Computational Approaches for Melodic Description in Indian Art Music Corpora
-
Upload
sankalp-gulati -
Category
Education
-
view
122 -
download
1
Transcript of Computational Approaches for Melodic Description in Indian Art Music Corpora
5[Y a M U[ M 3 [MOT _ R[ [PUO6 _O U U[ U PUM 3 a_UO 5[ [ M
.6[O [ M ET _U_ B _ M U[
a_UO E OT [ [Se 9 [a6 M Y [R R[ YM U[ M P 5[YYa UOM U[ E OT [ [SU _
F Ub _U M B[Y a MN M 4M O [ M D MU
ET _U_ PU O [ 1 ET _U_ O[YYU Y YN _1
.
)/ [b YN ().
Aa U
[PaO U[
a_UO 5[ [ MM P 6M M_ _
[Pe 6 _O U [ _M P C _ M U[ _
[Pe BMB [O __U S
3 UOM U[ _
DaYYM e M Pa a B _ O Ub _ClSM C O[S U U[
4
Aa U
[PaO U[
a_UO 5[ [ MM P 6M M_ _
[Pe 6 _O U [ _M P C _ M U[ _
[Pe BMB [O __U S
3 UOM U[ _
DaYYM e M Pa a B _ O Ub _ClSM C O[S U U[
Chapter 1Introduction
Information technology (IT) is constantly shaping the world that we live in. Its ad-vancements have changed human behavior and influenced the way we connect withour environment. Music being a universal socio-cultural phenomena has been deeplyinfluenced by these advancements. The way music is created, stored, disseminated,listened to, and even learned has changed drastically over the last few decades. Amassive amount of audio music content is now available on demand. Thus, it becomesnecessary to develop computational techniques that can process and automatically de-scribe large volumes of digital music content to facilitate novel ways of interactionwith it. There are different information sources such as editorial metadata, social data,and audio recordings that can be exploited to generate a description of music. Melody,along with harmony and rhythm, is a fundamental facet of music, and therefore, anessential component in its description. In this thesis, we focus on describing melodicaspects of music through an automated analysis of audio content.
1.1 Motivation
“It is melody that enables us to distinguish one work from another. Itis melody that human beings are innately able to reproduce by singing,humming, and whistling. It is melody that makes music memorable: weare likely to recall a tune long after we have forgotten its text.”
(Selfridge-Field, 1998)
The importance of melody in our musical experiences makes its analysis and descrip-tion a crucial component in music content processing. It becomes even more import-ant for melody dominant music traditions such as Indian art music (IAM), where theconcept of harmony (functional harmony as understood in common practice) does notexist, and the complex melodic structure takes the central role in music aesthetics.
1
Chapter 3Music Corpora and Datasets
3.1 IntroductionA research corpus is a collection of data compiled to study a research problem. A welldesigned research corpus is representative of the domain under study. It is practicallyinfeasible to work with the entire universe of data. Therefore, to ensure scalability ofinformation retrieval technologies to real-world scenarios, it is to important to developand test computational approaches using a representative data corpus. Moreover, aneasily accessible data corpus provides a common ground for researchers to evaluatetheir methods, and thus, accelerates knowledge sharing and advancement.
Not every computational task requires the entire research corpus for developmentand evaluation of approaches. Typically a subset of the corpus is used in a specificresearch task. We call this subset a test corpus or test dataset. The models built over atest dataset can later be extended to the entire research corpus. Test dataset is a staticcollection of data specific to an experiment, as opposed to a research corpus, whichcan evolve over time. Therefore, different versions of the test dataset used in a specificexperiment should be retained for ensuring reproducibility of the research results.Note that a test dataset should not be confused with the training and testing split ofa dataset, which are the terms used in the context of a cross validation experimentalsetup.
In MIR, a considerable number of the computational approaches follow a data-drivenmethodology, and hence, a well curated research corpus becomes a key factor in de-termining the success of these approaches. Due to the importance of a good datacorpus in research, building a corpus in itself is a fundamental research task (MacMul-len, 2003). MIR can be regarded as a relatively new research area within informationretrieval, which has primarily gained popularity in last two decades. Even today, asignificant number of the studies in MIR use ad-hoc procedures to build a collectionof data to be used in the experiments. Quite often the audio recordings used in theexperiments are taken from a researcher’s personal music collection. Availability of agood representative research corpus has been a challenge in MIR (Serra, 2014). This
61
Chapter 4Melody Descriptors and
Representations
4.1 Introduction
In this chapter, we describe methods for extracting relevant melodic descriptors andlow-level melody representation from raw audio signals. These melodic features areused as inputs to perform higher level melodic analyses by the methods described inthe subsequent chapters. Since these features are used in a number of computationaltasks, we consolidate and present these common processing blocks in the currentchapter. This chapter is largely based on our published work presented in Gulati et al.(2014a,b,c).
There are four sections in this chapter, and in each section, we describe the extractionof a different melodic descriptor or a representation.
In Section 4.2, we focus on the task of automatically identifying the tonic pitchin an audio recording. Our main objective is to perform a comparative evalu-ation of different existing methods and select the best method to be used in ourwork.
In Section 4.3, we present the method used to extract the predominant pitchfrom audio signals, and describe the post-processing steps to reduce frequentlyoccurring errors.
In Section 4.4, we describe the process of segmenting the solo percussion re-gions (Tani sections) in the audio recordings of Carnatic music.
In Section 4.5, we describe our nyas-based approach for segmenting melodiesin Hindustani music.
87
Chapter 5Melodic Pattern Processing:
Similarity, Discovery andCharacterization
5.1 IntroductionIn this chapter, we present our methodology for discovering musically relevant melodicpatterns in sizable audio collections of IAM. We address three main computationaltasks involved in this process: melodic similarity, pattern discovery and characteriz-ation of the discovered melodic patterns. We refer to these different tasks jointly asmelodic pattern processing.
“Only by repetition can a series of tones be characterized as somethingdefinite. Only repetition can demarcate a series of tones and its purpose.Repetition thus is the basis of music as an art”
(Schenker et al., 1980)
Repeating patterns are at the core of music. Consequently, analysis of patterns isfundamental in music analysis. In IAM, recurring melodic patterns are the buildingblocks of melodic structures. They provide a base for improvisation and composition,and thus, are crucial to the analysis and description of ragas, compositions, and artistsin this music tradition. A detailed account of the importance of melodic patterns inIAM is provided in Section 2.3.2.
To recapitulate, from the literature review presented in Section 2.4.2 and Section 2.5.2we see that the approaches for pattern processing in music can be broadly put intotwo categories (Figure 5.1). One of the types of approaches perform pattern detection(or matching) and follow a supervised methodology. In these approaches the systemknows a priori the pattern to be extracted. Typically such a system is fed with exem-plar patterns or queries and is expected to extract all of their occurrences in a piece of
121
Chapter 6Automatic Raga Recognition
6.1 Introduction
In this chapter, we address the task of automatically recognizing ragas in audio re-cordings of IAM. We describe two novel approaches for raga recognition that jointlycapture the tonal and the temporal aspects of melody. The contents of this chapter arelargely based on our published work in Gulati et al. (2016a,b).
Raga is a core musical concept used in compositions, performances, music organiz-ation, and pedagogy of IAM. Even beyond the art music, numerous compositions inIndian folk and film music are also based on ragas (Ganti, 2013). Raga is thereforeone of the most desired melodic descriptions of a recorded performance of IAM, andan important criterion used by listeners to browse its audio music collections. Despiteits significance, there exists a large volume of audio content whose raga is incorrectlylabeled or, simply, unlabeled. A computational approach to raga recognition will al-low us to automatically annotate large collections of audio music recordings. It willenable raga-based music retrieval in large audio archives, semantically-meaningfulmusic discovery and musicologically-informed navigation. Furthermore, a deeperunderstanding of the raga framework from a computational perspective will pave theway for building applications for music pedagogy in IAM.
Raga recognition is the most studied research topic in MIR of IAM. There exist aconsiderable number of approaches utilizing different characteristic aspects of ragassuch as svara set, svara salience and arohana-avrohana. A critical in-depth review ofthe existing approaches for raga recognition is presented in Section 2.4.3, wherein weidentify several shortcomings in these approaches and possible avenues for scientificcontribution to take this task to the next level. Here we provide a short summary ofthis analysis:
Nearly half of the number of the existing approaches for raga recognition donot utilize the temporal aspects of melody at all (Table 2.3), which are crucial
177
Chapter 7Applications
7.1 Introduction
The research work presented in this thesis has several applications. A number of theseapplications have already been described in Section 1.1. While some applicationssuch as raga-based music retrieval and music discovery are relevant mainly in thecontext of large audio collections, there are several applications that can be developedon the scale of the music collection compiled in the CompMusic project. In thischapter, we present some concrete examples of such applications that have alreadyincorporated parts of our work. We provide a brief overview of Dunya, a collection ofthe music corpora and software tools developed during the CompMusic project and,Saraga and Riyaz, the mobile applications developed within the technology transferproject, Culture Aware MUsic Technologies (CAMUT). In addition, we present threeweb demos that showcase parts of the outcomes of our computational methods. Wealso briefly present one of our recent studies that performs musicologically motivatedexploration of melodic structures in IAM. It serves as an example of how our methodscan facilitate investigations in computational musicology.
7.2 Dunya
Dunya55 comprises the music corpora and the software tools that have been developedas part of the CompMusic project. It includes data for five music traditions - Hindus-tani, Carnatic, Turkish Makam, Jingju and Andalusian music. By providing accessto the gathered data (such as audio recordings and metadata), and the generated data(such as audio descriptors) in the project, Dunya aims to facilitate study and explor-ation of relevant musical aspects of different music repertoires. The CompMusiccorpora mainly comprise audio recordings and complementary information that de-scribes those recordings. This complementary information can be in the form of the
55http://dunya.compmusic.upf.edu/
207
Chapter 8Summary and Future Perspectives
8.1 IntroductionIn this thesis, we have presented a number of computational approaches for analyzingmelodic elements at different levels of melodic organization in IAM. In tandem, theseapproaches generate a high-level melodic description of the audio collections in thismusic tradition. They build upon each other to finally allow us to achieve the goalsthat we set at the beginning of this thesis, which are:
To curate and structure representative music corpora of IAM that comprise au-dio recordings and the associated metadata, and use that to compile sizable andwell annotated tests datasets for melodic analyses.
To develop data-driven computational approaches for the discovery and char-acterization of musically relevant melodic patterns in sizable audio collectionsof IAM
To devise computational approaches for automatically recognizing ragas in re-corded performances of IAM.
Based on the results presented in this thesis, we can now say that our goals are suc-cessfully met. We started this thesis by presenting our motivation behind the analysisand description of melodies in IAM, highlighting the opportunities and challengesthat this music tradition offers within the context of music information research andthe CompMusic project (Chapter 1). We provided a brief introduction to IAM and itsmelodic organization, and critically reviewed the existing literature on related topicswithin the context of MIR and IAM (Chapter 2).
In Chapter 3, we provided a comprehensive overview of the music corpora and data-sets that were compiled and structured in our work as a part of the CompMusicproject. To the best of our knowledge, these corpora comprise the largest audiocollections of IAM along with curated metadata and automatically extracted music
221
Chapter 2Background
2.1 Introduction
In this chapter, we present our review of the existing literature related with the workpresented in this thesis. In addition, we also present a brief overview of the relevantmusic and mathematical background. We start with a brief discussion on the termin-ology used in this thesis (Section 2.2). Subsequently, we provide an overview of theselected music concepts to better understand the computational tasks addressed inour work (Section 2.3). We then present a review of the relevant literature, which wedivide into two parts. We first present a review of the work done in computational ana-lysis of IAM (Section 2.4). This includes approaches for tonic identification, melodicpattern processing and automatic raga recognition. In the second part, we presentrelevant work done in MIR in general, including topics related to pattern processing(detection and discovery), and key estimation (Section 2.5). Finally, we provide abrief overview of selected scientific concepts (Section 2.6).
2.2 Terminology
In this section, we provide our working definition of selected terms that we haveused throughout the thesis. To begin with, it is important to understand the mean-ing of melody in the context of this thesis. As we see from the literature, definingmelody in itself has been a challenge, with no consensus on a single definition ofmelody (Gómez et al., 2003; Salamon, 2013). We do not aim here to formally definemelody in the universal sense, but present its working definition within the scope ofthis thesis. Before we proceed, it is important to understand the setup of a concertof IAM. Every performance of IAM has a lead artist, who plays the central role (alsoliterally positioned in the center of the stage), and all other instruments are consideredas accompaniments. There is a main melody line by the lead performer, and typic-ally, also a melodic accompaniment that follows the main melody. A more detailed
13
5
[PaO U[
6
Chapter 1Introduction
Information technology (IT) is constantly shaping the world that we live in. Its ad-vancements have changed human behavior and influenced the way we connect withour environment. Music being a universal socio-cultural phenomena has been deeplyinfluenced by these advancements. The way music is created, stored, disseminated,listened to, and even learned has changed drastically over the last few decades. Amassive amount of audio music content is now available on demand. Thus, it becomesnecessary to develop computational techniques that can process and automatically de-scribe large volumes of digital music content to facilitate novel ways of interactionwith it. There are different information sources such as editorial metadata, social data,and audio recordings that can be exploited to generate a description of music. Melody,along with harmony and rhythm, is a fundamental facet of music, and therefore, anessential component in its description. In this thesis, we focus on describing melodicaspects of music through an automated analysis of audio content.
1.1 Motivation
“It is melody that enables us to distinguish one work from another. Itis melody that human beings are innately able to reproduce by singing,humming, and whistling. It is melody that makes music memorable: weare likely to recall a tune long after we have forgotten its text.”
(Selfridge-Field, 1998)
The importance of melody in our musical experiences makes its analysis and descrip-tion a crucial component in music content processing. It becomes even more import-ant for melody dominant music traditions such as Indian art music (IAM), where theconcept of harmony (functional harmony as understood in common practice) does notexist, and the complex melodic structure takes the central role in music aesthetics.
1
[ UbM U[
3a [YM UO Ya_UO P _O U U[
7
[ UbM U[
] ]_Ue]gl& <]fWbiYel&YUeW & IUX]b&
IYWb YaXUg]ba
GYXU[b[l& =XhWUg]ba&=agYegU]a Yag&
=a UaWYX _]fgYa]a[
b chgUg]baU_hf]Wb_b[l
8
[ UbM U[
IYceYfYagUg]baY[ YagUg]ba] ]_Ue]glDbg]Zf <]fWbiYelGUggYea UeUWgYe]mUg]ba
DY_bX]W<YfWe]cg]ba
9
5[Y a_UO B [V O
• Serra, X. (2011). A multicultural approach to music information research. In Proc. of ISMIR, pp. 151–156.
<YfWe]cg]ba
<UgU Xe]iYa
h_gheY U UeY
FcYa UWWYff
IYcebXhW]V]_]gl
10
PUM M a_UO
11
]aXhfgUa]BUhfghi B( ?Ua[h_]
UeaUg]WM][aYf Af Ue
E YU [ [Se
§ E[ UO1 4M_ U OT GM U _ MO [__ M U_ _
§ [Pe1 BU OT [R T P[YU M Y [PUO _[a O
§ ClSM1 [PUO R MY c[ W
§ DbM M1 3 M [S[a_ [ M [
§ 3 mTM M Mb mTM M1 3_O PU S P _O PU S [S __U[
§ 5TM M 1 3N_ MO U[ [R T Y [PUO [S __U[
§ [ UR_1 5TM MO U_ UO Y [PUO M _
§ 9MYMWM_1 9 UPU S Y [PUO S _ a _ 5M M UO Ya_UO
12
A [ a U U _ M P 5TM S _
§ A [ a U U _
13
A [ a U U _ M P 5TM S _
§ A [ a U U _§ H _ aPU P Ya_UO MPU U[
14
A [ a U U _ M P 5TM S _
§ A [ a U U _§ H _ aPU P Ya_UO MPU U[§ : [ T[ UO MO[a_ UOM OTM MO U_ UO_
15
A [ a U U _ M P 5TM S _
§ A [ a U U _§ H _ aPU P Ya_UO MPU U[§ : [ T[ UO MO[a_ UOM OTM MO U_ UO_
• http://www.music- ir.org/mirex/wiki/MIREX_HOME• Salamon, J. & Gómez, E. (2012). Melody extraction from polyphonic music signals using pitch contour characteristics. IEEE
Transactions on Audio, Speech, and Language Processing, 20(6), 1759–1770.
DAI=O /DAI=O 3 /<: AE<A9E 2
16
< <
A [ a U U _ M P 5TM S _
§ A [ a U U _§ H _ aPU P Ya_UO MPU U[§ : [ T[ UO MO[a_ UOM OTM MO U_ UO_§ [PUO M _ à NaU PU S N [OW_
17
A [ a U U _ M P 5TM S _
§ A [ a U U _§ H _ aPU P Ya_UO MPU U[§ : [ T[ UO MO[a_ UOM OTM MO U_ UO_§ [PUO M _ à NaU PU S N [OW_§ ClSM R MY c[ W1
• Martinez, J. L. (2001). Semiosis in Hindustani Music. Motilal Banarsidass Publishers. 18
rK Y ex[U ]f beY Z]kYX g Ua U bXY& UaX _Yff Z]kYX g Ua g Y Y_bXl& VYlbaX g YbXY UaX f beg bZ Y_bXl& UaX e]W Ye Vbg g Ua U []iYa bXY be U []iYaY_bXl(s
RDUeg]aYm& G(30S
DbXY 5 Ix[U 5 DY_bXl
A [ a U U _ M P 5TM S _
§ 5TM S _
19
A [ a U U _ M P 5TM S _
§ 5TM S _§ Y [bU_M [ e Ya_UO MPU U[
Time (s)
Freq
uenc
y (C
ent)
1 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0
200
400
600
800
1000
1200
1400
20
A [ a U U _ M P 5TM S _
§ 5TM S _§ Y [bU_M [ e Ya_UO MPU U[§ [PUO M _O U U[ OTM SU S M P U P RU P
0 2 4 6 8 10
Time (s)
�100
0
100
200
300
400
500
600
Freq
uenc
y(C
ents
)
S
m
g
Characteristic movement
• http://www.freesound.org/people/sankalp/sounds/360428/ 21
A [ a U U _ M P 5TM S _
§ 5TM S _§ Y [bU_M [ e Ya_UO MPU U[§ [PUO M _O U U[ OTM SU S M P U P RU P§ [ _ M PM P R O R ]a Oe E[ UO U OT
9#++
9#++0
:#+,-
?#+ -
?#32
22
A [ a U U _ M P 5TM S _
§ 5TM S _§ Y [bU_M [ e Ya_UO MPU U[§ [PUO M _O U U[ OTM SU S M P U P RU P§ [ _ M PM P R O R ]a Oe E[ UO U OT§ >[ S MaPU[ O[ PU S_
q+ FLI
23
4 [MP ANV O Ub _
§ 5[Y U M U[ Oa M U[ [R O[ [ M [R 3
§ 6U_O[b e OTM MO UfM U[ [R Y [PUO M _
§ 3a [YM UO lSM O[S U U[
24
5[Y a M U[ M EM_W_
DY_bX]W<YfWe]cg]ba
25
5[Y a M U[ M EM_W_
Kba]W AXYag]Z]WUg]ba #=iU_hUg]ba
DY_bX]W<YfWe]cg]ba
26
5[Y a M U[ M EM_W_
Kba]W AXYag]Z]WUg]ba #=iU_hUg]ba
DY_bX]W<YfWe]cg]ba
Elxf Y[ YagUg]ba
27
5[Y a M U[ M EM_W_
Kba]W AXYag]Z]WUg]ba #=iU_hUg]ba
DY_bX]W<YfWe]cg]ba
DY_bX]W ] ]_Ue]gl
Elxf Y[ YagUg]ba
28
5[Y a M U[ M EM_W_
Kba]W AXYag]Z]WUg]ba #=iU_hUg]ba
DY_bX]W<YfWe]cg]ba
GUggYea <]fWbiYel
DY_bX]W ] ]_Ue]gl
Elxf Y[ YagUg]ba
time (s) 10 30
time (s) 10 30
29
5[Y a M U[ M EM_W_
Kba]W AXYag]Z]WUg]ba #=iU_hUg]ba
DY_bX]W<YfWe]cg]ba
GUggYea UeUWgYe]mUg]ba
GUggYea <]fWbiYel
DY_bX]W ] ]_Ue]gl
Elxf Y[ YagUg]ba
time (s) 10 30
time (s) 10 30
30
5[Y a M U[ M EM_W_
Kba]W AXYag]Z]WUg]ba #=iU_hUg]ba
9hgb Ug]W Ix[U IYWb[a]g]ba
DY_bX]W<YfWe]cg]ba
GUggYea <]fWbiYel
DY_bX]W ] ]_Ue]gl
Elxf Y[ YagUg]ba
time (s) 10 30
time (s) 10 30
31
GUggYea UeUWgYe]mUg]ba
a_UO 5[ [ M M P 6M M_ _
32
Chapter 3Music Corpora and Datasets
3.1 IntroductionA research corpus is a collection of data compiled to study a research problem. A welldesigned research corpus is representative of the domain under study. It is practicallyinfeasible to work with the entire universe of data. Therefore, to ensure scalability ofinformation retrieval technologies to real-world scenarios, it is to important to developand test computational approaches using a representative data corpus. Moreover, aneasily accessible data corpus provides a common ground for researchers to evaluatetheir methods, and thus, accelerates knowledge sharing and advancement.
Not every computational task requires the entire research corpus for developmentand evaluation of approaches. Typically a subset of the corpus is used in a specificresearch task. We call this subset a test corpus or test dataset. The models built over atest dataset can later be extended to the entire research corpus. Test dataset is a staticcollection of data specific to an experiment, as opposed to a research corpus, whichcan evolve over time. Therefore, different versions of the test dataset used in a specificexperiment should be retained for ensuring reproducibility of the research results.Note that a test dataset should not be confused with the training and testing split ofa dataset, which are the terms used in the context of a cross validation experimentalsetup.
In MIR, a considerable number of the computational approaches follow a data-drivenmethodology, and hence, a well curated research corpus becomes a key factor in de-termining the success of these approaches. Due to the importance of a good datacorpus in research, building a corpus in itself is a fundamental research task (MacMul-len, 2003). MIR can be regarded as a relatively new research area within informationretrieval, which has primarily gained popularity in last two decades. Even today, asignificant number of the studies in MIR use ad-hoc procedures to build a collectionof data to be used in the experiments. Quite often the audio recordings used in theexperiments are taken from a researcher’s personal music collection. Availability of agood representative research corpus has been a challenge in MIR (Serra, 2014). This
61
5[Y a_UO 5[ [ M
33
5[Y a_UO 5[ [ M
§ E MY RR[
34
5[Y a_UO 5[ [ M
§ E MY RR[
§ B [O Pa M P P _US O U UM
• Serra, X. (2014). Creating research corpora for the computational study of music: the case of the Compmusic project. In Proc. of the 53rd AES Int. Conf. on Semantic Audio. London.
• Srinivasamurthy, A., Koduri, G. K., Gulati, S., Ishwar, V., & Serra, X. (2014). Corpora for music information research in Indian art music. In Int. Computer Music Conf./Sound and Music Computing Conf., pp. 1029–1036. Athens, Greece. 35
5[Y a_UO 5[ [ M
36
5[Y a_UO 5[ [ M
9hX]b =X]gbe]U_ YgUXUgU
<f Dhf]WVeU]am
37
5M M UO M P :U Pa_ M U a_UO 5[ [ M
8 9
.=3
.20891573
9 5
38
A MOO __ a_UO 5[ [ M
8 9
.=3
.20891573
9 5
39
A MOO __ a_UO 5[ [ M
U Uf
YWg]baf
DY_bX]W c eUfYf
40
E _ 6M M_ _
§ E[ UO P URUOM U[
§ el_ D SY M U[
§ [PUO DUYU M U e
§ ClSM C O[S U U[
41
[Pe 6 _O U [ _ M P C _ M U[ _
42
Chapter 4Melody Descriptors and
Representations
4.1 Introduction
In this chapter, we describe methods for extracting relevant melodic descriptors andlow-level melody representation from raw audio signals. These melodic features areused as inputs to perform higher level melodic analyses by the methods described inthe subsequent chapters. Since these features are used in a number of computationaltasks, we consolidate and present these common processing blocks in the currentchapter. This chapter is largely based on our published work presented in Gulati et al.(2014a,b,c).
There are four sections in this chapter, and in each section, we describe the extractionof a different melodic descriptor or a representation.
In Section 4.2, we focus on the task of automatically identifying the tonic pitchin an audio recording. Our main objective is to perform a comparative evalu-ation of different existing methods and select the best method to be used in ourwork.
In Section 4.3, we present the method used to extract the predominant pitchfrom audio signals, and describe the post-processing steps to reduce frequentlyoccurring errors.
In Section 4.4, we describe the process of segmenting the solo percussion re-gions (Tani sections) in the audio recordings of Carnatic music.
In Section 4.5, we describe our nyas-based approach for segmenting melodiesin Hindustani music.
87
[Pe 6 _O U [ _ M P C _ M U[ _
Kba]W AXYag]Z]WUg]ba#=iU_hUg]ba
Elxf Y[ YagUg]ba
GeYXb ]aUag DY_bXl =fg] Ug]baUaX cbfg cebWYff]a[
KUa] Y[ YagUg]ba
43
[Pe 6 _O U [ _ M P C _ M U[ _
Kba]W AXYag]Z]WUg]ba#=iU_hUg]ba
Elxf Y[ YagUg]ba
GeYXb ]aUag DY_bXl =fg] Ug]baUaX cbfg cebWYff]a[
KUa] Y[ YagUg]ba
44
E[ UO P URUOM U[ 1 3 [MOT _
2.4R
EL
AT
ED
WO
RK
ININ
DIA
NA
RT
MU
SIC25
Method Features Feature Distribution Tonic SelectionMRS (Sengupta et al., 2005) Pitch (Datta, 1996) NA Error minimization
MRH1/MRH2 (Ranjani et al., 2011) Pitch (Boersma & Weenink,2001) Parzen-window-based PDE GMM fitting
MJS (Salamon et al., 2012) Multi-pitch salience(Salamon et al., 2011) Multi-pitch histogram Decision tree
MSG (Gulati et al., 2012) Multi-pitch salience(Salamon et al., 2011) Multi-pitch histogram Decision tree
Predominant melody(Salamon & Gómez, 2012) Pitch histogram Decision tree
MAB1 (Bellur et al., 2012) Pitch (De Cheveigné &Kawahara, 2002) GD histogram Highest peak
MAB2 (Bellur et al., 2012) Pitch (De Cheveigné &Kawahara, 2002) GD histogram Template matching
MAB2 (Bellur et al., 2012) Pitch (De Cheveigné &Kawahara, 2002) GD histogram Highest peak
MCS (Chordia & Sentürk, 2013)
ABBREVIATIONS: NA=Not applicable; GD=Group Delay, PDE=Probability Density Estimate
Table 2.1: Summary of the existing tonic identification approaches.
• Salamon, J., Gulati, S., & Serra, X. (2012). A multipitch approach to tonic identification in Indian classical music. In Proc. of Int. Soc. for Music Information Retrieval Conf. (ISMIR), pp. 499–504. Porto, Portugal.
• Gulati, S., Salamon, J., & Serra, X. (2012). A two-stage approach for tonic identification in indian art music. In Proc. of the 2nd CompMusic Workshop, pp. 119–127. Istanbul, Turkey: Universitat Pompeu Fabra.
2.4R
EL
AT
ED
WO
RK
ININ
DIA
NA
RT
MU
SIC25
Method Features Feature Distribution Tonic SelectionMRS (Sengupta et al., 2005) Pitch (Datta, 1996) NA Error minimization
MRH1/MRH2 (Ranjani et al., 2011) Pitch (Boersma & Weenink,2001) Parzen-window-based PDE GMM fitting
MJS (Salamon et al., 2012) Multi-pitch salience(Salamon et al., 2011) Multi-pitch histogram Decision tree
MSG (Gulati et al., 2012) Multi-pitch salience(Salamon et al., 2011) Multi-pitch histogram Decision tree
Predominant melody(Salamon & Gómez, 2012) Pitch histogram Decision tree
MAB1 (Bellur et al., 2012) Pitch (De Cheveigné &Kawahara, 2002) GD histogram Highest peak
MAB2 (Bellur et al., 2012) Pitch (De Cheveigné &Kawahara, 2002) GD histogram Template matching
MAB2 (Bellur et al., 2012) Pitch (De Cheveigné &Kawahara, 2002) GD histogram Highest peak
MCS (Chordia & Sentürk, 2013)
ABBREVIATIONS: NA=Not applicable; GD=Group Delay, PDE=Probability Density Estimate
Table 2.1: Summary of the existing tonic identification approaches.
Dh_g]c]gW % _Uff]Z]WUg]ba GeYXb ]aUag c]gW % gY c_UgY)cYU c]W
45
5[Y M M Ub 7bM aM U[
§ D b Y T[P_
§ DUd PUb _ PM M_ _
TIDCM1
TIDCM2
TIDCM3
TIDIITM1
TIDIITM2
TIDIISc
,1+3-/.,2-2.1,//
--+/+..+,1
=kWYecgf DYUa XheUg]ba# ]af
46
C _a _
4.2T
ON
ICID
EN
TIFIC
AT
ION:
APPR
OA
CH
ES
AN
DC
OM
PAR
AT
IVE
EV
AL
UA
TIO
N91
MethodsTIDCM1 TIDCM2 TIDCM3 TIDIISc TIDIITM1 TIDIITM2
TP TPC TP TPC TP TPC TP TPC TP TPC TP TPC
MRH1 - 81.4 69.6 84.9 73.2 90.8 81.8 83.6 92.1 97.4 80.2 86.9
MRH2 - 63.2 65.7 78.2 68.5 83.5 83.6 83.6 94.7 97.4 83.8 88.8
MAB1 - - - - - - - - 89.5 89.5 - -
MAB2 - 88.9 74.5 82.9 78.5 83.4 72.7 76.4 92.1 92.1 86.6 89.1
MAB3 - 86 61.1 80.5 67.8 79.9 72.7 72.7 94.7 94.7 85 86.6
MJS - 88.9 87.4 90.1 88.4 91 75.6 77.5 89.5 97.4 90.8 94.1
MSG - 92.2 87.8 90.9 87.7 90.5 79.8 85.3 97.4 97.4 93.6 93.6
Table 4.1: Accuracies (%) for tonic pitch (TP) and tonic pitch-class (TPC) identification by seven methods on six different datasets using onlyaudio data. The best accuracy obtained for each dataset is highlighted using bold text. The dashed horizontal line divides the methods basedon supervised learning (MJS and MSG) and those based on expert knowledge (MRH1, MRH2, MAB1, MAB2 and MAB3). TP column for TIDCM1 ismarked as ‘-’, because it consists of only instrumental excerpts for which we not evaluate tonic pitch accuracy. MAB1 is only evaluated on TIDIITM1since it works on the whole concert recording.
94 MELODY DESCRIPTORS AND REPRESENTATIONS
Methods TIDCM1 TIDCM2 TIDCM3 TIDIISc TIDIITM1 TIDIITM2
MRH1 87.7 83.5 88.9 87.3 97.4 91.7
MRH2 79.55 76.3 82 85.5 97.4 91.5
MAB1 - - - - 97.4 -
MAB2 92.3 91.5 94.2 81.8 97.4 91.1
MAB3 87.5 86.7 90.9 81.8 94.7 89.9
MJS 88.9 93.6 92.4 80.9 97.4 92.3
MSG 92.2 90.9 90.5 85.3 97.4 93.6
Table 4.2: Accuracies (tonic pitch-class (%)) when additional information regarding thegender of the lead singer (male/female) and performance type (vocal/instrumental) is used.The dashed horizontal line divides the methods based on supervised learning (MJS and MSG)and those based on expert knowledge (MRH1, MRH2, MAB1, MAB2 and MAB3).
the presence of the drone, the tonal sounds produced by percussive instruments andthe octave errors produced by the pitch tracker, all contribute to the appearance of ahigh peak one octave below the tonic of the female singers. This is especially thecase for three minute excerpts, wherein a limited amount of vocal pitch informationis available. In the case of the approaches based on multipitch analysis and classi-fication (MJS and MSG), a probable reason for obtaining better performance for malesingers is the larger number of excerpts from male singers in the database. As a res-ult, it is possible that the rules learned by the classifier are slightly biased towards theperformances of male singers.
4.2.2.2 Results Obtained Using Metadata Together with the Audio
One of the ways to reduce the amount of octave errors in tonic identification is torestrict the frequency range of the allowed tonic pitches. The frequency range canbe optimized based on the additional information regarding the gender of the singer(when available) to guide the method. In this section, we analyze the effect of includ-ing information regarding the gender of the singer and the performance type (vocal orinstrumental) on the identification accuracy obtained by the methods.
In Table 4.2, we present the identification accuracies obtained when associated metadatais available to the methods in addition to the audio data. Note for this evaluation weonly report the tonic pitch accuracy for vocal excerpts (and not pitch-class accuracy)since when this metadata is available the pitch range of the tonic is known and limitedto a single octave, meaning the TP and TPC accuracies will be the same.
Comparing the identification accuracies summarized in Table 4.2 with the with onesin Table 4.1, we see that the accuracies for all methods are higher when gender and
3aPU[
3aPU[ MPM M47
C _a _
CM2 CM3 IITM2 IITM160
65
70
75
80
85
90
95
100
Per
form
an
ce A
ccu
racy
(%
)
JS
SG
RH1
RH2
AB2
AB3
Dataset
Acc
urac
y (%
)
40
50
60
70
80
90
100
Per
form
an
ce A
ccu
racy
(%
)
JS SG
RH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SG
RH1RH2AB2AB3
JS SG
RH1RH2AB2AB3
Tonic PitchTonic Pitch−class
Hindustani Carnatic Male Female
Acc
urac
y (%
)
Tonic Pitch-classTonic Pitch
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
Erro
rs (%
)
Datasets
0
5
10
15
20
25
30
Per
form
an
ce A
ccu
racy
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
Hindustani! Carnatic Male Female
Acc
urac
y (%
)
0
5
10
15
20
25
30
Per
form
an
ce A
ccu
racy
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
Hindustani! Carnatic Male Female
0
5
10
15
20
25
30
Per
form
an
ce A
ccu
racy
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
Hindustani! Carnatic Male Female
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
Erro
rs (%
)
Datasets
48
C _a _
CM2 CM3 IITM2 IITM160
65
70
75
80
85
90
95
100
Per
form
an
ce A
ccu
racy
(%
)
JS
SG
RH1
RH2
AB2
AB3
Dataset
Acc
urac
y (%
)
40
50
60
70
80
90
100
Per
form
an
ce A
ccu
racy
(%
)
JS SG
RH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SG
RH1RH2AB2AB3
JS SG
RH1RH2AB2AB3
Tonic PitchTonic Pitch−class
Hindustani Carnatic Male Female
Acc
urac
y (%
)
Tonic Pitch-classTonic Pitch
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
Erro
rs (%
)
Datasets
0
5
10
15
20
25
30
Per
form
an
ce A
ccu
racy
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
Hindustani! Carnatic Male Female
Acc
urac
y (%
)
0
5
10
15
20
25
30
Per
form
an
ce A
ccu
racy
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
Hindustani! Carnatic Male Female
0
5
10
15
20
25
30
Per
form
an
ce A
ccu
racy
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
Hindustani! Carnatic Male Female
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
0
5
10
15
20
25
30
Err
ors
(%
)
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
JS SGRH1RH2AB2AB3
Pa Errors
Ma Errors
Other Errors
CM1 CM2 CM3 IISCB1 IITM1 IITM2
Erro
rs (%
)
Datasets
<heUg]ba bZ YkWYecg
UgY[bel ]fY UaU_lf]f
<YgU]_YX Yeebe UaU_lf]f
UgY[bel ]fY Yeebe 9aU_lf]f
49
C _a _1 DaYYM e
§ a U U OT O M__URUOM U[
§ 3aPU[
§ bM UM [ Pa M U[
§ 5[ _U_ MPU U[ S P
U _ aY
§ BU OT Y M MW UOW
§ 3aPU[ Y MPM M
§ D _U Ub [ Pa M U[
§ O[ _U_ MPU U[ S P
U _ aY
• Gulati, S., Bellur, A., Salamon, J., Ranjani, H. G., Ishwar, V., Murthy, H. A., & Serra, X. (2014). Automatic tonic identification in Indian art music: approaches and evaluation. JNMR, 43(1), 53–71. 50
B P[YU M BU OT 7_ UYM U[
DY_bX]U
• Salamon, J. & Gómez, E. (2012). Melody extraction from polyphonic music signals using pitch contour characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 1759–1770.
• Bogdanov, D., Wack, N., Gómez, E., Gulati, S., Herrera, P., Mayor, O., Roma, G., Salamon, J., Zapata, J., & Serra, X. (2013). Essentia: an audio analysis library for music information retrieval. In ISMIR, pp. 493–498. 51
el_ DbM M D SY M U[
§ [PUO _ SY M U[
§ 6 UYU Y [PUO T M_ _
178 180 182 184 186 188 190−200
0
200
400
600
800
Time (s)
F0 frequ
ency
(ce
nts)
N1 N5N4N3N2
52
el_ DbM M D SY M U[
Predominant pitch estimation Tonic identification
Audio
Nyās segments
Pred. pitch estimation and representation
Histogram computation
SegmentationSvara identification Segmentation
Local Feature extractionContextual Local + Contextual
Segment classification Segment fusionSegment classification and fusion
53
el_ DbM M D SY M U[
Predominant pitch estimation Tonic identification
Audio
Nyās segments
Pred. pitch estimation and representation
Histogram computation
SegmentationSvara identification Segmentation
Local Feature extractionContextual Local + Contextual
Segment classification Segment fusionSegment classification and fusion
54
el_ DbM M D SY M U[
Predominant pitch estimation Tonic identification
Audio
Nyās segments
Histogram computation SegmentationSvara identification Segmentation
Local Feature extractionContextual Local + Contextual
Segment classification Segment fusionSegment classification and fusion
Pred. pitch estimation and representation
55
el_ DbM M D SY M U[
Predominant pitch estimation Tonic identification
Audio
Nyās segments
Histogram computation
SegmentationSvara identification Segmentation
Local Feature extractionContextual Local + Contextual
Segment classification Segment fusionSegment classification and fusion
Pred. pitch estimation and representation
0.12, 0.34, 0.59, 0.23, 0.54
0.21, 0.24, 0.54, 0.54, 0.42
0.32, 0.23, 0.34, 0.41, 0.63
0.66, 0.98, 0.74, 0.33, 0.12
0.90, 0.42, 0.14, 0.83, 0.76
56
el_ DbM M D SY M U[
Predominant pitch estimation Tonic identification
Audio
Nyās segments
Histogram computation
SegmentationSvara identification Segmentation
Local Feature extractionContextual Local + Contextual
Segment classification Segment fusionSegment classification and fusion
Pred. pitch estimation and representation
0.12, 0.34, 0.59, 0.23, 0.54
0.21, 0.24, 0.54, 0.54, 0.42
0.32, 0.23, 0.34, 0.41, 0.63
0.66, 0.98, 0.74, 0.33, 0.12
0.90, 0.42, 0.14, 0.83, 0.76
Nyās
Non-nyās
Nyās
Nyās
Non-nyās
57
6M M_
§ k l R[ YM O _ :U Pa_ M U
§ 3 [ M U[ _1 Ya_UOUM 2)- e M _ MU U S
§ ) -/ el_ _ SY _
657
IYWbeX]a[f <heUg]ba 9eg]fgf Ix[Uf
58
C _a _118 MELODY DESCRIPTORS AND REPRESENTATIONS
Segmentation Feat. DTW Tree k-NN NB LR SVM
PLSFL 0.356 0.407 0.447 0.248 0.449 0.453FC 0.284 0.394 0.387 0.383 0.389 0.406
FL+FC 0.289 0.414 0.426 0.409 0.432 0.437
ProposedFL 0.524 0.672 0.719 0.491 0.736 0.749FC 0.436 0.629 0.615 0.641 0.621 0.673
FL+FC 0.446 0.682 0.708 0.591 0.725 0.735
Table 4.3: F-scores for nyas boundary detection using PLS method and the proposed seg-mentation method. Results are shown for different classifiers (Tree, k-NN, NB, LR, SVM)and local (FL), contextual (FC) and local together with contextual (FL+FC) features. DTWis the baseline method used for comparison. F-score for the random baseline obtained usingBR2 is 0.184.
Segmentation Feat. DTW Tree k-NN NB LR SVM
PLSFL 0.553 0.685 0.723 0.621 0.727 0.722FC 0.251 0.639 0.631 0.690 0.688 0.674
FL+FC 0.389 0.694 0.693 0.708 0.722 0.706
ProposedFL 0.546 0.708 0.754 0.714 0.749 0.758FC 0.281 0.671 0.611 0.697 0.689 0.697
FL+FC 0.332 0.672 0.710 0.730 0.743 0.731
Table 4.4: F-scores for nyas and non-nyas label annotation task using PLS method and theproposed segmentation method. Results are shown for different classifiers (Tree, k-NN, NB,LR, SVM) and local (FL), contextual (FC) and local together with contextual (FL+FC) fea-tures. DTW is the baseline method used for comparison. The best random baseline F-score is0.153 obtained using BR2.
significant for all the proposed method variants. Furthermore, we also see that theproposed segmentation method consistently performs better than PLS. However, thedifferences are not always statistically significant.
In addition, we also investigate per-class accuracies for the label annotations. Wefind that the performance for the nyas segments is considerably better than the non-nyas segments. This could be attributed to the fact that even though the segmentclassification accuracy is balanced across classes, the differences in segment lengthof nyas and non-nyas segments (nyas segments being considerably longer than non-nyas segments) can result in more number of matched pairs for nyas segments.
In general, we see that the proposed segmentation method improves the performanceover PLS method in both the tasks, wherein the differences are statistically significant
118 MELODY DESCRIPTORS AND REPRESENTATIONS
Segmentation Feat. DTW Tree k-NN NB LR SVM
PLSFL 0.356 0.407 0.447 0.248 0.449 0.453FC 0.284 0.394 0.387 0.383 0.389 0.406
FL+FC 0.289 0.414 0.426 0.409 0.432 0.437
ProposedFL 0.524 0.672 0.719 0.491 0.736 0.749FC 0.436 0.629 0.615 0.641 0.621 0.673
FL+FC 0.446 0.682 0.708 0.591 0.725 0.735
Table 4.3: F-scores for nyas boundary detection using PLS method and the proposed seg-mentation method. Results are shown for different classifiers (Tree, k-NN, NB, LR, SVM)and local (FL), contextual (FC) and local together with contextual (FL+FC) features. DTWis the baseline method used for comparison. F-score for the random baseline obtained usingBR2 is 0.184.
Segmentation Feat. DTW Tree k-NN NB LR SVM
PLSFL 0.553 0.685 0.723 0.621 0.727 0.722FC 0.251 0.639 0.631 0.690 0.688 0.674
FL+FC 0.389 0.694 0.693 0.708 0.722 0.706
ProposedFL 0.546 0.708 0.754 0.714 0.749 0.758FC 0.281 0.671 0.611 0.697 0.689 0.697
FL+FC 0.332 0.672 0.710 0.730 0.743 0.731
Table 4.4: F-scores for nyas and non-nyas label annotation task using PLS method and theproposed segmentation method. Results are shown for different classifiers (Tree, k-NN, NB,LR, SVM) and local (FL), contextual (FC) and local together with contextual (FL+FC) fea-tures. DTW is the baseline method used for comparison. The best random baseline F-score is0.153 obtained using BR2.
significant for all the proposed method variants. Furthermore, we also see that theproposed segmentation method consistently performs better than PLS. However, thedifferences are not always statistically significant.
In addition, we also investigate per-class accuracies for the label annotations. Wefind that the performance for the nyas segments is considerably better than the non-nyas segments. This could be attributed to the fact that even though the segmentclassification accuracy is balanced across classes, the differences in segment lengthof nyas and non-nyas segments (nyas segments being considerably longer than non-nyas segments) can result in more number of matched pairs for nyas segments.
In general, we see that the proposed segmentation method improves the performanceover PLS method in both the tasks, wherein the differences are statistically significant
:bhaXUel
CUVY_
59
C _a _1 DaYYM e
§ B [ [_ P _ SY M U[
§ M a d MO U[
§ >[OM R M a _
§ BU O cU_ U M _ SY M U[
§ BM YM OTU S 6EH
§ >[OM 5[ d aM R M a _
• Gulati, S., Serrà, J., Ganguli, K. K., & Serra, X. (2014). Landmark detection in Hindustani music melodies. In ICMC-SMC, pp. 1062-1068. 60
[PUO BM B [O __U S
61
Chapter 5Melodic Pattern Processing:
Similarity, Discovery andCharacterization
5.1 IntroductionIn this chapter, we present our methodology for discovering musically relevant melodicpatterns in sizable audio collections of IAM. We address three main computationaltasks involved in this process: melodic similarity, pattern discovery and characteriz-ation of the discovered melodic patterns. We refer to these different tasks jointly asmelodic pattern processing.
“Only by repetition can a series of tones be characterized as somethingdefinite. Only repetition can demarcate a series of tones and its purpose.Repetition thus is the basis of music as an art”
(Schenker et al., 1980)
Repeating patterns are at the core of music. Consequently, analysis of patterns isfundamental in music analysis. In IAM, recurring melodic patterns are the buildingblocks of melodic structures. They provide a base for improvisation and composition,and thus, are crucial to the analysis and description of ragas, compositions, and artistsin this music tradition. A detailed account of the importance of melodic patterns inIAM is provided in Section 2.3.2.
To recapitulate, from the literature review presented in Section 2.4.2 and Section 2.5.2we see that the approaches for pattern processing in music can be broadly put intotwo categories (Figure 5.1). One of the types of approaches perform pattern detection(or matching) and follow a supervised methodology. In these approaches the systemknows a priori the pattern to be extracted. Typically such a system is fed with exem-plar patterns or queries and is expected to extract all of their occurrences in a piece of
121
[PUO BM B [O __U S
62
[PUO BM B [O __U S
63
[PUO BM B [O __U S
64
7dU_ U S 3 [MOT _ R[ 3
32B
AC
KG
RO
UN
D
Method Task MelodyRepresentation Segmentation Similarity
Measure Speed-up #Ragas #Rec #Patt #Occ
Ishwar et al. (2012) Distinction Continuous GTannotations HMM - 5c NA 6 431
Ross & Rao (2012) Detection Continuous Pa nyas DTW Segmentation 1h 2 2 107
Ross et al. (2012) Detection Continuous,SAX-12,1200 Sama location Euc., DTW Segmentation 3h 4 3 107
Ishwar et al. (2013) Detection Stationary point,Continuous Brute-force RLCS Two stage
method 2c 47 4 173
Rao et al. (2013) Distinction Continuous GTannotations DTW - 1h 8 3 268
Dutta & Murthy (2014a) Discovery Stationary point,Continuous Brute-force RLCS - 5c 59 - -
Dutta & Murthy (2014b) Detection Stationary point,Continuous NA Modified-
RLCSTwo stage
method 1c 16 NA 59
Rao et al. (2014)Distinction Continuous GT
annotations DTW - 1h 8 3 268
Distinction Continuous GTannotations HMM - 5c NA 10 652
Ganguli et al. (2015) Detection BSS,Transcription - Smith-
Waterman Discretization 34h 50 NA 1075
h Hindustani music collection c Carnatic music collection
ABBREVIATIONS: #Rec=Number of recordings; #Patt=Number of unique patterns; #Occ=Total number of annotated occurrences of thepatterns; GT=Ground truth; NA=Not available; “-”= Not applicable; Euc.=Euclidean distance.
Table 2.2: Summary of the methods proposed in the literature for melodic pattern processing in IAM. Note that all of these studies were publishedduring the course of our work.<YgYWg]ba #/ <]fg]aWg]ba #. <]fWbiYel #+
, +, , +0
65
[PUO BM B [O __U S
Unsupervised Approach
Supervised Approach
Unsupervised Approach
Supervised Approach66
[PUO BM B [O __U S
Unsupervised Approach
Supervised Approach
Unsupervised Approach
Supervised Approach67
[PUO BM B [O __U S
§ 6M M_ _Uf
§ [c PS NUM_
§ :aYM [ _ UYU M U[ _
Unsupervised Approach
Supervised Approach68
[PUO BM B [O __U S
DY_bX]W] ]_Ue]gl
GUggYea<]fWbiYel
GUggYeaUeUWgYe]mUg]ba
hcYei]fYX LafhcYei]fYX LafhcYei]fYX
69
[PUO BM B [O __U S
DY_bX]W] ]_Ue]gl
GUggYea<]fWbiYel
GUggYeaUeUWgYe]mUg]ba
hcYei]fYX LafhcYei]fYX LafhcYei]fYX
70
[PUO DUYU M U e
71
[PUO DUYU M U e
Time (s)
Freq
uenc
y (C
ent)
1 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0
200
400
600
800
1000
1200
1400 U__Ya[Y
72
[PUO DUYU M U e
U__Ya[Y
73
+ , - .
[PUO DUYU M U e
Predominant Pitch Estimation
Pitch Normalization
Uniform Time-scaling
Distance Computation
Melodic Similarity
Audio1 Audio2
74
[PUO DUYU M U e
U c_]a[ eUgY8
PYf ) Eb& N Ug ]aX8
PYf ) Eb& b hW 8
=hW_]XYUa be<laU ]W g] Y Uec]a[ #<KN &GUeU YgYef8
Predominant Pitch Estimation
Pitch Normalization
Uniform Time-scaling
Distance Computation
Melodic Similarity
Audio1 Audio2
75
[PUO DUYU M U e
n+ & 01& / & . & --p m
XXII CONTENTS
Symbol Description
r Audio recordingR Corpus of audio recordingsS Svara frequency in CentsS time delayed melodic surface (TDMS)S Power compressed TDMSS Smoothened TDMSS Normalized TDMSs One element of TDMST Tonic pitch of an audio recording in HzT Time stampV Vocabulary of melodic patternsw Weight of an edge in a network of melodic patternsW Length of a melodic pattern in secondsW Length of a melodic pattern in samplesZ Normalization type used in melody representationD Distance between melodic patternsD Melodic similarity thresholde Allowed pitch deviation used in nyas segmentationy Temporal threshold used in nyas segmentationa Power compression factor in TDMS computationn Binary flatness measure used in nyas segmentationW Uniform time-scaling factorF Svara duration truncation thresholdr Max pitch deviation used in nyas segmentationsg Standard deviation of the Gaussian kernel used in the TDMS
computationQ Heaviside step functionu Flatness measureu Flatness threshold° Complexity weighting in melodic similarity computationw Sampling rate of the pitch sequence in HzV Maximum error parameter in piece-wise linear segmentation
methodz Complexity estimate of a melodic pattern
Predominant Pitch Estimation
Pitch Normalization
Uniform Time-scaling
Distance Computation
Melodic Similarity
Audio1 Audio2
76
§ Eb abe U_]mUg]ba§ Kba]W
§ Q]aZ§ Q,.§ Q+,
§ DYUa§ DYX]Ua§ Q abe§ DYX]Ua UVfb_hgY XYi]Ug]ba
[PUO DUYU M U e
Predominant Pitch Estimation
Pitch Normalization
Uniform Time-scaling
Distance Computation
Melodic Similarity
Audio1 Audio2
77
[PUO DUYU M U e
§ vbZZ
§ vba n (3& (3/& +( & +( /& +(+p
Predominant Pitch Estimation
Pitch Normalization
Uniform Time-scaling
Distance Computation
Melodic Similarity
Audio1 Audio2
78
[PUO DUYU M U e
§ =hW_]XYUa
§ <laU ]W g] Y Uec]a[
§ ?_bVU_ WbafgeU]ag #-
§ CbWU_ WbafgeU]ag #
Predominant Pitch Estimation
Pitch Normalization
Uniform Time-scaling
Distance Computation
Melodic Similarity
Audio1 Audio2
79
[PUO DUYU M U e
Predominant Pitch Estimation
Pitch Normalization
Uniform Time-scaling
Distance Computation
Melodic Similarity
Audio1 Audio2
k k k
80
7bM aM U[ 1 D a
E GUggYeaf
+ E IUaXbfY[ Yagf
Kbc +aYUeYfgaY][ Vbef
DYUa9iYeU[YGeYW]f]ba
81
6M M_
49
IYWbeX]a[f <heUg]ba Ix[UfGUggYea FWW(
49
IYWbeX]a[f <heUg]ba Ix[Uf
GUggYea FWW(
UeaUg]W hf]W XUgUfYg
]aXhfgUa] hf]W XUgUfYg
82
C _a _5.2 MELODIC SIMILARITY: APPROACHES AND EVALUATION 131
Dataset MAP Srate Norm TScale Dist
MSDcmdiitm
0.413 w67 Zmean Woff DDTW_L1_G90
0.412 w67 Zmean Won DDTW_L1_G10
0.411 w100 Zmean Woff DDTW_L1_G90
MSDhmdiitb
0.552 w100 Ztonic Woff DDTW_L0_G90
0.551 w67 Ztonic Woff DDTW_L0_G90
0.547 w50 Ztonic Woff DDTW_L0_G90
Table 5.1: MAP score and details of the parameter settings for the three best performingvariants for MSDcmd
iitm and MSDhmdiitb dataset. Srate: sampling rate of the melody representation,
Norm: normalization technique, TScale: uniform time-scaling and Dist: distance measure.
we use mean average precision (MAP), a typical evaluation measure in informationretrieval (Manning et al., 2008). Mean average precision (MAP) is computed bytaking the mean of the average precision values of each query in the dataset. This way,we have a single number to evaluate and compare the performance of a variant. Inorder to assess if the difference in the performance of any two variants is statisticallysignificant, we use the Wilcoxon signed rank-test (Wilcoxon, 1945) with p < 0.01. Tocompensate for multiple comparisons, we apply the Holm-Bonferroni method (Holm,1979). Thus, considering that we compare 560 different variants, we effectively use amuch more stringent criterion than p < 0.01 for measuring statistical significance.
5.2.3 Results and Discussion
In this section, we present the results of our evaluation of the 560 variants for eachof the datasets, MSDcmd
iitm and MSDhmdiitb . We order the variants in the decreasing order
of their MAP scores and present only the three best performing variants in Table 5.1.For complete results see the companion page for this study (Appendix A).
In Table 5.1 (top half), we show the MAP scores and details of the parameter settingsfor MSDcmd
iitm dataset. We see that the best performing variant has a MAP score of0.413. Having a look at the whole list, we observe that for MSDcmd
iitm , in the rankedlist of the 560 variants, top performing variants consistently use higher sampling rates(either w100, or w67). This can be attributed to the rapid oscillatory melodic move-ments present in Carnatic music, whose preservation requires a higher sampling rate.The top variants invariably use the zero mean normalization (Zmean), suggesting thatthere are several repeated instances of the melodic patterns that are pitch transposedwithin the same recording. In addition, top variants also use DTW-based distancewith local constraint (either DDTW_L1_G10 or DDTW_L1_G90), indicating that melodicfragments are prone to large pathological warping that can significantly degrade theperformance. We do not observe any consistency in the usage of uniform time-scaling.
CMD HMD
134 MELODIC PATTERN PROCESSING
Dataset MAP Srate Norm TScale Dist
MSDcmdiitm
0.279 w67 Zmean Won DDTW_L1_G10
0.277 w67 ZtonicQ12 Won DDTW_L1_G10
0.275 w100 ZtonicQ12 Won DDTW_L1_G10
MSDhmdiitb
0.259 w40 ZtonicQ12 Won DDTW_L1_G90
0.259 w100 ZtonicQ12 Won DDTW_L1_G90
0.259 w67 ZtonicQ12 Won DDTW_L1_G90
Table 5.2: MAP score and details of the parameter settings for the three best performing vari-ants for MSDcmd
iitm and MSDhmdiitb dataset. These results are corresponding to the experimental
setup where the target patterns’ length is not based on the ground-truth annotations, but istaken to be the same as the query pattern length. Srate: sampling rate of the melody repres-entation, Norm: normalization technique, TScale: uniform time-scaling and Dist: distancemeasure.
are segmented using the ground-truth annotations. We now present the results corres-ponding to the other experimental setup described in Section 5.2.2, where the targetpattern lengths are considered to be the same as that of the query pattern. We evaluateall 560 variants of the method as done above on both the datasets under this experi-mental setup. In Table 5.2 we show the MAP scores of the top performing variantsfor both the datasets MSDcmd
iitm and MSDhmdiitb . We find that the MAP score for the best
performing variant decreases from 0.41 to 0.28 and 0.55 to 0.26 for MSDcmdiitm and
MSDhmdiitb , respectively. This indicates that the melodic similarity computation task
becomes much more challenging in the absence of an accurate melodic segmentationmethod. For this experimental setup the trend in the sampling rate for Carnatic andHindustani music remain the same as we saw in Table 5.1, which is that a highersampling rate is desired for representing melodic patterns in Carnatic music. In termsof the normalization we see that surprisingly ZtonicQ12 is used by the top performingvariants for both the datasets. Note that in other variants whose performance is notstatistically significantly different from the ones reported in Table 5.2, Zmean and Ztonicnormalization is also used for MSDcmd
iitm and MSDcmdiitm dataset respectively. An interest-
ing observation is that in this experimental setup uniform time-scaling is consistentlyused by all the top performing variants. This indicates that such a time-scaling op-eration is immensely advantageous in the retrieval scenarios where the length of thetarget melodic patterns is taken to be the same as that of a query pattern (i.e. patternsegmentation is not known). We also see that DDTW_L1_G10 and DDTW_L1_G90 are al-ways used for MSDcmd
iitm and MSDhmdiitb datasets, respectively, indicting that applying a
local constraint in DTW is critical for this experimental setup.
83
C _a _5.2 MELODIC SIMILARITY: APPROACHES AND EVALUATION 131
Dataset MAP Srate Norm TScale Dist
MSDcmdiitm
0.413 w67 Zmean Woff DDTW_L1_G90
0.412 w67 Zmean Won DDTW_L1_G10
0.411 w100 Zmean Woff DDTW_L1_G90
MSDhmdiitb
0.552 w100 Ztonic Woff DDTW_L0_G90
0.551 w67 Ztonic Woff DDTW_L0_G90
0.547 w50 Ztonic Woff DDTW_L0_G90
Table 5.1: MAP score and details of the parameter settings for the three best performingvariants for MSDcmd
iitm and MSDhmdiitb dataset. Srate: sampling rate of the melody representation,
Norm: normalization technique, TScale: uniform time-scaling and Dist: distance measure.
we use mean average precision (MAP), a typical evaluation measure in informationretrieval (Manning et al., 2008). Mean average precision (MAP) is computed bytaking the mean of the average precision values of each query in the dataset. This way,we have a single number to evaluate and compare the performance of a variant. Inorder to assess if the difference in the performance of any two variants is statisticallysignificant, we use the Wilcoxon signed rank-test (Wilcoxon, 1945) with p < 0.01. Tocompensate for multiple comparisons, we apply the Holm-Bonferroni method (Holm,1979). Thus, considering that we compare 560 different variants, we effectively use amuch more stringent criterion than p < 0.01 for measuring statistical significance.
5.2.3 Results and Discussion
In this section, we present the results of our evaluation of the 560 variants for eachof the datasets, MSDcmd
iitm and MSDhmdiitb . We order the variants in the decreasing order
of their MAP scores and present only the three best performing variants in Table 5.1.For complete results see the companion page for this study (Appendix A).
In Table 5.1 (top half), we show the MAP scores and details of the parameter settingsfor MSDcmd
iitm dataset. We see that the best performing variant has a MAP score of0.413. Having a look at the whole list, we observe that for MSDcmd
iitm , in the rankedlist of the 560 variants, top performing variants consistently use higher sampling rates(either w100, or w67). This can be attributed to the rapid oscillatory melodic move-ments present in Carnatic music, whose preservation requires a higher sampling rate.The top variants invariably use the zero mean normalization (Zmean), suggesting thatthere are several repeated instances of the melodic patterns that are pitch transposedwithin the same recording. In addition, top variants also use DTW-based distancewith local constraint (either DDTW_L1_G10 or DDTW_L1_G90), indicating that melodicfragments are prone to large pathological warping that can significantly degrade theperformance. We do not observe any consistency in the usage of uniform time-scaling.
CMD HMD
134 MELODIC PATTERN PROCESSING
Dataset MAP Srate Norm TScale Dist
MSDcmdiitm
0.279 w67 Zmean Won DDTW_L1_G10
0.277 w67 ZtonicQ12 Won DDTW_L1_G10
0.275 w100 ZtonicQ12 Won DDTW_L1_G10
MSDhmdiitb
0.259 w40 ZtonicQ12 Won DDTW_L1_G90
0.259 w100 ZtonicQ12 Won DDTW_L1_G90
0.259 w67 ZtonicQ12 Won DDTW_L1_G90
Table 5.2: MAP score and details of the parameter settings for the three best performing vari-ants for MSDcmd
iitm and MSDhmdiitb dataset. These results are corresponding to the experimental
setup where the target patterns’ length is not based on the ground-truth annotations, but istaken to be the same as the query pattern length. Srate: sampling rate of the melody repres-entation, Norm: normalization technique, TScale: uniform time-scaling and Dist: distancemeasure.
are segmented using the ground-truth annotations. We now present the results corres-ponding to the other experimental setup described in Section 5.2.2, where the targetpattern lengths are considered to be the same as that of the query pattern. We evaluateall 560 variants of the method as done above on both the datasets under this experi-mental setup. In Table 5.2 we show the MAP scores of the top performing variantsfor both the datasets MSDcmd
iitm and MSDhmdiitb . We find that the MAP score for the best
performing variant decreases from 0.41 to 0.28 and 0.55 to 0.26 for MSDcmdiitm and
MSDhmdiitb , respectively. This indicates that the melodic similarity computation task
becomes much more challenging in the absence of an accurate melodic segmentationmethod. For this experimental setup the trend in the sampling rate for Carnatic andHindustani music remain the same as we saw in Table 5.1, which is that a highersampling rate is desired for representing melodic patterns in Carnatic music. In termsof the normalization we see that surprisingly ZtonicQ12 is used by the top performingvariants for both the datasets. Note that in other variants whose performance is notstatistically significantly different from the ones reported in Table 5.2, Zmean and Ztonicnormalization is also used for MSDcmd
iitm and MSDcmdiitm dataset respectively. An interest-
ing observation is that in this experimental setup uniform time-scaling is consistentlyused by all the top performing variants. This indicates that such a time-scaling op-eration is immensely advantageous in the retrieval scenarios where the length of thetarget melodic patterns is taken to be the same as that of a query pattern (i.e. patternsegmentation is not known). We also see that DDTW_L1_G10 and DDTW_L1_G90 are al-ways used for MSDcmd
iitm and MSDhmdiitb datasets, respectively, indicting that applying a
local constraint in DTW is critical for this experimental setup.
GUeU YgYef
GUggYea WUgY[be]Yf
GUggYea fY[ YagUg]ba
84
C _a _1 DaYYM e
U c_]a[ eUgY
Ebe U_]mUg]ba
K] Y fWU_]a[
<]fgUaWY
CbWU_ WbafgeU]ag
?_bVU_ WbafgeU]ag
][
DYUa
Eb WbafYafhf
<KN
N]g
AaiUe]Uag
AaiUe]Uag
Kba]W
Eb WbafYafhf
<KN
N]g bhg
N]g bhg
UeaUg]W# (. D9G
]aXhfgUa]# (// D9G
GUeU YgYe
85
C _a _1 DaYYM e
Bab a La ab a
G eUfY Y[ YagUg]ba
UeaUg]W
]aXhfgUa]
• Gulati, S., Serrà, J., & Serra, X. (2015). An evaluation of methodologies for melodic similarity in audio recordings of Indian art music. In ICASSP, pp. 678–682. 86
Y [bU S [PUO DUYU M U e
Predominant Pitch Estimation
Pitch Normalization
Uniform Time-scaling
Distance Computation
Melodic Similarity
Audio1 Audio2
8h_gheY fcYW]Z]W]g]Yf
87
Y [bU S [PUO DUYU M U e
�
���
]aXhfgUa] hf]W
88
Y [bU S [PUO DUYU M U e
• G. E. Batista, X. Wang, and E. J Keogh. A complexity-invariant distance measure for time series, in SDM, volume 11, pp. 699–710, 2011
b c_Yk]gl
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6Time (s)
500
0
500
1000
1500Fr
eque
ncy
(Cen
t) 1
2
3
���
U Y c eUfY
UeaUg]W hf]W
89
142 MELODIC PATTERN PROCESSING
based distance (DDTW) in order to compute the final similarity score D f = °DDTW.We compute ° as:
° =max(zi,z j)
min(zi,z j)(5.1)
zi = 2
sN�1
Âi=1
(pi � pi+1)2 (5.2)
where, zi is the complexity estimate of a melodic pattern of length N samples and piis the pitch value of the ith sample. We explore two variants of this complexity estim-ate. One of these variants is already proposed in Batista et al. (2011) and is describedin Eq. 5.2. We denote this method variant by MCW1. We propose another variant thatutilizes melodic characteristics of Carnatic music. This variant takes the number ofsaddle points in the melodic phrase as the complexity estimate (Ishwar et al., 2013).This method variant is denoted by MCW2. As saddle points we consider all the localminimas and the local maximas in the pitch contour which have at least one minimato maxima distance of half a semitone. Since such melodic characteristics are pre-dominantly present in Carnatic music, the complexity weighting is not applicable forcomputing melodic similarity in Hindustani music.
5.3.2 Evaluation
5.3.2.1 Dataset and Annotations
For evaluation we use the same music collection as used in Section 5.2, which isdescribed in Section 3.3.3. This collection enables a better comparison of the resultswith other studies since it has been used in several other studies for a similar task (Raoet al., 2014; Ross et al., 2012). However, we found a number of issues in the an-notations of the melodic phrases, which we corrected and also extended the datasetby adding 25% more number of melodic phrase annotations as explained in Sec-tion 3.3.3. We denote this new dataset by MSDCM and the comprising Carnatic andHindustani sub-datasets by MSDcmd
CM and MSDhmdCM , respectively. Similar to the way
we perform evaluations in Section 5.2, we evaluate our approach separately on bothCarnatic and Hindustani datasets. In Table 3.6 we summarize the relevant details ofthe dataset in terms of the number of artists, ragas, audio recordings and total dura-tion. In Table 3.8 we summarize the details of the annotated phrases in terms of theirnumber of instances and basic statistics of the length of the phrases.
5.3.2.2 Setup, Measures and Statistical Significance
The experimental setup and evaluation measures used in this study are exactly thesame as used for the comparative evaluation described in Section 5.2. We consider
142 MELODIC PATTERN PROCESSING
based distance (DDTW) in order to compute the final similarity score D f = °DDTW.We compute ° as:
° =max(zi,z j)
min(zi,z j)(5.1)
zi = 2
sN�1
Âi=1
(pi � pi+1)2 (5.2)
where, zi is the complexity estimate of a melodic pattern of length N samples and piis the pitch value of the ith sample. We explore two variants of this complexity estim-ate. One of these variants is already proposed in Batista et al. (2011) and is describedin Eq. 5.2. We denote this method variant by MCW1. We propose another variant thatutilizes melodic characteristics of Carnatic music. This variant takes the number ofsaddle points in the melodic phrase as the complexity estimate (Ishwar et al., 2013).This method variant is denoted by MCW2. As saddle points we consider all the localminimas and the local maximas in the pitch contour which have at least one minimato maxima distance of half a semitone. Since such melodic characteristics are pre-dominantly present in Carnatic music, the complexity weighting is not applicable forcomputing melodic similarity in Hindustani music.
5.3.2 Evaluation
5.3.2.1 Dataset and Annotations
For evaluation we use the same music collection as used in Section 5.2, which isdescribed in Section 3.3.3. This collection enables a better comparison of the resultswith other studies since it has been used in several other studies for a similar task (Raoet al., 2014; Ross et al., 2012). However, we found a number of issues in the an-notations of the melodic phrases, which we corrected and also extended the datasetby adding 25% more number of melodic phrase annotations as explained in Sec-tion 3.3.3. We denote this new dataset by MSDCM and the comprising Carnatic andHindustani sub-datasets by MSDcmd
CM and MSDhmdCM , respectively. Similar to the way
we perform evaluations in Section 5.2, we evaluate our approach separately on bothCarnatic and Hindustani datasets. In Table 3.6 we summarize the relevant details ofthe dataset in terms of the number of artists, ragas, audio recordings and total dura-tion. In Table 3.8 we summarize the details of the annotated phrases in terms of theirnumber of instances and basic statistics of the length of the phrases.
5.3.2.2 Setup, Measures and Statistical Significance
The experimental setup and evaluation measures used in this study are exactly thesame as used for the comparative evaluation described in Section 5.2. We consider
Y [bU S [PUO DUYU M U e
Predominant Pitch Est. + Post Processing
Pitch Normalization
Svara Duration Truncation
Distance Computation + Complexity Weight
Melodic Similarity
Audio1 Audio2
Partial Transcription Elxf fY[ YagUg]ba
n (+& (-& (/& (1/& +( & +(/& ,( p
90
6M M_
49
IYWbeX]a[f <heUg]ba Ix[UfGUggYea FWW(
49
IYWbeX]a[f <heUg]ba Ix[Uf
GUggYea FWW(
UeaUg]W hf]W XUgUfYg
]aXhfgUa] hf]W XUgUfYg
91
C _a _144 MELODIC PATTERN PROCESSING
MSDhmdCM
Norm MB MDT MCW1 MCW2
Ztonic 0.45 (0.25) 0.52 (0.24) - -
Zmean 0.25 (0.20) 0.31 (0.23) - -
Ztetra 0.40 (0.23) 0.47 (0.23) - -
MSDcmdCM
Norm MB MDT MCW1 MCW2
Ztonic 0.39 (0.29) 0.42 (0.29) 0.41 (0.28) 0.41 (0.29)
Zmean 0.39 (0.26) 0.45 (0.28) 0.43 (0.27) 0.45 (0.27)
Ztetra 0.45 (0.26) 0.50 (0.27) 0.49 (0.28) 0.51 (0.27)
Table 5.3: MAP scores for the two datasets MSDhmdCM and MSDcmd
CM for the four method vari-ants MB, MDT, MCW1 and MCW2 and for different normalization techniques. Standard devi-ation of average precision is reported within round brackets.
We first analyse the results for the MSDhmdCM dataset. From Table 5.3 (upper half),
we see that the proposed method variant that applies a duration truncation performsbetter than the baseline method for all the normalization techniques. Moreover, thisdifference is found to be statistically significant in each case. The results for theMSDhmd
CM in this table correspond to F =500 ms, for which we obtain the highest ac-curacy compared to the other F values as shown in Figure 5.10. Furthermore, we seethat Ztonic results in the best accuracy for the MSDhmd
CM for all the method variants andthe difference is found to be statistically significant in each case. From Table 5.3, wenotice a high standard deviation of the average precision values. This is because someoccurrences of melodic phrases possess a large amount of melodic variation (actingas outliers), and therefore, the average precision value of the retrieved results usingthem as a query is much smaller compared to the other occurrences. In Figure 5.11,we show a boxplot of average precision values for each phrase category and for bothMB and MDT to get a better understanding of the results. We observe that with anexception of the phrase category H2, MDT consistently performs better than MB forall the other phrase categories. A close examination of this exception reveals that theerror often is in the segmentation of the steady svara regions of the melodic phrasescorresponding to H2. This can be attributed to a specific subtle melodic movement inH2 that is confused by the segmentation method as a melodic ornament instead of asvara transition, leading to a segmentation error.
We now analyse the results for the MSDcmdCM dataset. From Table 5.3 (lower half), we
see that using the method variants MDT, MCW1 and MCW2 we obtain reasonably higher
MB MDT
H1 H2 H3 H4 H5MB MB MB MBMDT MDT MDT MDT
MB MDT MCW2 MB MB MB MBMDT MDT MDT
C1 C2 C3 C4 C5MDTMCW2 MCW2 MCW2 MCW2
0.1 0.3 0.5 0.75 1.0 1.5 2.0
�
0.44
0.46
0.48
0.50
0.52MAP
HMDCMD
(s)
92
C _a _144 MELODIC PATTERN PROCESSING
MSDhmdCM
Norm MB MDT MCW1 MCW2
Ztonic 0.45 (0.25) 0.52 (0.24) - -
Zmean 0.25 (0.20) 0.31 (0.23) - -
Ztetra 0.40 (0.23) 0.47 (0.23) - -
MSDcmdCM
Norm MB MDT MCW1 MCW2
Ztonic 0.39 (0.29) 0.42 (0.29) 0.41 (0.28) 0.41 (0.29)
Zmean 0.39 (0.26) 0.45 (0.28) 0.43 (0.27) 0.45 (0.27)
Ztetra 0.45 (0.26) 0.50 (0.27) 0.49 (0.28) 0.51 (0.27)
Table 5.3: MAP scores for the two datasets MSDhmdCM and MSDcmd
CM for the four method vari-ants MB, MDT, MCW1 and MCW2 and for different normalization techniques. Standard devi-ation of average precision is reported within round brackets.
We first analyse the results for the MSDhmdCM dataset. From Table 5.3 (upper half),
we see that the proposed method variant that applies a duration truncation performsbetter than the baseline method for all the normalization techniques. Moreover, thisdifference is found to be statistically significant in each case. The results for theMSDhmd
CM in this table correspond to F =500 ms, for which we obtain the highest ac-curacy compared to the other F values as shown in Figure 5.10. Furthermore, we seethat Ztonic results in the best accuracy for the MSDhmd
CM for all the method variants andthe difference is found to be statistically significant in each case. From Table 5.3, wenotice a high standard deviation of the average precision values. This is because someoccurrences of melodic phrases possess a large amount of melodic variation (actingas outliers), and therefore, the average precision value of the retrieved results usingthem as a query is much smaller compared to the other occurrences. In Figure 5.11,we show a boxplot of average precision values for each phrase category and for bothMB and MDT to get a better understanding of the results. We observe that with anexception of the phrase category H2, MDT consistently performs better than MB forall the other phrase categories. A close examination of this exception reveals that theerror often is in the segmentation of the steady svara regions of the melodic phrasescorresponding to H2. This can be attributed to a specific subtle melodic movement inH2 that is confused by the segmentation method as a melodic ornament instead of asvara transition, leading to a segmentation error.
We now analyse the results for the MSDcmdCM dataset. From Table 5.3 (lower half), we
see that using the method variants MDT, MCW1 and MCW2 we obtain reasonably higher
MB MDT
H1 H2 H3 H4 H5MB MB MB MBMDT MDT MDT MDT
MB MDT MCW2 MB MB MB MBMDT MDT MDT
C1 C2 C3 C4 C5MDTMCW2 MCW2 MCW2 MCW2
0.1 0.3 0.5 0.75 1.0 1.5 2.0
�
0.44
0.46
0.48
0.50
0.52MAP
HMDCMD
(s)
<heUg]ba gehaWUg]ba
Fcg] U_ XheUg]ba8
b c_Yk]gl NY][ g]a[
93
C _a _1 DaYYM e
<heUg]ba gehaWUg]ba
b c_Yk]gl NY][ g]a[
UeaUg]W
]aXhfgUa]
UeaUg]W
• Gulati, S., Serrà, J., & Serra, X. (2015). Improving melodic similarity in Indian art music using culture-specific melodic characteristics. In ISMIR, pp. 680–686. 94
[PUO BM B [O __U S
DY_bX]W] ]_Ue]gl
GUggYea<]fWbiYel
GUggYeaUeUWgYe]mUg]ba
hcYei]fYX LafhcYei]fYX LafhcYei]fYX
95
BM 6U_O[b e
AageU eYWbeX]a[GUggYea <]fWbiYel
AagYe eYWbeX]a[GUggYea YUeW
IUa IYZ]aY Yag
I+
I,
I-
96
M O[ PU S BM 6U_O[b e
§ 6e MYUO UY cM U S 6EH
?_bVU_ WbafgeU]ag#+ _Ya[g
Eb CbWU_ WbafgeU]ag
gYc4 n# &+ & #+&+ & #+& p
97
CM W C RU Y
CbWU_ WbafgeU]ag
gYc4 n#,&+ & #+&+ & #+&, p
CbWU_ Wbfg ZhaWg]ba
#M+& M,& M-& M.98
5[Y a M U[ M 5[Y dU e
F#a, à U c_Yf
F#a, à UaX]XUgYf
-/ k #0 k 0 k / k (/ 6 -+ D]__]ba
<]fgUaWY Cb Ye :bhaXf99
>[c 4[a P_ R[ 6EH
<KehY 7 <C:
<C: 7 <CUfg<CUfg
Kbc E dhYhY
100
>[c 4[a P_ R[ 6EH
§ 5M_OMP P [c N[a P_ JRakthanmanon et al. (2013)K
§ U _ >M_ N[a P JKim et al. (2001)K
§ >4L [ST JKeogh et al. (2001)K
§ 7M e MNM P[ U S
• Kim, S. W., Park, S., & Chu, W. W. (2001). An index-based approach for similarity search supporting time warping in large sequence databases. In 17th International Conference on Data Engineering, pp. 607–614.
• Keogh, E. & Ratanamahatana, C. A. (2004). Exact indexing of dynamic time warping. Knowledge and Information Systems, 7(3), 358–386.• Rakthanmanon, T., Campana, B., Mueen, A., Batista, G., Westover, B., Zhu, Q., Zakaria, J., & Keogh, E. (2013). Addressing big data time
series: mining trillions of time series subsequences under dynamic time warping. ACM Transactions on Knowledge Discovery from Data (TKDD), 7(3), 10:1–10:31 101
6M M_
§ D P M _1 /0 (((
§ D M OT M _1 j)- ((( (((
49
IYWbeX]a[f <heUg]ba
UeaUg]W hf]W
102
C _a _
103
C _a _1 5[Y a M U[ _
§ M O[ PU S1 )&,) E
§ O[ PU S1 ) &, E
>]efg CUfg
C:TBYb[ T=H
C:TBYb[ T=
/,
,-
+
./
/+
-
Cb Ye VbhaX AageU eYWbeX]a[#
AagYe eYWbeX]a[#
+ K 6 + +,
10 33 104
C _a _
160 MELODIC PATTERN PROCESSING
Seed Category V1 V2 V3 V4
S1 0.92 0.92 0.91 0.89
S2 0.68 0.73 0.73 0.66
S3 0.35 0.34 0.35 0.35
Table 5.5: MAP scores for four variants of rank refinement method (Vi) for each seed category(S1, S2 and S3).
Next, we evaluate the performance of inter-recording pattern detection task and as-sess the effect of the four DTW cost variants explained in Section 5.4.3 (denoted byV1 . . .V4). To investigate the dependence of the performance on the category of theseed pair, we perform the evaluation within each seed category. In Table 5.5 we showthe MAP scores obtained for the different variants of the rank refinement method(V1,V2,V3 and V4) for each seed category (S1, S2 and S3). In addition, we also presenta box plot of corresponding average precision values in Figure 5.20. In general, weobserve that every method performs well for category S1, with a MAP score around0.9 and no statistically significant difference between each other. For category S2,V2 and V3 perform better than the rest and the difference is found to be statisticallysignificant. The performance is poor for the third category S3 for every variant. Thedifference in performance between any two methods across seed categories is stat-istically significant. We observe that MAP scores across different seed categoriescorrelate strongly with the fraction of melodically similar seed pairs in that category(discussed above). This means that the seed pattern pairs that have high distance Dbetween them obtain low average precision when used as query patterns and vice-versa. This might be because across repetitions of such patterns there could be a highdegree of melodic variation, to model which the current similarity measure appearsinadequate. In addition, closely repeating seed pattern pairs (i.e. with low distanceD between them) might have more number of repetitions with low degree of melodicvariations, for which the current similarity measure performs well.
Finally, we analyse the distance distributions of melodically similar and dissimilarsearch patterns. For this we use the best performing rank-refinement variant V2. Toexamine the separation in the distance distribution, we compute the ROC curve asshown in 5.19 (dashed red line). We observe that the separability between melodicallysimilar and dissimilar subsequences in this case is poorer than the one obtained for theseed pairs (solid blue line). This suggests that it is much harder to differentiate melod-ically similar from dissimilar patterns when the search is performed across recordings.This can be attributed to the total number of melodically dissimilar subsequences(irrelevant documents) for every query subsequence, which is orders of magnitudehigher in this task compared to the task of pattern discovery within a recording. Inaddition, one of the reasons can also be that the melodic phrases of two allied ragas(Section 2.3.2) are differentiated based on subtle melodic nuances (Viswanathan &
S1 S2 S3
�
S1 S2 S3
105
C _a _
160 MELODIC PATTERN PROCESSING
Seed Category V1 V2 V3 V4
S1 0.92 0.92 0.91 0.89
S2 0.68 0.73 0.73 0.66
S3 0.35 0.34 0.35 0.35
Table 5.5: MAP scores for four variants of rank refinement method (Vi) for each seed category(S1, S2 and S3).
Next, we evaluate the performance of inter-recording pattern detection task and as-sess the effect of the four DTW cost variants explained in Section 5.4.3 (denoted byV1 . . .V4). To investigate the dependence of the performance on the category of theseed pair, we perform the evaluation within each seed category. In Table 5.5 we showthe MAP scores obtained for the different variants of the rank refinement method(V1,V2,V3 and V4) for each seed category (S1, S2 and S3). In addition, we also presenta box plot of corresponding average precision values in Figure 5.20. In general, weobserve that every method performs well for category S1, with a MAP score around0.9 and no statistically significant difference between each other. For category S2,V2 and V3 perform better than the rest and the difference is found to be statisticallysignificant. The performance is poor for the third category S3 for every variant. Thedifference in performance between any two methods across seed categories is stat-istically significant. We observe that MAP scores across different seed categoriescorrelate strongly with the fraction of melodically similar seed pairs in that category(discussed above). This means that the seed pattern pairs that have high distance Dbetween them obtain low average precision when used as query patterns and vice-versa. This might be because across repetitions of such patterns there could be a highdegree of melodic variation, to model which the current similarity measure appearsinadequate. In addition, closely repeating seed pattern pairs (i.e. with low distanceD between them) might have more number of repetitions with low degree of melodicvariations, for which the current similarity measure performs well.
Finally, we analyse the distance distributions of melodically similar and dissimilarsearch patterns. For this we use the best performing rank-refinement variant V2. Toexamine the separation in the distance distribution, we compute the ROC curve asshown in 5.19 (dashed red line). We observe that the separability between melodicallysimilar and dissimilar subsequences in this case is poorer than the one obtained for theseed pairs (solid blue line). This suggests that it is much harder to differentiate melod-ically similar from dissimilar patterns when the search is performed across recordings.This can be attributed to the total number of melodically dissimilar subsequences(irrelevant documents) for every query subsequence, which is orders of magnitudehigher in this task compared to the task of pattern discovery within a recording. Inaddition, one of the reasons can also be that the melodic phrases of two allied ragas(Section 2.3.2) are differentiated based on subtle melodic nuances (Viswanathan &
S1 S2 S3
�
S1 S2 S3
2, cUggYeaf
CbWU_ Wbfg ZhaWg]ba
<]fgUaWY fYcUeUV]_]gl
AageU if AagYe eYWbeX]a[
106
C _a _
AageU eYWbeX]a[
AagYe eYWbeX]a[
+ >GIà 2 KGI
+ >GIà / KGI
• Gulati, S., Serrà, J., Ishwar, V., & Serra, X. (2014). Mining melodic patterns in large audio collections of Indian art music. In SITIS-MIRA, pp. 264–271.
CbWU_ Wbfg ZhaWg]baà ]gl V_bW
107
[PUO BM B [O __U S
6U__UYU M M _ bM M _
108
[PUO BM B [O __U S
DY_bX]W] ]_Ue]gl
GUggYea<]fWbiYel
GUggYeaUeUWgYe]mUg]ba
hcYei]fYX LafhcYei]fYX LafhcYei]fYX
109
[PUO BM 5TM MO UfM U[
9MYMWM_ h 3 M Wl _
5[Y [_U U[ _ OURUO M _
ClSM Y[ UR_
110
[PUO BM 5TM MO UfM U[
Inter-recording Pattern Detection
Intra-recording Pattern DiscoveryData Processing
Pattern Network Generation
Network Filtering
Community Detection and Characterization
Rāga motifs
Carnatic music collection
Melodic Pattern Discovery
Pattern Characterization
111
[PUO BM 5TM MO UfM U[
Inter-recording Pattern Detection
Intra-recording Pattern DiscoveryData Processing
Pattern Network Generation
Network Filtering
Community Detection and Characterization
Rāga motifs
Carnatic music collection
Melodic Pattern Discovery
Pattern Characterization
112
[PUO BM 5TM MO UfM U[
• M. EJ Newman, “The structure and function of complex networks,” Society for Industrial and Applied Mathematics (SIAM) review,vol. 45, no. 2, pp. 167–256, 2003.
LaX]eYWgYX
Pattern Network Generation
Network Filtering
Community Detection and Characterization
113
[PUO BM 5TM MO UfM U[
• S. Maslov and K. Sneppen, “Specificity and stability in topology of protein networks,” Science, vol. 296, no. 5569, pp. 910– 913, 2002.
Pattern Network Generation
Network Filtering
Community Detection and Characterization
114
[PUO BM 5TM MO UfM U[
• V. D. Blondel, J. L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2008, no. 10, pp. P10008, 2008.
• S Fortunato, “Community detection in graphs,” Physics Reports, vol. 486, no. 3, pp. 75–174, 2010.
Pattern Network Generation
Network Filtering
Community Detection and Characterization
115
[PUO BM 5TM MO UfM U[
Pattern Network Generation
Network Filtering
Community Detection and Characterization
b ha]gl KlcY EbXYf IYWbeX]a[f Ix[Uf
?U U U ][ ][ ][
b cbf]g]ba c eUfYf DYX]h CYff CYff
Ix[U c eUfYf DYX]h DYX]h CYff
116
[PUO BM 5TM MO UfM U[
Pattern Network Generation
Network Filtering
Community Detection and Characterization
Ix[U X]fge]Vhg]ba
5.5 CHARACTERIZATION OF MELODIC PATTERNS 169
Figure 5.24: Graphical representation of the melodic pattern network after filtering bythreshold D. The detected communities in the network are indicated by different colors. Fewexamples of these communities are shown on the right.
rank the communities we empirically devise a goodness measure G, which denotesthe likelihood that a community Cq represents a raga motif. We propose to use
G = NL4c, (5.6)
where L is an estimate of the likelihood of raga a0q in Cq,
L =Aq,1
N, (5.7)
and c indicates how uniformly the nodes of the community are distributed over audiorecordings,
c =ÂLb
l=1 l ·Bq,l
N. (5.8)
Higher c implies a more uniform distribution. Since a community that represents araga motif is expected to contain nodes from a single raga (high value of L) and thenodes belong to many different recordings (high value of c), the goodness measure Gis high for such a community. In general we prefer large communities, but, to avoid
N = 5
bhag
Ix[U _] Y_] bbX
-
+ +
35
=
117
[PUO BM 5TM MO UfM U[
Pattern Network Generation
Network Filtering
Community Detection and Characterization
IYWbeX]a[ X]fge]Vhg]ba
N = 5
bhag
1x2 + 2x2 + 3x15
Yageb]X
, ,
+
5.5 CHARACTERIZATION OF MELODIC PATTERNS 169
Figure 5.24: Graphical representation of the melodic pattern network after filtering bythreshold D. The detected communities in the network are indicated by different colors. Fewexamples of these communities are shown on the right.
rank the communities we empirically devise a goodness measure G, which denotesthe likelihood that a community Cq represents a raga motif. We propose to use
G = NL4c, (5.6)
where L is an estimate of the likelihood of raga a0q in Cq,
L =Aq,1
N, (5.7)
and c indicates how uniformly the nodes of the community are distributed over audiorecordings,
c =ÂLb
l=1 l ·Bq,l
N. (5.8)
Higher c implies a more uniform distribution. Since a community that represents araga motif is expected to contain nodes from a single raga (high value of L) and thenodes belong to many different recordings (high value of c), the goodness measure Gis high for such a community. In general we prefer large communities, but, to avoid
=
118
[PUO BM 5TM MO UfM U[
Pattern Network Generation
Network Filtering
Community Detection and Characterization
5.5 CHARACTERIZATION OF MELODIC PATTERNS 169
Figure 5.24: Graphical representation of the melodic pattern network after filtering bythreshold D. The detected communities in the network are indicated by different colors. Fewexamples of these communities are shown on the right.
rank the communities we empirically devise a goodness measure G, which denotesthe likelihood that a community Cq represents a raga motif. We propose to use
G = NL4c, (5.6)
where L is an estimate of the likelihood of raga a0q in Cq,
L =Aq,1
N, (5.7)
and c indicates how uniformly the nodes of the community are distributed over audiorecordings,
c =ÂLb
l=1 l ·Bq,l
N. (5.8)
Higher c implies a more uniform distribution. Since a community that represents araga motif is expected to contain nodes from a single raga (high value of L) and thenodes belong to many different recordings (high value of c), the goodness measure Gis high for such a community. In general we prefer large communities, but, to avoid
?bbXaYff YUfheY
119
EbXYf
Yageb]X
Ix[U _] Y_] bbX
7bM aM U[
49
IYWbeX]a[f <heUg]ba
Ix[Uf
b cbf]g]baf
+ Ix[U k + b ha]gl 6 + G eUfYf
+ Dhf]W]Uaf
120
C _a _
5.5 CHARACTERIZATION OF MELODIC PATTERNS 173
Raga µr sr Raga µr sr
Hamsadhvani 0.84 0.23 Begad. a 0.88 0.11
Kamavardani 0.78 0.17 Kapi 0.75 0.10
Darbar 0.81 0.23 Bhairavi 0.91 0.15
Kalyan. i 0.90 0.10 Behag 0.84 0.16
Kambhoji 0.87 0.12 Tod. ı 0.92 0.07
Table 5.7: Mean (µr) and standard deviation (sr) of µ√ for each raga. Ragas with µr � 0.85are highlighted.
Figure 5.26: Histogram of µ√ for all of the 100 melodic patterns selected for the listeningtest.
several recordings in the dataset. As described in Section 5.5.1.2, in such a scenario,the goodness measure G can be made more robust to such cases by computing cusing the distribution of nodes Bq over unique compositions rather than over audiorecordings.
The results show that the proposed method successfully discovers raga motifs witha high accuracy. We see that, even for the allied ragas present in the dataset such asKambhoji and Begad. a (Section 5.5.2.1), the method is able to discover distinct char-acteristic raga motifs. As mentioned, allied ragas are challenging because they have asubstantial overlap in the set of svaras that they comprise (see also Table 5.6). Finally,on a more informal side, it is worth mentioning that musicians were impressed when,after the listening test, they came to know that the melodic patterns were discoveredby a machine following an unsupervised approach.
We consider this work as a preliminary study with a lot of scope for improvements.In particular, the approach can be extended to identify patterns belonging to other cat-egories (gamaka or composition-specific patterns), definition of the goodness measureG can be improved as suggested in Section 5.5.3, listening test could be done more
Rāgas
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1µ�
05101520253035
Count
121
C _a _
5.5 CHARACTERIZATION OF MELODIC PATTERNS 173
Raga µr sr Raga µr sr
Hamsadhvani 0.84 0.23 Begad. a 0.88 0.11
Kamavardani 0.78 0.17 Kapi 0.75 0.10
Darbar 0.81 0.23 Bhairavi 0.91 0.15
Kalyan. i 0.90 0.10 Behag 0.84 0.16
Kambhoji 0.87 0.12 Tod. ı 0.92 0.07
Table 5.7: Mean (µr) and standard deviation (sr) of µ√ for each raga. Ragas with µr � 0.85are highlighted.
Figure 5.26: Histogram of µ√ for all of the 100 melodic patterns selected for the listeningtest.
several recordings in the dataset. As described in Section 5.5.1.2, in such a scenario,the goodness measure G can be made more robust to such cases by computing cusing the distribution of nodes Bq over unique compositions rather than over audiorecordings.
The results show that the proposed method successfully discovers raga motifs witha high accuracy. We see that, even for the allied ragas present in the dataset such asKambhoji and Begad. a (Section 5.5.2.1), the method is able to discover distinct char-acteristic raga motifs. As mentioned, allied ragas are challenging because they have asubstantial overlap in the set of svaras that they comprise (see also Table 5.6). Finally,on a more informal side, it is worth mentioning that musicians were impressed when,after the listening test, they came to know that the melodic patterns were discoveredby a machine following an unsupervised approach.
We consider this work as a preliminary study with a lot of scope for improvements.In particular, the approach can be extended to identify patterns belonging to other cat-egories (gamaka or composition-specific patterns), definition of the goodness measureG can be improved as suggested in Section 5.5.3, listening test could be done more
Rāgas
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1µ�
05101520253035
CountGYe c eUfY cYeZbe UaWY
GYe ex[U cYeZbe UaWY
Dhf]W]Uaf’ U[eYY Yag
122
C _a _1 DaYYM e
DYUaIx[U# DYUaG eUfY# IUg]a[f 5 (3,(1/ 5(2/
• Gulati, S., Serrà, J., Ishwar, V., & Serra, X. (2016). Discovering rāga motifs by characterizing communities in networks of melodic patterns. In ICASSP, pp. 286–290. 123
+ , - . / 0 1 2 3 +
PYf
bhag
+ , - -
+,
,+,/
--
3a [YM UO ClSM C O[S U U[
124
Chapter 6Automatic Raga Recognition
6.1 Introduction
In this chapter, we address the task of automatically recognizing ragas in audio re-cordings of IAM. We describe two novel approaches for raga recognition that jointlycapture the tonal and the temporal aspects of melody. The contents of this chapter arelargely based on our published work in Gulati et al. (2016a,b).
Raga is a core musical concept used in compositions, performances, music organiz-ation, and pedagogy of IAM. Even beyond the art music, numerous compositions inIndian folk and film music are also based on ragas (Ganti, 2013). Raga is thereforeone of the most desired melodic descriptions of a recorded performance of IAM, andan important criterion used by listeners to browse its audio music collections. Despiteits significance, there exists a large volume of audio content whose raga is incorrectlylabeled or, simply, unlabeled. A computational approach to raga recognition will al-low us to automatically annotate large collections of audio music recordings. It willenable raga-based music retrieval in large audio archives, semantically-meaningfulmusic discovery and musicologically-informed navigation. Furthermore, a deeperunderstanding of the raga framework from a computational perspective will pave theway for building applications for music pedagogy in IAM.
Raga recognition is the most studied research topic in MIR of IAM. There exist aconsiderable number of approaches utilizing different characteristic aspects of ragassuch as svara set, svara salience and arohana-avrohana. A critical in-depth review ofthe existing approaches for raga recognition is presented in Section 2.4.3, wherein weidentify several shortcomings in these approaches and possible avenues for scientificcontribution to take this task to the next level. Here we provide a short summary ofthis analysis:
Nearly half of the number of the existing approaches for raga recognition donot utilize the temporal aspects of melody at all (Table 2.3), which are crucial
177
3a [YM UO ClSM C O[S U U[38
BA
CK
GR
OU
ND
Methods Svara set Svarasalience
Svaraintonation
arohana-avrohana
Melodicphrases
Svaradiscretization
Temporalaspects
Pandey et al. (2003) • • Yes •Chordia & Rae (2007) • • • Yes •Belle et al. (2009) • • • No
Shetty & Achary (2009) • • Yes •Sridhar & Geetha (2009) • • Yes •Koduri et al. (2011) • • Yes
Ranjani et al. (2011) • No
Chakraborty & De (2012) • Yes
Koduri et al. (2012) • • • Both
Chordia & Sentürk (2013) • • • No
Dighe et al. (2013a) • • • Yes •Dighe et al. (2013b) • • Yes
Koduri et al. (2014) • • • No
Kumar et al. (2014) • • • • Yes •Dutta et al. (2015)a • No •a This method performs raga verification and not recognition
Table 2.3: Raga recognition methods proposed in the literature along with the melodic characteristics they utilize to perform the task. We alsoindicate if a method uses a discrete svara representation of melody. The methods are arranged in chronological order.
125
3a [YM UO ClSM C O[S U U[38
BA
CK
GR
OU
ND
Methods Svara set Svarasalience
Svaraintonation
arohana-avrohana
Melodicphrases
Svaradiscretization
Temporalaspects
Pandey et al. (2003) • • Yes •Chordia & Rae (2007) • • • Yes •Belle et al. (2009) • • • No
Shetty & Achary (2009) • • Yes •Sridhar & Geetha (2009) • • Yes •Koduri et al. (2011) • • Yes
Ranjani et al. (2011) • No
Chakraborty & De (2012) • Yes
Koduri et al. (2012) • • • Both
Chordia & Sentürk (2013) • • • No
Dighe et al. (2013a) • • • Yes •Dighe et al. (2013b) • • Yes
Koduri et al. (2014) • • • No
Kumar et al. (2014) • • • • Yes •Dutta et al. (2015)a • No •a This method performs raga verification and not recognition
Table 2.3: Raga recognition methods proposed in the literature along with the melodic characteristics they utilize to perform the task. We alsoindicate if a method uses a discrete svara representation of melody. The methods are arranged in chronological order.
e I [ ? D G X < a E
126
3a [YM UO ClSM C O[S U U[38
BA
CK
GR
OU
ND
Methods Svara set Svarasalience
Svaraintonation
arohana-avrohana
Melodicphrases
Svaradiscretization
Temporalaspects
Pandey et al. (2003) • • Yes •Chordia & Rae (2007) • • • Yes •Belle et al. (2009) • • • No
Shetty & Achary (2009) • • Yes •Sridhar & Geetha (2009) • • Yes •Koduri et al. (2011) • • Yes
Ranjani et al. (2011) • No
Chakraborty & De (2012) • Yes
Koduri et al. (2012) • • • Both
Chordia & Sentürk (2013) • • • No
Dighe et al. (2013a) • • • Yes •Dighe et al. (2013b) • • Yes
Koduri et al. (2014) • • • No
Kumar et al. (2014) • • • • Yes •Dutta et al. (2015)a • No •a This method performs raga verification and not recognition
Table 2.3: Raga recognition methods proposed in the literature along with the melodic characteristics they utilize to perform the task. We alsoindicate if a method uses a discrete svara representation of melody. The methods are arranged in chronological order.
e I [ ? D G X < a E
127
3a [YM UO ClSM C O[S U U[38
BA
CK
GR
OU
ND
Methods Svara set Svarasalience
Svaraintonation
arohana-avrohana
Melodicphrases
Svaradiscretization
Temporalaspects
Pandey et al. (2003) • • Yes •Chordia & Rae (2007) • • • Yes •Belle et al. (2009) • • • No
Shetty & Achary (2009) • • Yes •Sridhar & Geetha (2009) • • Yes •Koduri et al. (2011) • • Yes
Ranjani et al. (2011) • No
Chakraborty & De (2012) • Yes
Koduri et al. (2012) • • • Both
Chordia & Sentürk (2013) • • • No
Dighe et al. (2013a) • • • Yes •Dighe et al. (2013b) • • Yes
Koduri et al. (2014) • • • No
Kumar et al. (2014) • • • • Yes •Dutta et al. (2015)a • No •a This method performs raga verification and not recognition
Table 2.3: Raga recognition methods proposed in the literature along with the melodic characteristics they utilize to perform the task. We alsoindicate if a method uses a discrete svara representation of melody. The methods are arranged in chronological order.
e I [ ? D G X < a E
128
3a [YM UO ClSM C O[S U U[38
BA
CK
GR
OU
ND
Methods Svara set Svarasalience
Svaraintonation
arohana-avrohana
Melodicphrases
Svaradiscretization
Temporalaspects
Pandey et al. (2003) • • Yes •Chordia & Rae (2007) • • • Yes •Belle et al. (2009) • • • No
Shetty & Achary (2009) • • Yes •Sridhar & Geetha (2009) • • Yes •Koduri et al. (2011) • • Yes
Ranjani et al. (2011) • No
Chakraborty & De (2012) • Yes
Koduri et al. (2012) • • • Both
Chordia & Sentürk (2013) • • • No
Dighe et al. (2013a) • • • Yes •Dighe et al. (2013b) • • Yes
Koduri et al. (2014) • • • No
Kumar et al. (2014) • • • • Yes •Dutta et al. (2015)a • No •a This method performs raga verification and not recognition
Table 2.3: Raga recognition methods proposed in the literature along with the melodic characteristics they utilize to perform the task. We alsoindicate if a method uses a discrete svara representation of melody. The methods are arranged in chronological order.
129
3a [YM UO ClSM C O[S U U[38
BA
CK
GR
OU
ND
Methods Svara set Svarasalience
Svaraintonation
arohana-avrohana
Melodicphrases
Svaradiscretization
Temporalaspects
Pandey et al. (2003) • • Yes •Chordia & Rae (2007) • • • Yes •Belle et al. (2009) • • • No
Shetty & Achary (2009) • • Yes •Sridhar & Geetha (2009) • • Yes •Koduri et al. (2011) • • Yes
Ranjani et al. (2011) • No
Chakraborty & De (2012) • Yes
Koduri et al. (2012) • • • Both
Chordia & Sentürk (2013) • • • No
Dighe et al. (2013a) • • • Yes •Dighe et al. (2013b) • • Yes
Koduri et al. (2014) • • • No
Kumar et al. (2014) • • • • Yes •Dutta et al. (2015)a • No •a This method performs raga verification and not recognition
Table 2.3: Raga recognition methods proposed in the literature along with the melodic characteristics they utilize to perform the task. We alsoindicate if a method uses a discrete svara representation of melody. The methods are arranged in chronological order.
Ix[U 9 Ix[U : Ix[U
130
3a [YM UO ClSM C O[S U U[38
BA
CK
GR
OU
ND
Methods Svara set Svarasalience
Svaraintonation
arohana-avrohana
Melodicphrases
Svaradiscretization
Temporalaspects
Pandey et al. (2003) • • Yes •Chordia & Rae (2007) • • • Yes •Belle et al. (2009) • • • No
Shetty & Achary (2009) • • Yes •Sridhar & Geetha (2009) • • Yes •Koduri et al. (2011) • • Yes
Ranjani et al. (2011) • No
Chakraborty & De (2012) • Yes
Koduri et al. (2012) • • • Both
Chordia & Sentürk (2013) • • • No
Dighe et al. (2013a) • • • Yes •Dighe et al. (2013b) • • Yes
Koduri et al. (2014) • • • No
Kumar et al. (2014) • • • • Yes •Dutta et al. (2015)a • No •a This method performs raga verification and not recognition
Table 2.3: Raga recognition methods proposed in the literature along with the melodic characteristics they utilize to perform the task. We alsoindicate if a method uses a discrete svara representation of melody. The methods are arranged in chronological order.
131
3a [YM UO ClSM C O[S U U[38
BA
CK
GR
OU
ND
Methods Svara set Svarasalience
Svaraintonation
arohana-avrohana
Melodicphrases
Svaradiscretization
Temporalaspects
Pandey et al. (2003) • • Yes •Chordia & Rae (2007) • • • Yes •Belle et al. (2009) • • • No
Shetty & Achary (2009) • • Yes •Sridhar & Geetha (2009) • • Yes •Koduri et al. (2011) • • Yes
Ranjani et al. (2011) • No
Chakraborty & De (2012) • Yes
Koduri et al. (2012) • • • Both
Chordia & Sentürk (2013) • • • No
Dighe et al. (2013a) • • • Yes •Dighe et al. (2013b) • • Yes
Koduri et al. (2014) • • • No
Kumar et al. (2014) • • • • Yes •Dutta et al. (2015)a • No •a This method performs raga verification and not recognition
Table 2.3: Raga recognition methods proposed in the literature along with the melodic characteristics they utilize to perform the task. We alsoindicate if a method uses a discrete svara representation of melody. The methods are arranged in chronological order.
132
3a [YM UO ClSM C O[S U U[38
BA
CK
GR
OU
ND
Methods Svara set Svarasalience
Svaraintonation
arohana-avrohana
Melodicphrases
Svaradiscretization
Temporalaspects
Pandey et al. (2003) • • Yes •Chordia & Rae (2007) • • • Yes •Belle et al. (2009) • • • No
Shetty & Achary (2009) • • Yes •Sridhar & Geetha (2009) • • Yes •Koduri et al. (2011) • • Yes
Ranjani et al. (2011) • No
Chakraborty & De (2012) • Yes
Koduri et al. (2012) • • • Both
Chordia & Sentürk (2013) • • • No
Dighe et al. (2013a) • • • Yes •Dighe et al. (2013b) • • Yes
Koduri et al. (2014) • • • No
Kumar et al. (2014) • • • • Yes •Dutta et al. (2015)a • No •a This method performs raga verification and not recognition
Table 2.3: Raga recognition methods proposed in the literature along with the melodic characteristics they utilize to perform the task. We alsoindicate if a method uses a discrete svara representation of melody. The methods are arranged in chronological order.
Ix[U MYe]Z]WUg]ba
133
3a [YM UO ClSM C O[S U U[2.4
RE
LA
TE
DW
OR
KIN
IND
IAN
AR
TM
USIC
39
Method Tonal Feature TonicIdentification
Feature Recognition Method #Ragas Dataset(Dur./Num.)a
AudioType
Pandey et al. (2003) Pitch (Boersma & Weenink, 2001) NA Svara sequence HMM and n-Gram 2 - / 31 MP
Chordia & Rae (2007) Pitch (Sun, 2000) Manual PCD, PCDD SVM classifier 31 20 / 127 MP
Belle et al. (2009) Pitch (Rao & Rao, 2009) Manual PCD (parameterized) k-NN classifier 4 0.6 / 10 PP
Shetty & Achary (2009) Pitch (Sridhar & Geetha, 2006) NA #Svaras, Vakra svaras Neural Networkclassifier
20 - / 90 MP
Sridhar & Geetha (2009) Pitch (Lee, 2006) Singeridentification
Svara set, itssequence
String matching 3 - / 30 PP
Koduri et al. (2011) Pitch (Rao & Rao, 2010) Brute force PCD k-NN classifier 10 2.82 / 170 PP
Ranjani et al. (2011) Pitch (Boersma & Weenink, 2001) GMM fitting PDE SC-GMM and Setmatching
7 - / 48 PP
Chakraborty & De (2012) Pitch (Sengupta, 1990) Errorminimization
Svara set Set matching - - / - -
Koduri et al. (2012) Predominant pitch (Salamon &Gómez, 2012)
Multipitch-based PCD variants k-NN classifier 43 - / 215 PP
Chordia & Sentürk (2013) Pitch (Camacho, 2007) Brute force PCD variants k-NN and statisticalclassifiers
31 20 / 127 MP
Dighe et al. (2013a) Chroma (Lartillot et al., 2008) Brute force(vadi-based)
Chroma, Timbrefeatures
HMM 4 9.33 / 56 PP
Dighe et al. (2013b) Chroma (Lartillot et al., 2008) Brute force(vadi-based)
PCD variant RF classifier 8 16.8 / 117 PP
Koduri et al. (2014) Predominant pitch (Salamon &Gómez, 2012)
Multipitch-based PCD (parameterized) Different classifiers 45? 93/424 PP
Kumar et al. (2014) Predominant pitch (Salamon &Gómez, 2012)
Brute force PCD + n-Gramdistribution
SVM classifier 10 2.82 / 170 PP
Dutta et al. (2015)* Predominant pitch (Salamon &Gómez, 2012)
Cepstrum-based Pitch contours LCS with k-NN 30† 3 / 254 PP
a In the case of multiple datasets we list the larger one * This method performs raga verification and not recognition? Authors do not use all 45 ragas at once in a single experiment, but consider groups of 3 ragas per experiment † Authors finally use only 17 ragas in their experimentABBREVIATIONS: Dur.: Duration of the dataset, Num.: Number of recordings, NA: Not applicable, ‘-’: Not available, SC-GMM: semi-continuous GMM, MP: Monophonic, PP:Polyphonic
Table 2.4: Summary of the Raga recognition methods proposed in the literature. The methods are arranged in chronological order. 134
2.4R
EL
AT
ED
WO
RK
ININ
DIA
NA
RT
MU
SIC39
Method Tonal Feature TonicIdentification
Feature Recognition Method #Ragas Dataset(Dur./Num.)a
AudioType
Pandey et al. (2003) Pitch (Boersma & Weenink, 2001) NA Svara sequence HMM and n-Gram 2 - / 31 MP
Chordia & Rae (2007) Pitch (Sun, 2000) Manual PCD, PCDD SVM classifier 31 20 / 127 MP
Belle et al. (2009) Pitch (Rao & Rao, 2009) Manual PCD (parameterized) k-NN classifier 4 0.6 / 10 PP
Shetty & Achary (2009) Pitch (Sridhar & Geetha, 2006) NA #Svaras, Vakra svaras Neural Networkclassifier
20 - / 90 MP
Sridhar & Geetha (2009) Pitch (Lee, 2006) Singeridentification
Svara set, itssequence
String matching 3 - / 30 PP
Koduri et al. (2011) Pitch (Rao & Rao, 2010) Brute force PCD k-NN classifier 10 2.82 / 170 PP
Ranjani et al. (2011) Pitch (Boersma & Weenink, 2001) GMM fitting PDE SC-GMM and Setmatching
7 - / 48 PP
Chakraborty & De (2012) Pitch (Sengupta, 1990) Errorminimization
Svara set Set matching - - / - -
Koduri et al. (2012) Predominant pitch (Salamon &Gómez, 2012)
Multipitch-based PCD variants k-NN classifier 43 - / 215 PP
Chordia & Sentürk (2013) Pitch (Camacho, 2007) Brute force PCD variants k-NN and statisticalclassifiers
31 20 / 127 MP
Dighe et al. (2013a) Chroma (Lartillot et al., 2008) Brute force(vadi-based)
Chroma, Timbrefeatures
HMM 4 9.33 / 56 PP
Dighe et al. (2013b) Chroma (Lartillot et al., 2008) Brute force(vadi-based)
PCD variant RF classifier 8 16.8 / 117 PP
Koduri et al. (2014) Predominant pitch (Salamon &Gómez, 2012)
Multipitch-based PCD (parameterized) Different classifiers 45? 93/424 PP
Kumar et al. (2014) Predominant pitch (Salamon &Gómez, 2012)
Brute force PCD + n-Gramdistribution
SVM classifier 10 2.82 / 170 PP
Dutta et al. (2015)* Predominant pitch (Salamon &Gómez, 2012)
Cepstrum-based Pitch contours LCS with k-NN 30† 3 / 254 PP
a In the case of multiple datasets we list the larger one * This method performs raga verification and not recognition? Authors do not use all 45 ragas at once in a single experiment, but consider groups of 3 ragas per experiment † Authors finally use only 17 ragas in their experimentABBREVIATIONS: Dur.: Duration of the dataset, Num.: Number of recordings, NA: Not applicable, ‘-’: Not available, SC-GMM: semi-continuous GMM, MP: Monophonic, PP:Polyphonic
Table 2.4: Summary of the Raga recognition methods proposed in the literature. The methods are arranged in chronological order.
3a [YM UO ClSM C O[S U U[
135
6M M_
49
IYWbeX]a[f <heUg]ba CYUX 9eg]fgfbaWYegf
Ix[Uf
49 IYWbeX]a[f <heUg]ba CYUX 9eg]fgfIY_YUfYf
Ix[Uf
UeaUg]W hf]W XUgUfYg
]aXhfgUa] hf]W XUgUfYg
136
B [ [_ P 3 [MOT _
time (s) 10 30
time (s) 10 30
time (s) 10 30
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
Inde
x
(a)
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
(b)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0G eUfY VUfYXIx[U IYWb[a]g]ba
K<D VUfYXIx[U IYWb[a]g]ba
KbaU_ % KY cbeU_ <]fWeYg]mUg]ba
137
B [ [_ P 3 [MOT _
time (s) 10 30
time (s) 10 30
time (s) 10 30
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
Inde
x
(a)
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
(b)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0G eUfY VUfYXIx[U IYWb[a]g]ba
K<D VUfYXIx[U IYWb[a]g]ba
KbaU_ % KY cbeU_
138
<]fWeYg]mUg]ba
BT M_ NM_ P ClSM C O[S U U[
Inter-recording Pattern Detection
Intra-recording Pattern Discovery
Data Processing
Music collection
Melodic Pattern Discovery
Vocabulary Extraction
Term-frequencyFeature Extraction
Feature Normalization
TF-IDF Feature Extraction
Feature Matrix
Pattern Network Generation
Similarity Thresholding
Community Detection
Melodic Pattern Clustering
139
BT M_ NM_ P ClSM C O[S U U[
Inter-recording Pattern Detection
Intra-recording Pattern Discovery
Data Processing
Music collection
Melodic Pattern Discovery
Vocabulary Extraction
Term-frequencyFeature Extraction
Feature Normalization
TF-IDF Feature Extraction
Feature Matrix
Pattern Network Generation
Similarity Thresholding
Community Detection
Melodic Pattern Clustering
140
BT M_ NM_ P ClSM C O[S U U[
141
BT M_ NM_ P ClSM C O[S U U[
142
BT M_ NM_ P ClSM C O[S U U[
Kbc]W DbXY_]a[ o KYkg _Uff]Z]WUg]ba
G eUfYfIx[U
IYWbeX]a[f
NbeXfKbc]W<bWh Yagf
143
BT M_ NM_ P ClSM C O[S U U[
Kbc]W DbXY_]a[ o KYkg _Uff]Z]WUg]ba
KYe >eYdhYaWl t AaiYefY <bWh Yag>eYdhYaWl #K> A<>
144
E 6 M a 7d MO U[Vocabulary Extraction
Term-frequencyFeature Extraction
Feature Normalization
145
E 6 M a 7d MO U[Vocabulary Extraction
Term-frequencyFeature Extraction
Feature Normalization
146
E 6 M a 7d MO U[Vocabulary Extraction
Term-frequencyFeature Extraction
Feature Normalization
+ .
, ,
+ +
- - -147
E 6 M a 7d MO U[Vocabulary Extraction
Term-frequencyFeature Extraction
Feature Normalization
182 AUTOMATIC RAGA RECOGNITION
In the next step of our method we group together similar melodic patterns (Figure 6.1).To do so, we detect non-overlapping communities in the network of melodic pat-terns using the method proposed by Blondel et al. (2008). This community detectionmethod is based on optimizing the modularity of the network and is parameter-freefrom the user’s point of view. This method is capable of handling very large net-works and has been extensively used in various applications (Fortunato, 2010). Weuse its implementation available in networkX (Hagberg et al., 2008), a Python lan-guage package for exploration and analysis of networks and network algorithms. Notethat, from now on, the melodic patterns grouped within a community are regarded asthe occurrences of a single melodic phrase. Thus, a community essentially representsa melodic phrase or motif.
6.2.1.3 TF-IDF Feature Extraction
As mentioned above, we draw an analogy between raga rendition and textual descrip-tion of a topic. Using this analogy we represent each audio recording using a vectorspace model, wherein melodic patterns are considered as words (or terms). This pro-cess is divided into three blocks (Figure 6.1).
We start by building our vocabulary V , which translates to selecting relevant patterncommunities for characterizing ragas. For this, we include all the detected communit-ies except the ones that comprise patterns extracted from only a single audio record-ing. Such communities are analogous to the words that only occur within a documentand, hence, are irrelevant for modeling a topic.
We experiment with three different sets of features f1, f2 and f3, which are similar tothe TF-IDF features typically used in text information retrieval. We denote our corpusby R comprising NR = |R| number of recordings. A melodic phrase and a recordingis denoted by √ and r , respectively
f1(√,r) =
(1, if f (√,r) > 00, otherwise
(6.1)
where, f (√,r) denotes the raw frequency of occurrence of pattern √ in recordingr. f1 only considers the presence or absence of a pattern in a recording. In order toinvestigate if the frequency of occurrence of melodic patterns is relevant for charac-terizing ragas, we take f2(√,r) = f (√,r). As mentioned, the melodic patterns thatoccur across different ragas and in several recordings do not aid raga recognition.Therefore, to reduce their effect in the feature vector we employ a weighting scheme,similar to the inverse document frequency (idf) weighting in text retrieval.
f3(√,r) = f (√,r)I(√,R) (6.2)
I(√,R) = log✓
NR|{r 2 R :√2 r}|
◆(6.3)
182 AUTOMATIC RAGA RECOGNITION
In the next step of our method we group together similar melodic patterns (Figure 6.1).To do so, we detect non-overlapping communities in the network of melodic pat-terns using the method proposed by Blondel et al. (2008). This community detectionmethod is based on optimizing the modularity of the network and is parameter-freefrom the user’s point of view. This method is capable of handling very large net-works and has been extensively used in various applications (Fortunato, 2010). Weuse its implementation available in networkX (Hagberg et al., 2008), a Python lan-guage package for exploration and analysis of networks and network algorithms. Notethat, from now on, the melodic patterns grouped within a community are regarded asthe occurrences of a single melodic phrase. Thus, a community essentially representsa melodic phrase or motif.
6.2.1.3 TF-IDF Feature Extraction
As mentioned above, we draw an analogy between raga rendition and textual descrip-tion of a topic. Using this analogy we represent each audio recording using a vectorspace model, wherein melodic patterns are considered as words (or terms). This pro-cess is divided into three blocks (Figure 6.1).
We start by building our vocabulary V , which translates to selecting relevant patterncommunities for characterizing ragas. For this, we include all the detected communit-ies except the ones that comprise patterns extracted from only a single audio record-ing. Such communities are analogous to the words that only occur within a documentand, hence, are irrelevant for modeling a topic.
We experiment with three different sets of features f1, f2 and f3, which are similar tothe TF-IDF features typically used in text information retrieval. We denote our corpusby R comprising NR = |R| number of recordings. A melodic phrase and a recordingis denoted by √ and r , respectively
f1(√,r) =
(1, if f (√,r) > 00, otherwise
(6.1)
where, f (√,r) denotes the raw frequency of occurrence of pattern √ in recordingr. f1 only considers the presence or absence of a pattern in a recording. In order toinvestigate if the frequency of occurrence of melodic patterns is relevant for charac-terizing ragas, we take f2(√,r) = f (√,r). As mentioned, the melodic patterns thatoccur across different ragas and in several recordings do not aid raga recognition.Therefore, to reduce their effect in the feature vector we employ a weighting scheme,similar to the inverse document frequency (idf) weighting in text retrieval.
f3(√,r) = f (√,r)I(√,R) (6.2)
I(√,R) = log✓
NR|{r 2 R :√2 r}|
◆(6.3)
182 AUTOMATIC RAGA RECOGNITION
In the next step of our method we group together similar melodic patterns (Figure 6.1).To do so, we detect non-overlapping communities in the network of melodic pat-terns using the method proposed by Blondel et al. (2008). This community detectionmethod is based on optimizing the modularity of the network and is parameter-freefrom the user’s point of view. This method is capable of handling very large net-works and has been extensively used in various applications (Fortunato, 2010). Weuse its implementation available in networkX (Hagberg et al., 2008), a Python lan-guage package for exploration and analysis of networks and network algorithms. Notethat, from now on, the melodic patterns grouped within a community are regarded asthe occurrences of a single melodic phrase. Thus, a community essentially representsa melodic phrase or motif.
6.2.1.3 TF-IDF Feature Extraction
As mentioned above, we draw an analogy between raga rendition and textual descrip-tion of a topic. Using this analogy we represent each audio recording using a vectorspace model, wherein melodic patterns are considered as words (or terms). This pro-cess is divided into three blocks (Figure 6.1).
We start by building our vocabulary V , which translates to selecting relevant patterncommunities for characterizing ragas. For this, we include all the detected communit-ies except the ones that comprise patterns extracted from only a single audio record-ing. Such communities are analogous to the words that only occur within a documentand, hence, are irrelevant for modeling a topic.
We experiment with three different sets of features f1, f2 and f3, which are similar tothe TF-IDF features typically used in text information retrieval. We denote our corpusby R comprising NR = |R| number of recordings. A melodic phrase and a recordingis denoted by √ and r , respectively
f1(√,r) =
(1, if f (√,r) > 00, otherwise
(6.1)
where, f (√,r) denotes the raw frequency of occurrence of pattern √ in recordingr. f1 only considers the presence or absence of a pattern in a recording. In order toinvestigate if the frequency of occurrence of melodic patterns is relevant for charac-terizing ragas, we take f2(√,r) = f (√,r). As mentioned, the melodic patterns thatoccur across different ragas and in several recordings do not aid raga recognition.Therefore, to reduce their effect in the feature vector we employ a weighting scheme,similar to the inverse document frequency (idf) weighting in text retrieval.
f3(√,r) = f (√,r)I(√,R) (6.2)
I(√,R) = log✓
NR|{r 2 R :√2 r}|
◆(6.3)
182 AUTOMATIC RAGA RECOGNITION
In the next step of our method we group together similar melodic patterns (Figure 6.1).To do so, we detect non-overlapping communities in the network of melodic pat-terns using the method proposed by Blondel et al. (2008). This community detectionmethod is based on optimizing the modularity of the network and is parameter-freefrom the user’s point of view. This method is capable of handling very large net-works and has been extensively used in various applications (Fortunato, 2010). Weuse its implementation available in networkX (Hagberg et al., 2008), a Python lan-guage package for exploration and analysis of networks and network algorithms. Notethat, from now on, the melodic patterns grouped within a community are regarded asthe occurrences of a single melodic phrase. Thus, a community essentially representsa melodic phrase or motif.
6.2.1.3 TF-IDF Feature Extraction
As mentioned above, we draw an analogy between raga rendition and textual descrip-tion of a topic. Using this analogy we represent each audio recording using a vectorspace model, wherein melodic patterns are considered as words (or terms). This pro-cess is divided into three blocks (Figure 6.1).
We start by building our vocabulary V , which translates to selecting relevant patterncommunities for characterizing ragas. For this, we include all the detected communit-ies except the ones that comprise patterns extracted from only a single audio record-ing. Such communities are analogous to the words that only occur within a documentand, hence, are irrelevant for modeling a topic.
We experiment with three different sets of features f1, f2 and f3, which are similar tothe TF-IDF features typically used in text information retrieval. We denote our corpusby R comprising NR = |R| number of recordings. A melodic phrase and a recordingis denoted by √ and r , respectively
f1(√,r) =
(1, if f (√,r) > 00, otherwise
(6.1)
where, f (√,r) denotes the raw frequency of occurrence of pattern √ in recordingr. f1 only considers the presence or absence of a pattern in a recording. In order toinvestigate if the frequency of occurrence of melodic patterns is relevant for charac-terizing ragas, we take f2(√,r) = f (√,r). As mentioned, the melodic patterns thatoccur across different ragas and in several recordings do not aid raga recognition.Therefore, to reduce their effect in the feature vector we employ a weighting scheme,similar to the inverse document frequency (idf) weighting in text retrieval.
f3(√,r) = f (√,r)I(√,R) (6.2)
I(√,R) = log✓
NR|{r 2 R :√2 r}|
◆(6.3)148
C _a _6.2 PATTERN-BASED RAGA RECOGNITION 185
Method Feature SVML SGD NBM NBG RF LR 1-NN
MVSM
f1 51.04 55 37.5 54.37 25.41 55.83 -
f2 45.83 50.41 35.62 47.5 26.87 51.87 -
f3 45.83 51.66 67.29 44.79 23.75 51.87 -
MPC PCDfull - - - - - - 73.12
MGKPDparam 30.41 22.29 27.29 28.12 42.91 30.83 25.62
PDcontext 54.16 43.75 5.2 33.12 49.37 54.79 26.25
Table 6.1: Accuracy (%) of raga recognition on RRDCMD dataset by MVSM and other methodsmethods using different features and classifiers. Bold text signifies the best accuracy by amethod among all its variants
Hindustani music recordings. We therefore do not consider this method for comparingresults on Hindustani music.
The authors of MPC courteously ran the experiments on our dataset using the originalimplementations of the method. For MGK, the authors kindly extracted the features(PDparam and PDcontext) using the original implementation of their method and theexperiments using different classification strategies were done by us.
6.2.3 Results and Discussion
Before we proceed to present our results, we notify readers that the accuracies repor-ted in this section for different methods vary slightly from the ones reported in Gulatiet al. (2016b). As mentioned, the experimental setup used here is different com-pared to the paper (Section 6.2.2.2). In addition, the parameters used for discoveringmelodic patterns are also different (Section 6.2.1.1). Furthermore, we do not considerthe dataset that comprise only 10 raga as done in Gulati et al. (2016b), for which ourmethod was shown to outperform the state of the art.
In Table 6.1 and Table 6.2, we present the results of our proposed method MVSM andthe two state of the art methods MPC and MGK for the two datasets RRDCMD andRRDHMD, respectively. The highest accuracy for every method is highlighted in boldfor both the datasets. Due to the poor performance of SVML and NBB classifiers inthe experiments we do not consider them in our further analysis.
We start by analyzing the results of the variants of MVSM for both the datasets.From Table 6.1 and Table 6.2, we see that the highest accuracy obtained by MVSMfor RRDCMD is 67.29%, and for RRDHMD is 82.66%. The difference in the perform-ance across the two datasets can be attributed to several factors. One of them beingthe difference in the number of ragas across the two datasets: 40 in RRDCMD and 30RRDHMD. The performance difference can also be because of the difference in the
186 AUTOMATIC RAGA RECOGNITION
Method Feature SVML SGD NBM NBG RF LR 1-NN
MVSM
f1 71 72.33 69.33 79.33 38.66 74.33 -
f2 65.33 64.33 67.66 72.66 40.33 68 -
f3 65.33 62.66 82.66 72 41.33 67.66 -
MPC PCDfull - - - - - - 91.66
Table 6.2: Accuracy (%) of raga recognition on RRDHMD dataset by MVSM and other methodsmethods using different features and classifiers. Bold text signifies the best accuracy by amethod among all its variants.
overall length of the audio recordings. Recordings of the music pieces considered inour datasets are significantly longer in Hindustani music compared to Carnatic mu-sic (Section 3.3.4). However, since the melodic characteristics and complexity variesconsiderably across the two music traditions, a definite reason can only be identifiedby performing the experiments with equal number of ragas in the dataset and equalduration of the music pieces (Section 6.4).
From Table 6.1 and Table 6.2, we see that for both the datasets, the accuracy obtainedby MVSM across the feature sets differs substantially. We also see that their perform-ance is very sensitive to the choice of the classifier. With the exceptions of the NBMand LR classifier, feature f1 in general performs better than the other two features andthe difference is found to be statistically significant in each case. This suggests that,considering just the presence or the absence of a melodic pattern, irrespective of itsfrequency of occurrence, might be sufficient for raga recognition. Interestingly, thisfinding is consistent with the fact that characteristic melodic patterns are unique to araga and a single occurrence of such patterns is sufficient to identify the raga (Krishna& Ishwar, 2012). However, the best performance is obtained by the combination off3 and NBM classifier, and the difference in its performance compared to any othervariant is statistically significant. The NBM classifier outperforming other classifi-ers using appropriate features is also well recognized in the text classification com-munity (McCallum & Nigam, 1998). We, therefore, only consider the combinationof f3 and the NBM classifier for comparing MVSM with the other methods. Overall,the accuracies across different features indicate that normalizing the frequency of oc-currence of melodic patterns (as done for f1 and f3) is beneficial for raga recognition.A plausible reason can be that a normalization procedure reduces the weight of thetype of melodic patterns that occur more frequently, such as the gamakas type pat-terns (Section 2.3.2), and are not the characteristic patterns of any particular raga. Itis worth noting that the feature weights assigned by a classifier can be used to identifythe relevant melodic patterns for raga recognition. These patterns can serve as a dic-tionary of semantically-meaningful melodic units for many computational tasks inIAM.
3 4 5 6 7 8 9 10 11 12 13 14
� (bin index)
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
Clu
ster
ing
Coe
ffici
ent(
C)
0
10
20
30
40
50
60
70
Acc
urac
y(%
)
C(G) � C(Gr) Accuracy of MVSM
3 4 5 6 7 8 9 10 11 12 13 14
� (bin index)
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0.24
Clu
ster
ing
Coe
ffici
ent(
C)
30
40
50
60
70
80
90
Acc
urac
y(%
)
C(G) � C(Gr) Accuracy of MVSM
149
C _a _6.2 PATTERN-BASED RAGA RECOGNITION 185
Method Feature SVML SGD NBM NBG RF LR 1-NN
MVSM
f1 51.04 55 37.5 54.37 25.41 55.83 -
f2 45.83 50.41 35.62 47.5 26.87 51.87 -
f3 45.83 51.66 67.29 44.79 23.75 51.87 -
MPC PCDfull - - - - - - 73.12
MGKPDparam 30.41 22.29 27.29 28.12 42.91 30.83 25.62
PDcontext 54.16 43.75 5.2 33.12 49.37 54.79 26.25
Table 6.1: Accuracy (%) of raga recognition on RRDCMD dataset by MVSM and other methodsmethods using different features and classifiers. Bold text signifies the best accuracy by amethod among all its variants
Hindustani music recordings. We therefore do not consider this method for comparingresults on Hindustani music.
The authors of MPC courteously ran the experiments on our dataset using the originalimplementations of the method. For MGK, the authors kindly extracted the features(PDparam and PDcontext) using the original implementation of their method and theexperiments using different classification strategies were done by us.
6.2.3 Results and Discussion
Before we proceed to present our results, we notify readers that the accuracies repor-ted in this section for different methods vary slightly from the ones reported in Gulatiet al. (2016b). As mentioned, the experimental setup used here is different com-pared to the paper (Section 6.2.2.2). In addition, the parameters used for discoveringmelodic patterns are also different (Section 6.2.1.1). Furthermore, we do not considerthe dataset that comprise only 10 raga as done in Gulati et al. (2016b), for which ourmethod was shown to outperform the state of the art.
In Table 6.1 and Table 6.2, we present the results of our proposed method MVSM andthe two state of the art methods MPC and MGK for the two datasets RRDCMD andRRDHMD, respectively. The highest accuracy for every method is highlighted in boldfor both the datasets. Due to the poor performance of SVML and NBB classifiers inthe experiments we do not consider them in our further analysis.
We start by analyzing the results of the variants of MVSM for both the datasets.From Table 6.1 and Table 6.2, we see that the highest accuracy obtained by MVSMfor RRDCMD is 67.29%, and for RRDHMD is 82.66%. The difference in the perform-ance across the two datasets can be attributed to several factors. One of them beingthe difference in the number of ragas across the two datasets: 40 in RRDCMD and 30RRDHMD. The performance difference can also be because of the difference in the
186 AUTOMATIC RAGA RECOGNITION
Method Feature SVML SGD NBM NBG RF LR 1-NN
MVSM
f1 71 72.33 69.33 79.33 38.66 74.33 -
f2 65.33 64.33 67.66 72.66 40.33 68 -
f3 65.33 62.66 82.66 72 41.33 67.66 -
MPC PCDfull - - - - - - 91.66
Table 6.2: Accuracy (%) of raga recognition on RRDHMD dataset by MVSM and other methodsmethods using different features and classifiers. Bold text signifies the best accuracy by amethod among all its variants.
overall length of the audio recordings. Recordings of the music pieces considered inour datasets are significantly longer in Hindustani music compared to Carnatic mu-sic (Section 3.3.4). However, since the melodic characteristics and complexity variesconsiderably across the two music traditions, a definite reason can only be identifiedby performing the experiments with equal number of ragas in the dataset and equalduration of the music pieces (Section 6.4).
From Table 6.1 and Table 6.2, we see that for both the datasets, the accuracy obtainedby MVSM across the feature sets differs substantially. We also see that their perform-ance is very sensitive to the choice of the classifier. With the exceptions of the NBMand LR classifier, feature f1 in general performs better than the other two features andthe difference is found to be statistically significant in each case. This suggests that,considering just the presence or the absence of a melodic pattern, irrespective of itsfrequency of occurrence, might be sufficient for raga recognition. Interestingly, thisfinding is consistent with the fact that characteristic melodic patterns are unique to araga and a single occurrence of such patterns is sufficient to identify the raga (Krishna& Ishwar, 2012). However, the best performance is obtained by the combination off3 and NBM classifier, and the difference in its performance compared to any othervariant is statistically significant. The NBM classifier outperforming other classifi-ers using appropriate features is also well recognized in the text classification com-munity (McCallum & Nigam, 1998). We, therefore, only consider the combinationof f3 and the NBM classifier for comparing MVSM with the other methods. Overall,the accuracies across different features indicate that normalizing the frequency of oc-currence of melodic patterns (as done for f1 and f3) is beneficial for raga recognition.A plausible reason can be that a normalization procedure reduces the weight of thetype of melodic patterns that occur more frequently, such as the gamakas type pat-terns (Section 2.3.2), and are not the characteristic patterns of any particular raga. Itis worth noting that the feature weights assigned by a classifier can be used to identifythe relevant melodic patterns for raga recognition. These patterns can serve as a dic-tionary of semantically-meaningful melodic units for many computational tasks inIAM.
3 4 5 6 7 8 9 10 11 12 13 14
� (bin index)
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
Clu
ster
ing
Coe
ffici
ent(
C)
0
10
20
30
40
50
60
70
Acc
urac
y(%
)
C(G) � C(Gr) Accuracy of MVSM
3 4 5 6 7 8 9 10 11 12 13 14
� (bin index)
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0.24
Clu
ster
ing
Coe
ffici
ent(
C)
30
40
50
60
70
80
90
Acc
urac
y(%
)
C(G) � C(Gr) Accuracy of MVSM
b cUeY ZYUgheYf
gUgY bZ g Y Ueg
150
C _a _1 DaYYM e
beX]U wYague #, +-
GebcbfYX
BbXhe] Yg U_( #, +.
G <>h__
K> A<> #>-
G <cUeU
1-
01
//
3,
2-
UeaUg]W#
]aXhfgUa]#
• Gulati, S., Serrà, J., Ishwar, V., Şentürk, S., & Serra, X. (2016). Phrase-based rāga recognition using vector space modeling. In ICASSP, pp. 66–70. 151
7 [ 3 M e_U_
R1-
S .anm
ukha
priy
aR
2-K
api
R3-
Bha
irav
iR
4-M
adhy
amav
ati
R5-
Bila
hari
R6-
Moh
anam
R7-
Senc
urut
.t .i
R8-
Srır
anja
niR
9-R
ıtig
aul .a
R10
-Hus
senı
R11
-Dha
nyas
iR
12-A
t .ana
R13
-Beh
agR
14-S
urat .
iR
15-K
amav
arda
niR
16-M
ukha
riR
17-S
indh
ubha
irav
iR
18-S
ahan
aR
19-K
anad
.aR
20-M
ayam
al .av
agau
l .aR
21-N
at .a
R22
-San
kara
bhar
an.a
mR
23-S
aver
iR
24-K
amas
R25
-Tod .
iR
26-B
egad
.aR
27-H
arik
ambh
oji
R28
-Srı
R29
-Kal
yan .i
R30
-Sam
aR
31-N
at .ak
urin
jiR
32-P
urvı
kal .y
an.i
R33
-Yad
ukul
aka
mbo
jiR
34-D
evag
andh
ari
R35
-Ked
arag
aul .a
R36
-Ana
ndab
hair
avi
R37
-Gau
l .aR
38-V
aral .
iR
39-K
ambh
oji
R40
-Kar
ahar
apri
ya
R1R2R3R4R5R6R7R8R9
R10R11R12R13R14R15R16R17R18R19R20R21R22R23R24R25R26R27R28R29R30R31R32R33R34R35R36R37R38R39R40
8 1 1 1 14 1 1 1 1 1 1 1 1
9 1 1 11 5 3 1 2
7 1 1 1 21 5 1 1 1 1 1 1
1 5 1 1 1 1 1 12 5 1 1 3
121 1 2 5 1 1 1
7 1 2 1 11 7 2 1 1
11 110 1 1
10 21 11
1 1 1 2 1 4 1 11 9 1 1
1 1 1 1 1 3 1 1 1 111 1
1212
1 3 7 110 2
121 10 1
1 9 21 3 1 1 2 1 2 1
11 11 3 1 1 6
1 1 101 11
1 1 1 1 5 2 11 1 10
1 1 1 8 12 1 2 1 1 1 3 1
2 2 1 712
1 1 101 11
0 1 2 3 4 5 6 7 8 9 10 11 12
152
7 [ 3 M e_U_
R1-
S .anm
ukha
priy
aR
2-K
api
R3-
Bha
irav
iR
4-M
adhy
amav
ati
R5-
Bila
hari
R6-
Moh
anam
R7-
Senc
urut
.t .i
R8-
Srır
anja
niR
9-R
ıtig
aul .a
R10
-Hus
senı
R11
-Dha
nyas
iR
12-A
t .ana
R13
-Beh
agR
14-S
urat .
iR
15-K
amav
arda
niR
16-M
ukha
riR
17-S
indh
ubha
irav
iR
18-S
ahan
aR
19-K
anad
.aR
20-M
ayam
al .av
agau
l .aR
21-N
at .a
R22
-San
kara
bhar
an.a
mR
23-S
aver
iR
24-K
amas
R25
-Tod .
iR
26-B
egad
.aR
27-H
arik
ambh
oji
R28
-Srı
R29
-Kal
yan .i
R30
-Sam
aR
31-N
at .ak
urin
jiR
32-P
urvı
kal .y
an.i
R33
-Yad
ukul
aka
mbo
jiR
34-D
evag
andh
ari
R35
-Ked
arag
aul .a
R36
-Ana
ndab
hair
avi
R37
-Gau
l .aR
38-V
aral .
iR
39-K
ambh
oji
R40
-Kar
ahar
apri
ya
R1R2R3R4R5R6R7R8R9
R10R11R12R13R14R15R16R17R18R19R20R21R22R23R24R25R26R27R28R29R30R31R32R33R34R35R36R37R38R39R40
8 1 1 1 14 1 1 1 1 1 1 1 1
9 1 1 11 5 3 1 2
7 1 1 1 21 5 1 1 1 1 1 1
1 5 1 1 1 1 1 12 5 1 1 3
121 1 2 5 1 1 1
7 1 2 1 11 7 2 1 1
11 110 1 1
10 21 11
1 1 1 2 1 4 1 11 9 1 1
1 1 1 1 1 3 1 1 1 111 1
1212
1 3 7 110 2
121 10 1
1 9 21 3 1 1 2 1 2 1
11 11 3 1 1 6
1 1 101 11
1 1 1 1 5 2 11 1 10
1 1 1 8 12 1 2 1 1 1 3 1
2 2 1 712
1 1 101 11
0 1 2 3 4 5 6 7 8 9 10 11 12
G eUfY VUfYX ex[Uf
153
7 [ 3 M e_U_
R1-
S .anm
ukha
priy
aR
2-K
api
R3-
Bha
irav
iR
4-M
adhy
amav
ati
R5-
Bila
hari
R6-
Moh
anam
R7-
Senc
urut
.t .i
R8-
Srır
anja
niR
9-R
ıtig
aul .a
R10
-Hus
senı
R11
-Dha
nyas
iR
12-A
t .ana
R13
-Beh
agR
14-S
urat .
iR
15-K
amav
arda
niR
16-M
ukha
riR
17-S
indh
ubha
irav
iR
18-S
ahan
aR
19-K
anad
.aR
20-M
ayam
al .av
agau
l .aR
21-N
at .a
R22
-San
kara
bhar
an.a
mR
23-S
aver
iR
24-K
amas
R25
-Tod .
iR
26-B
egad
.aR
27-H
arik
ambh
oji
R28
-Srı
R29
-Kal
yan .i
R30
-Sam
aR
31-N
at .ak
urin
jiR
32-P
urvı
kal .y
an.i
R33
-Yad
ukul
aka
mbo
jiR
34-D
evag
andh
ari
R35
-Ked
arag
aul .a
R36
-Ana
ndab
hair
avi
R37
-Gau
l .aR
38-V
aral .
iR
39-K
ambh
oji
R40
-Kar
ahar
apri
ya
R1R2R3R4R5R6R7R8R9
R10R11R12R13R14R15R16R17R18R19R20R21R22R23R24R25R26R27R28R29R30R31R32R33R34R35R36R37R38R39R40
8 1 1 1 14 1 1 1 1 1 1 1 1
9 1 1 11 5 3 1 2
7 1 1 1 21 5 1 1 1 1 1 1
1 5 1 1 1 1 1 12 5 1 1 3
121 1 2 5 1 1 1
7 1 2 1 11 7 2 1 1
11 110 1 1
10 21 11
1 1 1 2 1 4 1 11 9 1 1
1 1 1 1 1 3 1 1 1 111 1
1212
1 3 7 110 2
121 10 1
1 9 21 3 1 1 2 1 2 1
11 11 3 1 1 6
1 1 101 11
1 1 1 1 5 2 11 1 10
1 1 1 8 12 1 2 1 1 1 3 1
2 2 1 712
1 1 101 11
0 1 2 3 4 5 6 7 8 9 10 11 12
9__]YX ex[Uf
154
R1-
S .anm
ukha
priy
aR
2-K
api
R3-
Bha
irav
iR
4-M
adhy
amav
ati
R5-
Bila
hari
R6-
Moh
anam
R7-
Senc
urut
.t .i
R8-
Srır
anja
niR
9-R
ıtig
aul .a
R10
-Hus
senı
R11
-Dha
nyas
iR
12-A
t .ana
R13
-Beh
agR
14-S
urat .
iR
15-K
amav
arda
niR
16-M
ukha
riR
17-S
indh
ubha
irav
iR
18-S
ahan
aR
19-K
anad
.aR
20-M
ayam
al .av
agau
l .aR
21-N
at .a
R22
-San
kara
bhar
an.a
mR
23-S
aver
iR
24-K
amas
R25
-Tod .
iR
26-B
egad
.aR
27-H
arik
ambh
oji
R28
-Srı
R29
-Kal
yan .i
R30
-Sam
aR
31-N
at .ak
urin
jiR
32-P
urvı
kal .y
an.i
R33
-Yad
ukul
aka
mbo
jiR
34-D
evag
andh
ari
R35
-Ked
arag
aul .a
R36
-Ana
ndab
hair
avi
R37
-Gau
l .aR
38-V
aral .
iR
39-K
ambh
oji
R40
-Kar
ahar
apri
ya
R1R2R3R4R5R6R7R8R9
R10R11R12R13R14R15R16R17R18R19R20R21R22R23R24R25R26R27R28R29R30R31R32R33R34R35R36R37R38R39R40
11 110 2
7 1 2 1 11 0 2 1 5 1 1 1
9 1 23 6 1 1 1
1 9 1 11 9 1 1
122 2 6 2
11 11 1 1 8 1
10 1 110 2
11 14 2 6
1 10 110 2
1 2 1 4 1 1 1 11 9 1 1
121 11
1 1110 1 1
121 7 1 3
1 1 5 51 2 2 4 1 2
11 11 1 1 9
1 111 1 10
1 1 1 1 81 2 1 8
1 5 612
121 11
3 1 81 2 9
0 1 2 3 4 5 6 7 8 9 10 11 12
7 [ 3 M e_U_R
1-S .a
nmuk
hapr
iya
R2-
Kap
iR
3-B
hair
avi
R4-
Mad
hyam
avat
iR
5-B
ilaha
riR
6-M
ohan
amR
7-Se
ncur
ut.t .
i
R8-
Srır
anja
niR
9-R
ıtig
aul .a
R10
-Hus
senı
R11
-Dha
nyas
iR
12-A
t .ana
R13
-Beh
agR
14-S
urat .
iR
15-K
amav
arda
niR
16-M
ukha
riR
17-S
indh
ubha
irav
iR
18-S
ahan
aR
19-K
anad
.aR
20-M
ayam
al .av
agau
l .aR
21-N
at .a
R22
-San
kara
bhar
an.a
mR
23-S
aver
iR
24-K
amas
R25
-Tod .
iR
26-B
egad
.aR
27-H
arik
ambh
oji
R28
-Srı
R29
-Kal
yan .i
R30
-Sam
aR
31-N
at .ak
urin
jiR
32-P
urvı
kal .y
an.i
R33
-Yad
ukul
aka
mbo
jiR
34-D
evag
andh
ari
R35
-Ked
arag
aul .a
R36
-Ana
ndab
hair
avi
R37
-Gau
l .aR
38-V
aral .
iR
39-K
ambh
oji
R40
-Kar
ahar
apri
ya
R1R2R3R4R5R6R7R8R9
R10R11R12R13R14R15R16R17R18R19R20R21R22R23R24R25R26R27R28R29R30R31R32R33R34R35R36R37R38R39R40
8 1 1 1 14 1 1 1 1 1 1 1 1
9 1 1 11 5 3 1 2
7 1 1 1 21 5 1 1 1 1 1 1
1 5 1 1 1 1 1 12 5 1 1 3
121 1 2 5 1 1 1
7 1 2 1 11 7 2 1 1
11 110 1 1
10 21 11
1 1 1 2 1 4 1 11 9 1 1
1 1 1 1 1 3 1 1 1 111 1
1212
1 3 7 110 2
121 10 1
1 9 21 3 1 1 2 1 2 1
11 11 3 1 1 6
1 1 101 11
1 1 1 1 5 2 11 1 10
1 1 1 8 12 1 2 1 1 1 3 1
2 2 1 712
1 1 101 11
0 1 2 3 4 5 6 7 8 9 10 11 12
G eUfY VUfYX G < VUfYX
155
B [ [_ P 3 [MOT _
time (s) 10 30
time (s) 10 30
time (s) 10 30
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
Inde
x
(a)
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
(b)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0G eUfY VUfYXIx[U IYWb[a]g]ba
K<D VUfYXIx[U IYWb[a]g]ba
KbaU_ % KY cbeU_ <]fWeYg]mUg]ba
156
5TM M U PUM 3 a_UO
0 2 4 6 8 10
Time (s)
�100
0
100
200
300
400
500
600
Freq
uenc
y(C
ents
)S
m
g
Characteristic movement
0 1 2 3 4 5 6 7
Time (s)
0
200
400
600
800
Freq
uenc
y(C
ents
)
S
mg
:x[y{ez
: z cU_xfz
157
EUY P Me P [PUO Da RMO
Predominant Pitch Estimation
Tonic Normalization
Surface Generation
Power Compression
Gaussian smoothening
Audio signal
TDMS
Pre-processing
Post-processing
158
EUY P Me P [PUO Da RMO
Predominant Pitch Estimation
Tonic Normalization
Surface Generation
Power Compression
Gaussian smoothening
Audio signal
TDMS
Pre-processing
Post-processing
159
EUY P Me P [PUO Da RMO
§ Da RMO 9 M U[
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
160
EUY P Me P [PUO Da RMO
§ Da RMO 9 M U[
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
G]gW
#WYag
K] Y
G+
G,
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
G+G,
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
G+’
G,’
#G+’& G,’
# &
161
EUY P Me P [PUO Da RMO
§ Da RMO 9 M U[
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
G]gW
#WYag
K] Y
G+
G,
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
G+G,
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
G+’
G,’
#G+’& G,’
# &
162
EUY P Me P [PUO Da RMO
§ Da RMO 9 M U[
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
G]gW
#WYag
K] Y
G+
G,
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
G+G,
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
G+’
G,’
#G+’& G,’
# &
163
EUY P Me P [PUO Da RMO
§ Da RMO 9 M U[
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
G]gW
#WYag
K] Y
G+
G,
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
G+G,
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
G+’
G,’
#G+’& G,’
# &
164
EUY P Me P [PUO Da RMO
§ Da RMO 9 M U[
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
G]gW
#WYag
K] Y
G+G,
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
G+G,
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
G+’
G,’
#G+’& G,’
# &
165
EUY P Me P [PUO Da RMO
§ Da RMO 9 M U[
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
G]gW
#WYag
K] Y
G+G,
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
G+’
G,’0 20 40 60 80 100 120
Index
0
20
40
60
80
100
120
Inde
x
(a)
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
(b)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
020
4060
8010
012
0In
dex
0 20 40 60 80 100
120
Index
(a)
020
4060
8010
012
0In
dex
0 20 40 60 80 100
120
(b)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
G+ G,
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
166
EUY P Me P [PUO Da RMO
§ B[_ [O __U S§ B[c O[Y __U[ g(&) (& - (&- (&/- )i§ 9Ma__UM _Y[[ T U S g ) ) +i§ [ YM UfM U[ >)
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
Inde
x
(a)
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
(b)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
Inde
x
(a)
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
(b)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
194 AUTOMATIC RAGA RECOGNITION
given recording, we generate a surface S of size h ⇥h by computing
si j =N�1
Ât=t
I (B(pt) , i) I (B(pt�t) , j) (6.4)
for 0 i, j < h , where si j is the (i, j)th element of the two-dimensional matrix S, ptis the t th sample (in Cent-scale) of the pitch sequence of length N, I is an indicatorfunction such that
I(x,y) =
(1, iff x = y0, otherwise
(6.5)
B is an octave-wrapping integer binning operator defined by
B(x) =j ⇣ hx
1200
⌘mod h
k, (6.6)
and t is a time delay index (in frames) that is left as a parameter. Note that, the frameswhere a predominant pitch could not be obtained (unvoiced segments) are excludedfrom any calculation. For the size of S we use h = 120. This value corresponds to10 Cents per bin, an optimal pitch resolution reported by Chordia & Sentürk (2013).In that study, the authors show that raga recognition using PCDs with a bin width of10 Cents outperforms the PCDs with a bin width of 100 Cents, and obtains a compar-able results to PCDs with a bin width of 5 Cents.
An example of the generated surface S for a music piece54 in raga Yaman is shownin Figure 6.7 (a). We see that the prominent peaks in the surface correspond to thesvaras of raga Yaman. We notice that these peaks are steep and the dynamic rangeof the surface is high. This can be attributed to the nature of the melodies in thesemusic traditions, particularly in Hindustani music, where the melodies often containlong held svaras. In addition, the dynamic range is high also because the pitches inthe stable svara regions lie within a small frequency range around the mean svarafrequency compared to the pitches in the transitory melodic regions. Because of this,most of the pitch values are mapped to a small number of bins, making the prominentpeaks more steep.
6.3.1.3 Post-processing
Power Compression: In order to accentuate the values corresponding to the trans-itory regions in the melody and reduce the dynamic range of the surface, we apply anelement-wise power compression
S = Sa , (6.7)
where a is an exponent that is left as a parameter. Once a more compact (in terms ofthe dynamic range) surface is obtained, we apply Gaussian smoothing. With that, weattempt to attenuate the subtle differences in S corresponding to the different melodieswithin the same raga, while retaining the attributes that characterize that raga.
54http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af
167
EUY P Me P [PUO Da RMO
§ M U[
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
Inde
x
(a)
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
(b)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
168
EUY P Me P [PUO Da RMO
§ M U[
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
Inde
x
(a)
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
(b)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Ix[U PU Ua
http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af 169
EUY P Me P [PUO Da RMO
§ M U[
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
Inde
x
(a)
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
(b)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Ix[U PU Ua
http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af 170
EUY P Me P [PUO Da RMO
§ M U[
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
Inde
x
(a)
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
(b)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Ix[U PU Ua
http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af 171
EUY P Me P [PUO Da RMO
§ M U[
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
Inde
x
(a)
0 20 40 60 80 100 120Index
0
20
40
60
80
100
120
(b)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Ix[U PU Ua
http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4b-d82bfbc7b4af 172
C _a _1 DaYYM e
GebcbfYX
beX]U wYague #, +-
G eUfY VUfYX
BbXhe] Yg U_( #, +.
K<D
G <>h__
K> A<> #>-
G <cUeU
21
1-
01
//
32
3,
2-
UeaUg]W#
]aXhfgUa]#
• Gulati, S., Serrà, J., Ganguli, K. K., Şentürk, S., & Serra, X. (2016). Time-delayed melody surfaces for rāga recognition. In ISMIR, pp. 751–757. 173
7RR O [R 6M M_
I+I,I-I.I/I0I1I2I3I+I++I+,I+-
Eh VYe bZ ex[Uf b]WY bZ ex[Uf
174
7RR O [R 6M M_
5 10 20 30 40Number of ragas
0.80
0.85
0.90
0.95
1.00
Acc
urac
y(%
)
5 10 15 20 25 30Number of ragas
0.80
0.85
0.90
0.95
1.00
Acc
urac
y(%
)
UeaUg]W
]aXhfgUa]
175
3 UOM U[ _
176
Chapter 7Applications
7.1 Introduction
The research work presented in this thesis has several applications. A number of theseapplications have already been described in Section 1.1. While some applicationssuch as raga-based music retrieval and music discovery are relevant mainly in thecontext of large audio collections, there are several applications that can be developedon the scale of the music collection compiled in the CompMusic project. In thischapter, we present some concrete examples of such applications that have alreadyincorporated parts of our work. We provide a brief overview of Dunya, a collection ofthe music corpora and software tools developed during the CompMusic project and,Saraga and Riyaz, the mobile applications developed within the technology transferproject, Culture Aware MUsic Technologies (CAMUT). In addition, we present threeweb demos that showcase parts of the outcomes of our computational methods. Wealso briefly present one of our recent studies that performs musicologically motivatedexploration of melodic structures in IAM. It serves as an example of how our methodscan facilitate investigations in computational musicology.
7.2 Dunya
Dunya55 comprises the music corpora and the software tools that have been developedas part of the CompMusic project. It includes data for five music traditions - Hindus-tani, Carnatic, Turkish Makam, Jingju and Andalusian music. By providing accessto the gathered data (such as audio recordings and metadata), and the generated data(such as audio descriptors) in the project, Dunya aims to facilitate study and explor-ation of relevant musical aspects of different music repertoires. The CompMusiccorpora mainly comprise audio recordings and complementary information that de-scribes those recordings. This complementary information can be in the form of the
55http://dunya.compmusic.upf.edu/
207
3 UOM U[ _
12
3 4 5
6
608 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
177
6a eM
9GA
178
6a eM H N RMO
1
4
32
5
ggc4))XhalU(Wb c hf]W(hcZ(YXh179
5[Y a M U[ M a_UO[ [Se
Ganguli, K. K., Gulati, S., Serra, X., & Rao, P. (2016). Data-driven exploration of melodic structures in Hindustani music. In ISMIR, pp. 605-611.
608 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
180
6 Y[1 [PUO BM _
ggc4))XhalU(Wb c hf]W(hcZ(YXh) bg]ZX]fWbiYel)181
6 Y[1 [PUO BM _
ggc4))XhalU(Wb c hf]W(hcZ(YXh) bg]ZX]fWbiYel)182
6 Y[1 [PUO BM _
ggc4))XhalU(Wb c hf]W(hcZ(YXh)cUggYeaTaYg be )183
6 Y[1 [PUO BM _
ggc4))XhalU(Wb c hf]W(hcZ(YXh)cUggYeaTaYg be )184
6 Y[1 [PUO BM _
ggc4))XhalU(Wb c hf]W(hcZ(YXh)cUggYeaTaYg be )185
ClSMcU_
ggcf4))XhalU(Wb c hf]W(hcZ(YXh)eU[U ]fY)186
DM lSM
bagYag
9DLK
187
DM lSM9DLK
188
CUelf9DLK
7 / B hfYef
7 2 fYff]baf
189
DaYYM e M P a a B _ O Ub _
190
Chapter 8Summary and Future Perspectives
8.1 IntroductionIn this thesis, we have presented a number of computational approaches for analyzingmelodic elements at different levels of melodic organization in IAM. In tandem, theseapproaches generate a high-level melodic description of the audio collections in thismusic tradition. They build upon each other to finally allow us to achieve the goalsthat we set at the beginning of this thesis, which are:
To curate and structure representative music corpora of IAM that comprise au-dio recordings and the associated metadata, and use that to compile sizable andwell annotated tests datasets for melodic analyses.
To develop data-driven computational approaches for the discovery and char-acterization of musically relevant melodic patterns in sizable audio collectionsof IAM
To devise computational approaches for automatically recognizing ragas in re-corded performances of IAM.
Based on the results presented in this thesis, we can now say that our goals are suc-cessfully met. We started this thesis by presenting our motivation behind the analysisand description of melodies in IAM, highlighting the opportunities and challengesthat this music tradition offers within the context of music information research andthe CompMusic project (Chapter 1). We provided a brief introduction to IAM and itsmelodic organization, and critically reviewed the existing literature on related topicswithin the context of MIR and IAM (Chapter 2).
In Chapter 3, we provided a comprehensive overview of the music corpora and data-sets that were compiled and structured in our work as a part of the CompMusicproject. To the best of our knowledge, these corpora comprise the largest audiocollections of IAM along with curated metadata and automatically extracted music
221
5[ UNa U[ _1 5[ [ M M P 6M M_ _
§ PUM M Ya_UO O[ [ M
§ 6M M_ _
§ E[ UO UP URUOM U[ PM M_
§ el_ PM M_
§ Y [bU S Y [PUO _UYU M U e PM M_
§ ClSM O[S U U[ PM M_
§ :U Pa_ M U
§ 5M M UO
ggc4))Wb c hf]W(hcZ(YXh)abXY)- . 191
5[ UNa U[ _1 DOU URUO
§ C bU c [R dU_ U S M [MOT _
§ E[ UO UP URUOM U[ O[Y T _Ub bM aM U[
§ el_ P O U[
§ [PUO _UYU M U e O[Y T _Ub bM aM U[
§ Y [bU S Y [PUO _UYU M U e
§ [PUO M PU_O[b e
§ [PUO M OTM MO UfM U[
§ ClSM O[S U U[§ BT M_ NM_ P Y T[P
§ EUY P Me P Y [PUO _a RMO NM_ P Y T[P
ggc4))Wb c hf]W(hcZ(YXh)abXY)- . 192
5[ UNa U[ _1 E OT UOM
§ E[ UO UP URUOM U[ U S M U[
§ H N P Y[_ R[ _TM U S PU_O[b P M _
§ ClSMcU_ BU OT I U MbM_O U
§ DM lSM M P CUelf
193
a a B _ O Ub _
§ 5[YNU U S _a bU_ P M P a _a bU_ P M [MOT _
§ D aPeU S Y [PUO _ SY M U[
§ C RU T M_ N[a PM U _
§ 5[YNU U S Ya U R M a _ R[ lSM O[S U U[
§ D aPe [R b[ a U[ [R Ya_UO U O
§ A [ [SUOM R MY c[ W [ S [a Ya U P _O U U[ _e_ Y_
194
a a B _ O Ub _
§ 5[YNU U S _a bU_ P M P a _a bU_ P M [MOT _
§ D aPeU S Y [PUO _ SY M U[
§ C RU T M_ N[a PM U _
§ 5[YNU U S Ya U R M a _ R[ lSM O[S U U[
§ D aPe [R b[ a U[ [R Ya_UO U O
§ A [ [SUOM R MY c[ W [ S [a Ya U P _O U U[ _e_ Y_
MusicMuni Labs195
9DLK
BaN UOM U[ _ Ne T 3a T[§ B bU c P V[a M _
§ Gulati, S., Bellur, A., Salamon, J., Ranjani, H. G., Ishwar, V., Murthy, H. A., & Serra, X. (2014). Automatic tonic identificationin Indian art music: approaches and evaluation. JNMR, 43(1), 53–71.
§ Koduri, G. K., Gulati, S., Rao, P., & Serra, X. (2012). Rāga recognition based on pitch distribution methods. JNMR, 41(4), 337–350.
§ a M UO _ U bU c P O[ R O _ )-§ Gulati, S., Serrà, J., Ganguli, K. K., Şentürk, S., & Serra, X. (2016). Time-delayed melody surfaces for rāga recognition. In
ISMIR, pp. 751–757.
§ Ganguli, K. K., Gulati, S., Serra, X., & Rao, P. (2016). Data-driven exploration of melodic structures in Hindustani music. InISMIR, pp. 605-611.
§ Gulati, S., Serrà, J., Ishwar, V., Şentürk, S., & Serra, X. (2016). Phrase-based rāga recognition using vector space modeling. InICASSP, pp. 66–70.
§ Gulati, S., Serrà, J., Ishwar, V., & Serra, X. (2016). Discovering rāga motifs by characterizing communities in networks ofmelodic patterns. In ICASSP, pp. 286–290.
§ Gulati, S., Serrà, J., & Serra, X. (2015). Improving melodic similarity in Indian art music using culture-specific melodiccharacteristics. In ISMIR, pp. 680–686.
§ Gulati, S., Serrà, J., & Serra, X. (2015). An evaluation of methodologies for melodic similarity in audio recordings of Indian artmusic. In ICASSP, pp. 678–682.
196
BaN UOM U[ _ Ne T 3a T[§ Gulati, S., Serrà, J., Ishwar, V., & Serra, X. (2014). Mining melodic patterns in large audio collections of Indian art music. In
SITIS-MIRA, pp. 264–271.
§ Gulati, S., Serrà, J., Ganguli, K. K., & Serra, X. (2014). Landmark detection in Hindustani music melodies. In ICMC-SMC,
pp. 1062-1068.
§ Srinivasamurthy, A., Koduri, G. K., Gulati, S., Ishwar, V., & Serra, X. (2014). Corpora for Music Information Research in
Indian Art Music. In ICMC-SMC, pp. 1029-1036.
§ Şentürk, S., Gulati, S., & Serra, X. (2014). Towards alignment of score and audio recordings of Ottoman-Turkish makam
music. In FMA.
§ Bogdanov, D., Wack, N., Gómez, E., Gulati, S., Herrera, P., Mayor, O., Roma, G., Salamon, J., Zapata, J., & Serra, X. (2013).
Essentia: an audio analysis library for music information retrieval. In ISMIR, pp. 493–498.
§ Bogdanov, D., Wack, N., Gómez, E., Gulati, S., Herrera, P., Mayor, O., Roma, G., Salamon, J., Zapata, J., & Serra, X. (2013).
ESSENTIA: an open-source library for sound and music analysis. In ACM Multimedia, pp. 855–858.
§ Şentürk, S., Gulati, S., & Serra, X. (2013). Score informed tonic identification for makam music of Turkey. ISMIR, pp. 175–
180.
§ Sordo, M., Koduri, G. K., Şentürk, S., Gulati, S., & Serra, X. (2012). A musically aware system for browsing and interacting
with audio music collections. In CompMusic Workshop.
§ Koduri, G. K., Gulati, S., & Rao, P. (2011). A survey of rāga recognition techniques and improvements to the state-of-the-art.
SMC.197
BaN UOM U[ _ Ne T 3a T[
§ A T O[ UNa U[ _ [ O[ R O _ +§ Gulati, S., Serrà, J., Ishwar, V., & Serra, X. (2014). Melodic Pattern Extraction in Large Collections of Music
Recordings Using Time Series Mining Techniques. In Late-Breaking Demo Session of ISMIR.
§ Gulati, S., Ganguli, K. K., Gupta, S., Srinivasamurthy, A., & Serra, X. (2015). Rāgawise: A Lightweight Real-time
Raga Recognition System for Indian Art Music. In Late-Breaking Demo Session of ISMIR.
§ Caro, R., Srinivasamurthy, A., Gulati, S., & Serra, X. (2014). Jingju music: Concepts and Computational Tools for its
Analysis. A Tutorial in ISMIR.
198
5[Y a M U[ M 3 [MOT _ R[ [PUO6 _O U U[ U PUM 3 a_UO 5[ [ M
.6[O [ M ET _U_ B _ M U[
a_UO E OT [ [Se 9 [a6 M Y [R R[ YM U[ M P 5[YYa UOM U[ E OT [ [SU _
F Ub _U M B[Y a MN M 4M O [ M D MU
ET _U_ 6U O [ 1 ET _U_ O[YYU Y YN _1
.
)/ [b YN ().