Post on 21-May-2019
MANIPULATION, ANALYSIS AND RETRIEVAL
SYSTEMS FOR AUDIO SIGNALS
GEORGE TZANETAKIS
A DISSERTATION
PRESENTED TO THE FACULTY
OF PRINCETON UNIVERSITY
IN CANDIDACY FOR THE DEGREE
OF DOCTOR OF PHILOSOPHY
RECOMMENDED FOR ACCEPTANCE
BY THE DEPARTMENT OF
COMPUTER SCIENCE
JUNE 2002
c�
Copyright by GeorgeTzanetakis, 2002.
All RightsReserved
iii
To my grandfatherGiorgosTzanetakiswhosepassionfor knowledgeandlovefor all
living thingswill alwaysbeaninspirationfor me.
To my parentsPanosandAnnaandmy sisterFlorafor their supportandlove.
iv
Abstract
Digital audioandespeciallymusiccollectionsarebecomingamajorpartof theaverage
computeruserexperience.Largedigital audiocollectionsof soundeffectsarealsousedby
the movie andanimationindustry. Researchareasthat utilize large audiocollectionsin-
clude:Auditory Display, Bioacoustics,ComputerMusic,Forensics,andMusic Cognition.
In order to develop more sophisticatedtools for interactingwith large digital audio
collections,researchin ComputerAudition algorithms and user interfacesis required.
In this work a seriesof systemsfor manipulating, retrieving from, and analysinglarge
collectionsof audiosignalswill bedescribed.Thefoundationof thesesystemsis thedesign
of new andtheapplicationof existing algorithmsfor automatic audiocontentanalysis.The
resultsof the analysisare usedto build novel 2D and 3D graphicaluser interfacesfor
browsing and interactingwith audiosignalsandcollections. The proposedsystemsare
basedontechniquesfrom thefieldsof SignalProcessing,PatternRecognition,Information
Retrieval,VisualizationandHumanComputerInteraction.All theproposedalgorithmsand
interfacesareintegratedunderMARSYAS, a freesoftwareframework designedfor rapid
prototyping of computerauditionresearch.In mostcasesthe proposedalgorithmshave
beenevaluatedandinformedby conductinguserstudies.
New contributions of this work to the areaof ComputerAudition include: a gen-
eral multifeatureaudio texture segmentation methodology, featureextraction from mp3
compresseddata,automaticbeatdetectionand analysisbasedon the DiscreteWavelet
Transformand musicalgenreclassificationcombiningtimbral, rhythmic and harmonic
features. Novel graphicaluser interfacesdevelopedin this work are various tools for
browsing andvisualizinglarge audiocollectionssuchasthe Timbregram, TimbreSpace,
GenreGram,andEnhancedSoundEditor.
v
Acknowledgments
First of all I would like to thankPerryCook for beingsucha greatadvisor. Perryhas
alwaysbeensupportive of my work anda constantsourceof new and interestingideas.
Large partsof the researchdescribedin this thesiswerewritten after Perrysaid“I have
a questionfor you. How hard would it be to do ...”. In addition I would like to thank
themembersof my thesiscommitte: AndreaLaPaugh,KenSteiglitz,TomFunkhauserand
PaulLansky whoprovidedvaluablefeedbackandsupportespeciallyduringthelastdifficult
monthsof looking for a job andfinishingmy PhD.
Kimon hasmotivatedmeto walk andrunin thewoods,clearingmy brainasI follow his
waggingtail. Emily providedsupport, soupsandstews. Lujo Bauer, GeorgEssl,Neophytos
Michael, and KedarSwadi andall the othergraduatestudentsat the ComputerScience
departmenthavebeengreatfriendsandhelpedmemany timesin many differentways.
During my yearsin graduateschoolit wasmy pleasureto collaboratewith many un-
dergraduateandgraduatestudentsdoingprojectsrelatedto my researchandin many cases
using MARSYAS the computeraudition software framework I have developed. These
includeJohnForsyth,FeiFeiLi, CorrieElder, DougTurnbull, AnoopGupta,David Greco,
GeorgeTourtellot,AndreyeErmolinskyi, andAri Lazier.
The two summerinternships I did at theComputerHumanInteractionCenter(CHIC)
at SRI Internationalandat Moodlogic weregreatlearningexperiencesandprovidedwith
valuableknowledge.I would like to especiallythankLuc Julia andRehanKhanfor being
greatpeopleto work with.
Finally, researchcostsandI wasfortunateenoughto have supportfrom the following
fundingsources:NSFgrant9984087,New Jersey Commission onScienceandTechnology
grant01-2042-007-22,Intel andArial Foundation.
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
1 Intr oduction 1
1.1 MotivationandApplications . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Themainproblem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Currentstate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 RelevantResearchAreas . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Guiding principlesandobservations . . . . . . . . . . . . . . . . . . . . . 7
1.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Relatedwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.8 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.9 A shorttour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Representation 24
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 SpectralShapeFeatures. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.1 STFT-basedfeatures . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.2 Mel-Frequency CepstralCoefficients . . . . . . . . . . . . . . . . 33
vi
CONTENTS vii
2.3.3 MPEG-basedfeatures . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4 Representingsound“texture” . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.1 Analysisandtexturewindows . . . . . . . . . . . . . . . . . . . . 39
2.4.2 Low Energy Feature . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Wavelettransformfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.1 TheDiscreteWaveletTransform. . . . . . . . . . . . . . . . . . . 40
2.5.2 Waveletfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6 Otherfeatures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.7 Rhythmic ContentFeatures. . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.7.1 BeatHistograms . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.7.2 Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.8 PitchContentFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.8.1 PitchHistograms. . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.8.2 Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.9 MusicalContentFeatures. . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3 Analysis 54
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Query-by-examplecontent-basedsimilarity retrieval . . . . . . . . . . . . 59
3.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4.1 Classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4.2 Classificationschemes. . . . . . . . . . . . . . . . . . . . . . . . 64
3.4.3 Voices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
CONTENTS viii
3.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.5.3 SegmentationFeatures. . . . . . . . . . . . . . . . . . . . . . . . 71
3.6 Audio thumbnailing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.7 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4 Interaction 74
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.1 BeyondtheQuery-by-Exampleparadigm . . . . . . . . . . . . . . 79
4.2.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4 Browsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4.1 TimbrespaceBrowser. . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4.2 Timbregramviewerandbrowser . . . . . . . . . . . . . . . . . . . 86
4.4.3 OtherBrowsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.5 EnhancedAudio Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.6 Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6.1 GenreGram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6.2 TimbreBallmonitor. . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6.3 Othermonitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.7 Audio-basedQueryInterfaces . . . . . . . . . . . . . . . . . . . . . . . . 96
4.7.1 SoundSliders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.7.2 Soundpalettes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.7.3 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.7.4 MusicMosaics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
CONTENTS ix
4.7.5 3D soundeffectsgenerators . . . . . . . . . . . . . . . . . . . . . 99
4.8 MIDI-basedQueryInterfaces. . . . . . . . . . . . . . . . . . . . . . . . . 100
4.8.1 Thegroovebox . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.8.2 Thetonalknob . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.8.3 Stylemachines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.9 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.10 ThePrincetonDisplayWall . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5 Evaluation 107
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4 Classificationevaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.4.1 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.5 Segmentationexperiments . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.6 MPEG-basedfeatures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.6.1 Music/Speechclassification . . . . . . . . . . . . . . . . . . . . . 118
5.6.2 Segmentationevaluation . . . . . . . . . . . . . . . . . . . . . . . 119
5.7 DWT-basedfeatures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.8 Similarity retrieval evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.9 AutomaticMusicalGenreClassification . . . . . . . . . . . . . . . . . . . 123
5.9.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.9.2 Humanperformancefor genreclassification. . . . . . . . . . . . . 129
5.10 Otherexperimentalresults . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.10.1 Audio Thumbnailing . . . . . . . . . . . . . . . . . . . . . . . . . 130
CONTENTS x
5.10.2 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.10.3 Beatstudies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6 Implementation 134
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.3.1 ComputationServer . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.3.2 UserInterfaceClient - Interaction . . . . . . . . . . . . . . . . . . 142
6.4 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.5 Characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7 Conclusions 146
7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.2 CurrentWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.3 Futurework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.3.1 Otherdirections. . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.4 FinalThoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
A 157
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
A.2 ShortTimeFourierTransform . . . . . . . . . . . . . . . . . . . . . . . . 157
A.3 DiscreteWaveletTransform . . . . . . . . . . . . . . . . . . . . . . . . . 158
A.4 Mel-Frequency CepstralCoefficients . . . . . . . . . . . . . . . . . . . . . 159
A.5 Multiple PitchDetection . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
A.6 Gaussianclassifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
A.7 GMM classifierandtheEM algorithm . . . . . . . . . . . . . . . . . . . . 162
CONTENTS xi
A.8 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
A.9 KNN classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
A.10 PrincipalComponentAnalysis . . . . . . . . . . . . . . . . . . . . . . . . 165
B 167
B.1 Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
C 170
C.1 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
D 173
D.1 WebResources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
List of Figures
1.1 RelevantResearchAreas . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 ComputerAudition Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 MagnitudeSpectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Soundtexturesof newsbroadcast. . . . . . . . . . . . . . . . . . . . . . . 38
2.3 BeatHistogramCalculationFlow Diagram. . . . . . . . . . . . . . . . . . 44
2.4 BeatHistogramExample . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5 BeatHistogramExamples . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1 MusicalGenresHiearachy . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2 ClassificationScemesAccuracy . . . . . . . . . . . . . . . . . . . . . . . 66
3.3 Feature-basedsegmentationandclassification . . . . . . . . . . . . . . . . 73
4.1 2D TimbreSpaceof SoundEffects . . . . . . . . . . . . . . . . . . . . . . 84
4.2 3D TimbreSpace3Dof SoundEffects. . . . . . . . . . . . . . . . . . . . . 85
4.3 3D TimbreSpaceof orchestralmusic . . . . . . . . . . . . . . . . . . . . . 86
4.4 Timbregramsuperimposedoverwaveform . . . . . . . . . . . . . . . . . . 88
4.5 Timbregramsof musicandspeech . . . . . . . . . . . . . . . . . . . . . . 89
4.6 Timbregramsof classical,rockandspeech. . . . . . . . . . . . . . . . . . 90
4.7 Timbregramsof orchestralmusic . . . . . . . . . . . . . . . . . . . . . . . 90
4.8 EnhancedAudio Editor I . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
xii
LIST OF FIGURES xiii
4.9 EnhancedAudio Editor II . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.10 TimbreballandGenreGrammonitors . . . . . . . . . . . . . . . . . . . . . 94
4.11 Soundslidersandpalettes. . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.12 3D soundeffectsgenerators. . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.13 Groovebox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.14 Indianmusicstylemachine. . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.15 Integration of interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.16 Marsyason theDisplayWall . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.1 MusicalGenresHiearachy . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2 Histogramsof subjectagreementand learning curve for meansegment
completion time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3 1,2,3and4,5,6arethemeansandvariancesof MPEGCentroid,Rolloff and
Flux. Low Energy is 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4 DWT-basedfeatureevaluationbasedonaudioclassification. . . . . . . . . 120
5.5 Web-basedRetrieval UserEvaluation . . . . . . . . . . . . . . . . . . . . 121
5.6 Resultsof UserEvaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.7 Classificationaccuracy percentages(RND=random,RT=realtime,WF=whole
file) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.8 Effectof texturewindow sizeto classificationaccuracy . . . . . . . . . . . 128
5.9 SyntheticBeatPatterns(CY = cymbal,HH = hi-hat,SD = snaredrum,BD
= basedrum,WB = woodblock,HT/MT/LT = tom-toms) . . . . . . . . . . 132
6.1 Music SpeechDiscriminator ArchitecturalLayout . . . . . . . . . . . . . . 140
6.2 Segmentationpeaksandcorrespondingtime tree . . . . . . . . . . . . . . 142
6.3 MarsyasUserInterfaceComponentArchitecture . . . . . . . . . . . . . . 143
A.1 DiscreteWaveletTransformPyramidalAlgorithm Schematic. . . . . . . . 159
LIST OF FIGURES xiv
A.2 Summingof FFT binsfor MFCC filterbankcalculation . . . . . . . . . . . 160
A.3 Multiple PitchAnalysisBlock Diagram . . . . . . . . . . . . . . . . . . . 161
A.4 Schematicof PrincipalComponentextraction . . . . . . . . . . . . . . . . 166
Chapter 1
Intr oduction
Theprocessof preparingprogramsfor a digital computeris especiallyattractive, notonly
becauseit canbeeconomicallyandscientifically rewarding, but alsobecauseit canbean
aestheticexperiencemuch likecomposingpoetryor music.
Donald E. Knuth, The Art of Computer Programming
(especiallywhenyouwrite programsto analyzeaudiosignals)
1.1 Moti vation and Applications
th Developments in hardware, compressiontechnology and network connectivity have
madedigital audioandespeciallymusicamajorpartof theeverydaycomputeruserexperi-
enceandasignificantportionof webtraffic. Themediaandlegalattentionthatcompanies
likeNapsterhave recentlyreceived,testifiesto this increasingimportanceof digital audio.
It is likely that in thenearfuture themajority of recordedmusicin humanhistorywill be
availabledigitally. As an indication of the magnitude of sucha collectioncurrently the
numberof cd tracksin retail is approximately3 mill ion. It is still unclearwhatthebusiness
modelsfor this new brave new music world will be. However, onething that is certainis
theneedto interacteffectively with increasinglylargedigital audiocollections.
1
CHAPTER1. INTRODUCTION 2
In addition to digital musicdistribution,theneedto interactwith largeaudiocollections
arisesin otherapplicationsareasaswell. Most of themoviesproducedtodayusea large
numberof soundstakenfrom largelibrariesof soundeffects(for exampletheBBC Sound
EffectsLibrary contains60 compactdisks).Whenever we heara doorclosing,a gunshot
or a telephoneringing in a movie it almostalwaysoriginatesfrom oneof thoselibraries.
Evenlargernumbersof soundeffectsareusedin theanimationindustry. The importance
andusageof audioin computergamesis alsoincreasing.Moreover, composers,DJs,and
arrangersfrequentlyutilize collectionsof soundsto createmusicalworks. Thesesounds
typically originatefrom synthesizers,samplersandmusicalinstrumentsrecordedlive.
In addition to applications relatedto the entertainmentindustry, the needto interact
with large audiocollectionsarisesin many scientificdisciplines. Auditory Display 1 is
a new researchareathat focuseson the useof non-speechaudio to convey information.
Specific areasinclude the sonificationof scientific data, spatial sound,and the use of
auditoryiconsin userinterfaces.Respondingto parametersof analyzedaudiocanalsobe
usedby interactive algorithmsperforminganimationor soundsynthesisin virtual reality
simulationsandcomputergames.Audio analysistoolshavealsoapplicationsin Forensics
which is the useof scienceandtechnologyto investigateandestablishfactsin criminal
or civil courts of law. For example the make of a particular car or the identity of a
speakingpersoncan be automaticallyidentified from a recording. Other fields where
working with audiocollectionsis importantareBioacoustics,PsychoacousticsandMusic
Cognition. Bioacousticsis the study of soundsproducedby animals. Psychoacoustics
studiesthe perceptionof different typesof soundby animalsof the humanspeciesand
MusicCognitionlooksspecificallyat how weperceiveandunderstandmusic.
This indicative but by no meansexhaustive list of applicationareasfrom industryand
academiashows theneedfor toolsto manipulate,analyzeandretrieve audiosignalsfrom
largedigital audiocollections.1http ://ww w.icad .org
CHAPTER1. INTRODUCTION 3
1.2 The main problem
To summarize,themainproblemthat this work triesto addressis theneedto manipulate,
analyzeandretrieve from largedigital audiocollections, with specialemphasisonmusical
signals. In order to achieve this goal a seriesof existing andnovel automaticcomputer
auditionalgorithmswere designed,developed, evaluated,and applieddirectly to create
novel contentandcontext awareaudiographicaluserinterfacesfor interactingwith large
audiocollections.
1.3 Curr ent state
Oneof themainlimitationsof currentmultimediasoftwaresystemsis thatthey have very
littl e understanding of the data they processand store. For examplean audio editing
programrepresentsaudioasanunstructuredmonolithic block of digital samples.In order
to develop new effective tools for interactingwith audio collections it is important to
have informationaboutthe contentand context of audiosignals. Contentrefersto any
informationaboutanaudiosignalandits structure.Someexamplesof contentinformation
are: knowing a specificsectionof a jazz musicpiececorrespondsto thesaxophonesolo,
identifying a bird from its song,finding similar soundsto a particularwalking soundin a
collectionof soundeffectsfor movies.Context refersto informationaboutanaudiosignal
that is dependentupona larger collectionto which it belongs.As anexampleof context
informationin a largecollectionof all musicalstyles, two femalesingerJazzpieceswould
besimilar whereasin a collectionof vocalJazzthey might bequitedissimilar. Therefore
their similarity is context information as it dependson the particularaudio collection.
Anotherexampleof context informationis the notion of a “f ast” musicalpiecewhich is
differentfor differentmusicalgenresthereforedependson theparticularcontext.
CHAPTER1. INTRODUCTION 4
Obviously in orderto achieve this sensitivity to the contentandcontext of audiocol-
lectionsmorehigh level information thanwhat is currentlyusedis required. The most
commonapproachto addressthis problemhasbeento manuallyannotateaudiosignals
with additional information (for examplefilenamesandID3 tagsof mp3files). However
thisapproachhasseverallimitations:it requiressignificantusertime(therearecommercial
companiesthat pay full-time experts to perform this annotation), doesnot scalewell,
andcanonly provide specifictypesof information thataretypically categorical in nature
such as music genre,mood, or energy level. In addition, the maintenanceof all this
informationrequiressignificantinfrastructureandqualityassurancein orderto ensurethat
the annotationresultsare consistentandaccurate.Thereforealgorithms and techniques
thatarecapableof extractingsuchinformation automaticallyaredesired.In this thesis,a
varietyof suchalgorithmsandtechniqueswill beproposedandwill beusedto createnovel
toolsfor structuringandinteractingwith largeaudiocollections.
1.4 Relevant Research Ar eas
This work falls under the generalresearchareaof ComputerAudition (CA). The term
ComputerAudition is usedin asimilar fashionto thetermComputerVisionandit includes
any processof extractinginformation from audiosignals.Othertermsusedin theliterature
areMachine ListeningandAuditorySceneAnalysis. 2
Although Speech Recognition canbe consideredpart of CA, for the purposesof this
thesis,CA will refer to techniquesthat althoughpossiblyprocessspeechsignals,they
do not perform any languagemodeling. For example,speaker/singeridentificationand
music/speechdiscrimination will be includedasthey canbe performedby a humanwho2I prefer the term Computer Audition becauseby beingmoregeneral is applicableto a broadersetof
problems. Machine Listening implies a high level perceptual activity whereasthe term Auditory SceneAnalysisrefersto the recognition of different soundproducing objects in a scene. Although thereis nocommon agreement about theterminology I considerthesetwo othertermssubtopics of ComputerAudition.
CHAPTER1. INTRODUCTION 5
Signal Processing
Information Retrieval
Machine Learning
Computer Audition
Perception
Computer Vision
Visualization
Human Computer Interaction
Figure1.1: RelevantResearchAreas
mightnotknow thelanguagespokenbut Speech RecognitionandSpeech Synthesiswill not
be coveredasthey rely on languagemodels. Although researchin aspectsof Computer
Audition hasexistedfor a long time only recently(after 1996)therehasbeensignificant
researchoutput. 3
Someof thereasonsfor this lateinterestare:
� Developmentsin hardwarethatmadefeasiblecomputationsover largecollectionsof
audiosignals
� AudiocompressionsuchastheMPEGaudiocompressionstandard(.mp3files)which
madedigital musicdistribution over theinterneta reality
� A shift from symbolic methods andsmall“toy” examplesto statistical methodsand
largecollectionsof “real world” audiosignals(alsoa trendin thegeneralArtificial
Intelligencecommunity)
Despitethe fact that ComputerAudition is a relatively recentresearchareastill in its
infancy therearemany existing well-establishedresearchareaswhereinterestingideasand
toolscanbefoundandadaptedfor thepurposeof extractinginformationfromaudiosignals.
Figure1.1 shows themainresearchareasthatarerelevant to this work. Thethickness
of the lines is indicative of the importanceof eachareato this particular work rather3My graduatestudiesbegan in 1997andtherefore I have beenvery lucky asI wasableto trackmostof
thenew researchin Computer Audition andactively participatein its evolution.
CHAPTER1. INTRODUCTION 6
than to the areaof ComputerAudition in general. Audio and especiallymusic signals
arecomplex data-intensive, time-varyingsignalswith uniquecharacteristicsthatposenew
challengesto theserelevant researchareas. Therefore,researchin CA providesa good
way to test,evaluateandcreatenew algorithmsandtools in all thoseresearchareas.This
reciprocalrelationis indicatedin figure1.1 with theuseof connectingarrows pointing in
bothdirections.In the following paragraphsshortdescriptionsof theseresearcharesand
their connectionto ComputerAudition areprovided.In addition a representativereference
textbook (by nomeanstheonly oneor thebestone)for eachspecificareais provided.
Digital SignalProcessing [97] is the computermanipulation of analogsignals(com-
monly soundsor images)which have beenconverted to digital form (sampled). The
varioustechniquesof Time-Frequency analysisdevelopedfor processingaudiosignalsin
many casesoriginally developedfor SpeechProcessingandRecognition[98], areof special
importancefor thiswork.
Machine Learning techniquesallow a machineto improve its performancebasedon
previous resultsand training by humans.More specificallypatternrecognitionmachine
learningtechniques[30] aim to classifydata(patterns)basedon eithera priori knowledge
or on statistical information extractedfrom the patterns. The patternsto be classified
are usually groupsof measurementsor observations, defining points in an appropriate
multidimensionalspace.
ComputerVision [9] is a termusedto describeany algorithmandsystemthatextracts
informationfrom visualsignalssuchasimageandvideo. Of specificinterestto this work
is researchin videoanalysisasbothvideoandaudioaretemporalsignals.
HumanComputerInteraction (HCI) researchis becomingan increasinglyimportant
areaof ComputerScienceresearch[114]. Graphicaluser interfaces(GUIs) enablethe
creationof powerful tools that combinethe bestof humanandmachinecapabilities.Of
specificinterestto this work is the field of Information Visualization[120, 34] that deals
CHAPTER1. INTRODUCTION 7
with thevisualpresentationof varioustypesof informationin orderto betterunderstandits
structureandcharacteristics.
Information Retrieval [137] familiar from the popular web searchenginessuch as
AltaVista(www.alta vista.com ) andGoogle(www.googl e.com ), refersto techniquesfor
searchingandretrieving text documentsbasedon userqueries.More lately techniquesfor
MultimediaInformationRetrieval suchascontent-basedimageretrieval systems[37] have
startedappearing.
Lastbut notleastisPerceptionresearch.Thehumanperceptualsystemhasuniquechar-
acteristicsandabilitiesthatcanbetakeninto accountto developmoreeffective computer
systemsfor processingsensorydata.For examplebothin imagesandsound,knowledgeof
thepropertiesof thehumanperceptualsystems hasled to powerful compressionmethods
suchas JPEG(for images)and MPEG (for video and audio). A good introduction to
psychoacoustics(theperceptionof sounds) usingcomputersis [21].
1.5 Guiding princip lesand observations
Giventhecurrentstateof toolsfor interactingwith largeaudiocollectionsseveraldirections
for improvementare evident. Basedon thesepoints, the following guiding principles
characterizethiswork anddifferentiateit from themajorityof existing research:
� Emphasison fundamental problemsand “r ealworld” signals
Many existing systemsare too ambitiousin the sensethat they solve somehard
problemsuchaspolyphonic transcriptionbut only for very limited artificially con-
structedsignals(“toy problems”).In contrastin thiswork theemphasisis onsolving
fundamentalproblemsthatareapplicableto arbitrary“real world” audiosignalssuch
musicpiecesstoredasmp3files.
� Developnew and impr ove existingcomputer audition algorithms
CHAPTER1. INTRODUCTION 8
In order to designanddevelop new algorithmsin any field it is importantto first
understand,implement,compareand improve existing algorithms. Thereforewe
have tried to includeand improve many existing CA algorithms in additionto the
newly designedalgorithms.
� Integration of differ ent analysistechniques
A largepercentageof existing work in CA dealswith only a specificanalysisprob-
lem. However in many casesthe resultsof different analysistechniquescan be
integratedimproving theresultinganalysis.For examplesegmentation canbeused
to reduceclassificationerrors.A consequenceof this desirefor integrationhasbeen
thecreationof a generalandflexible softwareinfrastructurefor CA researchrather
thansolvingeachsmallindividualproblemseparately.
� Content and Context awareuser interfacesbasedon automatic analysis
Existing CA work emphasizesautomaticalgorithmsand is rarely concernedwith
how userscan interact with thesealgorithms. Human computerinteractionis a
very important aspectin the designof an effective system. In this thesisour goal
is to provide combine theabilitiesof humansandcomputersby creatingnovel user
interfacesto interactwith audiosignalsandcollections. Unlike existing tools these
interfacesareawareof contentandcontext informationabouttheaudiosignalsthey
presentandthis informationis extractedusingautomaticanalysisalgorithms.
� Interacting with largecollectionsrather than individual files
In both the developedalgorithms and interfacesthe emphasisis on working with
largecollectionsratherthanindividual files. So,for example, largepartof thework
in interfacedesignis concernedwith browsinglargecollectionswhile theautomatic
analysisis heavily basedon statistical learningtechniquesthatutilize largedatasets
for training.
CHAPTER1. INTRODUCTION 9
� Automatic and user-basedevaluation
Evaluation of systems that try to imitatehumansensoryperceptionis difficult be-
causethereis no singlewell-definedcorrectanswer. However it is feasibleandit is
importantfor the designanddevelopmentof effective CA algorithmsandsystems.
BecauseCA is arelatively recentandstill developingfield many existingsystemsare
proposedwithout detailedevaluation (for exampleonly a few exampleswherethe
algorithms worksarepresented).In this thesis,aneffort hasbeenmade,to evaluate
theproposedtechniquesstatistically usinglargedatasetsandwith userexperiments
whenthatis required.
� Real-timeperformanceComputerauditioncomputationsarenumerically-intensive
andstresscurrenthardwareto its limits. In orderto have desiredreal-timeperfor-
manceandinteractivity in theuserinterfacescarefulchoiceshave to be madeboth
at thealgorithmic andtheimplementationlevel. Unlikecertainproposedalgorithms
thattakehoursto analyzeaminuteof audio,ouremphasishasbeenonfastalgorithms
andefficient implementation that enablesreal-timefeatureextraction,analysisand
interaction.
Althoughsomeof thesedesignguidelineshave beenusedin severalproposedsystems
to the bestof our knowledgeno existing systemhascombinedall of them. In addition
thefollowing simple, informalbut importantobservations aresomewayor anotherbehind
many of theideasdescribedin this thesis.
� Automatic techniqueswith errors canbemadeuseful.
Automatic audioanalysistechniquesarestill far from perfectandin many caseshave
largepercentagesof errors.Howevertheirresultscanbesignificantlyimproved using
varioustechniquessuchas: statistics, integration of differentanalysistechniques,
user interfacesand utilizing the errorsas as sourceof information. For example
CHAPTER1. INTRODUCTION 10
althoughshorttime automatic beatdetectionis not accurate,a goodrepresentation
of theoverall beatof a songcanbecollectedusingstatistics (seeBeatHistograms,
Chapter2). In addition thefailureof thebeatdetectionis asourceof informationfor
musicalgenreclassification(for examplebeatdetectionworksbetterfor rockandrap
musicthanfor classicalor ambient).Althoughthis observation is relatively simple,
it leadsto many interestingideasthatwill beexploredin thefollowing chapters.
� Avoid solvinga simple problem by trying to solvea harder one
Although this observation soundstrivial it is not taken into accountin a lot of ex-
isting work. For example a lot of work in audiosegmentation assumesan initial
classification. However classificationis a harderproblemas it involves learning
both in humansand machines(there is no easyway to prove suchan assertion,
andthereforesomepeoplemight disagree;however it will beshown thatautomatic
segmentation without classificationis feasible). Another example is the work in
polyphonic transcription, which is still an unsolved problem, for the purposeof
retrieval andclassificationwhichcanbesolvedusingstatistical methods.
� Usethe software you write
In orderto evaluatethealgorithmsandtoolsyoudevelopit is importantto usethem
andthemosteasilyaccessibleuseris yourself.An effort hasbeenmadein thiswork
to develop working prototypesthat aresubsequentlyusedto evaluateandimprove
thedevelopedalgorithms,creatingthatwayaself-improving feedbackloop.
Theway theseguidingprinciplesandobservations influencetheideasdescribedin this
thesiswill becomeclearerin thefollowing chaptersthatdescribedin detail thedeveloped
algorithms andsystems.
CHAPTER1. INTRODUCTION 11
1.6 Overview
As in every PhD thesismy primarygoal is to presentin a clearandaccuratemannermy
researchwork. In additionI would alsolike my thesisto serve asa goodintroduction to
the field of ComputerAudition in generaland its currentstateof art. Becausemostof
the researchis relatively recentandI have implementedmany of the existing algorithms
proposedin theliteratureI hopethatmy goalis achievable.Ideally I would like this thesis
to bea goodstartingpoint for someoneenteringthefield of ComputerAudition. Another
side-effect of this secondarygoal is theextendedbibliography. Unlike morematurefields
weretherearegoodsurvey papersthatcanserveasstartingpoints,ComputerAudition is a
relatively new field andit is possible (andI haveattemptedto doso)to cite themajorityof
directly relevantpublications.
Toward this goal the structureof this thesisfollows a form similar to a hypothetical
ComputerAudition textbook. For presentationpurposesI liketo divideComputerAudition
in thefollowingstages:
� Representation - Hearing
Audio signalsaredata-intensive time-domainsignalsandarestored(in their basic
uncompressedform) as seriesof numberscorrespondingto the amplitude of the
signal over time. Although this representationis adequatefor transmission and
reproductionof arbitrarywaveformsit is not appropriatefor analysingandunder-
standingaudiosignals. Thewayweperceiveandunderstandaudiosignalsashumans
is basedon and constrainedby our auditory system. It is well known that the
earlystagesof thehumanauditorysystem(HAS) decomposeincomingsoundwave
into different frequency bands. In a similar fashion,in ComputerAudition, Time-
Frequency analysistechniquesare frequentlyusedfor representingaudio signals.
Therepresentationstageof theComputerAudition pipelinerefersto any algorithms
andsystemsthat take asinput simple time-domain audiosignalsandcreatea more
CHAPTER1. INTRODUCTION 12
compact,information-bearingrepresentation.Themostrelevantresearch areato this
stageis SignalProcessing.
� Analysis - Understanding
Oncea goodrepresentationis extractedfrom the audiosignalvarioustypesof au-
tomaticanalysiscanbeperformed.Theseincludesimilarity retrieval classification,
clustering,segmentation,and thumbnailing. Thesehigher-level typesof analysis
typically involve memoryand learningboth for humansandmachines.Therefore
MachineLearningalgorithms andtechniquesareimportantfor thisstage.
� Interaction - Acting
Whenthe signal is representedandanalyzedby the system,the usermustbe pre-
sentedwith the necessaryinformationandact accordingto it. The algorithms and
systemsof this stageareinfluenceby ideasfrom HumanComputerInteractionand
dealwith how information aboutaudiosignalsandcollectionscanbe presentedto
theuserandwhattypesof controlsfor handlingthis informationareprovided.
This division into stageswill be usedto structurethis thesis. It is importantto em-
phasizethat the boundariesbetweenthesestagesarenot clear cut. For examplea very
completerepresentation(suchasamusicalscore)makesanalysiseasierthanalessdetailed
representation(suchasfeaturevectors).However featurevectorscanbeeasilycalculated
while theautomatic extractionof musical scores(polyphonic transcription)is still anun-
solvedproblem.Althoughthesestagesarepresentedseparatelyfor clarity, in any complete
systemthey areall integrated.Additionally thedesignof algorithmsandsystemsin each
stagehasto be informedandconstrainedby thedesignchoicesfor theotherstages.This
becomesevidentin evaluationandimplementation thattypically cannotbeappliedto each
stageseparatelybut have to take all of theminto account.Figure 1.2shows thestagesof
theComputerAudition Pipelineandrepresentativeexamplesof algorithmsandsystemsfor
CHAPTER1. INTRODUCTION 13
Representation − Hearing
Analysis−Understanding
Interaction − Acting
Feature extractionPolyphonic transcriptionBeat detection
ClassificationSegmenationSimilarity retrieval
BrowsersEditorsQuery filters
STAGESSTAGESSTAGES
COMPUTER AUDITION PIPELINE
EXAMPLES
Figure1.2: ComputerAudition Pipeline
eachstage.Although in mostexisting systemsinformation typically flowsbottom-upboth
directionsof flow areimportantashasbeenarguedin [115,31].
This thesisis structuredasfollows: Chapter1 (which you arecurrentlyreading)pro-
videsa high-level overview of the thesis,the motivation behindit andhow it fits in the
context of relatedwork. Chapter2 is aboutrepresentationanddescribesvariousfeature
extractiontechniquesthatcreaterepresentationsfor differentaspectsof musicalandaudio
content.Theanalysisstageis coveredin Chapter3 wheretechniquessuchasmusicalgenre
classification,sound“texture” segmentation andsimilarity retrieval aredescribed.Chapter
4 correspondsto theinteractionstageandis aboutvariousnovel contentandcontext aware
userinterfacesfor interactingwith audiosignalsandcollections. Theevaluation of thepro-
posedalgorithmsandsystemsis describedin Chapter5. Thereasonevaluation is presented
in a separatechapteris becausein mostcasesit involvesmany differentalgorithms and
systemsfrom all theCA stages.For examplein orderto evaluateMPEG-basedfeatures,
both classificationand segmentation resultsare obtainedthroughthe useof developed
CHAPTER1. INTRODUCTION 14
graphicaluserinterfaces.Chapter6 talksabouttheimplementationandarchitectureof our
system.The thesisendswith Chapter7 which summarizesthe main thesiscontributions
andgivesdirectionsfor futureresearch.Eachchaptercontainsasectionaboutrelatedwork
andthespecificcontributionsof thiswork in eacharea.Standardalgorithmstechniquesthat
areimportant for this work aredescribedin moredetail in AppendixA. A glossaryof the
terminology usedin this thesisis providedfor quickreferenceatAppendixB. AppendixC
presentsthepublications thatconstitutethework presentedin this thesisin chronological
orderandAppendixD providessomewebresourcesrelatedto this research.
For theremainderof thethesis,it is usefulto have in mind thefollowing hypothetical
scenariothatwill verylikely becomereality in thenearfuture.Supposeall of recordedmu-
sic in theworld is storeddigitally in yourcomputerandyouwantto beableto manipulate,
analyzeandretrieve from this largecollectionof audiosignals.Although, in many cases,
the techniquesdescribedin this work arealsoapplicableto other typesof audiosignals
suchassoundeffectsand isolatedmusicalinstrument tones,the main emphasiswill be
on musicalsignals. So if thereis no other indicationin the text, the readercanassume
that signalsof interestwill be musical signals.Marsyasis the nameof the free software
framework for ComputerAudition researchdevelopedaspartof this thesisunderwhichall
thedescribedalgorithmsandinterfacesareintegrated.
1.7 Relatedwork
In this section,pointersto important relatedwork will be provided. The emphasisis on
representative and influential publicationsas well as descriptionsof complex Complex
Audition systemsratherthanthanindividual techniquesor algorithms. More detailedand
completereferencesareprovidedin therelatedwork sectionof eachindividualchapter.
As alreadymentioned,theearlyoriginsof computerauditiontechniquesarein Speech
recognitionandprocessing[98]. Anotherimportant areaof early influencesis Computer
CHAPTER1. INTRODUCTION 15
Music research[99, 85] which althoughis mostly involved with the synthesisof sounds
andmusichasprovideda varietyof interestingtoolsandtechniquesthatcanbeappliedto
ComputerAudition.
Most of the algorithmsin Music InformationRetrieval (MIR) 4 hasbeendeveloped
basedonsymbolic representationssuchasMIDI files. Symbolicrepresentationsaresimilar
to musicalscoresin that they are typically comprisedof detailedstructuredinformation
about the notes,pitches,timings, and instrumentsusedin a recording. Basedon this
informationMIR algorithmscanperformvarioustypesof contentanalysissuchasmelody
retrieval, key identification,andchordrecognition.Thetechniquesusedhave strongsim-
ilarities to text informationretrieval [137]. More information aboutsymbolic MIR canbe
foundin [1, 2].
Transcriptionof music is definedastheactof listening to a pieceof musicandwriting
down musicalnotationfor the notesthat constitute the piece. Automatic transcription
refersto the processof converting an audiorecordingsuchasa .mp3file to a structured
representationsuchasa MIDI scoreandprovides the bridge that canconnectsymbolic
MIR with analysisof “real world” audiosignals.
The automatictranscription of monophonic signalsis practically a solved problem
sinceseveral algorithmshave beenproposedthat are reliable, commerciallyapplicable
andoperatein realtime. Attempts towardpolyphonic transcription datebackto the1970s,
whenMoorerbuilt a systemfor transcribingduets,i.e., two-voicecompositions[86]. The
proposedsystemsufferedfrom severelimitationson the allowable frequency relationsof
two simultaneouslyplayingnotes,andon therangeof notesin use.However, this wasthe
first polyphonic transcriptionsystem, andseveralotherswereto follow.
Until thesedays transcription systems have not beenable to solve other than very
limited problems. It is only lately that proposedsystemcan transcribemusic 1) with
morethantwo-voicepolyphony, 2) with evena limitedgeneralityof sounds,and3) yield4http ://ww w.musi c- ir.o rg
CHAPTER1. INTRODUCTION 16
resultsthataremostly reliableandwould work asa helpingtool for a musician[79, 61].
Still, thesetwo also suffer from substantial limitations. The first was simulatedusing
only a shortmusicalexcerpt,andcould yield goodresultsonly after a careful tuning of
thresholdparameters.Thesecondsystemrestrictstherangeof notepitchesin simulations
to lessthan20differentnotes.However, someflexibilit y in transcriptionhasbeenattained,
whencomparedto the previoussystems. To summarizefor practicalpurposesautomatic
polyphonic transcriptionof arbitrarymusicalpurposesis still unsolvedandlitt le progress
hasbeenmade.
Anotherapproachthatoffers thepossibility of automatic music analysisis Structured
Audio (SA). The idea is to build a hybrid representationcombiningsymbolic represen-
tation, software synthesisand raw audio in a structuredway. Although promisingthis
techniquehasonly beenusedin asmallscaleandwould requirechangesin thewaymusic
is producedandrecorded.MPEG4 [107] might changethesituation but for now andfor
thenearfuturemostof thedatawill still be in raw audioformat. Moreover, evenif SA is
widely used,convertersfrom raw audiosignalswill still berequired.
Giventhefactthatautomaticmusic transcriptionis theholy grail of ComputerAudition
andis not very likely to be achieved in the nearfuture andthat StructuredAudio is still
mainly a proposal,if onewishesto work with audiosignalstodaya significant paradigm
shift is required. The main idea behindthis paradigmshift is to try to extract directly
informationfrom audiosignalsusingmethodsfrom SignalProcessingandMachineLearn-
ing without attempting note-level transcriptions. This paradigmis followed in the work
describedin this thesis. Two position paperssupportingthisshift from note-level transcrip-
tions to statistical approachesbasedon perceptualpropertiesof ‘real-world” soundseven
in thecasewhennote-level transcriptionsareavailableare[82, 106].
In my opinion,thisstatisticalnon-transcriptiveapproachto ComputerAudition for non-
speechaudiosignalsreally startedin the secondhalf on the 90swith the publications of
CHAPTER1. INTRODUCTION 17
two important influential papers. Statisticalmethodsfor the retrieval and classification
of isolatedsoundssuchasmusicalinstrument tonesandsoundeffectswereintroducedin
[143]. An algorithmfor thediscrimination betweenmusic andspeechusingautomatic clas-
sificationbasedonspectralfeaturesandtrainedusingsupervisedlearningwasdescribedin
[110]. An early overview of audio information retrieval including techniquesbasedon
SpeechRecognitionanalysiscanbefoundat [39].
Another influential work for CA is the classicbook “Auditory SceneAnalysis” by
McGill Psychologist Albert Bregman[16]. Auditory sceneanalysisis theprocessby which
the humanauditorysystembuilds mentaldescriptionsof complex auditoryenvironments
by analyzingmixturesof sounds. The book describesa large numberof psychological
experimentsthatprovidea lot of insightaboutthemechanismsusedby thehumanauditory
systemto performthisprocess.Computational implementationsof someof theseideasare
coveredin [101]. Although thereis no direct attemptin this thesisto model the human
auditorysystem,knowledgeabouthow it workscanprovidevaluableinsight to thedesign
of algorithms andsystemsthatattemptto performsimilar tasks.
More recentlytheMachineListeningGroupof theMIT medialab hasbeenactive in
CA producinga numberof interestingPhD dissertations. The work of [31] is concerned
with the problemof computationalauditory sceneanalysiswherean auditory sceneis
decomposedinto a numberof soundsources. The classificationof musical instrument
soundsis exploredin [81]. The useinformationreductiontechniquesto provide mathe-
maticaldescriptionsof low level auditoryprocessessuchaspreprocessing,groupingand
sceneanalysisis describedin [118]. The mostrelevant dissertation to this work is [109]
which describesa seriesof systemsfor listeningto musicthat do not attemptto perform
polyphonic transcription.
CHAPTER1. INTRODUCTION 18
1.8 Main Contrib utions
In thisSection,themainnew contributionsof this thesisarebriefly reviewed.Moredetails
andexplanationof thetermswill beprovidedin thefollowing Chapters.
� MPEG-basedcompresseddomain spectral shapefeatures:
Computation of featuresto representthe short-timespectralshapedirectly from
MPEG audiocompresseddata(.mp3 files). That way the analysisdonefor com-
pressioncanbeusedfor “free” for featureextractionduringdecoding[130].
� Representation of sound“textur e” using texture windows
Unlikesimple monophonic soundssuchasmusicalinstrumenttonesor soundeffects
musicis a complex polyphonic signalthatdoesnothaveasteadystateandtherefore
hasto be characterizedby its overall “texture” characteristicsratherthana steady
state. The useof texture windows asa way to representsound“texture” hasbeen
formalizedandexperimentally validatedin [133].
� WaveletTransform Features
TheWaveletTransform(WT) is a techniquefor analyzingsignals.It wasdeveloped
asan alternative to theshorttime Fourier transform(STFT) to overcomeproblems
relatedto its frequency andtime resolutionproperties.A new setof featuresbased
on theWT thatcanbeusedfor representingsound“texture” is proposed[134].
� BeatHistogramsand beat content features
Themajority of existing work in audioinformationretrieval mainly utilizesfeatures
relatedto spectralcharacteristics.However if the signalsof interestare musical
signalsmorespecificinformationsuchasthe rhythmic contentcanbe represented
usingfeatures.Unlike typical automaticbeatdetectionalgorithms thatprovide only
a running estimate of the main beatand its strength,Beat Histogramsprovide a
CHAPTER1. INTRODUCTION 19
richer representationof the rhythmic contentof a musical piece. that canbe used
for similarity retrieval andclassification.A methodfor their calculationbasedon
periodicityanalysisof multiple frequency bandsis described[135,133].
� Pitch Histogramsand pitch content features
Anotherimportantsourceof informationspecificto musicalsignalsis pitchcontent.
Pitch contenthasbeenusedfor retrieval of music in symbolic form whereall the
musicalpitchesthe explicitly represented.However pitch informationaboutaudio
signalsis moredifficult to acquireasautomaticmultiple pitch detectionalgorithm
arestill inaccurate.Pitch Histogramscollect statistical informationaboutthe pitch
contentover thewholefile andprovidevaluable information for similarity andclas-
sificationevenwhenthepitchdetectionalgorithmis notperfect[133].
� Multifeatur e segmentationmethodology Audio Segmententation is the process
of detectingchangesof sound“texture” in audiosignals. A generalmulti-feature
segmentation methodology that appliesto arbitrarysound“textures” anddoesnot
rely onpreviousclassificationis describedin this thesis.Theproposedmethodology
can be applied to any type of audio featuresand hasbeenevaluated for various
featuresetswith userexperimentsdescribedin Chapter5. This methodology was
first describedin [125] wheresegmentationasa separateprocessfrom classification
andamoreformalapproachto multi-featuresegmentationwasprovided.Additional
segmentationexperiments aredescribedin [128].
� Automatic musicalgenre classification
An algorithmfor automatic musicalgenreclassificationof audiosignalsis described.
It is basedon a setof featuresfor describingmusicalcontentthat in additionto the
typical timbral features,includesfeaturesfor representingrhythmic andharmonic
aspectsof music. A variety of experimentshave beenconductedto e evaluatethis
CHAPTER1. INTRODUCTION 20
algorithm and, to the bestof our knowledge, this is the first detailedexploration
of automaticmusicalgenreclassification.An initial versionof the algorithmwas
publishedin [135] andthecompletealgorithmandmoreevaluation experimentsare
describedin [133].
� Timbr eSpaceTheTimbreSpaceis acontentandcontext awarevisualizationof audio
collectionswhereeachaudiosignalis representedasa visualobjectin a 3D space.
The shape,color and position of the objectsare derived automaticallybasedon
automaticfeatureextractionandPrincipalComponentAnalysis(PCA).
� Timbr eGramsTimbreGramsarecontentandcontext awarevisualizations of audio
signalsthat usecoloredstripesto show contentstructureandsimilarity. The col-
oring is derived automaticallybasedon automatic featureextractionandPrincipal
ComponentAnalysis(PCA).
Theseshortdescriptionsof eachcontributionarejustindicativeof themainideas.More
completedescriptionsfollow in theremainingChapters.
1.9 A short tour
In this sectiona short tour of the proposedmanipulation, analysisandretrieval systems
is provided by describingsomepossibleusagescenarios.The emphasisis on how the
systemscanbeusedratherthanhow they work. Theinterestedreadermightwantto return
to thissectionafterreadingthesubsequentchaptersthatexplainhow theproposedsystems
work to seeagainhow they fit in an applicationcontext. Someof the termsusedmight
beunfamiliar (they will beexplainedin therestof thethesis)but thegist of eachscenario
shouldbe clear. Thesescenariosare indicative of the requirementsof interactingwith
large audio collectionsand show how the systemswe are going to describemeetthese
requirements.
CHAPTER1. INTRODUCTION 21
A music listeneris driving a car, listeningto musicthrougha computerauditionen-
hancedradioreceiver. Thereceiver is configuredby theuserto automatically changeradio
stationwhentherearecommercialsandto automaticallyadjustequalizersettingbasedon
the musical genrethat is played(which is detectedautomatically). At somepoint, the
listenerhearsa strangepolyphonic vocalpiecethat is very interesting but doesnot know
what it is. Thepieceis recordedandsubmitted to a musicsearchengine.Severalsimilar
piecesare returnedand the listenerfinds out that the pieceis pygmy polyphonicmusic
from Africa. The retrieved piecescanbe subsequently by exploredfor exampleusinga
Soundsliderto find a fasterpieceof similar texture.
A researcherin Bioacousticsreceives a 24-hour long field recordingof a site with
birds andhasto identify if any of the bird songscorrespondto a list of endangeredbird
species.First the individualbirdsongsareseparatedusingtheenhancedsoundeditorand
automaticsegmentation. Subsequentlytheresearchercreatesa collectionconsisting of the
individual bird songsin addition to representative bird songsamplesfrom theendangered
bird specieslist eachlabeledwith a characteristicimage. Finally the bird songsthat are
similar to the endangeredspeciesonescanbe found by visual (they will be closeto the
targetimages)andaural(they will soundsimilar) inspection of anautomatically calculated
2D or 3D timbrespace(a visualrepresentationof largeaudiocollections). By collectinga
large numberof representative birdsong samplesanautomaticclassifiercanbe trainedto
recognizebirdsongsof endangeredspecies.
A musicologist is looking at thedifferencesin durationof symphonic movementsper-
formedby differentorchestrasandconductors.Timbregrams(a visual representationof
audiosignals)of varioussymphonic worksarecalculatedusinga collectionof orchestral
musicas the basisfor PrincipalComponentAnalysis (PCA). That way eachmovement
will beseparatedin colorandmovementsof similar texture(for exampleloud,highenergy
movements)will have similar colors. The Timbregramsare then manuallyarrangedin
CHAPTER1. INTRODUCTION 22
a table display suchthat eachcolumn correspondsto a symphonic work and eachrow
correspondsto a specificconductor/orchestracombination. That way the differencesin
durationof eachmovementwill beeasilyseenvisuallyby comparingthelengthsof similar
coloredmovementsacrossaspecificcolumn.By lookingataspecificrow andcomparingit
with therestof thetablethemusicologistcandraw conclusionssuchasconductorA tends
to takefasterhighenergy movementsandslower low energy movementsthanconductorB.
A Jazzsaxophonestudentwantsto make a collection of saxophonesolosby Char-
lie Parker. A TimbreSpace containinga collection of jazz music, wheredifferent jazz
subgenresare automaticallylabeledby shape,is usedinitially. By selectingtriangles
correspondingtoBebopfilesandconstrainingthefilenamestocontainCharlieParkeranew
collectionof all theBebopfiles with CharlieParker is created.Usingtheenhancedsound
editor with automaticsegmentationand fine tuning the segmentation resultsmanually
all the saxophone solosare marked and saved as separatefiles. Finally a collection of
saxophonesolosis createdfrom thosesavedfilesandcanbeusedfor subsequentbrowsing
andexploration.
A sounddesignerwantsto createthesoundscapeof walking througha forestusinga
databaseof soundeffects.As a first stepthesoundof walkingonsoil hasto belookedup.
After clusteringthesoundeffectscollectionthedesignernoticesthataparticularclusterhas
many files with Walking in their filename.Zooming on this clusterthedesignerselects4
soundfiles labeledaswalkingonsoil. Usingspatializedaudiogeneratedthumbnailsof the
4 soundfiles canbeheardsimultaneouslyin orderto make thefinal decision.In a similar
fashionfiles with bird songscanbe locatedaswell asthesoundof a waterfall. Oncethe
componentsof thesoundscapeareselected,they canbeedit andmixedin orderto create
thefinal soundscapeof walking througha forest.Theillusionof approachingthewaterfall
canbeachievedby artificial reverberationandcrossfading.
CHAPTER1. INTRODUCTION 23
It is evident from this scenariosthat a large variety of algorithmsfor manipulating,
analyzingandinteractingwith audiosignalsandcollectionsarerequired.To make these
scenarioseven more interestingimaginethat all thesetasksare performedin front of a
large screenwith multiple graphicdisplaysthat are all consistentand connectedto the
sameunderlyingdatarepresentation.Of coursehaving sucha large collaborative space
allowsmorethanoneuserto interactsimultaneously with thesystem. Thefirst stepstoward
sucha systemaswell asvarietyof algorithms andinterfacesthatcanusedto realizethese
scenarioswill bedescribedin thefollowing Chapters.
Chapter 2
Representation
If all thingswere at rest,andnothingmoved,there wouldbeperfectsilencein theworld.
In such a stateof absolutequiescence, nothingwould be heard, for motionand percus-
sion mustprecedesound. Just as the immediatecauseof soundis somepercussion,the
immediatecauseof all percussionmustbe motion. Somevibratory impulsesor motions
causinga percussionon theear returnwith greaterspeedthanothers. Consequentlythey
havea greaternumberof vibrations in a giventime, while otherare repeatedslowly, and
consequentlyare lessfrequentfor a givenlengthof time. Thequick returnsand greater
numberof such impulsesproduce thehighestsounds, while theslower, which havefewer
vibrations,producethelower.
Euclid, Elementsof Music, Intr oduction to the Sectionof the Canon
2.1 Intr oduction
The extractionof featurevectorsasthe main representationof audiosignals, constitutes
thefoundationof mostalgorithms in ComputerAudition. Featureextractionis theprocess
of computing a numericalrepresentationthat can be usedto characterizea segmentof
audio. This numericalrepresentation,which is calledthe featurevector, is subsequently
24
CHAPTER2. REPRESENTATION 25
usedasa fundamentalbuilding block of varioustypesof audioanalysisandinformation
extractionalgorithms. This vectorhastypically a fixed dimension and thereforecanbe
thoughtof asa point in a multi-dimensionalfeaturespace. Whenusing featurevectors
to representmusic two main approachesareused. In the first approachthe audiofile is
broken into small segmentsin time anda featurevector is computedfor eachsegment.
Theresultingrepresentationis a time seriesof featurevectorswhich canbethoughtof as
a trajectoryof pointsin the featurespace.In thesecondapproacha singlefeaturevector
that summarizesinformation for the whole file is used. The single vector approachis
appropriatewhenoverall informationaboutthewholefile is requiredwhereasthetrajectory
approachis appropriatewheninformationneedsto beupdatedin real time. For example,
automaticmusical genreclassificationof mp3 files might usethe singlevectorapproach
while classificationof radiosignalsmightusethetrajectoryapproach.Typically thesignal
is broken in small chunkscalledanalysiswindows. Their sizesareusuallyaround20 to
40milliseconds.Thatway thesignalcharacteristicsarerelatively stablefor thedurationof
thewindow.
Most of the time, evaluation of audio featuresis donein the context of somespe-
cific applicationor analysistechnique.So for example, a setof featuresfor representing
speechmight be evaluatedin the context of speechcompressionwhile a setof features
for representingmusical contentmight be evaluatedin the context of automaticmusical
genreclassification.Several casesof evaluation of the featuresdescribedin this chapter
aredescribedin Chapter5. The large variety of audiofeaturesthat have beenproposed
differ not only in how effective they arebut alsoin their performancerequirements.For
example,FFT-basedfeaturesmight be moreappropriatefor a real time applicationthan
a computationally intensive but perceptuallymoreaccuratefeaturesbasedon a cochlear
filterbank.
CHAPTER2. REPRESENTATION 26
The mostrelevant researchareafor audiofeatureextractionis SignalProcessing.In
many casesfeatureextractionis performedusingsomeform of Time-Frequency analysis
techniquesuchas the Short Time Fourier Transform(STFT). Time-Frequency analysis
techniquebasicallyrepresenttheenergy distribution of thesignalin atime-frequency plane
anddiffer in how thisplaneis subdividedinto regions.ThehumanearalsoactsasaTime-
Frequency analyzerdecomposing theincomingpressurewave in thecochleainto different
frequency bands.
2.2 RelatedWork
Theoriginsof audiofeatureextractionarein therepresentationandprocessingof speech
signals. A variety of representationsfor speechsignalshave beenproposedin the lit-
erature. Theserepresentationswhere initially developed for the purposesof telephony
andcompression andlater for speechrecognitionandsynthesis.A completeoverview of
thesetechniquesis beyond the scopeof this Section. Interestedreaderscanconsultany
of thestandardtextbooks on Speechprocessingsuchas[98]. Representationsoriginating
in speechprocessingthathave beenusedin this thesisare: LinearPredictionCoefficients
(LPC) [76] andMel Frequency CepstralCoefficients(MFCC) [26]. Theuseof MFCC for
music/speechclassificationhasbeenstudiedin [72]. Thesefeatureswill be explainedin
moredetaillaterin thisChapter.
Toward the end of the 90s featuresfor representingnon-speechaudio signalswere
proposed.Examplesof non-speechaudiosignalsthathavebeenexploredinclude:isolated
instrument tones[44, 45, 32, 143], soundeffects[143], andmusic/speechdiscrimination
[110]. Another interestingdirection of researchduring the 90s was the exploration of
alternative front-endsto the typical ShortTime Fourier Transformfor the calculationof
features.RepresentativeexamplesareCochlearModels[116,117,31, 109], Wavelets[65,
68,122,134], andtheMPEGaudiocompressionfilterbank[95, 130].
CHAPTER2. REPRESENTATION 27
Mostof thesepapersutilize theproposedfeaturesfor variousaudioanalysistaskssuch
asclassificationor segmentation andwill bedescribedin moredetail in thenext Chapter.
A shortdescriptionof somerepresentativeexamplesis providedin thisSection.Statistical
momentsof the magnitudespectrumareusedasfeaturesfor instrumentclassificationin
[44]. SpectralshapefeaturessuchCentroid, Rolloff, and Flux, describedlater in this
chapter, are usedin [143] for soundeffects and in [110] combined with MFCC-based
featuresfor music speechdiscrimination. Cochlearmodelsattemptto approximatethe
exact mechanicalworkings of the lower levels of the humanauditory systemand have
beenshown to have similar propertiesfor example in pitch detection[116]. Statistical
characteristicsof Wavelet subbandssuchasthe absolutemeanandvarianceof the coef-
ficientshave beenusedin [134] to modelsoundtexture for thepurposesof classification
andsegmentation. Theuseof theMPEGaudiocompressionfilterbankwasindependently
proposedby [95] and [130] at the InternationalConferenceon Acoustics,Speechand
SignalProcessing(ICASP 2000). Recentlysystems that jointly processaudioandvideo
informationfrom MPEGencodedsequencessuchas[14] havebeenproposed.
In thepreviously citedsystems,theproposedacousticfeaturesdo not directly attempt
to modelmusicalsignalsandthereforeignorepotentiallyusefulinformation. For example
no information regardingthe rhythmic structureof the musicis utilized. Research in the
areasof automatic beatdetectionand multiple pitch analysiscan provide ideasfor the
developmentof novel featuresspecificallytargetedto theanalysisof music signals.
Onecharacteristicof music that hasreceived considerableattentionis the automatic
extractionof thebeator tempoof musicalsignals.Informally thebeatcanbedefinedasan
underlyingregularsequenceof pulsesthatcoincideswith theflow of music. Of coursethis
definitionis vaguebut givesageneralideaof theproblem.A moreanthropocentricwayof
definingbeatis asthe sequenceof humanfoot tappingwhenonelistensto music. More
elaboratediscussionsaboutthedefinitionof beatcanbe found in Musicology andMusic
CHAPTER2. REPRESENTATION 28
Cognitionpublicationsthat are too numerousto be listed here. A goodstartingpoint is
[67].
Thereis a significant body of work that dealswith beattrackingandanalysisusing
symbolic representationssuchasMIDI files. Although many interestingideashave been
proposedthesetechniquescannotdeal with real audio signals. In contrastthe systems
that will be describedin this Sectionare applicableto audio signals. A real-timebeat
tracking systemfor audio signalswith music is describedin [108]. In this system,a
filterbank is coupledwith a network of combfilters that track the signalperiodicities to
provide an estimateof the main beatand its strength. A real-timebeattrackingsystem
basedon a multiple agentarchitecturethat tracksseveral beathypothesesin parallel is
describedin [49] [13]. A systemfor rhythmandperiodicity detectionin polyphonic music
basedon detectingbeatsat a narrow low frequency band(50-200Hz) is describedin [3].
A multiresolution time-frequency analysisandinterpretationof musicalrhythmbasedon
Waveletsis describedin [119]. More recentlycomputationally simplermethodsbasedon
onsetdetectionat specificfrequencieshave beenproposedin [66, 112]. A beattracking
systemfor audioor MIDI databasedon thefrequency of occurrencesof thevarioustime
durationsbetweenpairsof noteonsettimesandmultiple hypothesissearchwasintroduced
in [28] andenhancedwith a userinterfacethatprovidesvisualandauralfeedbackin [29].
A beatextractionalgorithm thatoperatesdirectly on MPEGaudiocompressedbitstreams
(.mp3files) wasrecentlyproposedin [141].
Themainoutputof thesesystemsis arunningestimateof themainbeatandits strength.
The beatspectrum,describedin [43] is a moreglobal representationof rhythm thanjust
themainbeatandits strength.BeatHistogramsintroducedin [134] anddescribedin this
chapterarea similar moreglobalrepresentationof rhythm(computeddifferently)thathas
beenusedfor automatic musicalgenreclassification[135, 133].
CHAPTER2. REPRESENTATION 29
In additionto informationrelatedto thebeatof music anotherimportantsourceof in-
formationis pitchcontent.Chordchangedetection(withoutchordnameidentification)has
beenusedin [48], for real-timerhythmtrackingfor drumlessaudiosignals.An algorithms
capableof detectingmultiplepitchesin polyphonicaudiosignalsisdescribedin [123]. This
algorithmshasbeenusedin [133] to calculatePitchHistograms, astatisticalrepresentation
of pitchcontentthatis usedfor automatic musicalgenreclassification.
TheShortTime FourierTransformis a standardSignalProcessingtechniqueandthere
are many references that describeit. A good introductory book is [121] and a more
completedescriptioncanbefoundin [97, 89]. A goodintroductionto Wavelets,especially
in relation to SignalProcessingcanbe found in [77]. A high level overview of MPEG
audiocompressioncanbefoundat [88] and[58, 59] containthecompletespecificationsof
thealgorithm. For the thesake of completenessa shortdescriptionof thesetechniquesis
providedin AppendixA for theinterestedreader.
2.2.1 Contrib utions
Thefollowingcontributionsof my this thesisto theareaof audiofeatureextractionwill be
describedin thischapter:
� MPEG-basedcompresseddomain spectral shapefeatures:
Computation of featuresto representthe short-timespectralshapedirectly from
MPEG audiocompresseddata(.mp3 files). That way the analysisdonefor com-
pressioncanbeusedfor “free” for featureextractionduringdecoding[130].
� Representation of sound“textur e” using texture windows
Unlikesimple monophonic soundssuchasmusicalinstrumenttonesor soundeffects
musicis a complex polyphonic signalthatdoesnothaveasteadystateandtherefore
hasto be characterizedby its overall “texture” characteristicsratherthana steady
CHAPTER2. REPRESENTATION 30
state. The useof texture windows asa way to representsound“texture” hasbeen
formalizedandexperimentally validatedin [133].
� WaveletTransform Features
TheWaveletTransform(WT) is a techniquefor analyzingsignals.It wasdeveloped
asan alternative to theshorttime Fourier transform(STFT) to overcomeproblems
relatedto its frequency andtime resolutionproperties.A new setof featuresbased
on theWT thatcanbeusedfor representingsound“texture” is proposed[134].
� BeatHistogramsand beat content features
Themajority of existing work in audioinformationretrieval mainly utilizesfeatures
relatedto spectralcharacteristics.However if the signalsof interestare musical
signalsmorespecificinformationsuchasthe rhythmic contentcanbe represented
usingfeatures.Unlike typical automaticbeatdetectionalgorithms thatprovide only
a running estimate of the main beatand its strength,Beat Histogramsprovide a
richer representationof the rhythmic contentof a musical piece. that canbe used
for similarity retrieval andclassification.A methodfor their calculationbasedon
periodicityanalysisof multiple frequency bandsis described[135,133].
� Pitch Histogramsand pitch content features
Anotherimportantsourceof informationspecificto musicalsignalsis pitchcontent.
Pitch contenthasbeenusedfor retrieval of music in symbolic form whereall the
musicalpitchesthe explicitly represented.However pitch informationaboutaudio
signalsis moredifficult to acquireasautomaticmultiple pitch detectionalgorithm
arestill inaccurate.Pitch Histogramscollect statistical informationaboutthe pitch
contentover thewholefile andprovidevaluable information for similarity andclas-
sificationevenwhenthepitchdetectionalgorithmis notperfect[133].
CHAPTER2. REPRESENTATION 31
Theseshort descriptions of eachcontributions are just indicative of the main ideas.
Morecompletedescriptionsfollow in theremainingSectionsof thisChapter.
Figure2.1: MagnitudeSpectrum
2.3 Spectral ShapeFeatures
Typically in Time-Frequency analysistechniquessuchas the STFT the signal is broken
into smallsegments in timeandthefrequency contentof eachsmallsegmentis calculated.
The frequency contentcanbe representedasa magnitude spectrum(seeFigure2.1) that
representstheenergy distribution over frequency for thatparticularsegmentin time. The
exactdetailsof how theamplitudesarerepresentedandhow the frequency axis is spaced
are important but for presentationpurposesthey will be ignoredfor now. Although the
frequency spectrumcanuseddirectly to representaudiosignals(in the extremecasecan
even be usedto perfectly reconstructthem)it hasthe problemthat it containstoo much
information. For CA purposesa morecompactrepresentationis moreappropriatefor two
main reasons:1) a lot of the information containedin the spectrumis not important2)
machinelearningalgorithmswork betterwith featurevectorsof smalldimensionality that
CHAPTER2. REPRESENTATION 32
areasinformative aspossible. Thereforetypically after thespectrumis calculateda small
setof featuresthatcharacterizethegrossspectralshapearecalculated.Thesetypesof fea-
tureshave beenusedin almostevery typeof ComputerAudition andSpeechRecognition
algorithm. In the following subsectionsspecificcasesof featuresfor characterizingthe
overall spectralshapearedescribed.
2.3.1 STFT-basedfeatures
FeaturesbasedontheShortTimeFourierTransform(STFT)areverycommonandhavethe
advantageof fastcalculationbasedon theFastFourierTransformalgorithm. Althoughthe
exactdetailsof theSTFTparametersusedto calculatethemdiffer from systemto system
theirbasicdescriptionis thesame.For thecuriousthefollowing STFTanalysisparameters
havebeenusedsuccessfullyin our implementationto representmusicalandspeechsignals:
samplingrate22050,window size512 (approximately20 mill iseconds),hop size512 or
256,hanningwindow, amplitudesin dB.Thefollowingfeaturesarebasedontheshort-time
magnitude of theSTFTtransformof thesignal:
Spectral Centroid
Thespectralcentroidis definedasthecenterof gravity of themagnitude spectrumof the
STFT:
Ct � ∑Nn� 1Mt � n��� n
∑Nn� 1Mt � n� (2.1)
whereMt � n� is the magnitude of the Fourier transformat frame t and frequency bin
n. The centroidis a measureof spectralshapeandhighercentroidvaluescorrespondto
“brighter” textureswith morehigh frequencies.Thespectralcentroidhasbeenshown by
userexperimentsto beanimportantperceptualattributein thecharacterizationof musical
instrumenttimbre[50].
CHAPTER2. REPRESENTATION 33
Spectral Rolloff
The spectralrolloff is definedas the frequency Rt below which 85% of the magnitude
distribution is concentrated:
Rt
∑n� 1
Mt � n� � 0 � 85 � N
∑n� 1
Mt � n� (2.2)
The rolloff is anothermeasureof spectralshapeandshows how muchof the signal’s
energy is concentratedin thelower frequencies.
Spectral Flux
Thespectralflux is definedasthesquareddifferencebetweenthenormalizedmagnitudes
of successivespectraldistributions:
Ft � N
∑n� 1
Nt � n�� Nt � 1 � n�� 2 (2.3)
where Nt � n� , Nt � 1 � n� are the normalizedmagnitude of the Fourier transformat the
currentframet, andthepreviousframet 1 respectively. Thespectralflux is ameasureof
theamountof localspectralchange.Spectralflux hasalsobeenshown by userexperiments
to beanimportantperceptualattributein thecharacterizationof musicalinstrumenttimbre
[50].
2.3.2 Mel-Fr equencyCepstral Coefficients
Mel-Frequency CepstralCoefficients(MFCC) areperceptuallymotivatedfeaturesthatare
alsobasedontheSTFT. After takingthelog-amplitudeof themagnitudespectrum,theFFT
bins are groupedand smoothedaccordingto the perceptuallymotivatedMel-frequency
scaling. Finally, in order to decorrelatethe resultingfeaturevectorsa DiscreteCosine
Transformis performed.Althoughtypically 13 coefficientsareusedfor Speechrepresen-
CHAPTER2. REPRESENTATION 34
tationwe have foundthat thefirst five coefficientsareadequatefor Music representation.
A moredetaileddescriptionof theMFCC calculationcanbefoundin theAppendix.
2.3.3 MPEG-basedfeatures
Most of availableaudiodataon the web is available in compressedform following the
MPEGaudiocompressionstandard(thewell known .mp3files). MPEGaudiocompression
is a lossyperceptually-basedcompressionschemethatallowsaudiofilesto bestoredin ap-
proximatelyonetenthof thespacethey wouldnormallyrequireif they wereuncompressed.
Thisenableslargenumbersof audiofilesto bestoredandstreamedoverthenetwork. Using
currenthardwarethe decompressioncanbe performedin real-timeandthereforein most
casesthereis no reasonto storetheuncompressedaudiosignal.TraditionalComputerAu-
dition algorithmsoperateonuncompressedaudiosignalsthereforewouldrequirethesignal
to be decompressedand then subsequently analyzed. In MPEG audio compressionthe
signalisanalyzedin timeandfrequency for compressionpurposes.Thereforeaninteresting
ideais tousethisanalysisnotonly for compressionbutalsoto extractfeaturesdirectlyfrom
compresseddata.This is ideais similar to compresseddomainvisualprocessingof video
(for example[146]). Becausethe bulk of the featurecalculationis performedduring the
encodingstagethis processhasa significantperformanceadvantagesif theavailabledata
is compressedbecausethe analysisis alreadytherefor “free”. Combiningdecodingand
analysisin onestageis alsovery importantfor audiostreamingapplications.
This ideaof calculatingfeaturesdirectly from MPEGaudiocompresseddatawasex-
ploredin [130] 1 andevaluatedin thecontext of music/speechclassificationandsegmen-
tation. In this sectionthe calculationof thesefeatureswill be describedandChapter5
describeshow thesefeatureswereevaluated. Thesefeatureshave beenusedin [71] for
audioanalysisof MPEGcompressedvideosignals.1a very similar ideaaboutcompresseddomainaudiofeatureswasindependently proposedby David Pye
at thesameconference(ICASSP2000) [95]
CHAPTER2. REPRESENTATION 35
Short MPEG overview
TheMPEGaudiocodingstandardis anexampleof a perceptionbasedcoderthatexploits
the characteristicsof the humanauditory systemin order to compressaudio more effi-
ciently. Sincethereis no specificsourcemodel,like in speechcompressionalgorithms, it
canbeusedto compressany typeof audio. In thefirst steptheaudiosignalis converted
to spectralcomponentsvia an analysisfilterbank. Eachspectralcomponentis quantized
andcodedwith the goal of keepingthe quantizationnoisebelow the masking threshold.
Simultaneousmaskingis a frequency domainphenomenonwherea low level signal(the
maskee)canbemadeinaudible(masked)by a simultaneousoccuringstrongersignal(the
masker)aslongasmaskerandmaskeearecloseenoughto eachotherin frequency. There-
fore, thedynamicbit allocationin eachsubbandis controlledby a psychoacousticmodel.
Layer I,II,III offer different trade-offs betweencomplexity andcompression ratio. More
detailscanbe found in [88] andat the completeISO specification[58, 59]. Essentially
what MPEG audiocompressiondoesis assignquantizationbits “cleverly” basedon the
propertiesof thehumanauditorysystem.2
Likemostmoderncodingschemesthereis anasymmetrybetweentheencoderandthe
decoder. The encoderis more complicated,slower, and thereis someflexibilit y in the
psychoacoustic modelused.On theotherhand,decodingis simpler andstraightforward.
Thesubbandsequencesarereconstructedon thebasisof blockstakinginto accountthebit
allocationinformation. Eachtime thesubbandsamplesof all the32 subbandshave been
calculated,they areappliedto the synthesisfilterbank,and32 consecutive 16-bit, PCM-
format audiosamplesarecalculated.The featurecalculationdescribedin the following
sectionis performedbeforethecommonto all layersfilterbanksynthesis.
Thedigital audioinput is mappedinto 32subbandsvia equallyspacedbandpassfilters.
A polyphasefilter structureis usedfor the frequency mapping;its filters have 512 coef-2Althoughthereis noeasywayto confirm thisexperimentallyprobablyMPEGaudiocompressionwould
notsoundasgood to non-humananimalssuchasdogs thathavedifferent auditory characteristics
CHAPTER2. REPRESENTATION 36
ficients. The filters areequallyspacedin frequency. At a 22050Hz samplingrate,each
bandhasawidth of 11025/ 32= 345Hz. Thesubsampledfilter outputsexhibit significant
overlap.Theimpulseresponseof subbandk, hkn is obtainedby multiplicationof asingle
prototypelow-passfilter, hn , by a modulating function thatshifts the lowpassresponce
to theappropriatesubbandfrequency range:
hin � h
n cos
2i 12M
� φi � ;
M � 32;i � 1 � 2 ����� 32;n � 1 � 2 ������� 512; (2.4)
Althoughtheactualcalculationof thesamplesis performeddifferentlyfor performance
reasons;the32-dimensionalsubbandvectorcanbewritten into aconvolutionequation:
St � i � � 511
∑n� 0
x � t n��� hi � n� (2.5)
wherehi � n� aretheindividualsubbandband-passfilter responces.Detailsaboutthecoeffi-
cientsof theprototypefilter andthephaseshiftsφk aregivenin theISO/MPEGstandard
[58], [59]. A morecompletedescriptionof theMPEGaudiocompressionstandardcanbe
foundin AppendixA.
For theexperimentsMPEG-2,LayerIII, 22050Hzsampling rate,mono,fileswereused.
A similar approachcanbe usedwith the otherlayersandsamplingratesaswell aswith
any filterbank-basedperceptualcoder. Theanalysisis performedonblocksof 576samples
(about20msecat 22050Hz)that correspondto one MPEG audio frame. A root mean
squaredsubbandvectoris calculatedfor theframeas:
M � i � � ∑18t � 1
St � i � 2
18� i � 1 ��� 32; (2.6)
CHAPTER2. REPRESENTATION 37
St arethe32-dimensional subbandvectors.Theresulting M is a32-dimensionalvector
thatdescribesthespectralcontentof thesoundfor that frame. Thecharacteristicfeatures
calculatedare:
MPEG Centroid is thebalancingpointof thevector. It canbecalculatedusing
C � ∑32i � 1 iM � i �
∑32i � 1M � i � (2.7)
MPEG Rolloff is thevalueR suchthat
R
∑i � 1
M � i � � 0 � 8532
∑i � 1
M � i � (2.8)
MPEG Spectral Flux is the2-normof thedifferencebetweennormalizedM vectorseval-
uatedat two successive frames.
MPEG RMS is ameasureof theloudnessof theframe.It canbecalculatedusing
RMS � ∑32i � 1
M � i � 2
32(2.9)
This featureis uniqueto segmentationsincechangesin loudnessareimportantcues
for new soundevents.In contrast,classificationalgorithmsmustusuallybeloudness
invariant.
It is clearthatthesefeaturesareverysimilar to thefeaturescalculatedonthemagnitude
spectrumof theSTFT. Thisis notsurprisingasin bothcasesessentiallywhatissummarized
is the grosscharacteristicsof the spectralshape. Comparisonsof thesefeatureswith
other featuressuchasSTFT-basedandMFCC in the contentof audioclassificationand
segmentation andwill beprovidedin Chapter5.
CHAPTER2. REPRESENTATION 38
2.4 Representing sound“text ure”
Accordingto the standarddefinition the term “timbre” is usedto describeall the charac-
teristicsof a soundthatdifferentiateit from othersoundsof thesamepitch andloudness.
Thisdefinitionmakessensefor singlesoundsourcessuchasanisolatedmusical instrument
toneor asinglespokenvowel but cannotbeappliedto morecomplex audiosignalssuchas
polyphonic music or speech.However, for differenttypesof complex audiosignalsthere
arestatistical qualitiesof theenergy distribution in time andfrequency thatcanbeusedto
characterizethem.For exampleweshouldexpectheavy metalmusicto havequitedifferent
spectralcharacteristicsthanclassicalmusic,andspeechis probablyquite different from
music. The term sound“texture” will be usedto describethesestatistical characteristics
of the spectraldistribution. It is importantto notethat thesecharacteristicsdependboth
on frequency and time. For exampleit is likely that heavy metalmusicwill have more
energy in the higher frequenciesthan classicalmusic (frequency) and speechwill have
morechangesof energy over time thanmosttypesof music.
Figure2.2: Soundtexturesof newsbroadcast
CHAPTER2. REPRESENTATION 39
Someexamplesof soundtexturesare: an orchestraplaying, a guitar solo, a singing
voice, a string quartet,a jazz big band,etc. Although in somecasesthesetexturescan
be observed visually on a spectrogram,in mostcasesthey cannot. Figure2.2 shows an
excerpt from a news broadcast.Thereare threedifferent textures(music,malespeech,
femalespeech),eachone coloreddifferently. It is easyto visually separatethe music
from speechfrom the spectrogramdisplay but it is more challengingto separatemale
from femalespeech.Sothegoal is to somehow capturethestatistical propertiesof sound
“textures”usingnumericalfeatures.
2.4.1 Analysisand texturewindows
In short-timeaudioanalysisthesignalis brokeninto small,possibly overlapping,segments
in time and eachsegment is processedseparately. Thesesegmentsare called analysis
windowsandhaveto besmallenoughsothatthefrequency characteristicsof themagnitude
spectrumare relatively stablei.e assumethat the signal for that shortamountof time is
stationary. However, the sensationof a sound“texture” arisesas the result of multiple
short-time spectrumswith differentcharacteristicsfollowing somepatternin time. For ex-
ample,speechcontainsvowel andconsonantsectionseachof whichhavedifferentspectral
characteristics.
Therefore,in order to capturethe long term natureof sound“texture”, the actual
featurescomputed in our systemare the running meansand variancesof the extracted
features,describedin thepreviousSections,over anumberof analysiswindows. Theterm
texture windowwill beusedto describethis largerwindow andideally shouldcorrespond
to the minimum time amountof soundthat is necessaryto identify a particularsoundor
music“texture”. Essentially ratherthanusingthefeaturevaluesdirectly, theparametersof
a runningmultidimensionalGaussiandistribution areestimated.More specifically, these
parameters(means,variances)arecalculatedbasedon the texture windowwhich consists
CHAPTER2. REPRESENTATION 40
of the currentfeaturevector in additionto a specificnumberof featurevectorsfrom the
past.Anotherway to think of the texture windowis asa memoryof thepast.For efficient
implementationa circular buffer holding previous featurevectorscan be used. In our
system,ananalysiswindowof 23 milli seconds(512samplesat 22050Hz samplingrate)
anda texture windowof 1 second(43 analysiswindows) is used. In Chapter5 it will be
shown experimentally thattheuseof “texture” windows improvestheresultsof automatic
musicalgenreclassification.
2.4.2 Low Energy Feature
Low energy is theonly featurethatis basedon thetexture windowratherthantheanalysis
window. It is definedasthe percentageof analysis windowsthat have lessRMS energy
thantheaverageRMS energy acrossthetexturewindow. For example musical signalswill
have lower “Low Energy” thanspeechsignalsthatusuallycontainmany silentwindows.
2.5 Wavelet transform features
2.5.1 The DiscreteWaveletTransform
TheWaveletTransform(WT) is a techniquefor analyzingsignals.It wasdevelopedasan
alternative to theshorttime Fourier transform(STFT) to overcomeproblemsrelatedto its
frequency andtime resolution properties.More specifically, unlike the STFT which pro-
videsuniformtimeresolutionfor all frequencies,theWT provideshightimeresolutionand
low frequency resolutionfor high frequencies,andlow timeandhigh frequency resolution
for low frequencies.In that respectit is similar to the humanearwhich exhibits similar
time-frequency resolutioncharacteristics.
The DiscreteWavelet Transform(DWT) is a specialcaseof the WT that providesa
compactrepresentationof thesignalin timeandfrequency thatcanbecomputedefficiently
CHAPTER2. REPRESENTATION 41
usinga fast,pyramidalalgorithmrelatedto multiratefilterbanks.More informationabout
the WT andDWT canbe found in Mallat [77]. For the purposesof this work, the DWT
canbeviewedasa computationally efficient way to calculateanoctave decomposition of
thesignalin frequency. Morespecifically, theDWT canbeviewedasa constantQ (center
frequency / bandwidth)with octave spacingbetweenthe centersof the filters. A more
detaileddescriptionof theDWT canbefoundin AppendixA.
2.5.2 Wavelet features
Theextractedwaveletcoefficientsprovide a compactrepresentationthatshows theenergy
of the signal in time andfrequency. In order to further reducethe dimensionality of the
extractedfeaturevectors,statistics over the set of wavelet coefficients are used. That
way the statistical characteristicsof the “texture” or “music surface” of the piececanbe
represented.For examplethe distribution of energy in time and frequency for music is
differentfrom thatof speech.
Thefollowing following featuresareusedto representsound“texture”:
� Themeanof theabsolute valueof thecoefficientsin eachsubband.Thesefeatures
provide informationaboutthefrequency distributionof theaudiosignal.
� The standarddeviation of the coefficientsin eachsubband.Thesefeaturesprovide
information abouttheamountof changeof thefrequency distribution over time.
� Ratiosof themeanabsolutevaluesbetweenadjacentsubbands.Thesefeaturesalso
provide informationaboutthefrequency distribution.
A window sizeof 65536samplesat 22050Hz sampling ratewith a hop sizeof 512
secondsis useda input to the featureextraction. This correspondsto approximately 3
seconds.Twelve levels (subbands)of coefficients are usedresultingin a featurevector
with 45dimensions(12+12+11).
CHAPTER2. REPRESENTATION 42
2.6 Other features
In this sectionsomeothersupportedfeaturesanddo not fit in thepreviouscategorieswill
bebriefly described.
RMS
RMS is ameasureof theloudnessof awindow. It canbecalculatedusing
RMS � ∑Ni � 1
M � i � 2 N
(2.10)
This featureis uniqueto segmentationsincechangesin loudnessareimportantcuesfor
new soundevents. In contrast,classificationalgorithmsmustusuallybeloudnessinvariant.
Pitch
Pitch (asa audiofeature)typically refersto the fundamentalfrequency of a monophonic
soundsignalandcanbecalculatedusingvariousdifferenttechniques[96].
Harmonicity
Harmonicityis measureof how strongthepitch detectionfor a soundis [143]. It canalso
beusedfor voiced/unvoiceddetection.
Linear prediction (LPC) reflectioncoefficients
LPC coefficientsareusedin speechresearchasanestimateof thespeechvocal tractfilter
[76].
Time domain Zero Crossings
Zt � 12
N
∑n� 1
�sign
x � n�� � sign
x � n 1�� � (2.11)
CHAPTER2. REPRESENTATION 43
wherethe sign function is 1 for positive argumentsand0 for negative argumentsand
x � n� is thetimedomainsignalfor framet. TimedomainZeroCrossingsprovideameasure
of the noisinessof the signal. For exampleheavy metal music due to guitar distortion
andlots of drumswill tendto havemuchhigherzerocrossingvaluesthanclassicalmusic.
For clean(without noise)signalsZero Crossingsarehighly correlatedwith the Spectral
Centroid.Theevaluationof thesefeaturesin thecontext of audioclassificationis described
in Chapter5.
2.7 Rhythmic Content Features
2.7.1 BeatHistograms
Mostautomaticbeatdetectionsystemsprovidea runningestimateof themainbeatandan
estimateof its strength.In additionto thesefeaturesin orderto characterizemusicalgenres
moreinformationaboutthe rhythmic structureof a piececanbe utilized. The regularity
of the rhythm, the relationof the main beatto the subbeats,andthe relative strengthof
subbeatsto themainbeataresomeexamplesof characteristicsof musicwe would like to
representthroughfeaturevectors.
A commonautomaticbeat detectorstructureconsistsof signal decomposition into
frequency bandsusinga filterbank,, followed by an envelopeextractionstepandfinally
a periodicity detectionalgorithm which is usedto detectthe lags at which the signal’s
envelopeis mostsimilar to itself. Theprocessof automatic beatdetectionresemblespitch
detectionwith largerperiods(approximately0.5secondsto 1.5secondsfor beatcompared
to 2 milli secondsto 50mill isecondsfor pitch).
Thefeaturesetfor representingrhythmstructureis basedondetectingthemostsalient
periodicitiesof the signal. Figure2.3 shows the flow diagramof the beatanalysisalgo-
rithm. Thesignalis first decomposedinto a numberof octave frequency bandsusingthe
CHAPTER2. REPRESENTATION 44
Envelope Extraction
Envelope Extraction
Envelope Extraction
BEAT HISTOGRAM CALCULATION FLOW DIAGRAM
Full Wave Rectification
Low Pass Filtering
Downsampling
Mean Removal
Autocorrelation
Multiple Peak Picking
Beat Histogram
Discrete Wavelet Transform Octave Frequency Bands
Envelope Extraction
Figure2.3: BeatHistogramCalculationFlow Diagram
DWT. Following this decomposition, thetime domainamplitudeenvelopeof eachbandis
extractedseparately. Thisis achievedby applyingfull-waverectification,low passfiltering,
anddownsampling to eachoctave frequency band. After meanremoval the envelopesof
eachbandarethensummedtogetherandtheautocorrelationof theresultingsumenvelope
is computed.Thedominantpeaksof theautocorrelationfunctioncorrespondto thevarious
periodicitiesof thesignal’s envelope.Thesepeaksareaccumulatedover thewholesound
file into a BeatHistogramwhereeachbin correspondsto thepeaklag (i.e thebeatperiod
in beats-per-minutebpm). Ratherthanaddingone, the amplitudeof eachpeakis added
to thebeathistogram. Thatway, whenthesignalis very similar to itself (strongbeat)the
histogrampeakswill behigher.
CHAPTER2. REPRESENTATION 45
Thefollowingbuilding blocksareusedfor thebeatanalysisfeatureextraction:
Full WaveRectification
y � n� � �x � n� � (2.12)
is appliedin order to extract the temporalenvelopeof the signalratherthanthe time
domainsignalitself.
Low PassFiltering (LPF)
y � n� � 1 α x � n� � αy � n 1� (2.13)
i.e aOnePoleFilter with analphavalueof 0.99which is usedto smooththeenvelope.
Full Wave Rectificationfollowedby Low-PassFiltering is a standardenvelopeextraction
technique.
Downsampling
y � n� � x � kn� (2.14)
wherek=16in our implementation.Becauseof thelargeperiodicitiesfor beatanalysis,
downsampling the signal reducescomputation time for the autocorrelationcomputation
withoutaffectingtheperformanceof thealgorithm.
Mean Removal
y � n� � x � n�� E � x � n��� (2.15)
is appliedin orderto make thesignalcenteredto zerofor theautocorrelationstage.
CHAPTER2. REPRESENTATION 46
EnhancedAutocorrelation
y � k� � 1N ∑
nx � n� x � n k� (2.16)
Thepeaksof theautocorrelationfunctioncorrespondto thetime lagswherethesignal
is mostsimilar to itself. Thetime lagsof peaksin theright time rangefor rhythmanalysis
correspondto beatperiodicities. Theautocorrelationfunctionis enhancedusinga similar
methodto the multipitch analysismodelof [123] in order to reducethe effect of integer
multiplesof thebasicperiodicities.Theoriginal autocorrelationfunctionof thesummary
of theenvelopes,is clippedto positive valuesandthentime-scaledby a factorof two and
subtractedfrom the original clipped function. The sameprocessis repeatedwith other
integerfactorssuchthatrepetitivepeaksat integer multiplesareremoved.
PeakDetectionand Histogram Calculation
The first threepeaksof the enhancedautocorrelationfunction that are in the appropriate
rangefor beatdetectionareselectedandaddedto a BeatHistogram (BH). Thebinsof the
histogramcorrespondto beats-per-minute(bpm) from 40 to 200 bpm. For eachpeakof
theenhancedautocorrelationfunctionthepeakamplitudeis addedto thehistogram. That
waypeaksthathavehighamplitude(wherethesignalis highly similar) areweightedmore
stronglythanweakerpeaksin thehistogramcalculation.
2.7.2 Features
Figure2.4showsabeathistogramfor a30secondexcerptof thesong“ComeTogether”by
theBeatles.Thetwo mainpeaksof theBeatHistogram(BH) correspondto themainbeat
at approximately80 bpmandits first harmonic(twice thespeed)at 160bpm. Figure2.5
shows four beathistograms of piecesfrom differentmusicalgenres.Theupperleft corner,
labeledclassical,is theBH of anexcerptfrom “La Mer” by ClaudeDebussy. Becauseof
CHAPTER2. REPRESENTATION 47
0
1
2
3
4
5
6
60 80 100 120 140 160 180 200
Bea
t Str
engt
h�
BPM
ROCK
Figure2.4: BeatHistogramExample
thecomplexity of themultiple instrumentsof theorchestrathereis nostrongself-similarity
andthereis nocleardominantpeakin thehistogram.Morestrongpeakscanbeseenat the
lower left corner, labeledJazz,which is an excerptfrom a live performanceby DeeDee
Bridgewater. The two peakscorrespondto the beatof the song(70 and140 bpm). The
BH of Figure2.4is shown on theupperright cornerwherethepeaksaremorepronounced
becauseof the strongerbeatof rock music. The highestpeaksof the lower right corner
indicatethestrongrhythmic structureof a HipHopsongby NenehCherry.
Unlikepreviouswork in automaticbeatdetectionwhich typically aimsto provideonly
anestimateof themainbeat(or tempo)of thesongandpossibly a measureof its strength,
the BH representationcapturesmoredetailedinformation aboutthe rhythmic contentof
the piecethat canbe usedto intelligently guessthe musicalgenreof a song. Figure2.5
indicatesthat the BH of different musicalgenrescan be visually differentiated. Based
on this observation a setof featuresbasedon the BH arecalculatedin orderto represent
CHAPTER2. REPRESENTATION 48
0
1
2
3
4
5
6
60 80 100 120 140 160 180 200
Bea
t Str
engt
h�
BPM
CLASSICAL
0
1
2
3
4
5
6
60 80 100 120 140 160 180 200
Bea
t Str
engt
h�
BPM
ROCK
0
1
2
3
4
5
6
60 80 100 120 140 160 180 200
Bea
t Str
engt
h�
BPM
JAZZ
0
1
2
3
4
5
6
60 80 100 120 140 160 180 200
Bea
t Str
engt
h�
BPM
HIP-HOP
Figure2.5: BeatHistogramExamples
rhythmic contentandareshown to be useful for automatic musicalgenreclassification.
Theseare:
� A0, A1: relativeamplitude(dividedby thesumof amplitudes)of thefirst,andsecond
histogrampeak
� RA: ratio of theamplitudeof thesecondpeakdividedby theamplitude of thefirst
peak
� P1,P2: Periodof thefirst, secondpeakin bpm
� SUM: overallsumof thehistogram(indicationof beatstrength)
For theBH calculation,theDWT is appliedin awindow of 65536samplesat22050Hz
samplingratewhich correspondsto approximately 3 seconds.This window is advanced
by a hop sizeof 32768samples.This larger window is necessaryto capturethe signal
repetitionsat thebeatandsubbeatlevels.
CHAPTER2. REPRESENTATION 49
2.8 Pitch Content Features
2.8.1 Pitch Histograms
Thepitchcontentfeaturesetis basedonmultiple pitchdetectiontechniques.Morespecif-
ically, themultipitch detectionalgorithmdescribedby [123] is utilized. In this algorithm,
thesignalis decomposedinto two frequency bands(below andabove1000Hz) andampli-
tudeenvelopesareextractedfor eachfrequency band.Theenvelopeextractionisperformed
by applyinghalf-wave rectificationandlow-passfiltering. Theenvelopesaresummedand
an enhancedautocorrelationfunction is computedso that the effect of integer multiples
of the peakfrequenciesto multiple pitch detectionis reduced. More detailsaboutthis
algorithmcanbefoundin theAppendix.
Theprominentpeaksof this summaryenhancedautocorrelationfunction(SACF) cor-
respondto themainpitchesfor thatshortsegmentof sound.This methodis similar to the
beatdetectionstructurefor theshorterperiodscorrespondingto pitchperception.Thethree
dominantpeaksof the(SACF)areaccumulatedinto aPitchHistogram(PH)overthewhole
soundfile.For thecomputationof thePH,apitchanalysiswindow of 512samplesat22050
Hz samplingrate(approximately 23mill iseconds)is used.
Thefrequenciescorrespondingto eachhistogrampeakareconvertedto musicalpitches
suchthateachbinof thePHcorrespondstoamusicalnotewith aspecificpitch(for example
A4 = 440Hz). ThemusicalnotesarelabeledusingtheMIDI notenumberingscheme.The
conversionfrom frequency to MIDI notenumbercanbe performedusing the following
equation:
n � 12log2f
440�
69 (2.17)
where f is thefrequency in Hertzandn is thehistogrambin (MIDI notenumber).
CHAPTER2. REPRESENTATION 50
Two versionsof thePHarecreated:a folded(FPH)andunfoldedhistogram(UPH).The
unfoldedversionis createdusingtheaboveequationwithoutany furthermodifications.In
thefoldedcaseall notesaremappedto a single octaveusingtheequation:
c � n mod12 (2.18)
wherec is thefoldedhistogrambin (pitchclassor chromavalue),andn is theunfolded
histogrambin (or MIDI notenumber).Thefoldedversioncontainsinformationregarding
thepitch classesor harmoniccontentof themusicwhereastheunfoldedversioncontains
informationaboutthepitchrangeof thepiece.TheFPHis similar in conceptto thechroma-
basedrepresentationsusedin [10] for audio-thumbnailing.Moreinformation regardingthe
chromaandheightdimension of musicalpitch canbe found in [113] andthe relationof
musicalscalesto frequency is discussedin moredetailin [94].
Finally, theFPHismappedto acircleof fifths histogramsothatadjacenthistogrambins
arespaceda fifth apartratherthana semitone. This mappingis achievedby thefollowing
equation:
c� � 7 � c mod12 (2.19)
wherec� is thenew foldedhistogrambin after themappingandc is theoriginal folded
histogrambin. Thenumber7 correspondsto 7 semitonesor themusicinterval of a fifth.
That way, the distancesbetweenadjacentbins after the mappingare better suited for
expressingtonalmusicrelations(tonic-dominant)andtheextractedfeaturesresultin better
classificationaccuracy.
Althoughmusicalgenresby nomeanscanbecharacterizedfully by their pitchcontent
therearecertaintendenciesthat can leadto useful featurevectors. For exampleJazzor
Classicalmusictendto have a higherdegreeof pitch changethanRockor Popmusic.As
CHAPTER2. REPRESENTATION 51
a consequencePopor Rockmusicpitchhistogramswill have fewer andmorepronounced
peaksthanthehistogramsof Jazzor Classicalmusic.
2.8.2 Features
Basedon theseobservations thefollowing featuresarecomputedfrom theUPH andFPH
in orderto representpitchcontent:
� FA0 Amplitude of maximum peakof the folded histogram. This correspondsto
themostdominantpitch classof the song. For tonalmusicthis peakwill typically
correspondto the tonic or dominantchord. This peakwill be higherfor songsthat
donothavemany harmonicchanges.
� UP0 Periodof the maximum peakof the unfoldedhistogram. This correspondsto
theoctave rangeof thedominantmusicalpitchof thesong.
� FP0 Periodof themaximumpeakof thefoldedhistogram.This correspondsto the
mainpitchclassof thesong.
� IPO1 Pitchinterval betweenthetwo mostprominentpeaksof thefoldedhistogram.
Thiscorrespondsto themaintonalinterval relation.For pieceswith simpleharmonic
structurethis featurewill have value1 or -1 correspondingto fifth or fourth interval
(tonic-dominant).
� SUM Theoverall sumof thehistogram. This is featureis a measureof thestrength
of thepitchdetection.
CHAPTER2. REPRESENTATION 52
2.9 Musical Content Features
Basedon thepreviously describedfeaturesit is possible to designa featurerepresentation
thatcapturestimbral,rhythmic andharmonicaspectsof musicalsignals. Morespecifically
thefollowing 30-dimensionalfeaturevectorcanbeconstructed:
� Timbr e - Sound Texture Meanover thewholefile of texture STFT-basedfeatures
(meansandvariancesof Centroid,Rolloff, Flux,ZeroCrossingsovertexturewindow)
andLow Energy (9 dimensions,23 milliseconds analysiswindow, 1 secondtexture
window).
� Timbr e - SoundTexture Meansover thewholefile of textureMFCCs(meansand
variancesof thefirst five MFCC coefficientsover thetexturewindow) (not counting
the DC term) (10 dimensions, 23 milli secondsanalysiswindow, 1 secondtexture
window)
� Beat content - Rhythm Rhythmic contentfeaturesbasedon Beat Histograms(6
dimensions,3 secondsanalysiswindow, 1.5secondshopsize)
� Pitch content - Harmony Pitch contentfeaturesbasedon Pitch Histograms (5 di-
mensions, 23mill isecondsanalysiswindow)
Althoughtherearemany possiblevariations, this featuresetformsthebasicrepresen-
tation usedby mostof the describedalgorithms in this thesisandhasbeenshown to be
effective in representingmusical content.
2.10 Summary
Featurevectorsarethefundamentalbuilding blocksof thatunderliethemajorityof existing
work in ComputerAudition. A setof featuresbasedon theShortTime FourierTranform
CHAPTER2. REPRESENTATION 53
(arguablythemostcommonfeaturefront-end)werepresented.A similarsetof featuresthat
canbecalculateddirectly from MPEGaudiocompresseddataweredescribed.In orderto
representsound“texture” statistical information aboutthe changeof short time charac-
teristicsover larger time intervals is required. The useof “texture” windows allows the
statistical combinationof shorttime spectralshapefeaturesin orderto representof sound
“texture”. A setof featuresthatdescribespectralshapeandtexturebasedon theDiscrete
Wavelet Transformwasalsodescribed. In addition to timbral and textural information,
musicalsignalscanbecharacterizedby characteristicsof their rhythmic andpitchcontent.
BeatandPitchHistogramsaresuchstatisticalcharacterizationof musicalsignalsthatcan
beusedto calculatefeaturesthatrepresentmusicalcontent.
Chapter 3
Analysis
Whydo we listen with greaterpleasure to mensingingmusicwhich we happento know
beforehand,than to musicwhich we do not know? Is it because, whenwe recognizethe
song, themeaningof thecomposer, like a manhitting themark, is more evident? This is
pleasantto consider. Or is it becauseit is more pleasantto contemplate than to learn ?
Thereasonis that in theonecaseit is theacquisition of knowledge, but in theother it is
usingit anda form of recognition. Moreover, whatweare accustomedto is alwaysmore
pleasantthantheunfamiliar.
Aristotle, Problems,Book 19,No. 5
3.1 Intr oduction
In theprevious chaptervariousdifferentwaysof representingaudiosignalsin theform of
featurevectorsweredescribed.In this chapterwe will seehow theserepresentationscan
beusedto analyzeandsomehow “understand”audiosignalsin variousways.After feature
extractionan audiosignalcanbe representedin two major ways: 1) asa time seriesof
featurevectors(or a trajectoryof pointsin thefeaturespace)2) or assingle featurevector
(or point in the featurespace).Essentially analysistechniquestry to extract contentand
54
CHAPTER3. ANALYSIS 55
context informationaboutaudiosignalsandcollectionsby trying to learnandanalyzethe
geometricstructureof thefeaturespace.
Extractingcontentandcontext informationbothin humansandmachinesis a taskthat
involvesmemoryandlearning.Thereforeideasfrom MachineLearningandespeciallyPat-
ternClassificationareimportantfor thedevelopmentof theproposedalgorithms. Auditory
SceneAnalysisis theprocessby which theauditorysystembuilds mentaldescriptionsof
complex auditoryenvironments by analyzingmixturesof sounds[16]. Fromanecological
viewpoint, we try to associateeventswith soundsin orderto understandour environment.
Classification(whatmadethissound?),segmentation(how longdid it last?)andsimilarity
(what elsesoundslike that) are fundamentalprocessesof any type of auditoryanalysis.
In this chapteralgorithmsfor automatically performingthesetaskswill bedescribed.Al-
thoughthereis no attemptto modeldirectly thehumanauditorysystem,knowledgeabout
how it workshasprovidedvaluableinsight to thedesignof thedescribedalgorithmsand
systems.For examplethereis significantevidencethat the decisionsfor sequentialand
simultaneousintegrationsof soundsarebasedonmultiplecues.Similarly, multiplefeatures
andrepresentationsareusedin theproposedalgorithms.
3.2 RelatedWork
Audio analysisis basedonMachineLearning(ML) andspecificallyPatternClassification
techniques.Representative textbooksdescribingtheseareasare[30, 105]. Although ML
techniquessuchasHiddenMarkov Modes(HMM) havebeenusedfor alongtimein Speech
Recognition [98] thehaveonly recentlybeenappliedto othertypesof audiosignals.
WhenSpeech Recognition techniquesstartedbeingaccurateenoughfor practicaluse
they wereusedin orderto retrieve multimediainformation. For examplethe Informedia
project at Carnegie Mellon [53] containsa terabyteof audiovisual data. Indexing the
archive is doneusinga combination of speech-recognition,imageanalysisandkeyword
CHAPTER3. ANALYSIS 56
searchingtechniques.Audio analysisandbrowsing toolscanenhancesuchindexing tech-
niquesfor multimedia data such as radio and television broadcasts,especiallyfor the
regionsthatdonotcontainspeech.
More recently audio classificationtechniquesthat include non-speechsignalshave
beenproposed. Most of thesesystemstarget the classificationof broadcastnews and
video in broadcategories like music, speechand environmental sounds. The problem
of discrimination betweenmusic andspeechhasreceivedconsiderableattentionfrom the
earlywork of Saunders[103] wheresimple thresholding of theaveragezero-crossingrate
and energy featuresis used,to the work of Scheirerand Slaney [110] wheremultiple
featuresand statistical patternrecognitionclassifiersare carefully evaluated. A similar
systemis usedin [102] to initially separatespeechfrom musicandthendetectphonemes
or notesaccordingly. Audio signalsaresegmentedandclassifiedinto “music”, “speech”,
“laughter”andnon-speechsoundsusingcepstralcoefficientsandaHiddenMarkov Model
(HMM) in [63]. An heuristicrule-basedsystemfor the segmentationandclassification
of audio signalsfrom movies or TV programsbasedon the time-varying propertiesof
simple featuresis proposedin [147]. Signalsareclassifiedinto two broadgroupsof music
andnon-music whicharefurthersubdividedinto (Music)HarmonicEnvironmentalSound,
PureMusic,Song,Speechwith Music,EnvironmentalSoundwith Musicand(Non-music)
PureSpeechandNon-HarmonicEnvironmentalSound.A similar audioclassificationand
segmentation method(music,speech,environmentsoundandsilence)basedon Nearest
NeighborandLearningVectorQuantizationis presentedin [74]. A solution to the more
difficult problemof locatingsingingvoicesegments in musicalsignalsis describedin [12].
In their system,thephonemeactivation outputof anautomatic speechrecognitionsystem
is usedasthefeaturevectorfor classifyingsinging segments.A systemfor theseparation
of speechsignalsfrom complex auditoryscenesis describedin [90].
CHAPTER3. ANALYSIS 57
Another type of non-speechaudioclassificationsysteminvolvesisolatedmusicalin-
strumentsoundsandsoundeffects. A retrieval-by-similarity systemfor isolatedsounds
(soundeffectsandinstrumenttones)hasbeendevelopedat MuscleFishLLC [143]. Users
cansearchfor andretrievesoundsby perceptualandacousticalfeatures,canspecifyclasses
basedonthesefeaturesandcanasktheengineto retrievesimilar or dissimilar sounds.The
featuresusedin their systemarestatistics (mean,variance,autocorrelation)over thewhole
soundfile of shorttimefeaturessuchaspitch,amplitude,brightness,andbandwidth. Using
thesamedatasetvariousotherretrieval andclassificationapproacheshave beenproposed.
Theuseof MFCC coefficientsto constructa learningtreevectorquantizeris proposedin
[38]. Histogramsof therelative frequenciesof featurevectorsin eachquantizationbin are
subsequently usedfor retrieval. Thesamedatasetis alsousedin [68] to evaluatea feature
extraction and indexing schemebasedon statistics of the DiscreteWavelet Transform
(DWT) coefficients. In [70] the samedatasetis usedto comparevariousclassification
methodsand featuresetsand the useof the NearestFeatureLine patternclassification
methodis proposed.
Similarity retrieval of musicsignalsbasedon contentis explored in [127]. A more
specificsimilarity retrieval methodtargetedtowardsidentifying differentperformancesof
thesamepieceis describedin [144,145]. A similar systemthatretrieveslargesymphonic
orchestralworksbasedontheoverallenergy profileanddynamic programmingtechniques
is describedin [41].
A multi-featureclassifierbasedon spectralmomentsfor recognitionof steadystate
instrument tonesis was initially describedin [44] and subsequentlyimproved in [45].
Another paperexploring the recognitionof musical instrumenttonesfrom monophonic
recordingis [17]. A moregeneralframework for thesameproblemis describedin [80,81].
A detailedexploration andcomparisonof featuresfor isolatedmusical instrumenttonesis
providedin [32].
CHAPTER3. ANALYSIS 58
A methodfor the automatichierarchicalclassificationof musicalgenreswasinitially
describedin [135] and improved in [133]. A more limited classificationsystem(three
genres:Jazz,Classical,Rock) using ideasfrom texture modelingin the visual domain
is describedin [27]. The problemof artist detectionusingneuralnetworks andsupport
vectorstrainedusingaudiofeaturesis addressedin [142]. A relatedproblemto similarity
retrieval is audioidentificationby contentor audiofingerprintingwhich is theuseof audio
contentto directly identify piecesof musicin largedatabase[5].
Segmentationalgorithmscanbe divided in algorithmsthat requireprior classification
andthosewho don’t. Segmentation systems thatrequireprior classificationin many cases
are basedon Hidden Markov Models (HMM). SomeExamplesare: the segmentation
of recordedmeetingsbasedon speaker changesand silencesdescribedin [63], and the
evaluationof musicsegmentationusingdifferentfeaturefront-ends(MFCC,LPC,Discrete
Cepstrum)describedin [8]. Another approachhasbeento detectchangesby tracking
variousindividual audiofeaturesandthenusingheuristicscombinetheir resultsto create
segmentations[147]. A segmentation methodbasedonself-similarity is describedin [42].
An interestingapproachfor segmentation basedon the detectionof relative silenceis
describedin [93]. A comparisonof model-based,metric-basedandenergy-basedaudio
segmentation for broadcastnew datais describedin [62]. A multifeatureapproachthat
doesnot requireprior classificationandformalizestheintegrationof multiple featureswas
proposedin [125] andfurtherevaluatedwith userexperiments in [128].
3.2.1 Contrib utions
The main contribution of this thesisto the analysisstageof ComputerAudition, is a
generalmulti-featuremethodology for audiosegmentation with arbitrarysound“textures”
that doesnot rely on previous classification.The proposedmethodology canbe applied
to any type of audio featuresand hasbeenevaluatedfor variousfeaturesetswith user
CHAPTER3. ANALYSIS 59
experimentsdescribedin Chapter5. This methodology wasfirst describedin [125] where
segmentation asaseparateprocessfrom classificationandamoreformalapproachtomulti-
featuresegmentation wasprovided.Additionalsegmentationexperimentsaredescribedin
[128]. In addition,we have shown thateffective musicalgenreclassificationandcontent-
basedsimilarity retrieval arepossible usingthemusicalcontentfeaturesetthatrepresents
timbral, rhythmic and harmonicaspectsof music. An initial versionof the automatic
musicalgenreclassificationalgorithmwaspublishedin [135] andthecompletealgorithm
andmoreevaluationexperimentsaredescribedin [133]
3.3 Query-by-example content-basedsimilarity retrieval
In mostcontent-basedmultimediainformationretrieval systems themainmethodof spec-
ifying a queryis by example.In thatparadigmtheuserprovidesa multimediaobject(for
exampleanimageor a pieceof music)asthequeryandthesystemreturnsa list of similar
multimediaobjectsranked by their similarity. Similarity retrieval can also be usedfor
automaticplaylistgeneration[4].
Query-by-example(QBE) musicsimilarity retrieval is implementedbasedon themu-
sicalcontentfeatures(Section2.9) usingthesinglefeaturevectorrepresentationfor each
file. Becausethe computedfeaturesrepresentvariouscharacteristicsof the music such
as its timbral texture, rhythmic andpitch content,featurevectorsthat areclosetogether
in featurespacewill alsobe perceptuallysimilar. A naive implementationof similarity
retrievalwouldusetheEuclideandistancebetweenfeaturevectorsandwouldreturnsimilar
audio files ranked by increasingdistance(the closestone to the query featurevector is
the first match). Becausethe featureshave different dynamicrangesand probablyare
correlated,a Mahalanobisdistanceis usedin our system.TheMahalanobisdistance[75]
betweentwo vectorsis definedas:
CHAPTER3. ANALYSIS 60
Mx � y � �
x y TΣ � 1x y (3.1)
whereΣ � 1 is theinverseof thefeaturevectorcovariancematrix.
ThefeaturecovariancematrixΣ is definedas:
Σ � E � xxT � (3.2)
TheMahalanobisdistanceautomatically accountsfor theproperscalingof thecoordi-
nateaxesandcorrectsthe correlationbetweenfeatures.Evaluating similarity retrieval is
difficult asit typically involves conductinguserexperiments. Userexperimentsconducted
to evaluatethis techniquearedescribedin Chapter5. A demonstrationof thesystemcan
befoundat: http:// soundlab.pr inceton.ed u/demo.php
3.4 Classification
Classificationalgorithms attemptto categorize observed patternsinto groupsof related
patternsor classes.Theclassificationor descriptionschemeis usuallybasedon theavail-
ability of a set of patternsthat have alreadybeenclassifiedor described. This set of
patternsis termedthe training setand the resultinglearningstrategy is characterized as
supervised.Learningcanalsobeunsupervised,in thesensethatthesystemis notgiven an
a priori labelingof patterns,insteadit establishestheclassesitself basedon thestatistical
regularitiesof thepatterns.
Theclassificationor descriptionschemeusuallyusesoneof thefollowing approaches:
statistical (or decisiontheoretic),syntactic(or structural),or neural. Statisticalpattern
recognitionis basedon statistical characterizationsof patterns,assumingthat thepatterns
are generatedby a probabilistic system. Structuralpatternrecognitionis basedon the
CHAPTER3. ANALYSIS 61
structuralinterrelationships of features. Neural patternrecognitionemploys the neural
computing paradigmthathasemergedwith neuralnetworks.
A hugevarietyof classificationsystemshave beenproposedin theliteratureandthere
is no clearcut choiceaboutwhich is bestasthey differ in many aspectssuchas: training
time, amountof trainingdatarequired,classificationtime, robustness,andgeneralization.
The main idea behindmost of theseclassificationschemesis to “learn” the geometric
structureof the training datain the featurespaceandusethat to estimatea “model” that
cangeneralizeto new unclassifiedpatterns.Parametricclassifiersassumeafunctionalform
for the underlyingfeatureprobability distribution while non-parametriconesusedirectly
thetrainingsetfor classification.Evenabrief presentationof all theproposedclassification
systemsis beyondthescopeof this thesis.More detailscanbefoundin standardtextbook
onPatternClassificationsuchas[30, 105].
3.4.1 Classifiers
In this section,the main classificationalgorithmsthat are supportedin our systemwill
briefly bedescribed.Theemphasisis onwhatmodelis usedto representthefeaturevectors
andnothow themodelparameterscanbeestimatedfrom trainingdata.Moredetailsabout
the training of the describedclassifierscan be found in Appendix A. Although by no
meanstheonly possiblechoices,theseclassificationalgorithmshavebeeneffectively used
to constructpracticalaudioanalysisalgorithms thatin many casesrun in real-time.
The Gaussian(GMAP) clasifier assumeseachclasscan be representedas a multi-
dimensional Gaussiandistribution in featurespace. The parametersof the distribution
(meansandcovariancematrix)areestimatedusingthelabeleddataof thetrainingset.This
classifieris typical of parametricstatistical classifiersthatassumea particularform for the
underlyingclassprobability densityfunctions. This classifieris moreappropriatewhen
CHAPTER3. ANALYSIS 62
the featuredistribution is unimodal. The locusof equaldistancepointsfrom a particular
Gaussianclassifieris anellipse which is axis-alignedif thecovariancematrix is diagonal.
TheGaussianMixtureModelclassifier(GMM) modeleachclassasafixed-sizeweighted
mixtureof Gaussiandistributions.For example,GMM3 will beusedto denoteamodelthat
has3 mixture components. The GMM classifieris characterizedby the mixture weights
andthemeansandcovariancematricesfor eachmixturecomponent.For efficiency reason
typically the covariancematricesare assumedto be diagonal. The GMM classifieris
typically trainedusingtheExpectation-Maximization(EM) algorithmwhich is aniterative
optimizationprocedure.A tutorial for theEM algorithmcanbefoundin [84].
Unlikeparametricclassifiers,theK-NearestNeighbor(KNN) classifierdirectlyusesthe
trainingsetfor classificationwithout assuming any mathematical form for theunderlying
classprobability densityfunctions. Eachsampleis classifiedaccordingto theclassof its
nearestneighborin the training dataset. In the KNN classifier, the K nearestneighbors
to the point to be classifiedarecalculatedandvoting is usedto determinethe class. An
interestingresultaboutthe nearestneighborclassifieris that no matterwhat classifieris
used,we cannever do betterthanto cut the error rate in half over the nearest-neighbor
classifier, assuming the training and testingdatarepresentthe underlyingfeaturespace
topology [30].
Artificial Neural Network (ANN) are multilayer architecturesof stronglyconnected
simple components that can be trained to perform function approximation, patternas-
sociation,and patternclassification. A standardtechniquefor the training of ANNs is
backpropagationwhich refersto the processby which derivativesof network error, with
respectto network weightsandbiasescanbe computed.This processcanbe usedwith
a numberof differentoptimization strategies. The architectureof a multilayer network
is not completelyconstrainedby the problemto be solved. The numberof inputsto the
network is constrainedby the problem,andthe numberof neuronsin the outputlayer is
CHAPTER3. ANALYSIS 63
constrainedby the numberof outputs requiredby the problem. However, the number of
layersbetweennetwork inputsandthe output layer andthe sizesof the layersareup to
the designer. It hasbeenshown that the two-layersigmoid/linearnetwork canrepresent
any functional relationshipbetweeninputsand outputsif the sigmoid layer hasenough
neurons.A standardbackpropagationneuralnetwork trainedusingagradient-descenttype
of optimizationhasbeenusedin our systemasa patternclassifier. Although theachieved
classificationperformanceis comparable(not better)to theotherclassifiers,trainingtime
for largecollections is too long for practicalpurposesandthereforetheresultsin Chapter
5 will bebasedmainly statistical patternrecognitionclassifiers.Thereis a lot of potential
in theuseof ANN with architecturethataretailoredto problemsof ComputerAuditition
ratherthanusedasagenericclassificationalgorithm.
In all the previously describedclassifiersa labeledtraining setwasused(supervised
learning). The K-meansalgorithmis a well-known techniquefor unsupervisedlearning
whereno labelsareprovidedandthesystemautomaticallyformsclustersbasedsolelyon
thestructureof thetrainingdata.In theK-meansalgorithmeachclusteris representedby
its centroidandtheclustersareiterativly improved. Moredetailscanbefoundin Appendix
A. Thesamealgorithmcanalsobeusedfor VectorQuantization(VQ), a techniquewhere
eachfeaturevector is codedasan integer index to a codebookof representative feature
vectorscorrespondingto themeansof theclusters.
Having differentclassifiersis importantin realworld applications,becausein addition
to classificationaccuracy, other factorssuchas training speed,classificationspeedand
robustnessare important. For example,the K-NN family of classifiersin its basicform
is computationally intensiveandrequiressignificantstoragefor classificationof largedata
sets. On the otherhand,classificationand training usingthe simple Gaussian(GMAP)
classifierarefastandthereforeit canbeusedfor real-timeapplications.
CHAPTER3. ANALYSIS 64
3.4.2 Classificationschemes
Basedon theseclassificationalgorithmsand the audio featuresdescribedin chapter2
variousaudio classificationschemesare supportedin our system. Theseschemeswill
be usedin Chapter5 to evaluatevariousfeaturessetsandalgorithms. For the purposes
of this Chaptereachclassificationschemewill be briefly describedanda representative
classificationaccuracy will begiven(moredetaileddescriptionsandresultsaswell asthe
specificdetailsof the featuresandclassifiersusedwill be providedin Chapter5). There
hasbeenlitt le prior work in theMusical,JazzandClassicalgenresclassificationschemes.
Music vs Speech
Separatingmusic from speechis oneof the first audioclassificationsthat have beenex-
plored. Detectingspeechandmusicsegmentsis importantfor automaticspeechrecog-
nition systemsespeciallywhendealingwith real world multimediadataandgoodresults
canbe achieved asmusic andspeechhave quite differentspectralcharacteristics.In our
implementationclassificationaccuracy of 89%is achieved.
3.4.3 Voices
In additionto separatingmusicfrom speechanotherimportant classificationis detecting
thegenderof a speaker. In this classificationschemeaudiosignalsareclassifiedinto three
categories: malevoice, femalevoice, andsportsannouncing.Sportsannouncingrefers
to any type of voice over a loud noisy background. In addition to being an important
sourceof informationby itself, speakergenderidentificationcanbeusedto improvespeech
recognitionperformance.Classificationaccuracy of 73%is achievedin oursystem.
CHAPTER3. ANALYSIS 65
Musical Genres
Using this classificationschemeaudiosignalsareautomaticallyclassifiedinto oneof the
following categories:Classical,Country, Disco,HipHop,Jazz,Rock,Blues,Reggae,Pop,
Metal. Classificationaccuracy of 61%is achievedin oursystem.
ClassicalGenres
Classicalmusiccan further be subdivided into: Choir, Orchestra,Piano,String Quartet.
Thesesubgenresaremorecloseto instrumentationcategories. Classificationaccuracy of
88%is achievedin oursystem.
JazzGenres
Jazzmusiccanfurtherbesubdividedinto: BigBand,Cool,Fusion,Piano,Quartet,Swing.
Thesesubgenresaremorecloseto instrumentationcategories. Classificationaccuracy of
68%is achievedin oursystem.
AUDIO CLASSIFICATION HIERARCHY
MusicClassicalCountryDiscoHipHopJazzRockBluesReggaePopMetal
BigBandCoolFusionPianoQuartetSwing
ChoirOrchestraPianoString Quartet
Male
Speech
SportsFemale
Figure3.1: MusicalGenresHiearachy
CHAPTER3. ANALYSIS 66
Figure3.2: ClassificationScemesAccuracy
Hierar chical Classification
Theseclassificationschemesarecombinedinto anhierarchicalschemesthatcanbeusedto
classifyradioandtelevisionbroadcasts.Figure3.1showsthishierarchicalschemedepicted
as a tree with 23 nodes. A summaryof the classificationaccuracy for eachschemeis
provided in Figure3.2 andTable3.1. Thereresultsarerepresentative. More resultsand
detailsabouttheir calculationcanbefoundin Chapter5.
Other schemes
In additionto theclassificationschemesdescribedabove,othertypesof audioclassification
such as: clusteringof soundeffects, isolatedinstrument recognition, speaker emotion
detectionhave beenimplementedusingour system.Unlike thepreviously describedclas-
CHAPTER3. ANALYSIS 67
Table3.1: Classificationschemesaccuracy
Random Automatic
MusicSpeec h(2) 50 89Voices(3) 33 73
Genres(1 0) 10 61Classical (4) 25 88
Jazz(6) 61 68
sificationschemes,thesetypesof classificationarecoveredin the existing literatureand
will notbefurtherdescribedin this thesis.Theresultsobtainedarecomparablewith those
reportedin thepublishedliterature.
3.5 Segmentation
3.5.1 Moti vation
One of the first chaptersof most textbooks in imageprocessingor computervision is
devotedto edgedetectionandobjectsegmentation. This is becauseit is mucheasierto
build classificationandanalysisalgorithms usingasinput segmentedobjectsratherthan
raw imagedata.In videoanalysis,shots,pansandgenerallytemporalsegmentsaredetected
andthenanalyzedfor content.Similarly temporalsegmentation canbeusedfor audioand
especiallymusicanalysis.
Auditory SceneAnalysis is the processby which the humanauditory systembuilds
mentaldescriptionsof complex auditoryenvironments by analyzingmixturesof sounds
[16]. From an ecologicalviewpoint, we try to associateeventswith soundsin order to
understandour environment. The characteristicsof soundsourcestendto vary smoothly
in time. Thereforeabruptchangesusuallyindicatea new soundevent. Thedecisionsfor
sequentialandsimultaneousintegration of soundarebasedon multiple cues. Although
CHAPTER3. ANALYSIS 68
our methoddoesnot attemptto modelthehumanauditorysystem,it doesusesignificant
changesof multiple featuresassegmentationboundaries.Theexperimentsindicatethatthe
selectedfeaturescontainenoughinformationto beusefulfor automatic segmentation.
Temporalsegmentation is a moreprimitiveprocessthanclassificationsinceit doesnot
try to interpret the data. Therefore,it can be more easily modeledusingmathematical
techniques.Beingmoresimpleit canwork with arbitraryaudioanddoesnotposespecific
constraintson its input like single speaker or isolatedtones. It hasbeenarguedin [82]
thatmusicanalysissystems shouldbe built for andtestedon real music andbe basedon
perceptualpropertiesratherthanmusictheoryandnote-level transcriptions.
In thisSection,ageneralsegmentationmethodology thatcanbeusedto segmentaudio
basedon arbitrarysound“texture” changeswill be described.Someexampleof sound
“texturechanges”area pianoentranceafter theorchestrain a concerto,a rock guitarsolo
entrance,achangeof speakeretc.
Unlike many of the proposedalgorithms for audiosegmentation that rely on the ex-
istenceof a prior classification,the proposedmethodology is not constrainedto fixed
numberof possible sound“textures”. This makesapplicableto a muchbroaderclassof
segmentation problems. For example segmentation into music and speechregions can
easily be handledby existing classification-basedsegmentation systems. However seg-
mentationof differentinstrumentsectionsof anorchestrausingtheclassification-basedis
moredifficult asa modelfor eachpossiblesubsetof theorchestrashouldbelearned.The
proposedsegmentation methodology hasno problemhandlingsuchcasesasit just detects
thesegmentationboundaries.
To furthermotivatesegmentation supposethatauseris giventhetaskof segmenting an
Operarecordinginto sectionsandlabeleachsectionappropriatelywith atext annotation. It
caneasilybeverifiedexperimentally thatmostof thetimewill bespendsearchingtheaudio
for thesectionboundariesandmuchlesstime for theactualannotation.As an indication
CHAPTER3. ANALYSIS 69
of thetimerequiredfor sucha task,2 hoursweretheaveragetimerequiredby thesubjects
of experimentsdescribedin Chapter5 to manuallysegmentandannotate10 minutesof
audiousingstandardsoundediting tools. Theseobservationssuggesta semi-automatic
approachthat combinesmanualand fully automatic annotation into a flexible, practical
user interfacefor audio manipulation. Automatic segmentation is an importantpart of
sucha system.For example,the usercanautomaticallysegmentaudiointo regionsthen
run automatic classificationalgorithmsthatsuggestannotationsfor eachregion. Thenthe
annotationscanbe editedand/orexpandedby the user. This way, significant amountsof
usertime aresavedwithout loosing theflexibili ty of subjective annotation.Segmentation
canalsobeusedto detectmusicalstructureof songs(for example cyclic ABAB type).
3.5.2 Methodology
The main idea behindthe segmentation methodology, describedin this Section,is that
changessound“texture” will correspondto abruptchangesin the trajectoryof feature
vectorsrepresentingthe file. Although the statistics of a particularsound“texture” (and
the correspondingfeaturevectors)might changeover time significantlythey will do this
smoothly whereasif thereis a “texture” changetherewill be an abruptgap. The pro-
posedmethodology canutilize arbitrarysetsof features(of coursethey have to be good
representationsof audio contentbut there is no constrainton their dimensionality and
methodof calculation). This is in contrast,to otherproposedsegmentation systemsthat
aredesignedaroundspecificfeaturessuchasamplitudeandpitch, track changesat each
featureindividually anduseheuristicsto combinetheresults.
Themethodology canbebrokeninto four stages:
1. A timeseriesof featurevectorsVt is calculatedby iteratingover thesoundfile.
2. A distancesignal∆t � ���Vt Vt � 1
���is calculatedbetweensuccessiveframesof sound.
In our implementationweuseaMahalanobisdistancegiven by
CHAPTER3. ANALYSIS 70
Mx � y �
x y TΣ � 1 x y (3.3)
whereΣ is thefeaturecovariancematrix calculatedfrom thewholesoundfile. This
distancerotatesandscalesthe featurespaceso the contribution of eachfeatureis
equal.Otherdistancemetrics,possibly usingrelative featureweighting,canalsobe
used.
3. Thederivative d∆tdt of thedistancesignalis taken. Thederivative of thedistancewill
be low for slowly changingtexturesandhigh duringsuddentransitions.Thepeaks
roughlycorrespondto texture changes.
4. Peaksare picked using simple heuristicsand are usedto createthe segmentation
of the signal into time regions. As a heuristicexample, adaptive thresholding can
be used. A minimum durationbetweensuccessive peakscanbe setto avoid small
regions.
For theexperimentsdescribedin Chapter5 thepeakpickingheuristicis parameterized
by thedesirednumberof peaks.Morespecificallythefollowing heuristic is used:
1. Thepeakwith themaximumamplitudeis picked.
2. A region aroundandincluding thepeakis zeroed(helpsto avoid countingthesame
peaktwice). Thesizeof theregion is proportional to thesizeof sound-filedivided
by thenumberof desiredregions(20%of theaverageregionsize)
3. Step1 is repeateduntil thedesirednumberof peaksis reached.
Oneimportantparameterof theproposedsegmentation methodology is thecollection
over which the featurecovariancematrix Σ is calculated.Thechoiceof collectionmakes
the segmentationmethodology context-dependent. For example the samefile might be
CHAPTER3. ANALYSIS 71
segmenteddifferentlyif acollectionof orchestralmusicis usedfor thecomputationof the
covariancematrixcomparedto acollectionof rockmusic.
3.5.3 SegmentationFeatures
The following setof featureshasbeenusedsuccessfullyfor automaticsegmentation and
are the basisof the experimentsdescribedin Chapter5: Meansand variances(over a
texturewindow) of SpectralCentroid,Rolloff, Flux,ZeroCrossings,LowEnergy, andRMS.
LoudnessinformationexpressedusingtheRMS featureis animportantsourceof segmen-
tation informationthatusuallyis ignoredin classification-basedsegmentation algorithms
asclassificationhasto beloudness-invariant.
3.6 Audio thumbnailing
Audio Thumbnailing refersto the processof creatinga shortsummarysoundfile from a
largesoundfile in sucha way thatthesummary bestcapturestheessentialelementsof the
original soundfile. It is similar to the conceptof key framesin video andthumbnails in
images.Audio Thumbnailing is important for Audio MIR especiallyfor thepresentation
of the returnedranked list sinceit allows usersto quickly hearthe resultsof their query
andmake their selection.In [73], two methodsof audiothumbnailing areexplored. The
first is basedon clusteringandthesecondis basedon theuseof HiddenMarkov Models
(HMMs). According to the userexperimentsdescribedin [73] both methods perform
better than randomselection. Clusteringis the bestmethodof the two. In addition to
the clusteringmethod,a segmentation-basedmethodis supportedin our system. In this
methodshortsegmentsaroundthesegmentation boundariesareconcatenatedto form the
summary. User experiments([128]) have indicatedthat humansconsidersegmentation
boundariesimportantfor summarization.
CHAPTER3. ANALYSIS 72
3.7 Integration
Althoughtheanalysisalgorithms werepresentedseparately, their resultscanbeintegrated
resultingin improved performance.As alreadymentionedin the relatedwork sectionof
thischapter, many proposedsegmentation algorithmsrequireprior segmentation. Theother
direction is alsopossible. Most of the classificationmethodsproposedin the literature
report improved performanceif the classificationresultsare integratedover larger time
windows. However, usingfixedsizeintegrationwindowsblursthetransitionedgesbetween
classes. Usually, test data consistsof files that do not contain transitions to simplify
the evaluation; thereforethis problemdoesnot show up. In real world data,however,
transitions exist andit is importantto preserve them.
The describedsegmentationmethodprovidesa naturalway of breakingup the data
into regionsbasedon texture. Theseregionscanthenbe usedto integrateclassification
resultsfor exampleusinga majority filter. That way, sharptransitionsarepreserved and
theclassificationperformanceis improvedbecauseof the integration. Initial experiments
in a numberof differentsoundfiles confirmthis fact. A moredetailedquantitative evalu-
ationof how this methodcomparesto fixed-window integrationis plannedfor the future.
Otherexamplesof integration aretheuseof segmentationfor thumbnailing, or theuseof
clusteringin similarity retrieval to grouptherankedlist of similar files.
3.8 Summary
A numberof fundamentalaudioanalysisalgorithmswerepresentedin this chapter. The
basicideabehindthesealgorithmsis to learnandutilize thestructureof thefeaturevector
representationin orderto extractinformation aboutaudiosignals.Figure3.3showsahigh-
level overview of thebasicideasdescribedin this Chapter. The audiosignalrepresented
asa trajectoryof featurevectorsby breakingit into smallanalysiswindowsandcomputing
CHAPTER3. ANALYSIS 73
0.45, 0.30, 0.25
classification boundary
boundarysegmentation
classification boundary
Analysis (FFT, LPC, MFCC)
Feature calculation
FEATURE−BASED CLASSIFICATION AND SEGMENTATION
Figure3.3: Feature-basedsegmentationandclassification
a featurevectorfor eachwindow. Segmentationcanbe performedby looking for abrupt
changesin this trajectoryof featurevectorsandclassificationcanbe performedby parti-
tioning the featurespacein regionssuchthat thepoints(vectors)falling in eachpartition
belongto thesameclass.
More specifically the following analysistechniqueswere presented:content-based
query-by-examplesimilarity retrieval basedon musicalcontentfeatureswas described,
hierarchicalmusicalgenreclassification,ageneralmultifeaturesegmentationmethodology,
andaudiothumbnailing.
Chapter 4
Interaction
Thoughtis impossible withoutan image.
Aristotle, On the Soul,Book III
A Picture’sMeaningCanExpressTenThousandWords1
Phony ChineseProverb
(andthousandsof songs)
4.1 Intr oduction
Theresultsof automatic audioanalysisarestill far from perfectandthereforeneedto be
corrected,edited,andenhancedby a humanuser. Thereforethecreationof userinterfaces
for interactingwith audio signalsand collectionsis an importantdirection for current
research.Evenif in thefutureautomaticalgorithmsarevastly improved,music andsound1Correcttranslationof phony chineseproverb typically translatedas “A picture is worth ten thousand
words”. The ”Chinese”quotation was fabricatedby an advertising executive representing a baking sodacompany. Theexecutive assumedthatconsumerswould becompelled to buy a product thathadtheweightof Chinesephilosophy behind it. A young boy’s smile is equalto many wordsexplaining the benefits oftheproduct. Theadwasoftenseenasa streetcarcardadthatcustomersdid not have muchtime to read. Itappearedin Printers’Ink (now Marketing/ Communication)March10,1927(pp. 114-115)
74
CHAPTER4. INTERACTION 75
listening is inherentlysubjective andtherefore,in my opinion,thehumanuserwill always
bepartof thesystem.
Successfuluserinterfacescombinetheskills of humansandcomputersmakingeffec-
tive useof their differentabilities. Userinterfaceshave beenshown to provide significant
improvementsin almostevery computer field rangingfrom the popularwindows-based
desktopto missioncritical applicationssuchasthedesignof NASA controlrooms.Human
ComputerInteraction(HCI) studiesthewayshumansandcomputersinteractandtries to
improve them.
Computeraudition research,in many casesoriginating from Electrical Engineering
departments,haslargely ignoreduserinterfaceaspects.Although publications typically
describehow theproposedalgorithms canbeused,they rarelyshow concreteexamplesof
userinterfacesthat utilize their results. Suchinterfaces,in additionto demonstrating the
proposedalgorithms, areimportantfor evaluation userexperiments.
Currently the commonway to interactwith large collectionsof audio files is using
commercialsoftwaremainlydevelopedfor audiorecordingandproductionpurposes.Such
soundeditingsoftwaretoolstypically havewell-designedandcomplex interfacesandallow
the user to apply standardmontage-like editing operations(cut, copy, paste,mix, etc),
viewing operations(zoom in, zoom out, region selectetc) and soundeffects (filtering,
denoising, pitchshiftingetc)basedon thetraditionaltape-recorderparadigm.
Thesecurrentsoftwaretoolsfor workingwith audiosuffer from two majorlimitations.
Thefirst limitationis thatthey view audioasamonolithic blockof digital sampleswithout
having any “understanding” of the specificaudio content. For example, if userswish
to locatethe saxophone solo in a jazz piece,they have to manuallylocatethe segment
by listening,scrolling andzooming. Although Waveform andSpectogramdisplayscan
provide visual cuesaboutaudiocontentthe processof locatingregionsof interestis still
time consuming. The secondlimitation is that currentaudiosoftware tools arecentered
CHAPTER4. INTERACTION 76
aroundthe conceptof a file asthe main unit of processing.Although typically multiple
files canbeopenedandmixed,andbatchoperationscanbeappliedto lists of files, there
are no direct graphicaluser interfacesfor displaying, browsing and editing large audio
collections.Ideally thesecollectioninterfacesshouldalsosomehow reflect thecontentand
similarity relationsof their individualfiles.
In this Chapter, a seriesof different novel contentand context aware interfacesfor
interactingwith audiosignalsandcollectionsareproposedand integratedin a common
applicationfor audiobrowsing, manipulation, analysisandretrieval. Thesecomponents
canutilize eithermanuallyor automaticallyextractedinformation or both. It is our belief,
thatthisfinal caseof utilizing bothkindsof information providesthemosteffectivewayof
interactingwith largeaudiocollections.Theuseof automatic techniquescansignificantly
reducethe amountof user involvement and can provide detailedinformation while the
humanusercanweedout theimperfectionsandmakesubjective judgments.
4.2 RelatedWork
Recentlyvariousgraphicaluserinterfacesfor browsing imagecollectionshave beenpro-
posedspurredby the now commonplaceavailability of digital photographyto computer
users.Variousspatialtechniquesandtheuseof thumbnailpictureshavebeenproposedand
shown improvementsfor thebrowsingandretrieval of images[60, 57, 100]. Severalaca-
demicandcommercialsystemshave beenproposedfor content-basedretrieval of images.
SomerepresentativeexamplesaretheBlobwordsystem[11] developedatUC Berkeley, the
PhotoBookfrom MIT [91] andtheQBIC systemfrom IBM [37]. In thesesystemstheuser
cansearch,browseandretrieve imagesbasedon similarity andvariousautomatic feature
extraction methods. This work falls underthe generalareaof Multimedia Information
ManagementSystems[51] with specificapplicationsto audiosignals.
CHAPTER4. INTERACTION 77
In the context of audio signals,most of the existing work for browsing single files
concentrateson the browsingof spoken documents. Speechskimmer [7] is a systemfor
interactively skimming recordedspeechthat pushesaudio interactionbeyond the tape-
recordermetaphor. Theusercanbrowsespokendocumentsby logicalunitssuchaswords,
sentencesandchangesof speaker. Time andPitchmodificationtechniquescanbeusedto
speed-upthe speechsignalwithout affecting intelligibilit y. Intelligentbrowsing of video
signalsof recordedmeetingsbasedonHiddenMarkov Model(HMM) analysisusingaudio
andvisual featuresis describedin [15]. Although therehave beenearlyattempts[18, 87]
to develop “intelligent” soundeditorsmostof their researcheffort wasspendaddressing
problemsrelatedto the limited hardware of that time. Today, commerciallyavailable
audioeditorsareeithersmall scaleeditorsthat work with relatively shortsoundfiles, or
professionaltools for audiorecordingandproduction.In bothcasesthey offer very little
supportfor browsing largecollections.Typically they supportmontage-likeoperationslike
copying, mixing, cutting andsplicing andsomesignalprocessingtools suchasfiltering,
reverberationandflanging.
Many of the ideasdescribedin this Chapterhave their rootsin the field of Scientific
andInformationVisualization[120,34]. Visualization techniqueshavebeenusedin many
scientificdomains. They take advantageof the strongpatternrecognitionabilitiesof the
humanvisual systemto reveal similarities, patternsand correlationsin spaceand time.
Visualizationis moresuitedfor areasthat areexploratory in natureandwherethereare
large amountsof datato be analyzed. Interactingwith large audiocollectionsis a good
exampleof suchanarea.Theconceptof browsingis centralto thedesignof theinterfaces
for interactingwith audiocollectionsdescribedin this Chapter. Browsing is definedas
“an exploratory, information seekingstrategy thatdependsuponserendipity... especially
appropriatefor ill-definedproblemsandexploringnew taskdomains”[78]. Thedescribed
componentshave beendesignedfollowing Shneiderman’s mantrafor thedesignof direct
CHAPTER4. INTERACTION 78
manipulation and interactive visualizationinterfaces: “overview first, zoom and filter,
then detail on demand” [114].
Direct manipulation systemsvisualizeobjectsand actionsof interest,supportrapid,
reversible,incrementalactions,andreplacecomplex command-languagesyntaxby direct
manipulation of the objectof interest[114]. Direct sonificationrefersto the immediate
auralfeedbackto useractions.Examplesof suchdirectmanipulation systemsincludethe
populardesktopmetaphor, computer-assisted-designtools, andvideo games. The main
propertyof direct manipulation interfacesis the immediateauralandvisual feedbackin
responseto theuseractions.
Of centralimportanceto this work is the ideaof representingsoundasvisualobjects
in a 2D or 3D spacewith propertiesrelatedto the audiocontentandcontext. This idea
hasbeenusedin psychoacousticsin orderto constructperceptualspacesthatvisuallyshow
similarity relationsbetweensinglenotesof differentmusicalinstruments. Thesespaces
canbe constructedusingMultidimensionalScaling(MDS) over datacollectedfrom user
studies[50]. Anotherapplicationthat usesthe ideaof a spacefor audiobrowsing is the
IntuitiveSoundEditingEnvironment(ISEE)where2D or 3D nestedvisualspacesareused
to browse instrumentsoundsasexperiencedby musiciansusingMIDI synthesizersand
samplers[138,139].
The most relatedproject to this work is the Sonic Browser which is a tool for ac-
cessingsoundsor collectionsof soundsusingsoundspatializationandcontext-overview
visualization techniques[35, 36]. In thiswork, aprototype2D graphicsbrowsercombined
with multi-streamsonicbrowsing wasdevelopedandevaluated. Oneapplicationof this
browsingsystemis musicological researchin Irish traditionalmusic. Althoughthe goals
of this work andthe SonicBrowseraresimilar therearesomedifferencein the two ap-
proaches.Themaindifferenceis thatSonicBrowseremphasizesuserinteractionanddirect
manipulation without utilizing any automatically extractedinformationaboutthesignals.
CHAPTER4. INTERACTION 79
In Chapter7 thecombination of theSonicBrowserwith our systemto createa powerful
soundbrowsing andeditingenvironmentwill bedescribed.
Visualizationsof audiosignalstake advantageof thestrongpatternrecognitionprop-
ertiesof the humanvisual systemby mappingaudiocontentandcontext information to
visualrepresentation.A visualization of audiosignalsbasedonself-similarityis proposed
in [40]. Timbregrams,a contentandcontext awarevisualizationfor audiosignals, were
first introducedin [127]. Timbrespaces,a visualization for audiocollectionbrowsing,and
the GenreGrammonitor, a dynamiccontent-basedreal-timedisplay, were introducedin
[126]. A morecompletedescriptionof the proposedinterfaceswith specialemphasison
their useon thePrincetonDisplayWall is providedin [131].
4.2.1 Beyond the Query-by-Exampleparadigm
As we have seen,in recentyears,techniquesfor audio andmusic information retrieval
have startedemerging as researchprototypes. Thesesystemscanbe classifiedinto two
major paradigms. In the first paradigmthe usersingsa melodyand similar audiofiles
containingthatmelodyareretrieved.Thisapproachis called“Queryby Humming” (QBH)
[47]. Unfortunatelyit hasthedisadvantageof beingapplicableonly whentheaudiodata
is storedin symbolic form suchasMIDI files. The conversionof genericaudiosignals
to symbolic form, calledpolyphonic transcription, is still anopenresearchproblemin its
infancy. Anotherproblemwith QBH is that it is not applicableto severalmusicalgenres
suchasDancemusicwherethereis no singable melodythat canbe usedasa query. In
thesecondparadigmcalled“Query-by-Example”(QBE) [127] anaudiofile is usedasthe
queryandaudiofiles thathave similar contentarereturnedrankedby their similarity. In
order to searchandretrieve generalaudiosignalssuchasmp3 files on the web only the
QBE paradigmis currentlyapplicable.
CHAPTER4. INTERACTION 80
In this Chapternew waysof browsing andretrieving audioandmusicalsignalsfrom
largecollectionsthatgobeyondtheQBE paradigmwill beproposed.Thedevelopedalgo-
rithmsandtoolsrely ontheautomaticextractionof contentinformationfrom audiosignals.
Themainideabehindthiswork is tocreatenew audiosignalseitherbycombinationof other
audiosignals,or synthetically basedonvariousconstraints.Thesegeneratedaudiosignals
aresubsequentlyusedtoauralizequeriesandpartsof thebrowsingspace.Theultimategoal
of this work is to lay thefoundations for thecreationof a musical“sketchpad”which will
allow computer usersto “sketch” the music they want to hear. Ideally we would like the
audioequivalentof sketchinga greencircle over a brown rectangleandretrieving images
of trees(something which is supportedin currentcontent-basedimageretrieval systems).
In additionto going beyond the QBE paradigmfor audioandmusic informationre-
trieval this work differs from the majority of existing work in two ways: 1) continuous
auralfeedback2) useof ComputerAuditionalgorithms. Continuousauralfeedbackmeans
that the userconstantlyhearsaudioor music that correspondsto her actions. Computer
Audition techniquesextractcontentinformationfrom audiosignalsthatis usedto configure
thegraphicaluserinterfaces.
The term Query User Interfaces(QUI) will be usedin this Chapterto describeany
interfacethatcanbeusedto specifyin someway audioandmusicalaspectsof thedesired
query. Two major familiesof Query Interfaceswill be describedbasedon the feedback
they provide to the user. The first family consistsof interfacesthat utilize directly audio
files in orderto provide feedbackwhile thesecondfamily consistsof interfacesthatgen-
eratesymbolic informationin MIDI format. It is importantto notethat in both of these
casesthe goal is to retrieve from generalaudioandmusic collectionsandnot symbolic
representations.
Anotherimportantinfluencefor thiswork is thelegacy of systemsfor automaticmusic
generationandstyle modeling. Thesesystemstypically fall into four major categories:
CHAPTER4. INTERACTION 81
generatingmusic [13], assisting composition [83], modelingstyle, performanceand/or
composers[46], andautomaticmusicalaccompaniment[24]. Thesereferencesareindica-
tiveandrepresentativeof earlywork in eachcatagory. Therearealsocommercialsystems
that generatemusic accordingto a variety of musicalstylessuchas the Band in a box
software:http:: //www.sonic spot.com/b andinabox/b andinabox. html
4.2.2 Contrib utions
Most thedescribedcontentandcontext awareinterfacesdescribedin this Chapterarenew
althoughsomeextendor arebasedon older ideas.Themostoriginal contributionsarethe
TimbreSpacebrowserandTimbregramswhicharevisualizations for audiocollectionsand
signalsthatexpressaudiocontentandcontext informationin thevisualdomain.
4.3 Terminology
Thefollowing termswill beusedto describetheproposedsystems:
Collection: a list of audiofiles (typicalsize � 100items)
Browser: aninterfacefor interactingwith acollectionof audiofiles
Editor: aninterfacefor interactingwith aspecificaudiofile
Monitor: adisplaythatis updatedin real-timein responseto theaudiothatis beingplayed
Mapping: a particularsettingof interfaceparameters
Selection: a list of selectedaudiofiles in a collectionor a specificregion in time in an
audiofile
Viewer: away to visualizeaspecificaudiofile
CHAPTER4. INTERACTION 82
Timeline: aparticularsegmentation of anaudiofile to non-overlappingregionsin time
View: aparticularwayof viewing aspecificcollection
The userinterfacesdescribedin this chapterfollow a Model-View-Controller (MVC)
framework [64]. TheModelpartcomprisesof theactualunderlyingdataandtheoperations
that canbe performedto manipulateit. The View part describesspecificways the data
modelcanbe displayedandthe Controllerpart describeshow userinput canbe usedto
controltheothertwo parts.
Collectionsarenamedlists of audiofiles. For example a collectionmight be named
Rock80sandconsistof rock musicfiles from thatdecade.A collectionselectionconsists
of anarbitrarysubsetof acollection.Typicallyaselectionis specifiedby mouseoperations.
Semanticzooming refersto theprocessof zoomingto a new collectionconsisting of only
theselectedobjects. In thecontext of audiofiles a selectionrefersto a specificregion in
time of the audiofile. A timeline is a segmentation of an audiofile to non-overlapping
possibly annotatedregionsin time. For examplea jazz piecemight have regionsfor the
eachindividual instrumentsolo aswell asthe chorus. Theseconceptscorrespondto the
Modelpartof theMVC framework.
Thecentralideabehindthedesignof thedescribedcontentandcontext awareGUIs is
to mapaudiofiles to visual objectswith specificpropertiesrelatedto their contentusing
viewers. Collectionsof audiofiles aremappedto spatialarrangementsof visualobjectsin
browsers showing that way their context. A mappingdescribesa specificway the audio
filesaremappedto visualobjectsandtheirspatialarrangement.Multiple viewsof thesame
audio collectioncan be displayedpossibly eachwith its own separatemapping. These
conceptscorrespondto theView partof theMVC framework.
The controller part of the MVC framework in this work utilizes standardkeyboard,
mouseoperationsto controlwidgetssuchasbuttons, slidersandscrollbarsso thereis no
needtodescribeit in moredetail.However, wenotethatseparatingthispartfromtheModel
CHAPTER4. INTERACTION 83
andView componentsallowstheeasyincorporationof alternativemethodsof controlsuch
asspeechinputor virtual realitysensorsto thesystem.
4.4 Browsers
4.4.1 Timbr espaceBrowser
The TimbreSpacebrowsermapseachaudiofile to an object in a 2D or 3D space. The
main browserpropertiesthat canbe mappedare the x, y, z coordinatesfor eachobject.
In additiona shape,texture imageor color, andtext annotationcanbe provided for each
object. Standardgraphicaloperationssuchasdisplayzooming,panning,androtatingcan
beusedto explore thebrowsing space.Datamodeloperationssuchasselectionpointing
and semanticzoomingare also supported. Selectionspecificationcan also be doneby
specifyingconstraintson thebrowserandobjectproperties.For example theusercanask
to selectall thefiles thathave positive x values,have triangularshapeandhave redcolor.
PrincipalCurvesoriginally proposedin [52] andusedfor sonificationin [54], canbeused
to movesequentially throughtheobjects.
TimbreSpacescan be constructedautomaticallybasedon the computation of audio
featuresandtheuseof dimensionality reductiontechniquessuchasPrincipalComponents
Analysis(PCA) [56]. PCA is describedin moredetailsin AppendixA. Dimensionality
reductiontechniquesmap a high dimensional set of featurevectorsto a set of feature
vectorsof lower dimensionality with minimum loss of information. Using PCA and a
singlefeaturevectorrepresentation,eachfile is mappedto thex,y,z coordinatesof avisual
object.Themapping is achievedby mapping thefirst threeprincipalcomponents(whichin
many casesdescribea mostof thevariability of thedatacollection)andmappingthemto
theunit cube.Anotherpossibility for automaticconfigurationof TimbreSpaceparameters
is thedirectuseof featurevaluesfor eachaxis. For example, thex-axismight correspond
CHAPTER4. INTERACTION 84
to the estimatedsongtempoin bpms,the y axis to its lengthand the z-axis to auditory
brightness. Finally, the object colors and/orshapescan be automaticallyconfiguration
usingclassificationandclusteringalgorithms.
Figure4.1: 2D TimbreSpaceof SoundEffects
Figure4.1 shows a 2D TimbreSpaceof soundeffects. The iconsrepresentdifferent
typesof soundeffectssuchaswalking (brown) andvariousother typesof soundeffects
(white)suchastool, telephoneanddoorbell sounds.Althoughtheiconshavebeenassigned
manuallythe x,y coordinatesof eachicon arecalculatedautomaticallybasedon timbral
texture features.That way files that aresimilar in contentarevisually clusteredtogether
ascanbeseenfrom thefigurewherethebrown walkingsoundsoccupy theleft sideof the
figurewhile thewhitesoundsoccupy theright side.
Figure4.2 shows a 3D TimbreSpacealsoshows a collectionof soundeffectsandsim-
ilarly with Figure4.1 the walking soundsare coloredin dark. The walking soundsare
automaticallydetectedusingK-Meansclustering.
Figure4.3showsa3D TimbreSpaceof differentpiecesof orchestralmusic.Eachpiece
is representedas a coloredrectangle. The x,y,z coordinatesare automatically extracted
basedon music similarity andthe rectanglecoloring is basedon Timbregramswhich are
CHAPTER4. INTERACTION 85
Figure4.2: 3D TimbreSpace3Dof SoundEffects
CHAPTER4. INTERACTION 86
Figure4.3: 3D TimbreSpaceof orchestralmusic
describedin Section4.4.2. Both figurescontainfewer objectsthantypical configurations
for clarity of presentationon paper. Audio collectionshave sizestypically rangingfrom
100to 1000files/objects.
4.4.2 Timbr egramviewer and browser
The basicideabehindTimbreGrams is to mapaudiofiles to sequencesof vertical color
stripeswhereeachstripecorrespondsto a shortsliceof sound(typically sizes20millisec-
onds- 0.5 seconds).Time is mappedfrom left to right. The similarity of differentfiles
(context) is shown asoverall color similarity while the similarity within a file (content)
is shown by color similarity within theTimbregram. For examplea file thathasan ABA
structurewheresectionA andsectionB have differentsoundtextureswill have an ABA
structurein color also. Althoughit is possible to manuallycreatedTimbregramstypically
CHAPTER4. INTERACTION 87
they arecreatedautomatically usingPrincipalComponentsAnalysis(PCA) over automat-
ically extractedfeaturevectors.
Two maindifferentapproachesareusedfor mapping theprincipalcomponentsto color
to createTimbregrams. If an indexed imageis desiredthenthefirst principalcomponent
is divided equallyandeachinterval is mappedto an index to a colomap. Any standard
visualization colormapsuchas Greyscaleor Thermometercan be used. This approach
works especiallywell if the first principal componentexplains a large percentageof the
varianceof the dataset. In the secondapproachthe first threeprincipal componentsare
normalizedso that they have equalmeansand variances. Although this normalization
distortsthe original featurespacein practiseit providesmoresatisfactory resultsas the
colorsaremoreclearlyseparated.Eachof thefirst threeprincipalcomponentsis mapped
to coordinatesin a RGB or HSV color space. In this approacha full color Timbregram
is created.Thereis a tradeoff betweentheability to show small scalelocal structureand
globaloverall similarity dependingon thequantizationlevelsandtheamountof variance
allowed. For exampleby allowing many quantizationlevels andlarge variancedifferent
sectionsof thesameaudiofilewill bevisuallyseparated.If only anoverall color is desired
fewerquantizationstepsandsmaller variationshouldbeused.
It is importantto notethatin thiscase,thesimilarity in colordependson thecontext i.e
thecollectionoverwhichPCAis calculated.Thatmeansthattwo filesmight haveTimbre-
gramswith similar colorsaspartof collectionA andTimbregramswith differentcolorsas
part of collectionB. For example a stringquartetandorchestralpiecewill have different
Timbregramsif viewedaspartof a classicalmusiccollectionbut similar Timbregramsif
viewed aspart of a collectionof arbitrary music. This is in contrastto techniquesthat
directly mapfrequency or timbral contentto color wherethecoloringof thefile depends
only on thefile itself (content)andnot thelargecollectionit belongsto (context).
CHAPTER4. INTERACTION 88
Figure4.4: Timbregramsuperimposedoverwaveform
CHAPTER4. INTERACTION 89
Timbregramscan arrangedin 2D tablesfor browsing. The table axis can be either
computedautomatically or manuallycreated.For exampleoneaxis might correspondto
theyearof releaseandtheotheronemightcorrespondto thetempoof thesong.In addition
TimbregramscanbesuperimposedovertraditionalWaveformdisplays(seeFigure4.4)and
texture-mappedoverobjectsin aTimbrespace(seeFigure4.3).
Figure4.5: Timbregrams of musicandspeech
Figure4.5shows theTimbregramsof six soundfiles. Thethreefiles on theleft column
containspeechand the threeon the right containclassicalmusic. It is easyto visually
separatemusicandspeechevenin thegreyscaleimage.It shouldbenotedthatno explicit
classmodelof music andspeechis usedandthedifferentcolorsarea direct resultof the
visualization technique.Thebottomright soundfile (opera)is light purpleandthespeech
segmentsarelight green.In this mapping,light andbright colorscorrespondto speechor
singining (Figure4.5left column).Purpleandbluecolorstypically correspondto classical
music (Figure 4.5. Similarly Figure 4.6 shows Timbregrams of Classical(left), Rock
(middle),andSpeech(right). Againthecolorsareadirectconsequenceof thevisualization
mappingandnoclass-modelis used.
Timbregramsof piecesof orchestralmusic canbeviewedin Figure4.7. Frominspect-
ing visually the Figure it is clear that the fourth piecefrom the top hasan AB structure
wheretheA partis similar to thesecondpiecefrom thetopandtheB part is similar to the
CHAPTER4. INTERACTION 90
Figure4.6: Timbregramsof classical,rockandspeech
Figure4.7: Timbregramsof orchestralmusic
CHAPTER4. INTERACTION 91
lastpiecefrom thetop. This is confirmedby listeningto thecorrespondingpieceswhereA
is aloud,energeticmomentwhereall theorchestrais playingandB is alightly orchestrated
flutesolo.
4.4.3 Other Browsers
Thefollowingbrowsersarebasedonexistinggraphicaluserinterfacesandcanbecombined
sharingthe sameunderlying data representationin arbitrary ways with the previously
describedbrowsers. For exampleselectinga file in a TimbreSpacecanalsobe reflected
in a Treebrowserandauralizedusinga SoundSpace.
Standard
The standardbrowserprovidesa scrollablelist of fileanamescorrespondingto the audio
files of acollection.Standardsingleandmultiple mouse selectionaresupported.
Tree
TheTreebrowserrenderstheaudiosignalsof a collectionbasedon a treerepresentation.
This treerepresentationcanbegeneratedby automatic classificationor createdmanually.
For examplethefirst level of the treecanbeMusic vs SpeechandSpeechcanbe further
subdividedto MalevsFemaleandsoforth.
SoundSpace
TheSoundSpacebrowsertakesadvantageof thecocktail-partyeffect in a similar fashion
to the Sonic Browser systemdescribedin [36]. As the userbrowses,a neighborhood
(aura)aroundthecurrentlyselecteditem is auralizedby playingall thecorrespondingfiles
simultaneously. Whenusedwith the PrincetonDisplay Wall (seeSection4.10 eachfile
of theaurais mappedto oneof the16 loudspeakersof thesystemandthe intensitylevels
CHAPTER4. INTERACTION 92
Figure4.8: EnhancedAudio Editor I
arenormalizedfor betterpresentation.Although all 16 loudspeakerscanbeusedwe have
found that at most 8 simultaneousaudio streamsare useful. Using more advanced3D
audiorenderingtechniquesto placefiles anywherein thespaceis plannedfor the future.
To shortentheplaybacktime,equallengthaudiothumnailscanbeautomaticallygenerated
for eachfile.
4.5 EnhancedAudio Editor
TheEnhancedAudio Editor lookslikeatypical tape-recorderstylewaveform-editor. How-
ever, in additionto thetypicalplay, fast-forward,rewind andstopbuttonsit allowsskipping
by eitheruserdefinedfixeddurationblocksor time linescontainingregionsof variabledu-
ration.Thesetime linescaneithercreatedby handor usingautomaticaudiosegmentation.
CHAPTER4. INTERACTION 93
Figure4.9: EnhancedAudio Editor II
Skippingandannotatingusingregionsis muchfasterthanmanualannotation,in thesame
way thatfindinga songonaCD is muchfasterthanfinding it ona tape.
The usercan selecta region and retrieve similar sounds. Another possibility is to
classifytheregionusingoneof theavailableclassificationschemessuchasMusic/Speech
orGenreclassification.Finally, eachtimeregioncanbeannotatedbymultiplekeywords.In
addition,theusercancombinetime regionsto form a time treethatcanbeusedfor multi-
resolutionbrowsing and annotation. The tree capturesthe hierarchicalnatureof music
pieces,andthereforecanbeusedfor musical analysis.
Figure 4.8 shows an automaticallycreatedsegmentation of an audio file. The user
canbrowseby segmentsandautomaticallyclassifythem. Figure4.9 shows a snapshotof
the editor containinga Waveformdisplaywith a Timbregram superimposedaswell asa
Spectrumdisplay.
CHAPTER4. INTERACTION 94
Figure4.10:TimbreballandGenreGrammonitors
4.6 Monitors
4.6.1 GenreGram
TheGenreGrammonitoris adynamicreal-timeaudiodisplayfor showing automatic genre
classificationresults.Although it canbe usedwith any audiosignalsit wasdesignedfor
real-timeclassificationof live radio signals. Eachgenreis representedasa cylinder that
movesupanddown in real-timebasedonaclassificationconfidencemeasurerangingfrom
0.0 to 1.0. Eachis cylinder is texture-mappedwith a representative imagefor eachgenre.
In additionto beingademonstrationof real-timeautomaticmusicalgenreclassificationthe
GenreGram providesvaluable feedbackboth to usersandalgorithmdesigners.Different
classificationsdecisionsandtheir relative strengthsarecombinedvisually, revealingcor-
relationsandclassificationpatterns.Sincein many casesthe boundariesbetweengenres
arefuzzy, a display like this is moreinformative thana singleclassificitiondecision.For
examplebothmalespeechandhiphop areactivatedin thecaseof a hiphopsongasshown
CHAPTER4. INTERACTION 95
in Figure4.10.Of courseit is possible to useGenreGramsto displayothertypesof audio
classificationsuchasinstrument,soundeffects,andbirdsongclassification.
4.6.2 Timbr eBall monitor
Thismonitor is a tool for visualizing in real-timetheevolution of featurevectorsextracted
from anaudiosignal.In thisanimationeachfeaturevectorismappedto thex,y,zcoordinate
of asmallball insideacube.Theball moves in thespaceasthesoundis playingfollowing
theevolution of the correspondingfeaturevectors.Texturechangesarevisible by abrupt
jumpsin the trajectoryof theball. A shadow is provided for betterdepthperception(see
Figure4.10.
4.6.3 Other monitors
Thesemonitorsarebasedonexistingvisualizationtechniquesfor audiosignals.
Waveform
Displayof signal’samplitude envelope familiar from commercialeditors.
Spectrum
Display that shows the magnitudein decibels(dB) of the ShortTime Fourier Transform
(SFTT)of thesignal.SeeFigure4.9.
Spectogram
ThespectogramrendersthesameinformationastheSpectrumasa greyscaleimage.Time
is mappedfrom left to right andthefrequency amplitudesaremappedto greyscalevalues
accordingto theirmagnitude.SeeFigure4.4.
CHAPTER4. INTERACTION 96
Waterfall
Thismonitorshowsawaterfall-like3D displayof acascadeof Spectrumplotsin 3D space.
Metr onome
This monitorshows thetempoof musicin beats-per-minute(bpm)anda confidencelevel
for how strongthebeatis. Themainbeatandit’ sstrengtharedetectedautomatically.
FeaturePlot
Thismonitor showsaplot of asingleuser-specifiedfeature.
4.7 Audio-basedQuery Interfa ces
In audioandmusicinformationretrieval thereis a queryanda collectionthat is searched
for similar files. Themainideabehindaudio-basedQUIs is to utilize not only thequery’s
audiobut alsoutilize directly the audiocollection. Sincethestructureandcontentof the
collectionarealreadyautomaticallyor manuallyextractedfor retrieval, this information
canbeusedtogetherwith theaudioto provide interactive auralfeedbackto theuser. This
ideawill beclarifiedin thefollowing sectionswith examplesof audio-basedQUIs.
4.7.1 SoundSliders
Soundslidersareusedto browseaudiocollectionsbasedcontinuousattributes.For presen-
tationpurposesassumethat for eachaudiofile in a collectionthetempoandbeatstrength
of thesongareautomaticallyextracted.For eachof thesetwo attributesa corresponding
soundslider is created.For a particularsettingof theslidersthesoundfile with attributes
correspondingto the slider valuesis selectedandpart of it is playedin a loop. If more
than one soundfile correspondsto the particularslider values,the usercan advancein
CHAPTER4. INTERACTION 97
Figure4.11:Soundslidersandpalettes.
circularfashionto thenext soundfile by pressingabutton. Oneway to view this is thatfor
eachparticularsettingof slidervaluesthereis a correspondinglist of soundfiles thathave
correspondingattributes.Whentheslidersaremovedthesoundis crossfadedto a new list
thatcorrespondsto thenew slidersettings.Theextractionof thecontinuousattributesand
their sorting is performedautomatically.
In currentaudiosoftware typically slidersare first adjusted,and then by pressinga
submitbutton,filesthatcorrespondto theseparametersareretrieved.Unlikethistraditional
useof slidersfor settingparameters,soundslidersprovide continuousauralfeedbackthat
correspondsto theactionsof theuser(directsonification).So for examplewhentheuser
setsthe tempoto 150beats-per-minute(bpm)andbeatstrengthto its highestvaluethere
is immediatefeedbackaboutwhat the valuesrepresentby hearinga correspondingfast
songwith strongrhythm.Anotherimportantaspectaboutsoundsliders is thatthey arenot
independantandtheauralfeedbackis influencedby thesettingsof all of them.Figure4.11
showsa screenshotof soundslidersusedin oursystem.
CHAPTER4. INTERACTION 98
4.7.2 Soundpalettes
Soundpalettesare similar to soundslidersbut apply to browsing discreteattributes. A
palettecontainsa fixedsetof visualobjects(text, images,shapes)thatarearrangedbased
ondiscreteattributes.At any timeonly oneof thevisualobjectscanbeselectedby clicking
on it with themouse.For exampleobjectsmight bearrangedin a tableby genreandyear
of release.Continuousauralfeedbacksimilar to thesoundslidersis supported.Thebottom
left cornerof Figure4.11shows a contentpalettecorrespondingto musical genres.Sound
palettesandsliderscanbecombinedin arbitraryways.
4.7.3 Loops
In the previously describedquery interfacesit is necessaryto playbacksoundfiles con-
tinuously in a looping fashion.Severalmethodsfor loopingaresupportedin thesystem.
The simplestmethodis to loop the whole file with crossfading at the loop point. The
main problemproblemwith this methodis that the file might be too long for browsing
purposes.For fileswith aregularrhythm,automaticbeatdetectiontoolsareusedto extract
loopstypically correspondingto an integer numberof beats. Anotherapproachthat can
beappliedto arbitraryfiles is to loopbasedon spectralsimilarity. In this approachthefile
is broken to shortwindows andfor eachwindow a numericalrepresentationof the main
spectrumcharacteristicsis calculated.Thefile is searchedfor windows that have similar
spectralcharacteristicsandthesewindows canbeusedto achieve smoothloopingpoints.
Finally automatic thumbnail methodsthatutilize morehigh level information canbeused
to extractrepresentativeshortthumbnails.
4.7.4 Music Mosaics
Loopsandthumbnail techniquescanbeusedto createshortrepresentationsof audiosig-
nalsandcollections. Anotherpossibility for therepresentationof audiocollectionsis the
CHAPTER4. INTERACTION 99
creationof MusicMosaicsthatarepiecescreatedby concatenatingshortsegmentsof other
pieces[111, 148]. For examplein orderto representa collectionof songsby theBeatles,
a musicmosaicthat would soundlike Hey Judecouldbe createdby concatenatingsmall
piecesfrom otherBeatlessongs. Time-stretchingbasedon beatdetectionandoverlap-add
techniquescanalsobeusedin MusicMosaics.
Figure4.12:3D soundeffectsgenerators.
4.7.5 3D soundeffectsgenerators
In addition to digital music distribution, anotherareawhere large audio collection are
utilized is libraries of digital soundeffects. Most of the times we heara door opening
or a telephoneringing in a movie thesoundsarenot actuallyrecordedduring thefilming
of themovie but aretakenfrom librariesof prerecordedsoundeffects.
Searching librariesof digital soundeffectsposesnew challengesto audioinformation
retrieval. Of specialinterestaresoundeffectswherethesounddependson the interaction
of a humanwith a physical object.Someexamplesare: thesoundof walkingon gravel or
therolling of a canon a woodentable. In contrastto non-interactivesoundeffectssuchas
adoorbell theproductionof thosesoundsis closelytied to amechanicalmotion.
CHAPTER4. INTERACTION 100
In orderto go beyondtheQBE paradigmfor contentretrieval of soundeffects,query
generatorsthatsomehow modelthehumanproductionof thesesoundsaredesired.Towards
this goalwe have delopeda numberof interfaceswith thecommonthemeof providing a
user-controlledanimatedobjectconnecteddirectly to the parametersof a synthesisalgo-
rithm for a particularclassof soundeffects. These3D interfacesnot only look similar to
their real world counterpartsbut alsosoundsimilar. This work is part of the Physically
OrientedLibrary of Interactive SoundEffects(PhOLISE)project[19]. This projectuses
physical andphysicallymotivatedanalysisandsynthesis algorithms suchasmodalsyn-
thesis,bandedwaveguides[33], andstochastic particlemodels[20] to provide interactive
parametricmodelsof real-world soundeffects.
Figure4.12showstwo such3D soundeffectsquerygenerators.ThePuColaCanshown
on the right is a 3D modelof a sodacanthat canbe slid acrossvarioussurfacetextures.
Thesliding speedandtexturematerialarecontrolledby theuserandresultin appropriate
changesto thesoundin real-time.TheGearontheleft is areal-timeparametricsynthesisof
thesoundof a turningwrench.Otherdevelopedgeneratorsare:amodelof falling particles
on a surfacewhereparameterssuchasdensity, speed,andmaterialarecontrolled,anda
modelof walkingsoundswhereparameterssuchasgait,heelandtoematerial,weightand
surfacematerialarecontrolled.
4.8 MIDI-bas edQuery Interfa ces
In constrastto audiobasedQUIs,MIDI-basedQUIs donotutilizedirectlytheaudiocollec-
tion but rathersynthetically generatequeryaudiosignalsfrom a symbolic representation
(MIDI) that is createdbasedon the useractions.The parametersusedfor thegeneration
of thisqueryand/ortheautomatically analyzedgeneratedaudioareusedto retrievesimilar
audiopiecesfrom a largecollection.It is importantto notethatalthough MIDI is usedfor
the generationof the queries,theseinterfacesare usedfor searchingaudio collections.
CHAPTER4. INTERACTION 101
Theseconceptswill be clarified in the subsequentsectionswith concreteexamplesof
MIDI-basedQUIs.
Figure4.13:Groovebox
4.8.1 The groovebox
Thegrooveboxissimilartostandardsoftwaredrummachineemulatorswherebeatpatterns
beatpatternsarecreatedusinga rectangulargrid of notevaluesanddrumsounds.Figure
4.13 shows a screenshotof a groove machinewhere the usercan createa beatpattern
or selectit from a setof predefinedpatternsand speedit up anddown. Soundfilesare
retrieved basedon the interfacesettingsas well as by audio analysis(Beat Histograms
[133]) of the generatedbeatpattern. Anotherpossibity is to usebeatanalysismethods
basedon extractingspecificsoundevents (suchasbassdrumandcymbalhits) andmatch
eachdrumtrack separately. The possibility of automatically alligning the generatedbeat
patternwith theretrievedaudiotrackandplayingbothastheusereditsthebeatpatternis
underdevelopment. Thecodefor the groove box is basedon demonstrationcodefor the
JavaSoundAPI.
CHAPTER4. INTERACTION 102
Figure4.14:Indianmusicstylemachine
4.8.2 The tonal knob
The tonalknobshows a circle of fifths which is a orderingof themusicalpitchesso that
harmonicallyrelatedpitchesaresuccessively spacedin a circle. The usercanselectthe
tonalcenterandheara chordprogressionat a specifiedmusicalstyle thatestablishesthat
tonic center. Pieceswith the sametonal centerare subsequentlyretrieved using Pitch
Histograms([133]).
4.8.3 Stylemachines
Stylemachinesaremorecloseto standardautomaticmusicgenerationinterfaces.For each
particularstyle we are interestedin modelling for retrieval a set of slidersand buttons
correspondingto variousattributesof thestyleareusedto generatein real-timean audio
CHAPTER4. INTERACTION 103
signal. This signalis subsequently usedasa queryfor retrieval purposes.Styleattributes
includetempo,density, key center, andinstrumentation.Figure4.14showsascreenshotof
Ragamatic, a stylemachinefor IndianMusic.
Figure4.15: Integration of interfaces
4.9 Integration
Although the describedgraphicaluser interfacecomponentswere presentedseparately,
they areall integratedfollowingaModel-View-Controlleruserinterfacedesignframework
[64]. TheModel partcomprisesof theactualunderlying dataandtheoperationsthatcan
CHAPTER4. INTERACTION 104
beperformedto manipulateit. TheView partdescribesspecificwaysthedatamodelcan
be displayedandthe Controllerpart describeshow userinput canbe usedto control the
other two parts. By sharingthe Model part the graphicaluserinterfacecomponentscan
affect thesameunderlyingdata.Thatway for examplesoundsliders, soundpalettes, and
stylemachinescanall beusedtogetherto browsethesameaudiocollectionvisualizedas
a TimbrespaceandTimbregram tableandeditedusingtheEnhancedAudio Editor. Figure
4.15showsasnapshotof oneof thepossible combinationsof describedinterfaces.
Of coursein a completemusicinformation retrieval systemtraditionalgraphicaluser
interfacesfor searchingandretrieval suchaskeywordor metadatasearchingwouldalsobe
usedin additionto the describedinterfaces.Thesetechniquesandideasarewell-known
andthereforearenotdescribedin thisChapterbut arepartof thedevelopedsystem.
Although the descriptionof the interfacesin this Chapterhasemphasizedmusic re-
trieval it is clearthatsucha systemwould alsobeusefulfor musiccreation.For example
sounddesigners,composers(especiallyutilizing ideasfrom Music ConcreteandGranular
Synthesis),andDJswouldall benefitfrom novel waysof interactingwith largecollections
of audiosignals.It is alsopossible thatmany of theproposedideasandinterfacescanbe
alsousedfor browsingandinteractingothertypesof temporalsignalssuchasvideo and
speechsignals.
4.10 The PrincetonDisplay Wall
Working with many audio files using the describedinterfaceson a desktopmonitor is
difficult becauseof the large numberof windows andscreenspacerequired. In addition
the desktopmonitor is not appropriatefor interactive collaborationsamongmultiple si-
multaneoususers. To overcomethis difficulty, Marsyashasbeenusedasan application
for the PrincetonScalableDisplay Wall (PDW) project. The PrincetonScalableDisplay
Wall projectexploresbuilding andusinga large-formatdisplay with multiple commodity
CHAPTER4. INTERACTION 105
Figure4.16:Marsyason theDisplayWall
components. TheprototypesystemhasbeenoperationalsinceMarch1998. It comprises
an 8 x 18 foot rear-projectionscreenwith a 4 x 2 arrayof projectors. Eachprojectoris
drivenby a commodity PC with an off-the-shelfgraphicsaccelerator. The multi-speaker
displaysystemhas16 speakersaroundtheareain front of thewall andis controlledby a
soundserverPC that usestwo 8-channelsoundcards[22]. A moredetaileddescription
of theprojectcanbefound[69]. This large immersive displayallows for visualandaural
presentationof detailedinformationfor browsing andediting large audiocollectionsand
supportsinteractive collaborationsamongmultiple simultaneoususers.More information
abouttheuseof MarsyasonthePDW canfoundin [131] andFigure4.9showsavarietyof
thedevelopedinterfaceson thePDW.
CHAPTER4. INTERACTION 106
4.11 Summary
In this Chapter, a variety of contentandcontext awareuserinterfacesfor browsing and
interactingwith audiosignalsandcollectionswerepresented.They arebasedonideasfrom
VisualizationandHumanComputerInteractionandutilize automatically extractedaudio
contentinformation. TimbrespacesandTimbregramsarevisualizations of collectionsand
signalsthatmapaudiocontentto propertiesof visualobjects.They canautomatically adapt
to audiodatausingautomaticfeatureextractionandPrincipalComponentAnalysis (PCA).
A seriesof interfacesfor specifyingaudioqueriesthatgobeyondthetraditionalQuery-by-
Exampleparadigmwerealsopresented.
Chapter 5
Evaluation
Knowledge mustcomethroughaction; youcanhaveno testwhich is not fanciful, saveby
trial.
Sophocles,Trachiniae
Thefirstprinciple is that youmustnot fool yourself- andyouare theeasiestpersonto
fool.
Richard Feynman,The Pleasureof Finding Things Out
5.1 Intr oduction
In the previous chaptersa variety of ways for representingand analyzingaudiosignals
andcollectionswaspresented.Evaluation of ComputerAudition researchaswell asany
typeof researchthatattemptsto modelhow humanprocessandunderstandsensoryinput
is difficult as thereis no singlecorrectanswer. Unlike otherareasin ComputerScience
(for examplesortingalgorithms)wherethereis awell-definedanswerto theproblembeing
solved, answersin ComputerAudition areusually fuzzy andsubjective in nature. That
observationdoesnotmeanthatevaluation of ComputerAudition is impossible, however it
requirescarefullydesignedexperimentsthat in many casesinvolve userstudies.Another
107
CHAPTER5. EVALUATION 108
way to increaseconfidencefor the evaluation resultsis to combineresultsfrom different
experiments. For example,MPEG-basedfeatureswill be comparedto other proposed
featuresbothin thecontext of automaticclassificationandsegmentation.
Anotherproblemfacingtheevaluationof ComputerAudition algorithmsandsystemsit
therelatively recentappearanceof thefield andtheresulting lackof standarizedevaluation
datacollections andresults.In somecasesof new typesof analysissuchassegmentation
and thumbnailing it is not even clear how algorithmscan be evaluatedand evaluation
experimentshave to bedesignedfrom scratch.
5.2 RelatedWork
In asenseall thework describedin thisSectionis acontribution asit consistsof evaluation
resultsabout the algorithmsand systemsthat have beendescribed. Theseresultsare
calculatedautomatically and by conductinguser studies. As there are no standarized
datasetsin ComputerAudition,directcomparisonswith existingresultsis difficult in most
cases. When it is possible to do so, direct and indirect comparisonswill be provided.
Therefore,the only way to evaluateand comparethe performanceof classifiersis by
conductingempiricalexperiments.
5.2.1 Contrib utions
Thebasiccontributionsdescribedin this Chapterarethe following: 1) designof userex-
perimentsfor evaluatingsegmentation into arbitrarysound“textures”2) basedon thecon-
ducteduserexperimentshumansareconsistentwhensegmenting audio,their performance
canbeapproximatedautomaticallyandproviding aneditableautomaticsegmentationdoes
notbiastheresultingsegmentation3) MPEG-basedfeatureshavecomparableperformance
with otherproposedfeaturesetsfor music/speechclassificationandsegmentation4) DWT-
CHAPTER5. EVALUATION 109
basedfeatureshave comparableperformancewith otherproposedfeaturesetsfor audio
classification5) Effective similarity retrieval is possibleusingtimbral andbeatinforma-
tion features6) Beat Histogramscan be usedfor tempoestimationandprovide a good
representationof rhythmic content.
Theexactexperiments, numbers,anddetailsthatsupportthepreviousstatementswill
bedescribedin therestof thisChapter.
AUDIO CLASSIFICATION HIERARCHY
MusicClassicalCountryDiscoHipHopJazzRockBluesReggaePopMetal
BigBandCoolFusionPianoQuartetSwing
ChoirOrchestraPianoString Quartet
Male
Speech
SportsFemale
Figure5.1: MusicalGenresHiearachy
5.3 Collections
In orderto trainandevaluatestatistical patternrecognitionclassifiersit is importantto have
largerepresentativedatasets.In thisSectionthecollectionsusedtoconducttheexperiments
describedin thisChapteraredescribed.
Figure5.1 shows the hierachyof musical genresusedfor evaluation augmentedby a
few (3) speechrelatedcategories. In addition, a music/speechclassifiersimilar to [110]
has beenimplemented. For eachof the 20 musical genresand 3 speechgenres,100
representativeexcerptswereusedfor training.Eachexcerptwas30secondslongresulting
in (23 * 100* 30seconds= 19hours)of trainingaudiodata.To ensurevarietyof different
CHAPTER5. EVALUATION 110
recordingqualitiestheexcerptsweretakenfromradio,compactdisks,andmp3compressed
audiofiles. Thefileswerestoredas22050Hz,16-bit,monoaudiofiles. An effort wasmade
to ensurethat the training setsare representative of the correspondingmusical genres.
The Genresdatasethasthe following classes:Classical,Country, Disco, HipHop, Jazz,
Rock,Blues,Reggae,Pop,Metal. TheClassicaldatasethasthe following classes:Choir,
Orchestra,Piano,String Quartet. The Jazzdatasethasthe following classes:BigBand,
Cool,Fusion,Piano,Quartet,Swing.
For the similarity retrieval experiment,a collection consisting of 1000 distinct 30-
secondrock songexcerptswas used. The files were storedas 22050Hz, 16-bit, mono
audiofiles. In addition collectionsof soundeffectsandisolatedmusical instrumenttones
havealsobeenusedto designanddevelop theproposedalgorithms.
5.4 Classification evaluation
Although a labeledtraining set is usedin training automaticclassifiers,we are really
interestedin how theclassifierperformswhenpresentedwith datait hasnot encountered
beforeor its generalizationperformance.It canbeshown thatif thegoal is to obtaingood
generalizationperformance,thereare no context-independentor usage-independentrea-
sonsto favor onelearningor classificationmethodoveranother(No FreeLunch Theorem)
seeChapter9of [30]. If onealgorithmseemstooutperformanotherin aparticularsituation,
it is a consequenceof its fit to the particularproblem,not the generalsuperiorityof the
algorithm. Thereforein orderto evaluateandcomparedifferentclassifiersit is necessary
to conductempiricalstudiesandexperiments.
CHAPTER5. EVALUATION 111
5.4.1 Cross-Validation
Whenevaluating anautomaticclassificationmethodtypically thereis a setof labeleddata
that in addition to training hasto be usedfor testing. In simple validationthe setof of
labeledsamplesis randomlysplit into two sets:oneis is usedasthetraditionaltrainingset
for adjusting modelparametersof theclassifier. Theotherset,calledthevalidation set,is
usedto estimatethegeneralizationerrorwhichcanbeusedto comparedifferentclassifiers.
Obviously this error will dependon the particularsplit of thedatato trainingandtesting
andthereis thedangerof overfitting thetrainingdata.
In orderto avoid this problem,a simple generalizationof theabove methodis m-fold
crossvalidation. Herethetrainingsetis randomlydividedinto mdisjoint setsof equalsize
n� m, wheren is thetotal numberof labeledpatterns.Theclassifieris trainedi times,each
timewith adifferentsetheldoutasavalidationset.Theestimatedperformanceis themean
of thesei errors.In thelimit wherem � n, themethodis calledtheleave-one-outapproach.
A specificdetailthatis importantwhenevaluatingclassifiersof audiofiles represented
astrajectoriesof featurevectorsis to neversplit featurevectorsfrom thesamefile in train-
ing andtesting.This is importantsincethereis a gooddealof frame-to-framecorrelation
in vectorsof thesamefile sosplitting thefile wouldgiveanincorrectestimateof classifier
performancefor truly novel data.Theresultspresentedin thischapterdonotsplit same-file
featurevectorsandarebasedon10-foldcross-validationwith 100iterations.
5.5 Segmentation experiments
Unlikeclassification,wheretherearewell-establishedevaluationtechniques,segmentation
providesachallenge.Methodsthatrely onprevioussegmentation canusetheclassification
annotationsasthegroundtruth however segmentationof arbitrarytexturesis moretricky
CHAPTER5. EVALUATION 112
Human AutomaticAgreement FixedBudget BestEffortAG % FB % BE MX %
Classic1 8/14 57 7/8 87 7/8 8 87Classic2 7/11 63 5/7 71 6/7 14 85Classic3 5/13 38 2/5 40 3/5 16 60Jazz1 3/14 21 2/3 66 3/3 16 100Jazz2 5/8 62 3/5 60 5/5 10 100JazzRock 6/9 66 0/6 0 5/6 12 83Pop1 5/6 83 4/5 80 5/5 10 100Pop2 5/10 50 4/5 80 4/5 5 80Radio1 4/10 40 3/4 75 4/4 10 100Radio2 8/11 72 5/8 62 6/8 11 75Total 56/106 55 35/56 62 48/56 11 87
Table5.1: Freesegmentation
andrequiredthe designof new userexperiments.The main questions the designeduser
experimentsattemptto answerare:
� Are humansconsistentwhensegmenting audiosignals? In otherwords,is therea
commonagreementbetweendifferentsubjectsaboutwhat constitutesa goodseg-
mentation.
� Cantheir performancebe approximatedautomaticallyandhow accurateis this ap-
proximation?
� Doesproviding aneditableautomaticsegmentationandallowing theusersto modify
it, biastheir segmentationresults? Theanswerto thisquestionis importantbecause
if it is yesthenautomatic segmentationcannot bedirectly usedasa preprocessing
step.
Two studies[125, 128]wereconductedin ordertoaddressthesequestions. Thepurpose
of thefirst experimentwasto explorewhathumansdowhenaskedto segmentaudioandto
comparethoseresultswith theautomaticsegmentationmethod.
CHAPTER5. EVALUATION 113
Human AutomaticAgreement FixedBudget BestEffortAG % FB % BE MX %
Classic1 4/6 66 0/4 0 4/4 10 100Classic2 4/7 57 3/4 75 3/4 4 75Classic3 4/7 57 2/4 50 2/4 4 50Jazz1 3/11 27 2/3 66 3/3 16 100Jazz2 5/6 83 3/5 60 5/5 10 100JazzRock 5/7 71 1/5 20 4/5 12 80Pop1 4/7 57 2/4 50 4/4 10 100Pop2 5/7 71 4/5 80 4/5 5 80Radio1 4/6 66 3/4 75 4/4 10 100Radio2 5/7 71 2/5 40 4/5 10 80Total 43/71 62 22/43 51 37/43 9 86
Table5.2: 4 � 1 segmentation
Human AutomaticAgreement FixedBudget BestEffortAG % FB % BE MX %
Classic1 8/14 57 7/8 87 7/8 8 87Classic2 7/10 70 5/7 71 6/7 14 85Classic3 9/11 81 6/9 66 6/9 9 66Jazz1 4/15 26 3/4 75 3/4 4 75Jazz2 5/11 45 3/5 60 5/5 10 100JazzRock 7/9 77 3/7 42 5/7 12 71Pop1 6/12 50 5/6 83 6/6 10 100Pop2 8/13 61 4/8 50 5/8 16 62Radio1 7/10 70 3/7 42 4/7 10 57Radio2 9/11 81 5/9 55 6/9 11 66Total 70/116 61 44/77 63 53/70 10 77
Table5.3: 8 � 2 segmentation
CHAPTER5. EVALUATION 114
Thedatausedfor bothstudiesconsistsof 10soundfilesabout1 minute long. A variety
of stylesandtexturesarerepresented.In thefirst study10subjectswereaskedto segment
eachsoundfile usingstandardaudio editing tools in 3 ways. The first way, which we
call free, is breakingup thefile into any numberof segments. Thesecondandthird way
constraintheusersto aspecificbudgetof total segments4 �� 1 and8 � 2.
The resultsof the first studyareshown in Tables5.1,5.2and5.3. The segments that
morethan5 of the 10 subjectsagreeduponwereusedfor comparison.The AG column
shows thenumberof thesesalientsegmentscomparedto thetotal numberof all segments
markedby any subject.It is ameasureof consistency betweenthesubjects.For comparing
theautomaticmethoda segmentboundarywasconsideredto bethesameif it waswithin
0.5secof theaveragehumanboundary. Thiswasbasedonthedeviation of segmentbound-
ariesbetweensubjects.FB (fixed-budget)refersto automatic segmentationby requesting
thesamenumberof segmentsasthesalienthumansegments.BE (besteffort) refersto the
bestautomaticsegmentation achievedby incrementallyincreasingthe numberof regions
upto amaximum of 16. MX is thenumberof segmentsnecessaryto achievethebesteffort
segmentation.
The resultsshow that humansareconsistentwhensegmenting audio(morethanhalf
of thesegmentsarecommonfor mostof thesubjects).In addition,they show thathuman
segmentation canbe approximatedby automatic algorithms.The biggestproblemseems
to betheperceptualweightingof a texturechange.For example,many errorsinvolvedsoft
speechentrancesthat weremarked by all the subjectsalthoughthey werenot significant
aschangesin thecomputedfeaturespace.Theautomatic segmentationresultsareusually
perceptuallyjustifiedandasupersetof thehumansegmentedregions.In thefirst studysub-
jectswereallowedto usethesoundeditorof their choice,thereforeno timing information
wascollected.
CHAPTER5. EVALUATION 115
In thesecondstudy[128] a new setof userexperimentswereconductedto re-confirm
theresultsof [125] andto answersomeadditionalquestions.Themainquestionwe tried
to answerwas if providing an automatic segmentation asa basisfor editing would bias
the resulting segmentation. In addition informationaboutthe time requiredto segment
and annotateaudio was collected. The useof automaticallysegmentedtime-linescan
greatlyaccelerate the segmentationandannotation process.Unlike the first study[125]
whereparticipantscouldusea soundeditorof their choice,in thesecondstudythey used
a segmentation-annotationtool basedon theEnhancedSoundEditor describedin chapter
4. The segmentation-annotationtool consistsof a graphicaluser interface looking like
a typical sound-editor. Using a waveform displayarbitrary regions for playbackcanbe
selectedand annotatedtime lines can be loadedand saved. Eachsegmented region is
coloreddifferentandtheusercanmove forward andbackward throughthoseregions. In
additionto thetypicalsoundeditorfunctionalitythesystemcanautomatically segmentthe
audioto suggesta time line. The resultingregionscanthenbe editedby adding/deleting
boundariesuntil thedesiredsegmentation is reached.
In Figure 5.2 histograms of subjectagreementareshown. To calculatethehistogram
all the segmentation markswerecollectedandpartitioned into bins of agreementin the
followingmanner:all thesegmentmarkswithin � 0 � 5secwereconsideredascorresponding
to thesamesegmentboundary. Thisvaluewascalculatedbasedonthedifferencesbetween
the exact locationof the segmentboundarybetweensubjectsandwasconfirmedby lis-
teningto the correspondingtransitions. Sinceat mostall the ten subjectscancontribute
markswithin this standarddeviation the maximumnumberof subjectagreementis 10.
In thefigure thedifferentlinesshow thehistogramof theexperimentsin [125] (old), the
resultsof thesecondstudy(new) andthetotal histogramof bothstudies(total). Thetotal
histogramis calculatedby consideringthe 20 subjectsanddividing by two to normalize
(notice that this is different than taking the averageof the two histogramssincethis is
CHAPTER5. EVALUATION 116
1 2 3 4 5 6 7 8 9 100
5
10
15
20
25
30
35
40
45
50Histogram of subject argeement Free
# of subject agreement
# pe
rcen
tage
of t
otal
seg
men
t mar
ks
AAA
OldNewTotal
1 2 3 4 5 6 7 8 9 100
5
10
15
20
25
30
35
40
45
50Histogram of subject argeement 8+/−2
# of subject agreement
# pe
rcen
tage
of t
otal
seg
men
t mar
ks
BBB
OldNewTotal
1 2 3 4 5 6 7 8 9 100
5
10
15
20
25
30
35
40
45
50Histogram of subject argeement 4+/−1
# of subject agreement
# pe
rcen
tage
of t
otal
seg
men
t mar
ks
CCC
OldNewTotal
1 2 3 4 5 6 7 8 98
9
10
11
12
13
14
15
16
17
18
Soundfile order number
Min
utes
Learning curve for segmentation and annotation
D
Figure 5.2: Histogramsof subject agreementand learning curve for mean segmentcompletiontime
doneon a boundarybasis). Thereforeif in the two studiesthe boundarieswerechanged
indicatingbiasbecauseof theautomaticallysuggestedsegmentation,theshapeof thetotal
histogramwoulddrasticallychange.Thehistogramsshow thata thereis a largepercentage
of agreementbetweensubjectsand that the segmentation resultsarenot affectedby the
provision of an automatically suggestedsegmentationtime line. Finally the fact that the
histogramsof theold,new andtotal experimentshave aboutthesameshapein free,8 2
and4 1 suggeststhatconstrainingtheuserto aspecificbudgetof segmentsdoesnotaffect
significantlytheiragreement.
CHAPTER5. EVALUATION 117
As a metric of subjectagreementwe definethe percentageof the total segmentation
marksthat more than5 of the 10 subjectsagreedupon. This numbercanbe calculated
by integrating thehistogramsfrom 6 to 10. For theexperimentsin [125] this metricgives
79%,for this study76%and,for thecombinedtotal 73%. In [125], 87%of thesegments
thanhalf thesubjectsagreeduponwerein thesetof thebesteffort automaticsegmentation.
Moreoverfor thisstudy(new) 70%of thetotalsegmentmarksby all subjectswereretained
from thebesteffort automaticallysuggestedsegmentation. All thesenumbersarefor the
caseof freesegmentation.
Themeanandstandarddeviationof thetimeit tookto complete(segmentandannotate)
asoundfilewith durationof about1 minutewas13 4 minutes.This resultwascalculated
usingonly thefreesegmentation timing informationbecausetheothercasesaremuchfaster
dueto the familiarity with the soundfileandthe reusabilityof segmentation information
from the free case.The lower right sub-figureof Figure 5.2 shows the averagetime per
soundfilein orderof processing.Theorderof processingwasrandomthereforethefigure
indicatesthereis asignificantlearningcurvefor thetask.Thishappensdespitethefactthat
aninitial soundfilethatwasnot timedwasusedto familiarizetheuserswith theinterface.
Thereforethe actualmeantime is probablylower (about10 minutes)for an experienced
user.
5.6 MPEG-basedfeatures
The performanceof featurescalculatedfrom MPEG-compresseddata is evaluatedand
comparedwith otherfeaturesetsin thecontext automaticmusic/speechclassificationand
segmentation.
CHAPTER5. EVALUATION 118
5.6.1 Music/Speechclassification
Thedatausedfor evaluating theMusic/Speechclassificationconsistsof about2 hoursof
audiodata.Thereare45minutesof speech,45minutesof music,andabout30minutesof
mixedaudio. Radio,live recordingsof speech,compactdisksandmovies representinga
varietyof speakersandmusic styleswereusedasdatasources.Theresultsarecalculated
using10-fold crossvalidation asdescribedin Section 5.4.1.Theresultsarecalculatedon
a framebasiswithoutany integration.Furtherimprovementsin classificationperformance
can be achieved by integrating the results. For integration the region detectedby the
segmentation algorithmcanbeused.
1 2 3 4 5 6 755
60
65
70
75
80
85
INCREASE IN CLASSIFICATION PERFORMANCE
AS FEATURES ARE ADDED
Feature #
% C
orre
ct C
lass
ifica
tions
Figure5.3: 1,2,3and4,5,6are the meansandvariancesof MPEG Centroid,Rolloff andFlux. Low Energy is 7.
Figure5.3, shows the improvementin classificationperformanceby the gradualad-
dition of features.The orderusedis meanCentroid, meanRolloff, meanFlux, variance
Centroid, varianceRolloff, varianceFlux andLow Energy. Table1 comparestheperfor-
manceof theMPEGbasedclassificationwith PCM basedsystemsthatuseSpectralShape
CHAPTER5. EVALUATION 119
GMAP K-NN(1) K-NN(5)MPEG 82! 2 2 ! 1% 84! 6 4 ! 4% 86! 2 3 ! 8%PCM 84! 8 1 ! 4% 89! 5 2 ! 0% 90! 0 1 ! 2%
DifferentDataSetOTHER1 94! 0 2 ! 6% 94! 2 3 ! 6% 95! 7 3 ! 5%
DifferentDataSet& ClassifiersOTHER2 77! 1% 90! 0% 94! 2%
Table5.4: Music/Speechpercentagesof frame-basedclassificationperformance
Free 4 1 8 2MPEGFB 55%" 31# 56 58%" 25# 43 54%" 38# 70PCM FB 62%" 35# 56 51%" 22# 43 62%" 44# 70
MPEGBE 71%" 40# 56 74%" 32# 43 65%" 46# 70PCM BE 85%" 48# 56 86%" 37# 43 75%" 53# 70
Table5.5: Comparisonof segmentationalgorithms
Featurescalculatedusingthe STFT. For the PCM line the samedatasetanda similar set
of featurescalculatedusingshort time Fourier Transformwereused. The OTHER1 line
is basedon resultsby [110] andusesa differentdataset. The OTHER2 line is basedon
resultsby [102] andusesa differentdatasetanddifferentclassifiers.Theresultsindicate
that the proposedfeaturescan be usedfor classificationwithout significantdecreasein
accuracy. Theseresultsarecalculatedusinga GMAP classifier. Becausethemainpurpose
is to comparetheMPEGfeaturesetwith otherfeaturesetstheparticularchoiceof classifier
is not important.
5.6.2 Segmentationevaluation
Thecomparisonof MPEG-basedfeaturesto PCM-basedfeaturesfor automatic segmenta-
tion is basedon theexperimentsdescribedin Section 5.5. In table 5.6.2theperformance
of theMPEG-basedsegmentation algorithmis comparedwith asimilar algorithmbasedon
shorttime Fourier Transformanalysis.The tableshows the number of regionsthat were
markedby humansthatwerecapturedby thesystem.FB (fixed-budget)refersto automatic
CHAPTER5. EVALUATION 120
segmentation by requestingthesamenumber of segments asthesalienthumansegments.
BE (besteffort) refersto the bestautomaticsegmentation achieved by incrementallyin-
creasingthenumberof regionsupto amaximumof 16. Althoughtheperformanceis lower
thanthePCMbasedalgorithmstill a largenumberof segmentationboundariesarecaptured
by thealgorithm. Most of thecaseswherethealgorithmmissedboundariescomparedto
thePCM basedalgorithmwerein soundfilescontainingrock or jazzmusic.Thereasonis
that thePCM basedalgorithmusestime domainzerocrossingsthatcapturethe transition
from noise-likesignalsto moreharmonicsignals.
Figure5.4: DWT-basedfeatureevaluation basedonaudioclassification
5.7 DWT-based features
The featuresbasedon the DWT transformwereevaluatedin the context of audioclassi-
fication andmorespecificallyusingthe following classificationschemes:Music/Speech,
ClassicalGenres,andVoices. Figure5.4 andTable5.7 comparetheclassificationresults
CHAPTER5. EVALUATION 121
MusicSpeech Classical VoicesRandom 50 25 33
FFT 87 52 64MFCC 89 61 72DWTC 82 53 70
Table5.6: DWT-basedfeatureevaluation basedonaudioclassification
betweenFFT-based,MFCC,andDWT-basedfeatures.Theseresultsarecalculatedusinga
GMAP classifier. Becausethemainpurposeis to comparetheperformanceof theDWT-
basedfeaturesetwith otherfeaturesets,theparticularchoiceof classifieris not important.
Figure5.5: Web-basedRetrieval UserEvaluation
5.8 Similarity retrieval evaluation
In order to perform similarity retrieval userstudies,a Web evaluationinfastructurewas
developed. Figure 5.5 shows a samplepagefor evaluating the resultsof a query. A
usersurvey of similarity retrieval wasconductedusingtheRock dataset(100030-second
rock songs). The large sizeof the datasetmadethe calculationof recall difficult so only
CHAPTER5. EVALUATION 122
0
1
2
3
4
5
Random BPM Full
"Evaluation"
Figure5.6: Resultsof UserEvaluation
precisionwasexamined.Becausethe 30-secondsnippetsusedfor the evaluation did not
havemany texturechanges,thesinglevectorapproachwasused.Theevaluationwasdone
by 7 subjectswho gave a relevancejudgementfrom 1 (worse)to 5 (best). Therewere
twelve queries,five matchesreturnedandthreealgorithms (random,only beat-detection,
beatandtexture) giving a total of 7 * 5 * 12 * 3 = 1260datapoints. The beatdetection
wasdoneusingthealgorithmdescribedin [108]. In this algorithm,a filterbankis coupled
with a network of combfilters that track thesignalperiodicitiesto provide anestimateof
the main beatandits strength.The full retrieval wasdoneusingboth beatdetectionand
spectralfeatures.
Figure 5.6shows themeanandstandarddeviation of thethreeretrieval methods. The
standarddeviation is dueto thedifferentnatureof thequeriesandsubjectdifferencesand
is aboutthe samefor all algorithms.Although it is clearthat the systemperformsbetter
than randomand that the full approachis slightly betterthan usingonly beatdetection
morework needsto bedoneto improvethescores.Significantlybetterresultsareachieved
usingthefull musicalcontentfeaturesetsthatincludesBeatandPitchHistogramfeatures.
CHAPTER5. EVALUATION 123
Unfortunately, thisstudywasconductedbeforethedesignof thesenew features.A similar
evaluationstudyusingthenewly proposedfeaturesis plannedfor thefuture.
5.9 Automatic Musical GenreClassification
Becausethis is oneof the first publishedworks that deal seriouslywith the problemof
automaticmusicalgenreclassificationavarietyof experimentalresultswill bepresented.
5.9.1 Results
Table5.7 shows the classificationaccuracy percentageresultsof differentclassifiersand
musicalgenredatasets.With the exceptionof the RT GS row, theseresultshave been
computedusinga single-vector to representthe whole audiofile. The vectorconsistsof
thetimbral texturefeatures(9 (FFT) + 10(MFCC) = 19-dimensions), therhythmic content
features(6 dimensions), andthe pitch contentfeatures(5 dimensions) resultingin a 30-
dimensional featurevector. In order to compute a single timbral-texture vector for the
wholefile themeanfeaturevectorover thewholefile is used.
Table5.7: Classificationaccuracy meanandstandarddeviation
Genres(10 ) Classical( 4) Jazz(6)
Random 10 25 16RT GS 44 2 61 3 53 4
GS 59 4 77 6 61 8GMM(2) 60 4 81 5 66 7GMM(3) 61 4 88 4 68 7GMM(4) 61 4 88 5 62 6GMM(5) 61 4 88 5 59 6KNN(1) 59 4 77 7 57 6KNN(3) 60 4 78 6 58 7KNN(5) 56 3 70 6 56 6
CHAPTER5. EVALUATION 124
The row RT GS shows classificationaccuracy percentageresultsfor real-timeclassi-
ficationper frameusingonly the timbral texture featureset(19-dimensions). In this case
eachfile is representedby a time seriesof featurevectors,onefor eachanalysiswindow.
Framesfrom thesameaudiofile arenever split betweentrainingandtestingdatain order
to avoid falsehigheraccuracy dueto thesimilarity of featurevectorsfrom thesamefile. A
comparisonof randomclassification,real-timefeatures,andwhole-file featuresis shown
in Figure5.7. Thedatafor creatingthisbargraphcorrespondsto theRandom,RT GS,and
GMM(3) rowsof Table5.7.
Theclassificationresultsarecalculatedusinga10-foldcross-validationevaluationwhere
thedatasetto beevaluatedis randomlypartitionedsothat10%is usedfor testingand90%
is usedfor training.Theprocessis iteratedwith differentrandompartitions andtheresults
areaveraged(for Table5.7100iterationswereperformed).Thisensuresthatthecalculated
accuracy will notbebiasedbecauseof aparticularpartitioning of trainingandtesting.If the
datasetsarerepresentative of thecorrespondingmusicalgenresthentheseresultsarealso
indicative of theclassificationperformancewith real-world unknown signals.The part
shows thestandarddeviation of classificationaccuracy for theiterations.Therow labeled
Randomcorrespondsto theclassificationaccuracy of achanceguess.
Theadditionalmusic/speechclassificationhas86%(randomwould be50%)accuracy
andthe voicesclassification(male, female,sportsannouncing)has74% (random33%).
Sportsannouncingrefersto any typeof speechover a very noisybackground.TheSTFT-
basedfeaturesetis usedfor themusic/speechclassificationandtheMFCC-basedfeature
setis usedfor thespeechclassification.
Confusionmatrices
Table5.9showsmoredetailedinformation aboutthemusicalgenreclassifierperformance
in the form of a confusionmatrix. In a confusionmatrix, the columns correspondto the
CHAPTER5. EVALUATION 125
Genres Classical Jazz00
10
20
30
40
50
60
70
80
90
100
RND
RT
WF
Figure5.7: Classificationaccuracy percentages(RND=random,RT=real time, WF=wholefile)
actualgenreandtherows to thepredictedgenre.For example,thecell of row 5, column
1 with value26meansthat26%of theClassicalMusic (column1) waswronglyclassified
asJazzmusic(row 2). Thepercentagesof correctclassificationlie in thediagonalof the
confusionmatrix. Theconfusionmatrixshowsthatthemisclassificationsof thesystemare
similar to what a humanwould do. For example Classicalmusicis misclassifiedasJazz
musicfor pieceswith strongrhythmfrom composerslike LeonardBernstein,andGeorge
Gershwin. Rock music hasthe worst classificationaccuracy and is easilyconfusedwith
othergenreswhich is expectedbecauseof its broadnature.
Tables5.10,5.11show theconfusionmatricesfor theClassicalandJazzgenredatasets.
In theClassicalgenredataset,Orchestral music is mostly misclassified asString Quartet.
As canbe seenfrom the confusionmatrix (Table III 5.10), Jazzgenresaremostly mis-
CHAPTER5. EVALUATION 126
Table5.8: RealTimeClassificationaccuracy percentageresults
Genres(1 0) Classical (4) Jazz(6)
Random 10 25 16Gaussia n 54 4 77 7 62 8
GMM(3) 57 4 85 6 70 7
Table5.9: GenreConfusionMatrix
cl co di hi ja ro bl re po me
cl 69 0 0 0 1 0 0 0 0 0co 0 53 2 0 5 8 6 4 2 0di 0 8 52 11 0 13 14 5 9 6hi 0 3 18 64 1 6 3 26 7 6ja 26 4 0 0 75 8 7 1 2 1ro 5 13 4 1 9 40 14 1 7 33bl 0 7 0 1 3 4 43 1 0 0re 0 9 10 18 2 12 11 59 7 1po 0 2 14 5 3 5 0 3 66 0me 0 1 0 1 0 4 2 0 0 53
classifiedasFusion. This is dueto the fact that Fusion is a broadcategory that exhibits
large variability of featurevalues.JazzQuartetseemsto be a particularlydifficult genre
to correctly classify using the proposedfeatures(it is mostly misclassified as Cool and
Fusion).
Importance of texture window size
Figure5.8 shows how changingthe sizeof the texture windowaffects the classification
performance.It canbe seenthat the useof a texture window increasessignificantly the
classificationaccuracy. Thevalueof zeroanalysiswindowscorrespondsto usingdirectly
thefeaturescomputedfrom theanalysiswindow. After approximately40analysiswindows
(1 second)subsequentincreasesin texture window sizedo not improve classificationas
CHAPTER5. EVALUATION 127
Table5.10:JazzConfusionMatrix
BBand Cool Fus. Piano 4tet Swing
BBand 42 2 1 0 6 1Cool 21 67 5 4 23 10Fus. 28 16 88 0 38 22
Piano 1 0 0 80 0 04tet 4 5 2 0 19 5
Swing 4 10 4 16 14 62
Table5.11:ClassicalConfusionMatrix
Choir Orch. Piano Str.4tet
Choir 99 7 7 3Orch. 0 58 2 7Piano 0 9 86 4
Str.4t et 1 26 5 86
they don’t provideany additionalstatistical information. Basedonthisplot, thevalueof 40
analysiswindows waschosenasthe texture windowsize. The timbral-texture featureset
(STFTandMFCC) for thewholefile anda single Gaussianclassifier(GS)wereusedfor
thecreationof Figure5.8.
Importance of individual feature sets
Table5.12 shows the individual importanceof the proposedfeaturesetsfor the taskof
automaticmusicalgenreclassification. As can be seenthe non-timbraltexture features
(PitchHistogramFeatures(PHF)andBeatHistogramFeatures(BHF) performworsethan
thetimbral-texturefeatures(STFT, MFCC) in all cases.However in all casestheproposed
featuresetsperformbetterthanrandomclassificationthereforeprovide someinformation
aboutmusicalgenreandthereforemusicalcontentin general.The last row of table5.12
correspondsto the full combinedfeatureset and the first row correspondsto random
CHAPTER5. EVALUATION 128
40
42
44
46
48
50
52
54
56
0 20 40 60 80 100
CLA
SS
IFIC
AT
ION
$ A
CC
UR
AC
Y%
TEXTURE WINDOW SIZE (number of analysis windows)
EFFECT OF TEXTURE WINDOW SIZE TO CLASSIFICATION ACCURACY
Figure5.8: Effectof texturewindow sizeto classificationaccuracy
classification.The numberin parenthesesbesideeachfeaturesetdenotesthe numberof
individual featuresfor thatparticularfeatureset.Theresultsof table5.12werecalculated
usingasingleGaussianclassifier(GS)usingthewhole-fileapproach.
Table5.12: Individualfeaturesetimportance
Genres Classica l Jazz
RND 10 25 16PHF(5) 23 40 26BHF(6) 28 39 31
STFT(9) 45 78 58MFCC(10) 47 61 56FULL(30 ) 59 77 61
The classificationaccuracy of the combinedfeatureset, in somecases,is not signif-
icantly increasedcomparedto the individual featureset classificationaccuracies. This
fact doesnot necessarilyimply that the featuresare correlatedor do not containuseful
informationbecauseit can be the casethat a specificfile is correctly classifiedby two
differentfeaturesetsthatcontaindifferentanduncorrelatedfeatureinformation.In addition
althoughcertain individual featuresare correlated,the addition of eachspecificfeature
CHAPTER5. EVALUATION 129
improvesclassificationaccuracy. Therhythmic andpitchcontentfeaturesetsseemto play
a lessimportantrole in theClassicalandJazzdatasetclassificationcomparedto theGenre
dataset.This is an indicationthat it is possible thatgenre-specificfeaturesetsneedto be
designedfor moredetailedsubgenreclassification.
Table5.13:Bestindividual features
Genres
BHF.SUM 20PHF.FP0 23
STFT.VCTRD 29MFCC.MMFCC1 25
Table5.13shows thebestindividual featuresfor eachfeatureset.Thesearethesumof
theBeatHistogram(BHF.SUM), theperiodof thefirst peakof theFoldedPitchHistogram
(PHF.FP0), thevarianceof theSpectralCentroidover thetexturewindow (STFT.FPO)and
themeanof thefirst MFCC coefficientover thetexturewindow (MFCC.MMFCC1).
5.9.2 Human performancefor genreclassification
Theperformanceof humansin classifying musicalgenrehasbeeninvestigatedin [92] [30].
Using a ten-way forcedchoiceparadigmcollege studentswereable to accuratelyjudge
(53% correct)after listening to only 250 millisecondssamplesand (70% correct)after
listening to threeseconds(chancewould be 10%). Listeningto morethan3 secondsdid
not improve their performance.The subjectswheretrainedusingrepresentative samples
from eachgenre.Thetengenresusedin thisstudywere:Blues,Country, Classical,Dance,
Jazz,Latin, Pop,R & B, RapandRock. Although directcomparisonof theseresultswith
the automatic musical genreclassificationresultsis not possible due to differentgenres
anddatasets,it is clear that the automaticperformanceis not far away from the human
performance.Moreover theseresultsindicatethefuzzynatureof musicalgenreboundaries.
CHAPTER5. EVALUATION 130
5.10 Other experimental results
In this Sectionsomeadditional experimentalresultscollectedin the courseof the previ-
ouslydescribeduserstudiesarepresented.
5.10.1 Audio Thumbnailin g
An additionalcomponentof theannotationtasksof [125] wasthatof ”thumbnailing.” After
doingthefree,8+/-2,and4+/-1segmenting tasks,thesubjectswereinstructedto notethe
begin andendtimesof atwo secondthumbnail segmentof audiothatbestrepresentedeach
sectionof their freesegmentationsections.
Inspectionof the 545 total userthumbnailselectionsrevealsthat 62% of themwere
chosento bethefirst two secondsof a segment, and92%of themwerea selectionof two
secondswithin the first five secondsof the segment. This implies that a machinealgo-
rithm which canperformsegmentationcouldalsodo a reasonablejob of matchinghuman
performanceon thumbnailing. By simply using the first five secondsof eachsegment
asthe thumbnail, andcombining with theresultsof thebest-effort machinesegmentation
(87%)matchwith humansegments),a setcontaining80%”correct” thumbnailscouldbe
automaticallyconstructed.
5.10.2 Annotation
Somework in verbalcuesfor soundretrieval [140] hasshown thathumanstendto describe
isolatedsoundsby sourcetype (what it is), situation (how it is made),andonomatopoeia
(soundslike). Text labelingof segmentsof acontinuoussoundstreammightbeexpectedto
introducedifferentdescriptiontypes,however. In theconductedexperiments,apreliminary
investigationof semanticannotations of soundsegmentswasconducted.While doingthe
segmentation tasks,subjectswere instructedto ”write a short (2-8 words)descriptionof
CHAPTER5. EVALUATION 131
thesection...” The annotations from the free segmentationswereinspectedby sortingthe
wordsby frequency of occurrence.
The averageannotationlengthwas4 words, resulting in a total of 2200meaningful
(words like of, and, etc. were removed) words,and620 unique words. Of these,only
100wordsoccur5 or moretimesandtheserepresent64%of thetotalwordcount.Of these
”top 100” words,37areliteral descriptionsof thedominantsourceof sound(piano,female,
strings,horns,guitar, synthesizer, etc.),andthesemake up almost25%of the total words
used.
Eventhoughonly5of the20subjectscouldbeconsideredprofessionalcomposers/musicians,
thennext mostpopularword typecouldbeclassedasmusic-theoreticalstructuraldescrip-
tions (melody, verse,sequence,tune,break,head,phrase,etc.) 29 of the top 100 words
wereof this type,andthey represent23%of thetotalwordsused.
Anothersignificantcategory of wordsusedcorrespondedto basicacousticparameters
(soft, loud,slow, fast,low, high,build, crescendo,increase,etc.).Mostof suchparameters
areeasyto calculatefrom thesignal.12of thesewordsrepresentedabout10%of thetotal
wordsused.
Thesepreliminary findings indicate that with suitablealgorithmsdeterminingbasic
acousticalparameters(mostlypossible today),theperceptuallydominantsoundsourcetype
(somewhat possible today), andmusic-theoreticalstructuralaspectsof soundssegments
(muchalgorithmic work still to bedone),machinelabelingof a fair numberof segments
(60%)wouldbepossible.
5.10.3 Beatstudies
A numberof synthetic beat patternswere constructedto evaluate the Beat Histogram
calculation. Bass-drum,snare,hi-hat and cymbal soundswere used,with addition of
woodblockandthreetom-tomsoundsfor the unisonexamples.The simpleunisonbeats
CHAPTER5. EVALUATION 132
Figure5.9: SyntheticBeatPatterns(CY = cymbal, HH = hi-hat,SD = snaredrum,BD =basedrum,WB = woodblock,HT/MT/LT = tom-toms)
versuscascadedbeatpatternsare usedto check for phasesensitivity in the algorithm.
Simpleandcomplex beatpatternsareusedto gaugethe possibility to find the dominant
peaktiming. For thispurpose,thebeatpatternshavebeensynthesizedat5 differentspeeds.
Thesimple rhythm patternhasa simple rhythmic substructurewith single-beatregularity,
whereasthecomplex patternshowsacomplex rhythmic substructureacrossandwithin the
singleinstrumentsections.Figure 5.9showsthesynthetic beatpatternsthatwereusedfor
evaluation in tablaturenotation. In all thecasesthealgorithmdetectedthebasicbeatof the
patternasa prominentpeakin theBeatHistogram.
In additionto thesyntheticbeatpatternsthealgorithmwasappliedto realworld music
signals.To evaluatethealgorithm’s performanceit wascomparedto thebeats-per-minute
(bpms)detectedmanuallyby tappingthemouse with themusic. Theaveragetime differ-
encebetweenthe tapswasusedas the manualbeatestimate.Twenty files containinga
varietyof musicstyleswereusedto evaluatethealgorithm(5 HipHop, 3 Rock,6 Jazz,1
Blues,3 Classical,2 World). For mostof thefiles (13/20)theprominentbeatwasdetected
clearly (i.e the beatcorrespondedto highestpeakof the histogram).For (5/20) files the
beatwasdetectedasa histogrampeakbut it wasnot the highest,andfor (2/20) no peak
CHAPTER5. EVALUATION 133
correspondingto themainbeatwasfound.In thepiecesthatthebeatwasnotdetectedthere
wasno dominantperiodicity (eitherclassicalor free jazz). In suchcaseshumansrely on
morehigh-level information like grouping,melodyandharmonicprogressionto perceive
theprimarybeatfrom theinterplayof multipleperiodicities.
5.11 Summary
A seriesof experimentalresultsregarding the evaluation of the proposedComputerAu-
dition algorithms and systemswere described. Using a newly designeduserstudy for
segmentation evaluation,it wasshown thathumansareconsistentwhensegmenting audio
andtheir resultscanbeapproximatedautomatically. In addition, theresultsof auserstudy
evaluating similarity retrieval show that spectraland beatinformationare importantfor
musicretrieval. It wasshown thatautomatic musicalgenreclassificationis possiblewith
resultsthat arenot far away from the performancereportedfor humansubjects.Finally
evidencewaspresentedthat BeatHistogramsprovide a goodrepresentationof rhythmic
content.
Chapter 6
Implementation
In the developmentof the understanding of complex phenomena,the mostpowerful tool
availableto thehumanintellect is abstraction. Abstraction arisesfromtherecognition of
similaritiesbetweencertainobjects,situations,or processesin therealworld andthedeci-
sionto concentrateonthesesimilaritiesandto ignore, for thetimebeing, their differences.
C.A.R. Hoare,Structured Programming
6.1 Intr oduction
In the previous chaptersa variety of differentalgorithms andsystemsfor manipulation,
analysisandretrieval of audiosignalswerepresented.Althoughfor presentationpurposes
they weredescribedseparately, in reality they sharemany commonbuilding blocksand
haveto beintegratedtocreateusefultoolsfor interactingwith audiosignalsandcollections.
The mainsubjectof this chapteris Marsyas, which is a free softwareframework for
rapidprototypingof ComputerAudition researchthatwehavedeveloped.A framework is
setof cooperatingclassesthat make up a reusabledesignfor a specificclassof software
[55]. A framework is similarto alibrary in thatit providesasetof reusablecomponentsand
building blocksthatcanassembledinto complex system.In addition, it hasto beflexible
134
CHAPTER6. IMPLEMENTATION 135
and extensible allowing the addition and integration of new componentswith minimal
effort. Finding the right abstractionsandmaking the framework flexible andextensible
is not trivial. ComputerAuditition systems usesimilar features,analysisalgorithms, and
interfacesfor a varietyof differenttaks. Therefore,in thedesignof our system,aneffort
wasmadeto abstractthesecommonelementsandusethemasarchitecturalbuilding blocks.
This facilitatesthe integration of different techniquesundera commonframework and
interface. In addition, it helpsrapid prototyping sincethe commonelementsarewritten
once,anddeveloping andevaluating a new techniqueor applicationrequireswriting only
thenew task-specificcode.
Thedesiredpropertiesof a goodframework requirecarefulprogrammingdesign.An
additionalcomplication, in the caseof a framework for ComputerAudition research,is
therequirementfor efficient implementation of heavy numericalcomputations. Typically
ComputerAudition systemsfollow a bottom-up processingarchitecturewhere sensory
informationflows from low-level signalsto higherlevel cognitive representations.How-
ever, thereis increasedevidencethat thehumanauditorysystemusestop-down aswell as
bottom-upinformation flow [115]. A top-down (prediction-driven)approachhasbeenused
for computationalauditorysceneanalysis[31]. An extensionto this approachwith a hier-
archicaltaxonomy of soundsourcesis proposedin [81]. In thedesignof our framework,
we tried to haveaflexible architecturethatcansupportthesemodelsof top-down flow and
hierarchicalclassificationaswell astraditionalbottom-upprocessing.
Thischapteris mainlyaboutarchitecturaldesignandimplementation issuesandshould
be skippedby readerswho are mainly interestedin ideasaboutComputerAudition al-
gorithms. It shouldbe useful for readerswho are interestedin implementingpractical,
real-timeComputerAudition systems. Publications in ComputerAudition rarely go into
implementationdetails(in many casesbecausetheir performanceit too slow). The result
is thatpeopleenteringthefield have to discover thecorrectabstractionsby trial anderror.
CHAPTER6. IMPLEMENTATION 136
It is my hopethat this chapterandthe softwareit describeswill helppeopleenteringthe
field concentrateon new ideasandavoid rediscoveringandreimplementing basicbuilding
blocksof ComputerAudition research.
6.2 RelatedWork
This sectionwill be a little bit different thanthe correspondingsectionsof the previous
chaptersin that it referencessoftwaresystems ratherthanpublications. Softwareframe-
works from otherapplicationareasthat have similar designgoalsto Marsyasaswell as
alternativesto Marsyasfor implementingComputerAuditionalgorithmswill bedescribed.
Therearemany softwareframeworksfor a varietyof areasin ComputerScience.Rep-
resentativeexamples(referencescanbefoundat thecorrespondingwebpages)are:
& Audio Synthesis
CSOUND,1 SynthesisToolkit (STK), 2, PureData(Pd)3
& ImageProcessingand Manipulation
GIMP, 4 Khoros,5
& Visualization
Visualization Toolkit (VTK), 6
Whendesigning anddevelopingComputerAudition softwarethereareseveralpossible
implementationchoiceseachwith their own advantagesanddisadvantages. A common1http ://mi tpress 2.mit. edu/e- books /csoun d/fron tpage. html2http ://ww w- ccrm a.stan ford.e du/so ftware /stk/3http ://ww w.pure - data. org/4http ://ww w.gimp .org5http ://ww w.tnt. uni- hannover .de/j s/soft /imgpr oc/kho ros/6http ://pu blic.k itware .com/V TK/
CHAPTER6. IMPLEMENTATION 137
approachis to useanumericalcomputationenvironment(typically MATLAB 7). Themain
advantageof this approachare: 1) the large numberof availablenumericalfunctions2)
simplesyntaxsimplified for numericalcomputations3) interactivecomputing andplotting.
The main disadvantagesare: 1) slow performance(MATLAB is memoryintensive and
slow comparedto efficient C or C++ implementations)2) lack of flexibili ty of a general
programminglanguage(althoughMATLAB fans might disagree)3) difficult to create
userinterfacesthat respondin real time to audiosignals4) expensive licenceupdates5)
not easyto port to new platforms. In many casesin order to overcomesomeof these
problemsa hybridapproachis used,wherethecritical partsareimplementedin C or C++
andMATLAB is usedfor furtherprocessingandintegration.
Anotherapproachis to start from scratchusinga generalpurposeprogramminglan-
guageandprovide a separateefficient implementationfor eachdevelopedalgorithm. The
maindisadvantageof thisapproachis thesignificanteffort it requiesbecausecodehasto be
rewritten from scratchfor every new algorithm. Theuseof standardlibrariescanprovide
somecodereusabilitybut doesnotallow easyadditionof new components.
Another possibility is the useof an audio synthesis framework suchas CSOUND,
STK or PD. Although a lot of the requiredfunctionality suchas readingsoundfilesand
playing audiois alreadythere,theseframeworks arenot targetedfor ComputerAudition
andthereforelackcertainimportantcomponentssuchasMachineLearningalgorithms.
Theapproachchosenin thiswork is to developaframework for ComputerAudition re-
searchin ageneralprogramming languageandtry to abstractthebasicbuilding blocksand
make it flexible sothatnew blockscanbeaddedeasily. TheMarsyassoftwareframework
is theresultof theseefforts. It wasfirst describedin [124,129]andhasbeenusedto design
anddevelop all the describedalgorithms in this thesisaswell asconductthe evaluation
userexperiments. The advantageof this approachis that, if doneproperly, it combines7http ://ww w.math works. com
CHAPTER6. IMPLEMENTATION 138
expressivity andflexibilit y with performance.The main disadvantageis that it requires
significanteffort in designandimplementationuntil it reachesastagethatit is usable.
6.3 Ar chitecture
Marsyasfollows a client-server architecture.The server written in C++, containsall the
signalprocessingandpatternrecognitionmodulesoptimizedfor performance.Theclient,
written in Java,containsonly theuserinterfacecodeandcommunicateswith thecomputa-
tion enginevia sockets.Thisbreakdown hastheadvantageof decouplingtheinterfacefrom
thecomputation codeandallowsdifferentinterfacestobebuilt thatusethesameunderlying
computational functionality. For example,theservercanbeaccessedby differentgraphical
userinterfaces,scripting tools,webcrawlers,etc. Anotheradvantageis that theserver or
theclientcanberewrittenin someotherprogramming languagewithoutneedingto change
theotherpartaslongastheinterfaceconventionsbetweenthetwo arerespected.
Althoughtheobjectsform anaturalbottom-uphierarchy, top-down flow of information
canbeexpressedin theframework. As a simple example,a silencefeaturecanbeusedby
anextractorfor Music/Speechclassificationto avoid calculatingfeatureson silentframes.
Similarly hierarchicalclassificationcanbe expressedusingmultiple extractorsusingdif-
ferentfeatures.
6.3.1 Computation Server
The computationserver containsthe codefor reading/writing/playingaudiofiles, feature
extraction,signalprocessingandmachinelearning.Specialattentionwasgiven to abstract-
ing theselayersusingobject-orientedprogramming techniques.Abstractclassesareused
to provide a commonAPI for thebuilding blocksof thesystemandinheritanceis usedto
CHAPTER6. IMPLEMENTATION 139
factorout commonoperations.Themainclassesof thesystemcanroughlybedividedin
two categories:process-likeobjectsanddata-structure-likeobjects.
Processobjects:
Systemsprocessan input vectorof floatingpoint numbersandoutputanothervector(of
possibly differentdimensionality). They encapsulateall theSignalProcessingfunc-
tionality of Marsyas. Theparametersof aSystemandany temporarymemorystorage
it might requireareinitializedandallocatedwhentheobjectis constructed.Thepro-
cessmethodcanthenbeapplied,usuallymorethanonetime,to floatingpointvector
objectsof aspecificinputandoutputsizespecifiedatconstructiontime. Exampleof
Systemsincludetransformations suchas: windowing (for example Hamming),Fast
FourierTransform,andWavelets.In additionfeaturessuchastheSpectralCentroid,
Rolloff and Flux are also representedas systems.Systemsare similar to plugins
or unit generatorsof audioprocessingsoftware. Systemscanbe constructedfrom
arbitraryflow networksof othersystems usingcombinators. For examplethewhole
computation of featuresfor Music/Speechclassificationcanbeencapsulatedin one
systemasshown in Figure 6.1.
Combinators providewaysto form complex Systemsby combiningsimpler Systems. The
Parallel combinatortakesasinputalist of Systemsthathavethesamesameinputsize
andproducesa systemthattakesthesameinputandhasasoutputtheconcatenation
of ouputsof the Systemsit combines.The Seriescombinator takesas input a list
of Systemsand createsa systemthat appliesthem one after the other (the input
andoutputsizesof adjacentpairshave to match). Figure 6.1 shows how Parallel
and Seriescombinators are usedto constructthe flow network for Music/Speech
classification.
CHAPTER6. IMPLEMENTATION 140
Hamming FFT Magnitude
Centroid
Rolloff
Flux
Parallel
Memory I
Series
Music Speech Extractor
Music Speech System
Audio Window Feature Vector
Audio Signal FeatureMatrix Gaussian
Classifier
Feature Vector
Music
Speech
Audio Window
MUSIC SPEECH DISCRIMINATOR LAYOUT
Figure6.1: MusicSpeechDiscriminatorArchitecturalLayout
Memories area particularcaseof Systemsthat correspondto circular buffers that hold
previously calculatedfeaturesfor a limited time. They areusedto computemeans
andvariancesof featuresoverlargewindowswithout recomputingthefeatures.They
canhavedifferentsizesdependingon theapplication.
Extractors breakup a soundstreaminto frames. For eachframe they usea complex
Systemto calculatea Feature Vector. The result is a time-seriesof featurevectors
which is calledtheFeature Matrix. Typically, thereis a differentextractorfor each
classificationscheme.
Classifiers are trainedusinga Feature Matrix andcan thenbe usedto classifyFeature
Vectors or Feature Matrices. They alsosupportvariousmethodsfor debuggingand
evaluationsuchask-fold cross-validation.
Segmentors take as input Feature Matrix and outputa Time Line correspondingto the
extractedsegments.
CHAPTER6. IMPLEMENTATION 141
Data structur e objects:
Signals produceor consumesoundsegmentsrepresentedasvectorsof floatingpointnum-
bers. Examplesof signalsinclude: variousreaders/writersof soundfile formats,
real-timeaudioIO, andsynthesisalgorithms.
Float Vectors-Matrices arethebasicdatacomponentsof thesystem.They arefloatarrays
taggedwith sizes.Operatoroverloadingis usedfor vectoroperationsto avoid writing
many nestedloopsfor signalprocessingcode.Theoperatorfunctionsareinlinedand
optimized.Theresultingcodeis easyto readandunderstandwithoutcompromising
performance.
Feature Matrix storesautomatically extractedaudio featuresin the form of a Feature
Matrix, andmapthemto a LabelMatrix that is usedfor training,classificationand
segmentation. For example a particularFeature Vector might be mappedto two
labels,onecorrespondingto the classit belongs(for exampleJazz)and the other
to thefile thatit originatesfrom.
Time Regions aretime intervalstaggedwith annotation information.
Time Lines arelistsof timeregions.
Time Trees arearbitrarytreesof time regions. They representa hierarchicaldecomposi-
tion of audiointo successively smallersegments(seeFigure6.2). Treeshave been
shown to beaneffectiveabstractionfor browsingmultimediadata[136].
All the objectscontainmethodsto read/writethemto files and transportthemusing
the socket interface. For example a calculatedFeature Matrix can be storedand used
to evaluatedifferent Classifiers without having to redo the featurecalculationfor each
Classifier.
CHAPTER6. IMPLEMENTATION 142
0
2
4
6
8
10
12
0 100 200 300 400 500 600 700Toms Cymbal Guitar
Group
Time Tree
El.Piano Intro Entrance
Figure6.2: Segmentationpeaksandcorrespondingtime tree
6.3.2 User Interface Client - Interaction
The client containsall the codefor the user interfacesand is written in Java. Because
all of the performancecritical code is executedin the server, the choice of Java does
not impact performanceand allows the use of the excellent JavaSwing graphicaluser
interface framework. The main user interface componentsand their architecturewere
describedin Chapter 4. In this sectionsomeof additionaldetailsaboutthe architecture
andimplementationareprovided.
All theinterfacesfollow a Model-View-Controllerdesignpattern[64] thatallowsmut-
liple views of the sameunderlyingdatato be created.Browsers renderaudiocollections
basedon parametersstoredin MapFiles. In many casesthe MapFilesareautomatically
generatedusingComputerAudition algorithms.UsingBrowsers thespacecanbeexplored
anda subsetof thecollectioncanbeselected(logical zooming)for subsequentbrowsing
or editing. Editing the selectedfiles can be doneusing Viewers and Editors which are
alsoconfiguredby MapFiles, possibly automaticallygenerated.During playbackof the
CHAPTER6. IMPLEMENTATION 143
BrowserSoundSpace
AbstractMapfile
User
AbstractBrowser
Collection
Mapfile
User
Collection AbstractEditor
Item
Mapfile
User
Item
Monitor
ItemAbstract
Mapfile
User
Analysis Tool
Collection or Item Collection or Item
Collection Data Representation
BrowserTimbreSpace3DBrowser
TimbreSpace2D
Item DataRepresentation
Sound FilesSpectogramMonitor Monitor
ClassGram
WaveformEditor
MARSYAS3D ARCHITECTURE
MusicSpeech ClassifierAnalysis Tool
SegmentationAnalysis Tool
Figure6.3: MarsyasUserInterfaceComponentArchitecture
files variousMonitors canbe setupto visual information that is updatedin real-time. A
schematicof thesystem’sarchitecturecanbeseenin Figure 6.3.
6.4 Usage
Marsyascanbeusedin a varietyof differentlevels dependingon theuserexperienceand
flexibilit y required.At thehighestlevel thegraphicaluserinterfacescanbeuseddirectly
CHAPTER6. IMPLEMENTATION 144
to browse, analyzeand edit audio signalsand collections. At a lower level, command
line tools canbe usedto performbasicComputerAudition operations.For examplethe
following sequenceof commandsextractsSpectralShapefeaturesbasedon the STFT,
trains a music/speechGaussianclassifierand evaluatesthe results. The output will be
theclassificationaccuracy using10-fold cross-validationandthecorrespondingconfusion
matrix.
mkcollec tion music. mf /home/g tzan/data/s ound/music /
mkcollec tion speech .mf /home/ gtzan/data/ sound/spee ch/
bextract FFT ms music.mf speech.mf
evaluate 0 GS ms 100
If more flexibility is required,complex ComputerAudition algorithms can be con-
structedby combiningvariousobjectsthat correspondto basicbuilding blocks. Many
of the projectsdoneby undergraduatestudents usingMarsyasaredonethis way without
having to write any codeother than the glue that connectsthe building blocks. Finally,
arbitrarynew algorithms canbe addedby writing sourcecodedirectly andextendingvia
inheritencetheexisting abstractclassesof thesystem.
6.5 Characteristics
In addition to the describedcomponents,for the purposesof the MPEG-basedfeature
calculation,anMPEGaudiodecoderwasimplementedin C++. Thesourcecodeis loosely
basedontheFraunhoferinstitutereferencesoftwareimplementationandotheropensource
implementations8. Thecombineddecodingandclassification/segmentationrunsreal-time
onaPentiumII PC.8http ://ww w.mpeg.org
CHAPTER6. IMPLEMENTATION 145
All the graphicaluser interfacecomponentsare implementedin Java. The Java3D
API is usedfor the 3D graphicsand JavaSoundAPI is usedfor the audio and MIDI
playback. The computation server is written in standardANSI C++. For parametric
synthesisof soundeffectsandaudiogenerationfrom MIDI the SynthesisToolkit 9 [23]
is used.MARSYAS is mainly developedunderLinux andhasalsobeencompiledunder
Solaris, Irix, Mac OS X, and Windows (Visual Studio and Cygwin compilers)operat-
ing systems. The useof standardC++ and JAVA and the independencefrom external
libraries makes the systemeasily portableto new architectures. Moreover, the client-
server architectureallows a variety of possible configurations.For example,it is possi-
ble to have Windows andSolarisclientsaccessingthe audio information retrieval server
running in Linux. The framework is available as free software under the GNU Public
Licence(GPL) from: http://w ww.cs.princ eton.edu/˜ gtzan/marsy as.html andhas
beendownloadedapproximately2500timesfrom 1000differenthosts.Marsyashasbeen
usedby many Princetonundergraduateand graduatestudentsdoing projectsrelatedto
ComputerAudition. Someexamplesare: Music Mosaics(DouglasTurnbull, Ari Lazier),
Instrumentclassification(CorrieElder),PitchHistograms(GeorgeTourtellot),queryuser
interfaces(AndreyeErmolinskyi), andMusic Library (JohnForsyth).
9http ://cc rma- www.stan ford.e du/so ftware /stk/
Chapter 7
Conclusions
Remember, Informationis not knowledge, Knowledge is not Wisdom; Wisdomis not truth;
Truth is notbeauty;Beautyis not love;Loveis notmusic;Musicis thebest.
Frank Zappa
7.1 Discussion
In this thesis,a variety of algorithms, techniquesand interfacesfor working with audio
signalsandcollections wereproposed,developed,andevaluated. Thesesystemscanbe
usedfor manipulation, analysisand retrieval of audiosignalsandcollections. Someof
the main characteristicsthey shareare: 1) practicalsolutions to fundamentalproblems
thatareapplicableto arbitraryaudiosignals2) utilizationof largecollections3) statistical
evaluation in many casesbasedonuserstudies4) useof visualizationandhumancomputer
interactiontechniquesto include the userin the system5) integrationundera common
softwareimplementation.Thesecharacteristicscanbesummarizedasageneraltendency to
createpracticalworkablesystemsthatcanbeusedasprototypesfor theComputerAudition
systemsof thefuture.
146
CHAPTER7. CONCLUSIONS 147
To summarizethemaincontributionsof this thesisare:1) calculationof featuresfrom
MPEG audiocompresseddata2) calculationof featuresbasedon the DWT 3) a general
multifeaturemethodology for segmentationof arbitrarytextures4)automaticmusicalgenre
classificationandsimilarity retrieval basedontimbral,rhythmic andharmonicfeatures5) a
varietyof novel userinterfacessuchasTimbregramsandTimbrespacefor interactingwith
audiosignalsandcollectionsandfor specifyingaudioqueriessuchasSoundsliders6) the
creationof an extensible software framework for ComputerAudition researchthat inte-
gratestheproposedsystems aswell asa largevarietyof existing algorithmsandsystems.
This is a goodpoint to go backto Section1.9 andreadagaintheproposedComputer
Audition usagescenariosand seehow the describedsystemsin this thesisserve their
requirements.Our hopeis thatsomedayin thefuture, thetools for interactingwith audio
signalsand collectionswill be inspired by and look like our prototype systems. In the
followingSectionscurrentandfuturework relatedto this thesiswill bedescribed.
7.2 Curr ent Work
In this Sectionsomeof theongoingwork, continuingtheresearchdescribedin this thesis,
will bepresented.TheSonicBrowseris atool for accessingsoundsor collectionsof sounds
usingsoundspatializationandcontent-overview visualizationtechniquesdevelopedat the
University of Limerick, Ireland. We arecurrentlycollaboratingwith the SonicBrowser
team,combining their systemwith Marsyas. More specifically the new system,under
development,combinesthe interactivity anduserconfigurationof theSonicBrowserwith
the automaticaudio information retrieval tools of Marsyasin order to createa flexible
distributed graphicaluserinterfacewith multiple visualization components that provides
novel waysof browsing largeaudiocollections.TheExtensible MarkupLanguage(XML)
is usedto exchangeconfigurationandmetadatainformationbetweenthevariousdeveloped
CHAPTER7. CONCLUSIONS 148
components. This work hasbeensubmitted to the InternationalConferencefor Auditory
Display(ICAD) 2002.
In Chapter5 it wasshown that featurescalculatedfrom PitchHistogramscanbeused
for automatic musicalgenreclassification.However, for thecasesthat thesefeaturesfail
it is not clear if thesefailureshave to do with thechoiceof featuresor with errorsin the
multiplepitchdetectionalgorithm.In orderto addressthisquestionwehavecomparedthe
performanceof Pitch Histogramfeaturescalculatedon MIDI dataandon audioderived
from the sameMIDI files. By definition,Pitch Histogramsderived from MIDI datawill
have no pitch errorsthereforeby comparingthe two resultswe canjudgeto what extent
pitcherrorsinfluencetheuseof featuresfor classification.It hasbeenshown thatthePitch
Histogramsfeaturesin bothcasesshow thesameperformancethereforeit canbeconcluded
thatthepossible pitcherrorsin theaudiocasesdonotsignificantlyaffect theclassification
performanceof the features. Of course,audio from MIDI signalsalthough polyphonic
areeasierthan“real world” audiosignalsfor multiple pitch detection.However, because
thegenreclassificationresultsobtainedfrom MIDI andaudio-from-MIDI arecomparable
to resultsobtainedfrom “real-world” audiodatafor similar tasks(but obviously different
datasets)it canbe concludedthat the classificationperformanceof the featuresin “real
world” audiodatais alsonotsignificantlyaffectedby pitchdetectionerrors.It is important
to notethatthisdoesnotmeanthatbetterfeaturesarenotpossibleor thatthemultiplepitch
detectionalgorithmis perfect.Thiswork will besubmittedto theInternationalSymposium
onMusic InformationRetrieval (ISMIR) 2002.
Asalreadymentioned,in addition todigitalmusicdistribution,collectionssoundeffects
are anotherimportant areafor ComputerAudition research. The perceptualsimilarity
of soundeffectshasbeeninvestigatedby conductinguserexperimentsin [104]. We are
collaboratingwith thepeopleinvolved in this projectto usethedescribedtoolsandinter-
facesto visualizeandcomparethe resultsof this work. In addition theseresultswill be
CHAPTER7. CONCLUSIONS 149
comparedwith automatically extractedfeaturevectorsandsimilarity informationover the
samecollectionof digital soundeffects.
7.3 Futurework
ComputerAudition is a relatively recentresearchareaand thereare a lot of interesting
directionsfor future work. Someof thesedirectionsareobviousextensions of the work
describedin this thesisandothersaremoreunrelated.It is our hope,that the developed
softwareframework will helpusandotherresearchers realizesomeof thesedirectionsfor
futurework.
In the representationstage,investigation of differentfront-endsfor featureextraction
is an areaof active research. Although most of the featuresproposedin the literature
have shown adaquateperformance,there have beenfew thoroughinvestigationsof all
possible featurechoicesandcombinations. By providing a commonframework anddata
collectionsweareplanning to comparedifferentfeaturefront-endsandoptimizethechoice
of parametersfor particularapplications. For examplewe are investigating the useof
computational modelsof the earphysiology suchas the onesproposedin [116, 117] as
thebasisfor featureextraction.
Althoughtherearehavebeena largenumberof studiesabouttheperceptionof beatpe-
riodicitiesthereis relatively littl ework in theperceptionof beatstrength.Wearecurrently
conductinga userexperimentto investigatehow humansperceive musicalbeatstrength.
Eachsubjecthasto rate50segmentsfrom avarietyof musicalstylesinto fivebeatstrength
groups(weak,medium-weak,medium,medium-strong,strong).Theorderof presentation
is randomizedto avoid any ordereffectssuchaslearning.Preliminaryresultsfrom the15
subjectsthathavecompletedthestudysofar indicatethatthereis considerableagreeement
betweensubjectsreagrdingmusicbeatstrength. the resultsof this studywill be usedto
evaluateandcalibratethecalculationof BeatHistograms.
CHAPTER7. CONCLUSIONS 150
Anotherinteresting directionis theuseof compresseddomainaudiofeaturesfor other
typesof audioanalysisthanmusic/speechclassification,instrumentclassificationandseg-
mentation. Someexamplesare multiple pitch detection,beatanalysis,speaker identifi-
cationandspeechrecognition.The featuresdescribedin this work arenot the only ones
possibleandfurtherinvestigationis neededin orderto comeupwith betterfeaturestailored
to specificapplications. In addition to the subbandanalysisother typesof information
can be utilized. For example in MPEG layer III a window switching schemeis used
wherethe 32 subbandsignalsare further subdivided in frequency contentby applying,
to eachsubband,a6-pointor 18-pointmodifiedDCT block transform.The18-pointblock
transformis normallyappliedbecauseit providesbetterfrequency resolution,whereasthe
6-pointblock transformprovidesbettertime resolution.Thereforeit is morelikely to find
semgentation boundariesin shortblocksand this information could be usedto enhance
the segmentation algorithm. For statistical patternrecognitionlarge amountsof training
dataneedto becollected.By doingtheanalysison MPEGcompresseddatatheenormous
resourcesavailable on the Web can becomeavailable for training. We are planningto
implementaWebcrawler to automaticallygather.mp3files from theWebfor thispurpose.
Theexactdetailsof featurecalculationfor oldermethodssuchastheSTFTandMFCC
have beenrelatively well exploredin the literature. On theotherhand,featuresbasedon
the DWT have appearedonly recently. Furtherimprovementsin classificationaccuracy
canbeexpectedwith morecarefulexprimentationwith theexactdetailsof theparameters.
We alsoplan to investigatetheuseof otherwavelet familiesfor the featurecomputation.
Anotherinterestingdirectionis to combinefeaturesfrom differentanalysistechniquesto
improve classificationaccuracy. We asloplanto applythegeneralmethodology for audio
segmentation basedon “texture” decribedin this thesis,usingtheDWT-basedfeatures.
In theimplementationof thesegmentationalgorithm theimportanceof eachfeatureis
equal.Most likely this is not thecasewith humansandis not optimal. Therefore,relative
CHAPTER7. CONCLUSIONS 151
weightingof the featuresas well as optimization of the parametersand heuristicsneed
to beexplored. Onepossible approachfor this problemof featureweighting is theuseof
Geneticalgorithmsasin [44]. Theuseof theproposedsegmentationmethodology for other
typesof audiosignalssuchasnews broadcastsis alsoan interestingpossibility . Because
of the difficulty of evaluating and comparingdifferent segmentation schemesstandard
corporaof audioexamplesneedto bedesigned.Standarizedcorporaarevery importantfor
ComputerAudition researchandunfortunatelytheir creationhasbeenhinderedbecause
of copyright problems.In the futurewe plan to collectmoredataon the time requiredto
segmentaudio.Empiricalevidenceis requiredthattheautomaticsegmentationreducesthe
time to segment. Furthertestsin thumbnailing will needto be devisedto determinethe
salienceof thehumanandmachineselectedthumbnails,andto determinetheirusefulness.
For example, canthumbnails causea speedupin location/indexing, or canthumbnails be
concatenatedor otherwisecombinedto constructa useful ”caricature” of a long audio
selection?We plan to make moredetailedanalysesof the text annotations.Finally the
developedgraphicaluserinterfaceallowsthecollectionof many userstatisticslikenumber
of editoperations,detailedtiming information,etc.thatweplanto investigatefurther.
For the beatanalysisusingBeatHistogramsa comparisonof thebeatdetectionalgo-
rithm describedin [108] with ourschemeonthesamedatasetis alsoplannedfor thefuture.
In addition, more careful studiesusing the syntheticexamples are alsoplanned. These
studieswill show if it is possibleto detectthephaseandperiodicityof multiple rhythmic
linesusingtheDWT.
In this thesissimilarity retrieval wasperformedusinga singlefeaturevectorrepresen-
tation for eachfile. Although this approachworks well for files with relatively constant
textureit hasproblemwith files suchasclassicalmusicwith many differentsections.Seg-
mentationalgorithms canprovide information aboutwheretexturechangesoccurbut the
problemof how to dosimilarity retrieval in thosecasesstill remains.In thefutureweplan
CHAPTER7. CONCLUSIONS 152
to investigate trajectoryand segmentation-basedapproachesto content-basedsimilarity
retrieval.
For theautomaticmusicalgenreclassificationanobviousdirectionfor futureresearch
is expandingthe genrehierarchyboth in width and depth. Othersemanticdescriptions
suchas emotionor voice style will be investigatedaspossibleclassificationcategories.
Moreexplorationof thepitchcontentfeaturesetcouldpossiblyleadto betterperformance.
Alternative multiple pitch detectionalgorithms, for examplebasedon cochlearmodels,
could be usedto createthe pitch histograms. For the calculationof the beathistogram
we planto explore otherfilterbankfront-endsaswell asonsetbasedperiodicitydetection
as in [66, 112]. We are also planning to investigate real-timerunning versions of the
rhythmic structureandharmoniccontentfeaturesets.Anotherinterestingpossibility is the
extractionof similar featuresdirectly from MPEGaudiocompresseddataasin [95, 130].
Finally, wearealsoplanningto usetheproposedfeaturesetswith alternativeclassification
andclusteringmethodssuchasartificial neuralnetworks. Two otherpossible sourcesof
informationaboutmusicalgenrecontentaremelodyandsingervoice. Although melody
extraction is a hardproblemthat is not solved for generalaudio it might be possible to
obtain somestatistical information even from imperfect melody extraction algorithms.
Singingvoiceextractionandanalysisis anotherinterestingdirectionfor futureresearch.
In theInteractionstage,thereis a lot of work thathasto bedonein orderto improvethe
describedinterfacesandevaluatetheireffectiveness.For example,basedonthefactthaton
averagethesubjectsneededmorethan2 hoursfor segmenting10 minutesof audiousing
standardaudioeditingtoolsweplanto comparehow muchthisprocesscanbeaccelerated
usingtheEnhancedAudio Editor. By usingour own software,collectingusagestatistics
suchasmouseclicksandtiming informationis mucheasierthanusingacommercialaudio
editor. Someotherdirectionsfor future work are the developmentof alternative sound-
analysisvisualization toolsandthedevelopmentof new model-basedcontrollersfor sound
CHAPTER7. CONCLUSIONS 153
synthesisbasedon thesameprinciples. Of coursenew analysisandsynthesisalgorithms
will requireadjustment of our existing toolsanddevelopmentof new ones.More formal
evaluationsof thedevelopedtoolsareplannedfor thefuture. Integratingthemodel-based
controllersinto a realvirtual environmentis anotherdirectionof futurework.
The Display Wall projecthasimplementeda variety of alternative input methods in
additionto standardkeyboardandmousenavigation. We areexploring theuseof speech
recognitioninput for controlling thesystem.Theclient architectureof theuserinterfaces
will maketheintegrationwith thespeechsystemeasy. Speechrecognitionis alreadybeing
usedfor a circuit viewer. Currently the placingof the windows is doneby the window
managerrequiringmanualadjustmentwhich is difficult onsucha largedisplay. Therefore,
intelligentautomaticplacementof windows is plannedfor thefuture.
It is our belief that theproblemof queryspecificationhasmany interestingdirections
for future researchbecauseit providesa new perspective and designconstraintsto the
issueof automaticmusicandsoundgenerationandrepresentation.Unlike existing work
in automaticmusical generationwherethe goal is to createas goodmusic as possible,
thegoal in our work is to createconvincing sketchesor caricaturesof themusicthatcan
provide feedbackand be usedfor music information retrieval. Another relatedissueis
the visual designof interfacesanddisplaysandits connectionto visual music andscore
representations.
Evaluatinga complex retrieval systemwith a largenumberof graphicaluserinterfaces
is difficult and can only be doneby conductinguserstudies. For the future we plan a
task-basedevaluation of oursystemsimilar to [57] whereuserswill begivenaspecifictask
suchaslocatingaparticularsongin alargecollectionandtheirperformanceusingdifferent
combinationsof toolswill bemeasured.
Another direction for future researchis the collaborationwith composersand mu-
sic cognitionexpertswith the goal of exploring differentmetaphorsfor sketchingmusic
CHAPTER7. CONCLUSIONS 154
queriesfor userswith andwithout musicalbackgroundandhow thosediffer. Currentlya
lot of informationaboutaudioandespeciallymusicsignalscanbe found in the form of
metadataannotations suchasfilenamesand ID3 tagsof mp3 files. We plan to develop
webcrawlers thatwill collect that informationanduseit to build moreeffective tools for
interactingwith largeaudiocollections.
On theimplementationside,having awell-designedframework andknowing thebasic
abstractionsandbuilding blocksmakesthe porting to otherprogramming languagesand
systemseasier. Currentlyversions of thecomputationalserver written in SMLNJ (a typed
functionalprogramming language)andJAVA areunderdevelopment.A versionof some
of theclientuserinterfaceswritten,usingGTK, is underdevelopment.
7.3.1 Other dir ections
In this Sectionsomedirectionsfor future work thatarenot directextensionsof thework
describedin thisthesiswill bedescribed.As alreadymentioned,a largenumberof existing
ComputerAudition researchsolveshardproblemson small collectionsof limited “toy”
exampleswith thehopeof graduallyincreasingthenumberandcomplexity of signalsit can
handle.Thisapproachhasnotyetbeensuccessfulin buildingpracticalsystems. In contrast,
the work describedin this thesisemphasizessimpler, fundamentalproblemsthat canbe
reliably solved over largecollectionsof “real-world” complex audiosignals.Thereforean
obviousdirectionsfor futurework is to try to solveincreasinglyharderproblemswhile still
keepingthemethodsapplicableto “real world” audiosignals.Theseharderproblemscan
beseenasintermediatestepstoward thegoalof automaticpolyphonic transcription.In a
sense,someof the techniquesdescribedin this work suchasBeatHistogramsandPitch
Histogramsaremoving toward that goal. In the following paragraphsa numberof such
interestingunsolvedproblemsaredescribed.
CHAPTER7. CONCLUSIONS 155
An importantsourceof informationaboutmusical signalsis the number andtype of
instrumentsplayingat any particularmoment(for examplea saxophone,pianoanddrums
are playing now). Instrumentmixture identificationrefersto automaticallysolving this
problem. Although it is a harderproblemthan classificationas there has to be some
separationof the signals, it is an easierproblemthanpolyphonic transcriptionor source
separationasonly the instrument labelsare required. A relatedproblemis the problem
of instrumenttrackingwherea statistical modelof an instrumentsoundis provided and
thatparticularinstrumentis tracked througha polyphonic recording.Anotherinteresting
problemis theseparationof particulartypesof soundsor instrumentssuchasbass,drums
andvoice. For exampletheseparationof drumssoundscouldbeusedfor moreelaborate
detectionof particularrhythmic stylessuchasBossaNova.
In thesamemannerthatPitchHistogramsprovideoverall pitch informationfor a song
even thoughmultiple pitch detectionis still not perfect,it is probablypossible to detect
and identify chords. Beat analysismethodcan alsoprovide metric informationaswell
ashigherlevel structuralinformation. An importantsourceof informationfor identifying
musicis thesingingvoice.Thereforetechniquesfor singeridentificationandsingingvoice
separationor enhancementarevery interesting.
In addition to applicationsrelatedto audiosignals,it is possible thatmany of thepro-
posedideashave applications in otherareasandtypesof signals.Someobviousexamples
includevideo andspeechsignalsandsomelessobvious includemeasuredexperimental
datafrom fieldssuchasphysics andastronomy, stockmarket information, andbiological
signals.For example,visualizationsof DNA basedonSignalProcessingtechniquessimilar
to theonesdescribedin this thesis, aredescribedin [6].
CHAPTER7. CONCLUSIONS 156
7.4 Final Thoughts
Audio signalsand especiallymusical signalsare very interestingfrom a philosophical
perspective. For somereason,humansfind organizedstructuresof movingair moleculesso
interestingthatthey havesophisticatedsensoryorgansfor detectingthem,spendsignificant
amountsof their lives listening to them,make weird devices for creatingthem,anduse
sophisticatedtechnologyto storeandreproducethem. In theearlydaysof humanity, the
constructionof practicaltools suchasarrows andkniveswasparalleledby the creation
of holedtubesandstretchedmembranesfor settingair moleculesin motion. Todaythe
constructionof practicaltoolssuchasoperatingsystemsandfirewalls is paralleledby the
creationof systems for helpinghumansinteractandexplorethevastamountsof structured
air moleculemotions storedasstringsof onesandzeros.Music is thebest.
Appendix A
A.1 Intr oduction
This appendixis a brief overview of somestandardtechniquesthat areimportant for the
work describedin this thesis.Themaingoalis to describehow thealgorithmswork rather
thanexplainingandprovingwhy they work. Moredetailscanbefoundin thecorresponding
references.
A.2 Short Time Fourier Transform
The mostcommonsetof featuresusedfor classification,segmentationandretrieval are
basedon the FFT andmorespecificallythe ShortTime Fourier Transform(STFT). The
STFTX ' n " k ( of asignalx ' n( is a functionof bothtimen andfrequency k. It canbewritten
asas:
X ' n" k (*) ∞
∑m+-, ∞
x ' m( h ' n . m( e, j / 2π 0 N 1 km (A.1)
whereh ' n( is a sliding window, x ' m( is theinput signal,andN is thesizeof thetrans-
form. Typically theSTFTis implementedusingtheFFT algorithmto make its calculation
computationally efficient. The STFT canbe viewed asa filter bankwherethe kth filter
157
APPENDIXA. 158
channelis obtainedby multiplying theinputx ' m( by a complex sinusoid at frequency k/N
timesthe samplerateandthenconvolving with the low passfilter h ' m( . In otherwords,
the outputX ' n" k ( for any particularvalueof k, is a frequency shifted,band-passfiltered
versionof theinput. Anotherwayto view theSTFT, is thatfor any particularvalueof n, the
STFT is theDiscreteFourierTransformof thewindowed input at time n. In otherwords,
theSTFT canalsobe thought aspartially overlapping Fourier transforms.The STFT for
a particularfrequency k at particulartime n is a complex number. Typically for feature
calculationonly themagnitudeof thesecomplex numbersis retained.
A.3 DiscreteWaveletTransform
In thepyramidalalgorithmthesignalis analyzedatdifferentfrequency bandswith different
resolutionsfor eachband.This is achievedby successively decomposing thesignalinto a
coarseapproximation anddetailinformation.Thecoarseapproximation is thenfurtherde-
composedusingthesamewaveletdecomposition step.Thisdecomposition stepis achieved
by successive highpassandlowpassfiltering of the time domainsignalandis definedby
thefollowing equations:
yhigh 2 k34) ∑n
x 2 n3 g 2 2k . n3 (A.2)
ylow 2 k35) ∑n
x 2 n3 h 2 2k . n3 (A.3)
whereyhigh 2 k3 , ylow 2 k3 are the outputsof the highpass(g) and lowpass(h) filters, re-
spectively after subsampling by two. Eachlevel of the pyramid correspondsroughly to
frequency bandsspacedin octaves. Becauseof the symmetricpyramidstructureandthe
decimationtheDWT canbeperformedin ' O ' N ( (whereN is thenumberof inputpoints).
In this work, theDAUB4 filters proposedby Daubechies[25] areused.FigureA.1 shows
APPENDIXA. 159
aschematicdiagramof thepyramidalalgorithmfor thecalculationof theDiscreteWavelet
Transform.
Level 2
h(n) g(n)
Downsample
Lowpass Highpass
Original Signal, Frequency 0 − B
B/2−B
Level 1
h(n) g(n) Discrete Wavelet TransformPyramidal Algorithm Schematic
0−B/2
0−B/4 B/4−B/2
FigureA.1: DiscreteWaveletTransformPyramidalAlgorithm Schematic
A.4 Mel-Fr equencyCepstral Coefficients
Mel-Frequency CepstralCoefficients[26] area commonfeaturefront-endthat is usedin
many speechrecognitionsystems. An explorationof their usein Music/Speechclassifica-
tion is describedin [72]. This techniquecombinesan auditoryfilter-bank,followedby a
cosinetransformto give a representationthathasroughlypropertiessimilar to thehuman
auditorysystem.
Morespecificallythefollowing stepsaretakenin orderto calculateMFCCs:
1. The short-timeslice of audio data to be processedis windowed by a Hamming
window
2. Themagnitude of theDiscreteFourierTransformis computed usingtheFFT algo-
rithm
3. The FFT bins are combinedinto filterbank outputs approximating the perceptual
frequency resolutionpropertiesof thehumanear. Thefilterbankis constructedusing
APPENDIXA. 160
13 linearly-spacedfilter (133.33KHz betweentheir centerfrequencies)followedby
27 log-spacedfilters (separatedby a factorof 1.0711703in a frequency). . Thebins
aresummedusinga triangularweightingfunction which startsristing at the CF of
thepreviousfilter, hasa peakat thecurrentfilter CF, andendsat theCF of thenext
filter. (seeFigureA.2).
4. Thefilterbankoutputsareconvertedto log base10
5. TheDiscreteCosineTransformof theoutputs is usedto reducedimensionality.
CF CF + 133HzCF * 1.071CF / 1.071
CF − 133Hz
FigureA.2: Summingof FFT binsfor MFCC filterbankcalculation
Moredetailsaboutthecalculationof MFCCscanbefoundin [98].
A.5 Multiple Pitch Detection
The multiple pitch detectionusedfor Pitch Histogramcalculation,is basedon the two
channelpitch analysismodeldescribedin [123]. A block diagramof this modelis shown
in FigureA.3. The signal is separatedinto two channels,below andabove 1 kHz. The
channelseparationis donewith filters thathave12dB/octave attenuationat thestopband.
The lowpassblock alsoincludesa highpassrolloff with 12 dB /octave below 70Hz. The
high-channelis half-wave rectifiedandlowpassfilteredwith a similar filter (includingthe
highpasscharacteristicat70Hz)to thatusedfor separatingthelow channel.
Theperiodicitydetectionis basedon“generalizedautocorrelation”i.e thecomputation
consistsof a discreteFourier transform(DFT), magnitude compressionof the spectral
APPENDIXA. 161
representation,andan inversetransform(IDFT). The signalx2 of FigureA.3 is obtained
as:
x2 ) IDFT '76DFT ' xlow (86 k (59 IDFT '76DFT ' xhigh(86 k ( (A.4)
) IDFT '76DFT ' xlow(86 k 9 DFT ' xhigh(86 k ( (A.5)
wherexlow andxhigh arethelow andhigh channelsignalsbeforetheperiodicitydetec-
tion blocksin FigureA.3. Theparameterk determinesthefrequency domaincompression.
For normalautocorrelationk ) 2. ThefastFourierTransform(FFT) andits inverse(IFFT)
areusedto speedthecomputationof thetransforms.
The peaksof the summaryautocorrelationfunction (SACF) (signalx2 of FigureA.3
arerelatively goodindicatorsof potential pitch periodsin thesignalanalyzed.In orderto
filter out integermultiple of thefundamentalperiodapeakpruningtechniqueis used.The
original SACF curve, is first clippedto positive valuesandthentime-scaledby a factorof
two andsubtractedfrom theoriginalclippedSACFfunction,andagaintheresultis clipped
to havepositivevaluesonly. Thatway, repetitivepeakswith doublethetimelagof thebasic
peakareremoved. Theresultingfunctionis calledtheenhancedsummaryautocorrelation
(ESACF) and its prominentpeaksare accumulatedin the Pitch Histogramcalculation.
More detailsaboutthecalculationstepsof this multiple pitch detectionalgorithm,aswell
asits evaluationandjustificationcanbefoundin [123].
Periodicitydetection
Periodicitydetection1kHz
LowPass
1kHzHighPass Half−wave Rectifier
LowPassSACFEnhancer
MULTIPLE PITCH ANALYSIS BLOCK DIAGRAM
Input
xh
xl
x2
FigureA.3: Multiple PitchAnalysis Block Diagram
APPENDIXA. 162
A.6 Gaussian classifier
TheGaussianclassifiermodelseachclassasamultidimensionalGaussiandistribution. The
Gaussianor Normaldistribution in d dimesioncanbewrittenas:
p ' x (-) 1' 2π ( d 0 2 6Σ 6 10 21
exp :;. 12' x . µ( Σ , 1 ' x . µ(=< (A.6)
wherex is a d-componentcolumnvector, µ is thed-componentmeanvector, Σ is the
d-by-d covariancematrix,and 6Σ 6 andΣ , 1 areits determinantandinverserespectively.
A Gaussianclassifiercanbe fully characterizedby its meanvectorµ andcovariance
matrixΣ. Maximum Likelihood methodsview theparametersof adistribution asquantities
whosevaluesarefixedbut unknown. Thebestestimateof their valuesis definedto bethe
onethatmaximizesthe probability of obtainingthe samplesactuallyobserved. It canbe
shown that theMaximum Likelihoodestimateof parametersfor theGaussianclassifieris
thesamplemeanandcovariancematrix [30]. In otherwords,in orderto train a Gaussian
classifierfor aparticularclassof labelledsamples,thesamplemeanandcovariancematrix
areusedastheparametersof thedistribution. Classificationof anunknown vectorx usinga
Gaussianclassifieris doneby findingtheclassGaussiandistribution thatmostlikely would
producethisvector.
A.7 GMM classifier and the EM algorithm
TheGMM classifiermodelseachclassprobability densityfunctionasa finite mixture of
multidimensionalGaussiandistributions. If wedenoteθm theparametersof eachGaussian
distribution mixture componentand θ the complete set of parametersθ1 ">!�!�!�" θk " α1 "�!�!�! αk
thentheprobabilitydensityfunctionof themixturemodelcanbewrittenas:
APPENDIXA. 163
p ' y 6 θ (?) k
∑m+ 1
αmp ' y 6 θm( (A.7)
wherethe αm are the mixing probabilities. Given a setof n independentand identi-
cally distributed samplesY ) y / 11 " y / 21 "�!�!�!�" y / n1 , the log-likelihoodcorrespondingto a k-
componentmixtureis
logp ' Y 6 θ (?) logn
∏i + 1
p ' y / i 1 6 θ (?) n
∑i + 1
logk
∑m+ 1
αmp ' y / i 1 6 θm ( (A.8)
Our goalis to find themaximumlikelihood (ML) estimate:
θML ) argmaxθ @ logp ' Y 6 θ (7A (A.9)
It is known thatthisestimatecannotbefoundanalyticallythereforetheEM algorithm
which is aniterativeprocedurefor finding localmaximaof logp ' Y 6 θ ( is used.
The EM algorithmis basedon the intepretationof Y as incompletedata. For finite
mixtures, the missing part is a set of n labelsZ ) z / 11 " z / 21 "�!�!�!�" z / n1 associatedwith the
n samples,indicating which componentproducedeachsample. Eachlabel is a binary
indicatorvector. Thecompletelog-likelihood(i.e., theonefrom which we couldestimate
θ if thecompletedataX ) @ Y" Z A wasobserved)is
logp ' Y" Z 6 θ (B) n
∑i + 1
k
∑m+ 1
z/ i 1m log C αmp ' y / i 1 6 θm(=D (A.10)
The EM algorithm producesa sequenceof parameterestimates@ θ ' t (E" t ) 0 " 1 " 2 "�!�!�!F(7Aby alternatingly applyingtwo steps(until someconvergencecriterionis met):
& E-step: Computestheconditional expectationof thecompletelog-likelihood,given
Y andthecurrentestimateof theparametersθ ' t ( .
APPENDIXA. 164
Q ' θ " θ ' t (�(?) E 2 logp ' Y" Z 6 θ (86Y" θ ' t (G3 (A.11)
& M-step: Updatestheparameterestiamtesaccordingto:
θ ' t 9 1(-) argmaxθ
Q ' θ " θ ' t (�( (A.12)
In our implementation,theK-Meansalgorithmdescribedin thenext Sectionis usedto
initialize themeansof eachcomponentanddiagonalcovariancematricesareused.More
detailsabouttheEM algorithmcanbefoundin [84].
A.8 K-Means
The K-Meansalgorithm is an unsupervisedlearningor clusteringalgorithm. The only
input parameteris the desirednumberof clustersand the outputof the algorithmis the
meanvectorsµ1 " µ2 "�!�!�!�" µn of eachcluster. Thealgorithmis iterative andhasthefollowing
steps:
1. Initialize n,c,µ1 " µ2 ">!�!�!�" µn
2. Do: Classifyn samplesaccordingto nearestµi
3. Recomputeµi
4. Until nochangein µi
5. Returnµ1 " µ2 "�!�!�!�" µn
Moredetailsandvariations canbefoundin [30].
APPENDIXA. 165
A.9 KNN classifier
TheK nearestneighborsclassifier(KNN) is anexampleof a non-parametricclassifier. If
we denoteDn ) x1 "�!�!�! xn a setof n labeledprototypesthenthe nearestneighborrule for
classifyingan unknown vectorx is to assignit the label of its closestpoint in the setof
labeledprototypesDn. It can be shown that for an unlimited numberof prototypes the
errorrateof thisclassifieris neverworsethantwice theoptimalBayesrate[30]. TheKNN
rule classifiesx by assigningthe label most frequentlyrepresentedamongthe k nearest
samples. Typically k is odd to avoid ties in voting. Becausethis algorithmhasheavy
time andspacerequirementsvariousmethodsareusedto make its computation fasterand
storagerequirementssmaller[30].
A.10 Principal Component Analysis
Principal ComponentAnalysis (PCA) is a techniquefor dimensionality reduction[56].
PCA transformsanoriginal setof variablesinto a new setof principalcomponentswhich
areat right angles(orthogonal)to eachother(i.e uncorrelated).Basicallytheextractionof
a principalcomponentamounts to a variancemaximizing rotationof theoriginal variable
space.In otherwords,thefirst principalcomponentis theaxispassingthroughthecentroid
of the featurevectorsthat hasthe maximumvariance,thereforeit explains a large part
of the underlyingfeaturestructure. The next principal componenttries to maximize the
variancenot explainedby the first. In this manner, consecutive orthogonalcomponents
areextractedandeachonedescribesan increasinglysmallerportion of the variance. A
schematicrepresentationof PrincipalComponentextractionis shown in FigureA.4.
If wedenotethefeaturecovariancematrixas:
E ' xxT (-) Σ (A.13)
APPENDIXA. 166
thentheeigenvectorscorrespondingto theN largesteigenvaluesof Σ aretheN principal
componentaxes.
1stPrincipal Component
2nd Principal Component
PC2
PC1
SCHEMATIC OF PRINCIPAL COMPONENT EXTRACTION(3 to 2 dimensions)
FigureA.4: Schematicof PrincipalComponentextraction
Appendix B
B.1 Glossary
This sectionis a glossayof the basicterminology usedin this thesisprovided for quick
reference. In mostcasesthemeaningof thosetermscorrespondsto their commonusein
the existing literaturebut in somecases(mainly interactionand implementationrelated)
they aresomewhat idiomatic. Becauseof the shortnatureof the definition somedetails
that do not affect the useof the term in this thesisare left out. For morecompleteand
correctdefinitionthereadershouldconsulttheappropriateliterature.
Analysiswindows: shortsegments of soundwith presumedrelatively stablecharacteris-
ticsusedto analyzeaudiocontent.
Beat: arelatively regularsequenceof pulsesthatcoincideswith significantmusicalevents
Beat tracking: is theprocessof extractingandfollowing in real-timethemainbeatof a
pieceof music
Browser: aninterfacefor interactingwith acollectionof audiofiles
Classification: theprocessof automatically assigningacategoricallabeltoanaudiosignal
from a list of predefinedcategories.
167
APPENDIXB. 168
Clustering: theprocessof forminggroups(clusters)of audiosignalsbasedon their simi-
larity.
Collection: a list of audiofiles (typicalsize H 100items)
Computer Audition: the extractionof information from audiosignals(as in Computer
Vision).
Editor: aninterfacefor interactingwith aspecificaudiofile
Feature Vector: a fixed-lengtharrayof real numbersusedto representthe contentof a
shortsegmentof soundcalledtheanalysiswindow. A featurevectorcanbethought
of asapoint in amultidimensionalFeatureSpace.
Feature Space: A multidimensionalspacethatcontainsFeatureVectorsaspoints.
Fundamental fr equency: is themainfrequency of oscillation presentin asound
Mapping: a particularsettingof interfaceparameters
Monitor: adisplaythatis updatedin real-timein responseto theaudiothatis beingplayed
MIDI: an interfacespecificationfor controlling digital music instrument. A file format
(similar to amusicalscore)thatstorescontrolcontrolinformationfor playingapiece
of musicusingdigital musicinstruments
MPEG audio compression: Motion PicturesExpertGroupaudiocompressionis a per-
ceptuallybasedlossy audio compressionscheme. The well known mp3 files are
storedusingthiscompressionformat
Mp3 files: audiosignalscompressedusingMPEGaudiocompression
APPENDIXB. 169
Musical Score: a symbolic representationof music that consists of the start, duration,
volumeandinstrumentqualityof everynotein apiece.Typically musical scoresare
representedgraphically
Note: asoundof aparticularpitch
Octave: therelationbetweento soundsthathavea frequency ratioof two
Pitch: is asubjectivepropertyof soundthatcanbeusedto ordersoundsfrom low to high
andis typically relatedto thefundamentalfrequency
Query User Interfaces: generateand/orfilter audio signalsin order to specify a audio
retrieval query
Segmentation: theprocessof detectingchangesof SoundTexture.
Selection: a list of selectedaudiofiles in a collectionor a specificregion in time in an
audiofile
Similarity retrieval: theprocessof detectingsimilar files to a queryandreturningthem
rankedby their similarity
SoundTexture: aparticulartypeof soundmixturethatcanbecharacterizedby its overall
soundandaudiofeaturestatistics
Texture windows: largerwindowsusedtocharacterizesound“texture”bycollectingstatis-
ticsof the“analysiswindows”.
Timeline: aparticularsegmentation of anaudiofile to non-overlappingregionsin time
Thumbnailing: the processof creatinga shortsoundsegmentthat is representative of a
longeraudiosignal
View: aparticularwayof viewing aspecificcollection
Appendix C
C.1 Publications
The varioustopicsandideasin this thesiswerepresentedin thematicorder. The careful
readermighthaveobserved,while readingtheevaluationchapter, thatcertainexperiments
have not beenperformedasa consequenceof thechronological orderof publications. For
examplein evaluating similarity retrieval a subsetof musicalcontentfeatureswereused
becauseatthetimetheBeatHistogramandPitchHistogramfeatureshadnotbeendesigned.
This appendixprovidesa list of publications that this thesisis basedon in chronological
orderwith ashortdescriptionof their content.
1. A framework for audio analysisbasedon temporal segmentationand classifica-
tion 1999[124]
First descriptionof the Marsyassoftwareframework. Initial proposalfor segmen-
tationalgorithmandimplementationof Music/Speechclassificationbasedon [110]
usingtheframework.
2. Multifeatur e audio segmentationfor browsing and annotation 1999[125]
Full descriptionof multifeaturesegmentation methodology andproposalof feature
set.First userstudyfor evaluate segmentationalgorithm
170
APPENDIXC. 171
3. Experiments in computer-assistedannotation of audio 2000[128]
Seconduserstudyfor evaluating segmentation.Annotationandthumbnailing results
collected.Theexperiments wereconductedwith thehelpof thedevelopedEnhance
Audio Editoruserinterface.
4. Soundanalysisusing MPEG compressedaudio 2000[130]
Proposalof MPEG-basedcompresseddomain audio features. Evaluationof the
featuresbasedonMusic/Speechclassificationandsegmentation.
5. Marsyas: a framework for audio analysis2000[129]
An enhancedpublicationin aJouranlof [124].
6. Audio Inf ormation Retrieval (AIR) Tools2000[127]
A high level overview of audio informationretrieval. First descriptionof Timbre-
gramsandsimilarity retrieval evaluationstudy.
7. 3D Graphics Toolsfor IsolatedSoundCollections2000[126]
Descriptionof 3D graphicsuserinterfaces.Introductionof Timbrespacesandsound
effectquerygenerators(PuCola,Gear, etc).
8. MARSYA3D: A prototype audio browser-editor using a large scaleimmersive
visual and audio display 2001[131]
Descriptionof graphicsuserinterfacesandarchitecture.Emphasisonuseof interface
on thePrincetonDisplayWall.
9. Audio Analysis using the Discrete Wavelet Transform 2001[134] Investigation
of the DiscreteWavelet Transformfor audioclassificationandbeatanalysis.Beat
histogramsareintroduced.
APPENDIXC. 172
10. Automatic Musical Genre Classificationof Audio Signals2001[135] First paper
on musicalgenreclassification. Use of spectralfeaturesand beathistogramsfor
classification.
11. Audio Inf ormation Retrieval using Marsyas 2002[132] Chapterin book. High
level overview of Marsyas,essentiallya shortsummaryof this thesis.
12. Musical Genre Classification 2002 [133] Detailedexperimentsin musicalgenre
classification.PitchHistogramsintroducedandusedtogetherwith spectralfeatures
andbeathistogramsfeaturesfor classification.
Appendix D
D.1 WebResources
Webresourcesrelatedto theresearchpresentedin this thesis:
& http://s oundlab.cs .princetone .edu/demo. php
Demonstrations
& http://w ww.cs.prin ceton.edu/˜ gtzan/mars yas.html
TheMarsyassoftwareframework.
& http://w ww.cs.prin ceton.edu/˜ gtzan/publ ications.ht ml
My Publications.
& http://w ww.cs.prin ceton.edu/˜ gtzan/caud ition.html
Lotsof webresources(people,conferences,companies,publications,etc)
173
Bibliography
[1] Proc. Int. Symposiumon Music Information Retrieval (ISMIR), Plymouth, MA,2000.
[2] Proc.Int. SymposiumonMusicInformation Retrieval (ISMIR), 2001.
[3] M. Alghoniemy andA. Tewfik. RhythmandPeriodicityDetectionin PolyphonicMusic. In Proc. 3rd Workshopon MultimediaSignalProcessing, pages185–190,Denmark,Sept.1999.
[4] M. AlghoniemyandA. Tewfik. PersonalizedMusicDistribution. In Proc.Int. Conf.on Acoustics,Speech and SignalProcessingICASSP, Istanbul, turkey, June2000.IEEE.
[5] E. Allamanche,H. Jurgen, O. Hellmuth, B. Froba, T. Kastner, and M. Cremer.Content-basedIdentification of Audio Material using MPEG-7 Low LevelDescription. In Proc.Int. SymposiumonMusicInformationRetrievla (ISMIR), 2001.
[6] D. Anastassiou.GenomicSignalProcessing.IEEE SignalProcessingMagazine,18(4):8–21, July2001.
[7] B. Arons. SpeechSkimmer: a system for interactively skimming recordedspeech. ACM Transactions Computer Human Interaction, 4:3–38, 1997.http://www.media.mit.edu/people/barons/papers/ToCHI97.ps.
[8] J.-J.AucouturierandM. Sandler. Segmentation of Musical SignalsUsing HiddenMarkov Models. In Proc.110thAudioEngineering SocietyConvention, Amsterdam,TheNetherlands,May 2001.Audio EngineeringSocietyAES.
[9] D. H. BallardandC. M. Brown. ComputerVision. PrenticeHall, 1982.
[10] M. A. Bartschand G. H. Wakefield. To Catcha Chorus: Using Chroma-BasedRepresentationfor Audio Thumbnailing. In Proc. Int. Workshopon applicationsofSignalProcessingto AudioandAcoustics, pages15–19,Mohonk,NY, 2001.IEEE.
[11] S. Belongie,C. Carson,H. Greenspan,and J. Malik. Blobworld: A systemforregion-basedimageindexing and retrieval. In Proc. 6th Int. Conf. on ComputerVision, Jan.1998.
174
BIBLIOGRAPHY 175
[12] A. L. BerenzweigandD. P. Ellis. Locatingsingingvoicesegmentswithin musicalsignals. In Proc. Int. Workshopon Applicationsof SignalProcessing to AudioandAcousticsWASPAA, pages119–123,Mohonk, NY, 2001.IEEE.
[13] J. Biles. GenJam:A GeneticAlgorithms for GeneratingJazzSolos. In Proc. Int.ComputerMusicConf. (ICMC), pages131–137,Aarhus,Denmark,Sept.1994.
[14] G. Boccignone,M. DeSanto,and G. Percannella. Joint Audio-Video Processingof MPEGEncodedSequences.In Proc. Int. Con.fOn MultimediaComputingandSystems(ICMCS), pages225–229. IEEE,1999.
[15] J. Boreczky and L. Wilcox. A hidden markov model framework for videosegmentationusing audio and imagefeatures. In Proc. Int. Conf. on Acoustics,Speech andSignalProcessing ICASSP, volume6, pages3741–3744. IEEE,1998.
[16] A. Bregman.Auditory SceneAnalysis. MIT Press,Cambridge,1990.
[17] J.Brown. Computeridentificationof musicalinstruments.Journal of theAcousticalSocietyof America, 105(3):1933–1941,1999.
[18] C. Chafe,B. Mont-Reynaud,andL. Rush. Towardan IntelligentEditor of DigitalAudio: Recognitionof MusicalConstructs.ComputerMusicJournal, 6(1):30–41,1982.
[19] P. Cook. Physically inspired sonic modeling (PHISM): synthesis of percussivesounds. ComputerMusicJournal, 21(3),Aug. 1997.
[20] P. Cook. Toward physically-informedparametricsynthesisof soundeffects. InProc.IEEEWorkshoponapplicationsof SignalProcessing to AudioandAcoustics,WASPAA, New Paltz,NY, 1999.InvitedKeynote Address.
[21] P. Cook,editor. Music,Cognition,andComputerisedSound. MIT Press,2001.
[22] P. Cook, G. Essl,andD. Trueman. N IJI 2: Multi-speaker Display SystemsforVirtual RealityandSpatialAudioProjection.In Proc.Int. Conf. onAuditory Display(ICAD), Glasgow, Scotland,1998.
[23] P. Cook andG. Scavone. The SynthesisToolkit (STK), version2.1. In Proc. Int.ComputerMusicConf. ICMC, Beijing, China,Oct.1999.ICMA.
[24] R. Dannenberg. An on-line algorithmfor real-timeaccompaniment.In Proc. Int.ComputerMusicConf., pages187–191,Paris,France,1984.
[25] I. Daubechies.Orthonormalbasesof compactlysupportedwavelets. Communica-tionsonPureandAppliedMath, 41:909–996,1988.
BIBLIOGRAPHY 176
[26] S. Davis and P. Mermelstein. Experimentsin syllable-basedrecognition ofcontinuous speech. IEEE Transcactions on Acoustics, Speech and SignalProcessing, 28:357–366,Aug. 1980.
[27] H. Deshpande,R.Singh,andU. Nam.Classificationof MusicalSignalsin theVisualDomain. In Proc. COSTG-G Conf. on Digital Audio Effects(DAFX), Limerick,Ireland,Dec.2001.
[28] S. Dixon. A Lightweight Multi-agent Musical Beat Tracking System. In Proc.PacificRimInt. Conf. onArtificial Intelligence, pages778–788, 2000.
[29] S. Dixon. An Interactive Beat TrackingandVisualizationSystem. In Proc. Int.ComputerMusicConf. (ICMC), pages215–218,Habana,Cuba,2002.ICMA.
[30] R. Duda,P. Hart, andD. Stork. Pattern classification. JohnWiley & Sons,NewYork, 2000.
[31] D. Ellis. Prediction-drivencomputationalauditory sceneanalysis. PhDthesis,MITMediaLab,1996.
[32] A. EronenandA. Klapuri. MusicalInstrumentRecognitionusingCepstralFeaturesandTemporalFeatures.In Int. Conf. on Acoustics,Speech and SignalProcessingICASSP, Istanbul Turkey, 2000.IEEE.
[33] G.EsslandP. Cook.Measurementsandefficientsimulationsof bowedbars.Journalof AcousticalSocietyof America(JASA), 108(1):379–388,2000.
[34] U. Fayyad,G. G. Grinstein,andA. Wierse,editors. InformationVisualization inDataMining andKnowledgeDiscovery. MorganKaufmann,2002.
[35] M. FernstormandC. McNamara.After Direct Manipulation - Direct Sonification.In Proc.Int. Conf. onAuditory Display, ICAD, Glasgow, Scotland,1998.
[36] M. FernstromandE. Brazil. SonicBrowsing: anauditorytool for multimediaassetmanagement.In Proc. Int. Conf. on Auditory Display(ICAD), Espoo,Finland,July2001.
[37] M. Flicknerandet al. Queryby imageandvideocontent:theQBIC system.IEEEComputer, 28(9):23–32, Sept.1995.
[38] J. Foote. Content-basedretrieval of music andaudio. In MultimediaStorage andArchiving SystemsII , pages138–147,1997.
[39] J. Foote. An overview of audio informationretrieval. ACM MultimediaSystems,7:2–10, 1999.
BIBLIOGRAPHY 177
[40] J. Foote. Visualizingmusicandaudiousingself-similarity. In ACM Multimedia,1999.
[41] J. Foote. Arthur: Retrieving orchestralmusic by long-termstructure. Readat theFirst InternationalSymposiumonMusic InformationRetrieval, 2000.
[42] J. Foote. Automatic Audio Segmentationusinga Measureof Audio Novelty. InProc.Int. Conf. onMultimedia andExpo, volume1, pages452–455.IEEE,2000.
[43] J.FooteandS.Uchihashi.TheBeatSpectrum:anew approachto rhythmic analysis.In Int. Conf. onMultimedia& Expo. IEEE,2001.
[44] I. Fujinaga. Machine recognitionof timbre using steady-statetone of acousticinstruments.In Proc.Int. ComputerMusicConf. ICMC, pages207–210, Ann Arbor,Michigan,1998.ICMA.
[45] I. Fujinaga.Realtimerecognitionof orchestralinstruments.In Int. ComputerMusicConf. ICMC, 141-143.ICMA, 2000.
[46] B. Garton. Virtual PerformanceModelling. In Proc. Int. ComputerMusic Conf.(ICMC), pages219–222, SanJose,California,Oct.1992.
[47] A. Ghias,J. Logan,D. Chamberlin,andB. Smith. Queryby Humming: MusicalInformation Retrieval in an Audio Database. ACM Multimedia, pages213–236,1995.
[48] M. GotoandY. Muraoka. Real-timerhythmtrackingfor drumlessaudiosignals-chordchangedetectionfor musicaldecisions.In Proc. Int. Joint.Conf. in ArtificialIntelligence:WorkshoponComputationalAuditorySceneAnalysis, 1997.
[49] M. GotoandY. Muraoka. Music Understanding at theBeatLevel: Real-timeBeatTrackingof Audio Signals.In D. RosenthalandH. Okuno,editors,ComputationalAuditory SceneAnalysis, pages157–176. LawrenceErlbaumAssociates,1998.
[50] J. M. Gray. An Exploration of Musical Timbre. PhD thesis,Dept.of Psychology,StanfordUniv., 1975.
[51] W. Grosky, R. Jain, and R. Mehrotra, editors. The handbook of MultimediaInformation Management. PrenticeHall, 1997.
[52] T. HastieandW. Stuetzle. Principalcurves. Journal of the AmericanStatisticalAssociation, 84(406):502–516, 1989.
[53] A. Hauptmannand M. Witbrock. Informedia: News-on-demandmultimediainformation acquisition and retrieval. In Intelligent Multimedia Informa-tion Retrieval, chapter 10, pages 215–240. MIT Press, Cambridge, 1997.http://www.cs.cmu.edu/afs/cs/user/alex/www/.
BIBLIOGRAPHY 178
[54] T. Hermann, P. Meinicke, and H. Ritter. Principal Curve Sonification. InInternationalConferenceonAuditoryDisplay, ICAD, 2000.
[55] R. E. Johnson. Components,Frameworks, Patterns. In Proc. ACM SIGSOFTSymposium onSoftwareReusability, pages10–17,1997.
[56] L. Jolliffe. Principal ComponentAnalysis. Springer-Verlag,New York, 1986.
[57] J. M. Jose,J. Furner, and D. J. Harper. Spatialqueryingfor imageretrieval: auser-orientedevaluation. In Proc. SIGIR Conf. on research and developmentinInformation Retrieval, Melbourne,Australia,1998.ACM.
[58] I. JTC1/SC29.Information Technology-Codingof Moving PicturesandAssociatedAudio for Digital StorageMediaat up to about1.5Mbit/s-IS11172(Part 3, Audio).1992.
[59] I. JTC1/SC29. Information Technology-GenericCoding of Moving Pictures andAssociatedAudio Information-IS13818(Part 3, Audio). 1994.
[60] H. Kang and B. Shneiderman. Visualization Methods for PersonalPhotoCollections: Browsing and Searchingin the PhotoFinder. In Proc. Int. Con.f onMultimediaandExpo, New York, 2000.IEEE.
[61] N. KashinoandT. Kinoshita. Applicationof Bayesianprobabilitynetwork to musicsceneanalysis.In Proc.Int. JointConf. OnArtificial Intelligence, CASAWorkshop,1995.
[62] T. Kemp, M. Schmidt, M. Westphal,and A. Waibel. Strategies for automaticsegmentationof audio data. In Proc. Int. Conf. on AcousticsSpeech and SignalProcessing (ICASSP), volume3, pages1423–1426. IEEE,2000.
[63] D. Kimber andL. Wilcox. Acousticsegmentationfor audiobrowsers.In InterfaceConference, pagesSydney, Australia,July1996.
[64] G. E. Krasnerand S. T. Pope. A cookbookfor using the model-view-controlleruserinterfaceparadigmin Smalltalk-80.Journalof Object-OrientedProgramming,1(3):26–49, Aug. 1988.
[65] R. Kronland-Martinet,J. Morlet, and A. Grossman. Analysis of soundpatternsthrough wavelet transforms. Int. Journal of Pattern Recognition and ArtificialIntelligence, 1(2):237–301,1987.
[66] J.Laroche.EstimatingTempo,SwingandBeatLocationsin Audio Recordings.InProc. Int. Workshopon applicationsof SignalProcessing to Audio and AcousticsWASPAA, pages135–139,Mohonk,NY, 2001.IEEE.
BIBLIOGRAPHY 179
[67] F. LerdahlandR. Jackendoff. A Generative Theoryof Tonal Music. MIT Press,1983.
[68] G. Li and A. Khokar. Content-basedindexing and retrieval of audio datausingwavelets.In Int. Conf. onMultimediaandExpo(II) , pages885–888.IEEE,2000.
[69] K. Li andet.al. Building andusingascalabledisplaywall system.IEEEComputerGraphicsandApplications, 21(3),July2000.
[70] S. Li. Content-basedclassificationandretrieval of audiousingthe nearestfeatureline method. IEEE Transactionson Speech and Audio Processing, 8(5):619–625,Sept.2000.
[71] J.List, A. vanBallegooij, andA. deVries. Known-ItemRetrieval onBroadcastTV.TechnicalReportINS-R0104,CWI, Apr. 2001.
[72] B. Logan. Mel Frequency CepstralCoefficientsfor Music Modeling. In Proc. Int.Symposium onMusicInformation Retrieval (ISMIR), 2000.
[73] B. Logan.Musicsummarizationusingkey phrases.In Proc.Int. Conf. onAcoustics,Speech andSignalProcessing ICASSP. IEEE,2000.
[74] L. Lu, J. Hao, andZ. HongJiang. A robust audioclassificationandsegmentationmethod. In Proc.ACM Multimedia, Ottawa,Canada,2001.
[75] P. C. Mahalanobis.On thegeneralizeddistancein statistics. In Proc.NationalInst.of ScienceIndia, volume12,pages49–55,1936.
[76] J. Makhoul. Linear prediction: A tutorial overview. Proceedingsof the IEEE,63:561–580,Apr. 1975.
[77] S.G. Mallat. A WaveletTour of SignalProcessing. AcademicPress,1999.
[78] G. Marchionini. Information Seekingin Electronic Environments. The PressSyndicateof theUniversity of Cambridge,1995.
[79] K. Martin. A BlackboardSystemfor AutomaticTranscriptionof SimplePolyphonicMusic. TechnicalReport399,MIT MediaLab,1996.
[80] K. Martin. Toward automatic sound source recognition: identifying musicalinstruments. In NATO Computational HearingAdvancedStudyInstitute. Il CioccoIT, 1998.
[81] K. Martin. Sound-Source Recognition: A TheoryandComputational Model. PhDthesis,MIT MediaLab,1999.
BIBLIOGRAPHY 180
[82] K. Martin, E. Scheirer, andB. Vercoe. Musical contentanalysisthroughmodelsof audition. In Proc.MultimediaWorkshopon Content-basedProcessingof Music,Bristol, UK, 1998.ACM.
[83] N. Masako and K. Watanabe. Interactive Music Composerbasedon NeuralNetworks. In Proc.Int. ComputerMusicConf. (ICMC), SanJose,California,1992.
[84] T. K. Moon. The Expectation-Maximization Algorithm. IEEE SignalProcessingMagazine, 13(6):47–60,Nov. 1996.
[85] F. Moore. Elementsof ComputerMusic. PrenticeHall, 1990.
[86] J. Moorer. On the Segmentation and Analysis of Continuous Musical SoundbyDigital Computer. PhDthesis,Dept.of Music,StanfordUniversity, 1975.
[87] J. Moorer. The Lucasfilm Audio Signal Processor. ComputerMusic Journal,6(3):30–41, 1982.(alsoin ICASSP82).
[88] P. Noll. MPEG digital audio coding. IEEE Signal Processing Magazine, pages59–81,September1997.
[89] A. Oppenheimand R. Schafer. Discrete-Time SignalProcessing. PrenticeHall,Edgewood Clif fs, NJ,1989.
[90] L. OttavianiandD. Rocchesso.Separationof SpeechSignalfromComplex AuditoryScenes. In Proc. COSTG-6 Conf. on Digital Audio Effects (DAFX), Limerick,Ireland,Dec.2001.
[91] A. Pentland,R. Picard, and S. Sclaroff. Photobook: Tools for Content-BasedManipulationof ImageDatabases.IEEEMultimedia, pages73–75,July1994.
[92] D. Perrot and R. Gjerdigen. Scanningthe dial: An exploration of factors inidentificationof musicalstyle. In Proc.Societyfor MusicPerceptionandCognition,page88,1999.(abstract).
[93] S. Pfeiffer. Pauseconceptsfor audiosegmentation at differentsemanticlevels. InProc.ACM Multimedia, Ottawa,Canada,2001.
[94] J. Pierce. Consonanceand Scales. In P. Cook, editor, Music Cognition andComputerizedSound, pages167–185.MIT Press,1999.
[95] D. Pye. Content-basedmethodsfor themanagementof digital music. In Proc. Int.ConfonAcoustics,Speech andSignalprocessing ICASSP. IEEE,2000.
[96] L. Rabiner, M. Cheng, A. Rosenberg, and C. McGonegal. A comparativeperformancestudyof several pitch detectionalgorithms. IEEE Trans.Acoustics,Speech, andSignalProcessing., ASSP-24:399–417,October1976.
BIBLIOGRAPHY 181
[97] L. Rabinerand B. Gold. Theory and Application of Digital Signal Processing.PrenticeHall, 1975.
[98] L. RabinerandB. H. Juang.Fundamentalsof Speech Recognition. Prentice-Hall,1993.
[99] C. Roads.ComputerMusicTutorial. MIT Press,1996.
[100] K. Rodden,W. Basalaj,D. Sinclair, andK. Wood. DoesOrganizationby SimilarityAssist Image Browsing ? In Proc. SIG-CHI on Human Factors in computingsystems, SeattleWA, USA, 2001.ACM.
[101] D. F. RosentalandH. G. Okuno,editors. Computational AuditorySceneAnalysis.LawrenceErlbaum,1998.
[102] S. Rossignol,X. Rodet,et al. Featuresextraction and temporalsegmentationofacousticsignals. In Proc.Int. ComputerMusicConf. ICMC, pages199–202.ICMA,1998.
[103] J.Saunders.Realtimediscriminationof broadcastspeech/music. In Proc.Int. Conf.onAcoustics,Speech andSignalProcessingICASSP, pages993–996.IEEE,1996.
[104] G.Scavone,S.Lakatos,P. Cook,andC. Harbke. Perceptualspacesfor soundeffectsobtainedwith an interactive similarity ratingprogram. In Proc. Int. SymposiumonMusicalAcoustics, Perugia,Italy, Sept.2001.
[105] R. Schalkoff. Pattern Recognition. Statistical, Structural and Neural Approaches.JohnWiley & Sons,1992.
[106] E. Scheirer. Bregman’s chimerae:Music perceptionasauditorysceneanalysis.InProc.Int. Conf. onMusicPerceptionandCognition, Montreal,1996.
[107] E.Scheirer. TheMPEG-4structuredaudiostandard.In Proc.Int. Conf. onAcoustics,Speech andSignalProcessing ICASSP. IEEE,1998.
[108] E. Scheirer. Tempoandbeatanalysisof acousticmusicalsignals. Journal of the.Acoustical Societyof America, 103(1):588,601, Jan.1998.
[109] E. Scheirer. Music-ListeningSystems. PhDthesis,MIT, 2000.
[110] E. Scheirerand M. Slaney. Constructionandevaluationof a robust multifeaturespeech/music discriminator. In Proc. Int. Conf. on Acoustics,Speech and SignalProcessing ICASSP, pages1331–1334.IEEE,1997.
[111] D. Schwarz.A systemfor data-drivenconcatenativesoundsynthesis.In Proc.Cost-G6Conf. onDigital AudioEffects(DAFX), Verona,Italy, Dec.2000.
BIBLIOGRAPHY 182
[112] J.Sepp̈anen.QuantumGrid Analysisof MusicalSignals.In Proc.Int. Workshoponapplicationsof SignalProcessing to AudioandAcousticsWASPAA, pages131–135,Mohonk, NY, 2001.IEEE.
[113] R.N. Shepard.Circularityin Judgementsof RelativePitch.Journalof theAcousticalSocietyof America, 35:2346–2353,1964.
[114] B. Shneiderman.Designing the User Interface: Strategies for EffectiveHuman-ComputerInteraction. Addison-Wesley, 3rded.edition,1998.
[115] M. Slaney. A critique of pureaudition. Computational Auditory SceneAnalysis,1997.
[116] M. Slaney andR.Lyon.A perceptualpitchdetector. In Proc.Int. Conf. onAcoustics,Speech and SignalProcessingICASSP, pages357–360,Albuquerque,NM, 1990.IEEE.
[117] M. Slaney andR. Lyon. On the importanceof time-a temporalrepresentationofsound. In M. Cooke, B. Beet,andM. Crawford, editors,Visual RepresentationsofSpeech Signals, pages95–116.JohnWiley & SonsLtd, 1993.
[118] P. Smaragdis. RedundancyReductionfor Computational Audition, a UnifyingApproach. PhDthesis,MIT MediaLab,2001.
[119] L. Smith.A Multiresolution Time-FrequencyAnalysisAndInterpretationOf MusicalRhythm. PhDthesis,University of WesternAustralia,July1999.
[120] R. Spence.InformationVisualization. AddisonWesley ACM Press,2001.
[121] K. Steiglitz. A digital signalprocessing primer. Addison Wesley, 1996.
[122] A. Subramanya, S.R. annd Youssef. Wavelet-basedindexing of audio data inaudio/multimedia databases. In Proc. Int. Workshop on Multimedia DatabaseManagementIW-MMDBMS, pages46–53,1998.
[123] T. TolonenandM. Karjalainen. A Computationally Efficient Multipitch AnalysisModel. IEEETrans.onSpeech andAudioProcessing, 8(6):708–716, Nov. 2000.
[124] G.TzanetakisandP. Cook.A Framework for AudioAnalysisbasedonClassificationandTemporalSegmentation.In Proc.Euromicro99,WorkshoponMusicTechnologyandAudioProcessing, 1999.
[125] G. Tzanetakisand P. Cook. Multifeature audio segmentation for browsing andannotation. In Proc. Workshopon applications of signal processingto audio andacousticsWASPAA, New Paltz,NY, 1999.IEEE.
BIBLIOGRAPHY 183
[126] G. TzanetakisandP. Cook. 3D GraphicsTools for SoundCollections. In Proc.COSTG6 Conf. onDigital AudioEffects(DAFX), Verona,Italy, Dec.2000.
[127] G. TzanetakisandP. Cook. Audio InformationRetrieval (AIR) Tools. In Proc. Int.Symposium onMusicInformation Retrieval (ISMIR), 2000.
[128] G. TzanetakisandP. Cook. Experimentsin computer-assistedannotationof audio.In Proc.Int. Con.onAuditoryDisplay, ICAD, 2000.
[129] G. TzanetakisandP. Cook. Marsyas:A framework for audioanalysis.OrganisedSound, 4(3),2000.
[130] G. Tzanetakisand P. Cook. SoundanalysisusingMPEG compressedaudio. InProc.Int. Conf. onAcoustics,Speech andSignalProcessing ICASSP, Istanbul, 2000.IEEE.
[131] G. Tzanetakisand P. Cook. Marsyas3D:a prototype audio browser-editor usinga large scaleimmersive visual andaudiodisplay. In Proc. Int. Conf. on AuditoyDisplay (ICAD), Espoo,Finland,Aug. 2001.
[132] G. TzanetakisandP. Cook. Audio InformationRetrieval usingMarsyas. KlueweAcademicPublishers,2002.(to bepublished).
[133] G. TzanetakisandP. Cook. MusicalGenreClassificationof Audio Signals. IEEETransactionsonSpeech andAudioProcessing, 2002.(acceptedfor publication).
[134] G. Tzanetakis,G. Essl,andP. Cook. Audio Analysisusingthe DiscreteWaveletTransform. In Proc. Conf. in Acousticsand Music TheoryApplications. WSES,Sept.2001.
[135] G. Tzanetakis,G. Essl,andP. Cook. Automatic Musical GenreClassificationofAudio Signals. In Proc. Int. Symposiumon Music Information Retrieval (ISMIR),Oct.2001.
[136] G. TzanetakisandL. Julia. MultimediaStructuringusingTrees. In Proc. In Proc.RIAO 2000 ”Content basedmultimediainformation access”, Paris, France,Apr.2001.
[137] K. vanRijsbergen. Information retrieval. Butterworths,London,2ndedition,1979.
[138] R. Vertegaal and E. Bonis. ISEE: An Intuitive Sound Editing Environment.ComputerMusicJournal, 18(2):21–29, 1994.
[139] R. Vertegaal and B. Eaglestone. Looking for Sound? Selling Perceptual Spacein HierarchicallyNestedBoxes. In SummaryCHI 98 Conf. on HumanFactors inComputingSystems. ACM, 1998.
BIBLIOGRAPHY 184
[140] S.Wake andT. Asahi. SoundRetrieval with Intuitive VerbalExpressions. In Proc.Int. Conf. onAuditory Display(ICAD), Glasgow, Scotland,1997.
[141] Y. Wang and V. Miik ka. A compresseddomainbeatdetectorusing MP3 audiobitstreams.In Proc.ACM Multimedia, Ottawa,Canada,2001.
[142] B. Whitman, G. Flake, and S. Lawrence. Artist Detection in Music withMinnowmatch.In Proc.WorkshoponNeural Networksfor SignalProcessing, pages559–568, Falmouth,Massachusetts,Sept.2001.IEEE.
[143] E. Wold, T. Blum, D. Keislar, andJ.Wheaton.Content-basedclassification,searchandretrieval of audio. IEEEMultimedia, 3(2),1996.
[144] C. Yang. MACS: Music Audio CharacteristicSequenceIndexing for SimilarityRetrieval. In Proc. Workshopon Applications of SignalProcessing to Audio andAcoustics(WASPAA), New Paltz,New York, 2001.IEEE.
[145] C. Yang. Music DatabaseRetrieval basedon Spectral Similarity. In Proc.Int. Symposiumon Music Information Retrieval (Poster) (ISMIR), Bloomington,Indiana,2001.
[146] B. Yeo andB. Liu. RapidSceneAnalysison CompressedVideos. IEEE Trans.CircuitsSystem.VideoTechnology, 5(6),1995.
[147] T. Zhang and J. Kuo. Audio Content Analysis for online Audiovisual DataSegmentation and Classification. Transactionson Speech and Audio Processing,9(4):441–457,May 2001.
[148] A. Zils andF. Pachet.MusicalMosaicing.In Proc.Cost-G6Conf. onDigital AudioEffects(DAFX), Limerick, Ireland,Dec.2001.