MANIPULATION, ANALYSIS AND RETRIEVALgtzan/work/pubs/thesis02gtzan.pdf · manipulation, analysis and...

MANIPULATION, ANALYSIS AND RETRIEVAL

SYSTEMS FOR AUDIO SIGNALS

GEORGE TZANETAKIS

A DISSERTATION

PRESENTED TO THE FACULTY

OF PRINCETON UNIVERSITY

IN CANDIDACY FOR THE DEGREE

OF DOCTOR OF PHILOSOPHY

RECOMMENDED FOR ACCEPTANCE

BY THE DEPARTMENT OF

COMPUTER SCIENCE

JUNE 2002

Copyright by GeorgeTzanetakis, 2002.

All RightsReserved

To my grandfatherGiorgosTzanetakiswhosepassionfor knowledgeandlovefor all

living thingswill alwaysbeaninspirationfor me.

To my parentsPanosandAnnaandmy sisterFlorafor their supportandlove.

Abstract

Digital audioandespeciallymusiccollectionsarebecomingamajorpartof theaverage

computeruserexperience.Largedigital audiocollectionsof soundeffectsarealsousedby

the movie andanimationindustry. Researchareasthat utilize large audiocollectionsin-

clude:Auditory Display, Bioacoustics,ComputerMusic,Forensics,andMusic Cognition.

In order to develop more sophisticatedtools for interactingwith large digital audio

collections,researchin ComputerAudition algorithms and user interfacesis required.

In this work a seriesof systemsfor manipulating, retrieving from, and analysinglarge

collectionsof audiosignalswill bedescribed.Thefoundationof thesesystemsis thedesign

of new andtheapplicationof existing algorithmsfor automatic audiocontentanalysis.The

resultsof the analysisare usedto build novel 2D and 3D graphicaluser interfacesfor

browsing and interactingwith audiosignalsandcollections. The proposedsystemsare

basedontechniquesfrom thefieldsof SignalProcessing,PatternRecognition,Information

Retrieval,VisualizationandHumanComputerInteraction.All theproposedalgorithmsand

interfacesareintegratedunderMARSYAS, a freesoftwareframework designedfor rapid

prototyping of computerauditionresearch.In mostcasesthe proposedalgorithmshave

beenevaluatedandinformedby conductinguserstudies.

New contributions of this work to the areaof ComputerAudition include: a gen-

eral multifeatureaudio texture segmentation methodology, featureextraction from mp3

compresseddata,automaticbeatdetectionand analysisbasedon the DiscreteWavelet

Transformand musicalgenreclassificationcombiningtimbral, rhythmic and harmonic

features. Novel graphicaluser interfacesdevelopedin this work are various tools for

browsing andvisualizinglarge audiocollectionssuchasthe Timbregram, TimbreSpace,

GenreGram,andEnhancedSoundEditor.

Acknowledgments

First of all I would like to thankPerryCook for beingsucha greatadvisor. Perryhas

alwaysbeensupportive of my work anda constantsourceof new and interestingideas.

Large partsof the researchdescribedin this thesiswerewritten after Perrysaid“I have

a questionfor you. How hard would it be to do ...”. In addition I would like to thank

themembersof my thesiscommitte: AndreaLaPaugh,KenSteiglitz,TomFunkhauserand

PaulLansky whoprovidedvaluablefeedbackandsupportespeciallyduringthelastdifficult

monthsof looking for a job andfinishingmy PhD.

Kimon hasmotivatedmeto walk andrunin thewoods,clearingmy brainasI follow his

waggingtail. Emily providedsupport, soupsandstews. Lujo Bauer, GeorgEssl,Neophytos

Michael, and KedarSwadi andall the othergraduatestudentsat the ComputerScience

departmenthavebeengreatfriendsandhelpedmemany timesin many differentways.

During my yearsin graduateschoolit wasmy pleasureto collaboratewith many un-

dergraduateandgraduatestudentsdoingprojectsrelatedto my researchandin many cases

using MARSYAS the computeraudition software framework I have developed. These

includeJohnForsyth,FeiFeiLi, CorrieElder, DougTurnbull, AnoopGupta,David Greco,

GeorgeTourtellot,AndreyeErmolinskyi, andAri Lazier.

The two summerinternships I did at theComputerHumanInteractionCenter(CHIC)

at SRI Internationalandat Moodlogic weregreatlearningexperiencesandprovidedwith

valuableknowledge.I would like to especiallythankLuc Julia andRehanKhanfor being

greatpeopleto work with.

Finally, researchcostsandI wasfortunateenoughto have supportfrom the following

fundingsources:NSFgrant9984087,New Jersey Commission onScienceandTechnology

grant01-2042-007-22,Intel andArial Foundation.

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

1 Intr oduction 1

1.1 MotivationandApplications . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Themainproblem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Currentstate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 RelevantResearchAreas . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Guiding principlesandobservations . . . . . . . . . . . . . . . . . . . . . 7

1.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.7 Relatedwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.8 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.9 A shorttour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Representation 24

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3 SpectralShapeFeatures. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3.1 STFT-basedfeatures . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3.2 Mel-Frequency CepstralCoefficients . . . . . . . . . . . . . . . . 33

CONTENTS vii

2.3.3 MPEG-basedfeatures . . . . . . . . . . . . . . . . . . . . . . . . 34

2.4 Representingsound“texture” . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.4.1 Analysisandtexturewindows . . . . . . . . . . . . . . . . . . . . 39

2.4.2 Low Energy Feature . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.5 Wavelettransformfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.5.1 TheDiscreteWaveletTransform. . . . . . . . . . . . . . . . . . . 40

2.5.2 Waveletfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.6 Otherfeatures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.7 Rhythmic ContentFeatures. . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.7.1 BeatHistograms . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.7.2 Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.8 PitchContentFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.8.1 PitchHistograms. . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.8.2 Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.9 MusicalContentFeatures. . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3 Analysis 54

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.3 Query-by-examplecontent-basedsimilarity retrieval . . . . . . . . . . . . 59

3.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.4.1 Classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.4.2 Classificationschemes. . . . . . . . . . . . . . . . . . . . . . . . 64

3.4.3 Voices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.5 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

CONTENTS viii

3.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.5.3 SegmentationFeatures. . . . . . . . . . . . . . . . . . . . . . . . 71

3.6 Audio thumbnailing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.7 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4 Interaction 74

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.2.1 BeyondtheQuery-by-Exampleparadigm . . . . . . . . . . . . . . 79

4.2.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.4 Browsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.4.1 TimbrespaceBrowser. . . . . . . . . . . . . . . . . . . . . . . . . 83

4.4.2 Timbregramviewerandbrowser . . . . . . . . . . . . . . . . . . . 86

4.4.3 OtherBrowsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.5 EnhancedAudio Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.6 Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.6.1 GenreGram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.6.2 TimbreBallmonitor. . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.6.3 Othermonitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.7 Audio-basedQueryInterfaces . . . . . . . . . . . . . . . . . . . . . . . . 96

4.7.1 SoundSliders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.7.2 Soundpalettes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.7.3 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.7.4 MusicMosaics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

CONTENTS ix

4.7.5 3D soundeffectsgenerators . . . . . . . . . . . . . . . . . . . . . 99

4.8 MIDI-basedQueryInterfaces. . . . . . . . . . . . . . . . . . . . . . . . . 100

4.8.1 Thegroovebox . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.8.2 Thetonalknob . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.8.3 Stylemachines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.9 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.10 ThePrincetonDisplayWall . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5 Evaluation 107

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.2.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.3 Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.4 Classificationevaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.4.1 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.5 Segmentationexperiments . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.6 MPEG-basedfeatures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5.6.1 Music/Speechclassification . . . . . . . . . . . . . . . . . . . . . 118

5.6.2 Segmentationevaluation . . . . . . . . . . . . . . . . . . . . . . . 119

5.7 DWT-basedfeatures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.8 Similarity retrieval evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.9 AutomaticMusicalGenreClassification . . . . . . . . . . . . . . . . . . . 123

5.9.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.9.2 Humanperformancefor genreclassification. . . . . . . . . . . . . 129

5.10 Otherexperimentalresults . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.10.1 Audio Thumbnailing . . . . . . . . . . . . . . . . . . . . . . . . . 130

CONTENTS x

5.10.2 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.10.3 Beatstudies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

5.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6 Implementation 134

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.3.1 ComputationServer . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.3.2 UserInterfaceClient - Interaction . . . . . . . . . . . . . . . . . . 142

6.4 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.5 Characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

7 Conclusions 146

7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

7.2 CurrentWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7.3 Futurework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

7.3.1 Otherdirections. . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

7.4 FinalThoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

A.2 ShortTimeFourierTransform . . . . . . . . . . . . . . . . . . . . . . . . 157

A.3 DiscreteWaveletTransform . . . . . . . . . . . . . . . . . . . . . . . . . 158

A.4 Mel-Frequency CepstralCoefficients . . . . . . . . . . . . . . . . . . . . . 159

A.5 Multiple PitchDetection . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

A.6 Gaussianclassifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

A.7 GMM classifierandtheEM algorithm . . . . . . . . . . . . . . . . . . . . 162

CONTENTS xi

A.8 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

A.9 KNN classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

A.10 PrincipalComponentAnalysis . . . . . . . . . . . . . . . . . . . . . . . . 165

B.1 Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

C.1 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

D.1 WebResources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

List of Figures

1.1 RelevantResearchAreas . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 ComputerAudition Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1 MagnitudeSpectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 Soundtexturesof newsbroadcast. . . . . . . . . . . . . . . . . . . . . . . 38

2.3 BeatHistogramCalculationFlow Diagram. . . . . . . . . . . . . . . . . . 44

2.4 BeatHistogramExample . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.5 BeatHistogramExamples . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.1 MusicalGenresHiearachy . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.2 ClassificationScemesAccuracy . . . . . . . . . . . . . . . . . . . . . . . 66

3.3 Feature-basedsegmentationandclassification . . . . . . . . . . . . . . . . 73

4.1 2D TimbreSpaceof SoundEffects . . . . . . . . . . . . . . . . . . . . . . 84

4.2 3D TimbreSpace3Dof SoundEffects. . . . . . . . . . . . . . . . . . . . . 85

4.3 3D TimbreSpaceof orchestralmusic . . . . . . . . . . . . . . . . . . . . . 86

4.4 Timbregramsuperimposedoverwaveform . . . . . . . . . . . . . . . . . . 88

4.5 Timbregramsof musicandspeech . . . . . . . . . . . . . . . . . . . . . . 89

4.6 Timbregramsof classical,rockandspeech. . . . . . . . . . . . . . . . . . 90

4.7 Timbregramsof orchestralmusic . . . . . . . . . . . . . . . . . . . . . . . 90

4.8 EnhancedAudio Editor I . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

LIST OF FIGURES xiii

4.9 EnhancedAudio Editor II . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.10 TimbreballandGenreGrammonitors . . . . . . . . . . . . . . . . . . . . . 94

4.11 Soundslidersandpalettes. . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.12 3D soundeffectsgenerators. . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.13 Groovebox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.14 Indianmusicstylemachine. . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.15 Integration of interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.16 Marsyason theDisplayWall . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.1 MusicalGenresHiearachy . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.2 Histogramsof subjectagreementand learning curve for meansegment

completion time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.3 1,2,3and4,5,6arethemeansandvariancesof MPEGCentroid,Rolloff and

Flux. Low Energy is 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.4 DWT-basedfeatureevaluationbasedonaudioclassification. . . . . . . . . 120

5.5 Web-basedRetrieval UserEvaluation . . . . . . . . . . . . . . . . . . . . 121

5.6 Resultsof UserEvaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.7 Classificationaccuracy percentages(RND=random,RT=realtime,WF=whole

file) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.8 Effectof texturewindow sizeto classificationaccuracy . . . . . . . . . . . 128

5.9 SyntheticBeatPatterns(CY = cymbal,HH = hi-hat,SD = snaredrum,BD

= basedrum,WB = woodblock,HT/MT/LT = tom-toms) . . . . . . . . . . 132

6.1 Music SpeechDiscriminator ArchitecturalLayout . . . . . . . . . . . . . . 140

6.2 Segmentationpeaksandcorrespondingtime tree . . . . . . . . . . . . . . 142

6.3 MarsyasUserInterfaceComponentArchitecture . . . . . . . . . . . . . . 143

A.1 DiscreteWaveletTransformPyramidalAlgorithm Schematic. . . . . . . . 159

LIST OF FIGURES xiv

A.2 Summingof FFT binsfor MFCC filterbankcalculation . . . . . . . . . . . 160

A.3 Multiple PitchAnalysisBlock Diagram . . . . . . . . . . . . . . . . . . . 161

A.4 Schematicof PrincipalComponentextraction . . . . . . . . . . . . . . . . 166

Chapter 1

Intr oduction

Theprocessof preparingprogramsfor a digital computeris especiallyattractive, notonly

becauseit canbeeconomicallyandscientifically rewarding, but alsobecauseit canbean

aestheticexperiencemuch likecomposingpoetryor music.

Donald E. Knuth, The Art of Computer Programming

(especiallywhenyouwrite programsto analyzeaudiosignals)

1.1 Moti vation and Applications

th Developments in hardware, compressiontechnology and network connectivity have

madedigital audioandespeciallymusicamajorpartof theeverydaycomputeruserexperi-

enceandasignificantportionof webtraffic. Themediaandlegalattentionthatcompanies

likeNapsterhave recentlyreceived,testifiesto this increasingimportanceof digital audio.

It is likely that in thenearfuture themajority of recordedmusicin humanhistorywill be

availabledigitally. As an indication of the magnitude of sucha collectioncurrently the

numberof cd tracksin retail is approximately3 mill ion. It is still unclearwhatthebusiness

modelsfor this new brave new music world will be. However, onething that is certainis

theneedto interacteffectively with increasinglylargedigital audiocollections.

CHAPTER1. INTRODUCTION 2

In addition to digital musicdistribution,theneedto interactwith largeaudiocollections

arisesin otherapplicationsareasaswell. Most of themoviesproducedtodayusea large

numberof soundstakenfrom largelibrariesof soundeffects(for exampletheBBC Sound

EffectsLibrary contains60 compactdisks).Whenever we heara doorclosing,a gunshot

or a telephoneringing in a movie it almostalwaysoriginatesfrom oneof thoselibraries.

Evenlargernumbersof soundeffectsareusedin theanimationindustry. The importance

andusageof audioin computergamesis alsoincreasing.Moreover, composers,DJs,and

arrangersfrequentlyutilize collectionsof soundsto createmusicalworks. Thesesounds

typically originatefrom synthesizers,samplersandmusicalinstrumentsrecordedlive.

In addition to applications relatedto the entertainmentindustry, the needto interact

with large audiocollectionsarisesin many scientificdisciplines. Auditory Display 1 is

a new researchareathat focuseson the useof non-speechaudio to convey information.

Specific areasinclude the sonificationof scientific data, spatial sound,and the use of

auditoryiconsin userinterfaces.Respondingto parametersof analyzedaudiocanalsobe

usedby interactive algorithmsperforminganimationor soundsynthesisin virtual reality

simulationsandcomputergames.Audio analysistoolshavealsoapplicationsin Forensics

which is the useof scienceandtechnologyto investigateandestablishfactsin criminal

or civil courts of law. For example the make of a particular car or the identity of a

speakingpersoncan be automaticallyidentified from a recording. Other fields where

working with audiocollectionsis importantareBioacoustics,PsychoacousticsandMusic

Cognition. Bioacousticsis the study of soundsproducedby animals. Psychoacoustics

studiesthe perceptionof different typesof soundby animalsof the humanspeciesand

MusicCognitionlooksspecificallyat how weperceiveandunderstandmusic.

This indicative but by no meansexhaustive list of applicationareasfrom industryand

academiashows theneedfor toolsto manipulate,analyzeandretrieve audiosignalsfrom

largedigital audiocollections.1http ://ww w.icad .org

1.2 The main problem

To summarize,themainproblemthat this work triesto addressis theneedto manipulate,

analyzeandretrieve from largedigital audiocollections, with specialemphasisonmusical

signals. In order to achieve this goal a seriesof existing andnovel automaticcomputer

auditionalgorithmswere designed,developed, evaluated,and applieddirectly to create

novel contentandcontext awareaudiographicaluserinterfacesfor interactingwith large

audiocollections.

1.3 Curr ent state

Oneof themainlimitationsof currentmultimediasoftwaresystemsis thatthey have very

littl e understanding of the data they processand store. For examplean audio editing

programrepresentsaudioasanunstructuredmonolithic block of digital samples.In order

to develop new effective tools for interactingwith audio collections it is important to

have informationaboutthe contentand context of audiosignals. Contentrefersto any

informationaboutanaudiosignalandits structure.Someexamplesof contentinformation

are: knowing a specificsectionof a jazz musicpiececorrespondsto thesaxophonesolo,

identifying a bird from its song,finding similar soundsto a particularwalking soundin a

collectionof soundeffectsfor movies.Context refersto informationaboutanaudiosignal

that is dependentupona larger collectionto which it belongs.As anexampleof context

informationin a largecollectionof all musicalstyles, two femalesingerJazzpieceswould

besimilar whereasin a collectionof vocalJazzthey might bequitedissimilar. Therefore

their similarity is context information as it dependson the particularaudio collection.

Anotherexampleof context informationis the notion of a “f ast” musicalpiecewhich is

differentfor differentmusicalgenresthereforedependson theparticularcontext.

Obviously in orderto achieve this sensitivity to the contentandcontext of audiocol-

lectionsmorehigh level information thanwhat is currentlyusedis required. The most

commonapproachto addressthis problemhasbeento manuallyannotateaudiosignals

with additional information (for examplefilenamesandID3 tagsof mp3files). However

thisapproachhasseverallimitations:it requiressignificantusertime(therearecommercial

companiesthat pay full-time experts to perform this annotation), doesnot scalewell,

andcanonly provide specifictypesof information thataretypically categorical in nature

such as music genre,mood, or energy level. In addition, the maintenanceof all this

informationrequiressignificantinfrastructureandqualityassurancein orderto ensurethat

the annotationresultsare consistentandaccurate.Thereforealgorithms and techniques

thatarecapableof extractingsuchinformation automaticallyaredesired.In this thesis,a

varietyof suchalgorithmsandtechniqueswill beproposedandwill beusedto createnovel

toolsfor structuringandinteractingwith largeaudiocollections.

1.4 Relevant Research Ar eas

This work falls under the generalresearchareaof ComputerAudition (CA). The term

ComputerAudition is usedin asimilar fashionto thetermComputerVisionandit includes

any processof extractinginformation from audiosignals.Othertermsusedin theliterature

areMachine ListeningandAuditorySceneAnalysis. 2

Although Speech Recognition canbe consideredpart of CA, for the purposesof this

thesis,CA will refer to techniquesthat althoughpossiblyprocessspeechsignals,they

do not perform any languagemodeling. For example,speaker/singeridentificationand

music/speechdiscrimination will be includedasthey canbe performedby a humanwho2I prefer the term Computer Audition becauseby beingmoregeneral is applicableto a broadersetof

problems. Machine Listening implies a high level perceptual activity whereasthe term Auditory SceneAnalysisrefersto the recognition of different soundproducing objects in a scene. Although thereis nocommon agreement about theterminology I considerthesetwo othertermssubtopics of ComputerAudition.

Signal Processing

Information Retrieval

Machine Learning

Computer Audition

Perception

Computer Vision

Visualization

Human Computer Interaction

Figure1.1: RelevantResearchAreas

mightnotknow thelanguagespokenbut Speech RecognitionandSpeech Synthesiswill not

be coveredasthey rely on languagemodels. Although researchin aspectsof Computer

Audition hasexistedfor a long time only recently(after 1996)therehasbeensignificant

researchoutput. 3

Someof thereasonsfor this lateinterestare:

� Developmentsin hardwarethatmadefeasiblecomputationsover largecollectionsof

audiosignals

� AudiocompressionsuchastheMPEGaudiocompressionstandard(.mp3files)which

madedigital musicdistribution over theinterneta reality

� A shift from symbolic methods andsmall“toy” examplesto statistical methodsand

largecollectionsof “real world” audiosignals(alsoa trendin thegeneralArtificial

Intelligencecommunity)

Despitethe fact that ComputerAudition is a relatively recentresearchareastill in its

infancy therearemany existing well-establishedresearchareaswhereinterestingideasand

toolscanbefoundandadaptedfor thepurposeof extractinginformationfromaudiosignals.

Figure1.1 shows themainresearchareasthatarerelevant to this work. Thethickness

of the lines is indicative of the importanceof eachareato this particular work rather3My graduatestudiesbegan in 1997andtherefore I have beenvery lucky asI wasableto trackmostof

thenew researchin Computer Audition andactively participatein its evolution.

than to the areaof ComputerAudition in general. Audio and especiallymusic signals

arecomplex data-intensive, time-varyingsignalswith uniquecharacteristicsthatposenew

challengesto theserelevant researchareas. Therefore,researchin CA providesa good

way to test,evaluateandcreatenew algorithmsandtools in all thoseresearchareas.This

reciprocalrelationis indicatedin figure1.1 with theuseof connectingarrows pointing in

bothdirections.In the following paragraphsshortdescriptionsof theseresearcharesand

their connectionto ComputerAudition areprovided.In addition a representativereference

textbook (by nomeanstheonly oneor thebestone)for eachspecificareais provided.

Digital SignalProcessing [97] is the computermanipulation of analogsignals(com-

monly soundsor images)which have beenconverted to digital form (sampled). The

varioustechniquesof Time-Frequency analysisdevelopedfor processingaudiosignalsin

many casesoriginally developedfor SpeechProcessingandRecognition[98], areof special

importancefor thiswork.

Machine Learning techniquesallow a machineto improve its performancebasedon

previous resultsand training by humans.More specificallypatternrecognitionmachine

learningtechniques[30] aim to classifydata(patterns)basedon eithera priori knowledge

or on statistical information extractedfrom the patterns. The patternsto be classified

are usually groupsof measurementsor observations, defining points in an appropriate

multidimensionalspace.

ComputerVision [9] is a termusedto describeany algorithmandsystemthatextracts

informationfrom visualsignalssuchasimageandvideo. Of specificinterestto this work

is researchin videoanalysisasbothvideoandaudioaretemporalsignals.

HumanComputerInteraction (HCI) researchis becomingan increasinglyimportant

areaof ComputerScienceresearch[114]. Graphicaluser interfaces(GUIs) enablethe

creationof powerful tools that combinethe bestof humanandmachinecapabilities.Of

specificinterestto this work is the field of Information Visualization[120, 34] that deals

with thevisualpresentationof varioustypesof informationin orderto betterunderstandits

structureandcharacteristics.

Information Retrieval [137] familiar from the popular web searchenginessuch as

AltaVista(www.alta vista.com ) andGoogle(www.googl e.com ), refersto techniquesfor

searchingandretrieving text documentsbasedon userqueries.More lately techniquesfor

MultimediaInformationRetrieval suchascontent-basedimageretrieval systems[37] have

startedappearing.

Lastbut notleastisPerceptionresearch.Thehumanperceptualsystemhasuniquechar-

acteristicsandabilitiesthatcanbetakeninto accountto developmoreeffective computer

systemsfor processingsensorydata.For examplebothin imagesandsound,knowledgeof

thepropertiesof thehumanperceptualsystems hasled to powerful compressionmethods

suchas JPEG(for images)and MPEG (for video and audio). A good introduction to

psychoacoustics(theperceptionof sounds) usingcomputersis [21].

1.5 Guiding princip lesand observations

Giventhecurrentstateof toolsfor interactingwith largeaudiocollectionsseveraldirections

for improvementare evident. Basedon thesepoints, the following guiding principles

characterizethiswork anddifferentiateit from themajorityof existing research:

� Emphasison fundamental problemsand “r ealworld” signals

Many existing systemsare too ambitiousin the sensethat they solve somehard

problemsuchaspolyphonic transcriptionbut only for very limited artificially con-

structedsignals(“toy problems”).In contrastin thiswork theemphasisis onsolving

fundamentalproblemsthatareapplicableto arbitrary“real world” audiosignalssuch

musicpiecesstoredasmp3files.

� Developnew and impr ove existingcomputer audition algorithms

In order to designanddevelop new algorithmsin any field it is importantto first

understand,implement,compareand improve existing algorithms. Thereforewe

have tried to includeand improve many existing CA algorithms in additionto the

newly designedalgorithms.

� Integration of differ ent analysistechniques

A largepercentageof existing work in CA dealswith only a specificanalysisprob-

lem. However in many casesthe resultsof different analysistechniquescan be

integratedimproving theresultinganalysis.For examplesegmentation canbeused

to reduceclassificationerrors.A consequenceof this desirefor integrationhasbeen

thecreationof a generalandflexible softwareinfrastructurefor CA researchrather

thansolvingeachsmallindividualproblemseparately.

� Content and Context awareuser interfacesbasedon automatic analysis

Existing CA work emphasizesautomaticalgorithmsand is rarely concernedwith

how userscan interact with thesealgorithms. Human computerinteractionis a

very important aspectin the designof an effective system. In this thesisour goal

is to provide combine theabilitiesof humansandcomputersby creatingnovel user

interfacesto interactwith audiosignalsandcollections. Unlike existing tools these

interfacesareawareof contentandcontext informationabouttheaudiosignalsthey

presentandthis informationis extractedusingautomaticanalysisalgorithms.

� Interacting with largecollectionsrather than individual files

In both the developedalgorithms and interfacesthe emphasisis on working with

largecollectionsratherthanindividual files. So,for example, largepartof thework

in interfacedesignis concernedwith browsinglargecollectionswhile theautomatic

analysisis heavily basedon statistical learningtechniquesthatutilize largedatasets

for training.

� Automatic and user-basedevaluation

Evaluation of systems that try to imitatehumansensoryperceptionis difficult be-

causethereis no singlewell-definedcorrectanswer. However it is feasibleandit is

importantfor the designanddevelopmentof effective CA algorithmsandsystems.

BecauseCA is arelatively recentandstill developingfield many existingsystemsare

proposedwithout detailedevaluation (for exampleonly a few exampleswherethe

algorithms worksarepresented).In this thesis,aneffort hasbeenmade,to evaluate

theproposedtechniquesstatistically usinglargedatasetsandwith userexperiments

whenthatis required.

� Real-timeperformanceComputerauditioncomputationsarenumerically-intensive

andstresscurrenthardwareto its limits. In orderto have desiredreal-timeperfor-

manceandinteractivity in theuserinterfacescarefulchoiceshave to be madeboth

at thealgorithmic andtheimplementationlevel. Unlikecertainproposedalgorithms

thattakehoursto analyzeaminuteof audio,ouremphasishasbeenonfastalgorithms

andefficient implementation that enablesreal-timefeatureextraction,analysisand

interaction.

Althoughsomeof thesedesignguidelineshave beenusedin severalproposedsystems

to the bestof our knowledgeno existing systemhascombinedall of them. In addition

thefollowing simple, informalbut importantobservations aresomewayor anotherbehind

many of theideasdescribedin this thesis.

� Automatic techniqueswith errors canbemadeuseful.

Automatic audioanalysistechniquesarestill far from perfectandin many caseshave

largepercentagesof errors.Howevertheirresultscanbesignificantlyimproved using

varioustechniquessuchas: statistics, integration of differentanalysistechniques,

user interfacesand utilizing the errorsas as sourceof information. For example

althoughshorttime automatic beatdetectionis not accurate,a goodrepresentation

of theoverall beatof a songcanbecollectedusingstatistics (seeBeatHistograms,

Chapter2). In addition thefailureof thebeatdetectionis asourceof informationfor

musicalgenreclassification(for examplebeatdetectionworksbetterfor rockandrap

musicthanfor classicalor ambient).Althoughthis observation is relatively simple,

it leadsto many interestingideasthatwill beexploredin thefollowing chapters.

� Avoid solvinga simple problem by trying to solvea harder one

Although this observation soundstrivial it is not taken into accountin a lot of ex-

isting work. For example a lot of work in audiosegmentation assumesan initial

classification. However classificationis a harderproblemas it involves learning

both in humansand machines(there is no easyway to prove suchan assertion,

andthereforesomepeoplemight disagree;however it will beshown thatautomatic

segmentation without classificationis feasible). Another example is the work in

polyphonic transcription, which is still an unsolved problem, for the purposeof

retrieval andclassificationwhichcanbesolvedusingstatistical methods.

� Usethe software you write

In orderto evaluatethealgorithmsandtoolsyoudevelopit is importantto usethem

andthemosteasilyaccessibleuseris yourself.An effort hasbeenmadein thiswork

to develop working prototypesthat aresubsequentlyusedto evaluateandimprove

thedevelopedalgorithms,creatingthatwayaself-improving feedbackloop.

Theway theseguidingprinciplesandobservations influencetheideasdescribedin this

thesiswill becomeclearerin thefollowing chaptersthatdescribedin detail thedeveloped

algorithms andsystems.

1.6 Overview

As in every PhD thesismy primarygoal is to presentin a clearandaccuratemannermy

researchwork. In additionI would alsolike my thesisto serve asa goodintroduction to

the field of ComputerAudition in generaland its currentstateof art. Becausemostof

the researchis relatively recentandI have implementedmany of the existing algorithms

proposedin theliteratureI hopethatmy goalis achievable.Ideally I would like this thesis

to bea goodstartingpoint for someoneenteringthefield of ComputerAudition. Another

side-effect of this secondarygoal is theextendedbibliography. Unlike morematurefields

weretherearegoodsurvey papersthatcanserveasstartingpoints,ComputerAudition is a

relatively new field andit is possible (andI haveattemptedto doso)to cite themajorityof

directly relevantpublications.

Toward this goal the structureof this thesisfollows a form similar to a hypothetical

ComputerAudition textbook. For presentationpurposesI liketo divideComputerAudition

in thefollowingstages:

� Representation - Hearing

Audio signalsaredata-intensive time-domainsignalsandarestored(in their basic

uncompressedform) as seriesof numberscorrespondingto the amplitude of the

signal over time. Although this representationis adequatefor transmission and

reproductionof arbitrarywaveformsit is not appropriatefor analysingandunder-

standingaudiosignals. Thewayweperceiveandunderstandaudiosignalsashumans

is basedon and constrainedby our auditory system. It is well known that the

earlystagesof thehumanauditorysystem(HAS) decomposeincomingsoundwave

into different frequency bands. In a similar fashion,in ComputerAudition, Time-

Frequency analysistechniquesare frequentlyusedfor representingaudio signals.

Therepresentationstageof theComputerAudition pipelinerefersto any algorithms

andsystemsthat take asinput simple time-domain audiosignalsandcreatea more

compact,information-bearingrepresentation.Themostrelevantresearch areato this

stageis SignalProcessing.

� Analysis - Understanding

Oncea goodrepresentationis extractedfrom the audiosignalvarioustypesof au-

tomaticanalysiscanbeperformed.Theseincludesimilarity retrieval classification,

clustering,segmentation,and thumbnailing. Thesehigher-level typesof analysis

typically involve memoryand learningboth for humansandmachines.Therefore

MachineLearningalgorithms andtechniquesareimportantfor thisstage.

� Interaction - Acting

Whenthe signal is representedandanalyzedby the system,the usermustbe pre-

sentedwith the necessaryinformationandact accordingto it. The algorithms and

systemsof this stageareinfluenceby ideasfrom HumanComputerInteractionand

dealwith how information aboutaudiosignalsandcollectionscanbe presentedto

theuserandwhattypesof controlsfor handlingthis informationareprovided.

This division into stageswill be usedto structurethis thesis. It is importantto em-

phasizethat the boundariesbetweenthesestagesarenot clear cut. For examplea very

completerepresentation(suchasamusicalscore)makesanalysiseasierthanalessdetailed

representation(suchasfeaturevectors).However featurevectorscanbeeasilycalculated

while theautomatic extractionof musical scores(polyphonic transcription)is still anun-

solvedproblem.Althoughthesestagesarepresentedseparatelyfor clarity, in any complete

systemthey areall integrated.Additionally thedesignof algorithmsandsystemsin each

stagehasto be informedandconstrainedby thedesignchoicesfor theotherstages.This

becomesevidentin evaluationandimplementation thattypically cannotbeappliedto each

stageseparatelybut have to take all of theminto account.Figure 1.2shows thestagesof

theComputerAudition Pipelineandrepresentativeexamplesof algorithmsandsystemsfor

Representation − Hearing

Analysis−Understanding

Interaction − Acting

Feature extractionPolyphonic transcriptionBeat detection

ClassificationSegmenationSimilarity retrieval

BrowsersEditorsQuery filters

STAGESSTAGESSTAGES

COMPUTER AUDITION PIPELINE

EXAMPLES

Figure1.2: ComputerAudition Pipeline

eachstage.Although in mostexisting systemsinformation typically flowsbottom-upboth

directionsof flow areimportantashasbeenarguedin [115,31].

This thesisis structuredasfollows: Chapter1 (which you arecurrentlyreading)pro-

videsa high-level overview of the thesis,the motivation behindit andhow it fits in the

context of relatedwork. Chapter2 is aboutrepresentationanddescribesvariousfeature

extractiontechniquesthatcreaterepresentationsfor differentaspectsof musicalandaudio

content.Theanalysisstageis coveredin Chapter3 wheretechniquessuchasmusicalgenre

classification,sound“texture” segmentation andsimilarity retrieval aredescribed.Chapter

4 correspondsto theinteractionstageandis aboutvariousnovel contentandcontext aware

userinterfacesfor interactingwith audiosignalsandcollections. Theevaluation of thepro-

posedalgorithmsandsystemsis describedin Chapter5. Thereasonevaluation is presented

in a separatechapteris becausein mostcasesit involvesmany differentalgorithms and

systemsfrom all theCA stages.For examplein orderto evaluateMPEG-basedfeatures,

both classificationand segmentation resultsare obtainedthroughthe useof developed

graphicaluserinterfaces.Chapter6 talksabouttheimplementationandarchitectureof our

system.The thesisendswith Chapter7 which summarizesthe main thesiscontributions

andgivesdirectionsfor futureresearch.Eachchaptercontainsasectionaboutrelatedwork

andthespecificcontributionsof thiswork in eacharea.Standardalgorithmstechniquesthat

areimportant for this work aredescribedin moredetail in AppendixA. A glossaryof the

terminology usedin this thesisis providedfor quickreferenceatAppendixB. AppendixC

presentsthepublications thatconstitutethework presentedin this thesisin chronological

orderandAppendixD providessomewebresourcesrelatedto this research.

For theremainderof thethesis,it is usefulto have in mind thefollowing hypothetical

scenariothatwill verylikely becomereality in thenearfuture.Supposeall of recordedmu-

sic in theworld is storeddigitally in yourcomputerandyouwantto beableto manipulate,

analyzeandretrieve from this largecollectionof audiosignals.Although, in many cases,

the techniquesdescribedin this work arealsoapplicableto other typesof audiosignals

suchassoundeffectsand isolatedmusicalinstrument tones,the main emphasiswill be

on musicalsignals. So if thereis no other indicationin the text, the readercanassume

that signalsof interestwill be musical signals.Marsyasis the nameof the free software

framework for ComputerAudition researchdevelopedaspartof this thesisunderwhichall

thedescribedalgorithmsandinterfacesareintegrated.

1.7 Relatedwork

In this section,pointersto important relatedwork will be provided. The emphasisis on

representative and influential publicationsas well as descriptionsof complex Complex

Audition systemsratherthanthanindividual techniquesor algorithms. More detailedand

completereferencesareprovidedin therelatedwork sectionof eachindividualchapter.

As alreadymentioned,theearlyoriginsof computerauditiontechniquesarein Speech

recognitionandprocessing[98]. Anotherimportant areaof early influencesis Computer

Music research[99, 85] which althoughis mostly involved with the synthesisof sounds

andmusichasprovideda varietyof interestingtoolsandtechniquesthatcanbeappliedto

ComputerAudition.

Most of the algorithmsin Music InformationRetrieval (MIR) 4 hasbeendeveloped

basedonsymbolic representationssuchasMIDI files. Symbolicrepresentationsaresimilar

to musicalscoresin that they are typically comprisedof detailedstructuredinformation

about the notes,pitches,timings, and instrumentsusedin a recording. Basedon this

informationMIR algorithmscanperformvarioustypesof contentanalysissuchasmelody

retrieval, key identification,andchordrecognition.Thetechniquesusedhave strongsim-

ilarities to text informationretrieval [137]. More information aboutsymbolic MIR canbe

foundin [1, 2].

Transcriptionof music is definedastheactof listening to a pieceof musicandwriting

down musicalnotationfor the notesthat constitute the piece. Automatic transcription

refersto the processof converting an audiorecordingsuchasa .mp3file to a structured

representationsuchasa MIDI scoreandprovides the bridge that canconnectsymbolic

MIR with analysisof “real world” audiosignals.

The automatictranscription of monophonic signalsis practically a solved problem

sinceseveral algorithmshave beenproposedthat are reliable, commerciallyapplicable

andoperatein realtime. Attempts towardpolyphonic transcription datebackto the1970s,

whenMoorerbuilt a systemfor transcribingduets,i.e., two-voicecompositions[86]. The

proposedsystemsufferedfrom severelimitationson the allowable frequency relationsof

two simultaneouslyplayingnotes,andon therangeof notesin use.However, this wasthe

first polyphonic transcriptionsystem, andseveralotherswereto follow.

Until thesedays transcription systems have not beenable to solve other than very

limited problems. It is only lately that proposedsystemcan transcribemusic 1) with

morethantwo-voicepolyphony, 2) with evena limitedgeneralityof sounds,and3) yield4http ://ww w.musi c- ir.o rg

resultsthataremostly reliableandwould work asa helpingtool for a musician[79, 61].

Still, thesetwo also suffer from substantial limitations. The first was simulatedusing

only a shortmusicalexcerpt,andcould yield goodresultsonly after a careful tuning of

thresholdparameters.Thesecondsystemrestrictstherangeof notepitchesin simulations

to lessthan20differentnotes.However, someflexibilit y in transcriptionhasbeenattained,

whencomparedto the previoussystems. To summarizefor practicalpurposesautomatic

polyphonic transcriptionof arbitrarymusicalpurposesis still unsolvedandlitt le progress

hasbeenmade.

Anotherapproachthatoffers thepossibility of automatic music analysisis Structured

Audio (SA). The idea is to build a hybrid representationcombiningsymbolic represen-

tation, software synthesisand raw audio in a structuredway. Although promisingthis

techniquehasonly beenusedin asmallscaleandwould requirechangesin thewaymusic

is producedandrecorded.MPEG4 [107] might changethesituation but for now andfor

thenearfuturemostof thedatawill still be in raw audioformat. Moreover, evenif SA is

widely used,convertersfrom raw audiosignalswill still berequired.

Giventhefactthatautomaticmusic transcriptionis theholy grail of ComputerAudition

andis not very likely to be achieved in the nearfuture andthat StructuredAudio is still

mainly a proposal,if onewishesto work with audiosignalstodaya significant paradigm

shift is required. The main idea behindthis paradigmshift is to try to extract directly

informationfrom audiosignalsusingmethodsfrom SignalProcessingandMachineLearn-

ing without attempting note-level transcriptions. This paradigmis followed in the work

describedin this thesis. Two position paperssupportingthisshift from note-level transcrip-

tions to statistical approachesbasedon perceptualpropertiesof ‘real-world” soundseven

in thecasewhennote-level transcriptionsareavailableare[82, 106].

In my opinion,thisstatisticalnon-transcriptiveapproachto ComputerAudition for non-

speechaudiosignalsreally startedin the secondhalf on the 90swith the publications of

two important influential papers. Statisticalmethodsfor the retrieval and classification

of isolatedsoundssuchasmusicalinstrument tonesandsoundeffectswereintroducedin

[143]. An algorithmfor thediscrimination betweenmusic andspeechusingautomatic clas-

sificationbasedonspectralfeaturesandtrainedusingsupervisedlearningwasdescribedin

[110]. An early overview of audio information retrieval including techniquesbasedon

SpeechRecognitionanalysiscanbefoundat [39].

Another influential work for CA is the classicbook “Auditory SceneAnalysis” by

McGill Psychologist Albert Bregman[16]. Auditory sceneanalysisis theprocessby which

the humanauditorysystembuilds mentaldescriptionsof complex auditoryenvironments

by analyzingmixturesof sounds. The book describesa large numberof psychological

experimentsthatprovidea lot of insightaboutthemechanismsusedby thehumanauditory

systemto performthisprocess.Computational implementationsof someof theseideasare

coveredin [101]. Although thereis no direct attemptin this thesisto model the human

auditorysystem,knowledgeabouthow it workscanprovidevaluableinsight to thedesign

of algorithms andsystemsthatattemptto performsimilar tasks.

More recentlytheMachineListeningGroupof theMIT medialab hasbeenactive in

CA producinga numberof interestingPhD dissertations. The work of [31] is concerned

with the problemof computationalauditory sceneanalysiswherean auditory sceneis

decomposedinto a numberof soundsources. The classificationof musical instrument

soundsis exploredin [81]. The useinformationreductiontechniquesto provide mathe-

maticaldescriptionsof low level auditoryprocessessuchaspreprocessing,groupingand

sceneanalysisis describedin [118]. The mostrelevant dissertation to this work is [109]

which describesa seriesof systemsfor listeningto musicthat do not attemptto perform

polyphonic transcription.

1.8 Main Contrib utions

In thisSection,themainnew contributionsof this thesisarebriefly reviewed.Moredetails

andexplanationof thetermswill beprovidedin thefollowing Chapters.

� MPEG-basedcompresseddomain spectral shapefeatures:

Computation of featuresto representthe short-timespectralshapedirectly from

MPEG audiocompresseddata(.mp3 files). That way the analysisdonefor com-

pressioncanbeusedfor “free” for featureextractionduringdecoding[130].

� Representation of sound“textur e” using texture windows

Unlikesimple monophonic soundssuchasmusicalinstrumenttonesor soundeffects

musicis a complex polyphonic signalthatdoesnothaveasteadystateandtherefore

hasto be characterizedby its overall “texture” characteristicsratherthana steady

state. The useof texture windows asa way to representsound“texture” hasbeen

formalizedandexperimentally validatedin [133].

� WaveletTransform Features

TheWaveletTransform(WT) is a techniquefor analyzingsignals.It wasdeveloped

asan alternative to theshorttime Fourier transform(STFT) to overcomeproblems

relatedto its frequency andtime resolutionproperties.A new setof featuresbased

on theWT thatcanbeusedfor representingsound“texture” is proposed[134].

� BeatHistogramsand beat content features

Themajority of existing work in audioinformationretrieval mainly utilizesfeatures

relatedto spectralcharacteristics.However if the signalsof interestare musical

signalsmorespecificinformationsuchasthe rhythmic contentcanbe represented

usingfeatures.Unlike typical automaticbeatdetectionalgorithms thatprovide only

a running estimate of the main beatand its strength,Beat Histogramsprovide a

richer representationof the rhythmic contentof a musical piece. that canbe used

for similarity retrieval andclassification.A methodfor their calculationbasedon

periodicityanalysisof multiple frequency bandsis described[135,133].

� Pitch Histogramsand pitch content features

Anotherimportantsourceof informationspecificto musicalsignalsis pitchcontent.

Pitch contenthasbeenusedfor retrieval of music in symbolic form whereall the

musicalpitchesthe explicitly represented.However pitch informationaboutaudio

signalsis moredifficult to acquireasautomaticmultiple pitch detectionalgorithm

arestill inaccurate.Pitch Histogramscollect statistical informationaboutthe pitch

contentover thewholefile andprovidevaluable information for similarity andclas-

sificationevenwhenthepitchdetectionalgorithmis notperfect[133].

� Multifeatur e segmentationmethodology Audio Segmententation is the process

of detectingchangesof sound“texture” in audiosignals. A generalmulti-feature

segmentation methodology that appliesto arbitrarysound“textures” anddoesnot

rely onpreviousclassificationis describedin this thesis.Theproposedmethodology

can be applied to any type of audio featuresand hasbeenevaluated for various

featuresetswith userexperimentsdescribedin Chapter5. This methodology was

first describedin [125] wheresegmentationasa separateprocessfrom classification

andamoreformalapproachto multi-featuresegmentationwasprovided.Additional

segmentationexperiments aredescribedin [128].

� Automatic musicalgenre classification

An algorithmfor automatic musicalgenreclassificationof audiosignalsis described.

It is basedon a setof featuresfor describingmusicalcontentthat in additionto the

typical timbral features,includesfeaturesfor representingrhythmic andharmonic

aspectsof music. A variety of experimentshave beenconductedto e evaluatethis

algorithm and, to the bestof our knowledge, this is the first detailedexploration

of automaticmusicalgenreclassification.An initial versionof the algorithmwas

publishedin [135] andthecompletealgorithmandmoreevaluation experimentsare

describedin [133].

� Timbr eSpaceTheTimbreSpaceis acontentandcontext awarevisualizationof audio

collectionswhereeachaudiosignalis representedasa visualobjectin a 3D space.

The shape,color and position of the objectsare derived automaticallybasedon

automaticfeatureextractionandPrincipalComponentAnalysis(PCA).

� Timbr eGramsTimbreGramsarecontentandcontext awarevisualizations of audio

signalsthat usecoloredstripesto show contentstructureandsimilarity. The col-

oring is derived automaticallybasedon automatic featureextractionandPrincipal

ComponentAnalysis(PCA).

Theseshortdescriptionsof eachcontributionarejustindicativeof themainideas.More

completedescriptionsfollow in theremainingChapters.

1.9 A short tour

In this sectiona short tour of the proposedmanipulation, analysisandretrieval systems

is provided by describingsomepossibleusagescenarios.The emphasisis on how the

systemscanbeusedratherthanhow they work. Theinterestedreadermightwantto return

to thissectionafterreadingthesubsequentchaptersthatexplainhow theproposedsystems

work to seeagainhow they fit in an applicationcontext. Someof the termsusedmight

beunfamiliar (they will beexplainedin therestof thethesis)but thegist of eachscenario

shouldbe clear. Thesescenariosare indicative of the requirementsof interactingwith

large audio collectionsand show how the systemswe are going to describemeetthese

requirements.

A music listeneris driving a car, listeningto musicthrougha computerauditionen-

hancedradioreceiver. Thereceiver is configuredby theuserto automatically changeradio

stationwhentherearecommercialsandto automaticallyadjustequalizersettingbasedon

the musical genrethat is played(which is detectedautomatically). At somepoint, the

listenerhearsa strangepolyphonic vocalpiecethat is very interesting but doesnot know

what it is. Thepieceis recordedandsubmitted to a musicsearchengine.Severalsimilar

piecesare returnedand the listenerfinds out that the pieceis pygmy polyphonicmusic

from Africa. The retrieved piecescanbe subsequently by exploredfor exampleusinga

Soundsliderto find a fasterpieceof similar texture.

A researcherin Bioacousticsreceives a 24-hour long field recordingof a site with

birds andhasto identify if any of the bird songscorrespondto a list of endangeredbird

species.First the individualbirdsongsareseparatedusingtheenhancedsoundeditorand

automaticsegmentation. Subsequentlytheresearchercreatesa collectionconsisting of the

individual bird songsin addition to representative bird songsamplesfrom theendangered

bird specieslist eachlabeledwith a characteristicimage. Finally the bird songsthat are

similar to the endangeredspeciesonescanbe found by visual (they will be closeto the

targetimages)andaural(they will soundsimilar) inspection of anautomatically calculated

2D or 3D timbrespace(a visualrepresentationof largeaudiocollections). By collectinga

large numberof representative birdsong samplesanautomaticclassifiercanbe trainedto

recognizebirdsongsof endangeredspecies.

A musicologist is looking at thedifferencesin durationof symphonic movementsper-

formedby differentorchestrasandconductors.Timbregrams(a visual representationof

audiosignals)of varioussymphonic worksarecalculatedusinga collectionof orchestral

musicas the basisfor PrincipalComponentAnalysis (PCA). That way eachmovement

will beseparatedin colorandmovementsof similar texture(for exampleloud,highenergy

movements)will have similar colors. The Timbregramsare then manuallyarrangedin

a table display suchthat eachcolumn correspondsto a symphonic work and eachrow

correspondsto a specificconductor/orchestracombination. That way the differencesin

durationof eachmovementwill beeasilyseenvisuallyby comparingthelengthsof similar

coloredmovementsacrossaspecificcolumn.By lookingataspecificrow andcomparingit

with therestof thetablethemusicologistcandraw conclusionssuchasconductorA tends

to takefasterhighenergy movementsandslower low energy movementsthanconductorB.

A Jazzsaxophonestudentwantsto make a collection of saxophonesolosby Char-

lie Parker. A TimbreSpace containinga collection of jazz music, wheredifferent jazz

subgenresare automaticallylabeledby shape,is usedinitially. By selectingtriangles

correspondingtoBebopfilesandconstrainingthefilenamestocontainCharlieParkeranew

collectionof all theBebopfiles with CharlieParker is created.Usingtheenhancedsound

editor with automaticsegmentationand fine tuning the segmentation resultsmanually

all the saxophone solosare marked and saved as separatefiles. Finally a collection of

saxophonesolosis createdfrom thosesavedfilesandcanbeusedfor subsequentbrowsing

andexploration.

A sounddesignerwantsto createthesoundscapeof walking througha forestusinga

databaseof soundeffects.As a first stepthesoundof walkingonsoil hasto belookedup.

After clusteringthesoundeffectscollectionthedesignernoticesthataparticularclusterhas

many files with Walking in their filename.Zooming on this clusterthedesignerselects4

soundfiles labeledaswalkingonsoil. Usingspatializedaudiogeneratedthumbnailsof the

4 soundfiles canbeheardsimultaneouslyin orderto make thefinal decision.In a similar

fashionfiles with bird songscanbe locatedaswell asthesoundof a waterfall. Oncethe

componentsof thesoundscapeareselected,they canbeedit andmixedin orderto create

thefinal soundscapeof walking througha forest.Theillusionof approachingthewaterfall

canbeachievedby artificial reverberationandcrossfading.

It is evident from this scenariosthat a large variety of algorithmsfor manipulating,

analyzingandinteractingwith audiosignalsandcollectionsarerequired.To make these

scenarioseven more interestingimaginethat all thesetasksare performedin front of a

large screenwith multiple graphicdisplaysthat are all consistentand connectedto the

sameunderlyingdatarepresentation.Of coursehaving sucha large collaborative space

allowsmorethanoneuserto interactsimultaneously with thesystem. Thefirst stepstoward

sucha systemaswell asvarietyof algorithms andinterfacesthatcanusedto realizethese

scenarioswill bedescribedin thefollowing Chapters.

Chapter 2

Representation

If all thingswere at rest,andnothingmoved,there wouldbeperfectsilencein theworld.

In such a stateof absolutequiescence, nothingwould be heard, for motionand percus-

sion mustprecedesound. Just as the immediatecauseof soundis somepercussion,the

immediatecauseof all percussionmustbe motion. Somevibratory impulsesor motions

causinga percussionon theear returnwith greaterspeedthanothers. Consequentlythey

havea greaternumberof vibrations in a giventime, while otherare repeatedslowly, and

consequentlyare lessfrequentfor a givenlengthof time. Thequick returnsand greater

numberof such impulsesproduce thehighestsounds, while theslower, which havefewer

vibrations,producethelower.

Euclid, Elementsof Music, Intr oduction to the Sectionof the Canon

2.1 Intr oduction

The extractionof featurevectorsasthe main representationof audiosignals, constitutes

thefoundationof mostalgorithms in ComputerAudition. Featureextractionis theprocess

of computing a numericalrepresentationthat can be usedto characterizea segmentof

audio. This numericalrepresentation,which is calledthe featurevector, is subsequently

CHAPTER2. REPRESENTATION 25

usedasa fundamentalbuilding block of varioustypesof audioanalysisandinformation

extractionalgorithms. This vectorhastypically a fixed dimension and thereforecanbe

thoughtof asa point in a multi-dimensionalfeaturespace. Whenusing featurevectors

to representmusic two main approachesareused. In the first approachthe audiofile is

broken into small segmentsin time anda featurevector is computedfor eachsegment.

Theresultingrepresentationis a time seriesof featurevectorswhich canbethoughtof as

a trajectoryof pointsin the featurespace.In thesecondapproacha singlefeaturevector

that summarizesinformation for the whole file is used. The single vector approachis

appropriatewhenoverall informationaboutthewholefile is requiredwhereasthetrajectory

approachis appropriatewheninformationneedsto beupdatedin real time. For example,

automaticmusical genreclassificationof mp3 files might usethe singlevectorapproach

while classificationof radiosignalsmightusethetrajectoryapproach.Typically thesignal

is broken in small chunkscalledanalysiswindows. Their sizesareusuallyaround20 to

40milliseconds.Thatway thesignalcharacteristicsarerelatively stablefor thedurationof

thewindow.

Most of the time, evaluation of audio featuresis donein the context of somespe-

cific applicationor analysistechnique.So for example, a setof featuresfor representing

speechmight be evaluatedin the context of speechcompressionwhile a setof features

for representingmusical contentmight be evaluatedin the context of automaticmusical

genreclassification.Several casesof evaluation of the featuresdescribedin this chapter

aredescribedin Chapter5. The large variety of audiofeaturesthat have beenproposed

differ not only in how effective they arebut alsoin their performancerequirements.For

example,FFT-basedfeaturesmight be moreappropriatefor a real time applicationthan

a computationally intensive but perceptuallymoreaccuratefeaturesbasedon a cochlear

filterbank.

The mostrelevant researchareafor audiofeatureextractionis SignalProcessing.In

many casesfeatureextractionis performedusingsomeform of Time-Frequency analysis

techniquesuchas the Short Time Fourier Transform(STFT). Time-Frequency analysis

techniquebasicallyrepresenttheenergy distribution of thesignalin atime-frequency plane

anddiffer in how thisplaneis subdividedinto regions.ThehumanearalsoactsasaTime-

Frequency analyzerdecomposing theincomingpressurewave in thecochleainto different

frequency bands.

2.2 RelatedWork

Theoriginsof audiofeatureextractionarein therepresentationandprocessingof speech

signals. A variety of representationsfor speechsignalshave beenproposedin the lit-

erature. Theserepresentationswhere initially developed for the purposesof telephony

andcompression andlater for speechrecognitionandsynthesis.A completeoverview of

thesetechniquesis beyond the scopeof this Section. Interestedreaderscanconsultany

of thestandardtextbooks on Speechprocessingsuchas[98]. Representationsoriginating

in speechprocessingthathave beenusedin this thesisare: LinearPredictionCoefficients

(LPC) [76] andMel Frequency CepstralCoefficients(MFCC) [26]. Theuseof MFCC for

music/speechclassificationhasbeenstudiedin [72]. Thesefeatureswill be explainedin

moredetaillaterin thisChapter.

Toward the end of the 90s featuresfor representingnon-speechaudio signalswere

proposed.Examplesof non-speechaudiosignalsthathavebeenexploredinclude:isolated

instrument tones[44, 45, 32, 143], soundeffects[143], andmusic/speechdiscrimination

[110]. Another interestingdirection of researchduring the 90s was the exploration of

alternative front-endsto the typical ShortTime Fourier Transformfor the calculationof

features.RepresentativeexamplesareCochlearModels[116,117,31, 109], Wavelets[65,

68,122,134], andtheMPEGaudiocompressionfilterbank[95, 130].

Mostof thesepapersutilize theproposedfeaturesfor variousaudioanalysistaskssuch

asclassificationor segmentation andwill bedescribedin moredetail in thenext Chapter.

A shortdescriptionof somerepresentativeexamplesis providedin thisSection.Statistical

momentsof the magnitudespectrumareusedasfeaturesfor instrumentclassificationin

[44]. SpectralshapefeaturessuchCentroid, Rolloff, and Flux, describedlater in this

chapter, are usedin [143] for soundeffects and in [110] combined with MFCC-based

featuresfor music speechdiscrimination. Cochlearmodelsattemptto approximatethe

exact mechanicalworkings of the lower levels of the humanauditory systemand have

beenshown to have similar propertiesfor example in pitch detection[116]. Statistical

characteristicsof Wavelet subbandssuchasthe absolutemeanandvarianceof the coef-

ficientshave beenusedin [134] to modelsoundtexture for thepurposesof classification

andsegmentation. Theuseof theMPEGaudiocompressionfilterbankwasindependently

proposedby [95] and [130] at the InternationalConferenceon Acoustics,Speechand

SignalProcessing(ICASP 2000). Recentlysystems that jointly processaudioandvideo

informationfrom MPEGencodedsequencessuchas[14] havebeenproposed.

In thepreviously citedsystems,theproposedacousticfeaturesdo not directly attempt

to modelmusicalsignalsandthereforeignorepotentiallyusefulinformation. For example

no information regardingthe rhythmic structureof the musicis utilized. Research in the

areasof automatic beatdetectionand multiple pitch analysiscan provide ideasfor the

developmentof novel featuresspecificallytargetedto theanalysisof music signals.

Onecharacteristicof music that hasreceived considerableattentionis the automatic

extractionof thebeator tempoof musicalsignals.Informally thebeatcanbedefinedasan

underlyingregularsequenceof pulsesthatcoincideswith theflow of music. Of coursethis

definitionis vaguebut givesageneralideaof theproblem.A moreanthropocentricwayof

definingbeatis asthe sequenceof humanfoot tappingwhenonelistensto music. More

elaboratediscussionsaboutthedefinitionof beatcanbe found in Musicology andMusic

Cognitionpublicationsthat are too numerousto be listed here. A goodstartingpoint is

Thereis a significant body of work that dealswith beattrackingandanalysisusing

symbolic representationssuchasMIDI files. Although many interestingideashave been

proposedthesetechniquescannotdeal with real audio signals. In contrastthe systems

that will be describedin this Sectionare applicableto audio signals. A real-timebeat

tracking systemfor audio signalswith music is describedin [108]. In this system,a

filterbank is coupledwith a network of combfilters that track the signalperiodicities to

provide an estimateof the main beatand its strength. A real-timebeattrackingsystem

basedon a multiple agentarchitecturethat tracksseveral beathypothesesin parallel is

describedin [49] [13]. A systemfor rhythmandperiodicity detectionin polyphonic music

basedon detectingbeatsat a narrow low frequency band(50-200Hz) is describedin [3].

A multiresolution time-frequency analysisandinterpretationof musicalrhythmbasedon

Waveletsis describedin [119]. More recentlycomputationally simplermethodsbasedon

onsetdetectionat specificfrequencieshave beenproposedin [66, 112]. A beattracking

systemfor audioor MIDI databasedon thefrequency of occurrencesof thevarioustime

durationsbetweenpairsof noteonsettimesandmultiple hypothesissearchwasintroduced

in [28] andenhancedwith a userinterfacethatprovidesvisualandauralfeedbackin [29].

A beatextractionalgorithm thatoperatesdirectly on MPEGaudiocompressedbitstreams

(.mp3files) wasrecentlyproposedin [141].

Themainoutputof thesesystemsis arunningestimateof themainbeatandits strength.

The beatspectrum,describedin [43] is a moreglobal representationof rhythm thanjust

themainbeatandits strength.BeatHistogramsintroducedin [134] anddescribedin this

chapterarea similar moreglobalrepresentationof rhythm(computeddifferently)thathas

beenusedfor automatic musicalgenreclassification[135, 133].

In additionto informationrelatedto thebeatof music anotherimportantsourceof in-

formationis pitchcontent.Chordchangedetection(withoutchordnameidentification)has

beenusedin [48], for real-timerhythmtrackingfor drumlessaudiosignals.An algorithms

capableof detectingmultiplepitchesin polyphonicaudiosignalsisdescribedin [123]. This

algorithmshasbeenusedin [133] to calculatePitchHistograms, astatisticalrepresentation

of pitchcontentthatis usedfor automatic musicalgenreclassification.

TheShortTime FourierTransformis a standardSignalProcessingtechniqueandthere

are many references that describeit. A good introductory book is [121] and a more

completedescriptioncanbefoundin [97, 89]. A goodintroductionto Wavelets,especially

in relation to SignalProcessingcanbe found in [77]. A high level overview of MPEG

audiocompressioncanbefoundat [88] and[58, 59] containthecompletespecificationsof

thealgorithm. For the thesake of completenessa shortdescriptionof thesetechniquesis

providedin AppendixA for theinterestedreader.

2.2.1 Contrib utions

Thefollowingcontributionsof my this thesisto theareaof audiofeatureextractionwill be

describedin thischapter: