Manipulation and Resynthesis of Environmental Sounds with ...€¦ · dynamic structures that...
Transcript of Manipulation and Resynthesis of Environmental Sounds with ...€¦ · dynamic structures that...
Manipulation and Resynthesisof Envir onmentalSoundswith
Natural WaveletGrains
by
Reynald Hoskinson
B.A. (English with ComputerScience Minor)
McGill University, 1996
A THESISSUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTSFORTHE DEGREE OF
Master of Science
in
THE FACULTY OFGRADUATE STUDIES
(Departmentof ComputerScience)
We accept this thesis asconformingto therequiredstandard
The University of British Columbia
March2002
c�
ReynaldHoskinson,2002
Abstract
A technique is presentedto facilitatethecreation of constantly changing, random-izedaudio streamsfrom samplesof sourcematerial. A coremotivationis to makeit easier toquickly createsoundscapesfor virtual environmentsandotherscenarioswherelongstreamsof audioareused. While mostly in thebackground, thesestreamsarevital for thecreationof moodandrealism in these typesof applications.
Our approachis to extract the component partsof sampled audio signals,andusethemto synthesizeacontinuousaudio stream of indeterminate length. An automatic speechrecognitionalgorithm involving waveletsis usedto split uptheinput signal into syllable-likeaudio segments. Thesegments aretaken from theoriginal sampleandarenot transformedin any way.
For eachsegment, a table of similarity between it and all the other segmentsisconstructed. The segmentsarethenoutput in a continuous stream,with the next segmentbeing chosenfrom amongthose other segments which bestfollow from it. In this way,we canconstruct an infinite number of variations on the original signal with a minimumamountof interaction. An interfacefor the manipulation andplaybackof several of thesestreamsis providedto facilitatebuilding complex audio environments.
ii
Contents
Abstract ii
Contents iii
List of Figures iv
Acknowledgements v
1 Intr oduction 1
1.1 Problem andMotivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Natural Grains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Ecological Perception . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Objectives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 ThesisOrganization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background and Related Work 7
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Signal Transforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 TheWaveletTransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Audio Segmentation usingWavelets . . . . . . . . . . . . . . . . . . . . . 14
2.5.1 SpeechClassification Techniques . . . . . . . . . . . . . . . . . . 15
iii
2.5.2 UsingDifferencesbetweenCoefficients . . . . . . . . . . . . . . . 16
2.6 Segmenting in theWavelet Domain . . . . . . . . . . . . . . . . . . . . . 18
2.7 Wavelet Packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.8 Granular Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.9 Concatenative SoundSynthesis . . . . . . . . . . . . . . . . . . . . . . . . 23
2.10 Physically-Based Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Our Early Attempts at
SegmentingAudio Samples 26
3.1 A StreamingGranular SynthesisEngine . . . . . . . . . . . . . . . . . . . 28
4 Segmentation and Resynthesis 30
4.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Grading theTransitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Resynthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.1 Cross-fading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.2 Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5 ImplementingtheWaveletTransform . . . . . . . . . . . . . . . . . . . . 36
4.6 Real-time Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.7 Segmentation/ResynthesisControl Interface . . . . . . . . . . . . . . . . . 38
4.8 PresetMechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.9 Discouraging Repetition . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 Resultsand Evaluation 41
5.1 UserStudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.2 ExperimentalProcedure . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
iv
6 Conclusionsand Futur e Work 47
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2 GoalsandResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Bibliography 52
v
List of Figures
2.1 Contrastbetweenfrequency-based,STFT-based, andwavelet views of the
signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Thewaveletfiltering process . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Distancemeasure for transition between frames2 and3. Thearrows repre-
sent differencecalculations. . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Wavelet packet decomposition . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1 Input waveform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Portion of output waveform . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Segmented waveform andinterface. . . . . . . . . . . . . . . . . . . . . . 36
4.4 Segmented streammanagementinterface. . . . . . . . . . . . . . . . . . . 39
5.1 Correct scores persubject andsample.Eachscoreis out of 6. . . . . . . . . 44
5.2 Percentagecorrect answers persubject,over all samples. . . . . . . . . . . 44
5.3 Statistics for thenumber of correct responses . . . . . . . . . . . . . . . . 45
5.4 Percentagecorrect persample,compiled over all subjects . . . . . . . . . . 45
vi
Acknowledgements
I’d like to thankmy supervisor, DineshK. Pai, andHolgerHooswho providedadroit feed-backat several stagesalong the way. Also, I appreciatethe efforts of Antoine Maloney,who hasalwaysgivenmegood advice,althoughI haven’t alwaysfollowedit.
REYNALD HOSKINSON
TheUniversity of British ColumbiaMarch 2002
vii
Chapter 1
Intr oduction
1.1 Problemand Moti vation
Naturalsoundsareaninfinite sourceof material for anyone working with audio. Although
the source may be infinite, there are many situations whereone samplehasto be used
repeatedly. Electro-acoustic music composers often usesamplesas motifs that reappear
againandagainover the courseof a piece. Acoustic installationssometimes stretch pre-
obtained source material over the entirelife of an exhibit. Video gamescanusethe same
samplead infinitum during gameplay. Simplerepetition is not effective for long, so we
oftencreatevariationsof a sampleby manipulatingoneor moreof its properties.
Thereis a long tradition in the electro-acousticmusiccommunity of splitting au-
dio samplesinto portions and manipulating them to create new sounds. Curtis Roads
[Roa78] and Barry Truax [Tru94] pioneeredgranular synthesis, in which small grains
are combined to form complicatedsounds. Grainscan be constructed from scratch, or
obtained by splitting an audio sample into small segments. More recently, Bar-Joseph
[BJDEY� 99] proposeda variant of granular synthesisusing wavelets wherethe separa-
tion andre-combination of grains is done in thetime-frequency representation of anaudio
sample. Similarwork is alsobeingdoneon imagesto producevariationsof tiles or textures
[WL00, SSSE00].
1
Whenwhat is desired is simply a variation on the original source that still bears
a strong resemblance to the original, the above audio techniqueshave critical problems.
Granular synthesisis a technique to create new sounds,not recognizable variationsof the
originalexcept in averyabstractsense. A longaudio sampleis notevenrequired;it suffices
to specify theshape of thegrainandits envelope.Whenanaudio sampleis used, a grainis
anarbitrary slice chosen independently of thesound’s inherentstructure.
Attemptsat better preserving the original structure of the sound have been made.
Bar-Joseph[BJDEY� 99] usesa comparison stepwherewavelet coefficients representing
partsof thesampleareswappedonly whenthey aresimilar. Theauthors employ “statisti-
cal learning”, which producesa differentsound statistically similar to the original. In this
algorithm, only thelocalneighbours in themulti-resolution treeareconsideredwhencalcu-
lating similarity, andtheswapping is veryfine-grained.Thismeansthatlarge-scalechanges
over timewill not betakeninto account. Onalmostany signal, this results in a “chattering”
effect.
To addressthe limitations of the above methods,we have developedan algorithm
for segmenting sound samplesthat focuseson determining natural transition points. The
sound in betweenthesetransition points is consideredatomicandnotbroken upany further
or transformed in any way. We refer to the sound between transition points as “natural
grains”.
Oncewe have the grains,creating new sounds becomesa problem of how bestto
string themtogether. We do this by constructing a first-order Markov chain with eachstate
of thechain corresponding to a natural grain. Thetransition probabilities from onestateto
others areestimatedbasedon thesmoothnessof transition betweenit andall othergrains.
Thenatural grains arethusoutput in a continuousstream,with thenext grainbeingchosen
at random from amongthoseother grains which bestfollow from it. In this way, we can
construct an arbitrarily large numberof variations on the original signal with a minimum
amountof userinput.
2
1.1.1 Natural Grains
Segmenting anaudio sampleinto natural grainsinvolvessomeunderstanding of theprocess
of how theacoustic wavesthataredetectedby our earsaretransformedinto thesoundswe
perceive. Whatcuesdo we useto distinguishonesoundfrom another? More specifically,
whataretheclues thatour brainspick up to distinguishwhereonesoundendsandthenext
begins?
Fromthe time Helmholtz published“On the sensations of tone asa psychological
basisfor the theory of music” in 1885 [Hel54], it wasgenerally held that the steady-state
componentsin a sound werethe most important factor in humanrecognition. Rissetand
Matthews[RM69] wroteaseminal study of thetime-varyingspectraof trumpet toneswhich
invalidated thishypothesis.Theirwork showedthattheprimacy of steady-statecomponents
wasnot realistic from theperspective of synthesizing realistic musical instrumentsounds.
Insteadthey proposedthat the dynamiccomponentsof a spectrum wereprimary,
andsteady-statecomponentsdid not helpvery muchat all for instrumentclassificationand
tone-quality assessment.Risset’s hypothesishasnow becomethedominant view of sound
structure in the psychological literature,andclaimsto representthe perceptibly important
dynamic structuresthatcompriseauditory phenomenafrom theperspective of musical in-
strument sound structures.
Handel [Han95] definestimbre as perceptual qualities of objects and events, or
“what it sounds like.” The senseof timbre comesfrom the emergent, interactive proper-
tiesof thevibrationpattern. Clearly, any segmentation algorithm thatpurports to preserve
theperceptual propertiesof thesoundmustsegmentonalargerscalethanthelocal changes
thatmake up thetimbreof a sound event.
Drawing on the work of Hemholtz,Michael Casey [Cas98] enumeratesthe types
of change in a sonicstructureby examining theconstraintsof thehumanauditory system.
Fourier Persistencerefersto how thecochlearmechanicsof theeararesensitiveto changes
on theorder of 50msandshorter. Theearrepresents thesechangesasa staticquality in log
frequency space. In otherwords, whenourearssenseregular changesin air pressureatrates
3
greater than20Hz,weperceiveonepitch rather thaneachindividual change in air pressure.
20Hzis thefrequency perception threshold of thecochlearmechanism.
We are, however, able to perceive changes occurring at ratesless than 20Hz as
actual change. Thosethat arecontinuous in termsof the underlying Fourier components
areclassified asshort time changesin the static frequency spectrum. For example,when
I drop a coin, there is a Fourier persistence due to the physical characteristics of a small
metallic object. Theshort-time changereflects theindividual impacts.
Theaboveinformationleadsusto focusonthechangesonscalesgreater than20Hz
for our segmentation algorithm. Considering that we are looking at samples recordedat
44.1kHz, our windows should beat least2205sampleslong.
1.1.2 EcologicalPerception
Thereis a significant amountof literaturearguing that the atomicity of humanperception
of sound is moreon the level of whatwe have definedasa grain thanthatof an individual
sound wave. J. J. Gibson[Gib79] originally introduced the term Ecological Perception
to denote the idea that what an organismneedsfrom a stimulus, for the purposesof its
normaleverydaylife, is often obtaineddirectly from invariantstructuresin theenvironment.
Sometypes of complex stimuli maybeconsidered aselemental from theperspective of an
organism’s perceptual apparatus,unmediatedby higher-level mechanismssuchasmemory
andinference.
While Gibsonwas primarily referring to the vision system,thereare analogous
patterns in hearing. Perception is not simply the integrationof low-level stimuli, suchas
single pixels in the retina or narrow-bandfrequency channelsin the cochlea, but instead
directly perceivable groups of features.
Ecological perception was further explored for the auditory domain in William
Gaver’s pioneering work on everyday listening [Gav88]. Everyday listening involvesper-
ceiving thesourceof thesoundandits materialpropertiessuchassizeandweight.Take,for
instance,thesound of adoorslowly closingonrustyhinges.In everydaylistening,attention
4
is focusedonthedoor itself, theforcewith which it is being closed,thesizeof theroomit is
being closed in, andothermaterial propertiesof theorigin of thesound. This typeof listen-
ing is differentiated from another type of auditory experience,musicallistening, in which
musical parameterssuchaspitch, duration and loudnessaremost important. In the door
example, musical listeningwould involve hearing the change in pitch asthe door opens,
the particular timbre of the hinges,andthe band-limited impulsive noiseasthe door hits
the frame. Everyday listening instead involvesdistinguishing the individual eventswhich
producethesoundsthatwehear.
From the perspective of everyday listening, the perceptual world is one where
soundshave clearbeginnings andendings,evencontinuous soundssuchaswind thathave
no onsetsor offsets. In this way, a graincould bedefined by its temporal boundaries. How-
ever, the beginning andendings of sounds arenot necessarily due to the structure of the
acoustic wave; often they arenot physically marked by actual silent intervals. Our main
task,then,is to find points in theacousticwave which bestapproximatethebeginningand
endingsthatwecanperceive by listening to thesoundsourselves.
However, there is asyet no acceptedmathematical framework within which to use
this theory of perception in a systematicmanner. Casey [Cas98] doesprovide a mathe-
matical framework usinggroup theory, but he is primarily concernedwith extracting the
structureof larger-scale sounds,suchasthe timing andspectral changesbetween bounces
asa ball bouncesa number of timesbefore settling on theground.
While any sound consistsof atime-varyingpattern of harmonics,whensoundsfrom
differentsourcesoverlap, all of theharmonic componentsaremixedin timeandfrequency.
A listener can usethe timing, harmonic, and amplitude and frequency modulation rela-
tionshipsamongthecomponentsto parse thesound wave into discrete componentsources.
Thesecomponentsourcesoften arehappeningcontemporaneously, andarecalled“streams”
in theliteratureof acoustic perception. In this thesis, however, will limit ourselvesto split-
ting sound solely in thetime domain,with onestream only persample.
5
1.1.3 Objectives
Therewill never be a lack of natural world sounds to record and feed into a computer
system. In an application which usesempirically recorded samples, playing them back
blindly is inefficient in termsof resourcesand often less than optimum in termsof the
desiredeffect thesoundhason a user. Themorethe information theapplicationhasabout
thesound, themoreit cantailor its output to its situation.
Our implementation aimsto providesuserswith a tool to automatically manipulate
sound sourcesand soundscapes. Along with the core segmentation/resynthesistool, we
providea higher level interfacewhich allowsfor multiple randomizedsoundsto beplayed
at once, eachwith a number of controls thateffect how it appearsin thesoundscape. There
arecontrols for automatinghow oftenasound streamis triggeredandhow long aninstance
lasts. Therearepanandgaincontrols for controlling stereoamplitudeover thecourseof the
instance.Thisallowsauserto assembleapermanently changing auditory soundscapefrom
just a few representativesamples. Sucha tool is useful for immersive virtual environments,
video games,soundtracks for film, auditory displays,andevenmusiccomposition.
1.2 ThesisOrganization
This thesis is divided into six chapters. Chapter 1 introduced the problem,andstatedthe
objectivesof thework. In chapter2,weprovideageneral backgroundonthesignalprocess-
ing toolsused.Chapter 3 details our earlier, differentapproachto creating sound textures.
Thesegmentation andresynthesisalgorithm is detailed in chapter 4. Chapter 5 shows the
resultsof theresearch,includingsomeuser testing to demonstrateutilit y. Finally, chapter6
summarizesgoals andresults,andoffers somepotential future researchareas.
6
Chapter 2
Background and RelatedWork
2.1 Overview
This chapter will review relatedresearch into the representation andanalysisof audio sig-
nalsfor the purposeof segmentation. Thewavelet transform will thenbe introduced, and
we will review someof themethodswhich usewaveletsfor signal segmentation. We then
review someof the related work which useswavelet packets. Finally, techniques from
theelectro-acousticcommunity such asgranular synthesisandconcatenative synthesisare
briefly touchedupon.
2.2 Representation
The sound samples we usefor this algorithm are input in pulse-codemodulation (PCM)
format. PCM meansthateachsignal sampleis interpretedasa “pulse” at a particular am-
plitude. To determine whereto segment an input audiosample, it is useful to change rep-
resentations from theoriginal PCM format to something which morecompactly expresses
theinformationwe areinterestedin.
Ultimately, we needa representation thatcanaid us in determining whenanaudio
signal changes,andby how much.With thelocation of change,we cansegment thesignal
into natural grains. With a metric of how much the signal haschanged, we can offer a
7
threshold value that increasesor decreasesthe coarseness of the grains, andalsocompare
themto eachother to estimatehow they fit together.
Therearea multitude of waysto representa soundsignal. Usinganappropriately
multi-resolutional approach, [SY98] identifies threemain schemesfor classifying audio
signals:
1. Signal statistics suchas
(a) mean
(b) variance
(c) zero-crossing
(d) auto-correlation
(e) histogramsof samples/differenceof samples,either on thewholedataor blocks
of data.
Themainproblem with low-level statisticsis thatthey arevulnerable to alterationsin
theoriginal signal, making themfragile in thepresenceof noise.
2. Acoustical attr ibutes. Anothergeneral category of audio signal classification tools
areacoustical attributessuchaspitch, loudness,brightnessandharmonicity. Statisti-
cal analysis is thenapplied to theseattributesto derive feature vectors. Becausewe
hope to preserve asmany of theacoustical attributesaspossible in our resynthesized
sound, this appearsto bea muchbetter alternative.
Most of thesemeasurements, however, aredirectedtowardsmusicwherethe sound
is already relatively structured.For lessstructuredenvironmental sounds,wherepos-
sibly many different events arehappeningsimultaneously, these measurementsare
lesseffective. Calculatingacoustic attributesis alsomuchmoreexpensive computa-
tionally overall, andsuffersfrom the samelack of robustnessto noiseastraditional
signal statistics.
8
3. Transform-basedSchemes.Herethe coefficientsof a transform of the signal are
usedfor classification to reduce the susceptibility to noise. Therearemany advan-
tagesto usingsignaltransformsin analysis,such asthepotential for compression,and
theability to tailor thetransformusedto bring out thecharacteristicsof thesignal that
aremost important to the taskat hand. The canonical transform usedin audiopro-
cessing is theFourier transform, in partbecauseof theefficiency of theFast-Fourier
transform,andits util ity over a broadrange of tasks. For reasons explainedbelow,
we insteadchosethe Discrete Wavelet Transform, which hasthe samealgorithmic
complexity astheFFT.
2.3 Signal Transforms
A spectrum canbe loosely described as“a measureof thedistribution of signal energy as
a function of frequency” [Roa96]. We mustdefinethis term loosely becauseaccording to
Heisenberg’s Uncertainty Principle,any attempt to improve time resolution of a signal will
degrade frequency resolution [Wic94]. Both the time waveform and frequency spectrum
cannot bemadearbitrarily smallsimultaneously [Sko80]. Theproduct of thesetwo resolu-
tions, thetime-bandwidth product, remainsconstantfor any system. Soany representation
of a signal’s spectrum is necessarily a trade-off betweenthese competing concerns.
Long a stapleof digital signal processing, theFourier transformis oneway of cal-
culating a spectrum. It wasoriginally formulated by JeanBaptiste JosephFourier (1768-
1830). Its mainprinciple is that all complex periodic waveformscanbemodeledby a set
of harmonically relatedsinewavesaddedtogether. TheFourier transform for a continuous
time signal x � t � canbedefinedas:
X � f ��� � � ∞� ∞x � t � ej2π f tdt (2.1)
The results of evaluating X � f � are analysis coefficients which define the global
frequency f in a signal. As shown in Figure 2.1, the coefficients arecomputed as inner
9
productsof thesignal with sinewavebasisfunctionsof infinite duration. Thismeansabrupt
changesin time in a non-stationary signal arespreadout over thewhole frequency axis in
X � f � .To obtain a localized view of the frequency spectrum, there is the Short-Time
Fourier transform, (STFT) in which one divides the sampleinto short frames,then puts
eachthrough the FFT. As shownin Figure2.1 below, thesmaller the frame,thebetter the
time resolution, but employing framesraisesa wholenew setof problems.
To begin with, it is impossible to resolve frequencieslower thanthe lengthof the
frame.Usingframesalsohastheside effect of distorting thespectrummeasurement.This
is becausewearemeasuring notpurely theinput signal, but insteadtheproduct of theinput
signal andtheframeitself. Thespectrumthatresults is theconvolutionof thespectraof the
input andtheframesignals.
For eachframe,we canthink of the STFT asapplying a bankof filters at equally
spacedfrequency intervals.Thefrequenciesarespacedat integer multiplesof thesampling
frequency, dividedby theframelength. Artif actsof frameanalysisarisefrom thefact that
thesamplesanalyzeddonotalwayscontainanintegernumberof periodsof thefrequencies
they contain. Therearea number of strategies to curb theeffect of this “leakage”, suchas
employing aenvelopeoneachframethataccentuatesthemiddleof theframeat theexpense
of thesides,wheremostof theleaking is [Roa96].
For audio purposes, the FFT also hasthe drawback that it dividesthe frequency
spectrum up into equal linear segments. Theuseris put into aninescapablequandary: nar-
row framesprovidegood time resolution but poor frequency resolution, while wide frames
providegoodfrequency resolution but poortimeresolution. Moreover, if theframesaretoo
wide, the signalwithin themcannot be assumedto be stationary, which is something the
FFTdependson.
The problem with using linear segmentsis that humansperceive pitch on a scale
closer to logarithmic [War99]. We arerelatively goodat the resolution of low-frequency
sounds,but asthefrequency increases,our ability to recognizedifferences decreases.
10
2.4 The WaveletTransform
A wavelet is a waveform with very specific properties,suchasan average valueof zero,
andan effectively limited duration. Analysis with wavelets involvesbreaking up a signal
into shifted andscaledversionsof theoriginal (or mother) wavelet. Wavelet analysisusesa
time-scale region rather thana time-frequency region. Sinceonly artificial tonesarepurely
sinusoidal, this is not in itself a drawback.
Thewavelettransform is capable of revealing aspectsof datathatothersignal analy-
sistechniquesmiss,suchastrends,breakdown points,discontinuitiesin higher derivatives,
andself-similarity. Intuitively, thewavelet decomposition is calculating a “resemblancein-
dex” betweenthe signal andthe wavelet. A large index meansthe resemblanceis strong,
otherwiseit is slight. Theindicesarethewaveletcoefficients.
In contrastto thelinear spacing of channelson thefrequency axis in theSTFT, the
wavelet transform uses a logarithmic division of bandwidths. This impliesthat thechannel
frequency interval (bandwidth) ∆ f/f is constant for the wavelet transform, while in the
STFT, theframedurationis fixed.
To definethe continuous wavelet transform (CWT), we startby confining the im-
pulse responsesof a particular filter bankto bescaledversionsof thesameprototype ψ � t � :ψa � t ��� 1
a ψ � t
a � (2.2)
wherea is a scalefactor, andthe constant 1 �a� is usedfor energy normalization.
ψ � t � is often referred to asthemotherwavelet.With themotherwavelet, wecandefinethe
ContinuousWavelettransform(CWT) as
CWTx � a � b��� 1 a � � ∞� ∞
x � t � ψ � t � ba � dt (2.3)
Here * refersto the convolution operator, andx � t � is the original signal. As the
scalea increases,the scaledwavelet ψ � ta � (the filter impulse response)becomesspread
out in time andthustakes only longer durations into account. Both global andvery local
11
variationsaremeasuredby usingdifferentscalesa. b controls thetranslation of thewavelet
along thesignal.
We will limit our discussion to the discrete wavelet transform, which involves
choosing dyadic scalesandpositions (powersof two). In the discretewavelet transform,
a decomposition into waveletbasesrequiresonly thevaluesof the transform at thedyadic
scales:
a � 2j
and
b � k2j �Theanalysisis moreefficientandjustasaccuratefor ourpurposesasthecontinuouswavelet
transform. Thediscretewavelettransformapproachwasfirst developedby Mallat [Mal89].
It is basedon a classical schemeknownasthetwo-channel sub-bandcoder.
Unlike theFFT, which usessinusoids of infinite duration, waveletsarelocalized in
time. Leakage is also a different concernwith the wavelet transform. If we confinethe
length of our signal to bea power of two, thewavelet transform will analyze thesignal in
aninteger number of steps,soleakageis not anissue. It doesn’t matterthatthefrequencies
present in thesignaldon’t line upwith power-of-two boundaries, sincewearenotmeasuring
frequencies perse,but scales.
Theeffectivenessof theDiscrete WaveletTransform (DWT) for a particular appli-
cation could depend on the choice of the wavelet function. For example, Mallat [MZ92]
hasshownthat if a wavelet function which is the first derivative of a smoothing function
is chosen,then the local maximaof the DWT indicatethe sharp variations in the signal,
whereasthelocal minimaindicateslow variations.
As figure 2.1 shows, the wavelet transform givesbetterscale resolution for lower
frequencies, but worsetime resolution. Higher frequencies,on the otherhand, have less
resolution in scale, but better in time.
Onenotable aspectof wavelet transformsasthey pertain to audio processing is that
they arenotshift-independent. For this reason, weusetheenergiesof thecoefficientsrather
12
Figure2.1: Contrastbetweenfrequency-based, STFT-based,andwavelet viewsof thesignal
thantheraw coefficients themselvesfor our metrics,asin [PK99].
2.4.1 Implementation
For animplementation-centredpointof view, wecanthink of thewavelettransformasaset
of filterbanks,asin figure2.2.
Figure2.2: Thewavelet filtering process
Heretheoriginal signal, S,passesthrough two complementary filters andemerges
astwo signals. Thelow-andhigh-passdecomposition filters (L andH), togetherwith their
associatedreconstruction filters (L� andH � ), form a system of quadraturemirror filters.
Thefiltering processis implemented by convolving thesignalwith afilter. Initially,
weendupwith twiceasmany samples. Throwing awayeveryseconddatapoint (downsam-
13
pling) solvesthis problem. We arestill ableto regeneratethe entire original sample with
thedownsampledsignal.
The decomposition process canbe iterated,with successive approximations being
decomposedin turn, sothatonesignal is brokendown into many lower-resolution compo-
nents. This is thewavelet-decomposition tree,or otherwiseknown astheMallat tree. The
averagecoefficientsarethehigh-scale,low frequency componentsof thesignal. Difference
coefficientsarethelow-scale, high frequency components.
To reconstruct theoriginal signal from thewaveletanalysis,we employ theinverse
discrete wavelet transform. It consistsof upsampling and filtering. Upsampling is the
processof lengthening a signal componentby inserting zeros betweensamples.
For a morein-depth introduction to thewavelet transform, thereader is directedto
articlessuchas[SN96,KM88, Mal89, MMOP96] for the theory, and[Wic94, Cod92] for
ideason implementation.
2.5 Audio Segmentationusing Wavelets
Oncewe have expressedthe signal using the wavelet representation, we useit to identify
thepotential pointsto split into grains. Therehavebeenmany attemptsatsegmenting sound
using thewavelettransform, mostof thosewe lookedat geared towardsspeech analysis.
Attemptsat using signal statistics on wavelet transformcoefficients have alsobeen
made. However, a statistical prior model for wavelet coefficients is complicatedbecause
waveletcoefficients do not have a Gaussiandistribution [Mal89]. Thoughwavelets coeffi-
cients aredecorrelated, their values arenot statistically independent, another limitation to
take into consideration whenusingstatistical properties.
An important stepfor classification techniques such as[LKS� 98, SG97,TLS� 94]
is to characterizethesignal in assmallanumberof descriptorsaspossible,withoutthrowing
away any information that would help classification. Reducing the feature sethasseveral
advantages: primarily, it is computationally morecosteffective, but it alsoaids in gener-
alization. Our task is to figure out whereit is bestto segmentthe signal into grains. We
14
mustthendefinewhat happensin the signal at the startor endof an event beforelooking
for thesefeatures.
Most speechrecognition algorithms include a segmentation step to separate the
phonemesfor laterrecognition. However, segmentationin speechrecognition hasimportant
differencesfrom ourgoals. Smoothnessof transition is notanissuefor recognition,because
in humanspeech phonemesusually blend into each otherto suchanextentthatany splitsare
usually in areaswith averyhighdegreeof frequency change,toomuchsofor ourpurposes.
This is acceptablefor speechanalysisbecausetheresults areonly neededfor thepurposes
of recognition,not resynthesis.
2.5.1 SpeechClassification Techniques
A paperby Sarikaya and Gowdy [SG97], on identifying normal versusstressedspeech,
details aschemethatshowedsomepromiseasasignal segmentation techniqueapplicableto
our needs. In their algorithm,an8KHz sample is segmentedinto framesof 128samples. A
lookaheadandhistory of 64 samples is added, to make it 256,with skip rateof 64 samples.
This representation is the base of a classification algorithm that usesa two-dimensional
separability distancemeasurebetween two speech parametrization classes.
Their separability measureusesScaleEnergy, which representsthedistribution of
energy amongfrequency bands,definedas:
SE � k � � si ��� ∑m� si � � Wψx� � si � m��� 2sup � SE � n� � si ��� (2.4)
whereWψx is thewavelettransformof x, k is theframenumber, i thescalenumber,
si is the ith scale, and n spansall available frames. The denominator is a normalizing
constant. They usethe scale energy for an autocorrelation computation that measuresthe
separability of two signals. The ACS,Autocorrelation of ScaleEnergies,measures how
correlatedadjacentframesare.It canbedefinedas:
15
ACS� l �si � k ��� ∑k � Ln� k � SE � n� � si �! SE � n� l � � si � � 2
sup � ACS� l �si � j � � (2.5)
Here j is an index which spansall correlation coefficientsat a givenscale. l is the
correlation lag,which is fixedin this paper at 1. If wewould have setl � 0, we would look
at only oneframeat a time, sotheACSwould modelthenormalizedpower in scale i. For
l " 0, ACSmodels changesin the frame-to-framecorrelation variation of SEparameters.
Sarikaya alsofixesthecorrelation framelength L at 6. This meanstheACSparameters are
measuresof how correlatedsix adjacentframesare.
Usedin thisway, theautocorrelationof scaleenergiesis acomparisonbetweenlev-
els to bring out hidden identifying features. On a test of the phoneme/o/ in “go,” they
achieved satisfactory scores with both the SE and the ACS parameters, although ACS,
which takes into account the change betweenframes,was significantly higher. Readers
canreferto thepaper [SG97] for further informationon testing methodology andcomplete
results.
Thiswork is moretunedtowardrecognition anddistinctionbetween classesthanwe
areinterestedin. TheScaleEnergy parameter is useful for identifyi ngfeaturesof individual
frames,andwe adopta similar approachusing wavelet coefficients,asdoesthe paperby
Alani [AD99] discussedin the next section. ACS, however, tendsto smooth out local
changesbecause it is measured over a numberof frames. We aremore interestedin the
locationsandmagnitudeof these local changes,andthedifferencesbetween frames,andso
do not adoptthis technique.
2.5.2 UsingDiffer encesbetweenCoefficients
A paper by Alani andDeriche[AD99] details another approachto segmenting speech into
phonemes.Thesignalis broken into small frameof 1024sampleseach,with anoverlap of
256 samples. Soundsamplesareinput asCD-quality 44.1kHz, so eachframeis approx-
imately 23.2mslong. To provide metricsfor what is happeningduring the length of each
frame,they areeachanalyzedwith thewavelet transform. Theenergiesof eachof thefirst
16
six levelsof differencecoefficientsarecalculatedfor eachframe.
Their next stepis to segmentthesignalbased on thedifferencesin energy between
eachlevel of differencecoefficients in consecutive frames.A Euclideandistance function
over four framesis used. As an example, we calculatethe strength of transition between
frames2 and3:
D � f2 � f3 ��� 2
∑i � 1
4
∑j � 3
6
∑k � 1
� Xi # k � Xj # k � 2 (2.6)
Frame 1 Frame 2 Frame 3 Frame 4
1
CoefficientLevel 3
6
5
4
2Difference
Figure2.3: Distancemeasure for transition betweenframes2 and3. Thearrows representdifferencecalculations.
Herek refers to the wavelet level, i and j areframenumbers, Xi # k andXj # k arethe
energies of wavelet differencecoefficient levels. Only like levels arecompared, and the
differencesbetween themareadded up to obtain a overall differencebetweenframes.
Alani andDeriche[AD99] usethealgorithm to isolatephonemeswhicharethenfed
17
into a separatespeech-recognition system. In normal speech, vowels have pitches which
are relatively constant over time, whereas consonants arenot pitchedat all, andso have
frequencies that change considerably over the course of the phoneme. Additionally, in
humanspeech phonemesmeld into eachother, making isolation an even more difficult
task,onethatcanreally only besuccessfully achievedusingcontext-sensitive information.
Takingthis into account, theauthorspick thepointswherethedistancemeasure is highest,
reasoning thatthis is wherethespeakersaregoing from onephonemeto another.
Isolating phonemes,however, is quitea different taskthantrying to isolate grains.
We would like something moreon the lines of syllables,wherethere areclearer demarca-
tions to latch on to. Whensegmenting, we alsohave to take into account how the grains
will fit backtogetheragain, something Alani andDerichedid not consider. As describedin
thenext chapter, we adopta modifiedversion of this algorithm which is moresuitablefor
a coarserlevel of detail (on the level of syllablesrather thanphonemes),andour differing
requirements.
2.6 Segmentingin the WaveletDomain
Not only canwe do the analysis in the wavelet domain, but it is alsoan option to do the
separation andre-connection of grains aswell. The inversewavelet transform would then
beperformedto obtain thenew, modifiedsignal.
This is the approachtaken by Bar-Josephet al. [BJDEY� 99]. The authors usea
comparison stepwherewavelet coefficients representing part of the sampleareswapped
only whenthey aresimilar, andcall their approach“statistical learning”. Theaim is to pro-
ducea differentsound statistically similar to theoriginal. Satisfactory results arereported,
with “almost no artifacts dueto the random granular recombinationof different segments
of theoriginal inputsound.” Unfortunately, nosoundsamplesareavailable to support these
claims.1
1At the2001 InternationalComputerMusicConference,I askedanumber of peoplewhohadat-tendedtheconferencein 1999 whenBar-Joseph’swork waspresented, but nobody couldremember
18
To find out ourselves,we implementedthis algorithm asit is describedin thepaper.
More details about the implementation are given in the next chapter. The results were
perceivably different, enough to make it inappropriate for our use. With any signal, we
found thatthere wasa characteristic “chattering” effect asparts of thesignal wererepeated
quickly right aftereachother.
The problems stemfrom the way the algorithm producesnew variations. Only
the local neighboursin the multi-resolution treeare taken into account whencalculating
similarity, andtheswapping is very fine-grained.Becauseswapping only takesplacewhen
the coefficients aresimilar, muchof the large scale patterns arepreserved, resulting in a
sound thatstill hasmuchof thesameorderof events.Theeventsthemselvesarechanged
to a degree,but alsomuddiedbecauseof convolution artifacts.
Theseconvolution artifactsarisebecauseswitchingcoefficientsof thewavelettrans-
form hasunpredictable results. Unlessthe changesareon the dyadic boundaries,it is re-
ally changing the timbreof the input sound rather thanswitching thesound events. These
changescannot be easilypredicted; they have to do with the choicesof wavelet filter, the
filter length, andtheposition of thecoefficients.Theconvolution involvedin reconstruction
makesthis processvirtually impossibledo without introducing unwantedartifacts.
Extending the sampleto an arbitrary length is also non-trivial. Unlessall that is
needed is extending it by a power of two, the inverse wavelet transform becomesmuch
morecomplex. Extending thesampleby apowerof two is alsounsatisfactory. Theresult is
very similar to looping theoriginal sample,but with a lot of addedartifacts,which arethe
very things we would like to avoid.
2.7 WaveletPackets
Wavelet Packets were seriously considered as a methodfor representing natural grains.
They differ from wavelets in that at every decomposition step, the differenceand aver-
agecoefficientsare further broken down. The results are particular linear combinations
whatit soundedlike.
19
or superpositions of wavelets. They form baseswhich retain many of the orthogonality,
smoothness,andlocalizationpropertiesof their parent wavelets[Wic94].
Wavelet packetsseemto have a lot of potential: depending on thestrategy usedto
find asuitable basisfrom theover-completesetof packetsin afull waveletpacket transform,
youcanfind thebasiswhichrepresentsthesignal with thefewestnon-zerocoefficients.See
Wickerhauser [Wic94], for example, for a discussion of variousbasis-finding algorithms.
A wavelet packet librarywaswritten in Javaexpresslyto seeif somethingsimilar to
Bar-Joseph’swork [BJDEY� 99] could bedonewith waveletpackets.Insteadof interchang-
ing baldwaveletcoefficients, we would interchangethewaveletpackets, which ostensibly
would hold informationabout wholeevents, rather thansample-level information.
While efficiency of representation is important, there aresomekey problemsthat
cannot be easily be overcome.First of all, normalwavelet packetsarenot shift-invariant.
This makes comparison between different regions of the wavelet packet transform ex-
tremelydifficult.
Another difficulty with comparison hasto do with packet levels. Every packet is
denotedby theorder in which theaverageor differencecoefficientshave been further bro-
ken down, asshownin Figure2.4. However, thereis no guaranteethat all portions of the
signal will berepresentedon thesamepacket level. If we have a packet that is denotedby
ADADDDAD in Figure2.4, it is not trivial to changeit with anotherpacket is denotedby
DADD. They aredifferent sizes,andhave to interact with different packetsin orderto be
properly put throughtheinversewavelet packet transform.This makesswitching positions
of packetsvery difficult.
Someartificial meansof constraining the representation could be taken, suchas
limiting theresultto beall on onelevel of thewaveletpacket tree.However, this seriously
underminesthewholepoint of findingthebest basis,asthenumberof coefficientsincreases
by a power of two eachlevel.
Related literature supports the problemslisted above. In [WW99] Wickerhauser
notes that thebestbasis algorithm is not well suited for isolating phonemes,since thereis
20
Figure2.4: Waveletpacket decomposition
no reason for phonemesto even “begin” and“end” at dyadic points. Packets arethusnot
guaranteedto represententire,re-arrangeableevents.
Wickerhauserinsteadusesa segmentation algorithm to split up thetime axis of the
unprocessed signal. Thesegmentation algorithm measures the instantaneous frequency at
discretepoints,andplacessegmentationpoints at placeswherethis changes.2
Despite theproblemsraisedabove,therehavebeen attempts to usewaveletpackets
to aid in signal classification. However, surmounting the above concernsseemsto take
up any advantageover regular wavelets. For example, Sarikaya [SG98] hasproposedan
alternateversion of his paper [SG97] discussedin 2.5.1,but usingwaveletpackets instead
of wavelet differencecoefficients. This involvescharacterizing eachwindow by subband
features derived from the energyof wavelet packets. Thereresults, however, were not
differentenough from their earlier, wavelet-basedapproachfor usto changeour algorithm.
DelfsandJondral [DJ97] usethebest-basisalgorithm for waveletpacketsto charac-
terizepianotones. They useaspecialized, shift-invariant discretewaveletpacket transform
to improve classification. Thepacket coefficientsin thebest-basiswhoseenergiesexceeda
thresholdarenormalized,thencomparedto pianotonesin adatabasewhichhavebeensim-
ilarly analyzed. Theeuclideandistancedeterminesthe successof thematch. This system
wasused for identificationonly, notresynthesis. Their results indicatethatthesespecialized2Thealgorithm wassupposedto publishedin anotherpaper, but unfortunately, it wasnot. To my
knowledge,unpublishedcopiesarenotavailableeither.
21
waveletpackets do not seemto offer any advantageover simplediscrete Fourier transform
features.
2.8 Granular Synthesis
Thereis alonghistory in theelectro-acousticmusiccommunity of arrangingsmallsegments
of soundto create larger textures. Granularsynthesis,pioneeredby CurtisRoads[Roa88]
andBarry Truax [Tru88] [Tru94], is a methodof soundgeneration that uses a rapid suc-
cession of shortsound bursts, calledGranulesthat together form larger soundstructures.
Granular synthesisis particularly goodat generating textured sounds suchasa waterfall,
rain, or wind. Thegrains in this casearetakenasportionsof a larger soundsamplewhich
canbespecified to thealgorithm. Thesound sample itself hasa large influenceon the re-
sult of thegranular synthesis,but sinceit canbespecified by theuserfrom anywhere, it is
impossible to facilitate easier interaction with this parameter.
Curtis Roadsdescribes granular synthesisas involving “generating thousandsof
very short sonic grains to form larger acoustic events”[Roa88]. A grain is definedas a
signal with an amplitude envelopein the shapeof a bell curve. The duration of a grain
typically falls into the rangeof 1-50msec.This definition putsgranular synthesisentirely
into therealmof new soundcreation ratherthanmanipulation which preservestheoriginal
perceptible propertiesof the sound. He likensgranular synthesisto particle synthesisin
computer graphics,usedto create effects suchassmoke,clouds,andgrass.
Samplesfrom thenatural world have beenusedin granular synthesis, mostnotably
by Barry Truax[Tru94]. He createsrich sound textures from extremely small fragments of
sourcematerial. Theschemereliesonagrainattackanddelay envelopesto eliminateclicks
andtransients.Theprimary goal is time-shifting: drawing out the length of thesample to
reveal its spectral componentsasa compositional technique.
Therehasbeenwork doneon extracting grainsfrom natural signals andrecombin-
ing themwith phase alignment [JP88]. Phasealignmentis usedbecausewith suchsmall
grains, themethodof joining themhasa large effect on the resulting sound. This strategy
22
helpsavoid discontinuities in thewaveformsof reconnectedgrains. Phasealignmentworks
for both periodic and noisy signals. However, it is primarily a way to alter the original
signal, for instancefor time-stretching, by combining partsof thesignal in variousways.
Gerhard Behlesusesanother method[BSR98]with pitch markers, which reference
eachpitch period (theinverseof thelocal fundamentalfrequency.) Theonsetof thesource
sound excerpt is quantizedto theclosestpitch marker. This methodis lesscomputationally
expensive thantheoneoutlinedby Jones. It hastrouble,however, with inharmonicsignals.
Granular synthesiswith natural signalsdealswith arbitraryextraction of grainsfrom
the original sample,without regard to what is going on in the signal. The majority of
granular synthesisliteraturerefersto it asa compositional approachwith no intention of
being perceptibly similar to theoriginal sound.
2.9 ConcatenativeSoundSynthesis
Also primarily working in theelectro-acoustic musiccommunity, DiemoSchwarzhasde-
velopedtheCATERPILLAR system [Sch00], which usesa large databaseof sourcesounds,
and a selection algorithm that data-minesthese sounds to find those that bestmatchthe
sound or phraseto besynthesized.
The first stepof audio segmentation is not discussedin Scharz’ paper, readersare
insteaddirectedto athesis [Ros00] availableonly in French.Segmentsarecharacterizedby
acousticalattributes, suchaspitch, energy, spectrum,spectral tilt, spectral centroid, spectral
flux, inharmonicity, andvoicingcoefficients. For eachfeature,low-level signal statisticsare
calculated, which arethenstored in thedatabaseaskeys to thesegment.
Because of the complexity and numberof the features measuredfor eachsound
segment, this systemis not meantto be real time. Neither is it intendedto be automatic:
every segmentis individually chosenby theuser. This makesit inapplicable for our goals,
but doesshow thebreadth of applicationsthatconcatenationof audio segmentscanaddress.
23
2.10 Physically-BasedSynthesis
Keesvan denDoel’s methodfor audiosynthesis[vdDKP01] via modalresonancemodels
canbeviewedasakind of resynthesis, wherethesound’s physical propertiesareestimated
and usedfor resynthesis. It hasbeenusedto create a diverse numberof sounds, such
impacts, scraping,sliding androlli ng.
The modal model M �$� f � d � A� consistsof modal frequencies represented by a
vector f of length N, decayrateswhich arespecified as a vector d of length N, and an
N % K matrix A, whoseelementsank arethegainsfor eachmodeatdifferentlocations.The
modeled responsefor animpulseat location k is givenby
yk � t ��� N
∑n� 1
anke� dnt sin � 2π fnt � (2.7)
with t & 0 andyk � t � is zero for t ' 0. Geometryandmaterial properties, suchas
elasticity andtexture,determinethefrequenciesanddamping of theoscillators.Thegains
of themodesaredependenton thelocation of contacton theobject.
For simpleobjects geometries,parametersfor themodalmodelcanbederived,but
for most realistic objects,wherederivation becomesuntenable,we candirectly estimate
location-dependent sound parameters.JoshRichmond [RP00] hasdeveloped a method to
dothisusingtheteleroboticsystemACME [PLLW99]. Theobjectin question is pokedwith
a sound effector over various locations,creating a mapof soundsfrom which it is possible,
using analgorithm developedby vandenDoel,to estimatethesample’sdominantfrequency
modes. Thesemodesare fed into the modal model algorithm to produce resynthesized
sounds.
This technique is very effective for creating andmanipulating models of thesound
propertiesof everyday objects. However, so far it hasprimarily beenapplied to contact
sounds, where the objects making the noise can be modeled or measured satisfactorily.
For background sounds, suchas wind, animal cries, and traffic noises, it is possible to
userecordedsamplesto estimatethe modalmodelparameters. However, tweaking these
parametersfor thesetypesof sounds,whichusually haveahighdegreeof variation,is time-
24
consuming anddifficult. Producing a sufficiently randomizedstreamwith this technique
is also non-trivial, and the method to achieve this for one type of environmental sound
wouldn’t necessarily be transferable to others. So while this technique is very adept at
synthesizing contact sounds,we think thereis room for a resynthesis system specifically
for background, environmental audio.
25
Chapter 3
Our Early Attempts at
SegmentingAudio Samples
Part of thegenesis of this thesis wasthepaperby Bar-Josephet al. [BJDEY� 99] detailing
their attemptat automatic granulation of a sampleusingwavelets. A description hasbeen
given in Section2.6. Interestedby its promise, we implemented the algorithm detailed in
the paper. Matlab wasa convenient languageto usein this casebecausethewavelet tool-
box [MMOP96] hasall of thewavelet functionality we needed to implement Bar-Joseph’s
algorithm.
Although thepaper claimsthat they achieved“a high quality resynthesizedsound,
with almostno artifacts due to the random granular recombinationof different segments
of theoriginal input sound”, our results weredisappointing. Someimplementation details
wereomittedfrom paper, sosomeof thealgorithm hadto beguessedandre-invented. There
is alsono samples available on thewebto verify their claims.
A noteabout terminology, takenfrom theBar-Josephpaper: if weconsideraMallat
treeasdefinedin Section2.4.1turnedupside-down, we endup with a binary treewith the
first level of differencecoefficients of the wavelet transform on the very bottom. In this
view, predecessors refer to the adjacent wavelet coefficients to its left on the samelevel,
and ancestors are coefficients in higher levels that have, in the binary tree sense of the
26
word, this coefficient asa child.
Theobject is to mix up thewaveletcoefficientsin this binary treerepresentation in
a judiciousmanner, sothatwhentheinversetransform is performed,theresult soundslike
a variation of theoriginal.
To replacea wavelet coefficient on a given level of the binary treerepresentation,
we examinethe node’s predecessors and ancestors. Other wavelet coefficients from the
samelevel areconsideredascandidatesto replaceit if they have similar predecessorsand
ancestors.This similarity is measuredwithin a threshold, which is specifiedby theuser.
In thenaıve Bar-Josephalgorithm, all theneighbours of a coefficientaretakeninto
account whendeciding what to switch. Doing this over the whole transform results in a
quadratic number of checks. This makesthe algorithm impractically slow, so the authors
suggestto limit thesearch spaceto thechildrenof thecandidatesetof nodesof theparent,
which would greatly decreasethesearchspace.
Our implementation found that thecandidateswerealmostalwaysonly the imme-
diateneighbours, becausethey hadthemostancestorsandneighbours in commonwith the
nodeto bereplaced.Thenodeadjacent to theoneto bereplacedhasall thesameneighbours
except itself, andall of thesameancestorsaswell.
Almostneverwerethereany nodesother thantheimmediateneighboursbeing con-
sidered as candidates,and almostnever were the immediateneighbours not considered.
This wasthecaseno matterwhatthethreshold wassetto. It waseither only theimmediate
neighbours considered, or all of nodesin theentire level. This explainssomeof theeffects
we observed in our results, which tendedto soundsimilar to the input sample, but with
slight artifacts from the wavelet resynthesis. Therewereno large-scale reorganizations of
thesample,only local changes.
To promotemorelarge-scalechanges,we only allowedshuffling of thecoefficients
in the higher levels of the inverted Mallat tree,onesthat representedmorethan10 ms of
audio. Therestof the treewasre-organized according to the last level scrambled. For any
coefficient on the last level scrambled,not only is that coefficient taken from somewhere
27
elseon thesamelevel, but alsothatcoefficient’s children,andthechildren’s children,and
soonuntil theendof thetree.Thusfor any coefficient on thelastlevel processed,it andall
of its descendants aremovedenmasse.
However, allowing coefficient shuffling only at higher levels is only a partial solu-
tion, becausethe inversewavelet transform will work on at leastthe number of samples
of the lengthof the filter at onetime. Thereis really no direct analogy to the parentsand
childrenof abinary tree, becausetheeventhesmallest filter hasmorethantwo coefficients.
Because of the convolution stepthat happensbetween the filter andthe coefficients in the
inversewavelet transform, theeffect of any onecoefficient is spreadover a numberof co-
efficients equalto twice thefilter length. This almostinvariably leadsto artifacts,because
thereis no guaranteethat the inversewavelet transform will producea smoothsignal from
thesealteredcoefficients. For applications such ascomputer music,perhapsthese artifacts
aredesirable, although they won’t bepredictablein any useful way. For applicationssuch
asenvironmentalsoundproduction, it is a definitedrawback.
Despiteourconcerns,theresultsdid havesomepromise.While therewereartifacts,
the sound wasrecognizable. The artifacts weremostly dueto “chattering,” with the same
portion of the sound repeated a few times without enough of a decay envelope. At this
point, we thought there waspotential to solve theseproblems,sothedecisionwasmadeto
try a Java implementation thatstreamedaudio in real time. This required a Java version of
thewavelettransform,which wasthenimplemented,andis discussedin 4.5.
3.1 A StreamingGranular SynthesisEngine
To achieve real-time performance, we computed possible candidate for replacementfor
every coefficient beforehand, sothatwhengeneratingaudio, all thathadto bedonewasto
construct a Mallat tree from the pre-computed candidatesets,thendo an inversewavelet
transform. Theword “streaming”is usedloosely – thesmallest unit wasthewholewavelet
tree,which wasthesamelength astheinput sound.
Streaming audio in this way gave mixed results. Although we managed to get the
28
audio out in real-time, becauseof the local natureof the granulation, the sound wasn’t
adequatelychangedto allow for seamlesscombination of finished versionsin theway the
paper described. Although eventsweremixed, therewasno guaranteethat theendof one
andthestartof another would flow well.
The root of the problem wasstill the granularity of the coefficients. Their inter-
dependence madeit impossible to move audioevents around cleanly. We needed a better
way to characterize theeventsof thesignal. Wavelet Packetswerebriefly considered, then
rejectedfor reasonsdetailedin Section2.7.
After abandoning the wavelet packet transform, we hit upon the idea of using a
speech recognition algorithm thatusedthewavelet transform in theanalysisstep.Because
we valuedthe fidelity and similarity of the input sound to the output, using the wavelet
transform for analysisonly gave usmuchmoresatisfactory results.
29
Chapter 4
Segmentationand Resynthesis
In this chapter, we describe the stepstowards an implementation of a resynthesis engine
based onnatural grains. First is thesegmentationalgorithm,whichanalyzestheinput sound
signal and outputs a series of graded points in the samplethat are most appropriate to
segmentaround. The user can thenfine-tune the default threshold to determine the total
numberof segments in thesample.Next is themethodto gradehow segmentsfit together
with eachotherfor thepurposesof playback.We thendescribe our implementation.
4.1 Segmentation
Thecoreof oursegmentation algorithm is amodifiedversion of themethod describedin the
paper by Alani andDeriche[AD99] describedin Section2.5.2,in which asignal is divided
into framesandanalyzedwith thewavelet transform. An input audiosignal is broken into
small framesof 1024samples each,with an overlapof 256 samples. Soundsamplesare
input asCD-quality 44.1 kHz, so eachframe is approximately 23.2mslong. To provide
metricsfor what is happening during the length of eachframe,six levels of the wavelet
transform arecomputedfor eachframe.
An additive informationcost function is thencomputedon eachlevel of difference
coefficients.Thefunction currentlyusedis thesumof u2 log � u2 � for all non-zerovaluesof
u, whereu rangesoverall differencecoefficientsin onelevel of thewavelettransform. This
30
function is ameasureof concentration, i.e. theresult is largewhentheelementsareroughly
thesamesizeandsmallwhenall but a few elementsarenegligible. A simplerfunction that
sumstheabsolutevalues of thesequencehasalsobeentried,with similar results.
A measureof correlation betweencorresponding wavelet levels across adjacent
framesis thenused to mapthelocalchangesin thesignal. For thisweusethesamefunction
as2.6. Equation 4.1givesa slightly moregeneralversion,giving thestrengthof transition
betweenframesa andb:
D � fa � fb ��� a
∑i � a � 1
b� 1
∑j � b
6
∑k � 1
� Xi # k � Xj # k � 2 (4.1)
Again, k refersto the wavelet level, i and j are framenumbers, Xi # k andXj # k are
theenergiesof waveletdifferencecoefficient levels. Only like levelsarecompared,andthe
differencesbetween themareadded up to obtain anoverall differencebetweenframes.
Whenthis calculation is donefor eachframe(minusthe first and last two) in the
signal, the result is onenumberper framewhich represents the degreeof changebetween
a frameandits immediateneighbours. Thenumbers arerepresented asanarray, andonly
needto becalculatedoncepersound sample.
Alani andDericheused thismethodto find thepointswherethedistancemeasureis
highest,in order to separatephonemes.We arenot trying to ‘understandspeech’, however.
Isolating phonemesis not critical to our application. Rather, we would like to segmenton
thegranularity of a syllable, wheretransitions aremuchmorepronounced andthebound-
ariesmoreamenable to shuffling.
A simplealterationto their algorithm thatmakesit moresuitable for our goalsis to
look for thepointswhich have the leastdifferencebetween frames, insteadof thegreatest.
With respectto theamplitudeenvelope,thesepoints aremorelikely to bein thetroughsof
thesignal betweenrelevantportions,rather thanin themiddle, or in theattackportion.
Another change is that we normalize the energies before calculating correlation.
This is doneby dividing eachenergy in a frame by the sum of energies in that frame.
This focusesthe correlation on the differencesin strength of bandwidths betweenframes.
31
This wasnot done in theAlani’s version, but we consistently getsmoothertransitionsafter
normalization.
For every frame boundary, we now have a number representing how similar its
neighbours areon either side.To segmentthesoundinto grains,we compare eachof these
numbers to a threshold. Thoselower thanthethresholdaretaken asnew grainboundaries.
Wethusfavour splitting thesignal up at thepointswherethereis little change,andkeeping
togetherpartsof thesignal wherethereis a relatively large amount of change.
We needto ensure aminimumgrain sizeof morethan40mssothat we do not have
grainsoccurring at a rateof morethan20Hz,thelimit of frequency perception. Soweonly
consider the point with the minimum distance measurecompared to its two neighbours,
which meansthat if two grain boundaries occur in two adjacentframes,one of them is
ignored.
4.2 Grading the Transitions
The segmentsboundaries derived from the above approachrepresentthe locations in the
signal whereit changesleastabruptly. The degree of change is given by the result of the
differencealgorithm in Equation 4.1. Our final aim is to re-createrandomizedversions of
thesignal thatretain asmany of theoriginal characteristicsaspossible. Thenext task, then,
is to determinewhich of thegrainsflow mostnaturally from any givengrain.
To enumeratethemostnatural transitionsbetween grains,thegrainsarecompared
against eachother and graded on their similarity. This is done in the sameway as we
calculatedtheoriginal grains from thesample. To calculatethetransition between grainsA
andB, thelasttwo framesof A arefed into thefour-frameEuclideandistancealgorithm of
Equation 4.1, along with the first two framesof B. The lower in magnitudethe result, the
smoother thetransition will bebetweenthetwo grains.
32
4.3 Resynthesis
By taking the last two framesof eachgrain, andcomparing themwith the first two of all
other grains, thesimilarity metricallowsusto construct probabilitiesof transition between
eachandevery grain. Theseprobabilitiesareused to constructa first-order Markov chain,
with eachstatecorresponding to a natural grain. The next to be played is chosenby a
random sampleof the probabilities that have beenconstructedfrom the measure of how
well theendof thecurrentgrainmatches with thebeginningsof all theother grains.
Probabilities areconstructed by employing an inverse transform technique which
usesthe matchscoresfor eachgrain asobserved valuesfor a probability density function
(pdf). Thesmallerthe resultof Equation4.1, thesmoother the transition between the two
windows on either side. We would like higher probabilitiesfor smoother transitions,sowe
take the inverseof eachtransition score to orient the weightsin the favour of the smaller
scores.
We don’t alwayswant theprobabilitiesof choosing thenext grain to bedependent
entirely on the smoothnessof transition. Randomnessis sometimesas important a con-
siderationassmoothness.Someprobabilities might bemuchgreater thatall of theothers,
andso picked often enough for the repetition to be noticed. To allow for the control over
the weighting differences, we adda noise variable C which will help even out the grain
weightings.
Let Pi j � 1( D � i � j � , indicatethe likelihood thatgrain i is followed by grain j . We
canconvert this to a probability pi j by normalizing asfoll ows:
pi j � Pi j ) C
∑nj � 0Pi j ) nC
(4.2)
wheren is the numberof grains. C denotesthe constant noisewe want to addto
thedistribution to give thosewith smallersimilarities moreof a chance to beselected.This
numbercanbechangedinteractively to altertherandomnessof theoutput signal.
We now construct a cumulative density function (cdf) from thepdf which givesus
a stepwisefunction from 0 to 1. This is sampled by taking a random numberfrom 0 to 1
33
andusingit astheindex to thefunction. Thedesired index canthenbefoundusingabinary
search for the interval the random numberlies between, andusing that stepof the cdf to
index our map.
Oncewehave thetransition probabilities,resynthesis is assimpleaschoosingwhat
grain will be played next by random sampling from the empirical distribution pi j . In this
way, thesmoother thetransitionbetween current grain andthenext, thehighertheprobabil-
ity thatthisgrain is chosenasthesuccessor. A highnoisevariableC flattensthispreference,
but never eliminatesit.
4.3.1 Cross-fading
Ouralgorithm worksto matchenergiesof waveletbandsbetweengrainsasbestaspossible.
Sincetheenergiesarenormalizedbefore comparison,boundaryamplitude changesarenot
given much weight in our resynthesischoices. Normally, this is not much of an issue
becausethe algorithm prefers the “troughs” of the signal in which the amplitude is near
its minimum.. For soundssampleswherethereareno suchtroughs,andto give anoverall
cohesiveness to the output, we cross-fade betweeneachsuccessive grain. A linear cross-
fadeof 2 frames(approximately 5ms)is used.
4.3.2 Thr esholding
Theuserhascontrol overhow many grainsasignal is split up into. A sliderin thegraphical
interfacechangesthevalueof thethresholdbelowwhich a frameboundary is considered a
grain boundary. Thethresholdextremitiesaredeterminedby themaximumandminimum
values of all frameboundaries.Betweenthem,thethresholdslidersacrificescontrol at the
high endto obtainfine-tuning over thefirst few grains. This is importantbecausethere is a
perceptually muchbiggerdifferencebetweenchangingthevaluewhenthereis only 5 grains
comparedto 50 over thesamesample.An exponential function is thususedto control the
values of the thresholdslider. Thereis a default thresholdvaluewhich is currently setat
25%of thetotal possible numberof grains.
34
Figure4.1: Input waveform
Figure4.2: Portion of output waveform
Figures 4.1 and4.2 show an exampleof the transformation which is the result of
resynthesis. Although theoutput waveform in 4.2 lookssignificantly different thanthatof
4.1,bothusetheexactsamesamples(exceptfor thesmallcross-fadesongrain boundaries).
Thesamples in theoutput sound arejust re-arranged, andsometimesrepeated.
4.4 Implementation
Our resynthesis system is implemented in Java, and featuresa graphical interfaceto fa-
cilitate real-time interaction. By manipulating sliders, userscanchange the segmentation
threshold, andthe noiseparameter, C of the grain selection process. Using JavaSound on
any platform that supports it, suchasLinux andWindows, all of this canbedone in real-
time with no signal interruptionson systemswith a Pentium II processor.
Figure4.3 showsthe interface we have built to assistwith segmentation. The top
row of buttons,from left to right, areto play theoriginal sample, to record a new sample,
pause playback,load a new sound sample, andto separatethesampleusing thealgorithm
describedabove.
In thecenter is thecurrentsegmentedwaveform,with vertical lines slicing through
35
Figure4.3: Segmentedwaveform andinterface
it at the locationsof the grain boundaries.Moving the threshold slider left or right causes
moreor lesssegmentation lines to appear, giving instant feedback asto the segmentation
threshold. Thenoiseslider affectstheweights giving to next possible grains to beplayed.
It is C in Equation 4.2.
4.5 Implementing the WaveletTransform
To ourknowledge,nopublicly availableversionof thewavelet library existsfor Java,sowe
decidedto write our own. We usedtechniquesdescribedby Wickerhauser [Wic94], in the
UBC Imagerwavelet library wvlt [(or95] andin Dr. Dobbs[Cod92, Cod94]. All of these
containedpartial codeexampleswritten in C. To verify our implementation, we compared
our results against thoseusingthe compiled Dr. Dobbsversion. We alsoverified that the
36
inversetransformgave thesameresults astheinput data.
Thewaveletfiltersavailablewereall ported from theimagerwvlt library. Thefilters
ported include:* TheAdelson, Simoncelli, andHingorani filter [ASH87]* Filtersby Antonini, Barlaud, Mathieu andDaubechies[ABMD92]* TheBattle-Lemariefilter [Mal89]* TheBurt-Adelsonfilter [Dau92]* Coiflet filters [BCR91]* Daubechiefilters [Dau92]* TheHaarfilter [Dau92]* Pseudocoiflet Filters[Rei93]* SplineFilters[Dau92]
For our purposes, usually the Daubechies 10 wavelet filter is used, although the
other filters alsowork well. In general, therehasbeen little work doneon which particular
filters to usefor sound. More rigorous study is needed in this area. In our application,
becausethe wavelet coefficients aresummedinto energiesper level thennormalized, the
result is not very sensitive to thefeaturesof individual waveletfilters.
Thewavelet library is designedto bea separatemodule from therestof theimple-
mentation sothat it canbeusedfor other tasks. A wavelet packet library is alsoincludedin
theJava package,which we planto release to thecommunity.
4.6 Real-timeConsiderations
To make our implementation adequate for real-time use, the segmentation step is done
before playback. This is doneby computing the differencesbetween every frame,so that
37
setting thethreshold is just a matterof checking which framescoresfall below theselected
threshold value. Thosethat do becomegrain boundaries. Segmentation datacanbesaved
for a laterdate, sothis steponly hasto bedoneoncepersoundsample.
This allows usto synthesizeour audiooutput in real time with very little computa-
tion overhead, leaving plenty of extra computation power for othertaskson even themost
averageof desktop computers.
4.7 Segmentation/ResynthesisControl Interface
To facilitateconstruction of larger scalesoundecologies, ahigher-level interfaceshavebeen
implemented.Persound sample,therearecontrols for gain(amplitude)andpan(stereo left-
right amplitude). Therearethreecontrols for each: startvalue, middle,andend,allowing
for the sound to change over time. Becausenot every soundshould be played continu-
ously, therearealsocontrols for trigger (how often the sound should be played, given in
seconds. For instance,a valueof 10 would meanthe soundwould be activated every 10
seconds).Durationcontrols how long eachof thesehigher-level segments are.Triggerand
Duration bothhave associatedvaluesfor controlli ng random variability. TheTrigger value
andits associatedrandomelementareusedasthemeanandstandarddeviation in anormal
distribution random numbergenerator. The pan and gain envelopesaffect eachof these
higher-level segmentsindividually.
Thesample-segmentationinterfaceshown in figure4.4whichallowsfor thecontrol
of how many natural grainsthe sampleis choppedup into is still accessible by a button
for eachcontrol. A “Random Walk” button places thesound in a continuous random walk
through stereo space. “Play All” and “Stop All” buttons allow for overall control of all
samples loaded at onetime.
38
Figure4.4: Segmentedstreammanagement interface
4.8 PresetMechanism
Because the initi al time to analyze the signal for segmentscan be relatively substantial,
the interface also hasthe capability of saving presets for sounds that have already been
analyzed.
Currently, all of thewindow transition valuesaresaved,so that whenthepreset is
loadedagain, none of the wavelet transforms have to bedone again. The threshold, noise
andall of the interfacesettings suchaspan,gain, etc. arealsoall saved. Whena preset
is loaded, all theuser hasto do is push“play” to hear thesounds at thesamesettings they
werewhenthepresets werelastsaved.
4.9 DiscouragingRepetition
Favourable transitions between segments aregiven a higher probability of beingchosen,
so for segmentsthathave very goodcompatibility with just a few others,andvery low for
39
the rest, the samesequenceof samplescanoccur. For certain samples,this repetition of
sectionswasrecognizable,especially if the input signal wasrelatively shortin duration. A
high noiseparameterhelpsby evening out theprobabilitiesof thenext candidatesegments.
To further prevent repetition, a transition that hasjust occurredhasits weight reducedfor
further picks from this transition.
Discouraging repetition is accomplishedas follows. For eachsegment s1 that is
played,we pick thenext segmentfrom weightedprobability list of s1. We thenreduce the
possibility this transition from s1 will be picked againin the nearfuture. This is done by
associating with eachsegmenta list of the 10 most recently chosen transitions from that
segment. These10 arethemselvesweightedwith a higher value. Thenormalweights(the
weightsgivenby signal analysis)of these10recent-mosttransitionsarethendivided by the
repetition weights,yielding new weightsthat area fraction of their originals. Using this
scheme,afterenoughtime these last10 playedwill revert to their original weights.
We choseto discourage the transitions rather than the actualsegments becauseit
keptouroverall system of choosingsegmentsintact. If wehaddiscouragedactual segments
instead,therearecaseswhereour systemwould no longer berandom, andwould develop
arecognizable repeating pattern. Consider theextremecaseof asamplewith lesssegments
thanthe list of recently playedsegments thatwe do not play again.We would endup just
playing a loop, with thenext segment to beplayedalmostalways being theonewe played
leastrecently.
40
Chapter 5
Resultsand Evaluation
As examplesound inputsto ouralgorithm,wehaveusedsamplestaken from theVancouver
Soundscape[Tru99], andothernatural recordingsfrom anumberof differentenvironments.
This providedus with a wealthof real-world samples, both pitchedandunpitched, which
wereideal for this algorithm.
5.1 UserStudy
To testtheutil ity of ournatural grain technique, wecarriedoutapilot user study onsubjects
taken from around theUBC ComputerScienceDepartment. We testedthehypothesisthat
a sound generatedfrom our algorithm will be indistinguishable from therealaudio sample
it is takenfrom.
Thesevensamplesusedfor thetestswere:* a seriesof carhorns* thesoundsof sometreefrogs* crickets* thesound of crumpling andtearing paper* ambient soundsfrom a fish market
41
* thesound of a bubbling brook* birds chirps in theforest.
Eachsamplewasgonethroughby handto pick anappropriatesegmentation thresh-
old. We chosethreshold values thatwould ensure therewasmorethanonesegmentin any
snippetwe playedastests.
Thereis a possibility of subjects comparing the sound events in the real sample
versus the resynthesized sound, insteadof evaluating whetherthe resynthesizedsoundis
plausibly realistic. To minimize this, we didn’t just usea snippet of a real sound andthat
samesnippedresynthesized,becausethen the sameeventswould always happen in both
samples. Instead, thesnippetsweretakenfrom two larger pools, constructedfrom original
andresynthesizedsamplesof muchlarger duration. The location of thesnippetwithin the
larger samplepools waschosen at random, the only condition beingthe starthadto leave
enough roomin thepool to play the4 second duration of thetestsnippet itself. New loca-
tionswerechoseneachtime a snippet from a soundwasplayed. This precaution preserves
the intent of the algorithm: to producesounds that appear to have been recordedfrom the
samesourceastheoriginal,but at different times.
5.1.1 Participants
Ten membersof our department(9 males,1 female) participatedin the userstudy. All
reportednormalhearing. Theparticipantswerenot paidfor their time.
5.1.2 Experimental Procedure
Theexperimentusedatwo-alternative forced-choicedesign. Subjectssatonachairin front
of two speakersin anenclosedroom. We told themthatwe would play a seriesof various
environmental sound samplesin pairs. Eachpair would consist of a random section of
theoriginal sample,anda randomsection of a resynthesizedsound. They would be in no
particularorder, andtherewouldbeanumberof different typesof sounds.For each pair the
42
subjectswereinstructedto identify the‘real’ sample.They weretold they couldonly listen
to the samples once;no repetition wasallowed. They werethengiven a practice sample
different than thoseusedin the test. Finally, the testbegan. Therewerethreeiterationsof
bothorders real-synthesizedandsynthesized-real for each sample,for a total of 42 testsfor
eachsubject. Theorderof testswasrandom.
Subjectswerenot told thatthetestswouldbesymmetrical,thatis, therewouldbeas
many with theresynthesizedsound first astherewaswith therealsoundfirst. Nor werethey
told thenumber of repetitions in theexperiment,nor thenatureof theresynthesis algorithm.
Therewere somepotential issues with the approachwe took to demonstratethe
utili ty of our algorithm. For onething, the subject alwayshasthe original soundto listen
to next to theresynthesizedversion, andthusmight beableto pick out idiosyncrasieswith
the resynthesizedsound that might not be recognizable if the real sound wasnot played.
However, on the flip side,if the userscannot statistically tell betweenthe real soundand
theresynthesizedonein this test, it is a strong endorsement that thealgorithm canproduce
realistic sounds.
5.1.3 Results
Figure5.1 displays the correct scores per subject tested. Therewassubstantial variation
betweensubjects. Threeout of the10 scoredabove 70 %, 2 othersabove60 %, 1 above 50
% andtheother 4 belowchance,asshown in figure5.2.
We tested the null hypothesis that the subjectsperform at the chance level (eachresponse
is a pure guess) for the10 subjects. By hypothesis,themeannumberof correct responses
µ = 21 and the standard deviation σ = 3.24. Using the normal approximation to the bi-
nomial distribution we conclude that we canrejectthehypothesiswith a two-tailed testat
thesignificancelevel α � 0 � 05 only if thesamplemeanis outsidetheinterval µ + 1 � 96σ �,14� 65 � 27� 35- . Two subjectsscored above this range(32, 31), andonebelow (14). How-
43
Corr ect Scoresper Subject and Sample(eachout of 6)Subjectnumber Carhorns Crickets Paper Market Frogs Stream Birds
1 3 2 2 2 2 3 02 5 3 2 3 4 6 33 6 4 3 4 5 5 54 1 2 2 2 3 3 55 2 5 3 4 3 2 56 2 2 5 3 4 2 17 6 2 4 4 5 5 48 5 5 1 5 5 5 59 5 3 2 2 2 2 210 6 2 4 3 3 3 5
Figure5.1: Correct scorespersubject andsample.Eachscore is out of 6.
Subject Total Corr ect Percentage Corr ect Standard Deviati on1 14 0.33333 12 26 0.61905 1.380133 32 0.76190 0.97594 18 0.42857 1.2724185 24 0.56251 1.2724186 19 0.45238 1.3801317 30 0.71428 1.2535668 31 0.73809 1.5118589 18 0.42857 1.13389310 26 0.61904 1.380131
Figure5.2: Percentagecorrect answerspersubject,over all samples.
ever, the mean,23.8, falls solidly within this range, so overall we cannot reject the null
hypothesis.
5.1.4 Discussion
Thesetestsshowed that samplesresynthesizedwith this algorithm arevirtually indistin-
guishablefrom the originals. This demonstratesthe utilit y of this processto createaudio
streams of indefinite length from samplesof fixed length. It canalsobe usedto create a
numberof variationsof a fixed-duration sound sample.
Theresultsalsorevealsomething about whichsoundsaremosteffectively rendered
44
Corr ect Responsesmean 23.8max 32min 14std 6.268
Figure5.3: Statistics for thenumber of correctresponses
PercentageCorr ect, Per Samplecarhorns 0.683
brook 0.600frogs 0.583birds 0.583
market 0.533crickets 0.500
crumpling paper 0.483
Figure5.4: Percentagecorrectpersample, compiledover all subjects
with our algorithm. Figure5.4 showsthe fraction of correctanswerspersample.Thereis
substantial variation betweensamples, with the car horns beingthe most identifiable and
thecrumpling paper soundstheleast.
In the caseof the car horns, we can partly attribute the high successrate to the
nature of thesample. A carhorn is a pitchedsound with a distinct attack, middleandend.
Because it is so plain and in the foreground, it is easier to pick out idiosyncrasies when
the horn is altered, which in turn makes it easierto identify the resynthesizedsound. In
this particular case, somesubjectsremarked the horn stoppedor started too quickly to be
considerednormal. Thesegmentation/resynthesis processsometimeschangedtheoriginal
envelopeof the horn enough to be noticeable. They were comparing not only the two
samples played for them,but alsoeachagainst their own personal idea of whata carhorn
should sound like.
Sounds of themarket, on theother hand,depict thebustle andcommotionof many
people goingabout their routines. It is chaotic andlayered, displaying muchlesstemporal
structure than the regular sound of a car horn. Without the larger temporal cluesto give
45
themaway, theresynthesizedmarket soundswerethusmoredifficult to identify.
Another low-scorer, the crumpling paper alsodoesnot contain asmuchtemporal
informationasthecarhorns. Like themarket, it is unpitchedandirregular, which provides
a muchrichersetof possibilities for segmentation. Pitchdoes not provide an inherentdif-
ficulty for our algorithm, but thesetypes of soundsareusually accompanied by amplitude
envelopes, which area problem becauseof their temporal structure. This showsthat our
algorithm performsbest on the sortsof sounds it wasdesignedfor: unstructured environ-
mentalsoundssuitableas‘background’ noises.
As it stands, soundswith temporal structurescanonly behandled by starting with
an input sound that containsa number of the discrete structures,andthenonly segment-
ing betweenthem. For instance, to achieve bettercar horn sounds, we could have only
segmented betweeneachtoot of the horn. Anothersolution which would involve adding
temporal informationto our algorithm is detailed in thenext chapter.
Oneconfoundthat mightexist for thesetestsis thesegmentationthresholdwechose
for the individual samples. Eachsamplewasgone through by hand,to ensure therewas
morethanonesegmentin any snippet we played astests. The threshold could not be too
large, either, becausethenthe segments would be to small, causing the resultant soundto
differ perceptibly in timbrefrom theoriginal. Thesetwo constraintsstill leaveconsiderable
to maneuver, sothenotion of an‘ideal’ threshold is a looseone.
Our designdecision to give final sayon thethreshold to thesound designerallows
for maximumflexibil ity. However, it alsomakesteststhat limit the threshold to onevalue
per sampleto be necessarily limited in scope relative to the choicesthe implementation
offers.
Overall, thesubject’s reaction to the resynthesizedsounds wasuniformly positive.
They invariably commented that it was very difficult to distinguish our sounds from the
originals,irrespective of their final scores.
46
Chapter 6
Conclusionsand Futur eWork
6.1 Overview
This chapter will summarize thegoals andresults of this thesis,andwill alsooutline some
directionsfor further work andimprovements.
6.2 Goalsand Results
Our goal in developing the natural grain resynthesizer wasto complementexisting meth-
odsfor generating audio through physical simulation of sounds with one that focuseson
manipulating existing samples. Background sounds,such asbirds in a forest, chatter in a
cafe, or street sounds arecurrently beyond the scope of physical simulation, but are just
asnecessary to virtual environments, film, andother disciplines that valuerealistic sound
environments.
In thisthesisweextendedtheutil ity of sample-based audioresynthesisby providing
amethodto create randomizedversionsof samplesthatpreserveperceptualqualitiesof the
original. Thisallowsusersto createsamplesof indeterminate length from aninputsoundof
fixedlength, without having to resort to simple,deterministic looping. In a situation where
the samesoundsampleis triggeredin responseto an event very often, onecould create a
numberof variationsof similar duration to asample,insteadof repeating theoriginal again
47
andagain.
Our implementationalsoprovidesan interfacethatmakesit possible to easily mix
together multiple streamsof resynthesizedaudio to quickly createan integratedacoustic
environment from separateelements.
The implementation canalsobe used for other purposesbesidessound extension.
Interestingeffects canbe achieved whenwe set the granularity to be very fine, in which
casewe achieve a sparse form of granular synthesis. Intriguingcombinationsandtextures
canalsobegeneratedby usinga sample with many different heterogeneoussound sources
present. Thealgorithm mixesthemin unexpectedandinteresting ways,oftenin short bursts
thatareconnectedseamlesslywith othershort bursts from elsewhere in thesignal to create
new macroscopic structures.
Our main challengewas to preserve the perceptual characteristics of the original
sound. Therearemany waysof creating new soundswith new timbres from input samples,
suchasgranular synthesis,but muchlesswork hasbeendoneon creating new samplesthat
sound similar to thoseinput. This is thekey achievementof our work.
6.3 Futur e Work
While the system works well right now for a variety of applications, therearearemany
interestingextensions to the dynamic scrambling algorithm. In this section, we explore a
few futuredirectionsfor work.
1. More work could be doneto automatically set the threshold to a reasonablevalue
depending on the sample. Determination of automatic threshold values is tricky,
becausethe optimum numberof segments for a samplevarieswith the sizeof the
sampleandits inherentseparability.
2. It would be useful not only to extract the componentsin the time domain, but also
split the signal up into multiple simultaneousstreams. For instance, in a sample
wherethereis simultaneously traffic noiseandbirds singing, we could extract the
48
bird sound from the background andre-synthesize the two separately. Independent
Component Analysis (ICA) [Cas98] could be a viable method of separating signals
from onesource.
3. Modifying theunderlying signal throughmanipulation of thetime-frequency domain
canproducedesirableandpredictable changesin theperceivedobjectsinvolvedin the
sound production. For instance,Miner andCadell[MC97] have developedmethods
for altering thewavelet coefficients in the representation of a rain sampleto change
the perceivedsurface the rain wasfalling on. They could alsochange theperceived
sizeof thedrops,or their density. Sinceour algorithm already computesa version of
thesignal in thewavelet domain,this would berelatively efficient to implement.
4. Although it is notneeded for thepurposesof this thesis, adjusting thethreshold slider
in realtime to changethegrain sizeis alsocurrently possible, but with morecompu-
tational overhead. This functionality is useful for musiciansandsounddesignersto
create effects by continuously changing thegrainsizein realtime.
The generation of the first-order Markov chain is relatively expensive, andmustbe
re-doneafterever change in threshold. This is becauseuntil we know thethreshold,
we do not even know wherethe grains will be, so it is difficult to determine how
well they matchtogether. In practice,on a moderatelength sound sample (less than
5 minutes), this is not a problem. However, longer samples could leadto thousands
of grains. This becomes a problem becausethe time complexity of the algorithm is
O � n2 � to recomputetheMarkov chains,wheren is thenumberof grains.
For the purposessetout in this thesis, the existing implementation is adequate,but
re-writing the codeto pre-calculate segmentrelations is possible. We would have
to calculateevery possible transition between segments for every possible threshold
value,andbeableto only consider thosethatarebelowthecurrent threshold.
5. Oneof themostinterestingof thepossible extensionsto this algorithm would bethe
incorporation of time information.This would allow usto producenew sounds with
49
thesamerhythmicstructuresastheoriginals. A possible way of accomplishing this
would be to analyze amplitude, andusethe resulting information in a hierarchical
Markov chain. Entire sounds or just localized sections could be analyzed to obtain
their large-scale amplitude envelopes. This informationwould be usedto cut down
the numberof segmentsconsidered whenchoosing the next segment to play. The
subsetcould be based on the criteria that the next segmentmusthave an amplitude
that matches the current portion of the larger pattern. Finer-scalechoicesbetween
individualsegmentswithin thesubsetwouldstill bedoneusing thealgorithmoutlined
above. Or, if subsets remove too much randomness, we could instead modify the
entire weighting systemusedin resynthesis to reflectthis new information.
Thisapproachwouldallowusto successfully segmentamuchbroaderclassof sounds
thanis currently possible. Sounds with a high degree of temporalinformation would
be particularly better handled. This approachwould alsocreateadditional creative
opportunities. For instancewe could analyze onesound for its amplitudeenvelope,
thenusethis information on a completely differentsound. Sincelocal decisionsaf-
fecting soundquality wouldstill bemadenormallyaswehavedescribedaboveusing
local information,thesound would still beof very good quality, but with completely
different(but recognizable) temporalcharacteristics.
6. In addition to theamplitude-following describeabove,wemightbeableto implement
something closeto thebeat-tracking interfacesofferedin somecommercial packages
for musicians. By analyzing the amplitudemaps,we could estimate the rhythm of
thesample, andallow theuserto changeit. Thechange would bebrought about by
exploiting anartifact of our algorithm. For rhythmicsounds,decreasing thesegment
sizeoften increasesthe tempoof the sound. The exact cause of this is not known
at present, but it could be exploited by offering it as a control to the user. With
judicious useof amplitude maps,we could intentionally shorten the envelopeof a
particular sound by not picking asmany segmentsin the sustain part of the sound,
so the envelopebecomesshorter. Sinceit is the attack that gives a soundmostof
50
its character, as described in a paper by RissetandMatthews [RM69], this would
hopefully not changetheperceptible characterizationof thesounddrastically.
7. For particular tasks that demand a high degreeof control over which segments are
combined,it would beuseful to let theusermanually combinesegments. We could
provideasound-builder interface,wheretheuserstartswith onesegment,thenis pro-
vided with amenuwith everyother segment orderedaccording to goodnessof fit. Or
there couldbeother criteria besidesgoodnessof fit thattheuserchoosesto order the
segments. The selection of segments offeredcould alsobe moresophisticated. For
instance,theusercouldselect segments from a wholedatabaseof sounds,calledup
accordingto somecriteria. Thistakesuscloserto theobjectivesof theCATERPILLAR
system[Sch00].
8. Our aim of creating a perceptibly similar sound breaks down with extremelysmall
samples (under 1-3 seconds, depending on the samplecharacteristics). On sucha
small amount of data, thereis little possibility of having enough segments to pro-
duce quality output with anadequatenumberof segments: thesegments becometoo
small,andtheresult soundsmorelike granularsynthesis. Wecouldprovideasystem
which analyzes the sound, andchanges perceptible qualities like pitch, amplitude,
andcolour by smallamounts to createtheillusion of changeor variation without rad-
ically changing theperception of thesound. We would alsohave to choosetheseg-
mentandresynthesis pointscarefully, using something like theamplitude-following
techniquedescribedabove.
This algorithm and its proposedextensions all aim to give the useran intuitive
interfaceto sounddesign. Webelieveit is notenoughto providenovel physical or graphical
interfacesto sound synthesisengines; what is often needed,andrarely found in practice,
areinterfacesthatreflecttheunderlying physical propertiesof a sound. Whendealing with
samples from therealworld, aswe have done,this involvesdeveloping methods for signal
understanding, andmanipulationtechniques thatpreserve important aural properties.
51
Bibliography
[ABMD92] MarcAntonini, Michel Barlaud,PierreMathieu, andIngrid Daubechies.Im-agecoding usingwavelet transform. IEEE Transactionson Image Process-ing, 2(1):205–220,1992.
[AD99] AhmedAlani andMohamedDeriche. A novel approachto speech segmen-tationusingthewavelettransform. In Fifth InternationSymposiumonSignalProcessingandIts Applications, 1999.
[AFG99] Maria GraziaAlbanesi, Marco Ferretti, and Alessandro Giancane. Time-frequency decomposition for analysis andretrieval of 1-d signals. In Pro-ceedings of the IEEE International Conferenceon Multimedia ComputingandSystems, volume2, pages974–978,1999.
[ASH87] E. H. Adelson, E. Simoncelli, andR. Hingorani. Orthogonal pyramidtrans-forms for imagecoding. In Visual Communicationsand Image ProcessingII , pages50–58,1987.
[BCR91] G. Beylkin, R. Coifman,andV. Rokhlin. Fastwavelet transformsandnu-merical algorithms,1991.
[BIN96] Jerry Banks,JohnS.CarsonII, andBarry L. Nelson. Discrete-EventSystemSimulation. Prentice-Hall, 1996.
[BJDEY� 99] Ziv Bar-Joseph, Shlomo Dubnov, Ran El-Yaniv, Dani Lischinski, andMichael Werman. Granular synthesis of soundtextures using statisticallearning. In Proceedings of the International Computer Music Conference,pages178–181,1999.
[BSR98] Gerhard Behles, SaschaStarke, and Axel Robel. Quasi-synchronous andpitch-synchronousgranular sound processing with stampede ii. ComputerMusicJournal, 22:44–51, Summer1998.
52
[Cas98] Michael Anthony Casey. Auditory Group Theorywith Applications to Sta-tistical BasisMethods for Structured Audio. PhDthesis, Massachusetts In-stituteof Technology MediaLaboratory, 1998.
[CMW92] R. Coifman,Y. Meyer, andM. Wickerhauser. Wavelet analysis andsignalprocessing, 1992.
[Cod92] Mac A. Cody. The fastwavelet transform: Beyond fast fourier transforms.Dr. DobbsJournal of Software Tools, 17(4):16–18, 20,24,26,28,100–101,April 1992.
[Cod94] Mac A. Cody. Thewavelet packet transform: Extending the wavelet trans-form. Dr. DobbsJournal, April 1994.
[Dau92] Ingrid Daubechies.TenLecturesonWavelets, volume61. Societyfor Indus-trial andApplied Mathematics,Philadelphia,1992.
[DJ97] ChristophDelfs andFriedrich Jondral. Classification of pianosoundsusingtime-frequency signal analysis. In Proceedingsof IEEE ICASSP‘97, pages2093 – 2096, 1997.
[Dry96] Andrzej Drygajlo. New fastwaveletpacket transform algorithms for framesynchronized speech processing. In Proc. of the 4th International Confer-ence on SpokenLanguage Processing, pages410–413, 1996.
[EC95] KarmranEtemadandRamaChellappa. Dimensionality reduction of multi-scale featurespacesusingaseparability criterion. In Inter. Conf. onAcousticsSpeech andSignal Processing, 1995.
[Gav88] W. W. Gaver. Everyday listening andauditory icons. PhDthesis, Universityof California in SanDiego,1988.
[Gib79] JamesJeromeGibson. The ecological approach to visual perception.Houghton Mif flin, 1979.
[Han89] Stephen Handel. Listening: An Introduction to the Perceptionof AuditoryEvents. MIT Press,1989.
[Han95] S. Handel. Timbre perceptionandauditory object identification. Hearing(Handbookof Perception andCognition 2ndEdition), pages425–461, 1995.
[Hel54] H. L. F Helmholtz. On the sensations of toneas a psychological basisfortheteory of music. Dover, New York, 1954.
53
[JP88] Douglas L. Jonesand ThomasW. Parks. Generation and combination ofgrains for music synthesis. Computer Music Journal, 12:27–34, Summer1988.
[KM88] Richard Kronland-Martinet. The wavelet transform for analysis, synthesis,and proceessing of speech and music sounds. ComputerMusic Journal,12(4):11–19,Winter 1988.
[KT98] Dami’anKeller andBarry Truax. Ecologically-basedgranular synthesis.InICMC, pages 117–120,1998.
[LKS � 98] T. Lambrou,P. Kudumakis, R. Speller, M. Sandler, andA. Linney. Clas-sification of audio signals usingstatistical featureson the time andwavelettransformdomains. In Proceedingsof theIEEE 1998International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP’98), volume 6,pages3621–3624, Seattle(WA), May 12 - 15 1998.
[Mal89] StephaneG.Mallat. A theory for multiresolution signal decomposition: Thewavelet representation. IEEETransactionsonPatternAnalysis andMachineIntelligence, PAMI-11(7):674–693,1989.
[MC97] Nadine E. Miner and ThomasP. Caudell. Using wavelets to synthesizestochastic-based soundsfor immersivevirtual environments. In Proceedingsof theInternational Conferenceon Auditory Display, 1997.
[MMOP96] Michel Misiti, Yves Misiti , Georges Oppenheim, and Jean-Michel Poggi.Wavelet ToolboxUser’s Guide. TheMathWorks,1996.
[MZ92] StephaneG. Mallat andS. Zhong. Characterization of signals from multi-scale edges. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 14(7):710–732,July 1992.
[(or95] Alain Fournier (organizer). Wavelets and their applications in computergraphics. In Siggraph1995 course notes, 1995.
[PK99] StefanPittnerandSagarV. Kamarthi.Featureextraction from wavelet coef-ficients for pattern recognition tasks. In EEETrans.On PAMI, volume21,pages83–88,1999.
[PLLW99] D. K. Pai, J. Lang, J. E. Lloyd, andR. J. Woodham. Acme, a teleroboticactive measurementfacility. In Experimental Robots VI, vol. 250of LectureNotesin Control andInformationSciences, pages 391–400,1999.
54
[Rei93] L. M. Reissell. Multir esolution geometric algorithmsusing wavelets: Rep-resentation for parametric curvesandsurfaces,1993.
[RM69] J. RissetandM. Mathews. Analysis of musical instrumenttones. PhysicsToday, 22(2):23–30, 1969.
[Roa78] Curtis Roads. Automatedgranular synthesisof sound. Computer MusicJournal, 2(2):61–62,1978.
[Roa88] CurtisRoads. Introduction to granular synthesis. ComputerMusicJournal,12:11–13, Summer1988.
[Roa96] Curtis Roads. TheComputerMusic Tutorial. MIT Press,Cambridge, MA,1996.
[Ros00] S. Rossignol. Segmentation et indexation dessignaux sonores musicaux.PhDthesis,University of ParisVI, July 2000.
[RP00] J. L. Richmond andD. K. Pai. Active measurementandmodeling of con-tact sounds. In Proceedingsof the2000 IEEE International ConferenceonRobotics andAutomation, pages2146–2152, 2000.
[Sch00] DiemoSchwarz.A systemfor data-drivenconcatenativesoundsynthesis.InProceedingsof the COSIG-6 Conferenceon Digital Audio Effects (DAFX-00), Verona, Italy, December2000.
[SG97] Ruhi Sarikaya andJohnN. Gowdy. Waveletbased analysisof speech understress. In IEEESoutheastcon, volume1, pages92–96, Blacksburg, Virginia,1997.
[SG98] Ruhi Sarikaya andJohnN. Gowdy. Waveletbased analysisof speech understress.In Proceedingsof the1998IEEEConfrenceonAcoustics,Speech andSignal Processing, volume1, pages569–572,1998.
[Sko80] M. Skolnik. Introduction to RadarSystems. McGraw-Hill Book Co.,1980.
[SN96] F. StrangandT. Nguyen. WaveletsandFilter Banks. Wellesley-CambridgePress,Wellesley, Massachusetts,1996.
[SSSE00] Arno Schodl, RichardSzeliski, David H. Salesin,and Ifran Essa. Videotextures. In Siggraph, 2000.
[SY98] S.R.SubramanyaandAbdouYoussef. Wavelet-basedindexing of audio datain audio/multimedia databases. In Proceedings of the International Work-shop on MultimediaDatabasesManagementSystems, pages46–53, 1998.
55
[TLS � 94] B. Tan,R. Lang,H. Schroder, A. Spray, andP. Dermody. Applying waveletanalysis to speech segmentation and classification. In H. H. Szu, editor,Wavelet ApplicationsProc.SPIE2242, pages 750–761,1994.
[Tru88] Barry Truax. Real-timegranular synthesiswith a digital signal processor.Computer MusicJournal, 12:14–26,1988.
[Tru94] Barry Truax. Discovering inner complexity - time shifting and transposi-tion with a real-time granulation technique. In Computer Music Journal,volume2, pages 38–48, Summer1994.
[Tru99] Barry Truax. Handbook for Acoustic Ecology. ARC Publications, 1978.CD-ROM version Cambridge StreetPublishing 1999, 1999.
[vdDKP01] K. van den Doel, P. G. Kry, and D. K. Pai. Foleyautomatic: Physically-based sound effectsfor interactive simulation andanimation. In ComputerGraphics(ACM SIGGRAPH2001 ConferenceProceedings), 2001.
[War99] Richard M. Warren. Auditory Perception: A New Analysis and Synthesis.CambridgeUniversity Press,1999.
[Wic92] Mladen Victor Wickerhauser. Acoustic signal compression with waveletpackets. In CharlesK. Chui, editor, Wavelets–ATutorial in TheoryandAp-plications, pages679–700.AcademicPress,Boston,1992.
[Wic94] Mladen Victor Wickerhauser. AdaptedWaveletAnalysisfromTheoryto Soft-ware. AK Peters, Ltd., Wellesley, Massachusetts,1994.
[WL00] Li-Y i Wei andMarc Levoy. Fasttexture synthesisusing tree-structuredvec-tor quantization. In Siggraph, 2000.
[WS85] W. H. WarrenandR.E. Shaw. Eventsandencountersasunitsof analysisforecological psychology. In W.H. WarrenandR. E. Shaw, editors,Persistenceand change: Proceedings of the First International Conference on EventPerception, pages 1–27, 1985.
[WW99] Eva Wesfreid and Mladen Victor Wickerhauser. Vocal commandsig-nal segmentation and phonemeclassification. In Alberto A. Ochoa., edi-tor, Proceedings of the II Artificial Intelligence Symposium at CIMAF 99,page 10. Institute of Cybernetics, Mathematics andPhysics(ICIMAF), Ha-bana,Cuba,1999.
56