MSP and Jitter

8/6/2019 MSP and Jitter

1/16

A Tutorial on SpectralSound Processing UsingMax/MSP and Jitter

Jean-Francois Charles1 Route de Plampery74470 Vailly, [email protected]

For computer musicians, sound processing in thefrequency domain is an important and widely usedtechnique. Two particular frequency-domain toolsof great importance for the composer are the phasevocoder and the sonogram. The phase vocoder, ananalysis-resynthesis tool based on a sequence ofoverlapping short-time Fourier transforms, helpsperform a variety of sound modifications, from timestretching to arbitrary control of energy distribution

through frequency space. The sonogram, a graphicalrepresentation of a sounds spectrum, offers com-posers more readable frequency information than atime-domain waveform.

Such tools make graphical sound synthesisconvenient. A history of graphical sound synthesis isbeyond the scope of this article, but a few importantfigures include Evgeny Murzin, Percy Grainger, andIannis Xenakis. In 1938, Evgeny Murzin invented asystem to generate sound from a visible image; thedesign, based on the photo-optic sound techniqueused in cinematography, was implemented as theANS synthesizer in 1958 (Kreichi 1995). Percy

Grainger was also a pioneer with the Free MusicMachine that he designed and built with BurnettCross in 1952 (Lewis 1991); the device was ableto generate sound from a drawing of an evolvingpitch and amplitude. In 1977, Iannis Xenakis andassociates built on these ideas when they createdthe famous UPIC (Unite Polyagogique Informatiquedu CEMAMu; Marino, Serra, and Raczinski 1993).

In this article, I explore the domain of graphicalspectral analysis and synthesis in real-time situa-tions. The technology has evolved so that now, notonly can the phase vocoder perform analysis andsynthesis in real time, but composers have access to

a new conceptual approach: spectrum modificationsconsidered as graphical processing. Nevertheless, theunderlying matrix representation is still intimidat-ing to many musicians. Consequently, the musicalpotential of this technique is as yet unfulfilled.

Computer Music Journal, 32:3, pp. 87102, Fall 2008c 2008 Massachusetts Institute of Technology.

This article is intended as both a presentation ofthe potential of manipulating spectral sound dataas matrices and a tutorial for musicians who wantto implement such effects in the Max/MSP/Jitterenvironment. Throughout the article, I considerspectral analysis and synthesis as realized by theFast Fourier Transform (FFT) and Inverse-FFTalgorithms. I assume a familiarity with the FFT(Roads 1995) and the phase vocoder (Dolson 1986).

To make the most of the examples, a familiaritywith the Max/MSP environment is necessary, anda basic knowledge of the Jitter extension may behelpful.

I begin with a survey of the software currentlyavailable for working in this domain. I then showsome improvements to the traditional phase vocoderused in both real time and performance time.(Whereas real-time treatments are applied on a livesound stream, performance-time treatments aretransformations of sound files that are generatedduring a performance.) Finally, I present extensionsto the popular real-time spectral processing method

known as the freeze, to demonstrate that matrixprocessing can be useful in the context of real-timeeffects.

Spectral Sound Processing with Graphical

Interaction

Several dedicated software products enable graphicrendering and/or editing of sounds through theirsonogram. They generally do not work in real time,because a few years ago, real-time processing of com-plete spectral data was not possible on computers

accessible to individual musicians. This calculationlimitation led to the development of objects likeIRCAMs Max/MSP external iana, which reducesspectral data to a set of useful descriptors (Todoroff,Daubresse, and Fineberg 1995). After a quicksurvey of the current limitations of non-real-timesoftware, we review the environments allowing FFTprocessing and visualization in real time.

Charles 87


2/16

Non-Real-Time Tools

AudioSculpt, a program developed by IRCAM, ischaracterized by the high precision it offers as wellas the possibility to customize advanced parametersfor the FFT analysis (Bogaards, Robel, and Rodet2004). For instance, the user can adjust the analysiswindow size to a different value than the FFT size.Three automatic segmentation methods are pro-vided and enable high-quality time stretching withtransient preservation. Other important functionsare frequency-bin independent dynamics process-ing (to be used for noise removal, for instance)and application of filters drawn on the sonogram.Advanced algorithms are available, including funda-mental frequency estimation, calculation of spectralenvelopes, and partial tracking. Synthesis may berealized in real time with sequenced values of pa-rameters; these can, however, not be altered on thefly. The application with graphical user interfaceruns on Mac OS X.

MetaSynth (Wenger and Spiegel 2005) appliesgraphical effects to FFT representations of soundsbefore resynthesis. The user can apply graphicalfilters to sonograms, including displacement maps,

zooming, and blurring. A set of operations involv-ing two sonograms is also available: blend, fadein/out, multiply, and bin-wise cross synthesis. Itis also possible to use formant filters. The pitfallis that MetaSynth does not give control of thephase information given by the FFT: phases arerandomized, and the user is given the choice amongseveral modes of randomization. Although thismight produce interesting sounds, it does not offerthe most comprehensive range of possibilities forresynthesis.

There are too many other programs in this veinto list them all, and more are being developed

each year. For instance, Spear (Klingbeil 2005) fo-cuses on partial analysis, SoundHack (Polanskyand Erbe 1996) has interesting algorithms for spec-tral mutation, and Praat (Boersma and Weenink2006) is dedicated to speech processing. The lat-ter two offer no interactive graphical control,though.

Figure 1. FFT datarecorded in a stereo buffer.

Real-Time Environments

PureData and Max/MSP are two environmentswidely used by artists, composers, and researchersto process sound in real time. Both enable work inthe spectral domain via FFT analysis/resynthesis. Ifocus on Max/MSP in this article, primarily becausethe MSP object pfft

simplifies the developers

task by clearly separating the time domain (outsidethe pfft patcher) from the frequency domain(inside the patcher). When teaching, I find that thegraphical border between the time and frequencydomains facilitates the general understanding ofspectral processing.

Using FFT inside Max/MSP may be seen asdifficult because the FFT representation is two-dimensional (i.e., values belong to a time/frequencygrid), whereas the audio stream and buffers areone-dimensional. Figure 1 shows how data can bestored when recording a spectrum into a stereo

buffer.The tension between the two-dimensional natureof the data and the one-dimensional frameworkmakes coding of advanced processing patterns (e.g.,visualization of a sonogram, evaluation of transients,and non-random modifications of spatial energy inthe time/frequency grid) somewhat difficult. Anexample of treatment in this context is found in thework of Young and Lexer (2003), in which the energyin graphically selected frequency regions is mappedonto synthesis control parameters.

The lack of multidimensional data withinMax/MSP led IRCAM to develop FTM (Schnell

et al. 2005), an extension dedicated to multidimen-sional data and initially part of the jMax project.More specifically, the FTM Gabor objects (Schnelland Schwarz 2005) enable a generalized granularsynthesis, including spectral manipulations. FTMattempts to operate Faster Than Music in thatthe operations are not linked to the audio rate but

88 Computer Music Journal
http://www.mitpressjournals.org/action/showImage?doi=10.1162/comj.2008.32.3.87&iName=master.img-000.png&w=230&h=37


3/16

Figure 2. Phase vocoderdesign.

are done as fast as possible. FTM is intended to becross-platform and is still under active development.However, the environment is not widely used at themoment, and it is not as easy to learn and use asMax/MSP.

In 2002, Cycling 74 introduced its own mul-tidimensional data extension to Max/MSP: Jitter.This package is primarily used to generate graphics(including video and OpenGL rendering), but it en-ables more generally the manipulation of matricesinside Max/MSP. Like FTM, it performs operationson matrices as fast as possible. The seamless in-tegration with Max/MSP makes implementation

of audio/video links very simple (Jones and Nevile2005).Jitter is widely used by video artists. The learning

curve for Max users is reasonable, thanks to thenumber of high-quality tutorials available. Theability to program in Javascript and Java within Maxis also fully available to Jitter objects. Thus, I choseJitter as a future-proof choice for implementing thisproject.

An Extended Phase Vocoder

Implementing a phase vocoder in Max/MSP/Jitterunlocks a variety of musical treatments that wouldremain impossible with more traditional approaches.Previous work using Jitter to process sound in thefrequency domain include two sources: first, LukeDuboiss patch jitter pvoc 2D.pat, distributedwith Jitter, which shows a way to record FFT

data in a matrix, to transform it (via zoom androtation), and to play it back via the IFFT; second,an extension of this first work, where the authorsintroduce graphical transformations using a transfermatrix (Sedes, Courribet, and Thiebaut 2004).Both use a design similar to the one presented inFigure 2.

This section endeavors to make the phase-vocodertechnique accessible, and it presents improvementsin the resynthesis and introduces other approachesto graphically based transformations.

FFT Data in Matrices

An FFT yields Cartesian values, usually translatedto their equivalent polar representation. Workingwith Cartesian coordinates is sometimes useful forminimizing CPU usage. However, to easily achievethe widest and most perceptively meaningful rangeof transformations before resynthesis, we will workwith magnitudes and phase differences (i.e., the bin-by-bin differences between phase values in adjacentframes).

Figure 3 shows how I store FFT data in a Jittermatrix throughout this article. We use 32-bit

floating-point numbers here, the same data typethat MSP uses. The grid height is the number offrequency bins, namely, half the FFT size, and itslength is the number of analysis windows (also calledframes in the FFT literature). It is logical to use atwo-plane matrix: the first is used for magnitudes,the second for phase differences.

Charles 89
http://www.mitpressjournals.org/action/showImage?doi=10.1162/comj.2008.32.3.87&iName=master.img-001.jpg&w=386&h=116


4/16

Figure 3. FFTrepresentation of a soundstored in a matrix.

Interacting with a Sonogram

The sonogram represents the distribution of energyat different frequencies over time. To interact withthis representation of the spectrum, we must scalethe dimension of the picture to the dimension of thematrix holding the FFT data. To map the matrixcell coordinates to frequency and time domains, weuse the following relations. First, time position t (in

seconds) is given by

t = n WindowSizeOverlapFactor

1sr

(1)

where n is the number of frames, sr is the samplingrate (Hz), and WindowSize is given in samples.Second, the center frequency fc (Hz) of the frequencybin mis

fc = msr

FFTSize(2)

Third, assuming no more than one frequency ispresent in each frequency bin in the analyzed signal,

its value in Hz can be expressed as

f= fc + sr

2 WindowSize (3)

where is the phase difference, wrapped withinthe range [ , ] (Moore 1990). Note that with thepfft object, the window size (i.e., frame size) isthe same as the FFT size. (In other words, there isno zero-padding.)

The patch in Figure 4 shows a simple architecturefor interacting with the sonogram. Equations 1and 2 and part of Equation 3 are implemented insidepatcher (or p) objects. To reduce computationalcost, the amplitude plane of the matrix holdingthe FFT data is converted to a low-resolutionmatrix before the display adjustments (i.e., theinversion of low/high frequencies and inversion ofblack and white, because by default, 0 corresponds

to black and 1 to white in Jitter). The OpenGLimplementation of the display is more efficientand flexible, but it is beyond the scope of thisarticle.

Recording

When implementing the recording of FFT data intoa matrix, we must synchronize the frame iterationwith the FFT bin index. The frame iteration musthappen precisely when the bin index leaps back to0 at the beginning of a new analysis window. Luke

Duboiss patch jitter pvoc 2D.pat containsone implementation of a solution. (See the objectcount in [pfft jit.recordfft.pat].)

Playback

Figure 5 shows a simple playback algorithm using anIFFT. A control signal sets the current frame to read.



5/16

Figure 4. Interaction witha sonogram.

To play back the sound at the original speed, thecontrol signal must be incremented by one frame at

a period equal to the hop size. Hence, the frequencyf (in Hz) of the phasor object driving the framenumber is

f = srNHopSize PlaybackRate (4)

where Nis the total number of analysis windows.Here, PlaybackRate sets the apparent speed of thesound: 1 is normal, 2 is twice as fast, 0 freezes oneframe, and 1 plays the sound backwards.

Advanced Playback

In extreme time stretching, the two main artifactsof the phase vocoder are the frame effect and thetime-stretching of transients. In my compositionPlex for solo instrument and live electronics, tenseconds are recorded at the beginning of the pieceand played back 36 times more slowly, spread over6 minutes. Considering an FFT with a hop size of1,024 samples and a sampling rate of 44,100 Hz,

each analysis window with an original duration of 23msec is stretched out to 836 msec. During synthesis

with the traditional phase-vocoder method, the leapfrom one frame to the following one is audible witha low playback speed. This particular artifact, theframe effect, has not received much attentionin the literature. I will describe two methods tointerpolate spectra between two recorded FFTframes, one of which can easily be extended to areal-time blurring effect.

When humans slow down the pacing of a spokensentence, they are bound to slow down vowelsmore than consonants, some of which are physicallyimpossible to slow down. Similarly, musicians cannaturally slow down the tempo of a phrase without

slowing down the attacks. That is why time-stretching a sound uniformly, including transients,does not sound natural. Transient preservation in thephase vocoder has been studied, and several efficientalgorithms have been presented (Laroche and Dolson1999; Robel 2003). Nevertheless, these propositionshave not been created in a form adapted to real-timeenvironments. I present a simple, more straightfor-ward approach that computes a transient value for

Charles 91


6/16

Figure 5. A simple player.

each frame and uses it during resynthesis. That willnaturally lead to an algorithm for segmentation,designed for performance time as well.

Removing the Frame Effect

This section shows two algorithms interpolatingintermediary spectra between two regular FFTframes, thus smoothing transitions and diminishingthe frame effect. In both cases, a floating-pointnumber describes the frame to be resynthesized: forinstance, 4.5 for an interpolation halfway between

frames 4 and 5.In the first method, the synthesis module con-

tinuously plays a one-frame spectrum stored in aone-column matrix. As illustrated in Figure 6, thismatrix is fed with linear interpolations between twoframes from the recorded spectrum, computed withthe jit.xfade object. Values from both amplitudeand phase difference planes are interpolated. With

a frame number of 4.6, the value in a cell of theinterpolated matrix is the sum of 40 percent of thevalue in the corresponding cell in frame 4 and 60percent of the value in frame 5.

My second approach is a controlled stochasticspectral synthesis. For each frame to be synthesized,each amplitude/phase-difference pair is copied fromthe recorded FFT data, either from the correspondingframe or from the next frame. The probability ofpicking up values in the next frame instead of thecurrent frame is the fractional value given by theuser. For instance, a frame number of 4.6 results in

a probability of 0.4 that values are copied from theoriginal frame 4 and a probability of 0.6 that valuesare copied from the original frame 5. This randomchoice is made for each frequency bin withineach synthesis frame, such that two successivesynthesized frames do not have the same spectralcontent, even if the specification is the same. Theimplementation shown in Figure 7 shows a small



7/16

Figure 6. Interpolatedspectrum between twoanalysis windows.

Figure 6

Figure 7. A smooth player.

Figure 7

Charles 93
http://www.mitpressjournals.org/action/showImage?doi=10.1162/comj.2008.32.3.87&iName=master.img-006.jpg&w=419&h=285http://www.mitpressjournals.org/action/showImage?doi=10.1162/comj.2008.32.3.87&iName=master.img-005.jpg&w=420&h=203


8/16

Figure 8. Stochasticsynthesis between two

frames, blur over severalframes.

modification to the engine presented Figure 5. Asignal-rate noise with amplitude from 0 to 1 is addedto the frame number. While reading the FFT datamatrix, jit.peek truncates the floating-pointvalue to an integer giving the frame number. Thevertical synchronization is removed to enable theframe number to change at any time. Increasing thefloating-point value in the control signal creates anincreased probability of reading the next frame.

This method can be readily extrapolated to blurany number of frames over time with negligibleadditional cost in CPU power. Instead of adding anoise between 0 and 1, we scale it to [0, R], whereR is the blur width in frames. If C is the index ofthe current frame, each pair of values is then copiedfrom a frame randomly chosen between frames Cand C+ R, as shown in Figure 8. The musical resultis an ever-moving blur of the sound, improving thequality of frozen sounds in many cases. The blurwidth can be controlled at the audio rate.

Both the interpolated frame and the stochasticplayback methods produce high-quality results

during extremely slow playback, but they areuseful in different contexts. The interpolated framesoften sound more natural. Because it lacks verticalcoherence, the stochastic method may sound lessnatural, but it presents other advantages. First,it is completely implemented in the performthread of Max/MSP; as a result, CPU use is constantregardless of the parameters, including playbackrate. However, the interpolated spectrum must be

computed in the low-priority thread, meaning thatthe CPU load increases with the number of framesto calculate. Second, the stochastic method offers abuilt-in blurring effect, whereas achieving this effectwith matrix interpolation would require additionalprogramming and be less responsive. In whatfollows, I continue to develop this second method,because it adheres to the original phase-vocoderarchitecture and is more flexible.

Transient Preservation

I propose a simple algorithm for performance-timetransient evaluation, along with a complementaryalgorithm to play a sound at a speed proportionalto the transient value. My approach consists inassigning to each frame a transient value, definedas the spectral distance between this frame and thepreceding frame, normalized to [0, 1] over the set offrames.

Several choices are possible for the measure of

the spectral distance between two frames. TheEuclidean distance is the most obvious choice.Given M frequency bins, with am,n the amplitude inbin m of frame n, the Euclidean distance betweenframes n and n 1 is

tn =

M1m=0

(am,n am,n1)2 (5)



9/16

Figure 9. A transient valuefor each frame.

In this basic formula, we can modify two parametersthat influence quality and efficiency: first, the set ofdescriptors out of which we calculate the distance,and second, the distance measure itself. The setof descriptors used in Equation 5 is the completeset of frequency bins: applying Euclidean distanceweights high frequencies as much as low ones,which is not perceptually accurate. For a resultcloser to human perception, we would weight thedifferent frequencies according to the ears physi-ology, for instance by using a logarithmic scale for

the amplitudes and by mapping the frequency binsnon-linearly to a set of relevant descriptors likethe 24 Bark or Mel-frequency coefficients. In termsof CPU load, distance is easier to calculate over24 coefficients than over the whole range of fre-quency bins, but the calculation of the coefficientsmakes the overall process more computationallyexpensive. Although using the Euclidean distancemakes mathematical sense, the square root is a rela-tively expensive calculation. To reduce computationload, in Figure 9 I use in a rough approximation ofthe Euclidean distance:

tn =M1m=0

am,n am,n1 (6)

The playback part of the vocoder can use thetransient value to drive the playback rate. In thepatch presented Figure 10, the user specifies theplayback rate as a transient rate rtrans (the playbackrate for the greatest transient value) and a stationary

rate rstat (the playback rate for the most stationarypart of the sound). Given the current framestransient value tr, the instantaneous playback rate

rinst is given by

rinst =rstat + tr (rtrans rstat) (7)The musical result of a continuous transient value

between 0 and 1 is interesting, because it offers abuilt-in protection against errors of classificationas transient or stationary. This is especially useful

in a performance-time context. A binary transientvalue could, of course, be readily implemented by acomparison to a threshold applied to all the valuesof the transient matrix with a jit.op object.

Controlling the blur width (see Figure 8) with thetransient value may be musically useful as well. Themost basic idea is to make the blur width inverselyproportional to the transient value, as presentedin Figure 11. Similarly to the playback rate, theinstantaneous blur width binst is given by

binst = bstat + tr (btrans bstat) (8)where tr is the current frame transient value, bstat is

the stationary blur width, and btransis the transientblur width.

Segmentation

This section describes a performance-time method ofsegmentation based on transient values of successiveframes. I place a marker where the spectral changes

Charles 95


10/16

Figure 10. Re-synthesizingat self-adapting speed.

Figure 10

Figure 11. Blur widthcontrolled by transientvalue. During the

playback of stationaryparts of the sound (left),the blurred region is widerthan during the playbackof transients (right).

Figure 11

from frame to frame are greater than a giventhreshold. As Figure 12 shows, frame n is markedif and only if its transient value is greater than orequal to the transient value in the preceding frameplus a threshold h, that is, whenever

tn tn1 +h (9)The formula used to calculate transient values ap-

pears to have a great influence on the segmentation

result. In the preceding section, we used Euclideandistance or an approximation by absolute difference(Equations 5 and 6). Both expressions yield a usefulauto-adaptive playback rate, but my experienceis that the following expression is preferable forulterior segmentation with Equation 9:

tn =M1m= 0

am,n

am,n1(10)

http://www.mitpressjournals.org/action/showImage?doi=10.1162/comj.2008.32.3.87&iName=master.img-010.jpg&w=411&h=154http://www.mitpressjournals.org/action/showImage?doi=10.1162/comj.2008.32.3.87&iName=master.img-009.jpg&w=411&h=214


11/16

Figure 12. Automaticperformance-timesegmentation.

Indeed, whereas Euclidean distance or absolutedifference give spectrum changes a lower weightat locally low amplitudes than at high amplitudes,the division in Equation 10 enables a scaling of thedegree of spectrum modification to the local level ofamplitude. Musical applications include real-timeleaps from one segmentation marker to another,resulting in a meaningful shuffling of the sound.

An important point to retain from this section isthat, whereas ideas may materialize first as iterativeexpressions (e.g., see the summations in Equations5, 6, and 10), the implementation in Jitter is reduced

to a small set of operations on matrices. To takefull advantage of Jitter, we implement parallelequivalents to iterative formulas.

Graphically Based Transformations

Now, we explore several ways of transforming asound through its FFT representation stored in two-dimensional matrices. The four categories I examineare direct transformations, use of masks, interactionbetween sounds, and mosaicing.

Direct Transformations

The most straightforward transformation consistsof the application of a matrix operation to an FFTdata matrix. In Jitter, we can use all the objects thatwork with 32-bit floating-point numbers. A flexibleimplementation of such direct transformations ispossible thanks to jit.expr, an object that parses

and evaluates expressions to an output matrix.Moreover, all Jitter operators compatible with 32-bitfloating-point numbers are available in jit.expr.

Phase information is important for the qualityof resynthesized sounds. The choice to apply atreatment to amplitude and phase difference or onlyto amplitude, or to apply a different treatment tothe phase-difference plane, is easily implementedwhether in jit.expr or with jit.pack andjit.unpack objects. This choice depends on theparticular situation and must generally be madeafter experimenting.

In Figure 13, the expression attribute of thesecond jit.expr object is [gtp(in[0].p[0]\,1.3) in[0].p[1]]. The first part of theexpression is evaluated onto plane 0 of the outputmatrix; it yields the amplitude of the new sound. Itapplies the operator gtp (which passes its argumentif its value is greater than the threshold, otherwise,it yields 0) to the amplitude plane of the incomingmatrix with the threshold 1.3. It is a rough denoiser.(In the previous expression, in[0].p[0] is plane 0of the matrix in input 0.) The second part of theexpression produces on plane one the first plane ofthe incoming matrix (i.e., in[0].p[1]); the phase

is kept unchanged.

Masks

Masks enable the gradual application of an effectto a sound, thus refining the sound quality. Adifferent percentage of the effect can be used inevery frequency bin and every frame. In this article,

Charles 97


12/16

Figure 13. Graphicaltransformations.

Figure 13

Figure 14. Two designs touse masks.

Figure 14

I limit the definition of a mask to a grid of thesame resolution as the FFT analysis of the sound,with values between 0 and 1. In Jitter, that meansa one-dimensional, 32-bit floating-point matrix, of

the same dimension as the sound matrix.A mask can be arbitrarily generated, interpolated

from a graphical file, drawn by the user on top of thesound sonogram, or computed from another soundor the sound itself. For example, let us considera mask made of the amplitudes (or the spectralenvelope) of one sound, mapped to [0, 1]. Multiplyingthe amplitudes of a source sound with this mask isequivalent to applying a source-filter cross synthesis.

Figure 14 presents two possible designs. In theleft version, the processing uses the mask to producethe final result; this is memory-efficient. In theright version, the original sound is mixed with the

processed one; this allows a responsive control onthe mix that can be adjusted without re-calculationof the sound transformation.

Interaction of Sounds

Generalized cross synthesis needs both amplitudeand phase difference from both sounds. Similarly,

http://www.mitpressjournals.org/action/showImage?doi=10.1162/comj.2008.32.3.87&iName=master.img-013.png&w=419&h=198http://www.mitpressjournals.org/action/showImage?doi=10.1162/comj.2008.32.3.87&iName=master.img-012.jpg&w=420&h=126


13/16

Figure 15. Context-freemosaicing algorithm.

interpolating and morphing sounds typically re-quires amplitude and phase information from bothsounds.

It is easy to implement interpolation betweensounds with jit.xfade, in the same way asinterpolation between frames in Figure 6. Aninteresting musical effect results from blurring thecross-faded sound with filters such as jit.streak,jit.sprinkle , and jit.plur.

As opposed to an interpolated sound, a morphingbetween two sounds evolves from the first to thesecond within the duration of the synthesized sounditself. Such a morphing can be implemented as aninterpolation combined with three masks: one forthe disappearance of the first source sound, a secondfor the appearance of the second sound, and a third

one to apply a variable amount of blurring.

Mosaicing

Mosaicing is a subset of concatenative synthesis,where a target sound is imitated as a sequence ofbest-matching elements from a database (Schwarz2006). In the context of the phase vocoder, I provide

a simple mosaicing by spectral frame similarity,similar to the Spectral Reanimation technique (Lyon2003). Before resynthesis, each FFT frame in the

target sound is replaced by the closest frame in thedatabase irrespective of the context. The algorithmis illustrated in Figure 15.

Three factors influence the speed and quality ofthe resulting sound. First, the amplitude scale mightbe linear (less CPU intensive) or logarithmic (closerto perception). Second, the descriptors for eachframe can take variable sizes. The set of amplitudesin each frequency bin is directly available, but it isa large vector. A linear interpolation to a reducednumber of bands, a set of analyzed features (pitch,centroid, etc.), and a nonlinear interpolation to the24 Bark or Mel-frequency coefficients are other

possible vectors. The third factor is the distanceused to find the closest vector (Euclidean dis-tance, or an approximation sparing the square-rootcalculation).

My choice has been to use a logarithmic am-plitude scale, a vector of Bark coefficients, andEuclidean distance or an approximation similar toEquation 6. The Bark coefficients, reflecting earphysiology, are well suited to this task.

Charles 99
http://www.mitpressjournals.org/action/showImage?doi=10.1162/comj.2008.32.3.87&iName=master.img-014.png&w=335&h=233


14/16

Figure 16. Real-timestochastic freeze

implementation.

Real-Time Freeze and Beyond

The matrix approach to spectral treatments allowsnot only a variety of performance-time extensionsto the phase vocoder but also improvements in

real-time processing, such as freezing a sound andtransforming a melody into a chord.

Real-Time Freeze

A simple way to freeze a sound in real time isto resynthesize one spectral frame continuously. Iimprove the sound quality by freezing several framesat once and then resynthesizing the complete setof frames with the stochastic blurring techniquedescribed previously (Figure 8).

My implementation features a continuous spec-

tral recording of the input sound into a circularbuffer of eight frames (an eight-column matrix).This matrix is copied into the playback matrix uponreceipt of the freeze message.

Automatic Cross-Fade

The stochastic freeze presented in Figure 16 switchesabruptly from one frozen sound to the next. A

transition sub-patch between the matrix to freeze(freeze8-record) and the matrix currently playing(freeze8-play) can be used to cross-fade when anew sound is frozen. Such sub-patches incorporatefirst a counter to generate a number of transition

matrices, and second a cross-fading object suchas jit.xfade controlled by a user-drawn graphspecifying the cross-fade curve.

Melody to Harmony

In my composition Aqua for solo double bass,aquaphon, clarinet, and live electronics, arpeggiosare transformed in real time into chords with amodified freeze tool. As shown in Figure 17, thecurrently synthesized spectrum is added to theincoming one via a feedback loop. The buildharmony block uses the current output synthesisdata and the new window analysis to calculate amatrix reflecting the addition of the new note. Thefirst solution for this computation is to averagethe energy in each frequency bin over all frozenspectra:

an =n

i= 1 ain

(11)



15/16

Figure 17. Real-time freezedesign with extensions.

where n is the number of frames to average, an is theaverage amplitude, and ai is the amplitude in framei.The same formula can be used for phase differences.The implementation requires a recursive equivalentto Equation 11:

an+1 =n an + ai+ 1

n+ 1 (12)

Note that in the case of stochastic blurred resyn-

thesis over eight frames, this operation is doneindependently for each frame.

However, the solution I ended up using in concertis given by

an =n

i= 1 ain

(13)

which is written and implemented recursively as

an+1 =

n an + ai+ 1n+ 1

(14)

Indeed, the latter formula produces a more powerful

sound, compensating for the low-amplitude framesthat may be recorded.

Conclusion

This article provides an overview of techniques toperform spectral audio treatments using matrices inthe environment Max/MSP/Jitter. I have presented

a paradigm for extensions of the phase vocoder inwhich advanced graphical processing is possiblein performance time. The composition of soundtreatments described by Hans Tutschku (Nez2003) is now available not only in the studio,but also during a concert. I have also described animprovement to the real-time effect known asfreeze.I hope that these musical tools will help create notonly new sounds, but also new compositional

approaches. Some patches described in this article,complementary patches, and sound examples areavailable online at www.jeanfrancoischarles.com.

Acknowledgments

Thanks to Jacopo Baboni Schilingi and HansTutschku for their invitation to present this materialwithin the Prisma group; the exchanges with thegroup have been the source of important develop-ments. Thanks to Joshua Fineberg for helping methrough the writing of the article. Thanks to Orjan

Sandred for his useful comments after his attentivereading of my draft. Thanks to Andrew Bentley forhis invitation to teach this material.

References

Boersma, P., and D. Weenink. 2006. Praat: Doing Phonet-ics by Computer. Available online at www.praat.org.Accessed 11 October 2006.

Charles 101


16/16

Bogaards, N., A. Robel, and X. Rodet. 2004. Sound Anal-ysis and Processing with Audiosculpt 2. Proceedingsof the 2004 International Computer Music Conference.San Francisco, California: International ComputerMusic Association, pp. 462465.

Dolson, M. 1986. The Phase Vocoder: A Tutorial.Computer Music Journal 10(4):1427.

Jones, R., and B. Nevile. 2005. Creating Visual Music inJitter: Approaches and Techniques. Computer MusicJournal 29(4):5570.

Klingbeil, M. 2005. Software for Spectral Analysis,Editing, and Synthesis. Proceedings of the 2005 Inter-

national Computer Music Conference. San Francisco,California: International Computer Music Association,pp. 107110.

Kreichi, S. 1995. The ANS Synthesizer: Composing ona Photoelectronic Instrument. Leonardo 28(1):5962.

Laroche, J.,and M. Dolson. 1999. ImprovedPhaseVocoderTime-Scale Modification of Audio. IEEE Transactionson Speech and Audio Processing 7(3):323332.

Lewis, T. P. 1991. Free Music. In T. P. Lewis, ed.A SourceGuide to the Music of Percy Grainger. White Plains,New York: Pro-Am Music Resources, pp. 153162.

Lyon, E. 2003. Spectral Reanimation. Proceedings of theNinth Biennial Symposium on Arts and Technology.New London, Connecticut: Connecticut College, pp.103105.

Marino, G., M.-H. Serra, and J.-M. Raczinski. 1993. TheUPIC System: Origins and Innovations. Perspectivesof New Music 31(1):258269.

Moore, F. R. 1990. Elements of Computer Music. Engle-wood Cliffs, New Jersey: Prentice-Hall.

Nez, K. 2003. An Interview with Hans Tutschku.Computer Music Journal 27(4):1426.

Polansky, L., and T. Erbe. 1996. Spectral Mutation inSoundHack. Computer Music Journal 20(1):92101.

Roads, C. 1995. The Computer Music Tutorial. Cambridge,Massachusetts: MIT Press.

Robel, A. 2003. Transient Detection and Preservationin the Phase Vocoder. Proceedings of the 2003 Inter-

national Computer Music Conference. San Francisco,California: International Computer Music Association,pp. 247250.

Schnell, N., et al. 2005. FTMComplex Data Structuresfor Max. Proceedings of the 2005 InternationalComputer Music Conference. San Francisco, California:International Computer Music Association, pp. 912.

Schnell, N., and D. Schwarz. 2005. Gabor: Multi-Representation Real-Time Analysis/Synthesis. Pro-ceedings of the 8th International Conference onDigital Audio Effects (DAFx-05). Madrid: UniversidadPolitecnica de Madrid, pp. 122126.

Schwarz, D. 2006. Concatenative Sound Synthesis: TheEarly Years. Journal of New Music Research 35(1):322.

Sedes, A., B. Courribet, and J. B. Thiebaut. 2004. From theVisualization of Sound to Real-Time Sonification: Dif-ferent Prototypes in the Max/MSP/Jitter Environment.Proceedings of the 2004 International Computer MusicConference. San Francisco, California: InternationalComputer Music Association, pp. 371374.

Todoroff, T., E. Daubresse, and J. Fineberg. 1995. Iana(A Real-Time Environment for Analysis and Extractionof Frequency Components of Complex OrchestralSounds and Its Application within a Musical Context).Proceedings of the 1995 International Computer MusicConference. San Francisco, California: InternationalComputer Music Association, pp. 292293.

Wenger, E., and E. Spiegel. 2005. MetaSynth 4.0 UserGuide and Reference. San Francisco, California: U&ISoftware.

Young, M., and S. Lexer. 2003. FFT Analysis asa Creative Tool in Live Performance. Proceed-

ings of the 6th International Conference on Dig- ital Audio Effects (DAFX-03). London: QueenMary, University of London. Available online atwww.elec.qmul.ac.uk/dafx03/proceedings/pdfs/dafx29.pdf.


MSP and Jitter

Documents

Transcript of MSP and Jitter