DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a...

44
i DECLARATION All sentences or passages quoted in this dissertation from other people’s work have been specifically acknowledged by clear cross-referencing to author, work and page(s). I understand that failure to do this amounts to plagiarism and will be considered grounds for failure in this dissertation and the degree examination as a whole. Name: Stuart Nicholas Wrigley Signature: Date: 29 April 1998

Transcript of DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a...

Page 1: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

i

DECLARATION

All sentences or passages quoted in this dissertation fromother people's work have been specificallyacknowledged by clear cross-referencing to author, workand page(s). I understand that failure to do this amountsto plagiarism and will be considered grounds for failurein this dissertation and the degree examination as awhole.

Name: Stuart Nicholas Wrigley

Signature:

Date: 29 April 1998

Page 2: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

i i

Audio MorphingStuart Nicholas Wrigley

29 April 1998

Supervisors: Dr Guy BrownDr Martin Cooke

This report is submitted in partial fulfilment of the requirementsfor the degree of Bachelor of Science with Honours in

Computer Science by Stuart Nicholas Wrigley.

Page 3: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

i i i

Abstract

This paper describes the techniques needed to automatically morph from one soundsignal to another. To do this, each signal's information has to be converted into anotherrepresentation, which enables the pitch and spectral envelope to be encoded onorthogonal axes. Individual components of the sound are then matched and the sounds'amplitudes are then interpolated to produce a new 'sound'. This new signal'srepresentation then has to be converted back to an acoustic waveform. This paperdescribes the representations of the signals required to effect the morph and also thetechniques required to match the sound components, interpolate the amplitudes andinvert the new sound's representation back to an acoustic waveform.

Page 4: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Contents Page 0

Contents

1. Introduction.................................................................................................................12. Background Theory.....................................................................................................3

2.1 Digital representation of sound ..............................................................................32.2 Linear Systems Theory...........................................................................................62.3 Windowing.............................................................................................................62.4 The Fourier Series .................................................................................................82.5 Cepstral Analysis..................................................................................................10

3. Previous Morphing Work............................................................................................134. The Audio Morphing Process....................................................................................15

4.1 Pre-processing .....................................................................................................154.2 Matching and Warping ........................................................................................16

4.2.1 Dynamic Time Warping.................................................................................164.3 Morphing .............................................................................................................214.4 Signal Estimation from magnitude DFT...............................................................244.5 Sound Clips .........................................................................................................264.6 Use of the Signal Estimation algorithm................................................................26

4.6.1 Estimated Signal Quality...............................................................................264.7 Summary .............................................................................................................31

5. Results and Discussion ..............................................................................................336. Conclusions...............................................................................................................377. References................................................................................................................38

Appendix 1 - Project MilestonesAppendix 2 - Source Code Extracts

Page 5: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Contents Page 1

1. Introduction

The aim of this project is to develop and implement a system for morphing betweenone sound and another. Like image morphing, audio morphing aims to preserve theshared characteristics of the starting and final sounds, while generating a smoothtransition between them. An example of image morphing is shown in Figure 1.1. Thisshows that the in-between images all show one face smoothly changing its shape andtexture until it turns into the target face.

Figure 1.1 Visual morph between two distinct faces [7].

It is this feature that an audio morph should possess. One sound should smoothlychange into another sound, keeping the shared characteristics of the starting andending sounds but smoothly changing the other properties. Figure 1.1 shows theintermediate images at equal time intervals as the morph progresses. A popular use ofimage morphing is not to show the images separately, but to have a single view showingthe sequence of images smoothly fading from one to the next until the morph iscomplete. This approach will be used in the audio morphing: a single sound clip willbe produced which progresses from the first clip to the second clip.

This area of research offers many possible applications including sound synthesis. Forexample, there are two major methods for synthesising musical notes. One is to digitallymodel the sound's physical source and provide a number of parameters in order toproduce a synthetic note of the desired pitch. Another is to take two notes which boundthe desired note and use the principles used in audio morphing to manufacture a notewhich contains the shared characteristics of the bounding notes but whose otherproperties have been altered to form a new note. Using the analogue of visualmorphing, the new note would be equivalent to one of the intermediate imagesproduced by the process.

In the two years since Slaney et al. [6] published their paper, their ideas and processeshave already been put into use in a number of different fields. One such field is that ofmusic. In 1996, Settel and Lippe [17] demonstrated real-time audio morphing at the 7th

International Symposium on Electronic Art and predicted that audio morphing wouldsweep through the music world in the same way that visual morphing did in the graphicsworld. Evidence for this prediction is apparent in the number of hardware audiomorphing solutions. Lexicon Inc., founded by MIT Professor Dr. Francis Lee, is arecognised leader in digital signal processing (DSP) technology. In their suite of DSPmodules is the Vortex [16] which allows real-time production of audio effects using theirAudio MorphingTM processor.

The use of pitch manipulation within the algorithm also has an interesting potentialuse. In the interests of security, it is sometimes necessary for people to disguise theidentity of their voice. An interesting way of doing this is to alter the pitch of the soundin real-time using methods similar to those described in Section 4.3.

Audio morphing can be achieved by transforming the soundÕs representation from theacoustic waveform (Figure 2.1 and Figure 2.2) obtained by sampling of the analogsignal, with which many people are familiar with, to another representation. To preparethe signal for the transformation, it is split into a number of 'frames' - sections of thewaveform. The transformation is then applied to each frame of the signal. This providesanother way of viewing the signal information. The new representation (said to be in the

Page 6: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Contents Page 2

frequency domain) describes the average energy present at each frequency band.Further analysis enables two pieces of information to be obtained: pitch informationand the overall envelope of the sound.

A key element in the morphing is the manipulation of the pitch information. If twosignals with different pitches were simply cross-faded it is highly likely that two separatesounds will be heard. This occurs because the sound will have two distinct pitchescausing the auditory system to perceive two different objects. A successful morph mustexhibit a smoothly changing pitch throughout.

The pitch information of each sound is compared to provide the best match betweenthe two signals' pitches. To do this match, the signals are stretched and compressed sothat important sections of each signal match in time. The interpolation of the twosounds can then be performed which creates the intermediate sounds in the morph.The final stage is then to convert the frames back into a normal waveform.

However, after the morphing has been performed, the legacy of the earlier analysisbecomes apparent. The conversion of the sound to a representation in which the pitchand spectral envelope can be separated loses some information. Therefore, thisinformation has to be re-estimated for the morphed sound.

This process obtains an acoustic waveform which can then be stored or listened to.Recent work in the area by Slaney et al. [6] will form the basis of the work to beconducted. The algorithm to be used is shown below in Figure 1.2.

Figure 1.2 Block diagram of the simplified audio morphing algorithm.

The algorithm contains a number of fundamental signal processing methods includingsampling; the discrete Fourier transform and its inverse; cepstral analysis. Thesetechniques will be explained in the next chapter.

RepresentationConversion

RepresentationConversion

Pitch andEnvelopeAnalysis

Pitch andEnvelopeAnalysis

TemporalMatch

Interpolation SignalRe-estimation

pitchinformation

pitchinformation

envelopeinformation

envelopeinformation

morph

sound 1

sound 2

Page 7: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 2 Page 3

2. Background Theory

This section of the report shall introduce the major concepts associated with processinga sound signal and transforming it to the new representation mentioned above.

2.1 Digital representation of sound

With the advent of the computer, signal processing has moved away from manualelectronic techniques to those offered by complex and powerful software systems.However, before any processing can begin, the sound signal that is created by somereal-world process has to be ported to the computer by some method. This is calledsampling.

A fundamental aspect of a digital signal (in this case sound) is that it is based onprocessing sequences of samples. When sound is produced by a natural process, suchas a musical instrument, the signal produced is analog (continuous-time) because it isdefined along a continuum of times.

A discrete-time signal is represented by a sequence of numbers - the signal is onlydefined at discrete times. A digital signal is a special instance of a discrete-time signal -both time and amplitude are discrete. Each discrete representation of the signal istermed a sample.

A discrete-time signal has two methods of generation:· periodic sampling of a continuous-time signal· direct generation by some discrete-time process

This type of signal has its own mathematical representation:

x = {x[n]} -¥ < n < ¥ where n is an integer and x is the sequence ofnumbers of

which each member corresponds to a sample.

This equation is considered by some to be cumbersome and is usually referred to asÒthe sequence x[n]Ó.

Therefore x[n] is the nth the sample. If this is applied to the analog signal xa(t) 1 which is

sampled at a sampling period of T then,

x[n] = xa(nT) -¥ < n < ¥

Note: The sampling frequency is the reciprocal of the sampling period, ie 1/T.

1 The sequence representation in this section will follow that used by Oppenheim and Schafer [1]. [ ] encloses theindependent variable of discrete-variable functions and ( ) encloses the independent variable of continuous variablefunctions. By convention, the independent variable is usually time.

Page 8: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 2 Page 4

Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain.

Figure 2.2 Sample sequence of Figure 2.1 comprising 297 samples with T = 22.676 ms

However, a signal cannot be uniquely recovered from its samples in general. Anexample of such signals is shown in Figure 2.3. If T is too large, the original signalcannot be reproduced from the sampled sequence (as in Figure 2.3); conversely, if T istoo small, useless samples are included in the sequence. The Nyquist SamplingTheorem for the relationship between the frequency bandwidth of the analog signal tobe sampled and the sampling period resolves this problem.

The theorem states that:When the analog signal is bandlimited between 0 and w Hertz and when x(n) is

sampled at every T= 12w seconds then the original signal can be completely

reproduced by

x n xi

ni

nii

( )sin

=æèç

öø÷

-æèç

öø÷

é

ëê

ù

ûú

-æèç

öø÷=-¥

¥

å2

22

22

w

pww

pww

Here xi2w

æèç

öø÷ is a sampled value of x(n) at ti = i

2w . Furthermore, 1/T = 2w is

called the Nyquist Rate.

Related to sampling is the scalar quantisation of each sample. This is the numberpossible of amplitude values used to represent each sample of the waveform. If thenumber of amplitude values is increased, then the accuracy also increases.

1

2p

Page 9: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 2 Page 5

Figure 2.3 Sinusoids with the same samples.

If the sample is said to be represented using q bits, then this means that there are 2q

possible amplitude values. Inaccuracy (distortion in the signal) due to an insufficientnumber of amplitude values is termed quantisation noise. An example of quantisationand quantisation noise is shown in Figure 2.4.

Figure 2.4 (a) Scalar quantisation of a simple curve (b). The amplitude values are shown on the y axis and the samplingpositions on the x axis. The circles represent the desired sample value and the squares are the actual value. Note that some

coincide but many samples have discrepancies (quantisation noise). The final curve after sampling is shown in (c).

(b)

(c)

(a)

Page 10: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 2 Page 6

2.2 Linear Systems Theory

Once a signal and been converted to the new domain required for the warping, anumber of properties of the systems to be used become invaluable. Although the detailsof the transformation are given later, the outline of Linear Systems Theory will be givenhere.

A discrete-time system is defined as a transformation that maps the input sequence x[n]to the output sequence y[n]:

y[n] = T{x[n]}where T is the name of the discrete-time system (ortransformation) in question.

Linear time-invariant systems (normally referred to merely as linear systems) are asubclass of such systems which have two important properties:

1. the system is linear - the output of a system in response to a sum of inputs mustbe equal to the sum of all the outputs of the system in response to eachindividual input: i f

y1[n] = T{x1[n]} y2[n] = T[x2[n]} and x[n] = ax1[n] + bx2[n] where a and b are constants

then y[n] = T{x[n]} = T{ax1[n] + bx2[n]} = aT{x1[n]} + bT{x2[n]} [Superposition Principle] = ay1[n] + by2[n]

2. the system is time invariant - the effect of the system on the signal is the sameat any particular time: if y[n] = T{x[n]} then y[n-t] = T{x[n-t]}

2.3 Windowing

A DFT can only deal with a finite amount of information. Therefore, a long signal mustbe split up into a number of segments. These are called frames. Generally, soundsignals are constantly changing and so the aim is to make the frame short enough tomake the segment almost stationary and yet long enough to resolve consecutive pitchharmonics. Therefore, the length of such frames tend to be in the region of 25 to 75 ms[11]. There are a number of possible windows [1]. A selection are:

· The Rectangular window,w(n) = 1 when 0 £ n £ N,

= 0 otherwise

· The Hamming window,w(n) = 0.54 - 0.46 cos(2pn/N) when 0 £ n £ N,

= 0 otherwise

· The Hanning windoww(n) = 0.5 - 0.5 cos(2pn/N) when 0 £ n £ N,

= 0 otherwise

Page 11: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 2 Page 7

Figure 2.5 Commonly used windows.

It is evident that the rectangular window is a much more abrupt function in the time-domain than the Hanning and Hamming windows, which are raised-cosine functions.The frequency-domain spectrum of the Hamming window is much smoother than that ofthe rectangular window and is commonly used in spectral analysis. Fallside and Woods[12] give a good comparison between the properties of Hamming and rectangularwindows.

The windowing function splits the signal into time-weighted frames. However, it is notenough to merely process contiguous frames. When the frames are put back together,modulation in the signal becomes evident due to the windowing function. As theweighting of the window is required, another means of overcoming the modulation mustbe found. A simple method is to use overlapping windows. Simply put, it means that asone frame fades out, its successor fades in. It has the advantage that any discontinuitiesare smoothed out. However, it does increase the amount of processing required due tothe increase in the number of frames produced.

Hammingwindow

Hanningwindow

Rectangularwindow

Page 12: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 2 Page 8

2.4 The Fourier Series

The series was proposed by Jean Baptiste Joseph, Baron de Fourier in 1822 in order forany periodic signal (or wave) to be reconstructed as a weighted sum of sinusoids. Thechoice of a sinusoid was because a sinusoidal wave has a clear interpretation in thefrequency domain. It has a well-defined frequency and a measurable amplitude.

Joseph showed that if a periodic signal had a period f, then the frequencies of theFourier series would be 1/f,

2/f, 3/f, etc. Hz.

The Fourier series is defined for both continuous-time signals and discrete-time signals.However, this project is using discrete-time (ie sampled) signals and the remainder ofthis section will therefore concentrate on the discrete-time Fourier series.

If x(t) is a periodic discrete-time signal with period T then it can be represented by theFourier series

x(t) a a c t b c tc cc c

= å å+ +=

¥

=

¥

01 1cos sinw w where w

p=2T

and ao is a

constant

which simplifies to

( )x t a B n tc cc

( ) cos= + -=

¥

å01

w j where B a bc c c= + and

j =æ

èç

ö

ø÷arctan

abc

c

Before it is possible to analyse a signal with respect to its frequency content, and soview the signals in the frequency-domain2, a transform is required to convert the signalinto its Fourier components. This is achieved by the Discrete Fourier Transform (DFT).The DFT is used specifically for periodic input signals containing discretely sampledsignal values.

The DFT of a sequence of N points, x(n) is

X k x n W nk

n

N

( ) ( )= -

=

-

å0

1

where W ei

N=2p

and the inverse DFT (IDFT) is

x nN

X k W nk

k

N

( ) ( )==

-

å10

1

where W ei

N=2p

2 A signal plotted with Fourier component amplitude against frequency is said to be in the frequency-domain.

Page 13: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 2 Page 9

Figure 2.6 Time-domain representation of the utterance "Are you sure?" sampled at 11025Hz, 8 Bit, Mono.

Figure 2.7 Two spectrograms of the signal shown in Figure 2.6. High energy content is represented by the dark shading.Left: Wideband spectrogram (N=128, equivalent to approximately 86Hz resolution)

Right: Narrowband spectrogram (N=512, equivalent to approximately 22Hz resolution)

A spectrogram is a specific use of the DFT function and visually demonstrates animportant concept - the balance between frequency and time resolution. The signal issplit up into a number of overlapping frames by applying a windowing function. Thewindowing function takes sections of the signal comprising a finite number of samplesand weights each sample according to the specific windowing function used. Thespectrograms shown in Figure 2.7 are produced using the N-point DFT on a sequence ofwindows. The DFT result from each window corresponds to the vertical strips evident inthe diagrams.

The DFT produces a symmetric result and so only the first half of the values are used ina spectrogram. Hence the spectrograms in Figure 2.7 show the frequency axis onlyextending to just over 5 kHz. The size of N plays an important role in the DFT. Itdetermines the characteristics of the spectrogram itself. To obtain a sufficiently highlevel of frequency resolution, N has to be relatively large. In the case of speech, thisenables formants and individual harmonics to become visible (Figure 2.7). However, asthe size of N increases, so too does the DFT window. Therefore, each frequencyÕs singleenergy value is being calculated over a larger time-slice of the signal. This means thatthere is a loss in time resolution. Especially in speech, certain events in the signaloccur so quickly that they can be lost at high values of N. As a result, there is a time-frequency trade-off. This property is a form of uncertainty principle in which as timeresolution increases, frequency resolution decreases and vice versa:

D Dt f. º1 where Dt and Df are the relative time and frequency resolutionsrespectively.

The computational complexity of the DFT is proportional to N2 which means that theDFT becomes very inefficient at large values of N. The Fast Fourier Transform (FFT) wasdevised in the 1960Õs as a much quicker non-approximating replacement for the DFTwith complexity NlogN.

Page 14: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 2 Page 10

2.5 Cepstral Analysis

Further information which is encoded into a signal is information about the pitch andthe formant frequencies. Formant frequencies are a sound sourceÕs natural frequencieswhich correspond to resonances in its frequency response. Especially in speech, theseformants change relatively slowly and can be studied by analysing the signal furtherafter converting it to the frequency-domain. These pieces of information are notcombined additively: they are convolved. Once converted to the frequency-domain,this convolution becomes multiplication. The use of logarithms, in turn, convertsmultiplication into addition. This enables the pitch information to be ÔsubtractedÕ orremoved in order to obtain a cepstrally smoothed spectrum for formant extraction.

The term cepstrum is given to the inverse DFT (IDFT) of the log magnitude spectrum ofa signal. The process, which is visually demonstrated in Figure 2.8, is,

signal à DFT à log | · | à IDFT à cepstrum

As a log magnitude has been taken in-between the DFT and IDFT, the domain is notquite the time-domain. As such, a new set of terms is required to name the variouscomponents in this domain. The word ÔcepstrumÕ comes from interchanging the lettersin ÔspectrumÕ and ÔfrequencyÕ becomes ÔquefrencyÕ. Therefore, the cepstrum is said to bein the quefrency-domain.

Figure 2.8 The series of stepsfrom the sampled waveform (i) tothe cepstral slice (iii). Note thepitch peak in the cepstral slice.

(ii)

DFT and logmagnitude

InverseDFT

(i)

(iii)

Page 15: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 2 Page 11

Pitch and formant information can be extracted from the cepstrum by using lifters (thequefrency-domain equivalent of filters): high-pass for pitch information and low-pass forformant information.

Figure 2.9 The liftering process to obtain the signal envelope information (top) and the pitch and voicing information(bottom).

If this process is repeated for all 'slices' in the signal and these liftered spectra are thenindependently transformed back into the fourier domain, two forms of spectrogram areobtained: a smooth spectrogram for the signal envelope information and a pitchspectrogram, Figure 2.10. The power of the MATLAB environment used to implementthe audio morphing process is demonstrated by the code fragment in Appendix 2. Theabove cepstral analysis can be performed in merely two lines of code.

*

*

Page 16: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 2 Page 12

Figure 2.10 A smooth spectrogram of "Are you sure?" (left) encoding broad spectral shapes of the signal. The pitchspectrogram (right) encodes the pitch and voicing information. The standard spectroram can be seen in Figure 2.7.

The ability to separate out these two pieces of information is extremely important whenthe two signals finally come to be morphed. It allows details of the pitch and spectra tobe worked on independently and is used in the temporal alignment of the pitch peaksand the interpolation of the signalsÕ spectra.

Page 17: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 3 Page 13

3. Previous Morphing Work

This area of research is still very new and as such their are few papers in existence.Slaney et al. [6] provides an excellent treatise on the subject. This section shall give abrief comparison between image and audio techniques and summarise SlaneyÕs paper.

The technique of digitally distorting an image is called warping. The complete andsmooth transformation from one image to another is called morphing and is anextension of warping. Watkins et al. [18] use the analogy of a sheet of rubber on whichthe image is printed to describe warping. In order to distort the image in the requiredmanner, just a few 'control-points' need to be displaced. The elasticity of the rubbertakes care of the rest of the image warping in the region surrounding the control points.

In visual morphing, key features in both images have to be matched. Such featurescould be the eyes, nose and mouth, in the case of a face, or headlamps andwindscreens in the case of vehicles. The process of morphing is relativelyuncomplicated,1. warp the target image so that key features match the corresponding source image's

features as closely as possible.2. over a number of frames, return the target image to its original state and in the

process the source image is warped to maintain the key feature correspondence.A weighted average between the two warps is used to smooth any discontinuities.

For more than fifteen years, image morphing has been used as a computer graphicstechnique in many areas ranging from experimental art to Hollywood blockbusters (3Dmorphing was highly evident in TriStar Pictures' Terminator II) and televisioncommercials. To enable the morphing algorithms to effect a warp from one image toanother, mesh points are required to define the correspondence points between the twoimages. Usually, the selection of these mesh points is a manual, labour-intensive task.However, Covell and Withgott [7] proposed techniques to ease this process.

Unlike many image morphing systems available, there is no need to select features toform the basis of the morph in audio morphing. Slaney et al. [6] have shown that thefundamental principles of auto-correspondence in video [7] can be used for audio aswell. In a pitch spectrogram, the spacing of the peaks is proportional to the pitch. If twosounds, between which a morph is required, both have a pitch then these are likely tobe different and so have to be matched over the duration of the morph. Theidentification of these peaks and subsequent peak matching is automated.

Image morphing (as in Figure 1.1) is equivalent to the simplest form of audio morph -that of tracing the path between two points in some suitably warped space. Videomorphing where the morph starts with the first clipÕs properties smoothly changes until i tpossess all the properties of the second, destination clip is equivalent to an audiomorph between two stationary objects - for example a steady vowel and a musical note.

Time is an important part of a sound. Unlike video morphing, the time dimension canbe considered independently to the other dimensions of the sound such as spectralenvelope and pitch. Slaney et al. use techniques based on magnitude spectrograms.This is possible due to the ability to find the sound with the same magnitudespectrogram - phase re-estimation is an integral part of this process.

It is demonstrated that such magnitude spectrograms cannot be simply cross-faded asthis can lead to two pitches being introduced into the final sound. Such a sound is thenperceived as two auditory objects and ruins the objective of a smooth morph. Therefore,Slaney et al. decompose the sound's properties into two salient 'dimensions' - oneencoding the broad spectral envelope of the sound; the other encoding the pitch andvoicing information. In the paper, they use mel-frequency cepstral coefficients (MFCC)to obtain these two pieces of information.

Each signal will have a number of features which need to be matched (c.f. visualmorphing). Such features do not normally occur at exactly the same point and so thefeatures have to be moved slowly from the position in the first sound to the position in

Page 18: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 3 Page 14

the second. Dynamic Time Warping (described in the next section) is used to match thefeatures in the two sounds.

The time-varying property used by Slaney et al. is that of the sounds' pitch, althoughother matching functions could be used. Rhythm could be considered for sound with apredominant beat. In order to match the sounds with respect to their pitch, a pitchestimate for each sound is required. Slaney et al. [6] use a combination of aconventional pitch estimation algorithm and dynamic programming [14].

The spectrograms are interpolated and then inverted to recover the morphed sounds.This procedure uses an extension to Griffin and Lim's [9] algorithm to allow the iterativeprocedure to quickly converge [15].

The results of their paper are convincing. However, the authors still suggest further workis required in a number of areas to improve the quality of the morphed sounds. Forexample, they suggest further research to produce an improved representation of thesound which will allow the separation of the voicing information and the pitchinformation. Work on improved matching techniques is also suggested to allow fordifferent possible types of sounds (eg rhythm dominated sounds).

Page 19: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 15

4. The Audio Morphing Process

4.1 Pre-processing

This process takes place for each of the signals involved with the morph. Considerobtaining a spectrum for just one frame of a signal. It involves applying a windowingfunction to the desired section of the signal. The windowing function used is thepopular Hamming window. The Discrete-Time Fourier Transform is then applied to thiswindowed section of the signal. The process is

apply windowingfunction to select and

weight sectionà apply DFT

To obtain a number of overlapping spectra, the window is shifted along the signal by anumber of samples (no more than the window length) and the process is repeated.

If the windows are applied starting at the beginning and continuing along the signal, aproblem arises. Any one sample should fall under more than one window. However, atthe beginning of the signal, the first few samples (the number of which corresponds tothe window shift) only fall under one window. To overcome this, each signal is paddedwith silence at the beginning for a period equal to the window length. This means thatthe first sample will fall under the requisite number of windows and will so be analysedcorrectly in later processes. Finally, the log magnitude is taken of each slice and thenthe inverse DFT is performed. Each signal is now represented by a series of cepstralslices as shown in section 2.5. Further, these slices are then liftered as described earlierto give two series of slices for each signal: a pitch information series and an envelopeinformation series.

For the series of pitch slices, two arrays are created: one to hold the peak location foreach slice and the other to store the pitch peak magnitude. These values are requiredduring the pitch morphing process, below.

Page 20: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 16

4.2 Matching and Warping

Both signals will have a number of 'time-varying properties'. To create an effectivemorph, it is necessary to match one or more of these properties of each signal to thoseof the other signal in some way. The property of concern is the pitch of the signal -although other properties such as the amplitude could be used - and will have anumber of features. It is almost certain that matching features do not occur at exactlythe same point in each signal. Therefore, the feature must be moved to some point inbetween the position in the first sound and the second sound.

In other words, to smoothly morph the pitch information, the pitch present in eachsignal needs to be matched and then the amplitude at each frequency cross-faded. Toperform the pitch matching, a pitch contour for the entire signal is required. This isobtained by using the pitch peak location in each cepstral pitch slice.

Consider the simple case of two signals, each with two features occurring in differentpositions (Figure 4.1).

Figure 4.1 The match path between two signals with differently located features.

The match path shows the amount of movement (or warping) required in order to aligncorresponding features in time. Such a match path is obtained by Dynamic TimeWarping (DTW).

4.2.1 Dynamic Time Warping

Speaker recognition and speech recognition are two important applications of speechprocessing. These applications are essentially pattern recognition problems, which is alarge field in itself. Some Automatic Speech Recognition (ASR) systems employ timenormalisation. This is the process by which time-varying features within the words arebrought into line. The current method is time-warping in which the time axis of theunknown word is non-uniformly distorted to match its features to those of the patternword. The degree of discrepancy between the unknown word and the pattern - theamount of warping required to match the two words - can be used directly as a distancemeasure. Such time-warping algorithms are usually implemented by dynamicprogramming and is known as Dynamic Time Warping.

Dynamic Time Warping (DTW) is used to find the best match between the features ofthe two sounds - in this case, their pitch. To create a successful morph, major featureswhich occur at generally the same time in each signal ought to remain fixed andintermediate features should be moved or interpolated. DTW enables a match path tobe created. This shows how each element in one signal corresponds to each element inthe second signal.

In Figure 4.1 it can be seen how the two major features in each signal remainstationary, and the intermediate sections of the signal are stretched and compressed.

matchpath

signal 1

signal 2

Page 21: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 17

In order to understand DTW, two concepts need to be dealt with,· Features: the information in each signal has to be represented in some manner.· Distances: some form of metric has be used in order to obtain a match path. There

are two types:1. Local: a computational difference between a feature of one signal and a

feature of the other.2. Global: the overall computational difference between an entire signal and

another signal of possibly different length.Feature vectors are the means by which the signal is represented and are created atregular intervals throughout the signal. In this use of DTW, a path between two pitchcontours is required. Therefore, each feature vector will be a single value. In other usesof DTW, however, such feature vectors could be large arrays of values.

Since the feature vectors could possibly have multiple elements, a means ofcalculating the local distance is required. The distance measure between two featurevectors is calculated using the Euclidean distance metric. Therefore the local distancebetween feature vector x of signal 1 and feature vector y of signal 2 is given by,

( ) ( )d x y x yi ii

, = -å 2

As the pitch contours are single value feature vectors, this simplifies to,

( )d x y x y, = -

which is equivalent to another popular distance metric, the City Block or Manhattanmetric.

The global distance is the overall difference between the two signals. Audio is a time-dependent process. For example, two audio sequences may have different durationsand two sequences of the sound with the same duration are likely to differ in the middledue to differences in sound production rate. Therefore, to produce a global distancemeasure, time alignment must be performed - the matching of similar features and thestretching and compressing, in time, of others. Instead of considering every possiblematch path which would be very inefficient, a number of constraints are imposed uponthe matching process.

The DTW AlgorithmThe basic DTW algorithm is symmetrical - in other words, every frame in signals must beused. The constraints placed upon the matching process are:

· Matching paths cannot go backwards in time;· Every frame in each signal must be used in a matching path;· Local distance scores are combined by adding to give a global distance.

If D(i,j) is the global distance up to (i,j) and the local distance at (i,j) is given by d(i,j)

( ) ( ) ( ) ( )[ ] ( )D i j D i j D i j D i j d i j, min , , , , , ,= - - - - +1 1 1 1 (1)

The only directions in which the match path can move when at (i,j) in the time-timematrix are given in Figure 4.2.

Page 22: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 18

Figure 4.2 The three possible directions in which the best match path may move from cell (i,j) in symmetric DTW.

Computationally, (1) is already in a form that could be recursively programmed.However, unless the language is optimised for recursion, this method can be slow evenfor relatively small pattern sizes. Another method which is both quicker and requires lessmemory storage uses two nested for loops. This method only needs two arrays that holdadjacent columns of the time-time matrix. In the following explanation, it is assumedthat the array notation is of the form 0..N-1 for a array of length N.

Figure 4.3 The cells at (i,j) and (i,0) have different possible originator cells. The path to (i,0) can only originate from (i-1,0).However, the path to (i,j) can originate from the three standard locations as shown.

The algorithm to find the least global cost is:1. Calculate column 0 starting at the bottom most cell. The global cost to this cell is

just its local cost. Then, the global cost for each successive cell is the local cost forthat cell plus the global cost to the cell below it. This is called the predCol(predecessor column).

2. Calculate the global cost to the first cell of the next column (the curCol). This thelocal cost for the cell plus the global cost to the bottom most cell of the previouscolumn.

3. Calculate the the global cost of the rest of the cells of curCol. For example, at (i,j)this is the local distance at (i,j) plus the minimum global cost at either (i-1,j), (i-1,j-1)or (i,j-1).

4. curCol is assigned to predCol and repeat from step 2 until all columns have beencalculated.

5. Global cost is the value stored in the top most cell of the last column.

i

j

i

j

curColpredCol

minimumglobaldistance

Page 23: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 19

The pseudocode for this process is:

calculate first column (predCol)for i=1 to number of signal 1 feature vectors

curCol[0] = local cost at (i,0) + global cost at (i-1,0)for j=1 to number of signal 2 feature vectors

curCol[j]=local cost at (i,j) + minimum of global costs at (i-1,j), (i-1,j-1) or (i,j-1).endpredCol=curCol

endminimum global cost is value in curCol[number of signal 2 feature vectors]

However, in the case of audio morphing, it is not the minimum global distance itselfwhich is of interest but the path to achieve. In other words, a backtrace array must bekept with entries in the array pointing to the preceding point in the path. Therefore, asecond algorithm is required to extract the path.

The path has three different types of direction changes:· vertical· horizontal· diagonal

The backtrace array will be of equal size to that of the time-time matrix. When theglobal distance to each cell, say (i,j), in the time-time matrix is calculated, itspredecessor cell is known - it's the cell out of (i-1,j), (i-1,j-1) or (i,j-1) with the lowestglobal cost. Therefore, it is possible to record in the backtrace array the predecessorcell using the following notation (for the cell (i,j) ):

· 1: (i-1,j-1) - diagonal· 2: (i-1,j) - horizontal· 3: (i,j-1) - vertical

Figure 4.4 A sample backtrace array with each cell containing a number which represents the location of the predecessorcell in the lowest global path distance to that cell.

The path is calculated from the last position, in Figure 4.4 this would be (4,4). The firstcell in the path is denoted by a zero in the backtrace array and is always the cell (0,0).A final 2D array is required which gives a pair (signal1 vector, signal2 vector) for eachstep in the match path given a backtrace array similar to that of Figure 4.4.

1

2

1

0

1

3

3

3

3

2 2 2 2

21

2 1

3 2

13

2 1

2

121

0 2 3 4

0

1

3

4

Page 24: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 20

The pseudocode is:

store the backtrace indices for the top right cell.obtain the value in that cell - currentValwhile currentVal is not 0

if currentVal is 1 then reduce both indices by 1if currentVal is 2 then reduce the signal 1 index by 1if currentVal is 3 then reduce the signal 2 index by 2store the new indices at the beginning of the 2D arrayobtain the value in that cell - currentVal

end

Therefore, for the example in Figure 4.4, the 2D array would be,

signal 1 0 0 1 2 3 4signal 2 0 1 2 2 3 4

This is illustrated in Figure 4.5.

Figure 4.5 The sample backtrace array of Figure 4.4 with the calculated path overlaid.

The MATLAB function to perform the DTW process and the 2D array extraction isincluded in Appendix 2.

At this stage, we now have the match path between the pitches of the two signals andeach signal in the appropriate form for manipulation. The next stage is to then producethe final morphed signal.

1

2

1

0

1

3

3

3

3

2 2 2 2

21

2 1

3 2

13

2 1

2

121

0 2 3 4

0

1

3

4

Page 25: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 21

4.3 Morphing

The overall aim in this section is to make the smooth transition from signal 1 to signal 2.This is partially accomplished by the 2D array of the match path provided by the DTW.At this stage, it was decided exactly what form the morph would take. Theimplementation chosen was to perform the morph in the duration of the longest sound;in other words, the final morphed sound would have the duration of the longest sound.In order to accomplish this, the 2D array is interpolated to provide the desired duration.However, one problem still remains: the interpolated pitch of each morph slice. If nointerpolation were to occur then this would be equivalent to the warped cross-fadewhich would still be likely to result in a sound with two pitches. Therefore, a pitch in-between those of the first and second signals must be created. The precise properties ofthis manufactured pitch peak are governed by how far through the morph the process is.At the beginning of the morph, the pitch peak will take on more characteristics of thesignal 1 pitch peak - peak value and peak location - than the signal 2 peak. Towards theend of the morph, the peak will bear more resemblance to that of the signal 2 peak. Thevariable l is used to control the balance between signal 1 and signal 2. At thebeginning of the morph, l has the value 0 and upon completion, l has the value 1.

Consider the example in Figure 4.6. This diagram shows a sample cepstral slice withthe pitch peak area highlighted. Figure 4.7 shows another sample cepstral slice, againwith the same information highlighted. To illustrate the morph process, these twocepstral slices shall be used.

There are three stages:1. combination of the envelope information;2. combination of the pitch information residual - the pitch information excluding the

pitch peak;3. combination of the pitch peak information.

Combination of the envelope informationSlaney et al. [6] conclude that the best morphs are obtained when the envelopeinformation is merely cross-faded, as opposed to employing any pre-warping of features,and so this approach is adopted here.

In order to cross-fade any information in the cepstral domain, care has to be taken. Dueto the properties of logarithms employed in the cepstral analysis stage, multiplication istransformed into addition. Therefore, if a cross-faded between the two envelopes wereattempted, multiplication would in fact take place. Consequently, each envelope mustbe transformed back into the frequency domain (involving an inverse logarithm) beforethe cross-fade is performed.

Once the envelopes have been successfully cross-faded according to the weightingdetermined by l, the morphed envelope is once again transformed back into thecepstral domain. This new cepstral slice forms the basis of the completed morph slice.

Page 26: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 22

Figure 4.6 Sample cepstral slice with the three main areas of interest in the morphing process highlighted.

Figure 4.7 A second sample cepstral slice with the pitch peak area highlighted. Note the different pitch peak locationcompared with that of Figure 4.6

Combination of the pitch information residualThe pitch information residual is the pitch information section of the cepstral slice withthe pitch peak also removed by liftering. To produce the morphed residual, it iscombined in a similar way to that of the envelope information: no further matching isperformed. It is simply transformed back into the frequency domain and cross-faded withrespect to l. Once the cross-fade has been performed, it is again transformed into thecepstral domain. The information is now combined with the new morph cepstral slice(currently containing envelope information). The only remaining part to be morphed isthe pitch peak area.

pitch and voicingsection of thecepstral sliceconsisting of thepitch peak regionand the pitch andvoicing residual.

smooth spectralshape section of

the cepstralslice.

pitch peak region ofthe pitch section ofinterest in the morphpitch peakmanufactureprocess.

Page 27: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 23

Combination of the pitch peak informationAs stated above, in order to produce a satisfying morph, it must have just one pitch. Thismeans that the morph slice must have a pitch peak which has characteristics of bothsignal 1 and signal 2. Therefore, an 'artificial' peak needs to be generated to satisfy thisrequirement. The positions of the signal 1 and signal 2 pitch peaks are stored in anarray (created during the pre-processing, above) which means that the desired pitchpeak location can easily be calculated.In order to manufacture the peak, the following process is performed,

1. Each pitch peak area is liftered from its respective slice.Although the alignment of the pitch peaks will not match with respect to the cepstralslices, the pitch peak areas are liftered in such a way as to align the peaks withrespect to the liftered area (see Figure 4.8).

2. The two liftered cepstral slices are then transformed back into the frequency domainwhere they can be cross-faded with respect to l. The cross-fade is then transformedback into the cepstral domain.

3. The morphed pitch peak area is now placed at the appropriate point in the morphcepstral slice to complete the process.

Figure 4.8 Detail of the cepstral slices showing the pitch peak areas (top and middle). The left hand column shows the pitchpeak location with respect to the cepstral slice. Note that the peak locations now match. The right hand column shows howthe peaks are aligned during the liftering process. The lower frame shows the relative position and shape of the morphed

pitch peak with l=0.5.

The morphing process is now compete. The final series of morphed cepstral slices istransformed back in to the frequency domain. All that remains to be done is re-estimatethe waveform.

Pitch peak areaof Figure 4.6

Pitch peak areaof Figure 4.7

Pitch peak area ofthe morphed pitch

peak.

Page 28: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 24

4.4 Signal Estimation from magnitude DFT

This part of the project was very conceptually demanding and took a considerableamount of time to master. However, it is a vital part of the system and the timeexpended on it was well spent. As is described above, due to the signals beingtransformed into the cepstral domain, a magnitude function is used. This results in aloss of phase information in the representation of the data. Therefore, an algorithm toestimate a signal whose magnitude DFT is close to that of the processed magnitudeDFT is required. The solution to this problem [9] is explained below.

The signal is windowed in the manner described above.

Let the windowing function used in the DFT be w(n) which is L points long and non-zerofor0 £ n £ L-1. Therefore, the windowed signal can be represented by

x m l w mS l x lw ( , ) ( ) ( )= -

where m is the window index andl is the signal sample indexS is the window shift

Hence from the definition of the DFT, the windowed signal's N-point DFT is given by

X m k x m l Ww wlk

l

N

( , ) ( , )= -

=å0

where W ei

N=2p

As stated above, the morphing process shall produce a magnitude DFT. Let Y m kw ( , )represent this magnitude DFT.

Before investigating the signal estimation of the magnitude DFT, let us considerestimating a signal from Y m kw ( , ) . The time-domain signal of Y m kw ( , ) is given by

y m lN

Y m k Ww wlk

k

N

( , ) ( , )==

å10

where W ei

N=2p

However, due to the Y m kw ( , ) having been manufactured by some process, it isgenerally not a 'valid' DFT. The phrase not a 'valid' DFT means that there is not aspecific signal whose DFT is given by Y m kw ( , ) . Griffin and Lim [9] estimate a signal

x(n) whose DFT X m kw ( , ) is as close as possible to that of the manufactured DFT

Y m kw ( , ) . The closeness of the two is represented by calculating the squared errorbetween the estimated signal x(n) and that of the manufactured DFT. This errormeasurement can be represented as

[ ] [ ]D x n Y m k x m l y m lw w wlm

( ), ( , ) ( , ) ( , )= -=-¥

¥

=-¥

¥

åå 2

This is the sum of all the errors in a windowed section between the estimated signal andthe manufactured DFT's signal. These are then summed for all windows.

Page 29: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 25

The error measurement equation can be solved to find x(n) because the equation is inquadratic form. This gives:

[ ]x n

w mS n y m n

w mS n

wm

m

( )( ) ( , )

( )=

-

-

= -¥

¥

=-¥

¥

å

å 2

This equation forms the basis of the algorithm to estimate a signal from the magnitude

DFT Y m kw ( , ) . In this iterative algorithm, the error between magnitude DFT of the

estimated signal X m kw ( , ) and the magnitude DFT Y m kw ( , ) produced by the

morphing sequence is decreased by each iteration.

Let xi(n) be the estimated x(n) after i iterations. xi+1(n) is found by finding the DFT of xi(n)

and replacing the magnitude of X m kwi ( , ) with the magnitude DFT of the morph,

Y m kw ( , ) and then using the above equation to calculate the signal.

The sequence of the algorithm is shown in Figure 4.9.

Figure 4.9 Signal re-estimation algorithm.

As can be seen from the algorithm, the re-estimation process requires a DFT of theprevious iteration's signal estimation in order to obtain the pitch information for thecurrent iteration. However, the first iteration has no previous estimation from which toobtain pitch information and so random noise is used as the pitch for the first iteration.

X m k Y m k ewi

wj X m kw

i

( , ) ( , ) ( , )= ÐX m k Y m k ewi

wjr( , ) ( , )=

[ ]x n

w mS n x m n

w mS n

iwi

m

m

+ = -¥

¥

= -¥

¥=-

-

å

å1

2( )

( ) ( , )

( )

Y m kw ( , )

x ni+1( )

r is a randomnumber between 0and 1.

where

x m nN

X m k Wwi

wi nk

k

N

( , ) ( , )==å10

and

W ei

N=2p

Page 30: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 26

4.5 Sound Clips

As this section will be describing techniques such as signal manipulation and re-estimation a number of example signals shall be used for demonstration purposes.Especially in the case of signal re-estimation, the major factor in evaluating its successis the objective evaluation of the sound's quality. In order for the reader to judge forthemselves and listen to the results, a World Wide Web (WWW) page has beenconstructed to accompany this document which gives a brief overview of the processesdetailed below together with a range audio resources. The URL of the page is

http://www.dcs.shef.ac.uk/~u5snw/AudioMorphing

Boxes similar to the one below will inform the reader of the presence of a sound clip onthe web page which is referred to in that section of the literature.

Sound clip

4.6 Use of the Signal Estimation algorithm

4.6.1 Estimated Signal Quality

An important issue with regard to signal estimation is the number of iterations that arerequired before a signal of acceptable quality is obtained. This section looks into thisarea. The algorithm is tested with a magnitude DFT which has no alterations made to it.In other words, the signal estimation algorithm should return a signal which is closelysimilar to the original signal.

For this exercise, a sound sequence of the utterance "Are you sure?" will be used.

The wave file of the utterance "Are you sure?" can be found on theaccompanying web page under reference 1.

The signal estimation algorithm was run with 1, 2, 4, 10, 25 and 100 iterations todemonstrate the increasing quality of each estimation.

Page 31: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 27

original

1 iteration

2 iterations

4 iterations

Figure 4.10 Waveform plots for 1, 2 and 4 iterations. The plot of the original is shown for comparison. Note the amplitudescale for 1 iteration.

Page 32: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 28

original

10 iterations

25 iterations

100 iterations

Figure 4.11 Waveform plots for 10, 25 and 100 iterations. The plot of the original is shown for comparison.

Wave files of the above estimated signals (after iterations of 1, 2, 4, 10, 25 and100) can be found on the accompanying web page under reference numbers 2to 7 respectively.

Figure 4.10 and Figure 4.11 are useful to see the gradual improvement of the overallenvelope of the waveform. However, in order to appreciate the improvement fully, thefine structure needs to be examined. Figure 4.12 and Figure 4.13 show the fine

Page 33: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 29

structure between 180ms and 280ms which shows the transition between the words youand sure.

original

1 iteration

2 iterations

4 iterations

Figure 4.12 Waveform plots of the 100ms of the transition between you and sure for 1, 2 and 4 iterations. The plot of theoriginal is shown for comparison. Note the amplitude scale for 1 iteration.

Page 34: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 30

original

10 iterations

25 iterations

100 iterations

Figure 4.13 Waveform plots of the 100ms of the transition between you and sure for 10, 25 and 100 iterations. The plot of theoriginal is shown for comparison.

After one iteration of the algorithm, a recognisable sound signal is produced although i tis very quiet. As the number of iterations increase, the maximum amplitude reaches thatof the original signal. However, after 10 iterations, the waveform's envelope and finestructure can be seen to be taking on the characteristics of the original signal to agreater extent although it still sounds artificial. Griffin and Lim [9] state that essentiallythe same results in terms of speech quality were observed after 25 to 100 iterations.Indeed, when the 25 iteration estimated signal is listened to, it sounds reasonablysimilar to that of the original. However, after 100 iterations, the estimated signal is veryaccurate.

Page 35: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 31

Although all the examples above use the same magnitude DFT as the basis of thesignal estimation, the phase information evidently plays an important role in not onlythe quality of the sound but also the amplitude of the estimated signal. It is evidentwhen the phase information is still a random noise sequence (the example using oneiteration) that the amplitude of the sound is very low but the acoustic signal is stillintelligible. After only one additional iteration, the amplitude of the estimated signal isessentially that of the original and the quality has improved substantially.

4.7 Summary

In this section, the sequence of steps to generate a smooth morph between two soundshave been described. The morph was generated by splitting each sound into two forms:a pitch and voicing representation and an envelope representation (Section 4.1). Thepitch peaks were obtained from the pitch spectrograms to create a pitch contour foreach sound. Dynamic Time Warping was then performed on these contours to align thesounds with respect to their pitches (Section 4.2). At each corresponding frame, thepitch, voicing and envelope information were separately morphed to produce a finalmorphed frame (Section 4.3). These frames were then converted back into a timedomain waveform using the signal re-estimation algorithm (Section 4.4). This process isshown in Figure 4.14.

Page 36: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 4 Page 32

Figure 4.14 Block diagram of the audio morphing algorithm.

overlappingmagnitude

power spectra

| DFT || DFT |

signal 1

sampling

cepstralanalysisandfiltering

cepstralanalysisandfiltering

sampling

signal 2

morphed signal

combinedmagnitude

power spectra

SignalRe-estimation

spectralenvelopeinformation

pitch andvoicinginformation

Page 37: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 5 Page 33

5. Results and Discussion

Figure 5.1 shows a competed morph from the word nine to the word six, spoken by theauthor. Figure 5.2 shows the waveform plots of the same sounds. The morph wasgenerated by splitting each sound into two forms: a pitch representation and anenvelope representation. The pitch peaks were obtained from the pitch spectrograms tocreate a pitch contour for each sound. Dynamic Time Warping was then performed onthese contours to align the sounds with respect to their pitches. At each correspondingframe, the pitch, voicing and envelope information were separately morphed toproduce a final morphed frame. These frames were then converted back into a timedomain waveform using the signal re-estimation algorithm.

Wave files of the words in Figure 5.1 - nine, six and nixe - can be found on theaccompanying web page under reference numbers 8 to 10 respectively.

Although the spectrogram and the waveform plot of the morphed sound appeardistorted when compared to the originals, the true test of success is to listen to thesound itself. In the Introduction, it was stated that the aim of the project was to create amorph in which one sound should smoothly change into another sound, keeping theshared characteristics of the starting and ending sounds but smoothly changing theother properties. This has been achieved. The morph successfully moves from the firstsound to the second without any abrupt changes in pitch.

However, the process is reasonably costly. The pitch matching process performed byDynamic Time Warping has complexity O(N2) where N is the number of cepstral slices.However, the cost of the morphing process is dominated by the signal re-estimationalgorithm. As described in Section 4.6 the quality of the morph is heavily influenced bythe number of iterations used to re-estimate the sound. In that section, the algorithmwas tested by re-estimating a sound from an unprocessed magnitude DFT. In otherwords, no information was removed - intentionally or not - by further manipulation. Thismeant that if a large number of iterations were used then an almost perfect signal couldbe obtained. In audio morphing, a large amount of manipulation of the signal takesplace and some loss of quality is inevitable. Therefore, less iterations were requiredbefore the sound began to converge to a point at which further iterations madenegligible difference.

Slaney et al. [6] use an extension of the algorithm described in Section 4.4 [9] whichallows the iterative nature of the algorithm to quickly converge. For modulatedsinusoids, the new algorithm has mixed responses when compared to the algorithmproposed by Griffin and Lim [9]. However, when tested on voice samples, the error ratecan be reduced by as much as a factor of 10 [15]. This would have a dramatic effect onthe cost of the entire morphing process and would form an interesting extension to theproject.

A further means of reducing the computational cost of the re-estimation algorithm wasto change the windowing technique. In order to reduce the number of cepstral slices tobe processed, the window size and window shift were increased. However, the size ofthe window was still within the range recommended by Parsons [11]. The need foroverlapping windows in the time domain also plays an important role. In order to re-estimate the phase of the signal, each sample in the wave must lie in at least twospectral frames. As stated in Section 2.3 the use of overlapping windows significantlyincreases subsequent computation. In this work, a 512 point window (equivalent toapproximately 50ms at a sampling rate of 11025Hz) with a 256 point shift (equivalent toapproximately 25ms at a sampling rate of 11025Hz) was used. This results in a two wayoverlap. However, Slaney et al. [6] recommend a four way overlap and use a 256 pointwindow with a 64 point shift. Naturally this would be much slower due to the four-foldincrease in the number of frames, although the quality would be improved.

Page 38: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 5 Page 34

Figure 5.1 Spectrograms of the words "nine" and "six" with the result of morphing six into nine - a word sounding like "nixe"[13].

nine

six

nixe

Page 39: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 5 Page 35

nine

six

nixe

Figure 5.2 Waveform plots of the words "nine" and "six together with the result of morphing six into nine - a word soundinglike "nixe".

The pitch contour extraction process of Section 4.1 is performed in a rather na�vemanner. In order to smoothly morph the pitch information, the pitches of each signalneed to be matched. To facilitate this, a pitch estimate for the entire signal is found - apitch contour. Dynamic Time Warping is then used to find the best match between thetwo pitch contours. In this work, the pitch contour is found from the cepstral domain.

The position of the peak in each slice is found and these build up a pitch contour.Although the results are satisfactory, this method does not take into account twopossibilities:

· the pitch may be absent or difficult to find in both frames;· one frame may have a pitch but the other may not.

Page 40: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 5 Page 36

To overcome this problem, Slaney et al. [6] use a combination of dynamicprogramming and a conventional pitch scheme in order to produce a pitch contour.The method they employ [14] sets two constraints,1. the pitch estimate must fit the available data (the pitch information);2. the pitch must smoothly change over time.

The development and integration of this technique with the work to date would improvethe robustness of the morphing process and be an important area for future work.

As noted in Section 4.2, pitch is not the only time-varying property which can be usedto morph between two sounds. If the underlying rhythm of a sound is important then thisought to be used as the matching function between the two sounds. A better approachstill, may be to combine two or more matching functions together in order to achieve amore pleasing morph. The algorithm presented here is prone to excessive stretching ofthe time axis in order to achieve a match between the two pitch contours. The use of acombined rhythm and pitch matching function could limit this unwanted warping.Further, the weighting of each component in the matching function could bedetermined according to requirements allowing heavily rhythm-biased matches orheavily pitch-biased matches.

In this paper, only one type of morphing has been discussed - that in which the finalmorph has the same duration as the longest signal. However, according to the eventualuse of the morph, a number of other types could be produced:· the longest signal is compressed and the morph has the same duration as the shortest

signal (the reverse of the approach described here).· if one signal is significantly longer than the other, two possibilities arise,

1. if the longer signal is the 'target' - the sound one wishes to morph to - then themorph would be performed between the start signal and the target'scorresponding section (of equal duration) with the remainder of the target'ssignal unaffected.

2. if the longer signal is the start signal then the morph would be performed overthe duration of the shorter signal and the remainder of the start signal would beremoved.

Further extension to this work to provide the above functionality would create a powerfuland flexible morphing tool. Such a tool would allow the user to specify at which pointsa morph was to start and finish, the properties of the morph and also the matchingfunction.

With the increased user interaction in the process, a Graphical User Interface could bedesigned and integrated to make the package more 'user-friendly'. Such animprovement would immediate visual feedback (which is lacking in the currentimplementation) and possibly step by step guidance.

Finally, this work has used spectrograms as the pitch and voicing and spectral enveloperepresentations. Although effective, further work ought to concentrate on newrepresentations which enable further separation of information. For example, a newrepresentation might allow the separation of the pitch and voicing [6].

Page 41: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 6 Page 37

6. Conclusions

This paper has implemented many of the ideas of Slaney et al.'s new approach toaudio morphing [6]. The method has already been put into use in a number of differentfields. One such field is that of music. In 1996, Settel and Lippe [17] demonstrated real-time audio morphing at the 7th International Symposium on Electronic Art andpredicted that audio morphing would sweep through the music world in the same waythat visual morphing did in the graphics world.

This approach separates the sounds into two forms: spectral envelope information andpitch and voicing information. These can then be independently modified. The morphis generated by splitting each sound into two forms: a pitch representation and anenvelope representation. The pitch peaks are then obtained from the pitchspectrograms to create a pitch contour for each sound. Dynamic Time Warping of thesecontours aligns the sounds with respect to their pitches. At each corresponding frame,the pitch, voicing and envelope information are separately morphed to produce a finalmorphed frame. These frames are then converted back into a time domain waveformusing the signal re-estimation algorithm.

Unlike visual morphing, audio morphing can separate different aspects of the sound intoindependent dimensions. Those dimensions are time, pitch and voicing, and spectralenvelope.

There are a number of areas in which further work should be carried out in order toimprove the technique described here and extend the field of audio morphing ingeneral.

The time required to generate a morph is dominated by the signal re-estimationprocess. Even a small number (for example, 2) of iterations takes a significant amount oftime even to re-estimate signals of approximately one second duration. Although inaudio morphing, an inevitable loss of quality due manipulation occurs and so lessiterations are required, an improved re-estimation algorithm is required. Slaney et al. [6]use an extension to the algorithm described in Section 4.4 [9] which allows the iterativenature of the algorithm quickly converge. This can reduce the error by as much as afactor of 10 [15] and reduces the required computation time significantly.

Pitch is not the only time-varying property which can be used to morph between twosounds. If the underlying rhythm of a sound is important then this ought to be used asthe matching function between the two sounds. A better approach still, may be tocombine two or more matching functions together in order to achieve a more pleasingmorph. The algorithm presented here is prone to excessive stretching of the time axis inorder to achieve a match between the two pitch contours. The use of a combinedrhythm and pitch matching function could limit this unwanted warping. Further, theweighting of each component in the matching function could be determinedaccording to requirements allowing heavily rhythm-biased matches or heavily pitch-biased matches.

This work has used spectrograms as the pitch and voicing and spectral enveloperepresentations. Although effective, further work ought to concentrate on newrepresentations which enable further separation of information. For example, a newrepresentation might allow the separation of the pitch and voicing [6].

In summary, the work described in this paper enables the automatic morphing of onesound to another whilst maintaining a smoothly changing pitch. A number of theprocesses, such as the matching and signal re-estimation are still at a primitive level butdo produce satisfactory morphs. Concentration on the issues described above for furtherwork and extensions to the audio morphing principle ought to produce systems whichcreate extremely convincing and satisfying sound morphs.

Page 42: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Chapter 7 Page 38

7. References

[1] Oppenheim, A.V. and Schafer, R.W., ÒDiscrete-Time Signal ProcessingÓ. Prentice-HallInternational, 1989.

[2] Elliot, D.F. and Rao K.R., ÒFast Transforms: Algorithms, Analyses, ApplicationsÓ. AcademicPress, 1982.

[3] Nussbaumer, H.J., ÒFast Fourier Transform And Convolution AlgorithmsÓ. Springer-Verlag,1982

[4] Rabiner, L.R. and Schafer, R.W., ÒDigital Processing of Speech SignalsÓ. Prentice-Hall, 1978

[5] Deller Jr., J.R., Proakis, J.G. and Hansen, J.H.L., ÒDiscrete-Time Processing of SpeechSignalsÓ. Prentice-Hall, 1987

[6] Slaney, M., Covell, M. and Lassiter, B., ÒAutomatic Audio MorphingÓ. Proceedings of the 1996IEEE ICASSP, Atlanta, GA, May 7-10, 1996.

[7] Covell, M., Withgott, M., "Spanning the Gap between Motion Estimation and Morphing,"Proceedings of the 1994 IEEE ICASSP.

[8] Sommerville, I., ÒSoftware Engineering, Fourth EditionÓ. Addison-Wesley, 1992

[9] Griffin, D. and Lim, J., "Signal Estimation from Modified Short-Time Fourier Transform". IEEETrans. on Acoustic, Speech, and Signal Processing, 32, 236-242, 1984.

[10] Brown, G. and Cooke, M., "COM325: Computer Speech & Hearing". Course notes.Department of Computer Science, University of Sheffield.

[11] Parsons, T., "Voice and Speech Processing". McGraw-Hill, 1987.

[12] Fallside, F. and Woods, W.A., "Computer Speech Processing". Prentice-Hall International,1985.

[13] Cooke, M., Code to create spectrograms in the MATLAB environment. Department ofComputer Science, University of Sheffield.

[14] Secrest., B. and Doddington, G., "An integrated pitch tracking algorithm for speech systems".Proceedings of the 1983 IEEE ICASSP, Boston, MA, 1983.

[15] Slaney, M., Naar, D. and Lyon, R.F., "Auditory Model Inversion for Sound Separation".Proceedings of the 1994 IEEE ICASSP, Adelaide, Australia, 1994.

[16] Lexicon Inc., Lexicon Vortex using the Audio MorphingTM Processor.http://www.harmony-central.com/Effects/Data/Lexicon/Vortex-spec.html

[17] Settel, Z. and Lippe, C., "Real-Time Audio Morphing". 7th International Symposium onElectronic Art. Rotterdam, The Netherlands. September 16-20 1996. http://www.isea96.nl/madi

[18] Watkins, C.D., Sadun, A. and Marenka, S., "Modern Image Processing: Warping, Morphing,and Classical Techniques". Academic Press Professional, 1993.

Page 43: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Appendix 1 - Project Milestones

Appendix 1 - Project Milestones

Semester 1Task Name W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 W13 W14 W15

Design

Implementation

Testing

System Evaluation

Identification of Modifications

Semester 2Task Name W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 W13 W14 W15

Design

Implementation

Testing

System Evaluation

Identification of Modifications

The different shades of grey are merely for separation purposes.

Page 44: DECLARATION All sentences or passages quoted in this ...staff · Figure 2.1 6 ms segment of a continuous-time speech signal represented in the time-domain. Figure 2.2 Sample sequence

Audio Morphing Stuart N Wrigley

Appendix 2 - Source Code Extracts

Appendix 2 - Source Code Extracts

MATLAB Code to calculate the 2D array of Section 4.3

function localCost=d(x,y)global PATTERN;global TEMPLATE;% return the local euclidean costlocalCost=abs(PATTERN(x)-TEMPLATE(y));

function ia=arrayindex(ba)% return the 2D array given the backtrace arrayia=[];pid=size(ba,2); % number of columnstid=size(ba,1); % number of rows

% first pair in array is last pair in backtrace arrayia(1,1)=pid;ia(2,1)=tid;

% obtain predecessor code: 1 diagonal, 2 horizontal, 3 verticalcurLoc=ba(tid,pid);

% do until at beginning of backtrace array (indicated by a zero in that cell)while curLoc~=0 if curLoc==1 % diagonal - decrement row and column indexes pid=pid-1; tid=tid-1; end if curLoc==2 % horizontal - decrement column index pid=pid-1; end if curLoc==3 % vertical - decrement row index tid=tid-1; end

% store new pair at the beginning of the 2D array ia=[pid ia(1,:); tid ia(2,:)]; % obtain predecessor code: 1 diagonal, 2 horizontal, 3 vertical curLoc=ba(tid,pid);end

MATLAB Code to create a cepstral slice.

fastft=abs(fft(window.*wave2(n:n+windowLength-1), nfft)); % create DFT slicecep=ifft((log10(0.00001+fastft)),nfft); % create cepstralslice. 0.00001 added to avoid log of zero