Speech Analysis and Synthesis by Linear Prediction of the Speech Wave

7/27/2019 Speech Analysis and Synthesis by Linear Prediction of the Speech Wave

1/19

Received2 April 1971; revised 21 April 1971 9.8, 9.3

Speech Analysis and Synthesis by Linear Prediction of theSpeech WaveB. S. ATAL AND SUZANNE L. HANAUER

Bdl Telephoneaboralore.s,neorporaled, urray Hill, IVewJersey 7974

We describe procedureor effcientncoding f the speech aveby representingt in termsof time-varyingparameterselated o the transferunction f the vocal ractand the characteristicsf the excitation. hespeech ave,sampled t l0 kHz, isanalyzed y predictinghepresent peechample sa linearcombinationof the 12 previous amples. he t2 predictor oefficientsre determined y minimizing he mean-squarederror between he actual and the predicted aluesof the speech amples. ifteenparameters--namely,he12 predictor oefficients,he pitch period,a binary parameter ndicatingwhether he speechs voicedorunvoiced, nd the rmsvalueof the speech amples--are erivedby analysis f the speech ave,encoded ndtransmitted o the synthesizer. he speechwave s synthesizeds the output of a linear recursire ilterexcitedby either a sequence f quasiperiodiculses r a white-noise ource.Applicationof this method orefficient ransmission nd storageof speech ignalsas well as proceduresor determiningother speechcharacteristics,uchas ormant requenciesnd bandwidths,he spectral nvelope, nd the autocorrelationfunction, are discussed.

INTRODUCTIONEfficient epresentationf speech ignalsn termsofa smallnumberof slowlyvarying parameterss a prob-lemof considerablemportancen speechesearch. ostmethodsor analyzing peech tart by transformingheacoustic ata into spectral orm by performing short-time Fourier analysis f the speechwave. Althoughspectral nalysiss a well-knownechniqueor studyingsignals,ts application o speech ignals uffers roma number of serious imitations arising from the non-

stationaryas well as the quasiperiodicroperties f thespeechwave.: As a result,methodsbasedon spectralanalysisoften do not provide a sufficientlyaccuratedescription f speech rticulation.We present n thispapera new approach o speech nalysis nd synthesisin whichwe represent he speechwaveformdirectly intermsof time-varying arameterselated o the transferfunction of the vocal tract and the characteristics of thesourceunction.-sBy modelinghe speech ave tself,ratherthan its spectrum,we avoid the problemsn-herent n frequency-domainethods. or instance,hetraditionalFourieranalysismethodsequire relativelylongspeechegmento provide dequate pectral esolu-tion. As a result, apidlychanging peech vents annotbe accurately ollowed. Furthermore,becauseof theperiodic atureof voicedspeech,ittle informationabout

the spectrum etween itch harmonicss available;consequently,he frequency-domainechniqueso notperform atisfactorilyorhigh-pitchedoicesuch s hevoices f women nd children.AlthoughpitCh-gynchro-nous nalysis-by-synthesisechniquesanprovide par-tial solution o the above diffiulties, such echniquesare extremelycumbersomend time consumingvenfor modern igitalcomputersndare hereforensuit-able orautomaticrocessingf arge mountsf speechdata?? n contrast,he techniquesresentedn thispaperare shown o avoid heseproblemsompletely.The speechnalysis-synthesisechniqueescribednthispaper s applicableo a wide ange f researchrob-lems n speech roduction nd perception. neof themainobjectivesf ourmethodsthesynthesisf speechwhich s indistinguishablerom normalhumanspeech.Much can be learned about the information-carryingstructure f speechy selectivelylteringhepropertiesof the speechignal. hese echniquesan husserve sa tool or modifyinghe acousticropertiesf a givenspeech ignalwithoutdegradinghe speech uality.Someother potentialapplications f these echniquesare in the areasof efficient torage nd transmissionf

speech,utomaticormant nd pitchextraction, ndspeaker nd speechecognition.In the rest of the paper,we describe parametricmodel or representinghe speech ignal n the timeThe Journalof the Acoustical ociety f America 637


2/19

ATAL AND HANAUER

vo,o t t tUNVOICED

P n_k%Sn-kJTIME-VARYINGINEARPREDICTORP Srl Fro. 1. Block diagram of a functiomdmodelof speech production based on the linearprediction epresentationf the speechwve.domain; we discussmethods for analyzing the speechwave to obtain theseparametersand for synthesizingthe speechwave from them. Finally, we discuss ppli-cations or efficientcodingof speech, stimationof thespectralenvelope, ormant analysis, nd for modifyingthe acousticproperties f the speech ignal.The paper s organized uch hat most of the mathe-maticaldetailsare discussedn a set of appendixes. hemainbodyof the paper s nearlycompleten itself,andthose readers who are not interested in the mathematicalor computational spectsmay skip the appendixes.I. MODEL FOR PARAMETRIC REPRESENTATION

OF THE SPEECH WAVEIn modem signal-processingechniques,he proce-dures or analyzing signalmakeuseof all the informa-

tion that can be obtained in advance about the structureof that signal.The first step n signalanalysiss thus tomake a modelof the signal.Speech ounds re produced s a resultof acousticalexcitationof the humanvocal ract. During the produc-tion of voiced sounds, he vocal tract is excitedby aseriesof nearly periodicpulsesgeneratedby the vocalcords. n the caseof unvoicedsounds, he excitation isprovided by air passing urbulently through constric-tions n the tract. A simplemodelof the vocal ract canbe made by representingt as a discrete ime-varyinglinear filter. If we assume that the variations with timeof the vocal-tractshape anbe approximated ith suffi-cientaccuracy y a successionf stationary hapes,t ispossible o definea transfer unction n the complexdomain for the vocal tract. The transfer function of alinearnetworkcan alwaysbe representedy its polesand zeros. It is well known that for nonnasal voicedspeechsounds he transfer function of the vocal tracthas no zeros. For these sounds, he vocal tract cantherefore be adequately represented y an all-pole(recursive)ilter. A representationf the vocal ract forunvoicedand nasal soundsusually includes he anti-resonanceszeros)as well as the resonancespoles)ofthe vocal tract. Since the zeros of the transfer functionof the vocal tract for unvoiced and nasal sounds liewithin the unit circle n the z plane, each actor in the638 Volume 50 Number 2 (Part 2) 1971

numerator f the transfer unctioncanbe approximatedby multiple poles in the denominatorof the transferfunction2 n addition, he locationof a pole s consider-ably more important perceptually han the location ofa zero; the zeros in most casescontribute only to thespectral alance. hus, an explicit epresentationf theantiresonances y zerosof the linear filter is not neces-sary. An all-polemodelof the vocal tract can approxi-mate the effectof antiresonancesn the speech ave nthe frequency angeof interest o any desired ccuracy.The z transformof the glottal volume low duringa singlepitch periodcan alsobe assumedo havepolesonly and no zeros. With this approximation, he ztransform f the glottal flow can be represented y

KUo(z)1-z,,'-')1-za-')' (1)whereK is a constant elated o the amplitudeof theglottal lowand% zbarepoles n the realaxis nside heunit circle. n mostcases, neof the poles s very closeto the unit circle. If the radiation of sound from themouth is approximated as radiation from a simplespherical ource,hen the ratio between he soundpres-sureat the microphone nd the volumevelocity at thelips is represented n the z-transform notation asKa(1--x-), whereKs is a constant elated to the ampli-tude of the volume flow at the lips and the distancefrom the lips to the microphone. The contribution fthe glottal volume low, togetherwith the radiation,can thus be representedn the transfer unctionby thefactor KIKa(1 --Z -x)(1 -z,z-')(1

which, n turn, can be approximated sKKa (2)

The error introducedby this approximation s given byKKz-2(1 --z)


3/19

SPEECH ANALYSIS AND SYNTHESIS BY LINEAR PREDICTION

FIG. 2. Block diagramof the pitch pulsedetector. (, - ) LOW-PASS PEAKTiME_VARYiNGiI FLTERI PICKEINEAR [PRED'CTOR

PITCHPU SES

The contribution of this error to the transfer functionin the frequencyangeof interestcan be assumedo besmall, sincez, 1.Oneof the important eatures f our model s that thecombinedcontributions f the glottal flow, the vocaltract, and the radiation are represented y a singlerecursire ilter. The difficultproblemof separatinghecontribution of the source function from that of thevocal ract is thus completely voided.This representationf the speech ignal s illustratedin sampled-dataorm in Fig. 1. The vocal-cord xcita-tion for voiced oundss producedy a pulse eneratorwith adjustableperiodand axnplitude. he noise-likeexcitation f unvoiced oundss produced y a white-noise ource. he linearpredictor , a transversalilterwith p delaysof one sample nterval each, ormsaweightedumof thepastp samplest the nputof thepredictor. he output of the linear filter at the nthsamplingnstant s givenby

s= as_+, (3)where the "predictorcoefficients" account or thefiltering ctionof the vocal ract, the radiation, nd heglottal low;and /i, representshe nth sample f theexcitation.The transferunction f the linear ilter of Fig. 1 isgiven by

T(z) = 1/(1-- az-). (4)The poles f T(z) are he (reciprocal)eros f thepoly-nomial in z 1) in the denominatorn the rightsideofEq. 4. The linear ilter hushasa total of p poleswhichare eitherreal or occur n conjugate airs.Moreover,for the inear ilter to bestable, he polesmustbe nsidethe unit circle.The numberof coefficients required o representany speech egment dequatelys determined y thenumber of resonances and antiresonances of the vocaltract in the frequency angeof interest, he nature ofthe glottal volume low unction,and the radiation.Asdiscussedarlier, wo polesare usuallyadequate orepresent he influenceof the glottal flow and the radia-tionon hespeech ave. t is shownn Appendix that,in order o representhepoles f thevocal-tractransferfunction dequately,he linearpredictormemorymustbe equal o twice he time requiredor soundwaves o

travel from the glottis o the lips (nasalopening ornasal ounds). orexample,f thevocal ract s 17cm nlength, he memoryof the predictor hould e roughlyI msec n order to represent he polesof transfer unc-tion of the vocal tract. The correspondingalue of p isthen 10 for a samplingntervalof 0. msec.With the twopoles required for the glottal flow and the radiationadded,p should e approximately 2.Thesecalculationsare meant to provide only a rough estimate of p andwill depend o some xtenton the speaker s well asonthe spokenmaterial.The resultsbasedon speech yn-thesis xperimentsseeSec. V) indicate hat, in mostcases, valueof p equal o 12 s adequate t a samplingfrequency f 10 kHz. p is, naturally, a functionof thesamplingrequency, and s roughly roportionaloThe predictor oefficientste, ogetherwith the pitchperiod, the rms value of the speechsamples, nd abinary parameter indicating whether the speech svoicedor unvoiced, rovidea complete epresentationof the speech aveovera time ntervalduringwhich hevocal-tractshape s assumedo be constant.Duringspeechproduction,of course, he vocal-tract shapechanges ontinuouslyn time. In most cases,t is suffi-cient to readjust heseparameters eriodically,or ex-ample,onceevery 5 or 10 msec.II. SPEECH ANALYSIS

A. Determination of the Predictor ParametersGoing back to Fig. 1, we see that, except for onesampleat the beginning f every pitch period,samplesof voiced peech re inearlypredictablen termsof thepastp speech amples.We nowuse his propertyof the

speech ave o determinehe predictor oefficients.etus define he prediction rror E, as the difference e-tween he speech amples, and its predicted alue g,given byS,,= as,,_. (5)k--

E, is then given by

We define he mean-squaredrediction rror (Ea),, asthe averageof E 2 over all the sampling nstances inthe speech egment o be analyzedexcept hoseat theThe Journal of the AcousticalSocietyof America 639


4/19

ATAL AND HANAUER

Fro. 3. Waveform of the speechsignal to-gether with the positions f the pitch pulses(shownby vertical ines).

beginning f eachpitch period, .e.,(E,2)av=((s. - E aks,-)2)v. (7)

The predictor oefficients of Eq. 3 are chosen o asto minimizehe mean-squaredredictionrrorThe same rocedures used o determinehe predictorparametersor unvoiced ounds,oo.Thecoeffcients whichminimizehemean-squaredprediction rror are obtained y setting he partialderivativef (E.2)vwith respecto each equal ozero. It can then be shown that the coefficients areobtained s solutions f the set of equations q}a}= oj.0, = 1, 2, -.., p, (8)

where, = (s_ss_).. (9)

In general, the solution of a set of simultaneous inearequationsequires greatdealof computation.ow-ever, he set of linearequationsivenby Eq. 8 is aspecialne, incehematrixof coefficientss symmetricand positivedefinite.There are severalmethodsof solv-ing suchequations..aA computationallyfficientmethod f solving q. 8 is outlinedn Appendix .Occasionally,he coefficients obtained y solvingEq. 8 produce oles n the transfer unctionwhichareoutsideheunitcircle. hiscanhappen heneverpoleof he ransferunctionear heunitcircle ppearsut-

side the unit circle,owing to approximationsn themodel.The locations f all suchpolesmustbe corrected.A simplecomputational rocedureo determinef anypole of the transfer function is outside the unit circleand a method or correcting he predictorcoefficientsare describedn AppendixD.

B. litch AnalysisAlthoughany reliablepitch-analysis ethodcan beused o determinehepitchof thespeechignal,weout-lineherebriefly womethodsf pitchanalysis hich resufficientlyeliable ndaccurateor ourpurpose.In the first method, M he speechwave is filteredthrough 1-kHz ow-passilter and each ilteredspeechsample s raised o the third power o emphasizehehigh-amplitude ortionsof the speechwaveform. hedurationof the pitch period s obtained y performinga pitch-synchronousorrelation nalysis f the cubedspeech.The voiced-unvoiceddecision s basedon twofactors,he density f zerocrossingsn the speech aveand the peak value of the correlation function. Thismethodof pitch analysiss describedn detail n Ref.14.The secondmethodof pitch analysiss basedon thelinearprediction epresentationf the speechwave.aIt follows rom Fig. 1 that, except or a sampleat thebeginning f each pitch period, every sampleof thevoiced peech aveform anbepredictedrom he pastsamplevalues.Therefore, he positions f individualpitchpulses anbe determinedy computinghe pre-diction rrorE givenby Eq. 6 and then ocatinghe

640 Volume 0 Number (Part 2) 1971


5/19

SPEECH ANALYSIS AND SYNTHESIS BY LINEAR PREDICTIONPITCH

5kHLOW-PASIG. 4. Block diagram of the speech FILTERsynthesizer. NOISE [ENERATOR .-__lOkSn-k ADAPTIVELINEARPREDICTORPREDICTORPARAMETERSOk

SEECpS(t)

samplesor which heprediction rror s arge.The latterfunction s easily accomplishedy a suitablepeak-pickingprocedure. his procedures illustrated n Fig.2. In practice, he predictionerror was found to belargeat the beginning f the pitch periods nd a rela-tively simplepeak-picking rocedurewas found o beeffective. The voiced-unvoiced decision is based on theratio of the mean-squaredalueof the speech amplesto the mean-squaredalue of the prediction rror sam-ples. This ratio is considerably maller or unvoicedspeech oundshan for voiced peech ounds--typically,by a factor of 10. The resultof the pitch analysis n ashort segment f the speechwave s illustrated n Fig.3. The positions f the individualpitch pulses, hownby vertical ines,are superimposedn the speechwave-form for easycomparison.

III. SPEECH SYNTHESISThe speech ignals synthesizedy means f the sameparametric epresentations was used n the analysis.A blockdiagramof the speech ynthesizers shown nFig. 4. The controlparameters upplied o the syn-thesizer re the pitch period,a binaryvoiced-unvoicedparameter, he rmsvalueof the speech amples, nd the

p predictorcoefficients. he pulsegeneratorproducesa pulseof unit amplitudeat the beginning f eachpitchperiod. he white-noiseenerator roduces ncorrelateduniformly distributed random sampleswith standarddeviation qual o 1 at eachsamplingnstant.The selec-tion between he pulsegenerator nd the white-noisegenerators madeby the voiced-unvoicedwitch.Theamplitudeof the excitationsignal s adjustedby theamplifierG. The linearlypredicted alueg of the speechsignal s combinedwith the excitationsignal$ to formthe nth sampleof the synthesized peechsignal. Thespeech amples re finally low-pass iltered to providethe continuous peechwave s(t).It may be pointed out here that, although or time-invariant networks he synthesizer f Fig. 4 will be

equivalent o a traditional formant synthesizerwithvariable ormantbandwidths,ts operationor the time-varyingcase which s true in speech ynthesis) ifferssignificantlyrom that of a formantsnthesizer. or in-stance,a formant synthesizer as separate ilters foreach formant and, thus, a correct abelingof formantfrequendess essential or the proper functioningofa formant synthesizer. his is not necessaryor thesynthesizer f Fig. 4, since he formantsare synthesizedtogetherby one recursiveilter. Moreover, he ampli-tudeof the pitchpulses s wellas he whitenoises ad-justed o provide he correct msvalueof the syntheticspeech amples.The synthesizerontrolparameters re reset o theirnew valuesat the beginning f every pitch period forvoicedspeech nd onceevery 10 msec or unvoicedspeech.f the controlparameters re not determinedpitch-synchronouslyn the analysis,new parametersare computed y suitablenterpolationf the originalparameterso allowpitch-synchronousesetting f thesynthesizer. he pitch periodand the rms value areinterpolated geometrically"linear nterpolation na logarithmicscale). n interpolating he predictorcoefficients,t is necessaryo ensurehe stabilityof therecursiveilter in the synthesizer. he stability cannot,in general, e ensured y direct inear nterpolation fthe predictorparameters. ne suitablemethod s tointerpolatehe first p samples f the autocorrelationfunctionof the impulse esponsef the recursiveilter.The autocorrelationunction has the important ad-vantageof havinga one-to-oneelationship ith thepredictorcoefficients.herefore, he predictorcoeffi-cients anbe recomputedrom he autocorrelationunc-tion. Moreover,he predictor oefficientserived romthe autocorrelation function always result in a stablefilter n the synthesizer. The relationshipetweenhepredictorcoefficientsnd the autocorrelationunctioncan be derived as follows:

The Journal f the Acoustical ociety f America 641


6/19

ATAL AND HANAUER1.0

o

zo

0.8

0.6

0.4

0.2

-. UNVOICEDPEECH

o i i i0 4 8 12 16P

SPEECH

20

FIo. 5. Variation ot the minimum value of thermsprediction rrorwith p, the numberof predictorcoefficients. Solid line shows the curve for voicedspeech.Dotted line shows he curve for unvoicedspeech.

From Eq. 3, the impulse esponse f the linear recur-sive filter of Fig. 1 satisfieshe equations,.= as._, n> 1, (10)

with the initial conditions So= 1 and s, =0 for n


7/19

SPEECH ANALYSIS AND

_FIG.6. Comparisono[ wide-band soundspectrograms for synthetic and originalspeechsignals for the utterance "May weall learn a yellow lion roar," spokenby a malespeaker: (a) synthetic speech, and (b)originalspeech.4

SYNTHESIS BY LINEAR PREDICTIONSYNTHETIC SPEECH

ORIGINAL SPEECH

0.5 1.5 2.0

I

TIME (SECONDS)negative. n casesucha solutiondoesnot exist,g is setto zero. The nth sampleof the synthesizedwave isfinally obtainedby addingq,, to gu,,.

IV. COMPUTER SIMULATION OF THEANALYSIS-SYNTHESIS SYSTEMIn order to assesshe subjective uality of the syn-thesized peech, he speech nalysisand synthesis ys-tem describedabove was simulated on a digital com-puter. The speech avewas irst low-passiltered o 5kHz and then sampled t a frequency f 10 kHz. Theanalysissegmentwas set equal to a pitch period forvoiced peech nd equal o 10msec or unvoiced peech.The variousparameterswere then determined or eachanalysissegment ccording o the procedure escribedin Sec. I. Theseparameters ere inallyused o controlthe speech ynthesizer hown n Fig. 4.The optimum v,'due or the number of predictorparameters was determined s follows:The speechwave was synthesized or various valuesof p between 2and 18. Informal listening tests revealedno significantdifferences etween synthetic speechsamples or plarger than 12. There was slight degradation n speechqualit)- at p equal o 8. However,even or # as low as 2,the syntheticspeechwas intelligiblealthonghpoor inqualit)-.The influence f decreasing to values ess han10 was most noticeable on nasal consonants. Further-more, the effect of decreasing was lessnoticeableonfemale voices than on male voices. This could be ex-pected n view of the fact that the length of the vocaltract for female speakerss generallyshorter than formale speakers nd that the nasal tract is slightly longer

than the oral tract. From these esults, t was concludedthat a wdue of p equal to 12 was required o provideanadequate epresentation f the speech ignal. t may beworthwhile at this point to compare heseresultswiththe objective results basedon an examinationof thevariation of the predictionerror as a function of p. InFig. 5, we have plotted the minimum value of the rmspredictionerror as a functionof severalvaluesof p. Thespeech ower n eachcasewas normalized o unity. Theresults re presented eparately or voicedand unvoicedspeech.As can be seen n the figure, he predictionerrorcrave is relatively flat for values of p greater than12 for voicedspeech nd for p greater than 6 for un-voicedspeech. hese esults uggest gain hat p equalto 12 s adequate or voiced peech. or unvoiced peech,a lowervalueof p, e.g.,p equal o 6, should e adequate.For thosereaderswho wish to listen to the quality ofsynthesized peech t variousvaluesof p, a recordingaccompanieshis article. AppendixA gives he contentsof the record.The readershould isten at this point tothe first section of the record.In informal isteningests, he qualit)-of the syntheticspeechwas ound o be very close o that of the originalspeechor a widerangeof speakers nd spokenmaterial.No significantdifferenceswere observedbetween thesnthetic speech amples f male and femalespeakers.The second ectionof the record ncludes xamples fsynthesizedspeech or several utterancesof differentspeakers.n each case,p was set to equal to 12. Thespectrograms f the synthetic and the original speechfor two of theseutterances re comparedn Figs. 6 and7. As can be seen, the spectrogram f the syntheticspeech losely esembleshat of the originalspeech.

The Journalof the AcousticalSocietyof America 643


8/19

,I

'I l '"I I'JJ' ,

1111tl'. ,'

ATAL AND HANAUERSYNTHETIC SPEECH

i II

:11' ,

, Ih

ORIGINAL SPEECH

"I . I .%0 t.5 2.O 2.5 3.OI I I I i

TIME (SEC)Fro. 7. Comparison f wide-band ound pectrogramsor synthetic nd originalspeech ignalsor the utterance It's time we roundedup that herd of Asian cattle," spokenby a male speaker: a) syntheticspeech, nd (b) originalspeech.

V. APPLICATIONSA. Digital Storage and Transmission of Speech

Methods for encoding peech t data rates consider-ably smaller than thoseneeded or PCM encodingareimportant in many practical applications.For example,automatic answerhackservicescan be practical if asufficiently arge vocabularyof wordsand phrases anbe storedeconomicallyn a digital computer.Efficientspeech odingmethodscan rednee,by a factor of 30 ormore, the spaceneeded or storing the vocabulary.Wediscuss n this section several procedures or efficientcodingof the synthesizer ontrol nformation.The synthesizer ontrol information includes15 pa-rameters for every analysis nterval, i.e., the twelvepredictor coefficients, he pitch period, the voicedunvoicedparameter,and the rms value. The methodsfor proper encodingof this information,except the pre-dictor coefficients, re relatively well nnderstood? Onthe otherhand, the procedureor encodinghe predictorcoefficientsmust include provision for ensuring the644 Volume 50 Number 2 (Port 2) 1971

stahility of the linear filter in the synthesizer.n general,to ensure stability, relatively high accuracy (about8-10 bits per coefficient)s required f the predictorcoefficients re quantized directly. Moreover, the pre-dictor coefficients re samplesof the inverse Fouriertransformof the reciprocal f the transfer unction.Thereciprocalof the transfer function has zeros preciselywhere the transfer function has poles.Therefore, smallerrors in the predictor coefficients ften can result inlargeerrors n the poles.The direct quantizationof thepredictor coefficientss thus not efficient.One suitablemethod is to convert the 12 predictor coefficients oanotherequivalent etof parameterswhichpossess ell-definedconstraints or achieving he desiredstability.For example, he polesof the linear filter can be com-puted from the predictorcoefficients. or stability of thefilter, it is snfficient hat the poles be inside the unitcircle.The stability is thereforeeasilyensuredby quan-tizing the frequencies nd the bandwidthsof the poles.The polesof the transfer unction are by definition the


9/19

SPEECH ANALYSIS AND SYNTHESIS BY LINEAR PREDICTION

Fxo. 8. Spectral nvelopeor the vowel i/in "we," spokenby a male speaker (Fo=120Hz).

60O._1"' 40_1 o: 20I-

lO

I I I I I I I I I/i/

I1.0 2.0 3.0 4.0

FREQUENCY (kHz)5.0

rootsof the polynomial quationY. akz-= 1, (18)kl

where, sbefore, are he predictor oefficients.ableI showsheprecision ith which ach f theparametersis quantized.t was ound hat the requenciesnd hebandwidthsf thepoles anbequantizod ithin60 bitswithoutproducingny perceptibleffecton the syn-thesizedpeech. dding hisvalue o the bits neededfor the pitch 6 bits), he rmsvalue 5 bits),andthevoiced-unvoicedarameter 1 bit), one arrivesat avalueof 72 bits (60+6+5+1) for each rameof ana-lyzeddata.The data rate n bits/sees obtained ymultiplyinghe number f bits used o encodeachframe fdataby thenumberf framesf datastored rtransmittedersecond.hus, bitrateof7200 its/seeis achievedf the parametersre sampledt a rate of100/sec. he bit rate is loweredo 2400bits/see ta samplingate of 33/sec.At thispoint, he reader an isten o recordedx-amples f synthesizedpeechncodedt threedifferentdata rates,namely,7200,4800,and 2400bits/see, e-spectively,n the hirdsectionf theenclosedecord.Thequantizingf he requenciesnd hebandwidthsof thepoless not theonlymethod f encodinghepre-dictorcoefficients.or example,t canbe shownseeAppendix) that a transferunction ithp polessalwaysealizables he ransferunction f an acoustictubeconsistingf p cylindricalectionsf equal engthTABLE. Quantizationf synthesizerontrolnformation.

Number ofParameter levels BitsPitch 64 6g/uv 2 1rms 32 5Frequenciesand bandwidthsof the poles 60

Total 72

with the last section erminatedby a unit acoustic e-sistance. oreover,he poles realwaysnsidehe unitcircle f the cross-sectionalrea of eachcylindricalsec-tion s positive. hus, he stability f the synthesizerfilter s easily chieved y quantizinghe areas f thesectionsr any othersuitableunction f the areas.No significant ifferencen speech uality wasob-served or the differentquantizingmethods utlinedaboveat variousbit ratesabove2400bits/sec. t is quitepossiblehat at very low bit rates thesedifferentmethods f codingmay showappreciable ifferences.An examplef speechynthesizedsing reaquantiza-tion s presentedn the fourth ectionf therecord.The dataratesdiscussedn thispaper resuitableorspeech-transmissionpplicationshere argebufferstorages obeavoided.heefficiencyf speechodingnaturally anvaryconsiderablyromone pplicationoanother. or example,t hasbeen ssumedo ar thatthespeechignalsanalyzedt uniformime ntervals.However,t maybe more fficiento vary he analysisinterval so hat it is short during ast articulatory ran-sitions nd ongduring teady-stateegments.urther-more,n applicationsuch sdiskstorage f voicemes-sages,dditionalavingsanbe ealized y choosinghequantizationevelsor each arameterroundts meanvalue determined in advance over short time intervals.The meanvalue tselfcanbe quantized eparately.B. Separation f SpectralEnvelope nd Fine StructureIt is oftendesirableo separatehe envelope f thespeechpectrumrom ts finestructure.sThe represen-tationof thespeechignal hownn Fig. 1 is verysuit-able or achievinghis decomposition.n this represen-tation, he inestructuref thespectrums contributedby the source hile heenvelopes contributedy thelinear ilter.ThuS, he two are easily eparated? hespectralnvelopes thepower pectrumf the mpulseresponsef the inearilter. n mathematicalotation,the elationshipetweenhespectralnvelope(f) atthe frequency and the predictor oefficientss ex-pressed y G(/)--1/I 1- E a*e-'Jz*'z'l, (19)

The Journalof the AcousticalSocietyof America 645


10/19

ATAL AND HANAUER60

5oEl-J 0JbJ-J 30

bJa_ 003o o

I I I I I I . I I I

I I I I I I I I I i1.o 2.0 3,0 4.0 5.0FREQUENCY (kHz)

F[o. 9. Spectralenvelope or the vowel /i/ in"we," spokenby a femalespeaker F0--200 Hz).

whereak, as before,are the predictor oeffidents nd f,is the samplingrequency. wo examples f the spectralenvelopeobtained n the abovemanner for the vowel/i/ belongingo the word"we" in the utterance Maywe all learn a yellow ion roar" spokenby a male anda female speakerare illustrated in Figs. 8 and 9, re-spectively.We would ike to add here that a spectralsectionobtainedon a soundspectrographailed toseparate he third formant rom the secondormant forthe femalespeaker oth n the wide-band nd the nar-row-bandanalysis.The spectralsectionshowedonebroadpeak or the two formants.On the otherhand, hespectral nvelope f Fig. 9 showshe two formantswith-out any ambiguity.Of course,t is difficult to evaluatethe accuracy f this method rom resultsbasedon realspeech lone.Resultswith syntheticspeech,where hespectralenvelopes knownprecisely,ndicate hat thespectralenvelope s accuratelydeterminedover a widerangeof pitch values from 50 to 300 Hz).It also ollowsromEq. 19 that, althoughhe Fouriertransform f G(J) is not time limited, he Fourier rans-formof I/G(/) is time imited o 2p/l, sec. hus,spec-tral samples f G(/), spaced/s/2pHz apart, are suffi-cient for reconstructionf the spectral nvelope. orp=12 and f,=10 kHz, this means hat a spacing froughly 00 Hz between pectral ampless adequate.In someapplications,t may be desired o computethe Fourier ransform f G(/), namely, he autocorrela-tion function. The autocorrelation function can bedeterminedirectly rom hepredictor oefficientsith-out computing (/). The relationshipetweenhe pre-dictor coefficientsand the autocorrelation function isgiven n Eqs. 12 and 13, and a computational ethodfor performinghese perationss outlinedn AppendixE.

C. Formant AnalysisThe objectiveof formant analysis s to determine hecomplexnatural frequencies f the vocal tract as theychangeduringspeech roduction.f the vocal-tractconfiguration ere known, thesenatural frequenciescouldbe computed.However, he speech ignal s in-fluencedothby thepropertiesf thesourcendby the

vocal tract. For example, f the sourcespectrumhasa zero close o oneof the natural frequencies f the vocaltract, it will be extremelydifficult, f not impossible,odetermine he frequencyor the bandwidthof that par-ticular formant. A side-branch element such as the nasalcavity createsa similar problem. n determining or-mant frequenciesnd bandwidths rom the speech ig-nal, one can at best hope to obtain such nformationwhich s not obscured r lost owing o the influence fthe source.Presentmethods f formantanalysis suallystart bytransforminghe speech ignal nto a short-timeFourierspectrum, ndconsequentlyuffer rommanyadditionalproblems which are inherent in short-time Fouriertransform techniques? .2aSuch problems,of course,can be completely voidedby determining he formantfrequencies nd bandwidths directly from the speechwave. 2In the representation f the speechwave shown nFig. 1, the linear filter representshe combined ontri-butionsof the vocal ract and the sourceo the spectralenvelope. hus, the polesof the transfer unctionof thefilter include he polesof the vocal tract as well as thesource.So far, we have made no attempt to separatethese wo contributions. or formantanalysis, owever,it is necessaryhat the polesof the vocal tract be sepa-rated out from the transfer unction. n general, t is ourexperience hat the poles contributed by the sourceeither fall on the real axis n the unit circleor producea relatively small peak in the spectralenvelope.Themagnitude f the spectralpeakproduced y a polecaneasilybe computed nd comparedwith a threshold odetermine whether a pole of the transfer function is in-deeda natural frequencyof the vocal tract. This is ac-complished s follows:From Eq. 4, the polesof the transfer unction are theroots of the polynomialequation

az-= 1. (20)Let there be n complex onjugate airsof rootszt, z*;z, z2*; ; , *. The transfer unctiondue to these

64 Volume 50 Number2 (Port 2) 1971


11/19

SPEECH ANALYSIS AND

Fro. 10. Formant frequencies or the ut-terance"We were away a year ago," spokenby a male speaker F0= 120 Hz). (a) Wide-band soundspectrogramor the above utter-ance, and (b) formants determined by thecomputer program.

4

0 o

SYNTHESIS BY LINEAR PREDICTIONWE WERE AWAYA YEAR AGO

.. u I [:0.4 0.8 1.2

TIME (SEC)

(o)

o o 0.4 O.8 1.2 LGTIME (SEC)

iJI

(b)

roots s given byv)=lrI (2])il il

where the additional factors in the numerator set thetransfer unction at dc (= 1) eqnal to 1. The specmdpeakproducedy the kth complexonjugateolepairis given by ;1 (1-)(1-*): (7--zr,)(z--z*--"' (22)where = exp(2rjfT), a= [z[ exp(2rjfT), and T isthe sampling ntervalThe threshold alue of A} wassetequal o 1.7.Finally, he formant requency and

the bandwidth (two-sidedroot zu byand

B are related o the z-planeF = (1/2rT) Im(lnz), (23)

(') = (l/rT) Re -- .lnz,,, (24)Examples of the formant frequencies eterminedac-cording o the above procedureare illustrated in Figs.10-12. Each figure consistsof (a) a wide-band sonndspectrogram f the utterance,and (b) formant data asdeterminedby the above method. The resultsare pre-sented or three different utterances.The first utterance,"We were away a year ago," was spokenby a malespeaker average undamentalrequency 0= 120 Hz).

The Journal of the AcousticalSocietyof America 647


12/19

ATAL AND HANAUERMAY WE ALL LEARN A YELLOW LION ROAR

J ,( {

4

o o

0.5 t.oTIME (SEC)

(o)

- 7 .', . ':: ' ..

. : : ':. . ".. % :

0.5 t.0 1.5 2.0TIME

2.0

Fro. ll. Formant frequenciesfor the utterance "May we alllearn a yellow ion roar," spokenby a femalespeaker Fo= 200 Hz).(a) Wide-band ound pectrogramfor the above utterance,and (b)formantsdeterminedby the com-puter program.

(b)

The second tterance,"May we all learn a yellow ionroar," was spokenby a femalespeaker F0=200 Hz).The third utterance,"Why do I oweyou a letter?"wasspoken y a malespeaker F0= 125Hz). Eachpoint ntheseplots representshe results rom a single rame ofthe speech ignalwhichwasequal o a pitch period nFigs10 and 11 and eqnal to 10 msec n Fig. 12. No

Tn,E II. Factor by which each parameter was scaled forsimulatinga female voice from parametersderived from a malevoiceParameter Scaling actor

Pitch period T 0.58Formant frequencies i 1.14Formant bandwidths B, 2--Fi/5000

648 Volume 50 Number 2 (Part 2) 1971

smoothingf the ormant ataoveradjacentrameswasdone.Again, in order to obtain a better estimate of the ac-curacyof this methodof formantanalysis, peech assynthesizedwith a known formant structure. The cor-respondenceetween he actual formant frequenciesandbandwidthsnd he computed neswas ound o beextremely close.

D. Re-forming the Speech SignalsThe ability to modify the acousticalcharacteristicsof a speech ignalwithout degradingts quality is im-portant or a widevarietyof applications.orexample,information egarding he relative mportance f vari-ousacoustic ariables n speech erception an be ob-tainedby listeningo speechn whichsomeparticular


13/19

SPEECH ANALVS1S AND SYNTHESIS BY LINEAR PIkEDICTIONWHY DO I OWE YOU A LETTER

Fro. 12. Formant frequenciesor theutterance "Why do I owe you a letter?"spoken y a malespeaker Fo= 125 Hz).(a) Wide-band ound pectrogramor theabove utterance, and (b) formants de-termined 1)y the computer program.

I

oo

i i i i

(a)

2

0 0.4 0.8TIME (SFC.)

(b)

acoustic variables have been altered in a controlledmanner.The speech nalysisand synthesisechniquesdescribedn thispapercanbe usedas a ttexible nd con-venient method for condnctingsnch speech-perceptionexperiments.We wonld like to point oul here that thesynthesis rocedure llows ndependent ontrolof suchspeech characteristics s spectral envdope, relativedurations,pitch, and intensity.Thus, the speaking ateof a givenspeech ignalmay be ahered,e.g., or produc-ing fast speech or blind persons r for producing lowspeech or learning oreign anguages.Or, in an applica-tion such as the recoveryof "hdium speech," he fre-quencies f the spectralenvelopecan be scaled, eavingthe fundamental requencyunchanged.Moreover, in

synthesizing seutence-lengthutterances from storeddata about individual words, the method can be usedto reshape he intonation and stresscontoursso thatthe speechsoundsnatural.Ixamplesf speechn which selected coustical har-acteristicshave been altered are presented n the fifthsection of the enclosed ecord. First, the listener canhear the utterance at the normal speaking ate. Next,the speaking ate is increased y a factor of 1.5. As thethird item, the sameutterancewith the speaking atereducedby a factor of 1.5 is presented.Finally, an ex-ampleof a speech ignal n which he pitch, the formantfrequencies, nd their bandwidths were changed romtheir original values, obtained from a male voice, toThe Journal of the Acoustical Societyof America 649


14/19

ATAL AND HANAUERTABLE III. Computation times needed to perform variousoperations iscussedn the paper on the GE 635 (p = 10, fa = 10kHz).

Operation ComputationimePredictor coefficientsfrom speech amples(No. of samples 100)Spectralenvelope 500spectral amples)rompredictorcoefficientsFormant frequencies ndbandwidths rom predictorcoefficientsp samplesof autocorrelationfunction rom predictorcoefficientsSpeech rom predictorcoefficientsPitch analysis

75 msec/frame250 msec/frame60 msec/framet0 msec/frame8 times real time10 times real time

simulate "female"voice s presented.he factorbywhich each parameter was changed rom its originalvalue is shown in Table II.

VI. COMPUTATIONAL EFFICIENCYThe computationimesneeded o performseveral ftheoperationsescribedn thispaper resummarizednTable II. The programs ere unona GE 635 computerhavinga cycle ime of 1 sec.As canbe seen, hismethodof speech nalysis nd synthesiss computationallyefficient.n fact, the techniquesre about five to 10times aster han the ones eededo perform quivalent

operationsby fast-Fourier-transformmethods.For in-stance,both the formant frequencies nd their band-widths are determined in 135 msec for each frame of thespeechwave 10 msec ong. Assuminghat the formantsare analyzedonceevery 10 msec, he programwill runin about13 imes eal ime;by comparison,ast-Fourier-transform echniques eedabout 100 times real time.Even for computing he spectral nvelope,he methodbasedon predictorcoefficientss at least three timesfaster than the fast-Fourier-transform methods. Thecomplete nalysis nd synthesis rocedure as ound orun in approximately25 times real time. Real-timeoperation ouldeasilybe achieved y using pecial ard-ware to perform someof the functions.

VII. CONCLUSIONSWe have presented method or automaticanalysisandsynthesisf speechignals y representinghem ntermsof time-varyingparameterselated o the transferfunction of the vocal tract and the characteristics of theexcitation. n important roperty f the speech ave,namely, ts linearpredictability, orms he basis f boththe analysisand synthesis rocedures. nlike pastspeech nalysismethods asedon Fourieranalysis, he

method described ere derives he speechparametersfroma directanalysis f the speech ave.Consequently,various roblemsncounteredhenFourieranalysiss

applied to nonstationaryand quasiperiodic ignals ikespeech re avoided.One of the main advantages f thismethod s that the analysisprocedure equiresonly ashort segmentof the speechwave to yield accurateresults. his method s therefore ery suitable or follow-ing rapidlychanging peech vents. t is alsosuitable oranalyzing the speechof speakerswith high-pitchedvoices, such as women or children. As an additionaladvantage, he analyzedparametersare rigorously e-lated to other well-knownspeech haracteristics. hus,by first representinghe speech ignal n terms of thepredictorcoefficients,ther speech haracteristicsanbe determined as desired without much additionalcomputation.The speech ignal s synthesized y a single ecursivefilter. The synthesizer,hus,doesnot requireany in-formation about the individual formants and the for-mantsneednot bedeterminedxplicitly uring nalysis.Moreover, the synthesizer makes use of the formantbandwidthsof real speech,n contrast o formant syn-thesizers,which use fixed bandwidths for each formant.Informal istening estsshow ery ittle or noperceptibledegradation n the quality of the synthesized peech.These results suggest hat the analyzed parametersretain all the perceptually mportant features of thespeechsignal. Furthermore, the various parametersused or the synthesisanbe encoded fficiently.t wasfoundpossibleo reducehe datarate to approximately2400bits/secwithoutproducingignificant egradationin the speech uality. The abovebit rate is smallerbya factor of about 30 than that for direct PCM encodingof the speechwaveform.The latter bit rate is approxi-mately 70 000 bits (70 000 bits= 7 bits/sampleX10 000samples/sec).In addition o providingan efficientand accuratede-scriptionof the speech ignal, he method s computa-tionally very fast. The entire analysisand synthesisprocedure uns at about 25 times real time on a GE635 digitalcomputer. he method s thuswellsuited oranalyzing arge amountsof speechdata automaticallyon the computer.

APPENDIX A: DESCRIPTION OF ENCLOSEDRECORDED MATERIAL

Side 1Section1. Speechanalysis and synthesis or variousvaluesof p, the numberof predictorcoefficients:

(a) p=2,(b) p=6,(c) p = 10,(d) p= 14,(e) p= 18,(f) original peech.Section2. Comparisonof synthesizedspeechwith theoriginal, p= 12. Synthetic--original. Five utterances.

650 Volume 50 Number 2 (Part 2) 1971


15/19

SPEECH ANALYSIS AND SYNTHESIS BY LINEAR PREDICTIONSection . Synthesized peech ncoded t differentbitrates, he parameters uantized s shown n Table I,p=12. Original--unquantized--7200bits/sec--4800bits/sec--2400 its/sec. hreeutterances.Section. Synthesizedpeech btainedby quantizingthe areasof an acoustic ube, p=12. Bit rate=7200bits/sec.

(1) Frequencies nd bandwidthsquantized nto60-bit frames.(2) Areasquantizednto 60-bit frames.The rest of the parameters re quantized s shownin Table I.

Section . Fast and slowspeech, = 14:(a) Originalspeech.(b) Speaking ate= 1.5 times he original.(c) Speaking ate=0.67 times he original.

Section . Manipulation f pitch, formant requencies,and their bandwidths, = 10:(a) Pitch, formant frequencies, nd bandwidthsaltered as shown in Table II.(b) Originalvoice.APPENDIX B: RELATIONSHIP BETWEEN THELENGTH OF THE VOCAL TRACT AND THENUMBER OF PREDICTOR COEFFICIENTS

Belowabout5000Hz, the acoustic roperties f thevocal ract canbe determined y consideringt as anacoustic tube of variable cross-sectional area. Therelationship etweensoundpressureP, and volumevelocityUo at the glottisand the correspondinguan-titiesP,, U, at the ips s bestdescribedn termsof theABCD matrix parameterschainmatrix) of the acoustictube.Theseparametersredefined y the matrixequa-tion (seeFig. B-I): qr,l

We now prove that the inverse Fourier transformsof theseparameters n the time domain have finiteduration = 21/c,where is the lengthof the tube andc is the velocityof sound. et S(x) be the area unctionof the vocal ract, wherex is the distancerom he glottisto the pointat which he cross-sectionalrea s specified.Considera small elementof tube of length dx at a dis-tancex from the glottis.The ABCD matrix parametersof the tube elementdx are givenbym

A = D = cosh dx= 1/2(er+e-B = --Z0 sinhPdx --Zo(erax--e-ra)/2, (B2)C = -- sinh dx/Zo = -- (e a_ e- r d-)/2Zo,

whereZ0 is the characteristicmpedance f the tubeelementdx=oc/S(x), P is the propagation onstant=jco/c, p is the densityof air, c is the velocityof sound,and w is the angular frequency n radians.The ABCDmatrix of the complete ube is given by the productofthe ABCD matrices of the individual tube elements oflengthdx spaced x apart along he lengthof the tube.Let l=ndx. It is now easily verified that each of theABCD parameters f the tube can be expresseds apower series n era of the form olkeFdx.

The ABCD parameters re thus Fourier transforms ffunctions of time each with duration r=2n.dx/.Taking the limit as dx--O, n--}o, and n.dx=l, weobtain r = 21/c.From Eq. B1, the relationshipbetween he glottaland the lip volumevelocitiess expressedn termsof theABCD parameters y

Uo=CP,+DU. (B3)SincePtfioKUt, K being a constant elated to themouth area, Eq. B3 is rewritten as

U -----(jcoC + D) U,. (B4)

Fro. B-I. Nonuniform acoustic tube.

GLOTTISP0,Ug-.1. .:1-,-- x{[I[4'1x LIPSDISTANCEROM LOTTIS

The Journalof the Acoustical ociety f America 651


16/19

ATAL AND HANAUERThe memoryof the linear predictor seeFig. 1) is bydefinitionequal to the duration of the inverseFouriertransformof the reciprocal f the transfer unctionbe-tween he lip and the glottM volumevelocities. here-fore, rom Eq. B4, the memoryof the linearpredictor sequal o r--21/c.

APPENDIX C: DETERMINATION OF THEPREDICTOR PARAMETERS FROM THECOVARIANCE MATRIX

Equation 8 can be written in matrix notation as4,a= , (C1)

where =[(qoi)-I is a positivedefinite (or positivesemidefinite) ymmetricmatrix, and a=l-(ai)-I and4 = [(;oi0)]are column ectors. ince is positive efi-nite (or semidefinite)ndsymtnetric,t canbeexpressedas the productof a triangularmatrix V with real ele-mentsand ts transpose t Thus,4,= V V'. (C2)

Equation C1 can now be resolved nto two simplerequations: Vx = t, (C3)V'a =x. (C4)

SinceV is a triangularmatrix, Eqs. C$ and C4 can besolved ecursively.Equations C3 and C4 provide a simple method ofcomputing the minimum value of the prediction error(E2),v as a functionof p, the numberof predictor o-efficients. t is easily verified from Eqs. 7-9 that theminimum value of the mean-squared rediction erroris givenby p= 00-a% (C5)On substitutingor a fromEq. C4 into Eq. C5, weobtain

= 00--x V ,whichon substitutionrom Eq. C3 for yields

= Ooo-X. (C6)Thus, the minimumvalue of the mean-squared redic-tion error is given asq,= ,00- x. (C7)k 1The advaxttageof using Eq. C7 lies in the fact that asinglecomputationof the vector x for one value of p issufficient.After the vector x is determined or the largestvalue of p at which he error is desired,% is calculatedfor smallervaluesof p from Eq. C7.

APPENDIX D: CORRECTION OF THEPREDICTOR COEFFICIENTS

Let us denoteby f(z) a polynomialdefinedbyJ(z) =zV--az v- ..... a,, (DI)

where he polynomialcoefficients& are the predictorcoefficients f Eq. 3. Associatedwith the polynomialf(z), we define reciprocalolynomial*(z) byf*(Z) -- --az- ..... . 1.

Let us constructhe sequencef polynomials_(z),f_2(), ..., f,(z), .- -, f(z), where=(s) sa polynomialof degreen, accordingo the formula

f,,(z) =k,,+,f=+,*(z) -l,,_lfn+t(z), (D3)where (z)=if(z), k is the coefficient f z in f,(z),and l is the constant term in f(z). It can then beshown hat the polynomial (z) has all its zeros nsidethe unit circle f andonly f I l > I k, I for each _


17/19

SPEECH ANALYSIS AND SYNTHESIS BY LINEAR PREDICTIONEquationE1 is identical o Eq. 12 for n=p. In matrixform the aboveequationbecomes

'1 F0 Fn_9 02n)i, : i i = . (E2)[r_x r .-. r0 [ a?) JLet R be the nXn matrix on the left sideof Eq. E2,a , and r an n-dimensional ector whosekth com-nent is r_. Let us define,for every vector, a re-ciprocal ectorby the relationship

a()*[ =ar) [ -+x. (E3)Equation E2 can now be rewritten as

R,.4a._x (")+a. (')r._= r,_x*, (E4)r._a._xr")+roa? ) =r.. (ES)On multiplyingEq. E4 throughby R, - and rearrang-ing terms,we obtain

a._ t") =R._ffr._*--a.(")R._ctr._. (E6)It is easilyverified rom Eq. E2 that

R._fr._ * =a._ (-). (E7)InsertingEq. E7 into Eq. E6 gives

a._ (")=a._ ("-) -- a?)[a._("-o ] *. (E8)Next, we multiply Eq. E8 throughby r,-d and insertthe result n Eq. E5. After rearrangement f the terms,we find that

n--I n--1a?)[r0-E ra("-)=r. -- E r.-a (-). (E9)klEquationsE8 and E9 provide a complete ecursiremlutionof Eq. El. We start with n = 1. The solutionsobviously r/ro. (El0)Next, a () and a_ ) are computed or successivelyincreasing alues of n until n=p. Furthermore, f Ris nonsingular,he expressionnside he bracketson theleft sideof Eq. E9 is alwayspositive.Therefore,a () isalways inite.To determine the autocorrelation function from thepredictorcoefficients, e proceedas follows:From Eq.E8, [a_{* = Ea_t-"]*-a?a_(-< (Ell)Therefore, fter eliminatinga_(-)] * from Eqs. E8and Ell, one obtains

Starting with n=p, we computea? ) for successivelysmaller values of n until n=l. The autocorrelationfunctionat the nth sampling nstant s given rom Eq.E1 by r,,= a(")r,_k,or l_


18/19

ATAL AND HANAUER

section

2ndsectionI Pn(t)I

Un(t)SI $2 S3 Sn

i_...NthectionSn+l $N

Fta. F-1. A nonuniform acoustic tubeformed by cascading niform cylindricalsectious.

form notation as1 ' [ t , (F6)Lx+t(,.)J-+,L-,z- z 3Ldz)Jwhere (z) and (z) are the z transformsfand u,(t), respectively,ith z=exp[j(2/)]. Simi-larlyas n Eq. F6, wehave the nverseelationshipf '"'IV , (7)whi cn be written in matrix notation s

Morver, rom q. FS,

:w)N+,). (0)Let - -Fwn((z) '2(")(z)].Fll)(z)iQk(z)Cw2>(z)The matrix IV,(,) satisfieshe equation

W,+(z)= W,(z)p,q4(z). (F12)It can be verified from Eq. FI1 that

JEW,/(z-)]-tJ = IV. (z), (F13)where

EquatiQn 13 mplieshatw2,(z)w._(z)-I-Lwn(*'-9

or

654 Volume50 Number2 (Part 2) 1971

Let us assume that the tube is terminated in a unitacousticesistance.We thenhave he terminalboundaryconditionpr+t(t)-Futv+(t)p+(t) ut(/), (F16)

fromwhich t followshat us+t(t)=0. The volume eloc-ity at the nputof the ube s given y pt(t)--(t). Letvolume velocity at the inputCr(z) volumevelocityat the output

_ (F17)t,-+,(z)It can now be easilyverified rom Eq. F10 that

Cx,(z) wn(m(z)--w2,(m(z). (F18)Let us define or eachn, between1 and N,

C.(z)= wnt")(z)-w2(")(z). (F19)Onmultiplyingq.F12bya vector1 --1], andsub-stituting or W,(a) fromEq. F15, we find that

1[c.+, .) - c.+t(,- ) = [c. ) - c. - )Xr %{1_,an+gl.F20)

Hence,1

- --.[C, (z) -r.+tC.(z-')z-']. (F21)1 --rn+lExceptor he actor ,n, eachC,(a) sa polynomialfdegree. Thus, he ransferunction, hichs he ecip-rocal of C(z), consists f a factor z m2 dividedbya polynomialf degree . The factor m represents,of course,he transmissionelay n the ube.The trans-fer function asN poleswhichare the zeros f C(z).Furthermore,he polesare inside he unit circle,pro-


19/19

SPEECH ANALYSIS AND SYNTHESIS BY LINEAR PREDICTIONvided that rn satisfies the conditionTM

Ir,,I 0, l

Speech Analysis and Synthesis by Linear Prediction of the Speech Wave

Documents

Transcript of Speech Analysis and Synthesis by Linear Prediction of the Speech Wave