arXiv:1107.0076v1 [stat.AP] 30 Jun 2011 · is the pre-emphasis coe cient de ning the in-herent...

13
KARMA: Kalman-based autoregressive moving average modeling and inference for formant and antiformant tracking a) Daryush D. Mehta, b) Daniel Rudoy, and Patrick J. Wolfe c) School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts 02138 (Dated: October 29, 2018) Vocal tract resonance characteristics in acoustic speech signals are classically tracked using frame- by-frame point estimates of formant frequencies followed by candidate selection and smoothing using dynamic programming methods that minimize ad hoc cost functions. The goal of the cur- rent work is to provide both point estimates and associated uncertainties of center frequencies and bandwidths in a statistically principled state-space framework. Extended Kalman (K) algorithms take advantage of a linearized mapping to infer formant and antiformant parameters from frame- based estimates of autoregressive moving average (ARMA) cepstral coefficients. Error analysis of KARMA, WaveSurfer, and Praat is accomplished in the all-pole case using a manually marked formant database and synthesized speech waveforms. KARMA formant tracks exhibit lower over- all root-mean-square error relative to the two benchmark algorithms, with third formant tracking more challenging. Antiformant tracking performance of KARMA is illustrated using synthesized and spoken nasal phonemes. The simultaneous tracking of uncertainty levels enables practition- ers to recognize time-varying confidence in parameters of interest and adjust algorithmic settings accordingly. PACS numbers: 43.72.Ar, 43.70.Bk, 43.60.Cg, 43.60.Uv I. INTRODUCTION Speech formant tracking has received continued atten- tion over the past sixty years to better characterize for- mant motion during vowels as well as vowel-consonant boundaries. The de facto approach to resonance estima- tion involves waveform segmentation and the assumption of an all-pole model characterized by second-order digi- tal resonators (Schafer and Rabiner, 1970). The center frequency and bandwidth of each resonator are then es- timated through picking peaks in the all-pole spectrum or finding roots of the prediction polynomial. Tracking these estimates across frames is typically accomplished via dynamic programming methods that minimize cost functions to produce smoothly-varying trajectories. This general formant-tracking algorithm is imple- mented in WaveSurfer (Sj¨ olander and Beskow, 2005) and Praat (Boersma and Weenink, 2009), speech analysis tools that enjoy widespread use in the speech recognition, clinical, and linguistic communities. There are, however, numerous shortcomings to this classical approach. For example, formant track smoothing and correction (e.g., for large frequency jumps) are performed in an ad hoc manner that precludes that ability to apply statistical a) Portions of this work were presented at the INTERSPEECH con- ference in Antwerp, Belgium, in August 2007 (Rudoy et al., 2007). b) Electronic address: [email protected]; also with the Cen- ter for Laryngeal Surgery and Voice Rehabilitation, Massachusetts General Hospital, Boston, Massachusetts 02114 c) Also with the Department of Statistics, Harvard University, and the Speech and Hearing Bioscience and Technology Pro- gram, Harvard-MIT Division of Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139. analysis to obtain confidence intervals around the esti- mated tracks. Initial development of the formant tracking approach described here has been reported by Rudoy et al. (2007) using a manually marked formant database for error anal- ysis (Deng et al., 2006b). The current work continues this line of research and offers two main contributions. The first provides improvements to the Kalman-based autoregressive approach of Deng et al. (2007) and ex- tensions to enable antiformant frequency and bandwidth tracking in a Kalman-based autoregressive moving aver- age (KARMA) framework. The second empirically deter- mines the performance of the KARMA approach through visual and quantitative error analysis and compares this performance with that of WaveSurfer and Praat. A. Classical formant tracking algorithms Linear predictive coding (LPC) models have been shown to efficiently encode source/filter characteristics of the acoustic speech signal (Atal and Hanauer, 1971). To extract frame-by-frame formant parameters, the poles of the LPC spectrum can be computed as the roots of the prediction polynomial, peaks in the LPC spectrum, or peaks in the second derivative of the frequency spectrum (Christensen et al., 1976). The first complete formant tracker over multiple continuous speech frames incorpo- rated spectral peak-picking, selection of formants from the candidate peaks using continuity constraints, and voicing detection to handle silent and unvoiced speech segments (McCandless, 1974). Extensions to LPC anal- ysis incorporate autoregressive moving average (ARMA) models that added estimates of candidate zeros associ- ated with anti-resonances during consonantal and nasal speech sounds (Steiglitz, 1977; Atal and Schroeder, 1978). Kalman-based formant and antiformant tracking 1 arXiv:1107.0076v1 [stat.AP] 30 Jun 2011

Transcript of arXiv:1107.0076v1 [stat.AP] 30 Jun 2011 · is the pre-emphasis coe cient de ning the in-herent...

Page 1: arXiv:1107.0076v1 [stat.AP] 30 Jun 2011 · is the pre-emphasis coe cient de ning the in-herent high-pass lter characteristic that is typically ap-plied to equalize energy across the

KARMA: Kalman-based autoregressive moving average modeling andinference for formant and antiformant tracking a)

Daryush D. Mehta,b) Daniel Rudoy, and Patrick J. Wolfec)

School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts 02138

(Dated: October 29, 2018)

Vocal tract resonance characteristics in acoustic speech signals are classically tracked using frame-by-frame point estimates of formant frequencies followed by candidate selection and smoothingusing dynamic programming methods that minimize ad hoc cost functions. The goal of the cur-rent work is to provide both point estimates and associated uncertainties of center frequencies andbandwidths in a statistically principled state-space framework. Extended Kalman (K) algorithmstake advantage of a linearized mapping to infer formant and antiformant parameters from frame-based estimates of autoregressive moving average (ARMA) cepstral coefficients. Error analysis ofKARMA, WaveSurfer, and Praat is accomplished in the all-pole case using a manually markedformant database and synthesized speech waveforms. KARMA formant tracks exhibit lower over-all root-mean-square error relative to the two benchmark algorithms, with third formant trackingmore challenging. Antiformant tracking performance of KARMA is illustrated using synthesizedand spoken nasal phonemes. The simultaneous tracking of uncertainty levels enables practition-ers to recognize time-varying confidence in parameters of interest and adjust algorithmic settingsaccordingly.

PACS numbers: 43.72.Ar, 43.70.Bk, 43.60.Cg, 43.60.Uv

I. INTRODUCTION

Speech formant tracking has received continued atten-tion over the past sixty years to better characterize for-mant motion during vowels as well as vowel-consonantboundaries. The de facto approach to resonance estima-tion involves waveform segmentation and the assumptionof an all-pole model characterized by second-order digi-tal resonators (Schafer and Rabiner, 1970). The centerfrequency and bandwidth of each resonator are then es-timated through picking peaks in the all-pole spectrumor finding roots of the prediction polynomial. Trackingthese estimates across frames is typically accomplishedvia dynamic programming methods that minimize costfunctions to produce smoothly-varying trajectories.

This general formant-tracking algorithm is imple-mented in WaveSurfer (Sjolander and Beskow, 2005) andPraat (Boersma and Weenink, 2009), speech analysistools that enjoy widespread use in the speech recognition,clinical, and linguistic communities. There are, however,numerous shortcomings to this classical approach. Forexample, formant track smoothing and correction (e.g.,for large frequency jumps) are performed in an ad hocmanner that precludes that ability to apply statistical

a)Portions of this work were presented at the INTERSPEECH con-ference in Antwerp, Belgium, in August 2007 (Rudoy et al., 2007).b)Electronic address: [email protected]; also with the Cen-ter for Laryngeal Surgery and Voice Rehabilitation, MassachusettsGeneral Hospital, Boston, Massachusetts 02114c)Also with the Department of Statistics, Harvard University,and the Speech and Hearing Bioscience and Technology Pro-gram, Harvard-MIT Division of Health Sciences and Technology,Massachusetts Institute of Technology, Cambridge, Massachusetts02139.

analysis to obtain confidence intervals around the esti-mated tracks.

Initial development of the formant tracking approachdescribed here has been reported by Rudoy et al. (2007)using a manually marked formant database for error anal-ysis (Deng et al., 2006b). The current work continuesthis line of research and offers two main contributions.The first provides improvements to the Kalman-basedautoregressive approach of Deng et al. (2007) and ex-tensions to enable antiformant frequency and bandwidthtracking in a Kalman-based autoregressive moving aver-age (KARMA) framework. The second empirically deter-mines the performance of the KARMA approach throughvisual and quantitative error analysis and compares thisperformance with that of WaveSurfer and Praat.

A. Classical formant tracking algorithms

Linear predictive coding (LPC) models have beenshown to efficiently encode source/filter characteristics ofthe acoustic speech signal (Atal and Hanauer, 1971). Toextract frame-by-frame formant parameters, the poles ofthe LPC spectrum can be computed as the roots of theprediction polynomial, peaks in the LPC spectrum, orpeaks in the second derivative of the frequency spectrum(Christensen et al., 1976). The first complete formanttracker over multiple continuous speech frames incorpo-rated spectral peak-picking, selection of formants fromthe candidate peaks using continuity constraints, andvoicing detection to handle silent and unvoiced speechsegments (McCandless, 1974). Extensions to LPC anal-ysis incorporate autoregressive moving average (ARMA)models that added estimates of candidate zeros associ-ated with anti-resonances during consonantal and nasalspeech sounds (Steiglitz, 1977; Atal and Schroeder, 1978).

Kalman-based formant and antiformant tracking 1

arX

iv:1

107.

0076

v1 [

stat

.AP]

30

Jun

2011

Page 2: arXiv:1107.0076v1 [stat.AP] 30 Jun 2011 · is the pre-emphasis coe cient de ning the in-herent high-pass lter characteristic that is typically ap-plied to equalize energy across the

FIG. 1. Illustration of the (A) classical and (B) proposed approaches to formant tracking. Key advantages to the proposedKARMA approach include intra-frame observation of autoregressive moving average (ARMA) parameters for both formant andantiformant tracking, inter-frame tracking using linearized Kalman (K) inference, and the availability of both point estimatesand uncertainties for each trajectory.

Fig. 1A illustrates the classical tracking process forthe all-pole case. Following pre-processing steps, LPCspectral coefficients yield intra-frame point estimates ofcandidate frequency and bandwidth parameters via rootfinding or peak-picking. Intra-frame parameter estima-tion can be accomplished using a number of methods(Atal and Hanauer, 1971; Atal and Schroeder, 1978;Broad and Clermont, 1989; Yegnanarayana, 1978), andinter-frame parameter selection and smoothing can beperformed by minimizing various cost functions in a dy-namic programming environment (Sjolander and Beskow,2005; Boersma and Weenink, 2009). Note that the re-quired root-finding (or peak-picking) procedure cannotbe written in closed form. Consequently, statistical anal-ysis (distributions, bias, variance) of the resultant for-mant and bandwidth estimates is challenging. Alter-native spectrographic representations primarily apply tosustained vowels and require significant manual interac-tion (Fulop, 2010).

B. Statistical formant tracking algorithms

Probabilistic and statistical models for tracking for-mants have gained widespread use in the past 15 yearswith motivation from automatic speech recognition ap-plications. The first such probabilistic model was intro-duced by Kopec (1986), in which a hidden Markov modelwas used to constrain the evolution of vector-quantizedsets of formant frequencies and bandwidths. Similarly, astate-space dynamical model can appropriately constrainthe evolution of formant parameters, where observationsof the acoustic speech waveform are linked through non-linear relationships to “hidden” states (formant parame-ters) that evolve over time. Inference of the state valuescan be performed by variants of Kalman filter algorithms(Kalman, 1960).

In these algorithms, ad hoc assignment of poles andzeros to appropriate formant indices is precluded by theinherent association of spectral/cepstral coefficients toformant and antiformant frequencies and bandwidths.

The first reported state-space approach to formant track-ing inferred formant frequencies and bandwidths directlyfrom LPC spectral coefficients (Rigoll, 1986). An ex-tension to this LPC approach was made by Toyoshimaet al. (1991) to build a tracker that inferred frequen-cies and bandwidths of both formants and antiformantsfrom time-varying ARMA spectral coefficients (Miyanagaet al., 1986). More recent state-space models define theobservations as coefficients in the LPC cepstral domain(Zheng and Hasegawa-Johnson, 2004; Deng et al., 2007),providing statistical methods to support or refute empir-ical relations obtained between low-order cepstral coef-ficients and formant frequencies (Broad and Clermont,1989).

The proposed KARMA (Kalman-based autoregressivemoving average) approach explores the performance ofsuch a state-space model with ARMA cepstral coeffi-cients as observations to track formant and antiformantparameters. Taking advantage of a linearized mappingbetween frequency and bandwidth values and cepstralcoefficients, KARMA applies Kalman inference to yieldpoint estimates and uncertainties for the output trajec-tories.

II. METHODS

Fig. 1B illustrates the proposed statistical modelingapproach to formant and antiformant tracking. Thisapproach affords several advantages over classical ap-proaches: (1) both formant and antiformant trajecto-ries are tracked, (2) both frequency and bandwidth es-timates are propagated as distributions instead of pointestimates to provide for uncertainty quantification, and(3) pole/zero assignment to formants/antiformants ismade through a linearized cepstral mapping instead ofcandidate selection using ad hoc cost functions.

Kalman-based formant and antiformant tracking 2

Page 3: arXiv:1107.0076v1 [stat.AP] 30 Jun 2011 · is the pre-emphasis coe cient de ning the in-herent high-pass lter characteristic that is typically ap-plied to equalize energy across the

A. Step 1: Pre-processing

The sampled acoustic speech waveform s[m] is firstwindowed into short-time frames st[m] = s[m]wt[m] us-ing overlapping windows wt[m] with frame index t. Eachshort-time frame st[m] is then pre-emphasized via

st[m] = st[m]− γst[m− 1], (1)

where γ is the pre-emphasis coefficient defining the in-herent high-pass filter characteristic that is typically ap-plied to equalize energy across the speech spectrum forimproved model fitting.

B. Step 2: Intra-frame observation generation

1. ARMA model of speech

Following windowing and pre-emphasis, the acousticwaveform st[m] is modeled as a stochastic ARMA(p, q)process:

st[m] =

p∑i=1

aist[m− i] +

q∑j=1

bju[m− j] + u[m], (2)

where ai are the p AR coefficients, bj are the q MA coef-ficients, and u[m] is the stochastic excitation waveform.The z-domain transfer function associated with Eq. (2)is

T (z) ,1 +

∑qj=1 bjz

−j

1−∑p

i=1 aiz−i . (3)

A number of standard spectral estimation techniques canbe employed in order to fit data to the ARMA(p, q)model (see Marelli and Balazs, 2010, for a recent reviewof ARMA estimation methods). In the current study,ARMA estimation was performed using the ‘armax’ func-tion in MATLAB’s System Identification toolbox (TheMathWorks, Natick, MA), which implements an iterativemethod to minimize a quadratic error prediction criterion(Ljung, 1999, Section 10.2).

2. Generation of observations: ARMA cepstral coefficients

In the proposed approach, the ARMA spectral coeffi-cients in Eq. (3) are transformed to the complex cepstrumbefore inferring formant characteristics. This mappingfrom ARMA spectral coefficients to ARMA cepstral co-efficients has been derived in the all-pole case (e.g., Denget al., 2006a) and can be extended to account for thepresence of zeros in the spectrum. Letting Cn denote thenth cepstral coefficient,

Cn = cn − c′n, (4)

where Cn depends on separate contributions from the de-nominator and numerator of the ARMA model throughthe following recursive relationships:

cn =

an if n = 1

an +∑n−1

i=1

(in

)an−ici if 1 < n ≤ p∑n−1

i=n−p(in

)an−ici if p < n,

(5a)

c′n =

bn if n = 1

bn +∑n−1

j=1

(jn

)bn−jc

′j if 1 < n ≤ q∑n−1

j=n−q(jn

)bn−jc

′j if q < n.

(5b)

Derivation of Eqs. (5) is given in the Appendix. Theproof is derived under the minimum-phase assumptionthat constrains the poles and zeros of the ARMA transferfunction to lie within the unit circle.

C. Step 3: Inter-frame parameter tracking

The proposed algorithm tracks point estimates and un-certainties for I formants and J antiformants from frameto frame. To accommodate the temporal dimension, theparameters of frame t are placed in column vector xt:

xt ,(f1 . . . fI b1 . . . bI f ′1 . . . f

′J b′1 . . . b

′J

)T, (6)

where (fi, bi) is the frequency/bandwidth pair of the ithformant and (f ′j , b

′j) is the frequency/bandwidth pair for

the jth antiformant.

1. Observation model

Inference of the output parameters is facilitated by aclosed-form mapping from the state vector xt to the ob-served cepstral coefficients Cn in Eq. (4). Extending thespeech production model of Schafer and Rabiner (1970)to capture zeros, we assume that the transfer functionT (z) of the ARMA model can be written as a cascadeof I second-order digital resonators and J second-orderdigital anti-resonators:

T (z) =

∏Jj=1(1− βjz−1)(1− βjz−1)∏Ii=1(1− αiz−1)(1− αiz−1)

, (7)

where (αi, αi) and (βj , βj) denote complex-conjugatepole and zero pairs, respectively. Each pole and zero areparameterized by a center frequency and 3-dB bandwidth(both in units of Hertz) using the following relations:

(αi, αi) = exp

(−πbi ± 2π

√−1fi

fs

), (8a)

(βj , βj) = exp

(−πb′j ± 2π

√−1f ′j

fs

), (8b)

where fs is the sampling rate (in Hz).

Performing a Taylor-series expansion of log T (z) yields

log T (z) =

I∑i=1

∞∑n=1

(αni + αn

i )

nz−n−

J∑j=1

∞∑n=1

(βnj + βj

n)

nz−n.

(9)

Kalman-based formant and antiformant tracking 3

Page 4: arXiv:1107.0076v1 [stat.AP] 30 Jun 2011 · is the pre-emphasis coe cient de ning the in-herent high-pass lter characteristic that is typically ap-plied to equalize energy across the

Recalling that Cn is the nth cepstral coefficient,log T (z) = C0 +

∑∞n=1 Cnz

−n. Thus, equating the co-efficients of powers of z−1 leads to

Cn =1

n

I∑i=1

(αni + αi

n)− 1

n

J∑j=1

(βnj + βj

n)

. (10)

Finally, inserting αi and βj from Eqs. (8) into Eq. (10)yields the following observation model h(xt) that mapselements of xt to Cn:

h(xt) ,Cn

=2

n

I∑i=1

exp

(−πnfsbi

)cos

(2πn

fsfi

)−

2

n

J∑j=1

exp

(−πnfsb′j

)cos

(2πn

fsf ′j

).

(11)

2. State-space model

We adopt a state-space framework similar to that byDeng et al. (2007) to model the evolution of the statevector in Eq. (6) from frame t to frame t+ 1:

xt+1 = Fxt +wt, (12a)

yt = h (xt) + vt, (12b)

where F is the state transition matrix, and wt and vtare uncorrelated white Gaussian sequences with covari-ance matrices Q and R, respectively. The function h(xt)is the nonlinear mapping of Eq. (11), and vector yt con-sists of estimates of the first N cepstral coefficients of Cn

(not including the zeroth coefficient). The initial statex0 follows a normal distribution with mean µ0 and co-variance Σ0. The state-space model of Eqs. (12) is thusparameterized by the set θ:

θ , (F ,Q,R,µ0,Σ0) . (13)

3. Linearization via Taylor approximation

The cepstral mapping in Eq. (11) can be lin-earized to enable approximate minimum-mean-square-error (MMSE) estimates of the tracked states via theextended Kalman filter. The mapping h(xt) is linearizedby computing the first-order terms of the Taylor-seriesexpansion of Cn in Eq. (11):

∂Cn

∂fi= −4π

fsexp

(−πnfsbi

)sin

(2πn

fsfi

),

∂Cn

∂bi= −2π

fsexp

(−πnfsbi

)cos

(2πn

fsfi

),

∂Cn

∂f ′j=

fsexp

(−πnfsb′j

)sin

(2πn

fsf ′j

),

∂Cn

∂b′j=

fsexp

(−πnfsb′j

)cos

(2πn

fsf ′j

).

TABLE I. Extended Kalman smoother algorithm.

1. Initialization: Set m0|0 = µ0 and P0|0 = Σ0

2. Filtering: Repeat for t = 1, . . . , T

mt|t−1 = Fmt−1|t−1

Pt|t−1 = FPt−1|t−1FT +Q

Kt = Pt|t−1HTt

(HtPt|t−1H

Tt +R

)−1

(16)

mt|t = mt|t−1 +Kt(yt − h(mt|t−1))

Pt|t = Pt|t−1 −KtHtPt|t−1

3. Smoothing: Repeat for t = T, . . . , 1

St = Pt−1|t−1FTP−1

t|t−1

mt−1|T = mt−1|t−1 + St

(mt|T − Fmt−1|t−1

)Pt−1|T = Pt−1|t−1 + St

(Pt|T − Pt−1|t−1

)ST

t

The Jacobian matrix Ht thus consists of four sub-matrices for each frame t:

Ht ,(H(fi) H(bi) H(f ′j) H(b′j)

), (14)

where H(fi) and H(bi) each consists of N rows and p/2columns:

H(fi) ,

∂C1

∂f1∂C1

∂f2· · · ∂C1

∂fp/2∂C2

∂f1∂C2

∂f2· · · ∂C2

∂fp/2...

.... . .

...∂CN

∂f1∂CN

∂f2· · · ∂CN

∂fp/2

, (15a)

H(bi) ,

∂C1

∂b1∂C1

∂b2· · · ∂C1

∂bp/2∂C2

∂b1∂C2

∂b2· · · ∂C2

∂bp/2...

.... . .

...∂CN

∂b1∂CN

∂b2· · · ∂CN

∂bp/2

, (15b)

and H(f ′j) and H(b′j) are defined analogously each withN rows and q/2 columns.

4. Kalman-based inference

Given observations yt for frame indices 1 to T , theextended Kalman smoother (EKS) can be used to com-pute the mean mt|T (point estimates) and covariancePt|T (estimate uncertainties) of each parameter in xt.Table I displays the steps of the EKS, which employs atwo-pass filtering (forward) and smoothing (backward)procedure. For real-time processing, the forward filter-ing stage may be applied without a backward smoothingprocedure; naturally, this will lead to larger uncertaintiesin the corresponding parameter estimates.

Care must be taken when approximating the observa-tion model of Eq. (11) to avoid suboptimal performanceor algorithm divergence in the case of the Kalman fil-ter (Julier and Uhlmann, 1997). To verify the appro-priateness of the linearization in Section II.C.3 in this

Kalman-based formant and antiformant tracking 4

Page 5: arXiv:1107.0076v1 [stat.AP] 30 Jun 2011 · is the pre-emphasis coe cient de ning the in-herent high-pass lter characteristic that is typically ap-plied to equalize energy across the

FIG. 2. (Color online) Comparison of KARMA (blue) andparticle filter (red) tracking performance in terms of root-mean-square error (RMSE) averaged over 25 Monte Carlotrials and reported with 95 % confidence intervals (gray).

setting, comparisons are made to a more computation-ally intensive method of stochastic computation termeda particle filter, which approximates the densities in ques-tion by sequentially propagating a fixed number of sam-ples, or “particles,” and hence avoids the linearization ofEq. (11).

Twenty-five Monte Carlo simulations of 100-sampledata sequences were performed according to Eqs. (12)with four complex-conjugate pole pairs (p = 8, q = 0)and N = 15 cepstral coefficients. Fig. 2 compares theoutput of KARMA using the extended Kalman filter ofTable I and that of a particle filter in terms of root-mean-square error (RMSE) as a function of the number of par-ticles, averaged over all formant frequency tracks (truebandwidths were provided to both algorithms). The per-formance of the EKF compares favorably to that of theparticle filter, even when a large number of particles isused. Similar results hold over a broad range of parame-ter values.

5. Observability of states

The model of Eq. (12) does not explicitly take intoaccount the existence of speech and non-speech states.To continue to track or coast parameters during silenceframes, the state vector xt can be augmented with a bi-nary indicator variable to specify the presence of speechin the frame. The approximate MMSE state estimateis then obtained via EKS inference by modifying theKalman gain in Eq. 16:

Kt = MtPt|t−1HTt

(HtPt|t−1H

Tt +R

)−1,

whereMt is a diagonal matrix with diagonal entries equalto 1 or 0 depending on the presence or absence, respec-tively, of speech energy in frame t.

In addition, to handle the presence or absence ofparticular tracks, the state vector xt can be dynam-ically modified to include or omit corresponding fre-quency/bandwidth states in Eq. (6). The approximateMMSE state estimate is then obtained via EKS inferencein Table I with the modified state vector. If an absent

state reappears in a given frame, that state is reinitializedwith corresponding entries in µ0 and Σ0.

6. Model order selection and system identification

As is commonly done, the orders p and q of the ARMAmodel are chosen to capture as much information as pos-sible on the peaks and valleys in the resonance spec-trum, while avoiding overfitting and mistakenly captur-ing source-related information. The ARMA cepstral or-der N is chosen to be at least max(p, q) so that allpole/zero information is incorporated per Eq. (5). Fi-nally, selecting I and J in Eq. (11) depends on the ex-pected number of formants and anti-formants, respec-tively, in the speech bandwidth fs/2.

Formants do not evolve independently of one another,and their temporal trajectories are not independent infrequency. In the synthesis of front vowels, it is commonpractice to employ a linear regression of f3 onto f1 andf2 (Nearey, 1989, e.g.). Empirically, we found the for-mant cross-correlation function to decay slowly (Rudoyet al., 2007), implying that a set of formant values atframe t might be helpful in predicting values of all for-mants at frame t + 1. Thus, instead of setting the statetransition matrix F to the identity matrix (Deng et al.,2006a, 2007), F is estimated a priori for a particularutterance from first-pass WaveSurfer formant frequencytracks using a linear least-squares estimator (Hamilton,1994).

The state transition covariance matrix Q, which dic-tates the frame-to-frame frequency variation, consists ofa diagonal matrix with values corresponding to standarddeviations of approximately 320 Hz for center frequenciesand 100 Hz for bandwidths. These values were empiri-cally found to follow temporal variations of speech articu-lation. The covariance matrixR, representing the signal-to-noise ratio of the cepstral coefficients, is a diagonalmatrix with elements Rnn = 1/n for n ∈ {1, 2, . . . , N}.This was observed to be in reasonable agreement with thevariance of the residual vector of the cepstral coefficientsderived from speech waveforms.

The center frequencies and bandwidths are initializedto µ0 =

(500 1500 2500 80 120 160

)Hz. The initial

covariance Σ0 is set to Q.

D. Summary of KARMA approach

Table II outlines the steps of the proposed KARMAalgorithm, which includes a pre-processing stage, intra-frame ARMA cepstral coefficient estimation, and inter-frame tracking of formant and antiformant parametersusing Kalman inference.

E. Benchmark algorithms

Performance of KARMA is compared with that ofPraat (Boersma and Weenink, 2009) and WaveSurfer(Sjolander and Beskow, 2005), two software packages

Kalman-based formant and antiformant tracking 5

Page 6: arXiv:1107.0076v1 [stat.AP] 30 Jun 2011 · is the pre-emphasis coe cient de ning the in-herent high-pass lter characteristic that is typically ap-plied to equalize energy across the

TABLE II. Proposed KARMA algorithm for formant and an-tiformant tracking.

Repeat for frames t = 1, . . . , T (Online or batch mode)

1. Pre-processing of input speech waveform s[m]

(a) Window: st[m] = s[m]wt[m]

(b) Pre-emphasize st[m]

2. Intra-frame observation of N cepstral coefficients

(a) Estimate ARMA(p, q) spectral coefficients ai and

bj in Eq. (3)

(b) Convert ai and bj to ARMA cepstral coefficientsusing Eq. (4)

3. Inter-frame parameter tracking of I formants and J an-tiformants

(a) Apply Kalman filtering step in Table I

(b) mt|t are point estimates and diagonal elements ofPt|t are associated variances of the estimates

Repeat for frames t = T, . . . , 1 (Batch mode only)

4. Inter-frame parameter tracking of I formants and J an-tiformants

(a) Apply Kalman smoothing step in Table I

(b) mt|T are point estimates and diagonal elementsof Pt|T are associated variances of the estimates

that see wide use among voice and speech researchers.WaveSurfer and Praat both follow the classical formanttracking approach in which frame-by-frame format fre-quency candidates are obtained from the all-pole spec-trum and smoothed across the entire speech utterance toremove outliers and constrain the trajectories to physi-ologically plausibile values. Smoothing is accomplishedthrough dynamic programming to minimize the sum ofthe following three cost functions: (1) the deviation be-tween the frequency for each formant from baseline val-ues of each frequency; (2) a measure of the quality fac-tor fi/bi of a formant, where higher quality factors arefavored; and (3) a transition cost that penalizes largefrequency jumps. The user sets weights to these costfunctions to tune the algorithm’s performance.

III. RESULTS

Evaluation of KARMA is accomplished in the all-polecase using the vocal tract resonance (VTR) database(Deng et al., 2006b). Since the VTR database itself onlyyields estimates of ground truth and exhibits observablelabeling errors, two speech databases are created usingoverlap-add of synthesis speech frames using the fourVTR formant tracks. Antiformant tracking performanceof KARMA is illustrated using synthesized and spokennasal phonemes.

A. Error analysis using a hand-corrected formant database

The VTR database contains a representative subsetof the TIMIT speech corpus (Garofolo et al., 1993) thatconsists of 516 diverse, phonetically-balanced utterancescollated across gender, individual speakers, dialects, andphonetic contexts. The VTR database contains state in-formation for four formant trajectory pairs (center fre-quency and bandwidth). The first three center frequencytrajectories were manually corrected after an initial au-tomated pass (Deng et al., 2004). Corrections weremade using knowledge-based intervention based on thespeech waveform, its wideband spectrogram, word- andphoneme-level transcriptions, and phonemic boundaries.

Analysis parameters of KARMA are set to the follow-ing values: fs = 7 kHz, 20 ms Hamming windows with50 % overlap, γ = 0.7, p = 12 (q = 0), and I = 3(J = 0). Each frame is fit with an ARMA(12, 0) modelusing the autocorrelation method of linear predictionand, subsequently, transformed to N = 15 cepstral co-efficients via Eq. (5). The initial state vector is set to

x0 =(500 1500 2500

)T, and Σ0 is set to Q. TIMIT

phone transcriptions are used to indicate whether eachframe contains speech energy or a silence region. A frameis considered silent if all its samples are labeled as a pause(pau, epi, h#), closure interval (bcl, dcl, gcl, pcl, tcl, kcl),or glottal stop (q). Thus, errors due to speech activitydetection are minimized, and all tracks are coasted dur-ing silent frames.

Default smoothing settings are set within WaveSurferand Praat. Other analysis parameters are matched toKARMA: fs = 7 kHz, 20 ms Hamming windows with50 % overlap, γ = 0.7, p = 12 (q = 0), and I = 3(J = 0).

Table III summarizes the performance of KARMA,WaveSurfer, and Praat on the VTR database. The root-mean-square error (RMSE) per formant is computed overall speech frames for each utterance and then averaged,per formant, across all 516 utterances in the database.The cepstral-based KARMA approach results in loweroverall error compared to the classical algorithms, withparticular gains for f1 and f2 tracking. Praat exhibitsthe lowest average error for f3.

Figure 3 illustrates the formant tracks output from thethree algorithms for VTR utterance 200 spoken by anadult female. During non-speech regions (the first 700 msexhibits noise energy during inhalation), mean KARMA

TABLE III. Formant tracking performance of KARMA,WaveSurfer, and Praat in terms of root-mean-square error(RMSE) per formant averaged across 516 utterances in theVTR database (Deng et al., 2006b). RMSE is only computedover speech-labeled frames.

Formant KARMA WaveSurfer Praatf1 114 Hz 170 Hz 185 Hzf2 226 Hz 276 Hz 254 Hzf3 320 Hz 383 Hz 303 Hz

Overall 220 Hz 276 Hz 247 Hz

Kalman-based formant and antiformant tracking 6

Page 7: arXiv:1107.0076v1 [stat.AP] 30 Jun 2011 · is the pre-emphasis coe cient de ning the in-herent high-pass lter characteristic that is typically ap-plied to equalize energy across the

FIG. 3. (Color online) Estimated formant tracks on spectrogram of VTR utterance 200: “Withdraw only as much money asyou need.” Reference trajectories from the VTR database are shown in red along with the formant frequency tracks in bluefrom (A) KARMA, (B) WaveSurfer, and (C) Praat. The KARMA output additionally displays uncertainties (gray shading, ±1standard deviation) for each formant trajectory and speech-labeled frames (green). Reported root-mean-square error (RMSE)is averaged across formants conditioned on speech presence for each frame.

track estimates are linear with increasing uncertainty forframes that are farther from frames with formant infor-mation. Compared to the WaveSurfer and Praat tracks,KARMA trajectories are smoother and better behaved,reflecting the slow-moving nature of the speech articula-tors. The classical algorithms exhibit errant tracking off2 during the /i/ vowel in “need” at 2.5 s that is handledby the KARMA approach.

B. Error analysis using synthesized databases

Speech waveforms in the first database (VTRsynth)are synthesized through overlap-add of frames that eachfollow the ARMA model of Eq. (2). ARMA spec-tral coefficients are derived from the four formant fre-quency/bandwidth pairs in the corresponding frame ofthe VTR database utterance using the impulse-invarianttransformation of a digital resonator (Klatt, 1980). Thesource excitation is white Gaussian noise during non-silence frames. Synthesis parameters are set to the fol-lowing values: fs = 16 kHz, 20 ms Hanning windowswith 50 % overlap, and p = 8 (q = 0).

The second database (VTRsynthf0) introduces amodel mismatch between synthesis and KARMA anal-ysis by applying a Rosenberg C source waveform (Rosen-berg, 1971) instead of white noise for each frame consid-ered voiced. The fundamental frequency of each voicedframe in the original VTR database is estimated byWaveSurfer. The VTRsynthf0 database thus includesvoiced, unvoiced, and non-speech frames. Synthesis pa-rameters are set as in the VTRsynth database. Formant

trajectories from these two synthesized databases act astruer ground truth contours than in the VTR databaseto test the performance of the cepstral-based KARMAalgorithm.

Table IV and Table V display performance on theVTRsynth and VTRsynthf0 databases, respectively, ofthe three tested algorithms with settings as described inthe previous section. The proposed KARMA approachcompares favorably to WaveSurfer and Praat. The sim-ilar error of KARMA and WaveSurfer validates the useof ARMA cepstral coefficients as observations in place ofARMA spectral coefficients.

C. Antiformant tracking

The KARMA approach to formant and antiformanttracking is illustrated in this section. Synthesized andreal speech examples are presented to determine the abil-

TABLE IV. Average RMSE of KARMA, WaveSurfer, andPraat formant tracking of the first three formant trajecto-ries in the VTRsynth database that resynthesizes utterancesusing a stochastic source and formant tracks from the VTRdatabase. Error is only computed over speech-labeled frames.

Formant KARMA WaveSurfer Praatf1 29 Hz 37 Hz 58 Hzf2 53 Hz 60 Hz 123 Hzf3 64 Hz 54 Hz 130 Hz

Overall 48 Hz 50 Hz 104 Hz

Kalman-based formant and antiformant tracking 7

Page 8: arXiv:1107.0076v1 [stat.AP] 30 Jun 2011 · is the pre-emphasis coe cient de ning the in-herent high-pass lter characteristic that is typically ap-plied to equalize energy across the

ity of the ARMA-derived cepstral coefficients to capturepole and zero information.

1. Synthesized waveform

In the synthesized case, a speech-like waveform /nAn/is generated with varying frame-by-frame formant andantiformant characteristics and a periodic source excita-tion as was implemented for the VTRsynthf0 database.The /nAn/ waveform is synthesized at fs = 10 kHz us-ing 75 100 ms frames with 50 % overlap. Formant fre-quencies (bandwidths) of the /n/ phonemes are set to257 Hz (32 Hz) and 1891 Hz (100 Hz). One antiformantis placed at 1223 Hz (bandwidth of 52 Hz) to mimic thelocation of an alveolar nasal antiformant. Formant fre-quencies (bandwidths) of /A/ were set to 850 Hz (80 Hz)and 1500 Hz (120 Hz). A random term with zero meanand standard deviation of 10 Hz was added to each tra-jectory to simulate realistic variation.

Fig. 4 shows the results of formant and antiformanttracking using KARMA on the synthesized phonemestring /nAn/. Two different visualizations are displayed.Fig. 4A plots point estimates and uncertanties of the cen-ter frequency and bandwidth trajectories for each frame.Fig. 4B displays the wideband spectrogram with overlaidcenter frequency tracks whose width reflects the corre-sponding 3-dB bandwidth value. Note that the lengthof the state vector in the KARMA’s state-space modelis modified depending on the presence or absence of an-tiformant energy. Estimated trajectories fit the groundtruth values well once initialized values reach a steadystate.

2. Spoken nasals

During real speech, a vocal tract configuration consist-ing of multiple acoustic paths results in the possible ex-istence of both poles and zeros in transfer function T (z)(Eq. 7). For example, the effects of antiresonances mightenter the transfer function of nasalized speech sounds aszeros in T (z). Typically, the frequency of the lowest zerodepends on tongue position. For the labial nasal conso-nant /m/, the frequency of this antiresonance is approxi-mately 1100 Hz. As the point of closure moves toward theback of the oral cavity—such as for the alveolar and velar

TABLE V. Average RMSE of KARMA, WaveSurfer, andPraat formant tracking of the first three formant trajectoriesin the VTRsynthf0 database that resynthesizes VTR databaseutterances using stochastic and periodic sources. Error is onlycomputed over speech-labeled frames.

Formant KARMA WaveSurfer Praatf1 44 Hz 57 Hz 57 Hzf2 53 Hz 58 Hz 117 Hzf3 62 Hz 59 Hz 111 Hz

Overall 53 Hz 58 Hz 95 Hz

nasal consonants—the length of the resonator decreases,and the frequency of this zero increases. The frequencyof a second zero is approximately three times the fre-quency of the lowest zero due to the quarter-wavelengthoral cavity configuration.

KARMA performance was evaluated visually on spo-ken nasal consonants produced with closure at thelabial (/m/), alveolar (/n/), and velar (/N/) positions.The extended Kalman smoother was applied using anARMA(16, 4) model, fs = 8 kHz, 20 ms Hamming win-dows with 50 % overlap, N = 20 cepstral coefficients, andγ = 0.7. The frequencies (bandwidths) of the formantswere initialized to 500 Hz (80 Hz), 1500 Hz (120 Hz),and 2500 Hz (160 Hz). The frequencies (bandwidths) ofthe antiformants were initialized to 1000 Hz (80 Hz) and2000 Hz (80 Hz).

Figure 5 displays KARMA outputs (point estimate anduncertainty of frequency tracks) and averaged spectra forthe three sustained consonants. The KARMA algorithmtakes a few frames to settle to its steady-state estimates.As expected, the frequency of the antiformant increasesas the position of closure moves toward the back of theoral cavity. The uncertainty of the first antiformant of/N/ increases significantly, indicating that this antifor-mant is not well observed in the waveform. Note thatthe inclusion of zeros greatly improves the ability of theARMA model to fit the underlying waveform spectra.

Finally, the KARMA tracker was applied to the spo-ken word “piano” to determine if the antiformant trackswould capture any zeros during the nasal phoneme. Fig-ure 6 displays the KARMA formant and antiformanttracks with their associated uncertainties. During thenon-nasalized regions, the uncertainty around the pointestimates of the antiformant track is large, reflecting thelack of antiresonance information. During the /n/ seg-ment, the uncertainty of the antiformant tracks decreasesto reveal observable antiformant information.

IV. DISCUSSION

In this article, the task of tracking frequencies andbandwidths of formants and antiformants was ap-proached from a statistical point of view. The evolutionof parameters was cast in a state-space model to provideaccess to point estimates and uncertainties of each track.The key relationship was a linearized mapping betweencepstral coefficients and formant and antiformant param-eters that allowed for the use of the extended family ofKalman inference algorithms.

The VTR database provides an initial benchmark of“ground truth” for the first three formant frequency val-ues to which multiple algorithm outputs can be com-pared. The values in the VTR database, however, shouldbe interpreted with caution because starting values wereinitially obtained via a first-pass automatic algorithmDeng et al. (2004). It is unclear how much manual in-tervention was required and what types of errors werecorrected. In particular, VTR tracks do not always over-lap high-energy spectral regions. Despite the presence ofvarious labeling errors in the VTR database, it is still

Kalman-based formant and antiformant tracking 8

Page 9: arXiv:1107.0076v1 [stat.AP] 30 Jun 2011 · is the pre-emphasis coe cient de ning the in-herent high-pass lter characteristic that is typically ap-plied to equalize energy across the

FIG. 4. (Color online) Illustration of the output from KARMA for the synthesized utterance /nAn/. Plots in panel A overlaythe true trajectories (red) with the mean estimates (blue for formants, green for antiformants) and uncertainties (gray shading)for each frequency and bandwidth. Panel B plots an alternative display with a wideband spectrogram along with estimatedfrequency and bandwidth tracks of formants (blue) and antiformants (green). The 3-dB bandwidths dictate the width of thecorresponding frequency tracks.

useful to obtain initial performance of formant trackingalgorithms on real speech.

In the current framework, cepstral coefficients are de-rived from the spectral coefficients of the fitted stochas-tic ARMA model (Section II.B). Source information re-lated to phonation is thus separated from vocal tract res-onances by assuming that the source is a white Gaussiannoise process. This is not the case in reality, especiallyfor voiced speech, where the source excitation componenthas its own characteristics in frequency (e.g., spectralslope) and time (e.g., periodicity). This model mismatchhas been explored here via VTRsynthf0, though we notethat it is also possible to incorporate more sophisticatedsource modeling through the use flexible basis functionssuch as wavelets (Mehta et al., 2011).

An alternative approach to ARMA modeling is to com-pute the nonparametric (real) cepstrum directly fromthe speech samples. Based on the convolutional modelof speech, low-quefrency cepstral coefficients are largelylinked to vocal tract information up to about 5 ms(Childers et al., 1977), depending on the fundamentalfrequency. Skipping the 0th cepstral coefficient thatquantifies the overall spectral level, this would trans-late to including up to the first 35 cepstral coefficientsin the observation vector yt in the state-space framework(Eq. 12). Although coefficients from the real cepstrum do

not strictly adhere to the observation model derived inEq. 11, the approximate separation of source and filter inthe nonparametric cepstral domain makes this approachviable.

Figure 7 illustrates the output of the algorithm usingthe first 15 coefficients of the real cepstrum as obser-vations. Interestingly the performance of the nonpara-metric cepstrum is comparably to that of the paramet-ric ARMA cepstrum, espeically for the first formant fre-quency. Most of the error stems from underestimatingthe second and third formant frequencies. Advantages tousing the nonparametric cepstrum include computationalefficiency and freedom from ARMA model constraints.

The capability of automated methods to track thethird formant strongly depends on the resampling fre-quency, which controls the amount of energy in the spec-trum at higher frequencies. For example, if the signalwere resampled to 10 kHz, a given algorithm might erro-neously track the third formany frequency through spec-tral regions typically ascribed to the fourth formant. Tra-ditional formant tracking algorithms have access to mul-tiple candidate frequencies, which are constantly resortedso that f4 > f3 > f2 > f1. In the proposed statisti-cal approach, the ordering of formant indices is inher-ent in the mapping of formants to cepstral coefficients(11), and further empirical study of this formants-to-

Kalman-based formant and antiformant tracking 9

Page 10: arXiv:1107.0076v1 [stat.AP] 30 Jun 2011 · is the pre-emphasis coe cient de ning the in-herent high-pass lter characteristic that is typically ap-plied to equalize energy across the

FIG. 5. (Color online) KARMA output for three spoken nasals: (A) /m/, (B) /n/, and (C) /N/. On the left, spectrogramsoverlay the mean estimates (blue for formants, green for antiformants) and uncertainties (gray shading) for each frequency andbandwidth. Plots to the right display the corresponding periodogram (gray) and spectral ARMA model fit (black).

FIG. 6. (Color online) KARMA formant and antiformant tracks of utterance by adult male: “piano.” Displayed are the(A) wideband spectrogram of the speech waveform and (B) the spectrogram overlaid with formant frequeny estimates (blue),antiformant frequency estimates (green), and uncertainties (±1 standard deviation) for each track (gray). Arrows indicatebeginning and ending of utterance. Note that the increase in uncertainty during silence regions.

cepstrum mapping can be expected to lead to improvedmethods when there are additional resonances present inthe speech bandwidth. Further analysis of “noise” in theestimated ARMA cepstrum can also be expected to im-prove overall robustness in the presence of various sources

of uncertainty (Tourneret and Lacaze, 1995).

Overall, the proposed KARMA approach compares fa-vorably with WaveSurfer and Praat in terms of root-mean-square error. RMSE, however, is only one selectederror metric, which must be validated by observing how

Kalman-based formant and antiformant tracking 10

Page 11: arXiv:1107.0076v1 [stat.AP] 30 Jun 2011 · is the pre-emphasis coe cient de ning the in-herent high-pass lter characteristic that is typically ap-plied to equalize energy across the

FIG. 7. (Color online) KARMA formant tracks using observations from (A) parametric ARMA cepstrum and (B) nonparametricreal cepstrum for VTRsynthf0 utterance 1: “Even then, if she took one step forward, he could catch her.” Reference trajectoriesfrom the VTR database are shown in red with the outputs of KARMA in blue. KARMA uncertainties (±1 standard deviation)are shown as gray shading. Reported root-mean-square error (RMSE) averages across 3 formants conditioned on the presenceof speech energy for each frame.

well raw trajectories behave. The proposed KARMAtracker yields smoother outputs as a result and offers pa-rameters that allow the user to tune the performance ofthe algorithm in a statistically principled manner. Suchwell behaved trajectories may be particularly desirablefor the resynthesis of perceptually natural speech.

Though considerably more complex and more sensi-tive to model assumptions, a time-varying autoregres-sive moving average (TV-ARMA) model has been pre-viously proposed for formant and antiformant tracking(Toyoshima et al., 1991) with little follow-up investiga-tion. In their study, Toyoshima et al. used an extendedKalman filter to solve for ARMA spectral coefficients ateach speech sample. One real zero and one real polewere included to model changes in gross spectral shapeover time. While a frame-based approach (as taken inKARMA) appears to yield more salient parameters at alower computational cost, future work could consider thisas well as alternative time-varying approaches (Rudoyet al., 2011).

As a final observation, antiformant tracking remainsa challenging task in speech analysis. Antiresonancesare typically less strong than their resonant counterpartsduring nasalized phonation, and the estimation of sub-glottal resonances continues to rely on empirical relation-ships rather than direct acoustic observation (Arsikereet al., 2011). Nevertheless, the proposed approach allowsthe user the option of tracking antiformants during selectspeech regions of interest. Potential improvements hereinclude the use of formal statistics tests for detecting thepresence of zeros within a frame prior to tracking them.

V. CONCLUSIONS

This article has presented KARMA, a Kalman-basedautoregressive moving average modeling approach to for-mant and antiformant tracking. The contributions of this

work are twofold. The first is methodological, with im-provements to the Kalman-based AR approach of Denget al. (2007) and extensions to enable antiformant fre-quency and bandwidth tracking in a KARMA frame-work. The second is empirical, with visual and quan-titative error analysis of the KARMA algorithm demon-strating improvements over two standard speech process-ing tools, WaveSurfer (Sjolander and Beskow, 2005) andPraat (Boersma and Weenink, 2009).

It is expected that additional improvements will comewith better understanding of precisely how formant infor-mation is captured through this class of nonlinear ARMA(or nonparametric) cepstral coefficient models. As noted,antiformant tracking remains challenging, although it hasbeen shown here that appropriate results can be obtainedfor selected cases exhibiting antiresonances. The demon-strated effectiveness of this approach, coupled with itsability to capture uncertainty in the frequency and band-width estimates, yields a statistically principled tool ap-propriate for use in clinical and other applications whereit is desired, for example, to quantitatively assess acous-tic features such as nasality, subglottal resonances, andcoarticulation.

APPENDIX: DERIVATION OF CEPSTRALCOEFFICIENTS FROM THE ARMA SPECTRUM

Assume an ARMA process with the minimum-phaserational transfer function

T (z) ,B(z)

A(z)=

1 +∑q

j=1 bjz−j

1−∑p

i=1 aiz−i , (A.1)

which in turn implies a right-sided complex cepstrum.For the moment, assume bj = 0 for 1 ≤ j ≤ q to initiallyderive the all-pole LPC cepstrum whose Z-transform is

Kalman-based formant and antiformant tracking 11

Page 12: arXiv:1107.0076v1 [stat.AP] 30 Jun 2011 · is the pre-emphasis coe cient de ning the in-herent high-pass lter characteristic that is typically ap-plied to equalize energy across the

denoted by C(z):

C(z) , log T (z) =

∞∑n=0

cnz−n, (A.2)

where

cn =1

∮z=eiw

(log T (z)) zn−1dz

is the nth coefficient of the LPC cepstrum.Using the chain rule, d

dz−1T (z) can be obtained in-dependently from Eq. (A.1) or Eq. (A.2), yielding therelation

dC(z)

dz−1=

1

T (z)

dT (z)

dz−1=

∑pi=1 iaiz

−i+1

1−∑p

i=1 aiz−i ,

which implies that∑pi=1 iaiz

−i+1

1−∑p

i=1 aiz−i =

∞∑n=0

cnd

dz−1(z−n

)=

∞∑n=0

ncnz−n+1.

Rearranging the terms above, we obtain

∞∑n=0

ncnz−n+1 =

p∑i=1

iaiz−i+1 +

p∑i=1

aiz−i∞∑

n=0

ncnz−n+1.

(A.3)Using Eq. (A.3), we can match the coefficients of termson both sides with equal exponents. In the constant-coefficient case (associated to z0), we have c1 = a1. For1 < n ≤ p, we obtain

cn = an +

n−1∑i=1

n− in

aicn−i = an +

n−1∑i=1

(1− i

n

)aicn−i.

On the other hand, if n > p, then Eq. (A.3) implies that

cn =

n−1∑i=n−p

n− in

aicn−i =

n−1∑i=1

(1− i

n

)aicn−i.

In summary, we have obtained the following relationshipbetween the prediction polynomial coefficients and thecomplex cepstrum:

cn =

a1 if n = 1

an +∑n−1

i=1

(n−in

)aicn−i if 1 < n ≤ p∑n−1

i=n−p(n−in

)aicn−i if p < n.

Reversing the roles of i and (n − i) yields the all-poleversion in Eq. (5a).

To allow for nonzero bj coefficients in Eq. (A.1), weobtain the ARMA cepstral coefficients Cn by separatingcontributions from the numerator and denominator ofEq. (A.1) as follows:

Cn = Z−1 log T (z)

= Z−1 log1

A(z)−Z−1 log

1

B(z)

= cn − c′n,

yielding the respective pole and zero recursions ofEqs. (5).

Arsikere, H., Lulich, S. M., and Alwan, A. (2011). “Auto-matic estimation of the first subglottal resonance”, J. Acoust.Soc. Am. 129, EL197–EL203.Atal, B. S. and Hanauer, S. L. (1971). “Speech analysis andsynthesis by linear prediction of the speech wave”, J. Acoust.Soc. Am. 50, 637–655.Atal, B. S. and Schroeder, M. R. (1978). “Linear predictionanalysis of speech based on a pole-zero representation”, J.Acoust. Soc. Am. 64, 1310–1318.Boersma, P. and Weenink, D. (2009). “Praat: Doing phonet-ics by computer”, version 5.1.40 retrieved from www.praat.org13 July 2009.Broad, D. J. and Clermont, F. (1989). “Formant estimationby linear transformation of the LPC cepstrum”, J. Acoust.Soc. Am. 86, 2013–2017.Childers, D. G., Skinner, D. P., and Kemerait, R. C. (1977).“The cepstrum: A guide to processing”, Proc. IEEE 65,1428–1443.Christensen, R., Strong, W., and Palmer, E. (1976). “A com-parison of three methods of extracting resonance informa-tion from predictor-coefficient coded speech”, IEEE Trans.Acoust. 24, 8–14.Deng, L., Acero, A., and Bazzi, I. (2006a). “Tracking vocaltract resonances using a quantized nonlinear function embed-ded in a temporal constraint”, IEEE Trans. Audio SpeechLang. Processing 14, 425–434.Deng, L., Cui, X., Pruvenok, R., Huang, J., Momen, S., Chen,Y., and Alwan, A. (2006b). “A database of vocal tract res-onance trajectories for research in speech processing”, Proc.IEEE Int. Conf. Acoust. Speech Signal Process. 1, 369–372.Deng, L., Lee, L. J., Attias, H., and Acero, A. (2004).“A structured speech model with continuous hidden dynam-ics and prediction-residual training for tracking vocal tractresonances”, IEEE International Conference on Acoustics,Speech, and Signal Processing 1, I–557–60.Deng, L., Lee, L. J., Attias, H., and Acero, A. (2007).“Adaptive Kalman filtering and smoothing for tracking vocaltract resonances using a continuous-valued hidden dynamicmodel”, IEEE Trans. Audio Speech Lang. Processing 15, 13–23.Fulop, S. A. (2010). “Accuracy of formant measurement forsynthesized vowels using the reassigned spectrogram and com-parison with linear prediction”, J. Acoust. Soc. Am. 127,2114–2117.Garofolo, J. S., Lamel, L., Fisher, W., Fiscus, J., Pallett,D., Dahlgren, N., and Zue, V. (1993). TIMIT Acoustic-Phonetic Continuous Speech Corpus (Linguistic Data Con-sortium, Philadelphia, PA).Hamilton, J. D. (1994). Time Series Analysis (PrincetonUniversity).Julier, S. and Uhlmann, J. (1997). “A new extensionof the Kalman filter to nonlinear systems”, Proceedingsof AeroSense: The 11th International Symposium onAerospace/Defense Sensing, Simulation, and Controls 3, 26.Kalman, R. E. (1960). “A new approach to linear filteringand prediction problems”, Trans. ASME J. Basic Eng. 82,35–45.Klatt, D. H. (1980). “Software for a cascade/parallel formantsynthesizer”, J. Acoust. Soc. Am. 67, 971–995.Kopec, G. (1986). “Formant tracking using hidden markovmodels and vector quantization”, IEEE Trans. Acoust. 34,709–729.Ljung, L. (1999). System Identification (Upper Saddle River,NJ: Prentice-Hall).Marelli, D. and Balazs, P. (2010). “On pole-zero model esti-mation methods minimizing a logarithmic criterion for speech

Kalman-based formant and antiformant tracking 12

Page 13: arXiv:1107.0076v1 [stat.AP] 30 Jun 2011 · is the pre-emphasis coe cient de ning the in-herent high-pass lter characteristic that is typically ap-plied to equalize energy across the

analysis”, IEEE Trans. Audio Speech Lang. Processing 18,237–248.McCandless, S. (1974). “An algorithm for automatic for-mant extraction using linear prediction spectra”, IEEE Trans.Acoust. 22, 134–141.Mehta, D. D., Rudoy, D., and Wolfe, P. J. (2011). “Jointsource-filter modeling using flexible basis functions”, IEEEInt. Conf. Acoust. Speech Signal Processing .Miyanaga, Y., Miki, N., and Nagai, N. (1986). “Adaptiveidentification of a time-varying arma speech model”, IEEETrans. Acoust. 34, 423–433.Nearey, T. M. (1989). “Static, dynamic, and relational prop-erties in vowel perception”, J. Acoust. Soc. Am. 85, 2088–2113.Rigoll, G. (1986). “A new algorithm for estimation of formanttrajectories directly from the speech signal based on an ex-tended kalman-filter”, Proc. IEEE Int. Conf. Acoust. SpeechSignal Process. 11, 1229–1232.Rosenberg, A. E. (1971). “Effect of glottal pulse shape on thequality of natural vowels”, J. Acoust. Soc. Am. 49, 583–590.Rudoy, D., Quatieri, T., and Wolfe, P. (2011). “Time-varyingautoregressions in speech: Detection theory and applica-tions”, IEEE Trans. Audio Speech Lang. Processing 19, 977–989.

Rudoy, D., Spendley, D. N., and Wolfe, P. J. (2007). “Con-ditionally linear Gaussian models for estimating vocal tractresonances”, Proc. INTERSPEECH 526–529.Schafer, R. W. and Rabiner, L. R. (1970). “System for au-tomatic formant analysis of voiced speech”, J. Acoust. Soc.Am. 47, 634–648.Sjolander, K. and Beskow, J. (2005). “WaveSurfer for Win-dows”, version 1.8.5 retrieved 1 November 2005.Steiglitz, K. (1977). “On the simultaneous estimation of polesand zeros in speech analysis”, IEEE Trans. Acoust. 25, 229–234.Tourneret, J.-Y. and Lacaze, B. (1995). “On the statistics ofestimated reflection and cepstrum coefficients of an autore-gressive process”, Signal Processing 43, 253–267.Toyoshima, T., Miki, N., and Nagai, N. (1991). “Adapa-tive formant estimation with compensation for gross spectralshape”, Electronics and Communications in Japan (Part III:Fundamental Electronic Science) 74, 58–68.Yegnanarayana, B. (1978). “Formant extraction from linear-prediction phase spectra”, J. Acoust. Soc. Am. 63, 1638–1640.Zheng, Y. and Hasegawa-Johnson, M. (2004). “Formanttracking by mixture state particle filter”, Proc. IEEE Int.Conf. Acoust. Speech Signal Process. 1, 565–568.

Kalman-based formant and antiformant tracking 13