Formant estimation from speech signal using the magnitude ...

Formant estimation from speech signal using the magnitude spectrum

modified with group delay spectrum

Husne Ara Chowdhury� and Mohammad Shahidur Rahman

Department of Computer Science and Engineering, Shahjalal University of Science andTechnology, Sylhet 3114, Bangladesh

(Received 1 July 2020, Accepted for publication 26 October 2020)

Abstract: The magnitude spectrum is a popular mathematical tool for speech signal analysis. In thispaper, we propose a new technique for improving the performance of the magnitude spectrum byutilizing the benefits of the group delay (GD) spectrum to estimate the characteristics of a vocal tractaccurately. The traditional magnitude spectrum suffers from difficulties when estimating vocal tractcharacteristics, particularly for high-pitched speech owing to its low resolution and high spectralleakage. After phase domain analysis, it is observed that the GD spectrum has low spectral leakage andhigh resolution for its additive property. Thus, the magnitude spectrum modified with its GD spectrum,referred to as the modified spectrum, is found to significantly improve the estimation of formantfrequency over traditional methods. The accuracy is tested on synthetic vowels for a wide range offundamental frequencies up to the high-pitched female speaker range. The validity of the proposedmethod is also verified by inspecting the formant contour of an utterance from the Texas Instrumentsand Massachusetts Institute of Technology (TIMIT) database and standard F2–F1 plot of natural vowelspeech spoken by male and female speakers. The result is compared with two state-of-the-art methods.Our proposed method performs better than both of these two methods.

Keywords: Modified spectrum, Phase domain analysis, Spectrogram, GD spectrum, Glottal formanteffect

1. INTRODUCTION

The speech production system is composed of the

convolution of a source and a system or filter. Most of the

speech signal analysis research is based on the separation

of the source and filter. Atal and Hanauer separated the

source and filter by the linear predictive coding (LPC) [1].

However, the conventional linear prediction (LP) technique

has some limitations [2]. Magi et al. [3] used the stabilized

weighted LP to overcome the problem of LP. Noll [4]

used the cepstrum method for source–filter separation. The

source is characterized by quasi-periodic motion in the

vocal folds called fundamental frequency (F0). In addition,

the system or filter is formed by a connection with several

tubes of different cross-sectional areas representing the

vocal tract. This vocal tract has some resonance frequen-

cies that correspond to the cross-sectional areas and are

called formants, denoted by F1, F2, F3, etc.

Fundamental and formant frequencies are well high-

lighted and distinguished in the magnitude spectrum

domain [5]. The local minimum and maximum in the

magnitude spectrum correspond to zeros and poles in the

z domain. This forms the basis of source–filter separation

described by Loweimi and Barker [6,7]. Schafer and

Rabiner [8] used the magnitude spectrum for formant

estimation. Rahman and Shimamura employed the magni-

tude spectrum to show improved accuracy of formant

estimation [9,10]. As a result, the magnitude spectrum

becomes a natural choice for speech analysis.

Some human perception experiments, such as those

of Paliwal and Alsteris [11], showed that the short-time

phase spectrum contributes to speech intelligibility and

plays an important role in high-quality speech synthesis.

Unfortunately, the phase spectrum is in a wrapped form

[7], and the negative first-order derivative of its unwrapped

version, called the GD function, is generally preferred. The

GD function is related to the overall delay of the input

signal passing through the system. The GD spectrum

represents the phase spectrum information but its overall

shape is more understandable. A properly estimated GD

spectrum has a high spectral resolution, lower spectral

leakage, and resembles a magnitude spectrum. Gowda

et al. [12] applied the GD function on the stabilized

�e-mail: [email protected][doi:10.1250/ast.42.93]

93

Acoust. Sci. & Tech. 42, 2 (2021) #2021 The Acoustical Society of Japan

PAPER

http://dx.doi.org/10.1250/ast.42.93

weighted linear prediction spectrum for formant estima-

tion.

In the proposed method, the magnitude spectrum is

modified using the GD spectrum obtained from an

appropriate phase domain analysis of the speech signal.

The resulting modified spectrum can be used for robust

formant estimation.

The GD spectrum often presents spurious spikes that

make its processing very problematic. Murthy and

Yegnanarayana proposed various methods [13,14] to

remove these spikes. Bozkurt et al. [15] tried to remove

spikes in the GD spectrum utilizing an analysis window

greater than the unit circle. The minimum-phase signal can

be a choice for avoiding spikes. The GD spectrum obtained

from the minimum-phase signal has a better frequency

resolution having peaks at the poles and valleys at the

zeros. It also resembles the magnitude spectrum explained

by Loweimi [7]. A signal is said to be a minimum phase

signal if it provides a causal and stable system.

In their method [13], Murthy and Yegnanarayana

estimated formant frequency from the GD spectrum of a

short analysis window segment of less than one pitch

period, which is not appropriate for high-pitched speech

owing to an insufficient number of samples. This method

is also pitch-dependent. Murthy and Gadde [16] modified

the GD spectrum by its cepstrally smoothed spectrum

and applied it in phoneme detection. Murthy and

Yegnanarayana showed different approaches of formant

extraction [14]. The approaches in [14] are based on only

synthetic speech. The study by Bozkurt [17] revealed that

the existence of zeros close to the unit circle is responsible

for the spikes in his analysis of the characteristics of the

source and filter using zeros of z transform. The Collab-

orative voice analysis repository (COVAREP) [18] pack-

age used the differential-phase peak tracking (DPPT)

method [19] for formant tracking, but is found to be

affected by glottal formant and unsuitable for high-pitched

speech analysis. The methods described in [17,18], and

[19] show some problems of robustness. The CheapTrick

[20] method of the WORLD [21] vocoder shows a

smoothed spectrum that becomes a vocal-tract-dominated

spectrum. The main problem with utilizing this spectrum as

a formant estimator is that some peaks other than formant

peaks arise, and some of the true peaks arise with low

resolution. This method is also pitch-dependent.

On the basis of the above observations, we proposed

a technique of using the GD function by utilizing the

minimum phase like a signal from which the GD spectrum

is computed. The resulting GD spectrum is used to modify

the magnitude spectrum. The modified spectrum empha-

sizes the formant peaks, reducing the glottal formant

impact. The effectiveness of the proposed method is

measured by estimating the formant frequencies and

comparing it with the recent DPPT method of the

COVAREP tool and a formant estimator utilizing the

spectrum of the WORLD vocoder. The proposed method

shows significant improvement over the DPPT of

COVAREP and WORLD formant estimator.

2. PROBLEM ANALYSIS

The basics of the magnitude and phase-based signal

processing, its significance, similarities and problems, and

available solutions are discussed.

2.1. Magnitude Spectrum and Its Significance

Fourier analysis is an important and popular mathe-

matical tool for speech signal analysis. The N point

discrete Fourier transform (DFT) [22] of the discrete signal

xðnÞ can be defined as

XðkÞ ¼1

L

XLn¼1

xðnÞe� jð2�LÞkn; ð1Þ

where n, L, and k indicates sample number, frame length,

and frequency index where k ¼ 1; 2; 3 . . .N. The e� jð2�LÞkn

in Eq. (1) forms the basis functions of Fourier transform.

The XðkÞ makes the complex quantity in the Cartesian

coordinates as follows.

XðkÞ ¼ XRðkÞ þ jXIðkÞ

Here, XRðkÞ and XIðkÞ indicate the real and imaginary part,

respectively. The complex numbers are represented in the

polar form as

XðkÞ ¼ jXðkÞje�X ðkÞ; ð2Þ

where jXðkÞj and �XðkÞ are the magnitude spectrum and

phase spectrum, respectively, which can also be expressed

as

jXðkÞj ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXRðkÞ2 þ XIðkÞ2

q�XðkÞ ¼ arctan

XIðkÞXRðkÞ

: ð3Þ

The magnitude spectrum better characterizes the

speech signal and matches our level of understanding. A

synthetic speech signal of vowel /a/ is segmented using

a 30 ms length of the Gaussian window. The time-domain

speech segment and corresponding log magnitude spec-

trum, phase spectrum, and GD spectrum are shown in

Fig. 1. The log magnitude spectrum shown in Fig. 1(b) has

a fine structure that is closely related to the fundamental

frequency and its harmonics. The resonances are also clear

here. This forms a basis of the source–filter separation

using the (log of the) magnitude spectrum. The local

minimum and maximum in the magnitude spectrum

correspond to zeros and poles in the z domain, which is

useful to separate the source and filter. Rahman and

Shimamura [9] attempted to separate the source and filter

94

Acoust. Sci. & Tech. 42, 2 (2021)

using the magnitude spectrum. The understandable and

expressive behavior of the magnitude spectrum makes it

applicable to a wider range of use.

The main problem with the magnitude spectrum is its

low resolution and high spectral leakage. Although window

functions are used to minimize these problems, the window

function cannot eliminate spectral leakage. It uniformly

distributes the leakage effect over all frequency points. As

a result, the leakage effect is not visible on the spectrum of

the windowed signal. If the magnitude spectrum is taken

from a short duration of real measurements, the resolution

degrades. Again, the magnitude spectrum is formed by

the product of the magnitude spectra of the individual

component. Hence the overall resolution is reduced.

Spectral leakage [23] takes place owing to the lack of

integer multiple cycles of each component spectrum.

2.2. Phase Spectrum

The phase spectrum described by Eq. (3) is depicted in

Fig. 1(c). Its behavior is not as clear as that of the

magnitude spectrum. It shows a chaotic nature because of

phase wrapping. The main difficulty in unwrapping the

phase is the addition of an integer multiple of 2� (as in

the following expression) to the principle phase spectrum

that does not change the complex number XðkÞ.

XðkÞ ¼ jXðkÞjejð�XðkÞþ2m�Þ

An alternative way to unwrap the phase spectrum is to

take the derivative of the phase spectrum using the

expression

arg0 XðkÞ ¼dfargXðkÞg

dk¼

XRðkÞX0IðkÞ � X0RðkÞXIðkÞjXðkÞj2

;

where 0 denotes the derivative with respect to k. Two

factors limit this process. One is the smaller size of FFT

and the other is noise. Increasing the size of FFT improves

the accuracy.

2.3. GD Spectrum

The GD spectrum is the alternative and meaningful

representation of the phase spectrum. It is defined as the

negative of the spectral derivative of the unwrapped phase

spectrum as given in

�ðkÞ ¼ �dfargðXðkÞg

dk: ð4Þ

Another method of computing the GD spectrum with-

out unwrapping the phase spectrum is

�ðkÞ ¼XRðkÞX0IðkÞ � X0RðkÞXIðkÞ

jXðkÞj2:

Bozkurt et al. [15] showed that if the speech signal

cannot be segmented with the appropriate window size

and position, the GD spectrum taken from this signal

becomes spiky. The window center must be synchronized

with the glottal closure instant, which is a challenging task.

If the GD spectrum becomes spiky, neither the fundamental

frequency nor the formants can be distinguished visually.

Another cause of spikes is poles and zeros being located

close to the unit circle, as discussed elaborately by Bozkurt

[17]. The choice of minimum phase signal can eliminate

the aforementioned inconvenience. The GD spectrum can

be effectively used when the signal is in the minimum

phase. A signal is in the minimum phase if and only if all

the poles and zeroes of the z transform lie within the unit

circle. Mathematically, this can be expressed as

5 10 15 20 25 30Time (ms)

-0.5

0

0.5

1

Am

plit

ud

e

500 1000 1500 2000 2500 3000 3500 4000 4500 5000Frequency (Hz)

-30

-20

-10

0

10

20

Am

plit

ud

e (d

B)

500 1000 1500 2000 2500 3000 3500 4000 4500 5000Frequency (Hz)

-2

0

2

Ph

ase

(rad

)

500 1000 1500 2000 2500 3000 3500 4000 4500 5000Frequency (Hz)

-10

0

10

20

30

Gro

up

Del

ay (

ms)

(b)

(d)

(a)

(c)

Fig. 1 A speech segment of synthetic vowel /a/. (a) Windowed speech segment in time domain (b) Log magnitudespectrum. (c) Phase spectrum. (d) GD spectrum.

H. A. CHOWDHURY and M. S. RAHMAN: FORMANT ESTIMATION USING MODIFIED MAGNITUDE SPECTRUM

95

XðzÞ ¼b0�

mi¼1ð1� biz

�1Þa0�

ni¼1ð1� aiz�1Þ

;

where 8i, bi < 1 and ai < 1.

2.4. Comparison of Magnitude Spectrum with GD

Spectrum

For the convolution of two signals in the time domain,

for example, a signal xðnÞ and an impulse response hðnÞ of

a filter, their magnitude spectrum and GD spectrum can be

expressed by

FfxðnÞ � hðnÞg ¼ XðkÞHðkÞ¼ jXðkÞjjHðkÞjejðargXðkÞþargHðkÞÞ

¼) �x�hðkÞ ¼ �xðkÞ þ �hðkÞ;

where F denotes DFT, X and H denote DFTs of signal and

impulse responses, and � indicates the GD spectrum.

It is known that the convolution in the time domain

becomes a multiplication in the magnitude spectrum

domain and an addition in the GD spectrum domain.

Suppose a speech signal is composed of five resonators.

The magnitude spectrum and the corresponding GD

spectrum are shown in Fig. 2. Since the GD spectrum is

composed by the addition of individual resonators charac-

terized by the poles, the overall resolution is high.

However, the magnitude spectrum is the combined multi-

plicative effect of individual component resonators, so the

overall resolution is degraded.

2.5. Derivation of GD Spectrum from Magnitude

Spectrum

The GD spectrum can be derived from the signal

obtained by the inverse Fourier transform of the magnitude

spectrum after it is converted to a signal such as a

minimum phase signal. The conversion method is elabo-

rately illustrated in Sect. 3.

Let uðnÞ be a minimum phase signal whose Fourier

transform can be expressed as

Uðej!Þ ¼ jUðej!Þjej�ðej!Þ:

We can show that [14]

ln jUðej!Þj ¼ c½0�=2þX1n¼1

cðnÞ cosðn!Þ: ð5Þ

The unwrapped phase function is expressed as

�ðej!Þ ¼X1n¼1

cðnÞ sinðn!Þ; ð6Þ

where fcðnÞg are the cepstral coefficients [22]. Taking the

negative derivative of �ðej!Þ with respect to !, we can

obtain the GD spectrum as

�ðej!Þ ¼ �X1n¼1

ncðnÞ cosðn!Þ: ð7Þ

In the case of the minimum phase signal, we note from

Eqs. (5) and (6) that the magnitude spectrum and the phase

spectrum are related by the cepstral coefficients. Equa-

tion (7) derives the GD spectrum from the phase spectrum

and this GD function becomes an even function. It can be

observed that the GD spectrum can also be derived from

the log magnitude spectrum of Eq. (5) by multiplying each

cepstral coefficient by n and ignoring the dc term.

Murthy and Yagnanarayana [13] utilized ð Þr in place

of the logarithmic operation on the magnitude spectrum

to derive a GD spectrum; 0 < r <¼ 1. Thus, the sequence

obtained from the inverse Fourier transformation by setting

r ¼ 1 gives an even sequence. Murthy and Yagnanarayana

call this the spectral root cepstrum. The GD spectrum

obtained from this sequence becomes spiky. This is

because this sequence contains an anticausal part with the

causal part. By truncating it, Murthy and Yagnanarayana

converted this signal, similar to the minimum phase signal,

and derived the GD spectrum from it.

Bozkurt et al. [15] obtained a signal by taking the

inverse Fourier transform of its magnitude spectrum and

applied a chirp function on it. Then they derived a GD

spectrum (from it), which becomes a vocal-tract-dominated

spectrum. Since the phase information is lost, it contains

only the information available in the magnitude spectrum.

Formant peaks appear with better resolution in this GD

spectrum than in the magnitude spectrum.

3. FORMATION OF PROPOSEDMODIFIED SPECTRUM

Now, from Sect. 2, it is clear that the problems of the

magnitude spectrum can be compensated by combining it

with the GD spectrum. In this section, we derive a method

1000 2000 3000 4000 5000Frequency (Hz)

-10

0

10

20

Am

plit

ud

e (d

B)

1000 2000 3000 4000 5000Frequency (Hz)

0

2

4

6

Gro

up

Del

ay (

ms)

(a)

(b)

Fig. 2 A speech signal using 12-pole LPC-based (a)magnitude spectrum and (b) corresponding GD spec-trum.

Acoust. Sci. & Tech. 42, 2 (2021)

96

of forming a modified spectrum from the magnitude

spectrum using the GD spectrum.

Suppose a speech signal sðnÞ has to be segmented with

a Gaussian window wgðnÞ [24] of the expression

wgðnÞ ¼ exp �1

2�n

L

2

8>>><>>>:

9>>>=>>>;

28>>><>>>:

9>>>=>>>;;

where L is the window size, n ¼ 1; 2; 3; . . . L, and � ¼ 2:5.

The resulting windowed signal can be expressed as

xðnÞ ¼ wgðnÞsðnÞ:

Since the proposed method involves deriving a signal that

shows the characteristics of a minimum phase signal, the

GD spectrum conveys the information of the magnitude

spectrum with better resolution. Thus, the signal xðnÞ must

be converted to a signal like the minimum phase signal

either by reflecting the poles and zeros inside the unit circle

or by imposing the causality condition on it. The N point

DFT of the windowed signal xðnÞ is XðkÞ, defined by

Eq. (1) and the polar form is defined by Eq. (2). Taking the

inverse DFT of only the magnitude spectrum, we have

xmðnÞ ¼1

N

XNn¼1

jXðkÞjejð2�NÞkn: ð8Þ

As the signal xmðnÞ is obtained from the inverse DFT of the

magnitude spectrum, the poles and zeros outside the unit

circle are absent in it. However, the signal xmðnÞ does not

show the minimum phase signal characteristics clearly.

Yagnanarayana et al. [25] have shown that the values of

xmðnÞ outside the interval 0 < n � N=2 violate the causal-

ity condition, which still imposes spikes in the GD

spectrum. As a result, the GD spectrum taken from xmðnÞis not of an understandable level, as shown in Fig. 3. We

imposed the causality condition on the signal xmðnÞ by

defining a window wðnÞ of size N=2 as follows:

wðnÞ ¼1; if 1 � n � N=2

0; otherwise.

�After multiplying the signal xmðnÞ with wðnÞ, we obtain the

signal x0mðnÞ using

x0mðnÞ ¼ xmðnÞwðnÞ: ð9Þ

GD taken from the spectrum of x0mðnÞ becomes clearer, as

shown in Fig. 4. Hence, we can say that the signal x0mðnÞshows characteristics like those of the minimum phase

signal. Now, the N point DFT of x0mðnÞ is XmðkÞ, obtained

by using Eq. (1) or equivalently, Eq. (2). Using XmðkÞ in

Eq. (3), we have

argXmðkÞ ¼ �XmðkÞ ¼ arctanXmIðkÞXmRðkÞ

:

Taking the first-order differentiation of the above expres-

sion, we can find

�mðkÞ ¼ �dfargðXmðkÞÞg

dk;

where �mðkÞ is the GD spectrum shown in Fig. 4. Thus, the

complete magnitude spectral information is held in the

GD spectrum obtained from the signal x0mðnÞ. The cepstral

coefficients and the weighted cepstrum can be derived

recursively from x0mðnÞ using Eq. (5).

The negative value of the GD spectrum violates the

causality of the signal system [7]. Therefore, the GD

spectrum is half-wave rectified using Eq. (10), and the

resulting GD spectrum �minðkÞ is shown in Fig. 5.

�minðkÞ ¼�mðkÞ; if �mðkÞ > 0

0; otherwise

�ð10Þ

Since our objective is to use this GD spectrum to modify

the magnitude spectrum, the GD spectrum should be

similar to the magnitude spectrum. The resulting GD

spectrum �minðkÞ obtained from Eq. (10) resembles the

magnitude spectrum. The modified spectrum XmgðkÞ is

1000 2000 3000 4000 5000

Frequency (Hz)

-1.5

-1

-0.5

0

0.5

1

1.5

Gro

up

Del

ay (

s)

× 10-14

Fig. 3 GD spectrum taken from xmðnÞ signal.

1000 2000 3000 4000 5000Frequency (Hz)

-8

-6

-4

-2

0

2

Gro

up

Del

ay (

ms)

Fig. 4 GD spectrum taken from x0mðnÞ signal.


97

formed by the multiplication of the GD spectrum �minðkÞwith the magnitude spectrum jXðkÞj as

XmgðkÞ ¼ jXðkÞj�minðkÞ: ð11Þ

The similarities between the modified spectrum XmgðkÞ and

the magnitude spectrum jXðkÞj are shown in Fig. 6. In

this modified spectrum, the nearest harmonics of resonant

frequency become emphasized and others are more sup-

pressed than in the magnitude spectrum, as observed in

Fig. 6.

The proposed method is depicted as the block diagram

in Fig. 7. The formant frequency estimated from the

modified spectrum has elevated accuracy.

4. ANALYSIS ON SYNTHETIC SPEECH

The Liljancrants-Fant glottal model [26] is used to

simulate the source to generate synthetic speech. The

robustness of the proposed method is tested against five

synthetic Japanese vowels according to method described

in [9] over a range of fundamental frequencies. The speech

signal is sampled at 10 kHz. We aim to verify the accuracy

of the proposed method by estimating the formant

frequencies at different F0s. For this purpose, the param-

eters of the glottal model are kept constant for all values of

F0s. The formant frequencies to synthesize the five vowels

are listed in Table 1. Bandwidths of the five formants are

set to 60, 100, 120, 175, and 281 Hz. The analysis order is

set to 12. A Gaussian window of 30 ms and a frame shift

of 5 ms are used. A 1,024-point DFT is used to obtain the

magnitude and phase domain spectra.

4.1. Formant Frequency Estimation

To estimate the formant frequencies, analysis is

conducted by segmenting the synthetic speech signal on

different window positions. Each segmented signal is

analyzed using the method described briefly in the block

diagram shown in Fig. 7. Autoregressive (AR) coefficients

are estimated using the Levinson–Durbin algorithm [27].

Formants are measured by the root-solving technique. The

DPPT method estimates the formant frequency by peak

picking on a smoothed GD spectrum. The CheapTrick [20]

method of the WORLD [21] vocoder provides a vocal-

tract-dominated spectrum. The Levinson–Durbin algorithm

is employed on this spectrum to calculate the AR

coefficients. The WORLD formant estimator measures

500 1000 1500 2000 2500 3000 3500 4000 4500 5000Frequency (Hz)

0

1

2

3

Gro

up

Del

ay (

ms)

Fig. 5 GD spectrum after half-wave rectification.

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Frequency (Hz)

0

0.2

0.4

0.6

0.8

1

Am

plit

ud

e

Magnitude Spectrum

Modified Spectrum

Fig. 6 Similarities between the proposed modifiedspectrum and the magnitude spectrum.

Fig. 7 Block diagram of the proposed method.

Table 1 Formant frequencies used to synthesize the fivevowels.

vowel F1 F2 F3 F4 F5

/a/ 813 1313 2688 3438 4438/i/ 375 2188 2938 3438 4438/u/ 375 1063 2188 3438 4438/e/ 438 1863 2688 3438 4438/o/ 438 1063 2688 3438 4438

Acoust. Sci. & Tech. 42, 2 (2021)

98

the formants from these AR coefficients by the root-solving

technique. The estimated formant frequency values are

taken as the arithmetic mean of those at all window

positions. The relative estimation error (REE), EFi, of the

ith formant is calculated by averaging the individual Fi

errors of all five vowels. Thus, we can express REFi as

REFi ¼1

5

X5

j¼1

jbFij � Fijj=Fij;

where Fij denotes the ith formant frequency of the jth

vowel and bFij is the corresponding estimated value. The

average REE, E, of the first three formants of all five

vowels is represented using

E ¼1

15

X5

j¼1

X3

i¼1

jbFij � Fijj=Fij:

Low F0s in Fig. 8 show that the DPPT method has

more errors than other methods. The F1 error is minimum

in the proposed method. In the case of F2 and F3, the

proposed method and the WORLD formant estimator show

similar performances. The average of F1, F2, and F3 shows

that the proposed method outperforms the other methods.

For high F0s, it is observed from Fig. 9 that the proposed

method has a much higher accuracy than the DPPT

method. Comparatively, the DPPT method shows higher

errors for high-pitched speech. This is because the DPPT

method for high F0s is mostly affected by the glottal

formant and thus the original formant peak shifts towards

the nearest harmonics. When the proposed method is

compared with the WORLD formant estimator, similar

results (like low F0s) are observed for these two methods.

Since the proposed method is based on the combined effect

100 110 120 130 140 150 160 170 180 190Fundamental frequency, F0 (Hz)

2

4

6

8

Err

or

in %

1st FormantDPPT of COVAREPWORLDProposed


0

1

2

3

4

5

Err

or

in %

2nd Formant

DPPT of COVAREPWORLDProposed


0

1

2

3

Err

or

in %

3rd FormantDPPT of COVAREPWORLDProposed


0

2

4

6

Err

or

in %

Average of first three formantsDPPT of COVAREPWORLDProposed

Fig. 8 REE of formant frequencies of five vowels at different F0 values up to 190 Hz.

200 250 300 350 400Fundamental frequency, F0 (Hz)

10

20

30

40

Err

or

in %

1st Formant



0

10

20

30

40

50

Err

or

in %

2nd Formant



0

5

10

15

20

Err

or

in %

3rd Formant



0

10

20

30

40

Err

or

in %

Average of first three formants


Fig. 9 REE of formant frequencies of five vowels at different F0 values from 200 to 400 Hz.


99

of both phase and magnitude spectra, spectral resolution is

higher. Thus, the modified spectrum results in a more

accurate formant estimation. A close inspection of Figs. 8

and 9 suggests that the proposed method can be used for

analyzing high-pitched speech as well, with elevated

accuracy.

5. ANALYSIS OF REAL SPEECH

An utterance of a sentence from the TIMIT [28]

database, ‘‘Don’t ask me to carry an oily rag like that,’’ is

analyzed. The sampling frequency of this speech is 16 kHz.

A Gaussian window of 30 ms is used to segment the speech

signal with a frame overlap of 5 ms. The 1,024-point DFT

is used to analyze the magnitude and GD spectrum. The

signal is pre-emphasized by 1� z�1. The formant fre-

quency estimated using the analysis order is 16. Seven

formants are detected. The spectrogram and formant

contour of the first three formants of the sentence are

shown in Fig. 10. A close inspection of the formant contour

reveals that the formants are clearer on the voiced region of

the proposed method. The WORLD formant estimator

shows better formant contours than the DPPT method.

We analyzed the real speech signal of vowel sounds

/a/ and /o/ spoken by a male and a female speaker. The

frame size is 30 ms, and the frame shift is 5 ms. The speech

signal is pre-emphasized by 1� z�1, and prediction order

12 is used in the analysis. The sampling frequency is

10 kHz. The results of analyzing these vowels are repre-

sented in standard F2–F1 plots [29] (Appendix) for the

low-pitched male and high-pitched female speeches in

Figs. 11–14. In both cases, the proposed method produced

F2–F1 values of almost all frames that exist within a

confined region. The WORLD formant estimator shows

some scattered values at the female vowel /a/, but for the

other vowels, it shows a similar result to the proposed

method. However, the DPPT method produced F2–F1

values that are more scattered, and some of the F2 values

are treated as F1 values. This is because, in DPPT, F1 is

affected by the glottal formant. From this observation, it is

Fig. 10 A sentence from TIMIT database, ‘‘Don’t ask me to carry an oily rag like that.’’ (a) Spectrogram. F1–F3 formantcontour obtained using the (b) proposed method, (c) WORLD formant estimator, and (d) DPPT method of COVAREP.

5001000150020002500

F2 (Hz)

0

500

1000

1500

F1

(Hz)

ProposedWORLDDPPT

Fig. 11 F2 vs F1 plot of vowel /a/ spoken by a malespeaker.

Acoust. Sci. & Tech. 42, 2 (2021)

100

clear that the proposed method is less affected by the

glottal formant. From all of these analysis, it is clear that

the proposed method outperforms the existing methods.

6. STABILITY OF AR FILTER

If the inverse DFT is taken from the modified spectrum

XmgðkÞ, we have the signal xmgðnÞ using

xmgðnÞ ¼1

N

XNn¼1

XmgðkÞe jð2�NÞkn ð12Þ

Since xmgðnÞ is formed by the inverse DFT of the product

of two spectra jXðkÞj and �minðkÞ, which is similar to a

power spectrum, it can be shown that this signal has a

similar property to the autocorrelation sequence of a signal.

The standard autocorrelation function produces a stable AR

filter. It can also be shown that the AR filter resulting from

xmgðnÞ is stable. We know that the magnitude spectrum

has all poles and zeros within the unit circle, so the AR

filter resulting from that signal becomes stable. Also,

�minðkÞ retains the nonnegative values. Thus, the modified

spectrum formed by the product of the magnitude spectrum

jXðkÞj and GD spectrum �minðkÞ produces a positive semi-

definite matrix as an autocorrelation that guarantees the

stability of the AR filter.

7. CONCLUSION

A new spectral representation of the magnitude

spectrum has been proposed in this research to estimate

the vocal tract characteristics accurately. For this purpose,

we exploited the benefits of the GD spectrum by multi-

plying the magnitude spectrum with it and obtained a high-

resolution spectrum with spectral leakage compensation.

It is shown that the properly estimated GD spectrum with

the magnitude spectrum can be used for extracting closely

spaced and low-amplitude formant frequencies. At the

same time, the proposed method can be used to estimate

the formant frequency of high-pitched speech with elevated

accuracy.

ACKNOWLEDGEMENT

We are grateful to the Information and Communication

Technology (ICT) Division of the Bangladesh Government

for providing funds for this research. We are also thankful

to the three anonymous reviewers for their important

comments and valuable suggestions on this manuscript.

REFERENCES

[1] B. S. Atal and S. L. Hanauer, ‘‘Speech analysis and synthesisby linear prediction of the speech wave,’’ J. Acoust. Soc. Am.,50, 637–655 (1971).

[2] J. Makhoul, ‘‘Linear prediction: A tutorial review,’’ Proc.IEEE, 63, 561–580 (1975).

[3] C. Magi, J. Pohjalainen, T. Backstrom and P. Alku, ‘‘Stabilisedweighted linear prediction,’’ Speech Commun., 51, 401–411(2009).

[4] A. M. Noll, ‘‘Cepstrum pitch determination,’’ J. Acoust. Soc.Am., 41, 293–309 (1967).

[5] R. W. Schafer and A. V. Oppenheim, Discrete Time Signal

5001000150020002500

F2 (Hz)

0

500

1000

1500

F1

(Hz)

ProposedWORLDDPPT

Fig. 12 F2 vs F1 plot of vowel /o/ spoken by a malespeaker.

5001000150020002500

F2 (Hz)

0

500

1000

1500

F1

(Hz)

ProposedWORLDDPPT

Fig. 13 F2 vs F1 plot of high-pitched vowel /a/ spokenby a female speaker at 300 Hz.

5001000150020002500

F2 (Hz)

0

500

1000

1500

F1

(Hz)

ProposedWORLDDPPT

Fig. 14 F2 vs F1 plot of high-pitched vowel /o/ spokenby a female speaker at 300 Hz.


101

Processing (Prentice Hall, Englewood Cliffs, NJ, 1989).[6] E. Loweimi, J. Barker and T. Hain, ‘‘Source–filter separation

of speech signal in the phase domain,’’ Proc. Interspeech 2015,pp. 598–602 (2015).

[7] E. Loweimi, ‘‘Robust phase-based speech signal processingfrom source–filter separation to model-based robust ASR,’’Ph.D. Dissertation, University of Sheffield (2018).

[8] R. W. Schafer and L. R. Rabiner, ‘‘System for automaticformant analysis of voiced speech,’’ J. Acoust. Soc. Am., 47,634–648 (1970).

[9] M. S. Rahman and T. Shimamura, ‘‘Formant frequencyestimation of high-pitched speech by homomorphic predic-tion,’’ Acoust. Sci. & Tech., 26, 502–510 (2005).

[10] M. S. Rahman and T. Shimamura, ‘‘Linear prediction usingrefined autocorrelation function,’’ EURASIP J. Audio SpeechMusic Process., 1, 1–9 (2007).

[11] K. K. Paliwal and L. Alsteris, ‘‘Usefulness of phase spectrumin human speech perception,’’ 8th Eur. Conf. Speech Commu-nication and Technology, pp. 2117–2120 (2003).

[12] D. Gowda, J. Pohjalainen, M. Kurimo and P. Alku, ‘‘Robustformant detection using group delay function and stabilizedweighted linear prediction,’’ Proc. Interspeech 2013, pp. 49–53(2013).

[13] H. A. Murthy and B. Yegnanarayana, ‘‘Formant extractionfrom group delay function,’’ Speech Commun., 10, 209–221(1991).

[14] H. A. Murthy and B. Yegnanarayana, ‘‘Group delay functionsand its applications in speech technology,’’ Sadhana, 36, 745–782 (2011).

[15] B. Bozkurt, L. Couvreur and T. Dutoit, ‘‘Chirp group delayanalysis of speech signals,’’ Speech Commun., 49, 159–176(2007).

[16] H. A. Murthy and V. Gadde, ‘‘The modified group delayfunction and its application to phoneme recognition,’’ Proc.IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP)2013, pp. 1–68 (2003).

[17] B. Bozkurt, ‘‘Zeros of the z-transform (ZZT) representationand chirp group delay processing for the analysis of source andfilter characteristics of speech signals,’’ Thesis work, UniversityPolytechnique de Mons, Belgium and LIMSICNRS, France(October 2005).

[18] G. Degottex, J. Kane, T. Drugman, T. Raitio and S. Scherer,‘‘COVAREP — A collaborative voice analysis repository forspeech technologies,’’ Proc. IEEE Int. Conf. Acoust. SpeechSignal Process. (ICASSP) 2014, pp. 960–964 (2014).

[19] B. Bozkurt, T. Dutoit, B. Doval and C. d’Alessandro,‘‘Improved differential phase spectrum processing for formanttracking,’’ Proc. 8th Int. Conf. Spoken Language Processing(ICSLP) 2004, pp. 2421–2424 (2004).

[20] M. Morise, ‘‘CheapTrick, a spectral envelope estimator forhigh-quality speech synthesis,’’ Speech Commun., 67, 1–7(2015).

[21] M. Morise, F. Yokomori and K. Ozawa, ‘‘WORLD: Avocoder-based high-quality speech synthesis system for real-

time applications,’’ IEICE Trans. Inf. Syst., 99, 1877–1884(2016).

[22] L. R. Rabiner and R. W. Schafer, Digital Processing of SpeechSignals (Prentice-Hall Inc., Englewood Cliffs, NJ, 1975).

[23] M. Cerna and A. F. Harvey, The Fundamentals of FFT-basedSignal Analysis and Measurement (Application Note 041,National Instruments, 2000).

[24] F. J. Harris, ‘‘On the use of windows for harmonic analysiswith the discrete Fourier transform,’’ Proc. IEEE, 66, 51–83(1978).

[25] B. Yegnanarayana, D. Saikia and T. Krishnan, ‘‘Significance ofgroup delay functions in signal reconstruction from spectralmagnitude or phase,’’ IEEE Trans. Acoust. Speech SignalProcess., 32, 610–623 (1984).

[26] G. Fant, J. Liljencrants and Q.-G. Lin, ‘‘A four-parametermodel of glottal flow,’’ STL-QPSR, 4, 1–13 (1985).

[27] J. Durbin, ‘‘The fitting of time series models,’’ Rev. Inst. Int.Stat., 28, 233–243 (1960).

[28] V. Zue, S. Seneff and J. Glass, ‘‘Speech database developmentat MIT: TIMIT and beyond,’’ Speech Commun., 9, 351–356(1990).

[29] D. Watt and A. Fabricius, ‘‘Evaluation of a technique forimproving the mapping of multiple speakers’ vowel spaces inthe F1–F2 plane,’’ Leeds Working Papers in Linguistics andPhonetics, 9, 159–173 (2002).

Husne Ara Chowdhury received her B.Sc.(Eng) degree in computer science and engineer-ing from Shahjalal University of Science andTechnology, Bangladesh, in 2002. She joinedthe CSE department of the same university as ajunior faculty member in 2004 and became anassistant professor in 2007. She has been work-ing toward her Ph.D. in speech signal analysisat the same university since November, 2018.

Her research interest is on speech recognition, synthesis and speechcoding.

M. Shahidur Rahman received his B.Sc. andM.Sc. degrees in electronics and computerscience from Shahjalal University of Scienceand Technology, Sylhet, Bangladesh, in 1995and 1997, respectively. He received his Ph.D.degree in mathematical information systems in2006 from Saitama University, Saitama, Japan.In 1997, he joined Shahjalal University ofScience and Technology as a junior faculty

member and he is currently serving as a professor. He was a JSPSPostdoctoral Research Fellow from 2009 to 2011 in SaitamaUniversity. His current research interests include speech analysis,speech synthesis, speech recognition, enhancement of bone-con-ducted speech and digital signal processing.

Acoust. Sci. & Tech. 42, 2 (2021)

102

Formant estimation from speech signal using the magnitude ...

Documents

Transcript of Formant estimation from speech signal using the magnitude ...