Formant estimation from speech signal using the magnitude ...
Transcript of Formant estimation from speech signal using the magnitude ...
Formant estimation from speech signal using the magnitude spectrum
modified with group delay spectrum
Husne Ara Chowdhury� and Mohammad Shahidur Rahman
Department of Computer Science and Engineering, Shahjalal University of Science andTechnology, Sylhet 3114, Bangladesh
(Received 1 July 2020, Accepted for publication 26 October 2020)
Abstract: The magnitude spectrum is a popular mathematical tool for speech signal analysis. In thispaper, we propose a new technique for improving the performance of the magnitude spectrum byutilizing the benefits of the group delay (GD) spectrum to estimate the characteristics of a vocal tractaccurately. The traditional magnitude spectrum suffers from difficulties when estimating vocal tractcharacteristics, particularly for high-pitched speech owing to its low resolution and high spectralleakage. After phase domain analysis, it is observed that the GD spectrum has low spectral leakage andhigh resolution for its additive property. Thus, the magnitude spectrum modified with its GD spectrum,referred to as the modified spectrum, is found to significantly improve the estimation of formantfrequency over traditional methods. The accuracy is tested on synthetic vowels for a wide range offundamental frequencies up to the high-pitched female speaker range. The validity of the proposedmethod is also verified by inspecting the formant contour of an utterance from the Texas Instrumentsand Massachusetts Institute of Technology (TIMIT) database and standard F2–F1 plot of natural vowelspeech spoken by male and female speakers. The result is compared with two state-of-the-art methods.Our proposed method performs better than both of these two methods.
Keywords: Modified spectrum, Phase domain analysis, Spectrogram, GD spectrum, Glottal formanteffect
1. INTRODUCTION
The speech production system is composed of the
convolution of a source and a system or filter. Most of the
speech signal analysis research is based on the separation
of the source and filter. Atal and Hanauer separated the
source and filter by the linear predictive coding (LPC) [1].
However, the conventional linear prediction (LP) technique
has some limitations [2]. Magi et al. [3] used the stabilized
weighted LP to overcome the problem of LP. Noll [4]
used the cepstrum method for source–filter separation. The
source is characterized by quasi-periodic motion in the
vocal folds called fundamental frequency (F0). In addition,
the system or filter is formed by a connection with several
tubes of different cross-sectional areas representing the
vocal tract. This vocal tract has some resonance frequen-
cies that correspond to the cross-sectional areas and are
called formants, denoted by F1, F2, F3, etc.
Fundamental and formant frequencies are well high-
lighted and distinguished in the magnitude spectrum
domain [5]. The local minimum and maximum in the
magnitude spectrum correspond to zeros and poles in the
z domain. This forms the basis of source–filter separation
described by Loweimi and Barker [6,7]. Schafer and
Rabiner [8] used the magnitude spectrum for formant
estimation. Rahman and Shimamura employed the magni-
tude spectrum to show improved accuracy of formant
estimation [9,10]. As a result, the magnitude spectrum
becomes a natural choice for speech analysis.
Some human perception experiments, such as those
of Paliwal and Alsteris [11], showed that the short-time
phase spectrum contributes to speech intelligibility and
plays an important role in high-quality speech synthesis.
Unfortunately, the phase spectrum is in a wrapped form
[7], and the negative first-order derivative of its unwrapped
version, called the GD function, is generally preferred. The
GD function is related to the overall delay of the input
signal passing through the system. The GD spectrum
represents the phase spectrum information but its overall
shape is more understandable. A properly estimated GD
spectrum has a high spectral resolution, lower spectral
leakage, and resembles a magnitude spectrum. Gowda
et al. [12] applied the GD function on the stabilized
�e-mail: [email protected][doi:10.1250/ast.42.93]
93
Acoust. Sci. & Tech. 42, 2 (2021) #2021 The Acoustical Society of Japan
PAPER
weighted linear prediction spectrum for formant estima-
tion.
In the proposed method, the magnitude spectrum is
modified using the GD spectrum obtained from an
appropriate phase domain analysis of the speech signal.
The resulting modified spectrum can be used for robust
formant estimation.
The GD spectrum often presents spurious spikes that
make its processing very problematic. Murthy and
Yegnanarayana proposed various methods [13,14] to
remove these spikes. Bozkurt et al. [15] tried to remove
spikes in the GD spectrum utilizing an analysis window
greater than the unit circle. The minimum-phase signal can
be a choice for avoiding spikes. The GD spectrum obtained
from the minimum-phase signal has a better frequency
resolution having peaks at the poles and valleys at the
zeros. It also resembles the magnitude spectrum explained
by Loweimi [7]. A signal is said to be a minimum phase
signal if it provides a causal and stable system.
In their method [13], Murthy and Yegnanarayana
estimated formant frequency from the GD spectrum of a
short analysis window segment of less than one pitch
period, which is not appropriate for high-pitched speech
owing to an insufficient number of samples. This method
is also pitch-dependent. Murthy and Gadde [16] modified
the GD spectrum by its cepstrally smoothed spectrum
and applied it in phoneme detection. Murthy and
Yegnanarayana showed different approaches of formant
extraction [14]. The approaches in [14] are based on only
synthetic speech. The study by Bozkurt [17] revealed that
the existence of zeros close to the unit circle is responsible
for the spikes in his analysis of the characteristics of the
source and filter using zeros of z transform. The Collab-
orative voice analysis repository (COVAREP) [18] pack-
age used the differential-phase peak tracking (DPPT)
method [19] for formant tracking, but is found to be
affected by glottal formant and unsuitable for high-pitched
speech analysis. The methods described in [17,18], and
[19] show some problems of robustness. The CheapTrick
[20] method of the WORLD [21] vocoder shows a
smoothed spectrum that becomes a vocal-tract-dominated
spectrum. The main problem with utilizing this spectrum as
a formant estimator is that some peaks other than formant
peaks arise, and some of the true peaks arise with low
resolution. This method is also pitch-dependent.
On the basis of the above observations, we proposed
a technique of using the GD function by utilizing the
minimum phase like a signal from which the GD spectrum
is computed. The resulting GD spectrum is used to modify
the magnitude spectrum. The modified spectrum empha-
sizes the formant peaks, reducing the glottal formant
impact. The effectiveness of the proposed method is
measured by estimating the formant frequencies and
comparing it with the recent DPPT method of the
COVAREP tool and a formant estimator utilizing the
spectrum of the WORLD vocoder. The proposed method
shows significant improvement over the DPPT of
COVAREP and WORLD formant estimator.
2. PROBLEM ANALYSIS
The basics of the magnitude and phase-based signal
processing, its significance, similarities and problems, and
available solutions are discussed.
2.1. Magnitude Spectrum and Its Significance
Fourier analysis is an important and popular mathe-
matical tool for speech signal analysis. The N point
discrete Fourier transform (DFT) [22] of the discrete signal
xðnÞ can be defined as
XðkÞ ¼1
L
XLn¼1
xðnÞe� jð2�LÞkn; ð1Þ
where n, L, and k indicates sample number, frame length,
and frequency index where k ¼ 1; 2; 3 . . .N. The e� jð2�LÞkn
in Eq. (1) forms the basis functions of Fourier transform.
The XðkÞ makes the complex quantity in the Cartesian
coordinates as follows.
XðkÞ ¼ XRðkÞ þ jXIðkÞ
Here, XRðkÞ and XIðkÞ indicate the real and imaginary part,
respectively. The complex numbers are represented in the
polar form as
XðkÞ ¼ jXðkÞje�X ðkÞ; ð2Þ
where jXðkÞj and �XðkÞ are the magnitude spectrum and
phase spectrum, respectively, which can also be expressed
as
jXðkÞj ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXRðkÞ2 þ XIðkÞ2
q�XðkÞ ¼ arctan
XIðkÞXRðkÞ
: ð3Þ
The magnitude spectrum better characterizes the
speech signal and matches our level of understanding. A
synthetic speech signal of vowel /a/ is segmented using
a 30 ms length of the Gaussian window. The time-domain
speech segment and corresponding log magnitude spec-
trum, phase spectrum, and GD spectrum are shown in
Fig. 1. The log magnitude spectrum shown in Fig. 1(b) has
a fine structure that is closely related to the fundamental
frequency and its harmonics. The resonances are also clear
here. This forms a basis of the source–filter separation
using the (log of the) magnitude spectrum. The local
minimum and maximum in the magnitude spectrum
correspond to zeros and poles in the z domain, which is
useful to separate the source and filter. Rahman and
Shimamura [9] attempted to separate the source and filter
94
Acoust. Sci. & Tech. 42, 2 (2021)
using the magnitude spectrum. The understandable and
expressive behavior of the magnitude spectrum makes it
applicable to a wider range of use.
The main problem with the magnitude spectrum is its
low resolution and high spectral leakage. Although window
functions are used to minimize these problems, the window
function cannot eliminate spectral leakage. It uniformly
distributes the leakage effect over all frequency points. As
a result, the leakage effect is not visible on the spectrum of
the windowed signal. If the magnitude spectrum is taken
from a short duration of real measurements, the resolution
degrades. Again, the magnitude spectrum is formed by
the product of the magnitude spectra of the individual
component. Hence the overall resolution is reduced.
Spectral leakage [23] takes place owing to the lack of
integer multiple cycles of each component spectrum.
2.2. Phase Spectrum
The phase spectrum described by Eq. (3) is depicted in
Fig. 1(c). Its behavior is not as clear as that of the
magnitude spectrum. It shows a chaotic nature because of
phase wrapping. The main difficulty in unwrapping the
phase is the addition of an integer multiple of 2� (as in
the following expression) to the principle phase spectrum
that does not change the complex number XðkÞ.
XðkÞ ¼ jXðkÞjejð�XðkÞþ2m�Þ
An alternative way to unwrap the phase spectrum is to
take the derivative of the phase spectrum using the
expression
arg0 XðkÞ ¼dfargXðkÞg
dk¼
XRðkÞX0IðkÞ � X0RðkÞXIðkÞjXðkÞj2
;
where 0 denotes the derivative with respect to k. Two
factors limit this process. One is the smaller size of FFT
and the other is noise. Increasing the size of FFT improves
the accuracy.
2.3. GD Spectrum
The GD spectrum is the alternative and meaningful
representation of the phase spectrum. It is defined as the
negative of the spectral derivative of the unwrapped phase
spectrum as given in
�ðkÞ ¼ �dfargðXðkÞg
dk: ð4Þ
Another method of computing the GD spectrum with-
out unwrapping the phase spectrum is
�ðkÞ ¼XRðkÞX0IðkÞ � X0RðkÞXIðkÞ
jXðkÞj2:
Bozkurt et al. [15] showed that if the speech signal
cannot be segmented with the appropriate window size
and position, the GD spectrum taken from this signal
becomes spiky. The window center must be synchronized
with the glottal closure instant, which is a challenging task.
If the GD spectrum becomes spiky, neither the fundamental
frequency nor the formants can be distinguished visually.
Another cause of spikes is poles and zeros being located
close to the unit circle, as discussed elaborately by Bozkurt
[17]. The choice of minimum phase signal can eliminate
the aforementioned inconvenience. The GD spectrum can
be effectively used when the signal is in the minimum
phase. A signal is in the minimum phase if and only if all
the poles and zeroes of the z transform lie within the unit
circle. Mathematically, this can be expressed as
5 10 15 20 25 30Time (ms)
-0.5
0
0.5
1
Am
plit
ud
e
500 1000 1500 2000 2500 3000 3500 4000 4500 5000Frequency (Hz)
-30
-20
-10
0
10
20
Am
plit
ud
e (d
B)
500 1000 1500 2000 2500 3000 3500 4000 4500 5000Frequency (Hz)
-2
0
2
Ph
ase
(rad
)
500 1000 1500 2000 2500 3000 3500 4000 4500 5000Frequency (Hz)
-10
0
10
20
30
Gro
up
Del
ay (
ms)
(b)
(d)
(a)
(c)
Fig. 1 A speech segment of synthetic vowel /a/. (a) Windowed speech segment in time domain (b) Log magnitudespectrum. (c) Phase spectrum. (d) GD spectrum.
H. A. CHOWDHURY and M. S. RAHMAN: FORMANT ESTIMATION USING MODIFIED MAGNITUDE SPECTRUM
95
XðzÞ ¼b0�
mi¼1ð1� biz
�1Þa0�
ni¼1ð1� aiz�1Þ
;
where 8i, bi < 1 and ai < 1.
2.4. Comparison of Magnitude Spectrum with GD
Spectrum
For the convolution of two signals in the time domain,
for example, a signal xðnÞ and an impulse response hðnÞ of
a filter, their magnitude spectrum and GD spectrum can be
expressed by
FfxðnÞ � hðnÞg ¼ XðkÞHðkÞ¼ jXðkÞjjHðkÞjejðargXðkÞþargHðkÞÞ
¼) �x�hðkÞ ¼ �xðkÞ þ �hðkÞ;
where F denotes DFT, X and H denote DFTs of signal and
impulse responses, and � indicates the GD spectrum.
It is known that the convolution in the time domain
becomes a multiplication in the magnitude spectrum
domain and an addition in the GD spectrum domain.
Suppose a speech signal is composed of five resonators.
The magnitude spectrum and the corresponding GD
spectrum are shown in Fig. 2. Since the GD spectrum is
composed by the addition of individual resonators charac-
terized by the poles, the overall resolution is high.
However, the magnitude spectrum is the combined multi-
plicative effect of individual component resonators, so the
overall resolution is degraded.
2.5. Derivation of GD Spectrum from Magnitude
Spectrum
The GD spectrum can be derived from the signal
obtained by the inverse Fourier transform of the magnitude
spectrum after it is converted to a signal such as a
minimum phase signal. The conversion method is elabo-
rately illustrated in Sect. 3.
Let uðnÞ be a minimum phase signal whose Fourier
transform can be expressed as
Uðej!Þ ¼ jUðej!Þjej�ðej!Þ:
We can show that [14]
ln jUðej!Þj ¼ c½0�=2þX1n¼1
cðnÞ cosðn!Þ: ð5Þ
The unwrapped phase function is expressed as
�ðej!Þ ¼X1n¼1
cðnÞ sinðn!Þ; ð6Þ
where fcðnÞg are the cepstral coefficients [22]. Taking the
negative derivative of �ðej!Þ with respect to !, we can
obtain the GD spectrum as
�ðej!Þ ¼ �X1n¼1
ncðnÞ cosðn!Þ: ð7Þ
In the case of the minimum phase signal, we note from
Eqs. (5) and (6) that the magnitude spectrum and the phase
spectrum are related by the cepstral coefficients. Equa-
tion (7) derives the GD spectrum from the phase spectrum
and this GD function becomes an even function. It can be
observed that the GD spectrum can also be derived from
the log magnitude spectrum of Eq. (5) by multiplying each
cepstral coefficient by n and ignoring the dc term.
Murthy and Yagnanarayana [13] utilized ð Þr in place
of the logarithmic operation on the magnitude spectrum
to derive a GD spectrum; 0 < r <¼ 1. Thus, the sequence
obtained from the inverse Fourier transformation by setting
r ¼ 1 gives an even sequence. Murthy and Yagnanarayana
call this the spectral root cepstrum. The GD spectrum
obtained from this sequence becomes spiky. This is
because this sequence contains an anticausal part with the
causal part. By truncating it, Murthy and Yagnanarayana
converted this signal, similar to the minimum phase signal,
and derived the GD spectrum from it.
Bozkurt et al. [15] obtained a signal by taking the
inverse Fourier transform of its magnitude spectrum and
applied a chirp function on it. Then they derived a GD
spectrum (from it), which becomes a vocal-tract-dominated
spectrum. Since the phase information is lost, it contains
only the information available in the magnitude spectrum.
Formant peaks appear with better resolution in this GD
spectrum than in the magnitude spectrum.
3. FORMATION OF PROPOSEDMODIFIED SPECTRUM
Now, from Sect. 2, it is clear that the problems of the
magnitude spectrum can be compensated by combining it
with the GD spectrum. In this section, we derive a method
1000 2000 3000 4000 5000Frequency (Hz)
-10
0
10
20
Am
plit
ud
e (d
B)
1000 2000 3000 4000 5000Frequency (Hz)
0
2
4
6
Gro
up
Del
ay (
ms)
(a)
(b)
Fig. 2 A speech signal using 12-pole LPC-based (a)magnitude spectrum and (b) corresponding GD spec-trum.
Acoust. Sci. & Tech. 42, 2 (2021)
96
of forming a modified spectrum from the magnitude
spectrum using the GD spectrum.
Suppose a speech signal sðnÞ has to be segmented with
a Gaussian window wgðnÞ [24] of the expression
wgðnÞ ¼ exp �1
2�n
L
2
8>>><>>>:
9>>>=>>>;
28>>><>>>:
9>>>=>>>;;
where L is the window size, n ¼ 1; 2; 3; . . . L, and � ¼ 2:5.
The resulting windowed signal can be expressed as
xðnÞ ¼ wgðnÞsðnÞ:
Since the proposed method involves deriving a signal that
shows the characteristics of a minimum phase signal, the
GD spectrum conveys the information of the magnitude
spectrum with better resolution. Thus, the signal xðnÞ must
be converted to a signal like the minimum phase signal
either by reflecting the poles and zeros inside the unit circle
or by imposing the causality condition on it. The N point
DFT of the windowed signal xðnÞ is XðkÞ, defined by
Eq. (1) and the polar form is defined by Eq. (2). Taking the
inverse DFT of only the magnitude spectrum, we have
xmðnÞ ¼1
N
XNn¼1
jXðkÞjejð2�NÞkn: ð8Þ
As the signal xmðnÞ is obtained from the inverse DFT of the
magnitude spectrum, the poles and zeros outside the unit
circle are absent in it. However, the signal xmðnÞ does not
show the minimum phase signal characteristics clearly.
Yagnanarayana et al. [25] have shown that the values of
xmðnÞ outside the interval 0 < n � N=2 violate the causal-
ity condition, which still imposes spikes in the GD
spectrum. As a result, the GD spectrum taken from xmðnÞis not of an understandable level, as shown in Fig. 3. We
imposed the causality condition on the signal xmðnÞ by
defining a window wðnÞ of size N=2 as follows:
wðnÞ ¼1; if 1 � n � N=2
0; otherwise.
�After multiplying the signal xmðnÞ with wðnÞ, we obtain the
signal x0mðnÞ using
x0mðnÞ ¼ xmðnÞwðnÞ: ð9Þ
GD taken from the spectrum of x0mðnÞ becomes clearer, as
shown in Fig. 4. Hence, we can say that the signal x0mðnÞshows characteristics like those of the minimum phase
signal. Now, the N point DFT of x0mðnÞ is XmðkÞ, obtained
by using Eq. (1) or equivalently, Eq. (2). Using XmðkÞ in
Eq. (3), we have
argXmðkÞ ¼ �XmðkÞ ¼ arctanXmIðkÞXmRðkÞ
:
Taking the first-order differentiation of the above expres-
sion, we can find
�mðkÞ ¼ �dfargðXmðkÞÞg
dk;
where �mðkÞ is the GD spectrum shown in Fig. 4. Thus, the
complete magnitude spectral information is held in the
GD spectrum obtained from the signal x0mðnÞ. The cepstral
coefficients and the weighted cepstrum can be derived
recursively from x0mðnÞ using Eq. (5).
The negative value of the GD spectrum violates the
causality of the signal system [7]. Therefore, the GD
spectrum is half-wave rectified using Eq. (10), and the
resulting GD spectrum �minðkÞ is shown in Fig. 5.
�minðkÞ ¼�mðkÞ; if �mðkÞ > 0
0; otherwise
�ð10Þ
Since our objective is to use this GD spectrum to modify
the magnitude spectrum, the GD spectrum should be
similar to the magnitude spectrum. The resulting GD
spectrum �minðkÞ obtained from Eq. (10) resembles the
magnitude spectrum. The modified spectrum XmgðkÞ is
1000 2000 3000 4000 5000
Frequency (Hz)
-1.5
-1
-0.5
0
0.5
1
1.5
Gro
up
Del
ay (
s)
× 10-14
Fig. 3 GD spectrum taken from xmðnÞ signal.
1000 2000 3000 4000 5000Frequency (Hz)
-8
-6
-4
-2
0
2
Gro
up
Del
ay (
ms)
Fig. 4 GD spectrum taken from x0mðnÞ signal.
H. A. CHOWDHURY and M. S. RAHMAN: FORMANT ESTIMATION USING MODIFIED MAGNITUDE SPECTRUM
97
formed by the multiplication of the GD spectrum �minðkÞwith the magnitude spectrum jXðkÞj as
XmgðkÞ ¼ jXðkÞj�minðkÞ: ð11Þ
The similarities between the modified spectrum XmgðkÞ and
the magnitude spectrum jXðkÞj are shown in Fig. 6. In
this modified spectrum, the nearest harmonics of resonant
frequency become emphasized and others are more sup-
pressed than in the magnitude spectrum, as observed in
Fig. 6.
The proposed method is depicted as the block diagram
in Fig. 7. The formant frequency estimated from the
modified spectrum has elevated accuracy.
4. ANALYSIS ON SYNTHETIC SPEECH
The Liljancrants-Fant glottal model [26] is used to
simulate the source to generate synthetic speech. The
robustness of the proposed method is tested against five
synthetic Japanese vowels according to method described
in [9] over a range of fundamental frequencies. The speech
signal is sampled at 10 kHz. We aim to verify the accuracy
of the proposed method by estimating the formant
frequencies at different F0s. For this purpose, the param-
eters of the glottal model are kept constant for all values of
F0s. The formant frequencies to synthesize the five vowels
are listed in Table 1. Bandwidths of the five formants are
set to 60, 100, 120, 175, and 281 Hz. The analysis order is
set to 12. A Gaussian window of 30 ms and a frame shift
of 5 ms are used. A 1,024-point DFT is used to obtain the
magnitude and phase domain spectra.
4.1. Formant Frequency Estimation
To estimate the formant frequencies, analysis is
conducted by segmenting the synthetic speech signal on
different window positions. Each segmented signal is
analyzed using the method described briefly in the block
diagram shown in Fig. 7. Autoregressive (AR) coefficients
are estimated using the Levinson–Durbin algorithm [27].
Formants are measured by the root-solving technique. The
DPPT method estimates the formant frequency by peak
picking on a smoothed GD spectrum. The CheapTrick [20]
method of the WORLD [21] vocoder provides a vocal-
tract-dominated spectrum. The Levinson–Durbin algorithm
is employed on this spectrum to calculate the AR
coefficients. The WORLD formant estimator measures
500 1000 1500 2000 2500 3000 3500 4000 4500 5000Frequency (Hz)
0
1
2
3
Gro
up
Del
ay (
ms)
Fig. 5 GD spectrum after half-wave rectification.
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Frequency (Hz)
0
0.2
0.4
0.6
0.8
1
Am
plit
ud
e
Magnitude Spectrum
Modified Spectrum
Fig. 6 Similarities between the proposed modifiedspectrum and the magnitude spectrum.
Fig. 7 Block diagram of the proposed method.
Table 1 Formant frequencies used to synthesize the fivevowels.
vowel F1 F2 F3 F4 F5
/a/ 813 1313 2688 3438 4438/i/ 375 2188 2938 3438 4438/u/ 375 1063 2188 3438 4438/e/ 438 1863 2688 3438 4438/o/ 438 1063 2688 3438 4438
Acoust. Sci. & Tech. 42, 2 (2021)
98
the formants from these AR coefficients by the root-solving
technique. The estimated formant frequency values are
taken as the arithmetic mean of those at all window
positions. The relative estimation error (REE), EFi, of the
ith formant is calculated by averaging the individual Fi
errors of all five vowels. Thus, we can express REFi as
REFi ¼1
5
X5
j¼1
jbFij � Fijj=Fij;
where Fij denotes the ith formant frequency of the jth
vowel and bFij is the corresponding estimated value. The
average REE, E, of the first three formants of all five
vowels is represented using
E ¼1
15
X5
j¼1
X3
i¼1
jbFij � Fijj=Fij:
Low F0s in Fig. 8 show that the DPPT method has
more errors than other methods. The F1 error is minimum
in the proposed method. In the case of F2 and F3, the
proposed method and the WORLD formant estimator show
similar performances. The average of F1, F2, and F3 shows
that the proposed method outperforms the other methods.
For high F0s, it is observed from Fig. 9 that the proposed
method has a much higher accuracy than the DPPT
method. Comparatively, the DPPT method shows higher
errors for high-pitched speech. This is because the DPPT
method for high F0s is mostly affected by the glottal
formant and thus the original formant peak shifts towards
the nearest harmonics. When the proposed method is
compared with the WORLD formant estimator, similar
results (like low F0s) are observed for these two methods.
Since the proposed method is based on the combined effect
100 110 120 130 140 150 160 170 180 190Fundamental frequency, F0 (Hz)
2
4
6
8
Err
or
in %
1st FormantDPPT of COVAREPWORLDProposed
100 110 120 130 140 150 160 170 180 190Fundamental frequency, F0 (Hz)
0
1
2
3
4
5
Err
or
in %
2nd Formant
DPPT of COVAREPWORLDProposed
100 110 120 130 140 150 160 170 180 190Fundamental frequency, F0 (Hz)
0
1
2
3
Err
or
in %
3rd FormantDPPT of COVAREPWORLDProposed
100 110 120 130 140 150 160 170 180 190Fundamental frequency, F0 (Hz)
0
2
4
6
Err
or
in %
Average of first three formantsDPPT of COVAREPWORLDProposed
Fig. 8 REE of formant frequencies of five vowels at different F0 values up to 190 Hz.
200 250 300 350 400Fundamental frequency, F0 (Hz)
10
20
30
40
Err
or
in %
1st Formant
DPPT of COVAREPWORLDProposed
200 250 300 350 400Fundamental frequency, F0 (Hz)
0
10
20
30
40
50
Err
or
in %
2nd Formant
DPPT of COVAREPWORLDProposed
200 250 300 350 400Fundamental frequency, F0 (Hz)
0
5
10
15
20
Err
or
in %
3rd Formant
DPPT of COVAREPWORLDProposed
200 250 300 350 400Fundamental frequency, F0 (Hz)
0
10
20
30
40
Err
or
in %
Average of first three formants
DPPT of COVAREPWORLDProposed
Fig. 9 REE of formant frequencies of five vowels at different F0 values from 200 to 400 Hz.
H. A. CHOWDHURY and M. S. RAHMAN: FORMANT ESTIMATION USING MODIFIED MAGNITUDE SPECTRUM
99
of both phase and magnitude spectra, spectral resolution is
higher. Thus, the modified spectrum results in a more
accurate formant estimation. A close inspection of Figs. 8
and 9 suggests that the proposed method can be used for
analyzing high-pitched speech as well, with elevated
accuracy.
5. ANALYSIS OF REAL SPEECH
An utterance of a sentence from the TIMIT [28]
database, ‘‘Don’t ask me to carry an oily rag like that,’’ is
analyzed. The sampling frequency of this speech is 16 kHz.
A Gaussian window of 30 ms is used to segment the speech
signal with a frame overlap of 5 ms. The 1,024-point DFT
is used to analyze the magnitude and GD spectrum. The
signal is pre-emphasized by 1� z�1. The formant fre-
quency estimated using the analysis order is 16. Seven
formants are detected. The spectrogram and formant
contour of the first three formants of the sentence are
shown in Fig. 10. A close inspection of the formant contour
reveals that the formants are clearer on the voiced region of
the proposed method. The WORLD formant estimator
shows better formant contours than the DPPT method.
We analyzed the real speech signal of vowel sounds
/a/ and /o/ spoken by a male and a female speaker. The
frame size is 30 ms, and the frame shift is 5 ms. The speech
signal is pre-emphasized by 1� z�1, and prediction order
12 is used in the analysis. The sampling frequency is
10 kHz. The results of analyzing these vowels are repre-
sented in standard F2–F1 plots [29] (Appendix) for the
low-pitched male and high-pitched female speeches in
Figs. 11–14. In both cases, the proposed method produced
F2–F1 values of almost all frames that exist within a
confined region. The WORLD formant estimator shows
some scattered values at the female vowel /a/, but for the
other vowels, it shows a similar result to the proposed
method. However, the DPPT method produced F2–F1
values that are more scattered, and some of the F2 values
are treated as F1 values. This is because, in DPPT, F1 is
affected by the glottal formant. From this observation, it is
Fig. 10 A sentence from TIMIT database, ‘‘Don’t ask me to carry an oily rag like that.’’ (a) Spectrogram. F1–F3 formantcontour obtained using the (b) proposed method, (c) WORLD formant estimator, and (d) DPPT method of COVAREP.
5001000150020002500
F2 (Hz)
0
500
1000
1500
F1
(Hz)
ProposedWORLDDPPT
Fig. 11 F2 vs F1 plot of vowel /a/ spoken by a malespeaker.
Acoust. Sci. & Tech. 42, 2 (2021)
100
clear that the proposed method is less affected by the
glottal formant. From all of these analysis, it is clear that
the proposed method outperforms the existing methods.
6. STABILITY OF AR FILTER
If the inverse DFT is taken from the modified spectrum
XmgðkÞ, we have the signal xmgðnÞ using
xmgðnÞ ¼1
N
XNn¼1
XmgðkÞe jð2�NÞkn ð12Þ
Since xmgðnÞ is formed by the inverse DFT of the product
of two spectra jXðkÞj and �minðkÞ, which is similar to a
power spectrum, it can be shown that this signal has a
similar property to the autocorrelation sequence of a signal.
The standard autocorrelation function produces a stable AR
filter. It can also be shown that the AR filter resulting from
xmgðnÞ is stable. We know that the magnitude spectrum
has all poles and zeros within the unit circle, so the AR
filter resulting from that signal becomes stable. Also,
�minðkÞ retains the nonnegative values. Thus, the modified
spectrum formed by the product of the magnitude spectrum
jXðkÞj and GD spectrum �minðkÞ produces a positive semi-
definite matrix as an autocorrelation that guarantees the
stability of the AR filter.
7. CONCLUSION
A new spectral representation of the magnitude
spectrum has been proposed in this research to estimate
the vocal tract characteristics accurately. For this purpose,
we exploited the benefits of the GD spectrum by multi-
plying the magnitude spectrum with it and obtained a high-
resolution spectrum with spectral leakage compensation.
It is shown that the properly estimated GD spectrum with
the magnitude spectrum can be used for extracting closely
spaced and low-amplitude formant frequencies. At the
same time, the proposed method can be used to estimate
the formant frequency of high-pitched speech with elevated
accuracy.
ACKNOWLEDGEMENT
We are grateful to the Information and Communication
Technology (ICT) Division of the Bangladesh Government
for providing funds for this research. We are also thankful
to the three anonymous reviewers for their important
comments and valuable suggestions on this manuscript.
REFERENCES
[1] B. S. Atal and S. L. Hanauer, ‘‘Speech analysis and synthesisby linear prediction of the speech wave,’’ J. Acoust. Soc. Am.,50, 637–655 (1971).
[2] J. Makhoul, ‘‘Linear prediction: A tutorial review,’’ Proc.IEEE, 63, 561–580 (1975).
[3] C. Magi, J. Pohjalainen, T. Backstrom and P. Alku, ‘‘Stabilisedweighted linear prediction,’’ Speech Commun., 51, 401–411(2009).
[4] A. M. Noll, ‘‘Cepstrum pitch determination,’’ J. Acoust. Soc.Am., 41, 293–309 (1967).
[5] R. W. Schafer and A. V. Oppenheim, Discrete Time Signal
5001000150020002500
F2 (Hz)
0
500
1000
1500
F1
(Hz)
ProposedWORLDDPPT
Fig. 12 F2 vs F1 plot of vowel /o/ spoken by a malespeaker.
5001000150020002500
F2 (Hz)
0
500
1000
1500
F1
(Hz)
ProposedWORLDDPPT
Fig. 13 F2 vs F1 plot of high-pitched vowel /a/ spokenby a female speaker at 300 Hz.
5001000150020002500
F2 (Hz)
0
500
1000
1500
F1
(Hz)
ProposedWORLDDPPT
Fig. 14 F2 vs F1 plot of high-pitched vowel /o/ spokenby a female speaker at 300 Hz.
H. A. CHOWDHURY and M. S. RAHMAN: FORMANT ESTIMATION USING MODIFIED MAGNITUDE SPECTRUM
101
Processing (Prentice Hall, Englewood Cliffs, NJ, 1989).[6] E. Loweimi, J. Barker and T. Hain, ‘‘Source–filter separation
of speech signal in the phase domain,’’ Proc. Interspeech 2015,pp. 598–602 (2015).
[7] E. Loweimi, ‘‘Robust phase-based speech signal processingfrom source–filter separation to model-based robust ASR,’’Ph.D. Dissertation, University of Sheffield (2018).
[8] R. W. Schafer and L. R. Rabiner, ‘‘System for automaticformant analysis of voiced speech,’’ J. Acoust. Soc. Am., 47,634–648 (1970).
[9] M. S. Rahman and T. Shimamura, ‘‘Formant frequencyestimation of high-pitched speech by homomorphic predic-tion,’’ Acoust. Sci. & Tech., 26, 502–510 (2005).
[10] M. S. Rahman and T. Shimamura, ‘‘Linear prediction usingrefined autocorrelation function,’’ EURASIP J. Audio SpeechMusic Process., 1, 1–9 (2007).
[11] K. K. Paliwal and L. Alsteris, ‘‘Usefulness of phase spectrumin human speech perception,’’ 8th Eur. Conf. Speech Commu-nication and Technology, pp. 2117–2120 (2003).
[12] D. Gowda, J. Pohjalainen, M. Kurimo and P. Alku, ‘‘Robustformant detection using group delay function and stabilizedweighted linear prediction,’’ Proc. Interspeech 2013, pp. 49–53(2013).
[13] H. A. Murthy and B. Yegnanarayana, ‘‘Formant extractionfrom group delay function,’’ Speech Commun., 10, 209–221(1991).
[14] H. A. Murthy and B. Yegnanarayana, ‘‘Group delay functionsand its applications in speech technology,’’ Sadhana, 36, 745–782 (2011).
[15] B. Bozkurt, L. Couvreur and T. Dutoit, ‘‘Chirp group delayanalysis of speech signals,’’ Speech Commun., 49, 159–176(2007).
[16] H. A. Murthy and V. Gadde, ‘‘The modified group delayfunction and its application to phoneme recognition,’’ Proc.IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP)2013, pp. 1–68 (2003).
[17] B. Bozkurt, ‘‘Zeros of the z-transform (ZZT) representationand chirp group delay processing for the analysis of source andfilter characteristics of speech signals,’’ Thesis work, UniversityPolytechnique de Mons, Belgium and LIMSICNRS, France(October 2005).
[18] G. Degottex, J. Kane, T. Drugman, T. Raitio and S. Scherer,‘‘COVAREP — A collaborative voice analysis repository forspeech technologies,’’ Proc. IEEE Int. Conf. Acoust. SpeechSignal Process. (ICASSP) 2014, pp. 960–964 (2014).
[19] B. Bozkurt, T. Dutoit, B. Doval and C. d’Alessandro,‘‘Improved differential phase spectrum processing for formanttracking,’’ Proc. 8th Int. Conf. Spoken Language Processing(ICSLP) 2004, pp. 2421–2424 (2004).
[20] M. Morise, ‘‘CheapTrick, a spectral envelope estimator forhigh-quality speech synthesis,’’ Speech Commun., 67, 1–7(2015).
[21] M. Morise, F. Yokomori and K. Ozawa, ‘‘WORLD: Avocoder-based high-quality speech synthesis system for real-
time applications,’’ IEICE Trans. Inf. Syst., 99, 1877–1884(2016).
[22] L. R. Rabiner and R. W. Schafer, Digital Processing of SpeechSignals (Prentice-Hall Inc., Englewood Cliffs, NJ, 1975).
[23] M. Cerna and A. F. Harvey, The Fundamentals of FFT-basedSignal Analysis and Measurement (Application Note 041,National Instruments, 2000).
[24] F. J. Harris, ‘‘On the use of windows for harmonic analysiswith the discrete Fourier transform,’’ Proc. IEEE, 66, 51–83(1978).
[25] B. Yegnanarayana, D. Saikia and T. Krishnan, ‘‘Significance ofgroup delay functions in signal reconstruction from spectralmagnitude or phase,’’ IEEE Trans. Acoust. Speech SignalProcess., 32, 610–623 (1984).
[26] G. Fant, J. Liljencrants and Q.-G. Lin, ‘‘A four-parametermodel of glottal flow,’’ STL-QPSR, 4, 1–13 (1985).
[27] J. Durbin, ‘‘The fitting of time series models,’’ Rev. Inst. Int.Stat., 28, 233–243 (1960).
[28] V. Zue, S. Seneff and J. Glass, ‘‘Speech database developmentat MIT: TIMIT and beyond,’’ Speech Commun., 9, 351–356(1990).
[29] D. Watt and A. Fabricius, ‘‘Evaluation of a technique forimproving the mapping of multiple speakers’ vowel spaces inthe F1–F2 plane,’’ Leeds Working Papers in Linguistics andPhonetics, 9, 159–173 (2002).
Husne Ara Chowdhury received her B.Sc.(Eng) degree in computer science and engineer-ing from Shahjalal University of Science andTechnology, Bangladesh, in 2002. She joinedthe CSE department of the same university as ajunior faculty member in 2004 and became anassistant professor in 2007. She has been work-ing toward her Ph.D. in speech signal analysisat the same university since November, 2018.
Her research interest is on speech recognition, synthesis and speechcoding.
M. Shahidur Rahman received his B.Sc. andM.Sc. degrees in electronics and computerscience from Shahjalal University of Scienceand Technology, Sylhet, Bangladesh, in 1995and 1997, respectively. He received his Ph.D.degree in mathematical information systems in2006 from Saitama University, Saitama, Japan.In 1997, he joined Shahjalal University ofScience and Technology as a junior faculty
member and he is currently serving as a professor. He was a JSPSPostdoctoral Research Fellow from 2009 to 2011 in SaitamaUniversity. His current research interests include speech analysis,speech synthesis, speech recognition, enhancement of bone-con-ducted speech and digital signal processing.
Acoust. Sci. & Tech. 42, 2 (2021)
102