[IEEE 2013 5th International Conference on Modeling, Simulation and Applied Optimization (ICMSAO...

Selection of Pertinent Acoustic Features for Detection of Pathological Voices

Lotfi SALHI Signal Processing Laboratory, Physics Department

Sciences Faculty of Tunis, University of Tunis El-Manar, 1060 Tunis, Tunisia

[email protected]

Adnane CHERIF Signal Processing Laboratory, Physics Department

Sciences Faculty of Tunis, University of Tunis El-Manar, 1060 Tunis, Tunisia

[email protected]

Abstract—This paper suggests a new method to improve the performance of acoustic features selection for the classification of pathological and normal voices. The effectiveness of the Mel Frequency Cepstrum Coefficients (MFCCs) using the Fisher Discriminant Ratio (FDR) is analyzed. To evaluate the performance of the selected features, experiments were performed using a Multi-Layer Perceptron (MLP) classifier with Feed Forward Back Propagation training algorithm (FFBP). The developed method was evaluated using voice data base composed of recorded voice samples (continuous speech) from normophonic and dysphonic speakers. Based on mixed voices database, the best selected features achieved a correct classification rate of 92.74%. The proposed system shows that the FDR is sufficiently a selection method of acoustic features for classification of pathological and normal voices.

Keywords—Fisher Criterion, Multilayer Perceptron, Pathological Voices, Acoustic Features.

I. INTRODUCTION The diagnosis of pathological voice is an important topic

that has been received considerable attention. The analysis of the voice disorder stays essentially clinic. Diverse medical techniques exist for direct examination and diagnostics of pathologies. Laryngoscopy, glottography, stroboscopy, electromyography, videokimography are most frequently used by medical specialists [1,2]. Generally all these methods are hardly accessible for visual examination during phonation process and that makes it more problematic to identify the pathology. Moreover, these diagnostic means may cause patients much discomfort and distort the actual signal that may lead to incorrect diagnosis as well. Speech processing has proved to be an excellent tool for voice disorder detection. In fact, it is a non-invasive diagnostic technique that allows examining many people in short time period with minimal discomfort and revealing pathologies on early stages of their origin. This method can be of great interest for medical institutions.

In current literature, the features used are often extracted from the audio data for voice pathology analysis include the fundamental frequency (F0), jitter, shimmer, Mel-Frequency Cepstral Coefficients (MFCC), Signal-to-Noise Ratios (SNR), Harmonic-to-Noise Ratios (HNR) and high order statistics parameters (HOS) [3]. These parameters are frequently used in

systems for automatic vocal pathology detection and are usually selected arbitrarily. In general, these methods accept signal stationary within a given time frame and may therefore lack the ability to analyse localized events correctly. Various classification methods have been used in the recent approaches to pathological voices classification. Several researches, such as: application of automatic speaker recognition techniques to pathological voice assessment [20], identification of voice disorders using speech samples [18], performance of Gaussian mixture models as a classifier for pathological voice [20], On the Use of the Correlation between acoustic Descriptors for the Normal/Pathological Voices Discrimination [11,12], Support Vector Machines Applied to the Detection of Voice Disorders [7], have recently been applied to various kinds of pathological classification tasks. Although, the MNN has been widely used because there is no need to think about the details of the mathematical models of the data and relatively easy to train and has produced a good pathological recognition performance, we admit that a comparison with other classification systems is needed to evaluate the performance of the proposed system.

In this study, the MLP method was used to classify the mixed voiced data set (pathological and normal voice) into normal and pathological voices. Different classical vocal parameters (Pitch, Formants, ZCR, MFCC, PLP ...) were used to identify the state of voice sample. According to the same process, all results given by using different parameters were used.

II. EXPERIMENTAL FRAMEWORK

A. Proposed System The process of the proposed system for automatic

identification of pathological voices is essentially consists of two parts: parameterization and classification. Artificial Neural Networks are powerful tools for handling the problems of higher dimensions [11,23], and are good at tasks such as pattern matching and classification, approximation function, optimization, and data clustering, while traditional computers, because of their architecture, are inefficient at these tasks, especially classification tasks [15,16]. Figure 1 presents this proposed system process.

978-1-4673-5814-9/13 / 31.00 $ © 2013 IEEE

Fig. 1. Proposed system process

B. Corpus The corpus comprises sustained vowels [a], including

onsets and offsets, and four French sentences produced by 22 normophonic or dysphonic speakers (10 male and 12 female speakers) [11,12]. The corpus includes 20 adults (from 20 to 79 years), one boy aged 14 years and one girl aged 10 years. Five speakers are normophonic, the others are dysphonic. The dysphonic speakers were patients of the Laryngology Department of a University Hospital in Brussels, Belgium. The disordered voices range from mildly deviant to very deviant. The pathologies were diagnosed as follows: dysfunctional dysphonia, bilateral nodule, polyp on the left vocal fold, edema of the vocal folds, mutational disorder, dysphonia plicea ventricularis, and unilateral vocal fold paralysis. The sentences are referred to as S1, S2, S3 and S4, respectively. They have the same grammatical structure, the same number of syllables and roughly the same number of resonants and plosives. Sentences S1 and S2 are voiced by default, whereas S3 and S4 include voiced and unvoiced segments.

Speech signals have been recorded at a sampling frequency of 48 kHz. The recordings were made in an isolated booth by means of a digital audio tape recorder (Sony TCD D8) and a head-mounted microphone (AKG C41WL). The recordings have been transferred from the DAT recorder to computer hard disk via a digital-to-digital interface. Silent intervals before and after each recording have been removed by manual segmentation.

C. Feature Extraction The speech signal has many acoustic features from which

measurements can be taken. Different parameters where chosen at the input of the neural networks [4]. The type of each parameter depends on its method of extraction.

Fig. 2. Feature extraction

• Pitch and Formants:

Starting as a vibration of the vocal cords, speech signals have a pitch or "fundamental frequency" (denoted F0) which is determined by the mass, length and tension of the speaker’s vocal cords. As the buzzing sound created by vocal fold vibration travels through the pharynx, oral and nasal cavities, it is modified by the length and shape of the air column through which it passes and becomes a speech sound. Articulation changes the sound that exits the mouth simply because it changes the shape and size of the air cavities used for speech. During the production of vowels, one of the most important features this creates is a series of vocal resonances called formants. The vocal tract produces an infinite number of formants. Only the three lowest-frequency formants (designated F1, F2 and F3) are usually required for listeners to hear all the different vowel sounds.

• Log Energy:

The measurements of the short time energy can be defined as follows [17]:

2

1( )

N

ii

Log E Log S=

=∑ (1)

• Zero-Crossing Rate:

The zero-crossing rate (ZCR) is another basic acoustic feature that can be computed easily. It is equal to the number of zero-crossing of the waveform within a given frame.

Fig. 3. ZCR values for both types voice

ZCR can be defined as follows [9]:

11

1

1 01( )0

Nn n

n

if x xZCR x

N else

−−

=

− <⎧= ⎨

⎩∑ (2)

In general, ZCR of both unvoiced sounds and environment noise are larger than voiced sounds (which has observable fundamental periods).

• MFCC and ΔMFCC:

The most commonly used acoustic features for speech or recognition are mel-scale frequency cepstral coefficient (MFCC) [19]. The MFCC takes human perception sensitivity with respect to frequencies into consideration, and therefore are best for speech recognition. To extract an envelope like features, we use the triangular bandpass filters. The positions of these filters are equally spaced along the Mel frequency, which is related to the common linear frequency FHz by the following equation:

2595 1700

HzMel 10

FF = log ( + ) (3)

Fig. 4. Frequency to Mel-Frequency curve

It is also advantageous to have the time derivatives of MFCC as new features, which shows the velocity and acceleration of MFCC. If we add the velocity (ΔMFCC), the feature dimension is 24. If we add both the velocity and the acceleration (ΔΔMFCC), the feature dimension is 36.

• PLP and RASTA-PLP:

The Perceptual Linear Prediction (PLP) was originally proposed by Hynek Hermansky as a way of warping spectra to minimize the differences between speakers while preserving the important speech information [4,8].

Fig. 5. Method of calculating PLP coefficients

Relative Spectral Transform - Perceptual Linear Prediction (RASTA-PLP) is a separate technique that applies a band-pass filter to the energy in each frequency subband in order to smooth over short-term noise variations and to remove any constant offset resulting from static spectral coloration in the speech channel e.g. from a telephone line [7].

Fig. 6. PLP and RASTA-PLP for normal speech signal

Fig. 7. PLP and RASTA-PLP for pathological speech signal

III. EXPERIMENTAL RESULTS

A. Fisher Criterion The generalized Fisher criterion lets estimate the average

power of the feature discrimination by measuring the Fisher Discrimination Ratio (FDR). This value represents the function overlaps of probability density of different classes to distinguish. It is expressed for each parameter as follows [5,6].

[ ] [ ]( )( )[ ]

2

1 1

1

k k

i jk

i

x i x jFDR

Var x i

= =

=

−=∑∑

∑ (4)

Where [ ] [ ]x i and x j are the averages of the parameter x respectively for the class i and j.

( )[ ]Var x i : is the variance of the parameter x for the class i.

K : is the class number For two given classes, the FDR can be interpreted as being

the report of interclasse variability of the parameter by his intraclasse variability. This criterion consists to calculate the

distance between the average values of a feature set, and to normalize the average of the variances, to estimate the power of discrimination of the characteristic in question between these two classes. This measure is used to estimate the discriminating power of the discrimination feature between these two classes [13,14].

In our case the two possible classes which must belong to the recorded voice is pathological and normal. The FDR of any parameter x can be defined by [9,10]:

(5)

We could then choose the distance between the projected means as our objective function

Fig. 8. Separation principle of the two classes by FDR

The table below shows the results of the statistical study based on the Fisher discrimination rate (FDR) for each selected acoustic parameter and using the corpus mentioned previously.

TABLE I. FDR VALUES FOR EACH PARAMETER

Feature FDR Feature FDR Feature FDR

F0 0.84 MFCC-5 8.18 PLP-3 2.26

F1 0.93 MFCC-6 7.72 PLP-4 2.67

F2 0.56 MFCC-7 7.32 PLP-5 2.36

F3 1.06 MFCC-8 9.92 PLP-6 1.83

Log E 0.39 MFCC-9 9.18 PLP-7 1.57

ZCR 0.21 MFCC-10 8.47 PLP-8 6.65

MFCC-1 1.57 MFCC-11 9.36 PLP-9 4.22

MFCC-2 10.24 MFCC-12 9.54 PLP-10 1.82

MFCC-3 9.63 PLP-1 1.38 PLP-11 2.16

MFCC-4 7.76 PLP-2 2.98 PLP-12 8.24

We present in the following table the statistical results obtained by applying the most relevant acoustic parameters.

These parameters have the highest rates of discrimination Fisher. More than the averages of such parameter in each class are different than it is able to separate the two classes of voice.

TABLE II. STATISTICAL PARAMETERS HAVING THE BEST FDR

Features Classe C1: Normal

Min Max Moyenne Ecart type

MFCC-2 -5.84 18.32 8.32 1.04

PLP-12 -5.55 15.40 6.12 0.52

Classe C2: Pathological

MFCC-2 -4.36 13.22 3.96 0.88

PLP-12 -9.06 15.66 3.06 0.93

The following figure shows the standard deviation of the average of each parameter. This figure allows visualizing the ability of each parameter (MFCC and PLP) to separate the two classes of voice (pathologic and normal).

Fig. 9. Separation of the two classes by different vocal acoustic parameters

B. Multilayer Perceptron Classifier According to my previous work, the multi-layer perceptron

(MLP) with Back-Propagation Algorithm (BPA) was chosen as a method of pattern classification for many main reasons [4]. We carry various combinations of classical acoustic parameters in order to prove their ability to identify pathological voices. In this study we use 8 input vectors as shown in table 3.

The purpose of our study is to classify the mixed voice data set into normal and pathological voice and to compare the performance of each acoustic feature. The performance of each parameter is evaluated in terms of the learning rate which is the mean squared error (MSE) between the output Y and the target T defined as:

2 2

1 1

1 1( ) ( )N N

i i ii i

EQM e t yN N= =

= = −∑ ∑ (6)

xNormal

xPathol

µ1

µ2

( )( ) ( )

2

Normal Patholog

Normal Patholog

x xFDR

Var x Var x−

=+

Also the performance of each parameter is evaluated in terms of the correct classification rate (CCR) defined as:

Number of voices correctly identifiedCCR = × 100Total number of voices

(7)

TABLE III. INPUT VECTORS TO THE MLP

Vectors Acoustic Features N

V1 F0, F1, F2, F3 4

V2 F0, F1, F2, F3, Log E, ZCR 6 V3 12 MFCC 12 V4 12 MFCC, Log E, ZCR 14 V5 12 MFCC, F0, F1, F2, F3 16 V6 12 MFCC, 12 Δ, 12 Δ Δ 36 V7 12 PLP 12 V8 12RASTA-PLP 12

The experimental results for identification of pathological

voice are shown in this section. In order to evaluate effectiveness of the method and features, the same pre-condition was configured, such as the same initialisation data set of the MLP, the same generalized architecture and the same training times. Since the total number of data was small, we tried to train and test the classifier model by splitting total data sets into two parts. We use 75% of the data for training and the remaining 25% of data was used for test. In each training stage, the classifier model were trained and tested separately using different combination of data sets, which is especially useful when few speech data are available for training and testing. The experimental results for identification of pathological voice are shown in this section.

TABLE IV. BEST RESULTS FOR EACH FEATURE VECTORS

Vectors M CCR (Normal)

CCR (Pathological)

CCR (Total)

V1 15 66.7 70.83 68.75

V2 15 71.43 69.44 70.44

V3 20 80.95 84.72 82.84

V4 20 80.95 86.11 83.53

V5 20 85.71 86.11 85.91

V6 20 90.48 90.28 90.38

V7 20 80.95 79.17 80.06

V8 20 85.71 91.67 88.69

Using the structure shown above for the MLP and by

varying the number M of neurons in the hidden layer for each input feature vector we obtain the following results. Table.4 shows that the best results on the correct classification rate is given by using the input vector V6 which is consists by MFCC, Δ MFCC and ΔΔ MFCC coefficients.

As shown by figure 10, the best number of hidden neurons is 15 for vectors V1 and V2. However for other input vector it appears that the 20 neurons give the best results.

Fig. 10. CCR according to the neurons number of hidden layer

On the MSE, figure 11 shows the obtained results by using the best architectures (i.e. best number of hidden neurons). We recall that the learning rate is higher better over the MSE is low.

Fig. 11. MSE for each input feature vector

So, we note that the vector V6 gives the best performance in identification of pathological voices based on the CCR (90.38%) and on the MSE (9.9 e-6).

Also, we can notice that the decrease rate toward the goal minimal error is highest while using MFCC and her derivative coefficients. Also the PLP and RASTA-PLP parameters give best results on the CCR and MSE. This result permits us to more emphasize on these acoustic features that can be more discriminate for the characterization of speech signal.

IV. CONCLUSION This paper presents a classification scheme between normal

and pathological voices using MLP method as one of the trial to obtain a better classification performance, based on the various feature parameters sets. MFCC, PLP and RASTA-PLP give better results on the identification of pathological voices. The pathological classification method based on MLP was applied to each vector of input parameters. Obtained results shows on the average a slight improvement with normal voice

data and a significantly improvement with pathological voice data. These results are in accordance with those given by Fisher criterion. This comparison permits the deduction that the Fisher criterion can be of major interest to classify pathological and normal voice. Indeed, it allows a better selection of the most relevant parameters of the speech signal for the task of the concerned classification.

ACKNOWLEDGMENT The authors would like to express their appreciation for

Prof. Francis GRENEZ and for all members of “Signals and Waves” Laboratory (LIST), Faculty of Engineering, Free University of Brussels for their invaluable collaborations and for the availability of the voice database.

REFERENCES [1] P.Yu, M.Ouaknine, J.Revis, and A.Giovanni, “Objective Voice Analysis

for Dysphonic Patients: A Multiparametric Protocol Including Acoustic and Aerodynamic Measurements”, Journal of Voice 15 (2001) 529–542.

[2] B. Boyanov and S. Hadjitodorov, “Acoustic analysis of pathological voices. A voice analysis system for the screening of laryngeal diseases”, IEEE Eng. Med. Biol. Mag.16, no.4 (1997) 74–82.

[3] T.Dubuisson, T.Dutoit, B.Gosselin, and M.Remacle, “On the use of the correlation between acoustic descriptors for the normal/pathological voices discrimination”, EURASIP Journal on Advances in Signal Process. ID 173967 (2009) 19 pages.

[4] Lotfi Salhi, Mourad Talbi, Adnane Cherif, “Voice Disorders Identification Using Hybrid Approach: Wavelet Analysis and Multilayer Neural Networks”, World Academy of Science, Engineering and Technology (WASET), Vol. 45, 2008, pp. 330-339.

[5] Ch. Liu, H. Wechsler, “A Shape- and Texture-Based Enhanced Fisher Classifier for Face Recognition”, IEEE Transactons on Image Processing, vol. 10, no. 4, 2001.

[6] P. Belhumeur, J. Hespanha, D. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, 1997.

[7] G. Llorente, and Al., “Support vector machines applied to the detection of voice disorders”, Ed. Springer Verlag, Lecture Notes in Computer Science. 3817 (2006) 219 - 230.

[8] A.cherif, “Pitch detection and formant extraction of Arabic speech processing”, Journal of applied acoustics (2001).

[9] W. Malina, “On an Extended Fisher Criterion for Feature Selection”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 3, no. 5, pp. 611-614, 1981.

[10] M. Smiatacz, W. Malina, “Modifying the Input Data Structure for Fisher Classifier”, 2nd Conference on Computer Recognition Systems (KOSYR´2001), pp. 363-367, 2001.

[11] A. Kacha, F. Grenez, and J. Schoentgen, “Multiband frame-based acoustic cues of vocal dysperiodicities in disordered connected speech”, Biomedical Signal Process. and Control 1 (2006) 137–143.

[12] Bettens, F., Grenez F., & Schoentgen J.. “Estimation of vocal dysperiodicities in disordered connected speech by means of distant-sample bidirectional linear predictive analysis”. J Acoust Soc Am 117, (2005) pp. 328–337.

[13] X.-S. Zhuang, D.-Q. Dai, “Inverse Fisher discriminate criteria for small sample size problem and its application to face recognition”, Pattern Recognition 38 (2005) 2192 – 2194.

[14] H. Guo, Q.Zhang and A.K.Nandi, “Feature generation using genetic programming based on Fisher criterion”, Eusipco, Poznan (2007), pp. 1867-1871 Eurasip

[15] J.Kortelainen, K.Noponen, Neural networks, Intelligent Systems 2005. [16] C. M. Bishop, Neural networks for pattern recognition, Oxford,

Clarendon Press (1996). [17] Ritchings, R.T., Mcgillion, M.A., & Moore, C.J., “Pathological voice

quality assessment using artificial neural network”, Elsevier Medical Engineering & Physics, 24, 8, (2002) 561-564.

[18] J.Wang, Jo.Cheolwoo, “Performance of Gaussian Mixture Model as a classifier for Pathological Voice”, proceeding of the ASST in Auckland (2006) 165-169.

[19] G. Llorente, and Al., “Dimensionality reduction of a pathological voice quality assessment system based on Gaussian mixture models and short-term cepstral parameters”, IEEE Trans. on Biomedical Engin. 53, no.10 (2006) 1943--1953.

[20] C.Fredouille, G.Pouchoulin, J-F. Bonastre, M.Azzarello, A.Giovanni, A.Ghio, “Application of automatic speaker recognition techniques to pathological voice assessment”, in proceeding of internat. conf. on acoustic speech and signal process. (2005).

[IEEE 2013 5th International Conference on Modeling, Simulation and Applied Optimization (ICMSAO...

Documents

Transcript of [IEEE 2013 5th International Conference on Modeling, Simulation and Applied Optimization (ICMSAO...