PID2198363 _6 Feb 12

6
Assimilate the Auditory Scale with Wavelet Packet Filters for Multistyle Classification of Speech Under Stress Nurul Aida Amira Bt Johari, M.Hariharan, A.Saidatul, Sazali Yaacob School of Mechatronic Engineering University Malaysia Perlis Perlis, Malaysia [email protected], [email protected] Abstract—Nowadays, people are having high stress level due to highworkload stress, emergency phone call and multitasking. Emotional/stress of a person affects his/her performance in daily life and speech production. The research for understanding the human emotional/stress states using speech has undergone research and development in the past two decades. This paper presents a feature extraction method based on wavelet packet decomposition for detecting the emotional or stressed states of the person. Three different wavelet packet filter bank structures are design based on Bark scale, Mel Scale and Equivalent Rectangular Bandwidth (ERB) Scale. Linear Discriminant Analysis (LDA) based classifier and Support Vector Machine (SVM) are employed as classifier to identify the emotional/stressed states of a person. In this study speech samples are taken from Speech Under Simulated and Actual Stress (SUSAS) database. Experimental result shows that the suggested method can be used to identify the stress and emotional state of a person. Keywords- Emotional/Stressed states, Wavelet packet transform, Linear Discriminant Analysis, Support Vector Machine, stress classification I. INTRODUCTION One important challenging research today was the recognition of Emotion and stress through the speaker speech. It was an important application which helps in identifying the speaker stress emotional conditions. It was one of the human- computer interaction or affective computing. Recognition of the human affective states is rapidly gaining interests among researches and industrial developers since it has a broad range of applications. The study useful in applications such robot recognize emotion, metropolitan emergency telephone system to direct the emotional telephone calls to priority operator and potentially usage in multimedia as interactive voice response system, air craft voice communication monitoring and as psychiatrics diagnosis.[1-3]. The user‘s stress and emotional state have been analyzed using speech pattern. Vocal parameter and prosody features such as fundamental frequency, intensity (energy) and speaking rate are strongly related with the emotion expressed in speech[4-12]. Many studies have shown distinctive differences in phonetic features between normal and speech produced under stress[4-12] and classifiers for example Hidden Markov Model [6-9] and, Neural Networks [10-12]. Researchers have proposed different speech features, the most common features are MFCC (Mel-Frequency Cepstral Coefficients), Pitch, LPC (Linear Prediction Coefficients), autocorrelation coefficients and Teager Energy Operator (TEO) based features [5,13]. Up to now, researchers have not identified a specific feature set for the the recognition of emotional/stressed states through speech[12]. Wavelet transform is a promising tool for non-stationary speech analysis. It is capable in analyzing the speech signal both in time and frequency scale. Zhang Xueying and Jiao Zhiping[14] developed two filterbank structures based on barkscale and ERB scale for speech recognition application where the wavelet packet filter frequency bands are spacing closely to the Bark scale and ERB scale. The advantage of Wavelet Packet is able to partition the both low and high frequency bands. Bark scale[15] is a phsychoacoustical scale proposed by Eberhard Zwicker in 1961 and it has been named after Heinrich Barkhausen who proposed the first subjective measurements of loudness. In 1938, Fletcher determined the critical band concept which is the bandwidth of the human auditory filter at different characteristic frequencies along the cochlea path. He assumed that the auditory filters were rectangular and several physiologically motivated formulas have been derived for the ERB values [16]. S. Datta and co- workers [18] developed new filter structure using Mel-like Admissible Wavelet Packet Structure. The filter frequency band are spacing closely to the Mel scale. Here,[17] used two level WP of 32 order Daubechies and get by estimating minimum RMSE between centre frequency of Mel and Bark Scale and extract Log energy feature. The best result obtain for his study is ~95% for both PCA and LDA classifiers. Our paper investigates the usefulness of the three different wavelet packet filterbank structures which are based on Bark scale, Mel scale and ERB scale. Energy and entropy features were extracted from each subband wavelet packet coefficients. The simulation results show that the suggested methods can be used to identify the emotional/stressed states of a person. The work is supported by the grant: FRGS-9003-00224 from the Ministry of Higher Education of Malaysia.

Transcript of PID2198363 _6 Feb 12

Page 1: PID2198363 _6 Feb 12

Assimilate the Auditory Scale with Wavelet Packet Filters for Multistyle Classification of Speech Under

Stress

Nurul Aida Amira Bt Johari, M.Hariharan, A.Saidatul, Sazali Yaacob School of Mechatronic Engineering

University Malaysia Perlis Perlis, Malaysia

[email protected], [email protected]

Abstract—Nowadays, people are having high stress level due to highworkload stress, emergency phone call and multitasking. Emotional/stress of a person affects his/her performance in daily life and speech production. The research for understanding the human emotional/stress states using speech has undergone research and development in the past two decades. This paper presents a feature extraction method based on wavelet packet decomposition for detecting the emotional or stressed states of the person. Three different wavelet packet filter bank structures are design based on Bark scale, Mel Scale and Equivalent Rectangular Bandwidth (ERB) Scale. Linear Discriminant Analysis (LDA) based classifier and Support Vector Machine (SVM) are employed as classifier to identify the emotional/stressed states of a person. In this study speech samples are taken from Speech Under Simulated and Actual Stress (SUSAS) database. Experimental result shows that the suggested method can be used to identify the stress and emotional state of a person.

Keywords- Emotional/Stressed states, Wavelet packet transform, Linear Discriminant Analysis, Support Vector Machine, stress classification

I. INTRODUCTION One important challenging research today was the

recognition of Emotion and stress through the speaker speech. It was an important application which helps in identifying the speaker stress emotional conditions. It was one of the human-computer interaction or affective computing. Recognition of the human affective states is rapidly gaining interests among researches and industrial developers since it has a broad range of applications. The study useful in applications such robot recognize emotion, metropolitan emergency telephone system to direct the emotional telephone calls to priority operator and potentially usage in multimedia as interactive voice response system, air craft voice communication monitoring and as psychiatrics diagnosis.[1-3].

The user‘s stress and emotional state have been analyzed

using speech pattern. Vocal parameter and prosody features such as fundamental frequency, intensity (energy) and speaking rate are strongly related with the emotion expressed

in speech[4-12]. Many studies have shown distinctive differences in phonetic features between normal and speech produced under stress[4-12] and classifiers for example Hidden Markov Model [6-9] and, Neural Networks [10-12]. Researchers have proposed different speech features, the most common features are MFCC (Mel-Frequency Cepstral Coefficients), Pitch, LPC (Linear Prediction Coefficients), autocorrelation coefficients and Teager Energy Operator (TEO) based features [5,13]. Up to now, researchers have not identified a specific feature set for the the recognition of emotional/stressed states through speech[12].

Wavelet transform is a promising tool for non-stationary speech analysis. It is capable in analyzing the speech signal both in time and frequency scale. Zhang Xueying and Jiao Zhiping[14] developed two filterbank structures based on barkscale and ERB scale for speech recognition application where the wavelet packet filter frequency bands are spacing closely to the Bark scale and ERB scale. The advantage of Wavelet Packet is able to partition the both low and high frequency bands. Bark scale[15] is a phsychoacoustical scale proposed by Eberhard Zwicker in 1961 and it has been named after Heinrich Barkhausen who proposed the first subjective measurements of loudness. In 1938, Fletcher determined the critical band concept which is the bandwidth of the human auditory filter at different characteristic frequencies along the cochlea path. He assumed that the auditory filters were rectangular and several physiologically motivated formulas have been derived for the ERB values [16]. S. Datta and co-workers [18] developed new filter structure using Mel-like Admissible Wavelet Packet Structure. The filter frequency band are spacing closely to the Mel scale. Here,[17] used two level WP of 32 order Daubechies and get by estimating minimum RMSE between centre frequency of Mel and Bark Scale and extract Log energy feature. The best result obtain for his study is ~95% for both PCA and LDA classifiers. Our paper investigates the usefulness of the three different wavelet packet filterbank structures which are based on Bark scale, Mel scale and ERB scale. Energy and entropy features were extracted from each subband wavelet packet coefficients. The simulation results show that the suggested methods can be used to identify the emotional/stressed states of a person.

The work is supported by the grant: FRGS-9003-00224 from the Ministry of Higher Education of Malaysia.

Page 2: PID2198363 _6 Feb 12

II. DATABASE The database employed for this study is SUSAS[13]. It consists of stressed speech samples under simulated environment and actual environment. The Simulated subcorpus in SUSAS contain 11 speaking styles[7]. From five domains available in the SUSAS only three speech domain are used: talking styles, single tracking task and Lombard effect domain. Stressed Speech was uttered by 9 speakers representing different dialect, where three speakers used main USA dialects (General American, Boston and New York). In this experiment four speaking style considered are neutral, angry, lombard and loud. The SUSAS database contains of 35 isolated words and each style contains 2 recordings of the same word by each speaker. The total utterances use are 2524. All speech samples are recorded at 8kHz sampling frequency that is two time the frequency of the original speech to avoid aliasing and with the resolution of 16 bits per sample. Pairwise stress classification (neural and angry, neutral and loud, neutral and lombard) is carried out in this work. Where angry considered emotional speech, while loud and lombard as stressed speech.

III. METHODOLOGY In this research, the total samples of 2524 stressed and emotional speech from utterances styles of angry, lombard, loud and natural as the reference state were be used.[280] Voice analysis detector (VAD) applied to the samples to discard unvoiced part speech data. The segmented voice portions are subjected to feature extraction using auditory wavelet packet filters. They are based on Bark Scale,Mel scale and ERB scale. The energy and entropy features are extracted from each wavelet packet subbands. SVM and LDA is used as a classifier. Fig. 1 depicts the feature extraction and classification phase of the multistyle pairwise stress speech classification.

Fig. 1 Block diagram of the feature extraction and classification phase

IV. DESIGN OF WAVELET PACKET FILTERS This section briefly explains the design of wavelet packet filters and the feature extraction using them.

A. Wavelet Transform The wavelet transform provides time frequency representation of the signal. It decomposes signal over dilated and translated wavelets. A wavelet is a waveform of effectively limited duration that has an average value of zero. Wavelet transform is defined as the convolution of a signal f(t) with a wavelet function ψ(t) shifted in time by a translation parameter and dilated by a scale parameter [16]. The general definition of the wavelet transform is given as:

dta

bt

atfbaW ⎟

⎠⎞

⎜⎝⎛ −∗

∫∞

∞−= ψ

1)(),( (1)

where a and b are real and * denote complex conjugate and ψ(t) is the wavelet function. The wavelet transform uses multi-resolution technique by which different frequencies are analyzed with different resolution [18-20]. The discrete wavelet transform (DWT) of a sampled sequence fn =f(nT) with sampling period T is computed as:

⎟⎟⎠

⎞⎜⎜⎝

⎛ −∑−

=

−=ja

nm*ψ1N

0mj/2af[m]]jaf[n,DWT (2)

where m and n are integers. The value of a is equal to 2. B. Wavelet Packets In DWT decomposition procedure, a signal is decomposed into two frequency bands such as lower frequency band (approximation coefficients) and higher frequency band (detail coefficients). Low frequency band is used for further decomposition. Hence DWT gives a left recursive binary tree structure. In Wavelet Packet (WP) decomposition procedure, both lower and higher frequency bands are decomposed into two sub-bands. Thereby wavelet packet gives a balanced binary tree structure. In the tree, each subspace is indexed by its depth i and the number of subspaces p. The two wavelet packet orthogonal bases at a parent node (i,p) are given by the following forms [18-20]

)2(][)(2

1nik

p

innlk

p

i−∑

∞−==

+ψψ (3)

where l[n] is a low pass(scaling) filter.

)2(][)(12

1nik

pin

nhkp

i−∑

∞∞−=

=+

+ψψ (4)

where h[n] is the high pass(wavelet) filter. Wavelet packet decomposition helps to partition the high frequency side into smaller bands which cannot be achieved by using general discrete wavelet transform. Table I, II, and III gives the lower cut off frequency (LCF), higher cut off frequency (HCF) and bandwidth (BW) of all the three wavelet packet filters, where the frequency bands obtained from wavelet packet decomposition closely follows the Bark Scale (16 bands), Mel scale (20 bands) and ERB scale (19 bands). The speech

Speech Signal

Wavelet Packet Filters (Energy and Entropy)

Classification using LDA and SVM

Voiced/Unvoiced Segmentation

Page 3: PID2198363 _6 Feb 12

samples are sampled at 8 kHz giving a 4 kHz bandwidth signal. The speech signals are filtered with the 16 Bark scale wavelet packet filters, 20 Mel scale wavelet packet filters and 19 ERB scale wavelet packet filters [14]. 4th order of the Daubechies wavelet is used. In this work, Daubechies wavelet has been chosen due to the following properties[18]: Time invariance, Fast computation, and Sharp filter transition bands. For a better representation of the sub-band signals, the energy and entropy features are often used[21,22]. Energy feature is extracted from wavelet packet coefficients using the equation (5):

2n

1iP

kn,Cnn

Energy ∑=

=1

(5)

n=1,2,…, N k=0,1,…, 2N -1

Where P is the scale index, n represents the number of decomposition level. k represents wavelet packet node. Shannon entropy can be computed using the extracted wavelet-packet coefficients, through the following equation (6):

2Pkn,Clog

2n

1iP

kn,Cn

Entropy ∑=

−= (6)

n=1,2,…, N k=0,1,…, 2N -1

Where P is the scale index, n represents the number of decomposition level. A feature database is created, after the computation of energy and entropy measures from each subband wavelet packet coefficients and they are used as input features for the classifiers to distinguish the speech samples as neutral or angry, lombard and loud.

TABLE I. FREQUENCY BANDS OBTAINED FROM BARK SCALE WAVELET PACKET DECOMPOSITION

Filter Number

Bark filters (Hz) Wavelet Bark filters (Hz) Filter

Number

Bark filters (Hz) Wavelet Bark filters (Hz)

LCF

UCF

BW

Level /node

LCF

UCF

BW

LCF

UCF

BW

Level /node

LCF

UCF

BW

1 0 100 100 5,0 0 125 125 9 1080 1270 190 4,4 1000 1250 250 2 100 200 100 5,1 125 250 125 10 1270 1480 210 4,5 1250 1500 250 3 200 300 100 6,4 250 312.5 62.5 11 1480 1720 240 4,6 1500 1750 250

4 300 400 100 6,5 312.5 375 62.5 12 1720 2000 280 4,7 1750 2000 250 5 400 510 110 5,3 375 500 125 13 2000 2320 320 4,8 2000 2250 250 6 510 770 260 4,2 500 750 250 14 2320 2700 380 4,10 2500 2750 250 7 770 920 150 5,6 750 875 125 15 2700 3150 450 4,11 2750 3000 250 8 920 1080 160 5,7 875 1000 125 16 3150 3700 550 3,7 3000 4000 1000

TABLE II. FREQUENCY BANDS OBTAINED FROM MEL SCALE WAVELET PACKET DECOMPOSITION

Filter Number

Mel filters (Hz) Wavelet Mel filters (Hz) Filter

Number

Mel filters (Hz) Wavelet Mel filters (Hz)

LCF

UCF

BW

Level /node

LCF

UCF

BW

LCF

UCF

BW

Level /node

LCF

UCF

BW

1 0 100 100 5,0 0 125 125 11 1000 1149 145 5,10 1250 1375 125 2 100 200 100 5,1 125 250 125 12 1149 1320 171 5,11 1375 1500 125 3 200 300 100 5,2 250 375 125 13 1320 1516 196 4,6 1500 1750 250

4 300 400 100 5,3 375 500 125 14 1516 1741 225 4,7 1750 2000 250 5 400 500 100 5,4 500 625 125 15 1741 2000 259 4,8 2000 2250 250 6 500 600 100 5,5 625 750 125 16 2000 2297 297 4,9 2250 2500 250 7 600 700 100 5,6 750 875 125 17 2297 2639 342 4,10 2500 2750 250 8 700 800 100 5,7 875 1000 125 18 2639 3031 392 4,11 2750 3000 250

9 800 900 100 5,8 1000 1125 125 19 3031 3482 451 3,6 3000 3500 500

10 900 1000 100 5,9 1125 1250 125 20 3482 4000 518 3,7 3500 4000 500

TABLE III. FREQUENCY BANDS OBTAINED FROM ERB SCALE WAVELET PACKET DECOMPOSITION

Filter Number

ERB filters (Hz) Wavelet ERB filters (Hz) Filter

Number

ERB filters (Hz) Wavelet ERB filters (Hz)

LCF

UCF

BW

Level /node

LCF

UCF

BW

LCF

UCF

BW

Level /node

LCF

UCF

BW

1 0 36 36 7,0 0 31.25 31.25 11 789 953 160 4,3 750 1000 250 2 36 79 42 7,1 31.25 62.5 31.25 12 953 1143 190 5,8 1000 1125 125 3 79 129 49 6,1 62.5 125 62.5 13 1143 1364 220 5,10 1250 1375 125

4 129 186 57 6,2 125 187.5 62.5 14 1364 1620 256 4,6 1500 1750 250 5 186 253 66 6,3 187.5 250 62.5 15 1620 1918 297 4,7 1750 2000 250 6 253 331 77 6,4 250 312.5 62.5 16 1918 2264 344 4,8 2000 2250 250 7 331 421 90 6,6 375 437.5 62.5 17 2264 2665 401 4,10 2500 2750 250 8 421 526 104 6,7 437.5 500 62.5 18 2665 3131 465 4,11 2750 3000 250

9 526 648 121 5,4 500 625 125 19 3131 3672 541 3,6 3000 3500 500

10 648 789 141 5,5 625 750 125

Page 4: PID2198363 _6 Feb 12

V. CLASSIFIERS Several classifiers have been proposed in the area of

classification speech under stress. In this paper, linear discriminant analysis based classifier and SVM are used as classifiers to test the effectiveness of the wavelet packet energy and entropy features. A. Support Vector Machine

SVM is used as a classifier and it is relatively a new and promising method for solving nonlinear classification problems, function estimation and density estimation and pattern recognition tasks[23-24]. It has been originally proposed to classify samples within two classes. It maps training samples of two classes into a higher dimensional space through a kernel function. SVM seeks an optimal separating hyperplane in this new space to maximize its distance from the closest training point. While testing, a query point is categorized according to the distance between the point and the hyperplane. SVM models are built around a kernel function that transforms the input data into an n-dimensional space where a hyperplane can be constructed to partition the data. Linear kernel, multilayer kernel and radial basis function (RBF) kernel are normally used by researchers[23-24]. In this work, RBF kernel function is used since it gives excellent generalization and low computational cost. In RBF kernel, σ2 (sig2) is the importance parameters and it cause the changes in shape flexion of hyperplane.

⎟⎟⎠

⎞⎜⎜⎝

⎛ −−= 22

2||||exp),(

σixx

ixxK (7)

In this work, LS-SVMLab toolbox [25] is used to perform pairwise classification of speech under stress.

There are two parameters which are to be chosen optimally such as regularization parameter (γ, gam) and σ2 (sig2) is the squared bandwidth of the RBF kernel to obtain better accuracy. The suitable value of regularization parameter (γ, gam) and σ2 (sig2) are chosen optimally as 90 and 0.9 respectively to obtain better accuracy. B. Linear Discriminant Analysis Discriminant analysis is a statistical technique to classify objects into mutually exclusive and exhaustive groups based on a set of measurable object's features. It is also often called pattern recognition, supervised learning, or supervised classification. Linear discriminants(LD)[24] partition the feature space into the different classes using a set of hyper-planes. The parameters of this classifier model were fitted to the available training data by using the method of maximum likelihood. Using this method the processing required for training is achieved by direct calculation and is extremely fast relative to other classifier building methods such as neural networks. This model assumes that the feature data has a Gaussian distribution for each class. In response to input features, linear discriminants provide a probability estimate of each class. The final classification is obtained by choosing the class with the highest probability estimate. The LDA based classifier is designed using MATLAB 7.0.

VI. RESULTS AND DISCUSSIONS There are many present study done relate to automatic speech for emotion and stressed recognition using SUSAS database. The researchers had present their results using several methods and techniques. In 1997,R. Sarikaya present his ungroup stress classification study on 11 emotion and stress styles speech from SUSAS database. He implemented Multi-layer-Perceptron(MLP) with backpropagation training method

TABLE IV. RESULTS OF AUDITORY WAVELET PACKET FILTERS USING LDA AND SVM CLASSIFIERS FOR ENERGY FEATURES

Wavelet Packet Filter Classifiers Neutral Vs Angry Neutral Vs Lombard Neutral Vs Loud

Bark Scale LDA 88.50 86.31 91.63 SVM 93.28 90.27 95.43

Mel Scale LDA 88.02 86.15 91.35 SVM 90.31 88.88 95.43

ERB Scale LDA 88.49 73.81 93.25 SVM 92.68 82.34 96.62

TABLE V. RESULTS OF AUDITORY WAVELET PACKET FILTERS USING LDA AND SVM CLASSIFIERS FOR ENTROPY FEATURES

Wavelet Packet Filter Classifiers Neutral Vs Angry Neutral Vs Lombard Neutral Vs Loud

Bark Scale LDA 90.95 91.31 93.97 SVM 92.29 91.46 96.23

Mel Scale LDA 90.20 91.47 93.81 SVM 91.50 91.86 96.23

ERB Scale LDA 91.31 90.28 93.53 SVM 91.60 92.65 96.62

Page 5: PID2198363 _6 Feb 12

as the stress classifier. From the proposed four Subband based features where Subband Cepstral(SC) was notify the most promising feature that give 59.1% accuracy.[26] T L Nwe (2003) selected four speech style correspond from emotion category of anger and stress category of loud, lombard and clear speech in her system. The extracted feature based Log Frequency Power Coefficient(LFPC) and Teager Energy Operator based LFPC feature. He raise accuracy result of 87% and 89% for classification stress and emotion respectively.[27] Ling He(2009) used spectrograms which subdivide into three sets of alternative frequency band:critical band, Bark scale and equivalent rectangular bandwidth(ERB) and also the 12 log Gabor filters. The stress feature were compute by Gaussian Mixture Model (GMM).The result shown Log Gabor overcome alternative frequency between 40-80% classification rate using energy feature.

In this study, we had developed three sets auditory wavelet Packet filterbanks to imitate frequency resolution in human auditory system. In the experiment to evaluate the recognition of speech emotion and stressed from simulated natural, angry ,Lombard and loud of SUSAS database, three different Wavelet analysis are create to present multiresolution capabilities of wavelet packet(WP) transform to derive the salient feature of stress and emotional content. At first we segmented only voiced speech parts from an input utterances by using end point detector base on zero-crossing rate(ZCR) and frame energy. This will ensure the useful information were analyses and noises were discard. In WP, the emotion/stress discrimination also done onto the high frequency subband. Thereby. lower as well higher frequency bands are decompose which giving a balance binary WP tree structure filterbank. Beside, the wavelet transform uses an adaptive window size which allocate more time to lower frequencies and less time to higher frequency, that the dynamic character are very important features to differentiate emotional/stressed speech according to a localize property in time(space) and as scale(frequency). The decomposition effectiveness of emotion/stress speech may be achieved by best basis(band) selection criterion. From the full j level wavelet packet decomposition there will be more than 22j-1 orthogonal bases. The bases contain the information of different frequency scales. For selection of bases have to be done for the voiced speech within a number of frequency bands, before feature extraction process was perform. This helps in getting emotional/stressed information of corresponding frequency. For that reason the proposed wavelet packet tree structure filter bank are derive. Therefore, we perform the investigation of wavelet packet analysis bands (filterbanks). In this study, first set of Wavelet packet frequency bands type introduce here was given as Bark Scale Wavelet Packet(Bark SWP). The Bark SWP consist 16 frequency bands that will represent 16 coefficients for volunteering the neutral, angry, Lombard and loud style speech characteristic. Bark SWP form bands of small bandwidth at the speech spectrum from 0 to 3kHz, while wider bandwidth at 3kHz. Second set frequency bands then was the Equivalent Rectangular Bandwidth scale Wavelet

Packet(ERB SWP). The ERB SWP form 19 frequency bands covered speech spectrum from 0 to 4kHz has been built the smallest bandwidth within its frequency bands compare to the two others type. The last set was the Mel scale Wavelet Packet (Mel SWP) frequency bands which adopt 20 filterbanks for emotion/stress speech to be investigate. All the bands in each Bark SWP,ERB SWP and Mel SWP, were selected from the entire set of wavelet packet analysis bands as which closely approximate the critical bands characterizing human auditory perception.

Finally, Speech of the neutral, angry, lombard and loud speeches could be analyze and test by taking measure of prosody features like energy and entropy feature to characterize the emotional/stressed information from the sample of Neutral, Angry, Loud and Lombard utterances. Conventional validation scheme is used for testing the effectiveness of the results of the classifier. 80% of data are used for training and 20% of data are used for testing. Three experiments are conducted after extracting energy and entropy features. Firstly, the LDA and SVM classifier are trained and tested with energy features alone and the results are tabulated in Table IV. Second experiment was conducted using entropy alone and the results are tabulated in Table V. Third experiment was conducted using combination (energy + entropy) of features and results are shown in Table VI. From the Tables, it is observed that the SVM gives better classification for all the wavelet packet filters. Entropy features gives better classification accuracy compared to energy features. From the third experiment, it is observed that the combination of energy and entropy features gives very promising classification accuracy of more than 94% for all the pair wise stress speech using SVM classifier. In the recognition of text-independent SUSAS database of three emotional/stressed utterances the observation on the pattern of the result of all the three experiment, we noticed that the recognition of neutral speech as the reference speech had reduced the error in confusing emotional angry utterances. The Neutral vs Lombard pairwise text-independent is the very confusing and got higher error thus reduce the recognition rate during the classification. The combination energy and entropy

TABLE VI. RESULTS OF AUDITORY WAVELET PACKET FILTERS USING LDA AND SVM CLASSIFIERS FOR ENERGY + ENTROPY FEATURES

Wavelet Packet Filter

Classifiers Neutral

Vs Angry

Neutral Vs

Lombard

Neutral Vs

Loud

Bark Scale LDA 91.10 90.79 94.60 SVM 94.26 93.05 96.42

Mel Scale LDA 90.43 91.42 94.84 SVM 91.50 95.23 97.81

ERB Scale LDA 91.98 91.62 93.25 SVM 95.65 94.84 96.03

Page 6: PID2198363 _6 Feb 12

feature used to discriminate neutral and loud given the best result which is ~93-97 % across that the three wavelet packet filterbanks and the three set of the experiment proposed. The primary concern for the stressed speech is the Lombard utterances. The degradation of the correct result shown for Lombard effect using energy feature are observed in the experiment 1. In which accuracy of 73% result obtain when using WP filters that approximate the ERB scale. This due to the large different effect during pronouncing the word utterances generated by vocal fold and glottal that project a very high fundamental frequency during the list of the sound from larynge cavity. This result infers that there is significant degradation when spoken under stress condition like Lombard.

VII. CONCLUSIONS

This paper presents a simple feature extraction method based on three different auditory wavelet packet filterbank structures for multistyle classification of speech under stress. In order to test the effectiveness and reliability of the suggested LDA and SVM based classifier is used. Three experiments are conducted using the extracted features. The experimental results show that the suggested features give very promising classification accuracy of 91% for all the combination of emotional/stress speech classification. Wavelet Transform (WT) has the properties of time-frequency localization and multiresolution. The main reasons for WT’s popularity lie in its complete theoretical framework, the great flexibility in choosing the bases or Wavelet Packets (WP) and the low computational complexity. The suggested method can be used to detect emotional/stressed states of a person. In the future work, feature reduction will be applied to reduce feature dimension and other classification algorithms will be developed to improve the current results with less computation.

ACKNOWLEDGEMENT This work is supported by the grant: FRGS-9003-00224 from the Ministry of Higher Education of Malaysia. The authors wish to thank our Vice Chancellor Y. Bhg. Brig. Jen. Prof. Dato’ Dr. Kamarudin Hussin, for his valuable support during the research work.

References [1] O. Kwon, K. Chan, J. Hao, and T. Lee, "Emotion recognition by speech

signals," in EUROSPEECH 2003, GENEVA, pp. 125-128, 2003. [2] N. Mbitiru, P. Tay, J. Z. Zhang, and R. D. Adams, "Analysis of Stress in

Speech Using Empirical Mode Decomposition," Proceedings of The 2008 IAJC-IJME International Conference, pp. 140-146, 2008.

[3] H. Selye, "Stress Management and Research Center”, http://www.smrc.com.my/index.html, Retrieved on 01/12/2009

[4] G. Zhou, J. H. L. Hansen, and J. F. Kaiser, "Nonlinear feature based classification of speech under stress," IEEE Transactions on Speech and Audio Processing, vol. 9, pp. 201-216, 2001.

[5] S. E. Bou-Ghazale and J. H. L. Hansen, "A comparative study of traditional and newly proposed features for recognition of speech under stress," IEEE Transactions on Speech and Audio Processing, vol. 8, pp. 429-442, 2000.

[6] T. L. Nwe, S. W. Foo, and L. C. De Silva, "Classification of stress in speech using linear and nonlinear features," IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03),

Hong Kong, pp. 1394-1398, 2003. [7] T. L. Nwe, F. S. Wei, and L. C. De Silva, "Speech based emotion

classification," Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology(TENCON'01), Phuket Island, Langkawi Island, Singapore, pp. 297–301, 2001.

[8] B. Schuller, G. Rigoll, and M. Lang, "Hidden Markov model-based speech emotion recognition," Proceedings of The 2003 International Conference on Multimedia and Expo(ICME'03), Baltimore, Maryland, USA, pp. 401–404, 2003.

[9] J. H. L. Hansen and B. D. Womack, "Feature analysis and neural network-based classification of speech under stress," IEEE Transactions on Speech and Audio Processing, vol. 4, pp. 307-313, 1996.

[10] J. Nicholson, K. Takahashi, and R. Nakatsu, "Emotion recognition in speech using neural networks," Neural Computing & Applications(Springer), vol. 9, pp. 290-296, 2000.

[11] C. H. Park and K. B. Sim, "Emotion recognition and acoustic analysis from speech signal," Proceedings of the International Joint Conference on Neural Networks, Portland, Oregon, USA, pp. 2594–2598, 2003.

[12] S. Casale, A. Russo, and S. Serrano, "Multistyle classification of speech under stress using feature subset selection based on genetic algorithms," Speech Communication (Elsevier), vol. 49, pp. 801-810, 2007.

[13] J. H. L. Hansen and S. E. Bou-Ghazale, "Getting started with SUSAS: A speech under simulated and actual stress database," Proceedings of the International Conference on Speech Communication and Technology (Eurospeech), Rhodes, Greece, pp. 1743–1746, 1997.

[14] Zhang Xueying and Jiao Zhiping, “Speech Recognition based on auditory wavelet packet filter”, International conference on Signal Processing, pp.695-698, 2004.

[15] Waleed H. Abdulla, “Auditory based features vectors for speech recognition systems”, Electrical and Electronic Engineering Department, The University of Auckland, Zew Zeland. 2005.

[16] Raghuveer M Rao and Ajith S Bopardikar, Wavelet transforms: introduction to theory and applications, Pearson Education Asia, 2000.

[17] N.S Nehe, D.V Jadhave, R.S Holambe, “Multiresolution Features and Polynomial Kernel Subspace Approach for Isolated Word Recognition”,International Conference in Advance of Computing, Communiction and Control(ICAC3),2009.

[18] C. Burrus, R. Gopinath, H. Guo, J. Odegard, and I. Selesnick, Introduction to wavelets and wavelet transforms: a primer, Prentice Hall Upper Saddle River, NJ, 1997.

[19] O. Farooq and S. Datta, “Mel filter like admissible wavelet packet structure for speech recognition”, Proc. of IEEE Signal Processing Letters, Vol. 8, No. 7, pp.: 196-198,2001.

[20] A. Cohen, I. Daubechies, and J. C. Feauveau, Biorthogonal bases of compactly supported wavelets, Communications on Pure and Applied Mathematics(Wiley Subscription Services, Inc., A Wiley Company New York), 45(5), (1992) 485 - 560.

[21] R. Behroozmand and F. Almasganj, “Optimal selection of waveletpacket- based features using genetic algorithm in pathological assessment of patients’ speech signal with unilateral vocal fold paralysis”, Computers in Biology and Medicine, Vol. 37, pp. 474–485,2007.

[22] Avci, E., Hanbay, D., & Varol, A, “An expert discrete wavelet adaptive network based fuzzy inference system for digital modulation recognition”, Expert System with Applications, Vo. 33, pp. 582–589, 2006.

[23] A. Ben-Hur, D. Horn and H.T. Siegelmann et al, “A support vector clustering method”, Pattern Recognition15th International Conference, 2000, 1(2): pp. 724 –727.

[24] K. De Brabanter, P. Karsmakers, F. Ojeda, C. Alzate, J. De Brabanter, K. Pelckmans, B. De Moor, J. Vandewalle et.al, “LSSVM

lab Toolbox User’s Guide Version 1.7”, 2010. [25] http://www.esat.kuleuven.be/sista/lssvmlab [26] R Sarikaya , John N. G , “Subband Based Classification of Speech under Stress”, Digital Speech and Audio Processing Laboratory, Clemson Univerity,.1997. [27] T L New , S W Foo, LC De Silvs,” Detection of Stress and Emotion

in Speech USingTraditional and FFT Based Log Energy Feature” , ICIC-PCM 2003,2003.