[IEEE 2012 5th International Congress on Image and Signal Processing (CISP) - Chongqing, Sichuan,...

5
978-1-4673-0964-6/12/$31.00 ©2012 IEEE 1796 2012 5th International Congress on Image and Signal Processing (CISP 2012) Project founded by National Natural Science Foundation of China under Grant No. 61075022 Environmental Sounds Recognition Using TESPAR Guanyu You, Ying Li College of Mathematics and Computer Science Fuzhou University Fuzhou, P.R. China AbstractEnvironmental sounds depict the sound content of varieties of creatures’ survival and activities, and also closely related with the human living environment. Current conventional approaches for recognition of environmental sounds required important computational resources and employing complex signal processing methods in the frequency domain. This work proposes a low-complexity method named Time Encoded Signal Processing and Recognition (TESPAR for short). The computational requirements for this method are two orders of magnitude less than that required by other usual methods. We used the TESPAR coding method to produce simple data structures, and then used the archetypes technique for classification. Our method was tested on two databases, database 1 consisted of 10 classes of bird sounds to test the interspecific recognition, database 2 consisted of 10 classes of different environmental sounds to test intraspecific recognition. We also did the experiments on the same databases using MFCC and SVM to make a comparison. Results showed that TESPAR has lower training time complexity than SVM, and the recognition rate of intraspecific recognition was better than interspecific recognition. Key words- environmental sounds recogntion;Time Encoded Si- gnal Processing And Recognition;Linde-Buzo-Gray vector qu- antization; archetypes; interspecific recognition; intraspecific r- ecognition; I. INTRODUCTION Audio data contains a signicant amount of information, enabling the system to capture a semantically richer environment, on top of what visual information can provide. Moreover, the fusion of audio and visual information can be used to capture a more complete description of a scene, such as for disambiguation of environment and object types. Compared to visual signals, audio data could be obtained at any moment when the system is functioning, in spite of challenging external conditions such as poor lighting or visual obstruction, and is relatively cheap to store and compute. Environmental sounds depict the sound content of varie- ties of creatures’ survival and activities, and also closely related with the human living environment. For example, animal sounds at different moments can help biologists to study the living habits of different animals in an ecologic- al environment; observation of wind and rain sounds can also help meteorologists understand the meteorological co- nditions of a region. Environmental sounds recognition is also important for intelligent robots. An intelligent robot can use the hearing function to complement vision becau- se vision only provides information within the limited ey- eshot. Environmental sound recognition is generally done in two phases: first feature extraction, followed by classification. Feature extraction can be split into two broad types: stationary feature extraction and non-stationary feature e-xtraction[1]. The most popular techniques of stationary feature extraction are Mel frequency cepstral coefficients (MFCC), Linear prediction cepstral (LPC) coefficients, Perceptual linear prediction (PLP) features and so on. The most popular technique of non-stationary feature extraction are Short time Fourier transform (STFT), Fast (discrete) wavelet transform (FWT), Continuous wavelet transform (CWT) and so on. The commonly used classification technique are Dynamic time warping (DTW), Hidden Markov models (HMM), Learning vector quantization (LVQ), Articial neural networks (ANN), Gaussian mixture models(GMM) and Support vector machines (SVM).However, the usual approaches require important computational resources generally employing complex signal processing methods in the frequency domain. We propose a low-complexity method, which named the Time Encoded Signal Processing and Recognition (TESPAR for short). TESPAR is a simple and efficient language for describing complex waveforms in digital terms[9]. Its has three key features. First, it has the ability to separate and classify of the signals that are indistinguishable in the frequency domain. Second, it has the ability to code time-varying signals into optimum configurations for processing by Artificial Neural Networks. Third, the algorithm is simple and can be implemented in hardware easily. TESPAR algorithm can be easily implemented to an 8-bit microcontroller chip, the required storage and processing capacity and operating power are very small .It can work in harsh industrial environments. The TESPAR algorithms has been used in some aspects. Vasile V. Moca et al. used TESPAR in combination with multilayer perceptron to identify the deep of anesthesia(DOA for short) levels in [2]. Marius Vasile Ghiurcau et al. used TESPAR method to detect the wildlife intruder in [3], they also used the TESPAR to identify six types of vehicles based on the generated sound in [4] . Giorgos Mazarakis et.al use TESPAR and FANN to perform musical instruments recognition in [5]. Gorge and King applied TESPAR in the speaker verification in [6].But there is no research about the recognition of environmental sounds using TESPAR. Given the advantages of TESPAR, this paper presents a method for the recognition of environmental sounds using TESPAR. Two databases were used for experiments.Database1 consists of 10 classes of bird sounds, database2 consists of 10 types of environmental sounds. In order to simulate different environments, we add three kinds of noise under different SNR to the sounds. The noises including Gaussian White Noise, wind sounds. We also do the experiments on the same data set using SVM for a comparison. The rest of this paper is organized as follows. Section presents the coding method of TESPAR. Experiment and result analysis are provided in Section . Finally in Section , the conclusion is presented.

Transcript of [IEEE 2012 5th International Congress on Image and Signal Processing (CISP) - Chongqing, Sichuan,...

978-1-4673-0964-6/12/$31.00 ©2012 IEEE 1796

2012 5th International Congress on Image and Signal Processing (CISP 2012)

Project founded by National Natural Science Foundation of China under

Grant No. 61075022

Environmental Sounds Recognition Using TESPAR

Guanyu You, Ying Li College of Mathematics and Computer Science

Fuzhou University Fuzhou, P.R. China

Abstract—Environmental sounds depict the sound content of varieties of creatures’ survival and activities, and also closely related with the human living environment. Current conventional approaches for recognition of environmental sounds required important computational resources and employing complex signal processing methods in the frequency domain. This work proposes a low-complexity method named Time Encoded Signal Processing and Recognition (TESPAR for short). The computational requirements for this method are two orders of magnitude less than that required by other usual methods. We used the TESPAR coding method to produce simple data structures, and then used the archetypes technique for classification. Our method was tested on two databases, database 1 consisted of 10 classes of bird sounds to test the interspecific recognition, database 2 consisted of 10 classes of different environmental sounds to test intraspecific recognition. We also did the experiments on the same databases using MFCC and SVM to make a comparison. Results showed that TESPAR has lower training time complexity than SVM, and the recognition rate of intraspecific recognition was better than interspecific recognition. Key words- environmental sounds recogntion;Time Encoded Si-gnal Processing And Recognition;Linde-Buzo-Gray vector qu-antization; archetypes; interspecific recognition; intraspecific r-ecognition;

I. INTRODUCTION Audio data contains a significant amount of information,

enabling the system to capture a semantically richer environment, on top of what visual information can provide. Moreover, the fusion of audio and visual information can be used to capture a more complete description of a scene, such as for disambiguation of environment and object types. Compared to visual signals, audio data could be obtained at any moment when the system is functioning, in spite of challenging external conditions such as poor lighting or visual obstruction, and is relatively cheap to store and compute.

Environmental sounds depict the sound content of varie-ties of creatures’ survival and activities, and also closely related with the human living environment. For example,animal sounds at different moments can help biologists tostudy the living habits of different animals in an ecologic-al environment; observation of wind and rain sounds can also help meteorologists understand the meteorological co-nditions of a region. Environmental sounds recognition is also important for intelligent robots. An intelligent robot can use the hearing function to complement vision becau-se vision only provides information within the limited ey-eshot.

Environmental sound recognition is generally done in two phases: first feature extraction, followed by classification. Feature extraction can be split into two broad types: stationary feature extraction and non-stationary feature e-xtraction[1]. The most popular techniques of stationary feature extraction

are Mel frequency cepstral coefficients (MFCC), Linear prediction cepstral (LPC) coefficients, Perceptual linear prediction (PLP) features and so on. The most popular technique of non-stationary feature extraction are Short time Fourier transform (STFT), Fast (discrete) wavelet transform (FWT), Continuous wavelet transform (CWT) and so on. The commonly used classification technique are Dynamic time warping (DTW), Hidden Markov models (HMM), Learning vector quantization (LVQ), Artificial neural networks (ANN), Gaussian mixture models(GMM) and Support vector machines (SVM).However, the usual approaches require important computational resources generally employing complex signal processing methods in the frequency domain. We propose a low-complexity method, which named the Time Encoded Signal Processing and Recognition (TESPAR for short).

TESPAR is a simple and efficient language for describing complex waveforms in digital terms[9]. Its has three key features. First, it has the ability to separate and classify of the signals that are indistinguishable in the frequency domain. Second, it has the ability to code time-varying signals into optimum configurations for processing by Artificial Neural Networks. Third, the algorithm is simple and can be implemented in hardware easily. TESPAR algorithm can be easily implemented to an 8-bit microcontroller chip, the required storage and processing capacity and operating power are very small .It can work in harsh industrial environments.

The TESPAR algorithms has been used in some aspects. Vasile V. Moca et al. used TESPAR in combination with multilayer perceptron to identify the deep of anesthesia(DOA for short) levels in [2]. Marius Vasile Ghiurcau et al. used TESPAR method to detect the wildlife intruder in [3], they also used the TESPAR to identify six types of vehicles based on the generated sound in [4] . Giorgos Mazarakis et.al use TESPAR and FANN to perform musical instruments recognition in [5]. Gorge and King applied TESPAR in the speaker verification in [6].But there is no research about the recognition of environmental sounds using TESPAR.

Given the advantages of TESPAR, this paper presents a method for the recognition of environmental sounds using TESPAR. Two databases were used for experiments.Database1 consists of 10 classes of bird sounds, database2 consists of 10 types of environmental sounds. In order to simulate different environments, we add three kinds of noise under different SNR to the sounds. The noises including Gaussian White Noise, wind sounds. We also do the experiments on the same data set using SVM for a comparison.

The rest of this paper is organized as follows. Section Ⅱpresents the coding method of TESPAR. Experiment and result analysis are provided in Section Ⅲ. Finally in Section Ⅳ, the conclusion is presented.

1797

II. BACKGROUND

A. TESPAR Time Encoded Signal Processing And Recognition

algorithms (TESPAR) is a time-domain digital language for coding “band-limited” signals which was first proposed by King and Gosling in [7] .The method is based on infinite clipping ,a coding method proposed by Licklidder and Plooack in [8]. They investigated the effects of amplitude clipping on the intelligibility of the speech waveform, and extend this process to the so-called infinite clipping format, where all the amplitude information was removed from the waveform. The result was a binary transformation that preserves only the zero-crossings points of the original signal. The infinite clipping coding is a direct representation of the duration between the zero crossings of the waveform, thus it is only dependent on the waveform itself.

Above observations on the importance of zeros to the intelligibility of a coded waveform led scientists to further investigate zero-based methods of signal approximation. Let a signal waveform of bandwidth W and duration T. The signal contains 2TW zeros, where typically 2TW exceeds several thousand. While the real zeros are easy to determine, complex zeros extraction is a difficult problem involving the factorization of a 2TWth order polynomial. Such an approach of zeros identification requires significant computational resources and is practically infeasible. An approximation of complex zeros’ location could be given instead of determining their exact position[9]. Thus, the waveform is segmented between successive real zeros, which comprise the bounds for the complex zeros positions. Complex zeros become visible in the shape of the waveform as minima, maxima or points of inflection and occur in conjugate pairs inside the epoch.

Hence, a band limited waveform may be simply approximated by segmenting it into successive epochs with two features. The signals are split into portions situated between two adjacent zero crossings of the waveform, called epochs. Each epoch is described by a pair of parameters:

(1) D (continued): it represents the duration between two successive real zeros (the number of samples between two real zeros); (2) S (shape): it is given by the number of points of local minima/maxima between two consecutive real zeros.

The D/S pairs obtained from the analysis of the original audio signals are coded using an alphabet, representing each epoch by a single “letter”. This coding procedure results to a symbol stream. This alphabet is the result of a vector quantization process. The TESPAR alphabet is specific to each class of signals, and its main purpose is to reduce the noise affecting the epochs by assigning the same symbol to similar epochs. The symbol stream received from the TESPAR coder can be easily converted in a series of classification operands named TESPAR matrixes:

Figure 1. The TESPAR D/S descriptor

(1)S matrix: N* 1 matrix-array vector that counts the number of apparitions for each symbol from the alphabet, in the coded symbol stream. Here N is the number of symbols from the alphabet. (2)A matrix: N * N matrix, counts the number of occurrences of all the pairs of symbols, at a distance n (lag)apart; it contains more information than S matrix, but as a compromise, the computational time is larger. The use of small lags(i.e. n<=10) tends to emphasize short-term properties of the signals, whilst the use of larger lags tends to highlight those features which are more global defined. The elements in A matrix can be represent as follows.

1

1 ( )m N

ij ijn L

a x nN L

=

= +=

− ∑

Where 1ijx = when t(n)=i and t(n-L)=j, otherwise ( )ijx n =0

And ( ) tht n n= TESPAR symbol

B. Classification Compared to other frequency domain descriptors, a

significant advantage of representing time-varying signals using TESPAR matrixes is that TESPAR matrices remain fixed in size regardless of the duration of the signal to be coded. In other words, even if the different conditions in a classification problem span variable time frames, all may be represented using TESPAR matrices of a common dimension. The use of fixed-size data structures greatly simplifies the process of generating exemplar templates at the training stage, and renders TESPAR matrices amenable to a wide range of comparison and classification procedures.

Because of the fixed dimension of the TESPAR matrixes, the archetypes technique can be used to classify and recognize the class of a particular sound. The archetypes is created by adding together and then computing the average of the A and S matrixes. The averaging process tends to emphasize the consistent characteristics of the condition and reduces the significance of anomalies that may exist in the individual examples. Once the archetypes are computed, we can store them in databases and used them later on for classification of unknown samples. The process is simple, an A or S is computed and compared to each of the archetypes in the database. The closest is declared the type of the test sample. In this paper, the comparison is done using city block distance, the corresponding archetype of the smallest distance being declared the type of the test sample.

City block distance: d(i,j)=|xi-xj|+|yi-yj|

1798

III. EXPERIMENT AND RESULT ANALYSIS

A. Experiment The study of environmental sounds classification can divide

into three steps, including reprocessing, TESPAR coding and classification. The process procedure can be seen in Fig. 2.Our main experimental aim was to estimate the performance accuracy of the environmental sounds recognition using TESPAR compared to SVM.

For the research, all the software was implemented under MATLAB platform. We use records of sounds obtained from [10].We used two databases. Database 1 consisted of 10 classes of birds sounds. These birds’ classes included cuckoo, duck, owl, flour chicken, chicken, gallicrex cinerea, partridge, orange-chicken, turtledove, streptopelia chinensis. The sounds in databases 1 were the same species. Database 2 consisted of 10 different types of environmental sounds, including nature sounds, such as ocean waves, rain, frog, lightning, water, and human sounds, such as footstep, kid crying, and the sounds that were made by something, such as bell ring, glass breaking. Database 1 was used for intraspecific recognition. Database 2 was used for interspecific recognition. Each kind of sound contained 10 records, 200 records in total. More details about these twenty classes of sounds can be seen in table I and Ⅱ.We also did the experiments on the same data set using SVM for a comparison.

Figure 2. TESPAR coding and classification process procedure.

1) Reprocessing For each database, the following process was used. The data

set was divided into audio clips of 2s in duration and down sampled to 8000Hz sampling rate, mono-channel and 16 bits per sample. For TESPAR is sensible to the DC offset, we also removed the DC offset from the signals.

2) TESPAR Coding We used the Linde-Buzo-Gray[11](LBG for short) vector

quantization algorithm to generate the alphabet. For this computation, we used 20 records,2 clips for each class. We computed the values for D and S respectively between every two consecutive real zeros, and used the D/S values as the training signal of LBG vector quantization algorithm. A 32 symbols alphabet was enough for our research. For database 1

and 2, we used 9 clips for training for each class of sounds. 3) Classification

In this step, we generated the TESPAR symbol streams and the TESPAR matrixes firstly, and then computed the archetypes for both A and S matrixes. For the A matrix the value for the lag n was set to 2.The archetypes were finally stored in a database. In order to get a better statistically results, for each experiment we have decided to generate ten archetypes, using 9 clips to generate the archetype each time and use the remaining sounds that were not used for training for testing. When testing an unknown sound sample, its corresponding S and A matrixes were computed and then compared with the archetypes stored in the databases.

TABLE I. THE DETAIL CLASS OF BIRD SOUNDS.

NO CLASS NO CLASS 1 Cuckoo 6 Gallicrex Cinerea 2 Duck 7 Partridge 3 Owl 8 Orange-chicken 4 Flour Chicken 9 Turtledove 5 Chicken 10 Streptopelia Chinensis

TABLE II. THE DETAIL OF ENVIRONMENTAL SOUNDS.

NO CLASS NO CLASS 1 Footstep 6 Lightning 2 Kid crying 7 Car 3 Ocean Wave 8 Bell Ring 4 Frog 9 Water 5 Rain 10 Glass Breaking

B. Comparison Between TESPAR And SVM The comparisons of the classification results of TESPAR (using S matrixes and using A matrix) and SVM [12] can be seen in tableⅢ . For database 1, the recognition rate of TESPAR using S matrix is 92%, while using A matrix is 94%, and SVM is 98%. For database 2, the recognition rate of TESPAR using S matrix is 95%, while using A matrix is 98%,and SVM is 100%.The results show that using A matrix can get a better recognition rate, because A matrix contains more information. As expected, the performance recognition accuracy of using SVM is better than TESPAR. But at the expense of it, the complexity of SVM is much more than TESPAR. When the number of samples is m, the training time complexity of TESPAR is o (32 * 2 * m) = o (m), for SVM[13], it is 3( )o m .So when applied to large scale environmental sounds recognition, TESPAR is more practical.

C. Recognition Under Different Environment Condition And different SNR

To simulate different environmental condition, we did the experiment that test the recognition accuracy of TESPAR and SVM under different noise and different level of SNR. The noises include Gaussian white noise and wind sounds.

First, Gaussian White Noise with different SNR was added to the sounds and the results were processed by identifying the sounds with TESPAR and SVM respectively. The corresponding results are showed in Table IV. The results show that for database 1 and database 2, when SNR>=50, the accuracy of TESPAR is 94%, while it is 98% when using

1799

SVM. As the decreasing of SNR, the accuracy dropped rapidly.

TABLE III. RECOGNITION ACCURACY OF DIFFERENT METHOD

Database Method Accuracy ( %)

Time Complexity

1

TESPAR(S) 92 O(m)

TESPAR(A) 94 O(m)

SVM 98 3( )o m 2

TESPAR(S) 95 O(m) TESPAR(A) 98 O(m) SVM 100 3( )o m

TABLE IV. CLASSIFICATION RESULT OF DIFFERENT METHOD ADDING GWN UNDER DIFFERENT SNR

database SNR TESPAR(A) Accuracy (%

SVM Accuracy(%)

1

50 94 98 40 89 96 30 76 74 20 38 36 10 17 15

2

50 98 100 40 96 96 30 81 74 20 40 38 10 20 18

Second, wind sound with different SNR was added to the sounds, and the results were processed by identifying the sounds with TESPAR and SVM respectively. The corresponding results are showed in Table V.

TABLE V. CLASSIFICATION RESULT OF DIFFERENT METHOD ADDING WIND SOUND UNDER DIFFERENT SNR

database SNR(dB) TESPAR(A) Accuracy (%

SVM Accuracy(%)

1

50 94 98 40 86 94 30 74 72 20 38 36 10 16 15

2

50 98 100 40 96 96 30 79 74 20 38 36 10 19 18

The results showed that when SNR>=50dB, for database 1,the accuracy of TESPAR was 94%, while it was 98% using SVM. As the decreasing of SNR, the accuracy dropped rapidly. When SNR=10dB, the accuracy of TESPAR was 16%,while it was 15% of SVM.For database 2,when SNR>=50,the accuracy of TESPAR was lower than SVM,when SNR<=30,the accuracy of TESPAR was better than SVM.

The experimental results implied that under high SNR cases (clear or SNR>=40), the recognition rate of SVM

was better than that of TESPAR. But under low SNR conditions, recognition rate of TESPAR was better than SVM. This may due to the low noise immunity under low SNR of MFCC.What’s more,the above data showed that the recognition accuracy of database 1 was lower than database 2.Because the difference between signals in database 1 was small than database 2.

IV. CONCLUSION This work proposes a low-complexity method for

environmental sounds recognition using Time Encoded Signal Processing and Recognition(TESPAR). After the experiments and analysis, we found that the recognition rate when using A matrix is better than using S matrixes. The training time complexity of TESPAR is o(m), which is much lower than SVM. And intraspecific recognition accuracy was lower than interspecific recognition accuracy.The recognition accuracy of TESPAR is effective under noise conditions. Our future work would be improving the recognition performance of TESPAR.

ACKNOWLEDGMENT

This work was supported by the National Natural Science Foundation of China under Grant No.61075022. The authors would like to thank the partners in the lab for their helpful comments and suggestions.

REFERENCES [1] Michael Cowling, Renate Sitte, “Comparison of techniques for en-

vironmental sound recognition,” ELSEVIER pattern recogniton,vo l. 24, pp. 2895-2903,2003.

[2] Vasile V. Moca, Bertram Scheller and Raul C. Muresan, “EEG under anesthesia—Feature extraction with TESPAR," computer methods and programs in biomedicine, vol. 95, pp. 191-202,2009.

[3] Marius Vasile Ghiurcau, Corneliu Rusu, Radu Ciprian Bilcu,Jaakko Astola, “Audio based solutions for detecting insruders in wil- dareas”,ElSEVIER Signal Processing,vol.92,2012,pp.829-840.

[4] Marius Vasile Ghiurcau, Corneliu Rusu, “Vehicle sound classification application and low pass filtering influence,” in Proc. IEEE ISSCS 2009,Iasi,Romania,July 9–11,2009,pp.301–304.

[5] Emmanouil Benetos, Margarita Kotti,Constantine Kotropoulos, “Mu- sical instrument classification using non-negative matrix factorization algorithms,”in Proc. IEEE ISCAS Kos, Greece,2006,pp.246-255.

[6] M.H.George,R.A.King,”Time Encoded Signal Processing and Reco- gnition for Reduced Data, High Performance Speaker Verification Architectures,” in Proc. Audio-and Video-based Biometric Person Authentication First International Conference,AVBPA'97 Crans-Mo ntana,Switzerland,1997,pp.377-384.

[7] R.A.King,W.Glossing,Electron.Lett.14(15)(1978),pp.456-457. [8] J.C.Licklidder and I.Pollack,”Effects of differentiation,integration and

infinite peak clipping upon the intelligibility of speech,” J.Aco- ust.Soc.Am.,20(1948),pp.42-51.

[9] R.A.King,T.C.Philips,”Shannon,TESPAR and approximation strategi- es”,Computers and Security,1999(18),pp.445-453.

[10] Freesound.org.http://www.freesound.org/. [11] Giuseppe Patane,Marco Russo.”The enhanced LBG algorithm”,ELS

EVIER,Neural Networks,vol.14,pp.1219-1237,2001. [12] C.-C Chang,C.-J.Lin,LIBSVM:a library for support vector machine-

s,2011,http://www.csie.ntu.edu.tw/~cjlin/libsvm/.accessed: February 15,2011.

[13] C.-W.Hsu,C.-J.Lin, “A comparison of methods for multiclass supp or vector machines”,IEEE Transactions on Neural Networks 3(2), pp.415-425,2002.

1800

[14] J.Rojo-Alvareza,G.Camps-Vallsb,M.Martnez-Ramona,E.Soria-Olivasb,A.Navia-Vazqueza,”A.Figueiras-Vidal,Support vector machines framework for linear signal processing”, Signal Processing,85(12),2005,pp.2316-2326.