1 Btp Report 07010245 Suraj s Sheth16april2011

8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

1/35

- 1 -

Extraction of Pitch from

Speech Signals using

Hilbert Huang Transform

Submitted for the fulfilment of the requirements for the award of

the degree ofBachelor of Technology

by

Suraj Satishkumar Sheth (ROLL No. : 07010245)

Supervised by

Dr. S. R. M. Prasanna

Department of Electronics & CommunicationEngineering

Indian Institute of Technology, Guwahati

Year: July 2010 to May 2011


2/35

- 2 -

CERTIFICATE

It is certified that the work contained in the report entitled

Extraction of Pitch from Speech Signals using Hilbert Huang

Transform is a bona fide work of Suraj Satishkumar Sheth

(Roll No. 07010245), which has been carried out in the

Department of Electronics and Communication Engineering,

Indian Institute of Technology (IIT) Guwahati under my

supervision and this work has not been submitted elsewhere for a

degree.

Dr. S. R. M. PRASANNA

Associate Professor,

Department of Electronics and Communication Engineering,

Indian Institute of Technology Guwahati,

Guwahati 781039, INDIA

April, 2011

GUWAHATI


3/35

- 3 -

Acknowledgements

I feel it a great privilege in expressing my deepest and most sincere

gratitude to my supervisor, Dr. S. R. M. Prasanna for the mostvaluable guidance provided to me during the course of this project. Such a

successful work would not have been possible without his guidance. I would

also like to thank the Head of the Department and other faculty members for

their kind help in carrying out this work.

I am very grateful to the non-teaching staff and students of the

department who have always helped me out from the very beginning of this

work. I sincerely thank Mr. Gyandhar Pradhan and Mr. Govind who have

helped me.


4/35

- 4 -

Contents

1. Abstract .......................................................................................................................................... 8

2. Introduction ................................................................................................................................... 9

3. My contributions ......................................................................................................................... 10

4. Empirical Mode Decomposition .................................................................................................. 12

5. Mode - Mixing ............................................................................................................................. 17

6. Ensemble Empirical Mode Decomposition ............................................................................... 169

7. Neighbourhood Limited Empirical Mode Decomposition .......................................................... 21

8. Pitch Extraction using Empirical Mode Decomposition ............................................................. 23

9. Potential Applications ................................................................................................................ 29

10. Results and Conclusions ........................................................................................................... 30

11. Future Work ............................................................................................................................... 31

12. References ................................................................................................................................. 32


5/35

- 5 -

List of Figures

Figure 1. Speech Signal along with Maxima and Minima ......................................................................... 13

Figure 2. Speech Signal along with Maximum, minimum and mean-envelope. ........................................ 14

Figure 3. A synthetic signal with two sinusoids and the corresponding IMFs. .......................................... 16

Figure 4. In this case, it separates the frequency components well. But, in a general case, it faces

the problem of Mode-mixing, in which a single frequency component will be present in more

than one IMF.

. ................................................................................................................................................................... 16

Figure 5. A speech signal and the corresponding Intrinsic Mode Frequencies depicting the Mode-

Mixing phenomenon, specifically in IMF4 and IMF5

................................................................................................................................................................... 17

Figure 6. The Speech signal and its Fourth and Fifth Intrinsic Mode Frequencies

. ................................................................................................................................................................... 18

Figure 7. Input Signal Concatenation of two sinusoids. ............................................................................ 19

Figure 8. EMD Single IMF contains two modes. ...................................................................................... 19

Figure 9 : EEMD : Highest Frequency IMF. ............................................................................................ 19

Figure 10: EEMD : Lowest Frequency IMF .............................................................................................. 19

Figure 11. Input Signal : x=1:1000;y=[sin(0.1*x) sin(0.8*x)]; .................................................................. 21


6/35

- 6 -

Figure 12. EMD : Single IMF contains 2 modes ........................................................................................ 21

Figure 13. NLEMD : Highest Frequency IMF ........................................................................................... 21

Figure 14. NLEMD : Lowest Frequency IMF ............................................................................................ 21

Figure 15. The speech signal and the corresponding IMFs ........................................................................ 24

Figure 16. Filtered IMFs along with the speech signal .............................................................................. 25

Figure 17. The speech signal and the corresponding IMFs determined by modified

Neighbourhood Limited Empirical Mode Decomposition

.................................................................................................................................................................... 26

Figure 18. The speech signal and the corresponding Filtered IMFs obtained using EMD,

Neighbourhood Limited Criterion and Filtering

.................................................................................................................................................................... 27

Figure 19. (a) Magnified view of the Speech signal, (b) Filtered IMF6, (c) Envelope of Filtered

IMF6 and the Threshold in red

.................................................................................................................................................................... 28


IMF6 and the Threshold in red, (d)Epochs determined by our novel HHT based method and (e)

Epochs determined by ZFF method ............................................................................................................ 29

Figure 21. (a)Magnified view of the Speech signal, (b)Filtered IMF6, (c)Time-frequency

representation of IMF6, (d)Epochs determined by our novel HHT based method and (e) Epochs

determined by ZFF method

.................................................................................................................................................................... 30

Figure 22. (a)Magnified view of the Speech signal, (b)Filtered IMF6, (c)Time-frequency

representation of IMF6, (d)Epochs determined by our novel HHT based method and (e) Epochs

determined by ZFF method ........................................................................................................................ 31


7/35

- 7 -

Tables

Table 1: Comparison of Frequency Representation techniques15


8/35

- 8 -

1. Abstract

In this report, a novel method for instantaneous pitch extraction using Hilbert Huang Transform is

proposed. Unlike traditional methods, truncating data and segmenting them into windows is not

necessary in this method. Also, the stationarity and the linearity assumptions are not required.The instantaneous pitch is derived from the Intrinsic Mode Frequencies which in turn are

obtained from the speech signal through Empirical Mode Decomposition. We have explored

various methods to overcome the shortcomings of Empirical Mode Decomposition and proposed

several new variants of these methods which tackle the specific problem in hand. The robustness

is increased by using a novel variant of Neighbourhood Limited Empirical Mode Decomposition.

Also, the Filtered Intrinsic Mode Frequencies are used to determine the best contestant for pitch

in place of the Intrinsic Mode Frequencies themselves.

The results of the novel algorithm are compared to the epochs obtained using the Zero-Frequency

Filter method. The accuracy is found to be 96.43% for a dataset containing about 5000 epochsdetermined by the Zero-Frequency Filter method.

The envelope of the modified Intrinsic Mode Frequency of the best contestant for pitch is also

shown to be useful for several applications like Digit Recognition. The algorithm provides pitch

with a high time-resolution, high frequency-resolution and is not affected significantly by

windowing effect as the short-time analysis is not used. The bases used are adaptive and not a

priori which is important for this specific problem.


9/35

- 9 -

2. Introduction

The pitch of speech signal plays an important role in different speech processing applications

including speaker recognition, automatic speech recognition, speech enhancement, analysis and

modelling of speech prosody, low-bit-rate speech coding, etc. Although many methods exist forpitch extraction, reliability and accuracy are not good. Usually, the instantaneous values of pitch

are different even within a frame. So, we need good time resolution. And we need good

frequency resolution especially in the case of pitch which can be used for various applications.

Time-frequency representation is an important component of Signal Processing and has potential

applications. Examples include STFT, Wavelet Transform, etc. Most of the data-analysis methods

assume that the data is linear and stationary. Techniques like Wavelet analysis work for non-

stationary linear data. All these techniques face one or more of these problems - Windowing

effect, low time resolution, low frequency resolution, fixed basis functions, etc. Usually, speech is

both non-linear and non-stationary. Hence, we need adaptive basis functions for speech.

Hilbert Huang Transform is a time-frequency representation technique developed recently by N E

Huang and his group [1]. It is an empirical data-analysis method capable of processing non-linear

and non-stationary data. It provides the ingredients to obtain the adaptive bases for the

representation. The HHT has given much better and sharper results than other conventional time-

frequency-energy representation methods for various problems. Additionally, the HHT has

revealed true physical meanings in many of the data examined.

The most important component of Hilbert Huang Transform (HHT) is Empirical Mode

Decomposition (EMD). It separates components with different frequency modes and produces

sensible representation even from non-linear and non-stationary data.

After Empirical Mode Decomposition, the next component is Hilbert Spectral Analysis. Hilbert

Spectral Analysis provides the frequency-time representation of the input non-linear and non-

stationary signal. Hilbert Huang transform (or particularly, the Empirical Mode Decomposition),

in a weak sense, captures the highest frequency component at each iteration producing IMFs

having different frequency modes at a particular time-instant. But, it may happen that a particular

IMF has different frequency modes at different time-instants. This phenomenon is called 'mode-

mixing '[2]. An example is given in figure 6. Mode-mixing does not affect Hilbert Spectral

Analysis as in the final time-frequency representation, all the IMFs are mapped to a single graph.

But, if we want to use the IMFs individually and want to ensure that each IMF has a single

frequency mode as in the case of pitch extraction, we need to get rid of 'mode-mixing'. In this

report, we explain the various techniques for enhancing the decomposition of speech signal intoIntrinsic Mode Frequencies and get rid of mode-mixing in Hilbert Huang Transform and propose

a new method for pitch extraction using Hilbert Huang Transform.


10/35

- 10 -

3. My contributions

Problem FormulationLiterature Survey for Hilbert Huang Transform and Empirical Mode DecompositionWriting the code for Empirical Mode Decomposition and Hilbert Huang Transform in

Matlab

Testing a set of synthetic signals to understand the features of Empirical ModeDecomposition and Hilbert Huang Transform

Exploring the applications of Hilbert Huang Transform to Speech SignalsTrying out the Noise assisted analysis of data - Ensemble Empirical Mode

Decomposition to get rid of mode-mixing, one of the shortcomings of Hilbert Huang

Transform and writing codes for it.

Codes for Neighbourhood Limited Empirical Mode Decomposition algorithmLiterature Survey for pitch extraction algorithmsDerivation of the algorithm for pitch extraction using Empirical Mode DecompositionWriting the code for pitch extraction using Empirical Mode DecompositionLocation of Epochs using Empirical Mode Decomposition and comparing it to the

Epochs derived from Zero Frequency Filter Method [5]

Optimising Ensemble Empirical Mode Decomposition for reducing time complexity.Testing this method on Speech Signals and the corresponding Intrinsic Mode

Frequencies

Application of Neighbourhood Limited Empirical Mode Decomposition to synthetic andspeech signals and designing novel variants of Neighbourhood Limited Empirical

Mode Decomposition for the specific problem in hand Pitch Extraction

Developing a novel combination of filtering and Neighbourhood Limited EmpiricalMode Decomposition methods for the extraction of pitch to increase efficiency


11/35

- 11 -

Analysing the pitch and the epochs obtained using the modified filtering, novel variant ofNeighbourhood Limited Empirical Mode Decomposition and Hilbert Spectral Analysis

Exploring the use of features of modified Intrinsic Mode Frequency for various purposesincluding Digit Recognition

Comparing the proposed novel epoch detection algorithm to Zero-Frequency Filtermethod with a dataset containing about 5000 epochs determined by the Zero-Frequency

Filter method and analysing the results


12/35

- 12 -

4. Empirical Mode Decomposition

The fundamental part of the Hilbert Huang Transform is the Empirical Mode Decomposition

(EMD) method. Using the EMD method, a signal can be decomposed into a finite and often small

number of frequency modes called Intrinsic Mode Functions (IMF). An IMF represents a simple

frequency mode similar to the simple harmonic function, but it is much more general. The

Intrinsic Mode Frequency has amplitude and frequency as functions of time unlike a simple

harmonic component.

The algorithm can be described as:

Step 1: initially assume z0 = x(t) and i=1; (x(t) is the speech signal)

Step 2: To find out the (i+1)th

Intrinsic Mode Frequency

(a) Initially assume hi(k-1) = zi and k=1;

(b) Find out the local extrema of hi(k-1);

(c) Construct the maxima envelope and minima envelope of hi(k-1) by interpolation;

(d) Calculate the-mean envelope mi(k-1) from the maxima envelope and the minima

envelope

(e) Subtract the mean envelope from the input signal of this stage

hi(k) = hi(k-1) - mi(k-1)

(f) Check whether hi(k) satisfies the properties of an Intrinsic Mode Frequency

Or else, go back to b)

Step 3: Define z(i+1) = z(i) - hi(k)

Step 4: Check whether the required precision is obtained, if not, go back to Step 2


13/35

- 13 -

The procedure followed to compute the Intrinsic Mode Frequencies is:

1) Initially, the maxima and minima of the speech signal are computed.

Fig. 1: Speech Signal along with Maxima and Minima

2) Using curve fitting technique, in this case, the in-built spline function in matlab, a maxima-

envelope is generated for the set of maxima and a minima-envelope is generated for the set of

minima.

3) A mean-envelope (in MAGENTA, pink colour) is generated which is the mean of themaxima-envelope (in RED colour) and minima-envelope (in GREEN colour) computed at each

sampling point.


14/35

- 14 -

Fig. 2: Speech Signal along with maximum, minimum and mean-envelope

4 )The mean envelope is subtracted from the speech signal, (s(t)) to get a new signal hi(k)(t).

hi(k)(t) is a potential IMF.

5) Check whether hi(k)(t) satisfies the properties required to be an IMF :

a) The number of extrema and the number of zero crossings can at-most differ by one

b) The mean value of the envelope (steps 1, 2 and 3 applied to hi(k)(t)) should be zero at

all

points

6) Also check whether hi(k)(t) satisfies the Standard Deviation criteria where Standard Deviation

is

defined by,


15/35

- 15 -

Here, h1(0)(t)=s(t). As the number of iterations increase, h1(k-1)(t) becomes the potential IMF in

iteration k-1 and h1k(t) becomes the potential IMF in iteration k.

The Standard deviation threshold is usually set at a number between 02 to 0.3.

7) If hi(k)(t) satisfies the criteria mentioned in step 5 and step 6, c1(t) = hi(k)(t) is declared to be the

IMF (1st IMF in this case). This IMF is subtracted from the original speech signal s(t) to get s1(t).

We repeat step 1 to step 7 assuming s1(t) to be the input. Finally, this process is stopped at a point

where the residual signal is a monotonic function or a constant signal.

8) So, for each input signal, we get a set of IMFs and possibly, a residue which is a constant

signal or a monotonic signal.

Now, let us compare the features of three different types of Frequency representation techniques:

Fourier Transform Wavelet TransformHilbert-Huang

Transform

Basis a priori a priori adaptive

Frequency Not local Not local Local, instantaneous

Presentation Energy-Frequency Energy-Time-Frequency Energy-Time-Frequency

Non-linear NO NO YES

Non-stationary NO YES YES

Table 1: Comparison of Frequency Representation techniques


16/35

- 16 -

eg.: x=1:10000; y=sin(x*0.01)+sin(x*0.1); The signal along with the two IMFs are plotted

Fig. 3 : A synthetic signal with two sinusoids and the corresponding IMFs

Fig. 4: A synthetic signal having two sinusoids of different lengths. The highestfrequency sinusoid persists for a longer duration. The second and third subplots are of the

Intrinsic Mode Frequencies obtained using Empirical Mode Decomposition.

In this case, it separates the frequency components well. But, in a general case, it faces the

problem of Mode-mixing, in which a single frequency component will be present in more than

one IMF.


17/35

- 17 -

5. Mode-Mixing

It is a phenomenon in which a single frequency mode is present in more than one Intrinsic Mode

Frequencies which happens due to picking up of the highest frequency component at each point

by the Empirical Mode Decomposition. The figure below depicts this.

Figure 5. A speech signal and the corresponding Intrinsic Mode Frequencies depicting the Mode-

mixing phenomenon, specifically in IMF4 and IMF5


18/35

- 18 -

Now, let us observe the magnified view of the Speech Signal, the 4 th Intrinsic Mode Frequency

and the 5th Intrinsic Mode Frequency.

Figure 6. The Speech signal and its Fourth and Fifth Intrinsic Mode Frequencies

The Green highlighted portion has the same characteristics which are spread over more than OneIntrinsic Mode Frequencies. We want these to be present in a single Intrinsic Mode Frequency.

This problem arises when we are processing the Intrinsic Mode Frequencies individually. When

the application requires the whole time-frequency representation, mode-mixing doesnt come into

picture as the frequency components remain intact. This is not a shortcoming of the Hilbert

Huang Transform, but, a limitation of its representation. In this specific case, we are searching for

a particular IMF, so, we need to tackle the problem of Mode-mixing.


19/35

- 19 -

6. Ensemble Empirical Mode Decomposition

Ensemble Empirical Mode Decomposition [3] is a noise assisted data analysis to take care of

mode-mixing. A white Gaussian noise is added to the input speech signal to avoid mode mixing.

The same experiment is repeated N (>>1) times using N different sequences of noise. Thecorresponding IMFs from these N experiments are added. Because, the noise is random, it

becomes negligible compared to the signal. Hence, we get only the signal component, ideally. We

can thus avoid mode-mixing in Empirical Mode Decomposition.

Figure 7. Input Signal Concatenation of two sinusoids Figure 8. EMD Single IMF contains

two modes

Fig. 9 : EEMD : Highest Frequency IMF Fig. 10: EEMD : Lowest Frequency IMF


20/35

- 20 -

The EEMD algorithm [8] gets rid of the mode mixing defining the true IMF components as the

mean of certain ensemble of trials, obtained by adding noise of finite variance to the input signal.

Although the method based on autocorrelation has been reported to be the best pitch estimation

technique for the analysis of pathological sustained vowel /a/, it has been shown [9] that it fails

along with the RAPT algorithm, while the method using Ensemble Empirical Mode

Decomposition gives better results.

In the figures above, the first plot is that of a synthetic signal which is a concatenation of two

sinusoids. In case of Empirical Mode Decomposition which captures the highest frequency

component at each instant of time, the Intrinsic Mode Frequency contains both the sinusoids. But,

in the case of Ensemble Empirical Mode Decomposition, we get two different Intrinsic Mode

Frequencies containing one sinusoid each as desired. As we can observe, the mode-mixing has

been tackled by Ensemble Empirical Mode Decomposition nicely. We need a particular Intrinsic

Mode Frequency to contain a specific frequency component and we can achieve this using

Ensemble Empirical Mode Decomposition. But, there are a few limitations of Ensemble

Empirical Mode Decomposition. The EEMD Intrinsic Mode Frequencies do not satisfy the

property of IMFs unless the N, the number of trials is very high. To reduce the variance of the

estimate, we need to average it over a number of estimates [7]. Also, the noise strength reduces

only when N is high. We found out that, the higher the value of N (of the order of thousands), the

better the output. But, this leads to N-fold increase in resource consumption (mainly time). Also,

due to addition of noise, the execution of a single trial also becomes costly. Hence, it consumes

very high run-time. Then, we experimented with varying values of noise and found out that if the

noise energy is very low, EEMD acts as normal Empirical Mode Decomposition and doesnt help

much to avoid Mode-mixing. So, we need to find an alternative to tackle Mode-mixing.


21/35

- 21 -

7. Neighbourhood Limited Empirical Mode Decomposition

Neighbourhood Limited Empirical Mode Decomposition [4] is another method to avoid 'Mode-

mixing' in Empirical Mode Decomposition. Usually, the frequency mixing occurs in the above

mentioned algorithm, due to picking the highest frequency component in each locality. So, weneed to restrict the frequency span of a particular IMF. This can be done by limiting the distance

between two consecutive extrema. We can add a spurious extrema when the distance between

two consecutive extrema is large. This ensures that the frequency in a particular frequency mode

is limited, but, the amplitude can vary in a particular frequency mode. Thus, we have ensured that

frequencies have been efficiently separated into corresponding frequency modes.

Figure 11. Input Signal : x=1:1000;y=[sin(0.1*x) sin(0.8*x)]; Figure 12.: EMD : Single IMF

contains 2 modes

Figure 13. NLEMD : Highest Frequency IMF Figure 14. NLEMD : Lowest Frequency

IMF

X-axis is time, Y-axis is Amplitude


22/35

- 22 -

If we use strict Neighbourhood Limited Empirical Mode Decomposition before filtering the

signal, all the Intrinsic Mode frequencies will contain only frequencies in the pitch region. Then,

we will not be able to decide the best candidate for pitch information using filtering. If we filter

the signals initially and then, perform Neighbourhood Limited criterion, much of the pitch

information will get leaked to other Intrinsic Mode Frequencies and we will fail to tackle

Mode-mixing.

The resource consumption for Neighbourhood Limited Empirical Mode Decomposition increases

as the entropy in the time-frequency representation increases. So, the resource consumption is

lower for sinusoids, higher for speech and even higher for noisy speech. To decrease the

resources consumed, we can begin the Neighbourhood Limited criterion at the Intrinsic Mode

Frequency where the pitch information begins to make its presence. Thus, the resource

consumption will be reduced and the performance will be almost intact. Another application of

this Neighbourhood Limited Empirical Mode Decomposition can be Speech Enhancement, as

the frequency resolution of Hilbert Huang Transform is high.

Hence, we have designed an algorithm which will capture the pitch information, if it has any, in

the adjacent Intrinsic Mode Frequencies, else, it will capture other frequencies. So, we have

ensured that the particular Intrinsic Mode Frequency contains the pitch information wherever the

pitch information exists and has different frequencies in other regions. So, we perform the

modified novel Neighbourhood Limited Empirical Mode Decomposition initially. Then, we can

proceed with our filtering.


23/35

- 23 -

8. Pitch Extraction using Empirical Mode Decomposition

The pitch is a prominent part of speech (and the speaker). It is used extensively in Speech

Coding, Speaker Recognition and a few Speech Recognition systems. Many techniques exist to

calculate the pitch - Autocorrelation method, Cepstral methods, AMDF, etc. But, all of these

techniques face a few or all of these problems- windowing effect, low time resolution, low

frequency resolution, etc.

This research work is an attempt to get rid of a few or all of these shortcomings. We can use

Empirical Mode Decomposition to find the instantaneous pitch. The idea is that one of the

Intrinsic Mode Frequencies contains the pitch information. To make sure that there is a unique

Intrinsic Mode Frequency containing the pitch information, we need to get rid of Mode-mixing.

We can use Neighbourhood Limited Empirical Mode Decomposition for this purpose and wehave observed that it serves the purpose. We also need to filter the signals. The main purpose that

this filtering serves is to determine the best candidate for the pitch information. The Intrinsic

Mode Frequencies are filtered and the ratio of the energies of the signal that comes out as the

output of the filter and the signal that is passed as the input to the filter is used to determine the

best candidate. The best candidate will get passed through the filter to the highest extent and other

Intrinsic Mode Frequencies will be highly attenuated which is also clear from the given figures.

Another purpose it serves is to determine the voiced portions. We can use the short-time ratio of

the signal that comes as the output of the filter to the signal given as the input to the filter as a

parameter to determine the voicedness of the speech. Only in voiced regions, the energy of the

signal passes through the filter to the maximum extent. The filter attenuates the signal in other

regions. This can also be used as a feature in various tasks such as Digit Recognition, SpeechRecognition among others.

So, initially, all the IMFs are computed using Neighbourhood Limited Empirical Mode

Decomposition. Then, the estimated pitch is calculated using auto-correlation method. A narrow-

band 'Band-Pass Filter' is generated with the estimated pitch as the centre frequency. The IMFs

are normalized, so that, they can later be compared. All the IMFs are then passed through the

filter. The IMF having the highest energy is proposed as the IMF containing the pitch

information.


24/35

- 24 -

Figure 15. The speech signal and the corresponding IMFs


25/35

- 25 -

Figure 16. Filtered IMFs along with the speech signal

It can be observed that IMF 6 contains the pitch information and has the highest fraction of

energy passed through the filter


26/35

- 26 -

Now, let us consider an example of a non-voiced and a voiced signal (/s/ /e/). The utterance

contains two phonemes the fricative s and the vowel e. The speech signal and the

corresponding IMFs are plotted.

Figure 17. The speech signal and the corresponding IMFs determined by modified

Neighbourhood Limited Empirical Mode Decomposition


27/35

- 27 -

The amplitude of the Filtered IMF6 is high in the voiced region and is close to zero in the non-

voiced part. So, its envelope can be used as a distinguisher. A proper threshold is placed on the

envelope to make a decision. The below plots show that the 6th

Intrinsic Mode Frequency

contains the pitch information. It is evident more clearly from the quantitative analysis of the

Filtered Intrinsic Mode Frequencies. To find out which IMF contains the pitch information, we

filter all the IMFs through a suitable filter and determine the fraction of energy that passesthrough the filter. The IMF whose energy passes the maximum is the best contestant for the pitch

information. These fractions also represent the confidence of the IMF chosen. The fraction should

be as large as possible for the IMF that will be chosen and as low as possible for others. In this

example, we can observe that IMF6 can be ranked SIXTH among the IMFs with respect to

amplitude or Energy-content. But, Filtered IMF6 ranks FIRST among the Filtered IMFs w.r.t.

Amplitude or Energy-content. Also, we can observe that Filtered IMF6 hardly contains any

Mode-mixing or a mixing of more than one frequency mode.

Figure 18. The speech signal and the corresponding Filtered IMFs obtained using EMD,

Neighbourhood Limited Criterion and Filtering


28/35

- 28 -

The plots of the speech signal, its Filtered IMF6 and the envelope of the Magnitude of Filtered

IMF6 is shown for about 250 msec.s. We can observe that the local amplitude of the Intrinsic

Mode Frequency containing pitch information represents the voicedness of the speech. We can

use it to determine whether a particular region is voiced or not. This is done by a threshold on the

maxima envelope of the Intrinsic Mode Frequency in this case. The figure below depicts this. The

red line in the third subplot represents the threshold. The envelope can also be used for variousapplications including, but, not limited to Digit Recognition.


IMF6 and the Threshold in red


29/35

- 29 -

We also tested the method for epoch determination. For this purpose, the Zero crossings with a

positive slope are found out. These are the Epochs. The epochs obtained using this method are

compared to those obtained using Zero Frequency Filter method [5]. The average accuracy for a

set of speech files containing about 5000 epochs is 96.43% with a Standard Deviation of. about

0.4 msec.s (Sampling frequency, Fs = 16kHz).


IMF6 and the Threshold in red, (d) Epochs determined by our novel HHT based method and (e)

Epochs determined by ZFF method


30/35

- 30 -

The pitch is found out from the filtered Intrinsic Mode Frequencies using Hilbert Spectral

Analysis. This is done by first computing the Hilbert Transform of the signal. Then, the original

signal, i.e. the filtered Intrinsic Mode Frequency is used as the real part of a complex signal and

its Hilbert Transform as the complex part. The frequency at each sampling point is obtained by

scaling the differentiation of the instantaneous phase of this complex signal. The instantaneous

frequency (as Frequency-Time representation) is obtained for two speech signal portions alongwith the Epochs and is plotted in the following figures. Also, the Epochs obtained by Zero-

Frequency Filtering [6] method are plotted.

Figure 21. (a) Magnified view of the Speech signal, (b) Filtered IMF6, (c) Time-frequency

representation of IMF6, (d) Epochs determined by our novel HHT based method and (e) Epochs

determined by ZFF method


31/35

- 31 -

We can observe that for each Epoch determined by the Zero-Frequency Filtering method, we

have a corresponding Epoch determined by our novel algorithm based on modified

Neighbourhood Limited Empirical Mode Decomposition as desired. Also, we can observe that

the distance between adjacent Epochs determined by the Zero-Frequency Filtering method is

equal to the distance between the corresponding Epochs determined by our novel method.

Figure 22. (a) Magnified view of the Speech signal, (b) Filtered IMF6, (c) Time-frequency

representation of Filtered IMF6, (d) Epochs determined by our novel HHT based method and

(e) Epochs determined by ZFF method


32/35

- 32 -

9. Potential applications

The pitch extracted can be used as a feature in speaker recognition It can also be used as one of the parameters in Speech coding Other applications include biometric identification, Bio-medical applications, signal

processing research, etc.

This will help future research in Signal Processing, especially in the field of Spectralrepresentation of signal.

Other products of this algorithm are very useful in many cases. An example is DigitRecognition where the envelope of the suitable Intrinsic Mode Frequency increases the

accuracy


33/35

- 33 -

10. Results and Conclusions

In this report we have explained and tested various methods to enhance the decomposition of

speech signal into Intrinsic Mode Frequencies and to extract pitch using Hilbert Huang

Transform. We can conclude that the Empirical Mode Decomposition is a good method for

decomposing speech signal into IMFs. But, mode-mixing disrupts the assumption that an IMF

contains a single frequency mode. So, we need to avoid mode-mixing in applications which rely

on the assumption that the IMF has a single mode. Hence, we introduced Ensemble Empirical

Mode Decomposition which correctly avoids mode-mixing using the principle of Noise-assisted

data analysis. But, the time required by Ensemble Empirical Mode Decomposition is very high, N

times that of normal EMD, where N is of the order of a few thousands. Hence, we need some

alternative method to avoid mode-mixing without affecting other properties of Empirical Mode

Decomposition. We found that Neighbourhood Limited Empirical Mode Decompositioncombined with filtering stands up to our expectations. It effectively gets rid of mode-mixing in

Empirical Mode Decomposition and has runtime comparable to that of Empirical Mode

Decomposition. We have obtained very good results using this novel algorithm. The accuracy for

a dataset is found to be 96.43% with a Standard Deviation of 0.4 msec.s. Also, other products of

this algorithm have a variety of applications.


34/35

- 34 -

11. Future Work

1) To explore the uses of instantaneous pitch

2) To exploit the algorithm for other signals including Biomedical signals

3) To test the algorithm on a larger dataset

4) To increase the efficiency of the algorithm with respect to run-time (Can implement the

algorithm in C or C++)

5) To test the novel algorithm for Intrinsic Mode Frequency analysis for various applicatios


35/35

12. References

[1] Norden E. Huang, Zheng Shen, Steven R. Long, Manli C. Wu, Hsing H. Shih, QuananZheng, Nai-Chyuan Yen, Chi Chao Tung and Henry H. Liu, "The Empirical mode

decomposition and the Hilbert spectrum for nonlinear and non-stationary time series

analysis," Proc.Royal Society London A, vol. 454, pp. 903-995, 1998.

[2] Norden E. Huang, Samuel S.P. Shen, Hilbert-Huang transform and its applications,London : World Scientific, c2005.

[3] G. Schlotthauer, M. E. Torres, and H. L. Rufiner, A new algorithm for instantaneousF0speech extraction based on ensemble empirical mode decomposition, Proc. European

Signal Processing Conference, Glasgow, Scotland, August 24-28, 2009.

[4] Guanlei Xu, Xiaotong Wang, Xiaogang Xu, "Neighborhood Limited Empirical ModeDecomposition and application in Image Processing," Proc. Fourth International

Conference on Image and Graphics, pp.149-154, 2007

[5] K. S. R. Murty and B. Yegnanarayana, Epoch extraction from speech signals, IEEETrans. Audio, Speech and Language Process., vol. 16, no. 8, pp. 16021614, Nov. 2008.

[6] S. R. M. Prasanna, D. Govind, K. Sreenivasa Rao and B. Yegnanarayana, "Fast prosodymodification using instants of significant excitation, Proc. Speech Prosody, Chicago,

USA, May 2010

[7] R.M. Rangayyan, Biomedical Signal Analysis -A Case-Study Approach, IEEE andWiley, New York, pp. 289, 2002

[8] Wu Z, Huang N, Ensemble empirical mode decomposition: a noise-assisted dataanalysis method. Advances in Adaptive Data Analysis, vol. 1, no. 1, pp. 1-41, 2009

[9] G. Schlotthauer, M. E. Torres, and H. L. Rufiner., Voice fundamental frequencyextraction algorithm based on ensemble empirical mode decomposition and entropies.

Proc. 11th Int. Congr. of the IFMBE, Munich, pp. 984987, 2009

1 Btp Report 07010245 Suraj s Sheth16april2011

Documents

Transcript of 1 Btp Report 07010245 Suraj s Sheth16april2011