1 Btp Report 07010245 Suraj s Sheth16april2011

download 1 Btp Report 07010245 Suraj s Sheth16april2011

of 35

Transcript of 1 Btp Report 07010245 Suraj s Sheth16april2011

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    1/35

    - 1 -

    Extraction of Pitch from

    Speech Signals using

    Hilbert Huang Transform

    Submitted for the fulfilment of the requirements for the award of

    the degree ofBachelor of Technology

    by

    Suraj Satishkumar Sheth (ROLL No. : 07010245)

    Supervised by

    Dr. S. R. M. Prasanna

    Department of Electronics & CommunicationEngineering

    Indian Institute of Technology, Guwahati

    Year: July 2010 to May 2011

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    2/35

    - 2 -

    CERTIFICATE

    It is certified that the work contained in the report entitled

    Extraction of Pitch from Speech Signals using Hilbert Huang

    Transform is a bona fide work of Suraj Satishkumar Sheth

    (Roll No. 07010245), which has been carried out in the

    Department of Electronics and Communication Engineering,

    Indian Institute of Technology (IIT) Guwahati under my

    supervision and this work has not been submitted elsewhere for a

    degree.

    Dr. S. R. M. PRASANNA

    Associate Professor,

    Department of Electronics and Communication Engineering,

    Indian Institute of Technology Guwahati,

    Guwahati 781039, INDIA

    April, 2011

    GUWAHATI

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    3/35

    - 3 -

    Acknowledgements

    I feel it a great privilege in expressing my deepest and most sincere

    gratitude to my supervisor, Dr. S. R. M. Prasanna for the mostvaluable guidance provided to me during the course of this project. Such a

    successful work would not have been possible without his guidance. I would

    also like to thank the Head of the Department and other faculty members for

    their kind help in carrying out this work.

    I am very grateful to the non-teaching staff and students of the

    department who have always helped me out from the very beginning of this

    work. I sincerely thank Mr. Gyandhar Pradhan and Mr. Govind who have

    helped me.

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    4/35

    - 4 -

    Contents

    1. Abstract .......................................................................................................................................... 8

    2. Introduction ................................................................................................................................... 9

    3. My contributions ......................................................................................................................... 10

    4. Empirical Mode Decomposition .................................................................................................. 12

    5. Mode - Mixing ............................................................................................................................. 17

    6. Ensemble Empirical Mode Decomposition ............................................................................... 169

    7. Neighbourhood Limited Empirical Mode Decomposition .......................................................... 21

    8. Pitch Extraction using Empirical Mode Decomposition ............................................................. 23

    9. Potential Applications ................................................................................................................ 29

    10. Results and Conclusions ........................................................................................................... 30

    11. Future Work ............................................................................................................................... 31

    12. References ................................................................................................................................. 32

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    5/35

    - 5 -

    List of Figures

    Figure 1. Speech Signal along with Maxima and Minima ......................................................................... 13

    Figure 2. Speech Signal along with Maximum, minimum and mean-envelope. ........................................ 14

    Figure 3. A synthetic signal with two sinusoids and the corresponding IMFs. .......................................... 16

    Figure 4. In this case, it separates the frequency components well. But, in a general case, it faces

    the problem of Mode-mixing, in which a single frequency component will be present in more

    than one IMF.

    . ................................................................................................................................................................... 16

    Figure 5. A speech signal and the corresponding Intrinsic Mode Frequencies depicting the Mode-

    Mixing phenomenon, specifically in IMF4 and IMF5

    ................................................................................................................................................................... 17

    Figure 6. The Speech signal and its Fourth and Fifth Intrinsic Mode Frequencies

    . ................................................................................................................................................................... 18

    Figure 7. Input Signal Concatenation of two sinusoids. ............................................................................ 19

    Figure 8. EMD Single IMF contains two modes. ...................................................................................... 19

    Figure 9 : EEMD : Highest Frequency IMF. ............................................................................................ 19

    Figure 10: EEMD : Lowest Frequency IMF .............................................................................................. 19

    Figure 11. Input Signal : x=1:1000;y=[sin(0.1*x) sin(0.8*x)]; .................................................................. 21

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    6/35

    - 6 -

    Figure 12. EMD : Single IMF contains 2 modes ........................................................................................ 21

    Figure 13. NLEMD : Highest Frequency IMF ........................................................................................... 21

    Figure 14. NLEMD : Lowest Frequency IMF ............................................................................................ 21

    Figure 15. The speech signal and the corresponding IMFs ........................................................................ 24

    Figure 16. Filtered IMFs along with the speech signal .............................................................................. 25

    Figure 17. The speech signal and the corresponding IMFs determined by modified

    Neighbourhood Limited Empirical Mode Decomposition

    .................................................................................................................................................................... 26

    Figure 18. The speech signal and the corresponding Filtered IMFs obtained using EMD,

    Neighbourhood Limited Criterion and Filtering

    .................................................................................................................................................................... 27

    Figure 19. (a) Magnified view of the Speech signal, (b) Filtered IMF6, (c) Envelope of Filtered

    IMF6 and the Threshold in red

    .................................................................................................................................................................... 28

    Figure 20. (a) Magnified view of the Speech signal, (b) Filtered IMF6, (c) Envelope of Filtered

    IMF6 and the Threshold in red, (d)Epochs determined by our novel HHT based method and (e)

    Epochs determined by ZFF method ............................................................................................................ 29

    Figure 21. (a)Magnified view of the Speech signal, (b)Filtered IMF6, (c)Time-frequency

    representation of IMF6, (d)Epochs determined by our novel HHT based method and (e) Epochs

    determined by ZFF method

    .................................................................................................................................................................... 30

    Figure 22. (a)Magnified view of the Speech signal, (b)Filtered IMF6, (c)Time-frequency

    representation of IMF6, (d)Epochs determined by our novel HHT based method and (e) Epochs

    determined by ZFF method ........................................................................................................................ 31

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    7/35

    - 7 -

    Tables

    Table 1: Comparison of Frequency Representation techniques15

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    8/35

    - 8 -

    1. Abstract

    In this report, a novel method for instantaneous pitch extraction using Hilbert Huang Transform is

    proposed. Unlike traditional methods, truncating data and segmenting them into windows is not

    necessary in this method. Also, the stationarity and the linearity assumptions are not required.The instantaneous pitch is derived from the Intrinsic Mode Frequencies which in turn are

    obtained from the speech signal through Empirical Mode Decomposition. We have explored

    various methods to overcome the shortcomings of Empirical Mode Decomposition and proposed

    several new variants of these methods which tackle the specific problem in hand. The robustness

    is increased by using a novel variant of Neighbourhood Limited Empirical Mode Decomposition.

    Also, the Filtered Intrinsic Mode Frequencies are used to determine the best contestant for pitch

    in place of the Intrinsic Mode Frequencies themselves.

    The results of the novel algorithm are compared to the epochs obtained using the Zero-Frequency

    Filter method. The accuracy is found to be 96.43% for a dataset containing about 5000 epochsdetermined by the Zero-Frequency Filter method.

    The envelope of the modified Intrinsic Mode Frequency of the best contestant for pitch is also

    shown to be useful for several applications like Digit Recognition. The algorithm provides pitch

    with a high time-resolution, high frequency-resolution and is not affected significantly by

    windowing effect as the short-time analysis is not used. The bases used are adaptive and not a

    priori which is important for this specific problem.

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    9/35

    - 9 -

    2. Introduction

    The pitch of speech signal plays an important role in different speech processing applications

    including speaker recognition, automatic speech recognition, speech enhancement, analysis and

    modelling of speech prosody, low-bit-rate speech coding, etc. Although many methods exist forpitch extraction, reliability and accuracy are not good. Usually, the instantaneous values of pitch

    are different even within a frame. So, we need good time resolution. And we need good

    frequency resolution especially in the case of pitch which can be used for various applications.

    Time-frequency representation is an important component of Signal Processing and has potential

    applications. Examples include STFT, Wavelet Transform, etc. Most of the data-analysis methods

    assume that the data is linear and stationary. Techniques like Wavelet analysis work for non-

    stationary linear data. All these techniques face one or more of these problems - Windowing

    effect, low time resolution, low frequency resolution, fixed basis functions, etc. Usually, speech is

    both non-linear and non-stationary. Hence, we need adaptive basis functions for speech.

    Hilbert Huang Transform is a time-frequency representation technique developed recently by N E

    Huang and his group [1]. It is an empirical data-analysis method capable of processing non-linear

    and non-stationary data. It provides the ingredients to obtain the adaptive bases for the

    representation. The HHT has given much better and sharper results than other conventional time-

    frequency-energy representation methods for various problems. Additionally, the HHT has

    revealed true physical meanings in many of the data examined.

    The most important component of Hilbert Huang Transform (HHT) is Empirical Mode

    Decomposition (EMD). It separates components with different frequency modes and produces

    sensible representation even from non-linear and non-stationary data.

    After Empirical Mode Decomposition, the next component is Hilbert Spectral Analysis. Hilbert

    Spectral Analysis provides the frequency-time representation of the input non-linear and non-

    stationary signal. Hilbert Huang transform (or particularly, the Empirical Mode Decomposition),

    in a weak sense, captures the highest frequency component at each iteration producing IMFs

    having different frequency modes at a particular time-instant. But, it may happen that a particular

    IMF has different frequency modes at different time-instants. This phenomenon is called 'mode-

    mixing '[2]. An example is given in figure 6. Mode-mixing does not affect Hilbert Spectral

    Analysis as in the final time-frequency representation, all the IMFs are mapped to a single graph.

    But, if we want to use the IMFs individually and want to ensure that each IMF has a single

    frequency mode as in the case of pitch extraction, we need to get rid of 'mode-mixing'. In this

    report, we explain the various techniques for enhancing the decomposition of speech signal intoIntrinsic Mode Frequencies and get rid of mode-mixing in Hilbert Huang Transform and propose

    a new method for pitch extraction using Hilbert Huang Transform.

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    10/35

    - 10 -

    3. My contributions

    Problem FormulationLiterature Survey for Hilbert Huang Transform and Empirical Mode DecompositionWriting the code for Empirical Mode Decomposition and Hilbert Huang Transform in

    Matlab

    Testing a set of synthetic signals to understand the features of Empirical ModeDecomposition and Hilbert Huang Transform

    Exploring the applications of Hilbert Huang Transform to Speech SignalsTrying out the Noise assisted analysis of data - Ensemble Empirical Mode

    Decomposition to get rid of mode-mixing, one of the shortcomings of Hilbert Huang

    Transform and writing codes for it.

    Codes for Neighbourhood Limited Empirical Mode Decomposition algorithmLiterature Survey for pitch extraction algorithmsDerivation of the algorithm for pitch extraction using Empirical Mode DecompositionWriting the code for pitch extraction using Empirical Mode DecompositionLocation of Epochs using Empirical Mode Decomposition and comparing it to the

    Epochs derived from Zero Frequency Filter Method [5]

    Optimising Ensemble Empirical Mode Decomposition for reducing time complexity.Testing this method on Speech Signals and the corresponding Intrinsic Mode

    Frequencies

    Application of Neighbourhood Limited Empirical Mode Decomposition to synthetic andspeech signals and designing novel variants of Neighbourhood Limited Empirical

    Mode Decomposition for the specific problem in hand Pitch Extraction

    Developing a novel combination of filtering and Neighbourhood Limited EmpiricalMode Decomposition methods for the extraction of pitch to increase efficiency

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    11/35

    - 11 -

    Analysing the pitch and the epochs obtained using the modified filtering, novel variant ofNeighbourhood Limited Empirical Mode Decomposition and Hilbert Spectral Analysis

    Exploring the use of features of modified Intrinsic Mode Frequency for various purposesincluding Digit Recognition

    Comparing the proposed novel epoch detection algorithm to Zero-Frequency Filtermethod with a dataset containing about 5000 epochs determined by the Zero-Frequency

    Filter method and analysing the results

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    12/35

    - 12 -

    4. Empirical Mode Decomposition

    The fundamental part of the Hilbert Huang Transform is the Empirical Mode Decomposition

    (EMD) method. Using the EMD method, a signal can be decomposed into a finite and often small

    number of frequency modes called Intrinsic Mode Functions (IMF). An IMF represents a simple

    frequency mode similar to the simple harmonic function, but it is much more general. The

    Intrinsic Mode Frequency has amplitude and frequency as functions of time unlike a simple

    harmonic component.

    The algorithm can be described as:

    Step 1: initially assume z0 = x(t) and i=1; (x(t) is the speech signal)

    Step 2: To find out the (i+1)th

    Intrinsic Mode Frequency

    (a) Initially assume hi(k-1) = zi and k=1;

    (b) Find out the local extrema of hi(k-1);

    (c) Construct the maxima envelope and minima envelope of hi(k-1) by interpolation;

    (d) Calculate the-mean envelope mi(k-1) from the maxima envelope and the minima

    envelope

    (e) Subtract the mean envelope from the input signal of this stage

    hi(k) = hi(k-1) - mi(k-1)

    (f) Check whether hi(k) satisfies the properties of an Intrinsic Mode Frequency

    Or else, go back to b)

    Step 3: Define z(i+1) = z(i) - hi(k)

    Step 4: Check whether the required precision is obtained, if not, go back to Step 2

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    13/35

    - 13 -

    The procedure followed to compute the Intrinsic Mode Frequencies is:

    1) Initially, the maxima and minima of the speech signal are computed.

    Fig. 1: Speech Signal along with Maxima and Minima

    2) Using curve fitting technique, in this case, the in-built spline function in matlab, a maxima-

    envelope is generated for the set of maxima and a minima-envelope is generated for the set of

    minima.

    3) A mean-envelope (in MAGENTA, pink colour) is generated which is the mean of themaxima-envelope (in RED colour) and minima-envelope (in GREEN colour) computed at each

    sampling point.

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    14/35

    - 14 -

    Fig. 2: Speech Signal along with maximum, minimum and mean-envelope

    4 )The mean envelope is subtracted from the speech signal, (s(t)) to get a new signal hi(k)(t).

    hi(k)(t) is a potential IMF.

    5) Check whether hi(k)(t) satisfies the properties required to be an IMF :

    a) The number of extrema and the number of zero crossings can at-most differ by one

    b) The mean value of the envelope (steps 1, 2 and 3 applied to hi(k)(t)) should be zero at

    all

    points

    6) Also check whether hi(k)(t) satisfies the Standard Deviation criteria where Standard Deviation

    is

    defined by,

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    15/35

    - 15 -

    Here, h1(0)(t)=s(t). As the number of iterations increase, h1(k-1)(t) becomes the potential IMF in

    iteration k-1 and h1k(t) becomes the potential IMF in iteration k.

    The Standard deviation threshold is usually set at a number between 02 to 0.3.

    7) If hi(k)(t) satisfies the criteria mentioned in step 5 and step 6, c1(t) = hi(k)(t) is declared to be the

    IMF (1st IMF in this case). This IMF is subtracted from the original speech signal s(t) to get s1(t).

    We repeat step 1 to step 7 assuming s1(t) to be the input. Finally, this process is stopped at a point

    where the residual signal is a monotonic function or a constant signal.

    8) So, for each input signal, we get a set of IMFs and possibly, a residue which is a constant

    signal or a monotonic signal.

    Now, let us compare the features of three different types of Frequency representation techniques:

    Fourier Transform Wavelet TransformHilbert-Huang

    Transform

    Basis a priori a priori adaptive

    Frequency Not local Not local Local, instantaneous

    Presentation Energy-Frequency Energy-Time-Frequency Energy-Time-Frequency

    Non-linear NO NO YES

    Non-stationary NO YES YES

    Table 1: Comparison of Frequency Representation techniques

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    16/35

    - 16 -

    eg.: x=1:10000; y=sin(x*0.01)+sin(x*0.1); The signal along with the two IMFs are plotted

    Fig. 3 : A synthetic signal with two sinusoids and the corresponding IMFs

    Fig. 4: A synthetic signal having two sinusoids of different lengths. The highestfrequency sinusoid persists for a longer duration. The second and third subplots are of the

    Intrinsic Mode Frequencies obtained using Empirical Mode Decomposition.

    In this case, it separates the frequency components well. But, in a general case, it faces the

    problem of Mode-mixing, in which a single frequency component will be present in more than

    one IMF.

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    17/35

    - 17 -

    5. Mode-Mixing

    It is a phenomenon in which a single frequency mode is present in more than one Intrinsic Mode

    Frequencies which happens due to picking up of the highest frequency component at each point

    by the Empirical Mode Decomposition. The figure below depicts this.

    Figure 5. A speech signal and the corresponding Intrinsic Mode Frequencies depicting the Mode-

    mixing phenomenon, specifically in IMF4 and IMF5

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    18/35

    - 18 -

    Now, let us observe the magnified view of the Speech Signal, the 4 th Intrinsic Mode Frequency

    and the 5th Intrinsic Mode Frequency.

    Figure 6. The Speech signal and its Fourth and Fifth Intrinsic Mode Frequencies

    The Green highlighted portion has the same characteristics which are spread over more than OneIntrinsic Mode Frequencies. We want these to be present in a single Intrinsic Mode Frequency.

    This problem arises when we are processing the Intrinsic Mode Frequencies individually. When

    the application requires the whole time-frequency representation, mode-mixing doesnt come into

    picture as the frequency components remain intact. This is not a shortcoming of the Hilbert

    Huang Transform, but, a limitation of its representation. In this specific case, we are searching for

    a particular IMF, so, we need to tackle the problem of Mode-mixing.

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    19/35

    - 19 -

    6. Ensemble Empirical Mode Decomposition

    Ensemble Empirical Mode Decomposition [3] is a noise assisted data analysis to take care of

    mode-mixing. A white Gaussian noise is added to the input speech signal to avoid mode mixing.

    The same experiment is repeated N (>>1) times using N different sequences of noise. Thecorresponding IMFs from these N experiments are added. Because, the noise is random, it

    becomes negligible compared to the signal. Hence, we get only the signal component, ideally. We

    can thus avoid mode-mixing in Empirical Mode Decomposition.

    Figure 7. Input Signal Concatenation of two sinusoids Figure 8. EMD Single IMF contains

    two modes

    Fig. 9 : EEMD : Highest Frequency IMF Fig. 10: EEMD : Lowest Frequency IMF

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    20/35

    - 20 -

    The EEMD algorithm [8] gets rid of the mode mixing defining the true IMF components as the

    mean of certain ensemble of trials, obtained by adding noise of finite variance to the input signal.

    Although the method based on autocorrelation has been reported to be the best pitch estimation

    technique for the analysis of pathological sustained vowel /a/, it has been shown [9] that it fails

    along with the RAPT algorithm, while the method using Ensemble Empirical Mode

    Decomposition gives better results.

    In the figures above, the first plot is that of a synthetic signal which is a concatenation of two

    sinusoids. In case of Empirical Mode Decomposition which captures the highest frequency

    component at each instant of time, the Intrinsic Mode Frequency contains both the sinusoids. But,

    in the case of Ensemble Empirical Mode Decomposition, we get two different Intrinsic Mode

    Frequencies containing one sinusoid each as desired. As we can observe, the mode-mixing has

    been tackled by Ensemble Empirical Mode Decomposition nicely. We need a particular Intrinsic

    Mode Frequency to contain a specific frequency component and we can achieve this using

    Ensemble Empirical Mode Decomposition. But, there are a few limitations of Ensemble

    Empirical Mode Decomposition. The EEMD Intrinsic Mode Frequencies do not satisfy the

    property of IMFs unless the N, the number of trials is very high. To reduce the variance of the

    estimate, we need to average it over a number of estimates [7]. Also, the noise strength reduces

    only when N is high. We found out that, the higher the value of N (of the order of thousands), the

    better the output. But, this leads to N-fold increase in resource consumption (mainly time). Also,

    due to addition of noise, the execution of a single trial also becomes costly. Hence, it consumes

    very high run-time. Then, we experimented with varying values of noise and found out that if the

    noise energy is very low, EEMD acts as normal Empirical Mode Decomposition and doesnt help

    much to avoid Mode-mixing. So, we need to find an alternative to tackle Mode-mixing.

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    21/35

    - 21 -

    7. Neighbourhood Limited Empirical Mode Decomposition

    Neighbourhood Limited Empirical Mode Decomposition [4] is another method to avoid 'Mode-

    mixing' in Empirical Mode Decomposition. Usually, the frequency mixing occurs in the above

    mentioned algorithm, due to picking the highest frequency component in each locality. So, weneed to restrict the frequency span of a particular IMF. This can be done by limiting the distance

    between two consecutive extrema. We can add a spurious extrema when the distance between

    two consecutive extrema is large. This ensures that the frequency in a particular frequency mode

    is limited, but, the amplitude can vary in a particular frequency mode. Thus, we have ensured that

    frequencies have been efficiently separated into corresponding frequency modes.

    Figure 11. Input Signal : x=1:1000;y=[sin(0.1*x) sin(0.8*x)]; Figure 12.: EMD : Single IMF

    contains 2 modes

    Figure 13. NLEMD : Highest Frequency IMF Figure 14. NLEMD : Lowest Frequency

    IMF

    X-axis is time, Y-axis is Amplitude

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    22/35

    - 22 -

    If we use strict Neighbourhood Limited Empirical Mode Decomposition before filtering the

    signal, all the Intrinsic Mode frequencies will contain only frequencies in the pitch region. Then,

    we will not be able to decide the best candidate for pitch information using filtering. If we filter

    the signals initially and then, perform Neighbourhood Limited criterion, much of the pitch

    information will get leaked to other Intrinsic Mode Frequencies and we will fail to tackle

    Mode-mixing.

    The resource consumption for Neighbourhood Limited Empirical Mode Decomposition increases

    as the entropy in the time-frequency representation increases. So, the resource consumption is

    lower for sinusoids, higher for speech and even higher for noisy speech. To decrease the

    resources consumed, we can begin the Neighbourhood Limited criterion at the Intrinsic Mode

    Frequency where the pitch information begins to make its presence. Thus, the resource

    consumption will be reduced and the performance will be almost intact. Another application of

    this Neighbourhood Limited Empirical Mode Decomposition can be Speech Enhancement, as

    the frequency resolution of Hilbert Huang Transform is high.

    Hence, we have designed an algorithm which will capture the pitch information, if it has any, in

    the adjacent Intrinsic Mode Frequencies, else, it will capture other frequencies. So, we have

    ensured that the particular Intrinsic Mode Frequency contains the pitch information wherever the

    pitch information exists and has different frequencies in other regions. So, we perform the

    modified novel Neighbourhood Limited Empirical Mode Decomposition initially. Then, we can

    proceed with our filtering.

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    23/35

    - 23 -

    8. Pitch Extraction using Empirical Mode Decomposition

    The pitch is a prominent part of speech (and the speaker). It is used extensively in Speech

    Coding, Speaker Recognition and a few Speech Recognition systems. Many techniques exist to

    calculate the pitch - Autocorrelation method, Cepstral methods, AMDF, etc. But, all of these

    techniques face a few or all of these problems- windowing effect, low time resolution, low

    frequency resolution, etc.

    This research work is an attempt to get rid of a few or all of these shortcomings. We can use

    Empirical Mode Decomposition to find the instantaneous pitch. The idea is that one of the

    Intrinsic Mode Frequencies contains the pitch information. To make sure that there is a unique

    Intrinsic Mode Frequency containing the pitch information, we need to get rid of Mode-mixing.

    We can use Neighbourhood Limited Empirical Mode Decomposition for this purpose and wehave observed that it serves the purpose. We also need to filter the signals. The main purpose that

    this filtering serves is to determine the best candidate for the pitch information. The Intrinsic

    Mode Frequencies are filtered and the ratio of the energies of the signal that comes out as the

    output of the filter and the signal that is passed as the input to the filter is used to determine the

    best candidate. The best candidate will get passed through the filter to the highest extent and other

    Intrinsic Mode Frequencies will be highly attenuated which is also clear from the given figures.

    Another purpose it serves is to determine the voiced portions. We can use the short-time ratio of

    the signal that comes as the output of the filter to the signal given as the input to the filter as a

    parameter to determine the voicedness of the speech. Only in voiced regions, the energy of the

    signal passes through the filter to the maximum extent. The filter attenuates the signal in other

    regions. This can also be used as a feature in various tasks such as Digit Recognition, SpeechRecognition among others.

    So, initially, all the IMFs are computed using Neighbourhood Limited Empirical Mode

    Decomposition. Then, the estimated pitch is calculated using auto-correlation method. A narrow-

    band 'Band-Pass Filter' is generated with the estimated pitch as the centre frequency. The IMFs

    are normalized, so that, they can later be compared. All the IMFs are then passed through the

    filter. The IMF having the highest energy is proposed as the IMF containing the pitch

    information.

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    24/35

    - 24 -

    Figure 15. The speech signal and the corresponding IMFs

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    25/35

    - 25 -

    Figure 16. Filtered IMFs along with the speech signal

    It can be observed that IMF 6 contains the pitch information and has the highest fraction of

    energy passed through the filter

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    26/35

    - 26 -

    Now, let us consider an example of a non-voiced and a voiced signal (/s/ /e/). The utterance

    contains two phonemes the fricative s and the vowel e. The speech signal and the

    corresponding IMFs are plotted.

    Figure 17. The speech signal and the corresponding IMFs determined by modified

    Neighbourhood Limited Empirical Mode Decomposition

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    27/35

    - 27 -

    The amplitude of the Filtered IMF6 is high in the voiced region and is close to zero in the non-

    voiced part. So, its envelope can be used as a distinguisher. A proper threshold is placed on the

    envelope to make a decision. The below plots show that the 6th

    Intrinsic Mode Frequency

    contains the pitch information. It is evident more clearly from the quantitative analysis of the

    Filtered Intrinsic Mode Frequencies. To find out which IMF contains the pitch information, we

    filter all the IMFs through a suitable filter and determine the fraction of energy that passesthrough the filter. The IMF whose energy passes the maximum is the best contestant for the pitch

    information. These fractions also represent the confidence of the IMF chosen. The fraction should

    be as large as possible for the IMF that will be chosen and as low as possible for others. In this

    example, we can observe that IMF6 can be ranked SIXTH among the IMFs with respect to

    amplitude or Energy-content. But, Filtered IMF6 ranks FIRST among the Filtered IMFs w.r.t.

    Amplitude or Energy-content. Also, we can observe that Filtered IMF6 hardly contains any

    Mode-mixing or a mixing of more than one frequency mode.

    Figure 18. The speech signal and the corresponding Filtered IMFs obtained using EMD,

    Neighbourhood Limited Criterion and Filtering

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    28/35

    - 28 -

    The plots of the speech signal, its Filtered IMF6 and the envelope of the Magnitude of Filtered

    IMF6 is shown for about 250 msec.s. We can observe that the local amplitude of the Intrinsic

    Mode Frequency containing pitch information represents the voicedness of the speech. We can

    use it to determine whether a particular region is voiced or not. This is done by a threshold on the

    maxima envelope of the Intrinsic Mode Frequency in this case. The figure below depicts this. The

    red line in the third subplot represents the threshold. The envelope can also be used for variousapplications including, but, not limited to Digit Recognition.

    Figure 19. (a) Magnified view of the Speech signal, (b) Filtered IMF6, (c) Envelope of Filtered

    IMF6 and the Threshold in red

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    29/35

    - 29 -

    We also tested the method for epoch determination. For this purpose, the Zero crossings with a

    positive slope are found out. These are the Epochs. The epochs obtained using this method are

    compared to those obtained using Zero Frequency Filter method [5]. The average accuracy for a

    set of speech files containing about 5000 epochs is 96.43% with a Standard Deviation of. about

    0.4 msec.s (Sampling frequency, Fs = 16kHz).

    Figure 20. (a) Magnified view of the Speech signal, (b) Filtered IMF6, (c) Envelope of Filtered

    IMF6 and the Threshold in red, (d) Epochs determined by our novel HHT based method and (e)

    Epochs determined by ZFF method

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    30/35

    - 30 -

    The pitch is found out from the filtered Intrinsic Mode Frequencies using Hilbert Spectral

    Analysis. This is done by first computing the Hilbert Transform of the signal. Then, the original

    signal, i.e. the filtered Intrinsic Mode Frequency is used as the real part of a complex signal and

    its Hilbert Transform as the complex part. The frequency at each sampling point is obtained by

    scaling the differentiation of the instantaneous phase of this complex signal. The instantaneous

    frequency (as Frequency-Time representation) is obtained for two speech signal portions alongwith the Epochs and is plotted in the following figures. Also, the Epochs obtained by Zero-

    Frequency Filtering [6] method are plotted.

    Figure 21. (a) Magnified view of the Speech signal, (b) Filtered IMF6, (c) Time-frequency

    representation of IMF6, (d) Epochs determined by our novel HHT based method and (e) Epochs

    determined by ZFF method

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    31/35

    - 31 -

    We can observe that for each Epoch determined by the Zero-Frequency Filtering method, we

    have a corresponding Epoch determined by our novel algorithm based on modified

    Neighbourhood Limited Empirical Mode Decomposition as desired. Also, we can observe that

    the distance between adjacent Epochs determined by the Zero-Frequency Filtering method is

    equal to the distance between the corresponding Epochs determined by our novel method.

    Figure 22. (a) Magnified view of the Speech signal, (b) Filtered IMF6, (c) Time-frequency

    representation of Filtered IMF6, (d) Epochs determined by our novel HHT based method and

    (e) Epochs determined by ZFF method

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    32/35

    - 32 -

    9. Potential applications

    The pitch extracted can be used as a feature in speaker recognition It can also be used as one of the parameters in Speech coding Other applications include biometric identification, Bio-medical applications, signal

    processing research, etc.

    This will help future research in Signal Processing, especially in the field of Spectralrepresentation of signal.

    Other products of this algorithm are very useful in many cases. An example is DigitRecognition where the envelope of the suitable Intrinsic Mode Frequency increases the

    accuracy

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    33/35

    - 33 -

    10. Results and Conclusions

    In this report we have explained and tested various methods to enhance the decomposition of

    speech signal into Intrinsic Mode Frequencies and to extract pitch using Hilbert Huang

    Transform. We can conclude that the Empirical Mode Decomposition is a good method for

    decomposing speech signal into IMFs. But, mode-mixing disrupts the assumption that an IMF

    contains a single frequency mode. So, we need to avoid mode-mixing in applications which rely

    on the assumption that the IMF has a single mode. Hence, we introduced Ensemble Empirical

    Mode Decomposition which correctly avoids mode-mixing using the principle of Noise-assisted

    data analysis. But, the time required by Ensemble Empirical Mode Decomposition is very high, N

    times that of normal EMD, where N is of the order of a few thousands. Hence, we need some

    alternative method to avoid mode-mixing without affecting other properties of Empirical Mode

    Decomposition. We found that Neighbourhood Limited Empirical Mode Decompositioncombined with filtering stands up to our expectations. It effectively gets rid of mode-mixing in

    Empirical Mode Decomposition and has runtime comparable to that of Empirical Mode

    Decomposition. We have obtained very good results using this novel algorithm. The accuracy for

    a dataset is found to be 96.43% with a Standard Deviation of 0.4 msec.s. Also, other products of

    this algorithm have a variety of applications.

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    34/35

    - 34 -

    11. Future Work

    1) To explore the uses of instantaneous pitch

    2) To exploit the algorithm for other signals including Biomedical signals

    3) To test the algorithm on a larger dataset

    4) To increase the efficiency of the algorithm with respect to run-time (Can implement the

    algorithm in C or C++)

    5) To test the novel algorithm for Intrinsic Mode Frequency analysis for various applicatios

  • 8/7/2019 1 Btp Report 07010245 Suraj s Sheth16april2011

    35/35

    12. References

    [1] Norden E. Huang, Zheng Shen, Steven R. Long, Manli C. Wu, Hsing H. Shih, QuananZheng, Nai-Chyuan Yen, Chi Chao Tung and Henry H. Liu, "The Empirical mode

    decomposition and the Hilbert spectrum for nonlinear and non-stationary time series

    analysis," Proc.Royal Society London A, vol. 454, pp. 903-995, 1998.

    [2] Norden E. Huang, Samuel S.P. Shen, Hilbert-Huang transform and its applications,London : World Scientific, c2005.

    [3] G. Schlotthauer, M. E. Torres, and H. L. Rufiner, A new algorithm for instantaneousF0speech extraction based on ensemble empirical mode decomposition, Proc. European

    Signal Processing Conference, Glasgow, Scotland, August 24-28, 2009.

    [4] Guanlei Xu, Xiaotong Wang, Xiaogang Xu, "Neighborhood Limited Empirical ModeDecomposition and application in Image Processing," Proc. Fourth International

    Conference on Image and Graphics, pp.149-154, 2007

    [5] K. S. R. Murty and B. Yegnanarayana, Epoch extraction from speech signals, IEEETrans. Audio, Speech and Language Process., vol. 16, no. 8, pp. 16021614, Nov. 2008.

    [6] S. R. M. Prasanna, D. Govind, K. Sreenivasa Rao and B. Yegnanarayana, "Fast prosodymodification using instants of significant excitation, Proc. Speech Prosody, Chicago,

    USA, May 2010

    [7] R.M. Rangayyan, Biomedical Signal Analysis -A Case-Study Approach, IEEE andWiley, New York, pp. 289, 2002

    [8] Wu Z, Huang N, Ensemble empirical mode decomposition: a noise-assisted dataanalysis method. Advances in Adaptive Data Analysis, vol. 1, no. 1, pp. 1-41, 2009

    [9] G. Schlotthauer, M. E. Torres, and H. L. Rufiner., Voice fundamental frequencyextraction algorithm based on ensemble empirical mode decomposition and entropies.

    Proc. 11th Int. Congr. of the IFMBE, Munich, pp. 984987, 2009