Speech Recognition Using Wavelets

download Speech Recognition Using Wavelets

of 63

Transcript of Speech Recognition Using Wavelets

  • 7/31/2019 Speech Recognition Using Wavelets

    1/63

    i

    Abstract

    Speech recognition systems have come a long way in the last forty years, there is still

    room for improvement. Although readily available, these systems are sometimes inaccurate and

    insufficient. In an effort to provide a more efficient representation of the speech signal,

    application of the wavelet analysis is considered. Here we present an effective and robust method

    for extracting features for speech processing. Based on the time frequency multi resolutionproperty of wavelet transform, the input speech signal is decomposed into various frequency

    channels. Further we can recognize the original speech using Wavelet Transform. The major

    issues concerning the design of this Wavelet based speech recognition system are choosing

    optimal wavelets for speech signals, decomposition level in the DWT, selecting the feature

    vectors from the wavelet coefficients.

    Dynamic Time Warping (DTW) is a pattern matching approach that can be used for

    limited vocabulary speech recognition, which is based on a temporal alignment of the input

    signal with the template models. The main drawback of this method is its high computational

    cost when the length of the signals increases. The main aim of the project work is to provide a

    modified version of the DTW, based on the Discrete Wavelet Transform (DWT), which reducesits original complexity. Daubechies wavelet family with level 4 & level 7 are experimented and

    the corresponding results are reported.

    The above proposed approaches are implemented in software and also implemented by

    using FPGA.

  • 7/31/2019 Speech Recognition Using Wavelets

    2/63

  • 7/31/2019 Speech Recognition Using Wavelets

    3/63

    iii

    4. WAVELET ANALYSIS .......................................................................................................... 17

    4.1 Definition ............................................................................................................................ 17

    4.2 Fourier Analysis .................................................................................................................. 18

    4.2.1 Limitations .................................................................................................................... 18

    4.3 Short-Time Fourier analysis ................................................................................................ 18

    4.3.1 Limitations .................................................................................................................... 19

    4.4 Types of Wavelets ............................................................................................................... 19

    4.4.1 Haar Wavelet ................................................................................................................ 20

    4.4.2 Daubechies-N wavelet family ...................................................................................... 20

    4.4.3 Advantages Wavelet analysis over STFT ..................................................................... 22

    4.5 Wavelet Transform .............................................................................................................. 22

    4.5.1 Discrete Wavelet Transform ......................................................................................... 23

    4.5.2 Multilevel Decomposition of Signal............................................................................. 24

    4.5.3 Wavelet Reconstruction ................................................................................................ 25

    5. FROM SPEECH TO FEATURE VECTORS ........................................................................... 26

    5.1 Preprocessing ...................................................................................................................... 26

    5.1.1 Pre emphasis ................................................................................................................. 27

    5.1.2 Voice Activation Detection (VAD) .............................................................................. 28

    5.2 Frame blocking & Windowing ............................................................................................ 295.2.1 Frame blocking ............................................................................................................. 30

    5.2.2 Windowing ................................................................................................................... 31

    5.3 Feature Extraction ............................................................................................................... 32

    6. DYNAMIC TIME WARPING ................................................................................................. 34

    6.1 DTW Algorithm .................................................................................................................. 34

    6.1.1 DP-Matching Principle ................................................................................................. 35

    6.1.2 Restrictions on Warping Function ................................................................................ 37

    6.1.3 Discussions on Weighting Coefficient ......................................................................... 39

    6.2 Practical DP-Matching Algorithm ...................................................................................... 40

    6.2.1 DP-Equation ................................................................................................................. 40

    6.2.2 Calculation Details ....................................................................................................... 42

    7. FPGA Implementation .............................................................................................................. 43

  • 7/31/2019 Speech Recognition Using Wavelets

    4/63

    iv

    8. SIMULATION & RESULTS ................................................................................................... 45

    8.1 Input Signal: ................................................................................................................... 45

    8.2 Pre emphasis:.................................................................................................................. 46

    8.3 Voice Activation & Detection ........................................................................................ 47

    8.4 De-noising: ..................................................................................................................... 49

    8.5 Recognition Results: ...................................................................................................... 50

    8.5 FPGA Implementation ................................................................................................... 51

    9. CONCLUSION ......................................................................................................................... 54

    REFERENCES ............................................................................................................................. 55

  • 7/31/2019 Speech Recognition Using Wavelets

    5/63

    v

    List of Tables

    Table 8.1: Recognition rates for English words using db8 & level 4 DWT. ................................ 50Table 8.2: Recognition rates for English words using db8 & level 7 DWT. ................................ 51

  • 7/31/2019 Speech Recognition Using Wavelets

    6/63

    vi

    List of Figures

    Fig. 2.1 Literature survey ................................................................................................................ 7

    Fig. 3.1 Schematic diagram of the speech production/perception process ..................................... 9

    Fig. 3.2 Human Vocal Mechanism ............................................................................................... 10

    Fig. 3.3 Discrete-Time Speech Production Model........................................................................ 11

    Fig. 3.4 Three state representation of a speech signal. ................................................................. 13

    Fig. 3.5.Spectrogram using Welchs Method ............................................................................... 14

    Fig. 4.1 Fourier transform ............................................................................................................. 18

    Fig. 4.2 Short time Fourier transform ........................................................................................... 19

    Fig. 4.3 Haar wavelet .................................................................................................................... 20

    Fig. 4.5 Daubechies wavelets........................................................................................................ 21

    Fig. 4.6 Comparison of Wavelet analysis over STFT ................................................................... 22

    Fig. 4.7 Filter functions ................................................................................................................. 24

    Fig. 4.8 Decomposition of DWT Co-efficients ............................................................................ 24

    Fig. 4.9 Decomposition using DWT ............................................................................................. 24

    Fig. 4.10 Signal Reconstruction .................................................................................................... 25

    Fig. 4.11 Signal Decomposition & Reconstruction ...................................................................... 25

    Fig. 5.1 Main steps in Feature Extraction ..................................................................................... 26

    Fig. 5.2 Pre processing .................................................................................................................. 26

    Fig. 5.3 Pre emphasis filter ........................................................................................................... 27

    Fig. 5.4 Frame blocking & Windowing ........................................................................................ 30

    Fig. 5.5 Frame blocking of a sequence ......................................................................................... 31

    Fig. 5.6 Hamming Window .......................................................................................................... 32

    Fig. 6.1 warping function & adjusting window definition............................................................ 35

    Fig. 6.2 Slope constraint on warping function .............................................................................. 38

    Fig. 6.3 Weighting coefficient W(k) ............................................................................................. 40

    Fig. 7.1 Synthesis flow in AccelDSP ............................................................................................ 44

    Fig. 8.1 Input speech signal .......................................................................................................... 45

    Fig. 8.2 Pre emphasis output ......................................................................................................... 46

    Fig. 8.3 Voice Activation & Detection ......................................................................................... 47

  • 7/31/2019 Speech Recognition Using Wavelets

    7/63

    vii

    Fig. 8.4 Speech signal after Voice Activation & Detection .......................................................... 48

    Fig. 8.5 Speech signal after de-noising ......................................................................................... 49

    Fig. 8.6 Matlab output of Speech Recognition for word FEDORA. ......................................... 52

    Fig. 8.7 Figure showing FPGA results for word FEDORA. ..................................................... 53

  • 7/31/2019 Speech Recognition Using Wavelets

    8/63

    1

    1. INTRODUCTION

    1.1 Definition

    Speech recognition is the process of automatically extracting and determining linguistic

    information conveyed by a speech signal using computers or electronic circuits. Recent advances

    in soft computing techniques give more importance to automatic speech recognition. Large

    variation in speech signals and other criteria like native accent and varying pronunciations makes

    the task very difficult. ASR is hence a complex task and it requires more intelligence to achieve a

    good recognition result. Speech recognition is a topic that is very useful in many applications and

    environments in our daily life.

    The fundamental purpose of speech is communication, i.e., the transmission of messages.According to Shannons information theory a message represented as a sequence of discrete

    symbols can be quantified by its information content in bits, and the rate of transmission of

    information is measured in bits/second (bps).

    In order for communication to take place, a speaker must produce a speech signal in the

    form of a sound pressure wave that travels from the speaker's mouth to a listener's ears. Although

    the majority of the pressure wave originates from the mouth, sound also emanates from the

    nostrils, throat, and cheeks. Speech signals are composed of a sequence of sounds that serve as a

    symbolic representation for a thought that the speaker wishes to relay to the listener. The

    arrangement of these sounds is governed by rules associated with a language. The scientific

    study of language and the manner in which these rules are used in human communication is

    referred to as linguistics. The science that studies the characteristics of human sound production,

    especially for the description, classification, and transcription of speech, is called phonetics.

    1.2 Application area, Features & Issues

    A different aspect of speech recognition is to facilitate for people with functional

    disability or other kinds of handicap. To make their daily chores easier, voice control could be

    helpful. With their voice they could operate the light switch, turn off/on the coffee machine or

    operate some other domestic appliances. This leads to the discussion about intelligent homes

    where these operations can be made available for the common man as well as for handicapped.

  • 7/31/2019 Speech Recognition Using Wavelets

    9/63

    2

    1.2.1 Features

    Speech input is easy to perform because it does not require a specialized skill as does typingor pushbutton operations.

    Information can be input even when the user is moving or doing other activities involvingthe hands, legs, eyes, or ears.

    Since a microphone or telephone can be used as an input terminal, inputting information iseconomical with remote inputting capable of being accomplished over existing telephone

    networks and the Internet.

    1.2.2 Issues

    Lot of redundancy is present in the speech signal that makes discriminating between the classes

    difficult.

    Presence of temporal and frequency variability such as intra speaker variability inpronunciation of words and phonemes as well as inter speaker variability e.g. the effect of

    regional dialects.

    Context dependent pronunciation of the phonemes (co-articulation). Signal degradation due to additive and convolution noise present in the background or in the

    channel.

    Signal distortion due to nonideal channel characteristic.1.3 Recognition Systems

    Recognition systems may be designed in many modes to achieve specific objective or

    performance criteria.

    1.3.1 Speaker Dependent / Independent System

    For speaker dependent systems, user is asked to utter predefined words or sentences.

    These acoustic signals form the training data, which are used for recognition of the input speech.

    Since these systems are used for only a predefined speaker, their performance become higher

    compared to speaker independent systems.

  • 7/31/2019 Speech Recognition Using Wavelets

    10/63

    3

    1.3.2 Isolated Word Recognition

    This is also called discrete recognition system. In this system, there has to be pause

    between uttered words. Therefore the system does not have to care about finding boundaries

    between words.

    1.3.3 Continuous Speech Recognition

    These systems are the ultimate goal of a recognition process. No matter how or when a

    word is uttered, they are recognized in real time and accordingly an action is performed. Changes

    in speaking rate, careless pronunciations, detecting the word boundaries and real time issues are

    main problems for this recognition mode.

    1.3.4 Vocabulary Size

    The lower the size of the vocabulary in a recognition system, the higher the recognition

    performance. Specific tasks may use small vocabularies. However a natural system should be

    speaker independent continuous recognition over a large vocabulary which is the most difficult.

    1.3.5 Keyword Spotting

    These systems are used to detect a word in continuous speech. For this reason they maybe as good as isolated recognition besides having the capability to handle continuous.

    Speech word recognition systems commonly carry out some kind of classification

    recognition based on speech features which are usually obtained via Fourier Transforms (FTs),

    Short Time Fourier Transforms (STFTs), or Linear Predictive Coding techniques. However,

    these methods have some disadvantages. These methods accept signal stationary with in a given

    time frame and may therefore lack the ability to analyze localized events correctly. The wavelet

    transform copes with some of these problems. Other factors influencing the selection of Wavelet

    Transforms (WT) over conventional methods include their ability to determine localized

    features. Discrete Wavelet Transform method is used for speech processing.

    The speech recognizer implemented in Mat lab was used to simulation, as if, a speech

    recognizer was operating in a real environment. Simulation recordings are taken in open

    environment to get real data.

  • 7/31/2019 Speech Recognition Using Wavelets

    11/63

    4

    In the future it could be possible to use this information to create a chip that could be

    used as a new interface to humans. For example it would be desired to get rid of all remote

    controls in the home and just tell the television, stereo or any desired device what to do with the

    voice.

    1.4 Objectives

    This project will cover speaker independent and small vocabulary speech recognition

    with the help of wavelet analysis using Dynamic Time Warping method. The project will

    compose of two phases:

    1) Training phase: In this phase, a number of words will be trained to extract model for each

    word.

    2) Recognition phase: In this phase, a sequence of connected word is entered by microphone or

    an input file and the system will try to recognize these words.

    1.5 Out line

    The outline of this thesis is as follows.

    Chapter 2Literature Survey:

    This chapter discuss about trends and technologies that are followed for improvising the

    speech recognition performance.

    Chapter 3 - The Speech Signal:

    This chapter will discuss how the production and perception of speech is performed.

    Topics related to this chapter are Speech production, speech representation, Characteristics of

    speech signal and Perception.

    Chapter 4Wavelet Analysis:

    This chapter will discuss what is wavelet, what are the types of wavelets available, which

    type of wavelets are used, basically why wavelets are introduced and decomposition of wavelets.Some topics related to this chapter are Fourier analysis, STFT, types of wavelets and wavelet

    transform.

  • 7/31/2019 Speech Recognition Using Wavelets

    12/63

    5

    Chapter 5 - From Speech to Feature Vectors

    In this chapter the fundamental signal processing applied to a speech recognizer. Some

    topics related to this chapter are Pre-processing, frame blocking and windowing and Feature

    extraction.

    Chapter 6Dynamic Time Warping

    Aspects of this chapter are theory and implementation of the set of statistical modeling

    techniques collectively referred to as Dynamic Time Warping. Some topics related to this

    chapter are DTW Algorithm, DP Matching Algorithm.

    Chapter 7FPGA Implementation

    This chapter describes about the FPGA Implementation of Speech Recognition system

    using AccelDSP tool in Xilinx ISE.

    Chapter 8Simulation & Results

    In this chapter the speech recognizer implemented in Matlab will be used. This is to test

    the recognizer in different cases for finding efficiency.

    Chapter 9 - Conclusions

    This chapter will summarizes the whole project.

  • 7/31/2019 Speech Recognition Using Wavelets

    13/63

    6

    2. LITERATURE SURVEY

    Designing a machine that mimics human behavior, particularly the capability of speaking

    naturally and responding properly to spoken language, has intrigued engineers and scientists for

    centuries. Since the 1930s, when Homer Dudley of Bell Laboratories proposed a system model

    for speech analysis and synthesis, the problem of automatic speech recognition has been

    approached progressively, from a simple machine that responds to a small set of sounds to a

    sophisticated system that responds to fluently spoken natural language and takes into account the

    varying statistics of the language in which the speech is produced. Based on major advances in

    statistical modeling of speech in the 1980s, automatic speech recognition systems today find

    widespread application in tasks that require a human-machine interface, such as automatic call

    processing in the telephone network and query-based information systems that do things like

    provide updated travel information, stock price quotations, weather reports, etc.

    Speech is the primary means of communication between people. For reasons ranging

    from technological curiosity about the mechanisms for mechanical realization of human speech

    capabilities, to the desire to automate simple tasks inherently requiring human-machine

    interactions, research in automatic speech recognition (and speech synthesis) by machine has

    attracted a great deal of attention over the past five decades.

    2.1 Advancement in technology

    Fig. 2.1 shows a timeline of progress in speech recognition and understanding technology

    over the past several decades. We see that in the 1960s we were able to recognize small

    vocabularies (order of 10-100 words) of isolated words, based on simple acoustic-phonetic

    properties of speech sounds. The key technologies that were developed during this time frame

    were filter-bank analyses, simple time normalization methods, and the beginnings of

    sophisticated dynamic programming methodologies. In the 1970s we were able to recognizemedium vocabularies (order of 100-1000 words) using simple template-based, pattern

    recognition methods [3]. The key technologies that were developed during this period were the

    pattern recognition models, the introduction of LPC methods for spectral representation, the

    pattern clustering methods for speaker-independent recognizers, and the introduction of dynamic

    programming methods for solving connected word recognition problems. In the 1980s we

  • 7/31/2019 Speech Recognition Using Wavelets

    14/63

    7

    started to tackle large vocabulary (1000-unlimited number of words) speech recognition

    problems based on statistical methods, with a wide range of networks for handling language

    structures. The key technologies introduced during this period were the hidden Markov model

    (HMM) [9] and the stochastic language model, which together enabled powerful new methods

    for handling virtually any continuous speech recognition problem efficiently and with high

    performance. In the 1990s we were able to build large vocabulary systems with unconstrained

    language models, and constrained task syntax models for continuous speech recognition and

    understanding. The key technologies developed during this period were the methods for

    stochastic language understanding, statistical learning of acoustic and language models, and the

    introduction of finite state transducer framework (and the FSM Library) and the methods for

    their determination and minimization for efficient implementation of large vocabulary speech

    understanding systems.

    Fig. 2.1 Literature survey

  • 7/31/2019 Speech Recognition Using Wavelets

    15/63

    8

    Finally, in the last few years, we have seen the introduction of very large vocabulary

    systems with full semantic models, integrated with text-to-speech (TTS) synthesis systems, and

    multi-modal inputs (pointing, keyboards, mice, etc.). These systems enable spoken dialog

    systems with a range of input and output modalities for ease-of-use and flexibility in handling

    adverse environments where speech might not be as suitable as other input-output modalities.

    During this period we have seen the emergence of highly natural speech synthesis systems, the

    use of machine learning to improve both speech understanding and speech dialogs, and the

    introduction of mixed-initiative dialog systems to enable user control when necessary.

    After nearly five decades of research, speech recognition technologies have finally

    entered the marketplace, benefiting the users in a variety of ways. Throughout the course of

    development of such systems, knowledge of speech production and perception was used in

    establishing the technological foundation for the resulting speech recognizers. Major advances,

    however, were brought about in the 1960s and 1970s via the introduction of advanced speech

    representations based on LPC analysis and cepstral analysis methods, and in the 1980s through

    the introduction of rigorous statistical methods based on hidden Markov models [9]. All of this

    came about because of significant research contributions from academia, private industry and the

    government. As the technology continues to mature, it is clear that many new applications will

    emerge and become part of our way of lifethereby taking full advantage of machines that are

    partially able to mimic human speech capabilities.

  • 7/31/2019 Speech Recognition Using Wavelets

    16/63

    9

    3. THE SPEECH SIGNAL

    This chapter intends to discuss how the speech signal is produced and perceived by

    human beings. This is an essential subject that has to be considered before one can pursue and

    decide which approach to use for speech recognition.

    3.1 Speech production

    Human communication is to be seen as a comprehensive diagram of the process from

    speech production to speech perception between the talker and listener as in Fig. 3.1 [2].

    Fig. 3.1 Schematic diagram of the speech production/perception process

    Five different elements, A. Speech formulation, B. Human vocal mechanism, C. Acoustic

    air, D. Perception of the ear, E. Speech comprehension, will be examined more carefully in the

    following sections.The first element (A. Speech formulation) is associated with the formulation of the

    speech signal in the talkers mind. This formulation is used by the human vocal mechanism (B.

    Human vocal mechanism) to produce the actual speech waveform. The waveform is transferred

    via the air (C. Acoustic air) to the listener. During this transfer the acoustic wave can be affected

    by external sources, for example noise, resulting in a more complex waveform. When the wave

  • 7/31/2019 Speech Recognition Using Wavelets

    17/63

    10

    reaches the listeners hearing system (the ears) the listener percepts the waveform (D. Perception

    of the ear) and the listeners mind (E. Speech comprehension) starts processing this waveform to

    comprehend its content so the listener understands what the talker is trying to tell him or her.

    Fig. 3.2 Human Vocal Mechanism

    To be able to understand how the production of speech is performed one need to know

    how the humans vocal mechanism is constructed, as in Fig. 3.2.

  • 7/31/2019 Speech Recognition Using Wavelets

    18/63

    11

    The most important parts of the human vocal mechanism are the vocal tracttogether with

    nasal cavity, which begins at the velum. The velum is a trapdoor-like mechanism that is used to

    formulate nasal sounds when needed. When the velum is lowered, the nasal cavity is coupled

    together with the vocal tract to formulate the desired speech signal. The cross-sectional area of

    the vocal tract is limited by the tongue, lips, jaw and velum and varies from 0-20 cm2.

    When humans produce speech, air is expelled from the lungs through the trachea. The air

    flowing from the lungs causes the vocal cords to vibrate and by forming the vocal tract, lips,

    tongue, jaw and maybe using the nasal cavity, different sounds can be produced.

    Important parts of the discrete-time speech production model, in the field of speech

    recognition and signal processing, are: u (n), gain b0 andH(z). The impulse generator acts like

    the lungs, exciting the glottal filter G (z), resulting in u (n). The G (z) is to be regarded as the

    vocal cords in the human vocal mechanism. The signal u (n) can be seen as the excitation signal

    entering the vocal tract and the nasal cavity and is formed by exciting the vocal cords by air from

    the lungs.

    Fig. 3.3 Discrete-Time Speech Production Model

    The gain b0 is a factor that is related to the volume of the speech being produced. Largergain b0 gives louder speech and vice versa. The vocal tract filter H(z) is a model over the vocal

    tract and the nasal cavity. The lip radiation filter R (z) is a model of the formation of the human

    lips to produce different sounds.

  • 7/31/2019 Speech Recognition Using Wavelets

    19/63

    12

    3.2 Speech Representation

    The speech signal and all its characteristics can be represented in two different domains,

    the time and the frequency domain.

    A speech signal is a slowly time varying signal in the sense that, when examined over a

    short period of time (between 5 and 100 ms), its characteristics are short-time stationary. This is

    not the case if we look at a speech signal under a longer time perspective (approximately time

    T>0.5 s). In this case the signals characteristics are non-stationary, meaning that it changes to

    reflect the different sounds spoken by the talker.

    To be able to use a speech signal and interpret its characteristics in a proper manner some

    kind of representation of the speech signal are preferred. The speech representation can exist in

    either the time or frequency domain, and in three different ways. These are a three-state

    representation, a spectral representation and the last representation is aparameterization of the

    spectral activity.3.2.1 Three-state Representation

    The three-state representation is one way to classify events in speech. The events of

    interest for the three-state representation are:

    Silence (S) - No speech is produced. Unvoiced (U) - Vocal cords are not vibrating, resulting in an aperiodic or random

    speech waveform.

    Voiced (V) - Vocal cords are tensed and vibrating periodically, resulting in a speechwaveform that is quasi-periodic.

    Quasi-periodic means that the speech waveform can be seen as periodic over a short-time

    period (5-100 ms) during which it is stationary.

  • 7/31/2019 Speech Recognition Using Wavelets

    20/63

    13

    Fig. 3.4 Three state representation of a speech signal.

    The upper plot Fig. 3.4(a) contains the whole speech sequence and in the middle plot Fig.

    3.4(b) a part of the upper plot Fig. 3.4(a) is reproduced by zooming an area of the whole speech

    sequence. At the bottom of Fig. 3.4 the segmentation into a three-state representation, in relation

    to the different parts of the middle plot, is given.

  • 7/31/2019 Speech Recognition Using Wavelets

    21/63

    14

    3.2.2 Spectral Representation

    Spectral representation of speech intensity over time is very popular, and the

    most popular one is the sound spectrogram, see Fig. 3.5.

    Fig. 3.5.Spectrogram using Welchs Method

    Here the darkest (dark blue) parts represent the parts of the speech waveform where no

    speech is produced and the lighter (red) parts represent intensity if speech is produced.

  • 7/31/2019 Speech Recognition Using Wavelets

    22/63

  • 7/31/2019 Speech Recognition Using Wavelets

    23/63

    16

    3.3.2 Fundamental Frequency

    The time between successive vocal fold openings is called the fundamental period T0,

    while the rate of vibration is called thefundamental frequency of the phonation, F0= 1/T0.

    Using voiced excitation for the speech sound will result in a pulse train, the so-called

    fundamental frequency. Voiced excitation is used when articulating vowels and some of the

    consonants. For fricatives (e.g., /f/ as in fish or /s/, as in mess), unvoiced excitation (noise) is

    used. In these cases, usually no fundamental frequency can be detected. On the other hand, the

    zero crossing rate of the signal is very high. Plosives (like /p/ as in put), which use transient

    excitation, you can best detect in the speech signal by looking for the short silence necessary to

    build up the air pressure before the plosive bursts out.

    3.3.3 Peaks in the Spectrum

    After passing the glottis, the vocal tract gives a characteristic spectral shape to the speech

    signal. If one simplifies the vocal tract to a straight pipe (the length is about 17cm), one can see

    that the pipe shows resonance at the frequencies. Depending on the shape of the vocal tract (the

    diameter of the pipe changes along the pipe), the frequencies of the formants (especially of the

    1st and 2nd formant) changes and therefore characterizes the vowel being articulated.

    3.3.4 The Envelope of the Power Spectrum

    The pulse sequence from the glottis has a power spectrum decreasing towards higher

    frequencies by -12dB per octave. The emission characteristics of the lips show a high-pass

    characteristic with +6dB per octave. Thus, this results in an overall decrease of-6dB per octave.

    3.4 Speech perception process

    The microphone.cs class is responsible to accept input from a microphone and forward it

    to the feature extraction module. Before converting the signal into suitable or desired form, it is

    important to identify the segments of the sound containing words. The audio.cs class deals with

    all tasks needed for converting wave file to stream of digits and vice versa. It also has a provision

    of saving the sound into WAV files.

  • 7/31/2019 Speech Recognition Using Wavelets

    24/63

    17

    4. WAVELET ANALYSIS

    4.1 Definition

    A wavelet is a wave-like oscillation with amplitude that starts out at zero, increases, and

    then decreases back to zero. It can typically be visualized as a "brief oscillation" like one might

    see recorded by a seismograph or heart monitor. Generally, wavelets are purposefully crafted to

    have specific properties that make them useful for signal processing. Wavelets can be combined,

    using a "reverse, shift, multiply and sum" technique called convolution, with portions of an

    unknown signal to extract information from the unknown signal.

    The fundamental idea behind wavelets is to analyze according to scale. The wavelet

    analysis procedure is to adopt a wavelet prototype function called an analyzing wavelet or

    mother wavelet. Any speech signal can then be represented by translated and scaled versions of

    the mother wavelet. Wavelet analysis is capable of revealing aspects of data that other speech

    signal analysis technique such the extracted features are then passed to a classifier for the

    recognition of isolated words [4].

    The integral wavelet transform is the integral transform defined as:

    ( ) Equation 4.1Where a is positive and defines the scale and b is any real number and defines the shift.

    For decomposition of speech signal, we can use different techniques like Fourier analysis,

    STFT (Short Time Fourier Transforms), wavelet transform techniques.

    Here, we have explained the necessity and advantages of Wavelet Analysis by first

    considering the Fourier analysis, its limitations, its modification to Short Time Fourier

    Transform, its limitations and finally the Wavelet Analysis.

    http://en.wikipedia.org/wiki/Wavehttp://en.wikipedia.org/wiki/Oscillationhttp://en.wikipedia.org/wiki/Amplitudehttp://en.wikipedia.org/wiki/Seismographhttp://en.wikipedia.org/wiki/Heart_monitorhttp://en.wikipedia.org/wiki/Signal_processinghttp://en.wikipedia.org/wiki/Convolutionhttp://en.wikipedia.org/wiki/Integral_transformhttp://en.wikipedia.org/wiki/Integral_transformhttp://en.wikipedia.org/wiki/Convolutionhttp://en.wikipedia.org/wiki/Signal_processinghttp://en.wikipedia.org/wiki/Heart_monitorhttp://en.wikipedia.org/wiki/Seismographhttp://en.wikipedia.org/wiki/Amplitudehttp://en.wikipedia.org/wiki/Oscillationhttp://en.wikipedia.org/wiki/Wave
  • 7/31/2019 Speech Recognition Using Wavelets

    25/63

    18

    4.2 Fourier Analysis

    Fourier analysis breaks down a signal into constituent sinusoids of different frequencies.

    It is a mathematical technique for transforming a signal from a time-based one to a frequency-

    based one. Fourier Transform of sinusoidal signal is depicted in Fig. 3.1 below. Equation 4.2

    Fig. 4.1 Fourier transform

    4.2.1 Limitations

    But Fourier analysis has a serious drawback. In transforming to the frequency domain,

    time information is lost. When looking at a Fourier transform of a signal, it is impossible to tell

    when a particular event tookplace. If a signal doesnt change much over time, i.e. if it is what is

    called a stationary signal. This drawback isnt very important. However, most interesting signals

    contain numerous non-stationary or transitory characteristics: drift, trends, abrupt changes, and

    beginnings and ends of events. These characteristics are often the most important part of the

    signal, and Fourier analysis is not suited to detecting them.

    4.3 Short-Time Fourier analysisShort-Time Fourier Transform (STFT), maps a signal into a two-dimensional function of

    time andfrequency.A technique called windowing the signal. Mathematically it is given by*,-+ ,-, - Equation4.3Where signal is x[n] and window is w[n].

  • 7/31/2019 Speech Recognition Using Wavelets

    26/63

    19

    Short-Time Fourier Transform of a random signal is shown in Fig. 4.2 below.

    Fig. 4.2 Short time Fourier transform

    The STFT represents a sort of compromise between the time- and frequency-based views

    of a signal. It provides some information about both when and at what frequencies a signal event

    occurs.

    4.3.1 Limitations

    However, you can only obtain this information with limited precision, and that precision

    is determined by the size of the window. While the STFTs compromise between time and

    frequency information can be useful, the drawback is that once you choose a particular size for

    the time window, that window is the same for all frequencies .Otherwise ,if a wider window is

    chosen, it gives better frequency resolution but poor time resolution. A narrower window gives

    good time resolution but poor frequency resolution. Many signals require a more flexible

    approach - one where we can vary the window size to determine more accurately either time or

    frequency.

    4.4 Types of Wavelets

    Different types of wavelets are Haar wavelets, Daubechies wavelets, Bi orthogonal

    wavelets, Coiflet wavelets, Symlet wavelets, Morlet wavelets, Mexican Hat wavelets and Meyer

    wavelets.

    Wavelets mainly used in speech recognition are discussed here.

  • 7/31/2019 Speech Recognition Using Wavelets

    27/63

    20

    4.4.1 Haar Wavelet

    Its first and simplest. Haar is discontinuous, and resembles a step function. It represents

    the same wavelet as Daubechies db1.

    The Haar wavelet family for t [0, 1] is defined as follows:hi (t) ={

    Equation 4.4

    Integer m = 2j ( j = 0,1,2J ) indicates the level of the wavelet; k = 0,1, 2,..m1is the

    translation parameter. Maximal level of resolution is J.

    Fig. 4.3 Haar wavelet

    4.4.2 Daubechies-N wavelet family

    The Daubechies wavelets are a family oforthogonal wavelets defining a discrete wavelet

    transform and characterized by a maximal number of vanishing moments for some given

    support. With each wavelet type of this class, there is a scaling function (also called father

    wavelet) which generates an orthogonal multi resolution analysis. The Daubechies wavelet is one

    of the popular wavelets and has been used for speech recognition [4].

    http://en.wikipedia.org/wiki/Orthogonal_wavelethttp://en.wikipedia.org/wiki/Discrete_wavelet_transformhttp://en.wikipedia.org/wiki/Discrete_wavelet_transformhttp://en.wikipedia.org/wiki/Moment_(mathematics)http://en.wikipedia.org/wiki/Moment_(mathematics)http://en.wikipedia.org/wiki/Discrete_wavelet_transformhttp://en.wikipedia.org/wiki/Discrete_wavelet_transformhttp://en.wikipedia.org/wiki/Orthogonal_wavelet
  • 7/31/2019 Speech Recognition Using Wavelets

    28/63

    21

    In general the Daubechies wavelets are chosen to have the highest number A of vanishing

    moments, (this does not imply the best smoothness) for given support width N=2A, and among

    the 2A1possible solutions the one is chosen whose scaling filter has external phase. The wavelet

    transform is also easy to put into practice using the fast wavelet transform. Daubechies wavelets

    are widely used in solving a broad range of problems, e.g. self-similarity properties of a signal

    or fractal problems, signal discontinuities, etc.

    The Daubechies wavelets properties [6]:

    The support length of wavelet function and the scaling function is 2N1. The number of vanishing moments of is N. Most dbN are not symmetrical. The regularity increases with the order. When N becomes very large, and belong to CN

    where is approximately equal to 0.2.

    Daubechies8 wavelet is used for decomposition of speech signal as it needs minimumsupport size for the given number of vanishing points.

    The names of the Daubechies family wavelets are written dbN, where N is the order, and

    db the surname of the wavelet. The db1 wavelet, as mentioned above, is the same as Haar.

    Here are the next nine members of the family:

    Fig. 4.5 Daubechies wavelets

    http://en.wikipedia.org/wiki/Fast_wavelet_transformhttp://en.wikipedia.org/wiki/Fractalhttp://en.wikipedia.org/wiki/Fractalhttp://en.wikipedia.org/wiki/Fast_wavelet_transform
  • 7/31/2019 Speech Recognition Using Wavelets

    29/63

    22

    4.4.3 Advantages Wavelet analysis over STFT

    Wavelet analysis represents the next logical step: a windowing technique with variable-

    sized regions. Wavelet analysis allows the use of long time intervals where we want more precise

    low frequency information, and shorter regions where we want high frequency information.

    Fig. 4.6 Comparison of Wavelet analysis over STFT

    The time-based, frequency-based and STFT views of a signal are given with respect to

    that of Wavelet analysis. One major advantage afforded by wavelets is the ability to perform

    local analysis, i.e., to analyze a localized area of a larger signal.

    4.5 Wavelet Transform

    The transform of a signal is just another form of representing the signal. It does not

    change the information content present in the signal. For many signals, the low-frequency part

    contains the most important part. It gives an identity to a signal. Consider the human voice. If we

    remove the high-frequency components, the voice sounds different, but we can still tell whats

    being said. In wavelet analysis, we often speak of approximations and details. The

    approximations are the high- scale, low-frequency components of the signal. The details are the

    low-scale, high frequency components.

    Equation 4.5Where (t) is a time function with finite energy and fast decay called mother wavelet.

  • 7/31/2019 Speech Recognition Using Wavelets

    30/63

    23

    4.5.1 Discrete Wavelet Transform

    The Discrete Wavelet Transform (DWT) involves choosing scales and positions based on

    powers of two

    so called dyadic scales and positions. The mother wavelet is rescaled or dilated,

    by powers of two and translated by integers. Specifically, a function f ( t) L2

    (R) (defines space

    of square integrable functions) can be represented as [1]:

    * ( ) + Equation 4.6

    The function (t) is known as the mother wavelet, while (t) is known as the scaling

    function.The set of functions ( ) } where Z isthe set of integers is an orthonormal basis for L2(R).

    The numbers a (L, k) are known as the approximation coefficients at scale L, while d (j, k) are

    known as the detail coefficients at scale j. The approximation and detail coefficients can be

    expressed as:

    Equation 4.7 Equation 4.8The DWT analysis can be performed using a fast, pyramidal algorithm related to multi-

    rate filter-banks. As a multi-rate filter-bank the DWT can be viewed as a constant Q filter-bank

    with octave spacing between the centers of the filters. Each sub-band contains half the samples

    of the neighboring higher frequency sub-band. In the pyramidal algorithm the signal is analyzed

    at different frequency bands with different resolution by decomposing the signal into a coarse

    approximation and detail information. The coarse approximation is then further decomposed

    using the same wavelet decomposition step. This is achieved by successive high-pass and low-pass filtering of the time domain signal and is defined by the following equations:

    ylow[n] = ,-,- Equation 4.9yhigh[n] = ,-,- Equation 4.10

  • 7/31/2019 Speech Recognition Using Wavelets

    31/63

    24

    Fig. 4.7 Filter functions

    Signal x[n] is passed through low pass and high pass filters and it is down sampled by 2.

    ylow[n] = (x * g) 2 Equation 4.11yhigh[n] = (x*h) 2 Equation 4.12

    In the DWT, each level is calculated by passing the previous approximation coefficients

    though a high and low pass filters.

    4.5.2 Multilevel Decomposition of Signal

    A signal can be decomposed using Wavelet Analysis as Shown below [11]:

    Fig. 4.8 Decomposition of DWT Co-efficients

    Fig. 4.9 Decomposition using DWT

  • 7/31/2019 Speech Recognition Using Wavelets

    32/63

    25

    The DWT is computed by successive low-pass and high-pass filtering of the discrete

    time-domain signal as shown in figure 4.8 and 4.9. This is called the Mallat algorithm or Mallat-

    tree decomposition.

    4.5.3 Wavelet Reconstruction

    Getting the original signal with no loss (min.) of information is called Reconstruction. It

    can be done by inverse discrete wavelettransform(IDWT). Whereas wavelet analysis involves

    filtering and down sampling, the wavelet, Reconstruction process consists of up sampling and

    filtering. Up sampling is the process of lengthening a signal component by inserting zeros

    between samples.

    Fig. 4.10 Signal Reconstruction

    Fig. 4.11 Signal Decomposition & Reconstruction

  • 7/31/2019 Speech Recognition Using Wavelets

    33/63

    26

    5. FROM SPEECH TO FEATURE VECTORS

    The main objective of this stage is to extract the important features that are enough for

    the recognizer to recognize the words. This chapter describes how to extract information from a

    speech signal, which means creating feature vectors from the speech signal. A wide range of

    possibilities exist for parametrically representing a speech signal and its content. The main steps

    for extracting information are preprocessing, frame blocking & windowing and feature

    extraction [1].

    Fig. 5.1 Main steps in Feature Extraction

    5.1 Preprocessing

    This step is the first step to create feature vectors. The objective in the pre-processing is

    to modify the speech signal, x (n), so that it will be more suitable for the feature extraction

    analysis. The preprocessing operations noise cancelling, pre emphasis and voice activation

    detection can be seen in Figure below shown.

    Fig. 5.2 Pre processing

    The first thing to consider is if the speech, x (n), is corrupted by some noise, d(n), for

    example an additive disturbance x (n) = s (n) + d (n), where s (n) is the clean speech signal.

    There are several approaches to perform noise reduction on a noisy speech signal. Two

    commonly used noise reduction algorithms in the field of speech recognition context is spectral

    subtraction and adaptive noise cancellation. A low signal to noise ratio (SNR) decrease the

  • 7/31/2019 Speech Recognition Using Wavelets

    34/63

    27

    performance of the recognizer in a real environment. Some changes to make the speech

    recognizer more noise robust will be presented later. Note that the order of the operations might

    be reordered for some tasks. For example the noise reduction algorithm, spectral subtraction, is

    better placed last in the chain (it needs the voice activation detection).

    5.1.1 Pre emphasis

    There is a need for spectrally flatten the signal. The pre emphasize, often represented by a

    first order high pass FIR filter is used to emphasize the higher frequency components.

    The second stage in feature extraction is to boost the amount of energy in the high

    frequencies. It turns out that if we look at the spectrum for voiced segments like vowels, there is

    more energy at the lower frequencies than the higher frequencies. This SPECTRAL TILT drop

    in energy across frequencies (which is called spectral tilt) is caused by the nature of the glottal

    pulse. Boosting the high frequency energy makes information from these higher formants more

    available to the acoustic model and improves phone detection accuracy.

    Fig. 5.3 Pre emphasis filter

    The pre emphasizer is used to spectrally flatten the speech signal. This is usually done by

    a high pass filter. The most commonly used filter for this step is the FIR filter described below: Equation5.1

  • 7/31/2019 Speech Recognition Using Wavelets

    35/63

    28

    The filter response for this FIR filter can be seen in Figure. The filter in the time domain

    will beh (n) = {1, 0.95}and the filtering in the time domain will give the pre emphasized signal

    s1 (n):

    s1 (n) =

    Equation 5.2

    The pre emphasis filter is shown on Fig. 5.3.

    5.1.2 Voice Activation Detection (VAD)

    The problem of locating the endpoints of an utterance in a speech signal is a major

    problem for the speech recognizer. Inaccurate endpoint detection will decrease the performance

    of the speech recognizer. The problem of detecting endpoint seems to be relatively trivial, but it

    has been found to be very difficult in practice. Only when a fair SNR is given, the task is made

    easier. Some commonly used measurements for finding speech are short-term energy estimate

    Es1, or short-term power estimate Ps1, and short term zero crossing rate Zs1. For the speech

    signal s1(n) these measures are calculated as follows [1]:

    Es1(m) = Equation 5.3Ps1(m) =

    Equation 5.4Zs1(m) =

    |,-,-|

    Equation 5.5

    Where: Equation 5.6For each block ofL samples these measures calculate some value. Note that the index for

    these functions is m and not n, this because these measures do not have to be calculated for every

    sample (the measures can for example be calculated in every 20 ms). The short-term energy

    estimate will increase when speech is present in s1 (n). This is also the case with the short-term

    power estimate; the only thing that separates them is scaling with 1/L when calculating the short-

    term power estimate. The short term zero crossing rate gives a measure of how many times thesignal, s1 (n), changes sign. This short term zero crossing rates tend to be larger during unvoiced

    regions.

    These measures will need some triggers for making decision about where the utterances

    begin and end. To create a trigger, one needs some information about the background noise. This

    is done by assuming that the first 10 blocks are background noise. With this assumption the

  • 7/31/2019 Speech Recognition Using Wavelets

    36/63

    29

    mean and variance for the measures will be calculated. To make a more comfortable approach

    the following function is used:

    Ws1(m)=Ps1(m)(1-Zs1(m))Sc Equation 5.7

    Using this function both the short-term power and the zero crossing rates will be taken

    into account. Sc is a scale factor for avoiding small values, in a typical application is Sc = 1000.

    The trigger for this function can be described as:

    tW=W+ W Equation 5.8

    TheWis the mean and Wis the variance for Ws1 (m) calculated for the first 10 blocks.

    The term is a constant that have to be fine-tuned according to the characteristics of the signal.

    After some testing the following approximation of will give pretty good voice activation

    detection in various level of additive background noise:

    Equation 5.9The voice activation detection function, VAD (m), can now be found as: Equation 5.10

    VAD (n) is found as VAD (m) in the block of measure. For example if the measures is

    calculated every 320 sample (block length L=320), which corresponds to 40 ms if the sampling

    rate is 8 kHz. The first 320 samples of VAD (n) found as VAD (m) then m = 1. Using these

    settings the VAD (n) is calculated for the speech signal containing the word file shown in

    results.

    5.2 Frame blocking & Windowing

    Speech signal is a kind of unstable signal. But we can assume it as stable signal during

    10-20ms. Framing is used to cut the long-time speech to the short-time speech signal in order to

    get relative stable frequency characteristics. Features get periodically extracted. The time for

    which the signal is considered for processing is called a window and the data acquired in a

    window is called as a frame. Typically features are extracted once every 10ms, which is called as

    frame rate. The window duration is typically 20ms. Thus two consecutive frames have

    overlapping areas.

  • 7/31/2019 Speech Recognition Using Wavelets

    37/63

    30

    Fig. 5.4 Frame blocking & Windowing

    5.2.1 Frame blocking

    For each utterances of the word, window duration of 320 samples is used for processing

    at later stages. A frame is formed from the windowed data with typical frame duration (Tf) of

    about 200 samples. Since the frame duration is shorter than window duration there is an overlap

    of data and the percentage overlap is given as:

    %Overlap = ((TwTf)*100)/Tw) Equation 5.11

    Each frame is Ksamples long, with adjacent frames being separated by P samples.

  • 7/31/2019 Speech Recognition Using Wavelets

    38/63

    31

    Fig. 5.5 Frame blocking of a sequence

    By applying the frame blocking to de noised signal (x (k)), one will get M vectors of

    length K, which correspond to x (k; m) where k=0, 1...K-1 and m=0, 1.M 1.

    5.2.2 Windowing

    Windowing concept is used to minimize the signal distortion by using the window to

    taper the signal to zero at the beginning and end of each frame i.e. to reduce signal discontinuity

    at either end of the block.

    The rectangular window (i.e. no window) can cause problems, when we do Fourier

    analysis; it abruptly cuts of the signal at its boundaries. A good window function has a narrow

    main lobe and low side lobe levels in their transfer functions, which shrinks the values of the

    signal toward zero at the window boundaries, avoiding discontinuities.

    Equation 5.12The most commonly used window function in speech processing is the Hamming windowdefined as follows:

    Equation 5.13By applying w (k) tox (k; m) for all blocks, the windowed signal output is calculated.

  • 7/31/2019 Speech Recognition Using Wavelets

    39/63

    32

    Hamming window function is shown in Fig. 5.5 below:

    Fig. 5.6 Hamming Window

    Multiplication of the signal by a window function in the time domain is the same as

    convolving the signal in the frequency domain. Rectangular window gives maximum sharpness

    but large side-lobes (ripples) - hamming window blurs in frequency but produces much less

    leakage.

    5.3 Feature Extraction

    A feature extractor should reduce the pattern vector (i.e., the original waveform) to a

    lower dimension, which contains most of the useful information from the original vector. Here

    we use we extract features of the input speech signal using Daubechies-8 wavelets of level 4 [4].

    The extracted wavelet coefficients provide a compact representation that shows theenergy distribution of the signal in time and frequency. In order to further reduce the

    dimensionality of the extracted feature vectors, statistics over the set of the wavelet coefficients

    are used.

  • 7/31/2019 Speech Recognition Using Wavelets

    40/63

    33

    The following features are used in our system:

    The mean of the absolute value of the coefficients in each sub-band. These features provideinformation about the frequency distribution of the audio signal.

    The standard deviation of the coefficients in each sub-band. These features provideinformation about the amount of change of the frequency distribution.

    Energy of each sub-band of the signal. These features provide information about the energyof the each sub-band.

    Kurtosis of each sub-band of the signal. These features measure whether the data are peakedor flat relative to a normal distribution.

    Skewness of each sub-band of the signals. These features are the measure of symmetry orlack of symmetry.

    After frame blocking and windowing we get different frame vectors i.e. different signals

    are to be loaded to extract the features at a time. Hence Multi signal analysis is performed on

    input frame vectors using wavelets using matlab [13].

  • 7/31/2019 Speech Recognition Using Wavelets

    41/63

    34

    6. DYNAMIC TIME WARPING

    Dynamic time warping (DTW) is an algorithm for measuring similarity between two

    sequences which may vary in time or speed. For instance, similarities in walking patterns would

    be detected, even if in one video the person was walking slowly and if in another he or she were

    walking more quickly, or even if there were accelerations and decelerations during the course of

    one observation. DTW has been applied to video, audio, and graphics indeed, any data which

    can be turned into a linear representation can be analyzed with DTW. A well-known application

    has been automatic speech recognition, to cope with different speaking speeds [3].

    In general, DTW is a method that allows a computer to find an optimal match between

    two given sequences (e.g. time series) with certain restrictions. The sequences are "warped" non-

    linearly in the time dimension to determine a measure of their similarity independent of certain

    non-linear variations in the time dimension. This sequence alignment method is often used in

    time.

    The recognition process then consists of matching the incoming speech with stored

    templates. The template with the lowest distance measure from the input pattern is the

    recognized word. The best match (lowest distance measure) is based upon dynamic

    programming.

    6.1 DTW Algorithm

    Speech is a time-dependent process. Hence the utterances of the same word will have

    different durations, and utterances of the same word with the same duration will differ in the

    middle, due to different parts of the words being spoken at different rates. To obtain a global

    distance between two speech patterns (represented as a sequence of vectors) a time alignment

    must be performed.

    http://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Time_serieshttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Time_serieshttp://en.wikipedia.org/wiki/Speech_recognition
  • 7/31/2019 Speech Recognition Using Wavelets

    42/63

    35

    6.1.1 DP-Matching Principle

    General Time-Normalized Distance Definition:

    Speech can be expressed by appropriate feature extraction as a sequence of feature

    vectors.

    A= a1, a2, a3,ai ..aI, Equation 6.1

    B = b1, b2, b3, bj.. bJ. Equation 6.2

    Consider the problem of eliminating timing differences between these two speech

    patterns. In order to clarify the nature of time-axis fluctuation or timing differences, let us

    consider an i-j plane, shown in Fig. 6.1, where patterns A andBare developed along the i-axis

    and j-axis, respectively. Where these speech patterns are of the same category, the timing

    differences between them can be depicted by a sequence of points c = (i, j):

    F = c (l), c (2), ------c, (k), ---------- c (K). Equation 6.3

    Where c (k) = (i (k), j (k)).

    This sequence can be considered to represent a function which approximately realizes a

    mapping from the time axis of patternA onto that of pattern B. Hereafter, it is called a warping

    function. When there is no timing difference between these patterns, the warping function

    coincides with the diagonal line j = i. It deviates further from the diagonal line as the timing

    difference grows [3].

    Fig. 6.1 warping function & adjusting window definition

  • 7/31/2019 Speech Recognition Using Wavelets

    43/63

    36

    As a measure of the difference between two feature vectors aiand bi, a distance

    | | Equation 6.4is employed between them. Then, the weighted summation of distances on warping function F

    becomes () Equation 6.5(Where w (k) is a nonnegative weighting coefficient, which is intentionally introduced to allow

    the E (F) measure flexible characteristic) and is a reasonable measure for the goodness of

    warping function F. It attains its minimum value when warping function F is determined so as to

    optimally adjust the timing difference. This minimum residual distance value can be considered

    to be a distance between patterns A andB, remaining still after eliminating the timing differences

    between them, and is naturally expected to be stable against time-axis fluctuation. Based on these

    considerations, the time-normalized distance between two speech patterns A andB is defined as

    follows:

    () Equation 6.6Where denominator is employed to compensate for the effect ofK(number of points onthe warping function F). Above equation is no more than a fundamental definition of time-

    normalized distance. Effective characteristics of this measure greatly depend on the warping

    function specification and the weighting 'coefficient definition. Desirable characteristics of the

    time-normalized distance measure will vary, according to speech pattern properties (especially

    time axis expression of speech pattern) to be dealt with. Therefore, the present problem is

    restricted to the most general case where the following two conditions hold:

    Condition 1: Speech patterns are time-sampled with a common and constant sampling period.

    Condition 2: We have no a priori knowledge about which parts of speech pattern contain

    linguistically important information. In this case, it is reasonable to consider each part of aspeech pattern to contain an equal amount of linguistic information.

  • 7/31/2019 Speech Recognition Using Wavelets

    44/63

    37

    6.1.2 Restrictions on Warping Function

    Warping function Fis a model of time-axis fluctuation in a speech pattern. Accordingly,

    it should approximate the properties of actual time-axis fluctuation. In other words, function F,

    when viewed as a mapping from the time axis of pattern A onto that of patternB, must preserve

    linguistically essential structures in pattern A time axis and vice versa. Essential speech pattern

    time-axis structures are continuity, monotonicity (or restriction of relative timing in a speech),

    limitation on the acoustic parameter transition speed in a speech, and so on. These conditions can

    be realized as the following restrictions on warping function F or points ( )1) Monotonic conditions: i (k-1) i (k) and j (k-1) j (k). Equation 6.72) Continuity conditions: : i(k)- i(k-1) 1 and j(k)- j(k-1) 1. Equation 6.8

    As a result of these two restrictions, the following relation holds between two consecutive points

    ( ) Equation 6.93) Boundary conditions: i (1) =1, j (1) =1, and i (K) =I, j (K) = J. Equation 6.104) Adjustment window condition:

    | | Equation 6.11

    Where ris an appropriate positive integer, called window length. This condition corresponds to

    the fact that time-axis fluctuation in usual cases never causes too excessive timing difference.

    5) Slope constraint condition:Neither too steep nor too gentle a gradient should be allowed for warping function F

    because such deviations may cause undesirable time-axis warping. Too steep a gradient, for

    example, causes an unrealistic correspondence between very short patterns A segment and a

    relatively long patternB segment. Then, such a case occurs where a short segment in consonant

    or phoneme transition part happens to be in good coincidence with an entire steady vowel part.

    Therefore, a restriction called a slope constraint condition was set upon the warping function F,

    so that its first derivative is of discrete form. The slope constraint condition is realized as a

    restriction on the possible relation among (or the possible configuration of) several consecutive

  • 7/31/2019 Speech Recognition Using Wavelets

    45/63

    38

    points on the warping function, as is shown in Fig. 6.2(a) and (b). To put it concretely, if point c

    (k) moves forward in the direction of i (or j)-axis consecutive m times, then point c (k) is not

    allowed to step further in the same direction before stepping at least n times in the diagonal

    direction. The effective intensity of the slope constraint can be evaluated by the following

    measure P = n/m.

    Fig. 6.2 Slope constraint on warping function

    The larger the P measure, the more rigidly the warping function slope is restricted. When

    p = 0, there are no restrictions on the warping function slope. When p = (that is m = 0), the

    warping function is restricted to diagonal line j = i. Nothing more occurs than a conventional

  • 7/31/2019 Speech Recognition Using Wavelets

    46/63

    39

    pattern matching no time normalization. Generally speaking, if the slope constraint is too severe,

    then time-normalization would not work effectively. If the slope constraint is too lax, then

    discrimination between speech patterns in different categories is degraded. Thus, setting neither a

    too large nor a too small value forp is desirable. Section IV reports the results of an investigation

    on an optimum compromise onp value through several experiments.

    In Fig. 6.2(c) and (d), two examples of permissible point c (k) paths under slope

    constraint condition p = 1 are shown. TheFig. 6.2(c) type is directly derived from the above

    definition,while Fig. 6.2(d) is an approximated type, and there is anotherconstraint. That is, the

    second derivative of warping function F is restricted, so that the point c (k) path does not

    orthogonally change its direction. This new constraint reduces the number of paths to be

    searched. Therefore, the simple Fig. 6.2(d) type is adopted afterward, except for thep = 0 case.

    6.1.3 Discussions on Weighting Coefficient

    Since the criterion function in Equation 6.6 is a rational expression, its maximization is

    an unwieldy problem. If the denominator in Equation 6.6 Equation 6.12(Called normalization coefficient) is independent of warping function F; it can be put out of the

    bracket, while simplifying the equation as follows:

    [ () ] Equation 6.13This simplified problem can be effectively solved by use of the dynamic programmingtechnique.

    W (k) = [i (k) - i (k-1)] + [j (k) - j (k-1)], Equation 6.14

    Then N=I+J, whereIandJare lengths of speech patternsA andB, respectively.

    If it is assumed that time axes i andj are both continuous, then, in the symmetric form,

    the summation in Equation 6.6 means an integration along the temporarily defined axis l = i +j.As a result of this difference, time-normalized distance is symmetric, or D (A, B) =D (B, A), in

    the symmetric form. Another more important result, caused by the difference in the integration

    axis, is that, as is in Fig. 6.3, weighting coefficient w (k) reduces to zero in the asymmetric form,

    when the point in warping function steps in the direction ofj-axis, or c (k) = c (k-1) + (0, 1). This

    means that some feature vectors bjare possibly excluded from the integration in the asymmetric

  • 7/31/2019 Speech Recognition Using Wavelets

    47/63

    40

    form. On the contrary, in the case of symmetric form, minimum w (k) value is equal to 1, and no

    exclusion occurs. Since discussions here are based on the assumption that each part in a speech

    pattern should be treated equally, an exclusion of any feature vectors from integration should be

    avoided as long as possible. It can be expected, therefore, that the symmetric form will give

    better recognition accuracy than the asymmetric form. However, it should be noted that the slope

    constraint reduces the situation where the point in warping function steps in the j-axis direction.

    The difference in performance between the symmetric one and asymmetric one will gradually

    vanish as the slope constraint is intensified.

    Fig. 6.3 Weighting coefficient W(k)

    6.2 Practical DP-Matching Algorithm

    6.2.1 DP-Equation

    A simplified definition of time-normalized distance D (A, B) given above is one of the

    typical problems to which the well-known DP-principle Equation 6.10 can be applied. The basic

    algorithm for calculating Equation 6.13 is written as follows.

    Initial condition:

    g1(c (1)) = d (c (1)) w (1). Equation 6.15

    DP-equation: () [( ) ( ) ] Equation 6.16Time-normalized distance:

    () Equation 6.17

  • 7/31/2019 Speech Recognition Using Wavelets

    48/63

    41

    It is implicitly assumed here that c (0) = (0, 0). Accordingly, w (1) = 2 in the symmetric

    form, and w (1) = 1 in the asymmetric form. By realizing the restriction on the warping function

    described in Section 6.1.2 and substituting Equation 6.14 for weighting coefficient w (k) in

    Equation 6.16,several practical algorithms can be derived. As one of the simplest examples, the

    algorithm for symmetric form, in which no slope constraint is employed (that is P = 0), is shown

    here.

    Initial condition:

    g (l, 1) = 2 d (1, 1). Equation 6.18

    DP-equation:

    Equation 6.19Restricting condition (adjustment window):

    j - r i j + r. Equation 6.20

    Time-normalized distance:

    Equation 6.21Where N = I+J.

    The algorithm, especially the DP-equation, should be modified when the asymmetric

    form is adopted or some slope constraint is employed. In Table I, algorithms are summarized for

    both symmetric and asymmetric forms, with various slope constraint conditions. In this table,

    DP-equations for asymmetric forms are shown in some improved form. The first expression in

    the bracket of the asymmetric form DP-equation for P = 1 (that is, [g(i - 1 , j - 2) + d(i, j - 1) +

    d(i, j)]/2) corresponds to the case where c(k - 1 ) = (i(k), j(k) - 1) and c(k - 2) = (i(k - 1) - 1 , j (k -

    1) - 1). Accordingly, if the definition in (14) is strictly obeyed, w (k) is equal to zero while w (k -

    1) is equal to 1, thus completely omitting the d(c (k)) from the summation. In order to avoid this

    situation to a certain extent, the weighting coefficient w (k - 1) = 1 is divided between two

    weighting coefficients w (k - 1) and w (k). Thus, (d(i, j - 1) + d(i, j))/2 is substituted for d(i, j - 1)

    + 0 * d(i, j ) in this expression. Similar modifications are applied to other asymmetric form DP-

    equations. In fact, it has been established, by a preliminary experiment, that this modification

    significantly improves the asymmetric form performance [12].

  • 7/31/2019 Speech Recognition Using Wavelets

    49/63

    42

    6.2.2 Calculation Details

    DP-equation or g(i, j ) must be recurrently calculated in ascending order with respect to

    coordinates i andj , starting from initial condition at (1, 1 ) up to ( I , J). The domain in which the

    DP-equation must be calculated is specified by

    1 i I, 1 j J. Equation 6.22

    and adjustment window

    j - r i j + r. Equation 6.23

    The optimum DP-algorithm, applied to speech recognition, was investigated. Symmetric

    form was proposed along with slope constraint technique. These varieties were then compared

    through theoretical and experimental investigations.

    Conclusions are as follows: Slope constraint is actually effective. Optimum performance is

    attained when the slope constraint condition is P = 1. The validity of these results was ensured by

    a good agreement between theoretical discussions and experimental results. The optimized

    algorithm was then experimentally compared with several other DP-algorithms applied to spoken

    word recognition by different research groups, and the superiority of the algorithm described in

    this paper was established.

  • 7/31/2019 Speech Recognition Using Wavelets

    50/63

    43

    7. FPGA Implementation

    The AccelDSP Synthesis Tool is a product that allows to transform a MATLAB floating-

    point design into a hardware module that can be implemented in a Xilinx FPGA. AccelDSP

    Synthesis Tool features an easy-to-use Graphical User Interface that controls an integrated

    environment with other design tools such as MATLAB, Xilinx ISE tools, and other industry-

    standard HDL simulators and logic synthesizers.

    AccelDSP Synthesis is done with the following implementation procedure:

    a) Reading and analyzing a MATLAB floating-point design.b) Automatically creating an equivalent MATLAB fixed-point design.c) Invoking a MATLAB simulation to verify the fixed-point design.d) Providing the power to quickly explore design trade-offs of algorithms that are optimized

    for the target FPGA architectures.

    e) Creating a synthesizable RTL HDL model and a Test bench to ensure bit-true, cycleaccurate design verification.

    f) Providing scripts that invoke and control down-stream tools such as HDL simulators,RTL logic synthesizers, and Xilinx ISE implementation tools.

  • 7/31/2019 Speech Recognition Using Wavelets

    51/63

    44

    The Synthesis flow in AccelDSP ISE can be observed from the following flow chart:

    Fig. 7.1 Synthesis flow in AccelDSP

  • 7/31/2019 Speech Recognition Using Wavelets

    52/63

    45

    8. SIMULATION & RESULTS

    This chapter presents the experimental results obtained from the proposed approach;

    namely Wavelet analysis, Dynamic Time Warping that was applied to the isolated word speech

    recognition. The effectiveness of the algorithms is measured through the analysis of the results.

    8.1 Input Signal:1) Input speech signal for word Speech:

    Fig. 8.1 Input speech signal

    The input speech signal with duration of 5 seconds with sampling frequency of 8k Hz is

    shown above.

  • 7/31/2019 Speech Recognition Using Wavelets

    53/63

    46

    8.2 Pre emphasis:

    Pre emphasis output for Speech:

    Fig. 8.2 Pre emphasis output

    The output obtained after passing the input Speech signal to the pre emphasis (first

    order high pass) filter. The output has significant spectral flatness when compared with input.

  • 7/31/2019 Speech Recognition Using Wavelets

    54/63

    47

    8.3 Voice Activation & Detection

    1) Voice Activation and Detection for Speech:

    Fig. 8.3 Voice Activation & Detection

    The above plot shows the voice activated region for the word Speech. The output is 1

    for voiced region and 0 for unvoiced and silence region. Hence out of the total samples, only the

    voice activated samples are going to be filtered out.

  • 7/31/2019 Speech Recognition Using Wavelets

    55/63

    48

    2) Speech signal after voice activation and Detection:

    Fig. 8.4 Speech signal after Voice Activation & Detection

    After obtaining the Voice Activation & Detection output, the regions for which VAD=1

    are extracted out for further analysis.

  • 7/31/2019 Speech Recognition Using Wavelets

    56/63

    49

    8.4 De-noising:

    De-noising for Speech:

    Fig. 8.5 Speech signal after de-noising

    The final denoised signal obtained after Spectral subtraction. Here the noise components

    present in the signal are reduced.

  • 7/31/2019 Speech Recognition Using Wavelets

    57/63

    50

    8.5 Recognition Results:

    This section provides the experimental results in recognizing the isolated words. In the

    experiment, the database consists of 10 different words and 25 utterances for each word is used.

    Calculation of Recognition rate is given in Equation 8.1 below. Equation 8.1

    a) The Recognition rates for each word using Daubechies-8 wavelet & level-4decomposition using DWT for English words is shown in the following table:

    Word to be recognized Number of times theword is correctly recognized

    Recognition Rate

    Matrix 24 96

    Paste 24 96

    Project 18 72

    Speech 18 72

    Window 24 96

    Distance 20 80

    India 24 96

    Ubuntu 19 76

    Fedora 25 100

    Android 24 96

    Table 8.1: Recognition rates for English words using db8 & level 4 DWT.

    The overall Recognition rate for English words using Daubechies 8 wavelet of level-4 is 88%.

  • 7/31/2019 Speech Recognition Using Wavelets

    58/63

    51

    b) The Recognition rates for each word using Daubechies-8 wavelet & level-7decomposition using DWT for English words is shown in the following table:

    Word to be recognized Number of times the

    word is correctly recognized

    Recognition Rate

    Matrix 24 96

    Paste 23 92

    Project 21 84

    Speech 23 92

    Window 24 96

    Distance 22 88

    India 25 100

    Ubuntu 21 84

    Fedora 25 100

    Android 25 100

    Table 8.2: Recognition rates for English words using db8 & level 7 DWT.

    The overall Recognition rate for English words using Daubechies 8 wavelet of level-7 is

    93.2%.

    8.5 FPGA Implementation

    AccelDSP synthesis tool is used to transform a MATLAB design into a hardware module

    that can be implemented in a Xilinx FPGA.

    The figure in Fig. 8.6 shows the matlab results for the word recognized FEDORA.

    The figure in Fig. 8.7 shows the FPGA Implementation results for the word recognized

    FEDORAon AccelDSP tool in Xilinx ISE Platform.

  • 7/31/2019 Speech Recognition Using Wavelets

    59/63

    52

    Fig. 8.6 Matlab output of Speech Recognition for word FEDORA.

  • 7/31/2019 Speech Recognition Using Wavelets

    60/63

    53

    Fig. 8.7 Figure showing FPGA results for word FEDORA.

  • 7/31/2019 Speech Recognition Using Wavelets

    61/63

  • 7/31/2019 Speech Recognition Using Wavelets

    62/63

    55

    REFERENCES

    [1]Trivedi, Saurabh, Sachin and Raman, "Speech Recognition by Wavelet Analysis",International Journal of Computer Applications (09758887) Volume 15No.8, February

    2011.

    [2]Lawrence Rabiner and Bing-Hwang Jung, "Fundamentals of Speech Recognition".[3]Hiroaki Sakoe, Seibi Chiba, "Dynamic Programming Algorithm Optimization for Spoken

    Word Recognition", IEEE Transactions On Acoustics, Speech, And Signal Processing, Vol.

    Assp-26, No. 1, February 1978.

    [4]Ingrid Daubechies, Ten Lectures on Wavelets, SIAM, Philadelphia, 1992.[5]Ian Mcloughlin, "Audio Processing with Matlab Examples".[6]I.Daubechies, Orthonormal Bases of Compactly Supported Wavelets, Comm. on Pure

    and Applied Math., vol.41, pp.909-996, Nov.1988.

    [7]Murali Krishnan, Chris P.Neophytou and Glenn Prescott, Wavelet Transform SpeechRecognition using Vector Quantization, Dynamic Time Warping and Artificial Neural

    Networks".

    [8]George Tzanetakis, Georg Essl, Perry Cook, Audio Analysis using the Discrete WaveletTransform Organized sound, Vol. 4(3), 2000.

    [9]L.R.Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in SpeechRecognition, vol. 77, no. 2, pp. 257-286, 1989.

    [10] Michael Nilsson, Marcus Ejnarsson, "Speech Recognition using Hidden Markov Model".[11] S.G.Mallat, A theory for multi resolution signal decomposition: the wavelet

    representation, IEEE transactions on Pattern Analysis Machine Intelligence, Vol. 11

    1989, pp.674-693.

    [12] Sylvio Barbon Junior, Rodrigo Capobianco Guido, Shi-Huang Chen, Lucimar SassoVieira, Fabricio Lopes Sanchez, "Improved Dynamic Time Warping Based on the Discrete

    Wavelet Transform", Ninth IEEE International Symposium on Multimedia 2007.

    [13] M.Misiti, Y. Misiti, G. Oppenheim and J. Poggi, Matlab Wavelet Tool Box, The MathWorks, Inc.,2000.

    [14] George Tzanetakis, Georg Essl, Perry Cook, Audio Analysis using the Discrete WaveletTransform Organized sound, Vol. 4(3), 2000.

  • 7/31/2019 Speech Recognition Using Wavelets

    63/63

    [15] Mike Brookes, "Voicebox: Speech Processing Toolbox for Matlab", Department ofElectrical & Electronic Engineering, Imperial College, London SW7 2BT, UK.

    [16] Daryl Ning, "Developing an Isolated Word Recognition System in Matlab", MatlabDigest - January 2010.