Speech Recognition Using Wavelets

7/31/2019 Speech Recognition Using Wavelets

1/63

i

Abstract

Speech recognition systems have come a long way in the last forty years, there is still

room for improvement. Although readily available, these systems are sometimes inaccurate and

insufficient. In an effort to provide a more efficient representation of the speech signal,

application of the wavelet analysis is considered. Here we present an effective and robust method

for extracting features for speech processing. Based on the time frequency multi resolutionproperty of wavelet transform, the input speech signal is decomposed into various frequency

channels. Further we can recognize the original speech using Wavelet Transform. The major

issues concerning the design of this Wavelet based speech recognition system are choosing

optimal wavelets for speech signals, decomposition level in the DWT, selecting the feature

vectors from the wavelet coefficients.

Dynamic Time Warping (DTW) is a pattern matching approach that can be used for

limited vocabulary speech recognition, which is based on a temporal alignment of the input

signal with the template models. The main drawback of this method is its high computational

cost when the length of the signals increases. The main aim of the project work is to provide a

modified version of the DTW, based on the Discrete Wavelet Transform (DWT), which reducesits original complexity. Daubechies wavelet family with level 4 & level 7 are experimented and

the corresponding results are reported.

The above proposed approaches are implemented in software and also implemented by

using FPGA.


2/63


3/63

iii

4. WAVELET ANALYSIS .......................................................................................................... 17

4.1 Definition ............................................................................................................................ 17

4.2 Fourier Analysis .................................................................................................................. 18

4.2.1 Limitations .................................................................................................................... 18

4.3 Short-Time Fourier analysis ................................................................................................ 18

4.3.1 Limitations .................................................................................................................... 19

4.4 Types of Wavelets ............................................................................................................... 19

4.4.1 Haar Wavelet ................................................................................................................ 20

4.4.2 Daubechies-N wavelet family ...................................................................................... 20

4.4.3 Advantages Wavelet analysis over STFT ..................................................................... 22

4.5 Wavelet Transform .............................................................................................................. 22

4.5.1 Discrete Wavelet Transform ......................................................................................... 23

4.5.2 Multilevel Decomposition of Signal............................................................................. 24

4.5.3 Wavelet Reconstruction ................................................................................................ 25

5. FROM SPEECH TO FEATURE VECTORS ........................................................................... 26

5.1 Preprocessing ...................................................................................................................... 26

5.1.1 Pre emphasis ................................................................................................................. 27

5.1.2 Voice Activation Detection (VAD) .............................................................................. 28

5.2 Frame blocking & Windowing ............................................................................................ 295.2.1 Frame blocking ............................................................................................................. 30

5.2.2 Windowing ................................................................................................................... 31

5.3 Feature Extraction ............................................................................................................... 32

6. DYNAMIC TIME WARPING ................................................................................................. 34

6.1 DTW Algorithm .................................................................................................................. 34

6.1.1 DP-Matching Principle ................................................................................................. 35

6.1.2 Restrictions on Warping Function ................................................................................ 37

6.1.3 Discussions on Weighting Coefficient ......................................................................... 39

6.2 Practical DP-Matching Algorithm ...................................................................................... 40

6.2.1 DP-Equation ................................................................................................................. 40

6.2.2 Calculation Details ....................................................................................................... 42

7. FPGA Implementation .............................................................................................................. 43


4/63

iv

8. SIMULATION & RESULTS ................................................................................................... 45

8.1 Input Signal: ................................................................................................................... 45

8.2 Pre emphasis:.................................................................................................................. 46

8.3 Voice Activation & Detection ........................................................................................ 47

8.4 De-noising: ..................................................................................................................... 49

8.5 Recognition Results: ...................................................................................................... 50

8.5 FPGA Implementation ................................................................................................... 51

9. CONCLUSION ......................................................................................................................... 54

REFERENCES ............................................................................................................................. 55


5/63

v

List of Tables

Table 8.1: Recognition rates for English words using db8 & level 4 DWT. ................................ 50Table 8.2: Recognition rates for English words using db8 & level 7 DWT. ................................ 51


6/63

vi

List of Figures

Fig. 2.1 Literature survey ................................................................................................................ 7

Fig. 3.1 Schematic diagram of the speech production/perception process ..................................... 9

Fig. 3.2 Human Vocal Mechanism ............................................................................................... 10

Fig. 3.3 Discrete-Time Speech Production Model........................................................................ 11

Fig. 3.4 Three state representation of a speech signal. ................................................................. 13

Fig. 3.5.Spectrogram using Welchs Method ............................................................................... 14

Fig. 4.1 Fourier transform ............................................................................................................. 18

Fig. 4.2 Short time Fourier transform ........................................................................................... 19

Fig. 4.3 Haar wavelet .................................................................................................................... 20

Fig. 4.5 Daubechies wavelets........................................................................................................ 21

Fig. 4.6 Comparison of Wavelet analysis over STFT ................................................................... 22

Fig. 4.7 Filter functions ................................................................................................................. 24

Fig. 4.8 Decomposition of DWT Co-efficients ............................................................................ 24

Fig. 4.9 Decomposition using DWT ............................................................................................. 24

Fig. 4.10 Signal Reconstruction .................................................................................................... 25

Fig. 4.11 Signal Decomposition & Reconstruction ...................................................................... 25

Fig. 5.1 Main steps in Feature Extraction ..................................................................................... 26

Fig. 5.2 Pre processing .................................................................................................................. 26

Fig. 5.3 Pre emphasis filter ........................................................................................................... 27

Fig. 5.4 Frame blocking & Windowing ........................................................................................ 30

Fig. 5.5 Frame blocking of a sequence ......................................................................................... 31

Fig. 5.6 Hamming Window .......................................................................................................... 32

Fig. 6.1 warping function & adjusting window definition............................................................ 35

Fig. 6.2 Slope constraint on warping function .............................................................................. 38

Fig. 6.3 Weighting coefficient W(k) ............................................................................................. 40

Fig. 7.1 Synthesis flow in AccelDSP ............................................................................................ 44

Fig. 8.1 Input speech signal .......................................................................................................... 45

Fig. 8.2 Pre emphasis output ......................................................................................................... 46

Fig. 8.3 Voice Activation & Detection ......................................................................................... 47


7/63

vii

Fig. 8.4 Speech signal after Voice Activation & Detection .......................................................... 48

Fig. 8.5 Speech signal after de-noising ......................................................................................... 49

Fig. 8.6 Matlab output of Speech Recognition for word FEDORA. ......................................... 52

Fig. 8.7 Figure showing FPGA results for word FEDORA. ..................................................... 53


8/63

1

1. INTRODUCTION

1.1 Definition

Speech recognition is the process of automatically extracting and determining linguistic

information conveyed by a speech signal using computers or electronic circuits. Recent advances

in soft computing techniques give more importance to automatic speech recognition. Large

variation in speech signals and other criteria like native accent and varying pronunciations makes

the task very difficult. ASR is hence a complex task and it requires more intelligence to achieve a

good recognition result. Speech recognition is a topic that is very useful in many applications and

environments in our daily life.

The fundamental purpose of speech is communication, i.e., the transmission of messages.According to Shannons information theory a message represented as a sequence of discrete

symbols can be quantified by its information content in bits, and the rate of transmission of

information is measured in bits/second (bps).

In order for communication to take place, a speaker must produce a speech signal in the

form of a sound pressure wave that travels from the speaker's mouth to a listener's ears. Although

the majority of the pressure wave originates from the mouth, sound also emanates from the

nostrils, throat, and cheeks. Speech signals are composed of a sequence of sounds that serve as a

symbolic representation for a thought that the speaker wishes to relay to the listener. The

arrangement of these sounds is governed by rules associated with a language. The scientific

study of language and the manner in which these rules are used in human communication is

referred to as linguistics. The science that studies the characteristics of human sound production,

especially for the description, classification, and transcription of speech, is called phonetics.

1.2 Application area, Features & Issues

A different aspect of speech recognition is to facilitate for people with functional

disability or other kinds of handicap. To make their daily chores easier, voice control could be

helpful. With their voice they could operate the light switch, turn off/on the coffee machine or

operate some other domestic appliances. This leads to the discussion about intelligent homes

where these operations can be made available for the common man as well as for handicapped.


9/63

2

1.2.1 Features

Speech input is easy to perform because it does not require a specialized skill as does typingor pushbutton operations.

Information can be input even when the user is moving or doing other activities involvingthe hands, legs, eyes, or ears.

Since a microphone or telephone can be used as an input terminal, inputting information iseconomical with remote inputting capable of being accomplished over existing telephone

networks and the Internet.

1.2.2 Issues

Lot of redundancy is present in the speech signal that makes discriminating between the classes

difficult.

Presence of temporal and frequency variability such as intra speaker variability inpronunciation of words and phonemes as well as inter speaker variability e.g. the effect of

regional dialects.

Context dependent pronunciation of the phonemes (co-articulation). Signal degradation due to additive and convolution noise present in the background or in the

channel.

Signal distortion due to nonideal channel characteristic.1.3 Recognition Systems

Recognition systems may be designed in many modes to achieve specific objective or

performance criteria.

1.3.1 Speaker Dependent / Independent System

For speaker dependent systems, user is asked to utter predefined words or sentences.

These acoustic signals form the training data, which are used for recognition of the input speech.

Since these systems are used for only a predefined speaker, their performance become higher

compared to speaker independent systems.


10/63

3

1.3.2 Isolated Word Recognition

This is also called discrete recognition system. In this system, there has to be pause

between uttered words. Therefore the system does not have to care about finding boundaries

between words.

1.3.3 Continuous Speech Recognition

These systems are the ultimate goal of a recognition process. No matter how or when a

word is uttered, they are recognized in real time and accordingly an action is performed. Changes

in speaking rate, careless pronunciations, detecting the word boundaries and real time issues are

main problems for this recognition mode.

1.3.4 Vocabulary Size

The lower the size of the vocabulary in a recognition system, the higher the recognition

performance. Specific tasks may use small vocabularies. However a natural system should be

speaker independent continuous recognition over a large vocabulary which is the most difficult.

1.3.5 Keyword Spotting

These systems are used to detect a word in continuous speech. For this reason they maybe as good as isolated recognition besides having the capability to handle continuous.

Speech word recognition systems commonly carry out some kind of classification

recognition based on speech features which are usually obtained via Fourier Transforms (FTs),

Short Time Fourier Transforms (STFTs), or Linear Predictive Coding techniques. However,

these methods have some disadvantages. These methods accept signal stationary with in a given

time frame and may therefore lack the ability to analyze localized events correctly. The wavelet

transform copes with some of these problems. Other factors influencing the selection of Wavelet

Transforms (WT) over conventional methods include their ability to determine localized

features. Discrete Wavelet Transform method is used for speech processing.

The speech recognizer implemented in Mat lab was used to simulation, as if, a speech

recognizer was operating in a real environment. Simulation recordings are taken in open

environment to get real data.


11/63

4

In the future it could be possible to use this information to create a chip that could be

used as a new interface to humans. For example it would be desired to get rid of all remote

controls in the home and just tell the television, stereo or any desired device what to do with the

voice.

1.4 Objectives

This project will cover speaker independent and small vocabulary speech recognition

with the help of wavelet analysis using Dynamic Time Warping method. The project will

compose of two phases:

1) Training phase: In this phase, a number of words will be trained to extract model for each

word.

2) Recognition phase: In this phase, a sequence of connected word is entered by microphone or

an input file and the system will try to recognize these words.

1.5 Out line

The outline of this thesis is as follows.

Chapter 2Literature Survey:

This chapter discuss about trends and technologies that are followed for improvising the

speech recognition performance.

Chapter 3 - The Speech Signal:

This chapter will discuss how the production and perception of speech is performed.

Topics related to this chapter are Speech production, speech representation, Characteristics of

speech signal and Perception.

Chapter 4Wavelet Analysis:

This chapter will discuss what is wavelet, what are the types of wavelets available, which

type of wavelets are used, basically why wavelets are introduced and decomposition of wavelets.Some topics related to this chapter are Fourier analysis, STFT, types of wavelets and wavelet

transform.


12/63

5

Chapter 5 - From Speech to Feature Vectors

In this chapter the fundamental signal processing applied to a speech recognizer. Some

topics related to this chapter are Pre-processing, frame blocking and windowing and Feature

extraction.

Chapter 6Dynamic Time Warping

Aspects of this chapter are theory and implementation of the set of statistical modeling

techniques collectively referred to as Dynamic Time Warping. Some topics related to this

chapter are DTW Algorithm, DP Matching Algorithm.

Chapter 7FPGA Implementation

This chapter describes about the FPGA Implementation of Speech Recognition system

using AccelDSP tool in Xilinx ISE.

Chapter 8Simulation & Results

In this chapter the speech recognizer implemented in Matlab will be used. This is to test

the recognizer in different cases for finding efficiency.

Chapter 9 - Conclusions

This chapter will summarizes the whole project.


13/63

6

2. LITERATURE SURVEY

Designing a machine that mimics human behavior, particularly the capability of speaking

naturally and responding properly to spoken language, has intrigued engineers and scientists for

centuries. Since the 1930s, when Homer Dudley of Bell Laboratories proposed a system model

for speech analysis and synthesis, the problem of automatic speech recognition has been

approached progressively, from a simple machine that responds to a small set of sounds to a

sophisticated system that responds to fluently spoken natural language and takes into account the

varying statistics of the language in which the speech is produced. Based on major advances in

statistical modeling of speech in the 1980s, automatic speech recognition systems today find

widespread application in tasks that require a human-machine interface, such as automatic call

processing in the telephone network and query-based information systems that do things like

provide updated travel information, stock price quotations, weather reports, etc.

Speech is the primary means of communication between people. For reasons ranging

from technological curiosity about the mechanisms for mechanical realization of human speech

capabilities, to the desire to automate simple tasks inherently requiring human-machine

interactions, research in automatic speech recognition (and speech synthesis) by machine has

attracted a great deal of attention over the past five decades.

2.1 Advancement in technology

Fig. 2.1 shows a timeline of progress in speech recognition and understanding technology

over the past several decades. We see that in the 1960s we were able to recognize small

vocabularies (order of 10-100 words) of isolated words, based on simple acoustic-phonetic

properties of speech sounds. The key technologies that were developed during this time frame

were filter-bank analyses, simple time normalization methods, and the beginnings of

sophisticated dynamic programming methodologies. In the 1970s we were able to recognizemedium vocabularies (order of 100-1000 words) using simple template-based, pattern

recognition methods [3]. The key technologies that were developed during this period were the

pattern recognition models, the introduction of LPC methods for spectral representation, the

pattern clustering methods for speaker-independent recognizers, and the introduction of dynamic

programming methods for solving connected word recognition problems. In the 1980s we


14/63

7

started to tackle large vocabulary (1000-unlimited number of words) speech recognition

problems based on statistical methods, with a wide range of networks for handling language

structures. The key technologies introduced during this period were the hidden Markov model

(HMM) [9] and the stochastic language model, which together enabled powerful new methods

for handling virtually any continuous speech recognition problem efficiently and with high

performance. In the 1990s we were able to build large vocabulary systems with unconstrained

language models, and constrained task syntax models for continuous speech recognition and

understanding. The key technologies developed during this period were the methods for

stochastic language understanding, statistical learning of acoustic and language models, and the

introduction of finite state transducer framework (and the FSM Library) and the methods for

their determination and minimization for efficient implementation of large vocabulary speech

understanding systems.

Fig. 2.1 Literature survey


15/63

8

Finally, in the last few years, we have seen the introduction of very large vocabulary

systems with full semantic models, integrated with text-to-speech (TTS) synthesis systems, and

multi-modal inputs (pointing, keyboards, mice, etc.). These systems enable spoken dialog

systems with a range of input and output modalities for ease-of-use and flexibility in handling

adverse environments where speech might not be as suitable as other input-output modalities.

During this period we have seen the emergence of highly natural speech synthesis systems, the

use of machine learning to improve both speech understanding and speech dialogs, and the

introduction of mixed-initiative dialog systems to enable user control when necessary.

After nearly five decades of research, speech recognition technologies have finally

entered the marketplace, benefiting the users in a variety of ways. Throughout the course of

development of such systems, knowledge of speech production and perception was used in

establishing the technological foundation for the resulting speech recognizers. Major advances,

however, were brought about in the 1960s and 1970s via the introduction of advanced speech

representations based on LPC analysis and cepstral analysis methods, and in the 1980s through

the introduction of rigorous statistical methods based on hidden Markov models [9]. All of this

came about because of significant research contributions from academia, private industry and the

government. As the technology continues to mature, it is clear that many new applications will

emerge and become part of our way of lifethereby taking full advantage of machines that are

partially able to mimic human speech capabilities.


16/63

9

3. THE SPEECH SIGNAL

This chapter intends to discuss how the speech signal is produced and perceived by

human beings. This is an essential subject that has to be considered before one can pursue and

decide which approach to use for speech recognition.

3.1 Speech production

Human communication is to be seen as a comprehensive diagram of the process from

speech production to speech perception between the talker and listener as in Fig. 3.1 [2].

Fig. 3.1 Schematic diagram of the speech production/perception process

Five different elements, A. Speech formulation, B. Human vocal mechanism, C. Acoustic

air, D. Perception of the ear, E. Speech comprehension, will be examined more carefully in the

following sections.The first element (A. Speech formulation) is associated with the formulation of the

speech signal in the talkers mind. This formulation is used by the human vocal mechanism (B.

Human vocal mechanism) to produce the actual speech waveform. The waveform is transferred

via the air (C. Acoustic air) to the listener. During this transfer the acoustic wave can be affected

by external sources, for example noise, resulting in a more complex waveform. When the wave


17/63

10

reaches the listeners hearing system (the ears) the listener percepts the waveform (D. Perception

of the ear) and the listeners mind (E. Speech comprehension) starts processing this waveform to

comprehend its content so the listener understands what the talker is trying to tell him or her.

Fig. 3.2 Human Vocal Mechanism

To be able to understand how the production of speech is performed one need to know

how the humans vocal mechanism is constructed, as in Fig. 3.2.


18/63

11

The most important parts of the human vocal mechanism are the vocal tracttogether with

nasal cavity, which begins at the velum. The velum is a trapdoor-like mechanism that is used to

formulate nasal sounds when needed. When the velum is lowered, the nasal cavity is coupled

together with the vocal tract to formulate the desired speech signal. The cross-sectional area of

the vocal tract is limited by the tongue, lips, jaw and velum and varies from 0-20 cm2.

When humans produce speech, air is expelled from the lungs through the trachea. The air

flowing from the lungs causes the vocal cords to vibrate and by forming the vocal tract, lips,

tongue, jaw and maybe using the nasal cavity, different sounds can be produced.

Important parts of the discrete-time speech production model, in the field of speech

recognition and signal processing, are: u (n), gain b0 andH(z). The impulse generator acts like

the lungs, exciting the glottal filter G (z), resulting in u (n). The G (z) is to be regarded as the

vocal cords in the human vocal mechanism. The signal u (n) can be seen as the excitation signal

entering the vocal tract and the nasal cavity and is formed by exciting the vocal cords by air from

the lungs.

Fig. 3.3 Discrete-Time Speech Production Model

The gain b0 is a factor that is related to the volume of the speech being produced. Largergain b0 gives louder speech and vice versa. The vocal tract filter H(z) is a model over the vocal

tract and the nasal cavity. The lip radiation filter R (z) is a model of the formation of the human

lips to produce different sounds.


19/63

12

3.2 Speech Representation

The speech signal and all its characteristics can be represented in two different domains,

the time and the frequency domain.

A speech signal is a slowly time varying signal in the sense that, when examined over a

short period of time (between 5 and 100 ms), its characteristics are short-time stationary. This is

not the case if we look at a speech signal under a longer time perspective (approximately time

T>0.5 s). In this case the signals characteristics are non-stationary, meaning that it changes to

reflect the different sounds spoken by the talker.

To be able to use a speech signal and interpret its characteristics in a proper manner some

kind of representation of the speech signal are preferred. The speech representation can exist in

either the time or frequency domain, and in three different ways. These are a three-state

representation, a spectral representation and the last representation is aparameterization of the

spectral activity.3.2.1 Three-state Representation

The three-state representation is one way to classify events in speech. The events of

interest for the three-state representation are:

Silence (S) - No speech is produced. Unvoiced (U) - Vocal cords are not vibrating, resulting in an aperiodic or random

speech waveform.

Voiced (V) - Vocal cords are tensed and vibrating periodically, resulting in a speechwaveform that is quasi-periodic.

Quasi-periodic means that the speech waveform can be seen as periodic over a short-time

period (5-100 ms) during which it is stationary.


20/63

13

Fig. 3.4 Three state representation of a speech signal.

The upper plot Fig. 3.4(a) contains the whole speech sequence and in the middle plot Fig.

3.4(b) a part of the upper plot Fig. 3.4(a) is reproduced by zooming an area of the whole speech

sequence. At the bottom of Fig. 3.4 the segmentation into a three-state representation, in relation

to the different parts of the middle plot, is given.


21/63

14

3.2.2 Spectral Representation

Spectral representation of speech intensity over time is very popular, and the

most popular one is the sound spectrogram, see Fig. 3.5.

Fig. 3.5.Spectrogram using Welchs Method

Here the darkest (dark blue) parts represent the parts of the speech waveform where no

speech is produced and the lighter (red) parts represent intensity if speech is produced.


22/63


23/63

16

3.3.2 Fundamental Frequency

The time between successive vocal fold openings is called the fundamental period T0,

while the rate of vibration is called thefundamental frequency of the phonation, F0= 1/T0.

Using voiced excitation for the speech sound will result in a pulse train, the so-called

fundamental frequency. Voiced excitation is used when articulating vowels and some of the

consonants. For fricatives (e.g., /f/ as in fish or /s/, as in mess), unvoiced excitation (noise) is

used. In these cases, usually no fundamental frequency can be detected. On the other hand, the

zero crossing rate of the signal is very high. Plosives (like /p/ as in put), which use transient

excitation, you can best detect in the speech signal by looking for the short silence necessary to

build up the air pressure before the plosive bursts out.

3.3.3 Peaks in the Spectrum

After passing the glottis, the vocal tract gives a characteristic spectral shape to the speech

signal. If one simplifies the vocal tract to a straight pipe (the length is about 17cm), one can see

that the pipe shows resonance at the frequencies. Depending on the shape of the vocal tract (the

diameter of the pipe changes along the pipe), the frequencies of the formants (especially of the

1st and 2nd formant) changes and therefore characterizes the vowel being articulated.

3.3.4 The Envelope of the Power Spectrum

The pulse sequence from the glottis has a power spectrum decreasing towards higher

frequencies by -12dB per octave. The emission characteristics of the lips show a high-pass

characteristic with +6dB per octave. Thus, this results in an overall decrease of-6dB per octave.

3.4 Speech perception process

The microphone.cs class is responsible to accept input from a microphone and forward it

to the feature extraction module. Before converting the signal into suitable or desired form, it is

important to identify the segments of the sound containing words. The audio.cs class deals with

all tasks needed for converting wave file to stream of digits and vice versa. It also has a provision

of saving the sound into WAV files.


24/63

17

4. WAVELET ANALYSIS

4.1 Definition

A wavelet is a wave-like oscillation with amplitude that starts out at zero, increases, and

then decreases back to zero. It can typically be visualized as a "brief oscillation" like one might

see recorded by a seismograph or heart monitor. Generally, wavelets are purposefully crafted to

have specific properties that make them useful for signal processing. Wavelets can be combined,

using a "reverse, shift, multiply and sum" technique called convolution, with portions of an

unknown signal to extract information from the unknown signal.

The fundamental idea behind wavelets is to analyze according to scale. The wavelet

analysis procedure is to adopt a wavelet prototype function called an analyzing wavelet or

mother wavelet. Any speech signal can then be represented by translated and scaled versions of

the mother wavelet. Wavelet analysis is capable of revealing aspects of data that other speech

signal analysis technique such the extracted features are then passed to a classifier for the

recognition of isolated words [4].

The integral wavelet transform is the integral transform defined as:

( ) Equation 4.1Where a is positive and defines the scale and b is any real number and defines the shift.

For decomposition of speech signal, we can use different techniques like Fourier analysis,

STFT (Short Time Fourier Transforms), wavelet transform techniques.

Here, we have explained the necessity and advantages of Wavelet Analysis by first

considering the Fourier analysis, its limitations, its modification to Short Time Fourier

Transform, its limitations and finally the Wavelet Analysis.
http://en.wikipedia.org/wiki/Wavehttp://en.wikipedia.org/wiki/Oscillationhttp://en.wikipedia.org/wiki/Amplitudehttp://en.wikipedia.org/wiki/Seismographhttp://en.wikipedia.org/wiki/Heart_monitorhttp://en.wikipedia.org/wiki/Signal_processinghttp://en.wikipedia.org/wiki/Convolutionhttp://en.wikipedia.org/wiki/Integral_transformhttp://en.wikipedia.org/wiki/Integral_transformhttp://en.wikipedia.org/wiki/Convolutionhttp://en.wikipedia.org/wiki/Signal_processinghttp://en.wikipedia.org/wiki/Heart_monitorhttp://en.wikipedia.org/wiki/Seismographhttp://en.wikipedia.org/wiki/Amplitudehttp://en.wikipedia.org/wiki/Oscillationhttp://en.wikipedia.org/wiki/Wave


25/63

18

4.2 Fourier Analysis

Fourier analysis breaks down a signal into constituent sinusoids of different frequencies.

It is a mathematical technique for transforming a signal from a time-based one to a frequency-

based one. Fourier Transform of sinusoidal signal is depicted in Fig. 3.1 below. Equation 4.2

Fig. 4.1 Fourier transform

4.2.1 Limitations

But Fourier analysis has a serious drawback. In transforming to the frequency domain,

time information is lost. When looking at a Fourier transform of a signal, it is impossible to tell

when a particular event tookplace. If a signal doesnt change much over time, i.e. if it is what is

called a stationary signal. This drawback isnt very important. However, most interesting signals

contain numerous non-stationary or transitory characteristics: drift, trends, abrupt changes, and

beginnings and ends of events. These characteristics are often the most important part of the

signal, and Fourier analysis is not suited to detecting them.

4.3 Short-Time Fourier analysisShort-Time Fourier Transform (STFT), maps a signal into a two-dimensional function of

time andfrequency.A technique called windowing the signal. Mathematically it is given by*,-+ ,-, - Equation4.3Where signal is x[n] and window is w[n].


26/63

19

Short-Time Fourier Transform of a random signal is shown in Fig. 4.2 below.

Fig. 4.2 Short time Fourier transform

The STFT represents a sort of compromise between the time- and frequency-based views

of a signal. It provides some information about both when and at what frequencies a signal event

occurs.

4.3.1 Limitations

However, you can only obtain this information with limited precision, and that precision

is determined by the size of the window. While the STFTs compromise between time and

frequency information can be useful, the drawback is that once you choose a particular size for

the time window, that window is the same for all frequencies .Otherwise ,if a wider window is

chosen, it gives better frequency resolution but poor time resolution. A narrower window gives

good time resolution but poor frequency resolution. Many signals require a more flexible

approach - one where we can vary the window size to determine more accurately either time or

frequency.

4.4 Types of Wavelets

Different types of wavelets are Haar wavelets, Daubechies wavelets, Bi orthogonal

wavelets, Coiflet wavelets, Symlet wavelets, Morlet wavelets, Mexican Hat wavelets and Meyer

wavelets.

Wavelets mainly used in speech recognition are discussed here.


27/63

20

4.4.1 Haar Wavelet

Its first and simplest. Haar is discontinuous, and resembles a step function. It represents

the same wavelet as Daubechies db1.

The Haar wavelet family for t [0, 1] is defined as follows:hi (t) ={

Equation 4.4

Integer m = 2j ( j = 0,1,2J ) indicates the level of the wavelet; k = 0,1, 2,..m1is the

translation parameter. Maximal level of resolution is J.

Fig. 4.3 Haar wavelet

4.4.2 Daubechies-N wavelet family

The Daubechies wavelets are a family oforthogonal wavelets defining a discrete wavelet

transform and characterized by a maximal number of vanishing moments for some given

support. With each wavelet type of this class, there is a scaling function (also called father

wavelet) which generates an orthogonal multi resolution analysis. The Daubechies wavelet is one

of the popular wavelets and has been used for speech recognition [4].
http://en.wikipedia.org/wiki/Orthogonal_wavelethttp://en.wikipedia.org/wiki/Discrete_wavelet_transformhttp://en.wikipedia.org/wiki/Discrete_wavelet_transformhttp://en.wikipedia.org/wiki/Moment_(mathematics)http://en.wikipedia.org/wiki/Moment_(mathematics)http://en.wikipedia.org/wiki/Discrete_wavelet_transformhttp://en.wikipedia.org/wiki/Discrete_wavelet_transformhttp://en.wikipedia.org/wiki/Orthogonal_wavelet


28/63

21

In general the Daubechies wavelets are chosen to have the highest number A of vanishing

moments, (this does not imply the best smoothness) for given support width N=2A, and among

the 2A1possible solutions the one is chosen whose scaling filter has external phase. The wavelet

transform is also easy to put into practice using the fast wavelet transform. Daubechies wavelets

are widely used in solving a broad range of problems, e.g. self-similarity properties of a signal

or fractal problems, signal discontinuities, etc.

The Daubechies wavelets properties [6]:

The support length of wavelet function and the scaling function is 2N1. The number of vanishing moments of is N. Most dbN are not symmetrical. The regularity increases with the order. When N becomes very large, and belong to CN

where is approximately equal to 0.2.

Daubechies8 wavelet is used for decomposition of speech signal as it needs minimumsupport size for the given number of vanishing points.

The names of the Daubechies family wavelets are written dbN, where N is the order, and

db the surname of the wavelet. The db1 wavelet, as mentioned above, is the same as Haar.

Here are the next nine members of the family:

Fig. 4.5 Daubechies wavelets
http://en.wikipedia.org/wiki/Fast_wavelet_transformhttp://en.wikipedia.org/wiki/Fractalhttp://en.wikipedia.org/wiki/Fractalhttp://en.wikipedia.org/wiki/Fast_wavelet_transform


29/63

22

4.4.3 Advantages Wavelet analysis over STFT

Wavelet analysis represents the next logical step: a windowing technique with variable-

sized regions. Wavelet analysis allows the use of long time intervals where we want more precise

low frequency information, and shorter regions where we want high frequency information.

Fig. 4.6 Comparison of Wavelet analysis over STFT

The time-based, frequency-based and STFT views of a signal are given with respect to

that of Wavelet analysis. One major advantage afforded by wavelets is the ability to perform

local analysis, i.e., to analyze a localized area of a larger signal.

4.5 Wavelet Transform

The transform of a signal is just another form of representing the signal. It does not

change the information content present in the signal. For many signals, the low-frequency part

contains the most important part. It gives an identity to a signal. Consider the human voice. If we

remove the high-frequency components, the voice sounds different, but we can still tell whats

being said. In wavelet analysis, we often speak of approximations and details. The

approximations are the high- scale, low-frequency components of the signal. The details are the

low-scale, high frequency components.

Equation 4.5Where (t) is a time function with finite energy and fast decay called mother wavelet.


30/63

23

4.5.1 Discrete Wavelet Transform

The Discrete Wavelet Transform (DWT) involves choosing scales and positions based on

powers of two

so called dyadic scales and positions. The mother wavelet is rescaled or dilated,

by powers of two and translated by integers. Specifically, a function f ( t) L2

(R) (defines space

of square integrable functions) can be represented as [1]:

* ( ) + Equation 4.6

The function (t) is known as the mother wavelet, while (t) is known as the scaling

function.The set of functions ( ) } where Z isthe set of integers is an orthonormal basis for L2(R).

The numbers a (L, k) are known as the approximation coefficients at scale L, while d (j, k) are

known as the detail coefficients at scale j. The approximation and detail coefficients can be

expressed as:

Equation 4.7 Equation 4.8The DWT analysis can be performed using a fast, pyramidal algorithm related to multi-

rate filter-banks. As a multi-rate filter-bank the DWT can be viewed as a constant Q filter-bank

with octave spacing between the centers of the filters. Each sub-band contains half the samples

of the neighboring higher frequency sub-band. In the pyramidal algorithm the signal is analyzed

at different frequency bands with different resolution by decomposing the signal into a coarse

approximation and detail information. The coarse approximation is then further decomposed

using the same wavelet decomposition step. This is achieved by successive high-pass and low-pass filtering of the time domain signal and is defined by the following equations:

ylow[n] = ,-,- Equation 4.9yhigh[n] = ,-,- Equation 4.10


31/63

24

Fig. 4.7 Filter functions

Signal x[n] is passed through low pass and high pass filters and it is down sampled by 2.

ylow[n] = (x * g) 2 Equation 4.11yhigh[n] = (x*h) 2 Equation 4.12

In the DWT, each level is calculated by passing the previous approximation coefficients

though a high and low pass filters.

4.5.2 Multilevel Decomposition of Signal

A signal can be decomposed using Wavelet Analysis as Shown below [11]:

Fig. 4.8 Decomposition of DWT Co-efficients

Fig. 4.9 Decomposition using DWT


32/63

25

The DWT is computed by successive low-pass and high-pass filtering of the discrete

time-domain signal as shown in figure 4.8 and 4.9. This is called the Mallat algorithm or Mallat-

tree decomposition.

4.5.3 Wavelet Reconstruction

Getting the original signal with no loss (min.) of information is called Reconstruction. It

can be done by inverse discrete wavelettransform(IDWT). Whereas wavelet analysis involves

filtering and down sampling, the wavelet, Reconstruction process consists of up sampling and

filtering. Up sampling is the process of lengthening a signal component by inserting zeros

between samples.

Fig. 4.10 Signal Reconstruction

Fig. 4.11 Signal Decomposition & Reconstruction


33/63

26

5. FROM SPEECH TO FEATURE VECTORS

The main objective of this stage is to extract the important features that are enough for

the recognizer to recognize the words. This chapter describes how to extract information from a

speech signal, which means creating feature vectors from the speech signal. A wide range of

possibilities exist for parametrically representing a speech signal and its content. The main steps

for extracting information are preprocessing, frame blocking & windowing and feature

extraction [1].

Fig. 5.1 Main steps in Feature Extraction

5.1 Preprocessing

This step is the first step to create feature vectors. The objective in the pre-processing is

to modify the speech signal, x (n), so that it will be more suitable for the feature extraction

analysis. The preprocessing operations noise cancelling, pre emphasis and voice activation

detection can be seen in Figure below shown.

Fig. 5.2 Pre processing

The first thing to consider is if the speech, x (n), is corrupted by some noise, d(n), for

example an additive disturbance x (n) = s (n) + d (n), where s (n) is the clean speech signal.

There are several approaches to perform noise reduction on a noisy speech signal. Two

commonly used noise reduction algorithms in the field of speech recognition context is spectral

subtraction and adaptive noise cancellation. A low signal to noise ratio (SNR) decrease the


34/63

27

performance of the recognizer in a real environment. Some changes to make the speech

recognizer more noise robust will be presented later. Note that the order of the operations might

be reordered for some tasks. For example the noise reduction algorithm, spectral subtraction, is

better placed last in the chain (it needs the voice activation detection).

5.1.1 Pre emphasis

There is a need for spectrally flatten the signal. The pre emphasize, often represented by a

first order high pass FIR filter is used to emphasize the higher frequency components.

The second stage in feature extraction is to boost the amount of energy in the high

frequencies. It turns out that if we look at the spectrum for voiced segments like vowels, there is

more energy at the lower frequencies than the higher frequencies. This SPECTRAL TILT drop

in energy across frequencies (which is called spectral tilt) is caused by the nature of the glottal

pulse. Boosting the high frequency energy makes information from these higher formants more

available to the acoustic model and improves phone detection accuracy.

Fig. 5.3 Pre emphasis filter

The pre emphasizer is used to spectrally flatten the speech signal. This is usually done by

a high pass filter. The most commonly used filter for this step is the FIR filter described below: Equation5.1


35/63

28

The filter response for this FIR filter can be seen in Figure. The filter in the time domain

will beh (n) = {1, 0.95}and the filtering in the time domain will give the pre emphasized signal

s1 (n):

s1 (n) =

Equation 5.2

The pre emphasis filter is shown on Fig. 5.3.

5.1.2 Voice Activation Detection (VAD)

The problem of locating the endpoints of an utterance in a speech signal is a major

problem for the speech recognizer. Inaccurate endpoint detection will decrease the performance

of the speech recognizer. The problem of detecting endpoint seems to be relatively trivial, but it

has been found to be very difficult in practice. Only when a fair SNR is given, the task is made

easier. Some commonly used measurements for finding speech are short-term energy estimate

Es1, or short-term power estimate Ps1, and short term zero crossing rate Zs1. For the speech

signal s1(n) these measures are calculated as follows [1]:

Es1(m) = Equation 5.3Ps1(m) =

Equation 5.4Zs1(m) =

|,-,-|

Equation 5.5

Where: Equation 5.6For each block ofL samples these measures calculate some value. Note that the index for

these functions is m and not n, this because these measures do not have to be calculated for every

sample (the measures can for example be calculated in every 20 ms). The short-term energy

estimate will increase when speech is present in s1 (n). This is also the case with the short-term

power estimate; the only thing that separates them is scaling with 1/L when calculating the short-

term power estimate. The short term zero crossing rate gives a measure of how many times thesignal, s1 (n), changes sign. This short term zero crossing rates tend to be larger during unvoiced

regions.

These measures will need some triggers for making decision about where the utterances

begin and end. To create a trigger, one needs some information about the background noise. This

is done by assuming that the first 10 blocks are background noise. With this assumption the


36/63

29

mean and variance for the measures will be calculated. To make a more comfortable approach

the following function is used:

Ws1(m)=Ps1(m)(1-Zs1(m))Sc Equation 5.7

Using this function both the short-term power and the zero crossing rates will be taken

into account. Sc is a scale factor for avoiding small values, in a typical application is Sc = 1000.

The trigger for this function can be described as:

tW=W+ W Equation 5.8

TheWis the mean and Wis the variance for Ws1 (m) calculated for the first 10 blocks.

The term is a constant that have to be fine-tuned according to the characteristics of the signal.

After some testing the following approximation of will give pretty good voice activation

detection in various level of additive background noise:

Equation 5.9The voice activation detection function, VAD (m), can now be found as: Equation 5.10

VAD (n) is found as VAD (m) in the block of measure. For example if the measures is

calculated every 320 sample (block length L=320), which corresponds to 40 ms if the sampling

rate is 8 kHz. The first 320 samples of VAD (n) found as VAD (m) then m = 1. Using these

settings the VAD (n) is calculated for the speech signal containing the word file shown in

results.

5.2 Frame blocking & Windowing

Speech signal is a kind of unstable signal. But we can assume it as stable signal during

10-20ms. Framing is used to cut the long-time speech to the short-time speech signal in order to

get relative stable frequency characteristics. Features get periodically extracted. The time for

which the signal is considered for processing is called a window and the data acquired in a

window is called as a frame. Typically features are extracted once every 10ms, which is called as

frame rate. The window duration is typically 20ms. Thus two consecutive frames have

overlapping areas.


37/63

30

Fig. 5.4 Frame blocking & Windowing

5.2.1 Frame blocking

For each utterances of the word, window duration of 320 samples is used for processing

at later stages. A frame is formed from the windowed data with typical frame duration (Tf) of

about 200 samples. Since the frame duration is shorter than window duration there is an overlap

of data and the percentage overlap is given as:

%Overlap = ((TwTf)*100)/Tw) Equation 5.11

Each frame is Ksamples long, with adjacent frames being separated by P samples.


38/63

31

Fig. 5.5 Frame blocking of a sequence

By applying the frame blocking to de noised signal (x (k)), one will get M vectors of

length K, which correspond to x (k; m) where k=0, 1...K-1 and m=0, 1.M 1.

5.2.2 Windowing

Windowing concept is used to minimize the signal distortion by using the window to

taper the signal to zero at the beginning and end of each frame i.e. to reduce signal discontinuity

at either end of the block.

The rectangular window (i.e. no window) can cause problems, when we do Fourier

analysis; it abruptly cuts of the signal at its boundaries. A good window function has a narrow

main lobe and low side lobe levels in their transfer functions, which shrinks the values of the

signal toward zero at the window boundaries, avoiding discontinuities.

Equation 5.12The most commonly used window function in speech processing is the Hamming windowdefined as follows:

Equation 5.13By applying w (k) tox (k; m) for all blocks, the windowed signal output is calculated.


39/63

32

Hamming window function is shown in Fig. 5.5 below:

Fig. 5.6 Hamming Window

Multiplication of the signal by a window function in the time domain is the same as

convolving the signal in the frequency domain. Rectangular window gives maximum sharpness

but large side-lobes (ripples) - hamming window blurs in frequency but produces much less

leakage.

5.3 Feature Extraction

A feature extractor should reduce the pattern vector (i.e., the original waveform) to a

lower dimension, which contains most of the useful information from the original vector. Here

we use we extract features of the input speech signal using Daubechies-8 wavelets of level 4 [4].

The extracted wavelet coefficients provide a compact representation that shows theenergy distribution of the signal in time and frequency. In order to further reduce the

dimensionality of the extracted feature vectors, statistics over the set of the wavelet coefficients

are used.


40/63

33

The following features are used in our system:

The mean of the absolute value of the coefficients in each sub-band. These features provideinformation about the frequency distribution of the audio signal.

The standard deviation of the coefficients in each sub-band. These features provideinformation about the amount of change of the frequency distribution.

Energy of each sub-band of the signal. These features provide information about the energyof the each sub-band.

Kurtosis of each sub-band of the signal. These features measure whether the data are peakedor flat relative to a normal distribution.

Skewness of each sub-band of the signals. These features are the measure of symmetry orlack of symmetry.

After frame blocking and windowing we get different frame vectors i.e. different signals

are to be loaded to extract the features at a time. Hence Multi signal analysis is performed on

input frame vectors using wavelets using matlab [13].


41/63

34

6. DYNAMIC TIME WARPING

Dynamic time warping (DTW) is an algorithm for measuring similarity between two

sequences which may vary in time or speed. For instance, similarities in walking patterns would

be detected, even if in one video the person was walking slowly and if in another he or she were

walking more quickly, or even if there were accelerations and decelerations during the course of

one observation. DTW has been applied to video, audio, and graphics indeed, any data which

can be turned into a linear representation can be analyzed with DTW. A well-known application

has been automatic speech recognition, to cope with different speaking speeds [3].

In general, DTW is a method that allows a computer to find an optimal match between

two given sequences (e.g. time series) with certain restrictions. The sequences are "warped" non-

linearly in the time dimension to determine a measure of their similarity independent of certain

non-linear variations in the time dimension. This sequence alignment method is often used in

time.

The recognition process then consists of matching the incoming speech with stored

templates. The template with the lowest distance measure from the input pattern is the

recognized word. The best match (lowest distance measure) is based upon dynamic

programming.

6.1 DTW Algorithm

Speech is a time-dependent process. Hence the utterances of the same word will have

different durations, and utterances of the same word with the same duration will differ in the

middle, due to different parts of the words being spoken at different rates. To obtain a global

distance between two speech patterns (represented as a sequence of vectors) a time alignment

must be performed.
http://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Time_serieshttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Time_serieshttp://en.wikipedia.org/wiki/Speech_recognition


42/63

35

6.1.1 DP-Matching Principle

General Time-Normalized Distance Definition:

Speech can be expressed by appropriate feature extraction as a sequence of feature

vectors.

A= a1, a2, a3,ai ..aI, Equation 6.1

B = b1, b2, b3, bj.. bJ. Equation 6.2

Consider the problem of eliminating timing differences between these two speech

patterns. In order to clarify the nature of time-axis fluctuation or timing differences, let us

consider an i-j plane, shown in Fig. 6.1, where patterns A andBare developed along the i-axis

and j-axis, respectively. Where these speech patterns are of the same category, the timing

differences between them can be depicted by a sequence of points c = (i, j):

F = c (l), c (2), ------c, (k), ---------- c (K). Equation 6.3

Where c (k) = (i (k), j (k)).

This sequence can be considered to represent a function which approximately realizes a

mapping from the time axis of patternA onto that of pattern B. Hereafter, it is called a warping

function. When there is no timing difference between these patterns, the warping function

coincides with the diagonal line j = i. It deviates further from the diagonal line as the timing

difference grows [3].

Fig. 6.1 warping function & adjusting window definition


43/63

36

As a measure of the difference between two feature vectors aiand bi, a distance

| | Equation 6.4is employed between them. Then, the weighted summation of distances on warping function F

becomes () Equation 6.5(Where w (k) is a nonnegative weighting coefficient, which is intentionally introduced to allow

the E (F) measure flexible characteristic) and is a reasonable measure for the goodness of

warping function F. It attains its minimum value when warping function F is determined so as to

optimally adjust the timing difference. This minimum residual distance value can be considered

to be a distance between patterns A andB, remaining still after eliminating the timing differences

between them, and is naturally expected to be stable against time-axis fluctuation. Based on these

considerations, the time-normalized distance between two speech patterns A andB is defined as

follows:

() Equation 6.6Where denominator is employed to compensate for the effect ofK(number of points onthe warping function F). Above equation is no more than a fundamental definition of time-

normalized distance. Effective characteristics of this measure greatly depend on the warping

function specification and the weighting 'coefficient definition. Desirable characteristics of the

time-normalized distance measure will vary, according to speech pattern properties (especially

time axis expression of speech pattern) to be dealt with. Therefore, the present problem is

restricted to the most general case where the following two conditions hold:

Condition 1: Speech patterns are time-sampled with a common and constant sampling period.

Condition 2: We have no a priori knowledge about which parts of speech pattern contain

linguistically important information. In this case, it is reasonable to consider each part of aspeech pattern to contain an equal amount of linguistic information.


44/63

37

6.1.2 Restrictions on Warping Function

Warping function Fis a model of time-axis fluctuation in a speech pattern. Accordingly,

it should approximate the properties of actual time-axis fluctuation. In other words, function F,

when viewed as a mapping from the time axis of pattern A onto that of patternB, must preserve

linguistically essential structures in pattern A time axis and vice versa. Essential speech pattern

time-axis structures are continuity, monotonicity (or restriction of relative timing in a speech),

limitation on the acoustic parameter transition speed in a speech, and so on. These conditions can

be realized as the following restrictions on warping function F or points ( )1) Monotonic conditions: i (k-1) i (k) and j (k-1) j (k). Equation 6.72) Continuity conditions: : i(k)- i(k-1) 1 and j(k)- j(k-1) 1. Equation 6.8

As a result of these two restrictions, the following relation holds between two consecutive points

( ) Equation 6.93) Boundary conditions: i (1) =1, j (1) =1, and i (K) =I, j (K) = J. Equation 6.104) Adjustment window condition:

| | Equation 6.11

Where ris an appropriate positive integer, called window length. This condition corresponds to

the fact that time-axis fluctuation in usual cases never causes too excessive timing difference.

5) Slope constraint condition:Neither too steep nor too gentle a gradient should be allowed for warping function F

because such deviations may cause undesirable time-axis warping. Too steep a gradient, for

example, causes an unrealistic correspondence between very short patterns A segment and a

relatively long patternB segment. Then, such a case occurs where a short segment in consonant

or phoneme transition part happens to be in good coincidence with an entire steady vowel part.

Therefore, a restriction called a slope constraint condition was set upon the warping function F,

so that its first derivative is of discrete form. The slope constraint condition is realized as a

restriction on the possible relation among (or the possible configuration of) several consecutive


45/63

38

points on the warping function, as is shown in Fig. 6.2(a) and (b). To put it concretely, if point c

(k) moves forward in the direction of i (or j)-axis consecutive m times, then point c (k) is not

allowed to step further in the same direction before stepping at least n times in the diagonal

direction. The effective intensity of the slope constraint can be evaluated by the following

measure P = n/m.

Fig. 6.2 Slope constraint on warping function

The larger the P measure, the more rigidly the warping function slope is restricted. When

p = 0, there are no restrictions on the warping function slope. When p = (that is m = 0), the

warping function is restricted to diagonal line j = i. Nothing more occurs than a conventional


46/63

39

pattern matching no time normalization. Generally speaking, if the slope constraint is too severe,

then time-normalization would not work effectively. If the slope constraint is too lax, then

discrimination between speech patterns in different categories is degraded. Thus, setting neither a

too large nor a too small value forp is desirable. Section IV reports the results of an investigation

on an optimum compromise onp value through several experiments.

In Fig. 6.2(c) and (d), two examples of permissible point c (k) paths under slope

constraint condition p = 1 are shown. TheFig. 6.2(c) type is directly derived from the above

definition,while Fig. 6.2(d) is an approximated type, and there is anotherconstraint. That is, the

second derivative of warping function F is restricted, so that the point c (k) path does not

orthogonally change its direction. This new constraint reduces the number of paths to be

searched. Therefore, the simple Fig. 6.2(d) type is adopted afterward, except for thep = 0 case.

6.1.3 Discussions on Weighting Coefficient

Since the criterion function in Equation 6.6 is a rational expression, its maximization is

an unwieldy problem. If the denominator in Equation 6.6 Equation 6.12(Called normalization coefficient) is independent of warping function F; it can be put out of the

bracket, while simplifying the equation as follows:

[ () ] Equation 6.13This simplified problem can be effectively solved by use of the dynamic programmingtechnique.

W (k) = [i (k) - i (k-1)] + [j (k) - j (k-1)], Equation 6.14

Then N=I+J, whereIandJare lengths of speech patternsA andB, respectively.

If it is assumed that time axes i andj are both continuous, then, in the symmetric form,

the summation in Equation 6.6 means an integration along the temporarily defined axis l = i +j.As a result of this difference, time-normalized distance is symmetric, or D (A, B) =D (B, A), in

the symmetric form. Another more important result, caused by the difference in the integration

axis, is that, as is in Fig. 6.3, weighting coefficient w (k) reduces to zero in the asymmetric form,

when the point in warping function steps in the direction ofj-axis, or c (k) = c (k-1) + (0, 1). This

means that some feature vectors bjare possibly excluded from the integration in the asymmetric


47/63

40

form. On the contrary, in the case of symmetric form, minimum w (k) value is equal to 1, and no

exclusion occurs. Since discussions here are based on the assumption that each part in a speech

pattern should be treated equally, an exclusion of any feature vectors from integration should be

avoided as long as possible. It can be expected, therefore, that the symmetric form will give

better recognition accuracy than the asymmetric form. However, it should be noted that the slope

constraint reduces the situation where the point in warping function steps in the j-axis direction.

The difference in performance between the symmetric one and asymmetric one will gradually

vanish as the slope constraint is intensified.

Fig. 6.3 Weighting coefficient W(k)

6.2 Practical DP-Matching Algorithm

6.2.1 DP-Equation

A simplified definition of time-normalized distance D (A, B) given above is one of the

typical problems to which the well-known DP-principle Equation 6.10 can be applied. The basic

algorithm for calculating Equation 6.13 is written as follows.

Initial condition:

g1(c (1)) = d (c (1)) w (1). Equation 6.15

DP-equation: () [( ) ( ) ] Equation 6.16Time-normalized distance:

() Equation 6.17


48/63

41

It is implicitly assumed here that c (0) = (0, 0). Accordingly, w (1) = 2 in the symmetric

form, and w (1) = 1 in the asymmetric form. By realizing the restriction on the warping function

described in Section 6.1.2 and substituting Equation 6.14 for weighting coefficient w (k) in

Equation 6.16,several practical algorithms can be derived. As one of the simplest examples, the

algorithm for symmetric form, in which no slope constraint is employed (that is P = 0), is shown

here.

Initial condition:

g (l, 1) = 2 d (1, 1). Equation 6.18

DP-equation:

Equation 6.19Restricting condition (adjustment window):

j - r i j + r. Equation 6.20

Time-normalized distance:

Equation 6.21Where N = I+J.

The algorithm, especially the DP-equation, should be modified when the asymmetric

form is adopted or some slope constraint is employed. In Table I, algorithms are summarized for

both symmetric and asymmetric forms, with various slope constraint conditions. In this table,

DP-equations for asymmetric forms are shown in some improved form. The first expression in

the bracket of the asymmetric form DP-equation for P = 1 (that is, [g(i - 1 , j - 2) + d(i, j - 1) +

d(i, j)]/2) corresponds to the case where c(k - 1 ) = (i(k), j(k) - 1) and c(k - 2) = (i(k - 1) - 1 , j (k -

1) - 1). Accordingly, if the definition in (14) is strictly obeyed, w (k) is equal to zero while w (k -

1) is equal to 1, thus completely omitting the d(c (k)) from the summation. In order to avoid this

situation to a certain extent, the weighting coefficient w (k - 1) = 1 is divided between two

weighting coefficients w (k - 1) and w (k). Thus, (d(i, j - 1) + d(i, j))/2 is substituted for d(i, j - 1)

+ 0 * d(i, j ) in this expression. Similar modifications are applied to other asymmetric form DP-

equations. In fact, it has been established, by a preliminary experiment, that this modification

significantly improves the asymmetric form performance [12].


49/63

42

6.2.2 Calculation Details

DP-equation or g(i, j ) must be recurrently calculated in ascending order with respect to

coordinates i andj , starting from initial condition at (1, 1 ) up to ( I , J). The domain in which the

DP-equation must be calculated is specified by

1 i I, 1 j J. Equation 6.22

and adjustment window

j - r i j + r. Equation 6.23

The optimum DP-algorithm, applied to speech recognition, was investigated. Symmetric

form was proposed along with slope constraint technique. These varieties were then compared

through theoretical and experimental investigations.

Conclusions are as follows: Slope constraint is actually effective. Optimum performance is

attained when the slope constraint condition is P = 1. The validity of these results was ensured by

a good agreement between theoretical discussions and experimental results. The optimized

algorithm was then experimentally compared with several other DP-algorithms applied to spoken

word recognition by different research groups, and the superiority of the algorithm described in

this paper was established.


50/63

43

7. FPGA Implementation

The AccelDSP Synthesis Tool is a product that allows to transform a MATLAB floating-

point design into a hardware module that can be implemented in a Xilinx FPGA. AccelDSP

Synthesis Tool features an easy-to-use Graphical User Interface that controls an integrated

environment with other design tools such as MATLAB, Xilinx ISE tools, and other industry-

standard HDL simulators and logic synthesizers.

AccelDSP Synthesis is done with the following implementation procedure:

a) Reading and analyzing a MATLAB floating-point design.b) Automatically creating an equivalent MATLAB fixed-point design.c) Invoking a MATLAB simulation to verify the fixed-point design.d) Providing the power to quickly explore design trade-offs of algorithms that are optimized

for the target FPGA architectures.

e) Creating a synthesizable RTL HDL model and a Test bench to ensure bit-true, cycleaccurate design verification.

f) Providing scripts that invoke and control down-stream tools such as HDL simulators,RTL logic synthesizers, and Xilinx ISE implementation tools.


51/63

44

The Synthesis flow in AccelDSP ISE can be observed from the following flow chart:

Fig. 7.1 Synthesis flow in AccelDSP


52/63

45

8. SIMULATION & RESULTS

This chapter presents the experimental results obtained from the proposed approach;

namely Wavelet analysis, Dynamic Time Warping that was applied to the isolated word speech

recognition. The effectiveness of the algorithms is measured through the analysis of the results.

8.1 Input Signal:1) Input speech signal for word Speech:

Fig. 8.1 Input speech signal

The input speech signal with duration of 5 seconds with sampling frequency of 8k Hz is

shown above.


53/63

46

8.2 Pre emphasis:

Pre emphasis output for Speech:

Fig. 8.2 Pre emphasis output

The output obtained after passing the input Speech signal to the pre emphasis (first

order high pass) filter. The output has significant spectral flatness when compared with input.


54/63

47

8.3 Voice Activation & Detection

1) Voice Activation and Detection for Speech:

Fig. 8.3 Voice Activation & Detection

The above plot shows the voice activated region for the word Speech. The output is 1

for voiced region and 0 for unvoiced and silence region. Hence out of the total samples, only the

voice activated samples are going to be filtered out.


55/63

48

2) Speech signal after voice activation and Detection:

Fig. 8.4 Speech signal after Voice Activation & Detection

After obtaining the Voice Activation & Detection output, the regions for which VAD=1

are extracted out for further analysis.


56/63

49

8.4 De-noising:

De-noising for Speech:

Fig. 8.5 Speech signal after de-noising

The final denoised signal obtained after Spectral subtraction. Here the noise components

present in the signal are reduced.


57/63

50

8.5 Recognition Results:

This section provides the experimental results in recognizing the isolated words. In the

experiment, the database consists of 10 different words and 25 utterances for each word is used.

Calculation of Recognition rate is given in Equation 8.1 below. Equation 8.1

a) The Recognition rates for each word using Daubechies-8 wavelet & level-4decomposition using DWT for English words is shown in the following table:

Word to be recognized Number of times theword is correctly recognized

Recognition Rate

Matrix 24 96

Paste 24 96

Project 18 72

Speech 18 72

Window 24 96

Distance 20 80

India 24 96

Ubuntu 19 76

Fedora 25 100

Android 24 96

Table 8.1: Recognition rates for English words using db8 & level 4 DWT.

The overall Recognition rate for English words using Daubechies 8 wavelet of level-4 is 88%.


58/63

51

b) The Recognition rates for each word using Daubechies-8 wavelet & level-7decomposition using DWT for English words is shown in the following table:

Word to be recognized Number of times the

word is correctly recognized

Recognition Rate

Matrix 24 96

Paste 23 92

Project 21 84

Speech 23 92

Window 24 96

Distance 22 88

India 25 100

Ubuntu 21 84

Fedora 25 100

Android 25 100

Table 8.2: Recognition rates for English words using db8 & level 7 DWT.

The overall Recognition rate for English words using Daubechies 8 wavelet of level-7 is

93.2%.

8.5 FPGA Implementation

AccelDSP synthesis tool is used to transform a MATLAB design into a hardware module

that can be implemented in a Xilinx FPGA.

The figure in Fig. 8.6 shows the matlab results for the word recognized FEDORA.

The figure in Fig. 8.7 shows the FPGA Implementation results for the word recognized

FEDORAon AccelDSP tool in Xilinx ISE Platform.


59/63

52

Fig. 8.6 Matlab output of Speech Recognition for word FEDORA.


60/63

53

Fig. 8.7 Figure showing FPGA results for word FEDORA.


61/63


62/63

55

REFERENCES

[1]Trivedi, Saurabh, Sachin and Raman, "Speech Recognition by Wavelet Analysis",International Journal of Computer Applications (09758887) Volume 15No.8, February

2011.

[2]Lawrence Rabiner and Bing-Hwang Jung, "Fundamentals of Speech Recognition".[3]Hiroaki Sakoe, Seibi Chiba, "Dynamic Programming Algorithm Optimization for Spoken

Word Recognition", IEEE Transactions On Acoustics, Speech, And Signal Processing, Vol.

Assp-26, No. 1, February 1978.

[4]Ingrid Daubechies, Ten Lectures on Wavelets, SIAM, Philadelphia, 1992.[5]Ian Mcloughlin, "Audio Processing with Matlab Examples".[6]I.Daubechies, Orthonormal Bases of Compactly Supported Wavelets, Comm. on Pure

and Applied Math., vol.41, pp.909-996, Nov.1988.

[7]Murali Krishnan, Chris P.Neophytou and Glenn Prescott, Wavelet Transform SpeechRecognition using Vector Quantization, Dynamic Time Warping and Artificial Neural

Networks".

[8]George Tzanetakis, Georg Essl, Perry Cook, Audio Analysis using the Discrete WaveletTransform Organized sound, Vol. 4(3), 2000.

[9]L.R.Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in SpeechRecognition, vol. 77, no. 2, pp. 257-286, 1989.

[10] Michael Nilsson, Marcus Ejnarsson, "Speech Recognition using Hidden Markov Model".[11] S.G.Mallat, A theory for multi resolution signal decomposition: the wavelet

representation, IEEE transactions on Pattern Analysis Machine Intelligence, Vol. 11

1989, pp.674-693.

[12] Sylvio Barbon Junior, Rodrigo Capobianco Guido, Shi-Huang Chen, Lucimar SassoVieira, Fabricio Lopes Sanchez, "Improved Dynamic Time Warping Based on the Discrete

Wavelet Transform", Ninth IEEE International Symposium on Multimedia 2007.

[13] M.Misiti, Y. Misiti, G. Oppenheim and J. Poggi, Matlab Wavelet Tool Box, The MathWorks, Inc.,2000.

[14] George Tzanetakis, Georg Essl, Perry Cook, Audio Analysis using the Discrete WaveletTransform Organized sound, Vol. 4(3), 2000.


63/63

[15] Mike Brookes, "Voicebox: Speech Processing Toolbox for Matlab", Department ofElectrical & Electronic Engineering, Imperial College, London SW7 2BT, UK.

[16] Daryl Ning, "Developing an Isolated Word Recognition System in Matlab", MatlabDigest - January 2010.

Speech Recognition Using Wavelets

Documents

Transcript of Speech Recognition Using Wavelets