Speech Recognition Using Wavelets
-
Upload
pradeep-kumar -
Category
Documents
-
view
221 -
download
0
Transcript of Speech Recognition Using Wavelets
-
7/31/2019 Speech Recognition Using Wavelets
1/63
i
Abstract
Speech recognition systems have come a long way in the last forty years, there is still
room for improvement. Although readily available, these systems are sometimes inaccurate and
insufficient. In an effort to provide a more efficient representation of the speech signal,
application of the wavelet analysis is considered. Here we present an effective and robust method
for extracting features for speech processing. Based on the time frequency multi resolutionproperty of wavelet transform, the input speech signal is decomposed into various frequency
channels. Further we can recognize the original speech using Wavelet Transform. The major
issues concerning the design of this Wavelet based speech recognition system are choosing
optimal wavelets for speech signals, decomposition level in the DWT, selecting the feature
vectors from the wavelet coefficients.
Dynamic Time Warping (DTW) is a pattern matching approach that can be used for
limited vocabulary speech recognition, which is based on a temporal alignment of the input
signal with the template models. The main drawback of this method is its high computational
cost when the length of the signals increases. The main aim of the project work is to provide a
modified version of the DTW, based on the Discrete Wavelet Transform (DWT), which reducesits original complexity. Daubechies wavelet family with level 4 & level 7 are experimented and
the corresponding results are reported.
The above proposed approaches are implemented in software and also implemented by
using FPGA.
-
7/31/2019 Speech Recognition Using Wavelets
2/63
-
7/31/2019 Speech Recognition Using Wavelets
3/63
iii
4. WAVELET ANALYSIS .......................................................................................................... 17
4.1 Definition ............................................................................................................................ 17
4.2 Fourier Analysis .................................................................................................................. 18
4.2.1 Limitations .................................................................................................................... 18
4.3 Short-Time Fourier analysis ................................................................................................ 18
4.3.1 Limitations .................................................................................................................... 19
4.4 Types of Wavelets ............................................................................................................... 19
4.4.1 Haar Wavelet ................................................................................................................ 20
4.4.2 Daubechies-N wavelet family ...................................................................................... 20
4.4.3 Advantages Wavelet analysis over STFT ..................................................................... 22
4.5 Wavelet Transform .............................................................................................................. 22
4.5.1 Discrete Wavelet Transform ......................................................................................... 23
4.5.2 Multilevel Decomposition of Signal............................................................................. 24
4.5.3 Wavelet Reconstruction ................................................................................................ 25
5. FROM SPEECH TO FEATURE VECTORS ........................................................................... 26
5.1 Preprocessing ...................................................................................................................... 26
5.1.1 Pre emphasis ................................................................................................................. 27
5.1.2 Voice Activation Detection (VAD) .............................................................................. 28
5.2 Frame blocking & Windowing ............................................................................................ 295.2.1 Frame blocking ............................................................................................................. 30
5.2.2 Windowing ................................................................................................................... 31
5.3 Feature Extraction ............................................................................................................... 32
6. DYNAMIC TIME WARPING ................................................................................................. 34
6.1 DTW Algorithm .................................................................................................................. 34
6.1.1 DP-Matching Principle ................................................................................................. 35
6.1.2 Restrictions on Warping Function ................................................................................ 37
6.1.3 Discussions on Weighting Coefficient ......................................................................... 39
6.2 Practical DP-Matching Algorithm ...................................................................................... 40
6.2.1 DP-Equation ................................................................................................................. 40
6.2.2 Calculation Details ....................................................................................................... 42
7. FPGA Implementation .............................................................................................................. 43
-
7/31/2019 Speech Recognition Using Wavelets
4/63
iv
8. SIMULATION & RESULTS ................................................................................................... 45
8.1 Input Signal: ................................................................................................................... 45
8.2 Pre emphasis:.................................................................................................................. 46
8.3 Voice Activation & Detection ........................................................................................ 47
8.4 De-noising: ..................................................................................................................... 49
8.5 Recognition Results: ...................................................................................................... 50
8.5 FPGA Implementation ................................................................................................... 51
9. CONCLUSION ......................................................................................................................... 54
REFERENCES ............................................................................................................................. 55
-
7/31/2019 Speech Recognition Using Wavelets
5/63
v
List of Tables
Table 8.1: Recognition rates for English words using db8 & level 4 DWT. ................................ 50Table 8.2: Recognition rates for English words using db8 & level 7 DWT. ................................ 51
-
7/31/2019 Speech Recognition Using Wavelets
6/63
vi
List of Figures
Fig. 2.1 Literature survey ................................................................................................................ 7
Fig. 3.1 Schematic diagram of the speech production/perception process ..................................... 9
Fig. 3.2 Human Vocal Mechanism ............................................................................................... 10
Fig. 3.3 Discrete-Time Speech Production Model........................................................................ 11
Fig. 3.4 Three state representation of a speech signal. ................................................................. 13
Fig. 3.5.Spectrogram using Welchs Method ............................................................................... 14
Fig. 4.1 Fourier transform ............................................................................................................. 18
Fig. 4.2 Short time Fourier transform ........................................................................................... 19
Fig. 4.3 Haar wavelet .................................................................................................................... 20
Fig. 4.5 Daubechies wavelets........................................................................................................ 21
Fig. 4.6 Comparison of Wavelet analysis over STFT ................................................................... 22
Fig. 4.7 Filter functions ................................................................................................................. 24
Fig. 4.8 Decomposition of DWT Co-efficients ............................................................................ 24
Fig. 4.9 Decomposition using DWT ............................................................................................. 24
Fig. 4.10 Signal Reconstruction .................................................................................................... 25
Fig. 4.11 Signal Decomposition & Reconstruction ...................................................................... 25
Fig. 5.1 Main steps in Feature Extraction ..................................................................................... 26
Fig. 5.2 Pre processing .................................................................................................................. 26
Fig. 5.3 Pre emphasis filter ........................................................................................................... 27
Fig. 5.4 Frame blocking & Windowing ........................................................................................ 30
Fig. 5.5 Frame blocking of a sequence ......................................................................................... 31
Fig. 5.6 Hamming Window .......................................................................................................... 32
Fig. 6.1 warping function & adjusting window definition............................................................ 35
Fig. 6.2 Slope constraint on warping function .............................................................................. 38
Fig. 6.3 Weighting coefficient W(k) ............................................................................................. 40
Fig. 7.1 Synthesis flow in AccelDSP ............................................................................................ 44
Fig. 8.1 Input speech signal .......................................................................................................... 45
Fig. 8.2 Pre emphasis output ......................................................................................................... 46
Fig. 8.3 Voice Activation & Detection ......................................................................................... 47
-
7/31/2019 Speech Recognition Using Wavelets
7/63
vii
Fig. 8.4 Speech signal after Voice Activation & Detection .......................................................... 48
Fig. 8.5 Speech signal after de-noising ......................................................................................... 49
Fig. 8.6 Matlab output of Speech Recognition for word FEDORA. ......................................... 52
Fig. 8.7 Figure showing FPGA results for word FEDORA. ..................................................... 53
-
7/31/2019 Speech Recognition Using Wavelets
8/63
1
1. INTRODUCTION
1.1 Definition
Speech recognition is the process of automatically extracting and determining linguistic
information conveyed by a speech signal using computers or electronic circuits. Recent advances
in soft computing techniques give more importance to automatic speech recognition. Large
variation in speech signals and other criteria like native accent and varying pronunciations makes
the task very difficult. ASR is hence a complex task and it requires more intelligence to achieve a
good recognition result. Speech recognition is a topic that is very useful in many applications and
environments in our daily life.
The fundamental purpose of speech is communication, i.e., the transmission of messages.According to Shannons information theory a message represented as a sequence of discrete
symbols can be quantified by its information content in bits, and the rate of transmission of
information is measured in bits/second (bps).
In order for communication to take place, a speaker must produce a speech signal in the
form of a sound pressure wave that travels from the speaker's mouth to a listener's ears. Although
the majority of the pressure wave originates from the mouth, sound also emanates from the
nostrils, throat, and cheeks. Speech signals are composed of a sequence of sounds that serve as a
symbolic representation for a thought that the speaker wishes to relay to the listener. The
arrangement of these sounds is governed by rules associated with a language. The scientific
study of language and the manner in which these rules are used in human communication is
referred to as linguistics. The science that studies the characteristics of human sound production,
especially for the description, classification, and transcription of speech, is called phonetics.
1.2 Application area, Features & Issues
A different aspect of speech recognition is to facilitate for people with functional
disability or other kinds of handicap. To make their daily chores easier, voice control could be
helpful. With their voice they could operate the light switch, turn off/on the coffee machine or
operate some other domestic appliances. This leads to the discussion about intelligent homes
where these operations can be made available for the common man as well as for handicapped.
-
7/31/2019 Speech Recognition Using Wavelets
9/63
2
1.2.1 Features
Speech input is easy to perform because it does not require a specialized skill as does typingor pushbutton operations.
Information can be input even when the user is moving or doing other activities involvingthe hands, legs, eyes, or ears.
Since a microphone or telephone can be used as an input terminal, inputting information iseconomical with remote inputting capable of being accomplished over existing telephone
networks and the Internet.
1.2.2 Issues
Lot of redundancy is present in the speech signal that makes discriminating between the classes
difficult.
Presence of temporal and frequency variability such as intra speaker variability inpronunciation of words and phonemes as well as inter speaker variability e.g. the effect of
regional dialects.
Context dependent pronunciation of the phonemes (co-articulation). Signal degradation due to additive and convolution noise present in the background or in the
channel.
Signal distortion due to nonideal channel characteristic.1.3 Recognition Systems
Recognition systems may be designed in many modes to achieve specific objective or
performance criteria.
1.3.1 Speaker Dependent / Independent System
For speaker dependent systems, user is asked to utter predefined words or sentences.
These acoustic signals form the training data, which are used for recognition of the input speech.
Since these systems are used for only a predefined speaker, their performance become higher
compared to speaker independent systems.
-
7/31/2019 Speech Recognition Using Wavelets
10/63
3
1.3.2 Isolated Word Recognition
This is also called discrete recognition system. In this system, there has to be pause
between uttered words. Therefore the system does not have to care about finding boundaries
between words.
1.3.3 Continuous Speech Recognition
These systems are the ultimate goal of a recognition process. No matter how or when a
word is uttered, they are recognized in real time and accordingly an action is performed. Changes
in speaking rate, careless pronunciations, detecting the word boundaries and real time issues are
main problems for this recognition mode.
1.3.4 Vocabulary Size
The lower the size of the vocabulary in a recognition system, the higher the recognition
performance. Specific tasks may use small vocabularies. However a natural system should be
speaker independent continuous recognition over a large vocabulary which is the most difficult.
1.3.5 Keyword Spotting
These systems are used to detect a word in continuous speech. For this reason they maybe as good as isolated recognition besides having the capability to handle continuous.
Speech word recognition systems commonly carry out some kind of classification
recognition based on speech features which are usually obtained via Fourier Transforms (FTs),
Short Time Fourier Transforms (STFTs), or Linear Predictive Coding techniques. However,
these methods have some disadvantages. These methods accept signal stationary with in a given
time frame and may therefore lack the ability to analyze localized events correctly. The wavelet
transform copes with some of these problems. Other factors influencing the selection of Wavelet
Transforms (WT) over conventional methods include their ability to determine localized
features. Discrete Wavelet Transform method is used for speech processing.
The speech recognizer implemented in Mat lab was used to simulation, as if, a speech
recognizer was operating in a real environment. Simulation recordings are taken in open
environment to get real data.
-
7/31/2019 Speech Recognition Using Wavelets
11/63
4
In the future it could be possible to use this information to create a chip that could be
used as a new interface to humans. For example it would be desired to get rid of all remote
controls in the home and just tell the television, stereo or any desired device what to do with the
voice.
1.4 Objectives
This project will cover speaker independent and small vocabulary speech recognition
with the help of wavelet analysis using Dynamic Time Warping method. The project will
compose of two phases:
1) Training phase: In this phase, a number of words will be trained to extract model for each
word.
2) Recognition phase: In this phase, a sequence of connected word is entered by microphone or
an input file and the system will try to recognize these words.
1.5 Out line
The outline of this thesis is as follows.
Chapter 2Literature Survey:
This chapter discuss about trends and technologies that are followed for improvising the
speech recognition performance.
Chapter 3 - The Speech Signal:
This chapter will discuss how the production and perception of speech is performed.
Topics related to this chapter are Speech production, speech representation, Characteristics of
speech signal and Perception.
Chapter 4Wavelet Analysis:
This chapter will discuss what is wavelet, what are the types of wavelets available, which
type of wavelets are used, basically why wavelets are introduced and decomposition of wavelets.Some topics related to this chapter are Fourier analysis, STFT, types of wavelets and wavelet
transform.
-
7/31/2019 Speech Recognition Using Wavelets
12/63
5
Chapter 5 - From Speech to Feature Vectors
In this chapter the fundamental signal processing applied to a speech recognizer. Some
topics related to this chapter are Pre-processing, frame blocking and windowing and Feature
extraction.
Chapter 6Dynamic Time Warping
Aspects of this chapter are theory and implementation of the set of statistical modeling
techniques collectively referred to as Dynamic Time Warping. Some topics related to this
chapter are DTW Algorithm, DP Matching Algorithm.
Chapter 7FPGA Implementation
This chapter describes about the FPGA Implementation of Speech Recognition system
using AccelDSP tool in Xilinx ISE.
Chapter 8Simulation & Results
In this chapter the speech recognizer implemented in Matlab will be used. This is to test
the recognizer in different cases for finding efficiency.
Chapter 9 - Conclusions
This chapter will summarizes the whole project.
-
7/31/2019 Speech Recognition Using Wavelets
13/63
6
2. LITERATURE SURVEY
Designing a machine that mimics human behavior, particularly the capability of speaking
naturally and responding properly to spoken language, has intrigued engineers and scientists for
centuries. Since the 1930s, when Homer Dudley of Bell Laboratories proposed a system model
for speech analysis and synthesis, the problem of automatic speech recognition has been
approached progressively, from a simple machine that responds to a small set of sounds to a
sophisticated system that responds to fluently spoken natural language and takes into account the
varying statistics of the language in which the speech is produced. Based on major advances in
statistical modeling of speech in the 1980s, automatic speech recognition systems today find
widespread application in tasks that require a human-machine interface, such as automatic call
processing in the telephone network and query-based information systems that do things like
provide updated travel information, stock price quotations, weather reports, etc.
Speech is the primary means of communication between people. For reasons ranging
from technological curiosity about the mechanisms for mechanical realization of human speech
capabilities, to the desire to automate simple tasks inherently requiring human-machine
interactions, research in automatic speech recognition (and speech synthesis) by machine has
attracted a great deal of attention over the past five decades.
2.1 Advancement in technology
Fig. 2.1 shows a timeline of progress in speech recognition and understanding technology
over the past several decades. We see that in the 1960s we were able to recognize small
vocabularies (order of 10-100 words) of isolated words, based on simple acoustic-phonetic
properties of speech sounds. The key technologies that were developed during this time frame
were filter-bank analyses, simple time normalization methods, and the beginnings of
sophisticated dynamic programming methodologies. In the 1970s we were able to recognizemedium vocabularies (order of 100-1000 words) using simple template-based, pattern
recognition methods [3]. The key technologies that were developed during this period were the
pattern recognition models, the introduction of LPC methods for spectral representation, the
pattern clustering methods for speaker-independent recognizers, and the introduction of dynamic
programming methods for solving connected word recognition problems. In the 1980s we
-
7/31/2019 Speech Recognition Using Wavelets
14/63
7
started to tackle large vocabulary (1000-unlimited number of words) speech recognition
problems based on statistical methods, with a wide range of networks for handling language
structures. The key technologies introduced during this period were the hidden Markov model
(HMM) [9] and the stochastic language model, which together enabled powerful new methods
for handling virtually any continuous speech recognition problem efficiently and with high
performance. In the 1990s we were able to build large vocabulary systems with unconstrained
language models, and constrained task syntax models for continuous speech recognition and
understanding. The key technologies developed during this period were the methods for
stochastic language understanding, statistical learning of acoustic and language models, and the
introduction of finite state transducer framework (and the FSM Library) and the methods for
their determination and minimization for efficient implementation of large vocabulary speech
understanding systems.
Fig. 2.1 Literature survey
-
7/31/2019 Speech Recognition Using Wavelets
15/63
8
Finally, in the last few years, we have seen the introduction of very large vocabulary
systems with full semantic models, integrated with text-to-speech (TTS) synthesis systems, and
multi-modal inputs (pointing, keyboards, mice, etc.). These systems enable spoken dialog
systems with a range of input and output modalities for ease-of-use and flexibility in handling
adverse environments where speech might not be as suitable as other input-output modalities.
During this period we have seen the emergence of highly natural speech synthesis systems, the
use of machine learning to improve both speech understanding and speech dialogs, and the
introduction of mixed-initiative dialog systems to enable user control when necessary.
After nearly five decades of research, speech recognition technologies have finally
entered the marketplace, benefiting the users in a variety of ways. Throughout the course of
development of such systems, knowledge of speech production and perception was used in
establishing the technological foundation for the resulting speech recognizers. Major advances,
however, were brought about in the 1960s and 1970s via the introduction of advanced speech
representations based on LPC analysis and cepstral analysis methods, and in the 1980s through
the introduction of rigorous statistical methods based on hidden Markov models [9]. All of this
came about because of significant research contributions from academia, private industry and the
government. As the technology continues to mature, it is clear that many new applications will
emerge and become part of our way of lifethereby taking full advantage of machines that are
partially able to mimic human speech capabilities.
-
7/31/2019 Speech Recognition Using Wavelets
16/63
9
3. THE SPEECH SIGNAL
This chapter intends to discuss how the speech signal is produced and perceived by
human beings. This is an essential subject that has to be considered before one can pursue and
decide which approach to use for speech recognition.
3.1 Speech production
Human communication is to be seen as a comprehensive diagram of the process from
speech production to speech perception between the talker and listener as in Fig. 3.1 [2].
Fig. 3.1 Schematic diagram of the speech production/perception process
Five different elements, A. Speech formulation, B. Human vocal mechanism, C. Acoustic
air, D. Perception of the ear, E. Speech comprehension, will be examined more carefully in the
following sections.The first element (A. Speech formulation) is associated with the formulation of the
speech signal in the talkers mind. This formulation is used by the human vocal mechanism (B.
Human vocal mechanism) to produce the actual speech waveform. The waveform is transferred
via the air (C. Acoustic air) to the listener. During this transfer the acoustic wave can be affected
by external sources, for example noise, resulting in a more complex waveform. When the wave
-
7/31/2019 Speech Recognition Using Wavelets
17/63
10
reaches the listeners hearing system (the ears) the listener percepts the waveform (D. Perception
of the ear) and the listeners mind (E. Speech comprehension) starts processing this waveform to
comprehend its content so the listener understands what the talker is trying to tell him or her.
Fig. 3.2 Human Vocal Mechanism
To be able to understand how the production of speech is performed one need to know
how the humans vocal mechanism is constructed, as in Fig. 3.2.
-
7/31/2019 Speech Recognition Using Wavelets
18/63
11
The most important parts of the human vocal mechanism are the vocal tracttogether with
nasal cavity, which begins at the velum. The velum is a trapdoor-like mechanism that is used to
formulate nasal sounds when needed. When the velum is lowered, the nasal cavity is coupled
together with the vocal tract to formulate the desired speech signal. The cross-sectional area of
the vocal tract is limited by the tongue, lips, jaw and velum and varies from 0-20 cm2.
When humans produce speech, air is expelled from the lungs through the trachea. The air
flowing from the lungs causes the vocal cords to vibrate and by forming the vocal tract, lips,
tongue, jaw and maybe using the nasal cavity, different sounds can be produced.
Important parts of the discrete-time speech production model, in the field of speech
recognition and signal processing, are: u (n), gain b0 andH(z). The impulse generator acts like
the lungs, exciting the glottal filter G (z), resulting in u (n). The G (z) is to be regarded as the
vocal cords in the human vocal mechanism. The signal u (n) can be seen as the excitation signal
entering the vocal tract and the nasal cavity and is formed by exciting the vocal cords by air from
the lungs.
Fig. 3.3 Discrete-Time Speech Production Model
The gain b0 is a factor that is related to the volume of the speech being produced. Largergain b0 gives louder speech and vice versa. The vocal tract filter H(z) is a model over the vocal
tract and the nasal cavity. The lip radiation filter R (z) is a model of the formation of the human
lips to produce different sounds.
-
7/31/2019 Speech Recognition Using Wavelets
19/63
12
3.2 Speech Representation
The speech signal and all its characteristics can be represented in two different domains,
the time and the frequency domain.
A speech signal is a slowly time varying signal in the sense that, when examined over a
short period of time (between 5 and 100 ms), its characteristics are short-time stationary. This is
not the case if we look at a speech signal under a longer time perspective (approximately time
T>0.5 s). In this case the signals characteristics are non-stationary, meaning that it changes to
reflect the different sounds spoken by the talker.
To be able to use a speech signal and interpret its characteristics in a proper manner some
kind of representation of the speech signal are preferred. The speech representation can exist in
either the time or frequency domain, and in three different ways. These are a three-state
representation, a spectral representation and the last representation is aparameterization of the
spectral activity.3.2.1 Three-state Representation
The three-state representation is one way to classify events in speech. The events of
interest for the three-state representation are:
Silence (S) - No speech is produced. Unvoiced (U) - Vocal cords are not vibrating, resulting in an aperiodic or random
speech waveform.
Voiced (V) - Vocal cords are tensed and vibrating periodically, resulting in a speechwaveform that is quasi-periodic.
Quasi-periodic means that the speech waveform can be seen as periodic over a short-time
period (5-100 ms) during which it is stationary.
-
7/31/2019 Speech Recognition Using Wavelets
20/63
13
Fig. 3.4 Three state representation of a speech signal.
The upper plot Fig. 3.4(a) contains the whole speech sequence and in the middle plot Fig.
3.4(b) a part of the upper plot Fig. 3.4(a) is reproduced by zooming an area of the whole speech
sequence. At the bottom of Fig. 3.4 the segmentation into a three-state representation, in relation
to the different parts of the middle plot, is given.
-
7/31/2019 Speech Recognition Using Wavelets
21/63
14
3.2.2 Spectral Representation
Spectral representation of speech intensity over time is very popular, and the
most popular one is the sound spectrogram, see Fig. 3.5.
Fig. 3.5.Spectrogram using Welchs Method
Here the darkest (dark blue) parts represent the parts of the speech waveform where no
speech is produced and the lighter (red) parts represent intensity if speech is produced.
-
7/31/2019 Speech Recognition Using Wavelets
22/63
-
7/31/2019 Speech Recognition Using Wavelets
23/63
16
3.3.2 Fundamental Frequency
The time between successive vocal fold openings is called the fundamental period T0,
while the rate of vibration is called thefundamental frequency of the phonation, F0= 1/T0.
Using voiced excitation for the speech sound will result in a pulse train, the so-called
fundamental frequency. Voiced excitation is used when articulating vowels and some of the
consonants. For fricatives (e.g., /f/ as in fish or /s/, as in mess), unvoiced excitation (noise) is
used. In these cases, usually no fundamental frequency can be detected. On the other hand, the
zero crossing rate of the signal is very high. Plosives (like /p/ as in put), which use transient
excitation, you can best detect in the speech signal by looking for the short silence necessary to
build up the air pressure before the plosive bursts out.
3.3.3 Peaks in the Spectrum
After passing the glottis, the vocal tract gives a characteristic spectral shape to the speech
signal. If one simplifies the vocal tract to a straight pipe (the length is about 17cm), one can see
that the pipe shows resonance at the frequencies. Depending on the shape of the vocal tract (the
diameter of the pipe changes along the pipe), the frequencies of the formants (especially of the
1st and 2nd formant) changes and therefore characterizes the vowel being articulated.
3.3.4 The Envelope of the Power Spectrum
The pulse sequence from the glottis has a power spectrum decreasing towards higher
frequencies by -12dB per octave. The emission characteristics of the lips show a high-pass
characteristic with +6dB per octave. Thus, this results in an overall decrease of-6dB per octave.
3.4 Speech perception process
The microphone.cs class is responsible to accept input from a microphone and forward it
to the feature extraction module. Before converting the signal into suitable or desired form, it is
important to identify the segments of the sound containing words. The audio.cs class deals with
all tasks needed for converting wave file to stream of digits and vice versa. It also has a provision
of saving the sound into WAV files.
-
7/31/2019 Speech Recognition Using Wavelets
24/63
17
4. WAVELET ANALYSIS
4.1 Definition
A wavelet is a wave-like oscillation with amplitude that starts out at zero, increases, and
then decreases back to zero. It can typically be visualized as a "brief oscillation" like one might
see recorded by a seismograph or heart monitor. Generally, wavelets are purposefully crafted to
have specific properties that make them useful for signal processing. Wavelets can be combined,
using a "reverse, shift, multiply and sum" technique called convolution, with portions of an
unknown signal to extract information from the unknown signal.
The fundamental idea behind wavelets is to analyze according to scale. The wavelet
analysis procedure is to adopt a wavelet prototype function called an analyzing wavelet or
mother wavelet. Any speech signal can then be represented by translated and scaled versions of
the mother wavelet. Wavelet analysis is capable of revealing aspects of data that other speech
signal analysis technique such the extracted features are then passed to a classifier for the
recognition of isolated words [4].
The integral wavelet transform is the integral transform defined as:
( ) Equation 4.1Where a is positive and defines the scale and b is any real number and defines the shift.
For decomposition of speech signal, we can use different techniques like Fourier analysis,
STFT (Short Time Fourier Transforms), wavelet transform techniques.
Here, we have explained the necessity and advantages of Wavelet Analysis by first
considering the Fourier analysis, its limitations, its modification to Short Time Fourier
Transform, its limitations and finally the Wavelet Analysis.
http://en.wikipedia.org/wiki/Wavehttp://en.wikipedia.org/wiki/Oscillationhttp://en.wikipedia.org/wiki/Amplitudehttp://en.wikipedia.org/wiki/Seismographhttp://en.wikipedia.org/wiki/Heart_monitorhttp://en.wikipedia.org/wiki/Signal_processinghttp://en.wikipedia.org/wiki/Convolutionhttp://en.wikipedia.org/wiki/Integral_transformhttp://en.wikipedia.org/wiki/Integral_transformhttp://en.wikipedia.org/wiki/Convolutionhttp://en.wikipedia.org/wiki/Signal_processinghttp://en.wikipedia.org/wiki/Heart_monitorhttp://en.wikipedia.org/wiki/Seismographhttp://en.wikipedia.org/wiki/Amplitudehttp://en.wikipedia.org/wiki/Oscillationhttp://en.wikipedia.org/wiki/Wave -
7/31/2019 Speech Recognition Using Wavelets
25/63
18
4.2 Fourier Analysis
Fourier analysis breaks down a signal into constituent sinusoids of different frequencies.
It is a mathematical technique for transforming a signal from a time-based one to a frequency-
based one. Fourier Transform of sinusoidal signal is depicted in Fig. 3.1 below. Equation 4.2
Fig. 4.1 Fourier transform
4.2.1 Limitations
But Fourier analysis has a serious drawback. In transforming to the frequency domain,
time information is lost. When looking at a Fourier transform of a signal, it is impossible to tell
when a particular event tookplace. If a signal doesnt change much over time, i.e. if it is what is
called a stationary signal. This drawback isnt very important. However, most interesting signals
contain numerous non-stationary or transitory characteristics: drift, trends, abrupt changes, and
beginnings and ends of events. These characteristics are often the most important part of the
signal, and Fourier analysis is not suited to detecting them.
4.3 Short-Time Fourier analysisShort-Time Fourier Transform (STFT), maps a signal into a two-dimensional function of
time andfrequency.A technique called windowing the signal. Mathematically it is given by*,-+ ,-, - Equation4.3Where signal is x[n] and window is w[n].
-
7/31/2019 Speech Recognition Using Wavelets
26/63
19
Short-Time Fourier Transform of a random signal is shown in Fig. 4.2 below.
Fig. 4.2 Short time Fourier transform
The STFT represents a sort of compromise between the time- and frequency-based views
of a signal. It provides some information about both when and at what frequencies a signal event
occurs.
4.3.1 Limitations
However, you can only obtain this information with limited precision, and that precision
is determined by the size of the window. While the STFTs compromise between time and
frequency information can be useful, the drawback is that once you choose a particular size for
the time window, that window is the same for all frequencies .Otherwise ,if a wider window is
chosen, it gives better frequency resolution but poor time resolution. A narrower window gives
good time resolution but poor frequency resolution. Many signals require a more flexible
approach - one where we can vary the window size to determine more accurately either time or
frequency.
4.4 Types of Wavelets
Different types of wavelets are Haar wavelets, Daubechies wavelets, Bi orthogonal
wavelets, Coiflet wavelets, Symlet wavelets, Morlet wavelets, Mexican Hat wavelets and Meyer
wavelets.
Wavelets mainly used in speech recognition are discussed here.
-
7/31/2019 Speech Recognition Using Wavelets
27/63
20
4.4.1 Haar Wavelet
Its first and simplest. Haar is discontinuous, and resembles a step function. It represents
the same wavelet as Daubechies db1.
The Haar wavelet family for t [0, 1] is defined as follows:hi (t) ={
Equation 4.4
Integer m = 2j ( j = 0,1,2J ) indicates the level of the wavelet; k = 0,1, 2,..m1is the
translation parameter. Maximal level of resolution is J.
Fig. 4.3 Haar wavelet
4.4.2 Daubechies-N wavelet family
The Daubechies wavelets are a family oforthogonal wavelets defining a discrete wavelet
transform and characterized by a maximal number of vanishing moments for some given
support. With each wavelet type of this class, there is a scaling function (also called father
wavelet) which generates an orthogonal multi resolution analysis. The Daubechies wavelet is one
of the popular wavelets and has been used for speech recognition [4].
http://en.wikipedia.org/wiki/Orthogonal_wavelethttp://en.wikipedia.org/wiki/Discrete_wavelet_transformhttp://en.wikipedia.org/wiki/Discrete_wavelet_transformhttp://en.wikipedia.org/wiki/Moment_(mathematics)http://en.wikipedia.org/wiki/Moment_(mathematics)http://en.wikipedia.org/wiki/Discrete_wavelet_transformhttp://en.wikipedia.org/wiki/Discrete_wavelet_transformhttp://en.wikipedia.org/wiki/Orthogonal_wavelet -
7/31/2019 Speech Recognition Using Wavelets
28/63
21
In general the Daubechies wavelets are chosen to have the highest number A of vanishing
moments, (this does not imply the best smoothness) for given support width N=2A, and among
the 2A1possible solutions the one is chosen whose scaling filter has external phase. The wavelet
transform is also easy to put into practice using the fast wavelet transform. Daubechies wavelets
are widely used in solving a broad range of problems, e.g. self-similarity properties of a signal
or fractal problems, signal discontinuities, etc.
The Daubechies wavelets properties [6]:
The support length of wavelet function and the scaling function is 2N1. The number of vanishing moments of is N. Most dbN are not symmetrical. The regularity increases with the order. When N becomes very large, and belong to CN
where is approximately equal to 0.2.
Daubechies8 wavelet is used for decomposition of speech signal as it needs minimumsupport size for the given number of vanishing points.
The names of the Daubechies family wavelets are written dbN, where N is the order, and
db the surname of the wavelet. The db1 wavelet, as mentioned above, is the same as Haar.
Here are the next nine members of the family:
Fig. 4.5 Daubechies wavelets
http://en.wikipedia.org/wiki/Fast_wavelet_transformhttp://en.wikipedia.org/wiki/Fractalhttp://en.wikipedia.org/wiki/Fractalhttp://en.wikipedia.org/wiki/Fast_wavelet_transform -
7/31/2019 Speech Recognition Using Wavelets
29/63
22
4.4.3 Advantages Wavelet analysis over STFT
Wavelet analysis represents the next logical step: a windowing technique with variable-
sized regions. Wavelet analysis allows the use of long time intervals where we want more precise
low frequency information, and shorter regions where we want high frequency information.
Fig. 4.6 Comparison of Wavelet analysis over STFT
The time-based, frequency-based and STFT views of a signal are given with respect to
that of Wavelet analysis. One major advantage afforded by wavelets is the ability to perform
local analysis, i.e., to analyze a localized area of a larger signal.
4.5 Wavelet Transform
The transform of a signal is just another form of representing the signal. It does not
change the information content present in the signal. For many signals, the low-frequency part
contains the most important part. It gives an identity to a signal. Consider the human voice. If we
remove the high-frequency components, the voice sounds different, but we can still tell whats
being said. In wavelet analysis, we often speak of approximations and details. The
approximations are the high- scale, low-frequency components of the signal. The details are the
low-scale, high frequency components.
Equation 4.5Where (t) is a time function with finite energy and fast decay called mother wavelet.
-
7/31/2019 Speech Recognition Using Wavelets
30/63
23
4.5.1 Discrete Wavelet Transform
The Discrete Wavelet Transform (DWT) involves choosing scales and positions based on
powers of two
so called dyadic scales and positions. The mother wavelet is rescaled or dilated,
by powers of two and translated by integers. Specifically, a function f ( t) L2
(R) (defines space
of square integrable functions) can be represented as [1]:
* ( ) + Equation 4.6
The function (t) is known as the mother wavelet, while (t) is known as the scaling
function.The set of functions ( ) } where Z isthe set of integers is an orthonormal basis for L2(R).
The numbers a (L, k) are known as the approximation coefficients at scale L, while d (j, k) are
known as the detail coefficients at scale j. The approximation and detail coefficients can be
expressed as:
Equation 4.7 Equation 4.8The DWT analysis can be performed using a fast, pyramidal algorithm related to multi-
rate filter-banks. As a multi-rate filter-bank the DWT can be viewed as a constant Q filter-bank
with octave spacing between the centers of the filters. Each sub-band contains half the samples
of the neighboring higher frequency sub-band. In the pyramidal algorithm the signal is analyzed
at different frequency bands with different resolution by decomposing the signal into a coarse
approximation and detail information. The coarse approximation is then further decomposed
using the same wavelet decomposition step. This is achieved by successive high-pass and low-pass filtering of the time domain signal and is defined by the following equations:
ylow[n] = ,-,- Equation 4.9yhigh[n] = ,-,- Equation 4.10
-
7/31/2019 Speech Recognition Using Wavelets
31/63
24
Fig. 4.7 Filter functions
Signal x[n] is passed through low pass and high pass filters and it is down sampled by 2.
ylow[n] = (x * g) 2 Equation 4.11yhigh[n] = (x*h) 2 Equation 4.12
In the DWT, each level is calculated by passing the previous approximation coefficients
though a high and low pass filters.
4.5.2 Multilevel Decomposition of Signal
A signal can be decomposed using Wavelet Analysis as Shown below [11]:
Fig. 4.8 Decomposition of DWT Co-efficients
Fig. 4.9 Decomposition using DWT
-
7/31/2019 Speech Recognition Using Wavelets
32/63
25
The DWT is computed by successive low-pass and high-pass filtering of the discrete
time-domain signal as shown in figure 4.8 and 4.9. This is called the Mallat algorithm or Mallat-
tree decomposition.
4.5.3 Wavelet Reconstruction
Getting the original signal with no loss (min.) of information is called Reconstruction. It
can be done by inverse discrete wavelettransform(IDWT). Whereas wavelet analysis involves
filtering and down sampling, the wavelet, Reconstruction process consists of up sampling and
filtering. Up sampling is the process of lengthening a signal component by inserting zeros
between samples.
Fig. 4.10 Signal Reconstruction
Fig. 4.11 Signal Decomposition & Reconstruction
-
7/31/2019 Speech Recognition Using Wavelets
33/63
26
5. FROM SPEECH TO FEATURE VECTORS
The main objective of this stage is to extract the important features that are enough for
the recognizer to recognize the words. This chapter describes how to extract information from a
speech signal, which means creating feature vectors from the speech signal. A wide range of
possibilities exist for parametrically representing a speech signal and its content. The main steps
for extracting information are preprocessing, frame blocking & windowing and feature
extraction [1].
Fig. 5.1 Main steps in Feature Extraction
5.1 Preprocessing
This step is the first step to create feature vectors. The objective in the pre-processing is
to modify the speech signal, x (n), so that it will be more suitable for the feature extraction
analysis. The preprocessing operations noise cancelling, pre emphasis and voice activation
detection can be seen in Figure below shown.
Fig. 5.2 Pre processing
The first thing to consider is if the speech, x (n), is corrupted by some noise, d(n), for
example an additive disturbance x (n) = s (n) + d (n), where s (n) is the clean speech signal.
There are several approaches to perform noise reduction on a noisy speech signal. Two
commonly used noise reduction algorithms in the field of speech recognition context is spectral
subtraction and adaptive noise cancellation. A low signal to noise ratio (SNR) decrease the
-
7/31/2019 Speech Recognition Using Wavelets
34/63
27
performance of the recognizer in a real environment. Some changes to make the speech
recognizer more noise robust will be presented later. Note that the order of the operations might
be reordered for some tasks. For example the noise reduction algorithm, spectral subtraction, is
better placed last in the chain (it needs the voice activation detection).
5.1.1 Pre emphasis
There is a need for spectrally flatten the signal. The pre emphasize, often represented by a
first order high pass FIR filter is used to emphasize the higher frequency components.
The second stage in feature extraction is to boost the amount of energy in the high
frequencies. It turns out that if we look at the spectrum for voiced segments like vowels, there is
more energy at the lower frequencies than the higher frequencies. This SPECTRAL TILT drop
in energy across frequencies (which is called spectral tilt) is caused by the nature of the glottal
pulse. Boosting the high frequency energy makes information from these higher formants more
available to the acoustic model and improves phone detection accuracy.
Fig. 5.3 Pre emphasis filter
The pre emphasizer is used to spectrally flatten the speech signal. This is usually done by
a high pass filter. The most commonly used filter for this step is the FIR filter described below: Equation5.1
-
7/31/2019 Speech Recognition Using Wavelets
35/63
28
The filter response for this FIR filter can be seen in Figure. The filter in the time domain
will beh (n) = {1, 0.95}and the filtering in the time domain will give the pre emphasized signal
s1 (n):
s1 (n) =
Equation 5.2
The pre emphasis filter is shown on Fig. 5.3.
5.1.2 Voice Activation Detection (VAD)
The problem of locating the endpoints of an utterance in a speech signal is a major
problem for the speech recognizer. Inaccurate endpoint detection will decrease the performance
of the speech recognizer. The problem of detecting endpoint seems to be relatively trivial, but it
has been found to be very difficult in practice. Only when a fair SNR is given, the task is made
easier. Some commonly used measurements for finding speech are short-term energy estimate
Es1, or short-term power estimate Ps1, and short term zero crossing rate Zs1. For the speech
signal s1(n) these measures are calculated as follows [1]:
Es1(m) = Equation 5.3Ps1(m) =
Equation 5.4Zs1(m) =
|,-,-|
Equation 5.5
Where: Equation 5.6For each block ofL samples these measures calculate some value. Note that the index for
these functions is m and not n, this because these measures do not have to be calculated for every
sample (the measures can for example be calculated in every 20 ms). The short-term energy
estimate will increase when speech is present in s1 (n). This is also the case with the short-term
power estimate; the only thing that separates them is scaling with 1/L when calculating the short-
term power estimate. The short term zero crossing rate gives a measure of how many times thesignal, s1 (n), changes sign. This short term zero crossing rates tend to be larger during unvoiced
regions.
These measures will need some triggers for making decision about where the utterances
begin and end. To create a trigger, one needs some information about the background noise. This
is done by assuming that the first 10 blocks are background noise. With this assumption the
-
7/31/2019 Speech Recognition Using Wavelets
36/63
29
mean and variance for the measures will be calculated. To make a more comfortable approach
the following function is used:
Ws1(m)=Ps1(m)(1-Zs1(m))Sc Equation 5.7
Using this function both the short-term power and the zero crossing rates will be taken
into account. Sc is a scale factor for avoiding small values, in a typical application is Sc = 1000.
The trigger for this function can be described as:
tW=W+ W Equation 5.8
TheWis the mean and Wis the variance for Ws1 (m) calculated for the first 10 blocks.
The term is a constant that have to be fine-tuned according to the characteristics of the signal.
After some testing the following approximation of will give pretty good voice activation
detection in various level of additive background noise:
Equation 5.9The voice activation detection function, VAD (m), can now be found as: Equation 5.10
VAD (n) is found as VAD (m) in the block of measure. For example if the measures is
calculated every 320 sample (block length L=320), which corresponds to 40 ms if the sampling
rate is 8 kHz. The first 320 samples of VAD (n) found as VAD (m) then m = 1. Using these
settings the VAD (n) is calculated for the speech signal containing the word file shown in
results.
5.2 Frame blocking & Windowing
Speech signal is a kind of unstable signal. But we can assume it as stable signal during
10-20ms. Framing is used to cut the long-time speech to the short-time speech signal in order to
get relative stable frequency characteristics. Features get periodically extracted. The time for
which the signal is considered for processing is called a window and the data acquired in a
window is called as a frame. Typically features are extracted once every 10ms, which is called as
frame rate. The window duration is typically 20ms. Thus two consecutive frames have
overlapping areas.
-
7/31/2019 Speech Recognition Using Wavelets
37/63
30
Fig. 5.4 Frame blocking & Windowing
5.2.1 Frame blocking
For each utterances of the word, window duration of 320 samples is used for processing
at later stages. A frame is formed from the windowed data with typical frame duration (Tf) of
about 200 samples. Since the frame duration is shorter than window duration there is an overlap
of data and the percentage overlap is given as:
%Overlap = ((TwTf)*100)/Tw) Equation 5.11
Each frame is Ksamples long, with adjacent frames being separated by P samples.
-
7/31/2019 Speech Recognition Using Wavelets
38/63
31
Fig. 5.5 Frame blocking of a sequence
By applying the frame blocking to de noised signal (x (k)), one will get M vectors of
length K, which correspond to x (k; m) where k=0, 1...K-1 and m=0, 1.M 1.
5.2.2 Windowing
Windowing concept is used to minimize the signal distortion by using the window to
taper the signal to zero at the beginning and end of each frame i.e. to reduce signal discontinuity
at either end of the block.
The rectangular window (i.e. no window) can cause problems, when we do Fourier
analysis; it abruptly cuts of the signal at its boundaries. A good window function has a narrow
main lobe and low side lobe levels in their transfer functions, which shrinks the values of the
signal toward zero at the window boundaries, avoiding discontinuities.
Equation 5.12The most commonly used window function in speech processing is the Hamming windowdefined as follows:
Equation 5.13By applying w (k) tox (k; m) for all blocks, the windowed signal output is calculated.
-
7/31/2019 Speech Recognition Using Wavelets
39/63
32
Hamming window function is shown in Fig. 5.5 below:
Fig. 5.6 Hamming Window
Multiplication of the signal by a window function in the time domain is the same as
convolving the signal in the frequency domain. Rectangular window gives maximum sharpness
but large side-lobes (ripples) - hamming window blurs in frequency but produces much less
leakage.
5.3 Feature Extraction
A feature extractor should reduce the pattern vector (i.e., the original waveform) to a
lower dimension, which contains most of the useful information from the original vector. Here
we use we extract features of the input speech signal using Daubechies-8 wavelets of level 4 [4].
The extracted wavelet coefficients provide a compact representation that shows theenergy distribution of the signal in time and frequency. In order to further reduce the
dimensionality of the extracted feature vectors, statistics over the set of the wavelet coefficients
are used.
-
7/31/2019 Speech Recognition Using Wavelets
40/63
33
The following features are used in our system:
The mean of the absolute value of the coefficients in each sub-band. These features provideinformation about the frequency distribution of the audio signal.
The standard deviation of the coefficients in each sub-band. These features provideinformation about the amount of change of the frequency distribution.
Energy of each sub-band of the signal. These features provide information about the energyof the each sub-band.
Kurtosis of each sub-band of the signal. These features measure whether the data are peakedor flat relative to a normal distribution.
Skewness of each sub-band of the signals. These features are the measure of symmetry orlack of symmetry.
After frame blocking and windowing we get different frame vectors i.e. different signals
are to be loaded to extract the features at a time. Hence Multi signal analysis is performed on
input frame vectors using wavelets using matlab [13].
-
7/31/2019 Speech Recognition Using Wavelets
41/63
34
6. DYNAMIC TIME WARPING
Dynamic time warping (DTW) is an algorithm for measuring similarity between two
sequences which may vary in time or speed. For instance, similarities in walking patterns would
be detected, even if in one video the person was walking slowly and if in another he or she were
walking more quickly, or even if there were accelerations and decelerations during the course of
one observation. DTW has been applied to video, audio, and graphics indeed, any data which
can be turned into a linear representation can be analyzed with DTW. A well-known application
has been automatic speech recognition, to cope with different speaking speeds [3].
In general, DTW is a method that allows a computer to find an optimal match between
two given sequences (e.g. time series) with certain restrictions. The sequences are "warped" non-
linearly in the time dimension to determine a measure of their similarity independent of certain
non-linear variations in the time dimension. This sequence alignment method is often used in
time.
The recognition process then consists of matching the incoming speech with stored
templates. The template with the lowest distance measure from the input pattern is the
recognized word. The best match (lowest distance measure) is based upon dynamic
programming.
6.1 DTW Algorithm
Speech is a time-dependent process. Hence the utterances of the same word will have
different durations, and utterances of the same word with the same duration will differ in the
middle, due to different parts of the words being spoken at different rates. To obtain a global
distance between two speech patterns (represented as a sequence of vectors) a time alignment
must be performed.
http://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Time_serieshttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Time_serieshttp://en.wikipedia.org/wiki/Speech_recognition -
7/31/2019 Speech Recognition Using Wavelets
42/63
35
6.1.1 DP-Matching Principle
General Time-Normalized Distance Definition:
Speech can be expressed by appropriate feature extraction as a sequence of feature
vectors.
A= a1, a2, a3,ai ..aI, Equation 6.1
B = b1, b2, b3, bj.. bJ. Equation 6.2
Consider the problem of eliminating timing differences between these two speech
patterns. In order to clarify the nature of time-axis fluctuation or timing differences, let us
consider an i-j plane, shown in Fig. 6.1, where patterns A andBare developed along the i-axis
and j-axis, respectively. Where these speech patterns are of the same category, the timing
differences between them can be depicted by a sequence of points c = (i, j):
F = c (l), c (2), ------c, (k), ---------- c (K). Equation 6.3
Where c (k) = (i (k), j (k)).
This sequence can be considered to represent a function which approximately realizes a
mapping from the time axis of patternA onto that of pattern B. Hereafter, it is called a warping
function. When there is no timing difference between these patterns, the warping function
coincides with the diagonal line j = i. It deviates further from the diagonal line as the timing
difference grows [3].
Fig. 6.1 warping function & adjusting window definition
-
7/31/2019 Speech Recognition Using Wavelets
43/63
36
As a measure of the difference between two feature vectors aiand bi, a distance
| | Equation 6.4is employed between them. Then, the weighted summation of distances on warping function F
becomes () Equation 6.5(Where w (k) is a nonnegative weighting coefficient, which is intentionally introduced to allow
the E (F) measure flexible characteristic) and is a reasonable measure for the goodness of
warping function F. It attains its minimum value when warping function F is determined so as to
optimally adjust the timing difference. This minimum residual distance value can be considered
to be a distance between patterns A andB, remaining still after eliminating the timing differences
between them, and is naturally expected to be stable against time-axis fluctuation. Based on these
considerations, the time-normalized distance between two speech patterns A andB is defined as
follows:
() Equation 6.6Where denominator is employed to compensate for the effect ofK(number of points onthe warping function F). Above equation is no more than a fundamental definition of time-
normalized distance. Effective characteristics of this measure greatly depend on the warping
function specification and the weighting 'coefficient definition. Desirable characteristics of the
time-normalized distance measure will vary, according to speech pattern properties (especially
time axis expression of speech pattern) to be dealt with. Therefore, the present problem is
restricted to the most general case where the following two conditions hold:
Condition 1: Speech patterns are time-sampled with a common and constant sampling period.
Condition 2: We have no a priori knowledge about which parts of speech pattern contain
linguistically important information. In this case, it is reasonable to consider each part of aspeech pattern to contain an equal amount of linguistic information.
-
7/31/2019 Speech Recognition Using Wavelets
44/63
37
6.1.2 Restrictions on Warping Function
Warping function Fis a model of time-axis fluctuation in a speech pattern. Accordingly,
it should approximate the properties of actual time-axis fluctuation. In other words, function F,
when viewed as a mapping from the time axis of pattern A onto that of patternB, must preserve
linguistically essential structures in pattern A time axis and vice versa. Essential speech pattern
time-axis structures are continuity, monotonicity (or restriction of relative timing in a speech),
limitation on the acoustic parameter transition speed in a speech, and so on. These conditions can
be realized as the following restrictions on warping function F or points ( )1) Monotonic conditions: i (k-1) i (k) and j (k-1) j (k). Equation 6.72) Continuity conditions: : i(k)- i(k-1) 1 and j(k)- j(k-1) 1. Equation 6.8
As a result of these two restrictions, the following relation holds between two consecutive points
( ) Equation 6.93) Boundary conditions: i (1) =1, j (1) =1, and i (K) =I, j (K) = J. Equation 6.104) Adjustment window condition:
| | Equation 6.11
Where ris an appropriate positive integer, called window length. This condition corresponds to
the fact that time-axis fluctuation in usual cases never causes too excessive timing difference.
5) Slope constraint condition:Neither too steep nor too gentle a gradient should be allowed for warping function F
because such deviations may cause undesirable time-axis warping. Too steep a gradient, for
example, causes an unrealistic correspondence between very short patterns A segment and a
relatively long patternB segment. Then, such a case occurs where a short segment in consonant
or phoneme transition part happens to be in good coincidence with an entire steady vowel part.
Therefore, a restriction called a slope constraint condition was set upon the warping function F,
so that its first derivative is of discrete form. The slope constraint condition is realized as a
restriction on the possible relation among (or the possible configuration of) several consecutive
-
7/31/2019 Speech Recognition Using Wavelets
45/63
38
points on the warping function, as is shown in Fig. 6.2(a) and (b). To put it concretely, if point c
(k) moves forward in the direction of i (or j)-axis consecutive m times, then point c (k) is not
allowed to step further in the same direction before stepping at least n times in the diagonal
direction. The effective intensity of the slope constraint can be evaluated by the following
measure P = n/m.
Fig. 6.2 Slope constraint on warping function
The larger the P measure, the more rigidly the warping function slope is restricted. When
p = 0, there are no restrictions on the warping function slope. When p = (that is m = 0), the
warping function is restricted to diagonal line j = i. Nothing more occurs than a conventional
-
7/31/2019 Speech Recognition Using Wavelets
46/63
39
pattern matching no time normalization. Generally speaking, if the slope constraint is too severe,
then time-normalization would not work effectively. If the slope constraint is too lax, then
discrimination between speech patterns in different categories is degraded. Thus, setting neither a
too large nor a too small value forp is desirable. Section IV reports the results of an investigation
on an optimum compromise onp value through several experiments.
In Fig. 6.2(c) and (d), two examples of permissible point c (k) paths under slope
constraint condition p = 1 are shown. TheFig. 6.2(c) type is directly derived from the above
definition,while Fig. 6.2(d) is an approximated type, and there is anotherconstraint. That is, the
second derivative of warping function F is restricted, so that the point c (k) path does not
orthogonally change its direction. This new constraint reduces the number of paths to be
searched. Therefore, the simple Fig. 6.2(d) type is adopted afterward, except for thep = 0 case.
6.1.3 Discussions on Weighting Coefficient
Since the criterion function in Equation 6.6 is a rational expression, its maximization is
an unwieldy problem. If the denominator in Equation 6.6 Equation 6.12(Called normalization coefficient) is independent of warping function F; it can be put out of the
bracket, while simplifying the equation as follows:
[ () ] Equation 6.13This simplified problem can be effectively solved by use of the dynamic programmingtechnique.
W (k) = [i (k) - i (k-1)] + [j (k) - j (k-1)], Equation 6.14
Then N=I+J, whereIandJare lengths of speech patternsA andB, respectively.
If it is assumed that time axes i andj are both continuous, then, in the symmetric form,
the summation in Equation 6.6 means an integration along the temporarily defined axis l = i +j.As a result of this difference, time-normalized distance is symmetric, or D (A, B) =D (B, A), in
the symmetric form. Another more important result, caused by the difference in the integration
axis, is that, as is in Fig. 6.3, weighting coefficient w (k) reduces to zero in the asymmetric form,
when the point in warping function steps in the direction ofj-axis, or c (k) = c (k-1) + (0, 1). This
means that some feature vectors bjare possibly excluded from the integration in the asymmetric
-
7/31/2019 Speech Recognition Using Wavelets
47/63
40
form. On the contrary, in the case of symmetric form, minimum w (k) value is equal to 1, and no
exclusion occurs. Since discussions here are based on the assumption that each part in a speech
pattern should be treated equally, an exclusion of any feature vectors from integration should be
avoided as long as possible. It can be expected, therefore, that the symmetric form will give
better recognition accuracy than the asymmetric form. However, it should be noted that the slope
constraint reduces the situation where the point in warping function steps in the j-axis direction.
The difference in performance between the symmetric one and asymmetric one will gradually
vanish as the slope constraint is intensified.
Fig. 6.3 Weighting coefficient W(k)
6.2 Practical DP-Matching Algorithm
6.2.1 DP-Equation
A simplified definition of time-normalized distance D (A, B) given above is one of the
typical problems to which the well-known DP-principle Equation 6.10 can be applied. The basic
algorithm for calculating Equation 6.13 is written as follows.
Initial condition:
g1(c (1)) = d (c (1)) w (1). Equation 6.15
DP-equation: () [( ) ( ) ] Equation 6.16Time-normalized distance:
() Equation 6.17
-
7/31/2019 Speech Recognition Using Wavelets
48/63
41
It is implicitly assumed here that c (0) = (0, 0). Accordingly, w (1) = 2 in the symmetric
form, and w (1) = 1 in the asymmetric form. By realizing the restriction on the warping function
described in Section 6.1.2 and substituting Equation 6.14 for weighting coefficient w (k) in
Equation 6.16,several practical algorithms can be derived. As one of the simplest examples, the
algorithm for symmetric form, in which no slope constraint is employed (that is P = 0), is shown
here.
Initial condition:
g (l, 1) = 2 d (1, 1). Equation 6.18
DP-equation:
Equation 6.19Restricting condition (adjustment window):
j - r i j + r. Equation 6.20
Time-normalized distance:
Equation 6.21Where N = I+J.
The algorithm, especially the DP-equation, should be modified when the asymmetric
form is adopted or some slope constraint is employed. In Table I, algorithms are summarized for
both symmetric and asymmetric forms, with various slope constraint conditions. In this table,
DP-equations for asymmetric forms are shown in some improved form. The first expression in
the bracket of the asymmetric form DP-equation for P = 1 (that is, [g(i - 1 , j - 2) + d(i, j - 1) +
d(i, j)]/2) corresponds to the case where c(k - 1 ) = (i(k), j(k) - 1) and c(k - 2) = (i(k - 1) - 1 , j (k -
1) - 1). Accordingly, if the definition in (14) is strictly obeyed, w (k) is equal to zero while w (k -
1) is equal to 1, thus completely omitting the d(c (k)) from the summation. In order to avoid this
situation to a certain extent, the weighting coefficient w (k - 1) = 1 is divided between two
weighting coefficients w (k - 1) and w (k). Thus, (d(i, j - 1) + d(i, j))/2 is substituted for d(i, j - 1)
+ 0 * d(i, j ) in this expression. Similar modifications are applied to other asymmetric form DP-
equations. In fact, it has been established, by a preliminary experiment, that this modification
significantly improves the asymmetric form performance [12].
-
7/31/2019 Speech Recognition Using Wavelets
49/63
42
6.2.2 Calculation Details
DP-equation or g(i, j ) must be recurrently calculated in ascending order with respect to
coordinates i andj , starting from initial condition at (1, 1 ) up to ( I , J). The domain in which the
DP-equation must be calculated is specified by
1 i I, 1 j J. Equation 6.22
and adjustment window
j - r i j + r. Equation 6.23
The optimum DP-algorithm, applied to speech recognition, was investigated. Symmetric
form was proposed along with slope constraint technique. These varieties were then compared
through theoretical and experimental investigations.
Conclusions are as follows: Slope constraint is actually effective. Optimum performance is
attained when the slope constraint condition is P = 1. The validity of these results was ensured by
a good agreement between theoretical discussions and experimental results. The optimized
algorithm was then experimentally compared with several other DP-algorithms applied to spoken
word recognition by different research groups, and the superiority of the algorithm described in
this paper was established.
-
7/31/2019 Speech Recognition Using Wavelets
50/63
43
7. FPGA Implementation
The AccelDSP Synthesis Tool is a product that allows to transform a MATLAB floating-
point design into a hardware module that can be implemented in a Xilinx FPGA. AccelDSP
Synthesis Tool features an easy-to-use Graphical User Interface that controls an integrated
environment with other design tools such as MATLAB, Xilinx ISE tools, and other industry-
standard HDL simulators and logic synthesizers.
AccelDSP Synthesis is done with the following implementation procedure:
a) Reading and analyzing a MATLAB floating-point design.b) Automatically creating an equivalent MATLAB fixed-point design.c) Invoking a MATLAB simulation to verify the fixed-point design.d) Providing the power to quickly explore design trade-offs of algorithms that are optimized
for the target FPGA architectures.
e) Creating a synthesizable RTL HDL model and a Test bench to ensure bit-true, cycleaccurate design verification.
f) Providing scripts that invoke and control down-stream tools such as HDL simulators,RTL logic synthesizers, and Xilinx ISE implementation tools.
-
7/31/2019 Speech Recognition Using Wavelets
51/63
44
The Synthesis flow in AccelDSP ISE can be observed from the following flow chart:
Fig. 7.1 Synthesis flow in AccelDSP
-
7/31/2019 Speech Recognition Using Wavelets
52/63
45
8. SIMULATION & RESULTS
This chapter presents the experimental results obtained from the proposed approach;
namely Wavelet analysis, Dynamic Time Warping that was applied to the isolated word speech
recognition. The effectiveness of the algorithms is measured through the analysis of the results.
8.1 Input Signal:1) Input speech signal for word Speech:
Fig. 8.1 Input speech signal
The input speech signal with duration of 5 seconds with sampling frequency of 8k Hz is
shown above.
-
7/31/2019 Speech Recognition Using Wavelets
53/63
46
8.2 Pre emphasis:
Pre emphasis output for Speech:
Fig. 8.2 Pre emphasis output
The output obtained after passing the input Speech signal to the pre emphasis (first
order high pass) filter. The output has significant spectral flatness when compared with input.
-
7/31/2019 Speech Recognition Using Wavelets
54/63
47
8.3 Voice Activation & Detection
1) Voice Activation and Detection for Speech:
Fig. 8.3 Voice Activation & Detection
The above plot shows the voice activated region for the word Speech. The output is 1
for voiced region and 0 for unvoiced and silence region. Hence out of the total samples, only the
voice activated samples are going to be filtered out.
-
7/31/2019 Speech Recognition Using Wavelets
55/63
48
2) Speech signal after voice activation and Detection:
Fig. 8.4 Speech signal after Voice Activation & Detection
After obtaining the Voice Activation & Detection output, the regions for which VAD=1
are extracted out for further analysis.
-
7/31/2019 Speech Recognition Using Wavelets
56/63
49
8.4 De-noising:
De-noising for Speech:
Fig. 8.5 Speech signal after de-noising
The final denoised signal obtained after Spectral subtraction. Here the noise components
present in the signal are reduced.
-
7/31/2019 Speech Recognition Using Wavelets
57/63
50
8.5 Recognition Results:
This section provides the experimental results in recognizing the isolated words. In the
experiment, the database consists of 10 different words and 25 utterances for each word is used.
Calculation of Recognition rate is given in Equation 8.1 below. Equation 8.1
a) The Recognition rates for each word using Daubechies-8 wavelet & level-4decomposition using DWT for English words is shown in the following table:
Word to be recognized Number of times theword is correctly recognized
Recognition Rate
Matrix 24 96
Paste 24 96
Project 18 72
Speech 18 72
Window 24 96
Distance 20 80
India 24 96
Ubuntu 19 76
Fedora 25 100
Android 24 96
Table 8.1: Recognition rates for English words using db8 & level 4 DWT.
The overall Recognition rate for English words using Daubechies 8 wavelet of level-4 is 88%.
-
7/31/2019 Speech Recognition Using Wavelets
58/63
51
b) The Recognition rates for each word using Daubechies-8 wavelet & level-7decomposition using DWT for English words is shown in the following table:
Word to be recognized Number of times the
word is correctly recognized
Recognition Rate
Matrix 24 96
Paste 23 92
Project 21 84
Speech 23 92
Window 24 96
Distance 22 88
India 25 100
Ubuntu 21 84
Fedora 25 100
Android 25 100
Table 8.2: Recognition rates for English words using db8 & level 7 DWT.
The overall Recognition rate for English words using Daubechies 8 wavelet of level-7 is
93.2%.
8.5 FPGA Implementation
AccelDSP synthesis tool is used to transform a MATLAB design into a hardware module
that can be implemented in a Xilinx FPGA.
The figure in Fig. 8.6 shows the matlab results for the word recognized FEDORA.
The figure in Fig. 8.7 shows the FPGA Implementation results for the word recognized
FEDORAon AccelDSP tool in Xilinx ISE Platform.
-
7/31/2019 Speech Recognition Using Wavelets
59/63
52
Fig. 8.6 Matlab output of Speech Recognition for word FEDORA.
-
7/31/2019 Speech Recognition Using Wavelets
60/63
53
Fig. 8.7 Figure showing FPGA results for word FEDORA.
-
7/31/2019 Speech Recognition Using Wavelets
61/63
-
7/31/2019 Speech Recognition Using Wavelets
62/63
55
REFERENCES
[1]Trivedi, Saurabh, Sachin and Raman, "Speech Recognition by Wavelet Analysis",International Journal of Computer Applications (09758887) Volume 15No.8, February
2011.
[2]Lawrence Rabiner and Bing-Hwang Jung, "Fundamentals of Speech Recognition".[3]Hiroaki Sakoe, Seibi Chiba, "Dynamic Programming Algorithm Optimization for Spoken
Word Recognition", IEEE Transactions On Acoustics, Speech, And Signal Processing, Vol.
Assp-26, No. 1, February 1978.
[4]Ingrid Daubechies, Ten Lectures on Wavelets, SIAM, Philadelphia, 1992.[5]Ian Mcloughlin, "Audio Processing with Matlab Examples".[6]I.Daubechies, Orthonormal Bases of Compactly Supported Wavelets, Comm. on Pure
and Applied Math., vol.41, pp.909-996, Nov.1988.
[7]Murali Krishnan, Chris P.Neophytou and Glenn Prescott, Wavelet Transform SpeechRecognition using Vector Quantization, Dynamic Time Warping and Artificial Neural
Networks".
[8]George Tzanetakis, Georg Essl, Perry Cook, Audio Analysis using the Discrete WaveletTransform Organized sound, Vol. 4(3), 2000.
[9]L.R.Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in SpeechRecognition, vol. 77, no. 2, pp. 257-286, 1989.
[10] Michael Nilsson, Marcus Ejnarsson, "Speech Recognition using Hidden Markov Model".[11] S.G.Mallat, A theory for multi resolution signal decomposition: the wavelet
representation, IEEE transactions on Pattern Analysis Machine Intelligence, Vol. 11
1989, pp.674-693.
[12] Sylvio Barbon Junior, Rodrigo Capobianco Guido, Shi-Huang Chen, Lucimar SassoVieira, Fabricio Lopes Sanchez, "Improved Dynamic Time Warping Based on the Discrete
Wavelet Transform", Ninth IEEE International Symposium on Multimedia 2007.
[13] M.Misiti, Y. Misiti, G. Oppenheim and J. Poggi, Matlab Wavelet Tool Box, The MathWorks, Inc.,2000.
[14] George Tzanetakis, Georg Essl, Perry Cook, Audio Analysis using the Discrete WaveletTransform Organized sound, Vol. 4(3), 2000.
-
7/31/2019 Speech Recognition Using Wavelets
63/63
[15] Mike Brookes, "Voicebox: Speech Processing Toolbox for Matlab", Department ofElectrical & Electronic Engineering, Imperial College, London SW7 2BT, UK.
[16] Daryl Ning, "Developing an Isolated Word Recognition System in Matlab", MatlabDigest - January 2010.