SPEECH ENHANCEMENT DURING BiPAP USE FOR PERSONS LIVING WITH ALS
Transcript of SPEECH ENHANCEMENT DURING BiPAP USE FOR PERSONS LIVING WITH ALS
SPEECH ENHANCEMENT DURING BiPAP
USE FOR PERSONS LIVING WITH ALS
by
SAMUEL D. CHUA
B.A.Sc., University of British Columbia, 2005
A THESIS SUBMITTED IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF APPLIED SCIENCE
in
The Faculty of Graduate Studies
(Electrical and Computer Engineering)
THE UNIVERSITY OF BRITISH COLUMBIA
(Vancouver)
November 2012
© Samuel D. Chua, 2012
ii
ABSTRACT
Speech from behind a face mask while on Bilevel Positive Air Pressure (BiPAP) ventilation is
extremely difficult for persons living with Amyotrophic Lateral Sclerosis (ALS). The inability to verbally
communicate while on ventilation causes frustration and feelings of isolation from loved ones and
decreases quality of life.
A system that integrates with face masks, captures speech, removes ventilator wind noise and
outputs and recognizes de-noised speech is proposed, implemented and tested. The system is tested with a
dataset consisting of digitally added noise as well as a single patient with ALS. Automated machine
recognition of the words is then performed and results analyzed. A subjective listening test is conducted
with individuals listening to the noisy and filtered speech samples and the results are also analyzed.
Although intelligibility does not seem to improve for human listeners, there appears to be some
improvement in machine recognition scores. In addition, feedback from the ALS community reports an
improvement in the quality of life simply because patients are able to use their own voice and be heard by
loved ones.
iii
PREFACE Ethics approval, H10-01703, for project “Speech Enhancement During Bipap Use For Persons Living
With ALS” was obtained through the Clinical Research Ethics Board.
iv
TABLE OF CONTENTS ABSTRACT ................................................................................................................................................. ii
PREFACE ................................................................................................................................................... iii
TABLE OF CONTENTS ............................................................................................................................. iv
LIST OF TABLES ..................................................................................................................................... vii
LIST OF FIGURES ................................................................................................................................... viii
ABBREVIATIONS ...................................................................................................................................... ix
ACKNOWLEDGEMENTS .......................................................................................................................... x
DEDICATION ............................................................................................................................................. xi
1 INTRODUCTION ................................................................................................................................. 1
1.1 ALS and BiPAP ............................................................................................................................ 1
1.2 The Effects of ALS on Speech and Quality of Life ...................................................................... 1
1.3 Research Goals .............................................................................................................................. 2
1.4 Organization of the Thesis ............................................................................................................ 2
1.5 Contributions of the Thesis ........................................................................................................... 2
2 BACKGROUND ................................................................................................................................... 3
2.1 Speech: Our Preferred Method of Communication ....................................................................... 3
2.2 Speech Dysarthria ......................................................................................................................... 3
2.3 Speech in Patients with ALS ......................................................................................................... 4
2.4 Measuring Speech Intelligibility ................................................................................................... 4
2.5 Increasing Speech Intelligibility .................................................................................................... 4
2.6 Three Difficulties in Increasing Speech Intelligibility .................................................................. 5
3 RELATED WORK – HISTORY AND PRESENT .............................................................................. 6
3.1 The Capture Problem..................................................................................................................... 6
3.2 The Noise Problem ........................................................................................................................ 8
3.3 Automated Speech Recognition .................................................................................................... 9
3.4 Automated Speech Recognition in Persons with ALS ................................................................ 10
4 AUTOMATIC SPEECH RECOGNITION AND ENHANCEMENT SYSTEM ............................... 12
4.1 System Overview and Setup........................................................................................................ 12
4.2 Microphone Selection.................................................................................................................. 14
4.3 Calibration and Positioning ......................................................................................................... 15
4.4 Microphone Powering Circuit ..................................................................................................... 16
4.5 Spectral Subtraction .................................................................................................................... 17
v
4.6 Speech Extraction ........................................................................................................................ 21
4.7 Mel Frequency Cepstral Coefficients .......................................................................................... 24
4.8 Dynamic Time Warping .............................................................................................................. 25
5 USER INTERFACE AND SYSTEM USAGE ................................................................................... 31
5.1 User Interface Description ........................................................................................................... 31
5.2 System Usage .............................................................................................................................. 32
5.3 Initial Training ............................................................................................................................. 32
6 EXPERIMENTS ................................................................................................................................. 34
6.1 Experimental Setup ..................................................................................................................... 34
6.1.1 Goal ..................................................................................................................................... 34
6.1.2 Setup .................................................................................................................................... 34
6.1.3 Hypothesis ........................................................................................................................... 35
6.2 Experimental Results ................................................................................................................... 36
6.2.1 Digital Addition of Noise to Nemours Subject BB ............................................................. 36
6.2.2 Person Living with ALS – RG ............................................................................................ 41
6.3 Phoneme Analysis ....................................................................................................................... 45
6.4 Discussion ................................................................................................................................... 47
6.4.1 Summary of Phoneme Analysis .......................................................................................... 47
6.4.2 Summary of ASRES Results ............................................................................................... 48
6.4.3 Effectiveness of ASRES ...................................................................................................... 50
6.4.4 Feedback from ALS Community ........................................................................................ 51
6.5 Validity ........................................................................................................................................ 51
7 CONCLUSIONS AND FUTURE WORK .......................................................................................... 53
7.1 Research Goals Summary ........................................................................................................... 53
7.2 Contributions of this Work .......................................................................................................... 53
7.3 Strengths and Limitations ............................................................................................................ 54
7.3.1 Strengths .............................................................................................................................. 54
7.3.2 Weaknesses ......................................................................................................................... 55
7.4 Potential Applications ................................................................................................................. 55
7.5 Future Work ................................................................................................................................ 55
7.6 Conclusion ................................................................................................................................... 56
BIBLIOGRAPHY ....................................................................................................................................... 57
APPENDICES ............................................................................................................................................. 59
vi
Appendix A: Procedure for Collecting Speech Samples of a PALS ....................................................... 59
Appendix B: Phoneme Analysis Data Sheets .......................................................................................... 60
vii
LIST OF TABLES Table 1 - Noisy DTW of 5 Words by BB .................................................................................................... 37 Table 2 - Post-Filtered DTW of 5 Words by BB ........................................................................................ 37 Table 3 - Noisy DTW of 5 Words by BB with +3dB Noise ....................................................................... 38 Table 4 - Post-Filtered DTW of 5 Words by BB with +3dB Noise ............................................................ 38 Table 5 - Noisy DTW of 5 Words by BB with +7dB Noise ....................................................................... 39 Table 6 - Post-Filtered DTW of 5 Words by BB with +7dB Noise ............................................................ 39 Table 7 - Noisy DTW of 5 Words by BB with +13dB Noise ..................................................................... 40 Table 8 - Post-Filtered DTW of 5 Words by BB with +13dB Noise .......................................................... 40 Table 9 - Noisy DTW of 5 Words by BB with +13dB Noise and Alternate Noise .................................... 41 Table 10 - Post-Filtered DTW of 5 Words by BB with +13dB Noise and Alternate Noise ....................... 41 Table 11 - DTW of 6 Words by RG ............................................................................................................ 42 Table 12 – Post-Filtered DTW of 6 Words by RG ...................................................................................... 42 Table 13 - Real-time Filtered DTW of 4 Words by RG .............................................................................. 42 Table 14 - Noisy DTW of 5 Words by RG ................................................................................................. 43 Table 15 – Post-Filtered DTW of 5 Words by RG ...................................................................................... 43 Table 16 - Real-Time Filtered DTW of 5 Words by RG ............................................................................ 44 Table 17 - DTW of 3 Phrases by RG .......................................................................................................... 44 Table 18 - Real-Time Filtered DTW of 3 Phrases by RG ........................................................................... 44 Table 19 - Control 5 Sentences and Phoneme Divisions ............................................................................ 45 Table 20 - An Example Phoneme Analysis Trial Explained ....................................................................... 45 Table 21 - Percentage of Correctly Identified Phonemes ............................................................................ 46 Table 22 - Phoneme Analysis of MB .......................................................................................................... 47 Table 23 - Summary of Phoneme Analysis ................................................................................................. 48 Table 24 - Summary of BB Datasets ........................................................................................................... 48 Table 25 - Summary of RG Datasets ........................................................................................................... 50 Table 26 - Phoneme Analysis of RS ........................................................................................................... 60 Table 27 - Phoneme Analysis of EC ........................................................................................................... 61 Table 28 - Phoneme Analysis of JW ........................................................................................................... 62
viii
LIST OF FIGURES Figure 1 - Cross-Sectional View of Kang's Microphone Array .................................................................... 8 Figure 2 - System Block Diagram ............................................................................................................... 12 Figure 3 - Panasonic Noise Cancelling Microphone Cartridge (WM-55D103) .......................................... 14 Figure 4 - Frequency Response ................................................................................................................... 14 Figure 5 - Airflow and Optimal Microphone Placement ............................................................................. 16 Figure 6 - Electret Microphone Circuit ....................................................................................................... 17 Figure 7 - Normalized WAV ....................................................................................................................... 22 Figure 8 - Original and Truncated WAV .................................................................................................... 23 Figure 9 - Distance Map of Different Words .............................................................................................. 25 Figure 10 - Distance Map of Two Identical Words ..................................................................................... 26 Figure 11 - DTW Minimum Cost Path Equation ........................................................................................ 26 Figure 12 - DTW Scoring for the First Row ............................................................................................... 27 Figure 13 - DTW Scoring for the Second Row ........................................................................................... 27 Figure 14 - DTW Scoring for the Entire Grid ............................................................................................. 27 Figure 15 - Minimum Cost Path for Two Different Words ......................................................................... 28 Figure 16 - Path with a Score of Zero for Identical Words ......................................................................... 28 Figure 17 - Cost Map of a Word Spoken Slowly ........................................................................................ 29 Figure 18 - Path with a Zero Score for a Word Spoken Slowly .................................................................. 29 Figure 19 - Euclidean Distance in 3-D ........................................................................................................ 30 Figure 20 - ASR Tool .................................................................................................................................. 31 Figure 21 - SnR vs. Recognition Rate of BB for Noisy Signals Subjected to Post-Filtering ...................... 49 Figure 22 - SnR vs. Recognition Rate of RG for Noisy, Post-Filtered and Real-time Filtering ................. 50
ix
ABBREVIATIONS
Term Definition
AWG American Wire Gauge
ALS Amyotrophic Lateral Sclerosis
ASRES Automated Speech Recognition and Enhancement System
BiPAP Bi-level Positive Airways Pressure
DTW Dynamic Time Warping
HMM Hidden Markov Model
MFCC Mel Frequency Cepstral Coefficients
MMSE Minimum Mean Squared Error
LPC Linear Predictive Coding
PALS Person living with Amyotrophic Lateral Sclerosis
SnR Signal-to-Noise-Ratio
x
ACKNOWLEDGEMENTS
To my supervisor, Philippe, I am grateful for your continued support throughout the duration of this project. If it were not for your patience and guidance, this project would never have come to fruition.
To the members of PROP BC and the ALS Society of BC, thank you for your willingness to support this project and to offer your feedback.
And finally, to my dear wife, Esther, words simply cannot express my gratitude for your support and sacrifices along the way.
xi
DEDICATION
Dedicated to….
those who have fought with ALS courageously
and to my God who made me and strengthened me for this work.
1
1 INTRODUCTION
1.1 ALS and BiPAP
Amyotrophic lateral sclerosis (ALS) is a progressive neurological disease that destroys the nerve
cells associated with voluntary muscle control [1]. Although the initial symptoms of the disease vary from
person to person, as time progresses, all persons with ALS eventually begin to lose their mobility, ability
to speak and have trouble breathing due to the weakening of respiratory muscles.
At some point assisted breathing is required in the form of mechanical ventilation to ease the
strain on the weakened muscles. Usage may initially be nocturnal followed by increasing daytime usage
as the disease progresses. Eventually, ventilation will be required on a full-time basis when the respiratory
muscles are no longer able to maintain appropriate oxygen and carbon dioxide levels [1].
One assisted breathing method which is common in North America is the Bilevel Positive Air
Pressure (BiPAP) breathing apparatus. Unlike a mechanical ventilator, it does not replace the normal
breathing mechanism, but rather allows patients with neuromuscular diseases similar to ALS to breathe
normally while reducing the amount of effort required by the patient. This is accomplished by fitting a
mask to the patient’s face and applying positive air pressure upon inhalation and negative air pressure
upon expiration. Although BiPAP cannot slow the progression of the disease, it has been show to improve
the quality of life in patients [2].
One major problem caused by BiPAP utilization is its interference with speech caused by the
muffling of vocalization by the BiPAP mask and the associated airflow noise. For patients using BiPAP
several hours a day this problem greatly limits their ability to communicate verbally.
1.2 The Effects of ALS on Speech and Quality of Life
Mixed dysarthria associated with ALS involves imprecise consonants, hypernasality, slowed
speech rate, harsh vocal quality, breathiness, slurred speech and low pitch [3]. Acoustic studies show
differences in vowel duration, fundamental frequency and vowel space with some variation between
different individuals [4].
The effects of dysarthria on speech alone have quite an impact on the quality of life of a patient.
The inability to communicate effectively with voice leads to feelings of isolation, frustration, anxiety, loss
of control and increased sadness. Isolation comes from the reduced amount of communication, frustration
from not being understood, fear and anxiety from failed communication attempts, loss of control as their
opinions are ignored or misunderstood, and sadness due to the isolation and frustration experienced by the
caregivers and patient [5]. In addition to this, BiPAP users who wear face masks also feel more distant
2
and isolated from their friends and primary caregivers. The combined effect of reduced ability to speak
and also the BiPAP face mask reduce the quality of life of persons living with ALS considerably.
1.3 Research Goals
The primary goal of this thesis is to develop a prototype Automatic Speech Recognition and
Enhancement System (ASRES) that will allow Persons Living with ALS (PALS) to communicate clearly
with their voices while being on BiPAP ventilation. The primary goals of this thesis can be separated into
three sub-goals as follows:
• Identify the problems associated with capturing and filtering speech by PALS who are on BiPAP
ventilation
• Design and implement a working prototype that PALS can use
• Validate the effectiveness of the system by examining the recognition rate of ASRES when used
with PALS and also by objectively measuring increases in intelligibility through listening tests
1.4 Organization of the Thesis
This thesis consists of seven chapters. Chapter 1 is an introduction to the problem while Chapters 2
and 3 discuss the background of the problem as well as related work. Chapter 4 presents the proposed
system including an overview of the theory and details behind the implementation. Chapter 5 covers the
user interface and describes the usage of the system including training. Chapter 6 includes all the data
obtained from the different experiments as well as a summary of the data. And lastly, Chapter 7 presents
the conclusions and suggestions for future work.
1.5 Contributions of the Thesis
The contribution of this thesis include:
• A discussion of the unique problem that PALS on BiPAP ventilation face with regards to
communicating using their voices
• An implementation and analysis of ASRES that fits the needs of PALS on BiPAP
• Validation of the effectiveness of the system by conducting experiments with noise digitally
added to samples of dysarthric speech as well as real subject data from a PALS
• A final analysis and evaluation of the prototype
3
2 BACKGROUND
The purpose of this section is to clearly articulate the communication problems caused by the face
masks of BiPAP users and why they reduce the quality of life of PALS.
2.1 Speech: Our Preferred Method of Communication
Speech is the primary method of communication between people. Although it is a common
method of communicating our intent, many factors come into play when it comes to having highly
intelligible speech. For example, pitch and tone contribute to the semantic meaning of a phrase as well as
inflection. A subtle variation, such as the slight rise in tone at the end of a phrase, can change a statement
to be recorded into a question to be answered. Accents and variations in localized pronunciation also
provide a challenge with regards to both human and machine recognition. The ability to communicate is
simply a part of our nature and surveys and studies have shown that there is a value in person-to-person
verbal communication in patient care that cannot be replaced by simply giving written instructions [6]. A
study done on laryngectomees (persons who have had their larynx removed) due to illness show that their
quality of life improves after the restoration of verbal communication through either a voice prosthesis or
a tracheo-oesophageal puncture [7]. Since the loss of the ability to speak is considered a loss in quality of
life for individuals, any improvements or enhancements that we can make to restore the intelligibility of
speech will help improve an individual’s quality of life.
2.2 Speech Dysarthria
Speech dysarthria is a term that refers to a group of motor speech disorders that result from either
central or peripheral nervous system damage [8]. The disruption to muscular control that affects the
muscles used in producing speech can result in differing levels of intelligibility. Some patients who suffer
from mild dysarthria when subjected to a Frenchay Dysarthria test can produce scores which are low
enough to have them pass as normal speakers. The Nemours database [9] contains sound samples of 11
different male speakers with varying degrees of dysarthria. Some of the speakers have a greater than 80%
intelligibility/understanding rate, while others score less than 60% or lower to the point where they are
virtually unintelligible.
Although the degree of dysarthria varies from person to person, imprecise articulation [10] is
characteristic of all dysarthric speakers. Research has shown that for dysarthrics, vowels are easy to
produce whereas consonants are difficult to enunciate. The speech of patients with dysarthria is often
characterized as being either very nasally or distorted.
4
2.3 Speech in Patients with ALS
Persons living with ALS have a kind of mixed dysarthria that is characterized by defective
articulation, slow speech, and imprecise consonant and vowel formation among other things.
Documentation of their speech is not extensive, but a few studies have been conducted to examine their
speech rate, vowel space and variance in speech intelligibility [11].
Although their symptoms are similar to other types of individuals with dysarthria, they suffer from
a unique problem in that unlike other disabilities that affect speech, such as cerebral palsy and multiple
sclerosis, ALS causes the patient’s ability to speak to degenerate over time. This poses a very unique
problem when it comes to developing solutions to help improve the quality of speech for an individual
with ALS as the specifics of their dysarthria degrade over time, rendering a potentially helpful system
either less effective or ineffective altogether.
2.4 Measuring Speech Intelligibility
Intelligibility can be defined as “how well a speaker’s acoustic signal can be accurately recovered
by a listener” [12]. Although the quality of a speech signal affects comprehension, it is important to note
that there are many other nonverbal factors that are involved in listener comprehension. Some examples
would be the length of a message, its predictability, context, relationship to the listener, and facial cues.
The measurement of intelligibility is no simple task either as there are multiple ways in which this
can be done. One way is through orthographic transcription in which a user listens to a speech sample and
then attempts to reproduce in writing what they heard. A percentage score is obtained by calculating the
number of words correctly identified over the total number of words. Although this can objectively
measure the number of words correctly perceived, it is important to design an experiment in such way that
the context of the sentence in which the words are found does not heavily influence the recognition of the
word.
A stronger method for testing intelligibility is to ask the listener questions to see how well they
were able to understand the speaker’s meaning. Carefully designed general comprehension questions can
be used to determine how well a listener understood the speaker overall, while specific factual questions
can help shed some light on particular word intelligibility.
2.5 Increasing Speech Intelligibility
There are many different methods that have come up over the years that attempt to improve the
speech intelligibility of dysarthric patients. When these methods are broken down, they can be broken up
into two categories: modification of the speech signal to enhance the actual acoustic characteristics of the
5
signal and the alternative, manipulating speech complementary information or nonverbal cues in order to
increase a listener’s comprehension or perception of the speech. Studies have shown that speech
intelligibility can be improved simply by offering these additional sources of information to listeners.
Semantic cues, first letter and word class cues [13] have been shown to improve intelligibility from 10-
24% among listeners.
Topical cues and key words are communication strategies that are often employed by PALS in
communication [14]. These strategies are currently used by partners of PALS who give anecdotal
evidence that being able to understand one key word can allow you to understand the rest of the sentence.
Although speech intelligibility can be increased by other means that are non-speech related, these
go beyond the scope of this project and we will be concentrating on improving speech intelligibility from
behind a face mask primarily through speech processing.
2.6 Three Difficulties in Increasing Speech Intelligibility
One obvious difficulty with capturing intelligible speech from a patient with a face mask, is that
because the mask is placed over the face and forms a complete seal, it is difficult to hear any intelligible
speech whatsoever, even if you were to place your ear very close to the mask itself. This we will refer to
as the Capture Problem of speech in the following section, Section 3, entitled RELATED WORK –
HISTORY AND PRESENT.
Secondly, the problem is that patients who are on BiPAP, not only have voices that are generally
weaker and thus are not able to speak as loudly as a typical person, but also struggle to articulate their
words with regularity and sufficient preciseness so as to be intelligible. This is a significant challenge
especially when it comes to trying to capture and process a good speech sample with consistency from
behind the mask. This we will refer to as the Articulation Problem in the following section.
Thirdly, as BiPAPs function with rushing air, attempts to capture the sound also pick up the “wind
noise” within the mask, further decreasing the intelligibility of any captured speech. This we will refer to
as the Noise Problem in the following section.
As the face mask is an enclosed system, little can be done in terms of opening the mask to allow for
sound capture as this would allow air to escape and nullify any respiratory benefits. Coupled with the fact
that patients have weak voices, it would seem that capturing sound outside the mask by opening the
masks or asking the patients to shout would be impossible. Therefore, we must turn our efforts to
improving speech intelligibility by capturing sound not externally, but from within the mask, and
mitigating the effects of the rushing air.
6
3 RELATED WORK – HISTORY AND PRESENT
At the present time, there are no published works on improving speech from behind a ventilator
face mask and filtering BiPAP induced wind noise. The present work is among the first of its kind and as
a result, has very little to draw on in terms of specifics in this field.
In the previous section, we identified three problems, the Capture Problem, the Articulation
Problem, and the Noise Problem that contribute to the decreased intelligibility of speech in PALS on
BiPAP. Although intelligibility could be increased by addressing all three of these, the scope of this
project would simply be enormous if we were to tackle all of them.
For example, improvements to the Articulation Problem have been attempted and met with
different degrees of success. Older solutions for persons with ALS involved using speech synthesizers
when articulation simply became too poor or when the patient became nonverbal. Although these systems
gave functionality, it is a well-known fact that machine voices are still regarded to be highly impersonal.
Time and energy is being expended at the present time to add emotion among other things to make
machine voices sound more human [15]. There is research in the area of capturing dysarthric speech,
identifying and resynthesizing the badly pronounced phonemes [16] so that the large portion of a person’s
speech is retained. However, nothing has yet been made available commercially and the technology exists
primarily in labs. As this problem is under research and a large enough problem in it of itself for a thesis,
we will not be attempting to improve intelligibility through the re-synthesis of poorly articulated
phonemes. We will focus our attention on the Capture Problem and Noise Problem and then explore
current work in the field for machine recognition of the cleaned speech in order to select a mechanism for
the validation of any claims we might make of increased intelligibility.
3.1 The Capture Problem
The first problem that we need to address is how to obtain speech for processing from within a
full face mask. If the quality of our signal is poor, then any attempts that we make at improving
intelligibility may be hampered not by our algorithms, but simply by the fact that our input is poor to
begin with. Therefore, choosing an appropriate solution for capturing sound from within the mask is
important.
The majority of speech processing methods focus on processing speech after it has been recorded
into a microphone; however, any gain that we can achieve by choosing a specific microphone
configuration is advantageous. A survey of recent literature shows that although dual microphone
solutions for noise suppression are becoming more popular, they still add complexity in terms of weight,
7
size of the array, power consumption and more complex processing [17]. Furthermore, the advantage to
using dual microphones noise reduction techniques seem to be more pronounced when trying to perform
filtering of non-stationary noise. As we are dealing with the removal of stationary or very slowly varying
noise, the two microphone solution seems less appealing in that though it would offer a benefit, its
increased computational complexity and additional cost in terms of space within a face mask seem to
encourage the use of a single microphone.
Very little work seems to have been done on the problem of audio capture from within a full face
mask. A large part of the reason is that there is very little need for a typical person who uses BiPAP to
speak from behind a mask while the ventilator is running. For those who are hospitalized due to acute
respiratory failure, speech is not an option, and therefore, the need for verbal communication is nil. For
persons who suffer from sleep apnea and use BiPAP to aid them in getting a good night’s rest, there is no
reason to want the ability to communicate while sleeping. In the event that someone with sleep apnea or
some other respiratory ailment who uses BiPAP awakens or needs to communicate with their voice, it is a
simple matter for them to switch off the machine, loosen the mask slightly, communicate, and then return
to sleep. This, however, is not the case for PALS as their loss of motor coordination can make the task of
loosening or replacing the Velcro straps of a full face mask extremely difficult. Therefore, any solution
allowing them to speak without having to fumble with a mask is helpful.
The main category of people who use BiPAP and are not unconscious, suffering from acute
respiratory failure or asleep, are persons living with ALS. As ALS is a progressive disease, the window in
which an individual is on BiPAP and also sufficiently verbal to communicate is narrow and does not last
indefinitely. Coupled with the fact that life expectancy of persons living with ALS is normally 2 to 5
years and the fact that the disease effects only about 6 to 8 people per 100,000 with a diagnosis rate of
approximately only 2 per 100,000 new cases [18] each year, compared with cancer that will have 186,000
new cases this year alone [19], it is understandable why there is very little attention and research put into
this particular problem of allowing a person living with ALS to speak while on BiPAP.
The closest form of published work that relates to the problem of capturing speech from behind a
mask comes from military research done on fighter pilot oxygen masks. A work by Kang featuring a 4
microphone array [20] attempts to perform noise reduction within a mask, not by addressing ambient
noise, but rather by focusing on the problems associated with reverberations within the face mask. The
system constructed by Kang makes use of four collinearly spaced microphones with a sound duct.
8
Figure 1 - Cross-Sectional View of Kang's Microphone Array
This design allows for reverberations in the mask to be removed by using a technique of adding and
subtracting the individual microphone outputs. An absorption material such as wool which absorbs
sounds at 4 kHz helps to further reduce the reverberation effects. Kang reports that this array works well
at restoring lost high-frequency components and minimizing the reverberation effect.
Although this setup is good, size is a factor to consider when working with the BiPAP masks as they are
smaller than fighter pilot masks. In addition, sound tests that were recorded do not seem to be extremely
muffled as in the case of the oxygen masks Kang was using to test. It is possible that the material of the
BiPAP full face mask coupled with the CO2 exchange holes contribute to the reduction of the
reverberation effect or as fighter pilot masks are airtight and perform their oxygen and carbon dioxide
exchange through a system that uses a single hose. There are definitely differences between the two kinds
of masks however the extent to which fighter pilot masks differ from BiPAP full face masks has not been
experimentally verified.
Although Kang reports success with his microphone array and that it performs some noise
reduction in addition to its reverberation reduction, the way that it achieves this is not just because of the
array, but also in the proximity to the mouth. Kang’s array in his tests is placed at a mere ¼ inch from the
mouth, a distance, which would be difficult not only to calibrate and maintain for PALS, but would also
perhaps interfere with their breathing or their lips as they struggle to articulate sounds. Although Kang’s
work is the closest to our problem, it is not a solution by itself to our problem.
3.2 The Noise Problem
The problem of removing stationary noise from a signal is not a new one and has been explored
and reviewed rather extensively [21]. In fact, more recent papers have begun to attack problems such as
filtering non-stationary wind noise with a single microphone [22]
9
Work on performing filtering of undesired sound in noisy environments such as a helicopters
[23], oxygen masks [20], vehicles [24] has been researched and solutions proposed. These solutions to
these problems are helpful to us in that the solutions are geared towards solving a specific ambient noise
problem such as the noise from helicopter rotors or vehicles. For the most part, this type of noise is
unvarying and can be classified as stationary noise.
If the noise from the airflow in the mask can be filtered using a similar method and a clean speech
signal captured and outputted, this would be of an immense benefit to the quality of life patients as they
would retain their ability to communicate using their voices with others while on ventilation.
Techniques to attenuate the noise include Wiener filtering, Log-MMSE (minimum-mean-square)
and spectral subtraction. Although spectral subtraction was first developed by Boll in 1979 [23] and
improved on by Ephraim and Malah with the elimination of the musical noise phenomenon [25] , it still
continues to have traction in the academic world. Variations have been proposed for non-stationary
noise[26] as well as modifications that take advantage of human auditory characteristics [27]. As it is
simple to implement, it continues to remain as a popular choice for performing noise reduction. The other
aforementioned techniques also offer similar performance in terms of their ability to reduce noise.
3.3 Automated Speech Recognition
In order to help us validate our claims of improving intelligibility through speech processing, it is
necessary to explore methods for validation. As discussed in Section 2.4 - Measuring Speech
Intelligibility, increased human recognition of words or sentences is an indication that intelligibility has
increased. We need not, however, limit ourselves to only human recognition. If a computer recognition
system was able to match words according to some library of words and the recognition increased as a
result of noise filtering, then we would also have another objective method for determining the
improvement of a system. Therefore, it is also necessary to explore automated speech recognition
methods to use for validation purposes.
Automated speech recognition is a field that is under heavy research as having machines that are
able to understand the subtle nuances of human speech would revolutionize the way that way we interface
with machines. Work in this field is varied and in the last few decades has been very successful. The first
generation of speech recognizers focused primarily on phonemes and was very limited in its ability to
recognize commands. This was followed by the second generation of speech recognizers that made use of
linear predictive coding (LPC) and dynamic time warping (DTW) [28]. In the 1980’s Hidden Markov
Models (HMM) and statistical analysis became the de facto standard for automated recognition of
continuous speech. It also allowed for a major increase in the size of the vocabularies of new systems.
10
Although Dynamic Time Warping is older and has since been replaced by Hidden Markov
Models for the bulk of speech recognition, it still has use for working with small datasets and isolated
words. The bulk of research today focuses on optimizing DTW, finding new ways that it can be applied to
larger datasets [29] or using it to enhance HMMs [30].
3.4 Automated Speech Recognition in Persons with ALS
Several studies have been conducted in the area of dysarthric speech including patients who have
ALS. Studies have shown that their speech differs greatly from our own and that it is very difficult for
such people to use off the shelf packages that involve voice recognition [9]. Training people on these
systems requires a great amount of time and effort and is a very frustrating procedure for patients who
have been weakened by illness.
To improve the phoneme recognition, experimental systems have been implemented to try and
train speech classifiers to recognize the distinguished phonemes produced by patients with dysarthria.
Visual and auditory feedback has been used among other methods to try and come up with the best
method for training, however all these are still time consuming and difficult to achieve [8].
ALS patients present a unique problem in that unlike other disabilities the disease causes the
patient’s ability to speak to degenerate over time. Studies that are performed on dysarthrics often exclude
patients with neurodegenerative conditions as research projects spanning several years cannot
continuously obtain consistent data samples as the patient’s disease progresses. Projects like STARDUST
[8] that examined isolated word recognition for dysarthrics specifically excluded these types of patients
and focused on patients who had a more stable disease. The problem with training sets of classifiers for
these patients is that due to rapid degeneration of their speech, possibly within months, a retraining of a
speech recognition system becomes necessary. This is extremely time consuming and tedious for these
PALS who now have to rerecord samples for a database of words that has expanded due to use. ASR
studies that involve PALS testing new speech recognition software have shown that during the recording
sessions that are usually repetitive, the PALS grow tired and are no longer able to articulate their words
clearly. This implies that systems that have low training and retraining requirements are in the PALS best
interests. Currently, there are no products or open source initiatives that are available to the public that are
capable of performing automatic speech recognition of dysarthrics speakers let alone persons living with
ALS.
Efforts have been made to improve intelligibility of dysarthric speech by replacing sections of
speech with re-synthesized speech[31], however the study is careful to explain that their methods would
only work for a specific subgroup of dysarthrics. As persons with ALS who are at different stages of
disease progression will have wildly different characteristics in their speech, it is impossible to use a one-
11
size-fits-all method for automatically detecting and resynthesizing speech. Although this method of
improving intelligibility seems like it might be promising, constructing such a system and then making it
operate in real-time would prove challenging and go beyond the scope of this project.
12
4 AUTOMATIC SPEECH RECOGNITION AND ENHANCEMENT SYSTEM
This section is an explanation of a prototype system that was designed to address the Capture and
Noise problems as detailed in the previous section. The proposed solution, ASRES, is an automated
system that is able to capture and filter noisy speech from behind a face mask, output the processed
speech, as well as perform recognition.
The details of the system’s implementation and assumptions are explained in this chapter.
The following, Figure 2 - System Block Diagram, is a diagram that demonstrates the setup of the
system and how a speech signal is passed to the different subsystems.
Figure 2 - System Block Diagram
4.1 System Overview and Setup
A Person Living with ALS wearing a face mask is the start point of the diagram. A small electret
microphone (WM-61A), with two 30 AWG wires attached to its leads is inserted into a patient’s face
mask via small ventilation holes and hung from the top of the mask. It is important that the microphone
be placed out of the direct path of the air that flows through the flexible hose connecting the mask to the
BiPAP in order to maximize the Signal-to-Noise Ratio (SnR) of the speech. The microphone must also be
13
adjusted so that it is not being pressed against the nasal bridge for patients with larger noses, and also not
in line with the patient’s nostrils as deep inhalation and exhalation can create large amounts of noise that
further reduce the SnR. Once the microphone is properly inserted, the patient then attaches the full face
mask to their BiPAP ventilator.
The full face masks that were tested with this setup are the Resmed Ultra Mirage Full Face Mask,
Resmed and the Fisher & Paykel FlexiFit432. It is conceivable that the system would work well with any
full face mask that makes a complete seal over the patient’s face and has CO2 ventilation holes that are no
smaller than 0.0799mm in diameter which is also the diameter of the 30 AWG wire that is used with the
microphones.
As the device makes use of existing CO2 ventilation holes, it is important to consider whether or
not the insertion of a microphone would impair normal operation of the mask. In a Resmed Ultra Mirage
Full Face Mask, CO2 is vented through 6 holes in the mask that are approximately 1.8mm in diameter.
Therefore the total surface area through which CO2 can pass through is 15.268mm2. As the two 30 AWG
wires occupy only 5.107x10-2 mm2, the total amount of area obstructed by the two leads is 0.6690 %. This
means that the mask’s ability to vent CO2 is still operating at approximately 99.3% which is not deemed
to be a risk when reviewed by the UBC Clinical ethics board.
Once the face mask is securely in place, two alligator clips are fastened to the exposed leads,
connecting the microphone in the patient’s mask to the rest of the system. The leads pass through a small
circuit that supplies power to the electret microphone via the 3.5mm jack on the TMS320C6713 DSP
board.
After the noise-corrupted speech signal is processed with the filtering software on the board, the
filtered speech signal is split with a splitter and sent to two different places. The filtered speech signal is
sent to a set of speakers that output the sound so that a listener in the room can hear the patient’s voice in
real-time from behind the mask. In addition, the filtered speech signal is also sent to a laptop where the
ASRES software is running via a 3.5mm aux cable connected to the microphone jack. A laptop is used as
it avoids the cost and difficulty of having to develop a GUI on a piece of embedded software for this
prototype.
Once the system is properly set up, the patient can begin speaking during normal operation of the
BiPAP.
14
4.2 Microphone Selection
Microphone selection is important for this application as the environment from which speech is to
be extracted is quite noisy. As the results of the digital signal processing and Speech Recognition
Software are dependent on the quality of the input signal, it is beneficial to maximize the amount of
speech captured and minimize the noise before the signal is subjected to digital signal processing. Two
ways to accomplish this in the electret microphone stage are to choose the best electret microphone or
microphone array in order to optimize the placement of the microphone or microphones.
For this project, a Panasonic Noise Cancelling Back Electret Condenser Microphone Cartridge
(WM-55D103) is used for the purpose of being small and non-intrusive.
Figure 3 - Panasonic Noise Cancelling Microphone Cartridge (WM-55D103)
The microphone has a frequency range of 100-10,000Hz which covers the range of human
speech, approximately 200 – 8000Hz, quite well.
Figure 4 - Frequency Response
The frequency response for this microphone is fairly flat for distances that are close to the
microphone and useful for us since there would be little variance in response for all the different
frequencies of speech.
15
The noise cancelling microphone has a black porous material that covers the sensitive electret
components and serves to protect the microphone from saliva since it is placed fairly close to the
speaker’s mouth. As dysarthrics and PALS often have difficulty in controlling their speech muscles and
saliva, this covering is useful. In addition to this, the covering helps to reduce some of the wind noise that
passes over the microphone. As a noise-cancelling electret microphone is set up to attenuate signals that
are farther away from microphone, it is possible by placement of the microphone in the mask, to increase
the SnR simply by making sure that the microphone is sufficiently close to the mouth as well as far away
from the noise source as possible. Although the full potential of this characteristic is not fully taken
advantage of as the distance between the noise source and the patient’s mouth is still in the order of
centimeters, it is still a beneficial quality to have and makes the noise-cancelling electret a better choice
than other types of microphones. An omnidirectional microphone, for instance, has the undesirable
quality of picking up sound in all directions equally well and as a result collected more noise in trials. A
unidirectional microphone proved to also be slightly less useful, but not as poor as the omnidirectional
electret microphone.
4.3 Calibration and Positioning
Calibration was done using empirical methods. Although there is merit perhaps in doing
mathematical modeling of the mask to achieve maximal SnR, this is very difficult to attempt as different
masks have different shapes and sizes and any optimization algorithm would need an accurate model
created for each one. Furthermore, mathematical modeling of the optimal place to achieve the highest
SnR would be complicated by the fact that skin does not have the reflective acoustic properties of other
solid surfaces and actually has absorptive qualities. In addition, variations in patient’s nose structures, lips
and even facial hair further complicate mathematical modeling. Therefore, as it is mathematically and
computationally difficult, not to mention different for each individual BiPAP user, mathematical
modeling for the purpose of optimizing the microphone placement was rejected in favor of deductive and
empirical testing.
16
Figure 5 - Airflow and Optimal Microphone Placement
Figure 5 - Airflow and Optimal Microphone Placement, above, shows an arrow indicating the
initial flow of air from the BiPAP as it enters the mask. Empirically, it was determined that placing a
microphone anywhere along that path where the wind struck the microphone directly resulted in very
poor SnR ratios of 0 and below. The two rectangles above show the areas in the mask that are out of the
direct path of the wind and are candidates for placement of the microphone. Although the bottom
rectangle in Figure 5 was out of the direct path of the wind, it was still sufficiently close that empirical
tests showed very little improvement in the SnR. Therefore, the only possible placement for the
microphone is the top rectangle. Empirical testing again for microphones placed within the rectangle and
away from the nostrils show an SnR of up to 30dB when the BiPAP is not in operation and approximately
3-9dB when the BiPAP is on. This large variation in the SnR is due to the variation in the strength of the
speech signal. An average person who is able to speak normally would be on the higher end, whereas a
patient with ALS who has a low lung capacity might be closer to the lower end.
4.4 Microphone Powering Circuit
The diagram below is of the microphone circuit that powers the electret microphone and interfaces
it with the TMS320C6713 DSP board that was used for this project.
17
Figure 6 - Electret Microphone Circuit
The electret microphone is connected to a 3.5mm TRS connector in a standard configuration. As it
is plugged into the DSP board, a 5V bias is supplied to the microphone through a 2.2kΩ resistor to limit
the current.
4.5 Spectral Subtraction
The speech signal that comes from the microphone suffers from noise caused by air rushing over
the microphone and the general hum of the BiPAP. This noise corrupts the signal and makes it difficult to
perform any speech processing unless it is either attenuated or removed. If the noise and the speech signal
cannot be separated by using multiple microphones, it is necessary to come up with a way to identify the
noise from the same source. One of the simplest ways to do this is to perform power spectral subtraction.
This is a simple but effective way to get an idea of how strong the noise is and takes advantage of
the fact that while a BiPAP is operating, a person isn’t speaking 100% of the time. The non-speech
segments can be used to recalculate the estimate of the noise which in our case shouldn’t vary too much
as the BiPAP is fairly consistent. As what is needed is a sample of the stationary noise, it is possible on
startup of the system to capture a 250ms sample of the stationary sound and then use this sample for
spectral subtraction calculations.
This algorithm is simple to implement and can be done in real time with a DSP board or a
computer. The following derivation is largely based on the paper by Boll [23] whose work served to pave
the way for subsequent spectral subtraction based works.
Let
18
where y(m) is the corrupted signal, and x(m) is the speech signal and n(m) is the uncorrelated and
additive noise signal.
In the frequency domain, let the same signal be represented as
Where Y(f), X(f) and N(f) are the Fourier transforms of the corrupted, speech and noisy signal
respectively with f as a frequency variable.
In order to process the original signal, the signal must be divided up into chunks, or windowed. In
order to alleviate the effect of discontinuities at the endpoints of each segment, it is necessary to choose
an appropriate window. For our application we use a Hanning Window
In the frequency domain, applying a window to a signal is a convolution operation.
Rearranging terms, we can express original signal as the noise signal subtracted from the corrupted signal.
If it were possible to obtain an exact representation of the noise signal, we could completely
remove it and restore the original speech signal. Since that is impossible, spectral subtraction is used to
reconstruct an approximation of the original signal by subtracting a time-averaged noise spectra from the
corrupted signal. For simplicity, the windowing subscript, W, is dropped.
!
The result is an approximation of the original signal. The accuracy of the reconstructed signal is
directly dependent on the accuracy of the approximation of the noise spectra to be measured.
As it is necessary to calculate the average noise spectra from a period of non-speech activity, this
can be done upon the initial startup of the device with the following formula.
19
"" # $%&%'()&*+
In the above equation, the average noise spectra is calculated by summing the spectra of K-1
frames and then dividing by K. K is a variable that is determined by the sampling rate and length of the
signal. In order for this to work, we assume that the first K frames, are pure noise with no speech.
Averaging the noise spectra over a 250-300ms window provides a fairly good estimate of stationary noise
and presents no difficulty to the user. In the event that there is speech during the first ¼ of a second, the
system can always be reset, or the noise mean recalculated during another interval of silence. After
calculating the noise mean, we can create a simple voice activity detector by setting a threshold of 2-3dB
above the noise mean. If 6-8 consecutive frames of the signal are below this threshold, we can safely
assume that speech has ended and that we are now looking at a noise.
The spectral error therefore that exists is approximately equal to the difference between the noisy
signal and the actual speech signal which is also roughly equal to the difference between the actual noise
in that frame and the noise average that we have calculated.
ε , ! ,
In order to help reduce the spectral error, a few modifications are made to the reconstructed
signal. As the spectral error is the difference between the noise in the frame and the calculated noise
average, then it follows that
-./01' 2 $%&%0()&*+ # $%&%'()
&*+
Where M < K and where K is still the total number of frames over which the noise average is
computed. Therefore, if we were to perform magnitude averaging of the noisy signal over M frames
before subtracting the noise average, this would help decrease the error.
"" 2 $%&%0()&*+
According to Boll [23], performing this averaging is not a problem as long as the number of
frames, M, does not exceed a certain amount. Based on results from his conducted DRT test, as long as
the averaging does not exceed 3 half-overlapped windows with total time duration of 38.4ms,
intelligibility does not decrease. There is still, of course, the risk that for highly explosive and short
20
sounds, that averaging can cause smearing. However, this risk should be weighed against the benefits of
having less spectral error overall.
Upon completion of magnitude averaging, the next step is to perform Half-wave rectification. In
the event that at a particular frequency, the average noise, subtracted from %%produces a
negative value because the average of the noise spectra is greater than the noisy signal, it necessary to
floor these values to 0.
The benefit of doing this, is that overall, the noise floor can be reduced by . The
disadvantage, however, is when the sum of the noise and speech at that particular frequency is less than
. When the output is set to 0, any speech information that was contained in that frequency is lost and
the result is possibly a loss of intelligibility.
Once half-wave rectification is complete, speech and noise that is above the threshold
remains. This residual noise will have a value somewhere between zero and a maximum that is measured
during the non-speech activity periods. In the case where the noise in the frame at a particular frequency
is equal to the average noise spectra, the amount of residual noise will be zero, or very close to it. The
residual noise will consist of frequencies randomly scattered throughout the spectrum that exist for the
duration of one window, or approximately 25ms. The result is what is known as “musical tones” as it
sounds like a number of fundamental tone generators being flipped on and off in all residual noise
frequencies. In order to help alleviate this effect, additional residual noise reduction can be performed.
There are 3 cases that need to be examined after the average noise spectra is subtracted from the
noisy signal.
In the first case, if the amplitude of !, the signal after spectral subtraction, is below the
maximum threshold for noise and fluctuates rapidly between adjacent frames, then there is a high
probability that this is not speech but residual noise. Therefore, we can replace that particular frequency in
the frame with the minimum value from examining adjacent frames as well.
In the second case, if the value of ! is below the maximum noise threshold and remains fairly
constant between frames, then there is a high probability that this is not noise, but low energy speech. As
the values are fairly similar, again taking the minimum preserves the speech.
In the third case, if ! is greater than the maximum noise threshold, then nothing else needs to
be done, as what remains is most likely a speech signal.
After this, all that remains now is to restore the signal to the time domain.
21
Restoration of the time domain signal is achieved by taking the magnitude spectrum estimate %!%, combining it with the phase of the original noisy signal then performing an inverse discrete
Fourier transform.
3 $%!4%567895(6:;< 9=<()9*+
In the above equation, θY(k) is the phase of the noisy signal. The estimated magnitude of the
original signal can be recombined with the phase information of the noisy signal because it is assumed
that noise distortions are primarily in the magnitude spectrum and that phase distortion, for the most part
is mostly inaudible.
After the signal frames have been restored to the time domain, it is now necessary to overlap-add
the overlapping frames in order to reconstruct our approximation of a noise-free signal.
This technique of magnitude spectral subtraction works well for stationary and slow varying
noise. In the case of patients who are on BiPAP, there is very little variation in noise once the ventilator is
in operation. Empirical tests show that the amount of noise is dependent on placement of the microphone
and once the microphone is placed, does not vary. Therefore, spectral subtraction is an ideal candidate for
this particular problem.
Although spectral subtraction is fairly simple and effective, a problem that it still has is the residual
“musical noise” that is left behind due to isolated patches of energy in the time-frequency domain.
Despite best efforts to alleviate this, these artifacts persist. Depending on the accuracy of the noise
estimate and the SNR there can be large fluctuations in the final output. In general, the less noise that
needs to be removed and the stronger the speech signal, less artifacts appear in the final output. Ephraim
and Malah’s method [25] does not suffer from this and could be used as an alternative, but we will
address this in Chapter 7 - CONCLUSIONS AND FUTURE WORK.
4.6 Speech Extraction
This section explains the process of extracting commands and words from the de-noised and
reconstructed speech signal.
To capture a person’s speech, the de-noised speech is outputted from the DSP into the 3.5mm
jack of a laptop where the ASRES software is running. The software listens to the microphone channel
and captures the clip of speech. Although this may be done automatically by setting certain sound
thresholds, for the purpose of explanation, we will describe the case in which the record button of the
software is toggled on and off manually in order to acquire a clip of speech.
22
Upon capturing a clip, the next step is to normalize the WAV so that we can have an accurate
comparison with others that are stored in the database. In order to do this, we first iterate through all the
samples and examine the absolute values of each of the samples. If the absolute maximum is less than 1,
then we divide the entire function by this absolute maximum. An example resulting WAV is shown in
Figure 7.
Figure 7 - Normalized WAV
Major errors in isolated word recognition are the result of inaccurate detection of the beginning
and end points of a speech sample. As the output of the DSP is the de-noised signal and we do not have
access to average noise spectra calculations, we need to create an algorithm to determine the endpoints of
the speech sample.
To extract the endpoints, we use an energy based approach. We begin by segmenting the entire
WAV into 30ms frames first. Then, the energy for each frame is calculated according to the following
formula.
> $?@:
The first two frames are examined with the following formula and a value for the noise at the
front of the signal is calculated.
>A BC>) >: BBBBBBBBBDB E >)F>: E /.G>)H >: BBBBBIJ5KLD5BBBBBBBBBBBBBBBBBM After the noise at the front is computed, the noise at the back end is also calculated using the last
two frames.
>N BC>'() >' BBBBBBBBBDB E >'()F>' E /.G>'()H >' BBBBBIJ5KLD5BBBBBBBBBBBBBBBBBBBBBBM Finally the average background noise level is computed from these two values.
23
>< BC >A >N BBBBBBBBBDB E >AF>N E K5O5I5PBBBBBBBBBBBBIJ5KLD5BBBBBBBBBBBBBBBBBBBBBBM Rejection occurs on the basis that the value of EN should be between the two limits and that the
noise in the front and end frames should not be drastically different. A rejection can also take place if EN
exceeds an experimentally determined threshold. If the sound sample has a very heavy background noise,
i.e. the values of EF and EB exceed a certain threshold, the sample should also be discarded. Once we have
determined that background noise is not a factor and that the first two frames and the last two frames
contain similar residual noise, we can then compute the average power of each frame.
2nS
Pn
=∑
Once the power in a frame exceeds a certain threshold, we can assume that this frame contains
speech. To determine the starting frame of speech, we examine the frames sequentially from the first
frame until we encounter a frame that exceeds the threshold. This is a simple method of determining the
start frame; however, this is susceptible to noise.
A better way to determine the start frame is by examining not just the frame, but also the adjacent
frames. If two consecutive frames in a set of three exceed the average, then there is a greater likelihood
that this represents a speech frame. Once this occurs and the average power of the frame surpasses an
experimentally determined threshold, that frame is marked as the starting frame of speech. We repeat this
process in reverse beginning with the last frame to determine the ending frame.
Once these two frames have been determined, we can extract this portion of the WAV as speech.
The frames from the designated starting and ending points are then written to a new WAV which is then
converted into its Mel Frequency Cepstral Coefficients for Dynamic Time Warping analysis.
Figure 8 - Original and Truncated WAV
Figure 8 is an example of truncation of a voice clip. The original sample is almost 2.5 seconds in
length, while the truncated WAV is only 0.9 seconds in length. If truncation is not performed correctly or
24
no truncation is performed at all, this causes serious problems in the later stages when trying to compare
two speech signals using dynamic time warping.
4.7 Mel Frequency Cepstral Coefficients
Although utterances of the same phrase differ drastically in the time domain, they are not so
different in the frequency domain. Therefore spectral analysis is an excellent way to take advantage of
this fact.
In order to quantitatively measure sound samples against each other, we need a way of
representing them numerically. Mel Frequency Cepstral Coefficients (MFCC) provides us with a method
for comparing these WAVs to each other.
Calculating the MFCCs is a process that can be summarized as follows:
i. Convert to frames and calculate energy
ii. Take Fourier Transform
iii. Take Log of amplitude spectrum
iv. Perform mel-scaling and smoothing
v. Take Discrete Cosine Transform
Since taking the Fourier transform of a windowed waveform can cause spectral leakage, the first step
is once again to apply a Hanning window to the signal before taking the discrete Fourier transform and
calculating the energy spectrum.
34 B $ 5(6:;@9F<QR BBBBBBBBB E 4 E <Q()@*+
The energy spectrum is given by
9 "34":
The energy, Ej. is then calculated for each frame
>6 B $ S649'()9*+
where J is the number of triangular filters used. Finally the Discrete Cosine Transform is taken of the mel
log-amplitudes and only the first 13 coefficients are kept for our purposes.
25
= TU $VWXY Z O [ -W\)+]>6^_()6*+
As MFCC calculations were performed using a DLL, we will not go into further detail regarding
the calculation of MFCCs. An MFCC is calculated for each frame of the input signal and an array of 13-
coefficient MFCC vectors is built. This array can then be compared against other arrays of MFCC vectors
stored in a library.
4.8 Dynamic Time Warping
ASRES makes use of Dynamic Time Warping, due to the ease of implementation and also its
performance for small datasets in order to perform recognition. The following is an explanation of
Dynamic Time Warping as used by ASRES. In order to compare a speech sample using its MFCC array,
we need to perform a frame by frame comparison of the speech sample with each individual command
stored in the library.
As signals differ in length, the problem becomes an alignment problem. Let Q and C be two
different signals of lengths n and m respectively.
` a)H a:H b a@
c )H :H b =
The first thing that needs to be done is to compare the two signals by creating an n by m distance matrix.
Figure 9 - Distance Map of Different Words
As seen in Figure 9, a 7x6 matrix is created to compare the frames in each of the words. In this
example, “BARNEY” is stored in the database, while the word “BEAVERS” has been spoken.
If the box at the intersection of two frames is the same, we say that this is a match and assign it a
distance score of 0. If the intersection of the two frames is different, we assign it a positive score, in this
case, the value 1.
Y 1 1 1 1 1 1 1
E 1 0 1 1 0 1 1
N 1 1 1 1 1 1 1
R 1 1 1 1 1 0 1
A 1 1 0 1 1 1 1
B 0 1 1 1 1 1 1
B E A V E R S
26
If the two frames were completely identical, then the distance score along the diagonal would be
0 as shown in Figure 10
Figure 10 - Distance Map of Two Identical Words
Once we have the distance map, we then compute the cost to reach the top right hand corner of
the box by traversing the grid using the following algorithm.
dDH O /.GdD H O H dD H OH dDH O PDH O
Figure 11 - DTW Minimum Cost Path Equation
Where D(i,j) is the value of the point (i,j) in the distance map.
We define a path W to the top-right corner to be a connected set of K elements with each element
designated as wk.
L)H L:H b H L9 H b H L' BBBBBBBBBBBBBB/efH E # g B The shortest path would be a diagonal, hence max (m,n) as the minimum bound, and the longest
path would involving traversing both edges, hence m+n-1.
Traversal of the grid is subject to three rules:
i. Boundary Conditions: w1 = (1,1) and wK = (m,n). The path must start at the bottom-left
corner and finish at the top-right corner.
ii. Monotonicity: the path cannot go backwards, therefore ik – ik-1 ≥ 0 and jk – jk-1 ≥ 0.
iii. Continuity: the path cannot jump and is restricted to adjacent cells.
In order to determine the minimum cost of reaching the top right corner, we begin by building a
path map of the entire grid by starting at the bottom left hand corner and then filling each subsequent row.
Each D(i,j) in the bottom row can be determined by adding the value of the cell to the immediate left to
the value of d(i,j) found in the distance map in Figure 10. After summing the 1’s, on the bottom row, we
can determine the total cost, D(i,j) to any point on the bottom row as shown in Figure 12.
S 1 1 1 1 1 1 0
R 1 1 1 1 1 0 1
E 1 1 1 1 0 1 1
V 1 1 1 0 1 1 1
A 1 1 0 1 1 1 1
E 1 0 1 1 1 1 1
B 0 1 1 1 1 1 1
B E A V E R S
27
Figure 12 - DTW Scoring for the First Row
For the second row, we use the distance map in Figure 10 and apply the formula found in Figure
11 and calculate the next row of the grid as seen in the following figure.
Figure 13 - DTW Scoring for the Second Row
This continues until the entire grid is completely filled with values as seen in the following figure.
Figure 14 - DTW Scoring for the Entire Grid
As seen in the above figure, the minimum cost to get from the bottom left hand corner to the top
right hand is 5 in this example.
The following figure, Figure 15, shows one the possible minimum paths that can be taken through
the grid to arrive at the top right hand corner
Y 5 4 5 5 4 4 5
E 4 3 4 4 3 4 5
N 3 3 3 3 3 4 4
R 2 2 2 2 3 3 4
A 1 1 1 2 3 4 5
B 0 1 2 3 4 5 6
B E A V E R S
Y
E
N
R
A 1 1 1 2 3 4 5
B 0 1 2 3 4 5 6
B E A V E R S
Y 5 4 5 5 4 4 5
E 4 3 4 4 3 4 5
N 3 3 3 3 3 4 4
R 2 2 2 2 3 3 4
A 1 1 1 2 3 4 5
B 0 1 2 3 4 5 6
B E A V E R S
28
Figure 15 - Minimum Cost Path for Two Different Words
Although there are alternative paths that can be taken, following the three rules of grid traversal,
the minimum cost remains at 5.
In the case where the spoken word is identical to the word stored in the database, a path with a
score of zero is produced as seen in Figure 16.
Figure 16 - Path with a Score of Zero for Identical Words
This case does not occur under real conditions as it is impossible for the exact same utterance to
be repeated more than once, however, if there is a match, the score of the minimum cost path should be
fairly low.
When the sounds of words are stretched, dynamic warping provides us with a way of showing
that the two are actually identical.
Y 5 4 5 5 4 4 5
E 4 3 4 4 3 4 5
N 3 3 3 3 3 4 4
R 2 2 2 2 3 3 4
A 1 1 1 2 3 4 5
B 0 1 2 3 4 5 6
B E A V E R S
S 1 1 1 1 1 1 0
R 1 1 1 1 1 0 1
E 1 1 1 1 0 1 1
V 1 1 1 0 1 1 1
A 1 1 0 1 1 1 1
E 1 0 1 1 1 1 1
B 0 1 1 1 1 1 1
B E A V E R S
29
Figure 17 - Cost Map of a Word Spoken Slowly
In Figure 17, the word stored in the database is “BEAVERS”, however the speaker has
enunciated the word slowly and placing heavy emphasis on the vowels. Although the lengths are no
longer equal, dynamic time warping allows us to calculate a zero-cost path through the grid proving that
despite the elongated vowels, the command is the same.
Figure 18 - Path with a Zero Score for a Word Spoken Slowly
This path minimization is what makes dynamic time warping extremely useful for comparing
samples of speech.
In Figure 9, the distance map compares letters with letters, therefore logical values, 1 and 0, are
sufficient to determine the cost of the intersection.
Since we have MFCC vectors, we need a way to be able to calculate the distance between two
vectors. If the two vectors are very similar, i.e. they contain the same segment of a word, the cost to move
to that intersection should be low. To compare two vectors we take the Euclidean distance between the
two and calculate the distance which will serve as d(i,j).
30
Figure 19 - Euclidean Distance in 3-D
In Figure 19, the Euclidean distance in three dimensions is the square root of the sum of the
squares of the differences in each of the component directions.
Since our MFCC vectors have 13 components, we extend the Euclidean distance formula to 13
dimensions
PDH O h) ): : :: b )i )i:
where y and x are the components of each of the MFCC’s calculated for each frame of speech
that is to be examined.
A distance map is then computed and the minimum cost path through the grid is calculated.
In the word example given above, the final score was identically zero, however as stated before, in
real life, this is not the case since no two frames will ever precisely match up. Therefore in order to
determine what word is actually being said, we need to calculate the minimum cost path for every word or
phrase stored in the database. Once this is done and provided that the speaker has spoken a phrase that is
contained in the database, the lowest score should indicate what phrase was spoken.
31
5 USER INTERFACE AND SYSTEM USAGE
The previous section described in detail the algorithms behind ASRES. This section documents and
explains the user interface.
5.1 User Interface Description
The user interface of ASRES consists of a main window from which all operations proceed.
Figure 20 - ASR Tool
(1), in the figure above, is the button that sets the user interface into capturing mode. When this
button is toggled, the system goes into a listening mode in which the input signal is analyzed for speech
according to the algorithms described in Chapter 4.
(2), on the right hand side contains quick buttons that put the system into a mode to capture one
of three phrases that were commonly used to test the effectiveness of the automated speech recognition
and to store them in the database. Additional custom phrases can also be captured by pressing the “New”
button below the three phrases. Theoretically, it is possible to capture any number of user defined phrases
and store them in the database. However, it is important to remember that performing DTW on any sound
clip against the others in the database is an operation of the order O(n). Therefore, performance of the
system continues to decrease linearly as the database continues to grow.
(3) is a text field that provides text feedback. For example, if a user was captured as saying “Help
me!”, the system would process the phrase, score it against phrases stored in the database, then display in
1 2
3
32
the text field the name of the phrase that the system recognized. In this case, the text “Help me!” would be
displayed. Although a visual display would be of little use to a BiPAP user who was actually crying for
help, this output is primarily for those setting up the system. If the command is correctly recognized,
further appropriate reaction can then be taken.
5.2 System Usage
The “Start cap” button is pressed and de-noised speech is captured by the microphone. The
captured speech is compared against the entire database of phrases and a match is located.
Because the system is able to recognize different kinds of words, commands and actions can be tied
to each of these and stored in a database. For example if the “Help me!” command is recognized, an
appropriate reaction could be to trigger an alarm that is connected to the computer via a USB port. Or
perhaps if the caregiver is not in the general vicinity, a message could be sent to their phone via a specific
app. A two-word command such as “Lights on” could be interfaced with the room lights and allow a
patient to have control over the lights.
One of the commands that was implemented is “Open Browser.” Upon accurate recognition of this
command, the default browser of the operating system, e.g. Firefox, Chrome, or Internet Explorer is
opened to its home page. From there, a user can navigate from page to page by using an extension such as
“numberedlinks” for Firefox that assigns numbers to every hyperlink on the page. Thus, a user would be
able to say “Link 15” and the link would be clicked and the new page loaded. To add more functionality,
recording commands like “Back” or “Forward” could also improve the browser experience.
The possibilities are endless; however, they all rely on accurate detection of the words.
5.3 Initial Training
In order to train the system, a user selects either one of the quick buttons on the side, or the “New”
button and records his or her command. These samples become the unique user sound set by which all
future voice commands are compared with and aligned to using Dynamic Time Warping. In order to
make the system as robust as possible, any “New” command that is recorded can be set to execute a
particular command line command. For example, if we wish to implement something beyond “Open
Browser” and navigating through a web interface, we can specify a command to run.
For example, on a Windows-based system, it is possible to record a clip “Adjust pillow” and set the
executed action to run the command line string “msg.exe SamChua Can you please come over and adjust
my pillow?” Then, when the ASRES system is running and the words “Adjust pillow” are spoken, the
system would execute the above shell command which would cause a message to be sent to the user
33
SamChua who is on the local network with the above text. In this way, provided that the system is setup
in advance and these commands are typed in by the caregiver, a PAL could send numerous kinds of
commands to particular people.
Since the commands are shell-based, any type of command line based custom software could be
written or bought and used with the ASRES system. For example, another even more useful application
would be to purchase a proprietary PC-to-SMS software and use it to send an SMS to a caregiver. For
example, the recorded command could be “Emergency!” and the executed action could be “SMS.exe
/u:SamChua /m:Come home immediately, I need help!”
One important thing to note however is the degradation of speech over time in individuals with
ALS. In some individuals, their ability to enunciate words clearly declines rapidly within a year. For
others, the loss of speech is a slow process that drags on for several years. This is problematic as it means
that speech samples in the database will no longer be accurate and need to be retrained over time. Because
the amount of time differs from person to person, either a variable can be set in the XML configuration
file to automatically replace old records with new spoken ones every few months or a user can manually
retrain their entire database when they start to notice error rates in recognition increasing substantially.
The automatic deprecation of old records and addition of new ones would be ideal, however, due to
the fact that dysarthrics struggle to control their facial muscles, there is no clever way to automatically tell
whether an utterance is acceptable or not. Although tedious, the only way to accurately re-train is to
record, allowing the user to listen to what they recorded, and verify that they would like to keep it.
34
6 EXPERIMENTS
In this section, two experiments involving human subjects are explained, conducted and the results
discussed.
6.1 Experimental Setup
6.1.1 Goal
The goal of the following experiment was to capture several predetermined speech samples from
a PALS in order to provide ASRES with data that could be used to determine if spectral subtraction
improves speech intelligibility.
6.1.2 Setup
In order to determine if filtered sound captured from within a mask improved intelligibility,
multiple recordings of sentences and a paragraph from within the mask with the noise filter deactivated
are taken to establish a baseline. Once these recording were complete, the airflow noise filter is activated
and the same sentences and a paragraph are read and recorded.
Test sentences are constructed from a subset of the list of 74 monosyllabic nouns and 37
disyllabic verbs found in the Nemours Database of Dysarthric Speech [9].
Two factors went into the creation of the subset. The first was to remove the words that differed
from each other by only a single phoneme such as “cob” and “cop”. Dysarthric speakers have been shown
to have difficulty with certain vowels and syllables [10] and removal of these helps to ensure that
incorrect recognition of the word is not due to the fact that the speaker is unable to articulate a particular
sound resulting in two words being pronounced the same. Words were therefore selected which were at
least two phonemes apart.
The second factor addressed the issue of context. In order to ensure that context was not what was
being used to recognize the speaker’s words, the constructed sentences are nonsensical and are in the
form of “The X is Ying the Z” where X and Z are randomly selected nouns without replacement from the
reduced set of 11 nouns and Y is a verb selected from the reduced set of 8 verbs listed below. An example
sentence is “The fade is leaping the bin”.
Set of Nouns:
• cob • bad • bait
35
• fade • fight • dime • dew • bin • rot • pat • bet
Set of verbs: • wading • leaping • licking • bearing • stewing • sipping • going • surging
Finally, the paragraph to be read by the speaker was a standard paragraph in the area of speech sciences
called “My Grandfather”
You wished to know all about my grandfather. Well, he is nearly ninety-three years old; he
dresses himself in an ancient black frock coat, usually minus several buttons; yet he still thinks as
swiftly as ever. A long, flowing beard clings to his chin, giving those who observe him a
pronounced feeling of the utmost respect. When he speaks, his voice is just a bit cracked and
quivers a trifle. Twice each day he plays skillfully and with zest upon our small organ. Except in
the winter when the ooze or snow or ice prevents, he slowly takes a short walk in the open air
each day. We have often urged him to walk more and smoke less, but he always answers,
“Banana oil!” Grandfather likes to be modern in his language.
For the full details of the experimental procedure, see Appendix A: Procedure for Collecting
Speech Samples of a PALS. The subject population consisted of a single PALS.
6.1.3 Hypothesis
Prior to the start of this experiment, it was hypothesized that performing spectral subtraction filtering on
speech captured by a PALS would increase intelligibility.
36
6.2 Experimental Results
In this subsection, the experimental results of a subject RG are discussed. In addition, a set of data
with digitally added noise was also tested on ASRES and the results examined in order to test the
effectiveness of spectral subtraction on BiPAP. Two types of tests were conducted. The following is an
examination first of recognition rates achieved from the created dataset and then of the actual dataset
produced by a PALS.
6.2.1 Digital Addition of Noise to Nemours Subject BB
In the first sets of tests, a sample of BiPAP noise was recorded and then digitally added to the
voice of a dysarthric speaker and the results were Post-Filtered and the DTW algorithm was applied.
The following datasets were created from words spoken by subject BB recorded on the Nemours
Database of Dysarthic speech. Subject BB has a form of cerebral palsy that results in speech dysarthria,
although not severe. His level of dysarthria, compared to other in the Nemours database is fairly low. BB
has a score of 8/8 in the sentences intelligibility and conversation intelligibility. In the area, however, of
word intelligibility, his score is only 4/8 which implies that his words are not as well formed as a normal
person. BB’s level of dysarthria would be comparable to a person who is in the earlier stages of ALS and
would just be starting to use BiPAP on a more regular basis. This makes him a fairly good candidate for a
test.
In the following results, the recording of the paragraph “My Grandfather” was trimmed until only
five words, “Grandfather”, “Coat”, “Buttons”, “Trifle”, and “Organ” remained. These five words were
then added to the library for DTW matching. A sample of noise recorded from inside a face mask with
BiPAP operating was then digitally added to the five words and the recordings saved. Finally, the sample
of digitally added noise was Post-Filtered with the spectral subtraction algorithm and the result again
segmented into five different words to test against ASRES. The following tables show the effect of
machine recognition as the SnR is gradually decreased.
37
SnR = 11.32 dB
NOISY Library
Grandfather Coat Buttons Trifle Organ
Spoken
Grandfather 399.51 429.72 434.85 432.59 417.8
Coat 451.74 288.45 331.68 329.96 302.84 Buttons 404.85 292.41 253.06 305.79 274.45 Trifle 409.06 291.11 313.92 250.89 279.04
Organ 416.57 295.18 310.4 291.18 242.36 Table 1 - Noisy DTW of 5 Words by BB
SnR = 22.31 dB
POST-FILTERED Library
Grandfather Coat Buttons Trifle Organ
Spoken
Grandfather 207.7 325.05 320.68 332.67 301.89 Coat 308.03 143.18 233.05 251.14 203.18 Buttons 287.12 220.4 119.45 239.64 195.19
Trifle 327.51 255.81 220.71 131.69 176.51
Organ 310.5 223.61 203.6 204.06 105.44 Table 2 - Post-Filtered DTW of 5 Words by BB
In the results above, the SnR is fairly high and every single word recorded in both the 5 samples
of noise corrupted speech as well as filtered speech were matched with the correct sample in the library.
Therefore, there is no apparent benefit to performing filtering at this SnR as it does not improve the
already excellent recognition rate.
38
SnR = 8.30 dB
NOISY Library
Grandfather Coat Buttons Trifle Organ
Spoken
Grandfather 493.5 533.74 545.93 516.98 512.09 Coat 552.65 389.01 460.63 420.78 427.72 Buttons 503.71 406.92 373.06 388.74 392.32
Trifle 444.91 337.97 371.81 276.45 299.84
Organ 458.73 361.62 393 307.9 264.7 Table 3 - Noisy DTW of 5 Words by BB with +3dB Noise
SnR = 19.77 dB
POST-FILTERED Library
Grandfather Coat Buttons Trifle Organ
Spoken
Grandfather 232.81 404.65 388.57 345.31 305.99
Coat 352.28 202.96 322.72 279.87 250.41 Buttons 337.51 294.85 214.69 267.62 238.48 Trifle 343.2 270.32 279.81 130.83 192.42
Organ 328.6 297.51 292.26 204.67 126.06 Table 4 - Post-Filtered DTW of 5 Words by BB with +3dB Noise
In the results above, the SnR has been reduced by 3dB. It is still fairly good but lower than the
previous example. The Post-Filtered data was 100% correctly matched with the library, however the
DTW matching for the noisy signal is down to 40% with only two of the words, “Trifle” and “Organ”
being correctly recognized. It is also worth noting that the scores of the Noisy table are much higher than
those of the Post-Filtered table. This implies that the alignment between the noisy speech and that which
is stored in the library is more distant and therefore, the system would be more prone to making errors as
the size of the dataset increased.
39
SnR = 4.31 dB
NOISY Library
Grandfather Coat Buttons Trifle Organ
Spoken
Grandfather 532.67 558.27 571.66 548.45 543.42
Coat 591.07 437.84 496.93 465.51 467.74 Buttons 527.93 424.23 403.22 418.6 405.6 Trifle 503.11 403.53 439.36 360.08 376.96
Organ 497.42 395.78 424.89 367.13 312.55 Table 5 - Noisy DTW of 5 Words by BB with +7dB Noise
SnR = 16.64 dB
POST-FILTERED Library
Grandfather Coat Buttons Trifle Organ
Spoken
Grandfather 258.76 389.99 389.29 336.2 293.33 Coat 361.26 222.17 352.16 269.2 252.84 Buttons 328.83 309.38 228.11 264.2 229.4
Trifle 366.76 279.3 312.32 180.62 219.4
Organ 334.95 291.74 309.11 232.14 152.78 Table 6 - Post-Filtered DTW of 5 Words by BB with +7dB Noise
In the results above, the SnR has been reduced by an additional 4 dB from the previous set. Once
again the Post-Filtered data was 100% correctly matched with the library, however the DTW matching for
the noisy signal is at 60% with three words, “Buttons”, “Trifle” and “Organ” being correctly recognized.
In addition to this, the winning scores for each of the noisy samples have increased on average by 54.6
points.
This implies that the alignment between the noisy speech and that which is stored in the library is
becoming even more distant and more error prone.
40
SnR = -1.70 dB
NOISY Library
Grandfather Coat Buttons Trifle Organ
Spoken
Grandfather 688.6 694.65 715.26 686.7 677.34
Coat 714.46 535.93 581.55 550.37 552.47 Buttons 646.58 503.4 498.33 501.58 488.53 Trifle 594.55 469.85 503.39 427.74 435.51
Organ 633.91 489.28 515.92 447.1 400.54 Table 7 - Noisy DTW of 5 Words by BB with +13dB Noise
SnR = 12.03 dB
POST-FILTERED Library
Grandfather Coat Buttons Trifle Organ
Spoken
Grandfather 345.96 417.85 438.26 368.66 338.42 Coat 413.07 265.57 390.25 297.25 278.12 Buttons 379.34 338.53 286.55 297.42 253.08
Trifle 373.28 285.58 341.79 213.9 244.22
Organ 388.06 316.99 352.28 261.48 211.48 Table 8 - Post-Filtered DTW of 5 Words by BB with +13dB Noise
In the results above, the SnR is close to 0dB which means that the signal and noise strength are
almost one-to-one. Once again the Post-Filtered data was 100% correctly matched with the library, with
the DTW matching for the noisy signal remaining at 60% with three words, “Buttons”, “Trifle” and
“Organ” being correctly recognized. Although three words are recognized, buttons is but 5.1 points, 1.0%,
away from “Trifle”, the next closest match. If this were repeated, it is quite possible that the recognition
rate would go down to 40%.
41
SnR = -1.68 dB
NOISY Library
Grandfather Coat Buttons Trifle Organ
Spoken
Grandfather 720.91 725.14 748.82 718.73 708.46
Coat 730.43 552.9 598.96 567.4 570.11 Buttons 665.13 537 537.99 536.62 529.54 Trifle 580.12 456.3 487.08 415.58 410.74
Organ 616.28 473.26 499.22 432.93 385.86 Table 9 - Noisy DTW of 5 Words by BB with +13dB Noise and Alternate Noise
SnR = 12.07 dB
POST-FILTERED Library
Grandfather Coat Buttons Trifle Organ
Spoken
Grandfather 361.95 433.81 453.11 385.53 352.25 Coat 423.18 271.72 396.59 302.63 286.78 Buttons 389.1 335.86 311.83 303.94 277.62
Trifle 368.06 275.71 330.23 212.35 236.44
Organ 383.22 315.09 345.35 256.86 209.23 Table 10 - Post-Filtered DTW of 5 Words by BB with +13dB Noise and Alternate Noise
For this particular set of data, a different sample of BiPAP noise was digitally added to set the
SnR close to 0 once again. In Table 9, it is observed that the word “Buttons” and “Trifle” are extremely
close together and as a result of the noise change, the word “Buttons” is incorrectly recognized as
“Trifle”. Once again the DTW matching for the noisy signal has dropped to 40%.
From the data collected in the tables above, there is reason to believe that spectral subtraction aids
in the recognition of words by DTW.
6.2.2 Person Living with ALS – RG
In the second set of tests, a subject with ALS tested the system by performing both experiments
on the procedure sheet as detailed in the experimental setup. In order to validate the theoretical findings of
the previous experiment, it was necessary to conduct the same experiment with a real person.
Subject RG, a person living with ALS provided the sound clips from which the following results
were derived. The following tables follow the same format and analysis of the digitally added noise
experiment in the previous section.
42
My Grandfather
SnR = 6.89 dB
NOISY Library
Grandfather Coat Buttons Trifle Organ Language
Spoken
Grandfather 402.15 381.8 352.32 426.21 333.79 423.91 Coat 352.16 299.5 251.09 349.65 223.15 333.44 Buttons 342.22 322.35 267.91 348.13 191.05 321.54
Trifle 351.12 332.86 274.62 349.99 232.95 344.45 Organ 361.03 351.08 296.1 384.98 220.72 348.17
Language 382.26 372.5 306.51 371.54 278.42 371.2 Table 11 - DTW of 6 Words by RG
The data from the table above was generated from a real-time noisy dataset that was cut from RG
reading the paragraph, “My Grandfather”. At a SnR of 6.89, and with this particular dataset, DTW
performed poorly and was only able to match a single word, “Coat” correctly.
SnR = 6.79 dB
POST-FILTERED Library
Grandfather Coat Buttons Trifle Organ Language
Spoken
Grandfather 704.2 690.74 605.99 755.59 535.35 690.45
Coat 742.46 693.53 586.83 782.55 497.03 711.89 Buttons 681.23 663.34 570.24 764.12 459.74 689.49 Trifle 831.07 784.44 675.2 873.5 573.25 815.71
Organ 762.76 757.45 643.36 821.56 513.91 757.53
Language 750.65 763.74 646.96 784.11 553.19 739.21 Table 12 – Post-Filtered DTW of 6 Words by RG
In the above table the same real-time noisy dataset was subjected to Post-Filtering. The Post-
Filtering in this case performed extremely poorly and the SnR actually went down by 0.1dB. As a result,
again, only 1 word was correctly recognized and no improvement was made.
SnR = 9.12 dB
REAL-TIME FILTERED Library
Grandfather Coat Buttons Language
Spoken
Grandfather 342.82 249.44 304.91 439.06 Coat 327.89 188.02 287.97 381.67
Buttons 310.02 196.17 218.99 363.23 Trifle 402.9 312.27 407.06 545.7 Organ 411.42 268.7 376.74 512.62
Language 389.79 236.22 401.37 526.47 Table 13 - Real-time Filtered DTW of 4 Words by RG
43
The data from the table above was generated from a real-time filtered dataset that was cut from
RG reading the paragraph, “My Grandfather”. The filtering has improved the SnR just over 2dB and the
result is that two words, “Coat” and “Buttons’ are correctly recognized. Due to the fact that during this
recording, RG was not able to articulate the words, “Trifle” and “Organ” both words were removed from
this dataset.
From this dataset we see that Post-Filtering in this case had no benefit, however the results from
the Real-Time Filtered dataset show a recognition of two of the words which is better than the recognition
of only one word in the noisy dataset.
Five Isolated Words
SnR = 12.5 dB
NOISY Library
Fight Dime Dew Bin Bait
Spoken
Fight 418.85 150.99 201.16 140.37 165.79
Dime 426.74 161.13 190.26 165.62 158.91 Dew 392.86 198.93 178.62 184.3 149.09 Bin 388.54 174.78 162.93 143.62 132.76
Bait 384.9 177.01 162.06 163.56 118.38 Table 14 - Noisy DTW of 5 Words by RG
In the above dataset, words from each of the five sentences spoken by RG were cut from two
recordings. The first recording which had no BiPAP and no filtering enabled served as the baseline and
became the library files. The second recording which had the BiPAP enabled and filtering off was
segmented and tested against the library with DTW. The result as seen in the above table, at a SnR of
12.5dB was still fairly poor with only a single word “Bait” being recognized.
SnR = 18.2 dB
POST-FILTERED Library
Fight Dime Dew Bin Bait
Spoken
Fight 116.36 403.95 428.75 360.9 389.88 Dime 134.62 422.62 418.36 374.19 404.29 Dew 192 386.68 372.09 338.11 377.12
Bin 144.25 381.5 386 332.7 372.12
Bait 142.73 389.63 392.44 326.03 361.63 Table 15 – Post-Filtered DTW of 5 Words by RG
This dataset was created by taking the noisy dataset found in Table 14 and applying spectral
subtraction. This resulted in an increase in the SnR by 5.7dB and as a result, the software was able to
recognize 3 out of the 5 words, or 60%.
44
SnR = 23.45 dB
REAL-TIME FILTERED
Library
Fight Dime Dew Bin Bait
Spoken
Fight 114.5 169.87 179.09 167.08 173.05
Dime 129.81 143.39 189.34 165.34 164.29 Dew 178.28 148.45 144.14 133.46 117.96 Bin 137 141.73 147.22 128.87 114.96
Bait 133.61 128.47 148.89 129.54 111.12 Table 16 - Real-Time Filtered DTW of 5 Words by RG
In this above dataset, a real-time spectral subtraction recording of the five words yielded an
extremely high SnR. As a result, the recognition rate in this case was 4/5 or 80%.
From this dataset we see that there appears to be better recognition of words after performing filtering.
Three Commands
SnR = 8.56 dB
NOISY Library
Help Me Open Browser Close Window
Spoken Help Me 522.5 784.6 868.68
Open Browser 1146.6 1042.66 1150.93
Close Window 1098.57 1150.65 1151.85 Table 17 - DTW of 3 Phrases by RG
In the above dataset, three phrases were recorded with the BiPAP off for the library and then 3
recordings were made with the BiPAP running and then DTW was applied. At a SnR of 8.56dB the
results are once again poor with only one phrase being correctly recognized.
SnR = 18.7 dB
FILTERED Library
Help Me Open Browser Close Window
Spoken Help Me 325.58 721.43 474.74
Open Browser 749.32 637.4 702.31
Close Window 657.25 793.79 505.64 Table 18 - Real-Time Filtered DTW of 3 Phrases by RG
In this dataset, a recording of the 3 phrases was made with the BiPAP running and spectral
subtraction enabled. In this case, the filter performed fairly well boosting the SnR to 18.7dB and allowing
for a recognition rate of 2 out of 3, or 66%.
45
6.3 Phoneme Analysis
In order to further validate if there is any increase in intelligibility an experiment with subjects was
conducted with individuals who listened to the recordings of subject RG and wrote down what they
perceived was said. Table 19 below shows the 5 control sentences with the two nouns and the verb in
bold. Both of the nouns in the five sentences have exactly 3 phonemes each, while the verbs all have 5
phonemes. Therefore, the two nouns and verb in each sentence have a total of 11 phonemes.
Table 19 - Control 5 Sentences and Phoneme Divisions
The experiment was conducted in two phases consisting of 3 trials each. In each trial, the subject
would listen to a sentence and then write down what they perceived RG to be speaking. Subjects
understood in advance that the sentences spoken would be in the form of “The X is Ying the Z”. In phase
one, three trials were conducted with the subjects listening to unfiltered speech with a SnR of 12.5dB. In
phase two, the trials were conducted with the subjects listening to the filtered speech with a SnR of
23.45dB.
Table 20 - An Example Phoneme Analysis Trial Explained
In Table 20 above, an example trial is explained. In this trial, the subject perceived that sentence
number one was “The fight is wearing the cob.” As the first word was perceived correctly, no penalty is
applied under column 1 as all the phonemes were correctly identified. The verb, however, “wading” was
perceived instead as “wearing”. As these two words differ in exactly 2 phonemes, a score of minus 2 is
applied to column 2. The third word, “cob” was correctly identified, therefore no penalty is applied. In
total, only 2 phonemes were incorrectly perceived, which means that 9 out of 11 were correctly perceived,
resulting in a recognition percentage of 81.8%. This scoring process is then repeated for the remaining 4
ControlTrial 1 1 2 3
The fight is wading the cob. 3 5 3 11The dime is bearing the bet. 3 5 3 11The dew is leaping the pat. 3 5 3 11The bin is stewing the fade. 3 5 3 11The bait is licking the rot. 3 5 3 11
Words
No Filtering (SnR = 12.5 dB) ScoreTrial 1 1 2 3
The fight is wearing the cob 0 -2 0 -2 81.8%the drive is bearing the dirt -2 0 -2 -4 63.6%the view is leaving the path -1 -1 -1 -3 72.7%the bend is spewing the pain -2 -1 -2 -5 54.5%the bate is licking the vase 0 0 -4 -4 63.6%
Words
46
sentences and this is performed for all 3 trials from the unfiltered set. From this we can then calculate the
average phoneme recognition rate for each sentence over the three trials. Finally a single number, an
overall percentage of correctly identified phonemes can be computed as seen below in Table 21.
Table 21 - Percentage of Correctly Identified Phonemes
This process is then repeated for the second phase so that an overall percentage of correctly
identified phonemes can also be determined for the filtered dataset and then the two results can be
compared.
Table 22 below is a phoneme analysis of one of four subjects who participated in the study.
Mean % of Correctly Identified Phonemes per SentenceSentence #1 75.8%Sentence #2 81.8%Sentence #3 69.7%Sentence #4 48.5%Sentence #5 66.7%
Overall % of Correctly Identified Phonemes 68.5%
47
Table 22 - Phoneme Analysis of MB
The phoneme analysis of subject MB in Table 22 above shows a slight improvement in
recognition of the phonemes, 6%.
The details of the last three phoneme analyses are not included here but can be found in Appendix
B: Phoneme Analysis Data Sheets. The summary results will be discussed in 6.4.1 - Summary of
Phoneme Analysis.
6.4 Discussion
In this subsection, the results of the different experiments are summarized and discussed.
6.4.1 Summary of Phoneme Analysis
A summary of the phoneme analysis for all four subjects is found in Table 23 below.
No Filtering (SnR = 12.5 dB) Score RT Filtered (SnR = 23.45 db) ScoreTrial 1 1 2 3 Trial 1 1 2 3
The fight is wearing the cob 0 -2 0 -2 81.8% The fight is waving the cob 0 -1 0 -1 90.9%the drive is bearing the dirt -2 0 -2 -4 63.6% the dime is wearing the death 0 -1 -2 -3 72.7%the view is leaving the path -1 -1 -1 -3 72.7% the view is leaking the path -1 -1 -1 -3 72.7%the bend is spewing the pain -2 -1 -2 -5 54.5% the bend is stewing the shade -2 0 -1 -3 72.7%the bate is licking the vase 0 0 -4 -4 63.6% the bait is making the rock 0 -2 -1 -3 72.7%
Trial 2 Trial 2the pipe is wearing the cob -2 -2 0 -4 63.6% The fight is waving the cob 0 -1 0 -1 90.9%the dime is bearing the bent 0 0 -1 -1 90.9% the dime is wearing the bent 0 -1 -1 -2 81.8%the view is reaching the path -1 -2 -1 -4 63.6% the Jew is sweeping the path -1 -2 -1 -4 63.6%the bend is spewing the thing -2 -1 -4 -7 36.4% the bend is stewing the shade -2 0 -1 -3 72.7%the bait is licking the vase 0 0 -4 -4 63.6% the bait is making the run 0 -2 -2 -4 63.6%
Trial 3 1 2 3 Trial 3 1 2 3The fight is wearing the cob 0 -2 0 -2 81.8% The fight is waving the cob 0 -1 0 -1 90.9%the dime is bearing the bent 0 0 -1 -1 90.9% The dime is wearing the death 0 -1 -2 -3 72.7%the view is leaving the path -1 -1 -1 -3 72.7% The Jew is sweeping the path -1 -2 -1 -4 63.6%the bend is spewing the pain -2 -1 -2 -5 54.5% The bend is stewing the shade -2 0 -1 -3 72.7%the bait is raking the rock 0 -2 -1 -3 72.7% The bait is making the run 0 -2 -2 -4 63.6%
Mean % of Correctly Identified Phonemes per SentenceSentence #1 75.8% 90.9%Sentence #2 81.8% 75.8%Sentence #3 69.7% 66.7%Sentence #4 48.5% 72.7%Sentence #5 66.7% 66.7%
Overall % of Correctly Identified Phonemes 68.5% 74.5%
Words Words
48
Overall % of Correctly Identified Phonemes
Subject No Filtering (SnR = 12.5 dB)
RT Filtered (SnR = 23.45 dB) +/- %
MB 68.5% 74.5% 6.1% RS 69.1% 70.3% 1.2% EC 67.3% 67.9% 0.6% JW 74.5% 73.3% -1.2%
Table 23 - Summary of Phoneme Analysis
As seen in the data, apart from one subject who showed a 6.1% increase in the number of
phonemes recognized by listening to the filtered speech samples, the other three individuals did not show
any significant increase in the number of phonemes recognized. In one case, JW, the individual actually
had a slight decrease in the number of phonemes accurately recognized.
Although there is a possibility that intelligibility of phonemes has improved slightly, no firm
conclusions can be reached from the dataset above in Table 23.
6.4.2 Summary of ASRES Results
The results from tables of data from both BB and RG are summarized below in a table and a
graph of SnR vs. Recognition Rate are constructed below in order to verify whether or not spectral
subtraction of a noisy BiPAP signal provides any benefit to ASR technology.
SnR Type Correct Percent Subj. Set 11.32 Noisy 5/5 100% BB Grandfather
8.3 Noisy 2/5 40% BB Grandfather 4.31 Noisy 3/5 60% BB Grandfather
-1.68 Noisy 2/5 40% BB Grandfather -1.7 Noisy 3/5 60% BB Grandfather
22.31 Post-Filtered 5/5 100% BB Grandfather 19.77 Post-Filtered 5/5 100% BB Grandfather 16.64 Post-Filtered 5/5 100% BB Grandfather 12.07 Post-Filtered 5/5 100% BB Grandfather 12.03 Post-Filtered 5/5 100% BB Grandfather
Table 24 - Summary of BB Datasets
The table above was created by summarizing the data found in Table 1 to Table 10. The data was
first sorted by type and then by SnR. The “Noisy” type refers to signals that were unfiltered, whereas
Post-Filtered refers to the noisy signals that had spectral subtraction applied to them in post-processing.
The column “Correct” is a fractional total of the number of words that were correctly recognized by the
ASR and the “Percent” column simply expresses the fraction in the form of a percentage.
49
Figure 21 - SnR vs. Recognition Rate of BB for Noisy Signals Subjected to Post-Filtering
From Table 24 and Figure 21, there is reason to believe that Post-Filtering a noisy signal results
in a significantly higher recognition rate by ASR. All of the signals that were post-processed with spectral
subtraction were recognized without fail.
-5
0
5
10
15
20
25
0% 20% 40% 60% 80% 100% 120%
SnR
(dB)
Automated Speech Recognition Rate (%)
BB - SnR vs. Recognition Rate
Noisy
Postfiltered
50
SnR Type Correct Percent Subj. Set 12.5 Noisy 1/5 20% RG 5 Sentences 8.56 Noisy 1/3 33% RG 3 Commands 6.89 Noisy 1/6 17% RG Grandfather 18.2 Post-Filtered 3/5 60% RG 5 Sentences 6.79 Post-Filtered 1/6 17% RG Grandfather
23.45 Real-Time Filter 4/5 80% RG 5 Sentences 18.7 Real-Time Filter 2/3 67% RG 3 Commands 9.12 Real-Time Filter 2/4 50% RG Grandfather
Table 25 - Summary of RG Datasets
Figure 22 - SnR vs. Recognition Rate of RG for Noisy, Post-Filtered and Real-time Filtering
From Table 25 and Figure 22, there is reason to believe that spectral subtraction increases the
ASR rate under real world circumstances. The three data points for the “Noisy” series show that for
signals under 15dB, there is a low recognition rate. However, the very same signals subjected to Post-
Filtering once again show a marked improvement. The real-time filtering shows that if spectral
subtraction results in a SnR that is over 20dB, the recognition rate increases once again.
6.4.3 Effectiveness of ASRES
ASRES seems to show some effectiveness in improving the recognition of both cases that were
explored. In the case where noise is digitally added to speech samples and the Post-Filtered, the
0
5
10
15
20
25
0% 20% 40% 60% 80% 100%
SnR
(dB)
Automated Speech Recognition Rate (%)
RG - SnR vs. Recognition Rate
Noisy
Postfiltered
Real-Time Filter
51
improvement is large. Under real-world circumstances, there is still a noticeable improvement; however,
the results are not nearly as fantastical as the 100% recognition rates achieved by the case in which noise
was digitally added and Post-Filtered.
From the intelligibility analysis as summarized in Section 6.4.1 - Summary of Phoneme Analysis,
there is little reason to believe that spectral subtraction at the specific SnRs offers anything more than a
marginal improvement in terms of intelligibility. However, from Section 6.4.2 - Summary of ASRES
Results, it is clear that performing spectral subtraction on noisy speech signals of a patient with ALS can
improve the automatic recognition of their speech signals when the Post-Filtering SnR is greater than
8dB.
Therefore, in our discussions regarding the effectiveness of the ASRES system, it would seem
that the enhancement that the system offers is not primarily one of increased intelligibility for human
listeners, but rather for automated machine recognition of speech.
It would not be fair, however, to say, that the system is of no value then for human listeners.
Although no discernible increase in intelligibility by filtering was achieved, PALS still gain several
immediate benefits from using ASRES.
For one, as discussed in Section 1.2 - The Effects of ALS on Speech and Quality of Life,
allowing PALS to speak from behind a mask while on BiPAP is already a huge step forward in terms of
combatting loneliness and struggles with isolation.
6.4.4 Feedback from ALS Community
The feedback from ALS caregivers and subjects has been very positive. In May 2009, ASRES
was presented as the ALS Society of BC’s Engineering Design Competition and a live demonstration was
performed of its capabilities. As a result of a successful live demonstration, the system won the Principal
Award of $5000 and as a result, was selected for a platform presentation, “C75 – Enhancing Speech
During BiPAP Use” at the 20th International Symposium on ALS/MND in Berlin for which a travel
bursary was also awarded. The findings were also reported during the keynote speaker’s address during
the conference closing as a novel attempt at a problem that had not previously been explored.
In addition, subject RG, who tested the system, was very pleased with it as well as caregivers and
professionals who were able to observe the system.
6.5 Validity
When considering the validity of the experiments that were conducted for this thesis, there are two
outstanding issues that still need to be considered:
52
• Sample size: Due to time, lifespan, disease progression, geographical and language constraints,
the number of subjects for the experiment was limited to one subject that was located with the aid
of ALS BC. Finding local subjects was difficult due not only to the rarity of the disease, but also
due to the fact, that the subject had to have some ability of speech left while being on BiPAP
ventilation. One subject was not able to participate due to poor command of the English language.
Other subjects that were considered during the time that ASRES was in development did not live
to see it come to fruition.
• Size of datasets: Although some results were achieved, only a handful of datasets were used and
the size of the library was very small as well. Larger datasets would help determine whether or
not it was sheer coincidence that resulted in the machine recognition rates being higher for Post-
Filtered and real-time filtered data.
53
7 CONCLUSIONS AND FUTURE WORK
In this section the work of the thesis is discussed. The goals of the research are reviewed, the results
are summarized, strengths and weaknesses are assessed and future work is discussed.
7.1 Research Goals Summary
The goal of this thesis is to develop and evaluate a prototype Automatic Speech Recognition and
Enhancement System (ASRES) that will allow Persons Living with ALS (PALS) to communicate clearly
with their own voices while being on BiPAP ventilation. This problem is a significant one and a solution
to it would greatly aid in improving the overall quality of life of PALS by allowing them to communicate
naturally as well as for loved ones to hear their voices. Although there are some published works on
related topics in the area of noise reduction and speech enhancement, there are no published works that
offer a solution to this particular problem. As a result, it is not possible to evaluate ASRES against any
existing system.
In this thesis, the problems associated with capturing and filtering speech by PALS are explained
and as a result specific hardware was selected and software written to capture PALS speech and to filter
the wind noise that corrupts signal samples taken from within the mask. A working and mobile prototype
was designed and implemented using Matlab, a TMS320C6713 DSP board with an interface written in
C# for Windows.
The effectiveness of ASRES was evaluated through digitally adding noise to speech samples from
the Nemours Database of Dysarthric Speech, filtering them with spectral subtraction, and then running the
resulting samples through ASRES in order to determine their scores. The effectiveness of the system was
also evaluated by testing with a subject with ALS who recorded a number of a different speech samples
under different conditions including No filtering with BiPAP on and Filtering with BiPAP on. The
resulting sound samples were listened to and the perceived sentences were written down by human
subjects and then the accuracy of their transcription was evaluated using phoneme analysis.
7.2 Contributions of this Work
In the introduction, it was stated that this thesis would attempt to improve the intelligibility of
speech of persons living with ALS by one, identifying the problems associated with capturing speech of
PALS, two, designing and implementing a working prototype that would address these problem, and
three, validating the effectiveness of the system by conducting and analyzing the results of experiments.
An explanation for the problem of the “wind noise” that caused difficulties in capturing speech
from behind a full face mask while a subject is on BiPAP ventilation was given and an appropriate
54
capture solution involving a single-microphone using spectral subtraction for stationary noise was
designed and implemented.
An automated speech recognition system using DTW was also implemented to validate our
hypothesis of ASRES providing improvements to the intelligibility of PALS.
An evaluation of the results shows that spectral subtraction offers some improvement to machine
recognition of noisy speech. The benefits of being able to accurately recognize the speech of PALS are
many and the system can be used to trigger alarms, operate a browser, or launch custom software that
could contact a caregiver. All of these applications stand to improve quality of life of PALS.
A final experiment was conducted to evaluate whether or not there is an increase in intelligibility as
perceived by human beings due to filtering. Although one case showed a slight improvement, the other
three cases do not exhibit this improvement and the best conclusion is that there is no major improvement
in terms of intelligibility for human listeners over that of noisy speech. Although no improvement was
made with regards to intelligibility, the very fact that PALS can speak with their own voices while on
BiPAP is already an improvement as their speech is completely muffled by the masks if there is no
amplification.
The results of this thesis are significant in that not only is there reason to believe from the data that
there is some benefit to performing filtering of noisy speech, but also that persons with ALS, caregivers
and professionals see this device as a step forward in improving the quality of life for persons with ALS.
7.3 Strengths and Limitations
7.3.1 Strengths
• Low hardware requirements: As a single-microphone solution was chosen, only a basic DSP
that reads and samples from one channel is required. All the components for a simple system,
including the microphone and DSP chip are easily available.
• Improvements to Automated speech recognition: This is perhaps the single largest strength of
the system, the fact that filtered speech is more accurately recognized by a machine. The
implications of this are significant as voice-activated alarms and the triggering of custom
software all hinges on accurate recognition of the command words that activate them.
• Improvement of quality of life of PALS: As the system allows PALS to do what they could
never do before, using their voice while on BiPAP, there is already an immediate improvement of
the quality of life for PALS.
55
7.3.2 Weaknesses
• System training is difficult for PALS. As PALS often have difficulty with fine motor
coordination, having them use a Windows GUI to configure the system would prove difficult. All
training would have to be done with the aid of a caregiver who would not only setup the system,
but record samples of the command words.
• Results are largely based on a few datasets obtained from one subject with ALS. Although this
was due to the already mentioned limitations, more tests with others PALS would help to confirm
the validity of the results.
• The performance of the automated speech recognition degrades as the ability to articulate
speech for the user degrades. A mechanism for retraining on a regular basis is necessary in order
to avoid decreased accuracy over time.
• Not the strongest noise filtering algorithm. The noise filtering can be improved by changing to
a different algorithm, or using an improved version of spectral subtraction which is more to
update with current research.
7.4 Potential Applications
There are a number of potential applications that could result from this prototype. The primary one,
however, would be the development of this prototype into a completely portable, self-powered system
that could be permanently integrated with the face mask with the speakers and computers perhaps
attached to the BiPAP unit itself.
It is also possible that the device could be integrated in to the construction of existing masks,
however, this would be a difficult procedure as extensive testing would need to be conducted with mask
manufacturers to determine whether or not the introduction of a microphone system would comprise
either the integrity or function of the mask.
7.5 Future Work
There are many areas of the project that would need to be enhanced so as to make this project a
reality for persons with ALS. The following are a list of several major points that would still need to be
solved in order to move ASRES from a prototype into a useable product.
• Permanent integration of microphone with a full-face mask – As the system is currently a
prototype, the microphone dangles within the mask cavity and occupies some CO2 ventilation
holes. For long-term use, it would be essential to have microphones built into the walls of the
mask itself. This would reduce the risk of the microphone ever becoming dislodged and also
56
decrease the time it takes to secure a microphone in the mask each time it is taken off or washed.
The addition of a microphone to the mask is a more complex problem that would involve mask
design companies and also approval would have to be sought to ensure that the classification of
the mask remains as a Type I or II medical device.
• Development of a completely self-contained prototype that would not need an electrical plug-
in or a laptop to configure it. Currently the system is rather difficult to setup and involves a fair
number of wires, connections, and DIP switches that cannot be easily setup and are prone to be
being broken or unplugged if sudden movements are made to stretch the wires.
• Improved training interface – The current user interface is hardly user friendly and does not
offer very many features to users.
• Improving the ASR algorithm by exploring HMMs and building a more sophisticated
automated speech recognition system that would be able to handle larger sets of data.
• Exploring the effect of reverberations in the mask. Although literature indicates that within
fighter pilot masks, reverberations decrease intelligibility, this has yet to be confirmed with the
full face masks worn by those on BiPAP.
7.6 Conclusion
This thesis has identified the problems associated with capturing and recognizing the speech of
PALS who are on BiPAP ventilation. A prototype was designed, implemented and tested with human
subjects and DTW to evaluate its improvements to speech intelligibility. Although no conclusive
improvements to intelligibility for human listeners was made, the ALS community has affirmed that there
is real-value in being able to communicate from behind a face mask and that in it of itself improved the
quality of life of persons living with ALS. In addition, there is reason to believe that spectral subtraction
filtering of noisy speech enhances automated machine recognition of PALS speech.
The work of this thesis is a first step towards improving the quality of life for PALS. Although the
life expectancy of PALS is approximately 2-5 years, and therefore, the length of time that a device like
this would be useful would be fairly short, perhaps on the order of months or at most a year or two, there
is still an immeasurable benefit that can be offered by such a device as the value of this device is not
determined by its useable lifespan, but rather on the impact that it has on caregivers and families, for the
time in which it can be used.
57
BIBLIOGRAPHY [1] NIH, “Amyotrophic Lateral Sclerosis Fact Sheet: National Institute of Neurological Disorders and
Stroke (NINDS),” National Institute of Neurological Disorders and Stroke, 2003. [Online]. Available: http://www.ninds.nih.gov/disorders/amyotrophiclateralsclerosis/detail_amyotrophiclateralsclerosis.htm?css=print.
[2] L. S. Aboussouan, S. U. Khan, M. Banerjee, A. C. Arroliga, and H. Mitsumoto, “Objective measures of the efficacy of noninvasive positive-pressure ventilation in amyotrophic lateral sclerosis,” MUSCLE & NERVE, vol. 24, no. 3, pp. 403–409, Mar. 2001.
[3] E. R. Klasner and K. M. Yorkston, “Speech intelligibility in ALS and HD dysarthria: the everyday listener’s perspective.(amyotrophic lateral sclerosis)(Huntington disease),” Journal of Medical Speech - Language Pathology, Jun. 2005.
[4] J. F. Kent, R. D. Kent, J. C. Rosenbek, G. Weismer, R. Martin, R. Sufit, and B. R. Brooks, “Quantitative Description of the Dysarthria in Women With Amyotrophic Lateral Sclerosis.,” Journal of Speech & Hearing Research, vol. 35, no. 4, p. 723, 1992.
[5] “A Guide to ALS Patient Care for Primary Care Physicians.” ALS Society of Canada. [6] M. Shoeb, S. E. Merel, M. B. Jackson, and B. D. Anawalt, “‘Can we just stop and talk?’ patients
value verbal communication about discharge care plans,” J. Hosp. Med., vol. 7, no. 6, pp. 504–507, Aug. 2012.
[7] L. Giordano, S. Toma, R. Teggi, F. Palonta, F. Ferrario, S. Bondi, and M. Bussi, “Satisfaction and Quality of Life in Laryngectomees after Voice Prosthesis Rehabilitation,” Folia Phoniatr. Logop., vol. 63, no. 5, pp. 231–236, 2011.
[8] M. Parker, S. Cunningham, P. Enderby, M. Hawley, and P. Green, “Automatic speech recognition and training for severely dysarthric users of assistive technology: The STARDUST project,” Clinical Linguistics & Phonetics, vol. 20, no. 2–3, pp. 156, 149, 2006.
[9] X. Menendez-Pidal, J. B. Polikoff, S. M. Peters, J. E. Leonzio, and H. T. Bunnell, “Nemours database of dysarthric speech,” in Proceedings of the 1996 International Conference on Spoken Language Processing, ICSLP. Part 3 (of 4), Oct 3-6 1996, Piscataway, NJ, USA, 1996, vol. 3, pp. 1962–1965.
[10] A. B. Kain, J.-P. Hosom, X. Niu, J. P. H. van Santen, M. Fried-Oken, and J. Staehely, “Improving the intelligibility of dysarthric speech,” Speech Communication, vol. 49, no. 9, pp. 743–759, Sep. 2007.
[11] B. Tomik and R. J. Guiloff, “Dysarthria in amyotrophic lateral sclerosis: A review,” Amyotroph. Lateral. Scler., vol. 11, no. 1–2, pp. 4–15, 2010.
[12] K. C. Hustad, “The Relationship Between Listener Comprehension and Intelligibility Scores for Speakers With Dysarthria,” J Speech Lang Hear Res, vol. 51, no. 3, pp. 562–573, Jun. 2008.
[13] W. Jones, P. Mathy, T. Azuma, and J. Liss, “The Effect of Aging and Synthetic Topic Cues on the Intelligibility of Dysarthric Speech,” AAC: Augmentative and Alternative Communication, vol. 20, no. 1, pp. 22–29, 2004.
[14] J. Murphy, “Communication strategies of people with ALS and their partners,” AMYOTROPHIC LATERAL SCLEROSIS AND OTHER MOTOR NEURON DISORDERS, vol. 5, no. 2, pp. 121–126, Jun. 2004.
[15] I. R. Murray and J. L. Arnott, “Synthesizing emotions in speech: is it time to get excited?,” Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, vol. 3, pp. 1816–1819 vol.3, 3.
[16] M. S. Yakoub, S.-A. Selouani, and D. O’Shaughnessy, “Improving dysarthric speech intelligibility through re-synthesized and grafted units,” in Electrical and Computer Engineering, 2008. CCECE 2008. Canadian Conference on, 2008, pp. 001523–001526.
[17] N. Yousefian and P. C. Loizou, “A Dual-Microphone Speech Enhancement Algorithm Based on the Coherence Function,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 2, pp. 599–609, 2012.
[18] “ALS Facts,” Amyotrophic Lateral Sclerosis Society of Canada.
58
[19] “Canadian Cancer Statistics 2012.” Canadian Cancer Society, 2012. [20] G. S. Kang and T. M. Moran, “Speech enhancement in noise and within face mask (microphone
array approach),” in Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, 1998, vol. 2, pp. 1017–1020 vol.2.
[21] P. Goel and A. Garg, “Review of Spectral Subtraction Techniques for Speech Enhancement,” IJECT, vol. 2, no. 4, 2011.
[22] E. Nemer and W. Leblanc, “Single-microphone wind noise reduction by adaptive postfiltering,” in Applications of Signal Processing to Audio and Acoustics, 2009. WASPAA ’09. IEEE Workshop on, 2009, pp. 177–180.
[23] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 27, no. 2, pp. 113–120, 1979.
[24] S. Ahn and H. Ko, “Background noise reduction via dual-channel scheme for speech recognition in vehicular environment,” Consumer Electronics, IEEE Transactions on DOI - 10.1109/TCE.2005.1405694, vol. 51, no. 1, pp. 22–27, 2005.
[25] Y. Ephraim and D. Malah, “Speech enhancement using optimal non-linear spectral amplitude estimation,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’83., 1983, vol. 8, pp. 1118–1121.
[26] Lu-ying Sui, Xiong-wei Zhang, Jian-jun Huang, and Bin Zhou, “An improved spectral subtraction speech enhancement algorithm under non-stationary noise,” in Wireless Communications and Signal Processing (WCSP), 2011 International Conference on, 2011, pp. 1–5.
[27] Bing-yin Xia, Yan Liang, and Chang-chun Bao, “A modified spectral subtraction method for speech enhancement based on masking property of human auditory system,” Wireless Communications & Signal Processing, 2009. WCSP 2009. International Conference on, pp. 1–5, 13.
[28] B. H. Juang and L. R. Rabiner, “Automatic Speech Recognition – A Brief History of the Technology Development,” in Elsevier Encyclopedia of Language and Linguistics, Second ed., 2005.
[29] E. J. Keogh and M. J. Pazzani, “Scaling up Dynamic Time Warping to Massive Dataset,” in Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery, 1999, pp. 1–11.
[30] T. Oates, L. Firoiu, and P. R. Cohen, “Using Dynamic Time Warping to Bootstrap HMM-Based Clustering of Time Series,” in Sequence Learning: Paradigms, Algorithms, and Applications, 2001, pp. 35–52.
[31] H. Tolba and A. S. El Torgoman, “Towards the improvement of automatic recognition of dysarthric speech,” in Computer Science and Information Technology, 2009. ICCSIT 2009. 2nd IEEE International Conference on, 2009, pp. 277–281.
59
APPENDICES
Appendix A: Procedure for Collecting Speech Samples of a PALS The following is a step-by-step outline of the procedure to be conducted at subject’s home.
Introduction – 5 minutes
• Welcome and thank patients for their participation
• Explain the purpose, their right to terminate the study at any time, and outline of procedure
• Sign consent Form
Prototype Setup – 2 minutes
Setup of prototype involves insertion of microphone into the mask to record
Experiment #1: - 4 minutes
Patient speaks same 3 sentences and paragraph with the mask and inserted microphone. Samples are
recorded
Experiment #2: - 4 minutes
Sound Filter is enabled via a toggle switch and patient speaks same 3 sentences and paragraph with the
mask and inserted microphone. Samples are recorded.
Debrief – 5 minutes
Debrief involves answering any questions the subject may have. They will then be thanked for
participating in the study.
Summary
Total visits: 1
Total Duration: 20 min
60
Appendix B: Phoneme Analysis Data Sheets
Table 26 - Phoneme Analysis of RS
No Filtering (SnR = 12.5 dB) Score RT Filtered (SnR = 23.45 db) ScoreTrial 1 1 2 3 Trial 1 1 2 3
The fight is waiting the cop 0 -1 -1 -2 81.8% The fight is waving the towel. 0 -1 -3 -4 63.6%The dime is burying the bet. 0 -1 0 -1 90.9% The dime is wearing the vest. 0 -1 -2 -3 72.7%The view is meeting the pack. -1 -2 -1 -4 63.6% The zoo is soothing the pet. -2 -3 -1 -6 45.5%The bin is doing the fin. 0 -2 -2 -4 63.6% The bin is doing the shade. 0 -2 -1 -3 72.7%The page is licking the vine. -2 0 -3 -5 54.5% The beat is making the rock. -1 -2 -1 -4 63.6%
Trial 2 Trial 2The fight is waiting the cop. 0 -1 -1 -2 81.8% The fight is waving the tub. 0 -1 -2 -3 72.7%The dime is burying the bat. 0 -1 -1 -2 81.8% The guide is wearing the pet. -2 -1 -1 -4 63.6%The view is meeting the pet. -1 -2 -1 -4 63.6% The do is zooting the pack. 0 -3 -1 -4 63.6%The bin is doing the thing. 0 -2 -3 -5 54.5% The bin is doing the fade. 0 -2 0 -2 81.8%The beat is making the rock. -1 -2 -1 -4 63.6% The beat is making the rock. -1 -2 -1 -4 63.6%
Trial 3 1 2 3 Trial 3 1 2 3The fight is winning the cop. 0 -2 -1 -3 72.7% The fight is waving the tub. 0 -1 -2 -3 72.7%The dime is burying the vet. 0 -1 -1 -2 81.8% The dine is wearing the bet. -1 -1 0 -2 81.8%The view is meeting the cat. -1 -2 -1 -4 63.6% The do is zooting the pat. 0 -3 0 -3 72.7%The bin is doing the fig. 0 -2 -2 -4 63.6% The bin is stewing the fade. 0 0 0 0 100.0%The meat is making the rock. -2 -2 -1 -5 54.5% The date is making the rock. -1 -2 -1 -4 63.6%
Sentence #1 78.8% 69.7%Sentence #2 84.8% 72.7%Sentence #3 63.6% 60.6%Sentence #4 60.6% 84.8%Sentence #5 57.6% 63.6%
69.1% 70.3%
Words Words
Mean % of Correctly Identified Phonemes per Sentence
Overall % of Correctly Identified Phonemes
61
Table 27 - Phoneme Analysis of EC
No Filtering (SnR = 12.5 dB) Score RT Filtered (SnR = 23.45 db) ScoreTrial 1 1 2 3 Trial 1 1 2 3
the fight is waiting the car 0 -1 -2 -3 72.7% the fight is waiting the cub 0 -1 -1 -2 81.8%the dime is baring the dame 0 0 -3 -3 72.7% the die is wearing the bat -1 -1 -1 -3 72.7%the view is leading the pack -1 -1 -1 -3 72.7% the dew is leading the past 0 -1 -2 -3 72.7%the bend is chewing the food -2 -2 -1 -5 54.5% the den is stealing the spade -2 -2 -2 -6 45.5%the bait is making the bard 0 -2 -4 -6 45.5% the bait is making the run 0 -2 -2 -4 63.6%
Trial 2 Trial 2the fight is waiting the car 0 -1 -2 -3 72.7% the fight is waiting the tub 0 -1 -2 -3 72.7%the dime is bearing the bird 0 0 -3 -3 72.7% the dye is wearing the pants -1 -1 -4 -6 45.5%the dew is leaving the pack 0 -1 -1 -2 81.8% the dew is sleeping the pack 0 -1 -1 -2 81.8%the bend is chewing the food -2 -2 -1 -5 54.5% the den is stewing the fade -2 0 0 -2 81.8%the bait is making the rod 0 -2 -1 -3 72.7% the bait is making the run 0 -2 -2 -4 63.6%
Trial 3 1 2 3 Trial 3 1 2 3the fight is waiting the car 0 -1 -2 -3 72.7% the fight is waiting the top 0 -1 -2 -3 72.7%the dime the bearing the debt 0 0 -3 -3 72.7% the die is wearing the pants -1 -1 -4 -6 45.5%the dew is leading the pack 0 -1 -1 -2 81.8% the dew is sleeping the past 0 -1 -2 -3 72.7%the bend is chewing the food -2 -2 -1 -5 54.5% the den is stewing the fade -2 0 0 -2 81.8%the bait is making the vine 0 -2 -3 -5 54.5% the bait is making the run 0 -2 -2 -4 63.6%
Sentence #1 72.7% 75.8%Sentence #2 72.7% 54.5%Sentence #3 78.8% 75.8%Sentence #4 54.5% 69.7%Sentence #5 57.6% 63.6%
67.3% 67.9%
Words Words
Mean % of Correctly Identified Phonemes per Sentence
Overall % of Correctly Identified Phonemes
62
Table 28 - Phoneme Analysis of JW
No Filtering (SnR = 12.5 dB) Score RT Filtered (SnR = 23.45 db) ScoreTrial 1 1 2 3 Trial 1 1 2 3
the fight is waiting the cob 0 -1 0 -1 90.9% The fight is waiting the tub 0 -1 -2 -3 72.7%the dine is -1 -5 -3 -9 18.2% The dine is wearing the bear -1 -1 -1 -3 72.7%the dew is leaving the path 0 -1 -1 -2 81.8% The dew is reaping the path 0 -2 -1 -3 72.7%the bin is spewing the 0 -1 -3 -4 63.6% The bin is spewing the spade 0 -1 -2 -3 72.7%the fate is making the rock -1 -2 -1 -4 63.6% The bait is making the rock 0 -2 -1 -3 72.7%
Trial 2 Trial 2the fight is waiting the cob 0 -1 0 -1 90.9% The fight is waiting the tub 0 -1 -2 -3 72.7%the dine is bearing the bear -1 0 -1 -2 81.8% The die is wearing the bear -1 -1 -1 -3 72.7%the dew is leaving the path 0 -1 -1 -2 81.8% The dew is reaping the path 0 -2 -1 -3 72.7%the bin is spewing the fish 0 -1 -2 -3 72.7% The bin is stewing the spade 0 0 -2 -2 81.8%the bait is making the rock 0 -2 -1 -3 72.7% The bait is making the rock 0 -2 -1 -3 72.7%
Trial 3 1 2 3 Trial 3 1 2 3the fight is waiting the cob 0 -1 0 -1 90.9% The fight is waiting the tub 0 -1 -2 -3 72.7%the dime is bearing the bear 0 0 -1 -1 90.9% The dine is wearing the bear -1 -1 -1 -3 72.7%the dew is leaving the path 0 -1 -1 -2 81.8% The stew is reaping the path -1 -2 -1 -4 63.6%the bin is spewing the fish 0 -1 -2 -3 72.7% The bin is stewing the sage 0 0 -2 -2 81.8%the fate is making the rock -1 -2 -1 -4 63.6% The bait is making the rock 0 -2 -1 -3 72.7%
Sentence #1 90.9% 72.7%Sentence #2 63.6% 72.7%Sentence #3 81.8% 69.7%Sentence #4 69.7% 78.8%Sentence #5 66.7% 72.7%
74.5% 73.3%
Words Words
Mean % of Correctly Identified Phonemes per Sentence
Overall % of Correctly Identified Phonemes