SPEECH ENHANCEMENT DURING BiPAP USE FOR PERSONS LIVING WITH ALS

SPEECH ENHANCEMENT DURING BiPAP

USE FOR PERSONS LIVING WITH ALS

by

SAMUEL D. CHUA

B.A.Sc., University of British Columbia, 2005

A THESIS SUBMITTED IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF APPLIED SCIENCE

in

The Faculty of Graduate Studies

(Electrical and Computer Engineering)

THE UNIVERSITY OF BRITISH COLUMBIA

(Vancouver)

November 2012

© Samuel D. Chua, 2012

ii

ABSTRACT

Speech from behind a face mask while on Bilevel Positive Air Pressure (BiPAP) ventilation is

extremely difficult for persons living with Amyotrophic Lateral Sclerosis (ALS). The inability to verbally

communicate while on ventilation causes frustration and feelings of isolation from loved ones and

decreases quality of life.

A system that integrates with face masks, captures speech, removes ventilator wind noise and

outputs and recognizes de-noised speech is proposed, implemented and tested. The system is tested with a

dataset consisting of digitally added noise as well as a single patient with ALS. Automated machine

recognition of the words is then performed and results analyzed. A subjective listening test is conducted

with individuals listening to the noisy and filtered speech samples and the results are also analyzed.

Although intelligibility does not seem to improve for human listeners, there appears to be some

improvement in machine recognition scores. In addition, feedback from the ALS community reports an

improvement in the quality of life simply because patients are able to use their own voice and be heard by

loved ones.

iii

PREFACE Ethics approval, H10-01703, for project “Speech Enhancement During Bipap Use For Persons Living

With ALS” was obtained through the Clinical Research Ethics Board.

iv

TABLE OF CONTENTS ABSTRACT ................................................................................................................................................. ii

PREFACE ................................................................................................................................................... iii

TABLE OF CONTENTS ............................................................................................................................. iv

LIST OF TABLES ..................................................................................................................................... vii

LIST OF FIGURES ................................................................................................................................... viii

ABBREVIATIONS ...................................................................................................................................... ix

ACKNOWLEDGEMENTS .......................................................................................................................... x

DEDICATION ............................................................................................................................................. xi

1 INTRODUCTION ................................................................................................................................. 1

1.1 ALS and BiPAP ............................................................................................................................ 1

1.2 The Effects of ALS on Speech and Quality of Life ...................................................................... 1

1.3 Research Goals .............................................................................................................................. 2

1.4 Organization of the Thesis ............................................................................................................ 2

1.5 Contributions of the Thesis ........................................................................................................... 2

2 BACKGROUND ................................................................................................................................... 3

2.1 Speech: Our Preferred Method of Communication ....................................................................... 3

2.2 Speech Dysarthria ......................................................................................................................... 3

2.3 Speech in Patients with ALS ......................................................................................................... 4

2.4 Measuring Speech Intelligibility ................................................................................................... 4

2.5 Increasing Speech Intelligibility .................................................................................................... 4

2.6 Three Difficulties in Increasing Speech Intelligibility .................................................................. 5

3 RELATED WORK – HISTORY AND PRESENT .............................................................................. 6

3.1 The Capture Problem..................................................................................................................... 6

3.2 The Noise Problem ........................................................................................................................ 8

3.3 Automated Speech Recognition .................................................................................................... 9

3.4 Automated Speech Recognition in Persons with ALS ................................................................ 10

4 AUTOMATIC SPEECH RECOGNITION AND ENHANCEMENT SYSTEM ............................... 12

4.1 System Overview and Setup........................................................................................................ 12

4.2 Microphone Selection.................................................................................................................. 14

4.3 Calibration and Positioning ......................................................................................................... 15

4.4 Microphone Powering Circuit ..................................................................................................... 16

4.5 Spectral Subtraction .................................................................................................................... 17

v

4.6 Speech Extraction ........................................................................................................................ 21

4.7 Mel Frequency Cepstral Coefficients .......................................................................................... 24

4.8 Dynamic Time Warping .............................................................................................................. 25

5 USER INTERFACE AND SYSTEM USAGE ................................................................................... 31

5.1 User Interface Description ........................................................................................................... 31

5.2 System Usage .............................................................................................................................. 32

5.3 Initial Training ............................................................................................................................. 32

6 EXPERIMENTS ................................................................................................................................. 34

6.1 Experimental Setup ..................................................................................................................... 34

6.1.1 Goal ..................................................................................................................................... 34

6.1.2 Setup .................................................................................................................................... 34

6.1.3 Hypothesis ........................................................................................................................... 35

6.2 Experimental Results ................................................................................................................... 36

6.2.1 Digital Addition of Noise to Nemours Subject BB ............................................................. 36

6.2.2 Person Living with ALS – RG ............................................................................................ 41

6.3 Phoneme Analysis ....................................................................................................................... 45

6.4 Discussion ................................................................................................................................... 47

6.4.1 Summary of Phoneme Analysis .......................................................................................... 47

6.4.2 Summary of ASRES Results ............................................................................................... 48

6.4.3 Effectiveness of ASRES ...................................................................................................... 50

6.4.4 Feedback from ALS Community ........................................................................................ 51

6.5 Validity ........................................................................................................................................ 51

7 CONCLUSIONS AND FUTURE WORK .......................................................................................... 53

7.1 Research Goals Summary ........................................................................................................... 53

7.2 Contributions of this Work .......................................................................................................... 53

7.3 Strengths and Limitations ............................................................................................................ 54

7.3.1 Strengths .............................................................................................................................. 54

7.3.2 Weaknesses ......................................................................................................................... 55

7.4 Potential Applications ................................................................................................................. 55

7.5 Future Work ................................................................................................................................ 55

7.6 Conclusion ................................................................................................................................... 56

BIBLIOGRAPHY ....................................................................................................................................... 57

APPENDICES ............................................................................................................................................. 59

vi

Appendix A: Procedure for Collecting Speech Samples of a PALS ....................................................... 59

Appendix B: Phoneme Analysis Data Sheets .......................................................................................... 60

vii

LIST OF TABLES Table 1 - Noisy DTW of 5 Words by BB .................................................................................................... 37 Table 2 - Post-Filtered DTW of 5 Words by BB ........................................................................................ 37 Table 3 - Noisy DTW of 5 Words by BB with +3dB Noise ....................................................................... 38 Table 4 - Post-Filtered DTW of 5 Words by BB with +3dB Noise ............................................................ 38 Table 5 - Noisy DTW of 5 Words by BB with +7dB Noise ....................................................................... 39 Table 6 - Post-Filtered DTW of 5 Words by BB with +7dB Noise ............................................................ 39 Table 7 - Noisy DTW of 5 Words by BB with +13dB Noise ..................................................................... 40 Table 8 - Post-Filtered DTW of 5 Words by BB with +13dB Noise .......................................................... 40 Table 9 - Noisy DTW of 5 Words by BB with +13dB Noise and Alternate Noise .................................... 41 Table 10 - Post-Filtered DTW of 5 Words by BB with +13dB Noise and Alternate Noise ....................... 41 Table 11 - DTW of 6 Words by RG ............................................................................................................ 42 Table 12 – Post-Filtered DTW of 6 Words by RG ...................................................................................... 42 Table 13 - Real-time Filtered DTW of 4 Words by RG .............................................................................. 42 Table 14 - Noisy DTW of 5 Words by RG ................................................................................................. 43 Table 15 – Post-Filtered DTW of 5 Words by RG ...................................................................................... 43 Table 16 - Real-Time Filtered DTW of 5 Words by RG ............................................................................ 44 Table 17 - DTW of 3 Phrases by RG .......................................................................................................... 44 Table 18 - Real-Time Filtered DTW of 3 Phrases by RG ........................................................................... 44 Table 19 - Control 5 Sentences and Phoneme Divisions ............................................................................ 45 Table 20 - An Example Phoneme Analysis Trial Explained ....................................................................... 45 Table 21 - Percentage of Correctly Identified Phonemes ............................................................................ 46 Table 22 - Phoneme Analysis of MB .......................................................................................................... 47 Table 23 - Summary of Phoneme Analysis ................................................................................................. 48 Table 24 - Summary of BB Datasets ........................................................................................................... 48 Table 25 - Summary of RG Datasets ........................................................................................................... 50 Table 26 - Phoneme Analysis of RS ........................................................................................................... 60 Table 27 - Phoneme Analysis of EC ........................................................................................................... 61 Table 28 - Phoneme Analysis of JW ........................................................................................................... 62

viii

LIST OF FIGURES Figure 1 - Cross-Sectional View of Kang's Microphone Array .................................................................... 8 Figure 2 - System Block Diagram ............................................................................................................... 12 Figure 3 - Panasonic Noise Cancelling Microphone Cartridge (WM-55D103) .......................................... 14 Figure 4 - Frequency Response ................................................................................................................... 14 Figure 5 - Airflow and Optimal Microphone Placement ............................................................................. 16 Figure 6 - Electret Microphone Circuit ....................................................................................................... 17 Figure 7 - Normalized WAV ....................................................................................................................... 22 Figure 8 - Original and Truncated WAV .................................................................................................... 23 Figure 9 - Distance Map of Different Words .............................................................................................. 25 Figure 10 - Distance Map of Two Identical Words ..................................................................................... 26 Figure 11 - DTW Minimum Cost Path Equation ........................................................................................ 26 Figure 12 - DTW Scoring for the First Row ............................................................................................... 27 Figure 13 - DTW Scoring for the Second Row ........................................................................................... 27 Figure 14 - DTW Scoring for the Entire Grid ............................................................................................. 27 Figure 15 - Minimum Cost Path for Two Different Words ......................................................................... 28 Figure 16 - Path with a Score of Zero for Identical Words ......................................................................... 28 Figure 17 - Cost Map of a Word Spoken Slowly ........................................................................................ 29 Figure 18 - Path with a Zero Score for a Word Spoken Slowly .................................................................. 29 Figure 19 - Euclidean Distance in 3-D ........................................................................................................ 30 Figure 20 - ASR Tool .................................................................................................................................. 31 Figure 21 - SnR vs. Recognition Rate of BB for Noisy Signals Subjected to Post-Filtering ...................... 49 Figure 22 - SnR vs. Recognition Rate of RG for Noisy, Post-Filtered and Real-time Filtering ................. 50

ix

ABBREVIATIONS

Term Definition

AWG American Wire Gauge

ALS Amyotrophic Lateral Sclerosis

ASRES Automated Speech Recognition and Enhancement System

BiPAP Bi-level Positive Airways Pressure

DTW Dynamic Time Warping

HMM Hidden Markov Model

MFCC Mel Frequency Cepstral Coefficients

MMSE Minimum Mean Squared Error

LPC Linear Predictive Coding

PALS Person living with Amyotrophic Lateral Sclerosis

SnR Signal-to-Noise-Ratio

x

ACKNOWLEDGEMENTS

To my supervisor, Philippe, I am grateful for your continued support throughout the duration of this project. If it were not for your patience and guidance, this project would never have come to fruition.

To the members of PROP BC and the ALS Society of BC, thank you for your willingness to support this project and to offer your feedback.

And finally, to my dear wife, Esther, words simply cannot express my gratitude for your support and sacrifices along the way.

xi

DEDICATION

Dedicated to….

those who have fought with ALS courageously

and to my God who made me and strengthened me for this work.

1

1 INTRODUCTION

1.1 ALS and BiPAP

Amyotrophic lateral sclerosis (ALS) is a progressive neurological disease that destroys the nerve

cells associated with voluntary muscle control [1]. Although the initial symptoms of the disease vary from

person to person, as time progresses, all persons with ALS eventually begin to lose their mobility, ability

to speak and have trouble breathing due to the weakening of respiratory muscles.

At some point assisted breathing is required in the form of mechanical ventilation to ease the

strain on the weakened muscles. Usage may initially be nocturnal followed by increasing daytime usage

as the disease progresses. Eventually, ventilation will be required on a full-time basis when the respiratory

muscles are no longer able to maintain appropriate oxygen and carbon dioxide levels [1].

One assisted breathing method which is common in North America is the Bilevel Positive Air

Pressure (BiPAP) breathing apparatus. Unlike a mechanical ventilator, it does not replace the normal

breathing mechanism, but rather allows patients with neuromuscular diseases similar to ALS to breathe

normally while reducing the amount of effort required by the patient. This is accomplished by fitting a

mask to the patient’s face and applying positive air pressure upon inhalation and negative air pressure

upon expiration. Although BiPAP cannot slow the progression of the disease, it has been show to improve

the quality of life in patients [2].

One major problem caused by BiPAP utilization is its interference with speech caused by the

muffling of vocalization by the BiPAP mask and the associated airflow noise. For patients using BiPAP

several hours a day this problem greatly limits their ability to communicate verbally.

1.2 The Effects of ALS on Speech and Quality of Life

Mixed dysarthria associated with ALS involves imprecise consonants, hypernasality, slowed

speech rate, harsh vocal quality, breathiness, slurred speech and low pitch [3]. Acoustic studies show

differences in vowel duration, fundamental frequency and vowel space with some variation between

different individuals [4].

The effects of dysarthria on speech alone have quite an impact on the quality of life of a patient.

The inability to communicate effectively with voice leads to feelings of isolation, frustration, anxiety, loss

of control and increased sadness. Isolation comes from the reduced amount of communication, frustration

from not being understood, fear and anxiety from failed communication attempts, loss of control as their

opinions are ignored or misunderstood, and sadness due to the isolation and frustration experienced by the

caregivers and patient [5]. In addition to this, BiPAP users who wear face masks also feel more distant

2

and isolated from their friends and primary caregivers. The combined effect of reduced ability to speak

and also the BiPAP face mask reduce the quality of life of persons living with ALS considerably.

1.3 Research Goals

The primary goal of this thesis is to develop a prototype Automatic Speech Recognition and

Enhancement System (ASRES) that will allow Persons Living with ALS (PALS) to communicate clearly

with their voices while being on BiPAP ventilation. The primary goals of this thesis can be separated into

three sub-goals as follows:

• Identify the problems associated with capturing and filtering speech by PALS who are on BiPAP

ventilation

• Design and implement a working prototype that PALS can use

• Validate the effectiveness of the system by examining the recognition rate of ASRES when used

with PALS and also by objectively measuring increases in intelligibility through listening tests

1.4 Organization of the Thesis

This thesis consists of seven chapters. Chapter 1 is an introduction to the problem while Chapters 2

and 3 discuss the background of the problem as well as related work. Chapter 4 presents the proposed

system including an overview of the theory and details behind the implementation. Chapter 5 covers the

user interface and describes the usage of the system including training. Chapter 6 includes all the data

obtained from the different experiments as well as a summary of the data. And lastly, Chapter 7 presents

the conclusions and suggestions for future work.

1.5 Contributions of the Thesis

The contribution of this thesis include:

• A discussion of the unique problem that PALS on BiPAP ventilation face with regards to

communicating using their voices

• An implementation and analysis of ASRES that fits the needs of PALS on BiPAP

• Validation of the effectiveness of the system by conducting experiments with noise digitally

added to samples of dysarthric speech as well as real subject data from a PALS

• A final analysis and evaluation of the prototype

3

2 BACKGROUND

The purpose of this section is to clearly articulate the communication problems caused by the face

masks of BiPAP users and why they reduce the quality of life of PALS.

2.1 Speech: Our Preferred Method of Communication

Speech is the primary method of communication between people. Although it is a common

method of communicating our intent, many factors come into play when it comes to having highly

intelligible speech. For example, pitch and tone contribute to the semantic meaning of a phrase as well as

inflection. A subtle variation, such as the slight rise in tone at the end of a phrase, can change a statement

to be recorded into a question to be answered. Accents and variations in localized pronunciation also

provide a challenge with regards to both human and machine recognition. The ability to communicate is

simply a part of our nature and surveys and studies have shown that there is a value in person-to-person

verbal communication in patient care that cannot be replaced by simply giving written instructions [6]. A

study done on laryngectomees (persons who have had their larynx removed) due to illness show that their

quality of life improves after the restoration of verbal communication through either a voice prosthesis or

a tracheo-oesophageal puncture [7]. Since the loss of the ability to speak is considered a loss in quality of

life for individuals, any improvements or enhancements that we can make to restore the intelligibility of

speech will help improve an individual’s quality of life.

2.2 Speech Dysarthria

Speech dysarthria is a term that refers to a group of motor speech disorders that result from either

central or peripheral nervous system damage [8]. The disruption to muscular control that affects the

muscles used in producing speech can result in differing levels of intelligibility. Some patients who suffer

from mild dysarthria when subjected to a Frenchay Dysarthria test can produce scores which are low

enough to have them pass as normal speakers. The Nemours database [9] contains sound samples of 11

different male speakers with varying degrees of dysarthria. Some of the speakers have a greater than 80%

intelligibility/understanding rate, while others score less than 60% or lower to the point where they are

virtually unintelligible.

Although the degree of dysarthria varies from person to person, imprecise articulation [10] is

characteristic of all dysarthric speakers. Research has shown that for dysarthrics, vowels are easy to

produce whereas consonants are difficult to enunciate. The speech of patients with dysarthria is often

characterized as being either very nasally or distorted.

4

2.3 Speech in Patients with ALS

Persons living with ALS have a kind of mixed dysarthria that is characterized by defective

articulation, slow speech, and imprecise consonant and vowel formation among other things.

Documentation of their speech is not extensive, but a few studies have been conducted to examine their

speech rate, vowel space and variance in speech intelligibility [11].

Although their symptoms are similar to other types of individuals with dysarthria, they suffer from

a unique problem in that unlike other disabilities that affect speech, such as cerebral palsy and multiple

sclerosis, ALS causes the patient’s ability to speak to degenerate over time. This poses a very unique

problem when it comes to developing solutions to help improve the quality of speech for an individual

with ALS as the specifics of their dysarthria degrade over time, rendering a potentially helpful system

either less effective or ineffective altogether.

2.4 Measuring Speech Intelligibility

Intelligibility can be defined as “how well a speaker’s acoustic signal can be accurately recovered

by a listener” [12]. Although the quality of a speech signal affects comprehension, it is important to note

that there are many other nonverbal factors that are involved in listener comprehension. Some examples

would be the length of a message, its predictability, context, relationship to the listener, and facial cues.

The measurement of intelligibility is no simple task either as there are multiple ways in which this

can be done. One way is through orthographic transcription in which a user listens to a speech sample and

then attempts to reproduce in writing what they heard. A percentage score is obtained by calculating the

number of words correctly identified over the total number of words. Although this can objectively

measure the number of words correctly perceived, it is important to design an experiment in such way that

the context of the sentence in which the words are found does not heavily influence the recognition of the

word.

A stronger method for testing intelligibility is to ask the listener questions to see how well they

were able to understand the speaker’s meaning. Carefully designed general comprehension questions can

be used to determine how well a listener understood the speaker overall, while specific factual questions

can help shed some light on particular word intelligibility.

2.5 Increasing Speech Intelligibility

There are many different methods that have come up over the years that attempt to improve the

speech intelligibility of dysarthric patients. When these methods are broken down, they can be broken up

into two categories: modification of the speech signal to enhance the actual acoustic characteristics of the

5

signal and the alternative, manipulating speech complementary information or nonverbal cues in order to

increase a listener’s comprehension or perception of the speech. Studies have shown that speech

intelligibility can be improved simply by offering these additional sources of information to listeners.

Semantic cues, first letter and word class cues [13] have been shown to improve intelligibility from 10-

24% among listeners.

Topical cues and key words are communication strategies that are often employed by PALS in

communication [14]. These strategies are currently used by partners of PALS who give anecdotal

evidence that being able to understand one key word can allow you to understand the rest of the sentence.

Although speech intelligibility can be increased by other means that are non-speech related, these

go beyond the scope of this project and we will be concentrating on improving speech intelligibility from

behind a face mask primarily through speech processing.

2.6 Three Difficulties in Increasing Speech Intelligibility

One obvious difficulty with capturing intelligible speech from a patient with a face mask, is that

because the mask is placed over the face and forms a complete seal, it is difficult to hear any intelligible

speech whatsoever, even if you were to place your ear very close to the mask itself. This we will refer to

as the Capture Problem of speech in the following section, Section 3, entitled RELATED WORK –

HISTORY AND PRESENT.

Secondly, the problem is that patients who are on BiPAP, not only have voices that are generally

weaker and thus are not able to speak as loudly as a typical person, but also struggle to articulate their

words with regularity and sufficient preciseness so as to be intelligible. This is a significant challenge

especially when it comes to trying to capture and process a good speech sample with consistency from

behind the mask. This we will refer to as the Articulation Problem in the following section.

Thirdly, as BiPAPs function with rushing air, attempts to capture the sound also pick up the “wind

noise” within the mask, further decreasing the intelligibility of any captured speech. This we will refer to

as the Noise Problem in the following section.

As the face mask is an enclosed system, little can be done in terms of opening the mask to allow for

sound capture as this would allow air to escape and nullify any respiratory benefits. Coupled with the fact

that patients have weak voices, it would seem that capturing sound outside the mask by opening the

masks or asking the patients to shout would be impossible. Therefore, we must turn our efforts to

improving speech intelligibility by capturing sound not externally, but from within the mask, and

mitigating the effects of the rushing air.

6

3 RELATED WORK – HISTORY AND PRESENT

At the present time, there are no published works on improving speech from behind a ventilator

face mask and filtering BiPAP induced wind noise. The present work is among the first of its kind and as

a result, has very little to draw on in terms of specifics in this field.

In the previous section, we identified three problems, the Capture Problem, the Articulation

Problem, and the Noise Problem that contribute to the decreased intelligibility of speech in PALS on

BiPAP. Although intelligibility could be increased by addressing all three of these, the scope of this

project would simply be enormous if we were to tackle all of them.

For example, improvements to the Articulation Problem have been attempted and met with

different degrees of success. Older solutions for persons with ALS involved using speech synthesizers

when articulation simply became too poor or when the patient became nonverbal. Although these systems

gave functionality, it is a well-known fact that machine voices are still regarded to be highly impersonal.

Time and energy is being expended at the present time to add emotion among other things to make

machine voices sound more human [15]. There is research in the area of capturing dysarthric speech,

identifying and resynthesizing the badly pronounced phonemes [16] so that the large portion of a person’s

speech is retained. However, nothing has yet been made available commercially and the technology exists

primarily in labs. As this problem is under research and a large enough problem in it of itself for a thesis,

we will not be attempting to improve intelligibility through the re-synthesis of poorly articulated

phonemes. We will focus our attention on the Capture Problem and Noise Problem and then explore

current work in the field for machine recognition of the cleaned speech in order to select a mechanism for

the validation of any claims we might make of increased intelligibility.

3.1 The Capture Problem

The first problem that we need to address is how to obtain speech for processing from within a

full face mask. If the quality of our signal is poor, then any attempts that we make at improving

intelligibility may be hampered not by our algorithms, but simply by the fact that our input is poor to

begin with. Therefore, choosing an appropriate solution for capturing sound from within the mask is

important.

The majority of speech processing methods focus on processing speech after it has been recorded

into a microphone; however, any gain that we can achieve by choosing a specific microphone

configuration is advantageous. A survey of recent literature shows that although dual microphone

solutions for noise suppression are becoming more popular, they still add complexity in terms of weight,

7

size of the array, power consumption and more complex processing [17]. Furthermore, the advantage to

using dual microphones noise reduction techniques seem to be more pronounced when trying to perform

filtering of non-stationary noise. As we are dealing with the removal of stationary or very slowly varying

noise, the two microphone solution seems less appealing in that though it would offer a benefit, its

increased computational complexity and additional cost in terms of space within a face mask seem to

encourage the use of a single microphone.

Very little work seems to have been done on the problem of audio capture from within a full face

mask. A large part of the reason is that there is very little need for a typical person who uses BiPAP to

speak from behind a mask while the ventilator is running. For those who are hospitalized due to acute

respiratory failure, speech is not an option, and therefore, the need for verbal communication is nil. For

persons who suffer from sleep apnea and use BiPAP to aid them in getting a good night’s rest, there is no

reason to want the ability to communicate while sleeping. In the event that someone with sleep apnea or

some other respiratory ailment who uses BiPAP awakens or needs to communicate with their voice, it is a

simple matter for them to switch off the machine, loosen the mask slightly, communicate, and then return

to sleep. This, however, is not the case for PALS as their loss of motor coordination can make the task of

loosening or replacing the Velcro straps of a full face mask extremely difficult. Therefore, any solution

allowing them to speak without having to fumble with a mask is helpful.

The main category of people who use BiPAP and are not unconscious, suffering from acute

respiratory failure or asleep, are persons living with ALS. As ALS is a progressive disease, the window in

which an individual is on BiPAP and also sufficiently verbal to communicate is narrow and does not last

indefinitely. Coupled with the fact that life expectancy of persons living with ALS is normally 2 to 5

years and the fact that the disease effects only about 6 to 8 people per 100,000 with a diagnosis rate of

approximately only 2 per 100,000 new cases [18] each year, compared with cancer that will have 186,000

new cases this year alone [19], it is understandable why there is very little attention and research put into

this particular problem of allowing a person living with ALS to speak while on BiPAP.

The closest form of published work that relates to the problem of capturing speech from behind a

mask comes from military research done on fighter pilot oxygen masks. A work by Kang featuring a 4

microphone array [20] attempts to perform noise reduction within a mask, not by addressing ambient

noise, but rather by focusing on the problems associated with reverberations within the face mask. The

system constructed by Kang makes use of four collinearly spaced microphones with a sound duct.

8

Figure 1 - Cross-Sectional View of Kang's Microphone Array

This design allows for reverberations in the mask to be removed by using a technique of adding and

subtracting the individual microphone outputs. An absorption material such as wool which absorbs

sounds at 4 kHz helps to further reduce the reverberation effects. Kang reports that this array works well

at restoring lost high-frequency components and minimizing the reverberation effect.

Although this setup is good, size is a factor to consider when working with the BiPAP masks as they are

smaller than fighter pilot masks. In addition, sound tests that were recorded do not seem to be extremely

muffled as in the case of the oxygen masks Kang was using to test. It is possible that the material of the

BiPAP full face mask coupled with the CO2 exchange holes contribute to the reduction of the

reverberation effect or as fighter pilot masks are airtight and perform their oxygen and carbon dioxide

exchange through a system that uses a single hose. There are definitely differences between the two kinds

of masks however the extent to which fighter pilot masks differ from BiPAP full face masks has not been

experimentally verified.

Although Kang reports success with his microphone array and that it performs some noise

reduction in addition to its reverberation reduction, the way that it achieves this is not just because of the

array, but also in the proximity to the mouth. Kang’s array in his tests is placed at a mere ¼ inch from the

mouth, a distance, which would be difficult not only to calibrate and maintain for PALS, but would also

perhaps interfere with their breathing or their lips as they struggle to articulate sounds. Although Kang’s

work is the closest to our problem, it is not a solution by itself to our problem.

3.2 The Noise Problem

The problem of removing stationary noise from a signal is not a new one and has been explored

and reviewed rather extensively [21]. In fact, more recent papers have begun to attack problems such as

filtering non-stationary wind noise with a single microphone [22]

9

Work on performing filtering of undesired sound in noisy environments such as a helicopters

[23], oxygen masks [20], vehicles [24] has been researched and solutions proposed. These solutions to

these problems are helpful to us in that the solutions are geared towards solving a specific ambient noise

problem such as the noise from helicopter rotors or vehicles. For the most part, this type of noise is

unvarying and can be classified as stationary noise.

If the noise from the airflow in the mask can be filtered using a similar method and a clean speech

signal captured and outputted, this would be of an immense benefit to the quality of life patients as they

would retain their ability to communicate using their voices with others while on ventilation.

Techniques to attenuate the noise include Wiener filtering, Log-MMSE (minimum-mean-square)

and spectral subtraction. Although spectral subtraction was first developed by Boll in 1979 [23] and

improved on by Ephraim and Malah with the elimination of the musical noise phenomenon [25] , it still

continues to have traction in the academic world. Variations have been proposed for non-stationary

noise[26] as well as modifications that take advantage of human auditory characteristics [27]. As it is

simple to implement, it continues to remain as a popular choice for performing noise reduction. The other

aforementioned techniques also offer similar performance in terms of their ability to reduce noise.

3.3 Automated Speech Recognition

In order to help us validate our claims of improving intelligibility through speech processing, it is

necessary to explore methods for validation. As discussed in Section 2.4 - Measuring Speech

Intelligibility, increased human recognition of words or sentences is an indication that intelligibility has

increased. We need not, however, limit ourselves to only human recognition. If a computer recognition

system was able to match words according to some library of words and the recognition increased as a

result of noise filtering, then we would also have another objective method for determining the

improvement of a system. Therefore, it is also necessary to explore automated speech recognition

methods to use for validation purposes.

Automated speech recognition is a field that is under heavy research as having machines that are

able to understand the subtle nuances of human speech would revolutionize the way that way we interface

with machines. Work in this field is varied and in the last few decades has been very successful. The first

generation of speech recognizers focused primarily on phonemes and was very limited in its ability to

recognize commands. This was followed by the second generation of speech recognizers that made use of

linear predictive coding (LPC) and dynamic time warping (DTW) [28]. In the 1980’s Hidden Markov

Models (HMM) and statistical analysis became the de facto standard for automated recognition of

continuous speech. It also allowed for a major increase in the size of the vocabularies of new systems.

10

Although Dynamic Time Warping is older and has since been replaced by Hidden Markov

Models for the bulk of speech recognition, it still has use for working with small datasets and isolated

words. The bulk of research today focuses on optimizing DTW, finding new ways that it can be applied to

larger datasets [29] or using it to enhance HMMs [30].

3.4 Automated Speech Recognition in Persons with ALS

Several studies have been conducted in the area of dysarthric speech including patients who have

ALS. Studies have shown that their speech differs greatly from our own and that it is very difficult for

such people to use off the shelf packages that involve voice recognition [9]. Training people on these

systems requires a great amount of time and effort and is a very frustrating procedure for patients who

have been weakened by illness.

To improve the phoneme recognition, experimental systems have been implemented to try and

train speech classifiers to recognize the distinguished phonemes produced by patients with dysarthria.

Visual and auditory feedback has been used among other methods to try and come up with the best

method for training, however all these are still time consuming and difficult to achieve [8].

ALS patients present a unique problem in that unlike other disabilities the disease causes the

patient’s ability to speak to degenerate over time. Studies that are performed on dysarthrics often exclude

patients with neurodegenerative conditions as research projects spanning several years cannot

continuously obtain consistent data samples as the patient’s disease progresses. Projects like STARDUST

[8] that examined isolated word recognition for dysarthrics specifically excluded these types of patients

and focused on patients who had a more stable disease. The problem with training sets of classifiers for

these patients is that due to rapid degeneration of their speech, possibly within months, a retraining of a

speech recognition system becomes necessary. This is extremely time consuming and tedious for these

PALS who now have to rerecord samples for a database of words that has expanded due to use. ASR

studies that involve PALS testing new speech recognition software have shown that during the recording

sessions that are usually repetitive, the PALS grow tired and are no longer able to articulate their words

clearly. This implies that systems that have low training and retraining requirements are in the PALS best

interests. Currently, there are no products or open source initiatives that are available to the public that are

capable of performing automatic speech recognition of dysarthrics speakers let alone persons living with

ALS.

Efforts have been made to improve intelligibility of dysarthric speech by replacing sections of

speech with re-synthesized speech[31], however the study is careful to explain that their methods would

only work for a specific subgroup of dysarthrics. As persons with ALS who are at different stages of

disease progression will have wildly different characteristics in their speech, it is impossible to use a one-

11

size-fits-all method for automatically detecting and resynthesizing speech. Although this method of

improving intelligibility seems like it might be promising, constructing such a system and then making it

operate in real-time would prove challenging and go beyond the scope of this project.

12

4 AUTOMATIC SPEECH RECOGNITION AND ENHANCEMENT SYSTEM

This section is an explanation of a prototype system that was designed to address the Capture and

Noise problems as detailed in the previous section. The proposed solution, ASRES, is an automated

system that is able to capture and filter noisy speech from behind a face mask, output the processed

speech, as well as perform recognition.

The details of the system’s implementation and assumptions are explained in this chapter.

The following, Figure 2 - System Block Diagram, is a diagram that demonstrates the setup of the

system and how a speech signal is passed to the different subsystems.

Figure 2 - System Block Diagram

4.1 System Overview and Setup

A Person Living with ALS wearing a face mask is the start point of the diagram. A small electret

microphone (WM-61A), with two 30 AWG wires attached to its leads is inserted into a patient’s face

mask via small ventilation holes and hung from the top of the mask. It is important that the microphone

be placed out of the direct path of the air that flows through the flexible hose connecting the mask to the

BiPAP in order to maximize the Signal-to-Noise Ratio (SnR) of the speech. The microphone must also be

13

adjusted so that it is not being pressed against the nasal bridge for patients with larger noses, and also not

in line with the patient’s nostrils as deep inhalation and exhalation can create large amounts of noise that

further reduce the SnR. Once the microphone is properly inserted, the patient then attaches the full face

mask to their BiPAP ventilator.

The full face masks that were tested with this setup are the Resmed Ultra Mirage Full Face Mask,

Resmed and the Fisher & Paykel FlexiFit432. It is conceivable that the system would work well with any

full face mask that makes a complete seal over the patient’s face and has CO2 ventilation holes that are no

smaller than 0.0799mm in diameter which is also the diameter of the 30 AWG wire that is used with the

microphones.

As the device makes use of existing CO2 ventilation holes, it is important to consider whether or

not the insertion of a microphone would impair normal operation of the mask. In a Resmed Ultra Mirage

Full Face Mask, CO2 is vented through 6 holes in the mask that are approximately 1.8mm in diameter.

Therefore the total surface area through which CO2 can pass through is 15.268mm2. As the two 30 AWG

wires occupy only 5.107x10-2 mm2, the total amount of area obstructed by the two leads is 0.6690 %. This

means that the mask’s ability to vent CO2 is still operating at approximately 99.3% which is not deemed

to be a risk when reviewed by the UBC Clinical ethics board.

Once the face mask is securely in place, two alligator clips are fastened to the exposed leads,

connecting the microphone in the patient’s mask to the rest of the system. The leads pass through a small

circuit that supplies power to the electret microphone via the 3.5mm jack on the TMS320C6713 DSP

board.

After the noise-corrupted speech signal is processed with the filtering software on the board, the

filtered speech signal is split with a splitter and sent to two different places. The filtered speech signal is

sent to a set of speakers that output the sound so that a listener in the room can hear the patient’s voice in

real-time from behind the mask. In addition, the filtered speech signal is also sent to a laptop where the

ASRES software is running via a 3.5mm aux cable connected to the microphone jack. A laptop is used as

it avoids the cost and difficulty of having to develop a GUI on a piece of embedded software for this

prototype.

Once the system is properly set up, the patient can begin speaking during normal operation of the

BiPAP.

14

4.2 Microphone Selection

Microphone selection is important for this application as the environment from which speech is to

be extracted is quite noisy. As the results of the digital signal processing and Speech Recognition

Software are dependent on the quality of the input signal, it is beneficial to maximize the amount of

speech captured and minimize the noise before the signal is subjected to digital signal processing. Two

ways to accomplish this in the electret microphone stage are to choose the best electret microphone or

microphone array in order to optimize the placement of the microphone or microphones.

For this project, a Panasonic Noise Cancelling Back Electret Condenser Microphone Cartridge

(WM-55D103) is used for the purpose of being small and non-intrusive.

Figure 3 - Panasonic Noise Cancelling Microphone Cartridge (WM-55D103)

The microphone has a frequency range of 100-10,000Hz which covers the range of human

speech, approximately 200 – 8000Hz, quite well.

Figure 4 - Frequency Response

The frequency response for this microphone is fairly flat for distances that are close to the

microphone and useful for us since there would be little variance in response for all the different

frequencies of speech.

15

The noise cancelling microphone has a black porous material that covers the sensitive electret

components and serves to protect the microphone from saliva since it is placed fairly close to the

speaker’s mouth. As dysarthrics and PALS often have difficulty in controlling their speech muscles and

saliva, this covering is useful. In addition to this, the covering helps to reduce some of the wind noise that

passes over the microphone. As a noise-cancelling electret microphone is set up to attenuate signals that

are farther away from microphone, it is possible by placement of the microphone in the mask, to increase

the SnR simply by making sure that the microphone is sufficiently close to the mouth as well as far away

from the noise source as possible. Although the full potential of this characteristic is not fully taken

advantage of as the distance between the noise source and the patient’s mouth is still in the order of

centimeters, it is still a beneficial quality to have and makes the noise-cancelling electret a better choice

than other types of microphones. An omnidirectional microphone, for instance, has the undesirable

quality of picking up sound in all directions equally well and as a result collected more noise in trials. A

unidirectional microphone proved to also be slightly less useful, but not as poor as the omnidirectional

electret microphone.

4.3 Calibration and Positioning

Calibration was done using empirical methods. Although there is merit perhaps in doing

mathematical modeling of the mask to achieve maximal SnR, this is very difficult to attempt as different

masks have different shapes and sizes and any optimization algorithm would need an accurate model

created for each one. Furthermore, mathematical modeling of the optimal place to achieve the highest

SnR would be complicated by the fact that skin does not have the reflective acoustic properties of other

solid surfaces and actually has absorptive qualities. In addition, variations in patient’s nose structures, lips

and even facial hair further complicate mathematical modeling. Therefore, as it is mathematically and

computationally difficult, not to mention different for each individual BiPAP user, mathematical

modeling for the purpose of optimizing the microphone placement was rejected in favor of deductive and

empirical testing.

16

Figure 5 - Airflow and Optimal Microphone Placement

Figure 5 - Airflow and Optimal Microphone Placement, above, shows an arrow indicating the

initial flow of air from the BiPAP as it enters the mask. Empirically, it was determined that placing a

microphone anywhere along that path where the wind struck the microphone directly resulted in very

poor SnR ratios of 0 and below. The two rectangles above show the areas in the mask that are out of the

direct path of the wind and are candidates for placement of the microphone. Although the bottom

rectangle in Figure 5 was out of the direct path of the wind, it was still sufficiently close that empirical

tests showed very little improvement in the SnR. Therefore, the only possible placement for the

microphone is the top rectangle. Empirical testing again for microphones placed within the rectangle and

away from the nostrils show an SnR of up to 30dB when the BiPAP is not in operation and approximately

3-9dB when the BiPAP is on. This large variation in the SnR is due to the variation in the strength of the

speech signal. An average person who is able to speak normally would be on the higher end, whereas a

patient with ALS who has a low lung capacity might be closer to the lower end.

4.4 Microphone Powering Circuit

The diagram below is of the microphone circuit that powers the electret microphone and interfaces

it with the TMS320C6713 DSP board that was used for this project.

17

Figure 6 - Electret Microphone Circuit

The electret microphone is connected to a 3.5mm TRS connector in a standard configuration. As it

is plugged into the DSP board, a 5V bias is supplied to the microphone through a 2.2kΩ resistor to limit

the current.

4.5 Spectral Subtraction

The speech signal that comes from the microphone suffers from noise caused by air rushing over

the microphone and the general hum of the BiPAP. This noise corrupts the signal and makes it difficult to

perform any speech processing unless it is either attenuated or removed. If the noise and the speech signal

cannot be separated by using multiple microphones, it is necessary to come up with a way to identify the

noise from the same source. One of the simplest ways to do this is to perform power spectral subtraction.

This is a simple but effective way to get an idea of how strong the noise is and takes advantage of

the fact that while a BiPAP is operating, a person isn’t speaking 100% of the time. The non-speech

segments can be used to recalculate the estimate of the noise which in our case shouldn’t vary too much

as the BiPAP is fairly consistent. As what is needed is a sample of the stationary noise, it is possible on

startup of the system to capture a 250ms sample of the stationary sound and then use this sample for

spectral subtraction calculations.

This algorithm is simple to implement and can be done in real time with a DSP board or a

computer. The following derivation is largely based on the paper by Boll [23] whose work served to pave

the way for subsequent spectral subtraction based works.

Let

18

where y(m) is the corrupted signal, and x(m) is the speech signal and n(m) is the uncorrelated and

additive noise signal.

In the frequency domain, let the same signal be represented as

Where Y(f), X(f) and N(f) are the Fourier transforms of the corrupted, speech and noisy signal

respectively with f as a frequency variable.

In order to process the original signal, the signal must be divided up into chunks, or windowed. In

order to alleviate the effect of discontinuities at the endpoints of each segment, it is necessary to choose

an appropriate window. For our application we use a Hanning Window

In the frequency domain, applying a window to a signal is a convolution operation.

Rearranging terms, we can express original signal as the noise signal subtracted from the corrupted signal.

If it were possible to obtain an exact representation of the noise signal, we could completely

remove it and restore the original speech signal. Since that is impossible, spectral subtraction is used to

reconstruct an approximation of the original signal by subtracting a time-averaged noise spectra from the

corrupted signal. For simplicity, the windowing subscript, W, is dropped.

!

The result is an approximation of the original signal. The accuracy of the reconstructed signal is

directly dependent on the accuracy of the approximation of the noise spectra to be measured.

As it is necessary to calculate the average noise spectra from a period of non-speech activity, this

can be done upon the initial startup of the device with the following formula.

19

"" # $%&%'()&*+

In the above equation, the average noise spectra is calculated by summing the spectra of K-1

frames and then dividing by K. K is a variable that is determined by the sampling rate and length of the

signal. In order for this to work, we assume that the first K frames, are pure noise with no speech.

Averaging the noise spectra over a 250-300ms window provides a fairly good estimate of stationary noise

and presents no difficulty to the user. In the event that there is speech during the first ¼ of a second, the

system can always be reset, or the noise mean recalculated during another interval of silence. After

calculating the noise mean, we can create a simple voice activity detector by setting a threshold of 2-3dB

above the noise mean. If 6-8 consecutive frames of the signal are below this threshold, we can safely

assume that speech has ended and that we are now looking at a noise.

The spectral error therefore that exists is approximately equal to the difference between the noisy

signal and the actual speech signal which is also roughly equal to the difference between the actual noise

in that frame and the noise average that we have calculated.

ε , ! ,

In order to help reduce the spectral error, a few modifications are made to the reconstructed

signal. As the spectral error is the difference between the noise in the frame and the calculated noise

average, then it follows that

-./01' 2 $%&%0()&*+ # $%&%'()

&*+

Where M < K and where K is still the total number of frames over which the noise average is

computed. Therefore, if we were to perform magnitude averaging of the noisy signal over M frames

before subtracting the noise average, this would help decrease the error.

"" 2 $%&%0()&*+

According to Boll [23], performing this averaging is not a problem as long as the number of

frames, M, does not exceed a certain amount. Based on results from his conducted DRT test, as long as

the averaging does not exceed 3 half-overlapped windows with total time duration of 38.4ms,

intelligibility does not decrease. There is still, of course, the risk that for highly explosive and short

20

sounds, that averaging can cause smearing. However, this risk should be weighed against the benefits of

having less spectral error overall.

Upon completion of magnitude averaging, the next step is to perform Half-wave rectification. In

the event that at a particular frequency, the average noise, subtracted from %%produces a

negative value because the average of the noise spectra is greater than the noisy signal, it necessary to

floor these values to 0.

The benefit of doing this, is that overall, the noise floor can be reduced by . The

disadvantage, however, is when the sum of the noise and speech at that particular frequency is less than

. When the output is set to 0, any speech information that was contained in that frequency is lost and

the result is possibly a loss of intelligibility.

Once half-wave rectification is complete, speech and noise that is above the threshold

remains. This residual noise will have a value somewhere between zero and a maximum that is measured

during the non-speech activity periods. In the case where the noise in the frame at a particular frequency

is equal to the average noise spectra, the amount of residual noise will be zero, or very close to it. The

residual noise will consist of frequencies randomly scattered throughout the spectrum that exist for the

duration of one window, or approximately 25ms. The result is what is known as “musical tones” as it

sounds like a number of fundamental tone generators being flipped on and off in all residual noise

frequencies. In order to help alleviate this effect, additional residual noise reduction can be performed.

There are 3 cases that need to be examined after the average noise spectra is subtracted from the

noisy signal.

In the first case, if the amplitude of !, the signal after spectral subtraction, is below the

maximum threshold for noise and fluctuates rapidly between adjacent frames, then there is a high

probability that this is not speech but residual noise. Therefore, we can replace that particular frequency in

the frame with the minimum value from examining adjacent frames as well.

In the second case, if the value of ! is below the maximum noise threshold and remains fairly

constant between frames, then there is a high probability that this is not noise, but low energy speech. As

the values are fairly similar, again taking the minimum preserves the speech.

In the third case, if ! is greater than the maximum noise threshold, then nothing else needs to

be done, as what remains is most likely a speech signal.

After this, all that remains now is to restore the signal to the time domain.

21

Restoration of the time domain signal is achieved by taking the magnitude spectrum estimate %!%, combining it with the phase of the original noisy signal then performing an inverse discrete

Fourier transform.

3 $%!4%567895(6:;< 9=<()9*+

In the above equation, θY(k) is the phase of the noisy signal. The estimated magnitude of the

original signal can be recombined with the phase information of the noisy signal because it is assumed

that noise distortions are primarily in the magnitude spectrum and that phase distortion, for the most part

is mostly inaudible.

After the signal frames have been restored to the time domain, it is now necessary to overlap-add

the overlapping frames in order to reconstruct our approximation of a noise-free signal.

This technique of magnitude spectral subtraction works well for stationary and slow varying

noise. In the case of patients who are on BiPAP, there is very little variation in noise once the ventilator is

in operation. Empirical tests show that the amount of noise is dependent on placement of the microphone

and once the microphone is placed, does not vary. Therefore, spectral subtraction is an ideal candidate for

this particular problem.

Although spectral subtraction is fairly simple and effective, a problem that it still has is the residual

“musical noise” that is left behind due to isolated patches of energy in the time-frequency domain.

Despite best efforts to alleviate this, these artifacts persist. Depending on the accuracy of the noise

estimate and the SNR there can be large fluctuations in the final output. In general, the less noise that

needs to be removed and the stronger the speech signal, less artifacts appear in the final output. Ephraim

and Malah’s method [25] does not suffer from this and could be used as an alternative, but we will

address this in Chapter 7 - CONCLUSIONS AND FUTURE WORK.

4.6 Speech Extraction

This section explains the process of extracting commands and words from the de-noised and

reconstructed speech signal.

To capture a person’s speech, the de-noised speech is outputted from the DSP into the 3.5mm

jack of a laptop where the ASRES software is running. The software listens to the microphone channel

and captures the clip of speech. Although this may be done automatically by setting certain sound

thresholds, for the purpose of explanation, we will describe the case in which the record button of the

software is toggled on and off manually in order to acquire a clip of speech.

22

Upon capturing a clip, the next step is to normalize the WAV so that we can have an accurate

comparison with others that are stored in the database. In order to do this, we first iterate through all the

samples and examine the absolute values of each of the samples. If the absolute maximum is less than 1,

then we divide the entire function by this absolute maximum. An example resulting WAV is shown in

Figure 7.

Figure 7 - Normalized WAV

Major errors in isolated word recognition are the result of inaccurate detection of the beginning

and end points of a speech sample. As the output of the DSP is the de-noised signal and we do not have

access to average noise spectra calculations, we need to create an algorithm to determine the endpoints of

the speech sample.

To extract the endpoints, we use an energy based approach. We begin by segmenting the entire

WAV into 30ms frames first. Then, the energy for each frame is calculated according to the following

formula.

> $?@:

The first two frames are examined with the following formula and a value for the noise at the

front of the signal is calculated.

>A BC>) >: BBBBBBBBBDB E >)F>: E /.G>)H >: BBBBBIJ5KLD5BBBBBBBBBBBBBBBBBM After the noise at the front is computed, the noise at the back end is also calculated using the last

two frames.

>N BC>'() >' BBBBBBBBBDB E >'()F>' E /.G>'()H >' BBBBBIJ5KLD5BBBBBBBBBBBBBBBBBBBBBBM Finally the average background noise level is computed from these two values.

23

>< BC >A >N BBBBBBBBBDB E >AF>N E K5O5I5PBBBBBBBBBBBBIJ5KLD5BBBBBBBBBBBBBBBBBBBBBBM Rejection occurs on the basis that the value of EN should be between the two limits and that the

noise in the front and end frames should not be drastically different. A rejection can also take place if EN

exceeds an experimentally determined threshold. If the sound sample has a very heavy background noise,

i.e. the values of EF and EB exceed a certain threshold, the sample should also be discarded. Once we have

determined that background noise is not a factor and that the first two frames and the last two frames

contain similar residual noise, we can then compute the average power of each frame.

2nS

Pn

=∑

Once the power in a frame exceeds a certain threshold, we can assume that this frame contains

speech. To determine the starting frame of speech, we examine the frames sequentially from the first

frame until we encounter a frame that exceeds the threshold. This is a simple method of determining the

start frame; however, this is susceptible to noise.

A better way to determine the start frame is by examining not just the frame, but also the adjacent

frames. If two consecutive frames in a set of three exceed the average, then there is a greater likelihood

that this represents a speech frame. Once this occurs and the average power of the frame surpasses an

experimentally determined threshold, that frame is marked as the starting frame of speech. We repeat this

process in reverse beginning with the last frame to determine the ending frame.

Once these two frames have been determined, we can extract this portion of the WAV as speech.

The frames from the designated starting and ending points are then written to a new WAV which is then

converted into its Mel Frequency Cepstral Coefficients for Dynamic Time Warping analysis.

Figure 8 - Original and Truncated WAV

Figure 8 is an example of truncation of a voice clip. The original sample is almost 2.5 seconds in

length, while the truncated WAV is only 0.9 seconds in length. If truncation is not performed correctly or

24

no truncation is performed at all, this causes serious problems in the later stages when trying to compare

two speech signals using dynamic time warping.

4.7 Mel Frequency Cepstral Coefficients

Although utterances of the same phrase differ drastically in the time domain, they are not so

different in the frequency domain. Therefore spectral analysis is an excellent way to take advantage of

this fact.

In order to quantitatively measure sound samples against each other, we need a way of

representing them numerically. Mel Frequency Cepstral Coefficients (MFCC) provides us with a method

for comparing these WAVs to each other.

Calculating the MFCCs is a process that can be summarized as follows:

i. Convert to frames and calculate energy

ii. Take Fourier Transform

iii. Take Log of amplitude spectrum

iv. Perform mel-scaling and smoothing

v. Take Discrete Cosine Transform

Since taking the Fourier transform of a windowed waveform can cause spectral leakage, the first step

is once again to apply a Hanning window to the signal before taking the discrete Fourier transform and

calculating the energy spectrum.

34 B $ 5(6:;@9F<QR BBBBBBBBB E 4 E <Q()@*+

The energy spectrum is given by

9 "34":

The energy, Ej. is then calculated for each frame

>6 B $ S649'()9*+

where J is the number of triangular filters used. Finally the Discrete Cosine Transform is taken of the mel

log-amplitudes and only the first 13 coefficients are kept for our purposes.

25

= TU $VWXY Z O [ -W\)+]>6^_()6*+

As MFCC calculations were performed using a DLL, we will not go into further detail regarding

the calculation of MFCCs. An MFCC is calculated for each frame of the input signal and an array of 13-

coefficient MFCC vectors is built. This array can then be compared against other arrays of MFCC vectors

stored in a library.

4.8 Dynamic Time Warping

ASRES makes use of Dynamic Time Warping, due to the ease of implementation and also its

performance for small datasets in order to perform recognition. The following is an explanation of

Dynamic Time Warping as used by ASRES. In order to compare a speech sample using its MFCC array,

we need to perform a frame by frame comparison of the speech sample with each individual command

stored in the library.

As signals differ in length, the problem becomes an alignment problem. Let Q and C be two

different signals of lengths n and m respectively.

` a)H a:H b a@

c )H :H b =

The first thing that needs to be done is to compare the two signals by creating an n by m distance matrix.

Figure 9 - Distance Map of Different Words

As seen in Figure 9, a 7x6 matrix is created to compare the frames in each of the words. In this

example, “BARNEY” is stored in the database, while the word “BEAVERS” has been spoken.

If the box at the intersection of two frames is the same, we say that this is a match and assign it a

distance score of 0. If the intersection of the two frames is different, we assign it a positive score, in this

case, the value 1.

Y 1 1 1 1 1 1 1

E 1 0 1 1 0 1 1

N 1 1 1 1 1 1 1

R 1 1 1 1 1 0 1

A 1 1 0 1 1 1 1

B 0 1 1 1 1 1 1

B E A V E R S

26

If the two frames were completely identical, then the distance score along the diagonal would be

0 as shown in Figure 10

Figure 10 - Distance Map of Two Identical Words

Once we have the distance map, we then compute the cost to reach the top right hand corner of

the box by traversing the grid using the following algorithm.

dDH O /.GdD H O H dD H OH dDH O PDH O

Figure 11 - DTW Minimum Cost Path Equation

Where D(i,j) is the value of the point (i,j) in the distance map.

We define a path W to the top-right corner to be a connected set of K elements with each element

designated as wk.

L)H L:H b H L9 H b H L' BBBBBBBBBBBBBB/efH E # g B The shortest path would be a diagonal, hence max (m,n) as the minimum bound, and the longest

path would involving traversing both edges, hence m+n-1.

Traversal of the grid is subject to three rules:

i. Boundary Conditions: w1 = (1,1) and wK = (m,n). The path must start at the bottom-left

corner and finish at the top-right corner.

ii. Monotonicity: the path cannot go backwards, therefore ik – ik-1 ≥ 0 and jk – jk-1 ≥ 0.

iii. Continuity: the path cannot jump and is restricted to adjacent cells.

In order to determine the minimum cost of reaching the top right corner, we begin by building a

path map of the entire grid by starting at the bottom left hand corner and then filling each subsequent row.

Each D(i,j) in the bottom row can be determined by adding the value of the cell to the immediate left to

the value of d(i,j) found in the distance map in Figure 10. After summing the 1’s, on the bottom row, we

can determine the total cost, D(i,j) to any point on the bottom row as shown in Figure 12.

S 1 1 1 1 1 1 0

R 1 1 1 1 1 0 1

E 1 1 1 1 0 1 1

V 1 1 1 0 1 1 1

A 1 1 0 1 1 1 1

E 1 0 1 1 1 1 1

B 0 1 1 1 1 1 1

B E A V E R S

27

Figure 12 - DTW Scoring for the First Row

For the second row, we use the distance map in Figure 10 and apply the formula found in Figure

11 and calculate the next row of the grid as seen in the following figure.

Figure 13 - DTW Scoring for the Second Row

This continues until the entire grid is completely filled with values as seen in the following figure.

Figure 14 - DTW Scoring for the Entire Grid

As seen in the above figure, the minimum cost to get from the bottom left hand corner to the top

right hand is 5 in this example.

The following figure, Figure 15, shows one the possible minimum paths that can be taken through

the grid to arrive at the top right hand corner

Y 5 4 5 5 4 4 5

E 4 3 4 4 3 4 5

N 3 3 3 3 3 4 4

R 2 2 2 2 3 3 4

A 1 1 1 2 3 4 5

B 0 1 2 3 4 5 6

B E A V E R S

Y

E

N

R

A 1 1 1 2 3 4 5

B 0 1 2 3 4 5 6

B E A V E R S

Y 5 4 5 5 4 4 5

E 4 3 4 4 3 4 5

N 3 3 3 3 3 4 4

R 2 2 2 2 3 3 4

A 1 1 1 2 3 4 5

B 0 1 2 3 4 5 6

B E A V E R S

28

Figure 15 - Minimum Cost Path for Two Different Words

Although there are alternative paths that can be taken, following the three rules of grid traversal,

the minimum cost remains at 5.

In the case where the spoken word is identical to the word stored in the database, a path with a

score of zero is produced as seen in Figure 16.

Figure 16 - Path with a Score of Zero for Identical Words

This case does not occur under real conditions as it is impossible for the exact same utterance to

be repeated more than once, however, if there is a match, the score of the minimum cost path should be

fairly low.

When the sounds of words are stretched, dynamic warping provides us with a way of showing

that the two are actually identical.

Y 5 4 5 5 4 4 5

E 4 3 4 4 3 4 5

N 3 3 3 3 3 4 4

R 2 2 2 2 3 3 4

A 1 1 1 2 3 4 5

B 0 1 2 3 4 5 6

B E A V E R S

S 1 1 1 1 1 1 0

R 1 1 1 1 1 0 1

E 1 1 1 1 0 1 1

V 1 1 1 0 1 1 1

A 1 1 0 1 1 1 1

E 1 0 1 1 1 1 1

B 0 1 1 1 1 1 1

B E A V E R S

29

Figure 17 - Cost Map of a Word Spoken Slowly

In Figure 17, the word stored in the database is “BEAVERS”, however the speaker has

enunciated the word slowly and placing heavy emphasis on the vowels. Although the lengths are no

longer equal, dynamic time warping allows us to calculate a zero-cost path through the grid proving that

despite the elongated vowels, the command is the same.

Figure 18 - Path with a Zero Score for a Word Spoken Slowly

This path minimization is what makes dynamic time warping extremely useful for comparing

samples of speech.

In Figure 9, the distance map compares letters with letters, therefore logical values, 1 and 0, are

sufficient to determine the cost of the intersection.

Since we have MFCC vectors, we need a way to be able to calculate the distance between two

vectors. If the two vectors are very similar, i.e. they contain the same segment of a word, the cost to move

to that intersection should be low. To compare two vectors we take the Euclidean distance between the

two and calculate the distance which will serve as d(i,j).

30

Figure 19 - Euclidean Distance in 3-D

In Figure 19, the Euclidean distance in three dimensions is the square root of the sum of the

squares of the differences in each of the component directions.

Since our MFCC vectors have 13 components, we extend the Euclidean distance formula to 13

dimensions

PDH O h) ): : :: b )i )i:

where y and x are the components of each of the MFCC’s calculated for each frame of speech

that is to be examined.

A distance map is then computed and the minimum cost path through the grid is calculated.

In the word example given above, the final score was identically zero, however as stated before, in

real life, this is not the case since no two frames will ever precisely match up. Therefore in order to

determine what word is actually being said, we need to calculate the minimum cost path for every word or

phrase stored in the database. Once this is done and provided that the speaker has spoken a phrase that is

contained in the database, the lowest score should indicate what phrase was spoken.

31

5 USER INTERFACE AND SYSTEM USAGE

The previous section described in detail the algorithms behind ASRES. This section documents and

explains the user interface.

5.1 User Interface Description

The user interface of ASRES consists of a main window from which all operations proceed.

Figure 20 - ASR Tool

(1), in the figure above, is the button that sets the user interface into capturing mode. When this

button is toggled, the system goes into a listening mode in which the input signal is analyzed for speech

according to the algorithms described in Chapter 4.

(2), on the right hand side contains quick buttons that put the system into a mode to capture one

of three phrases that were commonly used to test the effectiveness of the automated speech recognition

and to store them in the database. Additional custom phrases can also be captured by pressing the “New”

button below the three phrases. Theoretically, it is possible to capture any number of user defined phrases

and store them in the database. However, it is important to remember that performing DTW on any sound

clip against the others in the database is an operation of the order O(n). Therefore, performance of the

system continues to decrease linearly as the database continues to grow.

(3) is a text field that provides text feedback. For example, if a user was captured as saying “Help

me!”, the system would process the phrase, score it against phrases stored in the database, then display in

1 2

3

32

the text field the name of the phrase that the system recognized. In this case, the text “Help me!” would be

displayed. Although a visual display would be of little use to a BiPAP user who was actually crying for

help, this output is primarily for those setting up the system. If the command is correctly recognized,

further appropriate reaction can then be taken.

5.2 System Usage

The “Start cap” button is pressed and de-noised speech is captured by the microphone. The

captured speech is compared against the entire database of phrases and a match is located.

Because the system is able to recognize different kinds of words, commands and actions can be tied

to each of these and stored in a database. For example if the “Help me!” command is recognized, an

appropriate reaction could be to trigger an alarm that is connected to the computer via a USB port. Or

perhaps if the caregiver is not in the general vicinity, a message could be sent to their phone via a specific

app. A two-word command such as “Lights on” could be interfaced with the room lights and allow a

patient to have control over the lights.

One of the commands that was implemented is “Open Browser.” Upon accurate recognition of this

command, the default browser of the operating system, e.g. Firefox, Chrome, or Internet Explorer is

opened to its home page. From there, a user can navigate from page to page by using an extension such as

“numberedlinks” for Firefox that assigns numbers to every hyperlink on the page. Thus, a user would be

able to say “Link 15” and the link would be clicked and the new page loaded. To add more functionality,

recording commands like “Back” or “Forward” could also improve the browser experience.

The possibilities are endless; however, they all rely on accurate detection of the words.

5.3 Initial Training

In order to train the system, a user selects either one of the quick buttons on the side, or the “New”

button and records his or her command. These samples become the unique user sound set by which all

future voice commands are compared with and aligned to using Dynamic Time Warping. In order to

make the system as robust as possible, any “New” command that is recorded can be set to execute a

particular command line command. For example, if we wish to implement something beyond “Open

Browser” and navigating through a web interface, we can specify a command to run.

For example, on a Windows-based system, it is possible to record a clip “Adjust pillow” and set the

executed action to run the command line string “msg.exe SamChua Can you please come over and adjust

my pillow?” Then, when the ASRES system is running and the words “Adjust pillow” are spoken, the

system would execute the above shell command which would cause a message to be sent to the user

33

SamChua who is on the local network with the above text. In this way, provided that the system is setup

in advance and these commands are typed in by the caregiver, a PAL could send numerous kinds of

commands to particular people.

Since the commands are shell-based, any type of command line based custom software could be

written or bought and used with the ASRES system. For example, another even more useful application

would be to purchase a proprietary PC-to-SMS software and use it to send an SMS to a caregiver. For

example, the recorded command could be “Emergency!” and the executed action could be “SMS.exe

/u:SamChua /m:Come home immediately, I need help!”

One important thing to note however is the degradation of speech over time in individuals with

ALS. In some individuals, their ability to enunciate words clearly declines rapidly within a year. For

others, the loss of speech is a slow process that drags on for several years. This is problematic as it means

that speech samples in the database will no longer be accurate and need to be retrained over time. Because

the amount of time differs from person to person, either a variable can be set in the XML configuration

file to automatically replace old records with new spoken ones every few months or a user can manually

retrain their entire database when they start to notice error rates in recognition increasing substantially.

The automatic deprecation of old records and addition of new ones would be ideal, however, due to

the fact that dysarthrics struggle to control their facial muscles, there is no clever way to automatically tell

whether an utterance is acceptable or not. Although tedious, the only way to accurately re-train is to

record, allowing the user to listen to what they recorded, and verify that they would like to keep it.

34

6 EXPERIMENTS

In this section, two experiments involving human subjects are explained, conducted and the results

discussed.

6.1 Experimental Setup

6.1.1 Goal

The goal of the following experiment was to capture several predetermined speech samples from

a PALS in order to provide ASRES with data that could be used to determine if spectral subtraction

improves speech intelligibility.

6.1.2 Setup

In order to determine if filtered sound captured from within a mask improved intelligibility,

multiple recordings of sentences and a paragraph from within the mask with the noise filter deactivated

are taken to establish a baseline. Once these recording were complete, the airflow noise filter is activated

and the same sentences and a paragraph are read and recorded.

Test sentences are constructed from a subset of the list of 74 monosyllabic nouns and 37

disyllabic verbs found in the Nemours Database of Dysarthric Speech [9].

Two factors went into the creation of the subset. The first was to remove the words that differed

from each other by only a single phoneme such as “cob” and “cop”. Dysarthric speakers have been shown

to have difficulty with certain vowels and syllables [10] and removal of these helps to ensure that

incorrect recognition of the word is not due to the fact that the speaker is unable to articulate a particular

sound resulting in two words being pronounced the same. Words were therefore selected which were at

least two phonemes apart.

The second factor addressed the issue of context. In order to ensure that context was not what was

being used to recognize the speaker’s words, the constructed sentences are nonsensical and are in the

form of “The X is Ying the Z” where X and Z are randomly selected nouns without replacement from the

reduced set of 11 nouns and Y is a verb selected from the reduced set of 8 verbs listed below. An example

sentence is “The fade is leaping the bin”.

Set of Nouns:

• cob • bad • bait

35

• fade • fight • dime • dew • bin • rot • pat • bet

Set of verbs: • wading • leaping • licking • bearing • stewing • sipping • going • surging

Finally, the paragraph to be read by the speaker was a standard paragraph in the area of speech sciences

called “My Grandfather”

You wished to know all about my grandfather. Well, he is nearly ninety-three years old; he

dresses himself in an ancient black frock coat, usually minus several buttons; yet he still thinks as

swiftly as ever. A long, flowing beard clings to his chin, giving those who observe him a

pronounced feeling of the utmost respect. When he speaks, his voice is just a bit cracked and

quivers a trifle. Twice each day he plays skillfully and with zest upon our small organ. Except in

the winter when the ooze or snow or ice prevents, he slowly takes a short walk in the open air

each day. We have often urged him to walk more and smoke less, but he always answers,

“Banana oil!” Grandfather likes to be modern in his language.

For the full details of the experimental procedure, see Appendix A: Procedure for Collecting

Speech Samples of a PALS. The subject population consisted of a single PALS.

6.1.3 Hypothesis

Prior to the start of this experiment, it was hypothesized that performing spectral subtraction filtering on

speech captured by a PALS would increase intelligibility.

36

6.2 Experimental Results

In this subsection, the experimental results of a subject RG are discussed. In addition, a set of data

with digitally added noise was also tested on ASRES and the results examined in order to test the

effectiveness of spectral subtraction on BiPAP. Two types of tests were conducted. The following is an

examination first of recognition rates achieved from the created dataset and then of the actual dataset

produced by a PALS.

6.2.1 Digital Addition of Noise to Nemours Subject BB

In the first sets of tests, a sample of BiPAP noise was recorded and then digitally added to the

voice of a dysarthric speaker and the results were Post-Filtered and the DTW algorithm was applied.

The following datasets were created from words spoken by subject BB recorded on the Nemours

Database of Dysarthic speech. Subject BB has a form of cerebral palsy that results in speech dysarthria,

although not severe. His level of dysarthria, compared to other in the Nemours database is fairly low. BB

has a score of 8/8 in the sentences intelligibility and conversation intelligibility. In the area, however, of

word intelligibility, his score is only 4/8 which implies that his words are not as well formed as a normal

person. BB’s level of dysarthria would be comparable to a person who is in the earlier stages of ALS and

would just be starting to use BiPAP on a more regular basis. This makes him a fairly good candidate for a

test.

In the following results, the recording of the paragraph “My Grandfather” was trimmed until only

five words, “Grandfather”, “Coat”, “Buttons”, “Trifle”, and “Organ” remained. These five words were

then added to the library for DTW matching. A sample of noise recorded from inside a face mask with

BiPAP operating was then digitally added to the five words and the recordings saved. Finally, the sample

of digitally added noise was Post-Filtered with the spectral subtraction algorithm and the result again

segmented into five different words to test against ASRES. The following tables show the effect of

machine recognition as the SnR is gradually decreased.

37

SnR = 11.32 dB

NOISY Library

Grandfather Coat Buttons Trifle Organ

Spoken

Grandfather 399.51 429.72 434.85 432.59 417.8

Coat 451.74 288.45 331.68 329.96 302.84 Buttons 404.85 292.41 253.06 305.79 274.45 Trifle 409.06 291.11 313.92 250.89 279.04

Organ 416.57 295.18 310.4 291.18 242.36 Table 1 - Noisy DTW of 5 Words by BB

SnR = 22.31 dB

POST-FILTERED Library


Spoken

Grandfather 207.7 325.05 320.68 332.67 301.89 Coat 308.03 143.18 233.05 251.14 203.18 Buttons 287.12 220.4 119.45 239.64 195.19

Trifle 327.51 255.81 220.71 131.69 176.51

Organ 310.5 223.61 203.6 204.06 105.44 Table 2 - Post-Filtered DTW of 5 Words by BB

In the results above, the SnR is fairly high and every single word recorded in both the 5 samples

of noise corrupted speech as well as filtered speech were matched with the correct sample in the library.

Therefore, there is no apparent benefit to performing filtering at this SnR as it does not improve the

already excellent recognition rate.

38

SnR = 8.30 dB

NOISY Library


Spoken


Trifle 444.91 337.97 371.81 276.45 299.84

Organ 458.73 361.62 393 307.9 264.7 Table 3 - Noisy DTW of 5 Words by BB with +3dB Noise

SnR = 19.77 dB



Spoken

Grandfather 232.81 404.65 388.57 345.31 305.99


Organ 328.6 297.51 292.26 204.67 126.06 Table 4 - Post-Filtered DTW of 5 Words by BB with +3dB Noise

In the results above, the SnR has been reduced by 3dB. It is still fairly good but lower than the

previous example. The Post-Filtered data was 100% correctly matched with the library, however the

DTW matching for the noisy signal is down to 40% with only two of the words, “Trifle” and “Organ”

being correctly recognized. It is also worth noting that the scores of the Noisy table are much higher than

those of the Post-Filtered table. This implies that the alignment between the noisy speech and that which

is stored in the library is more distant and therefore, the system would be more prone to making errors as

the size of the dataset increased.

39

SnR = 4.31 dB

NOISY Library


Spoken

Grandfather 532.67 558.27 571.66 548.45 543.42


Organ 497.42 395.78 424.89 367.13 312.55 Table 5 - Noisy DTW of 5 Words by BB with +7dB Noise

SnR = 16.64 dB



Spoken


Trifle 366.76 279.3 312.32 180.62 219.4


In the results above, the SnR has been reduced by an additional 4 dB from the previous set. Once

again the Post-Filtered data was 100% correctly matched with the library, however the DTW matching for

the noisy signal is at 60% with three words, “Buttons”, “Trifle” and “Organ” being correctly recognized.

In addition to this, the winning scores for each of the noisy samples have increased on average by 54.6

points.

This implies that the alignment between the noisy speech and that which is stored in the library is

becoming even more distant and more error prone.

40

SnR = -1.70 dB

NOISY Library


Spoken

Grandfather 688.6 694.65 715.26 686.7 677.34


Organ 633.91 489.28 515.92 447.1 400.54 Table 7 - Noisy DTW of 5 Words by BB with +13dB Noise

SnR = 12.03 dB



Spoken


Trifle 373.28 285.58 341.79 213.9 244.22


In the results above, the SnR is close to 0dB which means that the signal and noise strength are

almost one-to-one. Once again the Post-Filtered data was 100% correctly matched with the library, with

the DTW matching for the noisy signal remaining at 60% with three words, “Buttons”, “Trifle” and

“Organ” being correctly recognized. Although three words are recognized, buttons is but 5.1 points, 1.0%,

away from “Trifle”, the next closest match. If this were repeated, it is quite possible that the recognition

rate would go down to 40%.

41

SnR = -1.68 dB

NOISY Library


Spoken

Grandfather 720.91 725.14 748.82 718.73 708.46

Coat 730.43 552.9 598.96 567.4 570.11 Buttons 665.13 537 537.99 536.62 529.54 Trifle 580.12 456.3 487.08 415.58 410.74

Organ 616.28 473.26 499.22 432.93 385.86 Table 9 - Noisy DTW of 5 Words by BB with +13dB Noise and Alternate Noise

SnR = 12.07 dB



Spoken


Trifle 368.06 275.71 330.23 212.35 236.44

Organ 383.22 315.09 345.35 256.86 209.23 Table 10 - Post-Filtered DTW of 5 Words by BB with +13dB Noise and Alternate Noise

For this particular set of data, a different sample of BiPAP noise was digitally added to set the

SnR close to 0 once again. In Table 9, it is observed that the word “Buttons” and “Trifle” are extremely

close together and as a result of the noise change, the word “Buttons” is incorrectly recognized as

“Trifle”. Once again the DTW matching for the noisy signal has dropped to 40%.

From the data collected in the tables above, there is reason to believe that spectral subtraction aids

in the recognition of words by DTW.

6.2.2 Person Living with ALS – RG

In the second set of tests, a subject with ALS tested the system by performing both experiments

on the procedure sheet as detailed in the experimental setup. In order to validate the theoretical findings of

the previous experiment, it was necessary to conduct the same experiment with a real person.

Subject RG, a person living with ALS provided the sound clips from which the following results

were derived. The following tables follow the same format and analysis of the digitally added noise

experiment in the previous section.

42

My Grandfather

SnR = 6.89 dB

NOISY Library

Grandfather Coat Buttons Trifle Organ Language

Spoken

Grandfather 402.15 381.8 352.32 426.21 333.79 423.91 Coat 352.16 299.5 251.09 349.65 223.15 333.44 Buttons 342.22 322.35 267.91 348.13 191.05 321.54

Trifle 351.12 332.86 274.62 349.99 232.95 344.45 Organ 361.03 351.08 296.1 384.98 220.72 348.17

Language 382.26 372.5 306.51 371.54 278.42 371.2 Table 11 - DTW of 6 Words by RG

The data from the table above was generated from a real-time noisy dataset that was cut from RG

reading the paragraph, “My Grandfather”. At a SnR of 6.89, and with this particular dataset, DTW

performed poorly and was only able to match a single word, “Coat” correctly.

SnR = 6.79 dB


Grandfather Coat Buttons Trifle Organ Language

Spoken

Grandfather 704.2 690.74 605.99 755.59 535.35 690.45

Coat 742.46 693.53 586.83 782.55 497.03 711.89 Buttons 681.23 663.34 570.24 764.12 459.74 689.49 Trifle 831.07 784.44 675.2 873.5 573.25 815.71

Organ 762.76 757.45 643.36 821.56 513.91 757.53

Language 750.65 763.74 646.96 784.11 553.19 739.21 Table 12 – Post-Filtered DTW of 6 Words by RG

In the above table the same real-time noisy dataset was subjected to Post-Filtering. The Post-

Filtering in this case performed extremely poorly and the SnR actually went down by 0.1dB. As a result,

again, only 1 word was correctly recognized and no improvement was made.

SnR = 9.12 dB

REAL-TIME FILTERED Library

Grandfather Coat Buttons Language

Spoken

Grandfather 342.82 249.44 304.91 439.06 Coat 327.89 188.02 287.97 381.67

Buttons 310.02 196.17 218.99 363.23 Trifle 402.9 312.27 407.06 545.7 Organ 411.42 268.7 376.74 512.62

Language 389.79 236.22 401.37 526.47 Table 13 - Real-time Filtered DTW of 4 Words by RG

43

The data from the table above was generated from a real-time filtered dataset that was cut from

RG reading the paragraph, “My Grandfather”. The filtering has improved the SnR just over 2dB and the

result is that two words, “Coat” and “Buttons’ are correctly recognized. Due to the fact that during this

recording, RG was not able to articulate the words, “Trifle” and “Organ” both words were removed from

this dataset.

From this dataset we see that Post-Filtering in this case had no benefit, however the results from

the Real-Time Filtered dataset show a recognition of two of the words which is better than the recognition

of only one word in the noisy dataset.

Five Isolated Words

SnR = 12.5 dB

NOISY Library

Fight Dime Dew Bin Bait

Spoken

Fight 418.85 150.99 201.16 140.37 165.79

Dime 426.74 161.13 190.26 165.62 158.91 Dew 392.86 198.93 178.62 184.3 149.09 Bin 388.54 174.78 162.93 143.62 132.76

Bait 384.9 177.01 162.06 163.56 118.38 Table 14 - Noisy DTW of 5 Words by RG

In the above dataset, words from each of the five sentences spoken by RG were cut from two

recordings. The first recording which had no BiPAP and no filtering enabled served as the baseline and

became the library files. The second recording which had the BiPAP enabled and filtering off was

segmented and tested against the library with DTW. The result as seen in the above table, at a SnR of

12.5dB was still fairly poor with only a single word “Bait” being recognized.

SnR = 18.2 dB



Spoken

Fight 116.36 403.95 428.75 360.9 389.88 Dime 134.62 422.62 418.36 374.19 404.29 Dew 192 386.68 372.09 338.11 377.12

Bin 144.25 381.5 386 332.7 372.12

Bait 142.73 389.63 392.44 326.03 361.63 Table 15 – Post-Filtered DTW of 5 Words by RG

This dataset was created by taking the noisy dataset found in Table 14 and applying spectral

subtraction. This resulted in an increase in the SnR by 5.7dB and as a result, the software was able to

recognize 3 out of the 5 words, or 60%.

44

SnR = 23.45 dB

REAL-TIME FILTERED

Library


Spoken

Fight 114.5 169.87 179.09 167.08 173.05

Dime 129.81 143.39 189.34 165.34 164.29 Dew 178.28 148.45 144.14 133.46 117.96 Bin 137 141.73 147.22 128.87 114.96

Bait 133.61 128.47 148.89 129.54 111.12 Table 16 - Real-Time Filtered DTW of 5 Words by RG

In this above dataset, a real-time spectral subtraction recording of the five words yielded an

extremely high SnR. As a result, the recognition rate in this case was 4/5 or 80%.

From this dataset we see that there appears to be better recognition of words after performing filtering.

Three Commands

SnR = 8.56 dB

NOISY Library

Help Me Open Browser Close Window

Spoken Help Me 522.5 784.6 868.68

Open Browser 1146.6 1042.66 1150.93

Close Window 1098.57 1150.65 1151.85 Table 17 - DTW of 3 Phrases by RG

In the above dataset, three phrases were recorded with the BiPAP off for the library and then 3

recordings were made with the BiPAP running and then DTW was applied. At a SnR of 8.56dB the

results are once again poor with only one phrase being correctly recognized.

SnR = 18.7 dB

FILTERED Library

Help Me Open Browser Close Window

Spoken Help Me 325.58 721.43 474.74

Open Browser 749.32 637.4 702.31

Close Window 657.25 793.79 505.64 Table 18 - Real-Time Filtered DTW of 3 Phrases by RG

In this dataset, a recording of the 3 phrases was made with the BiPAP running and spectral

subtraction enabled. In this case, the filter performed fairly well boosting the SnR to 18.7dB and allowing

for a recognition rate of 2 out of 3, or 66%.

45

6.3 Phoneme Analysis

In order to further validate if there is any increase in intelligibility an experiment with subjects was

conducted with individuals who listened to the recordings of subject RG and wrote down what they

perceived was said. Table 19 below shows the 5 control sentences with the two nouns and the verb in

bold. Both of the nouns in the five sentences have exactly 3 phonemes each, while the verbs all have 5

phonemes. Therefore, the two nouns and verb in each sentence have a total of 11 phonemes.

Table 19 - Control 5 Sentences and Phoneme Divisions

The experiment was conducted in two phases consisting of 3 trials each. In each trial, the subject

would listen to a sentence and then write down what they perceived RG to be speaking. Subjects

understood in advance that the sentences spoken would be in the form of “The X is Ying the Z”. In phase

one, three trials were conducted with the subjects listening to unfiltered speech with a SnR of 12.5dB. In

phase two, the trials were conducted with the subjects listening to the filtered speech with a SnR of

23.45dB.

Table 20 - An Example Phoneme Analysis Trial Explained

In Table 20 above, an example trial is explained. In this trial, the subject perceived that sentence

number one was “The fight is wearing the cob.” As the first word was perceived correctly, no penalty is

applied under column 1 as all the phonemes were correctly identified. The verb, however, “wading” was

perceived instead as “wearing”. As these two words differ in exactly 2 phonemes, a score of minus 2 is

applied to column 2. The third word, “cob” was correctly identified, therefore no penalty is applied. In

total, only 2 phonemes were incorrectly perceived, which means that 9 out of 11 were correctly perceived,

resulting in a recognition percentage of 81.8%. This scoring process is then repeated for the remaining 4

ControlTrial 1 1 2 3

The fight is wading the cob. 3 5 3 11The dime is bearing the bet. 3 5 3 11The dew is leaping the pat. 3 5 3 11The bin is stewing the fade. 3 5 3 11The bait is licking the rot. 3 5 3 11

Words

No Filtering (SnR = 12.5 dB) ScoreTrial 1 1 2 3

The fight is wearing the cob 0 -2 0 -2 81.8%the drive is bearing the dirt -2 0 -2 -4 63.6%the view is leaving the path -1 -1 -1 -3 72.7%the bend is spewing the pain -2 -1 -2 -5 54.5%the bate is licking the vase 0 0 -4 -4 63.6%

Words

46

sentences and this is performed for all 3 trials from the unfiltered set. From this we can then calculate the

average phoneme recognition rate for each sentence over the three trials. Finally a single number, an

overall percentage of correctly identified phonemes can be computed as seen below in Table 21.

Table 21 - Percentage of Correctly Identified Phonemes

This process is then repeated for the second phase so that an overall percentage of correctly

identified phonemes can also be determined for the filtered dataset and then the two results can be

compared.

Table 22 below is a phoneme analysis of one of four subjects who participated in the study.

Mean % of Correctly Identified Phonemes per SentenceSentence #1 75.8%Sentence #2 81.8%Sentence #3 69.7%Sentence #4 48.5%Sentence #5 66.7%

Overall % of Correctly Identified Phonemes 68.5%

47

Table 22 - Phoneme Analysis of MB

The phoneme analysis of subject MB in Table 22 above shows a slight improvement in

recognition of the phonemes, 6%.

The details of the last three phoneme analyses are not included here but can be found in Appendix

B: Phoneme Analysis Data Sheets. The summary results will be discussed in 6.4.1 - Summary of

Phoneme Analysis.

6.4 Discussion

In this subsection, the results of the different experiments are summarized and discussed.

6.4.1 Summary of Phoneme Analysis

A summary of the phoneme analysis for all four subjects is found in Table 23 below.

No Filtering (SnR = 12.5 dB) Score RT Filtered (SnR = 23.45 db) ScoreTrial 1 1 2 3 Trial 1 1 2 3

The fight is wearing the cob 0 -2 0 -2 81.8% The fight is waving the cob 0 -1 0 -1 90.9%the drive is bearing the dirt -2 0 -2 -4 63.6% the dime is wearing the death 0 -1 -2 -3 72.7%the view is leaving the path -1 -1 -1 -3 72.7% the view is leaking the path -1 -1 -1 -3 72.7%the bend is spewing the pain -2 -1 -2 -5 54.5% the bend is stewing the shade -2 0 -1 -3 72.7%the bate is licking the vase 0 0 -4 -4 63.6% the bait is making the rock 0 -2 -1 -3 72.7%

Trial 2 Trial 2the pipe is wearing the cob -2 -2 0 -4 63.6% The fight is waving the cob 0 -1 0 -1 90.9%the dime is bearing the bent 0 0 -1 -1 90.9% the dime is wearing the bent 0 -1 -1 -2 81.8%the view is reaching the path -1 -2 -1 -4 63.6% the Jew is sweeping the path -1 -2 -1 -4 63.6%the bend is spewing the thing -2 -1 -4 -7 36.4% the bend is stewing the shade -2 0 -1 -3 72.7%the bait is licking the vase 0 0 -4 -4 63.6% the bait is making the run 0 -2 -2 -4 63.6%

Trial 3 1 2 3 Trial 3 1 2 3The fight is wearing the cob 0 -2 0 -2 81.8% The fight is waving the cob 0 -1 0 -1 90.9%the dime is bearing the bent 0 0 -1 -1 90.9% The dime is wearing the death 0 -1 -2 -3 72.7%the view is leaving the path -1 -1 -1 -3 72.7% The Jew is sweeping the path -1 -2 -1 -4 63.6%the bend is spewing the pain -2 -1 -2 -5 54.5% The bend is stewing the shade -2 0 -1 -3 72.7%the bait is raking the rock 0 -2 -1 -3 72.7% The bait is making the run 0 -2 -2 -4 63.6%

Mean % of Correctly Identified Phonemes per SentenceSentence #1 75.8% 90.9%Sentence #2 81.8% 75.8%Sentence #3 69.7% 66.7%Sentence #4 48.5% 72.7%Sentence #5 66.7% 66.7%

Overall % of Correctly Identified Phonemes 68.5% 74.5%

Words Words

48

Overall % of Correctly Identified Phonemes

Subject No Filtering (SnR = 12.5 dB)

RT Filtered (SnR = 23.45 dB) +/- %

MB 68.5% 74.5% 6.1% RS 69.1% 70.3% 1.2% EC 67.3% 67.9% 0.6% JW 74.5% 73.3% -1.2%

Table 23 - Summary of Phoneme Analysis

As seen in the data, apart from one subject who showed a 6.1% increase in the number of

phonemes recognized by listening to the filtered speech samples, the other three individuals did not show

any significant increase in the number of phonemes recognized. In one case, JW, the individual actually

had a slight decrease in the number of phonemes accurately recognized.

Although there is a possibility that intelligibility of phonemes has improved slightly, no firm

conclusions can be reached from the dataset above in Table 23.

6.4.2 Summary of ASRES Results

The results from tables of data from both BB and RG are summarized below in a table and a

graph of SnR vs. Recognition Rate are constructed below in order to verify whether or not spectral

subtraction of a noisy BiPAP signal provides any benefit to ASR technology.

SnR Type Correct Percent Subj. Set 11.32 Noisy 5/5 100% BB Grandfather

8.3 Noisy 2/5 40% BB Grandfather 4.31 Noisy 3/5 60% BB Grandfather

-1.68 Noisy 2/5 40% BB Grandfather -1.7 Noisy 3/5 60% BB Grandfather

22.31 Post-Filtered 5/5 100% BB Grandfather 19.77 Post-Filtered 5/5 100% BB Grandfather 16.64 Post-Filtered 5/5 100% BB Grandfather 12.07 Post-Filtered 5/5 100% BB Grandfather 12.03 Post-Filtered 5/5 100% BB Grandfather

Table 24 - Summary of BB Datasets

The table above was created by summarizing the data found in Table 1 to Table 10. The data was

first sorted by type and then by SnR. The “Noisy” type refers to signals that were unfiltered, whereas

Post-Filtered refers to the noisy signals that had spectral subtraction applied to them in post-processing.

The column “Correct” is a fractional total of the number of words that were correctly recognized by the

ASR and the “Percent” column simply expresses the fraction in the form of a percentage.

49

Figure 21 - SnR vs. Recognition Rate of BB for Noisy Signals Subjected to Post-Filtering

From Table 24 and Figure 21, there is reason to believe that Post-Filtering a noisy signal results

in a significantly higher recognition rate by ASR. All of the signals that were post-processed with spectral

subtraction were recognized without fail.

-5

0

5

10

15

20

25

0% 20% 40% 60% 80% 100% 120%

SnR

(dB)

Automated Speech Recognition Rate (%)

BB - SnR vs. Recognition Rate

Noisy

Postfiltered

50

SnR Type Correct Percent Subj. Set 12.5 Noisy 1/5 20% RG 5 Sentences 8.56 Noisy 1/3 33% RG 3 Commands 6.89 Noisy 1/6 17% RG Grandfather 18.2 Post-Filtered 3/5 60% RG 5 Sentences 6.79 Post-Filtered 1/6 17% RG Grandfather

23.45 Real-Time Filter 4/5 80% RG 5 Sentences 18.7 Real-Time Filter 2/3 67% RG 3 Commands 9.12 Real-Time Filter 2/4 50% RG Grandfather

Table 25 - Summary of RG Datasets

Figure 22 - SnR vs. Recognition Rate of RG for Noisy, Post-Filtered and Real-time Filtering

From Table 25 and Figure 22, there is reason to believe that spectral subtraction increases the

ASR rate under real world circumstances. The three data points for the “Noisy” series show that for

signals under 15dB, there is a low recognition rate. However, the very same signals subjected to Post-

Filtering once again show a marked improvement. The real-time filtering shows that if spectral

subtraction results in a SnR that is over 20dB, the recognition rate increases once again.

6.4.3 Effectiveness of ASRES

ASRES seems to show some effectiveness in improving the recognition of both cases that were

explored. In the case where noise is digitally added to speech samples and the Post-Filtered, the

0

5

10

15

20

25

0% 20% 40% 60% 80% 100%

SnR

(dB)

Automated Speech Recognition Rate (%)

RG - SnR vs. Recognition Rate

Noisy

Postfiltered

Real-Time Filter

51

improvement is large. Under real-world circumstances, there is still a noticeable improvement; however,

the results are not nearly as fantastical as the 100% recognition rates achieved by the case in which noise

was digitally added and Post-Filtered.

From the intelligibility analysis as summarized in Section 6.4.1 - Summary of Phoneme Analysis,

there is little reason to believe that spectral subtraction at the specific SnRs offers anything more than a

marginal improvement in terms of intelligibility. However, from Section 6.4.2 - Summary of ASRES

Results, it is clear that performing spectral subtraction on noisy speech signals of a patient with ALS can

improve the automatic recognition of their speech signals when the Post-Filtering SnR is greater than

8dB.

Therefore, in our discussions regarding the effectiveness of the ASRES system, it would seem

that the enhancement that the system offers is not primarily one of increased intelligibility for human

listeners, but rather for automated machine recognition of speech.

It would not be fair, however, to say, that the system is of no value then for human listeners.

Although no discernible increase in intelligibility by filtering was achieved, PALS still gain several

immediate benefits from using ASRES.

For one, as discussed in Section 1.2 - The Effects of ALS on Speech and Quality of Life,

allowing PALS to speak from behind a mask while on BiPAP is already a huge step forward in terms of

combatting loneliness and struggles with isolation.

6.4.4 Feedback from ALS Community

The feedback from ALS caregivers and subjects has been very positive. In May 2009, ASRES

was presented as the ALS Society of BC’s Engineering Design Competition and a live demonstration was

performed of its capabilities. As a result of a successful live demonstration, the system won the Principal

Award of $5000 and as a result, was selected for a platform presentation, “C75 – Enhancing Speech

During BiPAP Use” at the 20th International Symposium on ALS/MND in Berlin for which a travel

bursary was also awarded. The findings were also reported during the keynote speaker’s address during

the conference closing as a novel attempt at a problem that had not previously been explored.

In addition, subject RG, who tested the system, was very pleased with it as well as caregivers and

professionals who were able to observe the system.

6.5 Validity

When considering the validity of the experiments that were conducted for this thesis, there are two

outstanding issues that still need to be considered:

52

• Sample size: Due to time, lifespan, disease progression, geographical and language constraints,

the number of subjects for the experiment was limited to one subject that was located with the aid

of ALS BC. Finding local subjects was difficult due not only to the rarity of the disease, but also

due to the fact, that the subject had to have some ability of speech left while being on BiPAP

ventilation. One subject was not able to participate due to poor command of the English language.

Other subjects that were considered during the time that ASRES was in development did not live

to see it come to fruition.

• Size of datasets: Although some results were achieved, only a handful of datasets were used and

the size of the library was very small as well. Larger datasets would help determine whether or

not it was sheer coincidence that resulted in the machine recognition rates being higher for Post-

Filtered and real-time filtered data.

53

7 CONCLUSIONS AND FUTURE WORK

In this section the work of the thesis is discussed. The goals of the research are reviewed, the results

are summarized, strengths and weaknesses are assessed and future work is discussed.

7.1 Research Goals Summary

The goal of this thesis is to develop and evaluate a prototype Automatic Speech Recognition and

Enhancement System (ASRES) that will allow Persons Living with ALS (PALS) to communicate clearly

with their own voices while being on BiPAP ventilation. This problem is a significant one and a solution

to it would greatly aid in improving the overall quality of life of PALS by allowing them to communicate

naturally as well as for loved ones to hear their voices. Although there are some published works on

related topics in the area of noise reduction and speech enhancement, there are no published works that

offer a solution to this particular problem. As a result, it is not possible to evaluate ASRES against any

existing system.

In this thesis, the problems associated with capturing and filtering speech by PALS are explained

and as a result specific hardware was selected and software written to capture PALS speech and to filter

the wind noise that corrupts signal samples taken from within the mask. A working and mobile prototype

was designed and implemented using Matlab, a TMS320C6713 DSP board with an interface written in

C# for Windows.

The effectiveness of ASRES was evaluated through digitally adding noise to speech samples from

the Nemours Database of Dysarthric Speech, filtering them with spectral subtraction, and then running the

resulting samples through ASRES in order to determine their scores. The effectiveness of the system was

also evaluated by testing with a subject with ALS who recorded a number of a different speech samples

under different conditions including No filtering with BiPAP on and Filtering with BiPAP on. The

resulting sound samples were listened to and the perceived sentences were written down by human

subjects and then the accuracy of their transcription was evaluated using phoneme analysis.

7.2 Contributions of this Work

In the introduction, it was stated that this thesis would attempt to improve the intelligibility of

speech of persons living with ALS by one, identifying the problems associated with capturing speech of

PALS, two, designing and implementing a working prototype that would address these problem, and

three, validating the effectiveness of the system by conducting and analyzing the results of experiments.

An explanation for the problem of the “wind noise” that caused difficulties in capturing speech

from behind a full face mask while a subject is on BiPAP ventilation was given and an appropriate

54

capture solution involving a single-microphone using spectral subtraction for stationary noise was

designed and implemented.

An automated speech recognition system using DTW was also implemented to validate our

hypothesis of ASRES providing improvements to the intelligibility of PALS.

An evaluation of the results shows that spectral subtraction offers some improvement to machine

recognition of noisy speech. The benefits of being able to accurately recognize the speech of PALS are

many and the system can be used to trigger alarms, operate a browser, or launch custom software that

could contact a caregiver. All of these applications stand to improve quality of life of PALS.

A final experiment was conducted to evaluate whether or not there is an increase in intelligibility as

perceived by human beings due to filtering. Although one case showed a slight improvement, the other

three cases do not exhibit this improvement and the best conclusion is that there is no major improvement

in terms of intelligibility for human listeners over that of noisy speech. Although no improvement was

made with regards to intelligibility, the very fact that PALS can speak with their own voices while on

BiPAP is already an improvement as their speech is completely muffled by the masks if there is no

amplification.

The results of this thesis are significant in that not only is there reason to believe from the data that

there is some benefit to performing filtering of noisy speech, but also that persons with ALS, caregivers

and professionals see this device as a step forward in improving the quality of life for persons with ALS.

7.3 Strengths and Limitations

7.3.1 Strengths

• Low hardware requirements: As a single-microphone solution was chosen, only a basic DSP

that reads and samples from one channel is required. All the components for a simple system,

including the microphone and DSP chip are easily available.

• Improvements to Automated speech recognition: This is perhaps the single largest strength of

the system, the fact that filtered speech is more accurately recognized by a machine. The

implications of this are significant as voice-activated alarms and the triggering of custom

software all hinges on accurate recognition of the command words that activate them.

• Improvement of quality of life of PALS: As the system allows PALS to do what they could

never do before, using their voice while on BiPAP, there is already an immediate improvement of

the quality of life for PALS.

55

7.3.2 Weaknesses

• System training is difficult for PALS. As PALS often have difficulty with fine motor

coordination, having them use a Windows GUI to configure the system would prove difficult. All

training would have to be done with the aid of a caregiver who would not only setup the system,

but record samples of the command words.

• Results are largely based on a few datasets obtained from one subject with ALS. Although this

was due to the already mentioned limitations, more tests with others PALS would help to confirm

the validity of the results.

• The performance of the automated speech recognition degrades as the ability to articulate

speech for the user degrades. A mechanism for retraining on a regular basis is necessary in order

to avoid decreased accuracy over time.

• Not the strongest noise filtering algorithm. The noise filtering can be improved by changing to

a different algorithm, or using an improved version of spectral subtraction which is more to

update with current research.

7.4 Potential Applications

There are a number of potential applications that could result from this prototype. The primary one,

however, would be the development of this prototype into a completely portable, self-powered system

that could be permanently integrated with the face mask with the speakers and computers perhaps

attached to the BiPAP unit itself.

It is also possible that the device could be integrated in to the construction of existing masks,

however, this would be a difficult procedure as extensive testing would need to be conducted with mask

manufacturers to determine whether or not the introduction of a microphone system would comprise

either the integrity or function of the mask.

7.5 Future Work

There are many areas of the project that would need to be enhanced so as to make this project a

reality for persons with ALS. The following are a list of several major points that would still need to be

solved in order to move ASRES from a prototype into a useable product.

• Permanent integration of microphone with a full-face mask – As the system is currently a

prototype, the microphone dangles within the mask cavity and occupies some CO2 ventilation

holes. For long-term use, it would be essential to have microphones built into the walls of the

mask itself. This would reduce the risk of the microphone ever becoming dislodged and also

56

decrease the time it takes to secure a microphone in the mask each time it is taken off or washed.

The addition of a microphone to the mask is a more complex problem that would involve mask

design companies and also approval would have to be sought to ensure that the classification of

the mask remains as a Type I or II medical device.

• Development of a completely self-contained prototype that would not need an electrical plug-

in or a laptop to configure it. Currently the system is rather difficult to setup and involves a fair

number of wires, connections, and DIP switches that cannot be easily setup and are prone to be

being broken or unplugged if sudden movements are made to stretch the wires.

• Improved training interface – The current user interface is hardly user friendly and does not

offer very many features to users.

• Improving the ASR algorithm by exploring HMMs and building a more sophisticated

automated speech recognition system that would be able to handle larger sets of data.

• Exploring the effect of reverberations in the mask. Although literature indicates that within

fighter pilot masks, reverberations decrease intelligibility, this has yet to be confirmed with the

full face masks worn by those on BiPAP.

7.6 Conclusion

This thesis has identified the problems associated with capturing and recognizing the speech of

PALS who are on BiPAP ventilation. A prototype was designed, implemented and tested with human

subjects and DTW to evaluate its improvements to speech intelligibility. Although no conclusive

improvements to intelligibility for human listeners was made, the ALS community has affirmed that there

is real-value in being able to communicate from behind a face mask and that in it of itself improved the

quality of life of persons living with ALS. In addition, there is reason to believe that spectral subtraction

filtering of noisy speech enhances automated machine recognition of PALS speech.

The work of this thesis is a first step towards improving the quality of life for PALS. Although the

life expectancy of PALS is approximately 2-5 years, and therefore, the length of time that a device like

this would be useful would be fairly short, perhaps on the order of months or at most a year or two, there

is still an immeasurable benefit that can be offered by such a device as the value of this device is not

determined by its useable lifespan, but rather on the impact that it has on caregivers and families, for the

time in which it can be used.

57

BIBLIOGRAPHY [1] NIH, “Amyotrophic Lateral Sclerosis Fact Sheet: National Institute of Neurological Disorders and

Stroke (NINDS),” National Institute of Neurological Disorders and Stroke, 2003. [Online]. Available: http://www.ninds.nih.gov/disorders/amyotrophiclateralsclerosis/detail_amyotrophiclateralsclerosis.htm?css=print.

[2] L. S. Aboussouan, S. U. Khan, M. Banerjee, A. C. Arroliga, and H. Mitsumoto, “Objective measures of the efficacy of noninvasive positive-pressure ventilation in amyotrophic lateral sclerosis,” MUSCLE & NERVE, vol. 24, no. 3, pp. 403–409, Mar. 2001.

[3] E. R. Klasner and K. M. Yorkston, “Speech intelligibility in ALS and HD dysarthria: the everyday listener’s perspective.(amyotrophic lateral sclerosis)(Huntington disease),” Journal of Medical Speech - Language Pathology, Jun. 2005.

[4] J. F. Kent, R. D. Kent, J. C. Rosenbek, G. Weismer, R. Martin, R. Sufit, and B. R. Brooks, “Quantitative Description of the Dysarthria in Women With Amyotrophic Lateral Sclerosis.,” Journal of Speech & Hearing Research, vol. 35, no. 4, p. 723, 1992.

[5] “A Guide to ALS Patient Care for Primary Care Physicians.” ALS Society of Canada. [6] M. Shoeb, S. E. Merel, M. B. Jackson, and B. D. Anawalt, “‘Can we just stop and talk?’ patients

value verbal communication about discharge care plans,” J. Hosp. Med., vol. 7, no. 6, pp. 504–507, Aug. 2012.

[7] L. Giordano, S. Toma, R. Teggi, F. Palonta, F. Ferrario, S. Bondi, and M. Bussi, “Satisfaction and Quality of Life in Laryngectomees after Voice Prosthesis Rehabilitation,” Folia Phoniatr. Logop., vol. 63, no. 5, pp. 231–236, 2011.

[8] M. Parker, S. Cunningham, P. Enderby, M. Hawley, and P. Green, “Automatic speech recognition and training for severely dysarthric users of assistive technology: The STARDUST project,” Clinical Linguistics & Phonetics, vol. 20, no. 2–3, pp. 156, 149, 2006.

[9] X. Menendez-Pidal, J. B. Polikoff, S. M. Peters, J. E. Leonzio, and H. T. Bunnell, “Nemours database of dysarthric speech,” in Proceedings of the 1996 International Conference on Spoken Language Processing, ICSLP. Part 3 (of 4), Oct 3-6 1996, Piscataway, NJ, USA, 1996, vol. 3, pp. 1962–1965.

[10] A. B. Kain, J.-P. Hosom, X. Niu, J. P. H. van Santen, M. Fried-Oken, and J. Staehely, “Improving the intelligibility of dysarthric speech,” Speech Communication, vol. 49, no. 9, pp. 743–759, Sep. 2007.

[11] B. Tomik and R. J. Guiloff, “Dysarthria in amyotrophic lateral sclerosis: A review,” Amyotroph. Lateral. Scler., vol. 11, no. 1–2, pp. 4–15, 2010.

[12] K. C. Hustad, “The Relationship Between Listener Comprehension and Intelligibility Scores for Speakers With Dysarthria,” J Speech Lang Hear Res, vol. 51, no. 3, pp. 562–573, Jun. 2008.

[13] W. Jones, P. Mathy, T. Azuma, and J. Liss, “The Effect of Aging and Synthetic Topic Cues on the Intelligibility of Dysarthric Speech,” AAC: Augmentative and Alternative Communication, vol. 20, no. 1, pp. 22–29, 2004.

[14] J. Murphy, “Communication strategies of people with ALS and their partners,” AMYOTROPHIC LATERAL SCLEROSIS AND OTHER MOTOR NEURON DISORDERS, vol. 5, no. 2, pp. 121–126, Jun. 2004.

[15] I. R. Murray and J. L. Arnott, “Synthesizing emotions in speech: is it time to get excited?,” Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, vol. 3, pp. 1816–1819 vol.3, 3.

[16] M. S. Yakoub, S.-A. Selouani, and D. O’Shaughnessy, “Improving dysarthric speech intelligibility through re-synthesized and grafted units,” in Electrical and Computer Engineering, 2008. CCECE 2008. Canadian Conference on, 2008, pp. 001523–001526.

[17] N. Yousefian and P. C. Loizou, “A Dual-Microphone Speech Enhancement Algorithm Based on the Coherence Function,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 2, pp. 599–609, 2012.

[18] “ALS Facts,” Amyotrophic Lateral Sclerosis Society of Canada.

58

[19] “Canadian Cancer Statistics 2012.” Canadian Cancer Society, 2012. [20] G. S. Kang and T. M. Moran, “Speech enhancement in noise and within face mask (microphone

array approach),” in Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, 1998, vol. 2, pp. 1017–1020 vol.2.

[21] P. Goel and A. Garg, “Review of Spectral Subtraction Techniques for Speech Enhancement,” IJECT, vol. 2, no. 4, 2011.

[22] E. Nemer and W. Leblanc, “Single-microphone wind noise reduction by adaptive postfiltering,” in Applications of Signal Processing to Audio and Acoustics, 2009. WASPAA ’09. IEEE Workshop on, 2009, pp. 177–180.

[23] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 27, no. 2, pp. 113–120, 1979.

[24] S. Ahn and H. Ko, “Background noise reduction via dual-channel scheme for speech recognition in vehicular environment,” Consumer Electronics, IEEE Transactions on DOI - 10.1109/TCE.2005.1405694, vol. 51, no. 1, pp. 22–27, 2005.

[25] Y. Ephraim and D. Malah, “Speech enhancement using optimal non-linear spectral amplitude estimation,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’83., 1983, vol. 8, pp. 1118–1121.

[26] Lu-ying Sui, Xiong-wei Zhang, Jian-jun Huang, and Bin Zhou, “An improved spectral subtraction speech enhancement algorithm under non-stationary noise,” in Wireless Communications and Signal Processing (WCSP), 2011 International Conference on, 2011, pp. 1–5.

[27] Bing-yin Xia, Yan Liang, and Chang-chun Bao, “A modified spectral subtraction method for speech enhancement based on masking property of human auditory system,” Wireless Communications & Signal Processing, 2009. WCSP 2009. International Conference on, pp. 1–5, 13.

[28] B. H. Juang and L. R. Rabiner, “Automatic Speech Recognition – A Brief History of the Technology Development,” in Elsevier Encyclopedia of Language and Linguistics, Second ed., 2005.

[29] E. J. Keogh and M. J. Pazzani, “Scaling up Dynamic Time Warping to Massive Dataset,” in Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery, 1999, pp. 1–11.

[30] T. Oates, L. Firoiu, and P. R. Cohen, “Using Dynamic Time Warping to Bootstrap HMM-Based Clustering of Time Series,” in Sequence Learning: Paradigms, Algorithms, and Applications, 2001, pp. 35–52.

[31] H. Tolba and A. S. El Torgoman, “Towards the improvement of automatic recognition of dysarthric speech,” in Computer Science and Information Technology, 2009. ICCSIT 2009. 2nd IEEE International Conference on, 2009, pp. 277–281.

59

APPENDICES

Appendix A: Procedure for Collecting Speech Samples of a PALS The following is a step-by-step outline of the procedure to be conducted at subject’s home.

Introduction – 5 minutes

• Welcome and thank patients for their participation

• Explain the purpose, their right to terminate the study at any time, and outline of procedure

• Sign consent Form

Prototype Setup – 2 minutes

Setup of prototype involves insertion of microphone into the mask to record

Experiment #1: - 4 minutes

Patient speaks same 3 sentences and paragraph with the mask and inserted microphone. Samples are

recorded

Experiment #2: - 4 minutes

Sound Filter is enabled via a toggle switch and patient speaks same 3 sentences and paragraph with the

mask and inserted microphone. Samples are recorded.

Debrief – 5 minutes

Debrief involves answering any questions the subject may have. They will then be thanked for

participating in the study.

Summary

Total visits: 1

Total Duration: 20 min

60

Appendix B: Phoneme Analysis Data Sheets

Table 26 - Phoneme Analysis of RS


The fight is waiting the cop 0 -1 -1 -2 81.8% The fight is waving the towel. 0 -1 -3 -4 63.6%The dime is burying the bet. 0 -1 0 -1 90.9% The dime is wearing the vest. 0 -1 -2 -3 72.7%The view is meeting the pack. -1 -2 -1 -4 63.6% The zoo is soothing the pet. -2 -3 -1 -6 45.5%The bin is doing the fin. 0 -2 -2 -4 63.6% The bin is doing the shade. 0 -2 -1 -3 72.7%The page is licking the vine. -2 0 -3 -5 54.5% The beat is making the rock. -1 -2 -1 -4 63.6%

Trial 2 Trial 2The fight is waiting the cop. 0 -1 -1 -2 81.8% The fight is waving the tub. 0 -1 -2 -3 72.7%The dime is burying the bat. 0 -1 -1 -2 81.8% The guide is wearing the pet. -2 -1 -1 -4 63.6%The view is meeting the pet. -1 -2 -1 -4 63.6% The do is zooting the pack. 0 -3 -1 -4 63.6%The bin is doing the thing. 0 -2 -3 -5 54.5% The bin is doing the fade. 0 -2 0 -2 81.8%The beat is making the rock. -1 -2 -1 -4 63.6% The beat is making the rock. -1 -2 -1 -4 63.6%

Trial 3 1 2 3 Trial 3 1 2 3The fight is winning the cop. 0 -2 -1 -3 72.7% The fight is waving the tub. 0 -1 -2 -3 72.7%The dime is burying the vet. 0 -1 -1 -2 81.8% The dine is wearing the bet. -1 -1 0 -2 81.8%The view is meeting the cat. -1 -2 -1 -4 63.6% The do is zooting the pat. 0 -3 0 -3 72.7%The bin is doing the fig. 0 -2 -2 -4 63.6% The bin is stewing the fade. 0 0 0 0 100.0%The meat is making the rock. -2 -2 -1 -5 54.5% The date is making the rock. -1 -2 -1 -4 63.6%

Sentence #1 78.8% 69.7%Sentence #2 84.8% 72.7%Sentence #3 63.6% 60.6%Sentence #4 60.6% 84.8%Sentence #5 57.6% 63.6%

69.1% 70.3%

Words Words

Mean % of Correctly Identified Phonemes per Sentence


61

Table 27 - Phoneme Analysis of EC


the fight is waiting the car 0 -1 -2 -3 72.7% the fight is waiting the cub 0 -1 -1 -2 81.8%the dime is baring the dame 0 0 -3 -3 72.7% the die is wearing the bat -1 -1 -1 -3 72.7%the view is leading the pack -1 -1 -1 -3 72.7% the dew is leading the past 0 -1 -2 -3 72.7%the bend is chewing the food -2 -2 -1 -5 54.5% the den is stealing the spade -2 -2 -2 -6 45.5%the bait is making the bard 0 -2 -4 -6 45.5% the bait is making the run 0 -2 -2 -4 63.6%

Trial 2 Trial 2the fight is waiting the car 0 -1 -2 -3 72.7% the fight is waiting the tub 0 -1 -2 -3 72.7%the dime is bearing the bird 0 0 -3 -3 72.7% the dye is wearing the pants -1 -1 -4 -6 45.5%the dew is leaving the pack 0 -1 -1 -2 81.8% the dew is sleeping the pack 0 -1 -1 -2 81.8%the bend is chewing the food -2 -2 -1 -5 54.5% the den is stewing the fade -2 0 0 -2 81.8%the bait is making the rod 0 -2 -1 -3 72.7% the bait is making the run 0 -2 -2 -4 63.6%

Trial 3 1 2 3 Trial 3 1 2 3the fight is waiting the car 0 -1 -2 -3 72.7% the fight is waiting the top 0 -1 -2 -3 72.7%the dime the bearing the debt 0 0 -3 -3 72.7% the die is wearing the pants -1 -1 -4 -6 45.5%the dew is leading the pack 0 -1 -1 -2 81.8% the dew is sleeping the past 0 -1 -2 -3 72.7%the bend is chewing the food -2 -2 -1 -5 54.5% the den is stewing the fade -2 0 0 -2 81.8%the bait is making the vine 0 -2 -3 -5 54.5% the bait is making the run 0 -2 -2 -4 63.6%


67.3% 67.9%

Words Words



62

Table 28 - Phoneme Analysis of JW


the fight is waiting the cob 0 -1 0 -1 90.9% The fight is waiting the tub 0 -1 -2 -3 72.7%the dine is -1 -5 -3 -9 18.2% The dine is wearing the bear -1 -1 -1 -3 72.7%the dew is leaving the path 0 -1 -1 -2 81.8% The dew is reaping the path 0 -2 -1 -3 72.7%the bin is spewing the 0 -1 -3 -4 63.6% The bin is spewing the spade 0 -1 -2 -3 72.7%the fate is making the rock -1 -2 -1 -4 63.6% The bait is making the rock 0 -2 -1 -3 72.7%

Trial 2 Trial 2the fight is waiting the cob 0 -1 0 -1 90.9% The fight is waiting the tub 0 -1 -2 -3 72.7%the dine is bearing the bear -1 0 -1 -2 81.8% The die is wearing the bear -1 -1 -1 -3 72.7%the dew is leaving the path 0 -1 -1 -2 81.8% The dew is reaping the path 0 -2 -1 -3 72.7%the bin is spewing the fish 0 -1 -2 -3 72.7% The bin is stewing the spade 0 0 -2 -2 81.8%the bait is making the rock 0 -2 -1 -3 72.7% The bait is making the rock 0 -2 -1 -3 72.7%

Trial 3 1 2 3 Trial 3 1 2 3the fight is waiting the cob 0 -1 0 -1 90.9% The fight is waiting the tub 0 -1 -2 -3 72.7%the dime is bearing the bear 0 0 -1 -1 90.9% The dine is wearing the bear -1 -1 -1 -3 72.7%the dew is leaving the path 0 -1 -1 -2 81.8% The stew is reaping the path -1 -2 -1 -4 63.6%the bin is spewing the fish 0 -1 -2 -3 72.7% The bin is stewing the sage 0 0 -2 -2 81.8%the fate is making the rock -1 -2 -1 -4 63.6% The bait is making the rock 0 -2 -1 -3 72.7%


74.5% 73.3%

Words Words



SPEECH ENHANCEMENT DURING BiPAP USE FOR PERSONS LIVING WITH ALS

Documents

Transcript of SPEECH ENHANCEMENT DURING BiPAP USE FOR PERSONS LIVING WITH ALS