c 2013 by Pooja Jain. All rights reserved.

18
c 2013 by Pooja Jain. All rights reserved.

Transcript of c 2013 by Pooja Jain. All rights reserved.

c© 2013 by Pooja Jain. All rights reserved.

VISUALIZING SPEECH PRODUCTION WITH A HIDDEN MARKOV MODELTRACKER TO AID SPEECH THERAPY AND COMMUNICATION

BY

POOJA JAIN

THESIS

Submitted in partial fulfillment of the requirementsfor the degree of Master of Science in Computer Science

in the Graduate College of theUniversity of Illinois at Urbana-Champaign, 2013

Urbana, Illinois

Advisor:

Associate Professor Karrie G. Karahalios

Abstract

Communication disorders occur across all age groups of people and often show first signs of appear-

ing in children. These can range from problems in comprehension of speech to expression of speech

to the point that it interferes with an individuals achievement and/or quality of life. Communication

disorders can compromise a persons psychological, sociological, educational and vocational growth.

There have been various studies on how the implications of these impairments can be mitigated

through treatment, therapy and communication processes.

This research focuses on the development and implementation of a software that aims to facilitate

speech production by providing feedback through audio visualizations that represent basic audio

features and coherent parts of speech tracked by a hidden Markov model. The goal of these vi-

sualizations is to help the user understand speech better by providing a system where users can

see the words they speak and experience, develop and practice speech skills using the statistical

speech model and temporal features represented through simple abstract visualizations. This re-

search proposes an approach to visualize speech in a way that can potentially aid speech therapy

and communication to help people with communication disorders by providing them with a tool

they can use to understand their speech problems without the continuous need of a therapist or

teacher.

ii

Acknowledgments

First and foremost, I would like to express my sincere gratitude and appreciation to my advisor,

Professor Karrie Karahalios, who has always been an amazing source of support, encouragement

and guidance throughout graduate school. If not for her, this thesis would not have been possible.

Many of the topics introduced in this thesis were the results of her attention to detail and willingness

to try new ideas, and thanks to her utmost dedication to her students I was able to complete this

thesis.

I am also very grateful to Jonathan Tedesco without whom I could have never have completed the

initial hidden Markov model prototype in Matlab. I would also like to extend my thanks to all my

graduate student friends at the University of Illinois who have stood by me through thick and thin.

Their mentoring and words of advice have helped me settle in as a first-year graduate student. I

especially want to thank Hyun Duk Cho for his constant support and willingness to help me till the

very end.

Last but certainly not the least, I have to thank my parents and my sisters for always being there

for me when I most needed them. I couldn’t have made it so far without them.

iii

Table of Contents

Chapter 1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Chapter 2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1 Technology aided speech therapy and communication . . . . . . . . . . . . . . . . . . 32.2 Speech technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Chapter 3 Visualization Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.1 Initial designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Chapter 4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.1 Data preprocessing and hidden Markov model . . . . . . . . . . . . . . . . . . . . . . 84.2 Acoustic features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Chapter 5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Chapter 6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 11

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

iv

Chapter 1

Background

Communication is the act of conveying meaningful information and is integral to a successful life.

We usually learn the basic fundamentals of speech in childhood or adolescence years. However, it is

estimated that during the same years 24% of all kids receive services for language and speech disor-

ders [3]. Difficulties in speech and language skills can increase anxiety and even lead to emotional

and behavioral disorders in children [4] [6]. Some common side effects of inadequate early treatment

in children have shown decreased receptive language, reading, and learning in later years [13]. These

disorders are not only limited to children, adults have been shown to have residual speech errors

too [8]. Communication disorders can be be various types like fluency, voice, language, phonological

and hearing disorders. They can also be associated as a second disorder for people with learning

disabilities and Autism [3].

Adults and children with communication disorders often have low word, phonological, sound and

syllable awareness [5] [15]. Both children and adults often receive treatments and therapies depend-

ing on degree, severity or cause of the disorder. Research shows that these therapies can increase

spoken language language and phonological processing abilities [7]. Some examples of phonological

disorder involve patterns of sound errors like substituting all the sounds made in the front of the

mouth like “t and “d”. For example, saying “tup” instead of cup [12]. Over the years technology

has played a great role in aiding speech production and speech therapy. Using computers to rec-

ognize and analyze speech goes way back to at least the 1970s but its only after the 1990s that

speech technologies and analysis have made significant advancements and people showed greater

interest in using computers to improve speech and language skills. From the days of Visi-Pitch,

a tool used on MS-DOS to receive visual feedback of an utterance without much information on

improvement, we now can now use automatic speech recognizers in real time with very high success

rates and with information detailed down to the phoneme level [9]. More recently, work studying

visualizations of speech have tried to answer the question “if we could see our speech, what might

it look like?” and which sound features give us significant insight into flow of speech and aid speech

1

production [18] [10]. We propose a visualization that represents a myriad of sound features and

distinguishes between coherent parts of changing sounds in a word based on states predicted by a

statistical hidden Markov model trained over a large corpus of speech samples. These visualizations

provide a way for users to ’see’ words through visual feedback and differences between smaller units

of sound in the speech by comparing different audio recordings. This research proposes an approach

to aid speech therapy and communication by providing a reflective and informative self sufficient

tool that shows temporal patterns and progression of sound features over time.

2

Chapter 2

Related Works

2.1 Technology aided speech therapy and communication

With advancements in technology, the number of computer aided tools and software for improving

speech and language skills have increased dramatically. There are many such technologies that em-

ploy speech recognition algorithms to give auditory and visual feedback. Researchers have examined

the role of technology in early diagnosis. These tools have helped children make better associations

between the phoneme - grapheme pairs or the connection between their own articulations for speech.

Moreover, with the aid of the computer, children can practice without the need of sustained attention

from a teacher [17]. Speech therapy with such interactive tools designed to facilitate the acquisition

of language skills in the areas of basic phonatory skills, phonetic articulation and language have

been effective in eliciting speech from speakers [21].

Speech technologies often show different kinds of auditory and visual feedback. One of the earliest

tools of its kind, Visi-Pitch, available during the MS-DOS days, would show users contours of the

spoken speech and would let you compare it that of a native speaker visually. No feedback was

given on how to make the comparison [2]. Surprisingly many of the visualizations today still use

similar standard approaches like waveforms and spectrograms to visualize speech. It can be hard to

understand the differences between speech samples using traditional approaches. Since then, speech

technologies have become much more advance and can automatically recognize speech with high

precision and detect pitch and phoneme level sounds. Tools like SpeechViewer [1] and Visi-pitch [2]

that have used these algorithms to give provide both visual and auditory feedback but they still use

standard contour diagrams and spectrograms for visualizations and their sound comparison tools

have a high learning curve. While we know that we can sense vocalization with our ears, phones-

thesia or artistic speech is the idea that to some extent, sounds of words tend to reflect and are

associated with domains such as shape and texture [18]. Another interesting visualization facilitates

multi-syllabic speech production in children with autism and speech delays and is one of the few

3

examples of work that uses of visual feedback to provide real-time acoustic information about vocal

productions for children with Autism Spectrum Disorder. Our work complements the visualiza-

tions in this work by showing similar acoustic information along with coherent parts of speech more

closer to phonetic tones than syllables. Phonemes being the smallest unit of identifiable sound are

often used to model automatic speech recognition. Syllables consist of a combination of phonemes

or only vowel sounds [16]. Our design proposes an approach to increase phonological awareness

and strengthen phonological processing abilities through auditory and visual feedback to aid speech

production for people with communication disorders.

2.2 Speech technologies

More powerful tools and technologies being available for speech analysis and recognition have led

to greater interest in using computers to help develop speaking and listening skills. Such human-

machine interaction for speech visualizations requires the existence of speech analysis, speech recog-

nition, and speech verification algorithms that are robust with respect to the sources of speech

variability that are characteristic of this population of speakers.

One of the major advancements in speech recognition has been the use of hidden Markov models.

A hidden Markov model, or an HMM, is a statistical model of a sequence of feature vector obser-

vations. In building a recognizer with HMMs we need to decide what sequences will correspond to

what models. Having Trained the model on a sequence of speech samples, given a new observation

sample HMMs try to predict the correct sequence of the hidden states that is closest to the model

using the Viterbi or EM algorithm. HMMs are very useful because they predict states based on

highest likelihood of the next occurring state and work well even when samples are of different

lengths and volumes [20].

HMM tracking modeled over audio features to map states from the training samples to the sample

observations has been used successfully in a tools for frequency line tracking and pitch detection in

musical queries [14] [22] [20]. Similar models are used for speech recognition where single or groups

of states in the model represent different consonants and vowels. These sounds are recognized using

different classifiers for each type. We use HMM tracking in our visualization with this model which

can be easily extended to incorporate automatic speech recognition algorithms.

4

Chapter 3

Visualization Design

3.1 Initial designs

Given the advancement in speech technologies, we attempted to explore different methods of visu-

alizing speech. The first initial design consisted of standard speech visuals used for speech analysis

of - a speech waveform and a spectrogram

Figure 3.1: Audio visualization feedback in the form of - waveform and spectrogram for the word

’Mama’

Speech sounds are created by vibratory activity in the human vocal tract. Speech is normally

transmitted to a listener’s ears or to a microphone through the air, where speech and other sounds

take on the form of radiating waves of variation in air pressure around. A waveform tracks the

excess pressure created by these waves as a function of time for a given point in space. It is easy

to read peaks of frequency and pitch from a waveform. However, for more detailed analysis, usually

experts and phoneticians are the best at reading waveform visualizations [19]

Spectograms are a little easier to read Their most common format is a graph with two geometric

dimensions: the horizontal axis represents time, the vertical axis is frequency; a third dimension

indicating the amplitude of a particular frequency at a particular time is represented by the intensity

or color of each point in the image [11]. You can see the intensity for each sound /ma/, with the

first one being more denser with increased intensity in color.

The second initial design represented hidden Markov models (HMMs). HMMs have become

5

predominant for speech recognition in the recent years given the ease of available training algorithms

and flexibility in controlling the speech model.

Figure 3.2: Audio visualization feedback showing acoustic features and hidden Markov model states

for the word ’Thunder’ where the model is assigned 6 states for the sounds - /Th/, /U/,/n/, /d/,

/er/ and silence

A Hidden Markov Model is a statistical model of a sequence of feature vector observations. In

building a recognizer with HMMs we need to decide what sequences will correspond to what models.

For the word ’Thunder’, based on the phonetic break for the word commonly found in dictionaries,

the model is assigned 6 states for the sounds - /Th/, /U/, /n/, /d/, /er/ and silence so that the

model gives back states resembling these sounds. After training this model over 72 samples of this

word, given a new sample recording it assigns the states that match the model the most. The first

figure in figure 3.2 to the left shows each state in different colors and helps show different coherent

sounds in a speech waveform. The waveform is still complex to read.

The second figure in figure 3.2 to the right shows a simplified representation of the HMM which

shows which state belongs to which time segment of the plot. The red color means being closer to

trained model and the blue color being further away. The opacity represents how close or far it is.

For example, a state is bright cool blue if it not close to sound at all represented in the model and

bright red if its spot on. However, this simplified visualization does not reveal information about

acoustic features of the sound sample.

6

3.2 Visualization

Our last and final visualization integrates the acoustic features - timing, tone and emphasis along

with hidden Markov model states.

Figure 3.3: Audio visualization feedback showing acoustic features and hidden Markov model states

for the word ’Thunder’ where the model is assigned 6 states for the sounds - /Th/, /U/,/n/, /d/,

/er/ and silence

In this visualization, the timing (speed) is depicted along the x-axis, tone (pitch) is depicted

by the curves along the y-axis and the emphasis (volume) is depicted by the thickness of the area

between the curves. The hidden Markov model states, or parts of sound are represented as the

segments of this shape along the x-axis (time). We now use a single color scheme where the higher

the opacity or brightness, the closer this speech sample is to the trained model.

In figure 3.3, compared to the first visualization on the left, the second visualization to the right has

similar loudness or thickness of the area between the curves and a flatter pronunciation of the word

thunder, as visible from the relatively flatter curves representing pitch. It also does worse along the

flatter segments or states as shown by the darker color or less opacity.

7

Chapter 4

Implementation

4.1 Data preprocessing and hidden Markov model

When data is recorded or read directly from the audio files, the the linearly-spaced frequency bands

used in the normal Cepstrum are not close enough to the humans auditory system to give meaningful

features. Thus, we calculate the Mel-Frequency Cepstrum Coefficients (MFCCs) of the recording or

speech sample where the frequency bands are equally spaced on the Mel scale and approximate the

human auditory system much more closely.

For a list of 121 words having 50 to 110 samples of both men and women, a hidden Markov model

was trained on the MFCCs for each word for the number of phonemes in each word plus one state

for silence. Then the system could, based on a new audio recorded or a previously stored audio file,

predict the closest sequence in the observation sample based on the Viterbi decoding algorithm. The

Scikit-learn machine learning algorithms package was used for accessing python implementations of a

Gaussian hidden Markov model and MFCC calculator. We assign the number of states to each word

based on its phonetic definition so that each state closely represents different coherent sounds in our

HMM. Our HMM model-based implements measures based on the log likelihood and log posterior

scores detailed in [27]. First, the log likelihood measure l for a particular sound i is computed as:

li =1

di

ti+di−1∑t=ti

log p(yt|qi) (4.1)

where qi is the corresponding HMM state for sound i, yt is the speech frame at index t, the

sound appears for di frames in y, and the sound appears in frames ti through ti + di − 1.

Similarly, the log posterior probability score is computed as

pi =1

di

ti+di+1∑t=ti

logP (qi|yt) (4.2)

where P (qi|yt) is defined by

8

P (qi|yt) =p(yt|qi)P (qi)∑M

j=1 p(yt|qj)P (qj)(4.3)

where P (qi) represents the prior probability of the sound qi.

Based on the this probability the closeness or distance from the average probability of a sound

in our trained model is determined. The higher this probability the more opaque is each state

represented by segments of the curves.

4.2 Acoustic features

The pitch is calculated for small chunks of data read or recorded over time and mapped to curves.

Similarly, the volume calculated over time is represented by the thickness of the curves.

Pitch being a perceptual property allows the ordering of sounds on a frequency-related scale and

makes it a natural choice to include pitch in the visualization. Pitch can be compared and distin-

guished as ”higher” and ”lower” in the sense associated with musical melodies. Higher pitch equates

to peaks in the visualization and lower pitch to dips. Pitch is also a major auditory attribute of

musical tones, along with duration and loudness. Volume or loudness is the other major acoustic

feature of sound that can help compare sounds as ”loud” or ”soft”. Volume varies from speaker

to speaker and thin width of curves represents low volume and thicker width of curves represents

higher volume on a relative scale from the starting point.

9

Chapter 5

Analysis

We have incorporated major acoustic features of sound, pitch and volume along with hidden Markov

model states to our visualization. The acoustic features - Pitch and volume each help us represent

change in frequency over time and change in amplitude over time, respectively. The states of the

HMM represent tones or coherent parts of sound. Additionally, our model calculates the posterior

probability for each state and helps us compare audio samples state by state. The opacity of the color

of each segment represents this probability, with a state probability close to the model being brighter

and opaque than a state probability much that is much lesser than the average state probability

for the model. Our visualization is different than previous work done in this area in that it helps

the user identify specific sounds of the speech sample (through the states) and the closeness to the

trained model (through color opacity).

Our visualization proposes an approach to provide visual and auditory feedback that can aid speech

production for people with communication disorders. This tool aims to facilitates speech acquisition

by aiding speech production in a variety of different ways. Looking at the pitch and volume variations

helps users track their speech samples. This is very useful for people with communication disorders

as they can use this as a reflective and informative tool to understand their speech problems without

the continuous need of a therapist or teacher. Moreover, this tool integrates acoustic features with

a simple hidden Markov model to show users a simple abstract visualization that conveys a lot of

information in an easier manner compared to many speech tools. The tool can also help non-native

users understand nuances of speech and how they compare to a native speakers speech model.

10

Chapter 6

Conclusion and Future Work

This tool facilitates speech acquisition and increases phonological awareness. Users can record audio

samples of word and compare results without the continuous presence of therapist or teacher. In

the future, the next step would be to see how well this tool can aid speech production for children

and adults with communication disorders and what kind of features are liked the most and what

more could be included in the visualization. It would also be interesting to see how this system fares

compared to Vocsyl [10].

Another step would be to take this HMM tracking model to the next level and classify each

sound based on a model trained for each of the 42 distinct phonemes in the English language. This

would open the door for many new kinds of features that could be added like speech labeling and

showing relevant images or texts based on speech recognition.

11

References

[1] Ibm. speech viewer iii, date 1997.

[2] Kaypentax. visi-pitch iv, model 2950b, 1996-2008.

[3] American speech-language-hearing association. 2008.

[4] Prizant Barry M., et al. Communication disorders and emotional/behavioral disorders in chil-dren and adolescents.

[5] Donna M Boudreau and Natalie L Hedberg. A comparison of early literacy skills in childrenwith specific language impairment and their typically developing peers. American Journal ofSpeech-Language Pathology, 1999.

[6] Dennis P Cantwell and Lorian Baker. The prevalence of anxiety in children with communicationdisorders. Journal of Anxiety Disorders, 1(3):239–248, 1987.

[7] Sharon Crosbie, Alison Holm, and Barbara Dodd. Intervention for children with severe speechdisorder: a comparison of two approaches. International Journal of Language & CommunicationDisorders, 2005.

[8] AM Gallagher, V Laxon, E Armstrong, and U Frith. Phonological difficulties in high-functioningdyslexics. 1996.

[9] Robert Godwin-Jones. Emerging technologies: Speech tools and technologies. Language Learn-ing & Technology, 2009.

[10] Joshua Hailpern, Karrie Karahalios, Laura DeThorne, and Jim Halle. Vocsyl: visualizing sylla-ble production for children with asd and speech delays. In Proceedings of the 12th internationalACM SIGACCESS conference on Computers and accessibility. ACM, 2010.

[11] http://kdenlive.org/users/granjow/introducing-scopes-audio-spectrum-and spectrogram.Kdenlive.

[12] http://www.asha.org/public/speech/disorders/speechsounddisorders.htm. American speech-language-hearing association. 2008.

[13] http://www.webmd.com/parenting/news/20100628/speech-delay-in-kids-linked-to-later-emotional problems. Webmd, 2010.

[14] Zhaozhang Jin and DeLiang Wang. Mm-based multipitch tracking for noisy and reverberantspeech. Audio, Speech, and Language Processing, IEEE Transactions on, 2011.

[15] Alan G Kamhi, Rene Friemoth Lee, and Lauren K Nelson. Word, syllable, and sound awarenessin language-disordered children. Journal of Speech and Hearing Disorders, 1985.

[16] Brett Kessler and Rebecca Treiman. Syllable structure and the distribution of phonemes inenglish syllables. Journal of Memory and Language, 37(3):295–311, 1997.

12

[17] Andras Kocsor and Denes Paczolay. Speech technologies in a computer-aided speech therapysystem. In Computers Helping People with Special Needs. 2006.

[18] Golan Levin and Zachary Lieberman. In-situ speech visualization in real-time interactive in-stallation and performance. In NPAR.

[19] M. O’Kane, J. Gillis, P. Rose, and M. Wagner. Deciphering speech waveforms. In Acoustics,Speech, and Signal Processing, IEEE International Conference on ICASSP ’86., 1986.

[20] Nicola Orio and M Sisti Sette. An hmm-based pitch tracker for audio queries. In ISMIR, 2003.

[21] Oscar Saz, Shou-Chun Yin, Eduardo Lleida, Richard Rose, Carlos Vaquero, and William RRodrıguez. Tools and technologies for computer-aided speech and language therapy. SpeechCommunication, 2009.

[22] R.L. Streit and R.F. Barrett. Frequency line tracking using hidden markov models. Acoustics,Speech and Signal Processing, IEEE Transactions on, 1990.

13