Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition:...

18
Speech Recognition: Statistical Methods L R Rabiner, Rutgers University, New Brunswick, NJ, USA and University of California, Santa Barbara, CA, USA B-H Juang, Georgia Institute of Technology, Atlanta, GA, USA ß 2006 Elsevier Ltd. All rights reserved. Introduction The goal of getting a machine to understand fluently spoken speech and respond in a natural voice has been driving speech research for more than 50 years. Although the personification of an intelligent ma- chine such as HAL in the movie 2001, A Space Odys- sey , or R2D2 in the Star Wars series, has been around for more than 35 years, we are still not yet at the point where machines reliably understand fluent speech, spoken by anyone, and in any acoustic environment. In spite of the remaining technical problems that need to be solved, the fields of automatic speech recogni- tion and understanding have made tremendous advances and the technology is now readily available and used on a day-to-day basis in a number of appli- cations and services – especially those conducted over the public-switched telephone network (PSTN) (Cox et al., 2000). This article aims at reviewing the technology that has made these applications possible. Speech recognition and language understanding are two major research thrusts that have traditionally been approached as problems in linguistics and acous- tic phonetics, where a range of acoustic phonetic knowledge has been brought to bear on the problem with remarkably little success. In this article, howev- er, we focus on statistical methods for speech and language processing, where the knowledge about a speech signal and the language that it expresses, together with practical uses of the knowledge, is de- veloped from actual realizations of speech data through a well-defined mathematical and statistical formalism. We review how the statistical methods are used for speech recognition and language under- standing, show current performance on a number of task-specific applications and services, and discuss the challenges that remain to be solved before the technology becomes ubiquitous. The Speech Advantage There are fundamentally three major reasons why so much research and effort has gone into the problem of trying to teach machines to recognize and under- stand fluent speech, and these are the following: . Cost reduction. Among the earliest goals for speech recognition systems was to replace humans performing certain simple tasks with automated machines, thereby reducing labor expenses while still providing customers with a natural and convenient way to access information and services. One simple example of a cost reduction system was the Voice Recognition Call Processing (VRCP) sys- tem introduced by AT&T in 1992 (Roe et al., 1996), which essentially automated so-called oper- ator-assisted calls, such as person-to-person calls, reverse-billing calls, third-party billing calls, collect calls (by far the most common class of such calls), and operator-assisted calls. The resulting automa- tion eliminated about 6600 jobs, while providing a quality of service that matched or exceeded that provided by the live attendants, saving AT&T on the order of $300 million per year. . New revenue opportunities. Speech recognition and understanding systems enabled service provi- ders to have a 24/7 high-quality customer care automation capability, without the need for access to information by keyboard or touch-tone button pushes. An example of such a service was the How May I Help You (HMIHY) ß service introduced by AT&T late in 2000 (Gorin et al., 1996), which automated the customer care for AT&T Consumer Services. This system will be discussed further in the section on speech understanding. A second ex- ample of such a service was the NTT ANSER ser- vice for voice banking in Japan [Sugamura et al., 1994], which enabled Japanese banking customers to access bank account records from an ordinary telephone without having to go to the bank. (Of course, today we utilize the Internet for such infor- mation, but in 1981, when this system was intro- duced, the only way to access such records was a physical trip to the bank and a wait in lines to speak to a banking clerk.) . Customer retention. Speech recognition provides the potential for personalized services based on customer preferences, and thereby the potential to improve the customer experience. A trivial exam- ple of such a service is the voice-controlled auto- motive environment that recognizes the identity of the driver from voice commands and adjusts the automobile’s features (seat position, radio sta- tion, mirror positions, etc.) to suit the customer’s preference (which is established in an enrollment session). The Speech Dialog Circle When we consider the problem of communicating with a machine, we must consider the cycle of events that occurs between a spoken utterance (as part of Speech Recognition: Statistical Methods 1

Transcript of Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition:...

Page 1: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

Speech Recognition: Statistical Methods 1

Speech Recognition: Statistical M

ethods L R Rabiner, Rutgers University, New Brunswick,

NJ, USA and University of California, Santa Barbara,

CA, USA

B-H Juang, Georgia Institute of Technology, Atlanta,

GA, USA

� 2006 Elsevier Ltd. All rights reserved.

Introduction

The goal of getting a machine to understand fluentlyspoken speech and respond in a natural voice hasbeen driving speech research for more than 50 years.Although the personification of an intelligent ma-chine such as HAL in the movie 2001, A Space Odys-sey, or R2D2 in the Star Wars series, has been aroundfor more than 35 years, we are still not yet at the pointwhere machines reliably understand fluent speech,spoken by anyone, and in any acoustic environment.In spite of the remaining technical problems that needto be solved, the fields of automatic speech recogni-tion and understanding have made tremendousadvances and the technology is now readily availableand used on a day-to-day basis in a number of appli-cations and services – especially those conductedover the public-switched telephone network (PSTN)(Cox et al., 2000). This article aims at reviewing thetechnology that has made these applications possible.

Speech recognition and language understanding aretwo major research thrusts that have traditionallybeen approached as problems in linguistics and acous-tic phonetics, where a range of acoustic phoneticknowledge has been brought to bear on the problemwith remarkably little success. In this article, howev-er, we focus on statistical methods for speech andlanguage processing, where the knowledge abouta speech signal and the language that it expresses,together with practical uses of the knowledge, is de-veloped from actual realizations of speech datathrough a well-defined mathematical and statisticalformalism. We review how the statistical methods areused for speech recognition and language under-standing, show current performance on a number oftask-specific applications and services, and discussthe challenges that remain to be solved before thetechnology becomes ubiquitous.

The Speech Advantage

There are fundamentally three major reasons why somuch research and effort has gone into the problemof trying to teach machines to recognize and under-stand fluent speech, and these are the following:

. Cost reduction. Among the earliest goals for speechrecognition systems was to replace humans

performing certain simple tasks with automatedmachines, thereby reducing labor expenseswhile still providing customers with a natural andconvenient way to access information and services.One simple example of a cost reduction system wasthe Voice Recognition Call Processing (VRCP) sys-tem introduced by AT&T in 1992 (Roe et al.,1996), which essentially automated so-called oper-ator-assisted calls, such as person-to-person calls,reverse-billing calls, third-party billing calls, collectcalls (by far the most common class of such calls),and operator-assisted calls. The resulting automa-tion eliminated about 6600 jobs, while providing aquality of service that matched or exceeded thatprovided by the live attendants, saving AT&T onthe order of $300 million per year.

. New revenue opportunities. Speech recognitionand understanding systems enabled service provi-ders to have a 24/7 high-quality customer careautomation capability, without the need for accessto information by keyboard or touch-tone buttonpushes. An example of such a service was the HowMay I Help You (HMIHY)� service introduced byAT&T late in 2000 (Gorin et al., 1996), whichautomated the customer care for AT&T ConsumerServices. This system will be discussed further inthe section on speech understanding. A second ex-ample of such a service was the NTT ANSER ser-vice for voice banking in Japan [Sugamura et al.,1994], which enabled Japanese banking customersto access bank account records from an ordinarytelephone without having to go to the bank. (Ofcourse, today we utilize the Internet for such infor-mation, but in 1981, when this system was intro-duced, the only way to access such records was aphysical trip to the bank and a wait in lines to speakto a banking clerk.)

. Customer retention. Speech recognition providesthe potential for personalized services based oncustomer preferences, and thereby the potential toimprove the customer experience. A trivial exam-ple of such a service is the voice-controlled auto-motive environment that recognizes the identity ofthe driver from voice commands and adjuststhe automobile’s features (seat position, radio sta-tion, mirror positions, etc.) to suit the customer’spreference (which is established in an enrollmentsession).

The Speech Dialog Circle

When we consider the problem of communicatingwith a machine, we must consider the cycle of eventsthat occurs between a spoken utterance (as part of

Page 2: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

2 Speech Recognition: Statistical Methods

a dialog between a person and a machine) and theresponse to that utterance from the machine. Figure 1shows such a sequence of events, which is often re-ferred to as the speech dialog circle, using an examplein the telecommunications context.

The customer initially makes a request by speakingan utterance that is sent to a machine, which attemptsto recognize, on a word-by-word basis, the spokenspeech. The process of recognizing the words in thespeech is called automatic speech recognition (ASR)and its output is an orthographic representation ofthe recognized spoken input. The ASR process will bediscussed in the next section. Next the spoken wordsare analyzed by a spoken language understanding(SLU) module, which attempts to attribute meaningto the spoken words. The meaning that is attributed isin the context of the task being handled by the speechdialog system. (What is described here is traditionallyreferred to as a limited domain understanding systemor application.) Once meaning has been determined,the dialog management (DM) module examines thestate of the dialog according to a prescribed opera-tional workflow and determines the course of actionthat would be most appropriate to take. The actionmay be as simple as a request for further informationor confirmation of an action that is taken. Thus ifthere were confusion as to how best to proceed, a textquery would be generated by the spoken languagegeneration module to hopefully clarify the meaningand help determine what to do next. The query text

Figure 1 The conventional speech dialog circle.

is then sent to the final module, the text-to-speechsynthesis (TTS) module, and then converted intointelligible and highly natural speech, which is sentto the customer who decides what to say next basedon what action was taken, or based on previous dia-logs with the machine. All of the modules in the speechdialog circle can be ‘data-driven’ in both the learn-ing and active use phases, as indicated by the centralData block in Figure 1.

A typical task scenario, e.g., booking an airlinereservation, requires navigating the speech dialog cir-cle many times – each time being referred to as one‘turn’ – to complete a transaction. (The average num-ber of turns a machine takes to complete a prescribedtask is a measure of the effectiveness of the machinein many applications.) Hopefully, each time throughthe dialog circle enables the customer to get closer tothe desired action either via proper understandingof the spoken request or via a series of clarificationsteps. The speech dialog circle is a powerful conceptin modern speech recognition and understanding sys-tems, and is at the heart of most speech understandingsystems that are in use today.

Basic ASR Formulation

The goal of an ASR system is to accurately and effi-ciently convert a speech signal into a text messagetranscription of the spoken words, independentof the device used to record the speech (i.e., the

Page 3: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

Speech Recognition: Statistical Methods 3

transducer or microphone), the speaker, or the envi-ronment. A simple model of the speech generationprocess, as used to convey a speaker’s intention isshown in Figure 2.

It is assumed that the speaker decides what to sayand then embeds the concept in a sentence, W, whichis a sequence of words (possibly with pauses andother acoustic events such as uh’s, um’s, er’s, etc.).The speech production mechanisms then produce aspeech waveform, sðnÞ, which embodies the words ofW as well as the extraneous sounds and pauses inthe spoken input. A conventional automatic speechrecognizer attempts to decode the speech, sðnÞ, intothe best estimate of the sentence, W, using a two-stepprocess, as shown in Figure 3.

The first step in the process is to convert the speechsignal, sðnÞ, into a sequence of spectral feature vec-tors, X, where the feature vectors are measured every10 ms (or so) throughout the duration of the speechsignal. The second step in the process is to use asyntactic decoder to generate every possible validsentence (as a sequence of orthographic representa-tions) in the task language, and to evaluate the score(i.e., the a posteriori probability of the word stringgiven the realized acoustic signal as measured by thefeature vector) for each such string, choosing as therecognized string, W, the one with the highest score.This is the so-called maximum a posteriori probabil-ity (MAP) decision principle, originally suggestedby Bayes. Additional linguistic processing can bedone to try to determine side information about thespeaker, such as the speaker’s intention, as indicatedin Figure 3.

Mathematically, we seek to find the string W thatmaximizes the a posteriori probability of that string,when given the measured feature vector X, i.e.,

Figure 2 Model of spoken speech.

Figure 3 ASR decoder from speech to sentence.

W ¼ arg maxW

PðWjXÞ

Using Bayes Law, we can rewrite this expression as:

W ¼ arg maxW

PðXjWÞPðWÞPðXÞ

Thus, calculation of the a posteriori probability isdecomposed into two main components, one thatdefines the a priori probability of a word sequenceW, P(W), and the other the likelihood of the wordstring W in producing the measured feature vector,P(X|W). (We disregard the denominator term, PðXÞ,since it is independent of the unknown W). The latteris referred to as the acoustic model, PAðXjWÞ, and theformer the language model, PLðWÞ (Rabiner et al.,1996; Gauvain and Lamel, 2003). We note that thesequantities are not given directly, but instead are usu-ally estimated or inferred from a set of trainingdata that have been labeled by a knowledge source,i.e., a human expert. The decoding equation is thenrewritten as:

W ¼ arg maxW

PAðXjWÞPLðWÞ

We explicitly write the sequence of feature vectors(the acoustic observations) as:

X ¼ x1;x2; . . . ;xN

where the speech signal duration is N frames (or Ntimes 10 ms when the frame shift is 10 ms). Similarly,we explicitly write the optimally decoded wordsequence as:

W ¼ w1w2 . . .wM

where there are M words in the decoded string. Theabove decoding equation defines the fundamentalstatistical approach to the problem of automaticspeech recognition.

It can be seen that there are three steps to the basicASR formulation, namely:

. Step 1: acoustic modeling for assigning probabil-ities to acoustic (spectral) realizations of a sequence

Page 4: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

4 Speech Recognition: Statistical Methods

of words. For this step we use a statistical model(called the hidden Markov model or HMM) of theacoustic signals of either individual words or sub-word units (e.g., phonemes) to compute the quan-tity PAðXjWÞ. We train the acoustic models from atraining set of speech utterances, which have beenappropriately labeled to establish the statisticalrelationship between X and W.

. Step 2: language modeling for assigning probabil-ities, PLðWÞ, to sequences of words that form validsentences in the language and are consistent withthe recognition task being performed. We trainsuch language models from generic text sequences,or from transcriptions of task-specific dialogues.(Note that a deterministic grammar, as is used inmany simple tasks, can be considered a degenerateform of a statistical language model. The ‘coverage’of a deterministic grammar is the set of permissibleword sequences, i.e., expressions that are deemedlegitimate.)

. Step 3: hypothesis search whereby we find theword sequence with the maximum a posterioriprobability by searching through all possibleword sequences in the language.

In step 1, acoustic modeling (Young, 1996; Rabineret al., 1986), we train a set of acoustic models for thewords or sounds of the language by learning thestatistics of the acoustic features, X, for each wordor sound, from a speech training set, where we com-pute the variability of the acoustic features during theproduction of the words or sounds, as represented bythe models. For large vocabulary tasks, it is impracti-cal to create a separate acoustic model for everypossible word in the language since it requires fartoo much training data to measure the variability inevery possible context. Instead, we train a set of about50 acoustic-phonetic subword models for the approx-imately 50 phonemes in the English language,and construct a model for a word by concatenating(stringing together sequentially) the models for theconstituent subword sounds in the word, as definedin a word lexicon or dictionary, where multiple pro-nunciations are allowed). Similarly, we build sen-tences (sequences of words) by concatenating wordmodels. Since the actual pronunciation of a phonememay be influenced by neighboring phonemes (thoseoccurring before and after the phoneme), the set ofso-called context-dependent phoneme models areoften used as the speech models, as long as sufficientdata are collected for proper training of these models.

In step 2, the language model (Jelinek, 1997;Rosenfeld, 2000) describes the probability of a se-quence of words that form a valid sentence in the tasklanguage. A simple statistical method works well,

based on a Markovian assumption, namely that theprobability of a word in a sentence is conditioned ononly the previous N–1 words, namely an N-gramlanguage model, of the form:

PLðWÞ ¼ PLðw1;w2; . . . ;wMÞ

¼YMm¼1

PLðwmjwm�1;wm�2; . . . ;wm�Nþ1Þ

where PLðwmjwm�1;wm�2; . . . ;wm�Nþ1Þ is estimatedby simply counting up the relative frequencies ofN-tuples in a large corpus of text.

In step 3, the search problem (Ney, 1984; Paul,2001) is one of searching the space of all validsound sequences, conditioned on the word grammar,the language syntax, and the task constraints, to findthe word sequence with the maximum likelihood.The size of the search space can be astronomicallylarge and take inordinate amounts of computingpower to solve by heuristic methods. The use ofmethods from the field of Finite State AutomataTheory provide finite state networks (FSNs)(Mohri, 1997), along with the associated searchpolicy based on dynamic programming, that reducethe computational burden by orders of magnitude,thereby enabling exact solutions in computation-ally feasible times, for large speech recognitionproblems.

Development of a Speech RecognitionSystem for a Task or an Application

Before going into more detail on the various aspectsof the process of automatic speech recognition bymachine, we review the three steps that must occurin order to define, train, and build an ASR system(Juang et al., 1995; Kam and Helander, 1997). Thesesteps are the following:

. Step 1: choose the recognition task. Specify theword vocabulary for the task, the set of units thatwill be modeled by the acoustic models (e.g., wholewords, phonemes, etc.), the word pronunciationlexicon (or dictionary) that describes the variationsin word pronunciation, the task syntax (grammar),and the task semantics. By way of example, for asimple speech recognition system capable of recog-nizing a spoken credit card number using isolateddigits (i.e., single digits spoken one at a time), thesounds to be recognized are either whole wordsor the set of subword units that appear in the digits/zero/ to /nine/ plus the word /oh/. The word vo-cabulary is the set of 11 digits. The task syntaxallows any single digit to be spoken, and the task

Page 5: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

Figure 4 Framework of ASR system.

Speech Recognition: Statistical Methods 5

semantics specify that a sequence of isolated digitsmust form a valid credit card code for identifyingthe user.

. Step 2: train the models. Create a method for build-ing acoustic word models (or subword models) froma labeled speech training data set of multiple occur-rences of each of the vocabulary words by one ormore speakers. We also must use a text trainingdata set to create a word lexicon (dictionary) describ-ing the ways that each word can be pronounced(assuming we are using subword units to character-ize individual words), a word grammar (or languagemodel) that describes how words are concatenatedto form valid sentences (i.e., credit card numbers),and finally a task grammar that describes whichvalid word strings are meaningful in the taskapplication (e.g., valid credit card numbers).

. Step 3: evaluate recognizer performance. We needto determine the word error rate and the task errorrate for the recognizer on the desired task. For anisolated digit recognition task, the word error rateis just the isolated digit error rate, whereas the taskerror rate would be the number of credit carderrors that lead to misidentification of the user.Evaluation of the recognizer performance oftenincludes an analysis of the types of recognitionerrors made by the system. This analysis can leadto revision of the task in a number of ways, rangingfrom changing the vocabulary words or the gram-mar (i.e., to eliminate highly confusable words) tothe use of word spotting, as opposed to word tran-scription. As an example, in limited vocabularyapplications, if the recognizer encounters frequentconfusions between words like ‘freight’ and ‘flight,’it may be advisable to change ‘freight’ to ‘cargo’ tomaximize its distinction from ‘flight.’ Revision ofthe task grammar often becomes necessary if therecognizer experiences substantial amounts ofwhat is called ‘out of grammar’ (OOG) utterances,

namely the use of words and phrases that are notdirectly included in the task vocabulary (ISCA,2001).

The Speech Recognition Process

In this section, we provide some technical aspects of atypical speech recognition system. Figure 4 shows ablock diagram of a speech recognizer that follows theBayesian framework discussed above.

The recognizer consists of three processing steps,namely feature analysis, pattern matching, and confi-dence scoring, along with three trained databases, theset of acoustic models, the word lexicon, and thelanguage model. In this section, we briefly describeeach of the processing steps and each of the trainedmodel databases.

Feature Analysis

The goal of feature analysis is to extract a set ofsalient features that characterize the spectral proper-ties of the various speech sounds (the subword units)and that can be efficiently measured. The ‘standard’feature set for speech recognition is a set of mel-frequency cepstral coefficients (MFCCs) (which per-ceptually match some of the characteristics of thespectral analysis done in the human auditory system)(Davis and Mermelstein, 1980), along with the first-and second-order derivatives of these features. Typi-cally about 13 MFCCs and their first and secondderivatives (Furai, 1981) are calculated every 10 ms,leading to a spectral vector with 39 coefficients every10 ms. A block diagram of a typical feature analysisprocess is shown in Figure 5.

The speech signal is sampled and quantized, pre-emphasized by a first-order (highpass) digital filterwith pre-emphasis factor a (to reduce the influenceof glottal coupling and lip radiation on the estimated

Page 6: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

Figure 5 Block diagram of feature analysis computation.

Figure 6 Three-state HMM for the sound /s/.

Figure 7 Concatenated model for the word ‘is.’

Figure 8 HMM for whole word model with five states.

6 Speech Recognition: Statistical Methods

vocal tract characteristics), segmented into frames,windowed, and then a spectral analysis is performedusing a fast Fourier transform (FFT) (Rabiner andGold, 1975) or linear predictive coding (LPC) method(Atal and Hanauer, 1971; Markel and Gray, 1976).The frequency conversion from a linear frequencyscale to a mel frequency scale is performed in thefiltering block, followed by cepstral analysis yieldingthe MFCCs (Davis and Mermelstein, 1980), equali-zation to remove any bias and to normalize the cep-stral coefficients (Rahim and Juang, 1996), andfinally the computation of first- and second-order(via temporal derivative) MFCCs is made, completingthe feature extraction process.

Acoustic Models

The goal of acoustic modeling is to characterize thestatistical variability of the feature set determinedabove for each of the basic sounds (or words) of thelanguage. Acoustic modeling uses probability mea-sures to characterize sound realization using statisti-cal models. A statistical method, known as the hiddenMarkov model (HMM) (Levinson et al., 1983;Ferguson, 1980; Rabiner, 1989; Rabiner and Juang,1985), is used to model the spectral variability of eachof the basic sounds of the language using a mixturedensity Gaussian distribution (Juang et al., 1986;Juang, 1985), which is optimally aligned with aspeech training set and iteratively updated and im-proved (the means, variances, and mixture gains areiteratively updated) until an optimal alignment andmatch is achieved.

Figure 6 shows a simple three-state HMM for mod-eling the subword unit /s/ as spoken at the beginningof the word /six/. Each HMM state is characterized bya probability density function (usually a mixtureGaussian density) that characterizes the statistical

behavior of the feature vectors at the beginning(state s1), middle (state s2), and end (state s3) of thesound /s/. In order to train the HMM for each subwordunit, we use a labeled training set of words and sen-tences and utilize an efficient training procedureknown as the Baum-Welch algorithm (Rabiner, 1989;Baum, 1972; Baum et al., 1970) to align each of thevarious subword units with the spoken inputs, and thenestimate the appropriate means, covariances, and mix-ture gains for the distributions in each subword unitstate. The algorithm is a hill-climbing algorithm and isiterated until a stable alignment of subword unit mod-els and speech is obtained, enabling the creation ofstable models for each subword unit.

Figure 7 shows how a simple two-sound word, ‘is,’which consists of the sounds /ih/ and /z/, is created byconcatenating the models (Lee, 1989) for the /ih/sound with the model for the /z/ sound, therebycreating a six-state model for the word ‘is.’

Figure 8 shows how an HMM can be used to char-acterize a whole-word model (Lee et al., 1989). In this

Page 7: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

Speech Recognition: Statistical Methods 7

case, the word is modeled as a sequence of M¼ 5HMM states, where each state is characterized by amixture density, denoted as bjðxtÞ where the modelstate is the index j, the feature vector at time t isdenoted as xt, and the mixture density is of the form:

bjðxtÞ ¼XK

k¼1

cjkN ½xt; mjk;Ujk�

xt ¼ ðxt1; xt2; . . . ; xtDÞ; D ¼ 39

K ¼ number of mixture components in

the density function

cjk ¼ weight of kth mixture component in

state j; cjk 0

N ¼ Gaussian density function

mjk ¼ mean vector for mixture k; state j

Ujk ¼ covariance matrix for mixture k;

state j

XK

k¼1

cjk ¼ 1; 1 j M

Z1

�1

bjðxtÞdxt ¼ 1; 1 j M

Included in Figure 8 is an explicit set of state transi-tions, aij, which specify the probability of making atransition from state i to state j at each frame, therebydefining the time sequence of the feature vectors overthe duration of the word. Usually the self-transitions,aii, are large (close to 1.0), and the skip-state transi-tions, a13, a24, a35, are small (close to 0).

Once the set of state transitions and state probabil-ity densities are specified, we say that a model l(which is also used to denote the set of parametersthat define the probability measure) has been createdfor the word or subword unit. (The model l is oftenwritten as lðA;B; pÞ to explicitly denote the modelparameters, namely A¼faij; 1 i; j Mg, which isthe state transition matrix, B¼fbjðxtÞ; 1 j Mg,which is the state observation probability density,and p¼fpi; 1 i Mg, which is the initial state dis-tribution). In order to optimally train the variousmodels (for each word unit [Lee et al., 1989] or sub-word unit [Lee, 1989]), we need to have algorithmsthat perform the following three steps or tasks (Rabi-ner and Juang, 1985) using the acoustic observationsequence, X, and the model l:

a. likelihood evaluation: compute PðXjlÞb. decoding: choose the optimal state sequence for a

given speech utterance

c. re-estimation: adjust the parameters of l to maxi-mize PðXjlÞ.

Each of these three steps is essential to definingthe optimal HMM models for speech recognitionbased on the available training data and each taskif approached in a brute force manner would becomputationally costly. Fortunately, efficient algo-rithms have been developed to enable efficient andaccurate solutions to each of the three steps thatmust be performed to train and utilize HMM modelsin a speech recognition system. These are generallyreferred to as the forward-backward algorithm orthe Baum-Welch re-estimation method (Levinsonet al., 1983). Details of the Baum-Welch procedureare beyond the scope of this article. The heart of thetraining procedure for re-estimating model para-meters using the Baum-Welch procedure is shown inFigure 9.

Recently, the fundamental statistical method, whilesuccessful for a range of conditions, has beenaugmented with a number of techniques that attemptto further enhance the recognition accuracy and makethe recognizer more robust to different talkers, back-ground noise conditions, and channel effects. Onefamily of such techniques focuses on transformationof the observed or measured features. The transfor-mation is motivated by the need for vocal tract lengthnormalization (e.g., reducing the impact of differencesin vocal tract length of various speakers). Anothersuch transformation (called the maximum likelihoodlinear regression method) can be embedded in thestatistical model to account for a potential mismatchbetween the statistical characteristics of the train-ing data and the actual unknown utterances tobe recognized. Yet another family of techniques(e.g., the discriminative training method based onminimum classification error [MCE] or maximummutual information [MMI]) aims at direct minimiza-tion of the recognition error during the parameteroptimization stage.

Word Lexicon

The purpose of the word lexicon, or dictionary, is todefine the range of pronunciation of words in the taskvocabulary (Jurafsky and Martin, 2000; Riley et al.,1999). The reason that such a word lexicon is neces-sary is because the same orthography can be pro-nounced differently by people with different accents,or because the word has multiple meanings thatchange the pronunciation by the context of its use.For example, the word ‘data’ can be pronounced as:/d/ /ae/ /t/ /ax/ or as /d/ /ey/ /t/ /ax/, and we would need

Page 8: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

Figure 9 The Baum-Welch training procedure.

8 Speech Recognition: Statistical Methods

both pronunciations in the dictionary to properlytrain the recognizer models and to properly recognizethe word when spoken by different individuals. An-other example of variability in pronunciation fromorthography is the word ‘record,’ which can be eithera disk that goes on a player, or the process of captur-ing and storing a signal (e.g., audio or video). Thedifferent meanings have significantly different pro-nunciations. As in the statistical language model, theword lexicon (consisting of sequences of symbols) canbe associated with probability assignments, resultingin a probabilistic word lexicon.

Language Model

The purpose of the language model (Rosenfeld, 2000;Jelinek et al., 1991), or grammar, is to provide a tasksyntax that defines acceptable spoken input sentencesand enables the computation of the probability of theword string, W, given the language model, i.e.,PLðWÞ. There are several methods of creating wordgrammars, including the use of rule-based systems(i.e., deterministic grammars that are knowledge-driven), and statistical methods that compute an esti-mate of word probabilities from large training sets oftextual material. We describe the way in which astatistical N-gram word grammar is constructedfrom a large training set of text.

Assume we have a large text training set of labeledwords. Thus for every sentence in the training set,we have a text file that identifies the words in thatsentence. If we consider the class of N-gram wordgrammars, then we can estimate the word probabil-ities from the labeled text training set using countingmethods. Thus to estimate word trigram probabilities(that is the probability that a word wi was precededby the pair of words ðwi�1;wi�2Þ), we compute thisquantity as:

Pðwijwi�1;wi�2Þ ¼Cðwi�2;wi�1;wiÞ

Cðwi�2;wi�1Þ

where Cðwi�2;wi�1;wiÞ is the frequency count ofthe word triplet (i.e., trigram) consisting ofðwi�2;wi�1;wiÞ that occurred in the training set, andCðwi�2;wi�1Þ is the frequency count of the wordduplet (i.e., bigram) ðwi�2;wi�1Þ that occurred in thetraining set.

Although the method of training N-gram wordgrammars, as described above, generally works quitewell, it suffers from the problem that the counts ofN-grams are often highly in error due to problems ofdata sparseness in the training set. Hence for a texttraining set of millions of words, and a word vocabu-lary of several thousand words, more than 50% ofword trigrams are likely to occur either once or not atall in the training set. This leads to gross distortions inthe computation of the probability of a word string,as required by the basic Bayesian recognition algo-rithm. In the cases when a word trigram does notoccur at all in the training set, it is unacceptable todefine the trigram probability as 0 (as would be re-quired by the direct definition above), since this leadsto effectively invalidating all strings with that partic-ular trigram from occurring in recognition. Instead,in the case of estimating trigram word probabilities(or similarly extended to N-grams where N is morethan three), a smoothing algorithm (Bahl et al., 1983)is applied by interpolating trigram, bigram, andunigram relative frequencies, i.e.,

Pðwijwi�1;wi�2Þ ¼ p3Cðwi�2;wi�1;wiÞ

Cðwi�2;wi�1Þ

þ p2Cðwi�1;wiÞ

Cðwi�1Þþ p1

CðwiÞPi

CðwiÞ

p3 þ p2 þ p1 ¼ 1Xi

CðwiÞ ¼ size of text training corpus

where the smoothing probabilities, p3, p2, p1

are obtained by applying the principle of cross-validation. Other schemes such as the Turing-Good

Page 9: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

Figure 10 Bigram language perplexity for Encarta Encyclopedia.

Speech Recognition: Statistical Methods 9

estimator, which deals with unseen classes of observa-tions in distribution estimation, have also been pro-posed (Nadas, 1985).

Worth mentioning here are two important notionsthat are associated with language models: perplexityof the language model and the rate of occurrencesof out-of-vocabulary words in real data sets. Weelaborate them below.

Language Perplexity A measure of the complexityof the language model is the mathematical quantityknown as language perplexity (which is actually thegeometric mean of the word branching factor, or theaverage number of words that follow any given wordof the language) (Roukos, 1998). We can computelanguage perplexity, as embodied in the languagemodel, PLðWÞ, where W = (w1, w2,. . ., wQ) is alength-Q word sequence, by first defining the entropy(Cover and Thomas, 1991) as:

HðWÞ ¼ � 1

Qlog2PðWÞ

Using a trigram language model, we can write theentropy as:

HðWÞ ¼ � 1

Q

XQ

i¼1

log2Pðwijwi�1;wi�2Þ

where we suitably define the first couple of probabil-ities as the unigram and bigram probabilities. Notethat as Q approaches infinity, the above entropyapproaches the asymptotic entropy of the source de-fined by the measure PLðWÞ. The perplexity of thelanguage is then defined as:

PPðWÞ ¼ 2HðWÞ ¼ Pðw1;w2; . . . ;wQÞ�1=Q

as Q ! 1:

Some examples of language perplexity for specificspeech recognition tasks are the following:

. For an 11-digit vocabulary (‘zero’ to ‘nine’ plus‘oh’) where every digit can occur independently ofevery other digit, the language perplexity (averageword branching factor) is 11.

. For a 2000-word Airline Travel Information System(ATIS) [Ward, 1991], the language perplexity (usinga trigram language model) is 20 [Price, 1990].

. For a 5000-word Wall Street Journal task (readingarticles aloud), the language perplexity (using abigram language model) is 130 [Paul et al., 1992].

A plot of the bigram perplexity for a training set of500 million words, tested on the Encarta Encyclope-dia is shown in Figure 10. It can be seen that languageperplexity grows only slowly with the vocabulary sizeand is only about 400 for a 60 000-word vocabulary.(Language perplexity is a complicated function of vo-cabulary size and vocabulary predictability, and is notin any way directly proportional to vocabulary size).

Out-of-Vocabulary Rate Another interesting aspectof language models is their coverage of the languageas exemplified by the concept of an out-of-vocabulary(OOV) (Kawahara and Lee, 1998) rate, which mea-sures how often a new word appears for a spe-cific task, given that a language model of agiven vocabulary size for the task has been created.Figure 11 shows the OOV rate for sentences from theEncarta Encyclopedia, again trained on 500 millionwords of text, as a function of the vocabulary size. Itcan be seen that even for a 60 000-word vocabulary,about 4% of the words that are encountered have notbeen seen previously and thus are considered OOVwords (which, by definition, cannot be recognizedcorrectly by the recognition system).

Pattern Matching

The job of the pattern matching module is to combineinformation (probabilities) from the acoustic model,the language model, and the word lexicon to find the

Page 10: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

Figure 11 Out-of-vocabulary rate of Encarta Encyclopedia as a function of the vocabulary size.

Figure 12 Use of WFSTs to compile FSN to minimize redun-

dancy in the network.

10 Speech Recognition: Statistical Methods

‘optimal’ word sequence, i.e., the word sequence thatis consistent with the language model and thathas the highest probability among all possible wordsequences in the language (i.e., best matches thespectral feature vectors of the input signal). Toachieve this goal, the pattern matching system is ac-tually a decoder (Ney, 1984; Paul, 2001; Mohri,1997) that searches through all possible word stringsand assigns a probability score to each string, using aViterbi decoding algorithm (Forney, 1973) or its var-iants.

The challenge for the pattern matching module is tobuild an efficient structure (via an appropriate finitestate network or FSN) (Mohri, 1997) for decodingand searching large-vocabulary complex-languagemodels for a range of speech recognition tasks. Theresulting composite FSNs represent the cross-productof the features (from the input signal), with the HMMstates (for each sound), with the HMM units (for eachsound), with the sounds (for each word), with thewords (for each sentence), and with the sentences(those valid within the syntax and semantics ofthe task and language). For large-vocabulary high-perplexity speech recognition tasks, the size of thenetwork can become astronomically large and hasbeen shown to be on the order of 1022 states forsome tasks. Such networks are prohibitively largeand cannot be exhaustively searched by any knownmethod or machine. Fortunately there are methods(Mohri, 1997) for compiling such large networksand reducing the size significantly due to inherent

redundancies and overlaps across each of the levelsof the network. (One earlier example of taking ad-vantage of the search redundancy is the dynamicprogramming method (Bellman, 1957), which turnsan otherwise exhaustive search problem into an in-cremental one.) Hence the network that started with1022 states was able to be compiled down to a math-ematically equivalent network of 108 states that wasreadily searched for the optimum word string with noloss of performance or word accuracy.

The way in which such a large network can betheoretically (and practically) compiled to a muchsmaller network is via the method of weighted finitestate transducers (WFST), which combine the variousrepresentations of speech and language and optimizethe resulting network to minimize the number ofsearch states. A simple example of such a WFST isgiven in Figure 12, and an example of a simple wordpronunciation transducer (for two versions of theword ‘data’) is given in Figure 13.

Using the techniques of composition and optimiza-tion, the WFST uses a unified mathematical frame-work to efficiently compile a large network into aminimal representation that is readily searched usingstandard Viterbi decoding methods. The example ofFigure 13 shows how all redundancy is removed anda minimal search network is obtained, even for assimple an example as two pronunciations of theword ‘data.’

Confidence Scoring

The goal of the confidence scoring module is to post-process the speech feature set in order to identifypossible recognition errors as well as out-of-vocabu-lary events and thereby to potentially improve theperformance of the recognition algorithm. To achievethis goal, a word confidence score (Rahim et al.,1997) based on a simple likelihood ratio hypothesistesting associated with each recognized word, is per-formed and the word confidence score is used todetermine which, if any, words are likely to be incor-rect because of either a recognition error or because it

Page 11: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

Figure 13 Word pronunciation transducer for two pronuncia-

tions of the word ‘data.’

Speech Recognition: Statistical Methods 11

was an OOV word (that could never be correctlyrecognized). A simple example of a two-word phraseand the resulting confidence scores is as follows:

Spoken Input: credit pleaseRecognized String: credit feesConfidence Scores: (0.9) (0.3)

Based on the confidence scores (derived using alikelihood ratio test), the recognition system wouldrealize which word or words are likely to be in errorand take appropriate steps (in the ensuing dialog) todetermine whether an error had been made and howto fix it so that the dialog moves forward to the taskgoal in an orderly and proper manner. (We will dis-cuss how this happens in the discussion of dialogmanagement later in this article.)

Simple Example of ASR System: IsolatedDigit Recognition

To illustrate some of the ideas presented above, con-sider a simple isolated word speech recognition sys-tem where the vocabulary is the set of 11 digits (‘zero’to ‘nine’ plus the word ‘oh’ as an alternative for‘zero’) and the basic recognition unit is a wholeword model. For each of the 11 vocabulary words,we must collect a training set with sufficient, say K,occurrences of each spoken word so as to be able totrain reliable and stable acoustic models (the HMMs)for each word. Typically a value of K¼ 5 is suffi-cient for a speaker-trained system (that is a recognizerthat works only for the speech of the speaker whotrained the system). For a speaker-independent rec-ognizer, a significantly larger value of K is requiredto completely characterize the variability in accents,speakers, transducers, environments, etc. For a speak-er-independent system based on using only a singletransducer (e.g., a telephone line input), and acarefully controlled acoustic environment (lownoise), reasonable values of K are on the order of100–500 for training reliable word models andobtaining good recognition performance.

For implementing an isolated-word recognitionsystem, we do the following:

1. For each word, n, in the vocabulary, we build aword-based HMM, ln, i.e., we must (re-)estimatethe model parameters ln that optimize the likeli-hood of the K training vectors for the n-th word.This is the training phase of the system.

2. For each unknown (newly spoken) test word thatis to be recognized, we measure the feature vectors(the observation sequence), X = [x1,x2, . . ., xN](where each observation vector, xi is the set ofMFCCs and their first- and second-order deriva-tives), we calculate model likelihoods, P(Xjln),1 vV for each individual word model (whereV is 11 for the digits case), and then we select as therecognized word the word whose model likelihoodscore is highest, i.e., n ¼ arg max

1nVPðXjlnÞ. This is

the testing phase of the system.

Figure 14 shows a block diagram of a simpleHMM-based isolated word recognition system.

Performance of Speech RecognitionSystems

A key issue in speech recognition (and understanding)system design is how to evaluate the system’s perfor-mance. For simple recognition systems, such as theisolated word recognition system described in the pre-vious section, the performance is simply the worderror rate of the system. For more complex speechrecognition tasks, such as for dictation applications,we must take into account the three types of errorsthat can occur in recognition, namely word insertions(recognizing more words than were actually spoken),word substitutions (recognizing an incorrect word inplace of the correctly spoken word), and word dele-tions (recognizing fewer words than were actuallyspoken) (Pallet and Fiscus, 1997). Based on the cri-terion of equally weighting all three types of errors,the conventional definition of word error rate formost speech recognition tasks is:

WER ¼ NI þ NS þ ND

jWj

where NI is the number of word insertions, NS is thenumber of word substitutions, ND is the number ofword deletions, and |W| is the number of words in thesentence W being scored. Based on the above defini-tion of word error rate, the performance of a range ofspeech recognition and understanding systems isshown in Table 1.

It can be seen that for a small vocabulary (11digits), the word error rates are very low (0.3%) fora connected digit recognition task in a very cleanenvironment (TI database) (Leonard, 1984), but we

Page 12: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

Figure 14 HMM-based Isolated word recognizer.

Table 1 Word error rates for a range of speech recognition systems

Corpus Type of speech Vocabulary size Word error rate

Connect digit string (TI database) Spontaneous 11 (0–9, oh) 0.3%

Connect digit string (AT&T mall recordings) Spontaneous 11 (0–9, oh) 2.0%

Connected digit string (AT&T HMIHY) Conversational 11 (0–9, oh) 5.0%

Resource management (RM) Read speech 1000 2.0%

Airline travel information system (ATIS) Spontaneous 2500 2.5%

North American business (NAB & WSJ) Read text 64 000 6.6%

Broadcast news Narrated news 210 000 �15%Switchboard Telephone conversation 45 000 �27%Call-home Telephone conversation 28 000 �35%

12 Speech Recognition: Statistical Methods

see that the digit word error rate rises significantly (to5.0%) for connected digit strings recorded in thecontext of a conversation as part of a speech under-standing system (HMIHY�) (Gorin et al., 1996). Wealso see that word error rates are fairly low for 1000-to 2500-word vocabulary tasks (RM [Linguistic DataConsortium, 1992–2000] and ATIS [Ward, 1991])but increase significantly as the vocabulary size rises(6.6% for a 64 000-word NAB vocabulary, and13–17% for a 210 000-word broadcast news vocab-ulary), as well as for more colloquially spoken speech(Switchboard and Call-home [Godfrey et al., 1992]),where the word error rates are much higher thancomparable tasks where the speech is more formallyspoken.

Figure 15 illustrates the reduction in word errorrate that has been achieved over time for several ofthe tasks from Table 1 (as well as other tasks not

covered in Table 1). It can be seen that there is asteady and systematic decrease in word error rate(shown on a logarithmic scale) over time for everysystem that has been extensively studied. Hence it isgenerally believed that virtually any (task-oriented)speech recognition system can achieve arbitrarily lowerror (over time) if sufficient effort is put into findingappropriate techniques for reducing the word errorrate.

If one compares the best ASR performance formachines on any given task with human performance(which often is hard to measure), the resulting com-parison (as seen in Figure 16) shows that humansoutperform machines by factors of between 10 and50; that is the machine achieves word error rates thatare larger by factors of 10–50. Hence we still have along way to go before machines outperform humanson speech recognition tasks. However, one should

Page 13: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

Figure 15 Reductions in speech recognition word error rates over time for a range of task-oriented systems (Pallet et al., 1995).

Figure 16 Comparison of human and machine speech recognition performance for a range of speech recognition tasks (Lippman,

1997).

Speech Recognition: Statistical Methods 13

also note that under a certain condition an automaticspeech recognition system could deliver a better ser-vice than a human. One such example is the recogni-tion of a long connected digit string, such as a creditcard’s 16-digit number, that is uttered all at once; ahuman listener would not be able to memorize or jotdown the spoken string without losing track of allthe digits.

Spoken Language Understanding

The goal of the spoken language understanding mod-ule of the speech dialog circle is to interpret the mean-ing of key words and phrases in the recognized speechstring, and to map them to actions that the speechunderstanding system should take. For speech under-standing, it is important to recognize that in domain-specific applications highly accurate understanding

can be achieved without correctly recognizing everyword in the sentence. Hence a speaker can have spo-ken the sentence: I need some help with my computerhard drive and so long as the machine correctly recog-nized the words help and hard drive, it basicallyunderstands the context of the sentence (needinghelp) and the object of the context (hard drive). Allof the other words in the sentence can often be mis-recognized (although not so badly that other contex-tually significant words are recognized) withoutaffecting the understanding of the meaning of thesentence. In this sense, keyword spotting (Wilponet al., 1990) can be considered a primitive form ofspeech understanding, without involving sophisticat-ed semantic analysis.

Spoken language understanding makes it possibleto offer services where the customer can speak natu-rally without having to learn a specific vocabulary

Page 14: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

14 Speech Recognition: Statistical Methods

and task syntax in order to complete a transactionand interact with a machine (Juang and Furui, 2000).It performs this task by exploiting the task grammarand task semantics to restrict the range of meaningsassociated with the recognized word string, and byexploiting a predefined set of ‘salient’ words andphrases that map high-information word sequencesto this restricted set of meanings. Spoken languageunderstanding is especially useful when the range ofmeanings is naturally restricted and easily cata-loged so that a Bayesian formulation can be used tooptimally determine the meaning of the sentence fromthe word sequence. This Bayesian approach utilizesthe recognized sequence of words, W, and the under-lying meaning, C, to determine the probability ofeach possible meaning, given the word sequence,namely:

PðCjWÞ ¼ PðWjCÞPðCÞ=PðWÞ

and then finding the best conceptual structure (mean-ing) using a combination of acoustic, linguistic andsemantic scores, namely:

C� ¼ arg maxC

PðWjCÞPðCÞ

This approach makes extensive use of the statisticalrelationship between the word sequence and theintended meaning.

One of the most successful (commercial) speechunderstanding systems to date has been the AT&THow May I Help You (HMIHY) task for customercare. For this task, the customer dials into an AT&T800 number for help on tasks related to his or herlong distance or local billing account. The prompt tothe customer is simply: ‘AT&T. How May I HelpYou?’ The customer responds to this prompt withtotally unconstrained fluent speech describing thereason for calling the customer care help line. Thesystem tries to recognize every spoken word (butinvariably makes a very high percentage of worderrors), and then utilizes the Bayesian concept frame-work to determine the meaning of the speech. Fortu-nately, the potential meaning of the spoken input is

Figure 17 Conceptual representation of HMIHY (How May I Help Y

restricted to one of several possible outcomes, such asasking about Account Balances, or new Calling Plans,or changes in local service, or help for an Unrecog-nized Number, etc. Based on this highly limited set ofoutcomes, the spoken language component deter-mines which meaning is most appropriate (or elsedecides not to make a decision but instead to deferthe decision to the next cycle of the dialog circle),and appropriately routes the call. The dialog manag-er, spoken language generation, and text-to-speechmodules complete the cycle based on the meaningdetermined by the spoken language understandingbox. A simple characterization of the HMIHY systemis shown in Figure 17.

The major challenge in spoken language under-standing is to go beyond the simple classification taskof the HMIHY system (where the conceptual mean-ing is restricted to one of a fixed, often small, set ofchoices) and to create a true concept and meaningunderstanding system.

While this challenge remains in an embryonicstage, an early attempt, namely the Air Travel Infor-mation System (ATIS), was made in embedding speechrecognition in a stylized semantic structure to mimic anatural language interaction between human and amachine. In such a system, the semantic notionsencapsulated in the system are rather limited, mostlyin terms of originating city and destination citynames, fares, airport names, travel times and so on,and can be directly instantiated in a semantic tem-plate without much text analysis for understanding.For example, a typical semantic template or networkis shown in Figure 18 where the relevant notions,such as the departing city, can be easily identifiedand used in dialog management to create the desireduser interaction with the system.

Dialog Management, Spoken LanguageGeneration, and Text-to-SpeechSynthesis

The goal of the dialog management module is tocombine the meaning of the current input speech

ou?) system.

Page 15: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

Speech Recognition: Statistical Methods 15

with the current state of the system (which is based onthe interaction history with the user) in order to de-cide what the next step in the interaction should be. Inthis manner, the dialog management module makesviable fairly complex services that require multipleexchanges between the system and the customer.Such dialog systems can also handle user-initiatedtopic switching within the domain of the application.

The dialog management module is one of the mostcrucial steps in the speech dialog circle for a successfultransaction as it enables the customer to accomplishthe desired task. The way in which the dialog man-agement module works is by exploiting models ofdialog to determine the most appropriate spokentext string to guide the dialog forward toward aclear and well-understood goal or system interaction.The computational models for dialog managementinclude both structure-based approaches (which mod-els dialog as a predefined state transition networkthat is followed from an initial goal state to a set offinal goal states), or plan-based approaches (whichconsider communication as executing a set of plansthat are oriented toward goal achievement).

The key tools of dialog strategy are the following:

. Confirmation: used to ascertain correctness of therecognized and understood utterances

. Error recovery: used to get the dialog back on trackafter a user indicates that the system has misunder-stood something

. Reprompting: used when the system expected inputbut did not receive any input

. Completion: used to elicit missing input informa-tion from the user

. Constraining: used to reduce the scope of therequest so that a reasonable amount of informationis retrieved, presented to the user, or otherwiseacted upon

. Relaxation: used to increase the scope of the requestwhen no information has been retrieved

Figure 18 An example of a word grammar with embedded semant

. Disambiguation: used to resolve inconsistent inputfrom the user

. Greeting/Closing: used to maintain social protocolat the beginning and end of an interaction

. Mixed initiative: allows users to manage the dialogflow.

Although most of the tools of dialog strategy arestraightforward and the conditions for their use arefairly clear, the mixed initiative tool is perhaps themost interesting one, as it enables a user to managethe dialog and get it back on track whenever the userfeels the need to take over and lead the interactionswith the machine. Figure 19 shows a simple chart thatillustrates the two extremes of mixed initiative for asimple operator services scenario. At the one extreme,where the system manages the dialog totally, the sys-tem responses are simple declarative requests to elicitinformation, as exemplified by the system command‘‘Please say collect, calling card, third number’’. Atthe other extreme is user management of the dialogwhere the system responses are open ended and thecustomer can freely respond to the system command‘How may I help you?’

Figure 20 illustrates some simple examples of theuse of system initiative, mixed initiative, and userinitiative for an airlines reservation task. It can beseen that system initiative leads to long dialogs (dueto the limited information retrieval at each query),but the dialogs are relatively easy to design, whereasuser initiative leads to shorter dialogs (and hence abetter user experience), but the dialogs are more dif-ficult to design. (Most practical natural language sys-tems need to be mixed initiative so as to be able tochange initiatives from one extreme to another,depending on the state of the dialog and how success-fully things have progressed toward the ultimate un-derstanding goal.)

Dialog management systems are evaluated basedon the speed and accuracy of attaining a well-defined

ic notions in ATIS.

Page 16: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

Figure 19 Illustration of mixed initiative for operator services

scenario.

Figure 20 Examples of mixed initiative dialogs.

16 Speech Recognition: Statistical Methods

task goal, such as booking an airline reservation,renting a car, purchasing a stock, or obtaining helpwith a service.

The spoken language generation module translatesthe action of the dialog manager into a textual repre-sentation and the text-to-speech modules convert thetextual representation into natural-sounding speechto be played to the user so as to initiate another roundof dialog discussion or to end the query (hopefullysuccessfully).

User Interfaces and Multimodal Systems

The user interface for a speech communications sys-tem is defined by the performance of each of theblocks in the speech dialog circle. A good user inter-face is essential to the success of any task-orientedsystem, providing the following capabilities:

. It makes the application easy to use and robustto the kinds of confusion that arise in human–machine communications by voice.

. It keeps the conversation moving forward, even inperiods of great uncertainty on the parts of eitherthe user or the machine.

. Although it cannot save a system with poor speechrecognition or speech understanding performance, itcan make or break a system with excellent speechrecognition and speech understanding performance.

Although we have primarily been concerned withspeech recognition and understanding interfaces tomachines, there are times when a multimodal ap-proach to human–machine communications is bothnecessary and essential. The potential modalities thatcan work in concert with speech include gesture andpointing devices (e.g., a mouse, keypad, or stylus).The selection of the most appropriate user interfacemode (or combination of modes) depends on thedevice, the task, the environment, and the user’s abil-ities and preferences. Hence, when trying to identifyobjects on a map (e.g., restaurants, locations of sub-way stations, historical sites), the use of a pointingdevice (to indicate the area of interest) along withspeech (to indicate the topic of interest) often is agood user interface, especially for small computingdevices like tablet PCs or PDAs. Similarly, when en-tering PDA-like information (e.g., appointments,reminders, dates, times, etc.) onto a small handhelddevice, the use of a stylus to indicate the appropriate

type of information with voice filling in the data fieldis often the most natural way of entering such infor-mation (especially as contrasted with stylus-based textinput systems such as graffiti for Palm-like devices).Microsoft research has shown the efficacy of such asolution with the MIPad (Multimodel InteractivePad) demonstration, and they claim to have achieveddouble the throughput for English using the multi-modal interface over that achieved with just a penstylus and the graffiti language.

Summary

In this article we have outlined the major componentsof a modern speech recognition and spoken languageunderstanding system, as used within a voice dialogsystem. We have shown the role of signal processingin creating a reliable feature set for the recognizer andthe role of statistical methods in enabling the recog-nizer to recognize the words of the spoken inputsentence as well as the meaning associated with therecognized word sequence. We have shown how adialog manager utilizes the meaning accrued fromthe current as well as previous spoken inputs to createan appropriate response (as well as potentially takingsome appropriate actions) to the customer request(s),and finally how the spoken language generation andtext-to-speech synthesis parts of the dialog completethe dialog circle by providing feedback to the user asto actions taken and further information that is re-quired to complete the transaction that is requested.

Although we have come a long way toward thevision of Hal, the machine that both recognizeswords reliably and understands their meaning almostflawlessly, we still have a long way to go before thisvision is fully achieved. The major problem that mustyet be tackled is robustness of the recognizer andthe language understanding system to variabilityin speakers, accents, devices, and environments inwhich the speech is recorded. Systems that appear to

Page 17: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

Speech Recognition: Statistical Methods 17

work almost flawlessly under laboratory conditionsoften fail miserably in noisy train or airplane stations,when used with a cellphone or a speakerphone, whenused in an automobile environment, or when used innoisy offices. There are many ideas that have beenadvanced for making speech recognition more robust,but to date none of these ideas has been able to fullycombat the degradation in performance that occursunder these nonideal conditions.

Speech recognition and speech understandingsystems have made their way into mainstream appli-cations and almost everybody has used a speech rec-ognition device at one time or another. They arewidely used in telephony applications (operatorservices, customer care), in help desks, in desktopdictation applications, and especially in office envir-onments as an aid to digitizing reports, memos, briefs,and other office information. As speech recognitionand speech understanding systems become more ro-bust, they will find their way into cellphone andautomotive applications, as well as into small devices,providing a natural and intuitive way to control theoperation of these devices as well as to access andenter information.

See also: Speech Recognition, Audio-Visual; Speech Rec-

ognition, Automatic: History.

Bibliography

Atal B S & Hanauer S L (1971). ‘Speech analysis and synth-esis by linear prediction of the speech wave.’ Journal ofthe Acoustical Society of America 50(2), 637–655.

Bahl L R, Jelinek F & Mercer R L (1983). ‘A maximumlikelihood approach to continuous speech recognition.’IEEE Transactions on Pattern Analysis & Machine Intel-ligence, PAMI-5(2), 179–190.

Baum L E (1972). ‘An inequality and associated maximiza-tion technique in statistical estimation for probabilisticfunctions of Markov processes.’ Inequalities 3, 1–8.

Baum L E, Petri T, Soules G & Weiss N (1970). ‘A maximi-zation technique occurring in the statistical analysis ofprobabilistic functions of Markov chains.’ Annals inMathematical Statistics 41, 164–171.

Bellman R (1957). Dynamic programming. Boston: Prince-ton University Press.

Cover T & Thomas J (1991). Wiley series in telecommuni-cations: Elements of information theory. John Wiley andSons.

Cox RV, Kamm C A, Rabiner L R, Schroeter J & Wilpon G J(2000). ‘Speech and language processing for next-millennium communications services.’ Proceedings ofthe IEEE 88(8), 1314–1337.

Davis S & Mermelstein P (1980). ‘Comparison of para-metric representations of monosyllabic word recognition

in continuously spoken sentences.’ IEEE Transactions onAcoustics, Speech and Signal Processing 28(4), 357–366.

Ferguson J D (1980). ‘Hidden Markov analysis: an intro-duction.’ In Hidden Markov models for speech. Prince-ton: Institute for Defense Analyses.

Forney D (1973). ‘The Viterbi algorithm.’ ProceedingsIEEE 61, 268–278.

Furui S (1981). ‘Cepstral analysis techniques for automaticspeaker verification.’ IEEE Transactions on Acoustics,Speech and Signal Processing, ASSP-29(2), 254–272.

Gauvain J-L & Lamel L (2003). ‘Large vocabulary speechrecognition based on statistical methods.’ In Chou W &Juang B H (eds.) Pattern recognition in speech & lan-guage processing. New York: CRC Press. 149–189.

Godfrey J J, Holliman E C & McDaniel J (1992).‘SWITCHBOARD: telephone speech corpus for researchand development.’ In Proceedings of the IEEE Confer-ence on Acoustics, Speech and Signal Processing I. 517–520.

Gorin A L, Parker B A, Sachs R M & Wilpon J G (1996).‘How may I help you?’ Proceedings of the InteractiveVoice Technology for Telecommunications Applications(IVTTA). 57–60.

ISCA Archive (2001). Disfluency in spontaneous speech(DiSS001), ISCA Tutorial and Research Workshop(ITRW), Edinburgh, Scotland, UK, August 29–31,2001. http://www.isca-speech.org/archive/diss_01.

Jelinek F (1997). Statistical methods for speech recognition.Cambridge: MIT Press, Cambridge.

Jelinek F, Mercer R L & Roukos S (1991). ‘Principles oflexical language modeling for speech recognition.’ InFurui & Sondhi (eds.) Advances in speech signal proces-sing. New York: Mercer Dekker. 651–699.

Juang B H (1985). ‘Maximum likelihood estimation formixture multivariate stochastic observations of Markovchains.’ AT&T Technology Journal 64(6), 1235–1249.

Juang B H & Furui S (2000). ‘Automatic recognition andunderstanding of spoken language – a first step towardsnatural human-machine communication.’ Proceedings ofthe IEEE .

Juang B H, Levinson S E & Sondhi M M (1986). ‘Maxi-mum likelihood estimation for multivariate mixtureobservations of Markov chains.’ IEEE Transactions inInformation Theory It-32(2), 307–309.

Juang B H, Thomson D & Perdue R J (1995). ‘Deployableautomatic speech recognition systems – advances andchallenges.’ AT&T Technical Journal 74(2).

Jurafsky D S & Martin J H (2000). Speech and languageprocessing. Englewood: Prentice Hall.

Kamm C & Helander M (1997). ‘Design issues for inter-faces using voice input.’ In Helander M, Landauer T K &Prabhu P (eds.) Handbook of human-computer interac-tion. Amsterdam: Elsevier. 1043–1059.

Kawahara T & Lee C H (1998). ‘Flexible speech under-standing based on combined key-phrase detection andverification.’ IEEE Transactions on Speech and AudioProcessing, T-SA 6(6), 558–568.

Lee C H, Juang B H, Soong F K & Rabiner L R (1989).‘Word recognition using whole word and subword

Page 18: Speech Recognition: Statistical Methods - Semantic Scholar · 2017-09-24 · Speech Recognition: Statistical Methods L R Rabiner,RutgersUniversity,NewBrunswick, NJ,USAandUniversityofCalifornia,SantaBarbara,

18 Speech Recognition: Statistical Methods

models.’ Conference Record 1989 IEEE InternationalConference on Acoustics, Speech, and Signal Processing,Paper S 12.2. 683–686.

Lee K-F (1989). The development of the Sphinx System.Kluwer.

Leonard R G (1984). ‘A database for speaker-independentdigit recognition.’ Proceedings of ICASSP, IEEE Inter-national Conference on Acoustics, Speech and SignalProcessing. 42.11.1–42.11.4.

Levinson S E, Rabiner L R & Sondhi M M (1983). ‘Anintroduction to the application of the theory of probabi-listic functions of a Markov process to automatic speechrecognition.’ Bell Systems Technical Journal 62(4),1035–1074.

Linguistic Data Consortium (1992–2000). LDC CatalogResource Management RM 2 2.0, http://wave.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId ¼LDC93S3C.

Lippman R P (1997). ‘Speech recognition by machines andhumans.’ Speech Communication 22(1), 1–15.

Markel J D & Gray A H Jr (1996). Linear prediction ofspeech. Springer-Verlag.

Rahim M & Juang B H (1996). ‘Signal bias removal bymaximum likelihood estimation for robust telephonespeech recognition.’ IEEE Transactions Speech andAudio Processing 4(1), 19–30.

Mohri M (1997). ‘Finite-state transducers in language andspeech processing.’ Computational Linguistics 23(2),269–312.

Nadas A (1985). ‘On Turing’s formula for word probabil-ities.’ IEEE Transactions on Acoustics, Speech, and SignalProcessing ASSP-33(6), 1414–1416.

Ney H (1984). ‘The use of a one stage dynamic program-ming algorithm for connected word recognition.’ IEEETransactions, Acoustics, Speech and Signal Processing,ASSP-32(2), 263–271.

Pallett D & Fiscus J (1997). ‘1996 Preliminary broadcastnews benchmark tests.’ In DARPA 1997 speech recogni-tion workshop.

Pallett D S et al. (1995). ‘1994 benchmark tests for theARPA spoken language program.’ Proceedings of the1995 ARPA Human Language Technology Workshop5–36.

Paul D B (2001). ‘An efficient A* stack decoder algorithmfor continuous speech recognition with a stochasticlanguage model.’ Proceedings IEEE ICASSP-01, SaltLake City, May 2001. 357–362.

Paul D B & Baker J M (1992). ‘The design for the WallStreet Journal-based CSR corpus.’ In Proceedings of theDARPA SLS Workshop.

Price P (1990). ‘Evaluation of spoken language systems: theATIS domain.’ In Price P (ed.) Proceedings of the ThirdDARPA SLS Workshop. Morgan Kaufmann. 91–95.

Rabiner L R (1989). ‘A tutorial on hidden Markov modelsand selected applications in speech recognition.’ Proceed-ings of the IEEE 77(2), 257–286.

Rabiner L R & Gold B (1975). Theory and applications ofdigital signal processing. Englewood Cliffs, NJ: Prentice-Hall.

Rabiner L R & Juang B H (1985). ‘An introductionto hidden Markov models.’ IEEE Signal ProcessingMagazine 3(1), 4–16.

Rabiner L R, Wilpon J G & Juang B H (1986). ‘A model-based connected-digit recognition system using eitherhidden Markov models or templates.’ Computer Speech& Language 1(2), December, 167–197.

Rabiner L R, Juang B H & Lee C H (1996). ‘An overview ofautomatic speech recognition, in automatic speech &speaker recognition – advanced topics.’ In Lee et al.(eds.). Norwell: Kluwer Academic. 1–30.

Rahim M, Lee C-H & Juang B-H (1997). ‘Discriminativeutterance verification for connected digit recognition.’IEEE Transactions on Speech and Audio Processing5(3), 266–277.

Riley M D et al. (1999). ‘Stochastic pronunciation modelingfrom hand-labelled phonetic corpora.’ Speech Communi-cation 29(2–4), 209–224.

Roe D B, Wilpon J G, Mikkilineni P & Prezas D(1991). ‘AT&T’s speech recognition in the telephonenetwork.’ Speech Technology Mag 5(3), February/March,16–22.

Rosenfeld R (2000). ‘Two decades of statistical languagemodeling: where do we go from here?’ Proceedings of theIEEE, Special Issue on Spoken Language Processing88(8), 1270–1278.

Roukos S (1998). ‘Language representation.’ In Varile G B& Zampolli A (eds.) Survey of the State of the Art inHuman Language Technology. Cambridge UniversityExpress.

Sugamura N, Hirokawa T, Sagayama S & Furui S (1994).‘Speech processing technologies and telecommunicationsapplications at NTT.’ Proceedings of the IVTTA 94,37–42.

Ward W (1991). ‘Evaluation of the CMU ATIS System.’Proceedings of the DARPA Speech and Natural Lan-guage Workshop, February 19–22, 1991. 101–105.

Wilpon J G, Rabiner L R, Lee C-H & Goldman E (1990).‘Automatic recognition of keywords in unconstrainedspeech using hidden Markov models.’ IEEE Transactionson Acoustics, Speech and Signal Processing 38(11),1870–1878.

Young S J (1996). ‘A review of large vocabulary continuousspeech recognition.’ IEEE Signal Processing Magazine13(5), September, 45–57.