Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info....

35
Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University Contact Information: Box 9571 Mississippi State University Mississippi State, Mississippi 39762 Tel: 662-325-3149 Fax: 662-325-2298 Email: [email protected] IBM – SIGNAL PROCESSING • URL: www.isip.msstate.edu/publications/seminars/msstate_misc/ 2004/cse • Acknowledgement: Supported by NSF under Grant No. EIA-9809300.

Transcript of Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info....

Page 1: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

Applications of Risk Minimization to Speech Recognition

• Joseph Picone

Inst. for Signal and Info. Processing

Dept. Electrical and Computer Eng.

Mississippi State University

• Contact Information:

Box 9571

Mississippi State University

Mississippi State, Mississippi 39762

Tel: 662-325-3149

Fax: 662-325-2298

Email: [email protected]

IBM – SIGNAL PROCESSING

• URL:www.isip.msstate.edu/publications/seminars/msstate_misc/2004/cse

• Acknowledgement:Supported by NSF under Grant No. EIA-9809300.

Page 2: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

INTRODUCTIONABSTRACT AND BIOGRAPHY

ABSTRACT: Statistical techniques based on Hidden Markov models (HMMs) with Gaussian emission densities have dominated the signal processing and pattern recognition literature for the past 20 years. However, HMMs suffer from an inability to learn discriminative information and are prone to overfitting and over‑parameterization. In this presentation, we will review our attempts to apply notions of risk minimization into pattern recognition problems such as speech recognition. New approaches based on probabilistic Bayesian learning are shown to provide an order of magnitude reduction in complexity over comparable approaches based on HMMs and Support Vector Machines.

BIOGRAPHY: Joseph Picone is currently a Professor in the Department of Electrical and Computer Engineering at Mississippi State University, where he also directs the Institute for Signal and Information Processing. For the past 15 years he has been promoting open source speech technology. He has previously been employed by Texas Instruments and AT&T Bell Laboratories. Dr. Picone received his Ph.D. in Electrical Engineering from Illinois Institute of Technology in 1983. He is a Senior Member of the IEEE and a registered Professional Engineer.

Page 3: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

HUMAN LANGUAGE TECHNOLOGYSPEECH RECOGNITION RESEARCH?

• Why do we work on speech recognition?

“Language is the preeminent trait of the human species.”

“I never met someone who wasn’t interested in language.”

“I decided to work on language because it seemed to bethe hardest problem to solve.”

• Why should we work on speech recognition?

• Antiterrorism, homeland security, military applications

• Telecommunications, mobile communications

• Education, learning tools, educational toys, enrichment

• Computing, intelligent systems, machine learning

• Commodity or liability?

• Fragile technology that is error prone

Page 4: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

INTRODUCTIONGENERALIZATION AND RISK

• Optimal decision surface is a line

• Optimal decision surface changes abruptly

• Optimal decision surface still a line

• How much can we trust isolated data points?

• Can we integrate prior knowledge about data, confidence, or willingness to take risk?

Page 5: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

HUMAN LANGUAGE TECHNOLOGYFUNDAMENTAL CHALLENGES

Page 6: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

INTRODUCTIONACOUSTIC CONFUSABILITY

• Regions of overlap represent classification error

• Reduce overlap by introducing acoustic and linguistic context

• Comparison of “aa” in “lOck” and “iy” in “bEAt” for conversational speech

Page 7: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

INTRODUCTIONPROBABILISTIC FRAMEWORK

Page 8: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

SPEECH RECOGNITIONBLOCK DIAGRAM OVERVIEW

Core components:

• transduction

• feature extraction

• acoustic modeling (hidden Markov models)

• language modeling (statistical N-grams)

• search (Viterbi beam)

• knowledge sources

Page 9: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

• Maximum likelihood convergence does not translate to optimal classification if a priori assumptions about the data are not correct.

• Finding the optimal decision boundary requires only one parameter.

INTRODUCTIONML CONVERGENCE NOT OPTIMAL

Page 10: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

INTRODUCTIONPOOR GENERALIZATION WITH GMM MLE

• Data is often not separable by a hyperplane – nonlinear classifier is needed

• Gaussian MLE models tend toward the center of mass – overtraining leads to poor generalization

• Three problems: controlling generalization, direct discriminative training, and sparsity.

Page 11: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

RISK MINIMIZATIONDISCRIMINATIVE TRAINING

• Several popular discriminative training approaches (e.g., maximum mutual information estimation)

• Essential Idea: Maximize

• Maximize numerator (ML term), minimize denominator (discriminative term)

• Previously developed for neural networks, hybrid systems, and eventually HMM-based speech recognition systems

)|(

)|(

out

in

WAP

WAP

Page 12: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

• Structural optimization often guided by an Occam’s Razor approach

• Trading goodness of fit and model complexity– Examples: MDL, BIC, AIC, Structural Risk

Minimization, Automatic Relevance Determination

RISK MINIMIZATION

Model Complexity

Error

Training SetError

Open-LoopError

Optimum

STRUCTURAL OPTIMIZATION

Page 13: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

RISK MINIMIZATIONSVMS FOR NON-SEPARABLE DATA

• No hyperplane could achieve zero empirical risk (in any dimension space!)

• Recall the SRM Principle: balance empirical risk and model complexity

• Relax our optimization constraint to allow for errors on the training set:

• A new parameter, C, must be estimated to optimally control the trade-off between training set errors and model complexity

iii bwxy 1)(

Page 14: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

RISK MINIMIZATIONDRAWBACKS OF SVMS

• Uses a binary (yes/no) decision rule Generates a distance from the hyperplane, but this

distance is often not a good measure of our “confidence” in the classification

Can produce a “probability” as a function of the distance (e.g. using sigmoid fits), but they are inadequate

• Number of support vectors grows linearly with the size of the data set

• Requires the estimation of trade-off parameter, C, via held-out sets

Page 15: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

• Build a fully specified probabilistic model – incorporate prior information/beliefs as well as a notion of confidence in predictions

• MacKay posed a special form for regularization in neural networks – sparsity

• Evidence maximization: evaluate candidate models based on their “evidence”, P(D|Hi)

• Structural optimization by maximizing the evidence across all candidate models

• Steeped in Gaussian approximations

RELEVANCE VECTOR MACHINESEVIDENCE MAXIMIZATION

Page 16: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

• A kernel-based learning machine

• Incorporates an automatic relevance determination (ARD) prior over each weight (MacKay)

• A flat (non-informative) prior over completes the Bayesian specification

)1

),0(|()|(0

N

i iiiwNwP

N

iii xxKwwwxy

10 ),();(

);(1

1);|1(

wixyi

ewxtP

RELEVANCE VECTOR MACHINESAUTOMATIC RELEVANCE DETERMINATION

Page 17: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

• The goal in training becomes finding:

• Estimation of the “sparsity” parameters is inherent in the optimization – no need for a held-out set!

• A closed-form solution to this maximization problem is not available. Iteratively reestimate

)|(

)|,(),,|(),(

),|,(,maxargˆ,ˆ

Xtp

XwpXwtpwp

whereXtwpw

w

ˆ andw

RELEVANCE VECTOR MACHINESITERATIVE REESTIMATION

Page 18: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

• Fix and estimate w (e.g. gradient descent)

• Use the Hessian to approximate the covariance of a Gaussian posterior of the weights centered at

• With and as the mean and covariance, respectively, of the Gaussian approximation, we find by finding

• Method is O(N2) in memory and O(N3) in time

w

w

)|()|(maxargˆ wpwtpw

w

1)|()|( wpwtpww

iiiii

ii wherew

ˆ2

RELEVANCE VECTOR MACHINESLAPLACE’S METHOD

Page 19: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

RVM:• Data: Class labels (0,1)

• Goal: Learn posterior, P(t=1|x)

• Structural Optimization: Hyperprior distribution encourages sparsity

• Training: iterative O(N3)

SVM:• Data: Class labels (-1,1)

• Goal: Find optimal decision surface under constraints

• Structural Optimization: Trade-off parameter that must be estimated

• Training: Quadratic O(N2)

iii bwxy 1)(

RELEVANCE VECTOR MACHINESCOMPARISON TO SVMS

Page 20: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

• Deterding Vowel Data: 11 vowels spoken in “h*d” context; 10 log area parameters; 528 train, 462 SI test

Approach % Error # Parameters

SVM: Polynomial Kernels 49%

K-Nearest Neighbor 44%

Gaussian Node Network 44%

SVM: RBF Kernels 35% 83 SVs

Separable Mixture Models 30%

RVM: RBF Kernels 30% 13 RVs

EXPERIMENTAL RESULTSDETERDING VOWEL DATA

Page 21: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

• Data size:

– 30 million frames of data in training set

– Solution: Segmental phone models

• Source for Segmental Data:

– Solution: Use HMM system in bootstrap procedure

– Could also build a segment-based decoder

• Probabilistic decoder coupling:

– SVMs: Sigmoid-fit posterior

– RVMs: naturally probabilistic

EXPERIMENTAL RESULTSINTEGRATION WITH SPEECH RECOGNITION

hh aw aa r y uw

region 10.3*k frames

region 30.3*k frames

region 20.4*k frames

mean region 1 mean region 2 mean region 3

k frames

Page 22: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

SEGMENTALCONVERTER

SEGMENTALCONVERTER

HMMRECOGNITION

HMMRECOGNITION

HYBRIDDECODER

HYBRIDDECODER

Features (Mel-Cepstra))

SegmentInformation

N-bestList

SegmentalFeatures

Hypothesis

EXPERIMENTAL RESULTSHYBRID DECODER

Page 23: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

• HMM system is cross-word state-tied triphones with 16 mixtures of Gaussian models

• SVM system has monophone models with segmental features

• System combination experiment yields another 1% reduction in error

EXPERIMENTAL RESULTSSVM ALPHADIGIT RECOGNITION

Transcription Segmentation SVM HMM

N-best Hypothesis 11.0% 11.9%

N-best+Ref Reference 3.3% 6.3%

Page 24: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

• RVMs yield a large reduction in the parameter count while attaining superior performance

• Computational costs mainly in training for RVMs but is still prohibitive for larger sets

Approach Error

Rate

Avg. # Parameters

Training Time

Testing Time

SVM 16.4% 257 0.5 hours 30 mins

RVM 16.2% 12 30 days 1 min

EXPERIMENTAL RESULTSSVM/RVM ALPHADIGIT COMPARISON

Page 25: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

SUMMARYPRACTICAL RISK MINIMIZATION?

• Reduction of complexity at the same level of performance is interesting:

• Results hold across tasks

• RVMs have been trained on 100,000 vectors

• Results suggest integrated training is critical

• Risk minimization provides a family of solutions:

• Is there a better solution than minimum risk?

• What is the impact on complexity and robustness?

• Applications to other problems?

• Speech/Non-speech classificiation?

• Speaker adaptation?

• Language modeling?

Page 26: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

Traditional Output:

• best word sequence

• time alignment of information

Other Outputs:

• word graphs

• N-best sentences

• confidence measures

• metadata such as speaker identity, accent, and prosody

SPEECH RECOGNITIONAPPLICATION OF INFORMATION RETRIEVAL

Page 27: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

APPLICATIONSINFORMATION RETRIEVAL

• Metadata extraction from conversational speech

• Automatic gisting and intelligence gathering

• Speech to text is the core technology challenge

• Machines vs. humans

• Real-time audio indexing• Time-varying channel• Dynamic language model• Multilingual and cross-lingual

Page 28: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

APPLICATIONS

• In-vehicle dialog systems improve information access.

• Advanced user interfaces enhance workforce training and increase manufacturing efficiency.

• Noise robustness in both environments to improve recognition performance

• Advanced statistical models and machine learning technology

• Multidisciplinary team (IE, ECE, CS).

CAVS: DIALOG SYSTEMS FOR THE CAR

Page 29: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

• Principal Investigators: Aravind Ganapathiraju (Conversay) and Jon Hamaker (Microsoft) as part of their Ph.D. studies at Mississippi State

• Consultants: Michael Tipping (MSR-Cambridge) and Thorsten Joachims (Cornell)

• Motivation: Serious work began after discussions with V.N. Vapnik at the CLSP Summer Workshop in 1997.

SUMMARYACKNOWLEDGEMENTS

Page 30: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

SUMMARYRELEVANT SOFTWARE RESOURCES

• Pattern Recognition Applet: compare popular algorithms on standard or custom data sets

• Speech Recognition Toolkits: compare SVMs and RVMs to standard approaches using a state of the art ASR toolkit

• Fun Stuff: have you seen our commercial on the Home Shopping Channel?

• Foundation Classes: generic C++ implementations of many popular statistical modeling approaches

Page 31: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

SUMMARYBRIEF BIBLIOGRAPHY

Applications to Speech Recognition:

1. J. Hamaker and J. Picone, “Advances in Speech Recognition Using Sparse Bayesian Methods,” submitted to the IEEE Transactions on Speech and Audio Processing, January 2003.

2. A. Ganapathiraju, J. Hamaker and J. Picone, “Applications of Risk Minimization to Speech Recognition,” submitted to the IEEE Transactions on Signal Processing, July 2003.

3. J. Hamaker, J. Picone, and A. Ganapathiraju, “A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines,” Proceedings of the International Conference of Spoken Language Processing, vol. 2, pp. 1001-1004, Denver, Colorado, USA, September 2002.

4. J. Hamaker, Sparse Bayesian Methods for Continuous Speech Recognition, Ph.D. Dissertation, Department of Electrical and Computer Engineering, Mississippi State University, December 2003.

5. A. Ganapathiraju, Support Vector Machines for Speech Recognition, Ph.D. Dissertation, Department of Electrical and Computer Engineering, Mississippi State University, January 2002.

Influential work:

6. M. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” Journal of Machine Learning, vol. 1, pp. 211-244, June 2001.

7. D. J. C. MacKay, “Probable networks and plausible predictions --- a review of practical Bayesian methods for supervised neural networks,” Network: Computation in Neural Systems, 6, pp. 469-505, 1995.

8. D. J. C. MacKay, Bayesian Methods for Adaptive Models, Ph. D. thesis, California Institute of Technology, Pasadena, California, USA, 1991.

9. E. T. Jaynes, “Bayesian Methods: General Background,” Maximum Entropy and Bayesian Methods in Applied Statistics, J. H. Justice, ed., pp. 1-25, Cambridge Univ. Press, Cambridge, UK, 1986.

10. V.N. Vapnik, Statistical Learning Theory, John Wiley, New York, NY, USA, 1998.

11. V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, NY, USA, 1995.

12. C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” AT&T Bell Laboratories, November 1999.

Page 32: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

CASE STUDIES

• Introduced at the Summer Consumer Electronics Show in Chicago

• First commercial speech synthesis consumer toy

• Based on linear prediction• Contained a proprietary speech

synthesis chip

SPEAK & SPELL™ (JUNE 1978)

• Left to right: Gene Frantz, Richard Wiggins, Paul Breedlove and Larry Brantingham (1978)

Page 33: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

CASE STUDIES

• Yes / no / true / false recognizer

• Answer questions about history

• Variety of learning modules

• Speaker independent recognition

• Microphone + children???• Won several industry design

awards for the mechanical design

VOYAGER™ (JUNE 1988)

Page 34: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

CASE STUDIES

• Worlds of Wonder approach TI in September of 1988.

• Can you put this toy on the market by Thanksgiving?

• 10-word speaker dependent isolated word recognizer

• 100 sentences for synthesis

• “Transparent training”

• First large-scale consumer toy application for a DSP

JULIE (DECEMBER 1988)

Page 35: Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

CASE STUDIES

• Voice verification for calling card security

• First wide-spread deployment of recognition technology in the telephone network

• Stimulated interest in voice dialing and other user-programmable features

• Original application was obsolete before wide-scale deployment

WATSON (EARLY 1990’S)