A Sparse Modeling Approach to Speech Recognition Using Kernel Machines Jon Hamaker...
-
Upload
ashlynn-wilkerson -
Category
Documents
-
view
221 -
download
1
Transcript of A Sparse Modeling Approach to Speech Recognition Using Kernel Machines Jon Hamaker...
A Sparse Modeling A Sparse Modeling Approach to Speech Approach to Speech Recognition Using Recognition Using Kernel MachinesKernel Machines
Jon Hamaker Jon Hamaker [email protected]@isip.msstate.edu
Institute for Signal and Information Institute for Signal and Information ProcessingProcessing
Mississippi State UniversityMississippi State University
AbstractAbstractStatistical techniques based on Hidden Markov models (HMMs) with Statistical techniques based on Hidden Markov models (HMMs) with Gaussian emission densities have dominated the signal processing and Gaussian emission densities have dominated the signal processing and pattern recognition literature for the past 20 years. However, HMMs pattern recognition literature for the past 20 years. However, HMMs suffer from an inability to learn discriminative information and are suffer from an inability to learn discriminative information and are prone to over-fitting and over-parameterization. Recent work in prone to over-fitting and over-parameterization. Recent work in machine learning has focused on models, such as the support vector machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. SVMs have parameterization as part of the overall optimization process. SVMs have been shown to provide significant improvements in performance on been shown to provide significant improvements in performance on small pattern recognition tasks compared to a number of conventional small pattern recognition tasks compared to a number of conventional approaches. SVMs, however, require ad hoc (and unreliable) methods to approaches. SVMs, however, require ad hoc (and unreliable) methods to couple it to probabilistic learning machines. Probabilistic Bayesian couple it to probabilistic learning machines. Probabilistic Bayesian learning machines, such as the relevance vector machine (RVM), are learning machines, such as the relevance vector machine (RVM), are fairly new approaches that attempt to overcome the deficiencies of fairly new approaches that attempt to overcome the deficiencies of SVMs by explicitly accounting for sparsity and statistics in their SVMs by explicitly accounting for sparsity and statistics in their formulation.formulation.
In this presentation, we describe both of these modeling approaches in In this presentation, we describe both of these modeling approaches in brief. We then describe our work to integrate these as acoustic models brief. We then describe our work to integrate these as acoustic models in large vocabulary speech recognition systems. Particular attention is in large vocabulary speech recognition systems. Particular attention is given to algorithms for training these learning machines on large given to algorithms for training these learning machines on large corpora. In each case, we find that both SVM and RVM-based systems corpora. In each case, we find that both SVM and RVM-based systems perform better than Gaussian mixture-based HMMs in open-loop perform better than Gaussian mixture-based HMMs in open-loop recognition. We further show that the RVM-based solution performs on recognition. We further show that the RVM-based solution performs on par with the SVM system using an order of magnitude fewer par with the SVM system using an order of magnitude fewer parameters. We conclude with a discussion of the remaining hurdles for parameters. We conclude with a discussion of the remaining hurdles for providing this technology in a form amenable to current state-of-the-art providing this technology in a form amenable to current state-of-the-art recognizers. recognizers.
BioBioJon Hamaker is a Ph.D. candidate in the Department Jon Hamaker is a Ph.D. candidate in the Department of Electrical and Computer Engineering at of Electrical and Computer Engineering at Mississippi State University under the supervision of Mississippi State University under the supervision of Dr. Joe Picone. He has been a senior member of the Dr. Joe Picone. He has been a senior member of the Institute for Signal and Information Processing Institute for Signal and Information Processing (ISIP) at MSU since 1996. Mr. Hamaker's research (ISIP) at MSU since 1996. Mr. Hamaker's research work has revolved around automatic structural work has revolved around automatic structural analysis and optimization methods for acoustic analysis and optimization methods for acoustic modeling in speech recognition systems. His most modeling in speech recognition systems. His most recent work has been in the application of kernel recent work has been in the application of kernel machines as replacements for the underlying machines as replacements for the underlying Gaussian distribution in hidden Markov acoustic Gaussian distribution in hidden Markov acoustic models. His dissertation work compares the popular models. His dissertation work compares the popular support vector machine with the relatively new support vector machine with the relatively new relevance vector machine in the context of a speech relevance vector machine in the context of a speech recognition system. Mr. Hamaker has co-authored 4 recognition system. Mr. Hamaker has co-authored 4 journal papers (2 under review), 22 conference journal papers (2 under review), 22 conference papers, and 3 invited presentations during his papers, and 3 invited presentations during his graduate studies at MS State (graduate studies at MS State (http://www.isip.msstate.edu/publicationshttp://www.isip.msstate.edu/publications). He also spent ). He also spent two summers as an intern at Microsoft in the two summers as an intern at Microsoft in the recognition engine group.recognition engine group.
OutlineOutline
The acoustic modeling problem for speechThe acoustic modeling problem for speech Current state-of-the-artCurrent state-of-the-art Discriminative approachesDiscriminative approaches Structural optimization and Occam’s Structural optimization and Occam’s
RazorRazor Support vector classifiersSupport vector classifiers Relevance vector classifiersRelevance vector classifiers Coupling vector machines to ASR systemsCoupling vector machines to ASR systems Scaling relevance vector methods to “real” Scaling relevance vector methods to “real”
problemsproblems Extensions of this workExtensions of this work
ASR ProblemASR Problem Front-end maintains Front-end maintains
information important information important for modeling in a for modeling in a reduced parameter setreduced parameter set
Language model Language model typically predicts a small typically predicts a small set of next words based set of next words based on knowledge of a finite on knowledge of a finite number of previous number of previous words (N-grams)words (N-grams)
Search engine uses Search engine uses knowledge sources and knowledge sources and models to chooses models to chooses amongst competing amongst competing hypotheseshypotheses
Input Speech
Statistical AcousticModels p(A/W)
LanguageModel p(W)
AcousticFront-End
Recognized Utterance
Search
Focus ofFocus ofWorkWork
Acoustic ConfusabilityAcoustic ConfusabilityRequires reasoning under uncertainty!Requires reasoning under uncertainty!
• Regions of overlap represent classification error
• Reduce overlap by introducing acoustic and linguistic context
Comparison of “aa” in “lOck” and “iy” in “bEAt” for SWB
Probabilistic Probabilistic FormulationFormulation
To deal with the uncertainty, we typically To deal with the uncertainty, we typically formulate speech as a probabilistic problem:formulate speech as a probabilistic problem:
Objective: Minimize the word error rate by Objective: Minimize the word error rate by maximizing P(W|A)maximizing P(W|A)
Approach: Maximize P(A|W) during trainingApproach: Maximize P(A|W) during training Components:Components:
P(A|W): Acoustic ModelP(A|W): Acoustic Model P(W): Language ModelP(W): Language Model P(A): Acoustic probability (ignored during P(A): Acoustic probability (ignored during
maximization)maximization)
)(
)()|()|(
AP
WPWAPAWP
Acoustic Modeling - Acoustic Modeling - HMMsHMMs
HMMs model temporal HMMs model temporal variation in the variation in the transition probabilities transition probabilities of the state machineof the state machine
GMM emission GMM emission densities are used to densities are used to account for variations account for variations in speaker, accent, and in speaker, accent, and pronunciationpronunciation
Sharing model Sharing model parameters is a parameters is a common strategy to common strategy to reduce complexityreduce complexity
s0 s1 s2 s3 s4
THREE TWO FIVE EIGHT
Maximum Likelihood Maximum Likelihood TrainingTraining
Data-driven modeling supervised only from a word-level transcription
Approach: maximum likelihood estimation The EM algorithm is used to improve our
estimates:
Guaranteed convergence to local maximum No guard against overfitting!
Computationally efficient training algorithms (Forward-Backward) have been crucial
Decision trees are used to optimize sharing parameters, minimize system complexity and integrate additional linguistic knowledge
),(),ˆ( if )|Data()ˆ|Data( QQPP
Drawbacks of Current Drawbacks of Current ApproachApproach
ML Convergence does not translate to optimal classification
Error from incorrect modeling assumptions
Finding the optimal decision boundary requires only one parameter!
Drawbacks of Current Drawbacks of Current ApproachApproach
Data not separable by a hyperplane – nonlinear classifier is needed
Gaussian MLE models tend toward the center of mass – overtraining leads to poor generalization
Acoustic ModelingAcoustic Modeling Acoustic Models Must:Acoustic Models Must:
Model the temporal progression of the speechModel the temporal progression of the speech Model the characteristics of the sub-word Model the characteristics of the sub-word
unitsunits We would also like our models to:We would also like our models to:
Optimally trade-off discrimination and Optimally trade-off discrimination and representationrepresentation
Incorporate Bayesian statistics (priors)Incorporate Bayesian statistics (priors) Make efficient use of parameters (sparsity)Make efficient use of parameters (sparsity) Produce confidence measures of their Produce confidence measures of their
predictions for higher-level decision processespredictions for higher-level decision processes
Paradigm Shift - Discriminative Paradigm Shift - Discriminative ModelingModeling
Discriminative Training (Maximum Discriminative Training (Maximum Mutual Information Estimation)Mutual Information Estimation) Essential Idea: Maximize Essential Idea: Maximize
Maximize numerator (ML term), Maximize numerator (ML term), minimize denominator minimize denominator (discriminative term)(discriminative term)
Discriminative Modeling (e.g. ANN Discriminative Modeling (e.g. ANN Hybrids – Bourlard and Morgan)Hybrids – Bourlard and Morgan)
)|(
)|(
out
in
WAP
WAP
Research FocusResearch Focus Our Research: replace the Gaussian Our Research: replace the Gaussian
likelihood computation with a likelihood computation with a machine that incorporates notions ofmachine that incorporates notions of DiscriminationDiscrimination Bayesian statistics (prior Bayesian statistics (prior
information)information) ConfidenceConfidence SparsitySparsity
All while maintaining computational All while maintaining computational efficiencyefficiency
ANN HybridsANN Hybrids
Shortcomings:Shortcomings: Prone to overfitting: require cross-validation to Prone to overfitting: require cross-validation to
determine when to stop training. determine when to stop training. Need methods Need methods to automatically penalize overfittingto automatically penalize overfitting
No substantial recognition improvements over No substantial recognition improvements over HMM/GMMHMM/GMM
Architecture:Architecture: ANN provides flexible, ANN provides flexible,
discriminative classifiers discriminative classifiers for emission probabilities for emission probabilities that avoid HMM that avoid HMM independence independence assumptions (can use assumptions (can use wider acoustic context)wider acoustic context)
Trained using Viterbi Trained using Viterbi iterative training (hard iterative training (hard decision rule) or can be decision rule) or can be trained to learn Baum-trained to learn Baum-Welch targets (soft Welch targets (soft decision rule)decision rule)
Input Feature Vector
………………..
…..
...
P(c1|o) … P(cn|o)
Structural OptimizationStructural Optimization
Structural optimization often guided by an Structural optimization often guided by an Occam’s Razor approach Occam’s Razor approach
Trading goodness of fit and model complexityTrading goodness of fit and model complexity Examples: MDL, BIC, AIC, Examples: MDL, BIC, AIC, Structural Risk Structural Risk
Minimization, Automatic Relevance Minimization, Automatic Relevance DeterminationDetermination
Model Complexity
Error
Training SetError
Open-LoopError
Optimum
Structural Risk Structural Risk MinimizationMinimization
The VC dimension is a The VC dimension is a measure of the complexity measure of the complexity of the learning machineof the learning machine
Higher VC dimension gives Higher VC dimension gives a looser bound on the a looser bound on the actual risk – thus actual risk – thus penalizing a more complex penalizing a more complex model (Vapnik)model (Vapnik)
Expected Risk: Expected Risk:
Not possible to estimate Not possible to estimate P(x,y)P(x,y)
Empirical Risk:Empirical Risk:
Related by the VC dimension, Related by the VC dimension, h:h:
Approach: choose the Approach: choose the machine that gives the least machine that gives the least upper bound on the actual upper bound on the actual riskrisk
),(),(2
1)( yxdPxfyR
l
iiiemp xfy
lR
1
|),(|2
1
)()()( hfRR emp
VC confidence
empirical risk
bound on the expected risk
VC dimension
Expected risk
optimum
Support Vector MachinesSupport Vector Machines
Hyperplanes C0-C2 achieve zero empirical risk. C0 generalizes optimally
The data points that define the boundary are called support vectors
Optimization: Optimization: Separable DataSeparable Data
Hyperplane: Constraints:
Quadratic optimization of a Lagrange functional minimizes risk criterion (maximizes margin). Only a small portion become support vectors
Final classifier: SVs
iii bxxyxf )()(
bwx
01)( bwxy ii
origin
class 1
class 2
w
H1
H2
C1
CO C2
optimalclassifier
SVMs as Nonlinear SVMs as Nonlinear ClassifiersClassifiers
Data for practical applications typically not separable using a hyperplane in the original input feature space
Transform data to higher dimension where hyperplane classifier is sufficient to model decision surface
Kernels used for this transformation
Final classifier:
Nn :
)()(),( jiji xxxxK
SVs
iii bxxKyxf ),()(
SVMs for Non-Separable SVMs for Non-Separable DataData
No hyperplane could achieve zero No hyperplane could achieve zero empirical risk (in any dimension space!)empirical risk (in any dimension space!)
Recall the SRM Principle: trade-off Recall the SRM Principle: trade-off empirical risk and model complexityempirical risk and model complexity
Relax our optimization constraint to allow Relax our optimization constraint to allow for errors on the training set: for errors on the training set:
A new parameter, A new parameter, CC, must be estimated to , must be estimated to optimally control the trade-off between optimally control the trade-off between training set errors and model complexitytraining set errors and model complexity
iii bwxy 1)(
SVM DrawbacksSVM Drawbacks Uses a binary (yes/no) decision ruleUses a binary (yes/no) decision rule
Generates a distance from the hyperplane, but Generates a distance from the hyperplane, but this distance is often not a good measure of this distance is often not a good measure of our “confidence” in the classificationour “confidence” in the classification
Can produce a “probability” as a function of Can produce a “probability” as a function of the distance (e.g. using sigmoid fits), but they the distance (e.g. using sigmoid fits), but they are inadequateare inadequate
Number of support vectors grows linearly Number of support vectors grows linearly with the size of the data setwith the size of the data set
Requires the estimation of trade-off Requires the estimation of trade-off parameter, parameter, CC, via held-out sets, via held-out sets
Evidence MaximizationEvidence Maximization Build a fully specified probabilistic model – Build a fully specified probabilistic model –
incorporate prior information/beliefs as incorporate prior information/beliefs as well as a notion of confidence in well as a notion of confidence in predictionspredictions
MacKay posed a special form for MacKay posed a special form for regularization in neural networks – sparsityregularization in neural networks – sparsity
Evidence maximization: evaluate candidate Evidence maximization: evaluate candidate models based on their “evidence”, P(D|Hmodels based on their “evidence”, P(D|H ii))
Structural optimization by maximizing the Structural optimization by maximizing the evidence across all candidate models!evidence across all candidate models!
Steeped in Gaussian approximationsSteeped in Gaussian approximations
Evidence FrameworkEvidence Framework
Penalty that measures how well our posterior Penalty that measures how well our posterior model fits our prior assumptions: model fits our prior assumptions:
We can use set the prior in favor of We can use set the prior in favor of sparse, smooth models!sparse, smooth models!
Evidence Evidence approximation:approximation:
Likelihood of Likelihood of data given best data given best fit parameter set: fit parameter set:
wHwPHwDPHDP )|ˆ(),ˆ|()|(
),ˆ|( HwDP
wHwP )|ˆ(
w
w
w
P(w|D,Hi)
P(w|Hi)
A kernel-based learning machineA kernel-based learning machine
Incorporates an Incorporates an automatic relevance automatic relevance determination (ARD)determination (ARD) prior over each prior over each weight (MacKay)weight (MacKay)
A flat (non-informative) prior over A flat (non-informative) prior over completes the Bayesian specificationcompletes the Bayesian specification
Relevance Vector MachinesRelevance Vector Machines
)1
),0(|()|(0
N
i iiiwNwP
N
iii xxKwwwxy
10 ),();(
);(1
1);|1( wxyi
iewxtP
Relevance Vector Relevance Vector MachinesMachines
The goal in training becomes finding:The goal in training becomes finding:
Estimation of the “sparsity” parameters Estimation of the “sparsity” parameters is inherent in the optimization – no need is inherent in the optimization – no need for a held-out set!for a held-out set!
A closed-form solution to this A closed-form solution to this maximization problem is not available. maximization problem is not available. Rather, we iteratively reestimate Rather, we iteratively reestimate
)|(
)|,(),,|(),(
),|,(,maxargˆ,ˆ
Xtp
XwpXwtpwp
whereXtwpw
w
ˆ andw
Laplace’s MethodLaplace’s Method Fix Fix and estimate and estimate ww (e.g. gradient descent) (e.g. gradient descent)
Use the Hessian to approximate the covariance of a Use the Hessian to approximate the covariance of a Gaussian posterior of the weights centered atGaussian posterior of the weights centered at
With and as the mean and covariance, With and as the mean and covariance, respectively, of the Gaussian approximation, we find respectively, of the Gaussian approximation, we find by finding by finding
Method is O(NMethod is O(N22) in memory and O(N) in memory and O(N33) in time) in time
w
w
)|()|(maxargˆ wpwtpw
w
1)|()|( wpwtpww
iiiii
ii where
w 1
ˆˆ
2
RVMs Compared to SVMsRVMs Compared to SVMsRVMRVM Data: Class labels (0,1)Data: Class labels (0,1) Goal: Learn posterior, Goal: Learn posterior,
P(t=1|x)P(t=1|x)
Structural Structural Optimization: Optimization: Hyperprior Hyperprior distribution distribution encourages sparsityencourages sparsity
Training: iterative – Training: iterative – O(NO(N33))
SVMSVM Data: Class labels (-Data: Class labels (-
1,1)1,1) Goal: Find optimal Goal: Find optimal
decision surface decision surface under constraintsunder constraints
Structural Structural Optimization: Trade-Optimization: Trade-off parameter that off parameter that must be estimatedmust be estimated
Training: Quadratic – Training: Quadratic – O(NO(N22))
iii bwxy 1)(
Simple ExampleSimple Example
ML ComparisonML Comparison
SVM ComparisonSVM Comparison
SVM With Sigmoid SVM With Sigmoid Posterior ComparisonPosterior Comparison
RVM ComparisonRVM Comparison
Experimental Experimental ProgressionProgression
Proof of concept on speech Proof of concept on speech classification dataclassification data
Coupling classifiers to ASR systemCoupling classifiers to ASR system Reduced-set tests on Alphadigits taskReduced-set tests on Alphadigits task Algorithms for scaling up RVM Algorithms for scaling up RVM
classifiersclassifiers Further tests on Alphadigits task (still Further tests on Alphadigits task (still
not the full training set though!)not the full training set though!) New work aiming at larger data sets New work aiming at larger data sets
and HMM decouplingand HMM decoupling
Vowel ClassificationVowel Classification Deterding Vowel Data: 11 vowels spoken in “h*d” Deterding Vowel Data: 11 vowels spoken in “h*d”
context; 10 log area parameters; 528 train, 462 SI testcontext; 10 log area parameters; 528 train, 462 SI test
ApproachApproach % % ErrorError
# # ParametersParameters
SVM: Polynomial SVM: Polynomial KernelsKernels
49%49%
K-Nearest NeighborK-Nearest Neighbor 44%44%
Gaussian Node Gaussian Node NetworkNetwork
44%44%
SVM: RBF KernelsSVM: RBF Kernels 35%35% 83 SVs83 SVs
Separable Mixture Separable Mixture ModelsModels
30%30%
RVM: RBF KernelsRVM: RBF Kernels 30%30% 13 RVs13 RVs
Coupling to ASRCoupling to ASR Data size:Data size:
30 million frames of data 30 million frames of data in training setin training set
Solution: Segmental phone Solution: Segmental phone modelsmodels
Source for Segmental Data:Source for Segmental Data: Solution: Use HMM system Solution: Use HMM system
in bootstrap procedurein bootstrap procedure Could also build a Could also build a
segment-based decodersegment-based decoder Probabilistic decoder Probabilistic decoder
coupling:coupling: SVMs: Sigmoid-fit SVMs: Sigmoid-fit
posteriorposterior RVMs: naturally RVMs: naturally
probabilisticprobabilistic
hh aw aa r y uw
region 10.3*k frames
region 30.3*k frames
region 20.4*k frames
mean region 1 mean region 2 mean region 3
k frames
Coupling to ASR SystemCoupling to ASR System
SEGMENTALCONVERTER
SEGMENTALCONVERTER
HMMRECOGNITION
HMMRECOGNITION
HYBRIDDECODER
HYBRIDDECODER
Features (Mel-Cepstra)
SegmentInformation
N-bestList
SegmentalFeatures
Hypothesis
Alphadigit RecognitionAlphadigit Recognition OGI Alphadigits: continuous, telephone OGI Alphadigits: continuous, telephone
bandwidth letters and numbers (“A19B4E”)bandwidth letters and numbers (“A19B4E”) Reduced training set size for RVM Reduced training set size for RVM
comparison: 2000 training segments per comparison: 2000 training segments per phone modelphone model Could not, at this point, run larger sets Could not, at this point, run larger sets
efficientlyefficiently 3329 utterances using 10-best lists 3329 utterances using 10-best lists
generated by the HMM decodergenerated by the HMM decoder SVM and RVM system architecture are SVM and RVM system architecture are
nearly identical: RBF kernels with gamma nearly identical: RBF kernels with gamma = 0.5= 0.5 SVM requires the sigmoid posterior estimate to SVM requires the sigmoid posterior estimate to
produce likelihoods – sigmoid parameters produce likelihoods – sigmoid parameters estimated from large held-out setestimated from large held-out set
SVM Alphadigit SVM Alphadigit RecognitionRecognition
TranscriptTranscriptionion
SegmentaSegmentationtion
SVMSVM HMMHMM
N-bestN-best HypothesiHypothesiss
11.0%11.0% 11.9%11.9%
N-N-best+Refbest+Ref
ReferenceReference 3.3%3.3% 6.3%6.3% HMM system is cross-word state-tied HMM system is cross-word state-tied
triphones with 16 mixtures of Gaussian triphones with 16 mixtures of Gaussian modelsmodels
SVM system has monophone models SVM system has monophone models with segmental featureswith segmental features
System combination experiment yields System combination experiment yields another 1% reduction in erroranother 1% reduction in error
SVM/RVM Alphadigit SVM/RVM Alphadigit ComparisonComparison
RVMs yield a large reduction in the RVMs yield a large reduction in the parameter count while attaining parameter count while attaining superior performancesuperior performance
Computational costs mainly in Computational costs mainly in training for RVMs but is still training for RVMs but is still prohibitive for larger setsprohibitive for larger sets
ApproaApproachch
ErroErrorr
RatRatee
Avg. # Avg. # ParametParamet
ersers
TrainiTraining ng
TimeTime
TestinTesting Timeg Time
SVMSVM 16.416.4%%
257257 0.5 0.5 hourshours
30 30 minsmins
RVMRVM 16.216.2%%
1212 30 30 daysdays
1 min1 min
Scaling UpScaling Up Central to RVM training is the inversion Central to RVM training is the inversion
of an MxM Hessian matrix: an O(Nof an MxM Hessian matrix: an O(N33) ) operation initiallyoperation initially
Solutions: Solutions: Constructive ApproachConstructive Approach: Start with an : Start with an
empty model and iteratively add candidate empty model and iteratively add candidate parameters. M is typically much smaller parameters. M is typically much smaller than Nthan N
Divide and Conquer ApproachDivide and Conquer Approach: Divide : Divide complete problem into set of sub-problems. complete problem into set of sub-problems. Iteratively refine the candidate parameter Iteratively refine the candidate parameter set according to sub-problem solution. M is set according to sub-problem solution. M is user-defineduser-defined
Constructive ApproachConstructive Approach Tipping and Faul (MSR-Cambridge) Tipping and Faul (MSR-Cambridge) Define Define
has a unique solution with respect has a unique solution with respect
to to
The results give a set of rules for The results give a set of rules for
adding vectors to the model, removing adding vectors to the model, removing
vectors from the model or updating vectors from the model or updating
parameters in the modelparameters in the model
)()()( ii lLL
)(L i
Constructive Approach Constructive Approach AlgorithmAlgorithm
Prune all parameters;Prune all parameters;While not convergedWhile not converged
For each parameter:For each parameter:If parameter is If parameter is
pruned:pruned: checkAddRulecheckAddRule
Else: Else: checkPruneRulecheckPruneRule
checkUpdateRulecheckUpdateRule
EndEndUpdate ModelUpdate Model
EndEnd
Begin with all weights Begin with all weights
set to zero and iteratively set to zero and iteratively
construct an optimal construct an optimal
model without evaluating model without evaluating
the full NxN inversethe full NxN inverse Formed for RVM Formed for RVM
regression – can have regression – can have
oscillatory behavior for oscillatory behavior for
classificationclassification Rule Rule subroutines require subroutines require
the full design matrix the full design matrix
(NxN) storage (NxN) storage
requirementrequirement
Iterative Reduction Iterative Reduction AlgorithmAlgorithm
O(MO(M33) in run-time and O(MxN) in memory. ) in run-time and O(MxN) in memory. M is a user-defined parameterM is a user-defined parameter
Assumes that if P(wAssumes that if P(wkk=0=0|w|wI,JI,J,D,D) is 1 then ) is 1 then P(wP(wkk=0|=0|w,Dw,D) is also 1! Optimality?) is also 1! Optimality?
Candidate
Pool
Candidate
Pool
TR
AIN
TR
AIN
Iteration I
TR
AIN
TR
AIN
Iteration I+1
RVs
RVs
Subset 0
Subset J
Alphadigit RecognitionAlphadigit Recognition
Data increased to 10000 training vectorsData increased to 10000 training vectors Reduction method has been trained up to Reduction method has been trained up to
100k vectors (on toy task). Not possible 100k vectors (on toy task). Not possible for Constructive methodfor Constructive method
ApproachApproach ErroErrorr
RateRate
Avg. # Avg. # ParametParamet
ersers
TrainiTraining ng
TimeTime
TestinTesting Timeg Time
SVMSVM 15.515.5%%
994994 3 3 hourshours
1.5 1.5 hourshours
RVMRVMConstrucConstruc
tivetive
14.814.8%%
7272 5 days5 days 5 mins5 mins
RVMRVMReductioReductio
nn
14.814.8%%
7474 6 days6 days 5 mins5 mins
SummarySummary First to apply kernel machines as First to apply kernel machines as
acoustic modelsacoustic models Comparison of two machines that Comparison of two machines that
apply structural optimization to apply structural optimization to learning: SVM and RVMlearning: SVM and RVM
Performance exceeds that of HMM Performance exceeds that of HMM but with quite a bit of HMM but with quite a bit of HMM interactioninteraction
Algorithms for increased data sizes Algorithms for increased data sizes are keyare key
Decoupling the HMMDecoupling the HMM Still want to use segmental data (data Still want to use segmental data (data
size)size) Want the kernel machine acoustic model Want the kernel machine acoustic model
to determine an optimal segmentation to determine an optimal segmentation thoughthough
Need a new decoderNeed a new decoder Hypothesize each phone for each possible Hypothesize each phone for each possible
segmentsegment Pruning is a huge issuePruning is a huge issue Stack decoder is beneficialStack decoder is beneficial
Status: In developmentStatus: In development
Improved Iterative Improved Iterative AlgorithmAlgorithm
Same principle of operationSame principle of operation One pass over the data – much faster!One pass over the data – much faster! Status: Equivalent performance on all Status: Equivalent performance on all
benchmarks – running on Alphadigits nowbenchmarks – running on Alphadigits now
Candidate
Pool
Candidate
Pool
TR
AIN
TR
AIN
Subset 0
Subset 1
RVs
RVs
Active Learning for Active Learning for RVMsRVMs
Idea: Given the current model, iteratively Idea: Given the current model, iteratively chooses a subset of points from the full chooses a subset of points from the full training set that will improve the system training set that will improve the system performanceperformance
Problem #1: “Performance” typically is Problem #1: “Performance” typically is defined as classifier error rate (e.g. boosting). defined as classifier error rate (e.g. boosting). What about the posterior estimate accuracy?What about the posterior estimate accuracy?
Problem #2: For kernel machines, an added Problem #2: For kernel machines, an added training point can:training point can: Assist in bettering the model performanceAssist in bettering the model performance Become part of the model itself! How do we Become part of the model itself! How do we
determine which points should be added?determine which points should be added? Look to work in Gaussian Processes Look to work in Gaussian Processes
(Lawrence, Seeger, Herbrich, 2003)(Lawrence, Seeger, Herbrich, 2003)
ExtensionsExtensions Not ready for prime time as an Not ready for prime time as an
acoustic modelacoustic model How else might we use the same How else might we use the same
techniques for speech?techniques for speech? Online Speech/Noise Classification?Online Speech/Noise Classification?
Requires adaptation methodsRequires adaptation methods Application of automatic relevance Application of automatic relevance
determination to model selection for determination to model selection for HMMs?HMMs?
AcknowledgmentsAcknowledgments Collaborators: Aravind Collaborators: Aravind
Ganapathiraju and Joe Picone at Ganapathiraju and Joe Picone at Mississippi StateMississippi State
Consultants: Michael Tipping (MSR-Consultants: Michael Tipping (MSR-Cambridge) and Thorsten Joachims Cambridge) and Thorsten Joachims (now at Cornell)(now at Cornell)