Automatic Speech Recognition.ppt
Transcript of Automatic Speech Recognition.ppt
Automatic Speech Recognition
/
How do humans do it?
• Articulation produces• sound waves which• the ear conveys to the brain• for processing
How might computers do it?
• Digitization• Acoustic analysis of the
speech signal• Linguistic interpretation
Acoustic waveform Acoustic signal
Speech recognition
Speech features
– Representation using features to develop models
– Vocal tract – time varying linear filter
– Glottal pulse or noise generator – signal sources
– Time varying character of speech process is captured by performing the spectral analysis
– Short time analysis and repeating the analysis periodically.
Mel-freqency cepstral co-efficients (MFCCs)
MFCC is based on human • hearing perceptions which• cannot perceive frequencies• over 1Khz. • MFCC has two types of filter • which are spaced linearly at• low frequency below 1000 Hz• logarithmic spacing above
1000Hz
Suitability of features
– Features from MFCCs are well suited to density function models such as mixture models or Hidden Markov Model (HMMs).
– cepstra are lower order fourier co-efficients– Variations induced by the pulsed nature of vocal
excitation has minimal effect
Speaker Models– Neural network– Support vector machine (SVM)– Gaussian mixture model (GMM)– Hidden markov model (HMM)
– The choice of model depends on circumstance and specific application
– Amount of data available and nature of speaker verification problem – influence choice of model.
Gaussian mixture model
GMM - Example
-5 0 5 100
0.1
0.2
0.3
0.4
0.5
Component 1 Component 2p(
x)
-5 0 5 100
0.1
0.2
0.3
0.4
0.5
Mixture Model
x
p(x)
-5 0 5 100
0.1
0.2
0.3
0.4
0.5
Component 1 Component 2p(
x)
-5 0 5 100
0.1
0.2
0.3
0.4
0.5
Mixture Model
x
p(x)
-5 0 5 100
0.5
1
1.5
2
Component Modelsp(
x)
-5 0 5 100
0.1
0.2
0.3
0.4
0.5
Mixture Model
x
p(x)
Gaussian Mixture Example: Start
After First Iteration
After 2nd Iteration
After 3rd Iteration
Gaussian Mixture Example: Start
After First Iteration
After 2nd Iteration
After 3rd Iteration
After 4th Iteration
After 5th Iteration
After 6th Iteration
After 20th Iteration
Model - training– Depending on the amount of data available
– Class conditional training– Adaptive training– Discriminative training
Class conditional modeling-maximum likelihood training
CCM-ML (Cntd..)
Class condition modeling-discriminative training
– ML training model works fine if the training data is sufficient
– For inadequate training data, with no loss in performance , with CCM - go for discriminative training.
– This training works with important question
“How should I select the parameters of my model or models such that we maximize the performance on our goal of speaker separability?”
CMM-DT (cntd..)– To meet this, we move away from ML criterion – Performance of SV system is measured by
Receiver Operating Characteristics (ROC) or detection error tradeoff (DET) curve.
– Model parameters are directly trained to improve the performance even working with GMM or HMM.
CMM via Adaptation– The process of creating speaker models via
adaptation starts with a generic model for a speaker and uses the data collected from the target speaker to tailor the generic model to the speaker.
– Generic model – prior probability distribution– Generic model is often referred as universal
background model (UBM) is typically modeled with GMM and trained with speaker data.
– Other class conditional models• Inherently discriminative approaches
Managing variability
Channel Normalisation• Mitigating the effect of the channel based on its
inherent characteristics.
• It can be modeled by linear time-invariant filter with cepstral coefficients
C r,n = C s,n + C ch,n
• Cepstral mean substraction (CMS) – one method of normalisation is quite effective.
• But removes some of the speaker information.
• Sometimes features shifts in feature space, but it can be avoided by CMS. However new features that are produced that are less sensitive to channel variability.
Channel normalisation – other methods (approach – 1)
– Whether there are features that are highly robust to channel variation?
– Features must be highly robust to channel variations and also carry significant amount of speaker-dependent information.
– Format-related features – Format is location of peak in short spectrum i.e. resonant freq. of vocal tract.
– Variations in glottal pulse timings and fundamental frequency – do carry speaker information and have channel invariance
– Difficult to extract these features and model. (Murthy et al.)
Channel normalisation – other methods (approach – 2)
– To treat channel as random disturbance and integrate its variability directly into scoring process of the speaker.
– Estimating probablistic model for channel – a challenge.
– Gaussian random vector is able to compute this effect.
Channel normalisation – other methods (approach – 3)
– To construct a channel dector– Compare with channel used at the time of
enrollment – While modeling speaker, maximum a posteriori
(MAP) adaptation methods are employed (Teunen et al.)
– Faces mismatched channel problem. But effective than normalisation technique.
Score normalisation
– Normalising the scores generated after transmission is over.
– Constraining the text
Measuring performance– Performance of SV systen us measured by its prob. of false
dismissal versus the prob. of false acceptance at a given threshold value.
– By changing threshold, with collection of data DET curve is obtained.
– DET for one single user is discontinuous.
– National institute of standards and technology (NIST) in their annual evaluation, collected data from different speakers and normalised to create composite DET.
– For a system with speaker or channel dependent thresholds are used, then normalisation could lead to inappropriate DET.
– The speaker dependent threshold (no channel detection) a more appropriate way is to combine scores
Measuring performance (Cntd....)
– In such cases for a given PFA average of all PFM is found
and DET is plotted.
– DET or Receiver operating characteristics (ROC) provide a great deal of information on SV system. EER quantifies in terms of number.
– Another numerical value is detection cost function (DCF) involves assigning cost to two different types of errors.
– There are so many factors affect performance of system– Amount of training data
– Duration of the evaluation
– The variability
– Types of channels employed
– Protocols employed
– Constrains on the text
Alternative approaches
– 1. Speech recognition approaches
– Cepstra features extracted from speech were used and not exploited higher level information (phonetics).
– Relying primarily on individual vocal production mechanism.
– In verification process the discriminative gain was high
– However extracting them from communication channel reduced the accuracy.
– Improvement in both speech and phonetic recognisers - this approach gained importance.
Alternative approaches (Cntd...)
– Dragon system focused on the recognisers ability to provide higher accoustic knowledge about speakers.
– For training – Baum-welch HMM procedure is followed.
– Dragon approach fared well in all aspects but, fell short by the state of art of GMM. - lack of test data.
– Kimball et all gave text dependent recognition system
– Uses Maximum likelihood linear regression (MLLR) method.
– Proved to be effective even on different channels.
Alternative approaches (Cntd...)
– 2. Words (and phonetic units) count
– Gauvain et al. proposed speaker recongnition based on phone recogniser.
– Training of accoustic models for target speaker was by adaptation and model – HMM.
– Transition permitted between phones and user’s language.
– Doddington demonstrated that language patterns (by frequently occurring bigrams) contain a great deal of speaker specific information.
– Andrew et al demonstrated word based recognition system by combining language recognition with acoustic information.
Alternative approaches (Cntd...)
– 3. Models exploring the shape of feature space
– Statistical modeling through user’s likelihood ratios
– Speaker model is characterised by single Gaussian pdf.
– This model ignores fine structure of the speech, shape in feature space.
– Gish introduced eigenvalues of covariance matrices
– Zilca et al consider the measures of shape and T-norm scaling of scores in conjunction with channel detection to produce - cellular phone environment.
– This system requires less computations than GMM.