Automatic Speech Recognition.ppt

Automatic Speech Recognition

/

How do humans do it?

• Articulation produces• sound waves which• the ear conveys to the brain• for processing

How might computers do it?

• Digitization• Acoustic analysis of the

speech signal• Linguistic interpretation

Acoustic waveform Acoustic signal

Speech recognition

Speech features

– Representation using features to develop models

– Vocal tract – time varying linear filter

– Glottal pulse or noise generator – signal sources

– Time varying character of speech process is captured by performing the spectral analysis

– Short time analysis and repeating the analysis periodically.

Mel-freqency cepstral co-efficients (MFCCs)

MFCC is based on human • hearing perceptions which• cannot perceive frequencies• over 1Khz. • MFCC has two types of filter • which are spaced linearly at• low frequency below 1000 Hz• logarithmic spacing above

1000Hz

Suitability of features

– Features from MFCCs are well suited to density function models such as mixture models or Hidden Markov Model (HMMs).

– cepstra are lower order fourier co-efficients– Variations induced by the pulsed nature of vocal

excitation has minimal effect

Speaker Models– Neural network– Support vector machine (SVM)– Gaussian mixture model (GMM)– Hidden markov model (HMM)

– The choice of model depends on circumstance and specific application

– Amount of data available and nature of speaker verification problem – influence choice of model.

Gaussian mixture model

GMM - Example

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Component 1 Component 2p(

x)

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Mixture Model

x

p(x)

-5 0 5 100

0.5

1

1.5

2

Component Modelsp(

x)

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Mixture Model

x

p(x)

Gaussian Mixture Example: Start

After First Iteration

After 2nd Iteration

After 3rd Iteration

Gaussian Mixture Example: Start

After First Iteration

After 2nd Iteration

After 3rd Iteration

After 4th Iteration

After 5th Iteration

After 6th Iteration

After 20th Iteration

Model - training– Depending on the amount of data available

– Class conditional training– Adaptive training– Discriminative training

Class conditional modeling-maximum likelihood training

CCM-ML (Cntd..)

Class condition modeling-discriminative training

– ML training model works fine if the training data is sufficient

– For inadequate training data, with no loss in performance , with CCM - go for discriminative training.

– This training works with important question

“How should I select the parameters of my model or models such that we maximize the performance on our goal of speaker separability?”

CMM-DT (cntd..)– To meet this, we move away from ML criterion – Performance of SV system is measured by

Receiver Operating Characteristics (ROC) or detection error tradeoff (DET) curve.

– Model parameters are directly trained to improve the performance even working with GMM or HMM.

CMM via Adaptation– The process of creating speaker models via

adaptation starts with a generic model for a speaker and uses the data collected from the target speaker to tailor the generic model to the speaker.

– Generic model – prior probability distribution– Generic model is often referred as universal

background model (UBM) is typically modeled with GMM and trained with speaker data.

– Other class conditional models• Inherently discriminative approaches

Managing variability

Channel Normalisation• Mitigating the effect of the channel based on its

inherent characteristics.

• It can be modeled by linear time-invariant filter with cepstral coefficients

C r,n = C s,n + C ch,n

• Cepstral mean substraction (CMS) – one method of normalisation is quite effective.

• But removes some of the speaker information.

• Sometimes features shifts in feature space, but it can be avoided by CMS. However new features that are produced that are less sensitive to channel variability.

Channel normalisation – other methods (approach – 1)

– Whether there are features that are highly robust to channel variation?

– Features must be highly robust to channel variations and also carry significant amount of speaker-dependent information.

– Format-related features – Format is location of peak in short spectrum i.e. resonant freq. of vocal tract.

– Variations in glottal pulse timings and fundamental frequency – do carry speaker information and have channel invariance

– Difficult to extract these features and model. (Murthy et al.)


– To treat channel as random disturbance and integrate its variability directly into scoring process of the speaker.

– Estimating probablistic model for channel – a challenge.

– Gaussian random vector is able to compute this effect.


– To construct a channel dector– Compare with channel used at the time of

enrollment – While modeling speaker, maximum a posteriori

(MAP) adaptation methods are employed (Teunen et al.)

– Faces mismatched channel problem. But effective than normalisation technique.

Score normalisation

– Normalising the scores generated after transmission is over.

– Constraining the text

Measuring performance– Performance of SV systen us measured by its prob. of false

dismissal versus the prob. of false acceptance at a given threshold value.

– By changing threshold, with collection of data DET curve is obtained.

– DET for one single user is discontinuous.

– National institute of standards and technology (NIST) in their annual evaluation, collected data from different speakers and normalised to create composite DET.

– For a system with speaker or channel dependent thresholds are used, then normalisation could lead to inappropriate DET.

– The speaker dependent threshold (no channel detection) a more appropriate way is to combine scores

Measuring performance (Cntd....)

– In such cases for a given PFA average of all PFM is found

and DET is plotted.

– DET or Receiver operating characteristics (ROC) provide a great deal of information on SV system. EER quantifies in terms of number.

– Another numerical value is detection cost function (DCF) involves assigning cost to two different types of errors.

– There are so many factors affect performance of system– Amount of training data

– Duration of the evaluation

– The variability

– Types of channels employed

– Protocols employed

– Constrains on the text

Alternative approaches

– 1. Speech recognition approaches

– Cepstra features extracted from speech were used and not exploited higher level information (phonetics).

– Relying primarily on individual vocal production mechanism.

– In verification process the discriminative gain was high

– However extracting them from communication channel reduced the accuracy.

– Improvement in both speech and phonetic recognisers - this approach gained importance.

Alternative approaches (Cntd...)

– Dragon system focused on the recognisers ability to provide higher accoustic knowledge about speakers.

– For training – Baum-welch HMM procedure is followed.

– Dragon approach fared well in all aspects but, fell short by the state of art of GMM. - lack of test data.

– Kimball et all gave text dependent recognition system

– Uses Maximum likelihood linear regression (MLLR) method.

– Proved to be effective even on different channels.


– 2. Words (and phonetic units) count

– Gauvain et al. proposed speaker recongnition based on phone recogniser.

– Training of accoustic models for target speaker was by adaptation and model – HMM.

– Transition permitted between phones and user’s language.

– Doddington demonstrated that language patterns (by frequently occurring bigrams) contain a great deal of speaker specific information.

– Andrew et al demonstrated word based recognition system by combining language recognition with acoustic information.


– 3. Models exploring the shape of feature space

– Statistical modeling through user’s likelihood ratios

– Speaker model is characterised by single Gaussian pdf.

– This model ignores fine structure of the speech, shape in feature space.

– Gish introduced eigenvalues of covariance matrices

– Zilca et al consider the measures of shape and T-norm scaling of scores in conjunction with channel detection to produce - cellular phone environment.

– This system requires less computations than GMM.

Automatic Speech Recognition.ppt

Documents

Transcript of Automatic Speech Recognition.ppt