Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition

Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition

S. Srinivasan, T. Ma, D. May, G. Lazarou and J. PiconeDepartment of Electrical and Computer Engineering

Mississippi State UniversityURL: http://www.ece.msstate.edu/research/isip/publications/conferences/interspeech/2008/mar_hmm/

This material is based upon work supported by the National Science Foundation under Grant No. IIS-0414450. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

200400600800

10001200

AA

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

200400600800

10001200

M

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

200400600800

10001200

SH

S1 S2

S3

1/A1(z)

1/A2(z)

1/A1(z)

1/A2(z)

1/A1(z)

1/A2(z)

of 18Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition

Gaussian mixture models are a very successful method for modeling the output distribution of a state in a hidden Markov model (HMM). However, this approach is limited by the assumption that the dynamics of speech features are linear and can be modeled with static features and their derivatives.In this paper, a nonlinear mixture autoregressive model is used to model state output distributions (MAR-HMM). Estimation of model parameters is extended to handle vector features. MAR-HMMs are shown to provide superior performance to comparable Gaussian mixture model-based HMMs (GMM-HMM) with lower complexity on two pilot classification tasks.

Abstract


• Popular approaches to speech recognition use: Gaussian Mixture Models (GMMs) to model state output distributions in

hidden Markov model-based acoustic models. Additional passes of discriminative training to optimize model

performance. Discriminative classifiers such as Support Vector Machines for rescoring

or combining hypotheses.• Limitations of current approaches:

GMMs model the marginal pdf and do not explicitly model dynamics. Adding differential features to GMMs still only allows modeling of linear

dynamic system and does not directly model nonlinear dynamics. Extraction of features that explicitly model nonlinear dynamics of speech

[May, 2008] has not provided significant improvements that are robust to noise or mismatched training conditions.

• Goal: Directly capture dynamic information in our acoustic model.

Motivation


Nonlinear Dynamic Invariants as Features (May, 2008)

• Nonlinear Features: Quantify the amount of nonlinearity in the signal (e.g. Lyapunov exponents, fractal dimension, and correlation entropy).

• Pros: Invariant to signal transformations: noise distortions, channel effects (at least in theory). This leads to expectations that nonlinear invariants could be noise-robust.

• Cons: Contrary to expectations, experiments with continuous speech data show that performance with invariants are not noise-robust (May, 2008). While there is a 11.1% relative increase in recognition performance with clean evaluation set, a relative 7.6% decrease is noted with noisy set.

• Hypotheses: (Theoretical) Even when estimated correctly, the amount of nonlinearity is

only a very crude measure of dynamic information. Invariant values can be similar even when the signals are quite different. This places a limit to the gain achievable by using nonlinear invariants as features.

(Practical) Invariant feature estimation algorithms are very sensitive to changes in channel conditions – almost never “invariant.”


A Mixture Autoregressive Model (MAR)MAR (Wong and Li, 2000) of order p with m components:

where,

εi : zero mean Gaussian with variance σj2

“w.p. wi” : with probability wi

ai,j (j>0) : AR predictor coefficients

ai,0 : mean for component i

m

p

imimm

p

ii

p

ii

wninxaa

wninxaa

wninxaa

nx

w.p.][][

w.p.][][

w.p.][][

][

1,0,

21

2,20,2

11

1,10,1

Additional nonlinear components – weighted sum of Gaussian AR processes


Interpretations

• MAR as a weighted sum of Gaussian autoregressive (AR) processes: Analysis/Modeling: Conditional distribution modeled as a weighted

combination of AR filters with Gaussian white noise inputs (hence, this models data as colored non-Gaussian signals).

Synthesis: A random process in which data sampled at any one point in time is generated from one of the components of an AR mixture process chosen randomly according to its weight, wi.

• Comparisons to GMMs: GMMs model the marginal distribution as a weighted sum of Gaussian

random variables. MAR with AR filter order 0 is equivalent to a GMM. MAR can capture information in dynamics in addition to the static

information captured by a GMM.


Overview of the MAR-HMM Approach

S1 S2

S3

1/A1(z)

1/A2(z)

1/A1(z)

1/A2(z)

1/A1(z)

1/A2(z)

• An example MAR-HMM system with 3 states: each state observation is modeled by a 2-component MAR.


MAR Properties• Ability to model nonlinearities:

Each AR component linear. Probabilistic mixing of AR processes leads to nonlinearities.

• Dynamics of probability distribution: In GMM:

Distribution invariant to past samples. Only marginal distribution is modeled.

In MAR: Conditional distribution varies with past samples. Joint distribution of random process is modeled. Nonlinear evolution in time series is modeled using a probabilistic

mixing of linear AR processes.


MAR Parameter Estimation Using EM

• E-step:Conditional probability sample came from mixture component :

where:

In a GMM-HMM, for , while corresponds to the mean. This

reduces to the familiar EM equations.

2

1,0, )][][(

21

1)|][(

m

iill

linxaanx

ll enxp

m

kkk

lll

nxpw

nxpwn

1)|][(

)|][(][

l

0, ila 0i

][nx

0,la


MAR Parameter Estimation Using EM• M-step:

Parameter Updates

Again, for GMM-HMM, for , and is the component mean,

reducing to the familiar EM equations.

pN

nw

N

pnl

l

1

][ˆ

lll rRA 1ˆ

N

pnl

N

pn

m

iilll

ln

inxaanxn

1

1

2

1,0,

2

][

)][ˆˆ][(][ˆ

N

pn

Tnnll XXnR

111][

N

pnnll nxXnr

11 ][][

.]][]2[]1[1[1T

n pnxnxnxX

0, ila 0i 0,la


Vector Extension of MAR• Basic formulation of MAR was only for scalar time series.• How can we extend MAR to handle vector time series (e.g. MFCCs) without

fundamental modifications to the EM equations?• Solution: Tie MAR mixture components across scalar components using a

single weight estimated by combining likelihood of all scalar components for each mixture:

• Pros: Only simple modification to weight update during EM training.• Cons: Assumes scalar components are decoupled (except for the shared

mixture weight parameter). This is similar to an assumption of a diagonal covariance in GMMs – a very common assumption for MFCC features.

.][

][ˆ

1 1 1,

1 1,

m

k

D

d

N

pndk

D

d

N

pndl

ln

nw


Relationship to Other AR-HMMs

• (Poritz, 1980) Linear Predictive HMM: Equivalent to MAR-HMM with a single mixture component.

• (Juang and Rabiner, 1985) Mixture Autoregressive HMM: Somewhat similar to the current model investigated (even identical in name!), but with two major differences: Component means were assumed to be zero. Variances were assumed to be equal across components.

• (Ephraim and Robert, 2005) Switching AR-HMM: Restricted to single-component MAR-HMM.

• All the above were applied to scalar time series with speech samples as observations.

• Our vector extension of MAR has been applied to modeling of the MFCC vector feature stream.


(Synthetic) Signal Classification – Experimental Design

• Two-class problem• Same marginal distribution• Different dynamics

Signal 1: no dependence on past samples Signal 2: nonlinear dependence on past samples Signal parameters for the two classes:

f

.

2.0,25.0

,2.0,1,2.0,1,6.0,4.0,2,1

:

21

0,20,21,10,1

21

1

aaaawwmp

.

2894.0,2598.0,2906.0,3263.0

98.0,13.1

,86.0,03.13.0,3.0,2.0,2.0

,4,0

:

43

21

0,40,3

0,20,1

4321

2

aa

aawwww

mp


(Synthetic) Signal Classification – Results

Table 1: Classification (% accuracy) results for synthetic data

• Static-only features: GMM can only do as good as a random guess (because both signals have the same marginal); MAR uses the distinct signal dynamics to yield 100% classification accuracy.

• Static+∆+∆∆ : GMM performance improves significantly due to the addition of dynamic information. Yet, MAR outperforms GMM because it can capture nonlinear dynamics in the signals.

# mixts. GMMStaticonly

MARStatic only

GMMstatic+∆+∆∆

MARstatic+∆+∆∆

2 47.5 (6) 100.0 (8) 82.5 (14) 100.0 (20)

4 52.5 (12) 100.0 (16) 85.0 (28) 100.0 (40)


Sustained Phone Classification – Experimental Design

• Collected artificially elongated pronunciations of three sounds: vowel /aa/, nasal /m/, sibilant /sh/, sampled at 16 kHz.

• For this preliminary study, we wanted to avoid artifacts introduced by coarticulation.

• Static feature database: 13 MFCCs (including energy) extracted at a frame rate of 10 ms and window duration of 25 ms.

• Static+∆+∆∆ database: Static + velocity + acceleration MFCC coefficients.• Training: 70 recordings of each phone.• Testing: 30 recordings of each phone.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

200400600800

10001200

AA

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

200400600800

10001200

M

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

200400600800

10001200

SH


Sustained Phone Classification (Static Features) – Results

Table 2: Sustained phone classification (% accuracy) results with MAR and GMM using static (13) MFCC features

• For an equal number of parameters, MAR outperforms GMM.• MAR exploits dynamic information to better model evolution of MFCCs for

phones that GMM is unable to model.

# mixtures GMM MAR

2 77.8 (54) 83.3 (80)

4 86.7 (108) 90.0 (160)

8 91.1 (216) 94.4 (320)

16 93.3 (432) 95.6 (640)


Sustained Phone Classification (Static +∆+∆∆ ) – Results

Table 3: Sustained phone classification (% accuracy) results with MAR and GMM using static+∆+∆∆ (39) MFCC features

• MAR performs better than GMM when the number of parameters are equal.• MAR performance saturates as the number of parameters increases.• Why?: Assumption that features are uncorrelated during MAR training is

invalid. Static features may be unrelated but obviously ∆ features are related to static features. Causes problems both for diagonal covariance GMMs and MAR, but more severe for MAR due to its explicit modeling of dynamics.

# Mixtures GMM MAR

2 92.2 (158) 94.4 (236)

4 94.4 (316) 97.8 (472)

8 96.7 (632) 97.8 (944)

16 100 (1264) 98.9 (1888)


Summary and Future WorkConclusions:• Presented a novel nonlinear mixture autoregressive HMM (MAR-HMM) for

speech recognition with a vector extension.• Demonstrated the efficacy of MAR-HMM over GMM-HMM for signal

classification with a two-class problem using synthetic signals.• Showed the superiority of MAR-HMM over GMM-HMM for sustained phone

classification using static MFCC features.• Experiments with static+ ∆+∆∆ features were not as conclusive – MAR-HMM

performance saturated more quickly than GMM-HMM.Future Work:• Perform large-scale phone recognition experiments to study MAR-HMM

performance in more practical scenarios involving noise (e.g. Aurora-4).• Investigate whether phone-dependent decorrelation (class-based PCA) of the

features improves MAR-HMM performance.


References

1. D. May, Nonlinear Dynamic Invariants For Continuous Speech Recognition, M.S. Thesis, Department of Electrical and Computer Engineering, Mississippi State University, May 2008.

2. Wong, C. S. and Li, W. K., “On a Mixture Autoregressive Model,” Journal of the Royal Statistical Society, vol. 62, no. 1, pp. 95‑115, Feb. 2000.

3. Poritz, A. B., “Linear Predictive Hidden Markov Models,” Proceedings of the Symposium on the Application of Hidden Markov Models to Text and Speech, Princeton, New Jersey, USA, pp. 88-142, Oct. 1980.

4. Juang, B. H., and Rabiner, L. R., “Mixture Autoregressive Hidden Markov Models for Speech Signals,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 33, no. 6, pp. 1404-1413, Dec. 1985.

5. Ephraim, Y. and Roberts, W. J. J., “Revisiting Autoregressive Hidden Markov Modeling of Speech Signals,” IEEE Signal Processing Letters, vol. 12, no. 2, pp. 166-169, Feb. 2005.

Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition

Documents

Transcript of Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition