Motivation
Standard acoustic models impose many structural constraints
We propose an automatic approach
Use TIMIT Dataset MFCC features Full covariance Gaussians (Young and Woodland, 1994)
Standard subphone/mixture HMM
Temporal Structure
Gaussian Mixtures
Model Error rate
HMM Baseline 25.1%
Phone Classification Results
Method Error Rate
GMM Baseline (Sha and Saul, 2006) 26.0 %
HMM Baseline (Gunawardana et al., 2005) 25.1 %
SVM (Clarkson and Moreno, 1999) 22.4 %
Hidden CRF (Gunawardana et al., 2005) 21.7 %
Our Work 21.4 %
Large Margin GMM (Sha and Saul, 2006) 21.1 %
Hierarchical Refinement Results
0.24
0.26
0.28
0.3
0.32
0.34
0.36
0.38
0 500 1000 1500 2000
Number of States
Error Rate
Split and Merge, Automatic Alignment Split Only
HMM Baseline 41.7%
5 Split Rounds 28.4%
Merging
Not all phones are equally complex Compute log likelihood loss from merging
Split model Merged at one node
t-1 t t+1 t-1 t t+1
Split and Merge Results
0.24
0.26
0.28
0.3
0.32
0.34
0.36
0.38
0 500 1000 1500 2000
Number of States
Error Rate
Split and Merge Split Only
Split Only 28.4%
Split & Merge 27.3%
0
5
10
15
20
25
30
35
ae ao ay eh er ey ih f r s sil aa ah ix iy z cl k sh n
vcl ow l
m t v
uw aw ax ch w th el dh uh p
en oy hh jh ng y b d dx g zh epi
HMM states per phone
ey eh ao
0
5
10
15
20
25
30
35
ae ao ay eh er ey ih f r s sil aa ah ix iy z cl k sh n
vcl ow l
m t v
uw aw ax ch w th el dh uh p
en oy hh jh ng y b d dx g zh epi
HMM states per phone
g d b
0
5
10
15
20
25
30
35
ae ao ay eh er ey ih f r s sil aa ah ix iy z cl k sh n
vcl ow l
m t v
uw aw ax ch w th el dh uh p
en oy hh jh ng y b d dx g zh epi
HMM states per phone
Alignment
0.24
0.26
0.28
0.3
0.32
0.34
0.36
0.38
0 500 1000 1500 2000
Number of States
Error Rate
Split and Merge Split Only Split and Merge, Automatic Alignment
Hand Aligned 27.3%
Auto Aligned 26.3%
Results
0
5
10
15
20
25
30
35
ae ao ay eh er ey ih aa ah ix iy ow uw aw ax el uh en oy f r s z k sh n l m t v ch w th dh
p hh jh ng
y b d dx g zh sil cl vcl epi
Hand Aligned Auto Aligned
Alignment State Distribution
Inference
State sequence: d1-d6-d6-d4-ae5-ae2-ae3-ae0-d2-d2-d3-d7-d5
Phone sequence:d - d - d -d -ae - ae - ae - ae - d - d -d - d - d
Transcription d - ae - d
Viterbi
Variational
???
Variational Inference
Variational Approximation:
Viterbi 26.3%
Variational 25.1%
: Posterior edge marginals
Solution:
Phone Recognition Results
Method Error Rate
State-Tied Triphone HMM (HTK)
(Young and Woodland, 1994)27.7 %
Gender Dependent Triphone HMM
(Lamel and Gauvain, 1993) 27.1 %
Our Work 26.1 %
Bayesian Triphone HMM
(Ming and Smith, 1998) 25.6 %
Heterogeneous classifiers
(Halberstadt and Glass, 1998) 24.4 %
Conclusions
Minimalist, Automatic Approach Unconstrained Accurate
Phone Classification Competitive with state-of-the-art discriminative
methods despite being generative
Phone Recognition Better than standard state-tied triphone models
Top Related