Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction...
-
Upload
kevin-hardy -
Category
Documents
-
view
212 -
download
0
Transcript of Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction...
Speech Communication Lab, State University of New York at Binghamton
Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen A. Zahorian
Department of Electrical and Computer Engineering, Binghamton University, Binghamton, NY 13902, [email protected], [email protected]
Introduction Accurate Automatic Speech Recognition (ASR)
Highly discriminative features»Incorporate nonlinear frequency scales and time
dependency»Low dimensionality feature spaces
Efficient recognition models (HMMs & Neural Networks)
Neural Network Based Dimensionality Reduction Neural Networks (NNs) used to represent complex data
while preserving the variability and discriminability of the original data
Combine with a HMM recognizer to form a hybrid NN/HMM recognition model
NLDA ReductionOverview
Nonlinear Discriminant Analysis (NLDA) A multilayer neural network performs a nonlinear feature
transformation of the input speech features Phone models for transformed features using HMMs with each
state modeled with a GMM (Gaussian Mixture Model) PCA performs a Karhunen-Loeve (KL) transform for reducing
the correlation of the network outputs
NLDA1 Method
Dimensionality Reduced Features Obtained at the output layer of the neural network Feature dimensionality is further reduced by PCA
Node Nonlinearity (Activation Function) In the feature transformation, a linear function for the output
layer and a Sigmoid nonlinear function for the other layers In the NLDA1 training, all layers are nonlinear
Experimental SetupTIMIT Database (‘SI’ and ‘SX” only)
48 phoneme set mapped down from 62 phoneme set Training Data: 3696 sentences (460 speakers) Testing Data: 1344 sentences (168 speakers)
DCTC/DCSC Features A total of 78 features (13 DCTCs x 6 DCSCs) were computed 10 ms frames with 2 ms frame spacing and 8 ms block
spacing with 1s block length
ConclusionsVery high recognition accuracies were obtained using
the outputs of network middle layer as in NLDA2The NLDA methods are able to produce a low-
dimensional effective representation of speech features
HMM 3-state Left-to-right Markov models with no skip 48 monophone HMMs were created using HTK (ver 3.4) Language model: Phone bigram information of the training
data
Neural Networks in NLDA 3 hidden-layers: 500-36-500 nodes Input layer: 78 nodes corresponding to the feature
dimensionality Output Layer: 48 nodes for the phoneme targets, or 144
nodes for the state level targets
Training Neural NetworksPhone Level Targets
Each NN output correspond to a specific phone Straightforward to implement, using phonetically labeled training database But why should NN output be forced to the same value for the entire phone
State Level Targets Each NN output correspond to a single state of a phone HMM But how to determine state boundaries
o Estimate using percentage of total length o Use initial training iteration, then Viterbi alignment
PCA
Spe
ech
F
eat
ure
s
Training Targets
Tra
nsfo
rmed
F
eat
ure
s
HMM Recognizer
Pho
nem
es
Neural Network
Original Features
PCANetwork Outputs
Dimensionality Reduced Features
NLDA2 MethodReduced Features
Use outputs of the network middle hidden layer
The reduced dimensionality is determined by the number of middle nodes, giving flexibility in reduced feature dimensionality
The linear PCA is used only for feature decorrelation
Nonlinearity All nonlinear layers are used in
both the feature transformation and network training
PCA
Dimensionality Reduced Outputs
Dimensionality Reduced Features
Experimental ResultsControl ExperimentCompare the original DCTC-DSCS with
the PCA and LDA reduced features Use various numbers of mixtures in HMMs
Accuracies using the original, PCA and LDA reduced features (20 & 36 dimensions)
The original 78-dimensional features yield the highest accuracy of 73.2% using 64-mix HMMs
50
55
60
65
70
75
3 8 16 32 64
Acc
ura
cy [
%]
Number of Mixtures
Original PCA (36)
PCA (20) LDA (36)
LDA (20)
NLDA ExperimentEvaluate NLDA1 and NLDA2 with or
without PCA 48-dimensional phoneme level targets used The features reduced to 36 dimensions
Accuracies using the NLDA1 and NLDA2 reduced features, and the reduced features without the PCA processing
The middle layer outputs of a network results in more effective features in a reduced spaceThe accuracies improved about 2% with PCA
State Level Target Exp 2Use a fixed length ration and the Viterbi
alignment for the state targets State level targets with “Don’t cares” used Targets obtained using a fixed length ratio (3
states: 1:4:1) and the Viterbi alignment Network training: 4x107 weight updates
Accuracies; (R)” indicates a fixed length ratio and “(A)” the Viterbi forced alignment
65
67
69
71
73
75
3 8 16 32 64
Acc
urac
y [%
]
Number of Mixtures
NLDA1NLDA1 w/o PCANLDA2NLDA2 w/o PCA
65
67
69
71
73
75
3 8 16 32 64
Acc
ura
cy [
%]
Number of Mixtures
NLDA1 (R)NLDA1 (A)NLDA2 (R)NLDA2 (A)
Literature ComparisonRecognition accuracy based on TIMIT
Feature Recognizer Acc. (%) Study
MFCC HMM 68.5 Somervuo (2003)
PLP MLP-GMM 71.5 Ketabdar et al. (2008)
LPC HMM-MLP 74.6 Pinto et al. (2008)
MFCC Tandem NN 78.5 Schwarz et al. (2006)
DCTC/DCSC HMM 73.9 Zahorian et al. (2009)
DCTC/DCSC NN-HMM 74.9 This study
State Level Target Exp 1Compare the state level targets with and
without “Don’t cares” 144-dimensional state level targets used State boundaries obtained using the fixed
state length method (3 states: 1:4:1) Network training: 8x106 weight updates
Accuracies using the state level targets with and without “Don’t cares”
The state level targets with “Don’t cares” result in higher accuracies
The NLDA2 reduced features achieved a substantial improvement versus the original features
Dimensionality Reduction Reduction & Decorrelation
Dim
en
siona
lity R
ed
uction
De
correlatio
n
65
67
69
71
73
75
3 8 16 32 64
Acc
ura
cy [
%]
Number of Mixtures
NLDA1NLDA1 w/o DCNLDA2NLDA2 w/o DC
Phone Level Target Vector
(Phone 4)
State Level Target Vectors
Phone 4: State 1
48 96 144Don't Care
0
1
48 96 144Dont' Care
0
1
48 96 144Don't Care
0
1
0 4 8 12 16 20 24 28 32 36 40 44 480
1
Phone 4: State 2
Phone 4: State 3
For 3-state models, train using “Don’t Cares” o For 1st portion, target is “1” for state 1 and “Don’t Cares” for states 2 and 3o For 2nd portion, target is “1” for state 2 and “Don’t cares” for states 1 and 3o For 3rd portion, target is “1” for state 3 and “Don’t Cares” for states 1 and 2