Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction...

1
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen A. Zahorian Department of Electrical and Computer Engineering, Binghamton University, Binghamton, NY 13902, USA [email protected], [email protected] Introduction Accurate Automatic Speech Recognition (ASR) Highly discriminative features »Incorporate nonlinear frequency scales and time dependency »Low dimensionality feature spaces Efficient recognition models (HMMs & Neural Networks) Neural Network Based Dimensionality Reduction Neural Networks (NNs) used to represent complex data while preserving the variability and discriminability of the original data Combine with a HMM recognizer to form a hybrid NN/HMM recognition model NLDA Reduction Overview Nonlinear Discriminant Analysis (NLDA) A multilayer neural network performs a nonlinear feature transformation of the input speech features Phone models for transformed features using HMMs with each state modeled with a GMM (Gaussian Mixture Model) PCA performs a Karhunen-Loeve (KL) transform for reducing the correlation of the network outputs NLDA1 Method Dimensionality Reduced Features Obtained at the output layer of the neural network Feature dimensionality is further reduced by PCA Node Nonlinearity (Activation Function) In the feature transformation, a linear function for the output layer and a Sigmoid nonlinear function for the other layers In the NLDA1 training, all layers are nonlinear Experimental Setup TIMIT Database (‘SI’ and ‘SX” only) 48 phoneme set mapped down from 62 phoneme set Training Data: 3696 sentences (460 speakers) Testing Data: 1344 sentences (168 speakers) DCTC/DCSC Features A total of 78 features (13 DCTCs x 6 DCSCs) were computed 10 ms frames with 2 ms frame spacing and 8 ms block spacing with 1s block length Conclusions Very high recognition accuracies were obtained using the outputs of network middle layer as in NLDA2 The NLDA methods are able to produce a low-dimensional effective representation of speech features HMM 3-state Left-to-right Markov models with no skip 48 monophone HMMs were created using HTK (ver 3.4) Language model: Phone bigram information of the training data Neural Networks in NLDA 3 hidden-layers: 500-36-500 nodes Input layer: 78 nodes corresponding to the feature dimensionality Output Layer: 48 nodes for the phoneme targets, or 144 nodes for the state level targets Training Neural Networks Phone Level Targets Each NN output correspond to a specific phone Straightforward to implement, using phonetically labeled training database But why should NN output be forced to the same value for the entire phone State Level Targets Each NN output correspond to a single state of a phone HMM But how to determine state boundaries o Estimate using percentage of total length o Use initial training iteration, then Viterbi alignment PCA Speech Features Training Targets Transformed Features HM M R ecognizer Phonem es N eural Network Original Features PCA Network Outputs Dimensionality Reduced Features NLDA2 Method Reduced Features Use outputs of the network middle hidden layer The reduced dimensionality is determined by the number of middle nodes, giving flexibility in reduced feature dimensionality The linear PCA is used only for feature decorrelation Nonlinearity All nonlinear layers are used in both the feature transformation and network training PCA Dimensionality Reduced Outputs Dimensionality Reduced Features Experimental Results Control Experiment Compare the original DCTC-DSCS with the PCA and LDA reduced features Use various numbers of mixtures in HMMs Accuracies using the original, PCA and LDA reduced features (20 & 36 dimensions) The original 78-dimensional features yield the highest accuracy of 73.2% using 64-mix HMMs 50 55 60 65 70 75 3 8 16 32 64 A ccuracy [% ] N um berofM ixtures O riginal PCA (36) PCA (20) LDA (36) LDA (20) NLDA Experiment Evaluate NLDA1 and NLDA2 with or without PCA 48-dimensional phoneme level targets used The features reduced to 36 dimensions Accuracies using the NLDA1 and NLDA2 reduced features, and the reduced features without the PCA processing The middle layer outputs of a network results in more effective features in a reduced space The accuracies improved about 2% with PCA State Level Target Exp 2 Use a fixed length ration and the Viterbi alignment for the state targets State level targets with “Don’t cares” used Targets obtained using a fixed length ratio (3 states: 1:4:1) and the Viterbi alignment Network training: 4x10 updates Accuracies; (R)” indicates a fixed length ratio and “(A)” the Viterbi forced alignment 65 67 69 71 73 75 3 8 16 32 64 A ccuracy [% ] Num berofMixtures NLDA1 NLDA1 w /o PCA NLDA2 NLDA2 w /o PCA 65 67 69 71 73 75 3 8 16 32 64 A ccuracy [% ] N um berofM ixtures NLDA1 (R) NLDA1 (A) NLDA2 (R) NLDA2 (A) Literature Comparison Recognition accuracy based on TIMIT Feature Recogniz er Acc. (%) Study MFCC HMM 68.5 Somervuo (2003) PLP MLP-GMM 71.5 Ketabdar et al. (2008) LPC HMM-MLP 74.6 Pinto et al. (2008) MFCC Tandem NN 78.5 Schwarz et al. (2006) DCTC/DCSC HMM 73.9 Zahorian et al. (2009) DCTC/DCSC NN-HMM 74.9 This study State Level Target Exp 1 Compare the state level targets with and without “Don’t cares” 144-dimensional state level targets used State boundaries obtained using the fixed state length method (3 states: 1:4:1) Network training: 8x10 weight updates Accuracies using the state level targets with and without “Don’t cares” The state level targets with “Don’t cares” result in higher accuracies The NLDA2 reduced features achieved a substantial improvement versus the original features Dimensionality Reduction Reduction & Decorrelation Dimensionality Reduction Decorrela tion 65 67 69 71 73 75 3 8 16 32 64 A ccu racy [% ] Num ber of M ixtures NLDA1 NLDA1w/oDC NLDA2 NLDA2w/oDC P hone Level TargetVector (Phone 4) State Level TargetVectors Phone 4:State 1 48 96 144 D on'tC are 0 1 48 96 144 D ont' C are 0 1 48 96 144 D on'tC are 0 1 0 4 8 12 16 20 24 28 32 36 40 44 48 0 1 Phone 4:State 2 Phone 4:State 3 For 3-state models, train using “Don’t Cares” o For 1st portion, target is “1” for state 1 and “Don’t Cares” for states 2 and 3 o For 2nd portion, target is “1” for state 2 and “Don’t cares” for states 1 and 3 o For 3rd portion, target is “1” for state 3 and “Don’t Cares” for states 1 and 2

Transcript of Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction...

Page 1: Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Speech Communication Lab, State University of New York at Binghamton

Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen A. Zahorian

Department of Electrical and Computer Engineering, Binghamton University, Binghamton, NY 13902, [email protected], [email protected]

Introduction Accurate Automatic Speech Recognition (ASR)

Highly discriminative features»Incorporate nonlinear frequency scales and time

dependency»Low dimensionality feature spaces

Efficient recognition models (HMMs & Neural Networks)

Neural Network Based Dimensionality Reduction Neural Networks (NNs) used to represent complex data

while preserving the variability and discriminability of the original data

Combine with a HMM recognizer to form a hybrid NN/HMM recognition model

NLDA ReductionOverview

Nonlinear Discriminant Analysis (NLDA) A multilayer neural network performs a nonlinear feature

transformation of the input speech features Phone models for transformed features using HMMs with each

state modeled with a GMM (Gaussian Mixture Model) PCA performs a Karhunen-Loeve (KL) transform for reducing

the correlation of the network outputs

NLDA1 Method

Dimensionality Reduced Features Obtained at the output layer of the neural network Feature dimensionality is further reduced by PCA

Node Nonlinearity (Activation Function) In the feature transformation, a linear function for the output

layer and a Sigmoid nonlinear function for the other layers In the NLDA1 training, all layers are nonlinear

Experimental SetupTIMIT Database (‘SI’ and ‘SX” only)

48 phoneme set mapped down from 62 phoneme set Training Data: 3696 sentences (460 speakers) Testing Data: 1344 sentences (168 speakers)

DCTC/DCSC Features A total of 78 features (13 DCTCs x 6 DCSCs) were computed 10 ms frames with 2 ms frame spacing and 8 ms block

spacing with 1s block length

ConclusionsVery high recognition accuracies were obtained using

the outputs of network middle layer as in NLDA2The NLDA methods are able to produce a low-

dimensional effective representation of speech features

HMM 3-state Left-to-right Markov models with no skip 48 monophone HMMs were created using HTK (ver 3.4) Language model: Phone bigram information of the training

data

Neural Networks in NLDA 3 hidden-layers: 500-36-500 nodes Input layer: 78 nodes corresponding to the feature

dimensionality Output Layer: 48 nodes for the phoneme targets, or 144

nodes for the state level targets

Training Neural NetworksPhone Level Targets

Each NN output correspond to a specific phone Straightforward to implement, using phonetically labeled training database But why should NN output be forced to the same value for the entire phone

State Level Targets Each NN output correspond to a single state of a phone HMM But how to determine state boundaries

o Estimate using percentage of total length o Use initial training iteration, then Viterbi alignment

PCA

Spe

ech

F

eat

ure

s

Training Targets

Tra

nsfo

rmed

F

eat

ure

s

HMM Recognizer

Pho

nem

es

Neural Network

Original Features

PCANetwork Outputs

Dimensionality Reduced Features

NLDA2 MethodReduced Features

Use outputs of the network middle hidden layer

The reduced dimensionality is determined by the number of middle nodes, giving flexibility in reduced feature dimensionality

The linear PCA is used only for feature decorrelation

Nonlinearity All nonlinear layers are used in

both the feature transformation and network training

PCA

Dimensionality Reduced Outputs

Dimensionality Reduced Features

Experimental ResultsControl ExperimentCompare the original DCTC-DSCS with

the PCA and LDA reduced features Use various numbers of mixtures in HMMs

Accuracies using the original, PCA and LDA reduced features (20 & 36 dimensions)

The original 78-dimensional features yield the highest accuracy of 73.2% using 64-mix HMMs

50

55

60

65

70

75

3 8 16 32 64

Acc

ura

cy [

%]

Number of Mixtures

Original PCA (36)

PCA (20) LDA (36)

LDA (20)

NLDA ExperimentEvaluate NLDA1 and NLDA2 with or

without PCA 48-dimensional phoneme level targets used The features reduced to 36 dimensions

Accuracies using the NLDA1 and NLDA2 reduced features, and the reduced features without the PCA processing

The middle layer outputs of a network results in more effective features in a reduced spaceThe accuracies improved about 2% with PCA

State Level Target Exp 2Use a fixed length ration and the Viterbi

alignment for the state targets State level targets with “Don’t cares” used Targets obtained using a fixed length ratio (3

states: 1:4:1) and the Viterbi alignment Network training: 4x107 weight updates

Accuracies; (R)” indicates a fixed length ratio and “(A)” the Viterbi forced alignment

65

67

69

71

73

75

3 8 16 32 64

Acc

urac

y [%

]

Number of Mixtures

NLDA1NLDA1 w/o PCANLDA2NLDA2 w/o PCA

65

67

69

71

73

75

3 8 16 32 64

Acc

ura

cy [

%]

Number of Mixtures

NLDA1 (R)NLDA1 (A)NLDA2 (R)NLDA2 (A)

Literature ComparisonRecognition accuracy based on TIMIT

Feature Recognizer Acc. (%) Study

MFCC HMM 68.5 Somervuo (2003)

PLP MLP-GMM 71.5 Ketabdar et al. (2008)

LPC HMM-MLP 74.6 Pinto et al. (2008)

MFCC Tandem NN 78.5 Schwarz et al. (2006)

DCTC/DCSC HMM 73.9 Zahorian et al. (2009)

DCTC/DCSC NN-HMM 74.9 This study

State Level Target Exp 1Compare the state level targets with and

without “Don’t cares” 144-dimensional state level targets used State boundaries obtained using the fixed

state length method (3 states: 1:4:1) Network training: 8x106 weight updates

Accuracies using the state level targets with and without “Don’t cares”

The state level targets with “Don’t cares” result in higher accuracies

The NLDA2 reduced features achieved a substantial improvement versus the original features

Dimensionality Reduction Reduction & Decorrelation

Dim

en

siona

lity R

ed

uction

De

correlatio

n

65

67

69

71

73

75

3 8 16 32 64

Acc

ura

cy [

%]

Number of Mixtures

NLDA1NLDA1 w/o DCNLDA2NLDA2 w/o DC

Phone Level Target Vector

(Phone 4)

State Level Target Vectors

Phone 4: State 1

48 96 144Don't Care

0

1

48 96 144Dont' Care

0

1

48 96 144Don't Care

0

1

0 4 8 12 16 20 24 28 32 36 40 44 480

1

Phone 4: State 2

Phone 4: State 3

For 3-state models, train using “Don’t Cares” o For 1st portion, target is “1” for state 1 and “Don’t Cares” for states 2 and 3o For 2nd portion, target is “1” for state 2 and “Don’t cares” for states 1 and 3o For 3rd portion, target is “1” for state 3 and “Don’t Cares” for states 1 and 2