Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation...

23
Towards Speaker Adaptive Training of Deep Towards Speaker Adaptive Training of Deep Towards Speaker Adaptive Training of Deep Neural Network Acoustic Models Towards Speaker Adaptive Training of Deep Neural Network Acoustic Models Yajie Miao Hao Zhang Florian Metze Yajie Miao Hao Zhang Florian Metze Language Technologies Institute School of Computer Science Carnegie Mellon University Yajie Miao Carnegie Mellon Univ. 1 / 23

Transcript of Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation...

Page 1: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Towards Speaker Adaptive Training of DeepTowards Speaker Adaptive Training of DeepTowards Speaker Adaptive Training of Deep Neural Network Acoustic Models

Towards Speaker Adaptive Training of Deep Neural Network Acoustic Models

Yajie Miao Hao Zhang Florian MetzeYajie Miao      Hao Zhang        Florian MetzeLanguage Technologies Institute

School of Computer Science     Carnegie Mellon University

Yajie Miao     Carnegie Mellon Univ. 1  /  23

Page 2: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Outline

Motivation1 Motivation

Speaker‐normalized feature space with i‐vectors2 p p

AdaptNN ‐‐ Bottom adaptation layers

Speaker adaptive training of DNNs3

iVecNN    ‐‐ Linear feature shift 

Speaker adaptive training of DNNs3

Experiments4 p

Summary & Future Work5

Yajie Miao     Carnegie Mellon Univ. 2  /  23

Page 3: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Motivation

Deep neural networks become the state of the artfor acoustic modeling

For GMMmodels speaker adaptive training hasFor GMM models, speaker adaptive training has been a standard technique for improving WERs. 

Various methods [1,2,3,4,5] have been proposed toperform speaker adaptation for DNNs However howperform speaker adaptation for DNNs. However, howwe can do SAT for DNN is not clear.

In this work, we aim to achieve complete speakeradaptive training for DNN acoustic models.

Yajie Miao     Carnegie Mellon Univ. 3  /  23

Page 4: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

SAT for HMM/GMM

model update in fMLLR features space

fMLLRMatrix

HMM/GMM

p p

fMLLR re‐estimation

Starts with an initial GMM model and estimates fMLLR affine transformsaffine transforms

Updates model parameters with fMLLR applied and then re estimates fMLLR transforms Repeats untilthen re‐estimates fMLLR transforms. Repeats until convergence.

Yajie Miao     Carnegie Mellon Univ. 4  /  23

Page 5: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

SAT for HMM/GMM

model update in fMLLR features space

fMLLRMatrix

HMM/GMM

p p

fMLLR re‐estimation

Starts with an initial GMM model and estimates fMLLR affine transformsaffine transforms

Updates model parameters with fMLLR applied and then re‐estimates fMLLR transforms Repeats until convergenceestimates fMLLR transforms. Repeats until convergence.

We want to do the similar thing for DNNs !!

Yajie Miao     Carnegie Mellon Univ.

We want to do the similar thing for DNNs !!

5  /  23

Page 6: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Basic Idea for SAT‐DNN

. . .

initial DNN

originalfeatures

i‐vectornew features

Start with an initial DNN model which is the regular DNN we train for hybrid systems

Learn a function which takes advantage of i‐vectors and gprojects DNN inputs into a speaker‐normalized feature space

Update the DNN model in the new feature space

Yajie Miao     Carnegie Mellon Univ.

Update the DNN model in the new feature space

6  /  23

Page 7: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Bottom Adaptation Layers

. . .

AdaptNN

InitialDNNDNN

Insert a smaller adaptation network between the initial DNN and the inputsand the inputs

I‐vectors are appended to the outputs of each hidden layer 

By using i‐vectors, this network AdaptNN transforms the original DNN inputs into a speaker‐normalized space 

Yajie Miao     Carnegie Mellon Univ. 7  /  23

Page 8: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Bottom Adaptation Layers

. . .

AdaptNN

InitialDNNDNN

The output layer of AdaptNN has the same dimension as the original input featuresoriginal input features 

The output layer adopts the linear activation function, while th i idothers use sigmoid 

Parameters of AdaptNN can be estimated by the standard 

Yajie Miao     Carnegie Mellon Univ.

error back‐propagation by fixing initial DNN 

8  /  23

Page 9: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Linear Feature Shift

. . .

iVecNNinitial DNN

( )t t sf= +a o i

iVecNN takes speaker i‐vectors as inputs and generates a linear feature shift for each speakerlinear feature shift for each speaker

This feature shift is added to the original DNN inputs, and th lti f t b k li d

Yajie Miao     Carnegie Mellon Univ.

the resulting features become more speaker‐normalized

9  /  23

Page 10: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Linear Feature Shift

. . .

iVecNNinitial DNN

The output layer of iVecNN has the same dimension as DNN i t d t k li ti ti f ti

Parameters of iVecNN can be estimated by the standard 

DNN inputs and takes linear activation function

error back‐propagation

More flexible. It can be applied both to DNNs and also to 

Yajie Miao     Carnegie Mellon Univ.

convolutional neural nets (CNNs) [6] 

10  /  23

Page 11: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Procedures of SAT‐DNN ‐‐ Training

. . .

initial DNN

Step 1 Train the initial DNN model. This DNN can be trained on SI (e.g., fbank) or SA features (e.g., fMLLR)

Yajie Miao     Carnegie Mellon Univ. 11  /  23

Page 12: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Procedures of SAT‐DNN ‐‐ Training

i i l . .

initial DNN

originalfeatures

i‐vectors

Step 1 Train the initial DNN model. This DNN can be trained on SI (e.g., fbank) or SA features (e.g., fMLLR)

Step 2 Learn the feature function (AdaptNN or iVecNN) by keeping the initial DNN fixed. This step requires speaker i‐vectors as the side p q pinformation for feature transformation

Yajie Miao     Carnegie Mellon Univ. 12  /  23

Page 13: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Procedures of SAT‐DNN ‐‐ Training

i i l i iti l DNN. .

SAT‐DNN

originalfeatures

initial DNN

SAT DNN

i‐vectors

Step 1 Train the initial DNN model. This DNN can be trained on SI (e.g., fbank) or SA features (e.g., fMLLR)

Step 2 Learn the feature function (AdaptNN or iVecNN) by keeping the initial DNN fixed. This step requires speaker i‐vectors as the side p q pinformation for feature transforming

Step 3 Re‐finetune the DNN parameters in the new feature space while 

Yajie Miao     Carnegie Mellon Univ.

p p pkeeping the feature function fixed. This finally gives us SAT‐DNN

13  /  23

Page 14: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Procedures of SAT‐DNN ‐‐ Decoding

i i l . .

SAT‐DNN

originalfeatures

i‐vectors

Step 1 Given a testing speaker, just extract the i‐vector for adaptation. I‐vector extraction is  totally unsupervised

Step 2 Input the speech features and the i‐vectors into this architecture for decoding. This projects the input features into the speaker‐for decoding. This projects the input features into the speakernormalized space and adapts the SAT‐DNN model automatically  to this testing speaker

Yajie Miao     Carnegie Mellon Univ. 14  /  23

Page 15: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Procedures of SAT‐DNN ‐‐ Decoding

. original . .

SAT‐DNN

gfeatures

i‐vectors

Since i‐vector extraction is totally unsupervised, no initial d di d fi t i th d t ti d t

Only one single pass of decoding, although we are doing 

decoding pass and no fine‐tuning on the adaptation data

y g p g, g gunsupervised adaptation

Very Efficient Unsupervised Adaptation

Yajie Miao     Carnegie Mellon Univ.

Very Efficient Unsupervised Adaptation

15  /  23

Page 16: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Comparison to related work

G. Saon, H. Soltau, D. Nahamoo, and M. Picheny. Speaker adaptation of neural network acoustic models using i‐vectors. ASRU 2013.. . of neural network acoustic models using i vectors. ASRU 2013.

Concatenate i‐vectors with the original features directly and train the wholefeatures directly and train the whole network from scratch

We failed to get obvious gains from this proposal, mostlikely due to normalization of i‐vectors The i‐vectors shouldlikely due to normalization of i‐vectors. The i‐vectors should be normalized very carefully, which is also observed by:A. Senior, I. Lopez‐Moreno. Improving DNN speaker independence with i‐vector inputs. 

When using our SAT‐DNN, no need to worry about i‐vector 

ICASSP 2014.

Yajie Miao     Carnegie Mellon Univ.

g , ynormalization. The feature function will do this job!

16  /  23

Page 17: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Experiments ‐‐ Switchboard

A 110‐Hour training setup [7] =  100k utterances

Kaldi for GMM: mono   delta   lda+mllt sat

g p [ ]

Kaldi+PDNN:  http://www.cs.cmu.edu/~ymiao/kaldipdnn.html

Two types of DNN inputs: SI filterbanks and SA fMLLRs

Tested on the SWBD part of HUB’00

yp p

I‐Vector Extractor Building

O ALIZE lki [8]Open‐source ALIZE toolkit [8]

A 100‐dimensional i‐vector is extracted for each training 

Yajie Miao     Carnegie Mellon Univ.

and testing speaker

17  /  23

Page 18: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Experiments ‐‐ Switchboard

Models Filterbank fMLLR

Baseline (initial) DNN 21.4 19.9

SAT‐DNN + AdaptNN 19.8     7.5% 18.7      6.0%

SAT‐DNN + iVecNN 19.9     7.0% 19.0     4.8%

Initial DNN + AdaptNN 20.8 (2.8%) 19.2 (3.5%)

Initial DNN + iVecNN 21.2 (0.9%) 19.7 (1.0%)

Yajie Miao     Carnegie Mellon Univ. 18  /  23

Page 19: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Experiments ‐‐ Switchboard

Models Filterbank fMLLR

Baseline (initial) DNN 21.4 19.9

SAT‐DNN + AdaptNN 19.8     7.5% 18.7      6.0%

SAT‐DNN + iVecNN 19.9     7.0% 19.0     4.8%

Initial DNN + AdaptNN 20.8 (2.8%) 19.2 (3.5%)

Initial DNN + iVecNN 21.2 (0.9%) 19.7 (1.0%)

Our recent work enlarges the improvement to 11 1% and 6 8%Our recent work enlarges the improvement to 11.1% and 6.8% relatively on Filterbank and fMLLR respectively

Yajie Miao     Carnegie Mellon Univ. 19  /  23

Page 20: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Experiments ‐‐ BABEL

More challenging BABEL dataset

Conversational telephone speech from low‐resource languages

g g

g g

80 hours of training data for each language 

Tagalog (IARPA‐babel106‐v0.2f ) and Turkish (IARPA‐babel105b‐v0.4) ; only on the SI filterbank features 

Models Tagalog Turkish

Baseline (initial) DNN 49 3 51 3

SAT‐DNN + AdaptNN 47.1     4.5% 48.6      5.3%

Baseline (initial) DNN 49.3 51.3

Yajie Miao     Carnegie Mellon Univ.

SAT‐DNN + iVecNN 47.3     4.1% 49.3     3.9%

20  /  23

Page 21: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Summary & Future Work

W d SAT f DNN ! T hi thi t f t

Summary

We can do SAT for DNNs! To achieve this, we propose two feature learning  approaches to get the speaker‐normalized space

We get nice improvement! Our experiments show SAT‐DNN outperformsWe get nice improvement! Our experiments show SAT‐DNN outperforms DNNs regardless of the feature types of the DNN inputs 

Our code is open source! You can check out the code and run the Ou code s ope sou ce ou ca c ec ou e code a d u eexperiments 

http://www.cs.cmu.edu/~ymiao/satdnn.html

C i ith k d t ti th d f

Future Work

Comparison with speaker adaptation methods;  perform sequence training [9] over the resulting SAT‐DNN

Extend the SAT framework to other architectures e g to bottleneck

Yajie Miao     Carnegie Mellon Univ.

Extend the SAT framework to other architectures, e.g., to bottleneck feature extraction [10] and convolutional neural networks [6] 

21  /  23

Page 22: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

References[1] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context‐dependent deep neural networks for conversational speech transcription,” in Proc. ASRU, pp. 24–29, 2011. [2] B. Li, and K. C. Sim, “Comparison of discriminative input and output transformations for [ ] , , p p pspeaker adaptation in the hybrid NN/HMM systems,” in Proc. Interspeech, pp. 526–529, 2010.[3] K. Yao, D. Yu, F. Seide, H. Su, L. Deng, and Y. Gong, “Adaptation of context‐dependent deep neural networks for automatic speech recognition,” in Proc. IEEE Spoken Language Technology W k h 366 369 2012Workshop, pp. 366–369, 2012.[4] S. M. Siniscalchi, J. Li, and C.‐H. Lee, “Hermitian‐based hidden activation functions for adaptation of hybrid HMM/ANN models,” in Proc. Interspeech, pp. 526–529, 2012.[5] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network [ ] , , , y, p pacoustic models using i‐vectors,” in Proc. ASRU, pp. 55‐59, 2013.[6] T. N. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convolutional neural networks for LVCSR,” in Proc. ICASSP, pp. 8614‐8618, 2013.[7] S P Rath D Povey K Vesely and J Cernocky “Improved feature processing for deep neural[7] S. P. Rath, D. Povey, K. Vesely, and J. Cernocky, “Improved feature processing for deep neural networks,” in Proc. Interspeech, 2013.[8] J.‐F. Bonastre, N. Scheffer, D. Matrouf, C. Fredouille, A. Larcher, A. Preti, G. Pouchoulin, N. Evans, B. Fauve, and J. Mason, “ALIZE/SpkDet: a state‐of‐the‐art open source software for speaker p p precognition,” in Proc. ISCA/IEEE Speaker Odyssey 2008.[9] B. Kingsbury, “Lattice‐based optimization of sequence classification criteria for neural‐network acoustic modeling,” in Proc. ICASSP, pp. 3761–3764, 2009.[10] J Gehring Y Miao F Metze and A Waibel “Extracting deep bottleneck features using

Yajie Miao     Carnegie Mellon Univ.

[10] J. Gehring, Y. Miao, F. Metze, and A. Waibel,  Extracting deep bottleneck features using stacked auto‐encoders,” in Proc. ICASSP,  2013. 

22  /  23

Page 23: Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation Deep neural networks become the state of the art for acoustic modeling For GMM models,

Thank YouThank You

Yajie Miao      Hao Zhang        Florian MetzeL T h l i I tit tLanguage Technologies Institute

School of Computer Science     Carnegie Mellon University

Acknowledgements. This work was supported by the Intelligence Advanced Research Projects Activity (IARPA)via Department of Defense U.S. Army Research Laboratory (DoD / ARL) contract number W911NF‐12‐C‐0015.The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposesnotwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained hereinare those of the authors and should not be interpreted as necessarily representing the official policies or

Yajie Miao     Carnegie Mellon Univ.

endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government.

23  /  23