Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation...

Towards Speaker Adaptive Training of DeepTowards Speaker Adaptive Training of DeepTowards Speaker Adaptive Training of Deep Neural Network Acoustic Models

Towards Speaker Adaptive Training of Deep Neural Network Acoustic Models

Yajie Miao Hao Zhang Florian MetzeYajie Miao Hao Zhang Florian MetzeLanguage Technologies Institute

School of Computer Science Carnegie Mellon University

Yajie Miao Carnegie Mellon Univ. 1 / 23

Outline

Motivation1 Motivation

Speaker‐normalized feature space with i‐vectors2 p p

AdaptNN ‐‐ Bottom adaptation layers

Speaker adaptive training of DNNs3

iVecNN ‐‐ Linear feature shift

Speaker adaptive training of DNNs3

Experiments4 p

Summary & Future Work5


Motivation

Deep neural networks become the state of the artfor acoustic modeling

For GMMmodels speaker adaptive training hasFor GMM models, speaker adaptive training has been a standard technique for improving WERs.

Various methods [1,2,3,4,5] have been proposed toperform speaker adaptation for DNNs However howperform speaker adaptation for DNNs. However, howwe can do SAT for DNN is not clear.

In this work, we aim to achieve complete speakeradaptive training for DNN acoustic models.


SAT for HMM/GMM

model update in fMLLR features space

fMLLRMatrix

HMM/GMM

p p

fMLLR re‐estimation

Starts with an initial GMM model and estimates fMLLR affine transformsaffine transforms

Updates model parameters with fMLLR applied and then re estimates fMLLR transforms Repeats untilthen re‐estimates fMLLR transforms. Repeats until convergence.


SAT for HMM/GMM

model update in fMLLR features space

fMLLRMatrix

HMM/GMM

p p

fMLLR re‐estimation

Starts with an initial GMM model and estimates fMLLR affine transformsaffine transforms

Updates model parameters with fMLLR applied and then re‐estimates fMLLR transforms Repeats until convergenceestimates fMLLR transforms. Repeats until convergence.

We want to do the similar thing for DNNs !!

Yajie Miao Carnegie Mellon Univ.

We want to do the similar thing for DNNs !!

5 / 23

Basic Idea for SAT‐DNN

. . .

initial DNN

originalfeatures

i‐vectornew features

Start with an initial DNN model which is the regular DNN we train for hybrid systems

Learn a function which takes advantage of i‐vectors and gprojects DNN inputs into a speaker‐normalized feature space

Update the DNN model in the new feature space


Update the DNN model in the new feature space

6 / 23

Bottom Adaptation Layers

. . .

AdaptNN

InitialDNNDNN

Insert a smaller adaptation network between the initial DNN and the inputsand the inputs

I‐vectors are appended to the outputs of each hidden layer

By using i‐vectors, this network AdaptNN transforms the original DNN inputs into a speaker‐normalized space


Bottom Adaptation Layers

. . .

AdaptNN

InitialDNNDNN

The output layer of AdaptNN has the same dimension as the original input featuresoriginal input features

The output layer adopts the linear activation function, while th i idothers use sigmoid

Parameters of AdaptNN can be estimated by the standard


error back‐propagation by fixing initial DNN

8 / 23

Linear Feature Shift

. . .

iVecNNinitial DNN

⊕

( )t t sf= +a o i

iVecNN takes speaker i‐vectors as inputs and generates a linear feature shift for each speakerlinear feature shift for each speaker

This feature shift is added to the original DNN inputs, and th lti f t b k li d


the resulting features become more speaker‐normalized

9 / 23

Linear Feature Shift

. . .

iVecNNinitial DNN

⊕

The output layer of iVecNN has the same dimension as DNN i t d t k li ti ti f ti

Parameters of iVecNN can be estimated by the standard

DNN inputs and takes linear activation function

error back‐propagation

More flexible. It can be applied both to DNNs and also to


convolutional neural nets (CNNs) [6]

10 / 23

Procedures of SAT‐DNN ‐‐ Training

. . .

initial DNN

Step 1 Train the initial DNN model. This DNN can be trained on SI (e.g., fbank) or SA features (e.g., fMLLR)



.

i i l . .

initial DNN

originalfeatures

i‐vectors


Step 2 Learn the feature function (AdaptNN or iVecNN) by keeping the initial DNN fixed. This step requires speaker i‐vectors as the side p q pinformation for feature transformation



.

i i l i iti l DNN. .

SAT‐DNN

originalfeatures

initial DNN

SAT DNN

i‐vectors


Step 2 Learn the feature function (AdaptNN or iVecNN) by keeping the initial DNN fixed. This step requires speaker i‐vectors as the side p q pinformation for feature transforming

Step 3 Re‐finetune the DNN parameters in the new feature space while


p p pkeeping the feature function fixed. This finally gives us SAT‐DNN

13 / 23

Procedures of SAT‐DNN ‐‐ Decoding

.

i i l . .

SAT‐DNN

originalfeatures

i‐vectors

Step 1 Given a testing speaker, just extract the i‐vector for adaptation. I‐vector extraction is totally unsupervised

Step 2 Input the speech features and the i‐vectors into this architecture for decoding. This projects the input features into the speaker‐for decoding. This projects the input features into the speakernormalized space and adapts the SAT‐DNN model automatically to this testing speaker


Procedures of SAT‐DNN ‐‐ Decoding

. original . .

SAT‐DNN

gfeatures

i‐vectors

Since i‐vector extraction is totally unsupervised, no initial d di d fi t i th d t ti d t

Only one single pass of decoding, although we are doing

decoding pass and no fine‐tuning on the adaptation data

y g p g, g gunsupervised adaptation

Very Efficient Unsupervised Adaptation


Very Efficient Unsupervised Adaptation

15 / 23

Comparison to related work

.

G. Saon, H. Soltau, D. Nahamoo, and M. Picheny. Speaker adaptation of neural network acoustic models using i‐vectors. ASRU 2013.. . of neural network acoustic models using i vectors. ASRU 2013.

Concatenate i‐vectors with the original features directly and train the wholefeatures directly and train the whole network from scratch

We failed to get obvious gains from this proposal, mostlikely due to normalization of i‐vectors The i‐vectors shouldlikely due to normalization of i‐vectors. The i‐vectors should be normalized very carefully, which is also observed by:A. Senior, I. Lopez‐Moreno. Improving DNN speaker independence with i‐vector inputs.

When using our SAT‐DNN, no need to worry about i‐vector

ICASSP 2014.


g , ynormalization. The feature function will do this job!

16 / 23

Experiments ‐‐ Switchboard

A 110‐Hour training setup [7] = 100k utterances

Kaldi for GMM: mono delta lda+mllt sat

g p [ ]

Kaldi+PDNN: http://www.cs.cmu.edu/~ymiao/kaldipdnn.html

Two types of DNN inputs: SI filterbanks and SA fMLLRs

Tested on the SWBD part of HUB’00

yp p

I‐Vector Extractor Building

O ALIZE lki [8]Open‐source ALIZE toolkit [8]

A 100‐dimensional i‐vector is extracted for each training


and testing speaker

17 / 23


Models Filterbank fMLLR

Baseline (initial) DNN 21.4 19.9

SAT‐DNN + AdaptNN 19.8 7.5% 18.7 6.0%

SAT‐DNN + iVecNN 19.9 7.0% 19.0 4.8%

Initial DNN + AdaptNN 20.8 (2.8%) 19.2 (3.5%)

Initial DNN + iVecNN 21.2 (0.9%) 19.7 (1.0%)



Models Filterbank fMLLR


SAT‐DNN + AdaptNN 19.8 7.5% 18.7 6.0%

SAT‐DNN + iVecNN 19.9 7.0% 19.0 4.8%

Initial DNN + AdaptNN 20.8 (2.8%) 19.2 (3.5%)

Initial DNN + iVecNN 21.2 (0.9%) 19.7 (1.0%)

Our recent work enlarges the improvement to 11 1% and 6 8%Our recent work enlarges the improvement to 11.1% and 6.8% relatively on Filterbank and fMLLR respectively


Experiments ‐‐ BABEL

More challenging BABEL dataset

Conversational telephone speech from low‐resource languages

g g

g g

80 hours of training data for each language

Tagalog (IARPA‐babel106‐v0.2f ) and Turkish (IARPA‐babel105b‐v0.4) ; only on the SI filterbank features

Models Tagalog Turkish

Baseline (initial) DNN 49 3 51 3

SAT‐DNN + AdaptNN 47.1 4.5% 48.6 5.3%



SAT‐DNN + iVecNN 47.3 4.1% 49.3 3.9%

20 / 23

Summary & Future Work

W d SAT f DNN ! T hi thi t f t

Summary

We can do SAT for DNNs! To achieve this, we propose two feature learning approaches to get the speaker‐normalized space

We get nice improvement! Our experiments show SAT‐DNN outperformsWe get nice improvement! Our experiments show SAT‐DNN outperforms DNNs regardless of the feature types of the DNN inputs

Our code is open source! You can check out the code and run the Ou code s ope sou ce ou ca c ec ou e code a d u eexperiments

http://www.cs.cmu.edu/~ymiao/satdnn.html

C i ith k d t ti th d f

Future Work

Comparison with speaker adaptation methods; perform sequence training [9] over the resulting SAT‐DNN

Extend the SAT framework to other architectures e g to bottleneck


Extend the SAT framework to other architectures, e.g., to bottleneck feature extraction [10] and convolutional neural networks [6]

21 / 23

References[1] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context‐dependent deep neural networks for conversational speech transcription,” in Proc. ASRU, pp. 24–29, 2011. [2] B. Li, and K. C. Sim, “Comparison of discriminative input and output transformations for [ ] , , p p pspeaker adaptation in the hybrid NN/HMM systems,” in Proc. Interspeech, pp. 526–529, 2010.[3] K. Yao, D. Yu, F. Seide, H. Su, L. Deng, and Y. Gong, “Adaptation of context‐dependent deep neural networks for automatic speech recognition,” in Proc. IEEE Spoken Language Technology W k h 366 369 2012Workshop, pp. 366–369, 2012.[4] S. M. Siniscalchi, J. Li, and C.‐H. Lee, “Hermitian‐based hidden activation functions for adaptation of hybrid HMM/ANN models,” in Proc. Interspeech, pp. 526–529, 2012.[5] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network [ ] , , , y, p pacoustic models using i‐vectors,” in Proc. ASRU, pp. 55‐59, 2013.[6] T. N. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convolutional neural networks for LVCSR,” in Proc. ICASSP, pp. 8614‐8618, 2013.[7] S P Rath D Povey K Vesely and J Cernocky “Improved feature processing for deep neural[7] S. P. Rath, D. Povey, K. Vesely, and J. Cernocky, “Improved feature processing for deep neural networks,” in Proc. Interspeech, 2013.[8] J.‐F. Bonastre, N. Scheffer, D. Matrouf, C. Fredouille, A. Larcher, A. Preti, G. Pouchoulin, N. Evans, B. Fauve, and J. Mason, “ALIZE/SpkDet: a state‐of‐the‐art open source software for speaker p p precognition,” in Proc. ISCA/IEEE Speaker Odyssey 2008.[9] B. Kingsbury, “Lattice‐based optimization of sequence classification criteria for neural‐network acoustic modeling,” in Proc. ICASSP, pp. 3761–3764, 2009.[10] J Gehring Y Miao F Metze and A Waibel “Extracting deep bottleneck features using


[10] J. Gehring, Y. Miao, F. Metze, and A. Waibel, Extracting deep bottleneck features using stacked auto‐encoders,” in Proc. ICASSP, 2013.

22 / 23

Thank YouThank You

Yajie Miao Hao Zhang Florian MetzeL T h l i I tit tLanguage Technologies Institute

School of Computer Science Carnegie Mellon University

Acknowledgements. This work was supported by the Intelligence Advanced Research Projects Activity (IARPA)via Department of Defense U.S. Army Research Laboratory (DoD / ARL) contract number W911NF‐12‐C‐0015.The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposesnotwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained hereinare those of the authors and should not be interpreted as necessarily representing the official policies or


endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government.

23 / 23

Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation...

Documents

Transcript of Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation...