Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation...
Transcript of Neural Network Acoustic Models - Carnegie Mellon …ymiao/pub/IS_slides_sat_dnn_V2.pdfMotivation...
Towards Speaker Adaptive Training of DeepTowards Speaker Adaptive Training of DeepTowards Speaker Adaptive Training of Deep Neural Network Acoustic Models
Towards Speaker Adaptive Training of Deep Neural Network Acoustic Models
Yajie Miao Hao Zhang Florian MetzeYajie Miao Hao Zhang Florian MetzeLanguage Technologies Institute
School of Computer Science Carnegie Mellon University
Yajie Miao Carnegie Mellon Univ. 1 / 23
Outline
Motivation1 Motivation
Speaker‐normalized feature space with i‐vectors2 p p
AdaptNN ‐‐ Bottom adaptation layers
Speaker adaptive training of DNNs3
iVecNN ‐‐ Linear feature shift
Speaker adaptive training of DNNs3
Experiments4 p
Summary & Future Work5
Yajie Miao Carnegie Mellon Univ. 2 / 23
Motivation
Deep neural networks become the state of the artfor acoustic modeling
For GMMmodels speaker adaptive training hasFor GMM models, speaker adaptive training has been a standard technique for improving WERs.
Various methods [1,2,3,4,5] have been proposed toperform speaker adaptation for DNNs However howperform speaker adaptation for DNNs. However, howwe can do SAT for DNN is not clear.
In this work, we aim to achieve complete speakeradaptive training for DNN acoustic models.
Yajie Miao Carnegie Mellon Univ. 3 / 23
SAT for HMM/GMM
model update in fMLLR features space
fMLLRMatrix
HMM/GMM
p p
fMLLR re‐estimation
Starts with an initial GMM model and estimates fMLLR affine transformsaffine transforms
Updates model parameters with fMLLR applied and then re estimates fMLLR transforms Repeats untilthen re‐estimates fMLLR transforms. Repeats until convergence.
Yajie Miao Carnegie Mellon Univ. 4 / 23
SAT for HMM/GMM
model update in fMLLR features space
fMLLRMatrix
HMM/GMM
p p
fMLLR re‐estimation
Starts with an initial GMM model and estimates fMLLR affine transformsaffine transforms
Updates model parameters with fMLLR applied and then re‐estimates fMLLR transforms Repeats until convergenceestimates fMLLR transforms. Repeats until convergence.
We want to do the similar thing for DNNs !!
Yajie Miao Carnegie Mellon Univ.
We want to do the similar thing for DNNs !!
5 / 23
Basic Idea for SAT‐DNN
. . .
initial DNN
originalfeatures
i‐vectornew features
Start with an initial DNN model which is the regular DNN we train for hybrid systems
Learn a function which takes advantage of i‐vectors and gprojects DNN inputs into a speaker‐normalized feature space
Update the DNN model in the new feature space
Yajie Miao Carnegie Mellon Univ.
Update the DNN model in the new feature space
6 / 23
Bottom Adaptation Layers
. . .
AdaptNN
InitialDNNDNN
Insert a smaller adaptation network between the initial DNN and the inputsand the inputs
I‐vectors are appended to the outputs of each hidden layer
By using i‐vectors, this network AdaptNN transforms the original DNN inputs into a speaker‐normalized space
Yajie Miao Carnegie Mellon Univ. 7 / 23
Bottom Adaptation Layers
. . .
AdaptNN
InitialDNNDNN
The output layer of AdaptNN has the same dimension as the original input featuresoriginal input features
The output layer adopts the linear activation function, while th i idothers use sigmoid
Parameters of AdaptNN can be estimated by the standard
Yajie Miao Carnegie Mellon Univ.
error back‐propagation by fixing initial DNN
8 / 23
Linear Feature Shift
. . .
iVecNNinitial DNN
⊕
( )t t sf= +a o i
iVecNN takes speaker i‐vectors as inputs and generates a linear feature shift for each speakerlinear feature shift for each speaker
This feature shift is added to the original DNN inputs, and th lti f t b k li d
Yajie Miao Carnegie Mellon Univ.
the resulting features become more speaker‐normalized
9 / 23
Linear Feature Shift
. . .
iVecNNinitial DNN
⊕
The output layer of iVecNN has the same dimension as DNN i t d t k li ti ti f ti
Parameters of iVecNN can be estimated by the standard
DNN inputs and takes linear activation function
error back‐propagation
More flexible. It can be applied both to DNNs and also to
Yajie Miao Carnegie Mellon Univ.
convolutional neural nets (CNNs) [6]
10 / 23
Procedures of SAT‐DNN ‐‐ Training
. . .
initial DNN
Step 1 Train the initial DNN model. This DNN can be trained on SI (e.g., fbank) or SA features (e.g., fMLLR)
Yajie Miao Carnegie Mellon Univ. 11 / 23
Procedures of SAT‐DNN ‐‐ Training
.
i i l . .
initial DNN
originalfeatures
i‐vectors
Step 1 Train the initial DNN model. This DNN can be trained on SI (e.g., fbank) or SA features (e.g., fMLLR)
Step 2 Learn the feature function (AdaptNN or iVecNN) by keeping the initial DNN fixed. This step requires speaker i‐vectors as the side p q pinformation for feature transformation
Yajie Miao Carnegie Mellon Univ. 12 / 23
Procedures of SAT‐DNN ‐‐ Training
.
i i l i iti l DNN. .
SAT‐DNN
originalfeatures
initial DNN
SAT DNN
i‐vectors
Step 1 Train the initial DNN model. This DNN can be trained on SI (e.g., fbank) or SA features (e.g., fMLLR)
Step 2 Learn the feature function (AdaptNN or iVecNN) by keeping the initial DNN fixed. This step requires speaker i‐vectors as the side p q pinformation for feature transforming
Step 3 Re‐finetune the DNN parameters in the new feature space while
Yajie Miao Carnegie Mellon Univ.
p p pkeeping the feature function fixed. This finally gives us SAT‐DNN
13 / 23
Procedures of SAT‐DNN ‐‐ Decoding
.
i i l . .
SAT‐DNN
originalfeatures
i‐vectors
Step 1 Given a testing speaker, just extract the i‐vector for adaptation. I‐vector extraction is totally unsupervised
Step 2 Input the speech features and the i‐vectors into this architecture for decoding. This projects the input features into the speaker‐for decoding. This projects the input features into the speakernormalized space and adapts the SAT‐DNN model automatically to this testing speaker
Yajie Miao Carnegie Mellon Univ. 14 / 23
Procedures of SAT‐DNN ‐‐ Decoding
. original . .
SAT‐DNN
gfeatures
i‐vectors
Since i‐vector extraction is totally unsupervised, no initial d di d fi t i th d t ti d t
Only one single pass of decoding, although we are doing
decoding pass and no fine‐tuning on the adaptation data
y g p g, g gunsupervised adaptation
Very Efficient Unsupervised Adaptation
Yajie Miao Carnegie Mellon Univ.
Very Efficient Unsupervised Adaptation
15 / 23
Comparison to related work
.
G. Saon, H. Soltau, D. Nahamoo, and M. Picheny. Speaker adaptation of neural network acoustic models using i‐vectors. ASRU 2013.. . of neural network acoustic models using i vectors. ASRU 2013.
Concatenate i‐vectors with the original features directly and train the wholefeatures directly and train the whole network from scratch
We failed to get obvious gains from this proposal, mostlikely due to normalization of i‐vectors The i‐vectors shouldlikely due to normalization of i‐vectors. The i‐vectors should be normalized very carefully, which is also observed by:A. Senior, I. Lopez‐Moreno. Improving DNN speaker independence with i‐vector inputs.
When using our SAT‐DNN, no need to worry about i‐vector
ICASSP 2014.
Yajie Miao Carnegie Mellon Univ.
g , ynormalization. The feature function will do this job!
16 / 23
Experiments ‐‐ Switchboard
A 110‐Hour training setup [7] = 100k utterances
Kaldi for GMM: mono delta lda+mllt sat
g p [ ]
Kaldi+PDNN: http://www.cs.cmu.edu/~ymiao/kaldipdnn.html
Two types of DNN inputs: SI filterbanks and SA fMLLRs
Tested on the SWBD part of HUB’00
yp p
I‐Vector Extractor Building
O ALIZE lki [8]Open‐source ALIZE toolkit [8]
A 100‐dimensional i‐vector is extracted for each training
Yajie Miao Carnegie Mellon Univ.
and testing speaker
17 / 23
Experiments ‐‐ Switchboard
Models Filterbank fMLLR
Baseline (initial) DNN 21.4 19.9
SAT‐DNN + AdaptNN 19.8 7.5% 18.7 6.0%
SAT‐DNN + iVecNN 19.9 7.0% 19.0 4.8%
Initial DNN + AdaptNN 20.8 (2.8%) 19.2 (3.5%)
Initial DNN + iVecNN 21.2 (0.9%) 19.7 (1.0%)
Yajie Miao Carnegie Mellon Univ. 18 / 23
Experiments ‐‐ Switchboard
Models Filterbank fMLLR
Baseline (initial) DNN 21.4 19.9
SAT‐DNN + AdaptNN 19.8 7.5% 18.7 6.0%
SAT‐DNN + iVecNN 19.9 7.0% 19.0 4.8%
Initial DNN + AdaptNN 20.8 (2.8%) 19.2 (3.5%)
Initial DNN + iVecNN 21.2 (0.9%) 19.7 (1.0%)
Our recent work enlarges the improvement to 11 1% and 6 8%Our recent work enlarges the improvement to 11.1% and 6.8% relatively on Filterbank and fMLLR respectively
Yajie Miao Carnegie Mellon Univ. 19 / 23
Experiments ‐‐ BABEL
More challenging BABEL dataset
Conversational telephone speech from low‐resource languages
g g
g g
80 hours of training data for each language
Tagalog (IARPA‐babel106‐v0.2f ) and Turkish (IARPA‐babel105b‐v0.4) ; only on the SI filterbank features
Models Tagalog Turkish
Baseline (initial) DNN 49 3 51 3
SAT‐DNN + AdaptNN 47.1 4.5% 48.6 5.3%
Baseline (initial) DNN 49.3 51.3
Yajie Miao Carnegie Mellon Univ.
SAT‐DNN + iVecNN 47.3 4.1% 49.3 3.9%
20 / 23
Summary & Future Work
W d SAT f DNN ! T hi thi t f t
Summary
We can do SAT for DNNs! To achieve this, we propose two feature learning approaches to get the speaker‐normalized space
We get nice improvement! Our experiments show SAT‐DNN outperformsWe get nice improvement! Our experiments show SAT‐DNN outperforms DNNs regardless of the feature types of the DNN inputs
Our code is open source! You can check out the code and run the Ou code s ope sou ce ou ca c ec ou e code a d u eexperiments
http://www.cs.cmu.edu/~ymiao/satdnn.html
C i ith k d t ti th d f
Future Work
Comparison with speaker adaptation methods; perform sequence training [9] over the resulting SAT‐DNN
Extend the SAT framework to other architectures e g to bottleneck
Yajie Miao Carnegie Mellon Univ.
Extend the SAT framework to other architectures, e.g., to bottleneck feature extraction [10] and convolutional neural networks [6]
21 / 23
References[1] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context‐dependent deep neural networks for conversational speech transcription,” in Proc. ASRU, pp. 24–29, 2011. [2] B. Li, and K. C. Sim, “Comparison of discriminative input and output transformations for [ ] , , p p pspeaker adaptation in the hybrid NN/HMM systems,” in Proc. Interspeech, pp. 526–529, 2010.[3] K. Yao, D. Yu, F. Seide, H. Su, L. Deng, and Y. Gong, “Adaptation of context‐dependent deep neural networks for automatic speech recognition,” in Proc. IEEE Spoken Language Technology W k h 366 369 2012Workshop, pp. 366–369, 2012.[4] S. M. Siniscalchi, J. Li, and C.‐H. Lee, “Hermitian‐based hidden activation functions for adaptation of hybrid HMM/ANN models,” in Proc. Interspeech, pp. 526–529, 2012.[5] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network [ ] , , , y, p pacoustic models using i‐vectors,” in Proc. ASRU, pp. 55‐59, 2013.[6] T. N. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convolutional neural networks for LVCSR,” in Proc. ICASSP, pp. 8614‐8618, 2013.[7] S P Rath D Povey K Vesely and J Cernocky “Improved feature processing for deep neural[7] S. P. Rath, D. Povey, K. Vesely, and J. Cernocky, “Improved feature processing for deep neural networks,” in Proc. Interspeech, 2013.[8] J.‐F. Bonastre, N. Scheffer, D. Matrouf, C. Fredouille, A. Larcher, A. Preti, G. Pouchoulin, N. Evans, B. Fauve, and J. Mason, “ALIZE/SpkDet: a state‐of‐the‐art open source software for speaker p p precognition,” in Proc. ISCA/IEEE Speaker Odyssey 2008.[9] B. Kingsbury, “Lattice‐based optimization of sequence classification criteria for neural‐network acoustic modeling,” in Proc. ICASSP, pp. 3761–3764, 2009.[10] J Gehring Y Miao F Metze and A Waibel “Extracting deep bottleneck features using
Yajie Miao Carnegie Mellon Univ.
[10] J. Gehring, Y. Miao, F. Metze, and A. Waibel, Extracting deep bottleneck features using stacked auto‐encoders,” in Proc. ICASSP, 2013.
22 / 23
Thank YouThank You
Yajie Miao Hao Zhang Florian MetzeL T h l i I tit tLanguage Technologies Institute
School of Computer Science Carnegie Mellon University
Acknowledgements. This work was supported by the Intelligence Advanced Research Projects Activity (IARPA)via Department of Defense U.S. Army Research Laboratory (DoD / ARL) contract number W911NF‐12‐C‐0015.The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposesnotwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained hereinare those of the authors and should not be interpreted as necessarily representing the official policies or
Yajie Miao Carnegie Mellon Univ.
endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government.
23 / 23