A fast maximum likelihood nonlinear feature transformation ...€¦ · A fast maximum likelihood...

9
XML-IS Our reference: NEUCOM 13731 P-authorquery-vx AUTHOR QUERY FORM Journal: NEUCOM Please e-mail or fax your responses and any corrections to: Article Number: 13731 E-mail: [email protected] Fax: +44 1392 285878 Dear Author, Please check your proof carefully and mark all corrections at the appropriate place in the proof (e.g., by using on-screen annotation in the PDF file) or compile them in a separate list. Note: if you opt to annotate the file with software other than Adobe Reader then please also highlight the appropriate place in the PDF file. To ensure fast publication of your paper please return your corrections within 48 hours. For correction or revision of any artwork, please consult http://www.elsevier.com/artworkinstructions. Any queries or remarks that have arisen during the processing of your manuscript are listed below and highlighted by flags in the proof. Click on the Q link to go to the location in the proof. Location in article Query / Remark: click on the Q link to go Please insert your reply or correction at the corresponding line in the proof Q1 Please confirm that given names and surnames have been identified correctly and are presented in the desired order. Q2 Please check the telephone number and e-mail address ([email protected]) of the corresponding author, and correct if necessary. Q3 Please check the years 2011 and 2013 in the biography of author Dong Yu. Q4 Please provide biography for the author “Yifan Gong”. Q5 Please provide heading for the first column in Tables 4 and 5. Thank you for your assistance. Please check this box or indicate your approval if you have no corrections to make to the PDF le Z QBX

Transcript of A fast maximum likelihood nonlinear feature transformation ...€¦ · A fast maximum likelihood...

  • XML-IS

    Our reference: NEUCOM 13731 P-authorquery-vx

    AUTHOR QUERY FORM

    Journal: NEUCOM

    Please e-mail or fax your responses and any corrections to:

    Article Number: 13731

    E-mail: [email protected]

    Fax: +44 1392 285878

    Dear Author,

    Please check your proof carefully and mark all corrections at the appropriate place in the proof (e.g., by using on-screen annotation in

    the PDF file) or compile them in a separate list. Note: if you opt to annotate the file with software other than Adobe Reader then please

    also highlight the appropriate place in the PDF file. To ensure fast publication of your paper please return your corrections within 48

    hours.

    For correction or revision of any artwork, please consult http://www.elsevier.com/artworkinstructions.

    Any queries or remarks that have arisen during the processing of your manuscript are listed below and highlighted by flags in the

    proof. Click on the Q link to go to the location in the proof.

    Location in

    article

    Query / Remark: click on the Q link to go

    Please insert your reply or correction at the corresponding line in the proof

    Q1 Please confirm that given names and surnames have been identified correctly and are presented in the desired order.

    Q2 Please check the telephone number and e-mail address ([email protected]) of the corresponding author, and

    correct if necessary.

    Q3 Please check the years 2011 and 2013 in the biography of author Dong Yu.

    Q4 Please provide biography for the author “Yifan Gong”.

    Q5 Please provide heading for the first column in Tables 4 and 5.

    Thank you for your assistance.

    Please check this box or indicate your approvalif you have no corrections to make to the PDF file ZQBX

  • A fast maximum likelihood nonlinear feature transformation methodfor GMM–HMM speaker adaptation

    Kaisheng Yao a,n, Dong Yu b, Li Deng b, Yifan Gong aQ1a Online Service Division, Microsoft Corporation, One Redmond Way, Redmond 98052, WA, USAb Microsoft Research, One Redmond Way, Redmond 98052, WA, USA

    a r t i c l e i n f o

    Article history:Received 4 September 2012Received in revised form22 December 2012Accepted 12 February 2013

    Keywords:Extreme learning machineNeural networksMaximum likelihoodSpeech recognitionSpeaker adaptationHidden Markov models

    a b s t r a c t

    We describe a novel maximum likelihood nonlinear feature bias compensation method for Gaussianmixture model–hidden Markov model (GMM–HMM) adaptation. Our approach exploits a single-hidden-layer neural network (SHLNN) that, similar to the extreme learning machine (ELM), uses randomlygenerated lower-layer weights and linear output units. Different from the conventional ELM, however,our approach optimizes the SHLNN parameters by maximizing the likelihood of observing the featuresgiven the speaker-independent GMM–HMM. We derive a novel and efficient learning algorithm foroptimizing this criterion. We show, on a large vocabulary speech recognition task, that the proposedapproach can cut the word error rate (WER) by 13% over the feature maximum likelihood linearregression (fMLLR) method with bias compensation, and can cut the WER by more than 5% over thefMLLR method with both bias and rotation transformations if applied on top of the fMLLR. Overall, it canreduce the WER by more than 27% over the speaker-independent system with 0.2 real-timeadaptation time.

    & 2013 Elsevier B.V. All rights reserved.

    1. Introduction

    Automatic speech recognition (ASR) systems rely on powerfulstatistical techniques such as Gaussian mixture model–hiddenMarkov models (GMM–HMMs) [1] and deep neural network(DNN)-HMMs [2] to achieve good performance. Even though ASRsystems are trained on large amounts of data, mismatchesbetween the training and testing conditions are still unavoidabledue to speaker and environment differences. Much like othersystems built upon statistical techniques, ASR systems often failto produce the same level of performance when tested undermismatched conditions.

    A substantial amount of work has been conducted to improveASR systems' performance under mismatched conditions. A com-prehensive review of existing techniques can be found in [3].The key idea of these techniques is to adapt either the model orthe feature so that the mismatch between the training and testingconditions can be reduced. Here we briefly review the threetechniques that motivated this work.

    The first class of techniques adapts the model or feature byapplying the physical model that causes the mismatch [4–6].For example, the parallel model combination (PMC) method [4]adapts the speech models under noisy environment by combining

    clean speech models and noise models. The joint additive andconvolutive noise compensation (JAC) [5] method jointly compen-sates for the additive background noise and channel distortion byapplying the vector Taylor series expansion of the physical distor-tion model around the mean vectors of clean speech, noise andchannel. Methods in this category differ on the type of physicalmodels used and the approximation applied. For instance, the JACmethod may be further improved by including both the magni-tude and phase terms in the physical model that describes therelationship between speech and noise [6].

    The second class of techniques [7,8] adapts the model or featureby learning the distortion model from stereo data, i.e., data recordedsimultaneously over a “clean” and “noisy” channel. Methods in thiscategory differ in how the distortion model is estimated and used tocompensate for the mismatches during testing.

    In contrast to the above techniques, the third class of techniquedoes not need an explicit model that describes how speech isdistorted. Instead it learns a transformation directly from theadaptation data and the associated transcriptions that may containerrors. This transformation is then used either to modify the testfeatures to make it closer to the training data or to adjust theHMM model so that it better represents the test data. Examplemethods in this category include stochastic matching [9] and thepopular maximum likelihood linear regression (MLLR) [10]. Thisclass of techniques is more flexible than the first class of techni-ques since it does not rely on a physical model, which may notbe available when, for instance, the HMMs are trained on noisy

    123456789

    101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566

    6768697071727374757677787980818283848586878889909192939495969798

    Contents lists available at ScienceDirect

    journal homepage: www.elsevier.com/locate/neucom

    Neurocomputing

    0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.neucom.2013.02.050

    n Corresponding author. Tel.: þ1 469 360 6986.Q2E-mail addresses: [email protected], [email protected] (K. Yao).

    Please cite this article as: K. Yao, et al., A fast maximum likelihood nonlinear feature transformation method for GMM–HMMspeaker adaptation, Neurocomputing (2013), http://dx.doi.org/10.1016/j.neucom.2013.02.050i

    Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

    www.sciencedirect.com/science/journal/09252312www.elsevier.com/locate/neucomhttp://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050mailto:[email protected]:[email protected]://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050Original Text:GMM-

    Original Text:givenname

    Original Text:surname

    Original Text:givenname

    Original Text:surname

    Original Text:givenname

    Original Text:surname

    Original Text:givenname

    Original Text:surname

    Original Text:model-

    Original Text:GMM-

    Original Text:GMM-

    Original Text:model-

    Original Text:GMM-

    Original Text:Please confirm that given names and surnames have been identified correctly and are presented in the desired order.

    Original Text:Please check the telephone number and e-—mail address ([email protected]) of the corresponding author, and correct if necessary.

  • speech. This class of techniques is also less demanding than thesecond class of techniques since it does not require stereo datawhich are rarely available. For these reasons this class of techni-ques has been widely used in real world large vocabulary speechrecognition (LVSR) systems while the first and second class oftechniques are mainly successful in small tasks such as digitsrecognition in noisy conditions [11,12]. Due to the large amountof HMM parameters involved in LVSR systems, feature transfor-mation is preferred over model adaptation [10] to reducecomputational cost.

    Effort has been made to extend the third class of techniques todeal with nonlinear distortions. For example, in [13] a single-hidden-layer neural network (SHLNN) was used to estimate thenonlinear transformations to adapt the HMM model or feature.A gradient ascent method was developed to estimate both theupper- and lower-layer weights of the neural network, whichunfortunately, was slow to convergence and easy to overfit tothe training data [13] especially since the adaptation dataset istypically very small. Furthermore, the neural network in [13] wastrained to maximize the likelihood of the transformed featureinstead of the observed feature, which requires considering theJacobian of the feature transformation.

    Recently, extreme learning machine (ELM) [14] was proposedto address the difficulty and inefficiency in training single-hidden-layer neural networks. There are two core components in the ELM:using randomly generated “fixed” connection between the inputand hidden layer (called lower-layer weights) and using linearoutput units so that for minimum mean square error criterionclosed-form solution is available when the lower-layer weights arefixed. The randomly generated lower-layer weights are indepen-dent of the training data and so ELM is less likely to overfit to thetraining data than the conventional neural networks. ELM hasbeen used as building blocks for example in the deep convexnetwork (DCN) [15,16].

    In this paper we propose a novel nonlinear feature adaptationframework for LVSR exploiting the core ideas in ELM. Morespecifically, we estimate a nonlinear time-dependent bias usingan ELM-like SHLNN to compensate for the mismatch between thetraining and testing data. In this SHLNN, the lower-layer weightsare randomly generated and the output layer contains only linearunits just as in the ELM. Different from ELM, however, the SHLNNin our task is optimized to maximize the likelihood of either thetransformed feature, for which a closed-form solution is derived,or the observed feature, for which an efficient second-orderGauss–Newton method is developed.

    Our approach belongs to the third class of adaptation techniquejust reviewed. It requires neither physical models nor stereo dataand can be easily applied to LVSR tasks. Our proposed approach isalso more powerful than the previously developed approachesbecause it can take advantage of the neighboring frames (e.g.,[4,6,5,10,13] cannot) and uses nonlinear transformations (e.g., [17]does not). We show, on a large vocabulary speech recognition task,that the proposed approach can cut the word error rate (WER) by13% over the feature maximum likelihood linear regression(fMLLR) method with bias compensation, and can cut the WERby more than 5% over the fMLLR method with both bias androtation transformations if applied on top of the fMLLR. Overall,it can reduce the WER by more than 27% over the speaker-independent system.

    The rest of the paper is organized as follows: in Section 2, wediscuss the problem of feature space transformation for speakeradaptation. In Section 3 we present the training algorithms of theELM-like SHLNN for maximizing the likelihood of the transformedfeature and observed feature and describe a tandem scheme thatapplies the nonlinear transformation on top of the linearly trans-formed feature. Experimental results are reported in Section 4 to

    demonstrate how hidden layer size and context window lengthsaffect the adaptation effectiveness and how the proposed approachoutperforms fMLLR on an LVSR task. We conclude the paper inSection 5.

    2. Speaker adaptation through nonlinearfeature transformation

    In contrast to the traditional feature space transformation[9,10,17] that is a linear transform of observation vector xt at timet, the method presented in this paper is a nonlinear featuretransformation method.

    The nonlinear feature transformation can be defined as f ðx t ;ϕtÞ,where ϕt is a meta feature vector that contains information such assignal-to-noise ratio and utterance length, x tARLD is an expandedobservation vector centered at xtARD with a context length of L. Forexample, for a context length of L¼3, x t ¼ ðxTt�1xTt xTtþ1ÞT , wheresuperscript T denotes transpose. Usually, x t is augmented withelement 1 to become ðxTt 1ÞT .

    For speech recognition, the transformation f ðx t ;ϕtÞ is typicallyoptimized to improve the likelihood of the observed featurepðxt jΘ;ΛxÞ. Here Θ denotes parameters of the nonlinear transforma-tion and Λx denotes speaker independent HMM parameters thatinclude (GMM) parameters fπm;μm;Σm;m¼ 1;…;Mg and state tran-sition probabilities fcij; i; j¼ 1;…; Sg, where M is the total number ofGaussian components, S is the number of states, and πm, μm and Σmare the weight, mean vector and diagonal covariance matrix, respec-tively, of the m-th Gaussian component.

    Directly improving the above likelihood is difficult. We insteaditeratively estimate Θ through the following auxiliary function:

    Q ðΘ; eΘÞ¼ � 1

    2∑m;tγmðtÞðf ðx t ;ϕtÞ�μmÞTΣ�1m ðf ðx t ;ϕtÞ�μmÞ

    þ∑m;tγmðtÞ log jJt j ð1Þ

    where eΘ is the previous estimate of Θ, γmðtÞ is the posteriorprobability of Gaussian m at time t, given HMM parameters Λxand the previous estimates of the nonlinear transformation para-meters eΘ. Jt ¼ ð∂f ðx t ;ϕtÞÞ=∂xt is the Jacobian of the featuretransformation at time t. Note that we assume that the featuretransformation is not dependent on a particular Gaussian m.

    For speaker adaptation, observation vectors xt are from a particularspeaker. Increasing the above auxiliary function scores corresponds tolearning a speaker dependent transformation that increases the like-lihood of observation xt, given the GMM parameters.

    Theoretically the transformation function f ðx t ;ϕt) can be anyfunction. In this paper we focus on compensating for the biaschange, i.e.,

    yt ¼ f ðx t ;ϕtÞ ¼ xtþgðx t ;ϕtÞ; ð2Þwhere gðx t ;ϕtÞ is a nonlinear function compensating for the biasand is defined by the SHLNN shown in Fig. 1. Similar to the ELM,the SHLNN consists of a linear input layer, a sigmoid nonlinearhidden layer, and a linear output layer. The SHLNN parameters Θinclude a D� K matrix U, a K � LD matrix W and a scalar value α,where U is the upper layer weights connecting the hidden layerand the output layer, W is the lower layer weights connecting theinput layer and the hidden layer, and α is the parameter to controlsteepness of the sigmoid function at the hidden layer. Since ϕt canbe considered as part of the observation we merge ϕt into x t fromnow on to simplify notation.

    Eq. (2) can be further rewritten and simplified to

    yt ¼ xtþUht ð3Þ

    123456789

    101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566

    676869707172737475767778798081828384858687888990919293949596979899

    100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132

    K. Yao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎2

    Please cite this article as: K. Yao, et al., A fast maximum likelihood nonlinear feature transformation method for GMM–HMMspeaker adaptation, Neurocomputing (2013), http://dx.doi.org/10.1016/j.neucom.2013.02.050i

    http://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050Original Text:esp.

    Original Text:In

    Original Text:The equation

  • where ht ¼ θðαW x tÞ is the hidden layer output and θðxÞ ¼ 1=ð1þexpð�xÞÞ ¼ expðxÞ=ð1þexpðxÞÞ is the sigmoid function, by not-ing that, following ELM, the lower-layer weight W is randomlygenerated and is independent of the adaptation data and thetraining criterion. The scalar value α is typically set to a valuearound 1.0. In this paper, it is set to 0.6.1

    3. The algorithms for maximum likelihood estimation

    As we discussed in the last section, the only parameter set weneed to learn is U, whose algorithm will be derived in this section.

    Substituting (3) into (1), we have

    Q ðΘ; eΘÞ¼ � 1

    2∑mtγmðtÞðUht�ρmtÞTΣ�1m ðUht�ρmtÞ

    þ∑tγmðtÞ log jJt j; ð4Þ

    where ρmt ¼ μm�xt . The Jacobian Jt ¼ ð∂yt=∂xtÞARD�D isJt ¼ ID�DþαUHtðIK�K �HtÞWc ð5Þ

    where WcARK�D is a submatrix of W, whose rows correspond tothe center observation xt in the context window, HtAR

    K�K is adiagonal matrix with element θðαW x tÞ, and Il�lARl�l is anidentity matrix.

    U can be learned with the gradient ascent algorithm, whichiteratively updates an estimate of U as

    U←Uþη ∂Q∂U

    ; ð6Þ

    where η is the step size,

    ∂Q∂U

    ¼ �∑m;tγmðtÞΣ�1m ðUht�ρmtÞhTt þ∑

    m;tγmðtÞ

    ∂ log jJt j∂U

    ð7Þ

    and

    vec∂ log jJt j

    ∂U

    � �T¼ vecðJ�Tt ÞT

    ∂Jt∂U

    ¼ vecðJ�Tt ÞT ½αðHtðIK�K �HtÞWcÞT � ID�D� ð8Þ

    where vecð�Þ is a vectorization operation to stack columns of a matrix.To obtain ð∂ log jJt j=∂UÞARD�K , we need to reshape the above vectorto a D�K matrix.

    3.1. Maximum likelihood estimation on transformed features

    We rewrite the auxiliary function as

    Q ðU; eUÞ ¼ � 12∑m;tγmðtÞðUht�ρmtÞTΣ�1m ðUht�ρmtÞ ð9Þ

    by ignoring the Jacobian, where eU is the previous estimate of theupper layer weights U. This simplified auxiliary function max-imizes the likelihood of the transformed data pðyt jU;ΛxÞ, ratherthan the observed data pðxt jU;ΛxÞ.

    A closed-form solution can be obtained for the above optimiza-tion problem by running one iteration of Gauss–Newton updateand setting the initial U¼ 0. This is because the criterion (9) is aquadratic function of U and its gradient is linear. Using the Gauss–Newton method, we have

    vecðUÞ ¼ � ∂2Q

    ∂U2

    � ��1vec

    ∂Q∂U

    � �����U ¼ 0

    ¼ � ∂2Q

    ∂U2

    � ��1vec ∑

    mtγmðtÞΣ�1m ρmthTt

    � �����U ¼ 0

    ð10Þ

    where

    ∂2Q∂U2

    ����U ¼ 0

    ¼ �∑mtγmðtÞðhthTt � Σ�1m Þ ð11Þ

    using matrix calculus [18] and � is the Kronecker product.The term (11) can be written in detail as

    ∑mtγmðtÞðhthTt � Σ�1m Þ

    ¼∑mtγmðtÞ

    ht1ht1 1s2m1⋯ 0 ⋯ ht1htK 1s2m1

    ⋯ 0

    ⋮ ⋱ ⋮ ⋯ ⋮ ⋱ ⋮0 ⋯ ht1ht1 1s2mD

    ⋯ 0 ⋯ ht1htK 1s2mDht2ht1 1s2m1

    ⋯ 0 ⋯ ht2htK 1s2m1⋯ 0

    ⋮ ⋱ ⋮ ⋯ ⋮ ⋱ ⋮0 ⋯ ht2ht1 1s2mD

    ⋯ 0 ⋯ ht2htK 1s2mD⋮ ⋯ ⋮ ⋯ ⋮ ⋯ ⋮

    htKht1 1s2m1⋯ 0 ⋯ htKhtK 1s2m1

    ⋯ 0

    ⋮ ⋱ ⋮ ⋯ ⋮ ⋱ ⋮0 ⋯ htKht1 1s2mD

    ⋯ 0 ⋯ htKhtK 1s2mD

    0BBBBBBBBBBBBBBBBBBBBBB@

    1CCCCCCCCCCCCCCCCCCCCCCAð12Þ

    where hti denotes the i-th element in the hidden layer output ht.The memory for evaluating this term is OðDK � DKÞ. For a setupwith D¼39 and K¼2000, the cost is 6:09� 109 floating points,which corresponds to 48 G byte of memory. A typical computercannot hold such large memory.

    Notice that the above terms (11) and (14) are full but sparsematrices because Σm is diagonal. Fortunately a memory and CPUefficient algorithm can be derived by using the permutationmatrix [18]. The intuition behind the algorithm is re-arrangingthe above full but sparse matrices in (14) to be block-diagonal.A permutation matrix Tm;nARmn�mn is a matrix operator composedof 0s and 1s, with a single 1 on each row and column. It has anumber of special properties. Particularly, T�1m;n ¼ Tn;m ¼ TTm;n andTm;nvecðAÞ ¼ vecðAT Þ.

    Note that

    hthTt � Σ�1m ¼ TK;DΣ�1m � hthTt TD;K : ð13Þ

    Substituting this to (11), we have

    ∂2Q∂U2

    ����U ¼ 0

    ¼ �TK ;D ∑mtγmðtÞΣ�1m � hthTt

    � �TD;K ð14Þ

    123456789

    101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566

    676869707172737475767778798081828384858687888990919293949596979899

    100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132

    Fig. 1. Neural network architecture.

    1 The scalar value α was experimented with 0.6 and 1.0. We did not observemuch performance differences. Hence, this paper only reports results in Section 4with α¼ 0:6.

    K. Yao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 3

    Please cite this article as: K. Yao, et al., A fast maximum likelihood nonlinear feature transformation method for GMM–HMMspeaker adaptation, Neurocomputing (2013), http://dx.doi.org/10.1016/j.neucom.2013.02.050i

    http://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050Original Text:Gauss-

  • By applying (14) to (10) we get

    vecðUÞ ¼ TK ;D ∑mtγmðtÞΣ�1m � hthTt

    � ��1TD;Kvec

    ∂Q∂U

    � �U ¼ 0j

    After some manipulations detailed in Appendix A.1, we get theclosed-form solution at d-th row of U, Ud;:, for d¼ f1;⋯;Dg as

    vecðUd;:Þ ¼ ∑mtγmðtÞ

    1s2md

    hthTt

    !�1vec

    ∂Q∂Ud;:

    � �����U ¼ 0

    ; d¼ f1;…;Dg

    ð15Þwhere s2md is the (d,d)-th element of the diagonal covariancematrix Σm.

    This corresponds to re-arranging the term (14) into the follow-ing block diagonal matrix:

    ∑mtγmðtÞ

    ht1ht1 1s2m1⋯ ht1htK 1s2m1

    ⋯ 0 ⋯ 0

    ⋮ ⋱ ⋮ ⋯ ⋮ ⋱ ⋮htKht1 1s2m1

    ⋯ htKhtK 1s2m1⋯ 0 ⋯ 0

    0 ⋯ 0 ⋱ 0 ⋯ 0⋮ ⋱ ⋮ ⋯ ⋮ ⋱ ⋮0 ⋯ 0 ⋯ ht1ht1 1s2mD

    ⋯ ht1htK 1s2mD⋮ ⋱ ⋮ ⋯ ⋮ ⋱ ⋮0 ⋯ 0 ⋯ htKht1 1s2mD

    ⋯ htKhtK 1s2mD

    0BBBBBBBBBBBBBBBB@

    1CCCCCCCCCCCCCCCCA:

    ð16ÞKeeping only the nonzero block diagonal matrices, now the

    total memory cost is OðD� K � KÞ. In terms of the computationalcost, the matrix inversion ð∂2Q=∂U2Þ�1 in (10) is on a matrix ofdimension KD� KD. Using the new algorithm, the matrix inver-sion ð∑mtγmðtÞð1=s2mdÞhth

    Tt Þ�1 in (15) is on block diagonal matrices

    with dimension of K � K . Notice that the computational cost of atypical matrix inversion algorithm such as Gauss-Jordan elimina-tion [19] is Oðn3Þ for a matrix of dimension n� n. Therefore, thenew algorithm has the computational cost of OðD� K3Þ for thematrix inversion whereas the matrix inversion in (10) costs OðD3 �K3Þ operations.

    3.2. Maximum likelihood estimation on observed features

    To optimize for the observed features we cannot drop theJacobian term in (4) and need to solve it using an iterative Gauss–Newton procedure

    U←eU�η ∂Q2∂2U

    !�1∂Q∂U U ¼ eU��� ð17Þ

    This procedure starts from some initial estimates, such as U¼ 0,until a certain number of iterations is reached.

    We keep the Hessian term from (11) unchanged, but include afirst order differential term from (8). Applying permutationmatrices as shown in Appendix A.2, we can rewrite (8) as

    vecðJ�Tt ÞT ½αðHtðIK�K �HtÞWcÞT � ID�D�¼ αðvecðJ�1t ÞT ID�D � ðHtðIK�K �HtÞWcÞT ÞT ð18Þ

    which is a vector of dimension 1� DK . Its d-th subvector withdimension K is αðJ�1t ÞT:;dðHtðIK�K �HtÞWcÞT .

    Applying permutation matrices similarly as in the above sec-tion, we can write (17) as follows:

    vecðUd;:Þ ¼ vecðeUd;:Þþη ∑mtγmðtÞ

    1s2md

    hthTt

    !�1

    �vec ∂Q∂Ud;:

    � �U ¼ eU ; d¼ f1;…;Dg��� ð19Þ

    To obtain vecð∂Q=∂Ud;:Þ, we substitute (18) into (7); i.e.

    vec∂Q∂Ud;:

    � �¼ �∑

    mtγmðtÞðΣ�1m ðUht�ρmtÞhTt Þ:;d

    þ∑mtγmðtÞαðJ�1t ÞT:;dðHtðIK�K �HtÞWcÞT ð20Þ

    Note that here we assume that Hessian ∂2Q=∂U2 is unchangedonce it is estimated from the point U¼ 0 to speed up computation.The Gauss–Newton estimate can be initialized from either U¼ 0 orthe closed-form solution (15). The latter provides a better initialestimate and requires fewer iterations to estimate U.

    3.3. A tandem scheme

    The proposed nonlinear bias compensation approach can beapplied on top of the affine fMLLR [10]

    x̂ t ¼Axtþb¼Ψξt ð21Þto further improve the performance, where AARD�D is thetransformation that does rotation on the input xt at time t,bARD is the bias to the input, and Ψ¼ ½A;b�ARD�ðDþ1Þ is thecombined affine transformation. ξt ¼ ½xTt 1�T is the augmentedobservation vector at time t. This tandem scheme can takeadvantage of neighboring frames and compensate for remainingnonlinear mismatches after the affine fMLLR.

    3.4. Computational costs

    We list the computational costs to implement our algorithmsin Table 1. The forward propagation process in each methodestimates values such as Uht . The backward propagation processre-estimates neural network parameters U and W. We use N todenote the number of observation frames. As described in Section3.1, solution in (15) is not an iterative algorithm. Therefore it costsOðN � L� D� KÞ for the forward process. The cost for solution (15)in the backward process to estimate the neural network para-meters is OðD� K3Þ. Our second solution (19) using Eq. (20) needsto compute, at each time frame t, an inversion matrix J�1t . Sinceinversion operations costs OðD3Þ, which dominates the computa-tional costs at each time frame, the cost to compute solution (19)takes OðN � D3Þ. Considering the solution (19) may take J iterationsto converge, the total computational cost for (19) is thereforeOðJ � N � D3Þ, which is larger than the cost for solution (15). In theforward process of solution (19), the cost to compute Uht is OðJ �N � L� D� KÞ which is higher than the forward-pass cost insolution (15) because solution (19) takes J iterations.

    Other methods in [13] use iterative forward and backwardpropagation algorithms to estimate both the lower- and upper-layer weights, each costs OðJ � N � L� D� KÞ operations to con-verge. According to Table 1, the cost of forward process in [13] ishigher than the costs for solution (15). Moreover, since L is usuallylarger than D and K is approximately the same as D, the cost forbackward propagation in [13] is also higher than our solutions (15)and (19). It is worth mentioning that the number of iterations J inour solutions might be smaller than that used in [13] because oursolutions use Gauss–Newton method, which converges faster than

    123456789

    101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566

    676869707172737475767778798081828384858687888990919293949596979899

    100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132

    Table 1The computational costs to implement our algorithms. D is the feature dimension,L is the context length, K is the hidden layer size, N is the number of observationframes, and J is the number of iterations.

    Method Forward prop. Backward prop.

    Solution (15) in Section 3.1 OðN � L� D� KÞ OðD� K3ÞSolution (19) in Section 3.2 OðJ � N � L� D� KÞ OðJ � N � D3ÞMethod in [13] OðJ � N � L� D� KÞ OðJ � N � L� D� KÞ

    K. Yao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎4

    Please cite this article as: K. Yao, et al., A fast maximum likelihood nonlinear feature transformation method for GMM–HMMspeaker adaptation, Neurocomputing (2013), http://dx.doi.org/10.1016/j.neucom.2013.02.050i

    http://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050Original Text:Computational Costs --

    Original Text:Feature Dimension, --

    Original Text:Context Length, --

    Original Text:Hidden Layer Size, --

    Original Text:Number --

    Original Text:Observation Frames, --

    Original Text:Number --

    Original Text:Gauss-

    Original Text:Gauss-

    Original Text:Gauss-

  • the gradient descent method used in [13]. Specifically, the solution(15) is a closed-form solution and converges in one iteration.

    3.5. Other implementation considerations

    Note that we have γmðtÞ, the posterior probability of Gaussianmat time t, in the above algorithm. In practice, we use Viterbialignment at the Gaussian level, so γmðtÞ is either 1 or 0.

    For completeness, we list in Appendix A.3 auxiliary functionsfor fMLLR [10]. To debug our implementations, we use the scoresof these auxiliary functions to compare against the scores by theproposed algorithm.

    To replicate results, the randomization of W starts from a fixedseed. The random values for the lower layer weights W are in therange of �2.0 to 2.0.

    Inputs vectors need to be normalized to be zero mean and unitvariance in each input feature dimension. This improves conver-gence of the neural network learning.

    The step size η is important for Gauss–Newton algorithm.The step size starts from a small value, usually 1� 10�2. If oneiteration does not increase auxiliary score, the current estimate isbacked to its previous estimate and the step size η is halved for thenext iteration.

    For numerical issues, the inversion of Jacobian in (8) may beimplemented using singular value decomposition (SVD).

    4. Experiments

    4.1. System descriptions

    We conduct automatic speech recognition (ASR) experimentson an internal Xbox voice search data set. The scenario supportsdistant talking voice search (of music catalog, games, movies, etc.)using a microphone array. The training set of the speaker-independent (SI) GMM–HMM system contains 40 h of Xbox voicesearch commands. The evaluation was conducted on data from 25speakers. For each speaker 200 utterances are used for adaptationand 100 utterances are used for testing. The total number of testsentences is 2500 with 10,563 words. There are no overlap amongtraining, adaptation and test sets. We use Kaldi [20] to traina weighted finite state transducer (WFST)-based recognitionsystem [21]. The features are 13-dimension Mel filter-bank ceps-tral coefficients (MFCC) with delta and delta–delta coefficients.Per-device cepstral mean subtraction is applied. The SI acousticmodel has 7.5 k Gaussian components trained with the standardmaximum likelihood estimation (MLE) [1] procedure. We comparethe proposed approaches with the baseline system withoutadaptation and the system using fMLLR [10].

    4.2. Effects of hidden layer and context window size

    4.2.1. Hidden layer sizeWe first conducted experiments to investigate how the hidden

    layer size K affects the performance using a development set.Notice that there are methods [22,23] to automatically decidehidden layer size. The method in [22] selects the hidden layer sizeby reducing its regression error to a predefined threshold. Themethod in [23] proposes a bound of the generalization error. Themethod selects the hidden layer size by maximizing the coverageof unseen samples with a constraint or by thresholding on thebound. For speech recognition task, the most direct measure oferror is the word error rate. However, to our knowledge, there isnot yet a method to directly relate word error rate to neuralnetwork size.

    We therefore empirically investigated effects of the hiddenlayer size to speech recognition performance. In these experimentswe used the solution (15) (i.e., optimize for the transformedfeatures) to estimate the upper layer weights U. Fig. 2 summarizesthe experimental results under the condition that the inputdimension is set to 351 (or L¼9 frames) and α is set to 0.6.It can be observed that we achieved the lowest WER of 25.73%when 39 hidden units were used. Additional experiments usingdifferent input layer size indicate that the lowest WERs weretypically obtained when 20–40 hidden units were used. Thissuggests that the optimal hidden layer size is not sensitive to theinput layer size. The relatively small optimal hidden layer size isattributed to the small adaptation set used.

    4.2.2. Context window sizeOther experiments on neural network speech recognition

    [2,24] suggest using 9–11 frames of context window. Notice that9–11 frames correspond to 90–110 ms, approximately the averageduration of a phone in speech signal. We therefore conductedexperiments to cross validate the above recommendation usingthe same development set. We evaluated the impact of the contextwindow size L to the word error rate (WER) when the hidden layersize K is fixed using the same procedure described above. Table 2summarizes our results. From Table 2 we observe that lowestWERs can be obtained with L¼9 context windows if the hiddenlayer size is large enough to take advantage of it.

    Similar experiments were conducted using the solution (19)(i.e., optimize for the observed data), which was initialized fromthe closed-form solution (15) and was run for 10 iterations. InTable 3 WER results with context window size L set to 1, 9, and 11

    123456789

    101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566

    676869707172737475767778798081828384858687888990919293949596979899

    100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132

    Fig. 2. The effect of hidden layer size on word error rate (WER) with solution (15).The horizontal axis is the hidden layer size.

    Table 2The effect of context window size on word error rate (WER) with solution (15).

    Context window size K¼20 K¼39

    1 28.80 31.913 27.09 28.669 27.22 25.7311 28.08 25.91

    Table 3Effect of context window size and hidden layer size on WERs with solution (19).

    Context window size K¼20 K¼39

    1 28.66 31.499 27.06 25.6811 28.41 25.73

    K. Yao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 5

    Please cite this article as: K. Yao, et al., A fast maximum likelihood nonlinear feature transformation method for GMM–HMMspeaker adaptation, Neurocomputing (2013), http://dx.doi.org/10.1016/j.neucom.2013.02.050i

    http://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050Original Text:Effect --

    Original Text:Hidden Layer Size --

    Original Text:Word Error Rate (--

    Original Text:Solution --

    Original Text:Horizontal Axis --

    Original Text:Effect --

    Original Text:Context Window Size --

    Original Text:Word Error Rate (--

    Original Text:Solution --

    Original Text:Hidden Layer Size --

    Original Text:Solution --

    Original Text:Gauss-

    Original Text:GMM-

    Original Text:10563

    Original Text:delta-

    Original Text:20 to

    Original Text:9-

  • and hidden layer size K set to 20 and 39 are reported. The step sizeη was, respectively, set to 1� 10�2 for context window L¼1, 5�10�3 for context window LAf9;11g with K¼20, and 1� 10�3 forcontext window LAf9;11g with K¼39. In all these cases, using9-frame context provides the best WER. These experiments crossvalidate the recommendation of using 9–11 frames for neuralnetwork speech recognition [2,24].

    Remark 1. It is worthwhile to point out that most previouslydeveloped methods [4,6,5,10,13] cannot take advantage of theneighboring frames, which, indicated by Tables 2 and 3, to havepositive effect. [17] also benefits from the neighboring frames butuses linear model.

    Remark 2. Notice that the closed-form solution (15) converges inone iteration and has similar performances as the iterative solu-tion (19), which takes 10 iterations. Other existing methods [13]use the gradient descent method, therefore is even slower thanthe Gauss–Newton method used in the iterative solution (19).We measured the real time factor (RTF) on one speaker with thesetup K¼39 and L¼9 that has the lowest WER. The solution (15)has RTF of 0.22 and is real time. The solution (19) has RTF of 1.21.The closed-form solution (15) therefore is desirable in conditionssuch as fast adaptation.

    4.3. Main results

    Table 4 compares our proposed bias compensation approacheswith the SI system and fMLLR with bias compensation [10].In these experiments we used 9-frames of input (351 input units),20 hidden units and 39 output units. The α was set to 0.6 for theproposed approaches. The step size η was set to 5� 10�3 and 1�10�3 for hidden layer size K set to 20 and 39, respectively.

    From Table 4 we observe that fMLLR with bias compensationreduced WERs from 31.52%, achieved by the SI system, to 29.49%.This reduction is statistically significant with significance level of1:3� 10�3. With the hidden layer size K set to 20, using ourproposed approaches we can further reduce the WER to 27.22%and 27.06%, respectively, with and without dropping the Jacobianterm. Compared to the fMLLR with bias compensation, ourapproaches cut errors by 7.7% and 8.2%. These reductions aresignificant, each with a significance level of 2:5� 10�4 and8:7� 10�5.

    Note that the method with closed-form solution (15) withoutconsidering the Jacobian term achieved the similar performance asthat with Gauss–Newton method (19). This provides a significantgain in efficiency since the closed-form solution is much fasterthan the Gauss–Newton method. In fact, as shown in the table,WER can be further reduced to 25.73% using the closed-formsolution (15) if 39, instead of 20, hidden units are used. Thistranslates to a relative WER reduction of 13.0% over the fMLLRwith bias compensation system and is statistically significant withsignificance level of 9:5� 10�10 [25]. The method with Jacobianterm further reduced the WER to 25.68%, with the significancelevel of 5:6� 10�10.

    The tandem scheme proposed in Section 3.3 applies both biasand rotation compensation. Table 5 compares the tandem schemewith the closed-form and the Gauss–Newton method with affinefMLLR. In these experiments a SHLNN with 9-frame input feature,39 hidden units and 39 output units were used. The step size ηwas set to 1� 10�3. As shown in this table, our proposedapproaches reduced WER to 22.80% and 22.68%, respectively, withand without dropping the Jacobian term, from 24.03% achieved byaffine fMLLR. These results correspond to 5.1% and 5.6% relativeWER reductions with significance level of 3:5� 10�2 and2:0� 10�2, respectively. Compared to the SI model, our proposedmethods cut WERs significantly by more than 27% with signifi-cance level of zero.

    5. Conclusions and discussions

    In this paper, we have introduced a novel approach to trainsingle-hidden-layer neural networks to reduce mismatch betweentraining and testing for speaker adaptation of GMM–HMMs. Theneural network may be considered as a case of the recently proposedextreme learning machine which uses randomly generated lower-layer weights and linear output units. To estimate upper layerweights, we have developed efficient closed-form solution andGauss–Newton methods. Our proposed approach enables compen-sating for a time dependent bias and takes advantages of neighboringobservation frames. On a large vocabulary speech recognition task,we show that the proposed approach can reduce word error rates bymore than 27% over a speaker-independent system. Our proposednonlinear adaptation approach, which uses a closed-form solution, isalso significantly faster than prior arts [13] that use gradient-basedmethods such as back-propagation [26].

    To speed up the adaptation process we have used randomlygenerated lower-layer weights similar to the ELM. However, trade-off may be made between the adaptation speed and the power ofthe model. For example, instead of using random values we maylearn the lower-layer weights as well, possibly by exploiting theconstraint imposed by the closed-form relationship between theweights of upper- and lower-layers. Alternatively we may stackone ELM on top of another [16,15] via input–output concatenationto increase the modeling power, which has been shown toimprove phone recognition accuracy over ELM [27,28]. We believethat these extensions may further decrease the recognition errorrate, at the cost of increased adaptation time.

    Appendix A

    A.1. Derivation of an efficient closed-form solution

    Since Σ�1m is a diagonal matrix, Σ�1m � hthTt is a block diagonalmatrix. We therefore have

    vecðUÞ

    ¼ TK;D∑mtγmðtÞs�1m1 hthTt ⋯ 0

    ⋮ ⋱ ⋮0 ⋯ ∑mtγmðtÞs�1mDhthTt

    0BB@1CCA

    �1

    123456789

    101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566

    676869707172737475767778798081828384858687888990919293949596979899

    100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132

    Table 4Compare the proposed approaches with SI model and the fMLLR with biascompensation.Q5

    WERs (%)

    SI model (w/o adaptation) 31.52fMLLR with bias compensation 29.49Proposed approach with closed-form solution (K¼20) 27.22Proposed approach with Gauss–Newton method (K¼20) 27.06Proposed approach with closed-form solution (K¼39) 25.73Proposed approach with Gauss–Newton method (K¼39) 25.68

    Table 5Compare the tandem scheme and the affine fMLLR.

    WERs (%)

    Affine fMLLR 24.03Tandem scheme with closed-form solution 22.80Tandem scheme with Gauss–Newton method 22.68

    K. Yao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎6

    Please cite this article as: K. Yao, et al., A fast maximum likelihood nonlinear feature transformation method for GMM–HMMspeaker adaptation, Neurocomputing (2013), http://dx.doi.org/10.1016/j.neucom.2013.02.050i

    http://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050Original Text:Proposed Approaches --

    Original Text:proposed --

    Original Text:proposed --

    Original Text:Gauss- --

    Original Text:proposed --

    Original Text:proposed --

    Original Text:Gauss- --

    Original Text:Tandem Scheme --

    Original Text:Affine --

    Original Text:affine --

    Original Text:tandem --

    Original Text:tandem --

    Original Text:Gauss- --

    Original Text:was respectively

    Original Text:9-

    Original Text:Gauss-

    Original Text:Gauss-

    Original Text:Gauss-

    Original Text:Gauss-

    Original Text:was

    Original Text:backpropogation

    Original Text:input-

    Original Text:Please provide heading for the first column in Tables 4 and 5.

  • �TD;Kvec∂Q∂U

    � �U ¼ 0j

    ¼ TK;D∑mtγmðtÞs�1m1 hthTt ⋯ 0

    ⋮ ⋱ ⋮0 ⋯ ∑mtγmðtÞs�1mDhthTt

    0BB@1CCA

    �1

    �vec ∂Q∂U

    � �T !U ¼ 0j ð22Þ

    Multiplying the above equation with T�1K ;D on both sides, we get

    T�1K ;DvecðUÞ

    ¼∑mtγmðtÞs�1m1 hthTt ⋯ 0

    ⋮ ⋱ ⋮0 ⋯ ∑mtγmðtÞs�1mDhthTt

    0BB@1CCA

    �1

    �vec ∂Q∂U

    � �T !U ¼ 0j

    ¼ TD;KvecðUÞ¼ vecðUT Þ ð23Þ

    We therefore have

    vecðUT Þ

    ¼∑mtγmðtÞs�1m1 hthTt ⋯ 0

    ⋮ ⋱ ⋮0 ⋯ ∑mtγmðtÞs�1mDhthTt

    0BB@1CCA

    �1

    �vec ∂Q∂U

    � �T !U ¼ 0j

    ¼∑mtγmðtÞs�1m1 hthTt� ��1

    ⋯ 0

    ⋮ ⋱ ⋮

    0 ⋯ ∑mtγmðtÞs�1mDhthTt� ��1

    0BBBB@1CCCCA

    �vec ∂Q∂U

    � �T !�����U ¼ 0

    ð24Þ

    This indicates that each row of U, which is a vector of dimension K,can be updated independently from other rows as in (15).

    A.2. Derivation of an efficient Gauss–Newton algorithm whenconsidering Jacobian term

    Using permutation matrix, we rewrite the right hand sideof (8) as

    vecðJ�Tt ÞT ½αðHtðIK�K �HtÞWcÞT � ID�D�¼ vecðJ�Tt ÞT ½αTDDID�D � ðHtðIK�K �HtÞWcÞT �TDK¼ αðTTDDvecðJ�Tt ÞÞT ½ID�D � ðHtðIK�K �HtÞWcÞT �TDK¼ αvecðJ�1t ÞT ½ID�D � ðHtðIK�K�HtÞWcÞT �TDK¼ αðTTDK ðvecðJ�1t ÞT ID�D � ðHtðIK�K �HtÞWcÞT ÞT ÞT

    ¼ αðTKDðvecðJ�1t ÞT ID�D � ðHtðIK�K �HtÞWcÞT ÞT ÞT

    ¼ αðvecðJ�1t ÞT ID�D � ðHtðIK�K �HtÞWcÞT ÞT ð25Þ

    where TDK is a permutation matrix. The above is a vectorof dimension 1� DK .

    A.3. Auxiliary scores by fMLLR

    fMLLRs can be implemented either as a bias transformation oran affine transformation. In the case of bias transformation, its

    auxiliary score is

    Q ¼ � 12∑mtðb�ρmtÞTΣ�1m ðb�ρmtÞ ð26Þ

    where b is a global bias to be added to the input observation xt.In the case of affine transformation, its auxiliary score is as

    follows:

    Q ¼ β log ðjAjÞ� 12∑dðwjGjwTj �2wjkjÞ�

    12∑mtγmðtÞμTmΣ�1m μm ð27Þ

    where wj is the j-th row of the transformation ½A;b�. Gj and kj arethe second and first order statistics for fMLLR estimations [10].

    References

    [1] L.R. Rabiner, A tutorial on Hidden Markov Models and selected applications inspeech recognition, Proc. IEEE 77 (February) (1989) 257–286.

    [2] G.E. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neuralnetworks for large vocabulary speech recognition, IEEE Trans. Audio SpeechLang. Process. 20 (1) (2012) 30–42.

    [3] X. Huang, A. Acero, H.-W. Hong, Spoken Language Processing: A Guide toTheory, Algorithm, and System Development, Prentice Hall, 2001.

    [4] M. Gales, S. Young, S.J. Young, Robust continuous speech recognition usingparallel model combination, IEEE Trans. Speech Audio Process. 4 (1996)352–359.

    [5] Y. Gong, A method of joint compensation of additive and convolutivedistortions for speaker-independent speech recognition, IEEE Trans. SpeechAudio Process. 13 (5) (2005) 975–983.

    [6] L. Deng, J. Droppo, A. Acero, Enhancement of log-spectra of speech using aphase-sensitive model of the acoustic environment, IEEE Trans. Speech AudioProcess. 12 (13) (2004) 133–143.

    [7] L. Deng, A. Acero, L. Jiang, J. Droppo, X. Huang, High-performance robustspeech recognition using stereo training data, in: Proceedings of the ICASSP,2001, pp. 301–304.

    [8] M. Afify, X. Cui, Y. Gao, Stereo-based stochastic mapping for robust speechrecognition, IEEE Trans. Audio Speech Lang. Process. 17 (7) (2009) 1325–1334.

    [9] A. Sankar, C.-H. Lee, A maximum-likelihood approach to stochastic matchingfor robust speech recognition, IEEE Trans. Speech Audio Process. 4 (1996)190–202.

    [10] M.J.F. Gales, Maximum likelihood linear transformations for HMM-basedspeech recognition, Comput. Speech Lang. 12 (1998) 75–98.

    [11] K. Yao, L. Netsch, V. Viswanathan, Speaker-independent name recognitionusing improved compensation and acoustic modeling methods for mobileapplications, in: Proceedings of the ICASSP, 2006, pp. 173–176.

    [12] J. Li, L. Deng, D. Yu, Y. Gong, A. Acero, High-performance HMM adaptation withjoint compensation of additive and convolutive distortions via vector Taylorserirs, in: ASRU, 2007, pp. 67–70.

    [13] A.C. Surendran, C.-H. Lee, M. Rahim, Nonlinear compensation for stochasticmatching, IEEE Trans. Audio Speech Lang. Process. 7 (6) (1999) 643–655.

    [14] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory andapplications, Neurocomputing 70 (1) (2006) 489–501.

    [15] D. Yu, L. Deng, Accelerated parallelizable neural network learning algorithmfor speech recognition, in: INTERSPEECH, 2011, pp. 2281–2284.

    [16] L. Deng, D. Yu, Deep convex net: a scalable architecture for speech patternclassification, in: INTERSPEECH, 2011, pp. 2285–2288.

    [17] J. Huang, K. Visweswariah, P. Olsen, V. Goel, Front–end feature transforms withcontext filtering for speaker adaptation, in: ICASSP, 2011, pp. 4440–4443.

    [18] P. Fackler, Notes on Matrix Calculus, North Carolina State University, 2005.[19] W.H. Press, B.P. Flannery, S.A. Teukolsky, W.T. Vetterling, Gauss–Jordan

    elimination, in: Numerical Recipes in FORTRAN: The Art of Scientific Comput-ing, 2 edition, Chapter 2.1, Cambridge University Press, 1992, pp. 27–32.

    [20] D. Povey, A. Ghoshal, The Kaldi speech recognition toolkit, in: IEEE ASRU, 2011.[21] M. Mohri, F. Pereira, M. Riley, Weighted finite-state transducers in speech

    recognition, Comput. Speech Lang. 16 (1) (2002) 69–88.[22] G. Feng, G.-B. Huang, Q. Lin, R. Gay, Error minimized extreme learning

    machine with growth of hidden nodes and incremental learning, IEEE Trans.Neural Networks 20 (2009) 1352–1357.

    [23] D.S. Yeung, W.W. Ng, D. Wang, E.C.C. Tsang, X.-Z. Wang, Localized general-ization error model and its application to architecture selection for radial basisfunction neural network, IEEE Trans. Neural Networks 18 (2007) 1294–1305.

    [24] A.-R. Mohamed, D. Yu, L. Deng, Investigation of full-sequence training of deepbelief networks for speech recognition, in: INTERSPEECH, 2010, pp. 2846–2849.

    [25] L. Gillick, S. Cox, Some statistical issues in the comparison of speechrecognition algorithms, in: ICASSP, 1989, pp. 532–535.

    [26] A.E. Bryson, Y.-C. Ho, Applied Optimal Control: Optimization, Estimation,and Control, Blaisdell Publishing Company or Xerox College Publishing,1969, p. 481.

    [27] L. Deng, D. Yu, J. Platt, Scalable stacking and learning for building deeparchitectures, in: ICASSP, 2012, pp. 2133–2136.

    123456789

    101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566

    676869707172737475767778798081828384858687888990919293949596979899

    100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132

    K. Yao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 7

    Please cite this article as: K. Yao, et al., A fast maximum likelihood nonlinear feature transformation method for GMM–HMMspeaker adaptation, Neurocomputing (2013), http://dx.doi.org/10.1016/j.neucom.2013.02.050i

    http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref1http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref1http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref2http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref2http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref2http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref3http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref3http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref4http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref4http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref4http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref5http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref5http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref5http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref6http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref6http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref6http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0005http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0005http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0005http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref8http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref8http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref9http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref9http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref9http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref10http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref10http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0010http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0010http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0010http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0015http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0015http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0015http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref13http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref13http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref14http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref14http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0020http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0020http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0025http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0025http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0030http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0030http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref18http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0035http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0035http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0035http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0040http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref21http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref21http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref22http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref22http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref22http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref23http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref23http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref23http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0045http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0045http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0045http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0050http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0050http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref26http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref26http://refhub.elsevier.com/S0925-2312(13)01007-2/sbref26http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0055http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0055http://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050Original Text:Gauss-

    Original Text:Proceedings of the

    Original Text:on Audio, Speech, and Language Processing

    Original Text:Transactions on

    Original Text:and

    Original Text:Processing

    Original Text:on

    Original Text:Processing

    Original Text:on

    Original Text:and

    Original Text:on Audio,

    Original Text:and Language Processing

    Original Text:on

    Original Text:and

    Original Text:Processing

    Original Text:Computer,

    Original Text:and Language

    Original Text:on Audio,

    Original Text:and Language Processing

    Original Text:Theory

    Original Text:Front-

    Original Text:Gauss-

    Original Text:Computer,

    Original Text:and Language

    Original Text:on

  • [28] B. Hutchinson, L. Deng, D. Yu, A deep architecture with bilinear modeling ofhidden representations: applications to phonetic recognition, in: ICASSP, 2012,pp. 4805–4808.

    Kaisheng Yao received his Ph.D. degree on Communica-tion and Information Systems from Tsinghua University,China, in 2000. From 2000 to 2002, he worked as aninvited researcher at Advanced TelecommunicationResearch Lab in Japan. From 2002 to 2004, he was apost-doc researcher at University of California at SanDiego. From 2004 to 2008, he was with Texas Instruments.Since 2008, he has been with Microsoft. His research anddevelopment interests include speech, language, andimage processing, machine learning, data mining, patternrecognition and signal processing. He has published 50papers in these areas and is the inventor/coinventor ofmore than 20 granted/pending patents.

    Dong Yu joined Microsoft Corporation in 1998 andMicrosoft Speech Research Group in 2002, where he iscurrently a senior researcher. His research interestsinclude speech processing, robust speech recognition,discriminative training, and machine learning. His mostrecent work focuses on deep learning and its applica-tions to large vocabulary speech recognition. He haspublished over 100 papers in these areas and is theinventor/coinventor of more than 40 granted/pendingpatents.

    He is a senior member of IEEE. He is currently servingas a member of the IEEE Speech and Language Proces-sing Technical Committee (2013) and an associate

    editor of IEEE transactions on audio, speech, and language processing (2011)Q3 . Hehas served as an associate editor of IEEE signal processing magazine (2008–2011)and the lead guest editor of IEEE transactions on audio, speech, and languageprocessing – special issue on deep learning for speech and language processing(2010–2011)Q4 .

    Li Deng received the Bachelor degree from the Uni-versity of Science and Technology of China, andreceived the Master and Ph.D. degrees from the Uni-versity of Wisconsin, Madison. He joined the Depart-ment of Electrical and Computer Engineering,University of Waterloo, Ontario, Canada in 1989 as anAssistant Professor, where he became a Full Professorwith tenure in 1996. In 1999, he joined MicrosoftResearch, Redmond, WA as a Senior Researcher, wherehe is currently a Principal Researcher. Since 2000, hehas also been an Affiliate Full Professor and graduatecommittee member in the Department of ElectricalEngineering at University of Washington, Seattle. Prior

    to MSR, he also worked or taught at Massachusetts Institute of Technology, ATRInterpreting Telecom. Research Lab. (Kyoto, Japan), and HKUST. His current (andpast) research activities include deep learning and machine intelligence, deepneural networks for speech and related information processing, automatic speechand speaker recognition, spoken language identification and understanding,speech-to-speech translation, machine translation, language modeling, informationretrieval, neural information processing, dynamic systems, machine learning andoptimization, parallel and distributed computing, graphical models, audio andacoustic signal processing, image analysis and recognition, compressive sensing,statistical signal processing, digital communication, human speech production andperception, acoustic phonetics, auditory speech processing, auditory physiologyand modeling, noise robust speech processing, speech synthesis and enhancement,multimedia signal processing, and multimodal human-computer interaction. Inthese areas, he has published over 300 refereed papers in leading journals andconferences and 3 books, and has given keynotes, tutorials, and distinguishedlectures worldwide. His recent technical work (since 2009) on industry-scale deeplearning with colleagues and collaborators and his leadership in this emerging areahave created significant impact on speech recognition, signal processing, andrelated applications with high practical value.

    He has been granted over 60 US or international patents in acoustics/audio,speech/language technology, machine learning (deep learning), and related fields.He received numerous awards/honors bestowed by IEEE, ISCA, ASA, Microsoft, andother organizations. He is a Fellow of the Acoustical Society of America, a Fellow ofthe IEEE, and a Fellow of ISCA. He served on the Board of Governors of the IEEESignal Processing Society (2008–2010). More recently, he served as Editor-in-Chieffor the IEEE Signal Processing Magazine (2009–2011), which, according to theThomson Reuters Journal Citation Report released on June 2010 and 2011, ranksfirst in both years among all IEEE publications (127 in total) and all publicationswithin the Electrical and Electronics Engineering Category worldwide (247 in total)in terms of its impact factor: 4.9 and 6.0, and for which he received the 2011 IEEESPS Meritorious Service Award. He currently serves as an Editor-in-Chief for theIEEE Transactions on Audio, Speech and Langauge Processing 2012–2014.

    123456789

    10111213141516171819202122232425262728293031323334353637383940414243444546

    474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091

    K. Yao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎8

    Please cite this article as: K. Yao, et al., A fast maximum likelihood nonlinear feature transformation method for GMM–HMMspeaker adaptation, Neurocomputing (2013), http://dx.doi.org/10.1016/j.neucom.2013.02.050i

    http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0060http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0060http://refhub.elsevier.com/S0925-2312(13)01007-2/othref0060http://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050http://dx.doi.org/10.1016/j.neucom.2013.02.050Original Text:

    Original Text:Please check the years 2011 and 2013 in the biography of author Dong Yu.

    Original Text:Please provide biography for the author â•œYifan Gongâ•š.

    A fast maximum likelihood nonlinear feature transformation method for GMM–HMM speaker adaptationIntroductionSpeaker adaptation through nonlinear feature transformationThe algorithms for maximum likelihood estimationMaximum likelihood estimation on transformed featuresMaximum likelihood estimation on observed featuresA tandem schemeComputational costsOther implementation considerations

    ExperimentsSystem descriptionsEffects of hidden layer and context window sizeHidden layer sizeContext window size

    Main results

    Conclusions and discussionsDerivation of an efficient closed-form solutionDerivation of an efficient Gauss–Newton algorithm when considering Jacobian termAuxiliary scores by fMLLR

    References

    myCheckbox1: Off