[IEEE Multimedia and Expo, 2007 IEEE International Conference on - Beijing, China...

INVESTIGATING DATA SELECTION FOR MINIMUM PHONE ERROR TRAINING OFACOUSTIC MODELS

Shih-Hung Liu, Fang-Hui Chu, Shih-Hsiang Lin, Berlin Chen

Graduate Institute of Computer Science & Information Engineering,National Taiwan Normal University, Taipei, Taiwan

{g93470185, berlin} @csie.ntnu.edu.tw, {69447014, 69308027} @cc.ntnu.edu.tw

ABSTRACT

This paper considers minimum phone error (MPE) baseddiscriminative training of acoustic models for Mandarin broadcastnews recognition. A novel data selection approach based on thenormalized frame-level entropy of Gaussian posterior probabilitiesobtained from the word lattice of the training utterance wasexplored. It has the merit of making the training algorithm focusmuch more on the training statistics of those frame samples thatcenter nearly around the decision boundary for betterdiscrimination. Moreover, we presented a new phone accuracyfunction based on the frame-level accuracy of hypothesized phonearcs instead of using the raw phone accuracy function of MPEtraining. The underlying characteristics of the presentedapproaches were extensively investigated and their performancewas verified by comparison with the original MPE trainingapproach. Experiments conducted on the broadcast news collectedin Taiwan showed that the integration of the frame-level dataselection and accuracy calculation could achieve slight butconsistent improvements over the baseline system.

1. INTRODUCTION

Speech is the primary and the most convenient means ofcommunication between individuals. Due to the successfuldevelopment of much smaller electronic devices and the popularityof wireless communication and networking, it is widely believedthat speech will serve as a possible major human-machine interfacefor the interaction between people and different kinds of smartdevices in the near future. On the other hand, huge quantities ofmultimedia information, such as broadcast radio and televisionprograms, voice mails, digital archives and so on, are continuouslygrowing and filling our computers, networks and lives. Speech isobviously one of the most important information-bearing sourcesfor the great volumes of multimedia.

Based on these observations, it is expected that the automaticspeech recognition (ASR) technology will play a very importantrole in human-machine interaction, as well as in organization andretrieval of multimedia contents. When considering thedevelopment of an ASR system, acoustic modeling is always anindispensable and crucial ingredient we have to carefullymanipulate. The purpose of acoustic modeling is to provide amethod to calculate the likelihood of a speech utterance occurringgiven a word sequence. In principle, the word sequence can bedecomposed into a sequence of phone-like (subword, or INITIALor FINAL in Mandarin Chinese) units or acoustic models, each ofwhich is normally represented by a hidden Markov model (UMM),and the corresponding model parameters can be estimated from acorpus of orthographically transcribed training utterances using the

maximum likelihood (ML) training. The acoustic models can bealternatively trained with discriminative training algorithms, suchas the maximum mutual information (MMI) training and theminimum phone error (MPE) training [1, 2]. These algorithmswere developed in an attempt to correctly discriminate therecognition hypotheses for the best recognition results rather thanjust to fit the model distributions as done by ML training, andtherefore have continuously been an focus of much active researchin a wide variety of large vocabulary continuous speechrecognition (LVCSR) tasks in the past few years. Moreover, incontrast to ML training, discriminative training considers not onlythe reference (or correct) transcript of a training utterance, but alsothe competing (or incorrect) hypotheses that are often obtained byperforming LVCSR on the utterance.

In this paper, we consider minimum phone error (MPE) baseddiscriminative training of acoustic models for Mandarin broadcastnews recognition. A data selection approach based on thenormalized frame-level entropy of Gaussian posterior probabilitiesobtained from the word lattice of the training utterance wasexplored, which has the merit of making the MPE trainingalgorithm focus much more on the training statistics of those framesamples that center nearly around the decision boundary for betterdiscrimination. Moreover, we presented a new phone accuracyfunction based on the frame-level accuracy of hypothesized phonearcs. The underlying characteristics of the presented approacheswere extensively investigated and their performance was verifiedby comparison with the original MPE training approach.

2. MINIMUM PHONE ERROR TRAINING

Given a training set of K acoustic vector sequenceso={o1,..,ok,..,oK}, the MPE criterion for acoustic model trainingaims to minimize the expected phone errors of these acousticvector sequences using the following objective function [1]:

FMPEQ() ZKZ w RawAcc(W )P, (Wk (1)

where A denotes a set of phone-like acoustic models; wfat is thecorresponding word lattice of ok obtained by using LVCSR; Wk isone of the hypothesized word sequences in W.Ijt; P(Wk °k) is theposterior probability of hypothesis Wk given Ok; RawAcc (Wk) iSthe "raw phone accuracy" of wk in comparison with thecorresponding reference transcript, which is typically computed asthe sum of the phone accuracy measures of all phone hypotheses inwk. Then, the objective function in Eq. (1) can be maximized byapplying the Extended Baum-Welch algorithm [3] to update themean Uqmd and variance Uq of each dimension d of the m -thGaussian mixture component of the phone arc (or HMN) q usingthe following equations:

1-4244-1017-7/07/$25.00 C2007 IEEE 348 ICME 2007

onum (0) _ Oden (0) + D,mf qmd num den I

Yqm -Yqm +D

2 -fqudn(02) den (2) +D(qmd +Iqmd )07qmd - num _ den + D,qm qm+

2Aqmd'

num K Q eq 4rqm = E E E yk (t) max( 0, rk (4)k=lq itsq

denK Q E q k 0 k ), (5)'qm Z- ) IEEqm(t) max( ()'q )

k==1l=t=sqnum

K ek q ik ks E)t() 6Oqmd (°)= KL L yk (t)max(0, (6)

k=1 q=1 t=Sqoq,mK Q eq 2

kMPE k k kYq Yq (cq Cavg), (8)

where cfkvg is the average phone accuracy over all hypothesizedword sequences in the word lattice; ck is the expected phoneaccuracy over all hypothesized phone sequences containing phonearc q ; ot(d) is the observation vector component at frame t;Sq and eq are the start and end times of phone arc q; rq¾m(t) iSthe posterior probability for mixture component m of phone arc qat frame t; yqnT ,e)fqd (°) and o

mdo(02) are the accumulatedtraining statistics for mixture component m of phone arc q whoseCq iS larger than Cak , and vice versa for de, qd(O) andoden (02); qmd a 2md are respectively the mean and varianceestimated in the previous iteration; and D is a constant used toensure the positive variance values. On the other hand, thecalculation of CLkvg and Ck is actually based on the phoneaccuracies of phone arcs in the word lattice. For example, the rawphone accuracy for each word sequence wk in the lattice can becalculated in terms of the sum of the accuracy of each phonecontained in Wk [1]:

RawAcc (Wk) = E qe Wk PhoneAcc (q), (9)

where PhoneAcc (q) is the raw phone accuracy for a phone arc q in

wk, which can be defined as follows:

PhoneAcc(q):= {nax 1+2e(zX,q) 1(zj), zj q (10)1,-I+ e(zj , q) / (zj), zj .

where zk is the set of phone labels in the corresponding referencetranscript, and e(zj,q) is the overlap length in frames (or in time)for a phone label zj in Zk and a hypothesized phone arc q in Wk,l(9j) is the length in frames for zj. We can observe from Eq. (4)-(8), for the MPE training, those hypotheses having raw phoneaccuracies higher than the average can provide positivecontributions, and vice versa for those hypotheses with accuracieslower than the average.

3. FRAME-LEVEL TRAINING DATA SELECTION

In this section, we elucidate the theoretical roots of frame-leveltraining data selection using the entropy information, as well astwo variant implementations to achieve this goal.3.1. Normalized Frame-Level EntropyWe proposed the use of the entropy information to select theframe-level training statistics for MPE training. The normalizedentropy of a training frame sample t can be defined as:

Ek qm ( t) log 2 klog2N q=lme=q Yqm (t)(11)

6.OE±i06

5.OE±i06

p 4.OEi063.OE±i062OE+061 .OE±i06

O.OE±iOO0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

]Entropy

Figure 1. A plot of the relationship between the normalizedentropy and the number of training speech frame samples.

where Yktm(t) is the posterior probability for mixture componentm of phone arc q at frame t, which is calculated from the wordlattice; N is the total Gaussian mixtures which have nonzeroposterior probabilities at frame t (yk (t) > 0 ); and the value ofEk(t) will range from zero to one [4]. Two extreme cases areconsidered as follows. First, if the normalized entropy measure of aframe sample t is close to zero, it means that the correspondingframe-level posterior probabilities will be dominated by onespecific mixture component. From the viewpoint of frame sampleclassification using posterior probabilities, the difference ofprobabilities between the true (correct) mixture component and thecompeting (incorrect) ones is larger. That is, the frame sample t isactually located far from the decision boundary. On the other hand,if the normalized entropy measure is close to one, it means that theposterior probabilities of mixture components tend to be uniformlydistributed. Then, the frame sample t is instead located nearlyaround the decision boundary. In a word, the normalized entropymeasure to some extent can define a kind of margin for theselection of useful training frame samples. Therefore, we may takeadvantage of the normalized entropy measure to make the MPEtraining algorithm focus much more on the training statistics ofthose frame samples that center nearly around the decisionboundary for better sample discrimination and modelgeneralization [5].3.2. Hard Version of Frame Sample Selection (HS)A straightforward implementation of frame-level training dataselection is to define a threshold of the normalized entropymeasure and then completely discard the training statistics of thoseframe samples whose normalized entropy values fall below it. Thiscan be viewed as a "hard version" of data selection. Figure 1 showsthe relationship between the normalized entropy and the number oftraining speech frame samples used in this study. For example, theleftmost vertical bar denotes the number of training speech framesamples whose normalized entropy values are in the range of 0 to0.1. The large number of frame samples belonging to the leftmostvertical bar also reveals that most of the training frame samples infact are located far from the decision boundary and thus can bediscarded if the threshold is appropriately set.

3.3. Soft Version of Frame Sample Selection (SS)We also attempted an alternative implementation (or a "softversion") of frame-level training data selection to emphasize thetraining statistics of those frame samples that are located nearlyaround the decision boundary according to their normalizedentropy values using the following formula:

(12)

where a is tunable positive parameter whose value is rangingbetween 0 to 1. As indicated by Eq. (12), if the normalized entropy

349

k"'- k'-yq yq (I + aEk (t)),

value Et(t) of a training frame sample t is higher, then itscorresponding training statistics will be emphasized. Whereas, fora frame sample with a lower entropy value, its training statisticswill be deemphasized when compared to those of the framesamples with higher normalized entropy values [6].

4. NEW ACCURACY FUNCTION

It is known that the standard MPE training approach does notsufficiently penalize deletion errors [7]. In general, the originalMPE objective function discourages insertion errors more thandeletion and substitution errors. Inspired by the work reported in [7,8], in this paper we presented an alternative phone accuracyfunction that can look into the frame-level phone accuracies of allhypothesized word sequences in the word lattice to replace theoriginal raw phone accuracy function for MPE training. The frame-level phone accuracy function (TFA) is defined as:

FrameAcc (q) = q sq +1 (13)eq - Sq +

and

(q,Z(t)) f Z(t <p<1}' (14)

where z(t) is the phone label of the reference transcript at frame t;p is a tunable positive parameter used to control the penalty if thephone arc q is incorrect in its label; and the value of FrameAcc(q)will range from _p to 1. For each frame t, we thus can easilyevaluate whether the phone arcs of hypothesized word sequencesin the word lattice is identical to that of the reference transcript ornot. Actually, the presented frame-level phone accuracy functionemphasizes the deletion penalty on the incompletely correct phonearc. As illustrated in Figure 2, given the reference transcript "a-b-c", the hypothesized phone sequence "a-c" will be regarded asbeing completely correct (with a score of two) using the originalraw phone accuracy function, as shown in Eq. (9) and (10);however, the presented frame-level phone accuracy function willgive the hypothesized phone sequence a score of 1.27 by takinginto account the deletion error of it. Another frame-level phoneaccuracy function that used the Sigmoid function to normalize thephone accuracy value in a range between -1 and 1 was alsoexploited in this paper (STFA):

SigFrameAc c(q)2

- 1, (15)1 + exp(-net)

andnet = S(q,z(t)), (16)

where S(q,Z(t)) was previously defined in Eq.(14). Notice that, thepurpose of the above presented phone accuracy functions are notintended to approximate the standard Levenshtein distancemeasure, but instead to further sufficiently penalize the deletionerrors of speech recognition. Although another discriminativetraining approach using the finite state transducer, retaining thecorresponding recognition hypotheses of the training acousticvector sequence, for calculating the exact Levenshtein distancebased word error rate was proposed recently [9], however, noimproved results but only degraded results were demonstrated.

5. BROADCAST NEWS SYSTEM

The major constituent parts of the broadcast news system as wellas the speech and language data used in this paper will be brieflydescribed in this section [10].

Reference

Hypothesi s a C

It0 5 10 15 20 25 30

Raw phone accuracy = 2

Frame-level phone accuracy = 1.27

Figure 2. An illustration of the frame-level accuracy. The shaded boxindicates where the deletion errors occur

5.1. Front-End ProcessingThe front-end processing was conducted with the HLDA-based(Heteroscedastic Linear Discriminant Analysis) data-driven Mel-frequency feature extraction approach and then processed byMLLT (Maximum Likelihood Linear Transformation)transformation for feature de-correlation [11].

5.2. Speech Corpus and Acoustic Model TrainingThe speech corpus consists of about 200 hours of MIATBNMandarin television news (Mandarin Across Taiwan BroadcastNews) [11], which were collected by Academia Sinica and PublicTelevision Service Foundation of Taiwan during November 2001and April 2003. All the 200 hours of speech data are equipped withcorresponding orthographic transcripts, in which about 25 hours ofgender-balanced speech data of the field reporters collected duringNovember 2001 to December 2002 were used to bootstrap theacoustic training. Another set of 1.5 hour speech data of the fieldreporters collected within 2003 were reserved for testing. On theother hand, the acoustic models chosen here for speech recognitionare 112 right-context-dependent INITIAL's and 38 context-independent FINAL's.

The acoustic models were first trained at optimum settings usingthe ML criterion as well as the Baum-Welch training algorithm.The MPE-based discriminative training approaches were furtherapplied to those acoustic models previously trained by the MLcriterion. Unigram language model constraints were used inaccumulating the training statistics from the word lattices fordiscriminative training.

5.3. Lexicon and N-gram Language ModelingThe recognition lexicon consists of 72K words. The languagemodels used in this paper consist of unigram, bigram and trigrammodels, which were estimated based on the ML criterion and usinga text corpus consisting of 170 million Chinese characters collectedfrom Central News Agency (CNA) in 2001 and 2002 (the ChineseGigaword Corpus released by LDC).5.4. Speech RecognitionThe speech recognizer was implemented with a left-to-right frame-synchronous Viterbi tree search as well as a lexical prefix treeorganization of the lexicon. The recognition hypotheses wereorganized into a word lattice for further language model rescoring.In this study, the word bigram language model was used in the treesearch procedure while the trigram language model was used in theword lattice rescoring procedure [10].

6. EXPERIMENTAL RESULTS

6.1. Baseline Experimental Results

The acoustic models were trained with 25 hours of speechutterances. The MPE training started with the acoustic models

350

* *0 **

TS0CT77T07_777TCT7

b _ ca

trained by 10 iterations of the ML training, and used theinformation contained in the associated word lattices of trainingutterances to accumulate the necessary statistics for model training.The ML-trained acoustic models yields a CER (Chinese CharacterError Rate) of 23.64%, while the original MPE training (denoted asBaseline) indeed can provide a great boost to the acoustic modelsinitially trained by ML consistently at all training iterations, asdepicted in Figure 3 (the curve of Baseline (MPE)).

6.2. Experiments on Proposed Methods

The recognition results for our proposed methods including thenew phone accuracy functions and frame-level training dataselection are showed in Figure 3. We first evaluate theperformance of the two accuracy functions, TFA and STFA, aspreviously described in Section 4. As can be seen from Figure 3,both TFA and STFA outperform the baseline MPE at highertraining iterations, and STFA is slightly better than TFA, thoughthe difference between them is almost negligible. On the otherhand, we have observed from a series of experiments that, usingthe two variants of phone accuracy functions with different settingsof the value of their parameterp will give different penalties forinsertions and deletions. For example, if the value of p is larger,the insertion errors will be discouraged; if the value of p is smaller,the number of deletion errors will be decrease.

We then evaluate the effectiveness of training data selection,for which STFA was taken as the phone accuracy function. Theresults for the two alternative approaches, i.e. the hard (HS) andsoft (SS) versions of frame-level data selection, introduced in thispaper are depicted in Figure 3. It can be found that HS canconsiderably boost the performance of MPE training, while SSonly demonstrates with marginal improvements. However, theseresults indeed justify our postulation that with the properintegration of data selection into the acoustic model trainingprocess, we can make the training algorithm focus much more onthe useful training samples to achieve a better discriminationcapability on the test set. We also investigated the combination ofHS and SS for MPE training, which was achieved by using SS toemphasize or deemphasize the training samples selected by HS. Asthe curve (STFA+HS+SS) depicted in Figure 3, such combinationalso can provide additional performance gains over that obtainedby using either HS or SS alone.

Significance tests based on the standard NIST MAPSSWE [12]also have been conducted on the speech recognition results of allthe improved approaches presented in this paper (for the acousticmodels trained at all iteration). They indicated the statisticalsignificance of CER improvements (with p-value <0.001) over thebaseline MPE training when either STFA+HS or STFA+SS wasexploited in the acoustic model training process, as those resultsshown in Table 1. In the meantime, we are extensivelyexperimenting on the ways to improve the performance of the MPEtraining, including trying different sets of training settings,investigating the joint training of feature transformation, acousticmodels and language models, etc.

7. CONCLUSIONS

In this paper, we have studied frame-level training data selectionfor MPE training of acoustic models for Mandarin broadcast newsrecognition. A novel data selection approach using the normalizedframe-level entropy of Gaussian posterior probabilities has beenpresented. Moreover, a new phone accuracy function directly based

23.5

23

,422.522

21.5

21

20.52 3

Iteration6

Figure 3. Experimental results on proposed methods, incomparison with standard MPE training.STFA+HS T STFA+SS |_STFA+HS+SS

Itr CER(%) p-value CER(%) p-value CER(%) p-value1 22.46 <0.001 22.75 22.53 <0.0012 21.87 <0.001 22.25 - 21.72 <0.0013 21.4 <0.001 21.83 <0.001 21.45 <0.0014 21.38 <0.001 21.45 <0.001 21.38 <0.0015 21.08 <0.001 21.27 <0.001 21.03 <0.0016 21.03 <0.001 20.94 ]<0.001 J 20.9 1<0.001

Table 1. The speech recognition results (CERs) for the proposedimproved approaches.

on the frame-level accuracy has been proposed as well. Promisingand encouraging results on the recognition of Mandarin broadcastnews speech were demonstrated. More in-deep investigation of theproposed training data selection, as well as its integration withother discriminative acoustic model training algorithms, is alsocurrently undertaken.

8. REFERENCES

[1] D. Povey, P.C. Woodland, "Minimum Phone Error. and I-Smoothing for Improved Discriminative Training," in Proc.ICASSP 2002.

[2] J.W. Kuo et al, "An Empirical Study of Word ErrorMinimization Approaches for Mandarin Large VocabularySpeech Recognition," IJCLCLP, Vol. 11(3), 2006.

[3] P. S. Gopalakrishnan et al., "A Generalization of the BaumAlgorithm to Rational Objective Functions," in Proc. ICASSP1989.

[4] H. Misra, H. Bourlard, "Spectral Entropy Feature in Full-Combination Multi-Stream for Robust ASR," in Proc.Eurospeech 2005.

[5] H. Jiang et al., "Large Margin Hidden Markov Models forSpeech Recognition," IEEE Trans. ASLP 14(5), 2006.

[6] S.H. Liu and Berlin Chen, "The Use of Frame-levelInformation Cues for Minimum Phone Error DiscriminativeTraining of Acoustic Models," in Proc. TAAI 2006.

[7] J. Zheng, A. Stolcke, "Improved Discriminative Training usingPhone Lattices," in Proc. Eurospeech 2005

[8] F. Wessel et al., "Explicit Word Error Minimization UsingWord Hypothesis Posterior Probability," in Proc. ICASSP 2001.

[9] G. Heigold et al., "Minimum Exact Word Error Training," inProc. ASRU2005.

[10] B. Chen et al., "Lightly Supervised and Data-DrivenApproaches to Mandarin Broadcast News Transcription," inProc. ICASSP 2004.

[11] H.S. Chiu and B. Chen, "Word Topical Mixture Models forDynamic Language Model Adaptation," in Proc. ICASSP 2007.

[12] L. Gillick and S. Cox, "Some Statistical Issues in theComparison of Speech Recognition Algorithms," in Proc.ICASSP 1989.

351

[IEEE Multimedia and Expo, 2007 IEEE International Conference on - Beijing, China...

Documents

Transcript of [IEEE Multimedia and Expo, 2007 IEEE International Conference on - Beijing, China...