[IEEE Understanding (ASRU) - Moreno, Italy (2009.11.13-2009.12.17)] 2009 IEEE Workshop on Automatic...

6
Generalized Likelihood Ratio Discriminant Analysis Hung-Shin Lee 1, 2 , Berlin Chen 2 1 Department of Electrical Engineering, National Taiwan University, Taiwan 2 Department of Computer Science and Information Engineering, National Taiwan Normal University, Taiwan 1 [email protected], 2 [email protected] Abstract—In the past several decades, classifier-independent front-end feature extraction, where the derivation of acoustic features is lightly associated with the back-end model training or classification, has been prominently used in various pattern recognition tasks, including automatic speech recognition (ASR). In this paper, we present a novel discriminative feature transformation, named generalized likelihood ratio discriminant analysis (GLRDA), on the basis of the likelihood ratio test (LRT). It attempts to seek a lower dimensional feature subspace by making the most confusing situation, described by the null hypothesis, as unlikely to happen as possible without the homoscedastic assumption on class distributions. We also show that the classical linear discriminant analysis (LDA) and its well- known extension – heteroscedastic linear discriminant analysis (HLDA) can be regarded as two special cases of our proposed method. The empirical class confusion information can be further incorporated into GLRDA for better recognition performance. Experimental results demonstrate that GLRDA and its variant can yield moderate performance improvements over HLDA and LDA for the large vocabulary continuous speech recognition (LVCSR) task. I. INTRODUCTION For the purposes of better discrimination and less computational complexity, acoustic feature transformation has always been a key ingredient of automatic speech recognition (ASR) systems. It often aims to seek a linear transformation, projecting feature vectors from an n-dimensional space to a d- dimensional subspace ) ( n d < , so that the resulting new features can possess good discriminatory power among phonetic classes [1]. Typical techniques of acoustic feature transformation can roughly fall into two categories [2]: classifier-dependent and classifier-independent approaches. In the former approach, the transformation can be jointly derived with either the parameter estimation of acoustic models or the allocation rule governed by classifiers on the ground of some discriminative criteria such as minimum classification error (MCE) [3] and minimum phone error (MPE) [4]. On the contrary, for the latter category, according to a variety of class separation criteria, the transformation is derived independently of the back-end model training or classification while expecting the resulting features are still desirable for the recognition purpose. For instance, linear discriminant analysis (LDA) attempts to maximize the average squared Euclidean distance of all whitened class- mean pairs [5]; heteroscedastic linear discriminant analysis (HLDA), as a generalization of LDA, is a maximum likelihood solution to relieve the assumption of equal class covariances [6]. The maximum mutual information (MMI) and MCE criteria were also introduced for the classifier- independent approach in an attempt to retain more discriminative information than LDA and HLDA on the individual class distributions [7], while heteroscedastic discriminant analysis (HDA), considering the individual covariance of each class, provides a new estimate of the within-class scatter in the original LDA criterion [8]. Furthermore, recent research investigations started to take the empirical pairwise class error rate into account when constructing the transformation, so that the inconsistency between the front-end processing and the classification stage of a recognizer can be mitigated to some extent [9-11]. Despite that higher separability among classes does not necessarily guarantee better recognition accuracy, the classifier-independent approach still remains worthy of exploration for at least two reasons. First, in virtue of complete severance from the acoustic modeling, its derivation of transformation seems immune to any changes in the back- end modeling strategy. This would make a complex ASR system easily dissected and analyzed. Second, and most importantly, it is believed that methods designed along this vein would be more feasible when being applied to a wide array of pattern recognition tasks. In this paper, we present a novel discriminative feature transform, named generalized likelihood ratio discriminant analysis (GLRDA), stemming from the generic idea of the likelihood ratio test (LRT). It attempts to seek a lower dimensional feature subspace by making the most confusing situation, described by the null hypothesis, happen as unlikely as possible without the homoscedastic assumption on underlying class distributions. We also show that the classical linear discriminant analysis (LDA) and its well-known extension – heteroscedastic linear discriminant analysis (HLDA) are just two special cases of our proposed method. Furthermore, the empirical class confusion information can be incorporated into GLRDA for better recognition performance. The rest of this paper is organized as follows. Section II describes the theoretical background about LRT and introduces the basic ideas that motivate the exploration of GLRDA. Then, we elaborate the heteroscedastic formulation of GLRDA and its extension by using the empirical class confusion information in Section III. The experimental settings and a series of experiments conducted on the Mandarin large vocabulary continuous speech recognition (LVCSR) task are presented in Section IV. Finally, conclusions are drawn in Section V. 978-1-4244-5480-8/09/$26.00 © 2009 IEEE ASRU 2009 158

Transcript of [IEEE Understanding (ASRU) - Moreno, Italy (2009.11.13-2009.12.17)] 2009 IEEE Workshop on Automatic...

Page 1: [IEEE Understanding (ASRU) - Moreno, Italy (2009.11.13-2009.12.17)] 2009 IEEE Workshop on Automatic Speech Recognition & Understanding - Generalized likelihood ratio discriminant analysis

Generalized Likelihood Ratio Discriminant Analysis Hung-Shin Lee 1, 2, Berlin Chen 2

1Department of Electrical Engineering, National Taiwan University, Taiwan 2Department of Computer Science and Information Engineering, National Taiwan Normal University, Taiwan

1 [email protected], 2 [email protected]

Abstract—In the past several decades, classifier-independent

front-end feature extraction, where the derivation of acoustic features is lightly associated with the back-end model training or classification, has been prominently used in various pattern recognition tasks, including automatic speech recognition (ASR). In this paper, we present a novel discriminative feature transformation, named generalized likelihood ratio discriminant analysis (GLRDA), on the basis of the likelihood ratio test (LRT). It attempts to seek a lower dimensional feature subspace by making the most confusing situation, described by the null hypothesis, as unlikely to happen as possible without the homoscedastic assumption on class distributions. We also show that the classical linear discriminant analysis (LDA) and its well-known extension – heteroscedastic linear discriminant analysis (HLDA) can be regarded as two special cases of our proposed method. The empirical class confusion information can be further incorporated into GLRDA for better recognition performance. Experimental results demonstrate that GLRDA and its variant can yield moderate performance improvements over HLDA and LDA for the large vocabulary continuous speech recognition (LVCSR) task.

I. INTRODUCTION

For the purposes of better discrimination and less computational complexity, acoustic feature transformation has always been a key ingredient of automatic speech recognition (ASR) systems. It often aims to seek a linear transformation, projecting feature vectors from an n-dimensional space to a d-dimensional subspace )( nd < , so that the resulting new features can possess good discriminatory power among phonetic classes [1]. Typical techniques of acoustic feature transformation can roughly fall into two categories [2]: classifier-dependent and classifier-independent approaches. In the former approach, the transformation can be jointly derived with either the parameter estimation of acoustic models or the allocation rule governed by classifiers on the ground of some discriminative criteria such as minimum classification error (MCE) [3] and minimum phone error (MPE) [4].

On the contrary, for the latter category, according to a variety of class separation criteria, the transformation is derived independently of the back-end model training or classification while expecting the resulting features are still desirable for the recognition purpose. For instance, linear discriminant analysis (LDA) attempts to maximize the average squared Euclidean distance of all whitened class-mean pairs [5]; heteroscedastic linear discriminant analysis (HLDA), as a generalization of LDA, is a maximum likelihood solution to relieve the assumption of equal class covariances [6]. The maximum mutual information (MMI)

and MCE criteria were also introduced for the classifier-independent approach in an attempt to retain more discriminative information than LDA and HLDA on the individual class distributions [7], while heteroscedastic discriminant analysis (HDA), considering the individual covariance of each class, provides a new estimate of the within-class scatter in the original LDA criterion [8]. Furthermore, recent research investigations started to take the empirical pairwise class error rate into account when constructing the transformation, so that the inconsistency between the front-end processing and the classification stage of a recognizer can be mitigated to some extent [9-11].

Despite that higher separability among classes does not necessarily guarantee better recognition accuracy, the classifier-independent approach still remains worthy of exploration for at least two reasons. First, in virtue of complete severance from the acoustic modeling, its derivation of transformation seems immune to any changes in the back-end modeling strategy. This would make a complex ASR system easily dissected and analyzed. Second, and most importantly, it is believed that methods designed along this vein would be more feasible when being applied to a wide array of pattern recognition tasks.

In this paper, we present a novel discriminative feature transform, named generalized likelihood ratio discriminant analysis (GLRDA), stemming from the generic idea of the likelihood ratio test (LRT). It attempts to seek a lower dimensional feature subspace by making the most confusing situation, described by the null hypothesis, happen as unlikely as possible without the homoscedastic assumption on underlying class distributions. We also show that the classical linear discriminant analysis (LDA) and its well-known extension – heteroscedastic linear discriminant analysis (HLDA) are just two special cases of our proposed method. Furthermore, the empirical class confusion information can be incorporated into GLRDA for better recognition performance.

The rest of this paper is organized as follows. Section II describes the theoretical background about LRT and introduces the basic ideas that motivate the exploration of GLRDA. Then, we elaborate the heteroscedastic formulation of GLRDA and its extension by using the empirical class confusion information in Section III. The experimental settings and a series of experiments conducted on the Mandarin large vocabulary continuous speech recognition (LVCSR) task are presented in Section IV. Finally, conclusions are drawn in Section V.

978-1-4244-5480-8/09/$26.00 © 2009 IEEE ASRU 2009158

Page 2: [IEEE Understanding (ASRU) - Moreno, Italy (2009.11.13-2009.12.17)] 2009 IEEE Workshop on Automatic Speech Recognition & Understanding - Generalized likelihood ratio discriminant analysis

II. GENERALIZED LIKELIHOOD RATIO DISCRIMINANT ANALYSIS

A. Background

Conceptualized from statistical hypothesis testing [12], the likelihood ratio test (LRT) is a celebrated method of obtaining test statistics in situations where one wishes to test a null hypothesis 0H against a completely general alternative hypothesis 1H . In the paper, 0H generally represents a statistical fact that we would not like to accept. For example, in discriminative feature extraction tasks, the parameter space (e.g., for class mean vectors and covariance matrices) described by 0H contains less class-discrimination information. If Ω denotes the complete parameter space, and ω denotes the parameter space restricted by the null hypothesis 0H , the LRT criterion for the null hypothesis

0H against the alternative hypothesis 1H is

ΩL

LLR

sup

sup ω= (1)

where L denotes the likelihood of the sampled data, and SLsup denotes the likelihood computed with the maximum

likelihood (ML) estimated parameter set S. The logic behind the LRT criterion lies in that, if H0 is

perfectly true with no extra confidence measure being considered, the ML estimates over Ω should also occur within the parameter set that is consistent with ω. Accordingly,

ωLsup and ΩLsup should be close to each other. On the other hand, if 0H is apparently false, the ML condition will happen to occur at a point of Ω that is not in ω, which means that ωLsup will be far smaller than ΩLsup ideally.

B. Problem Formulation

For speech processing, various kinds of LRT statistics have been used across many tasks, such as speaker verification [13], phonetic disambiguation [14], voice activity detection (VAD) [15], etc. In this paper, we, however, do not intend to strictly follow the LRT procedure for dimensionality reduction of acoustic features. More specifically, we do not set the goal at testing whether the null hypothesis is true or false, but instead, at seeking a projected subspace, where the (most confusing) null hypothesis is as unlikely as possible to be true. To this end, we design the following statistical hypotheses:

0H : The class populations are the same.

1H : The class populations are different. Moreover, the projected subspace spanned by the column vectors of the transformation matrix )( nddn <ℜ∈ ×Θ , must satisfy the condition that the likelihood of all acoustic features generated by the null hypothesis being as small as possible. In light of such a perspective, the objective function of generalized likelihood ratio discriminant analysis (GLRDA) can be generically formulated by

.)(sup

)(sup)(

space paremeter complete the

same. theare spopulation class e that thspace parameter the

GLRDAΘ

ΘΘ

L

LJ = (2)

Finally, the transformation matrix Θ can be derived by minimizing ).(GLRDA ΘJ

C. The Homoscedastic Case

Suppose the sampled data is a collection of N independent labelled pairs ),( ii lx , where ),...,1(1 Nin

i ∈ℜ∈ ×x is a acoustic feature vector, and ,...,1 Cli ∈ is a class label. Each class ,...,1 Cj ∈ with the sample size jn is modelled by a Gaussian distribution with mean vector jμ and covariance matrix jΣ . The log-likelihood of the data in the transformed subspace is given by

,|

~|log)

~~Tr(

)~~(~

)~~(

2),(

),,,(log)(log

11

1

1

=

+

+−−−=

=C

j jjj

jjT

jjj

jjN

ndNg

pL

ΣSΣ

μmΣμm

ΘΣμxΘ

(3)

where )2log()2(),( πNddNg −= , jm and jS denote the sample mean and the sample covariance of class j, respectively, and the tilde variables refer to the transformed versions of the original variables. In general, the parameter space, where the class populations are the same, can be characterized by the estimates of the class mean vectors. Therefore, in the homoscedastic case that all classes are assumed to share the same covariance matrix, the hypotheses of GLRDA can be stated by

homo0H : For each class j, ΣΣ =j and μμ =j . homo1H : For each class j, ΣΣ =j and jμ is unrestricted.

where homo0H describes an extreme situation that if it is true,

then the distributions of all class populations will become almost indistinguishable resulting in less class-discrimination information offered by the parameter space. Therefore, the goal of GLRDA is set to find out the most appropriate projected subspace that makes the likelihood of the null hypothesis homo

0H be as small as possible. The following proposition will show that, in a maximum

likelihood sense, the classical LDA is merely a special case of GLRDA under the above two competing hypotheses homo

0H and homo

1H . Proposition 1: The derivation of the LDA transformation

by maximizing the criterion

||

||)(LDA

ΘSΘ

ΘSΘΘ

WT

BT

J = (4)

is equivalent to that obtained by minimizing the objective function of homoscedastic GLRDA

,)(sup

)(sup)(

homo1

homo0homo

GLRDAΘ

ΘΘ

H

H

L

LJ = (5)

where nnB

×ℜ∈S and nnW

×ℜ∈S denote the between-class and within-class scatter matrices, respectively [16].

Proof: For computational convenience, (5) can be reformulated to a logarithmic form:

),(logsup)(logsup)(log homo1

homo0

homoGLRDA ΘΘΘ HH LLJ −= (6)

where the two log-likelihood functions can be respectively expressed by (cf. (3))

159

Page 3: [IEEE Understanding (ASRU) - Moreno, Italy (2009.11.13-2009.12.17)] 2009 IEEE Workshop on Automatic Speech Recognition & Understanding - Generalized likelihood ratio discriminant analysis

( )=

−− ++−−

−=

=

C

j

jjT

jj

NH

n

dNg

pL

1

11

1

|~

|log)~~

Tr()~~(~

)~~(2

),(

),,,(log)(log homo0

ΣSΣμmΣμm

ΘΣμxΘ

(7)

and

( ).|~|log)

~~Tr()~~(

~)~~(

2

),(

),,,(log)(log

1

11

1homo1

=

−− ++−−

−=

=

C

j

jjjT

jij

jN

H

n

dNg

pL

ΣSΣμmΣμm

ΘΣμxΘ

(8)

Clearly, (7) can be optimized with the ML estimators mμ ~~homo

0 = and TSΣ~~ homo

0 = , where m~ and TS~

denotes the global sample mean vector and the total scatter matrix of the data, respectively, and TS

~ can be shown mathematically that

WBT SSS~~~ += [16]. Then, an alternative expression of

)(logsup homo0

ΘHL can be

.|~

|log22

),(

),~

,~,(logmax)(logsup homo0

homo01homo

0

T

NH

NNddNg

pL

S

ΘΣμxΘ

−−=

= (9)

Along a similar vein, (8) can be optimized by setting jj mμ ~~homo

,1 = and WSΣ~~ homo

1 = , while )(logsup homo1

ΘHL can be alternatively represented by

.|~

|log22

),(

),~

,~,(logmax)(logsup homo1

homo,11homo

1

W

jN

H

NNddNg

pL

S

ΘΣμxΘ

−−=

= (10)

Finally, by substituting the results obtained from (9) and (10) into (6), the log-likelihood ratio of GLRDA under the homoscedastic assumption can be shown by

.1

||||

1log

2|~

||~

|

|~

|log

2

)|~

|log|~

|(log2

)(log homoGLRDA

+=

+=

−=

ΘSΘΘSΘSS

S

SS

Θ

WT

BT

WB

W

TW

NN

N

J

(11)

Since the logarithmic function is monotonically increasing, the transformation matrix Θ , which is derived by minimizing )(log homo

GLRDA ΘJ , can also be derived by maximizing the term |||| ΘSΘΘSΘ W

TB

T in (11), which is exactly the objective function of LDA.

D. The Heteroscedastic Case

To go a step further, we now consider the heteroscedastic counterparts of the statistical hypotheses in GLRDA (cf. [17]):

heter0H : For each class j, μμ =j and jΣ is unrestricted. heter1H : For each class j, jμ and jΣ are both unrestricted.

Following a similar procedure as described in Proposition 1, the objective function of heteroscedastic GLRDA can be expressed by

),(logsup)(logsup)(log heter1

heter0

heterGLRDA ΘΘΘ HH LLJ −= (12)

where the two log-likelihood functions can be respectively shown by

( )=

−− ++−−

−=

=

C

j

jjjjjT

jj

jN

H

n

dNg

pL

1

11

1

|~

|log)~~

Tr()~~(~

)~~(2

),(

),,,(log)(log heter0

ΣSΣμmΣμm

ΘΣμxΘ

(13)

and

( ).|~|log)

~~Tr()~~(

~)~~(

2

),(

),,,(log)(log

1

11

1heter1

=

−− ++−−

−=

=

C

j

jjjjjjT

jjj

jjN

H

n

dNg

pL

ΣSΣμmΣμm

ΘΣμxΘ

(14)

Clearly, (13) can be optimized with the ML estimators

=

=

=

C

j

jjj

C

j

jj nn1

1

1

1

1heter0

~~~~ mΣΣμ (15)

and

,~~~ heter

,0 jjj SBΣ += (16)

where Tjjj )~~)(~~(

~ heter0

heter0 μmμmB −−= . Note that in (15),

since heter0

~μ contains an unknown term jΣ~

that needs to be estimated, we can take jS

~ as a temporary estimate of jΣ

~ to

form a weighted mean for all classes, or just directly use the global mean of the data as the estimate of heter

0~μ , alternatively

(see Section IV-B). Then from (13), )(logsup heter0

ΘHL can be expressed by

.|~~

|log22

),(

),~

,~,(logmax)(logsup

1

heter,0

heter01heter

0

=

+−−=

=C

j

jjj

jN

H

nNddNg

pL

SB

ΘΣμxΘ

(17)

Likewise, (14) can be optimized with the ML estimators jj mμ ~~heter

,1 = and jj SΣ~~ heter

,1 = , and from (14), we have

.|~

|log22

),(

),~

,~,(logmax)(logsup

1

heter,1

heter,11heter

1

=

−−=

=C

j

jj

jjN

H

nNddNg

pL

S

ΘΣμxΘ

(18)

Finally, by substituting the results obtained from (17) and (18) into (12), the log-likelihood ratio of GLRDA under the heteroscedastic hypotheses can be derived and expressed by

( ).)~~(~

)~~(1log2

|~~

|log2

|)~

|log|~~

|(log2

)(log

1

heter0

1heter0

1

1)(

1

heterGLRDA

=

=

−×

=

−−+−=

+−=

−+−=

C

j

jjT

jj

C

j

jjddj

C

j

jjjj

n

n

n

J

μmSμm

BSI

SSB

Θ

(19)

160

Page 4: [IEEE Understanding (ASRU) - Moreno, Italy (2009.11.13-2009.12.17)] 2009 IEEE Workshop on Automatic Speech Recognition & Understanding - Generalized likelihood ratio discriminant analysis

Therefore, we get the objective function of the heteroscedastic GLRDA (H-GLRDA):

.)()(

)(1log

2

)(

1heter0

1

heter0

GLRDA-H

=

−+−

=C

jT

jT

jT

TTj

Tjn

G

μΘmΘΘSΘ

μΘmΘ

Θ

(20)

The derivative of )(GLRDA-H ΘG is given by

,)

~~trace(1

~)

~~()(

11

11GLRDA-H

=−

−−

++−−=

∂∂ C

j jj

jjjjjjn

G

BS

SΘBBSΘS

Θ

Θ (21)

where Tjjj ))(( heter

0heter0 μmμmB −−= .

Since 0)(GLRDA-H =′ ΘG has no analytical solution for the stationary points, we can use a gradient descent-based procedure for the minimization of )(GLRDA-H ΘG .

E. Some Discussions

Table I summarizes the Statistics of GLRDA under various hypotheses. Furthermore, not only can GLRDA be reduced to LDA, of which the transformation matrix is derived through the statistics of homo

0H against homo1H , but also we can prove

that GLRDA is a generalization of HLDA. Proposition 2: The transformation matrix of HLDA can be

derived by minimizing the ratio of the maximum likelihood described by homo

0H to that described by heter1H . In other

words, GLRDA is a generalization of HLDA. Proof: Based on [6], the objective function of HLDA can

be expressed as

.||log||log2

||log2

)(

1

)()(HLDA

ΘΘSΘ

ΘSΘΘ

Nn

NJ

C

j

djTd

j

dnTT

dn

+

−−=

=

−−

(22)

Since here, )( nn×Θ is a full-rank matrix and can be dissembled to ][ )( dnd −ΘΘ , we can show that (cf. [18])

||log||log||log

||log||log||log

||||||

)()(

)()(

)()(

dTTdT

TdnT

Tdn

dnTT

dndTTdT

T

dnTT

dndTTdT

T

ΘSΘΘSΘΘSΘ

ΘSΘΘSΘΘSΘ

ΘSΘΘSΘΘSΘ

−=

+=

×=

−−

−−

−−

(23)

and

.||log2

||log2

||log2

||log2

||log

||log2

||log

T

T

TT

N

NNNN

NN

S

ΘSΘΘ

ΘSΘΘ

−=

−−−=

(24)

Substitute the result in (23) into (22) and consider the result in (24), an another form of HLDA thus can be reached

.||log2

||log2

||log2

)(

)(logsup)(logsup

1

HLDA

homo0

heter1

ΘΘ

ΘSΘΘSΘ

HHL

dTTd

L

C

j

djTd

j

T

Nn

NJ

−−−

−=

=

(25)

Obviously, when the constant term ||log)2( TN S− is being ignored, maximizing (25) is equivalent to minimizing the objective function of GLRDA, of which the two competing hypotheses are homo

0H and heter1H .

From Proposition 2, it can be found that the major

difference between HLDA and heteroscedastic GLRDA lies in the descriptions of the null hypothesis 0H : In HLDA, 0H is more strictly defined in a homoscedastic fashion. However, in heteroscedastic GLRDA, the parameter space described by the null hypothesis heter

0H is in fact larger (or more general) than that by HLDA, which would make the GLRDA-derived feature space be more amenable to class discrimination.

III. AN EXTENSION OF GLRDA

The presupposition that the most confusing condition (i.e., the null hypothesis 0H ) occurs only when all class means are entirely indistinguishable might be too rigorous to tally with the realistic speech recognition applications. For example, Table II shows the top 5 most confusing class (phone) pairs, as well as their corresponding error numbers in frames, obtained from the error analysis of a Mandarin LVCSR task (see Section IV-A) that employed the LDA-transformed speech features [9-11]. As is evident, the dominating or most confusing phone pair is (“in”, “ing”), since the number of sample items (in frames) that originally belong to “in” and “ing” but are misallocated to “ing” and “in”, by the recognizer , is the largest among all phone pairs. Intuitively, these statistics motivate us to think of a better way to represent the adverse condition for classification (i.e., 0H )

TABLE I The statistics of GLRDA under various hypotheses.

Statistical Hypotheses ML

Estimators (Relevant) Maximum

Log-likelihood

==μμ

ΣΣ

j

jH homo

0 TSm

~,~ |

~|log

2T

NS−

=

edunrestrict:homo1

j

jH

μ

ΣΣ

Wj Sm~

,~ |~

|log2

WN

S−

= μμ

Σ

j

jH

edunrestrict:heter0 heter

,0heter0

~,~

jΣμ =

+−C

j

jjjn

1

|~~

|log2

SB

edunrestrict:

edunrestrict:heter1

j

jH

μ

Σ

jj Sm~

,~ =

−C

j

jjn

1

|~

|log2

S

TABLE II The top 5 most confusing phone pairs in the Mandarin LVCSR system.

K Class (Phone) pair Errors (in Frames) 1 “in” “ing” 66,353 2 “an” “eng” 42,550 3 “i” silence 31,796 4 “u” silence 29,082 5 “e” silence 26,134

Total number of errors 2,400,806

161

Page 5: [IEEE Understanding (ASRU) - Moreno, Italy (2009.11.13-2009.12.17)] 2009 IEEE Workshop on Automatic Speech Recognition & Understanding - Generalized likelihood ratio discriminant analysis

Fig. 1. Example of class pairs (dashed circles) and confusing clusters (solid ellipses).

G1 1

2 3

4 5

6

7

8

G2

G3

by taking those more confusing pairs into more consideration, rather than simply assuming that all class means are equally mixed-up.

As an initial attempt, in this paper we integrate the pairwise class empirical confusion information into GLRDA, although we believe these are more sophisticated ways for achieving this goal that are still worthwhile to explore. However, as we consider representing the null hypothesis only using those empirically confusing class pairs, then we will soon confront with the issue stated below. Suppose the top two most confusing phone pairs are (1, 2) and (2, 3). The relevant part of the null hypothesis can be described by 21 μμ = and

32 μμ = , where jμ denotes the mean estimate for class j. It is clear that the above two conditions can be further simplified to 321 μμμ == , such that these three classes 1, 2, and 3 in the ensemble can be viewed as a confusing cluster, as graphically illustrated in Fig. 1. That is, we can treat all classes and their pairwise relationships of empirical confusability as vertices and edges, respectively, in a graph. Note that an edge only exists between the class pair having higher empirical confusability (or more misclassified frames). Then, the pursuit of confusing clusters turns out to be a problem of extracting all connected subgraphs in a given graph, which can be solved by using some efficient algorithms in graph theory, such as the flood fill algorithm [19].

Therefore, we can modify heteroscedastic GLRDA to confusion information-based GLRDA (CI-GLRDA): Let

: kGG denotes the set of confusing clusters, and the corresponding hypotheses are stated by

CI0H : For each class j, jΣ is unrestricted, and if class j

jkG∈ , then CIjkj μμ = , where jk is used to label the

cluster that class j belongs to. CI1H : For each class j, jμ and jΣ are both unrestricted.

Under the ML estimation framework, as described in Section II, we hence can derive the objective function of CI-GLRDA as follows:

.)()(

)(1log

2

)(

,1CI1

CI

CIGLRDA

∈=

−+−

=

C

GGj kT

jT

jT

Tk

Tj

T

j

jk j

jn

G

μΘmΘΘSΘ

μΘmΘ

Θ

(26)

The derivative of )(CIGLRDA ΘG is given by

,)

~~trace(1

~)

~~()(

,11

11CIGLRDA

∈=−

−−

++−−=

∂∂ C

GGj kj

jkkjji

jk j

jjnG

BS

SΘBBSΘS

Θ

Θ (27)

where Tkjkjk

jjj ))(( CICI μmμmB −−= .

IV. EXPERIMENTS AND RESULTS

A. Experimental Setup

The speech corpus consists of about 200 hours of MATBN Mandarin television news [20]. All the 200 hours of speech data are equipped with corresponding orthographic transcripts, in which about 25 hours of speech data were used to bootstrap the acoustic training. Another sets of 1.5 hours and 1 hour speech data were reserved for developing and testing, respectively. On the other hand, the acoustic models chosen here for speech recognition are 112 right-context-dependent INITIAL's and 38 context-independent FINAL's. The acoustic models were trained using the Expectation-Maximization (EM) algorithm.

The recognition lexicon consists of 72K words. The language models used in this paper consist of unigram, bigram, and trigram models, which were estimated using a text corpus consisting of 170 million Chinese characters collected from Central News Agency (CNA) [21]. The N-gram language models were trained using the SRI Language Modelling Toolkit (SRILM).

The speech recognizer was implemented with a left-to-right frame-synchronous Viterbi tree search as well as a lexical prefix tree organization of the lexicon. The recognition hypotheses were organized into a word graph for further language model rescoring [20]. The baseline system with the Mel-frequency cepstral coefficient (MFCC) features resulted in a character accuracy of 72.23 % on the test set.

B. Experimental Results

In the following, we will evaluate the goodness of our proposed methods, as described in (20) and (26), respectively, for feature extraction. Moreover, the front-end processing was also alternatively performed using LDA, HLDA, and HDA for comparison, while the objective function of HDA is [8]

.||log||log)(1

HDA ΘSΘΘSΘΘ BT

jT

C

j

j NnJ +==

(28)

The original acoustic feature vectors consisted of 162 dimensions )162( =n , which were first obtained by splicing every 9 consecutive 18-dimensional Mel-filter-bank feature vectors and then reduced to 39 dimensions )39( =d . The states of each HMM were taken as the unit for class assignment, and a well-trained hidden Markov model-based (HMM-based) recognition system was performed to obtain the class alignment of the training utterances. During the speech recognition process we kept track of full state alignment for obtaining the state-level transcriptions of the training data; by comparison with the correct ones, we thus derived the sorted empirical pairwise classification errors for all phone pairs.

162

Page 6: [IEEE Understanding (ASRU) - Moreno, Italy (2009.11.13-2009.12.17)] 2009 IEEE Workshop on Automatic Speech Recognition & Understanding - Generalized likelihood ratio discriminant analysis

We also report the experimental results obtained by additionally conducting a heteroscedastic feature decorrelation, namely maximum likelihood linear transform (MLLT) [22], immediately after the abovementioned transformations. MLLT was designed to obtain a projection that can approximately diagonalizes the class covariances by maximizing the likelihood of the projected data, so as to approximate the performance of the system that uses full covariance HMM modeling.

In the experiment on heteroscedastic GLRDA (denoted by H-GLRDA), since the mean vector heter

0μ in the objective function (20) cannot be directly derived from the sample, we thus adopt two kinds of approximation for heter

0μ . One is to represent the mean vector in the original space using the following equation (cf. (15)):

,1

1

1

1

1heter0

=

=

=

C

j

jjj

C

j

jj nn mSSμ (28)

while the other is to merely use the arithmetic mean of all sampled data to approximate heter

0μ :

.1

1

heter0

==

N

i

iN

xμ (29)

We see in Table III that H-GLRDA with the weighted mean as the estimate of heter

0μ achieves a higher character accuracy (CA), and the same setting will be held for CI-GLRDA.

To go a step further, Table IV compares the performance of H-GLRDA and CI-GLRDA with that of various existing LDA-based approaches. As can be seen, without using MLLT for feature decorrelation (Column 2), H-GLRDA and CI-GLRDA both yield much lower accuracies than the existing approaches. The major reason would be that the transformation matrices derived by the optimization of (20) and (26) in essence do not have the effect of approximately diagonalizing the transformed covariance matrix for each class. They, on the contrary, both demonstrate moderate improvements over LDA, HLDA and HDA when MLLT has been applied, as illustrated in the right-most column of Table IV. It is also noteworthy that the improvement made by H-GLRDA is less pronounced than that by CI-GLRDA. This

may confirm our expectation that, with the assistance of empirical class confusion information, a proper design for the null hypothesis would be beneficial to the final classification performance.

V. CONCLUSIONS

In this paper, we have proposed a new and more general framework, namely generalized likelihood ratio discriminant analysis (GLRDA), for discriminative feature transformation on the basis of the likelihood ratio test. Not only can GLRDA be successfully applied to ASR, but it is also expected to be feasible for other pattern recognition tasks. Besides, GLRDA can work in conjunction with empirical class confusion information for better recognition performance.

VI. ACKNOWLEDGEMENT

This work was supported by the National Science Council, Taiwan, under Grants: NSC96-2628-E-003-015-MY3, NSC98-2221-E-003-011-MY3, and NSC97-2631-S-003-003.

REFERENCES [1] B. D. Ripley, Pattern Recognition and Neural Networks. New York:

Cambridge University Press, 1996. [2] X. Wang and K. K. Paliwal, "Feature extraction and dimensionality

reduction algorithms and their applications in vowel recognition," Pattern Recognition, vol. 36, pp. 2429-2439, 2003.

[3] X.-B. Li, et al., "Dimensionality reduction using MCE-optimized LDA transformation," in Proc. ICASSP 2004.

[4] D. Povey, et al., "fMPE: discriminatively trained features for speech recogntion," in Proc. ICASSP 2005.

[5] R. A. Fisher, "The statistical utilization of multiple measurements," Annals of Eugenics, vol. 8, pp. 376-386, 1938.

[6] N. Kumar and A. G. Andreou, "Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition," Speech Communication, vol. 26, pp. 283-297, 1998.

[7] K. Demuynck, et al., "Optimal feature sub-space selection based on discriminant analysis " in Proc. Eurospeech 1999.

[8] G. Saon, et al., "Maximum likelihood discriminant feature spaces," in Proc. ICASSP 2000.

[9] H.-S. Lee and B. Chen, "Linear discriminant feature extraction using weighted classification confusion information," in Proc. Interspeech 2008.

[10] H.-S. Lee and B. Chen, "Improved linear discriminant analysis considering empirical pairwise classification error rates," in Proc. ISCSLP 2008.

[11] H.-S. Lee and B. Chen, "Empirical error rate minimization based linear discriminant analysis," in Proc. ICASSP 2009.

[12] W. J. Krzanowski, Principles of Multivariate Analysis: A User's Perspective. New York: Oxford University Press, 1988.

[13] C.-H. Lee, "A Tutorial on Speaker and Speech Verification," in Proc. NORSIG 1998.

[14] Y. Liu and P. Fung, "Acoustic and phonetic confusions in accented speech recognition," in Proc. Interspeech 2005.

[15] J. M. Górriz, et al., "Generalized LRT-based voice activity detector," IEEE Signal Processing Letters, vol. 13, pp. 636-639, 2006.

[16] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed. New York: Academic Press, 1990.

[17] N. A. Campbell, "Canonical variate analysis with unequal covariance matrices - generalizations of the usual solution," Mathematical Geology, vol. 16, pp. 109-124, 1984.

[18] M. Sakai, et al., "Linear discriminant analysis using a generalized mean of class covariances and its application to speech recognition," IEICE Transactions on Information and Systems, vol. E91-D, pp. 478-487, 2008.

[19] J. D. Foley, et al., Computer Graphics: Principles and Practice in C, 2nd ed.: Addison-Wesley, 1995.

[20] B. Chen, et al., "Lightly supervised and data-driven approaches to mandarin broadcast news transcription," in Proc. ICASSP 2004.

[21] H.-S. Chiu and B. Chen, "Word topical mixture models for dynamic language model adaptation," in Proc. ICASSP 2007.

[22] R. A. Gopinth, "Maximum likelihood modeling with Gaussian distributions for classification," in Proc. ICASSP 1998.

TABLE III The CA results (%) of H-GLRDA with various mean estimates.

H-GLRDA Without MLLT With MLLT Weighted Mean (28) 62.34 74.88

Arithmetic Mean (29) 58.68 74.45

TABLE IV Comparison among the CA results (%) of H-GLRDA, CI-GLRDA,

and various LDA-based approaches.

Methods Without MLLT With MLLTLDA 71.46 74.33

HLDA 70.28 74.88 HDA 71.36 74.53

H-GLRDA 62.34 74.88 CI-GLRDA 63.62 75.26

163