Large margin discriminative learning for acoustic...

Large margin discriminative learning for acoustic modeling

Fei ShaDept. of Computer & Information Science

University of Pennsylvania

Thesis Proposal

July 17, 2006

Acoustic Modelingmodel acoustic properties of speech units

Acoustic models

language models

decoderrecognition result

speech signal

signal processing

p(W|A) ∝ p(A|W)p(W)

A

W

Many research problems

• What types of units?

• What acoustic properties & how to represent?

• How to integrate?

• What kind of models?

• How to estimate model parameters?

phonemes

standard signal processing

continuous density hidden Markov modles

Phonemes

• Smallest distinctive sound units in a language- distinguish one word from another- use as building blocks for other units

• Example: Phoneme Example Phoneme Example Phoneme Example/i/ beat /I/ bit /e/ bait/æ/ bat / / bet /a/ Bob/ / bird / / about / / bought/u/ boot /U/ book /o/ boat/ay/ buy / y/ boy /aw/ down/w/ wit /l/ let /r/ rent/y/ you /m/ met /n/ net/ / sing /b/ bet /d/ debt/g/ get /p/ pet /t/ ten/k/ kit /v/ vat / / thing/z/ zoo / / azure /f/ fat/ð/ that /s/ sat / / shut/h/ hat / / judge / / church

Signal processing

• Frame-based feature representation

• Industry standard- 13 mel-cepstral coefficients (MFCCs) per frame- frame difference (delta-features) for temporal

information

!"#$%&'(' !"#$%&')'

*+,-%&

./

*+,-%&

.0

1#)+(%+&

2+,$3*#+-

4%5&!52%+&

6,$7

5%,+$($8&

,58#+(2"-

-#9%5&

*#+&'('

-#9%5&

*#+&')'

3(8$,5&!+#:%33($8&*+#$2;%$9

3"#+2;2(-%&,$,5

Continuous density hidden Markov Models

• Standard acoustic model- HMMs state transition for modeling temporal

dynamics

- Gaussian mixture models for modeling emission distribution

• Example:- 3 states: left, middle, right- forward state transitions. ! "# $ %

&'(

&')#

&'*

&'"

&'$

&'+

p(xt|st = j) =∑M

m=1 αjmN (xt;µjm,Σjm)

The problem of parameter estimation

A simplified framework:

• Training data: - acoustic features:- phonetic transcriptions:

• Model: hidden Markov models with GMMs

• Ouput: - transitional probabilities- means and covariance matrices

x ∈ "Dy ∈ {1, 2, . . . , C}

p(x, y;θ)

Paradigms for parameter estimations

• Maximum likelihood estimation - maximize joint likelihood

- use simple update procedures ( EM algorithms)- link indirectly to recognition performance

• Discriminative learning- aim to classify and recognize correctly- tend to be much more computationally intensive- lead to better performance if done right

arg max∑

n

log p(xn, yn;θ)

Ex: discriminative learning methods

• Conditional maximum likelihood (CML/MMIE)

Intuition: raise likelihood for the target class, lower the likelihood for all other classes

• Minimum classification error (MCE)- raise discriminant score for the target class- lower discriminant score for other classes- transform & quantify the difference to approximate

classification error

arg max∑

n

log p(yb|xn,θ)

Unified framework

• Covers many methods- ML, MCE, CML/MMIE- Minimum Phone Error / Minimum Word Error

Discriminative learning methods generally do much better than ML estimation.

∑

n

f

(log

[∑W p

α(An|W)pα(W)G(W,Wn)∑W p

α(An|W)pα(W)

]1/α)

Challenges of discriminative learning methods

• Need a lot of training data- large corpus- aggressive parameter tying

• Overfitting is a big problem- smoothing interpolation between ML and MMI- “confusable” set: frame discrimination, MPE/MWE

• Optimization is complicated- EM variants: need to set parameters for

convergence

- Gradient methods: mixed success

Focus of this thesis

examine the utility of large margin based discriminative learning algorithms for acoustic-phonetic modeling

Why large margin?

• Discriminative learning paradigm

• Very successful

- in multiway classification, eg, SVMs

- in structural classification, eg., M3N

• Theoretically motivated to offer better generalization

• Computationally appealing.

Outline

• Large margin GMMs - Label individual samples - Parallel to support vector machines

• Crucial extensions for modeling phonemes- Handle outliers- Classify and recognize sequences

• Summary- What has been done- What will be done

Large margin Gaussian mixture modeling

Model each class by an ellipsoid

- centroid: - orientation matrix:- offset:

!1

X

Basic setting

Ψc ∈ "D×D # 0θc ≥ 0

µc ∈ "D

Nonlinear classification

Classify by smallest score

y = arg minc{(x− µc)TΨc(x− µc) + θc

}

!1

X

Mahalanobis distance

x ∈ "D

equivalent to Gaussian classifiers

Change of representation

• Step 1: collect parameters

• Step 2: augment inputs

µcΨcθc

−→ Φc =[

Ψc −Ψcµc−µTc Ψc µTc Ψcµc + θc

]# 0

z =[

x1

]∈ "D+1

Classification rule

• Old:

Old representation is nonlinear in and .

• New:

New representation is linear in .

y = arg minc{(x− µc)TΨc(x− µc) + θc

}

y = arg minc{zTΦcz

}

µc Ψc

Φc

Large margin learning criteria

Input: labeled examples Output: positive semidefinite matrices Goal: classify by at least one unit margin

{(xn, yn)}{Φc}

!1

X

∀ c "= yn,zTnΦczn − zTnΦynzn ≥ 1

margin = difference of distance between target and competing classes

Decision boundary & margin

!"#$%$&'()&*'!+,-( .+,/$'(.+,/$'(

Handling infeasible examples

• “Soft” margin constraints with slacks

• Discriminative loss function convex but

only a surrogate !

∀ c "= yn, zTnΦczn + ξnc − zTnΦynzn ≥ 1

margin1

feasible

infeasible

hinge loss

0/1 loss

ξnc

Learning via convex optimization

• Balance total slacks and scale

• Properties- instance of semidefinite programming (SDP)- convex problem: no local optimum- efficiently solvable with gradient methods

minimize∑

n,c ξnc + κ∑

c trace(Ψc)subject to zTnΦczn + ξnc − zTnΦynzn ≥ 1

ξnc ≥ 0, ∀n, c $= ynΦc % 0

General setting

• Model each class by M ellipsoids

• Assume knowledge of “proxy label”

mn = arg minm

zTnΦynmzn

{Φcm}

Large margin constraints

∀ c "= yn∀m z

TnΦcmz − zTnΦynmnz ≥ 1

score to the proxy centroid

score for all competing centroids

Handle large number of constraints

∀ c "= yn minm

(zTnΦcmz − zTnΦynmnz) ≥ 1

∀ c "= yn − logm∑

m

e−(zTnΦcmz−z

TnΦynmn z) ≥ 1

softmax trick to squash many constraints into one

Close the loop: proxy labels

• Step 1: maximum likelihood parameter estimation

• Step 2: find the closetest mixture

ΦML = arg max∑

n

log p(xn, yn;Φ)

mn = arg minm

zTnΦMLynmzn

Optimization problem

• Properties:- no longer an instance of SDP- still convex problem

minimize∑

n,c ξnc + κ∑

c trace(Ψcm)subject to − logm

∑m e

−(zTnΦcmz−zTnΦynmn z) + ξnc ≥ 1

ξnc ≥ 0, ∀n, c $= ynΦcm % 0

Application: handwritten digit recognition

• General setup- train GMMs with maximum likelihood estimation- use GMMs to infer proxy labels- train large margin GMMs

• Data sets- MNIST handwritten digit recognition- 60K training & 10K testing, ten-fold cross validation- reduce dimensionality from 28x28 to 40 with PCA

Result: classification error

• Comparable to best SVM result (1.0% - 1.4%)• training time: 1 - 10 minutes

!"#$%

&'()

&'*)

+'()

+'*)

,'()

& + - .

/! 012345612378

1.2%

Result: digit prototypes

• EM: typical images• Large margin: supposition of images of different

digits

digit prototypes

Crucial extensions for acoustic modeling

Handle statistical outliersClassify sequenceRecognize sequence

Extension 1: outliers

margin

outlier

near misses

correctdominateloss function

want to focus on these points!

Remedy

equalize contributions from outliers

marginoutlier

near misses

correct

minimize∑

n,c αnξnc + κ∑

c trace(Ψc)subject to zTnΦczn + ξnc − zTnΦynzn ≥ 1

ξnc ≥ 0, ∀n, c $= ynΦc % 0

estimate weighting factors from ML estimated GMMs

Extension 2: phonetic classification

• Input: - Segmented speech

signals

- Each segment is a phonetic unit

• Output:- Phonetic label for

each segment

• Evaluation:- Percentage of

misclassified segments

bcl l aa kcl t ey bcl

known phonetic segmentation

Algorithm

• Use average score to setup margin constraint- Input:- Output:

∀c "= yn, 1!∑

p zTnpΦcznp − 1!

∑p z

TnpΦynznp ≥ 1

xn1,xn2, . . . ,xn!yn

Application: phonetic classification

• General setup- train baseline GMMs with ML- use baseline GMMs to infer proxy labels- use baseline GMMs to detect outliers- train large margin GMMs

• Data- TIMIT speech database: 1.2m frames, 120K

segments for training

- Very well studied benchmark problem

Result: phonetic classification error

20.0%

23.5%

27.0%

30.5%

34.0%

1 2 4 8 16

Baseline EM Large margin

number of mixture components

best reported result

Extension 3: phonetic recognition

• Input:- unsegmented speech - frames span more than

one phone

• Output:- phonetic segmentation- phonetic labels

• Evaluation metrics:- multiple metrics - some relates more to

word error rate

bc l a kc t e bc

output phonetic segmentation;inferred indirectly from frame-

level phonetic labels.

Algorithm

• Goal: separate target sequences from incorrect

sequences by large margin

• Model: fully observable Markov model

states

inputs

state sequence ↔ phoneme label sequence

Discriminant score for sequences

• Analogous to the log probability in CD-HMMs- transition scores- emission scores

• Extendable to more complicated features

D(X, s) =∑

t log a(st1 , st)−∑T

t=1 zTt Φstzt

X = {x1,x2, . . . ,xt}

s = {s1, s2, . . . , sT }

Large margin constraints

How to handle exponential number of constraints?

∀s "= y, D(X,y)−D(X, s) ≥ h(s,y)

Hamming distance[Taskar et al 03]

Softmax margin

−D(X,y) + maxs !=y{h(s,y) +D(X, s)} ≤ 0

squash into one constraint with softmax

−D(X,y) + log∑

s !=yeh(s,y)+D(X,s) ≤ 0

tractable with dynamic programming

Optimization

• Properties:- convex optimization- main computation effort on computing the gradients

min∑

n ξn + γ∑

c trace(Ψc)s.t. −D(Xn,yn) + log

∑s !=yn e

h(s,yn)+D(Xn,s) ≤ ξn,ξn ≥ 0, n = 1, 2, . . . , NΦc $ 0, c = 1, 2, . . . , C

Results

• General setup• General setup- train baseline GMMs with ML- use baseline GMMs to infer proxy labels- use proxy label sequence as target state sequence- train large margin GMMs• Data- TIMIT speech database: 1.2m frames, 3000 sentences- Very well studied benchmark problem

Results: phone error rate

26%

30%

34%

38%

42%

1 2 4 8

Baseline EM Large margin

number of mixture components

Result: is “sequential” really important?

26%

29%

32%

35%

38%

1 2 4 8

Frame Sequence

Result: relative error reduction over baseline

4%

9%

14%

19%

24%

1 2 4 8

MMI Margin

What has been achieved?

• Algorithms- Large margin GMMs for multiway classification- Large margin GMMs for sequential classification

• Advantages over traditional methods- Computationally appealing thru convexity- Well motivated training criteria for generalization

• Experimental results- Able to train on millions of data points- Improve over baseline and other competitive

methods

Proposed Future Work

• Goals:- is margin really helpful in getting better performance?- can the formulation of the optimization be improved?

• Comparison to standard ASR algorithms: - MMI/CML, MCE- investigate the relation with MPE/MWE

• Comparison to structural learning methods- recent advances in machine learning community- algorithms: extragradient (Taskar, 2006), subgradient

method (Bagnell, 2006)

Large margin discriminative learning for acoustic...

Documents

Transcript of Large margin discriminative learning for acoustic...