Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of...

42
Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA Email: [email protected] Large-Margin HMM Estimation for Speech Recognition (This is a joint work with Chao-Jun Liu, Xinwei Li)

Transcript of Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of...

Page 1: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Prof. Hui Jiang

Department of Computer Science and Engineering

York University, Toronto, Ont. M3J 1P3, CANADA

Email: [email protected]

Large-Margin HMM Estimation for Speech Recognition

(This is a joint work with Chao-Jun Liu, Xinwei Li)

Page 2: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Research Projects

• Hierarchical covariance modeling in CDHMM

(joint with Y. Tian, J.-L. Zhou, MSRA, Beijing, Chin a)

• Large-scale discriminative Training based on MCE/GP D

(joint with B. Liu, Univ. of Sci. & Tech. of China, J.-L. Zhou, MSRA )

• Large-margin HMM estimation for speech recognition

(joint with C. Liu, X. Li, York Univ. )

Page 3: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Hierarchical Covariance Modeling (HCM) in CDHMM

)( pdΣ dΣ dΣ dΣ dΣ dΣ

)(kfΣ

)( pX

∑=

Σ⋅+Σ⋅=Σ1

)()(0

)(*

pk

kfk

pd

p λλ

)1(fΣ

)2(fΣ

Page 4: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Hierarchical Covariance Modeling Schemes

HCM

HPM

HCM+DIAG

HPM+DIAG

HCC

∑Ψ∈

∑=∑)(

)()()(

im

im

im

i λ

∑Ψ∈

−−∑=∑

)(

1)()(1)(

im

im

im

i λ

∑Ψ∈

∑+∑=∑)(

)()()()(0

)(

im

im

im

iii λλ

∑Ψ∈

−−∑+

∑=∑

)(

1)()(1)()(

0)(

im

im

im

iii diag λλ

( ) [ ]∑Ψ∈

∑−∑+∑=∑)(

)()()()()( )(im

im

im

im

ii diagdiag λ

Page 5: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Performance Comparison:RM database

22. 7%3.16%HCC

14.7%3.49%MIC

18.6%3.33%STC

11.7%3.61%MLLT

14.4%3.50%HLDA

n/a4.09%Baseline

Err. Rate Reduction Word Err. Rate

Page 6: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Performance Comparison:Switchboard (minitrain)

5.53%35.9%HCC

3.94%36.5%MIC (39 prototypes)

3.68%36.6%STC

2.10%37.2%HLDA

n/a38.0%Baseline

Err. Rate Reduction

Word Err. Rate

Page 7: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

t

A word-ending active path

search beam

Time

Sta

te

LargeLarge--Scale GPD/MCE: Scale GPD/MCE:

InIn--Search Data SelectionSearch Data Selection

a-b+c

Reference Segmentation

$-b-u b-u-k u-k+T k-T+i T-i+s i-s+T s-T+r T-r+i r-i+p i-p+$

Token Comparison

phone a-b+c phone a-b+c

True TokenTrue TokenSetsSets

CompetingCompetingToken SetsToken Sets

phone a’-b’+c’ phone a’-b’+c’

Page 8: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Large-Scale Discriminative Training based on GPD/MCE

• Discriminative training: refine the original model set discriminatively based on the the collected token s ets.

HMM modelHMM model

True Tokens

Competing Tokens

Page 9: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Implementation Modelfor state-tied HMMs

Frames (feature vectors)

Optimal Viterbi path (state sequence)

Competing states

Page 10: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Criteria in Discriminative Training

• Least Imposter Words (LIW): minimizing the total number of imposter words during the decoding of all training data.

– Imposter words are defined as incorrect words appearing within beam -width during Viterbi decoding with a higher likelihoo d than its reference model. ( Jiang, et. al. 2002 )

• Least Phone Competing Tokens (LPCT): minimizing the total number of phone competing tokens during the decodin g of all training data. (Liu, Jiang, et.al. 2005)

• Least Incorrect Frames (LIF): minimizing the total number of incorrectly decoded frames during decoding of all t raining data.

Page 11: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Discriminative Training:RM task

iteration Training set Test set

WER(%)

Err Red WER(%)

Err Red

0 (ML)1.26

N/A 4.30 N/A

11.19

8% 4.16 3%

51.06

16% 3.96 8%

61.03

18% 4.06 6%

Page 12: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Discriminative Training:Switchboard

iteration Training set Test set

WER(%) Err Red WER(%) Err Red

0 (ML) 33.2 N/A 48.1 N/A

1 31.5 5% 47.1 2%

4 29.4 11% 46.0 4%

6 29.0 13% 46.4 4%

Page 13: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Prof. Hui Jiang

Department of Computer Science and Engineering

York University, Toronto, Ont. M3J 1P3, CANADA

Email: [email protected]

Large-Margin HMM Estimation for Speech Recognition

(This is a joint work with Chao-Jun Liu, Xin-Wei Li)

Page 14: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Outline• Background:

– Automatic Speech Recognition (ASR)– Large Margin Classifier

• Large Margin for HMM-based classifiers

• A Gradient Ascent Optimization for Continuous Density HMM (CDHMM) in speech recognition

• Preliminary Experiments

• Final Remarks and ongoing works

Page 15: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

ASR Solution: MAP decision rule

• — Acoustic Model (AM) : the probability of generating feature X when W is uttered.

• — Language Model (LM) : the probability of W (word, phrase, sentence) being chosen to say.

• — Discriminant Function:

)|( WXpΛ

)(WPΓ

(X|W)WXpWP

WXpWPXWpW

WW

WW

F

Ω∈

ΛΓ

Ω∈

Ω∈Ω∈

=⋅=

⋅==

maxarg)|()(maxarg

)|()(maxarg)|(maxargˆ

)|( WXF

)|()()|( WW XpPWX Λ⋅Γ=F

Page 16: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Existing HMM Estimation Methods

• Maximum Likelihood Estimation (MLE)– The Baum-Welch algorithm: the EM algorithm for HMM

• Discriminative Training– Maximum Mutual Information Estimation (MMIE)

– Minimum Classification Error (MCE):

• Discriminative training can improve (more or less) over the standard ML training.

• All discriminative training methods suffer the prob lem of poor generalization.

Page 17: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Large-Margin Classifier:Support Vector Machine (SVM)

larger margin

Page 18: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Large-Margin Classifiers• Why larger margin classifiers yield better

generalization performance?

• Conceptually, large margin

– Robustness w.r.t. data patterns– Robustness w.r.t. classifier parameters

• The theory in machine learning:– upper bound of generalization error rate

++≤δ1

log)/(log

2

2

d

VMV

M

CRR d

Page 19: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

How about using SVM for Speech Recognition?

• Done in some simple ASR tasks:

– phoneme recognition

– speaker recognition

– small vocabulary isolated speech recognition

• No significant improvement is reported.

– still not a main-stream method

• Why?

– lack of a proper kernel function to map speech samp les from one dynamic high-dimension space to another high-di mension space, which is suitable for linear classifiers.

Page 20: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Large-Margin HMM-based Classifier

model 1 model 2

Separation boundary F(X| 1)-F(X| 2)=0

Page 21: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Large-Margin HMM-based Classifier

Original separation boundary F(X| 1)-F(X| 2)=0

1

’1

2

’2

New separation boundary F(X| ’1)-F(X| ’2)=0

Page 22: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

How to define separation margin? (1)

• In 2-class separable problem:

– For a data token, x1, of class 1

– For a data token, x2, of class 2

)|()|()( 21111 Λ−Λ= xxxd FF

)|()|()( 12222 Λ−Λ= xxxd FF

> 0

> 0

Page 23: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

How to define separation margin? (2)

• Extend to multiple-class problem:

– N classes 1, 2, …, N,

– For a data token, x i, of class i

[ ])|()|(min

)|(max)|()(

jiiiij

jiij

iii

xx

xxxd

Λ−Λ=

Λ−Λ=

FF

FF

Page 24: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Large-Margin Estimation of HMMs

• An N-class problem: each class is represented by an HMM

• Given a training set DD, define a subset, called support token set SS, as:

• Large-Margin Estimation ( LME) of HMMs:

,,, 21 NΛΛΛ= L

)(0 and | ε≤≤∈= iii XdDXXS

0))( all o(subject t)(minmaxargˆ >=∈ ii

SXXdXd

i

Page 25: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Large-Margin Estimation of HMMs

• Convert it into an equivalent minimax optimization problem

• Assume Xi belongs to class i

[ ])|()|(maxminargˆ,

iijiijSX

XXi

Λ−Λ=≠∈

FF

. and allfor

0)|()|(

:sconstraint subject to

ijSX

XX

i

iiji

≠∈

<Λ−Λ FF

Page 26: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Two difficulties

• No.1 : without additional constraints on during the optimization, maximum margin does not exist.

– e.g. scale up both and to increase margin unlimitedly.

• No.2 : how to do optimization?

– Use standard optimization tools, such as Matlaboptimization toolbox®

– However, too slow …

)|( iiX ΛF )|( jiX ΛF

Page 27: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

How to guarantee existence of Maximum Margin? (1)

• Solution one: maximizing relative marginrelative margin instead:

exists always maximum 1)('

)|(

)|(1min

)|(

)|(max)|()('

⇒≤<∞−

ΛΛ

−=

Λ

Λ−Λ=

i

ii

ji

ij

ii

jiij

ii

i

Xd

X

X

X

XXXd

F

F

F

FF

Called Large Relative Margin Estimation (LRME)Large Relative Margin Estimation (LRME)

Page 28: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

How to guarantee existence of Maximum Margin? (2)

• Solution two: optimize one HMM each time – Do

• foreach i do a sub-optimization problem

where other HMMs are kept constant in the above optimization.

– Until converge

Called iterative localized optimization (ILO)iterative localized optimization (ILO)

[ ])|()|(maxminargˆ,

iijiijSX

i XXii

Λ−Λ=Λ≠∈Λ

FF

. and allfor

0)|()|(

:sconstraint subject to

ijSX

XX

i

iiji

≠∈

<Λ−Λ FF

Page 29: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Iterative Localized Optimization

Page 30: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

How to do optimization? (1)

• Use the gradient ascent method to maximize a lower bound of minimum margin

– Use a continuous and differentiable function to approximate the minimum margin

)(maxarg)(minmaxargˆ QXd iSX i

==∈

)(min)( iSX

XdQi∈

=

Page 31: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

How to do optimization? (2)• Approximate with summation of exponential s

• Optimize instead

• The gradient ascent method

)(Q

⋅=≈ ∑

≠∈ ijSXi

i

XdQQ,

)](exp[log1

)()( ηηη

)()(lim )0()()( QQQQ =<>−∞→ ηηη η

)(ηQ

)(maxarg'ˆηQ=

)('

)()('ˆ)1('ˆ

n

Qnn =∂

∂⋅+=+ ηε

Page 32: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

How to calculate the gradientfor continuous density HMM? (1)

≠∈

≠∈

⋅∂

∂⋅⋅=

∂∂

ijSXi

ijSX

ii

i

i

Xd

XdXd

Q

,

,

)](exp[

)()](exp[

)(

η

ηη

i

ii

i

i XXd

Λ∂Λ∂=

Λ∂∂ )|()( F

j

ji

j

iXXd

Λ∂Λ∂

−=Λ∂

∂ )|()( F

Page 33: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

How to calculate the gradientfor continuous density HMM? (2)

• Assumption 1: adjust CDHMM mean vectors only

• Assumption 2: diagonal precision matrices

• Assumption 3: use the Viterbi approximation

∑∑= =

−−≈ΛT

t

D

d

iitd

iii dtltsdtlts

mXrCX1 1

2)()( )(2

1')|(F

∑∑= =

−−≈ΛT

t

D

d

jitd

jji dtltsdtlts

mXrCX1 1

2)()( )(2

1")|(

''''F

Page 34: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

How to handle Recognition Errors in training set?

• Given the training set DD, based on the current model , define the error set:

• Use the MCE (minimum classification error)/GPD algorithm to update model based on to reduce | |.

• Intuitively, the MCE algorithm will move separation boundary to correctly classify as many error tokens as possible.

• Use MCE-trained models as initial models to start l arge margin estimation (LME).

0)( and | ≤∈=Ψ iii XdXX D

Page 35: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Preliminary Experiments• English alphabet E-set recognition

– Use the OGI ISOLET database• Speaker-independent small vocabulary isolated-word

recognition• Feature vector ( 39-d): (12 MFCC + E) + + • Our best MLE system:

– 16-state whole-word CDHMM for each letter– 4 Gaussian mixtures per state

• Achieve 96.15% accuracy for a standard test set (26 -letter)

– comparable other reported systems: OGI (96%), Cambridge (96.73%).

• Test our best system on the E-set only: 91.5%

Page 36: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Preliminary Results: E-SetASR Performance Comparison

n/a

95.2

92.8

LME-ILO

95.294.491.54-mix

2-mix

1-mix

95.094.090.6

93.591.585.6

LRMEMCEML

Word accuracy comparison among various HMM training approaches

ML: Maximum Likelihood EstimationMCE: Minimum Classification Error LME-ILO: Large Margin Estimation via Iterative Loca lized Optimization LRME: Large Relative Margin Estimation

Page 37: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

LME learning curves (1-mix)

Accuracy in test set

Actual Margin Q

Objective Func Q

Page 38: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

LME learning curves (2-mix)

Accuracy in test set

Actual Margin Q

Objective Func Q

Page 39: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Final Remarks

• Based on preliminary experimental results only .

• LME can yield better performance than MLE and MCE.

• Margin is a good indicator of generalization capabil ity of an HMM-based speech recognizer.

• Maximizing the objective function Q (lower bound) effectively increases the actual separation margin.

Page 40: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

Ongoing and Future Works• More theoretical explorations:

– How to formulate the constraints in a theoretically sound way?

– How to re-formulate LME as another type of optimiza tion problem which has more efficient solutions?

• semi-definite programming (SDP)?

• Practically, extend to large-scale continuous ASR t asks

– TIDIGITS experiments under way

– SPINE very soon

– Switchboard

Page 41: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA
Page 42: Large-Margin HMM Estimation for Speech Recognitionhj/Talks/IBM.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA

ERROR: undefinedOFFENDING COMMAND:

STACK: