Large margin discriminative learning for acoustic...

50
Large margin discriminative learning for acoustic modeling Fei Sha Dept. of Computer & Information Science University of Pennsylvania Thesis Proposal July 17, 2006

Transcript of Large margin discriminative learning for acoustic...

  • Large margin discriminative learning for acoustic modeling

    Fei ShaDept. of Computer & Information Science

    University of Pennsylvania

    Thesis Proposal

    July 17, 2006

  • Acoustic Modelingmodel acoustic properties of speech units

    Acoustic models

    language models

    decoderrecognition result

    speech signal

    signal processing

    p(W|A) ∝ p(A|W)p(W)

    A

    W

  • Many research problems

    • What types of units?

    • What acoustic properties & how to represent?

    • How to integrate?

    • What kind of models?

    • How to estimate model parameters?

    phonemes

    standard signal processing

    continuous density hidden Markov modles

  • Phonemes

    • Smallest distinctive sound units in a language- distinguish one word from another- use as building blocks for other units

    • Example: Phoneme Example Phoneme Example Phoneme Example/i/ beat /I/ bit /e/ bait/æ/ bat / / bet /a/ Bob/ / bird / / about / / bought/u/ boot /U/ book /o/ boat/ay/ buy / y/ boy /aw/ down/w/ wit /l/ let /r/ rent/y/ you /m/ met /n/ net/ / sing /b/ bet /d/ debt/g/ get /p/ pet /t/ ten/k/ kit /v/ vat / / thing/z/ zoo / / azure /f/ fat/ð/ that /s/ sat / / shut/h/ hat / / judge / / church

  • Signal processing

    • Frame-based feature representation

    • Industry standard- 13 mel-cepstral coefficients (MFCCs) per frame- frame difference (delta-features) for temporal

    information

    !"#$%&'(' !"#$%&')'

    *+,-%&

    ./

    *+,-%&

    .0

    1#)+(%+&

    2+,$3*#+-

    4%5&!52%+&

    6,$7

    5%,+$($8&

    ,58#+(2"-

    -#9%5&

    *#+&'('

    -#9%5&

    *#+&')'

    3(8$,5&!+#:%33($8&*+#$2;%$9

    3"#+2;2(-%&,$,5

  • Continuous density hidden Markov Models

    • Standard acoustic model- HMMs state transition for modeling temporal

    dynamics

    - Gaussian mixture models for modeling emission distribution

    • Example:- 3 states: left, middle, right- forward state transitions. ! "# $ %

    &'(

    &')#

    &'*

    &'"

    &'$

    &'+

    p(xt|st = j) =∑M

    m=1 αjmN (xt;µjm,Σjm)

  • The problem of parameter estimation

    A simplified framework:

    • Training data: - acoustic features:- phonetic transcriptions:

    • Model: hidden Markov models with GMMs

    • Ouput: - transitional probabilities- means and covariance matrices

    x ∈ "Dy ∈ {1, 2, . . . , C}

    p(x, y;θ)

  • Paradigms for parameter estimations

    • Maximum likelihood estimation - maximize joint likelihood

    - use simple update procedures ( EM algorithms)- link indirectly to recognition performance

    • Discriminative learning- aim to classify and recognize correctly- tend to be much more computationally intensive- lead to better performance if done right

    arg max∑

    n

    log p(xn, yn;θ)

  • Ex: discriminative learning methods

    • Conditional maximum likelihood (CML/MMIE)

    Intuition: raise likelihood for the target class, lower the likelihood for all other classes

    • Minimum classification error (MCE)- raise discriminant score for the target class- lower discriminant score for other classes- transform & quantify the difference to approximate

    classification error

    arg max∑

    n

    log p(yb|xn,θ)

  • Unified framework

    • Covers many methods- ML, MCE, CML/MMIE- Minimum Phone Error / Minimum Word Error

    Discriminative learning methods generally do much better than ML estimation.

    n

    f

    (log

    [∑W p

    α(An|W)pα(W)G(W,Wn)∑W p

    α(An|W)pα(W)

    ]1/α)

  • Challenges of discriminative learning methods

    • Need a lot of training data- large corpus- aggressive parameter tying

    • Overfitting is a big problem- smoothing interpolation between ML and MMI- “confusable” set: frame discrimination, MPE/MWE

    • Optimization is complicated- EM variants: need to set parameters for

    convergence

    - Gradient methods: mixed success

  • Focus of this thesis

    examine the utility of large margin based discriminative learning algorithms for acoustic-phonetic modeling

  • Why large margin?

    • Discriminative learning paradigm

    • Very successful

    - in multiway classification, eg, SVMs

    - in structural classification, eg., M3N

    • Theoretically motivated to offer better generalization

    • Computationally appealing.

  • Outline

    • Large margin GMMs - Label individual samples - Parallel to support vector machines

    • Crucial extensions for modeling phonemes- Handle outliers- Classify and recognize sequences

    • Summary- What has been done- What will be done

  • Large margin Gaussian mixture modeling

  • Model each class by an ellipsoid

    - centroid: - orientation matrix:- offset:

    !1

    X

    Basic setting

    Ψc ∈ "D×D # 0θc ≥ 0

    µc ∈ "D

  • Nonlinear classification

    Classify by smallest score

    y = arg minc{(x− µc)TΨc(x− µc) + θc

    }

    !1

    X

    Mahalanobis distance

    x ∈ "D

    equivalent to Gaussian classifiers

  • Change of representation

    • Step 1: collect parameters

    • Step 2: augment inputs

    µcΨcθc

    −→ Φc =[

    Ψc −Ψcµc−µTc Ψc µTc Ψcµc + θc

    ]# 0

    z =[

    x1

    ]∈ "D+1

  • Classification rule

    • Old:

    Old representation is nonlinear in and .

    • New:

    New representation is linear in .

    y = arg minc{(x− µc)TΨc(x− µc) + θc

    }

    y = arg minc{zTΦcz

    }

    µc Ψc

    Φc

  • Large margin learning criteria

    Input: labeled examples Output: positive semidefinite matrices Goal: classify by at least one unit margin

    {(xn, yn)}{Φc}

    !1

    X

    ∀ c "= yn,zTnΦczn − zTnΦynzn ≥ 1

    margin = difference of distance between target and competing classes

  • Decision boundary & margin

    !"#$%$&'()&*'!+,-( .+,/$'(.+,/$'(

  • Handling infeasible examples

    • “Soft” margin constraints with slacks

    • Discriminative loss function convex but

    only a surrogate !

    ∀ c "= yn, zTnΦczn + ξnc − zTnΦynzn ≥ 1

    margin1

    feasible

    infeasible

    hinge loss

    0/1 loss

    ξnc

  • Learning via convex optimization

    • Balance total slacks and scale

    • Properties- instance of semidefinite programming (SDP)- convex problem: no local optimum- efficiently solvable with gradient methods

    minimize∑

    n,c ξnc + κ∑

    c trace(Ψc)subject to zTnΦczn + ξnc − zTnΦynzn ≥ 1

    ξnc ≥ 0, ∀n, c $= ynΦc % 0

  • General setting

    • Model each class by M ellipsoids

    • Assume knowledge of “proxy label”

    mn = arg minm

    zTnΦynmzn

    {Φcm}

  • Large margin constraints

    ∀ c "= yn∀m z

    TnΦcmz − zTnΦynmnz ≥ 1

    score to the proxy centroid

    score for all competing centroids

  • Handle large number of constraints

    ∀ c "= yn minm

    (zTnΦcmz − zTnΦynmnz) ≥ 1

    ∀ c "= yn − logm∑

    m

    e−(zTnΦcmz−z

    TnΦynmn z) ≥ 1

    softmax trick to squash many constraints into one

  • Close the loop: proxy labels

    • Step 1: maximum likelihood parameter estimation

    • Step 2: find the closetest mixture

    ΦML = arg max∑

    n

    log p(xn, yn;Φ)

    mn = arg minm

    zTnΦMLynmzn

  • Optimization problem

    • Properties:- no longer an instance of SDP- still convex problem

    minimize∑

    n,c ξnc + κ∑

    c trace(Ψcm)subject to − logm

    ∑m e

    −(zTnΦcmz−zTnΦynmn z) + ξnc ≥ 1

    ξnc ≥ 0, ∀n, c $= ynΦcm % 0

  • Application: handwritten digit recognition

    • General setup- train GMMs with maximum likelihood estimation- use GMMs to infer proxy labels- train large margin GMMs

    • Data sets- MNIST handwritten digit recognition- 60K training & 10K testing, ten-fold cross validation- reduce dimensionality from 28x28 to 40 with PCA

  • Result: classification error

    • Comparable to best SVM result (1.0% - 1.4%)• training time: 1 - 10 minutes

    !"#$%

    &'()

    &'*)

    +'()

    +'*)

    ,'()

    & + - .

    /! 012345612378

    1.2%

  • Result: digit prototypes

    • EM: typical images• Large margin: supposition of images of different

    digits

    digit prototypes

  • Crucial extensions for acoustic modeling

    Handle statistical outliersClassify sequenceRecognize sequence

  • Extension 1: outliers

    margin

    outlier

    near misses

    correctdominateloss function

    want to focus on these points!

  • Remedy

    equalize contributions from outliers

    marginoutlier

    near misses

    correct

    minimize∑

    n,c αnξnc + κ∑

    c trace(Ψc)subject to zTnΦczn + ξnc − zTnΦynzn ≥ 1

    ξnc ≥ 0, ∀n, c $= ynΦc % 0

    estimate weighting factors from ML estimated GMMs

  • Extension 2: phonetic classification

    • Input: - Segmented speech

    signals

    - Each segment is a phonetic unit

    • Output:- Phonetic label for

    each segment

    • Evaluation:- Percentage of

    misclassified segments

    bcl l aa kcl t ey bcl

    known phonetic segmentation

  • Algorithm

    • Use average score to setup margin constraint- Input:- Output:

    ∀c "= yn, 1!∑

    p zTnpΦcznp − 1!

    ∑p z

    TnpΦynznp ≥ 1

    xn1,xn2, . . . ,xn!yn

  • Application: phonetic classification

    • General setup- train baseline GMMs with ML- use baseline GMMs to infer proxy labels- use baseline GMMs to detect outliers- train large margin GMMs

    • Data- TIMIT speech database: 1.2m frames, 120K

    segments for training

    - Very well studied benchmark problem

  • Result: phonetic classification error

    20.0%

    23.5%

    27.0%

    30.5%

    34.0%

    1 2 4 8 16

    Baseline EM Large margin

    number of mixture components

    best reported result

  • Extension 3: phonetic recognition

    • Input:- unsegmented speech - frames span more than

    one phone

    • Output:- phonetic segmentation- phonetic labels

    • Evaluation metrics:- multiple metrics - some relates more to

    word error rate

    bc l a kc t e bc

    output phonetic segmentation;inferred indirectly from frame-

    level phonetic labels.

  • Algorithm

    • Goal: separate target sequences from incorrect

    sequences by large margin

    • Model: fully observable Markov model

    states

    inputs

    state sequence ↔ phoneme label sequence

  • Discriminant score for sequences

    • Analogous to the log probability in CD-HMMs- transition scores- emission scores

    • Extendable to more complicated features

    D(X, s) =∑

    t log a(st1 , st)−∑T

    t=1 zTt Φstzt

    X = {x1,x2, . . . ,xt}

    s = {s1, s2, . . . , sT }

  • Large margin constraints

    How to handle exponential number of constraints?

    ∀s "= y, D(X,y)−D(X, s) ≥ h(s,y)

    Hamming distance[Taskar et al 03]

  • Softmax margin

    −D(X,y) + maxs !=y{h(s,y) +D(X, s)} ≤ 0

    squash into one constraint with softmax

    −D(X,y) + log∑

    s !=yeh(s,y)+D(X,s) ≤ 0

    tractable with dynamic programming

  • Optimization

    • Properties:- convex optimization- main computation effort on computing the gradients

    min∑

    n ξn + γ∑

    c trace(Ψc)s.t. −D(Xn,yn) + log

    ∑s !=yn e

    h(s,yn)+D(Xn,s) ≤ ξn,ξn ≥ 0, n = 1, 2, . . . , NΦc $ 0, c = 1, 2, . . . , C

  • Results

    • General setup• General setup- train baseline GMMs with ML- use baseline GMMs to infer proxy labels- use proxy label sequence as target state sequence- train large margin GMMs• Data- TIMIT speech database: 1.2m frames, 3000 sentences- Very well studied benchmark problem

  • Results: phone error rate

    26%

    30%

    34%

    38%

    42%

    1 2 4 8

    Baseline EM Large margin

    number of mixture components

  • Result: is “sequential” really important?

    26%

    29%

    32%

    35%

    38%

    1 2 4 8

    Frame Sequence

  • Result: relative error reduction over baseline

    4%

    9%

    14%

    19%

    24%

    1 2 4 8

    MMI Margin

  • What has been achieved?

    • Algorithms- Large margin GMMs for multiway classification- Large margin GMMs for sequential classification

    • Advantages over traditional methods- Computationally appealing thru convexity- Well motivated training criteria for generalization

    • Experimental results- Able to train on millions of data points- Improve over baseline and other competitive

    methods

  • Proposed Future Work

    • Goals:- is margin really helpful in getting better performance?- can the formulation of the optimization be improved?

    • Comparison to standard ASR algorithms: - MMI/CML, MCE- investigate the relation with MPE/MWE

    • Comparison to structural learning methods- recent advances in machine learning community- algorithms: extragradient (Taskar, 2006), subgradient

    method (Bagnell, 2006)