Large margin discriminative learning for acoustic...
Transcript of Large margin discriminative learning for acoustic...
-
Large margin discriminative learning for acoustic modeling
Fei ShaDept. of Computer & Information Science
University of Pennsylvania
Thesis Proposal
July 17, 2006
-
Acoustic Modelingmodel acoustic properties of speech units
Acoustic models
language models
decoderrecognition result
speech signal
signal processing
p(W|A) ∝ p(A|W)p(W)
A
W
-
Many research problems
• What types of units?
• What acoustic properties & how to represent?
• How to integrate?
• What kind of models?
• How to estimate model parameters?
phonemes
standard signal processing
continuous density hidden Markov modles
-
Phonemes
• Smallest distinctive sound units in a language- distinguish one word from another- use as building blocks for other units
• Example: Phoneme Example Phoneme Example Phoneme Example/i/ beat /I/ bit /e/ bait/æ/ bat / / bet /a/ Bob/ / bird / / about / / bought/u/ boot /U/ book /o/ boat/ay/ buy / y/ boy /aw/ down/w/ wit /l/ let /r/ rent/y/ you /m/ met /n/ net/ / sing /b/ bet /d/ debt/g/ get /p/ pet /t/ ten/k/ kit /v/ vat / / thing/z/ zoo / / azure /f/ fat/ð/ that /s/ sat / / shut/h/ hat / / judge / / church
-
Signal processing
• Frame-based feature representation
• Industry standard- 13 mel-cepstral coefficients (MFCCs) per frame- frame difference (delta-features) for temporal
information
!"#$%&'(' !"#$%&')'
*+,-%&
./
*+,-%&
.0
1#)+(%+&
2+,$3*#+-
4%5&!52%+&
6,$7
5%,+$($8&
,58#+(2"-
-#9%5&
*#+&'('
-#9%5&
*#+&')'
3(8$,5&!+#:%33($8&*+#$2;%$9
3"#+2;2(-%&,$,5
-
Continuous density hidden Markov Models
• Standard acoustic model- HMMs state transition for modeling temporal
dynamics
- Gaussian mixture models for modeling emission distribution
• Example:- 3 states: left, middle, right- forward state transitions. ! "# $ %
&'(
&')#
&'*
&'"
&'$
&'+
p(xt|st = j) =∑M
m=1 αjmN (xt;µjm,Σjm)
-
The problem of parameter estimation
A simplified framework:
• Training data: - acoustic features:- phonetic transcriptions:
• Model: hidden Markov models with GMMs
• Ouput: - transitional probabilities- means and covariance matrices
x ∈ "Dy ∈ {1, 2, . . . , C}
p(x, y;θ)
-
Paradigms for parameter estimations
• Maximum likelihood estimation - maximize joint likelihood
- use simple update procedures ( EM algorithms)- link indirectly to recognition performance
• Discriminative learning- aim to classify and recognize correctly- tend to be much more computationally intensive- lead to better performance if done right
arg max∑
n
log p(xn, yn;θ)
-
Ex: discriminative learning methods
• Conditional maximum likelihood (CML/MMIE)
Intuition: raise likelihood for the target class, lower the likelihood for all other classes
• Minimum classification error (MCE)- raise discriminant score for the target class- lower discriminant score for other classes- transform & quantify the difference to approximate
classification error
arg max∑
n
log p(yb|xn,θ)
-
Unified framework
• Covers many methods- ML, MCE, CML/MMIE- Minimum Phone Error / Minimum Word Error
Discriminative learning methods generally do much better than ML estimation.
∑
n
f
(log
[∑W p
α(An|W)pα(W)G(W,Wn)∑W p
α(An|W)pα(W)
]1/α)
-
Challenges of discriminative learning methods
• Need a lot of training data- large corpus- aggressive parameter tying
• Overfitting is a big problem- smoothing interpolation between ML and MMI- “confusable” set: frame discrimination, MPE/MWE
• Optimization is complicated- EM variants: need to set parameters for
convergence
- Gradient methods: mixed success
-
Focus of this thesis
examine the utility of large margin based discriminative learning algorithms for acoustic-phonetic modeling
-
Why large margin?
• Discriminative learning paradigm
• Very successful
- in multiway classification, eg, SVMs
- in structural classification, eg., M3N
• Theoretically motivated to offer better generalization
• Computationally appealing.
-
Outline
• Large margin GMMs - Label individual samples - Parallel to support vector machines
• Crucial extensions for modeling phonemes- Handle outliers- Classify and recognize sequences
• Summary- What has been done- What will be done
-
Large margin Gaussian mixture modeling
-
Model each class by an ellipsoid
- centroid: - orientation matrix:- offset:
!1
X
Basic setting
Ψc ∈ "D×D # 0θc ≥ 0
µc ∈ "D
-
Nonlinear classification
Classify by smallest score
y = arg minc{(x− µc)TΨc(x− µc) + θc
}
!1
X
Mahalanobis distance
x ∈ "D
equivalent to Gaussian classifiers
-
Change of representation
• Step 1: collect parameters
• Step 2: augment inputs
µcΨcθc
−→ Φc =[
Ψc −Ψcµc−µTc Ψc µTc Ψcµc + θc
]# 0
z =[
x1
]∈ "D+1
-
Classification rule
• Old:
Old representation is nonlinear in and .
• New:
New representation is linear in .
y = arg minc{(x− µc)TΨc(x− µc) + θc
}
y = arg minc{zTΦcz
}
µc Ψc
Φc
-
Large margin learning criteria
Input: labeled examples Output: positive semidefinite matrices Goal: classify by at least one unit margin
{(xn, yn)}{Φc}
!1
X
∀ c "= yn,zTnΦczn − zTnΦynzn ≥ 1
margin = difference of distance between target and competing classes
-
Decision boundary & margin
!"#$%$&'()&*'!+,-( .+,/$'(.+,/$'(
-
Handling infeasible examples
• “Soft” margin constraints with slacks
• Discriminative loss function convex but
only a surrogate !
∀ c "= yn, zTnΦczn + ξnc − zTnΦynzn ≥ 1
margin1
feasible
infeasible
hinge loss
0/1 loss
ξnc
-
Learning via convex optimization
• Balance total slacks and scale
• Properties- instance of semidefinite programming (SDP)- convex problem: no local optimum- efficiently solvable with gradient methods
minimize∑
n,c ξnc + κ∑
c trace(Ψc)subject to zTnΦczn + ξnc − zTnΦynzn ≥ 1
ξnc ≥ 0, ∀n, c $= ynΦc % 0
-
General setting
• Model each class by M ellipsoids
• Assume knowledge of “proxy label”
mn = arg minm
zTnΦynmzn
{Φcm}
-
Large margin constraints
∀ c "= yn∀m z
TnΦcmz − zTnΦynmnz ≥ 1
score to the proxy centroid
score for all competing centroids
-
Handle large number of constraints
∀ c "= yn minm
(zTnΦcmz − zTnΦynmnz) ≥ 1
∀ c "= yn − logm∑
m
e−(zTnΦcmz−z
TnΦynmn z) ≥ 1
softmax trick to squash many constraints into one
-
Close the loop: proxy labels
• Step 1: maximum likelihood parameter estimation
• Step 2: find the closetest mixture
ΦML = arg max∑
n
log p(xn, yn;Φ)
mn = arg minm
zTnΦMLynmzn
-
Optimization problem
• Properties:- no longer an instance of SDP- still convex problem
minimize∑
n,c ξnc + κ∑
c trace(Ψcm)subject to − logm
∑m e
−(zTnΦcmz−zTnΦynmn z) + ξnc ≥ 1
ξnc ≥ 0, ∀n, c $= ynΦcm % 0
-
Application: handwritten digit recognition
• General setup- train GMMs with maximum likelihood estimation- use GMMs to infer proxy labels- train large margin GMMs
• Data sets- MNIST handwritten digit recognition- 60K training & 10K testing, ten-fold cross validation- reduce dimensionality from 28x28 to 40 with PCA
-
Result: classification error
• Comparable to best SVM result (1.0% - 1.4%)• training time: 1 - 10 minutes
!"#$%
&'()
&'*)
+'()
+'*)
,'()
& + - .
/! 012345612378
1.2%
-
Result: digit prototypes
• EM: typical images• Large margin: supposition of images of different
digits
digit prototypes
-
Crucial extensions for acoustic modeling
Handle statistical outliersClassify sequenceRecognize sequence
-
Extension 1: outliers
margin
outlier
near misses
correctdominateloss function
want to focus on these points!
-
Remedy
equalize contributions from outliers
marginoutlier
near misses
correct
minimize∑
n,c αnξnc + κ∑
c trace(Ψc)subject to zTnΦczn + ξnc − zTnΦynzn ≥ 1
ξnc ≥ 0, ∀n, c $= ynΦc % 0
estimate weighting factors from ML estimated GMMs
-
Extension 2: phonetic classification
• Input: - Segmented speech
signals
- Each segment is a phonetic unit
• Output:- Phonetic label for
each segment
• Evaluation:- Percentage of
misclassified segments
bcl l aa kcl t ey bcl
known phonetic segmentation
-
Algorithm
• Use average score to setup margin constraint- Input:- Output:
∀c "= yn, 1!∑
p zTnpΦcznp − 1!
∑p z
TnpΦynznp ≥ 1
xn1,xn2, . . . ,xn!yn
-
Application: phonetic classification
• General setup- train baseline GMMs with ML- use baseline GMMs to infer proxy labels- use baseline GMMs to detect outliers- train large margin GMMs
• Data- TIMIT speech database: 1.2m frames, 120K
segments for training
- Very well studied benchmark problem
-
Result: phonetic classification error
20.0%
23.5%
27.0%
30.5%
34.0%
1 2 4 8 16
Baseline EM Large margin
number of mixture components
best reported result
-
Extension 3: phonetic recognition
• Input:- unsegmented speech - frames span more than
one phone
• Output:- phonetic segmentation- phonetic labels
• Evaluation metrics:- multiple metrics - some relates more to
word error rate
bc l a kc t e bc
output phonetic segmentation;inferred indirectly from frame-
level phonetic labels.
-
Algorithm
• Goal: separate target sequences from incorrect
sequences by large margin
• Model: fully observable Markov model
states
inputs
state sequence ↔ phoneme label sequence
-
Discriminant score for sequences
• Analogous to the log probability in CD-HMMs- transition scores- emission scores
• Extendable to more complicated features
D(X, s) =∑
t log a(st1 , st)−∑T
t=1 zTt Φstzt
X = {x1,x2, . . . ,xt}
s = {s1, s2, . . . , sT }
-
Large margin constraints
How to handle exponential number of constraints?
∀s "= y, D(X,y)−D(X, s) ≥ h(s,y)
Hamming distance[Taskar et al 03]
-
Softmax margin
−D(X,y) + maxs !=y{h(s,y) +D(X, s)} ≤ 0
squash into one constraint with softmax
−D(X,y) + log∑
s !=yeh(s,y)+D(X,s) ≤ 0
tractable with dynamic programming
-
Optimization
• Properties:- convex optimization- main computation effort on computing the gradients
min∑
n ξn + γ∑
c trace(Ψc)s.t. −D(Xn,yn) + log
∑s !=yn e
h(s,yn)+D(Xn,s) ≤ ξn,ξn ≥ 0, n = 1, 2, . . . , NΦc $ 0, c = 1, 2, . . . , C
-
Results
• General setup• General setup- train baseline GMMs with ML- use baseline GMMs to infer proxy labels- use proxy label sequence as target state sequence- train large margin GMMs• Data- TIMIT speech database: 1.2m frames, 3000 sentences- Very well studied benchmark problem
-
Results: phone error rate
26%
30%
34%
38%
42%
1 2 4 8
Baseline EM Large margin
number of mixture components
-
Result: is “sequential” really important?
26%
29%
32%
35%
38%
1 2 4 8
Frame Sequence
-
Result: relative error reduction over baseline
4%
9%
14%
19%
24%
1 2 4 8
MMI Margin
-
What has been achieved?
• Algorithms- Large margin GMMs for multiway classification- Large margin GMMs for sequential classification
• Advantages over traditional methods- Computationally appealing thru convexity- Well motivated training criteria for generalization
• Experimental results- Able to train on millions of data points- Improve over baseline and other competitive
methods
-
Proposed Future Work
• Goals:- is margin really helpful in getting better performance?- can the formulation of the optimization be improved?
• Comparison to standard ASR algorithms: - MMI/CML, MCE- investigate the relation with MPE/MWE
• Comparison to structural learning methods- recent advances in machine learning community- algorithms: extragradient (Taskar, 2006), subgradient
method (Bagnell, 2006)