Post on 19-Jan-2016
description
1
CRANDEM: Conditional Random Fields for ASR
Jeremy Morris
11/21/2008
2
Outline
Background – Tandem HMMs & CRFs Crandem HMM Phone recognition Word recognition
3
Background Conditional Random Fields (CRFs)
Discriminative probabilistic sequence model Directly defines a posterior probability of a label
sequence given a set of observations
Background
Problem: How do we make use of CRF classification for word recognition? Attempt to use CRFs directly? Attempt to fit CRFs into current state-of-the-art
models for speech recognition? Here we focus on the latter approach
How can we integrate what we learn from the CRF into a standard HMM-based ASR system?
4
5
Background Tandem HMM
Generative probabilistic sequence model Uses outputs of a discriminative model (e.g. ANN
MLPs) as input feature vectors for a standard HMM
6
Background Tandem HMM
ANN MLP classifiers are trained on labeled speech data Classifiers can be phone classifiers, phonological
feature classifiers Classifiers output posterior probabilities for each
frame of data E.g. P(Q|X), where Q is the phone class label and X is
the input speech feature vector
7
Background Tandem HMM
Posterior feature vectors are used by an HMM as inputs
In practice, posteriors are not used directly Log posterior outputs or “linear” outputs are more
frequently used “linear” here means outputs of the MLP with no application
of the softmax function to transform into probabilities Since HMMs model phones as Gaussian mixtures, the
goal is to make these outputs look more “Gaussian” Additionally, Principle Components Analysis (PCA) is
applied to features to decorrelate features for diagonal covariance matrices
8
Idea: Crandem Use a CRF classifier to create inputs to a
Tandem-style HMM CRF labels provide a better per-frame accuracy
than input MLPs We’ve shown CRFs to provide better phone
recognition than a Tandem system with the same inputs
This suggests that we may get some gain from using CRF features in an HMM
9
Idea: Crandem Problem: CRF output doesn’t match MLP
output MLP output is a per-frame vector of posteriors CRF outputs a probability across the entire
sequence Solution: Use Forward-Backward algorithm to
generate a vector of posterior probabilities
10
Forward-Backward Algorithm The Forward-Backward algorithm is already
used during CRF training Similar to the forward-backward algorithm for
HMMs Forward pass collects feature functions for the
timesteps prior to the current timestep Backward pass collects feature functions for the
timesteps following the current timestep Information from both passes are combined
together to determine the probability of being in a given state at a particular timestep
11
)(
)),,(),((exp
)|(1
xZ
yyxfyxs
XYP t i jttjjtii
Forward Backward Algorithm
i j
jjii yyxfyxsyyM ))',,(),(exp(]',[
nt
ntifMT
TttT
t
1
1{1
0
0
1{ 1
t
ntifM ttt
)()|( ,,
, xZXyP titi
ti
12
Forward-Backward Algorithm
This form allows us to use the CRF to compute a vector of local posteriors y at any timestep t.
We use this to generate features for a Tandem-style system Take log features, decorelate with PCA
)()|( ,,
, xZXyP titi
ti
13
Phone Recognition Pilot task – phone recognition on TIMIT
61 feature MLPs trained on TIMIT, mapped down to 39 features for evaluation
Crandem compared to Tandem and a standard PLP HMM baseline model
As with previous CRF work, we use the outputs of an ANN MLP as inputs to our CRF
Various CRF models examined (state feature functions only, state+transition functions), and various input feature spaces examined (phone classifier and phonological feature classifier)
14
Phone Recognition
Phonological feature attributes Detector outputs describe phonetic features of a
speech signal Place, Manner, Voicing, Vowel Height, Backness, etc. A phone is described with a vector of feature values
Phone class attributes Detector outputs describe the phone label
associated with a portion of the speech signal /t/, /d/, /aa/, etc.
15
Phone Recognition
)(
)),,(),((exp
)|(1
xZ
yyxfyxs
XYP t i jttjjtii
16
Phone Recognition - Results
Phonological feature attributes Detector outputs describe phonetic features of a
speech signal Place, Manner, Voicing, Vowel Height, Backness, etc. A phone is described with a vector of feature values
Phone class attributes Detector outputs describe the phone label
associated with a portion of the speech signal /t/, /d/, /aa/, etc.
17* Significantly (p<0.05) improvement at 0.6% difference between models
Results (Fosler-Lussier & Morris 08)Model Phone
Accuracy
PLP HMM reference 68.1%
Tandem (61 feas) 70.6%
Tandem (48 feas) 70.8%
CRF (state) 69.9%
CRF (state+trans) 70.7%
Crandem (state) – log 71.1%
Crandem (state+trans) – log 71.7%
Crandem (state) – unnorm 71.2%
Crandem (state+trans) – unnorm 71.8%
18* Significantly (p≤0.05) improvement at 0.6% difference between models
Results (Fosler-Lussier & Morris 08)Model Phone
Accuracy
PLP HMM reference 68.1%
Tandem (105 feas) 70.9%
Tandem (48 feas) 71.2%
CRF (state) 71.4%
CRF (state+trans) 71.6%
Crandem (state) – log 71.7%
Crandem (state+trans) – log 72.4%
Crandem (state) – unnorm 71.7%
Crandem (state+trans) – unnorm 72.4%
19
Word Recognition Second task – Word recognition
Dictionary for word recognition has 54 distinct phones instead of 48, so new CRFs and MLPs trained to provide input features
MLPs and CRFs again trained on TIMIT to provide both phone classifier output and phonological feature classifier output
Initial experiments – use MLPs and CRFs trained on TIMIT to generate features for WSJ recognition Next pass – use MLPs and CRFs trained on TIMIT to
align label files for WSJ, then train MLPs and CRFs for WSJ recognition
20* Significant (p≤0.05) improvement at roughly 1% difference between models
Initial ResultsModel Word
Accuracy
MFCC HMM reference 90.85%
Tandem MLP (54feas) 90.30%
Tandem MFCC+MLP (54feas) 90.90%
Crandem (54feas) (state) 90.95%
Crandem (54feas) (state+trans) 90.77%
Crandem MFCC+CRF (state) 92.29%
Crandem MFCC+CRF (state+tran) 92.40%
21* Significant (p≤0.05) improvement at roughly 1% difference between models
Initial ResultsModel Word
Accuracy
MFCC HMM reference 90.85%
Tandem MLP (98feas) 91.26%
Tandem MFCC+MLP (98feas) 92.04%
Crandem (98feas) (state) 91.31%
Crandem (98feas) (state+trans) 90.49%
Crandem MFCC+CRF (state) 92.47%
Crandem MFCC+CRF (state+tran) 92.62%
22* Significant (p≤0.05) improvement at roughly 1% difference between models
Initial ResultsModel Word
Accuracy
MFCC HMM reference 90.85%
Tandem MLP (WSJ 54) 90.41%
Tandem MFCC+MLP (WSJ 54) 92.21%
Crandem (WSJ 54) (state) 89.58%
23
Word Recognition Problems
Some of the models show slight significant improvement over their Tandem counterpart Unfortunately, what will cause an improvement is not yet
predictable Transition features give slight degredation when used on
their own slight improvement when classifier is mixed with MFCCs
Retraining directly on WSJ data does not give improvement for CRF Gains from CRF training are wiped away if we just
retrain the MLPs on WSJ data
24
Word Recognition Problems (cont.)
The only model that gives improvement for the Crandem system is a CRF model trained on linear outputs from MLPs Softmax outputs – much worse than baseline Log softmax outputs – ditto This doesn’t seem right, especially given the results
from the Crandem phone recognition experiments These were trained on softmax outputs I suspect “implementor error” here, though I haven’t
tracked down my mistake yet
25
Word Recognition Problems (cont.)
Because of the “linear inputs only” issue, certain features have yet to be examined fully “Hifny”-style Gaussian scores have not provided any
gain – scaling of these features may be preventing them from being useful
26
Current Work Sort out problems with CRF models
Why is it so sensitive to the input feature type? (linear vs. log vs. softmax)
If this sensitivity is “built in” to the model, how can I appropriately scale features to include them in the model that works?
Move on to next problem – direct decoding on CRF lattices