ICASSP 2007 Robustness Techniques Survey
Presenter: Shih-Hsiang Lin
2
Topic
• Word Graph based Feature Enhancement for Noisy Speech Recognition
• Stereo-based Stochastic Mapping for Robust Speech Recognition
• Combination of Recognizers and Fusion of Features Approach to Missing Data ASR Under Non-Stationary Noise Conditions
WORD GRAPH BASED FEATURE ENHANCEMENT FOR NOISY SPEECH RECOGNITION
Zhi-Jie Yan1 Frank K. Soong2 Ren-Hua Wang1
1iFlytek Speech Lab, University of Science and Technology of China, Hefei, P. R. China, 2300272Microsoft Research Asia, Beijing, P. R. China, 100080
SPE-L3: Robust Features and Acoustic Modeling
4
Introduction
• This paper presents a word graph based feature enhancement method for robust speech recognition in noise– The word graph based approach would have more chance that the correct h
ypotheses exist in the graph with relatively lower posterior probabilities (or likelihoods) than the incorrect first best hypothesis
• The proposed method is based upon Wiener fitering of the Mel-filter bank energy, given– the input noisy speech
– a signal processing based estimate of noise
– a clean trained Hidden Markov Model (HMM)
• Therefore, the enhanced speech feature after Wiener filtering can match the clean speech model better in the acoustic space, and thus leads to an improved recognition performance
5
Algorithm Overview
rough estimate of noise spectrum
re-estimatenoise
mean
normalized speech
model based estimate of theclean speech
final estimate ofclean speech
correspondingclean speech
6
More Details…
• Kernel posterior probabilities for each Gaussian component of the model can be calculated– These posterior probabilities will serve as the weighting coefficients for
synthesizing the model based clean speech for Wiener filtering
– Using the word graph, the posterior probability of kernel k at time t, given the entire observation sequence can be formulated as:
• The model based clean speech estimate for Wiener filtering is constructed in two steps– First step, for each time frame t, the expected values of the mean and
covariance of the clean speech feature are calculated using the kernel posterior probabilities along with the kernel parameters
jkwjets
esw
TT jtkpwtjpoeswpotkp
,
,;
11 |;|;|,;|;
To1
Word Posterior Probability (WPP)
State OccupancyProbability
Kernel OccupancyProbability
7
More Details… (cont.)
– Second step, clean speech S3 can be synthesized in ML sense
• the ML solution of S3 can be obtained by solving the weighted normal equation
kK
k
TTk otkpotkpEt
1 11 |;|;,|ˆ
T1
T11
T ˆˆ|;|;,,|ˆˆˆ ttotkpotkpEtK
k kkkTT
kk
MUWWCUW -1T-1T
TTTT ,,2,1 Tccc C
Tdiag 1111 ˆ,,2ˆ,1ˆ U
TTTT ˆ,,2ˆ,1ˆ T Μ
synthesized clean speech
8
More Details… (cont.)
• Wiener filtering of the Mel-filter bank energy is performed in the linear spectral domain
• In the last step, is converted to the cepstral domain, and then rescore the word graph– Re-decode S4 within the constrained search space defined by the word
graph
1FBE3 IDCTexp StctS
tXtNtS
tStS FBE
FBEFBE3
FBE3FBE
4
final estimate of clean speech
tS FBE4
9
Experimental Results
• signal processing based feature enhancement consistently improves the recognition performance, and the overall relative error rate reduction is 35.44%
• the GER of the decoded word graph is significantly lower
than the WER of the first best hypothesis (only about 1/4 ∼ 1/5)
• The results show that an overall relative error rate reduction of 57.89% is obtained
• Using word graph constrained second pass decoding, this result is obtained with a minor increase of the computational cost
• The experimental results suggest that the difference between the two decoding scenarios is minimal
STEREO-BASED STOCHASTIC MAPPING FOR ROBUST SPEECH RECOGNITION
Mohamed Afif, Xiaodong Cui, and Yuqing Gao
IBM T.J. Watson Research Center1101 Old Kitchawan Road, Yorktown Heights, NY, 10598
SPE-L3: Robust Features and Acoustic Modeling
11
Introduction
• The idea is based on building a GMM for the joint distribution of the clean and noisy channels during training and using an iterative compensation algorithm during testing.– Also interpreted as a mixture of linear transforms that are estimated in a
special way using stereo data
– Stack both the clean and noisy channels to form a large augmented space and to build a statistical model in this new space
• The observed noisy speech and the augmented statistical model are used to predict the clean speech
12
Algorithm Formulation
• Assume we have a set of stereo data {(xi, yi)}
• Define z ≡ (x, y) as the concatenation of the two channels
• The first step in constructing the mapping is training the joint probability model for p(z)
• Once this model is constructed it can be used during testing to estimate the clean speech given the noisy observations
– The problem of estimating x in Equation looks like a mixture estimation problem
K
k
kzzkzk zNczp1
,, ,;
k k
xxxykxpykpykxpyxpx ,||maxarg|,maxarg|maxargˆ
ky
kxkz
,
,,
kyykyx
kxykxxkz
,,
,,,where
13
Algorithm Formulation (cont.)
• Hence, we will iteratively optimize an EM objective function given by
– is the value of x from previous iterationx
k
kyxkyxkyxkyxx
kx
kx
kx
ykxpxxyxkp
ykxpyxkp
ykxpykyxkp
ykxpykyxkpx
,|loglog,|2
1maxarg
,|log,|maxarg
,|log|log,|maxarg
,||log,|maxargˆ
,|1
,|,|,|
kykyykxykxkyx y ,1,,,,|
kyzkyykxykxxkyx ,1,,,|,|
14
Algorithm Formulation (cont.)
• By differentiating Equation with respect to x, setting the resulting derivative to zero
• An interesting special case arises when x is a scalar
kyx
k
kyx
k
kyx yxkpxyxkp ,|1
,|1
,| ,|ˆ,|
k
kyx
k
kyxkyx
yxkp
yxkp
x2
,|
2,|,|
/,|
/,|
ˆ
15
Experimental Results
• The first three lines refer to train/test conditions where the clean refers to the CT and noisy to the HF
• It can be observed that the proposed mapping outperforms SPLICE for all GMM sizes with the difference decreasing with increasing the GMM size•Both methods are considerably better than the VTS result in the last row of Table 1
• Using a time window gives an improvement over the baseline SSM with a slight cost during runtime• These results are not given for SPLICE because using biases requires that both the input and output spaces have the same dimensions
Digit recognition in the car
16
Experimental Results (cont.)
English large vocabulary speech recognition
• SSM brings considerable improvement over MST even in the clean speech condition
MFCC Feature
LDA+MLLT Feature• Building maps for the final feature space (after LDA and MLLT) looks to be slightly better than the original cepstral space
Combination of Recognizers and Fusion of Features Approach to Missing Data ASR Under Non-Stationary Noise Conditions
Neil Joshi and Ling GuanDepartment of Electrical and Computer Engineering
Ryerson UniversityToronto ON M5B 2K3, Canada
SPE-P14: Robustness II
18
Introduction
• This paper proposes a method a enhance speech recognition performance using missing data techniques for non-stationary noise conditions – By incorporating more resilient feature sets into the decoding process
– Two separate HMM based models
• One using spectral features
• The other MFCC features
• The statistical dependencies found in the models are based upon a coupled HMM methodology, the Fused HMM model– One using standard ASR techniques (traditional MFCC based HMM
models)
– The other missing data based (missing data theory spectral HMM models)
19
Coupled Fused HMM
• The fused HMM model models the relationship between HMMs using a probabilistic fusion model
• The statistical dependencies between two HMM process is thus,
121211 ˆ|,ˆ UOpOpOOp
212212 ˆ|,ˆ UOpOpOOp
20
Experimental Results
• the fused decoder is found to significantly increase recognition performance over conventional missing data decode process
Top Related