Download - ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

ICASSP 2007 Robustness Techniques Survey

Presenter: Shih-Hsiang Lin

2

Topic

• Word Graph based Feature Enhancement for Noisy Speech Recognition

• Stereo-based Stochastic Mapping for Robust Speech Recognition

• Combination of Recognizers and Fusion of Features Approach to Missing Data ASR Under Non-Stationary Noise Conditions

WORD GRAPH BASED FEATURE ENHANCEMENT FOR NOISY SPEECH RECOGNITION

Zhi-Jie Yan1 Frank K. Soong2 Ren-Hua Wang1

1iFlytek Speech Lab, University of Science and Technology of China, Hefei, P. R. China, 2300272Microsoft Research Asia, Beijing, P. R. China, 100080

SPE-L3: Robust Features and Acoustic Modeling

4

Introduction

• This paper presents a word graph based feature enhancement method for robust speech recognition in noise– The word graph based approach would have more chance that the correct h

ypotheses exist in the graph with relatively lower posterior probabilities (or likelihoods) than the incorrect first best hypothesis

• The proposed method is based upon Wiener fitering of the Mel-filter bank energy, given– the input noisy speech

– a signal processing based estimate of noise

– a clean trained Hidden Markov Model (HMM)

• Therefore, the enhanced speech feature after Wiener filtering can match the clean speech model better in the acoustic space, and thus leads to an improved recognition performance

5

Algorithm Overview

rough estimate of noise spectrum

re-estimatenoise

mean

normalized speech

model based estimate of theclean speech

final estimate ofclean speech

correspondingclean speech

6

More Details…

• Kernel posterior probabilities for each Gaussian component of the model can be calculated– These posterior probabilities will serve as the weighting coefficients for

synthesizing the model based clean speech for Wiener filtering

– Using the word graph, the posterior probability of kernel k at time t, given the entire observation sequence can be formulated as:

• The model based clean speech estimate for Wiener filtering is constructed in two steps– First step, for each time frame t, the expected values of the mean and

covariance of the clean speech feature are calculated using the kernel posterior probabilities along with the kernel parameters

jkwjets

esw

TT jtkpwtjpoeswpotkp

,

,;

11 |;|;|,;|;

To1

Word Posterior Probability (WPP)

State OccupancyProbability

Kernel OccupancyProbability

7

More Details… (cont.)

– Second step, clean speech S3 can be synthesized in ML sense

• the ML solution of S3 can be obtained by solving the weighted normal equation

kK

k

TTk otkpotkpEt

1 11 |;|;,|ˆ

T1

T11

T ˆˆ|;|;,,|ˆˆˆ ttotkpotkpEtK

k kkkTT

kk

MUWWCUW -1T-1T

TTTT ,,2,1 Tccc C

Tdiag 1111 ˆ,,2ˆ,1ˆ U

TTTT ˆ,,2ˆ,1ˆ T Μ

synthesized clean speech

8

More Details… (cont.)

• Wiener filtering of the Mel-filter bank energy is performed in the linear spectral domain

• In the last step, is converted to the cepstral domain, and then rescore the word graph– Re-decode S4 within the constrained search space defined by the word

graph

1FBE3 IDCTexp StctS

tXtNtS

tStS FBE

FBEFBE3

FBE3FBE

4

final estimate of clean speech

tS FBE4

9

Experimental Results

• signal processing based feature enhancement consistently improves the recognition performance, and the overall relative error rate reduction is 35.44%

• the GER of the decoded word graph is significantly lower

than the WER of the first best hypothesis (only about 1/4 ∼ 1/5)

• The results show that an overall relative error rate reduction of 57.89% is obtained

• Using word graph constrained second pass decoding, this result is obtained with a minor increase of the computational cost

• The experimental results suggest that the difference between the two decoding scenarios is minimal

STEREO-BASED STOCHASTIC MAPPING FOR ROBUST SPEECH RECOGNITION

Mohamed Afif, Xiaodong Cui, and Yuqing Gao

IBM T.J. Watson Research Center1101 Old Kitchawan Road, Yorktown Heights, NY, 10598

SPE-L3: Robust Features and Acoustic Modeling

11

Introduction

• The idea is based on building a GMM for the joint distribution of the clean and noisy channels during training and using an iterative compensation algorithm during testing.– Also interpreted as a mixture of linear transforms that are estimated in a

special way using stereo data

– Stack both the clean and noisy channels to form a large augmented space and to build a statistical model in this new space

• The observed noisy speech and the augmented statistical model are used to predict the clean speech

12

Algorithm Formulation

• Assume we have a set of stereo data {(xi, yi)}

• Define z ≡ (x, y) as the concatenation of the two channels

• The first step in constructing the mapping is training the joint probability model for p(z)

• Once this model is constructed it can be used during testing to estimate the clean speech given the noisy observations

– The problem of estimating x in Equation looks like a mixture estimation problem

K

k

kzzkzk zNczp1

,, ,;

k k

xxxykxpykpykxpyxpx ,||maxarg|,maxarg|maxargˆ

ky

kxkz

,

,,

kyykyx

kxykxxkz

,,

,,,where

13

Algorithm Formulation (cont.)

• Hence, we will iteratively optimize an EM objective function given by

– is the value of x from previous iterationx

k

kyxkyxkyxkyxx

kx

kx

kx

ykxpxxyxkp

ykxpyxkp

ykxpykyxkp

ykxpykyxkpx

,|loglog,|2

1maxarg

,|log,|maxarg

,|log|log,|maxarg

,||log,|maxargˆ

,|1

,|,|,|

kykyykxykxkyx y ,1,,,,|

kyzkyykxykxxkyx ,1,,,|,|

14

Algorithm Formulation (cont.)

• By differentiating Equation with respect to x, setting the resulting derivative to zero

• An interesting special case arises when x is a scalar

kyx

k

kyx

k

kyx yxkpxyxkp ,|1

,|1

,| ,|ˆ,|

k

kyx

k

kyxkyx

yxkp

yxkp

x2

,|

2,|,|

/,|

/,|

ˆ

15


• The first three lines refer to train/test conditions where the clean refers to the CT and noisy to the HF

• It can be observed that the proposed mapping outperforms SPLICE for all GMM sizes with the difference decreasing with increasing the GMM size•Both methods are considerably better than the VTS result in the last row of Table 1

• Using a time window gives an improvement over the baseline SSM with a slight cost during runtime• These results are not given for SPLICE because using biases requires that both the input and output spaces have the same dimensions

Digit recognition in the car

16

Experimental Results (cont.)

English large vocabulary speech recognition

• SSM brings considerable improvement over MST even in the clean speech condition

MFCC Feature

LDA+MLLT Feature• Building maps for the final feature space (after LDA and MLLT) looks to be slightly better than the original cepstral space

Combination of Recognizers and Fusion of Features Approach to Missing Data ASR Under Non-Stationary Noise Conditions

Neil Joshi and Ling GuanDepartment of Electrical and Computer Engineering

Ryerson UniversityToronto ON M5B 2K3, Canada

SPE-P14: Robustness II

18

Introduction

• This paper proposes a method a enhance speech recognition performance using missing data techniques for non-stationary noise conditions – By incorporating more resilient feature sets into the decoding process

– Two separate HMM based models

• One using spectral features

• The other MFCC features

• The statistical dependencies found in the models are based upon a coupled HMM methodology, the Fused HMM model– One using standard ASR techniques (traditional MFCC based HMM

models)

– The other missing data based (missing data theory spectral HMM models)

19

Coupled Fused HMM

• The fused HMM model models the relationship between HMMs using a probabilistic fusion model

• The statistical dependencies between two HMM process is thus,

121211 ˆ|,ˆ UOpOpOOp

212212 ˆ|,ˆ UOpOpOOp

20


• the fused decoder is found to significantly increase recognition performance over conventional missing data decode process