REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M....
-
Upload
barnard-nickolas-dixon -
Category
Documents
-
view
216 -
download
0
Transcript of REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M....
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION
Javier Ram’ırez, Jos’e C. Segura and J.M. G’orrizDept. of Signal Theory Networking and Communications
University of Granada (SPAIN)
Presenter: Chen, Hung-Bin
ICASSP 2007
2
Outline
• Introduction• Multiple Observation Likelihood Ratio Test• Analysis Of The Proposed Algorithm (IEEE SIGNAL PROCESSING 2005)
• Revised MO-LRT (ICASSP 2007)
• Experimental
3
Introduction
• This paper is based on a revised contextual likelihood ratio test (LRT) and defined over a multiple observation window
• The new approach not only evaluates the two hypothesis consisting on all the observations to be speech or nonspeech
• The proposed method showed a speech/non-speech discrimination over a wide range of SNR conditions
4
Likelihood Ratio Test
• two-hypothesis test– Given an observation vector to be classified, the problem is reduced to
selecting the class ( or ) with the largest posterior probability
– Likelihood ratio test (LRT) is defined as:
– where the observation vector is classified as if the likelihood ratio is greater than the ratio between the a priori class probabilities, otherwise it is classified as
y~
)~|( yGP i1G0G
1G
0G
)~(yL)(/)( 10 GPGP
)(
)(
)|~(
)|~()~(
1
0
0
1
0
1
Gp
Gp
G
G
Gyp
GypyL
5
Multiple Observation Likelihood Ratio Test
• To improve, a LRT for detecting the presence of speech in a noisy signal based on a Gaussian model– the multiple observation LRT (MO-LRT) considers not just a single obser
vation vector measured at a frame t, but also an N-frame neighborhood
– This test involves the evaluation of an N-th order LRT incorporating contextual information to the decision rule and exhibits significant improvements in speech/pause discrimination over the original LRT
– Multiple Observation LRT (MO-LRT) is defined as:
ty~
NttNt yyy ~,,~,,~
)|~,,~(
)|~,,~()~,,~,,~(
0|,,
1|,,
0
1
Gyyp
GyypyyyL
NtNtGyy
NtNtGyyNttNt
NtNt
NtNt
IEEE SIGNAL PROCESSING 2005
6
Multiple Observation Likelihood Ratio Test
• This test involves the evaluation of an N-th order LRT that enables a computationally efficient evaluation when the individual measurements are independent
• In this case
• An equivalent log LRT can be defined by taking logarithms
IEEE SIGNAL PROCESSING 2005
ty~
)|~(
)|~()~,,~,,~(
0|
1|
0
1
Gyp
GypyyyL
tGy
tGyN
NtNttNt
t
t
N
Nt tGy
tGyNttNt Gyp
Gypyyyl
t
t
)|~(
)|~(ln)~,,~,,~(
0|
1|
0
1
7
Multiple Observation Likelihood Ratio Test
• The use of the MO-LRT for voice activity detection is mainly motivated by the multiple observation vector to improvements in robustness against the acoustic noise present in the environment.
• The MO-LRT is defined over the observation vectors
• The decision rule is defined by
IEEE SIGNAL PROCESSING 2005
NttNt yyy ~,,~,,~
)(Gnonspeech or )(Gspeech as classified being frame thedenotes where
)|~(
)|~(ln
01
0|
1|,
0
1
l
Gyp
Gypl
N
Nt tGy
tGymt
t
t
tionclassificanonspeech andspeech between offbest trade for the
ally tunedexperiment is that hresholddecision t theis where
nonspeech as classified is frame ,
speech as classified is frame ,,
mtl
8
Multiple Observation Likelihood Ratio Test
• In order to evaluate the proposed MO-LRT VAD on an incoming signal, an adequate statistical model for the feature vectors in presence
• The model selected is similar to that assumes the discrete Fourier transform (DFT) coefficients of the clean speech and the noise to be asymptotically independent Gaussian random variables
)( jS )( jN
1
0
2
1
1
0
2
0
)()(exp
)()(
1)|~(
)(exp
)(
1)|~(
J
j SN
j
SNt
J
j N
j
Nt
jj
Y
jjGyp
j
Y
jGyp
IEEE SIGNAL PROCESSING 2005
9
Analysis Of The Proposed Algorithm
• For this experiment– The AURORA subset of the Spanish SpeechDat-Car (SDC) was used
– This database consists of recordings from distant and close-talking microphones in car environments at different driving conditions
– The most unfavorable scenario (i.e., distant microphone at high speed over good road) with a 5 dB average SNR was considered
IEEE SIGNAL PROCESSING 2005
10
Analysis Of The Proposed Algorithm
• Fig. 1(a) shows the distributions of the LRT during presence and absence of speech for increasing values of m.
(a) Speech/Nonspeech distributions of the multiple observationlikelihood ratio for different number of observations m
IEEE SIGNAL PROCESSING 2005
11
Analysis Of The Proposed Algorithm
• Fig. 1(b) the speech, nonspeech, and global detection errors are shown as a function of m
• Note that the speech detection error is clearly reduced when increasing the value of m
• The global error exhibits a minimum value for m =6 frames
(b) Error probabilities as a function of m
IEEE SIGNAL PROCESSING 2005
12
Analysis Of The Proposed Algorithm
• Fig. 2 shows the nonspeech hit rate (HR0) versus the false alarm rate (FAR0=1- HR1, where HR1 denotes the speech hit rate) for recordings from the distant microphone at an average SNR of about 5 dB.
• It is clearly shown that increasing the number of observation vectors in the MO-LRT improves the performance of the proposed VAD
IEEE SIGNAL PROCESSING 2005
13
Revised MO-LRT
• In this way, the test evaluates the probability that “all” the observations in the N frame of the central frame to be non-speech or speech
• This is the reason to revise the method in order to evaluate not just the two previous hypothesis G0 and G1
– the decision is made in favor of one of the two hypothesis:
ICASSP 2007
NttNtlfor
syG
nsyG
ll
lll
,,,,
ˆˆ :
ˆˆˆ :
0
1
14
Revised MO-LRT
• the multiple observation vector that is reindexed as for convenience of the presentation
• Each hypothesis Hm can be defined in terms of a binary integer representation:
NttNt yyy ~,,~,,~
1211 ˆ,,ˆ,,ˆˆ NN yyyY
12,,2,1 ˆˆ : 0
ˆˆˆ : 1
)1(speech or )0(speech -non is n observatio theif define where
2 12
1
Nksyb
nsyb
bbkb
m
kkk
kkkk
kkk
N
k
bk
ICASSP 2007
15
Revised MO-LRT
• Thus, each hypothesis Hm consists of 2N+1 individual hypothesis involving the 2N+1 observations
• The classification problem is then reformulated as selecting the class i with the current frame depending on the bit bN+1 to assigning speech (G1) or non-speech (G0)
• If the set of all the possible hypothesis is splitted depending on the value of the central frame bit bN+1 as:
• the posterior probabilities are defined to be:
0:
1:
10
11
Nm
Nm
bHM
bHM
0
1
)ˆ|()ˆ|(
)ˆ|()ˆ|(
0
1
Mm m
Mm m
YHpYGP
YHpYGP
ICASSP 2007
16
Revised MO-LRT
• Using the Bayes rule:
• a revised LRT can be defined as:
• Approximation to the statistical test replace the summation by the maximum value of the probability of the hypothesis in M1 and M0:
• By taking logarithms this test is expressed in a more compact form:
0
1
)()|ˆ()ˆ(
1)ˆ|(
)()|ˆ()ˆ(
1)ˆ|(
0
1
Mm mm
Mm mm
HpHYpYP
YGP
HpHYpYP
YGP
0
1
)()|ˆ(
)()|ˆ(
)ˆ|(
)ˆ|(
0
1
Mm mm
Mm mm
HpHYp
HpHYp
YGP
YGP
)()|ˆ(max
)()|ˆ(max*
0
1
mmMm
mmMm
HpHYp
HpHYp
mMmm
Mmll
01
maxmax*log
ICASSP 2007
17
Revised MO-LRT
• restrict the possible hypothesis – speech to non-speech or non-speech to speech transition in the N-frame neighbo
rhood
– K is the Hankel matrix:
TN
TN
TN
ypyp
ypyp
lll
)]1|ˆ(log,),1|ˆ([logB
)]0|ˆ(log,),0|ˆ([logB
],,,[L
:where
K)B(IKBL
1211
1210
1221
01
ICASSP 2007
18
Revised MO-LRT
• The matrix K can be splitted into two submatrices K0 and K1 then the test is easily reduced to:
01 maxLmaxLlogΛ
01110
01111
01
00100
01111
BK)BK(IL
)BK(IBKL
: toreduced isequation and KKI that Note
)BK(IBKL
)BK(IBKL
where
ICASSP 2007
19
Revised MO-LRT
• As an examplefor N = 1, the matrices K, K0 and K1 are defined to be:
• and L1 and L0 are computed by:
001
100
000
K
110
011
111
K
,
001
011
111
110
100
000
K
0
1
)0|ˆ(log
)0|ˆ(log
)0|ˆ(log
110
011
111
)1|ˆ(log
)1|ˆ(log
)1|ˆ(log
001
100
000
BK)BK(IL
)0|ˆ(log
)0|ˆ(log
)0|ˆ(log
001
100
000
)1|ˆ(log
)1|ˆ(log
)1|ˆ(log
110
011
111
)BK(IBKL
3
2
1
3
2
1
01110
3
2
1
3
2
1
01111
yp
yp
yp
yp
yp
yp
yp
yp
yp
yp
yp
yp
ICASSP 2007
20
Revised MO-LRT
• Finally– The algorithm for voice activity detection is based on a comparison of a likelihood
ratio to a given threshold :
0
1
01 maxLmaxLlogΛ
G
G
ICASSP 2007
21
Experimental Results
• The AURORA subset of the Spanish SpeechDat-Car (SDC) was used– Fig. 1 shows an utterance of the database in clean conditions (25 dB SNR)– Fig. 2 under the noisiest conditions (5 dB SNR)
ICASSP 2007
22
Experimental Results
• ROC curves in quiet noise conditions (stopped car and engine running) and close talking microphone
ICASSP 2007
23
Experimental Results
• ROC curves in high noise conditions (high speed over a good road) and distant talking microphone
ICASSP 2007
24
Conclusion
• The new approach not only evaluates the two hypothesis consisting on all the observations to be speech or non-speech, but all the possible hypothesis defined over the individual observations
• Hankel matrix was introduced into the revised statistical test to smoothing process and reduced variance of the decision variable
ICASSP 2007
25
26
Revised MO-LRT
• The algorithm is adaptive and suitable for non-stationary noise environments since the statistical properties are updated when the frame is classified as a non-speech frame. In this way, the variance of the noise is updated as:
2)1()()( jNN Yjj
N