Ofﬂine Voice Activity Detector Using Speech Supergaussianity · 2018. 1. 4. · Ofﬂine Voice...

Offline Voice Activity Detector Using SpeechSupergaussianity

Ivan J. TashevMicrosoft Research Labs

One Microsoft Way, Redmond, WA 98051, USAEmail: [email protected]

Abstract—Voice Activity Detectors (VAD) play important rolein audio processing algorithms. Most of the algorithms aredesigned to be causal, i.e. to work in real time using onlycurrent and past audio samples. Off-line processing, when wehave access to the entire voice utterance, allows using differenttype of approaches for increased precision. In this paper wepropose an algorithm for off-line VAD based on the differentprobability density functions (PDFs) of the speech and noise.While a Gaussian distribution is a very good model for noise,the speech PDF is peakier. The proposed VAD algorithm worksin frequency domain and estimates the speech signal presenceprobability for each frequency bin in each audio frame, thespeech presence probability for each frame and also provides abinary decision per bin and frame. Provides improved precisioncompared to the streaming real-time VAD algorithms.

I. INTRODUCTION

Voice Activity Detectors (VAD) are algorithms for detectingthe presence of a speech signal in the mixture of speechand noise. They are part of noise suppressors, double talkdetectors, codecs, and automatic gain control blocks, to men-tion a few. The VAD output can vary from simple binarydecision (yes/no), trough soft decision (probability of speechpresence in the current audio frame), to probability of speechpresence in each frequency bin of each audio frame. Thecommonly used VAD algorithms are based on the assumptionof quasi-stationary noise, i.e. noise changes much slowerthan the speech signal. A classic VAD algorithm works inreal time and makes the decisions based on the current andprevious frames, i.e. it is causal. Most of these algorithmswork in frequency domain for better integration in the audioprocessing chain and provide estimation for each frequencybin separately. One of the frequently used as a baseline VADalgorithm is standardized as ITU-T Recommendation G.729-Annex B [1]. An improved and generalized VAD is describedin [2], where authors create a soft decision VAD assumingGaussian distribution of the noise and speech signals. A simpleHMM is added to create a hangover scheme in [3] and tofinalize the decision utilizing the timing of switching the states.This algorithm can be optimized for better performance asdescribed in [4].

Most of the VAD algorithms assume Gaussian distributionof the noise and speech signals. It is well known that whilethe distribution of noise amplitudes in time domain is wellmodelled with the Gaussian distribution, the distribution ofthe amplitudes of the speech signal has higher kurtosis than

the Gaussian distribution. Gazor and Zhang [5] published astudy for the speech signal distribution in time domain, laterin [6] this study was extended with models of the ProbabilityDensity Functions (PDFs) of the speech signal magnitudesin frequency domain. In the literature are published severalattempts to utilize the non-Gaussianity of the speech signalfor better noise suppression rules [7] and [8], or for betterVAD [9]. In most of the cases it is very difficult to findanalytical form of the suppression rules, or speech presenceprobability, and the proposed solutions are either approximateor computationally expensive.

In many cases we do the processing off-line and haveaccess to the entire speech utterance. These scenarios includespeech enhancement of the recorded utterances sent for speechrecognition in the cloud, post-processing of the audio track invideo shots, off-line encoding of audio signals. This type orprocessing allows building more precise statistical models andis more tolerant to computationally expensive algorithms. Thequestion is how much the prior knowledge of the PDFs ofthe speech and noise signals can help for better parametersestimation, which leads to better VAD algorithms.

In this paper we propose an algorithm for off-line VADbased on the statistical processing of the entire utterance. Insection II we formulate the problem and present the statisticalmodels in a VAD. Sections III, IV, and V describe theapplication of the statistical modeling of the noise and speechPDFs for estimation of the speech and noise signal parametersfor normalization and VAD. In section VI we describe theexperimental results and conclude in VII.

II. PROBLEM DEFINITION

A. Modelling

We have a limited discrete signal in time domain x (lT )where l ∈ [0, L− 1], xmin ≤ x (lT ) ≤ xmax, ∀l, and T isthe sampling period. An example of such signal is shown infigure 1. This signal is a mixture of two limited, discrete, anduncorrelated signals x (lT ) = n (lT ) + s (lT ), noise n (lT )and speech s (lT ) respectively. After framing, windowing, andconverting to frequency domain we have X

(n)b = N

(n)b +S

(n)b ,

where b ∈ [0, B − 1] is the frequency bin, B is the numberof frequency bins, n ∈ [0, N − 1] is the frame number, andN is the number of audio frames. As we do the processingoff-line the same can be written in matrix form X = N+ S,where all are B×N complex matrices representing the spectra

0 5 10 15 20 25 30 35 40−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Time, sec

Am

plitu

de

SignalVAD frame/2

Fig. 1. Noisy signal in time domain with SNR=10 dB.

Time, sec

Fre

quen

cy, H

z

0 5 10 15 20 25 30 350

1000

2000

3000

4000

5000

6000

7000

−60

−50

−40

−30

−20

−10

0

10

20

30

Fig. 2. Noisy signal in frequency domain with SNR=10 dB.

of the signal, the noise, and the speech components. Thisrepresentation is visualized in figure 2, where the magnitudesare in a decibel scale.

In each frame and/or frequency bin we consider two hy-potheses:

H0: speech is absent, X = NH1: speech is present, X = N+ S

The goal of the VAD algorithm is to produce for eachfrequency bin and for each frame (column in the matricesabove) the probability of speech signal presence, P (n)

b (H1)and P (n) (H1) respectively. An example of the expected VADdecision per frame is shown in figure 1.

B. Voice Activity Detectors

Let assume that the noise and speech signals are fully char-acterized by their respective variances σ2

n and σ2s and we have

a prior knowledge of the PDFs of these two signals pn(a∣∣σ2

n

)and ps

(a∣∣σ2

s

). The PDF of a mix of two uncorrelated signals

is the convolution of the PDFs of the two signals:

px(a∣∣σ2

n, σ2s

)= pn

(a∣∣σ2

n

)∗ ps

(a∣∣σ2

s

). (1)

Note that this equation has analytical solution for a smallnumber of distribution pairs, it has to be solved numericallyfor most of the cases.

The probability P (H1 |a ) of signal with amplitude a tocontain speech is derived after direct applying of the Bayesianrule:

P (H1 |a ) =p (a |H1 )P (H1)

p (a |H1 )P (H1) + p (a |H0 )P (H0). (2)

Here P (H1) and P (H0) = 1 − P (H1) are the priorprobabilities for speech and noise presence. After dividing byp (a |H0 )P (H0) we have:

P (H1 |a ) =εΛ

1 + εΛ, (3)

where ε = P (H1)/P (H0), and Λ is the likelihood of speechsignal presence:

Λ =px(a∣∣σ2

n, σ2s

)pn (a |σ2

n ). (4)

The proportion of the prior probabilities for speech and noise εcan be assumed constant and known. Then if we can estimatethe noise and speech variances - we can estimate the speechpresence presence probability in each frame and/or frequencybin.

The binary flag V (n) for speech presence (1) or absence (0)can be obtained by comparing the likelihood Λ or the speechpresence probability P (n) with fixed threshold η:

V (n) =

∣∣∣∣ 1 if P (n) (H1) > η0 if P (n) (H1) ≤ η

(5)

For practical purposes a small hysteresis is added to prevent”ringing” of the flag when the probability is close to thethreshold.

C. Parameters Estimation

For given set of M equally spaced amplitudesam = xmin +m∆, where m ∈ [0,M − 1], xmin < a < xmax,∆ = (xmax − xmin)/(M − 1), we always can build thehistogram hM (am) of the input signal and compute theprobability density function:

p(am) =hM (am)

∆∑m

hM (am). (6)

To make the PDF p(am) independent of the pauses betweenthe clean speech phrases we remove the bin of the histogram,which contains the zero magnitude.

The distance between the two probability distributions isgiven by the Jensen-Shannon divergence [10], which is asymmetrized and smoothed version of Kullback-Leibler di-vergence [11] DKL (p ∥q ):

DJS (p ∥q ) = 1

2(DKL (p ∥m ) +DKL (q ∥m )) , (7)

where m = (p+ q) /2. The Kullback-Leibler divergence isdefined as:

DKL (p ∥q ) =∑i

p (xi) logp (xi)

q (xi)(8)

and measures the expected number of extra bits, required tocode samples from q when using a code based on p, rather thanusing a code based on q. Lower Jensen-Shannon divergenceDJS indicates a better fit of the model to the measuredhistogram.

We can estimate the noise and speech variances by mini-mizing the divergence between measured PDF hM (a), derivedfrom the histogram of amplitudes, and the models of the noiseand speech signals PDF px

(a∣∣σ2

n, σ2s

):

[σ̂2n, σ̂

2s

]= argmin

σ2n,σ

2s

(DJS

(hM (a)

∥∥px (a ∣∣σ2n, σ

2s

)))(9)

Considering the fact that it is unlikely to have analytic solutionfor the PDF of the mixed signal (see II-B) we will have tosolve the minimization problem in equation 9 numerically.

Once we have estimation of the noise and speech varianceswe can compute the likelihood using equation 4 and thencompute the speech presence probability for each audio frameand/or frequency bin using equation 3. We are going to applythis technique several times further in this paper.

III. SPEECH SIGNAL NORMALIZATION

Before to convert to frequency domain and proceed withthe VAD algorithm we found useful to normalize the speechsignal. In many cases the recorded signal has low level,contains clicks, bursts of wind noise. The last two are thereason why a simple normalization of the peak amplitude willnot work - typically these are short peaks with high amplitude.It is OK to clip them and to normalize the amplitude of theactual speech signal.

We can build the histogram of the signal in time domainusing the procedure described in II-C and compute the PDFof the mixed signal pxt(am) according to equation 6. ThePDF of the noise signal is modeled as zero-mean Gaussiandistribution:

pnt(a∣∣σ2

nt

)=

1

σnt

√2π

exp

(− a2

2σ2nt

)(10)

where σ2nt is the noise variance.

The PDF of the speech signal is modelled with zero-meanGeneralized Gaussian Distribution [12] as:

pst(a∣∣σ2

st

)=

βt

2αtΓ (1/βt)exp

(−(|a|αt

)βt)

(11)

where Γ (·) is the gamma function, αt is the scale, and βt

is the shape parameter. When the shape parameter βt = 2we have Gaussian distribution, when βt = 1 we have Laplacedistribution, and the clean speech distribution has even higher

−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.20

0.05

0.1

0.15

0.2

Magnitude

PD

F

SignalSpeechNoiseFit

Fig. 3. Distribution fit for signal in time domain with SNR=10 dB.

kurtosis. We assume that the shape parameter is constant forthe clean speech signal and known in advance. The onlyparameter of the clean speech signal PDF which needs to beestimated is the scale, related to the speech signal variance σ2

st

as:

αt =

√σ2stΓ (1/βt)

Γ (3/βt). (12)

We can use equation 9 to estimate the noise and speechvariances σ2

nt and σ2st respectively. Figure 3 shows an example

of the distributions of signal in time domain with SNR=10 dB.This is enlarged view of the central part of the distribution.Note the missing histogram bin with zero values and theheavier tail of the distribution of the speech signal.

The input signal in time domain can be normalized byapplying gain Gs = σst0/σst aiming to have desired deviationof the speech component σst0. To handle corner cases, suchas only noise or too small amount of speech signal, this gainis limited as

G = min [Gs, Gp] , (13)

where Gp is the gain which will cause 90% of the samples tobe clipped. It can be obtained from the cumulative distributionfunction, easily computed from the estimated PDF, or directlyfrom the histogram.

IV. PER-FRAME VAD ALGORITHM

After normalization and conversion to frequency domain wecan compute the RMS of the signal frames as:

A(n) =

√√√√ 1

N

Bend∑b=Bbeg

(X

(n)b

)2. (14)

Here we implicitly applied a band-pass filter by using only thefrequency bins from Bbeg to Bend.

We model the PDF of magnitudes of noise only frames withWeibull distribution [13]:

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Magnitude

Wei

bull

PD

F

k=2k=4k=5k=0.3k=0.5k=1.0

Fig. 4. Weibull distribution.

pnf(A∣∣σ2

nf

)=

knfλnf

(A

λnf

)knf−1

exp

(−(

A

λnf

)knf).

(15)Here λnf is the scale parameter and knf is the shape pa-rameter. This distribution is defined for A ≥ 0 and is shownin figure 4 for various shape parameters. When knf = 2Weibull distribution becomes Rayleigh distribution, which isthe PDF of the magnitudes of a complex numbers with a zero-mean Gaussian distribution of the real and imaginary parts,i.e. noise. We can state that this is the distribution of themagnitudes of frames with one sample. With increasing ofthe shape parameter above 2 the Weibull distribution becomesmore and more narrow bell-shaped keeping the same mean.This models well the reduction of the variation of the frameRMS with increasing the number of the samples in the frame.Here we assume known shape parameter knf > 2, which isframe size dependent.

The only parameter we have to estimate is the scale param-eter λnf , which is related with the noise variance as:

λnf =σnf

Γ(1 + 1

knf

) . (16)

In this case σnf is the deviation of the noise, i.e. the mean ofthe PDF.

We model the PDF of magnitudes of the audio frames withonly speech psf

(A∣∣∣σ2

sf

)with the same Weibull distribution

(Figure 4), but with different shape parameter ksf . When theshape parameter ksf = 1 the Weibull distribution becomesexponential distribution and the actual clean speech PDF haseven higher kurtosis. We assume known shape parameterksf < 1 of the PDF of the speech audio frames and thenthe only parameter we have to estimate is the scale parameterλsf , which is a function of the speech signal variance σ2

sf asshown in equation 16.

Using the approach described in section II-C we can buildthe histogram and estimate the noise only and speech only

0 0.5 1 1.5 2 2.50

1

2

3

4

5

6

Magnitude

PD

F


Fig. 5. Distribution fit of the frame magnitudes for signal in frequency domainwith SNR=10 dB.

variances σnf and σns by minimizing the Jensen-Shannondivergence. Figure 5 illustrates the result from this distributionfitting for a signal with SNR=10 dB converted to frequencydomain.

With the obtained noise only and speech only variances σnf

and σsf we can estimate the frame presence probability usingthe process described in section II-B.

V. PER-BIN VAD ALGORITHM

The approach for per-bin estimation of the speech presenceprobability is the same. Here we will omit the bin indexwherever is possible. The noise PDF if given by Rayleighdistribution [14]:

pnb(Ab

∣∣σ2nb

)=

Ab

σ2nb

exp

(− A2

b

2σ2nb

), Ab ≥ 0 (17)

Here σ2nb is the noise variance for frequency bin b.

The clean speech signal distribution psb(A∣∣σ2

sb

)is mod-

elled also with Weibull distribution according to equation 15.λsb is the scale parameter and ksb is the shape parameter,which is assumed known and constant across all frequencybins [6] with light dependency of the frame size. The scaleparameter is related with the speech variance per equation 16.Here we also will have to omit the first magnitude bin becauseof the same reasons as in section IV. An example for fittingof the PDF for the frequency bin corresponding to 1000 Hz isshown in figure 6. The estimated speech presence probabilityper frequency bin is show in figure 7 for the same signal withSNR=10 dB.

For relatively high SNRs the complex and computationallyexpensive minimizations can be avoided by direct estimationof the noise and clean speech variances for each bin separately.Given per-frame speech signal presence probability P (n) (H1)we can estimate the noise and speech plus noise variances asan weighted average:

0 0.5 1 1.5 2 2.5 30

0.5

1

1.5

2

Magnitude

PD

F


Fig. 6. Distribution fit of the magnitudes in 1000 Hz frequency bin for signalin frequency domain with SNR=10 dB.

Time, sec

Fre

quen

cy, H

z

0 5 10 15 20 25 30 350

1000

2000

3000

4000

5000

6000

7000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 7. Estimated speech presence probability per frequency bin, SNR=10 dB.

σ̂2nb =

∑n

(X

(n)b

)2 (1− P (n) (H1)

)∑n

(1− P (n) (H1)

) , (18)

σ̂2s+n,b =

∑n

(X

(n)b

)2P (n) (H1)∑

nP (n) (H1)

. (19)

Then we can assume σ̂sb2 ≈ σ̂2

s+n,b and continue the estima-tion as above.

VI. EXPERIMENTAL RESULTS

A. Data set

For evaluation of the proposed algorithm we created a cleanspeech file containing 10 utterances randomly taken from 10different speakers in TIMIT [15] database and mixed themwith vehicle noise recorded in a moving car in proportionto create set of files with SNR of 0, 10, 20, 30, 40, and50 dBC. Two sets of files were created for the evaluation of the

proposed algorithm - one for training and one for evaluation.Different speakers and noise segments were selected for thetwo sets. Both sets also include the clean speech files and thenoise only files. All files are with sampling rate of 16 kHz.

B. Evaluation parameters

As evaluation parameters we selected the error rate perframe and per bin. The error rate per frame is estimated as:

Ef =

√1

N

∑n

(P (n) (H1)−G(n) (H1)

), (20)

where G(n) (H1) is the ground truth, a binary mask obtainedby comparing the frame RMS of the clean speech signal with afixed, very low threshold. The ground truth per bin G

(n)b (H1)

and the error rate per bin Eb are obtained and estimated in thesame way. Identical approach was used to compute the errorrates for the binary decisions.

C. Implementation and parameters

The values of the distributions parameters, assumed con-stant, were obtained using the noise only and clean speech onlyfiles from the training set. The prior probability proportionsand optimal thresholds were obtained by minimizing the errorrates against the training set. The values of the constantparameters are shown in Table I.

The implementation of the off-line VAD was done inMatlab. The histograms were estimated in M = 100 points.The frame size for conversion to frequency domain was512 samples with 50% overlapping, Hann window. For thenumerical estimation of the convolution in equation 1 was usedgrid ten times denser than the one used for the histogram. Forsolving the optimization problem in equation 9 was used theMatlab function for non-constrained optimization fminunc(),and the optimization process was constrained with settingminimal and maximal values for each optimization parameterand applying quadratic punishing functions, added to theoptimization criterion. Proper measures were taken to preventoverflows and divisions by zero with adding small numbers tothe denominators.

D. Results

All of the results presented in this section were obtainedusing the evaluation data set. Table II shows the results forthree VAD structures. The first is the VAD per frame, asdescribed in section IV, for the soft and binary decisions.The second is the per-bin VAD from section V with directestimation of the variances of the noise and the speech signalsaccording to equations 18 and 19. The third VAD is the per-binVAD from section V with estimation of the noise and speechvariances by fitting the distributions.

VII. CONCLUSIONS

In this paper we presented an approach for off-line VADwhich utilizes the prior knowledge of the PDFs of the speechand noise signals, components of the mixture.

TABLE IVALUES OF THE CONSTANT PARAMETERS

βt knf ksf εf ηf ksb εb ηb0.2403 4.2416 0.3244 0.0215 0.9 0.571 0.0236 0.25

TABLE IIRESULTS

Algorithm SNR, dB 0 10 20 30 40 50 AverageVAD per frame soft 0.3898 0.2149 0.1355 0.0578 0.0380 0.0338 0.1449

binary 0.3974 0.2061 0.1178 0.0524 0.0308 0.0283 0.1388VAD per bin, direct soft 0.3523 0.3110 0.2674 0.2557 0.2526 0.2509 0.2833

binary 0.3233 0.2890 0.2478 0.2424 0.2312 0.2301 0.2606VAD per bin, fit soft 0.2699 0.2007 0.1611 0.1355 0.1166 0.1118 0.1659

binary 0.2667 0.1947 0.1508 0.1236 0.1033 0.0985 0.1562

One of the core ideas in this approach is the numericalestimation of the mixed signal PDF from the PDFs of thenoise and the speech signal. In time domain we model thenoise with Gaussian distribution and the clean speech signalwith Generalized Gaussian distribution. In frequency domainwe model the magnitudes of the audio frames with Weibulldistribution for both noise only and clean speech only audioframes using different values of the shape parameter. Infrequency domain we model the magnitudes of the frequencybins with Rayleigh distribution for noise only bins and withWeibull distribution for the speech only bins. In all cases weassume that we have constant values of the shape parameters,which makes the PDFs function only of the variations. Weestimate these variations by fitting the estimated PDF to theactual PDF using Jensen-Shannon divergence as minimizationcriterion.

The algorithm is evaluated against a small data set withvarious signal-to-noise ratios (SNRs) and shows comparableperformance with the real-time VAD algorithms of the sameclass. The proposed approach handles much better signals withvery high SNRs where, surprisingly enough, the classic VADsdo not perform very well. This make it suitable for building thebaseline for testing and evaluating another VAD algorithms.

Potential next steps are adding a hangover scheme, proven tobe very effective for improving the precision of the VAD deci-sions for both per-frequency bin and per frame. Implementingthe ideas from [9] for combining the per-bin decisions to forma better per-frame decision is also a factor for improving theperformance.

REFERENCES

[1] “Recommendation G.729 Annex B: a silence compression scheme foruse with G.729 optimized for V.70 digital simultaneous voice and data

applications,” 1997.[2] J. Sohn and W. Sung, “A voice activity detection employing soft

decision based noise spectrum adaptation,” in Proceedings of IEEEInternational Conference on Acoustics, Speech, and Signal Processing(ICASSP), 1998.

[3] J. Sohn, N. Kim, and W. Sung, “A statistical model based voice activitydetector,” IEEE Signal Processing Letters, vol. 6, pp. 1–3, 1999.

[4] Ivan Tashev, Andrew Lovitt, and Alex Acero, “Unified framework forsingle channel speech enhancement,” in Proceedings of IEEE PacificRim Conference on Communications, Computers and Signal Processing,2009.

[5] S. Gazor and W. Zhang, “Speech probability distribution,” IEEE SignalProcessing Letters, vol. 10, pp. 204–207, 2003.

[6] Ivan Tashev and Alex Acero, “Statistical modeling of the speech signal,”in Proceedings of International Workshop on Acoustic, Echo, and NoiseControl (IWAENC), 2010.

[7] R. Martin, “Speech enhancement using MMSE short time spectralestimation with Gamma distributed speech priors,” in Proc. of IEEEInternational Conference on Acoustics, Speech, and Signal Processing(ICASSP), 2002.

[8] T. Lotter, Speech Enhancement, chapter Single- and Multi-MicrophoneSpectral Amplitude Estimation Using a Super-Gaussian Speech Model,pp. 67–95, Springer-Verlag, 2005.

[9] Ivan Tashev, Andrew Lovitt, and Alex Acero, “Dual stage probabilisticvoice activity detector,” in NOISE-CON 2010 and 159th Meeting of theAcoustical Society of America, 2010.

[10] Hinrich Schtze and Christopher D. Manning, Foundations of StatisticalNatural Language Processing, MIT Press, 1999.

[11] S. Kullback and R. Leibler, “On information and sufficiency,” Annalsof Mathematical Statistics, vol. 22, pp. 79–86, 1951.

[12] M. K. Varanasi and B. Aazhang, “Parametric generalized gaussiandensity estimation,” Journal of the Acoustical Society of America, vol.86, pp. 1404–1415, 1989.

[13] W. Weibull, “A statistical distribution function of wide applicability,”Journal of Applied Mechchanics - Transactions of ASME, vol. 18, pp.293–297, 1951.

[14] Athanasios Papoulis and S. Unnikrishna Pillai, Probability, RandomVariables and Stochastic Processe, McGraw-Hill Science Engineering,2001.

[15] John S. Garofolo et al., “TIMIT acoustic-phonetic continuous speechcorpus,” Linguistic Data Consortium, 1993.

Ofﬂine Voice Activity Detector Using Speech Supergaussianity · 2018. 1. 4. · Ofﬂine Voice...

Documents

Transcript of Ofﬂine Voice Activity Detector Using Speech Supergaussianity · 2018. 1. 4. · Ofﬂine Voice...