Evaluating de-reverberation and ASR techniques in reverberant … · yApplied Physics Laboratory,...

ENHANCEMENT OF REVERBERANT AND NOISY SPEECH BY EXTENDING ITSCOHERENCE

Scott Wisdom?, Thomas Powers?, Les Atlas?, and James Pitton†?

?Electrical Engineering Department, University of Washington, Seattle, USA†Applied Physics Laboratory, University of Washington, Seattle, USA

ABSTRACTWe introduce a novel speech enhancement algorithm for re-moving reverberation and noise from recorded speech data.Our approach centers around using a single-channel min-imum mean-square error log-spectral amplitude (MMSE-LSA) estimator, which applies gain coefficients in a time-frequency domain to suppress noise and reverberation. Themain contribution of this paper is that the enhancement isdone in a time-frequency domain that is coherent with speechsignals over longer analysis durations than the short-timeFourier transform (STFT) domain. This extended coherenceis gained by using a linear model of fundamental frequencyvariation over the analysis frame. In the multichannel case,we preprocess the data with either a minimum variance distor-tionless response (MVDR) beamformer, or a delay-and-sumbeamformer (DSB). We evaluate our algorithm on the RE-VERB challenge dataset. Compared to the same processingdone in the STFT domain, our approach achieves significantimprovement on the REVERB challenge objective metrics,and according to informal listening tests, results in fewerartifacts in the enhanced speech.

Index Terms— Speech enhancement, speech derever-beration, time-warping, fan-chirp transform, adaptive basis,beamforming

1. INTRODUCTION AND PRIOR WORK

The enhancement of speech signals in the presence of rever-beration and noise remains a challenging problem with manyapplications. Many methods are prone to generating artifactsin the enhanced speech, and must trade off noise reductionagainst speech distortion.

In this paper, we describe a new enhancement algorithmthat suppresses both reverberation and background noise. Wecombine a statistically optimal single-channel enhancementalgorithm that suppresses background noise and reverbera-tion with an adaptive time-frequency transform domain thatis coherent with speech signals over longer durations than theshort-time Fourier transform (STFT). Thus, we are able touse longer analysis windows while still satisfying the assump-tions of the optimal single-channel enhancement filter. Multi-channel processing is made possible using a classic minimumvariance distortionless response (MVDR) beamformer or, inthe case of two-channel data, a delay-and-sum beamformer(DSB) preceding the single-channel enhancement.

First, we review the speech enhancement and derever-beration problem, as well as the enhancement algorithm weuse proposed by Habets [1], which suppresses both noise andlate reverberation based on a statistical model of reverbera-tion. Then, we describe the fan-chirp transform, proposed

This work is funded by ONR contract N00014-12-G-0078, delivery or-der 0013

by Weruaga and Kepesi [2, 3] and improved upon by Can-cela et al. [4], which provides an enhancement domain, theshort-time fan-chirp transform (STFChT), that better matchestime-varying frequency content of voiced speech. We dis-cuss why performing the enhancement in the STFChT domaingives superior results compared to the STFT domain. Finally,we present our results on the REVERB challenge dataset [5],which shows that our new method achieves superior resultsversus conventional STFT-based processing in terms of ob-jective measures.

Our basic multichannel architecture of single-channel en-hancement preceded by beamforming is not unprecedented.Gannot and Cohen [6] used a similar architecture for noisereduction that consists of a generalized sidelobe cancellation(GSC) beamformer followed by a single-channel post-filter.Maas et al. [7] employed a similar single-channel enhance-ment algorithm for reverberation suppression and observedpromising speech recognition performance in even highly re-verberant environments.

There have been several dereverberation and enhancementapproaches that estimate and leverage the time-varying funda-mental frequency f0 of speech. Nakatani et al. [8] proposeda dereverberation method using inverse filtering that exploitsthe harmonicity of speech to build an adaptive comb filter.Kawahara et al. [9] used adaptive spectral analysis and esti-mates of f0 to perform manipulation of speech characteristics.

Droppo and Acero [10] observed how the fundamentalfrequency of speech can change within an analysis window,and proposed a new framework that could better predict theenergy of voiced speech. Dunn and Quatieri [11] used thefan-chirp transform for sinusoidal analysis and synthesis ofspeech, and Dunn et al. [12] also examined the effect of var-ious interpolation methods on reconstruction error. Pantaziset al. [13] proposed an analysis/synthesis domain that usesestimates of instantaneous frequency to decompose speechinto quasi-harmonic AM-FM components. Degottex andStylianou [14] proposed another analysis/synthesis schemefor speech using an adaptive harmonic model that they claimis more flexible than the fan-chirp, as it allows nonlinearfrequency trajectories.

To our knowledge, single-channel enhancement has notbeen attempted in these new related transform domains. Here,we demonstrate improved performance using the STFChT in-stead of the STFT.

2. OPTIMAL SINGLE-CHANNEL SUPPRESSION OFNOISE AND LATE REVERBERATION

In this section, we review the speech enhancement problemand a popular statistical speech enhancement algorithm, theminimum mean-square error log-spectral amplitude (MMSE-LSA) estimator, which was originally proposed by Ephraimand Malah [15, 16] and later improved by Cohen [17]. We

REVERB Workshop 2014

1

review the application of MMSE-LSA to both noise reductionand joint dereverberation and noise reduction (the latter ofwhich was proposed by Habets [1]).

2.1. Noise reduction using MMSE-LSAA classic speech enhancement algorithm is the minimummean-square error (MMSE) short-time spectral amplitudeestimator proposed by Ephraim and Malah [15]. They laterrefined the estimator to minimize the MSE of the log-spectra[16]. We will refer to this algorithm as LSA (log-spectral am-plitude). Minimizing the MSE of the log-spectra was found toprovide better enhanced output because log-spectra are moreperceptually meaningful. Cohen [17] suggested improve-ments to Ephraim and Malah’s algorithm, which he referredto as “optimal modified log-spectral amplitude” (OM-LSA).

Given samples of a noisy speech signal

y[n] = s[n] + v[n], (1)

where s[n] is the clean speech signal and v[n] is additivenoise, the goal of an enhancement algorithm is to estimates[n] from the noisy observations y[n]. The LSA estimatoryields an estimate A(d, k) of the clean STFT magnitudes|S(d, k)| (where S(d, k) are assumed to be normally dis-tributed) by applying a frequency-dependent gainGLSA(d, k)to the noisy STFT magnitudes |Y (d, k)|:

A(d, k) = GLSA(d, k)|Y (d, k)|. (2)

Given these estimated magnitudes, the enhanced speech is re-constructed from STFT coefficients combining A(d, k) withnoisy phase: S(d, k) = A(d, k)ej∠Y (d,k). The LSA gains arecomputed as [16, (20)]:

GLSA(d, k) =ξ(d, k)

1 + ξ(d, k)exp

{1

2

∫ ∞v(d,k)

e−t

tdt

}, (3)

where ξ(d, k) is the a priori signal-to-noise ratio (SNR) forthe kth frequency bin of the dth frame, and is defined to beξ(d, k)

∆= λs(d,k)

λv(d,k) , where λs(d, k) = E{|S(d, k)|2

}is the

variance of S(d, k) and λv(d, k) = E{|V (d, k)|2

}is the

variance of V (d, k). The variable v(d, k) = ξ(d,k)1+ξ(d,k)γ(d, k),

where γ(d, k) is the a posteriori SNR for the kth frequencybin of the dth frame, defined as γ(d, k)

∆= |Y (d,k)|2

λv(d,k) .

Cohen [17] refined Ephraim and Malah’s approach to in-clude a lower bound Gmin for the gains as well as an a priorispeech presence probability (SPP) estimator p(d, k). Cohen’sestimator is as follows [17, (8)]:

GOM−LSA = {GLSA(d, k)}p(d,k) ·G1−p(d,k)min . (4)

Cohen also derived an efficient estimator for the SPP p(d, k)[17] that exploits the strong interframe and interfrequencycorrelation of speech in the STFT domain.

2.2. Joint dereverberation and noise reductionHabets [1] proposed a MMSE-LSA enhancement algorithmthat uses a statistical model of reverberation to suppress bothnoise and late reverberation. The signal model he uses is

y[n] = s[n] ∗ h[n] + v[n] = xe[n] + x`[n] + v[n], (5)

where s[n] is the clean speech signal, h[n] is the room im-pulse response (RIR), and v[n] is additive noise. The termsxe[n] and x`[n] correspond to the early and late reverberatedspeech signals, respectively. The partition between early andlate reverberations is determined by a parameter ne, which isa discrete sample index. All samples in the RIR before ne aretaken to cause early reflections, and all samples after ne aretaken to cause late reflections [1]. Thus,

h[n] =

0, if n < 0he[n], if 0 ≤ n < neh`[n] if ne ≤ n.

(6)

Using these definitions, xe[n] = s[n] ∗ he[n] and x`[n] =s[n] ∗ h`[n].

Habets proposed a generalized statistical model of re-verberation that is valid both when the source-microphonedistance is less than or greater than the critical distance [1].This model divides the RIR h[n] into a direct-path componenthd[n] and reverberant component hr[n]. Both direct-path andreverberant components are taken to be white, zero-mean,stationary Gaussian noise sequences bd[n] and br[n] withvariances σ2

d and σ2r enveloped by an exponential decay,

hd[n] = bd[n]e−ζn and hr[n] = br[n]e−ζn, (7)

where ζ is related to the reverberation time T60 by [1]:

ζ =3 ln(10)

T60fs. (8)

Using this model, the expected value of the energy enve-lope of h[n] is

E[h2[n]

]=

σ2de−2ζn, for 0 ≤ n < nd

σ2re−2ζn, for n ≥ nd

0 otherwise.(9)

Under the assumptions that the speech signal is stationaryover short analysis windows (i.e., duration much less thanT60), Habets proposed [1, (3.87)] the following model of thespectral variance of the reverberant component xr[n]:

λxr (d, k) =e−2ζ(k)Rλxr (d− 1, k)...

+ErEd

(1− e−2ζ(k)R

)λxd(d− 1, k), (10)

where R is the number of samples separating two adjacentanalysis frames and Er/Ed is the inverse of the direct-to-reverberant ratio (DRR). Thus, the spectral variance of thereverberant component in the current frame d is composed ofscaled copies of the spectral variance of the reverberation andthe spectral variance of the direct-path signal from the previ-ous frame d− 1.

Using this model, the variance of the late reverberant com-ponent can be expressed as [1, (3.85)]:

λx`(d, k) = e−2ζ(k)(ne−R)λxr

(d− ne

R+ 1, k

), (11)

which is quite useful in practice, because the variance of thelate-reverberant component can be computed from the vari-ance of the total reverberant component.

2

To suppress both noise and late reverberation, the a prioriand a posteriori SNRs ξ(d, k) and γ(d, k) from the pre-vious section become a priori and a posteriori signal-to-interference ratios (SIRs), given by [1, (3.25), (3.26)]:

ξ(d, k) =λxe(d, k)

λx`(d, k) + λv(d, k)(12)

and

γ(d, k) =|Y (d, k)|2

λx`(d, k) + λv(d, k). (13)

The gains are computed by plugging the SIRs in (12) and (13)into (3) and (4). Habets suggested an additional change to (4),which makes Gmin time- and frequency-dependent. This isdone because the interference of both noise and late reverber-ation is time-varying. The modification is [1, (3.29)]

Gmin(d, k) =Gmin,x` λx`(d, k) +Gmin,vλv(d, k)

λx`(d, k) + λv(d, k). (14)

Notice that two parameters in (8) and (10) are not knowna priori; namely, T60 and the DRR. These parameters mustbe blindly estimated from the data. For T60 estimation,Lollmann et al. [18] propose an algorithm, which we foundto be effective. As for the DRR, Habets suggested an onlineadaptive procedure [1, §3.7.2].

3. ANALYSIS AND SYNTHESIS USING THEFAN-CHIRP TRANSFORM

In this section, we review the forward short-time fan-chirptransform (STFChT) and describe a method of inverting theSTFChT.

3.1. The forward fan-chirp transformWe adopt the fan-chirp transform formulation used by Can-cela et al. [4]. The forward fan-chirp transform is definedas

X(f, α) =

∫x(t)φ′α(t)e−j2πfφα(t)dt (15)

where φα(t) =(1 + 1

2αt)t and φ′α(t) = 1+αt. The variable

α is an analysis chirp rate. Using a change of variable τ ←φα(t), (15) can be written as the Fourier transform of a time-warped signal:

X(f, α) =

∫ ∞−∞

x(φ−1α (τ))e−j2πfτdτ. (16)

The short-time fan-chirp transform (STFChT) of x(t) isdefined as the fan-chirp transform of the dth short frame ofx(t):

Xd(f, αd) =

∫ Tw/2

−Tw/2w(τ)xd(φ

−1αd

(τ))e−j2πfτdτ (17)

where w(t) is an analysis window, αd is the analysis chirprate for the dth frame given by (21), and xd(t) is the dth shortframe of the input signal of duration T :

xd(t) =

{x(t− dThop), −T/2 ≤ t ≤ T/2

0, otherwise.(18)

T is the duration of the pre-warped short-time duration,Thop is the frame hop, Tw is the post-warped short-time du-ration, and w(t) is a Tw-long analysis window. The analysiswindow is applied after time-warping so as to avoid warpingof the window, which can cause unpredictable smearing of theFourier transform.

Implementing the fan-chirp transform as a time-warpingfollowed by a Fourier transform allows efficient implementa-tion, consisting simply as an interpolation of the signal fol-lowed by an FFT. In the implementation provided by Cancelaet al. [4], the interpolation used in the forward fan-chirp trans-form is linear.

Kepesi and Weruaga [2] provide a method for determina-tion of the analysis chirp rate α using the gathered log spec-trum (GLogS). The GLogS is defined as follows:

ρ(f0, α) =1

Nh

Nh∑k=1

ln |X(kf0, α)| (19)

where Nh is the maximum number of harmonics that fitwithin the analysis bandwidth. That is,

Nh =

⌊fs

2f0

(1 + 1

2 |α|Tw)⌋ . (20)

Cancela et al. [4] proposed several enhancements to theGLogS. First, they observed improved results by replacingln | · | with ln (1 + γ |·|). Cancela et al. note that this expres-sion approximates a p-norm, with 0 < p < 1, where lowervalues of γ with γ ≥ 1 approach the 1-norm, while highervalues approaches the 0-norm. Cancela et al. note that γ = 10gave good results for their application.

Additionally, Cancela et al. propose modifications thatsuppress multiples and submultiples of the current f0. Also,they propose normalizing the GLogS such that is has zeromean and unit variance. This is necessary because the vari-ance of the GLogS increases with increasing fundamental fre-quency. For mean and variances measured over all frames ina database, a polynomial fit is determined and the GLogS arecompensated using these polynomial fits.

Let ρd(f0, α) be the GLogS of the dth frame with theseenhancements applied. For practical implementation, finitesetsA of candidate chirp rates andF0 of candidate fundamen-tal frequencies are used, and the GLogS is exhaustively com-puted for every chirp rate in A and fundamental frequency inF0. The analysis chirp rate αd for the dth frame is thus foundby

αd = argmaxα∈A

maxf0∈F0

ρd(f0, α). (21)

3.2. The inverse fan-chirp transformInverting the fan-chirp transform is a matter of reversing thesteps used in the forward transform. Thus, the inverse fan-chirp transform for a short-time frame consists of an inverseFourier transform, removal of the analysis window, and aninverse time-warping. The removal of the analysis windoww(t) from the Tw-long warped signal limits the choice ofanalysis windows to positive functions only, such as a Ham-ming window, so the window can be divided out. Also, sincethe warping is nonuniform, it is possible that the samplinginterval between points may exceed the Nyquist sampling in-terval. To combat this, the data should be oversampled beforetime-warping, which means the data must be downsampledafter undoing the time-warping.

3

The choice of post-warped duration Tw and the method ofinterpolation used in the inverse time-warping affect the re-construction error of the inverse fan-chirp transform. There isa trade-off between reconstruction performance and compu-tational complexity, because interpolation error decreases asinterpolation order increases. Kepesi and Weruaga [19] ana-lyzed fan-chirp reconstruction error with respect to order ofthe time-warping interpolation and oversampling factor, andfound that for cubic Hermite splines and an oversampling fac-tor of 2, a signal-to-error ratio of over 30dB can be achieved.For our application, we choose an oversampling factor of 8and cubic-spline interpolation.

4. MMSE-LSA IN THE FAN-CHIRP DOMAIN

In this section we analyze performing joint dereverberationand noise reduction using MMSE-LSA in the STFChT do-main and provide an example for why the STFChT (17),which is a domain that is more coherent with speech signals,is a more appropriate enhancement domain than the STFT.

The MMSE-LSA framework for joint dereverberation andnoise reduction implicitly assumes that the the frequency con-tent of speech does not change very much over the analysisduration. Such an assumption relies on the local stationar-ity of speech signals within the analysis frame. For voicedspeech, this is essentially equivalent to the fundamental fre-quency f0 being constant over the analysis frame, and the fre-quency variation of voiced speech limits analysis durations to10-30ms.

Using shorter analysis frames means only a finite amountof approximately stationary data is available at any specifictime, and this finite amount of data limits the performance ofstatistical estimators. To improve this situation, we propose toincrease the analysis duration by changing the time base of theanalysis such that there is less frequency variation within theframe, which makes the data more stationary. This time basemodification is performed using the fan-chirp, which uses alinear model of frequency variation within the frame.

To give intuition about the benefits of the fan-chirp in thepresence of reverberation, we present a simple example in fig-ure 1. Consider two successive Gaussian-enveloped harmonicchirps with duration 200 ms and spaced 100 ms apart. Let thef0 of the first harmonic chirp start at 200 Hz and rise to 233Hz, and let the f0 of the second harmonic chirp have a rangefrom 250 Hz falling to 200 Hz. Both chirps have 20 harmon-ics. This sequence of harmonic chirps has parameters thatare typical of two successive voiced vowels (here we do notconsider the spectral shape imposed by a vocal tract filter, forsimplicity). Now, let us apply reverberation to this signal (weuse the first channel of the MediumRoom2 far AnglA RIRprovided in the REVERB challenge development set [5]), andexamine the clean and reverberated versions in both the STFTand STFChT domains. The result is shown in figure 1. STFTand STFChT parameters are exactly matched, with a sam-pling rate of fs = 16kHz, a Hamming window of duration2048 samples, a frame hop of 128 samples, and a 3262-lengthFFT.

Notice that at higher frequencies in the STFT of the cleansignal (top left panel of figure 1), the harmonics becomebroader and more smeared across frequency. In contrast, theSTFChT of the clean signal (center left panel of figure 1) ex-hibits narrow lines at all frequencies. When reverberation isapplied, the STFT of the reverberated signal (top right panelof figure 1) exhibits smears that become wider with increas-ing frequency, and adjacent harmonics even become smearedtogether. In contrast, in the STFChT of the reverberatedsignal (center right panel of figure 1), the direct path sig-nal shows up as narrow lines at all frequencies, while some

|STFT |2 of c lean

Fre

quency(k

Hz)

0 0.2 0.4 0.6 0.80

1

2

3

|STFChT |2 of c lean

Fre

quency(k

Hz)

T ime (sec)0 0.2 0.4 0.6 0.8

0

1

2

3

|STFT |2 of reverberated

0 0.2 0.4 0.6 0.80

1

2

3

Time (sec)

|STFChT |2 of reverberated

0 0.2 0.4 0.6 0.80

1

2

3

0 0.2 0.4 0.6 0.8−4

−2

0

2

4

Time (sec)

Chirpra

teα

d

0 0.2 0.4 0.6 0.8−4

−2

0

2

4

Time (sec)

Fig. 1: Simple example showing benefits of fan-chirp bothfor narrower harmonics across frequency and for better coher-ence with direct path signal in the presence of reverberation.Test signal is two consecutive synthetic speech-like harmonicstacks. All colormaps are identical with a dynamic range of40dB. Left plots are the representations of the clean signal andright plots are the representations of the reverberated signal.The chosen analysis chirp rates αd are shown in the bottomplots.

smearing results in frames that contain reverberant energy.One cause of the smearing during reverberation-dominatedframes seems to be errors in the estimation of αd, the analysischirp rate, caused by the additional reverberant components,which are shown in the bottom panels of figure 1. Despitethese estimation errors, the STFChT still seems to give abetter representation of the signal compared to the STFT,because the STFChT reduces smearing of higher-frequencycomponents and achieves better coherence with the direct-path signal (i.e., direct-path signals show up as more narrowlines).

5. IMPLEMENTATION

Our algorithms are implemented in MATLAB, and we useutterance-based processing. The algorithm starts by using theutterance data to estimate the T60 time of the room using theblind algorithm proposed by Lollmann et al. [18]. Multichan-nel utterance input data is concatenated into a long vector,and as recommended by Lollmann et al., noise reduction isperformed beforehand. We use Loizou’s implementation [20]of Ephraim and Malah’s LSA [16] for this pre-enhancement.Figure 2 shows histograms of the T60 estimation performanceusing this approach.

For multichannel data, we estimate the direction of arrival(DOA) by cross-correlating oversampled data between chan-

4

0

0.1

0.2

0.3

0.4

0.5

room1, near

0

0.1

0.2

0.3

0.4

0.5

room1, far

0

0.1

0.2

0.3

0.4

0.5

room2, near

0

0.1

0.2

0.3

0.4

0.5

room2, far

0 0.4 0.80

0.10.20.30.40.5

0 0.4 0.8

Estimated T60 (s)

room3, near

0 0.4 0.8 0 0.4 0.80

0.10.20.30.40.5

0 0.4 0.8

Estimated T60 (s)

room3, far

0 0.4 0.8

Fig. 2: Histograms of estimated T60 time measured on Sim-Data evaluation dataset (these results were not used to tunethe algorithm). For each condition, left plot is for 1-channeldata, center plot is for 2-channel data, and right plot is for8-channel data. These plots show that T60 estimation [18]precision generally improved with increasing amounts of data(i.e., with more channels), although for some conditions T60estimates were inaccurate. Dotted lines indicate approximateT60 times given by REVERB organizers [5].

nels. That is, we compute a Nch-length vector of time delaysd with d1 = 0 and di, i=2,...,Nch given by

di = argmaxk

r1i[k]

Ufs, (22)

where r1i[k] =∑n x1[n]xi[n − k], U is the oversampling

factor, and c = 340 meters per second, the approximate speedof sound in air.

Given a time delay vector d, the DOA estimate is givenby the solution to Pa = 1

cd, where a is a 3 × 1 unit vec-tor representing the estimated DOA of the speech signal andP is a Nch × 3 matrix containing the Cartesian (x, y, z) co-ordinates of the array elements. For example, for an eight-element uniform circular array, Pi1 = xi = r cos(iπ/4),Pi2 = yi = r sin(iπ/4), and Pi3 = zi = 0 for i = 0, 1, ..., 7,where r is the array radius.

For the 8-channel case, the estimated DOA is used to formthe steering vector vH(f) for a frequency-domain MVDRbeamformer applied to the multichannel signal. The weightswH(d, f) for the MVDR are [21, (6.14-15)]

wH(d, f) =vH(f)S−1

yy (d, f)

vH(d, f)S−1yy (d, f)v(d, f)

, (23)

where Syy(d, f) is the spatial covariance matrix at frequencyf and frame d estimated using N snapshots Y (d − n, f) for−N/2 ≤ n < N/2 and v is given by

v(f) = exp

(j

2πf

cPa

). (24)

The MVDR uses a 512-sample long Hamming window with25% overlap, a 512-point FFT, and N = 24 snapshots forspatial covariance estimates. For 2-channel data, we use adelay-and-sum beamformer to enhance the signal with thedelay given by the DOA estimate. Single-channel data is en-hanced directly by the single-channel MMSE-LSA algorithm.A block diagram of these three cases is shown in figure 3.

MVDRMMSE-LSA

STFT or STFChTy1:8[n] s[n]

MMSE-LSASTFT or STFChTDSB

y1:2[n] s[n]

MMSE-LSASTFT or STFChT

y1[n] s[n]

Fig. 3: Block diagrams of processing for 8-channel data usinga minimum variance distortionless response (MVDR) beam-former (top), 2-channel data using a delay-and-sum beam-former (DSB, middle), and 1-channel data (bottom).

We tried three analysis/synthesis domains for the MMSE-LSA enhancement algorithm: the STFT with a short window,the STFT with a long window, and the STFChT. The STFTwith a short window uses 512-sample long (T = 32ms)Hamming windows, a frame hop of 128 samples, and anFFT length of 512. Short-window STFT processing ischosen to match conventional speech processing windowlengths. The STFT with a long window uses 2048-samplelong (T = 128ms) Hamming windows, a frame hop of 128samples, and an FFT length of 3262. Long-window STFTprocessing is intended to match the parameters of STFChTprocessing for a direct comparison. STFChT processing usesan analysis duration of 2048 samples, a Hamming analysiswindow, a frame hop of 128 samples, an FFT length of 3262,oversampling factor of 8, and a set of possible analysis chirprates A consisting of 21 equally spaced αs from -4 to 4.

The forward STFChT, given by (17), proceeds frame-by-frame, estimating the optimal analysis chirp rate αd using(21), oversampling in time, warping, applying an analysiswindow, and taking the FFT. Then MMSE-LSA weights areestimated frame-by-frame and applied in the STFChT do-main, and the enhanced speech signal is reconstructed usingthe inverse STFChT.

For all methods, noise estimation is performed with adecision-directed method and simple online updating ofthe noise variance. Voice activity detection to determineif a frame is noise-only or speech-plus-noise is done usingLoizou’s method, which compares the following quantity toa threshold ηthresh:

η(d) =∑k

ln γ(d, k)ξ(d, k)

1 + ξ(d, k)− ln(1 + ξ(d, k)). (25)

If η(d) < ηthresh, the frame is determined to be noise-onlyand the noise variance is updated as λv(d, k) = µvλv(d −1, k)+(1−µv)|Y (d, k)|2, with µv = 0.98 and ηthresh = 0.15.

For our implementation of Habets’s joint dereverberationand noise reduction algorithm, we used Loizou’s implementa-tion [20] of Ephraim and Malah’s LSA logmmse MATLABimplementation as a foundation. The forward STFChT codewas written by Cancela et al. [4]. We wrote our own MAT-LAB implementation of the inverse STFChT.

Computation times for processing REVERB evaluationdata are shown in figure 7. We measured reference wall clocktimes of 265.43s and 39.62s, respectively, for SimData andRealData. For 8-channel data, the MVDR and the STFChTrequire the most computation. For 1-channel and 2-channeldata, the STFChT requires the most computation. For theSTFChT, much of the computation is used to compute theGLogS for estimation of the analysis chirp rate αd (21) foreach frame. Note that this computation could be easily paral-lelized in hardware.

5

room1 room2 room31

1.5

2

2.5

3

3.5

PESQ

PESQ, near distance

room1 room2 room32

3

4

5

6

7

SRMR

SRMR, near distance

room1 room2 room31

1.5

2

2.5

PESQ

PESQ, far distance

room1 room2 room32

3

4

5

6

7

SRMR

SRMR, far distance

Orig

1ch STFT 5121ch STFT 20481ch STFChT2ch STFT 5122ch STFT 20482ch STFChT8ch MVDR8ch STFT 5128ch STFT 20488ch STFChT

Fig. 4: PESQ and SRMR results for SimData evaluation set. Upper plots are near distance condition, lower plots are far distancecondition. Left plots are PESQ, right plots are SRMR.

Noisy reverberated speech

Fre

q(k

Hz)

1 2 3 4 5 6 7 8 90

2

4

6

8

STFT 512 domain enhanced speech - female speaker, 8ch

Fre

q(k

Hz)

1 2 3 4 5 6 7 8 90

2

4

6

8

STFChT domain enhanced speech - female speaker, 8ch

Time (sec)

Fre

q(k

Hz)

1 2 3 4 5 6 7 8 90

2

4

6

8

Fig. 5: Spectrogram comparisons for one 8-channel far-distance utterance, c3bc020q, from SimData evaluation set.

6. RESULTS AND DISCUSSION

Our results on REVERB evaluation data are shown in fig-ures 4, 6, and 7. For the challenge, we submitted results us-ing STFChT-based processing. We choose to display PESQ(Perceptual Evaluation of Speech Quality) [22] and SRMR(source-to-reverberation modulation energy ratio) [23] moreprominently because the former is the ITU-T standard forvoice quality testing [24] and the latter is both a measure ofdereverberation and the only non-intrusive measure that canbe run on RealData (for which the clean speech is not avail-able).

For SimData, STFChT-based enhancement always per-forms better in terms of PESQ than STFT-based enhancementusing either a short (512-sample) window or a long (2048-sample) window, for the 8-, 2-, and 1-channel cases (exceptfor 8-channel, far-distance data in room 3). Informal listeningtests revealed an oversuppression of speech and some musicalnoise artifacts in STFT processing, while STFChT process-ing did not exhibit oversuppression or musical noise artifacts.The oversuppression of direct-path speech by STFT process-ing can be seen in the spectrogram comparisons shown in fig-ure 5. In terms of SRMR, STFChT processing yields equiva-

far near2

4

6

8

10

SRMR

RealData SRMR

Orig

1ch STFT 5121ch STFT 20481ch STFChT2ch STFT 5122ch STFT 20482ch STFChT8ch MVDR8ch STFT 5128ch STFT 20488ch STFChT

Fig. 6: SRMR results for RealData evaluation set.

lent or slightly worse SRMR scores than long-window STFTprocessing for the 8-, 2-, and 1-channel cases (except for 8-channel, near-distance data, where STFChT processing doesslightly better). One issue with these SRMR comparisons,however, is that the variance of the SRMR scores is quite high.Thus, for SimData, STFChT processing achieves better per-ceptual audio quality while still achieving almost equivalentdereverberation compared to STFT processing.

For RealData, we achieved SRMR improvements of over3, as shown in figure 6. Short-window STFT processingachieved higher scores than STFChT processing (especiallyfor 1- and 2-channel data), but informal listening tests re-vealed an oversuppression of speech and some musical noiseartifacts in STFT processing, while little oversuppressionand fewer artifacts were perceived in STFChT processing.Informal listening tests also indicated that the STFChT pro-cessing suppresses reverberation slightly less as compared toSTFT processing, which concurs with lower SRMR scores forSTFChT processing. Thus, though STFT processing achievesbetter dereverberation on RealData, better dereverberationperformance seems to come at the cost of oversuppression ofdirect-path speech and addition of artifacts. STFChT process-ing, on the other hand, achieves slightly less dereverberationon RealData, but the enhanced speech does not seem to sufferfrom oversuppression or artifacts.

7. CONCLUSION AND FUTURE WORK

In this paper we combined an optimal MMSE log-spectralamplitude estimator for joint dereverberation and noise re-duction with a recently-developed adaptive time-frequencytransform that is coherent with speech signals over longer du-rations. Our approach yielded improved results on the RE-

6

SimData summaryCh. Method Comp. Time (s) Mean CD Median

CD SRMR MeanLLR

MedianLLR

MeanFWSegSNR

MedianFWSegSNR PESQ

Orig — 3.97 3.68 3.68 0.57 0.51 3.62 5.39 1.488 STFT 512/2048 7447.86 / 7622.38 3.56 / 3.18 3.23 / 2.83 4.77 / 4.56 0.61 / 0.43 0.50 / 0.38 8.06 / 6.79 8.47 / 9.31 1.83 / 1.948 STFChT 17132.60 2.97 2.49 4.82 0.43 0.37 9.21 10.63 2.102 STFT 512/2048 1955.89 / 2022.04 3.80 / 3.57 3.42 / 3.22 4.86 / 4.47 0.65 / 0.49 0.55 / 0.44 7.26 / 5.46 7.93 / 7.86 1.60 / 1.662 STFChT 8248.49 3.33 2.83 4.75 0.51 0.45 7.68 9.19 1.771 STFT 512/2048 1003.54 / 1074.93 3.87 / 3.84 3.48 / 3.51 4.79 / 4.28 0.68 / 0.54 0.58 / 0.47 6.72 / 4.65 7.62 / 6.71 1.53 / 1.591 STFChT 7454.59 3.57 3.07 4.55 0.57 0.49 7.07 8.60 1.69

SimData, far distance, room 1Orig — 2.67 2.38 4.58 0.38 0.35 6.68 9.24 1.61

8 STFT 512/2048 7337.43 / 7477.34 3.16 / 2.28 2.88 / 2.05 5.05 / 4.82 0.52 / 0.37 0.43 / 0.34 8.50 / 9.16 8.73 / 10.57 1.90 / 1.878 STFChT 16730.48 2.47 2.04 5.01 0.34 0.30 10.16 11.26 2.112 STFT 512/2048 1872.86 / 1936.2 3.32 / 2.41 2.98 / 2.14 5.47 / 5.10 0.50 / 0.38 0.41 / 0.35 8.06 / 8.30 8.48 / 10.14 1.68 / 1.752 STFChT 7175.92 2.66 2.18 5.44 0.35 0.31 8.99 10.14 1.871 STFT 512/2048 975.37 / 1057.11 3.34 / 2.60 3.00 / 2.32 5.42 / 5.00 0.51 / 0.37 0.42 / 0.34 8.09 / 7.82 8.59 / 9.97 1.63 / 1.711 STFChT 6378.58 2.83 2.34 5.30 0.37 0.32 8.86 10.12 1.81

SimData, near distance, room 1Orig — 1.99 1.68 4.50 0.35 0.33 8.12 10.72 2.14










RealData summaryCh. Method Comp. Time (s) SRMR

Orig — 3.188 STFT 512/2048 3080.52 / 3152.88 6.90 / 5.318 STFChT 5236.74 6.332 STFT 512/2048 852.84 / 934.18 6.29 / 4.572 STFChT 3036.49 5.241 STFT 512/2048 610.26 / 682.97 5.80 / 4.211 STFChT 2753.87 4.85

RealData, far distanceCh. Method Comp. Time (s) SRMR


RealData, near distanceCh. Method Comp. Time (s) SRMR


Fig. 7: Results for SimData and RealData evaluation sets.

7

VERB challenge dataset versus standard STFT processing.Processing in the STFChT domain resulted in less reverbera-tion at the output without introducing artifacts, which concurswith substantial increase in the PESQ scores, the ITU-T stan-dard for voice quality. We also provided insight as to why en-hancement performance improves using the STFChT domain.The improvement gained by STFChT-based processing is aninteresting result, and warrants further investigation. Furtherexploration may be fruitful, as combination of the fan-chirpor other coherent transforms with other methods for derever-beration and/or noise reduction may yield improved results.

8. REFERENCES

[1] E. A. P. Habets, “Speech dereverberation using sta-tistical reverberation models,” in Speech Dereverber-ation, Patrick A. Naylor and Nikolay D. Gaubitch, Eds.Springer, July 2010.

[2] M. Kepesi and L. Weruaga, “Adaptive chirp-basedtime–frequency analysis of speech signals,” SpeechCommunication, vol. 48, no. 5, pp. 474–492, May 2006.

[3] L. Weruaga and M. Kepesi, “The fan-chirp transformfor non-stationary harmonic signals,” Signal Process-ing, vol. 87, no. 6, pp. 1504–1522, June 2007.

[4] P. Cancela, E. Lopez, and M. Rocamora, “Fan chirptransform for music representation,” in Proc. Inter-national Conference On Digital Audio Effects (DAFx),Graz, Austria, 2010, p. 1–8.

[5] K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani,E. Habets, R. Haeb-Umbach, V. Leutnant, A. Sehr,W. Kellermann, and R. Maas, “The REVERB chal-lenge: A common evaluation framework for dereverber-ation and recognition of reverberant speech,” in Proc.IEEE Workshop on Applications of Signal Processing toAudio and Acoustics (WASPAA), New Paltz, NY, 2013.

[6] S. Gannot and I. Cohen, “Speech enhancement basedon the general transfer function GSC and postfiltering,”IEEE Transactions on Speech and Audio Processing,vol. 12, no. 6, pp. 561–571, 2004.

[7] R. Maas, E. A. P. Habets, A. Sehr, and W. Kellermann,“On the application of reverberation suppression to ro-bust speech recognition,” in Proc. IEEE InternationalConference on Acoustics, Speech, and Signal Process-ing (ICASSP), Kyoto, Japan, 2012, pp. 297–300.

[8] T. Nakatani, K. Kinoshita, and M. Miyoshi,“Harmonicity-based blind dereverberation for single-channel speech signals,” IEEE Transactions on Audio,Speech, and Language Processing, vol. 15, no. 1, pp.80–95, 2007.

[9] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigne,“Restructuring speech representations using apitch-adaptive time–frequency smoothing and aninstantaneous-frequency-based f0 extraction: Possi-ble role of a repetitive structure in sounds,” SpeechCommunication, vol. 27, no. 3, pp. 187–207, 1999.

[10] J. Droppo and A. Acero, “A fine pitch model forspeech.,” in Proc. Interspeech, Antwerp, Belgium,2007, p. 2757–2760.

[11] R. Dunn and T. Quatieri, “Sinewave analysis/synthesisbased on the fan-chirp tranform,” in Proc. IEEE Work-shop on Applications of Signal Processing to Audio andAcoustics (WASPAA), New Paltz, NY, 2007, pp. 247–250.

[12] R. Dunn, T. Quatieri, and N. Malyska, “Sinewave pa-rameter estimation using the fast fan-chirp transform,”in Proc. IEEE Workshop on Applications of Signal Pro-cessing to Audio and Acoustics (WASPAA), New Paltz,NY, 2009, pp. 349–352.

[13] Y. Pantazis, O. Rosec, and Y. Stylianou, “Adaptive AM-FM signal decomposition with application to speechanalysis,” IEEE Transactions on Audio, Speech, andLanguage Processing, vol. 19, no. 2, pp. 290–300, 2011.

[14] G. Degottex and Y. Stylianou, “Analysis and synthesisof speech using an adaptive full-band harmonic model,”IEEE Transactions on Audio, Speech, and LanguageProcessing, vol. 21, no. 9-10, pp. 2085–2095, 2013.

[15] Y. Ephraim and D. Malah, “Speech enhancement us-ing a minimum-mean square error short-time spectralamplitude estimator,” IEEE Transactions on Acoustics,Speech and Signal Processing, vol. 32, no. 6, pp. 1109–1121, Dec. 1984.

[16] Y. Ephraim and D. Malah, “Speech enhancement using aminimum mean-square error log-spectral amplitude es-timator,” IEEE Transactions on Acoustics, Speech andSignal Processing, vol. 33, no. 2, pp. 443–445, 1985.

[17] I. Cohen, “Optimal speech enhancement under signalpresence uncertainty using log-spectral amplitude esti-mator,” IEEE Signal Processing Letters, vol. 9, no. 4,pp. 113–116, 2002.

[18] H. Lollmann, E. Yilmaz, M. Jeub, and P. Vary, “Animproved algorithm for blind reverberation time estima-tion,” in Proc. International Workshop on Acoustic Echoand Noise Control (IWAENC), Tel Aviv, Israel, 2010, p.1–4.

[19] L. Weruaga and M. Kepesi, “Speech analysis with thefast chirp transform,” in Proc. European Signal Pro-cessing Conference (EUSIPCO), Vienna, Austria, 2004,p. 1011–1014.

[20] P. C. Loizou, Speech Enhancement: Theory and Prac-tice, CRC Press, June 2007.

[21] H. L. Van Trees, Optimum Array Processing. Part IV ofDetection, Estimation, and Modulation Theory, Wiley-Interscience, New York, 2002.

[22] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hek-stra, “Perceptual evaluation of speech quality (PESQ)-anew method for speech quality assessment of telephonenetworks and codecs,” in Proc. IEEE International Con-ference on Acoustics, Speech, and Signal Processing(ICASSP), Salt Lake City, UT, 2001, vol. 2, p. 749–752.

[23] T. Falk, C. Zheng, and W.-Y. Chan, “A non-intrusivequality and intelligibility measure of reverberant anddereverberated speech,” IEEE Transactions on Audio,Speech, and Language Processing, vol. 18, no. 7, pp.1766–1774, 2010.

[24] ITU-T P.862.2, “Wideband extension to rec. p.862 forthe assessment of wideband telephone networks andspeech codecs,” 2007.

8

Evaluating de-reverberation and ASR techniques in reverberant … · yApplied Physics Laboratory,...

Documents

Transcript of Evaluating de-reverberation and ASR techniques in reverberant … · yApplied Physics Laboratory,...