Joint Dereverberation and Noise Reduction 4pt Using ...

Joint Dereverberation and Noise ReductionUsing Beamforming and a Single-Channel

Speech Enhancement Scheme

B. Cauchi, I. Kodrasi, R. Rehr,S. Gerlach, T. Gerkmann,

S. Doclo, S. Goetze

Fraunhofer IDMT,Project Group Hearing, Speech and Audio Technology

Oldenburg University,Signal Processing Group

Florence, May 10th 2014

[email protected]

phone 0441 2172-450

c©Fraunhofer IDMT1/16

Introduction

� Overview of the proposed system

� Design of the MVDR beamformer

� DOA estimated using MUSIC� Estimated noise covariance

� Single-channel enhancement scheme

� Combination and optimization of published estimators

� Results

� Objective measures� MUSHRA scores� WER using a baseline recognizer


1. Proposed System

Overview

MV

DR

bea

mfo

rmer

singl

e-ch

annel

enhan

cem

ent

VAD

DOAestimation

coherenceestimation

s(n)

x(n)

θ

Γ(k)

y1(n)

y2(n)

yM(n)

h1 (n)

h2 (n)

hM (n)

θ

s(n)

� Beamformer: towards estimated direction of arrival (DOA)

� Single-channel enhancement: based on statistical estimators

� Late reverberant spectral variance (LRSV)� Noise spectral variance (NSV)� Speech spectral variance (SSV)


2. MVDR Beamformer

With Ym(k, ℓ) the STFT of the input signal in the m-thmicrophone we define

Y(k, ℓ) = [Y1(k, ℓ) Y2(k, ℓ) . . . YM(k, ℓ)]T

The output X(k, ℓ) of the beamformer is obtained as

X(k, ℓ) = WHθ (k)Y(k, ℓ)

where

Wθ(k) = Γ−1(k)dθ(k)

dH

θ(k)Γ−1(k)dθ(k)

� Noise coherence matrix: Γ(k) estimated using a VAD.

� Steering vector: dθ(k) from θ using a far-field assumption.


2. MVDR Beamformer

Estimation of noisefield coherence

� Noise periods identified with a VAD� Comparison between the long-term spectral envelope and the

average noise spectrum

� Γ(k) is estimated using detected noise-only frames

� Alternatively, a theoretically diffuse noise field is used:

Wθ(k) = (Γdiff(k)+(k)IM )−1dθ(k)

dH

θ(k)(Γdiff(k)+(k)IM )−1

dθ(k)

with (k) a constraint such that

WHθ (k)Wθ(k) ≤ WNGmax = 10 dB

Ramirez, J., Segura, J.C., Benitez, C., de la Torre, A., and Rubio, A., Efficient voice activity detection algorithms using long-term speech information, 2003.


2. MVDR Beamformer

DOA Estimation

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

b

b

b

b

bb

b

b

b

b

b bb

b b

b

b

b

b

bb

b

b

b

θ

θ = argmaxθ

1

K

khigh∑

klow

Uθ(k, ℓ),

where Uθ(k, ℓ) is the MUSIC pseudo-spectra:

Uθ(k, ℓ) = 1dH

θ(k)E(k,ℓ)EH(k,ℓ)dθ(k)

E(k, ℓ) = [eQ+1(k, ℓ) . . . eM (k, ℓ)]

with em denoting eigenvectors of thecovariance matrix of Y(k, ℓ).

Schmidt, R., Multiple emitter location and signal parameter estimation, 1986.


3. Single-channel Enhancement

Overview

noise powerestimation

speechpower

estimation

reverbpower

estimation

T60

estimator

speechpower re-estimation

+

gainfunction

×

σ2v(k)

σ2z(k)

σ2 r(k

)

σ2r(k) + σ2

v(k)

σ2s(k) G(k)

S(k)

T60

X(k)

� σ2v(k, ℓ) estimated using Minimum Statistics

� σ2s(k, ℓ) estimated using Cepstral Smoothing

� σ2r (k, ℓ) estimated using Lebart’s approach

Martin, R., Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics, 2001.Breithaupt, C., Gerkmann T. and Martin, R., A Novel A Priori SNR Estimation Approach Based on Selective Cepstro-Temporal Smoothing, 2008.Eaton, J., Gaubitch, N.D., Naylor, P.A., Noise-robust reverberation time estimation using spectral decay distributions with reduced computational cost, 2012.Lebart, K., Boucher J.M. and Denbigh, P., A new method based on spectral subtraction for speech dereverberation, 2013.



LRSV estimation

� RIR modeled as Gaussian noise with decay ∆ = 3 ln 10T60fs

decay

RIR

Time [s]

Am

plitu

de

0 0.1 0.2 0.3 0.4 0.5

� Representing the variance of the reverberant speech as:� σ2

z(k, ℓ) = σ2r (k, ℓ) + σ2

s(k, ℓ)

� Leads to the estimator� σ2

r (k, ℓ) = e−2∆Tdfs σ2z(k, ℓ − Td/Ts)

Lebart, K., Boucher J.M. and Denbigh, P., A new method based on spectral subtraction for speech dereverberation, 2001.



Gain function

� The output X(k, ℓ) of the beamformer contains the anechoicspeech, remaining noise and spatially filtered reverberation

� X(k, ℓ) = S(k, ℓ) + V (k, ℓ) + R(k, ℓ)

� We aim to compute a real gain such that:

� S(k, ℓ) = G(k, ℓ)X(k, ℓ)

� Computation of G(k, ℓ) using an MMSE estimation of thespeech amplitude based on a super Gaussian speech model.

Breithaupt, C., Krawczyk, M., and Martin, R., Parameterized MMSE spectral magnitude estimation for the enhancement of noisy speech, 2008.


4. Objective Measures

SRMR

Real

Unprocessed8 Channels1 Channel

SR

MR

[dB

]Simulated

700 ms700 ms500 ms250 msnear far Meannear far near far near far Mean

1

3

5

7

9

1

3

5

7

9

� Illustrates dereverberation performance in all condition

� Better dereverberation achieved by multichannel, except forT60=500 ms



FWSSNR


FW

SSN

R[d

B]

Simulated

700 ms500 ms250 msnear far near far near far Mean

2

6

10

14

� Illustrates noise reduction in all condition

� Beamforming step advantageous for the noise reduction



PESQ


PE

SQ

Simulated

700 ms500 ms250 msnear far near far near far Mean

1.5

2.5

3.5

� Improvement of PESQ score in all condition illustrate theoverall improvement in speech quality


5. Subjective Tests

MUSHRA test

Intermediate results of the subjective test ran by the organizers:

� Tests carried out separately for 1 and 8 channels

UnprocessedOverall QualityReverberation

Real

Single-channel

Sim

MU

SH

RA

scor

e[%

]

near farnear far0

1020304050607080

Real

Multichannel

Sim

near farnear far0

1020304050607080

� Improvement for all tested condition

� Higher improvement of the overall quality


6. Preprocessing for ASR

Word Error Rate

� Baseline recognizer provided by the organizers

� Using pre-trained models on clean data

Simulated

2.3

3

3.2

7

−11.9

4

−20.3

1

−15.7

8

−18.6

9

−10.1

8

1.6

5

2.8

3

−24.9

1

−42.5

3

−21.6

3

−30.4

8

−19.1

7

WE

R \

%

250, near 250, far 500, near 500, far 700, near 700, far Mean10

20

30

40

50

60

70

80

90

100

1 Channel

8 Channels

baseline

Real

−12.7

4

−11.4

8

−12.1

1

−25.9

3

−25.7

6

−25.8

5

700, near 700, far Mean10

20

30

40

50

60

70

80

90

100


7. Conclusion

� System based on combination of MVDR beamformer andspectral enhancement

� All parameters are blindly estimated

� Speech enhancement achieved in all conditions in terms of:

� Objective measures� Subjective tests� Word error rate


Thank you very much for your attention

House of Hearing, Oldenburg

Questions ?

Fraunhofer IDMTProject Group Hearing, Speech and Audio Technology

Oldenburg UniversitySignal Processing Group

[email protected]


Joint Dereverberation and Noise Reduction 4pt Using ...

Documents

Transcript of Joint Dereverberation and Noise Reduction 4pt Using ...