Joint Dereverberation and Noise Reduction 4pt Using ...
Transcript of Joint Dereverberation and Noise Reduction 4pt Using ...
Joint Dereverberation and Noise ReductionUsing Beamforming and a Single-Channel
Speech Enhancement Scheme
B. Cauchi, I. Kodrasi, R. Rehr,S. Gerlach, T. Gerkmann,
S. Doclo, S. Goetze
Fraunhofer IDMT,Project Group Hearing, Speech and Audio Technology
Oldenburg University,Signal Processing Group
Florence, May 10th 2014
phone 0441 2172-450
c©Fraunhofer IDMT1/16
Introduction
� Overview of the proposed system
� Design of the MVDR beamformer
� DOA estimated using MUSIC� Estimated noise covariance
� Single-channel enhancement scheme
� Combination and optimization of published estimators
� Results
� Objective measures� MUSHRA scores� WER using a baseline recognizer
c©Fraunhofer IDMT2/16
1. Proposed System
Overview
MV
DR
bea
mfo
rmer
singl
e-ch
annel
enhan
cem
ent
VAD
DOAestimation
coherenceestimation
s(n)
x(n)
θ
Γ(k)
y1(n)
y2(n)
yM(n)
h1 (n)
h2 (n)
hM (n)
θ
s(n)
� Beamformer: towards estimated direction of arrival (DOA)
� Single-channel enhancement: based on statistical estimators
� Late reverberant spectral variance (LRSV)� Noise spectral variance (NSV)� Speech spectral variance (SSV)
c©Fraunhofer IDMT3/16
2. MVDR Beamformer
With Ym(k, ℓ) the STFT of the input signal in the m-thmicrophone we define
Y(k, ℓ) = [Y1(k, ℓ) Y2(k, ℓ) . . . YM(k, ℓ)]T
The output X(k, ℓ) of the beamformer is obtained as
X(k, ℓ) = WHθ (k)Y(k, ℓ)
where
Wθ(k) = Γ−1(k)dθ(k)
dH
θ(k)Γ−1(k)dθ(k)
� Noise coherence matrix: Γ(k) estimated using a VAD.
� Steering vector: dθ(k) from θ using a far-field assumption.
c©Fraunhofer IDMT4/16
2. MVDR Beamformer
Estimation of noisefield coherence
� Noise periods identified with a VAD� Comparison between the long-term spectral envelope and the
average noise spectrum
� Γ(k) is estimated using detected noise-only frames
� Alternatively, a theoretically diffuse noise field is used:
Wθ(k) = (Γdiff(k)+(k)IM )−1dθ(k)
dH
θ(k)(Γdiff(k)+(k)IM )−1
dθ(k)
with (k) a constraint such that
WHθ (k)Wθ(k) ≤ WNGmax = 10 dB
Ramirez, J., Segura, J.C., Benitez, C., de la Torre, A., and Rubio, A., Efficient voice activity detection algorithms using long-term speech information, 2003.
c©Fraunhofer IDMT5/16
2. MVDR Beamformer
DOA Estimation
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
b
b
b
b
bb
b
b
b
b
b bb
b b
b
b
b
b
bb
b
b
b
θ
θ = argmaxθ
1
K
khigh∑
klow
Uθ(k, ℓ),
where Uθ(k, ℓ) is the MUSIC pseudo-spectra:
Uθ(k, ℓ) = 1dH
θ(k)E(k,ℓ)EH(k,ℓ)dθ(k)
E(k, ℓ) = [eQ+1(k, ℓ) . . . eM (k, ℓ)]
with em denoting eigenvectors of thecovariance matrix of Y(k, ℓ).
Schmidt, R., Multiple emitter location and signal parameter estimation, 1986.
c©Fraunhofer IDMT6/16
3. Single-channel Enhancement
Overview
noise powerestimation
speechpower
estimation
reverbpower
estimation
T60
estimator
speechpower re-estimation
+
gainfunction
×
σ2v(k)
σ2z(k)
σ2 r(k
)
σ2r(k) + σ2
v(k)
σ2s(k) G(k)
S(k)
T60
X(k)
� σ2v(k, ℓ) estimated using Minimum Statistics
� σ2s(k, ℓ) estimated using Cepstral Smoothing
� σ2r (k, ℓ) estimated using Lebart’s approach
Martin, R., Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics, 2001.Breithaupt, C., Gerkmann T. and Martin, R., A Novel A Priori SNR Estimation Approach Based on Selective Cepstro-Temporal Smoothing, 2008.Eaton, J., Gaubitch, N.D., Naylor, P.A., Noise-robust reverberation time estimation using spectral decay distributions with reduced computational cost, 2012.Lebart, K., Boucher J.M. and Denbigh, P., A new method based on spectral subtraction for speech dereverberation, 2013.
c©Fraunhofer IDMT7/16
3. Single-channel Enhancement
LRSV estimation
� RIR modeled as Gaussian noise with decay ∆ = 3 ln 10T60fs
decay
RIR
Time [s]
Am
plitu
de
0 0.1 0.2 0.3 0.4 0.5
� Representing the variance of the reverberant speech as:� σ2
z(k, ℓ) = σ2r (k, ℓ) + σ2
s(k, ℓ)
� Leads to the estimator� σ2
r (k, ℓ) = e−2∆Tdfs σ2z(k, ℓ − Td/Ts)
Lebart, K., Boucher J.M. and Denbigh, P., A new method based on spectral subtraction for speech dereverberation, 2001.
c©Fraunhofer IDMT8/16
3. Single-channel Enhancement
Gain function
� The output X(k, ℓ) of the beamformer contains the anechoicspeech, remaining noise and spatially filtered reverberation
� X(k, ℓ) = S(k, ℓ) + V (k, ℓ) + R(k, ℓ)
� We aim to compute a real gain such that:
� S(k, ℓ) = G(k, ℓ)X(k, ℓ)
� Computation of G(k, ℓ) using an MMSE estimation of thespeech amplitude based on a super Gaussian speech model.
Breithaupt, C., Krawczyk, M., and Martin, R., Parameterized MMSE spectral magnitude estimation for the enhancement of noisy speech, 2008.
c©Fraunhofer IDMT9/16
4. Objective Measures
SRMR
Real
Unprocessed8 Channels1 Channel
SR
MR
[dB
]Simulated
700 ms700 ms500 ms250 msnear far Meannear far near far near far Mean
1
3
5
7
9
1
3
5
7
9
� Illustrates dereverberation performance in all condition
� Better dereverberation achieved by multichannel, except forT60=500 ms
c©Fraunhofer IDMT10/16
4. Objective Measures
FWSSNR
Unprocessed8 Channels1 Channel
FW
SSN
R[d
B]
Simulated
700 ms500 ms250 msnear far near far near far Mean
2
6
10
14
� Illustrates noise reduction in all condition
� Beamforming step advantageous for the noise reduction
c©Fraunhofer IDMT11/16
4. Objective Measures
PESQ
Unprocessed8 Channels1 Channel
PE
SQ
Simulated
700 ms500 ms250 msnear far near far near far Mean
1.5
2.5
3.5
� Improvement of PESQ score in all condition illustrate theoverall improvement in speech quality
c©Fraunhofer IDMT12/16
5. Subjective Tests
MUSHRA test
Intermediate results of the subjective test ran by the organizers:
� Tests carried out separately for 1 and 8 channels
UnprocessedOverall QualityReverberation
Real
Single-channel
Sim
MU
SH
RA
scor
e[%
]
near farnear far0
1020304050607080
Real
Multichannel
Sim
near farnear far0
1020304050607080
� Improvement for all tested condition
� Higher improvement of the overall quality
c©Fraunhofer IDMT13/16
6. Preprocessing for ASR
Word Error Rate
� Baseline recognizer provided by the organizers
� Using pre-trained models on clean data
Simulated
2.3
3
3.2
7
−11.9
4
−20.3
1
−15.7
8
−18.6
9
−10.1
8
1.6
5
2.8
3
−24.9
1
−42.5
3
−21.6
3
−30.4
8
−19.1
7
WE
R \
%
250, near 250, far 500, near 500, far 700, near 700, far Mean10
20
30
40
50
60
70
80
90
100
1 Channel
8 Channels
baseline
Real
−12.7
4
−11.4
8
−12.1
1
−25.9
3
−25.7
6
−25.8
5
700, near 700, far Mean10
20
30
40
50
60
70
80
90
100
c©Fraunhofer IDMT14/16
7. Conclusion
� System based on combination of MVDR beamformer andspectral enhancement
� All parameters are blindly estimated
� Speech enhancement achieved in all conditions in terms of:
� Objective measures� Subjective tests� Word error rate
c©Fraunhofer IDMT15/16
Thank you very much for your attention
House of Hearing, Oldenburg
Questions ?
Fraunhofer IDMTProject Group Hearing, Speech and Audio Technology
Oldenburg UniversitySignal Processing Group
c©Fraunhofer IDMT16/16