NOISE REDUCTION FOR ENHANCING SPEECH QUALITY...
Transcript of NOISE REDUCTION FOR ENHANCING SPEECH QUALITY...
-
P R E S E N T E D B Y : R Y A N D H I M A S E . Z E Z A R I O
N O I S E R E D U C T I O N F O R E N H A N C I N G
S P E E C H Q U A L I T Y A N D I N T E L L I G I B I L I T Y
R Y A N D H I M A S @ C I T I . S I N I C A . E D U . T W
D C & C V L A B C S I E
N T U 2 0 2 0 1
授課教授:傅楸善博士
-
OUTLINES
• 11.1 Introduction
• 11.2 Noise Reduction Using Two Stages Deep Learning Model
• 11.3 Specialized Speech Enhancement Model Selection Based on Learned
Non-Intrusive Quality Assessment Metric
• 11.4 Speech Enhancement based on Denoising Autoencoder with Multi-
branched Encoders
2
-
INTRODUCTION
• Over the last few decades, a great amount of research has been done onvarious aspects and properties of speech signal processing.
• However, improving the intelligibility for both human listening and machinerecognition in real acoustic conditions still remains a challenging task.
3
Environmental
mismatch
Clean speech
(Additive noise) (Convolutional noise)
Noisy speech
Background noise Channel effect
( )x t
( )h t( )n t
( ) ( ( ) ( )) ( )y t x t n t h t= + *
Speech signal
processing
algorithm
Intelligibility
Quality
temm/clean.wav
-
INTRODUCTION
4
• Issue of speech enhancement performance
➢ Residual noise
➢ Speech distortions are noticeable in enhanced speech signals
-
INTRODUCTION
5
• Deep Learning Models
➢ Deep Denoising Autoencoder (DDAE)
10
0
-10
10
0
-10
4k
2k
0
4k
2k
0
0 1 2 30 1 2 3
Feature
Extraction
DDAE model
FFT
0 1 2 3
Spectrum
Recover
IFFTPhase
Amplitude Amplitude
Noisy ( ) Clean ( )
Hidden layer
➢ X. Lu, Y. Tsao, S. Matsuda and C. Hori, “Speech Enhancement based on Deep Denoising Autoencoder,” Interspeech 2013.
-
Offline Stage
......
.........
Loss
Function
Parameter
Update
Feature
Extractor
Feature
Extractor
Noisy
Corpus
Clean
CorpusDCN
Speech
Enhancement
Enhanced
Corpus
Feature
Extractor
DEN
Σ
Σ
–
+
+–
…
…
TWO STAGES DEEP LEARNING MODELS
6
𝑫 = 𝑿 − 𝒀 = [ 𝒅 𝒅
𝒅 ],
where 𝒅 = 𝒙 − 𝒚 ,
Offline Stage
......
.........
Loss
Function
Parameter
Update
Feature
Extractor
Feature
Extractor
Noisy
Corpus
Clean
CorpusDCN
Speech
Enhancement
Enhanced
Corpus
Feature
Extractor
DEN
Σ
Σ
–
+
+–
…
…
Offline Stage
......
.........
Loss
Function
Parameter
Update
Feature
Extractor
Feature
Extractor
Noisy
Corpus
Clean
CorpusDCN
Speech
Enhancement
Enhanced
Corpus
Feature
Extractor
DEN
Σ
Σ
–
+
+–
…
…
𝑫 = 𝑿 − 𝒀 = [𝒅 𝒅
𝒅 ],
where 𝒅 = 𝒙 − 𝒚 .
ℎ 𝒅 = 𝜎 𝑾0𝒅
+ 𝒃0 ,
ℎ𝐽 𝒅 = 𝜎 𝑾𝐽− ℎ𝐽− 𝒅
+ 𝒃𝐽− ,
𝒅 = 𝑾𝐽ℎ𝐽 𝒅
+ 𝒃𝐽,
In this study, the DDAE model is used to model
the mapping function
⋮
-
TWO STAGES DEEP LEARNING MODELS
7
Online Stage
......
......
... +
𝒅
Feature
Extractor
Feature
Extractor
Noisy
Corpus
Speech
Enhancement
Enhanced
CorpusΣ
+–
DEN
…
𝒅 𝒅
𝒅 = 𝐹( 𝒅
).
𝒙 = 𝒚 + 𝒅 ,
Based on the computed DEN features, we then
estimate the predicted DCN features using the
DDAE model that is trained in the offline stage
Then 𝒅 is used to perform feature compensation
Online Stage
......
......
... +
𝒅
Feature
Extractor
Feature
Extractor
Noisy
Corpus
Speech
Enhancement
Enhanced
CorpusΣ
+–
DEN
…
𝒅 𝒅
-
EXPERIMENTAL SETUP
8
• The MHINT sentences were used to test the proposed DPF approach.
• Consisted of 300 clean utterance → 250 utterances were used as the training data,
50 utterances were used as the testing data.
• Two types of noises, car and two-talker recorded in real environments were used to
generate noisy speech.
• The DPF model was realized by a DDAE model consisting of three hidden layers,
with 2500 hidden nodes in each layer
• 257-dimensional log power spectral feature vector.
-
OBJECTIVE EVALUATION
9
Car Noise
PESQ STOI LSD
Method DDAEDDAE-
DPFDDAE
DDAE-
DPFDDAE
DDAE-
DPF
SNR 10 2.72 3.32 0.89 0.95 0.77 0.61
SNR 6 2.59 3.04 0.88 0.93 0.81 0.69
SNR 2 2.37 2.71 0.86 0.90 0.89 0.78
SNR 0 2.24 2.53 0.85 0.88 0.93 0.83
SNR -2 2.16 2.39 0.84 0.86 0.96 0.88
SNR -6 1.95 2.13 0.80 0.82 1.09 1.02
SNR -10 1.74 1.88 0.77 0.78 1.22 1.16
Ave 2.25 2.57 0.84 0.87 0.95 0.85
DDAE-DPF outperforms DDAE in terms of
PESQ, STOI, and LSD metrics consistently
over all SNR levels
-
OBJECTIVE EVALUATION
10
Two Talker Noise
PESQ STOI LSD
Method DDAEDDAE-
DPFDDAE
DDAE-
DPFDDAE
DDAE-
DPF
SNR 10 2.41 3.06 0.88 0.93 0.87 0.76
SNR 6 2.21 2.73 0.86 0.91 0.92 0.83
SNR 2 1.99 2.47 0.84 0.88 0.96 0.89
SNR 0 1.91 2.33 0.82 0.86 1.00 1.16
SNR -2 1.81 2.18 0.81 0.84 1.04 0.93
SNR -6 1.60 1.92 0.76 0.80 1.14 1.03
SNR -10 1.44 1.69 0.71 0.74 1.31 0.96
Ave 1.91 2.34 0.81 0.85 1.04 0.94
Direct spectral mapping and ratio-masking
mechanisms can be used together to leverage
the complementary information to achieve
better SE performance.
The first-stage DDAE model performed
direct spectral mapping, and the second
DDAE model (serves as a DPF) predicted
the ratio of clean to noisy speech in order to
compute the enhanced speech.
-
SPECTROGRAMS ANALYSIS
11
We can note that almost all of the SE
methods can effectively reduce noise
components from the noisy speech utterance
We also observe that the DPF approach can
further improve the MMSE and DDAE
enhanced speech by eliminating distortions
and restoring the detailed information of the
speech signals.
The results suggest that the overall performance
of the DPF approach depends on the
capability of the preceding SE method
Clean Noisy
DDAE DDAE-DPF
MMSE MMSE-DPF
-
ASR PERFORMANCES
12
10
20
30
40
50
60
70
-10 -6 -2 0 2 6 10
Noisy
DDAE
DDAE-DPF
(CE
R %
)
(SNR dB)
We used the Google ASR as the speech
recognizer.
We can note that when comparing to the
unprocessed noisy speech, the speech
processed by the single-stage DDAE
achieved lower CERs in relatively noisier
conditions (-10 to 2 dB SNR levels).
While higher CERs in relatively cleaner
conditions (SNR higher than 6 dB).
DDAE-DPF achieves further improvements
over the single-stage DDAE and outperforms
unprocessed noisy speech consistently over
low to high SNR levels.
-
IFFT
Quality-Net
FFT
Noisy Speech
Enhanced Speech
…
…
…
…
…
…
…
…
…
…
…
…
=
…
−
−
INCORPORATING QUALITY ASSESSMENT METRIC
13
-
INCORPORATING QUALITY ASSESSMENT METRIC
WSJ dataset
Training: 37416 training data were corrupted by 90 types of noises
consisting of stationary and non-stationary noise at several SNR levels
from 20 to -10 dB
Testing: 330 test data were injected by four types of unseen noises,
including car, pink, street and babble at seven SNR levels (-10, -5, 0, 5, 10,
and 15 dB).
14
-
INCORPORATING QUALITY ASSESSMENT METRIC
15
-
INCORPORATING QUALITY ASSESSMENT METRIC
16
-
INCORPORATING QUALITY ASSESSMENT METRIC
17
-
INCORPORATING QUALITY ASSESSMENT METRIC
18
-
INCORPORATING QUALITY ASSESSMENT METRIC
19
-
SPEECH ENHANCEMENT BASED ON DENOISING AUTOENCODER WITH MULTI-BRANCHED ENCODERS
20
-
SPEECH ENHANCEMENT BASED ON DENOISING AUTOENCODER WITH MULTI-BRANCHED ENCODERS
21
-
SPEECH ENHANCEMENT BASED ON DENOISING AUTOENCODER WITH MULTI-BRANCHED ENCODERS
22
-
THANK YOU
23