NOISE REDUCTION FOR ENHANCING SPEECH QUALITY...

P R E S E N T E D B Y : R Y A N D H I M A S E . Z E Z A R I O

N O I S E R E D U C T I O N F O R E N H A N C I N G

S P E E C H Q U A L I T Y A N D I N T E L L I G I B I L I T Y

R Y A N D H I M A S @ C I T I . S I N I C A . E D U . T W

D C & C V L A B C S I E

N T U 2 0 2 0 1

授課教授：傅楸善博士

OUTLINES

• 11.1 Introduction

• 11.2 Noise Reduction Using Two Stages Deep Learning Model

• 11.3 Specialized Speech Enhancement Model Selection Based on Learned

Non-Intrusive Quality Assessment Metric

• 11.4 Speech Enhancement based on Denoising Autoencoder with Multi-

branched Encoders

2

INTRODUCTION

• Over the last few decades, a great amount of research has been done onvarious aspects and properties of speech signal processing.

• However, improving the intelligibility for both human listening and machinerecognition in real acoustic conditions still remains a challenging task.

3

Environmental

mismatch

Clean speech

(Additive noise) (Convolutional noise)

Noisy speech

Background noise Channel effect

( )x t

( )h t( )n t

( ) ( ( ) ( )) ( )y t x t n t h t= + *

Speech signal

processing

algorithm

Intelligibility

Quality

temm/clean.wav

INTRODUCTION

4

• Issue of speech enhancement performance

➢ Residual noise

➢ Speech distortions are noticeable in enhanced speech signals

INTRODUCTION

5

• Deep Learning Models

➢ Deep Denoising Autoencoder (DDAE)

10

0

-10

10

0

-10

4k

2k

0

4k

2k

0

0 1 2 30 1 2 3

Feature

Extraction

DDAE model

FFT

0 1 2 3

Spectrum

Recover

IFFTPhase

Amplitude Amplitude

Noisy ( ) Clean ( )

Hidden layer

➢ X. Lu, Y. Tsao, S. Matsuda and C. Hori, “Speech Enhancement based on Deep Denoising Autoencoder,” Interspeech 2013.

Offline Stage

......

.........

Loss

Function

Parameter

Update

Feature

Extractor

Feature

Extractor

Noisy

Corpus

Clean

CorpusDCN

Speech

Enhancement

Enhanced

Corpus

Feature

Extractor

DEN

Σ

Σ

–

+

+–

…

…

TWO STAGES DEEP LEARNING MODELS

6

𝑫 = 𝑿 − 𝒀 = [ 𝒅 𝒅

𝒅 ],

where 𝒅 = 𝒙 − 𝒚 ,

Offline Stage

......

.........

Loss

Function

Parameter

Update

Feature

Extractor

Feature

Extractor

Noisy

Corpus

Clean

CorpusDCN

Speech

Enhancement

Enhanced

Corpus

Feature

Extractor

DEN

Σ

Σ

–

+

+–

…

…

Offline Stage

......

.........

Loss

Function

Parameter

Update

Feature

Extractor

Feature

Extractor

Noisy

Corpus

Clean

CorpusDCN

Speech

Enhancement

Enhanced

Corpus

Feature

Extractor

DEN

Σ

Σ

–

+

+–

…

…

𝑫 = 𝑿 − 𝒀 = [𝒅 𝒅

𝒅 ],

where 𝒅 = 𝒙 − 𝒚 .

ℎ 𝒅 = 𝜎 𝑾0𝒅

+ 𝒃0 ,

ℎ𝐽 𝒅 = 𝜎 𝑾𝐽− ℎ𝐽− 𝒅

+ 𝒃𝐽− ,

𝒅 = 𝑾𝐽ℎ𝐽 𝒅

+ 𝒃𝐽,

In this study, the DDAE model is used to model

the mapping function

⋮

TWO STAGES DEEP LEARNING MODELS

7

Online Stage

......

......

... +

𝒅

Feature

Extractor

Feature

Extractor

Noisy

Corpus

Speech

Enhancement

Enhanced

CorpusΣ

+–

DEN

…

𝒅 𝒅

𝒅 = 𝐹( 𝒅

).

𝒙 = 𝒚 + 𝒅 ,

Based on the computed DEN features, we then

estimate the predicted DCN features using the

DDAE model that is trained in the offline stage

Then 𝒅 is used to perform feature compensation

Online Stage

......

......

... +

𝒅

Feature

Extractor

Feature

Extractor

Noisy

Corpus

Speech

Enhancement

Enhanced

CorpusΣ

+–

DEN

…

𝒅 𝒅

EXPERIMENTAL SETUP

8

• The MHINT sentences were used to test the proposed DPF approach.

• Consisted of 300 clean utterance → 250 utterances were used as the training data,

50 utterances were used as the testing data.

• Two types of noises, car and two-talker recorded in real environments were used to

generate noisy speech.

• The DPF model was realized by a DDAE model consisting of three hidden layers,

with 2500 hidden nodes in each layer

• 257-dimensional log power spectral feature vector.

OBJECTIVE EVALUATION

9

Car Noise

PESQ STOI LSD

Method DDAEDDAE-

DPFDDAE

DDAE-

DPFDDAE

DDAE-

DPF

SNR 10 2.72 3.32 0.89 0.95 0.77 0.61

SNR 6 2.59 3.04 0.88 0.93 0.81 0.69

SNR 2 2.37 2.71 0.86 0.90 0.89 0.78

SNR 0 2.24 2.53 0.85 0.88 0.93 0.83

SNR -2 2.16 2.39 0.84 0.86 0.96 0.88

SNR -6 1.95 2.13 0.80 0.82 1.09 1.02

SNR -10 1.74 1.88 0.77 0.78 1.22 1.16

Ave 2.25 2.57 0.84 0.87 0.95 0.85

DDAE-DPF outperforms DDAE in terms of

PESQ, STOI, and LSD metrics consistently

over all SNR levels

OBJECTIVE EVALUATION

10

Two Talker Noise

PESQ STOI LSD

Method DDAEDDAE-

DPFDDAE

DDAE-

DPFDDAE

DDAE-

DPF

SNR 10 2.41 3.06 0.88 0.93 0.87 0.76

SNR 6 2.21 2.73 0.86 0.91 0.92 0.83

SNR 2 1.99 2.47 0.84 0.88 0.96 0.89

SNR 0 1.91 2.33 0.82 0.86 1.00 1.16

SNR -2 1.81 2.18 0.81 0.84 1.04 0.93

SNR -6 1.60 1.92 0.76 0.80 1.14 1.03

SNR -10 1.44 1.69 0.71 0.74 1.31 0.96

Ave 1.91 2.34 0.81 0.85 1.04 0.94

Direct spectral mapping and ratio-masking

mechanisms can be used together to leverage

the complementary information to achieve

better SE performance.

The first-stage DDAE model performed

direct spectral mapping, and the second

DDAE model (serves as a DPF) predicted

the ratio of clean to noisy speech in order to

compute the enhanced speech.

SPECTROGRAMS ANALYSIS

11

We can note that almost all of the SE

methods can effectively reduce noise

components from the noisy speech utterance

We also observe that the DPF approach can

further improve the MMSE and DDAE

enhanced speech by eliminating distortions

and restoring the detailed information of the

speech signals.

The results suggest that the overall performance

of the DPF approach depends on the

capability of the preceding SE method

Clean Noisy

DDAE DDAE-DPF

MMSE MMSE-DPF

ASR PERFORMANCES

12

10

20

30

40

50

60

70

-10 -6 -2 0 2 6 10

Noisy

DDAE

DDAE-DPF

(CE

R %

)

(SNR dB)

We used the Google ASR as the speech

recognizer.

We can note that when comparing to the

unprocessed noisy speech, the speech

processed by the single-stage DDAE

achieved lower CERs in relatively noisier

conditions (-10 to 2 dB SNR levels).

While higher CERs in relatively cleaner

conditions (SNR higher than 6 dB).

DDAE-DPF achieves further improvements

over the single-stage DDAE and outperforms

unprocessed noisy speech consistently over

low to high SNR levels.

IFFT

Quality-Net

FFT

Noisy Speech

Enhanced Speech

…

…

…

…

…

…

…

…

…

…

…

…

=

…

−

−

INCORPORATING QUALITY ASSESSMENT METRIC

13


WSJ dataset

Training: 37416 training data were corrupted by 90 types of noises

consisting of stationary and non-stationary noise at several SNR levels

from 20 to -10 dB

Testing: 330 test data were injected by four types of unseen noises,

including car, pink, street and babble at seven SNR levels (-10, -5, 0, 5, 10,

and 15 dB).

14


15


16


17


18


19

SPEECH ENHANCEMENT BASED ON DENOISING AUTOENCODER WITH MULTI-BRANCHED ENCODERS

20


21


22

THANK YOU

23

NOISE REDUCTION FOR ENHANCING SPEECH QUALITY...

Documents

Transcript of NOISE REDUCTION FOR ENHANCING SPEECH QUALITY...