NOISE REDUCTION FOR ENHANCING SPEECH QUALITY...

23
PRESENTED BY: RYANDHIMAS E. ZEZARIO NOISE REDUCTION FOR ENHANCING SPEECH QUALITY AND INTELLIGIBILITY [email protected] DC&CV LAB CSIE NTU 2020 1 授課教授:傅楸善 博士

Transcript of NOISE REDUCTION FOR ENHANCING SPEECH QUALITY...

  • P R E S E N T E D B Y : R Y A N D H I M A S E . Z E Z A R I O

    N O I S E R E D U C T I O N F O R E N H A N C I N G

    S P E E C H Q U A L I T Y A N D I N T E L L I G I B I L I T Y

    R Y A N D H I M A S @ C I T I . S I N I C A . E D U . T W

    D C & C V L A B C S I E

    N T U 2 0 2 0 1

    授課教授:傅楸善博士

  • OUTLINES

    • 11.1 Introduction

    • 11.2 Noise Reduction Using Two Stages Deep Learning Model

    • 11.3 Specialized Speech Enhancement Model Selection Based on Learned

    Non-Intrusive Quality Assessment Metric

    • 11.4 Speech Enhancement based on Denoising Autoencoder with Multi-

    branched Encoders

    2

  • INTRODUCTION

    • Over the last few decades, a great amount of research has been done onvarious aspects and properties of speech signal processing.

    • However, improving the intelligibility for both human listening and machinerecognition in real acoustic conditions still remains a challenging task.

    3

    Environmental

    mismatch

    Clean speech

    (Additive noise) (Convolutional noise)

    Noisy speech

    Background noise Channel effect

    ( )x t

    ( )h t( )n t

    ( ) ( ( ) ( )) ( )y t x t n t h t= + *

    Speech signal

    processing

    algorithm

    Intelligibility

    Quality

    temm/clean.wav

  • INTRODUCTION

    4

    • Issue of speech enhancement performance

    ➢ Residual noise

    ➢ Speech distortions are noticeable in enhanced speech signals

  • INTRODUCTION

    5

    • Deep Learning Models

    ➢ Deep Denoising Autoencoder (DDAE)

    10

    0

    -10

    10

    0

    -10

    4k

    2k

    0

    4k

    2k

    0

    0 1 2 30 1 2 3

    Feature

    Extraction

    DDAE model

    FFT

    0 1 2 3

    Spectrum

    Recover

    IFFTPhase

    Amplitude Amplitude

    Noisy ( ) Clean ( )

    Hidden layer

    ➢ X. Lu, Y. Tsao, S. Matsuda and C. Hori, “Speech Enhancement based on Deep Denoising Autoencoder,” Interspeech 2013.

  • Offline Stage

    ......

    .........

    Loss

    Function

    Parameter

    Update

    Feature

    Extractor

    Feature

    Extractor

    Noisy

    Corpus

    Clean

    CorpusDCN

    Speech

    Enhancement

    Enhanced

    Corpus

    Feature

    Extractor

    DEN

    Σ

    Σ

    +

    +–

    TWO STAGES DEEP LEARNING MODELS

    6

    𝑫 = 𝑿 − 𝒀 = [ 𝒅 𝒅

    𝒅 ],

    where 𝒅 = 𝒙 − 𝒚 ,

    Offline Stage

    ......

    .........

    Loss

    Function

    Parameter

    Update

    Feature

    Extractor

    Feature

    Extractor

    Noisy

    Corpus

    Clean

    CorpusDCN

    Speech

    Enhancement

    Enhanced

    Corpus

    Feature

    Extractor

    DEN

    Σ

    Σ

    +

    +–

    Offline Stage

    ......

    .........

    Loss

    Function

    Parameter

    Update

    Feature

    Extractor

    Feature

    Extractor

    Noisy

    Corpus

    Clean

    CorpusDCN

    Speech

    Enhancement

    Enhanced

    Corpus

    Feature

    Extractor

    DEN

    Σ

    Σ

    +

    +–

    𝑫 = 𝑿 − 𝒀 = [𝒅 𝒅

    𝒅 ],

    where 𝒅 = 𝒙 − 𝒚 .

    ℎ 𝒅 = 𝜎 𝑾0𝒅

    + 𝒃0 ,

    ℎ𝐽 𝒅 = 𝜎 𝑾𝐽− ℎ𝐽− 𝒅

    + 𝒃𝐽− ,

    𝒅 = 𝑾𝐽ℎ𝐽 𝒅

    + 𝒃𝐽,

    In this study, the DDAE model is used to model

    the mapping function

  • TWO STAGES DEEP LEARNING MODELS

    7

    Online Stage

    ......

    ......

    ... +

    𝒅

    Feature

    Extractor

    Feature

    Extractor

    Noisy

    Corpus

    Speech

    Enhancement

    Enhanced

    CorpusΣ

    +–

    DEN

    𝒅 𝒅

    𝒅 = 𝐹( 𝒅

    ).

    𝒙 = 𝒚 + 𝒅 ,

    Based on the computed DEN features, we then

    estimate the predicted DCN features using the

    DDAE model that is trained in the offline stage

    Then 𝒅 is used to perform feature compensation

    Online Stage

    ......

    ......

    ... +

    𝒅

    Feature

    Extractor

    Feature

    Extractor

    Noisy

    Corpus

    Speech

    Enhancement

    Enhanced

    CorpusΣ

    +–

    DEN

    𝒅 𝒅

  • EXPERIMENTAL SETUP

    8

    • The MHINT sentences were used to test the proposed DPF approach.

    • Consisted of 300 clean utterance → 250 utterances were used as the training data,

    50 utterances were used as the testing data.

    • Two types of noises, car and two-talker recorded in real environments were used to

    generate noisy speech.

    • The DPF model was realized by a DDAE model consisting of three hidden layers,

    with 2500 hidden nodes in each layer

    • 257-dimensional log power spectral feature vector.

  • OBJECTIVE EVALUATION

    9

    Car Noise

    PESQ STOI LSD

    Method DDAEDDAE-

    DPFDDAE

    DDAE-

    DPFDDAE

    DDAE-

    DPF

    SNR 10 2.72 3.32 0.89 0.95 0.77 0.61

    SNR 6 2.59 3.04 0.88 0.93 0.81 0.69

    SNR 2 2.37 2.71 0.86 0.90 0.89 0.78

    SNR 0 2.24 2.53 0.85 0.88 0.93 0.83

    SNR -2 2.16 2.39 0.84 0.86 0.96 0.88

    SNR -6 1.95 2.13 0.80 0.82 1.09 1.02

    SNR -10 1.74 1.88 0.77 0.78 1.22 1.16

    Ave 2.25 2.57 0.84 0.87 0.95 0.85

    DDAE-DPF outperforms DDAE in terms of

    PESQ, STOI, and LSD metrics consistently

    over all SNR levels

  • OBJECTIVE EVALUATION

    10

    Two Talker Noise

    PESQ STOI LSD

    Method DDAEDDAE-

    DPFDDAE

    DDAE-

    DPFDDAE

    DDAE-

    DPF

    SNR 10 2.41 3.06 0.88 0.93 0.87 0.76

    SNR 6 2.21 2.73 0.86 0.91 0.92 0.83

    SNR 2 1.99 2.47 0.84 0.88 0.96 0.89

    SNR 0 1.91 2.33 0.82 0.86 1.00 1.16

    SNR -2 1.81 2.18 0.81 0.84 1.04 0.93

    SNR -6 1.60 1.92 0.76 0.80 1.14 1.03

    SNR -10 1.44 1.69 0.71 0.74 1.31 0.96

    Ave 1.91 2.34 0.81 0.85 1.04 0.94

    Direct spectral mapping and ratio-masking

    mechanisms can be used together to leverage

    the complementary information to achieve

    better SE performance.

    The first-stage DDAE model performed

    direct spectral mapping, and the second

    DDAE model (serves as a DPF) predicted

    the ratio of clean to noisy speech in order to

    compute the enhanced speech.

  • SPECTROGRAMS ANALYSIS

    11

    We can note that almost all of the SE

    methods can effectively reduce noise

    components from the noisy speech utterance

    We also observe that the DPF approach can

    further improve the MMSE and DDAE

    enhanced speech by eliminating distortions

    and restoring the detailed information of the

    speech signals.

    The results suggest that the overall performance

    of the DPF approach depends on the

    capability of the preceding SE method

    Clean Noisy

    DDAE DDAE-DPF

    MMSE MMSE-DPF

  • ASR PERFORMANCES

    12

    10

    20

    30

    40

    50

    60

    70

    -10 -6 -2 0 2 6 10

    Noisy

    DDAE

    DDAE-DPF

    (CE

    R %

    )

    (SNR dB)

    We used the Google ASR as the speech

    recognizer.

    We can note that when comparing to the

    unprocessed noisy speech, the speech

    processed by the single-stage DDAE

    achieved lower CERs in relatively noisier

    conditions (-10 to 2 dB SNR levels).

    While higher CERs in relatively cleaner

    conditions (SNR higher than 6 dB).

    DDAE-DPF achieves further improvements

    over the single-stage DDAE and outperforms

    unprocessed noisy speech consistently over

    low to high SNR levels.

  • IFFT

    Quality-Net

    FFT

    Noisy Speech

    Enhanced Speech

    =

    INCORPORATING QUALITY ASSESSMENT METRIC

    13

  • INCORPORATING QUALITY ASSESSMENT METRIC

    WSJ dataset

    Training: 37416 training data were corrupted by 90 types of noises

    consisting of stationary and non-stationary noise at several SNR levels

    from 20 to -10 dB

    Testing: 330 test data were injected by four types of unseen noises,

    including car, pink, street and babble at seven SNR levels (-10, -5, 0, 5, 10,

    and 15 dB).

    14

  • INCORPORATING QUALITY ASSESSMENT METRIC

    15

  • INCORPORATING QUALITY ASSESSMENT METRIC

    16

  • INCORPORATING QUALITY ASSESSMENT METRIC

    17

  • INCORPORATING QUALITY ASSESSMENT METRIC

    18

  • INCORPORATING QUALITY ASSESSMENT METRIC

    19

  • SPEECH ENHANCEMENT BASED ON DENOISING AUTOENCODER WITH MULTI-BRANCHED ENCODERS

    20

  • SPEECH ENHANCEMENT BASED ON DENOISING AUTOENCODER WITH MULTI-BRANCHED ENCODERS

    21

  • SPEECH ENHANCEMENT BASED ON DENOISING AUTOENCODER WITH MULTI-BRANCHED ENCODERS

    22

  • THANK YOU

    23