Mmse Speech Enhancement

SPEECH ENHANCEMENT

Chunjian LiAalborg University, Denmark

Introduction

Applications:

- Improving quality and inteligibility (hearing aid, cockpit comm., video conferencing ...)

- Source coding (mobile phone, video conferencing, IP phone ...)

- Pre-processor for other speech processing applications (speech recognition, speaker varification ...)

Introduction

Classification 1- Single channel- Multi-channel

* with accoustic barrier (Adaptive Noise Cancelling)* without accoustic barrier (Array Processing)

Classification 2- Spectrum subtraction (Power Spectral Subtraction, Amplitude

Spectral Subtraction, Autocorrelation Subtraction, Non-causal Infinite Wiener Filtering)

- Parametric method (Iterative Wiener Filtering)

- Adaptive noise cancelling- Adaptive comb filtering

Single Channel Speech Enhancement

Stochastic Model- Noise process: broadband (white),

stationary (or short-time stationary), uncorrelated to speech, additive.

- Speech process: short-time stationary.- Need short-time processing

);();();( mndmnsmny +=

Single Channel Speech Enhancement

Important relation in the Power Spectrum domain:

This is true only when the noise is uncorrelated with the speech signal.

* To be concise, the index ”m” is droped in the following discussion

);();();( mmm dsy ωωω Γ+Γ=Γ

Power Spectral Subtraction

)(|)(ˆ|)(ˆ

)(ˆ|)(ˆ|

)(ˆ)()(ˆ

ωϕωω

ωω

ωωω

yjss

ss

dys

eSS

NS

=

Γ=

Γ−Γ=Γ

* Power Spectral Subtraction method use the noisy phase spectrum to synthesis the enhanced signal

(1)

Generalized Spectral Subtraction and its variants

GeneralizationEq(1) can be written as:

[ ] )(1

|)(ˆ||)(|)(ˆ ωϕααα ωωω yjdys eSSS −=

(2)

When �=1 , eq(2) is called Amplitude Spectral Subtraction (Boll,1979).

Variant – Correlation subtraction

);(ˆ);();(ˆ mrmrmr dys ηηη −=

Comments on Spectral Subtraction methods

Low complexitySevere musical noiseUsually need further enhancement - Smoothing in time and frequency; Rectification;

Amplitude Spectral Subtraction:

Power Spectral Subtraction:

Noisy speech sample:

Comments on Spectral Subtraction methods

Oversuppressing and smoothing can reduce residual noise but result in distortion to the speech spectrum.

Oversuppressing ASS:

Oversuppressing PSS:

Smoothing in time:

Wiener Filtering

Non-causal infinite Wiener filter using spectral-subtraction-prior (hereafter referred to as Non-causal Wiener Filter) can be recognized as a Spectral Subtraction method.

Non-causal infinite Wiener filter using LPC model as prior can be employed in iteratative manner, which can be recognized as a parametric method.

Noncausal Wiener Filter

A linear Minimum Mean Squared Error Filter:

Orthogonality principle:

∑∞

−∞=

=−=m

nynhqnyqhns )(*)()()()(ˆ

0)()()()( =

−

−− ∑

∞

−∞=

knyqnyqhnsEq

∑∞

−∞=

−=⇒q

yyys qkRqhkR )()()( Wiener-Hopf equation

Noncausal Wiener Filter

Orthogonality principle (frequency domain):

Transfer function:

MSE of the Wiener filter:

)()()( ωωω yyys SHS =

)(

)()(

ωω

ωyy

ys

S

SH =

∑∞

−∞=

−=−q

yss qRqhnsnsE )()(]))(ˆ)([( 22 σ

Comments on Noncausal WF

Requires estimate of the power spectrum of speech and noise.Performance depends very much on the estimate of the speech and noise spectrum.WF oversuppress the speech spectrum, results in muffling effect.WF does not process phase spectrum.

Comments on Noncausal WF

Roughness caused by phase noise

The phase spectrum is not processed, results in losing phase coherence in the voiced speech. The effect is called roughness or reverberance.

Samples of muffling and roughness:

Clean samples:Muffling:Roughness:Muffling & roughness:

Iterative Wiener Filtering

A parametric method using an all-pole modelA sequential MAP estimator of both speech waveform and LP coefficients.[Lim, Oppenheim 1978]


All-pole modeling of speech

- Speech amplitude spectrum can be well modeled by an all-pole transfer function (the vocal tract) excited by white sequence or pulse train (the glottal pulses). The coefficients of the all-pole model is found by Linear Prediction method, thus is called LP coef., and the excitation is called the residue. - The LP model is of minumum phase, which is generally not the true phase of the vocal tract.


The algorithm:1. Estimate the LP coef. From the noisy

oberservation samples. Estimate the noise spectrum during nonspeech activity.

2. Estimate the waveform using noncasual WF given the current estimate of LP coef. and current estimate of the noise spectrum.

3. Estimate the LP coef. again given the current estimate of the waveform.

4. Keep doing the iteration until some criterion is satisfied.


Comments:- Convergence is not garanteed, a heuristic stop

criterion is needed- Result in unrealisticly sharp formants and pole

jittering- Suffer from musical noise- Need some kind of smoothing

10 dB noisy sample:

Iterative WF:

Iterative WF with smoothing:

Further enhancement to IWF

Constrained IWF [Hansen,Clements 1987]Apply spectral constraint inter-frame and intra-frame using LSP transformation.

Pole-zero modeling [Flanagan 1972]Replace WF with Kalman filtering [Gibson 1991]Vector quantization method [Gibson 1988]

Use HMM [Ephraim 1988]

Phase issues

The majority of the noise reduction mthods only process amplitude spectrum, while the noisy phase spectrum is left unprocessed. The reasons are:- Human ears are less sensitive to phase than to the amplitude spectrum.- Masking of amplitude to phase (6dB/0.6rad threshold).For low SNR (<6dB) source, the noisy phase causes roughness/reverberance.

MMSE approaches to speech enhancement

Wiener filtering; MMSE amplitude spectrum estimator; MMSE log-amplitude spectrum estimator; Non-Gaussian prior MMSE approaches.

Being the dominant technique because of better performance than the Spectral Spectrum Subtraction methods.Need a priori info. of the speech and noise spectrum.

MMSE amplitude spectrum estimator (Ephraim-Malah filter)

Ephraim-Malah, 1984The basis of the noise reduction function of MELPe coding standard Consists of two parts: Decision-Directed method estimating the a priorispeech spectrum, and the MMSE Short-Time Spectral Amplitude (STSA) estimator

MMSE STSA estimator

Assumptions:- Stationary additive Gaussian noise with known spectrum.- An estimate of the speech spectrum is available.- Spectral components (DFT coefficients) are statistically independent

and each follows Gaussian distribution (the DFT amplitude follows Rayleigh distribution).

- The DFT phase follows uniform distribution and is independent of the amplitude.

)()()( tdtxty +=The signal model:

Let , , denote the kth spectral component of the noisy observation y(t), the signal x(t), and the noise d(t).

)exp( kkk jRY θ≡ )exp( kkk jAX α≡ kD

MMSE STSA estimator

−−= 2||)(

1exp

)(

1),|( kj

kkdd

kkk eAYkk

AYp α

λπλα

kA

kk

kk

kk

k

k

kkk

Rv

Ivv

Ivvv

YAEA

)]2

()2

()1)[(2

exp(2

]|[ˆ

10 ++−=

=

γπ

Where and denote the modified Bessel functions of zero and first order, and is defined by:kv

)(0 ⋅I )(1 ⋅I

kk

kkv γ

ξξ+

=1

and Baye’s rule, the estimator can be shown to be:

With the following PDF’s:

−=

)(exp

)(),(

2

k

A

k

AAp

x

k

x

kkk λπλ

α,

MMSE STSA estimator

kξ kγ

)(

)(

k

k

d

xk λ

λξ =

Where and are defined by:

)(

2

k

R

d

kk λ

γ =

Where and]|[|)( 2kx XEk =λ ]|[|)( 2

kd DEk =λ

and are interpreted as the a priori and a posteriori signal-to-noise ratio respectively.

is estimated by the Decision-Directed method.

kξ kγ

kξ

Decision-Directed method

An estimate of the a priori SNR.A combination of Power Spectrum Subtraction, halfwave rectification and inter-frame smoothing.

is usually chosen to be 0.98 in order to get the best smoothing performance. The higher the is, the less musical noise, but more distortion to the speech.

10 ],0,1)(max[)1()1,(

)1(ˆ)(ˆ

2

<≤−−+−

−= αγαλ

αξ nnk

nAn k

d

kk

α

α

Comments on the MMSE STSA estimator

Comparison of the suppression gains of Wiener filter and MMSE STSA

-The instantaneous SNR can be interpreted as the a priori SNR estimated without smoothing.-WF gains do not vary with the instantaneous SNR, only vary with the a priori SNR. Whereas the MMSE STSA gains vary with both instataneous SNR and a priori SNR.-When the a priori SNR is high, the MMSE STSA estimator has gain curves very close to the WF. When the a priori SNR is low, the MMSE STSA shows higher gain which is very much affected by the instataneous SNR.


A comparison of the suppression gains of PSS, WF and MMSE STSA estimator

Estimated A priori SNR Estimated A priori SNR

Solid line: power subtraction; dashed line:Wiener filter.

The MMSE STSA. Rpost denotes the A prioriSNR estimated without smoothing (the instantaneous SNR).


The gain curve transit smoothly between the power subtraction curve and the Wiener curve. This transit is controled by the un-smoothed estimate of a priori SNR (Rprio). The larger Rprio, the stronger the anttenuation.This counter-intuitive behavior manages to flatten the spurious spectral peaks caused by the noise at the low SNR part of the spectrum. While WF tends to sharpen the spurious peaks at the low SNR part of the specatrum.The phase of the noisy speech is used as the phase of the enhanced speech, because of the assumption of uniform distributed phase. An independent MMSE estimate of the phasor has nonunity modulus, thus can not be combined with the MMSE STSA.Suffer less musical noise than the WF.

MMSE Log-Spectral Amplitude Estimator

A modification to the MMSE STSA based on the fact that a distortion measure based on the mean-square error of the log-spectra is more suitable for speech processing.Minimize the distortion measureThe MMSE LSA estimator can be shown to be:

where , and are a priori SNR and a

posteriori SNR as defined before.

])ˆlog[(log 2kk AAE −

kv

t

k

k

kkk

Rdtt

e

YAEA

k

)2

1exp(

1

])|[lnexp(ˆ

∫∞ −

+=

=

ξξ

kξkk

kkv γ

ξξ+

=1 kγ

MMSE Log-Spectral Amplitude Estimator

Comparison of the suppression gains of MMSE STSA and MMSE LSA

- The gain curves of MMSE LSA are always lower than that of MMSE STSA, resulting in lower residual noise.- When the a priori SNR is high, the gain curve of MMSE LSA is very flat which is similar to Wiener filter. When the a priori SNR is low, the gain curve of the MMSE LSA varies w.r.t. the instantaneous SNR as the MMSE STSA does.

Decision-DirectedWiener Filter: MMSE LSA:

Noisy sample(0 dB):

MMSE estimator with non-Gaussian prior

How well does Gaussian model fit the real probability distribution of DFT coefficients?

Histogram of speech DFT amplitude. Histogram of noise (recorded from market place) DFT amplitude.

*The histograms are taken from one hour of speech


The probability density function of the DFT coefficients of speech can be better modeled by Supper-Gaussian functions (e.g. Gamma or Laplace) than the Guanssian function [Rainer Martin 2002, 2003]. An even more exact probability density function is the one talored to fit the shape of the histogram of the DFT coefficients [Lotter, Vary 2003].Using these density function in place of the Gaussian density function (for speech or noise processes) in the MMSE estimator can result in better noise reduction.Non-Gaussian prior MMSE estimator is nonlinear, non-zero-phased.


Comparing with WF:- Better output SNR (Gaussian/Gamma)- Less musical noise (Laplace/Gamma)- Less distortion to the speech

Exercises

1. The noncausal Wiener filter and the Ephraim-Malah filter are both MMSE estimators. They have a lot in common. Please list at least 4 common points of these two estimators.

2. So, what makes the two estimators different?

3. The residual noise is often catagorized as white noise and musical noise. Different filter produce different residual noise. How do you prefer the two types of residual noise? Disscuss how you make the choice in different communication scenarios.

4. How do you think about the experiment data (the histograms) for finding the PDF of DFT amplitude? Can you suggest any improvement to it?

Mmse Speech Enhancement

Documents

Transcript of Mmse Speech Enhancement