[IEEE 2009 12th International Conference on Computer and Information Technology (ICCIT) - Dhaka,...

5
Proceedings of 2009 12 th International Conference on Computer and Information Technology (ICCIT 2009) 21-23 December, 2009, Dhaka, Bangladesh Novel Objective Criteria for Perceptual Separation of Two Kinds of Distortion in Speech Enhancement Applications Md. Jahangir Alam, Douglas O'Shaughnessy, Sid-Ahmed Selouani t INRS-EMT, University of Quebec, Montreal QC, Canada t University of Moncton, campus de shippigan, NB, Canada [email protected], [email protected], [email protected] Abstract There is an increasing interest in the development of robust quantitative speech quality measures that corre- late well with subjective measures. This paper presents two objective criteria-the Perceptual Signal to Audible Noise Ratio (PSANR) and the Perceptual Signal to Aud- ible Distortion Ratio (PSADR) , to characterize the two kinds of degradation (i.e., residual background noise, speech distortion or both) in speech enhancement ap- plications. For performance evaluation of speech en- hancement algorithms it is necessary to determine with accuracy the kind of degradation present in the en- hanced signal. Experimental results for speech en- hancement using different well-known approaches de- pict the usefulness of the proposed objective criteria. speech quality measures can be classified according to the perceptual domain transformation module being used, and these are: Time domain measures Spectral domain measures and Perceptual domain measures Perceptual domain measures are shown to have the best chance of predicting subjective quality of speech and other audio signals since they are based on the human auditory perception models. Speech Quality Measures I \ (such as MOS,:/ 1 Time Domain Spectral Domain Perceptual Domain (1) (such as PESQ) Objective Subjective where E denotes the time, frequency or perceptual do- main, x and y denote the original speech and observed speech altered by noise or denoised speech after processing, respectively, and c is the score of the objec- tive measure. Mathematically, C is not a bijection from E 2 to 1R.. It means that it is possible to find a signal y' which is perceptually different from y but has the same score than the one obtained with y ( c( x, y) = c( x, y') ). The assessment of the denoised speech quality by means of two parameters permits to overcome the prob- lem of non bijection of classic objective evaluation and to better characterize each kind of speech degradation. (such as (such as Log Segmental Spectral SNR) Distance) Figure 1. Classification of speech quality measures. The common point of all objective criteria is their abili- ty of evaluating speech quality using a single parameter which embeds all kind of degradations after any processing. Indeed, speech quality measures are basing their evaluation on both original and degraded speeches according to the following application c. E 2 I. INTRODUCTION Quality assessment of the processed speech signal can be done using subjective listening tests or objective quality measures as shown in figure 1. Subjective listen- ing tests such as Mean Opinion Score (MaS) or Degra- dation MaS (DMOS) provide perhaps the most reliable method for assessing speech quality. Subjective evalua- tion involves comparisons of original and processed speech signals by a group of listeners who are asked to rate the quality of speech signal along a pre-determined scale. These tests, however, can be time consuming, requiring in most cases access to the trained listeners. For these reasons, several researchers have investigated the possibility of devising objective, rather than subjec- tive, measures of speech quality [5, 8-11]. The aim of the objective speech quality measures is to achieve high correlation with subjective speech quality measures such as Mean Opinion Score (MaS), or De- gradation MaS (DMOS). An ideal objective speech quality measure would be able to assess the quality of the degraded or processed speech by simply observing the speech in question, without accessing the original speech. Much progress has been done in developing such an objective measure [5, 8-11]. Current objective measures are limited in that most require access to the original speech signal and some can only model the low-level processing (e.g., masking effects) of the audi- tory system. Yet, despite these limitations, some of these objective measures have been found to correlate well with subjective listening tests [11]. Objective Keywords: speech enhancement, masking threshold, objective quality measure, PSANDR. 978-1-4244-6284-1/09/$26.00 ©2009 IEEE 483

Transcript of [IEEE 2009 12th International Conference on Computer and Information Technology (ICCIT) - Dhaka,...

Page 1: [IEEE 2009 12th International Conference on Computer and Information Technology (ICCIT) - Dhaka, Bangladesh (2009.12.21-2009.12.23)] 2009 12th International Conference on Computers

Proceedings of 2009 12th International Conference on Computer and Information Technology (ICCIT 2009)21-23 December, 2009, Dhaka, Bangladesh

Novel Objective Criteria for Perceptual Separation of TwoKinds of Distortion in Speech Enhancement Applications

Md. Jahangir Alam, Douglas O'Shaughnessy, Sid-Ahmed Selouanit

INRS-EMT, University of Quebec, Montreal QC, Canadat University ofMoncton, campus de shippigan, NB, [email protected], [email protected], [email protected]

AbstractThere is an increasing interest in the development ofrobust quantitative speech quality measures that corre­late well with subjective measures. This paper presentstwo objective criteria-the Perceptual Signal to AudibleNoise Ratio (PSANR) and the Perceptual Signal to Aud­ible Distortion Ratio (PSADR), to characterize the twokinds of degradation (i.e., residual background noise,speech distortion or both) in speech enhancement ap­plications. For performance evaluation of speech en­hancement algorithms it is necessary to determine withaccuracy the kind of degradation present in the en­hanced signal. Experimental results for speech en­hancement using different well-known approaches de­pict the usefulness ofthe proposed objective criteria.

speech quality measures can be classified according tothe perceptual domain transformation module beingused, and these are:

~ Time domain measures~ Spectral domain measures and~ Perceptual domain measures

Perceptual domain measures are shown to have the bestchance of predicting subjective quality of speech andother audio signals since they are based on the humanauditory perception models.

Speech QualityMeasures

I \(such asMOS,:/ 1~

Time Domain Spectral Domain Perceptual Domain

(1)

(such as PESQ)

ObjectiveSubjective

where E denotes the time, frequency or perceptual do­main, x and y denote the original speech and observedspeech altered by noise or denoised speech afterprocessing, respectively, and c is the score of the objec­tive measure. Mathematically, C is not a bijection from

E2 to 1R.. It means that it is possible to find a signal y'which is perceptually different from y but has the samescore than the one obtained with y (c( x, y) =c( x, y')) .

The assessment of the denoised speech quality bymeans of two parameters permits to overcome the prob­lem of non bijection of classic objective evaluation andto better characterize each kind of speech degradation.

(such as (such as LogSegmental Spectral

SNR) Distance)

Figure 1. Classification of speech quality measures.

The common point of all objective criteria is their abili­ty of evaluating speech quality using a single parameterwhich embeds all kind of degradations after anyprocessing. Indeed, speech quality measures are basingtheir evaluation on both original and degraded speechesaccording to the following application

c. E2 ~IR

(x,y)~c'

I. INTRODUCTION

Quality assessment of the processed speech signal canbe done using subjective listening tests or objectivequality measures as shown in figure 1. Subjective listen­ing tests such as Mean Opinion Score (MaS) or Degra­dation MaS (DMOS) provide perhaps the most reliablemethod for assessing speech quality. Subjective evalua­tion involves comparisons of original and processedspeech signals by a group of listeners who are asked torate the quality of speech signal along a pre-determinedscale. These tests, however, can be time consuming,requiring in most cases access to the trained listeners.For these reasons, several researchers have investigatedthe possibility of devising objective, rather than subjec­tive, measures of speech quality [5, 8-11].

The aim of the objective speech quality measures is toachieve high correlation with subjective speech qualitymeasures such as Mean Opinion Score (MaS), or De­gradation MaS (DMOS). An ideal objective speechquality measure would be able to assess the quality ofthe degraded or processed speech by simply observingthe speech in question, without accessing the originalspeech. Much progress has been done in developingsuch an objective measure [5, 8-11]. Current objectivemeasures are limited in that most require access to theoriginal speech signal and some can only model thelow-level processing (e.g., masking effects) of the audi­tory system. Yet, despite these limitations, some ofthese objective measures have been found to correlatewell with subjective listening tests [11]. Objective

Keywords: speech enhancement, masking threshold,objective quality measure, PSANDR.

978-1-4244-6284-1/09/$26.00 ©2009 IEEE 483

Page 2: [IEEE 2009 12th International Conference on Computer and Information Technology (ICCIT) - Dhaka, Bangladesh (2009.12.21-2009.12.23)] 2009 12th International Conference on Computers

II. SPEECH ENHANCEMENT METHOD

Basic speech enhancement methods involve estimatingevery frequency component of the clean speech spec-

trumX(m,k)

Reference signal

PSANDR score

Degraded signal

Maskingthreshold

Spectrumof the

estimatedsignal

Spreadingfunction

Figure 2. Block diagram of the PSANDR measure computa­tion.

ceptual domain objective quality measure to quantifythe two kinds of degradation, namely the residual noiseand the speech distortion.

(2)

(3)X(m,k) = H(m,k)Y(m,k),

E2~JR2(x,y) ~ (PSANR,PSADR) ,

where PSANR (Perceptual signal to audible noise ratio)and PSADR (Perceptual signal to audible distortionratio) are two parameters related to the residual noiseand the speech distortion, respectively. The definition ofPSANR and PSADR is inspired from the SNR defini­tion which is the ratio of signal energy to noise energy.

A masked signal is made inaudible by a masker if themasked signal magnitude is below the perceptual mask­ing threshold MT. Residual noise and speech distortioncan be audible or inaudible according to their positionregarding the masking threshold. We propose to finddecision rules to decide on the audibility of residualnoise and speech distortion by using the masking thre­shold concept. If they are audible, the audibility ratewill be quantified according to the proposed criterion.

This paper is organized as follows: Section 2 provides adescription of the speech enhancement technique. Insection 3, an overview of the proposed method is given.Performance evaluation is made in section 4 and in sec­tion 5 the paper is concluded.

Hence, instead of the application defined in (1), we de­velop a novel application from perceptual domain toJR2c:

where m =1,2, ...,M is the frame index, k =1,2, ...,K isthe frequency bin index, M is the total number of framesand K is the frame length, H (m, k) is the noise suppres-

sion filter chosen according to a suitable criterion,Y(m, k) represent the short-time spectral components of

the noisy signal. The error signal generated by this filteris

e(m,k) = X(m,k)-X(m,k)(4)

=(H(m, k) -1) X(m, k) +H(m, k )D(m, k),

where D(m,k) denotes the noise power spectrum. The

first term in equation (4) describes the speech distortioncaused by the spectral weighting. The second term inthe above equation is the residual noise distortion whichis perceptually heard as background noise.

III. OVERVIEW OF THE PROPOSEDMETHOD

Figure 2 and Figure 3 depict the complete block dia­gram of our proposed method and block diagram tocalculate the masking threshold, respectively. The fol­lowing sections give a description of the proposed per-

Figure 3. Block diagram for the calculation of the maskingthreshold

A. Computation of Masking Threshold (MT)

The masking threshold (MT) is obtained through mod­eling the frequency selectivity of the human auditorysystem and its masking properties. The estimation of theMT includes the preliminary estimation of the originalclean speech, critical band analysis, the spread maskingthreshold, the relative threshold offset, the maskingthreshold normalization and comparison with the abso­lute threshold of hearing [7]. The steps involved in thecomputation of the MT are taken from the Johnstonmodel [7] and is shown in figure 3.

B. Upper and Lower Bound of Perceptual Equiva­lence

According to MT definition, it is possible to add to theclean speech power spectrum, the MT curve so that theresulting signal (obtained by inverse FFT) has the sameaudible quality as the clean one. The resulting spectrumis called Upper Bound of Perceptual Equivalence(UBPE) and is defined as

484

Page 3: [IEEE 2009 12th International Conference on Computer and Information Technology (ICCIT) - Dhaka, Bangladesh (2009.12.21-2009.12.23)] 2009 12th International Conference on Computers

(6)

UBPE(m,k) = lx(m,k)+MT(m,k), (5)

where lx(m,k) is the clean speech power spectrum.

When some frequency components of the denoisedspeech are above UBPE, the resulting additive noise isheard. Thus, by analogy to UBPE, we propose to calcu­late a second curve which expresses the lower boundunder which any attenuation of frequency componentsis heard as a distortion. We call it Lower Bound of Per­ceptual Equivalence (LBPE). To compute LBPE, weused the audible spectrum introduced in [12]. In suchcase, audible spectrum is calculated by considering themaximum between the clean speech spectrum and themasking threshold. When speech components are underMT, they are not heard and we can replace them by achosen threshold a(m,k) .

The proposed LBPE is defined as

{l x(m,k ) iflx(m,k) ~ MT(m,k)

LBPE(m,k) = .a(m,k) otherwise

The choice of a(m,k) obeys only one condi­

tiona(m,k) < MT(m,k). In this thesis we choose it

equal to 0 dB.

Using UBPE and LBPE, we can define three regionscharacterizing the perceptual quantity of denoisedspeech: frequency components between UBPE andLBPE are perceptually equivalent to the original speechcomponents, frequency components above UBPE con­tain a background noise and frequency componentsunder LBPE are characterized by speech distortion. Thischaracterization constitutes our idea to identify anddetect audible additive noise and audible distortion. Asan illustration, we present in figure 4 an example ofspeech frame power spectrum and its related curvesUBPE and LBPE. The clean speech power spectrum is,for all frequencies index, between the two curves UBPEand LBPE. We remark that the two curves are the samefor most peaks. It means that for these frequency inter­vals, any kind of degradation altering speech will beaudible. If it quite over UBPE, it will be heard as back­ground noise. In the opposite case, it will be heard asspeech distortion.

C. Estimation of audible degradation

some cases the musical noise. Such musical noise iswell popular and constitutes the main drawback of spec­tral subtraction. Once the UBPE is calculated, it is poss­ible to estimate the audible power spectrum density ofresidual noise using a simple subtraction when it exists.Hence, the residual noise power spectrum is written as

P k _ {r;(m,k) - UBPE(m,k) ifr;(m,k) > UBPE(m,k)I'n (m, ) - 0 otherwise,

(7)

wherelx(m,k) denotes the power spectrum of

processed speech and the suffix p designs the percep­tually sense ofthe power spectrum.

Audible speech distortion power spectrum estima­tion

Using the same methodology as the one used for residualbackground noise, it is possible to estimate the audible distor­tion power spectrum as

p _{LBPE(m,k)-r;(m,k) ifr;(m,k)<LBPE(m,k)rd(m,k) - . .o otherwise

(8)

II(t

'1'£t

IGo

KJ

i~~~.,

](II'!--'I-

l'ft

iu 1E-

[.

if.I] ~ 1[0] ~ KOO ~.oo :rr.m "].~ IIOOJ

~tKr1

Figure 4. UBPE, LBPE and power spectrum of the originalclean speech signal

D. Calculation of PSANR and PSADR

The perceptual residual noise criterion is defined as theratio between the upper effective signal which is theUBPE and the audible residual noise. The PerceptualSignal to Audible Noise Ratio (PSANR) of mth frame iscalculated in frequency domain and it is formulated asfollows

The mean PSANR measure is computed by averagingthe frame PSANR measures across the sentence as fol­lows

Estimation of audible noise power spectrum

Once UBPE calculated, the superposition of denoisedsignal power spectrum and UBPE leads to separate twocases. The First one corresponds to the regions of de­noised speech power spectrum which are under UBPE.In such case, there is no audible residual noise. In thesecond case, some denoised speech frequency compo­nents are above UBPE, the amount above UBPE consti­tutes the audible residual noise. In term of listeningtests, such residual noise is annoying and constitutes in

485

N

LUBPE(m,k)PSANR(m) =10* 10glO ....;.;..k=....;;..l

N----

Ll~(m,k)k=l

1 MPSANRmean =-LPSANR(m)

M m=l

(9)

(10)

Page 4: [IEEE 2009 12th International Conference on Computer and Information Technology (ICCIT) - Dhaka, Bangladesh (2009.12.21-2009.12.23)] 2009 12th International Conference on Computers

1 MPSADRmean =-LPSADR(m). (12)

M m=l

The mean PSADR measure is computed by averagingthe frame PSADR measures across the sentence as fol­lows

Similarly, the Perceptual Signal to Audible DistortionRatio (PSADR) of the mth frame is defined as a ratiobetween the lower effective signal which is LBPE andthe audible distortion and is given as

Next, the couple (PSANR, PSADR) defines the newcriterion to evaluate both kinds of degradation. We callit Perceptual Signal to Audible Noise and DistortionRatio (PSANDR). The higher the PSANDR score thebetter is the quality of the processed speech.

observed that including the frequency regions between3200 Hz- 3500 Hz there are other small frequency re­gions where the processed speech power spectrum isunder the LBPE, which means that they constitute audi­ble distortion of the clean speech. In term of listeningtests, they are completely different from residual back­ground noise. They are heard as a loss of speech tonali­ty.

Table 1 shows the PSANDR (PSANR, PSADR) scoresand the Segmental SNR values for the Log MMSE withSPU and the spectral subtraction methods when theclean signals are degraded with subway noise at globalSNR levels of 5 dB and 10 dB. PSANR, giving ideaabout residual noise, depicts that the Log MMSE withSPU is the best one regarding noise attenuation.PSADR, determining the distortion of the denoised sig­nals, illustrates that the important distortion is obtainedusing the Spectral Subtraction (SS) technique. Theseobservations are confirmed by informal subjective lis­tening tests.

(11)

N

LLBPE(m,k)PSADR(m)=10*loglO....;.;..k=--=-lN----

Lr~(m,k)k=l

Figure 5. Power spectrums of the processed speech and itsrelated clean speech UBPE. (a) for Log MMSE with SPU (b)for Spectral subtraction

V. CONCLUSION

Two parameters PSANR and PSADR characterizing thetwo kinds of degradation for speech enhancement appli­cations are developed in this paper. We first proposetwo curves UBPE and LBPE to classify the audible re­sidual noise and audible distortion. Simulation resultscomparing different well-known speech enhancementalgorithms and classical objective measure (SegmentalSNR) show a better characterization of degradation na-

I,r........ ...,...,

(b)

,.:rJl'! ~ :r.-n ..':'"l"'I~roiJl

IV. PERFORMANCE EVALUATION

In order to evaluate performance of the proposedPSANDR measures to quantify the perceptual separa­tion of the two kind of degradations (residual noise andspeech distortion) we choose two well-known speechenhancement algorithms [1-5], namely, the spectral sub­traction [2] and the Log MMSE algorithm incorporatingspeech-presence uncertainty (SPU) [4, 6]. Figure 5 de­picts an example of denoised speech power spectrumsand their related UBPE curve calculated from cleanspeech. Figure 5 (a) is obtained using the Log MMSEalgorithm incorporating speech-presence uncertainty [4,6]. We notice from this figure that excluding some fre­quency points (900, 1200, and 2300) the processedspeech power spectrum is almost under the UBPE curveand hence they contain less (or not ) residual audiblenoise. Figure 5 (b) is obtained using the spectral sub­traction algorithm. It is observed from this figure that inthe frequency regions between 800 Hz and 1800 Hz theprocessed speech power spectrum is above the UBPEand therefore they contain residual audible noise. Interm of listening tests, such residual noise is annoyingand constitutes in some cases the musical noise. Suchmusical noise is well popular and constitutes the maindrawback of spectral subtraction.

Figure 6 (a) and (b) represent an example of powerspectrum of the processed speech and its related curveLBPE calculated from the clean speech. Figure 6 (a) isobtained when the Log MMSE algorithm incorporatingspeech-presence uncertainty is used whereas figure 6(b) is obtained when spectral subtraction algorithm isused. We notice from figure 6 (a) that excluding somespecific frequency point almost all regions are aboveLBPE and hence they constitute little (or no) audibledistortion of the clean speech. From figure 6 (b) it is

486

Page 5: [IEEE 2009 12th International Conference on Computer and Information Technology (ICCIT) - Dhaka, Bangladesh (2009.12.21-2009.12.23)] 2009 12th International Conference on Computers

Table 1 Experimental results

Acoust. Speech, Signal Processing, vol. ASSP-32,no. 6, pp. 1109-1121, Dec. 1984.

[4] Y. Ephraim and D. Malah, "Speech enhancementusing a minimum mean square error log-spectralamplitude estimator," IEEE Trans. Acoust., Speech,Signal Processing, vol. 33, pp. 443-445, 1985.

[5] Philipos C. Loizou, Speech Enhancement Theoryand Practice, I" edition, CRC press, June, 2007.

[6] Cohen, I., "Optimal speech enhancement undersignal presence uncertainty using log-spectra am­plitude estimator," IEEE Signal Processing Letters,vol. 9, no. 4, pp. 113-116, 2002.

[7] J. D. Johnston, "Transform coding of audio signalsusing perceptual noise criteria," IEEE J. on Se­lected Areas in Comm., vol. 6, pp. 314-323, Feb.1988.

[8] E. Zwicker and H. Fast!, Psychoacoustics: Factsand Models. Springer-Verlag, 2nd ed., 1999.

[9] Yi Hu and Philipos C. Loizou, "Evaluation of Ob­jective Quality Measures for Speech Enhance­ment," IEEE Trans. on Audio, Speech and Lan­guage Processing, vol. 16, No.1, pp. 229-238, Jan­uary 2008.

[10]Quackenbush S., T. Barnwell and M. Clements,Objective Measures ofSpeech Quality, EnglewoodCliffs, NJ, USA, Prentice Hall, 1988.

[11]W. Yang, M. Benbouchta, and R. Yantorno, "Per­formance of a modified bark spectral distortionmeasure as an objective speech quality measure,"IEEE ICASSP, pp.541-544, Seattle, 1998.

[12]D. E. Tsoukalas, J. Mourjopoulos and G. Kokkina­kis, "Speech enhancement based on audible noisesuppression," IEEE Trans. Speech and AudioProcessing, vol. 5, no. 6, pp. 497- 514, November1997.

N...-,-----..--.::::z::::.=..-:O::,:.-__::::L:--=--=-=.=...-==r==.-...l ~......-_nl~~'"

.~-'.---!:..~_.--w~_._-..---_r_j

~ :~ ..i

'::~ 11 . .hOiI ill .. tl :Illill ~«l ,.,:", I , .....,,"' .J.I ,)lJ "';'\-11" nJl:

....--yIH'J

(b)

II :PC.'f1 ~

(a)

Figure 6. Power spectrums of the processed speech and itsrelated clean speech LBPE, (a) for Log MMSE with SPU (b)for Spectral subtraction

ture of enhanced signal. The calculation of the degree ofcorrelation of the proposed criteria with MOS criterionconstitutes the perspectives of our future work.

Input

Algorithms SNR SegSNR PSADR PSANR(dB)

Log 5 2.81 7.56 18.00MMSE 10 4.85 16.65 7.115

with SPU

Spectral 5 1.40 6.94 10.764subtraction 10 4.1 15.2 6.85

(Berouti)

REFERENCES[1] S. F. Boll, "Suppression of acoustic noise in speech

using spectral subtraction," IEEE Trans. Acoustics,Speech, Signal Processing, vol. 27, pp. 113-120,Apr. 1979.

[2] M. Berouti, R. Schwartz, and J. Makhoul, "En­hancement of speech corrupted by acoustic noise,"in Proc. IEEE Int. Conf. on Acoustics, Speech,Signal Processing, vol. 1, (Washington, DC), pp.208-211, Apr. 1979.

[3] Y. Ephraim and D. Malah, "Speech enhancementusing a minimum mean-square error short-timespectral amplitude estimation," IEEE Trans.

487