[IEEE 2009 12th International Conference on Computer and Information Technology (ICCIT) - Dhaka,...
Transcript of [IEEE 2009 12th International Conference on Computer and Information Technology (ICCIT) - Dhaka,...
Proceedings of 2009 12th International Conference on Computer and Information Technology (ICCIT 2009)21-23 December, 2009, Dhaka, Bangladesh
Novel Objective Criteria for Perceptual Separation of TwoKinds of Distortion in Speech Enhancement Applications
Md. Jahangir Alam, Douglas O'Shaughnessy, Sid-Ahmed Selouanit
INRS-EMT, University of Quebec, Montreal QC, Canadat University ofMoncton, campus de shippigan, NB, [email protected], [email protected], [email protected]
AbstractThere is an increasing interest in the development ofrobust quantitative speech quality measures that correlate well with subjective measures. This paper presentstwo objective criteria-the Perceptual Signal to AudibleNoise Ratio (PSANR) and the Perceptual Signal to Audible Distortion Ratio (PSADR), to characterize the twokinds of degradation (i.e., residual background noise,speech distortion or both) in speech enhancement applications. For performance evaluation of speech enhancement algorithms it is necessary to determine withaccuracy the kind of degradation present in the enhanced signal. Experimental results for speech enhancement using different well-known approaches depict the usefulness ofthe proposed objective criteria.
speech quality measures can be classified according tothe perceptual domain transformation module beingused, and these are:
~ Time domain measures~ Spectral domain measures and~ Perceptual domain measures
Perceptual domain measures are shown to have the bestchance of predicting subjective quality of speech andother audio signals since they are based on the humanauditory perception models.
Speech QualityMeasures
I \(such asMOS,:/ 1~
Time Domain Spectral Domain Perceptual Domain
(1)
(such as PESQ)
ObjectiveSubjective
where E denotes the time, frequency or perceptual domain, x and y denote the original speech and observedspeech altered by noise or denoised speech afterprocessing, respectively, and c is the score of the objective measure. Mathematically, C is not a bijection from
E2 to 1R.. It means that it is possible to find a signal y'which is perceptually different from y but has the samescore than the one obtained with y (c( x, y) =c( x, y')) .
The assessment of the denoised speech quality bymeans of two parameters permits to overcome the problem of non bijection of classic objective evaluation andto better characterize each kind of speech degradation.
(such as (such as LogSegmental Spectral
SNR) Distance)
Figure 1. Classification of speech quality measures.
The common point of all objective criteria is their ability of evaluating speech quality using a single parameterwhich embeds all kind of degradations after anyprocessing. Indeed, speech quality measures are basingtheir evaluation on both original and degraded speechesaccording to the following application
c. E2 ~IR
(x,y)~c'
I. INTRODUCTION
Quality assessment of the processed speech signal canbe done using subjective listening tests or objectivequality measures as shown in figure 1. Subjective listening tests such as Mean Opinion Score (MaS) or Degradation MaS (DMOS) provide perhaps the most reliablemethod for assessing speech quality. Subjective evaluation involves comparisons of original and processedspeech signals by a group of listeners who are asked torate the quality of speech signal along a pre-determinedscale. These tests, however, can be time consuming,requiring in most cases access to the trained listeners.For these reasons, several researchers have investigatedthe possibility of devising objective, rather than subjective, measures of speech quality [5, 8-11].
The aim of the objective speech quality measures is toachieve high correlation with subjective speech qualitymeasures such as Mean Opinion Score (MaS), or Degradation MaS (DMOS). An ideal objective speechquality measure would be able to assess the quality ofthe degraded or processed speech by simply observingthe speech in question, without accessing the originalspeech. Much progress has been done in developingsuch an objective measure [5, 8-11]. Current objectivemeasures are limited in that most require access to theoriginal speech signal and some can only model thelow-level processing (e.g., masking effects) of the auditory system. Yet, despite these limitations, some ofthese objective measures have been found to correlatewell with subjective listening tests [11]. Objective
Keywords: speech enhancement, masking threshold,objective quality measure, PSANDR.
978-1-4244-6284-1/09/$26.00 ©2009 IEEE 483
II. SPEECH ENHANCEMENT METHOD
Basic speech enhancement methods involve estimatingevery frequency component of the clean speech spec-
trumX(m,k)
Reference signal
PSANDR score
Degraded signal
Maskingthreshold
Spectrumof the
estimatedsignal
Spreadingfunction
Figure 2. Block diagram of the PSANDR measure computation.
ceptual domain objective quality measure to quantifythe two kinds of degradation, namely the residual noiseand the speech distortion.
(2)
(3)X(m,k) = H(m,k)Y(m,k),
E2~JR2(x,y) ~ (PSANR,PSADR) ,
where PSANR (Perceptual signal to audible noise ratio)and PSADR (Perceptual signal to audible distortionratio) are two parameters related to the residual noiseand the speech distortion, respectively. The definition ofPSANR and PSADR is inspired from the SNR definition which is the ratio of signal energy to noise energy.
A masked signal is made inaudible by a masker if themasked signal magnitude is below the perceptual masking threshold MT. Residual noise and speech distortioncan be audible or inaudible according to their positionregarding the masking threshold. We propose to finddecision rules to decide on the audibility of residualnoise and speech distortion by using the masking threshold concept. If they are audible, the audibility ratewill be quantified according to the proposed criterion.
This paper is organized as follows: Section 2 provides adescription of the speech enhancement technique. Insection 3, an overview of the proposed method is given.Performance evaluation is made in section 4 and in section 5 the paper is concluded.
Hence, instead of the application defined in (1), we develop a novel application from perceptual domain toJR2c:
where m =1,2, ...,M is the frame index, k =1,2, ...,K isthe frequency bin index, M is the total number of framesand K is the frame length, H (m, k) is the noise suppres-
sion filter chosen according to a suitable criterion,Y(m, k) represent the short-time spectral components of
the noisy signal. The error signal generated by this filteris
e(m,k) = X(m,k)-X(m,k)(4)
=(H(m, k) -1) X(m, k) +H(m, k )D(m, k),
where D(m,k) denotes the noise power spectrum. The
first term in equation (4) describes the speech distortioncaused by the spectral weighting. The second term inthe above equation is the residual noise distortion whichis perceptually heard as background noise.
III. OVERVIEW OF THE PROPOSEDMETHOD
Figure 2 and Figure 3 depict the complete block diagram of our proposed method and block diagram tocalculate the masking threshold, respectively. The following sections give a description of the proposed per-
Figure 3. Block diagram for the calculation of the maskingthreshold
A. Computation of Masking Threshold (MT)
The masking threshold (MT) is obtained through modeling the frequency selectivity of the human auditorysystem and its masking properties. The estimation of theMT includes the preliminary estimation of the originalclean speech, critical band analysis, the spread maskingthreshold, the relative threshold offset, the maskingthreshold normalization and comparison with the absolute threshold of hearing [7]. The steps involved in thecomputation of the MT are taken from the Johnstonmodel [7] and is shown in figure 3.
B. Upper and Lower Bound of Perceptual Equivalence
According to MT definition, it is possible to add to theclean speech power spectrum, the MT curve so that theresulting signal (obtained by inverse FFT) has the sameaudible quality as the clean one. The resulting spectrumis called Upper Bound of Perceptual Equivalence(UBPE) and is defined as
484
(6)
UBPE(m,k) = lx(m,k)+MT(m,k), (5)
where lx(m,k) is the clean speech power spectrum.
When some frequency components of the denoisedspeech are above UBPE, the resulting additive noise isheard. Thus, by analogy to UBPE, we propose to calculate a second curve which expresses the lower boundunder which any attenuation of frequency componentsis heard as a distortion. We call it Lower Bound of Perceptual Equivalence (LBPE). To compute LBPE, weused the audible spectrum introduced in [12]. In suchcase, audible spectrum is calculated by considering themaximum between the clean speech spectrum and themasking threshold. When speech components are underMT, they are not heard and we can replace them by achosen threshold a(m,k) .
The proposed LBPE is defined as
{l x(m,k ) iflx(m,k) ~ MT(m,k)
LBPE(m,k) = .a(m,k) otherwise
The choice of a(m,k) obeys only one condi
tiona(m,k) < MT(m,k). In this thesis we choose it
equal to 0 dB.
Using UBPE and LBPE, we can define three regionscharacterizing the perceptual quantity of denoisedspeech: frequency components between UBPE andLBPE are perceptually equivalent to the original speechcomponents, frequency components above UBPE contain a background noise and frequency componentsunder LBPE are characterized by speech distortion. Thischaracterization constitutes our idea to identify anddetect audible additive noise and audible distortion. Asan illustration, we present in figure 4 an example ofspeech frame power spectrum and its related curvesUBPE and LBPE. The clean speech power spectrum is,for all frequencies index, between the two curves UBPEand LBPE. We remark that the two curves are the samefor most peaks. It means that for these frequency intervals, any kind of degradation altering speech will beaudible. If it quite over UBPE, it will be heard as background noise. In the opposite case, it will be heard asspeech distortion.
C. Estimation of audible degradation
some cases the musical noise. Such musical noise iswell popular and constitutes the main drawback of spectral subtraction. Once the UBPE is calculated, it is possible to estimate the audible power spectrum density ofresidual noise using a simple subtraction when it exists.Hence, the residual noise power spectrum is written as
P k _ {r;(m,k) - UBPE(m,k) ifr;(m,k) > UBPE(m,k)I'n (m, ) - 0 otherwise,
(7)
wherelx(m,k) denotes the power spectrum of
processed speech and the suffix p designs the perceptually sense ofthe power spectrum.
Audible speech distortion power spectrum estimation
Using the same methodology as the one used for residualbackground noise, it is possible to estimate the audible distortion power spectrum as
p _{LBPE(m,k)-r;(m,k) ifr;(m,k)<LBPE(m,k)rd(m,k) - . .o otherwise
(8)
II(t
'1'£t
IGo
KJ
i~~~.,
](II'!--'I-
l'ft
iu 1E-
[.
if.I] ~ 1[0] ~ KOO ~.oo :rr.m "].~ IIOOJ
~tKr1
Figure 4. UBPE, LBPE and power spectrum of the originalclean speech signal
D. Calculation of PSANR and PSADR
The perceptual residual noise criterion is defined as theratio between the upper effective signal which is theUBPE and the audible residual noise. The PerceptualSignal to Audible Noise Ratio (PSANR) of mth frame iscalculated in frequency domain and it is formulated asfollows
The mean PSANR measure is computed by averagingthe frame PSANR measures across the sentence as follows
Estimation of audible noise power spectrum
Once UBPE calculated, the superposition of denoisedsignal power spectrum and UBPE leads to separate twocases. The First one corresponds to the regions of denoised speech power spectrum which are under UBPE.In such case, there is no audible residual noise. In thesecond case, some denoised speech frequency components are above UBPE, the amount above UBPE constitutes the audible residual noise. In term of listeningtests, such residual noise is annoying and constitutes in
485
N
LUBPE(m,k)PSANR(m) =10* 10glO ....;.;..k=....;;..l
N----
Ll~(m,k)k=l
1 MPSANRmean =-LPSANR(m)
M m=l
(9)
(10)
1 MPSADRmean =-LPSADR(m). (12)
M m=l
The mean PSADR measure is computed by averagingthe frame PSADR measures across the sentence as follows
Similarly, the Perceptual Signal to Audible DistortionRatio (PSADR) of the mth frame is defined as a ratiobetween the lower effective signal which is LBPE andthe audible distortion and is given as
Next, the couple (PSANR, PSADR) defines the newcriterion to evaluate both kinds of degradation. We callit Perceptual Signal to Audible Noise and DistortionRatio (PSANDR). The higher the PSANDR score thebetter is the quality of the processed speech.
observed that including the frequency regions between3200 Hz- 3500 Hz there are other small frequency regions where the processed speech power spectrum isunder the LBPE, which means that they constitute audible distortion of the clean speech. In term of listeningtests, they are completely different from residual background noise. They are heard as a loss of speech tonality.
Table 1 shows the PSANDR (PSANR, PSADR) scoresand the Segmental SNR values for the Log MMSE withSPU and the spectral subtraction methods when theclean signals are degraded with subway noise at globalSNR levels of 5 dB and 10 dB. PSANR, giving ideaabout residual noise, depicts that the Log MMSE withSPU is the best one regarding noise attenuation.PSADR, determining the distortion of the denoised signals, illustrates that the important distortion is obtainedusing the Spectral Subtraction (SS) technique. Theseobservations are confirmed by informal subjective listening tests.
(11)
N
LLBPE(m,k)PSADR(m)=10*loglO....;.;..k=--=-lN----
Lr~(m,k)k=l
Figure 5. Power spectrums of the processed speech and itsrelated clean speech UBPE. (a) for Log MMSE with SPU (b)for Spectral subtraction
V. CONCLUSION
Two parameters PSANR and PSADR characterizing thetwo kinds of degradation for speech enhancement applications are developed in this paper. We first proposetwo curves UBPE and LBPE to classify the audible residual noise and audible distortion. Simulation resultscomparing different well-known speech enhancementalgorithms and classical objective measure (SegmentalSNR) show a better characterization of degradation na-
I,r........ ...,...,
(b)
,.:rJl'! ~ :r.-n ..':'"l"'I~roiJl
IV. PERFORMANCE EVALUATION
In order to evaluate performance of the proposedPSANDR measures to quantify the perceptual separation of the two kind of degradations (residual noise andspeech distortion) we choose two well-known speechenhancement algorithms [1-5], namely, the spectral subtraction [2] and the Log MMSE algorithm incorporatingspeech-presence uncertainty (SPU) [4, 6]. Figure 5 depicts an example of denoised speech power spectrumsand their related UBPE curve calculated from cleanspeech. Figure 5 (a) is obtained using the Log MMSEalgorithm incorporating speech-presence uncertainty [4,6]. We notice from this figure that excluding some frequency points (900, 1200, and 2300) the processedspeech power spectrum is almost under the UBPE curveand hence they contain less (or not ) residual audiblenoise. Figure 5 (b) is obtained using the spectral subtraction algorithm. It is observed from this figure that inthe frequency regions between 800 Hz and 1800 Hz theprocessed speech power spectrum is above the UBPEand therefore they contain residual audible noise. Interm of listening tests, such residual noise is annoyingand constitutes in some cases the musical noise. Suchmusical noise is well popular and constitutes the maindrawback of spectral subtraction.
Figure 6 (a) and (b) represent an example of powerspectrum of the processed speech and its related curveLBPE calculated from the clean speech. Figure 6 (a) isobtained when the Log MMSE algorithm incorporatingspeech-presence uncertainty is used whereas figure 6(b) is obtained when spectral subtraction algorithm isused. We notice from figure 6 (a) that excluding somespecific frequency point almost all regions are aboveLBPE and hence they constitute little (or no) audibledistortion of the clean speech. From figure 6 (b) it is
486
Table 1 Experimental results
Acoust. Speech, Signal Processing, vol. ASSP-32,no. 6, pp. 1109-1121, Dec. 1984.
[4] Y. Ephraim and D. Malah, "Speech enhancementusing a minimum mean square error log-spectralamplitude estimator," IEEE Trans. Acoust., Speech,Signal Processing, vol. 33, pp. 443-445, 1985.
[5] Philipos C. Loizou, Speech Enhancement Theoryand Practice, I" edition, CRC press, June, 2007.
[6] Cohen, I., "Optimal speech enhancement undersignal presence uncertainty using log-spectra amplitude estimator," IEEE Signal Processing Letters,vol. 9, no. 4, pp. 113-116, 2002.
[7] J. D. Johnston, "Transform coding of audio signalsusing perceptual noise criteria," IEEE J. on Selected Areas in Comm., vol. 6, pp. 314-323, Feb.1988.
[8] E. Zwicker and H. Fast!, Psychoacoustics: Factsand Models. Springer-Verlag, 2nd ed., 1999.
[9] Yi Hu and Philipos C. Loizou, "Evaluation of Objective Quality Measures for Speech Enhancement," IEEE Trans. on Audio, Speech and Language Processing, vol. 16, No.1, pp. 229-238, January 2008.
[10]Quackenbush S., T. Barnwell and M. Clements,Objective Measures ofSpeech Quality, EnglewoodCliffs, NJ, USA, Prentice Hall, 1988.
[11]W. Yang, M. Benbouchta, and R. Yantorno, "Performance of a modified bark spectral distortionmeasure as an objective speech quality measure,"IEEE ICASSP, pp.541-544, Seattle, 1998.
[12]D. E. Tsoukalas, J. Mourjopoulos and G. Kokkinakis, "Speech enhancement based on audible noisesuppression," IEEE Trans. Speech and AudioProcessing, vol. 5, no. 6, pp. 497- 514, November1997.
N...-,-----..--.::::z::::.=..-:O::,:.-__::::L:--=--=-=.=...-==r==.-...l ~......-_nl~~'"
.~-'.---!:..~_.--w~_._-..---_r_j
~ :~ ..i
'::~ 11 . .hOiI ill .. tl :Illill ~«l ,.,:", I , .....,,"' .J.I ,)lJ "';'\-11" nJl:
....--yIH'J
(b)
II :PC.'f1 ~
(a)
Figure 6. Power spectrums of the processed speech and itsrelated clean speech LBPE, (a) for Log MMSE with SPU (b)for Spectral subtraction
ture of enhanced signal. The calculation of the degree ofcorrelation of the proposed criteria with MOS criterionconstitutes the perspectives of our future work.
Input
Algorithms SNR SegSNR PSADR PSANR(dB)
Log 5 2.81 7.56 18.00MMSE 10 4.85 16.65 7.115
with SPU
Spectral 5 1.40 6.94 10.764subtraction 10 4.1 15.2 6.85
(Berouti)
REFERENCES[1] S. F. Boll, "Suppression of acoustic noise in speech
using spectral subtraction," IEEE Trans. Acoustics,Speech, Signal Processing, vol. 27, pp. 113-120,Apr. 1979.
[2] M. Berouti, R. Schwartz, and J. Makhoul, "Enhancement of speech corrupted by acoustic noise,"in Proc. IEEE Int. Conf. on Acoustics, Speech,Signal Processing, vol. 1, (Washington, DC), pp.208-211, Apr. 1979.
[3] Y. Ephraim and D. Malah, "Speech enhancementusing a minimum mean-square error short-timespectral amplitude estimation," IEEE Trans.
487