PAPER Improving Power Spectra Estimation in 2-Dimensional ...€¦ · Introduction With the...

9
IEICE TRANS. FUNDAMENTALS, VOL.E94–A, NO.1 JANUARY 2011 273 PAPER Improving Power Spectra Estimation in 2-Dimensional Areas Using Number of Active Sound Sources Yusuke HIOKA a) , Member, Ken’ichi FURUYA , Yoichi HANEDA , and Akitoshi KATAOKA †† , Senior Members SUMMARY An improvement of estimating sound power spectra lo- cated in a particular 2-dimensional area is proposed. We previously pro- posed a conventional method that estimates sound power spectra using mul- tiple fixed beamformings in order to emphasize speech located in a partic- ular 2-dimensional area. However, the method has one drawback that the number of areas where the active sound sources are located must be re- stricted. This restriction makes the method less eective when many noise source located in dierent areas are simultaneously active. In this paper, we reveal the cause of this restriction and determine the maximum num- ber of areas for which the method is able to simultaneously estimate sound power spectra. Then we also introduce a procedure for investigating areas that include active sound sources to reduce the number of unknown power spectra to be estimated. The eectiveness of the proposed method is exam- ined by experimental evaluation applied to sounds recorded in a practical environment. key words: microphone array, power spectrum estimation, speech en- hancement, 2-dimensional area, estimating source position 1. Introduction With the popularisation of teleconference and voice recogni- tion systems, hands-free microphone systems have become indispensable devices due to their convenience. However, they often suer from signal-to-noise ratio (SNR) deteriora- tion due to the speaker’s mouth being positioned a few me- tres away from the microphone, which is much farther than the mouth’s position during use of a conventional handset or close-talking microphone. In particular, an intercom with a hands-free microphone may be used in a public space where various nonstationary noise sources are located around the speaker. In such an environment, the microphone might pick up confidential conversations that should not be sent to the other speaker. Thus, a hands-free microphone that picks up only the speech located in a particular 2-dimensional area, i.e., the area where the speakers are located, is required. Here, “2-dimension” corresponds to the direction and the distance. For that purpose, using a microphone array [1] is a well-known strategy, and various conventional methods have been reported. However, only a few works [2]–[5] have dealt with the problem of discriminating sounds whose sources are located in the same direction but at dierent dis- Manuscript received April 19, 2010. Manuscript revised August 30, 2010. The authors are with NTT Cyber Space Laboratories, NTT Corporation, Musashino-shi, 180-8585 Japan. †† The author is with the Faculty of Science and Technology, Ryukoku University, Otsu-shi, 520-2194 Japan. a) E-mail: [email protected] DOI: 10.1587/transfun.E94.A.273 tances. One major approach reported is beamforming. Fixed beamforming [2] has the advantage of robustness against the change in noise source positions because its design is inde- pendent of the input signal; however, its noise suppression performance is limited. Adaptive beamforming is a dierent beamforming strategy that achieves better noise-suppression performance after sucient adaptation calculation [3], [4]. However, its performance is sometimes severely degraded by noise source movement because the convergence speed of the adaptation lags behind the changes in noise position. There is also a method based on independent component analysis (ICA) [5], which is known to work similar to the adaptive beamforming [6]. But it also lacks the robustness against rapid change in noise positions because the ICA also includes an iterative calculation. Actually this goes for ev- ery conventional method referred above, the beamforming, which can be recognized as spatial filtering, is able to ma- nipulate signals arriving from dierent directions up to the same number of its microphones due to the restriction on its degree of freedom. Due to this restriction, the beamforming could not achieve sucient eect, otherwise it might be an ill-designed filter if it is forcedly optimised. Another framework known as “fixed beamforming with nonlinear postfilter” [7] achieves sucient noise sup- pression without suering from limitation of the degree of freedom owing to the nonlinearity of the postfilter. Saruwatari et al. proposed a method within this framework that estimates the noise power spectrum for postfiltering us- ing complementary beamforming [8]. However, even this method may also encounter the restriction due to its esti- mation of the noise power spectrum being performed in the fixed beamforming framework. Above all, this method deals with sound emphasis in a particular direction but not in a 2- dimensional area, therefore the method cannot be adopted to our problem directly. The authors previously proposed a method [9] that esti- mates sound power spectra whose sources are located in the target sound areas using multiple fixed beamformings. Both the spectrum estimation and postfiltering are performed in the power spectrum domain, which belongs to nonlinear processing. Therefore, the method suciently emphasizes the speech without suering from the restriction of degree of freedom. However, the method has one major drawback: the number of areas where the active sound sources are lo- cated must be restricted. Therefore, the method is less eec- tive when many noise sources located in dierent areas are simultaneously active. Motivated by this situation, in this Copyright c 2011 The Institute of Electronics, Information and Communication Engineers

Transcript of PAPER Improving Power Spectra Estimation in 2-Dimensional ...€¦ · Introduction With the...

Page 1: PAPER Improving Power Spectra Estimation in 2-Dimensional ...€¦ · Introduction With the popularisation of teleconference and voice recogni- ... freedom in its design. In contrast,

IEICE TRANS. FUNDAMENTALS, VOL.E94–A, NO.1 JANUARY 2011273

PAPER

Improving Power Spectra Estimation in 2-Dimensional AreasUsing Number of Active Sound Sources

Yusuke HIOKA†a), Member, Ken’ichi FURUYA†, Yoichi HANEDA†, and Akitoshi KATAOKA††, Senior Members

SUMMARY An improvement of estimating sound power spectra lo-cated in a particular 2-dimensional area is proposed. We previously pro-posed a conventional method that estimates sound power spectra using mul-tiple fixed beamformings in order to emphasize speech located in a partic-ular 2-dimensional area. However, the method has one drawback that thenumber of areas where the active sound sources are located must be re-stricted. This restriction makes the method less effective when many noisesource located in different areas are simultaneously active. In this paper,we reveal the cause of this restriction and determine the maximum num-ber of areas for which the method is able to simultaneously estimate soundpower spectra. Then we also introduce a procedure for investigating areasthat include active sound sources to reduce the number of unknown powerspectra to be estimated. The effectiveness of the proposed method is exam-ined by experimental evaluation applied to sounds recorded in a practicalenvironment.key words: microphone array, power spectrum estimation, speech en-hancement, 2-dimensional area, estimating source position

1. Introduction

With the popularisation of teleconference and voice recogni-tion systems, hands-free microphone systems have becomeindispensable devices due to their convenience. However,they often suffer from signal-to-noise ratio (SNR) deteriora-tion due to the speaker’s mouth being positioned a few me-tres away from the microphone, which is much farther thanthe mouth’s position during use of a conventional handset orclose-talking microphone. In particular, an intercom with ahands-free microphone may be used in a public space wherevarious nonstationary noise sources are located around thespeaker. In such an environment, the microphone might pickup confidential conversations that should not be sent to theother speaker. Thus, a hands-free microphone that picks uponly the speech located in a particular 2-dimensional area,i.e., the area where the speakers are located, is required.Here, “2-dimension” corresponds to the direction and thedistance.

For that purpose, using a microphone array [1] isa well-known strategy, and various conventional methodshave been reported. However, only a few works [2]–[5]have dealt with the problem of discriminating sounds whosesources are located in the same direction but at different dis-

Manuscript received April 19, 2010.Manuscript revised August 30, 2010.†The authors are with NTT Cyber Space Laboratories, NTT

Corporation, Musashino-shi, 180-8585 Japan.††The author is with the Faculty of Science and Technology,

Ryukoku University, Otsu-shi, 520-2194 Japan.a) E-mail: [email protected]

DOI: 10.1587/transfun.E94.A.273

tances. One major approach reported is beamforming. Fixedbeamforming [2] has the advantage of robustness against thechange in noise source positions because its design is inde-pendent of the input signal; however, its noise suppressionperformance is limited. Adaptive beamforming is a differentbeamforming strategy that achieves better noise-suppressionperformance after sufficient adaptation calculation [3], [4].However, its performance is sometimes severely degradedby noise source movement because the convergence speedof the adaptation lags behind the changes in noise position.There is also a method based on independent componentanalysis (ICA) [5], which is known to work similar to theadaptive beamforming [6]. But it also lacks the robustnessagainst rapid change in noise positions because the ICA alsoincludes an iterative calculation. Actually this goes for ev-ery conventional method referred above, the beamforming,which can be recognized as spatial filtering, is able to ma-nipulate signals arriving from different directions up to thesame number of its microphones due to the restriction on itsdegree of freedom. Due to this restriction, the beamformingcould not achieve sufficient effect, otherwise it might be anill-designed filter if it is forcedly optimised.

Another framework known as “fixed beamformingwith nonlinear postfilter” [7] achieves sufficient noise sup-pression without suffering from limitation of the degreeof freedom owing to the nonlinearity of the postfilter.Saruwatari et al. proposed a method within this frameworkthat estimates the noise power spectrum for postfiltering us-ing complementary beamforming [8]. However, even thismethod may also encounter the restriction due to its esti-mation of the noise power spectrum being performed in thefixed beamforming framework. Above all, this method dealswith sound emphasis in a particular direction but not in a 2-dimensional area, therefore the method cannot be adoptedto our problem directly.

The authors previously proposed a method [9] that esti-mates sound power spectra whose sources are located in thetarget sound areas using multiple fixed beamformings. Boththe spectrum estimation and postfiltering are performed inthe power spectrum domain, which belongs to nonlinearprocessing. Therefore, the method sufficiently emphasizesthe speech without suffering from the restriction of degreeof freedom. However, the method has one major drawback:the number of areas where the active sound sources are lo-cated must be restricted. Therefore, the method is less effec-tive when many noise sources located in different areas aresimultaneously active. Motivated by this situation, in this

Copyright c© 2011 The Institute of Electronics, Information and Communication Engineers

Page 2: PAPER Improving Power Spectra Estimation in 2-Dimensional ...€¦ · Introduction With the popularisation of teleconference and voice recogni- ... freedom in its design. In contrast,

274IEICE TRANS. FUNDAMENTALS, VOL.E94–A, NO.1 JANUARY 2011

paper, we aim to reveal the cause of this restriction and to de-termine the maximum number of areas for which the methodis able to simultaneously estimate sound power spectra. Wealso present a method for investigating areas that include ac-tive sound sources in order to overcome the drawback of theconventional method.

This paper is organized as follows. First, in Sect. 2,we explain the basic principle of the conventional methodthat estimates the power spectra of a 2-dimensional area us-ing fixed beamformings. Then in Sect. 3, the cause of thedrawbacks in the conventional method is discussed and thekey strategy of the proposed method to overcome the draw-back is mentioned. A detailed explanation about the proce-dure of the proposed method is given in Sect. 4. Section 5shows some experimental results to confirm the improve-ment achieved by the proposed method, and we concludethis paper with some comments in Sect. 6.

2. Power Spectra Estimation in 2-Dimensional AreaUsing Multiple Fixed Beamformers

Many of the conventional speech enhancement proceduresbased on nonlinear postfilter designs necessitate the powerspectra of the target speech and noise to be estimated a pri-ori [7]. For example, a well-known Wiener filter [10] isdefined by G(ω) = λS(ω)/(λS(ω)+λN(ω)), where λS(ω) andλN(ω) are the power spectra of the speech and noise, respec-tively, and ω denotes the frequency bin. Then, the targetspeech is emphasized by S (ω) = G(ω)X(ω), where S (ω)and X(ω) are the spectra of the emphasized speech and in-put signal, respectively. In this section, we present a methodfor estimating λS(ω) and λN(ω) when the sound sources arelocated in a particular angular sector. We then extend themethod to cover 2-dimensional cases.

2.1 Power Spectrum Estimation Using Fixed Beamform-ers

We briefly explain here the basic structure of power spec-trum estimation. When we have N fixed beamformerswhose complex directivity gains are assumed to be con-stant within each M angular sector, depicted by dotted linein the figure of beamformers’ directivity pattern shown inFig. 1, the output of the n-th beamformer is given by Yn(ω) =∑

m anm(ω)S m(ω). Here, anm(ω) is the constant complex di-rectivity gain of the n-th beamformer in the m-th angularsector, and S m(ω) is the spectrum of the signal located in

Fig. 1 Angular sectors and directivity pattern of n-th beamformer.

the m-th angular sector. If only the directivity gain to them-th angular sector is not zero while the other gains are allzero, the beamformer is able to estimate the spectrum ofeach sound source separately. However, this is often impos-sible because the beamformer whose directivity gain satis-fies such a condition cannot be designed in practice. Thisis because the beamforming could suffer from the degree offreedom in its design.

In contrast, our method does not estimates the signalsthemselves but it estimates the power spectrum using mul-tiple beamformer outputs. If the signals are assumed tobe mutually uncorrelated, the relation between the soundsources and outputs of beamformers are approximated by⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣E[|Y1(ω)|2]

...E[|YN(ω)|2]

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

�⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣|a11(ω)|2 · · · |a1M(ω)|2...

. . ....

|aN1(ω)|2 · · · |aNM(ω)|2

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣E[|S 1(ω)|2]

...E[|S M(ω)|2]

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ , (1)

where E[·] denotes the expectation, which is normally sub-stituted by averaging the data of multiple frames. The powerspectrum of the signals in each angular sector can be esti-mated by solving this simultaneous equation as long as thecolumns of the matrix are linearly independent. Becausethe experimental directivity shape is not like the ideal de-sign assumed above, the average gains of actual directivity(depicted by thick solid line in the figure of beamformers’directivity pattern shown in Fig. 1) for each angular sectorare used for anm(ω). Therefore, the left side of Eq. (1) is anapproximation of the right side. In the following, we explainthe expansion of this power spectrum estimation for soundsources located in a 2-dimensional area.

2.2 Definition of Area

In general, a microphone array requires a sufficiently largeaperture size to discriminate the distances of the sources.To satisfy this requirement, we introduce a pair of separatedmicrophone arrays, “array-L” and “array-R.” When ML andMR angular sectors are defined for respective arrays, in thisplacement, a 2-dimensional Area is defined by the combi-nation of angular sector Θxa for both array-L and array-R,where x and a ∈ {L,R} represent the angular sectors and ar-rays, respectively. For example, when three angular sectors(ML = MR = 3) that are symbolized by x ∈ {C,L,R} are as-sumed as shown in Fig. 2, the target Area T and eight noiseAreas RR, R, C, L, LL, NC, NR, and NL are defined. Notethat in the following we denote the Areas in an italic font,e.g., “T ,” while the arrays are denoted in a Roman font, e.g.,“L,” to avoid confusion. Regarding the Area, the follow-ing discussion uses the definition in Fig. 2 unless otherwisestated.

The positions of the arrays and Areas are arbitrarily de-termined under the following restrictions: (a) the distance

Page 3: PAPER Improving Power Spectra Estimation in 2-Dimensional ...€¦ · Introduction With the popularisation of teleconference and voice recogni- ... freedom in its design. In contrast,

HIOKA et al.: IMPROVING POWER SPECTRA ESTIMATION IN 2-DIMENSIONAL AREAS275

Fig. 2 Definition of Areas.

Fig. 3 Beampatterns of fixed beamformers: left suppression beamformer(LSB), centre suppression beamformer (CSB), and right suppression beam-former (RSB).

between the target Area and each microphone array is nearlythe same (dL � dR), (b) the distance between two micro-phone arrays is unknown but longer than that between thetarget Area and each microphone array (dinter ≥ dL or dR),and (c) the target Area is located in front of each microphonearray. Furthermore, the received signal can be recognizedas a plane wave because the aperture length of each micro-phone array is sufficiently short. In addition, we assume thateach sound signal is uncorrelated.

2.3 Power Spectrum Estimation of Areas

In the same manner as that for the single array case men-tioned in Sect. 2.1, we introduce fixed beamformers appliedto both array-L and array-R. Each beamformer is designedto have a different directivity shape and has a null in its di-rectivity pattern (directivity null hereafter) pointing towardone of the angular sectors Θxa†, as depicted in Fig. 3. Now,the relationship between the input and output power spectraof the beamformers is described by

Y(ω) = T(ω) · Z(ω), (2)

where Y(ω, l) and Z(ω, l) are column vectors that consistof the power spectra of fixed beamformers’ outputs and theunknown power spectra of each Area, respectively. They aregiven by

Y(ω) :=

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

E[|YLL(ω)|2

]E[|YCL(ω)|2

]E[|YRL(ω)|2

]E[|YLR(ω)|2

]E[|YCR(ω)|2

]E[|YRR(ω)|2

]

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

(3)

Z(ω) :=

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

PT (ω)PLL(ω)PL(ω)

PNL(ω)PC(ω)

PNC(ω)PRR(ω)PR(ω)

PNR(ω)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

. (4)

Here, Yxa(ω) denotes the output signal of beamformer xSBof array-a. Furthermore, the sum of the power spectraof signals whose sources are located in the Area b (b ∈{T, LL, L,C,R,RR,NR,NC,NL}) are defined by

Pb(ω) :=∑

i

E[|S bi(ω)|2

], (5)

where S bi(ω) is the i-th sound source located in Area b. Onthe other hand, the matrix T(ω) consists of column vectorsgiven by Eq. (6) and Eq. (7), where each vector is composedof a set of squared directivity gains for each Area. As Y(ω)and T(ω) consist of known parameters, we have power spec-tra estimated using Z(ω), which is given by solving the si-multaneous equation Eq. (2).

3. Restriction on Number of Areas to Estimate PowerSpectra

The problem in solving Eq. (2) is that the simultaneousequation is under-determined because the rank of T(ω) issmaller than the number of unknown variables in Z(ω). Thisis caused by the lack of degrees of freedom in specifying aparticular Area, which is determined by the number of an-gular sectors defined at each array. In fact, the maximumnumber of angular sectors that can be defined at each array isconnected to the number of microphones in the array. There-fore, in the following, we discuss the above rank deficiencyproblem from the viewpoint of number of microphones.

When we design a beamformer, in principle, at leastK microphones are required to control directivity gain in Kdirections. Therefore, in our case, the minimum number ofmicrophones required for each array depends on the num-ber of angular sectors. In the case of Fig. 2, we need at leastML + MR(= 6) microphones. Normally, we can estimatethe power spectra of 6 sound sources using this array, butwe lose one degree of freedom in the directivity design be-cause the distance between array-L and array-R is unknown.

†For this reason, we call them as “Suppression Beamform-ers”(SBs).

Page 4: PAPER Improving Power Spectra Estimation in 2-Dimensional ...€¦ · Introduction With the popularisation of teleconference and voice recogni- ... freedom in its design. In contrast,

276IEICE TRANS. FUNDAMENTALS, VOL.E94–A, NO.1 JANUARY 2011

T(ω) =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

α2CL(ω) α2

LL(ω) α2CL(ω) α2

LL(ω) α2RL(ω) α2

LL(ω) α2RL(ω) α2

RL(ω) α2CL(ω)

β2CL(ω) β2

LL(ω) β2CL(ω) β2

LL(ω) β2RL(ω) β2

LL(ω) β2RL(ω) β2

RL(ω) β2CL(ω)

γ2CL(ω) γ2

LL(ω) γ2CL(ω) γ2

LL(ω) γ2RL(ω) γ2

LL(ω) γ2RL(ω) γ2

RL(ω) γ2CL(ω)

α2CR(ω) α2

LR(ω) α2LR(ω) α2

CR(ω) α2LR(ω) α2

RR(ω) α2RR(ω) α2

CR(ω) α2RR(ω)

β2CR(ω) β2

LR(ω) β2LR(ω) β2

CR(ω) β2LR(ω) β2

RR(ω) β2RR(ω) β2

CR(ω) β2RR(ω)

γ2CR(ω) γ2

LR(ω) γ2LR(ω) γ2

CR(ω) γ2LR(ω) γ2

RR(ω) γ2RR(ω) γ2

CR(ω) γ2RR(ω)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

(6)

:=[

tT tLL tL tNL tC tNC tRR tR tNR

](7)

Fig. 4 Example of Area restriction for estimation.

Thus, the rank of T(ω) is reduced to ML + MR − 1(= 5).Consequently, to modify Eq. (2) so that it is determined, weneed to reduce the number of unknown variables in Z(ω) to5. In the proposed method, we achieve this by restrictingthe number of Areas for which to estimate the power spec-tra by solving Eq. (2). This idea is based on the fact that anonstationary signal such as speech satisfies the sparsenessassumption on the time-frequency plane [11]. This meansthe spectrum of input signals at an arbitrary frequency binand frame is composed of a few sound sources. Thus, weconcluded that the number of Areas could be reduced with-out causing severe degradation in the estimation accuracy aslong as the method is used for speech enhancement.

Figure 4 shows examples of this Area restriction whereonly the power spectrum of sound sources in the shadedArea is estimated. In Fig. 4(a), the Areas are located in allthree angular sectors for both arrays. In this case, the rank ofT(ω) is 5, which means we can solve Eq. (2) after reducingthe number of unknown valuables in Z(ω) to 5. However,in some cases, the rank of T(ω) becomes smaller than 5,although the number of variables in Z(ω) is reduced to 5 be-cause the definition of T(ω) itself is changed by the Area se-lection. In other words, some vectors defined in Eq. (7) thatare indispensable to span the original space of T(ω) are lostby the Area selection. We found that this occurs dependingon the combination of Areas selected, as that includes soundsources. For example, there is no Area selected to estimatepower spectra in the angular sector ΘRL in Fig. 4(b). Be-cause every vector that spans the subspace of ΘRL is lost inthis case, the rank of T(ω) is reduced to 4. The same prob-lem occurs even if the number of Areas is restricted to 4,as in Fig. 4(c). Eventually, such a problem will be avoidedif the number of Areas is reduced to the smaller of ML andMR.

4. Proposed Method

The process flow of the proposed method is shown in Fig. 5.The method mainly consists of the following steps: (a) in-vestigating the possibility of an active sound source exist-ing, (b) estimating the power spectrum, and (c) postfiltering.These steps are explained in the following.

Note that because of the nonstationarity of the soundsources, we practically deal with signals by frames whoseindex is denoted by l hereafter. In addition to this, we alsoapproximate the expectations appeared above by the averageof frames, for example

E[|Yxa(ω, l)|2

]=

Λ−1∑λ=0

|Yxa(ω, l − λ)|2, (8)

where the symbol Λ shows the number of frames used forthe average.

4.1 Investigating Possibility of Existence of Active SoundSource

To specify the Areas to be excluded from spectrum estima-tion, the proposed method preliminarily estimates the noisepositions by investigating the possibility of each Area in-cluding active sound sources. We then omit Areas that donot include active sound sources from the power spectra es-timation. This enables reformulation of the simultaneousequation to be over-determined, where the equation can besolved by the least squares method, by reducing the numberof unknown variables.

The possibility is investigated by using the amount ofinput/output (I/O) level differences of the beamformers. Aseach beamformer in Fig. 3 is designed to point its directivitynull towards a particular angular sector, the I/O level differ-ence decreases if an active sound source is located in thecorresponding angular sector. Because each Area is definedby the combination of the angular sectors of array-L andarray-R, we can specify the Area b where an active soundsource exists by examining the product of the I/O level dif-ferences given by

IOb(ω, l) =

Page 5: PAPER Improving Power Spectra Estimation in 2-Dimensional ...€¦ · Introduction With the popularisation of teleconference and voice recogni- ... freedom in its design. In contrast,

HIOKA et al.: IMPROVING POWER SPECTRA ESTIMATION IN 2-DIMENSIONAL AREAS277

Fig. 5 Process flow of proposed method.

Fig. 6 Example of investigating possibility of existence of active soundsource.

⎛⎜⎜⎜⎜⎜⎜⎜⎝Λ∑λ=0

|XLk(ω, l − λ)|2 −Λ∑λ=0

|YxL(ω, l − λ)|2⎞⎟⎟⎟⎟⎟⎟⎟⎠

×⎛⎜⎜⎜⎜⎜⎜⎜⎝Λ∑λ=0

|XRk(ω, l − λ)|2 −Λ∑λ=0

|YxR(ω, l − λ)|2⎞⎟⎟⎟⎟⎟⎟⎟⎠ , (9)

where k is an arbitrary microphone of each array. Thesmaller this value, the higher the possibility that a soundsource exists.

Figure 6 shows an example of this investigation usingthe I/O level differences of the beamformers. As Fig. 6(a)shows, sound sources are located at each of Areas R andT . When the outputs of beamformers applied to array-R aremutually compared, as described in Fig. 6(b), only the out-put of CSBR is significantly reduced because both R and Tare located in the direction of ΘCR to which CSBR pointsits null. In the same manner, in the case of array-L asshown in Fig. 6(c), both the outputs of LSBL and CSBL arereduced due to R and T being located in the direction ofΘRL and ΘCL, respectively. The I/O level differences de-fined by Eq. (9) are the combinations of results measuringhow much the signal was reduced by the beamformers ofarray-R and array-L. Therefore, as described in Fig. 6(d),

the I/O level differences of both R and T become bigger asthey are defined by the combinations of CSBR-LSBL andof CSBR-CSBL, respectively. Thus, we can specify the Ar-eas that have high possibility of sound source existence bysearching the Areas whose I/O level differences are biggerthan those of the other Areas.

Note that there are some specific cases that we cannotdetermine the sound source existence by using I/O level dif-ferences. However, this restriction could be ignored becausethe chances of occurrence of such cases are extremely small.The details of this restriction is stated in Appendix.

4.2 Power Spectrum Estimation under Area Restriction

The number of unknown variables in Z(ω, l) is reduced byusing the investigated possibilities of active sound sourceexistence. Because the rank of T(ω) is not more than 5,we first discard 4 out of 9 Areas whose product of I/Olevel differences is large. Thus, we newly introduce avector T5(ω) := [ tc1 tc2 tc3 tc4 tc5 ], where the set C :={c1, · · · , c5} consists of the names of remaining (not dis-carded) Areas listed in descending order of the product ofthe I/O level differences, i.e., IOc1 > IOc2 > · · · > IOc9.Then, the power spectra of the remaining Areas are esti-mated by solving the modified simultaneous equation

T5(ω)Z5(ω, l) � Y(ω, l), (10)

Z5(ω, l) :=

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

Pc1(ω, l)Pc2(ω, l)Pc3(ω, l)Pc4(ω, l)Pc5(ω, l)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦. (11)

The equation is now over-determined, so it can be solved bythe least squares method as Z5(ω, l) = T+5 (ω)Y(ω, l), where+ and · denote the Moore-Penrose pseudo inverse and esti-mated value, respectively. If T5(ω) is still rank deficient, wefurther discard the Area with the least possibility of contain-ing a sound source from the estimation until the rank of thematrix corresponds to the number of Areas to be estimated.

Page 6: PAPER Improving Power Spectra Estimation in 2-Dimensional ...€¦ · Introduction With the popularisation of teleconference and voice recogni- ... freedom in its design. In contrast,

278IEICE TRANS. FUNDAMENTALS, VOL.E94–A, NO.1 JANUARY 2011

4.3 Modified Postfilter Based on Selected Areas

For emphasizing the sound sources in the target Area, wemultiply the sum of fixed beamformer outputs with the post-filter G(ω, l). In the summation, only the beamformerswhose directivity nulls are not pointing towards the targetArea, i.e., LSBa and RSBa, are applied.

Y(ω, l) = G(ω, l) · YB(ω, l) (12)

YB(ω, l) :=14

(YLL(ω, l) + YRL(ω, l)

+YLR(ω, l) + YRR(ω, l)) (13)

The postfilter is calculated using the estimated power spec-tra of the target and noise Area by well-known conventionalmethods, such as Wiener filtering [10], i.e.,

G(ω, l) :=PT (ω, l)∑i∈C Pi(ω, l)

, (14)

except for the case where the target Area T was not includedin the set C. In that case, we recognize that the target soundsource is not active; thus, the G(ω, l) is set to a small con-stant. Finally, the speech-emphasized signal is given by ap-plying the inverse Fourier transform to Y(ω, l).

5. Experimental Evaluation

5.1 Situation and Parameters

Experiments were performed in a room whose reverberation

Table 1 Parameters applied to experimental evaluation.

Sampling Rate 16 kHzd 3 cmFrame Length 32 msFrame Shift 16 msΛ 3Λ 1Number of microphones in array-L 4Number of microphones in array-R 4ΘRL [10◦, 70◦]ΘLR [−70◦,−10◦]ΘCL [−10◦, 10◦]ΘCR [−10◦, 10◦]ΘLL [−50◦,−10◦]ΘRR [10◦, 50◦]

Fig. 7 Positions of microphone array and sound sources.

time was approximately 250 ms. The parameters used inthe experiment are shown in Table 1. The positions of themicrophone arrays and sound sources are shown in Fig. 7.The target speaker was positioned at T, which was locatedinside the target Area, and the noise sources were positionedat N1 to N8. Each microphone array was composed of fourmicrophones in a linear and equi-spaced configuration withan inter-microphone distance of 3 cm. For comparison, wealso evaluated the conventional method [9], which assumesthat no noise exists in the Areas NL, NR, L, and NC unlessotherwise stated.

5.2 Noise Suppression Performance

First, we examined the performance for a single noisesource. The improvements in the signal-to-interference ra-tio (SIR) [12] were measured when the noise source waslocated at one of the noise positions defined in Fig. 7. In theresults shown in Fig. 8(a), the conventional method failed tosuppress any noise signal whose source was located in Ar-eas NR, NL, L, or NC, which the conventional method as-sumed that there were no active sound sources. In contrast,the proposed method, whose results are shown in Fig. 8(b),achieved 6 to 11 dB SIR improvement for most of the Ar-eas, which was nearly the same performance as that of theconventional method when the noise sources were locatedin the supposed Areas.

Fig. 8 Noise suppression performance measured by SIR improvements:(a) conventional and (b) proposed methods.

Page 7: PAPER Improving Power Spectra Estimation in 2-Dimensional ...€¦ · Introduction With the popularisation of teleconference and voice recogni- ... freedom in its design. In contrast,

HIOKA et al.: IMPROVING POWER SPECTRA ESTIMATION IN 2-DIMENSIONAL AREAS279

Fig. 9 Change of average SIR improvement depending on number ofnoise sources.

Table 2 Assessment words used for the subjective evaluation.

Score Assessment word3 Much Better2 Better1 Slightly Better0 Same−1 Slightly Worse−2 Worse−3 Much Worse

Second, the noise suppression performance with re-gard to the number of Areas including simultaneously activesound sources was evaluated. Figure 9 shows the improve-ments in SIR, which are the averages of that calculated forevery case of Area combination, when the number of ac-tive noise Areas was changed. The results showed that theconventional method loses its ability to suppress noise asthe number of Areas increases, while the proposed methodmaintains its effectiveness even if noise sources are locatedat every noise Area. For the conventional method, we alsoshowed the improvements in SIR averaged for only the com-binations of the Areas where the conventional method as-sumed, i.e. C, R, LL, and RR. In comparison, the results ofthe proposed method were not much worse than those of theconventional method for assumed Areas.

5.3 Evaluation on Sound Distortion

The nonlinear postfilter as to what the proposed method ap-plied is well-known to cause signal distortion that degradesthe sound quality in return for the noise suppression ef-fect. From the sound quality point of view, we performed asubjective evaluation for the output signals of the proposedmethods. In the test, each of seven subjects (five males andtwo females aged from twenties to forties) rated the soundquality of the outputs of the proposed method comparingto that of the conventional method by using a scale whoseassessment words are defined in Table 2. The test wasperformed only for the conditions where the conventionalmethod correctly worked, i.e. the noise sources were locatedonly in the assumed Areas mentioned above. Furthermore,in order to avoid evaluating the effect of noise suppression,

Fig. 10 Results of subjective evaluation.

each set of data consisted of results of the conventional andproposed methods that gave nearly the same amount of SIRimprovements. As the number of results which met suchcondition was limited, the subjective test was performed byusing only the results given under the situation where eithertwo or three of noise Areas were active.

Figure 10 shows the results of the subjective evalua-tion. Each “x” plot shows the average score and the er-ror bar denotes 95% confidence interval. From the results,we cannot find any significant degradation of sound qualitybecause the score “0,” which means “the sound quality ofthe proposed method is the same as that of the conventionalmethod” is included within the range of confidence intervalin every case.

Added to this, for the purpose to know the deviation ofsound quality depending on the noise source positions, wealso evaluated the amount of distortion in the extracted tar-get sounds using the signal-to-distortion ratio (SDR) [12].The evaluation was only performed to the output of pro-posed method at each position of noise source given inFig. 8(b). As a result, average SDR for every noise sourceposition was 7.11 dB, and there was no large dispersion de-pending on the noise source position where the standard de-viation of SDR was 0.75 dB. In conclusion, the proposesmethod is superior to the conventional method in their noisesuppressing effect while there is no significant difference inthe sound quality.

5.4 Example Result of Proposed Method

Finally, Fig. 11 shows the waveforms of the target speech,input signal, and output signals when the noise source is lo-cated at the positions N3, N4, N6, N7, and N8. From theseresults, we can see that the proposed method was success-ful to retrieve the target signal because the waveform of theoutput signal resembles the target speech much more thanthe contaminated input signal. Furthermore, during the non-target interval, e.g., 0–1[sec] or 4.7–6.2[sec], the level of theinterfering sound was sufficiently reduced even though fivenoise sources were active in this interval.

Page 8: PAPER Improving Power Spectra Estimation in 2-Dimensional ...€¦ · Introduction With the popularisation of teleconference and voice recogni- ... freedom in its design. In contrast,

280IEICE TRANS. FUNDAMENTALS, VOL.E94–A, NO.1 JANUARY 2011

Fig. 11 Observed waveforms in experiment: (a) target speech, (b) inputsignal, (c) output of proposed method.

6. Concluding Remarks

An improved method for estimating the power spectra of asound source located in a particular 2-dimensional area hasbeen proposed. We first introduced the conventional methodof estimating power spectra using multiple fixed beamform-ers, and then we extended the method to a 2-dimensionalcase and discussed the maximum number of Areas for whichthe method is able to simultaneously estimate power spec-tra. Then we presented a way to overcome this limitation byinvestigating the source location. The effectiveness of theproposed method was examined by experimental evaluationapplied to sound recorded in a practical environment.

References

[1] M. Brandstein and D. Ward, Microphone Arrays, Springer-Verlag,2001.

[2] J.G. Ryan and R.A. Goubran, “Near-field beamforming for micro-phone arrays,” ICASSP 1997, pp.363–366, 1997.

[3] Y.R. Zheng and R.A. Goubran, “Robust near-field adaptive beam-forming with distance discrimination,” IEEE Trans. Speech AudioProcess., vol.12, no.5, pp.478–488, 2004.

[4] Y.R. Zheng, P. Xie, and S. Grant, “Robustness and distance dis-crimination of adaptive near field beamformings,” ICASSP 2006,pp.1041–1044, 2006.

[5] A. Ando, M. Iwaki, K. Ono, and K. Kurozumi, “Separation of soundsources propagated in the same direction,” IEICE Trans. Fundamen-tals, vol.E88-A, no.7, pp.1665–1672, July 2005.

[6] S. Araki, S. Makino, Y. Hinamoto, R. Mukai, T. Nishikawa, and H.Saruwatari, “Equivalence between frequency domain blind sourceseparation and frequency domain adaptive beamforming for convo-lutive mixtures,” EURASIP Journal on Applied Signal Processing,vol.2003, no.11, pp.1157–1166, 2003.

[7] R. Zelinski, “A microphone array with adaptive post-filtering fornoise reduction in reverberant rooms,” ICASSP 1988, pp.2578–2581, 1988.

[8] H. Saruwatari, S. Kajita, K. Takeda, and F. Itakura, “Speech en-hancement using nonlinear microphone array based on complemen-tary beamforming,” IEICE Trans. Fundamentals, vol.E82-A, no.8,pp.1501–1510, Aug. 1999.

[9] Y. Hioka, K. Kobayashi, K. Furuya, and A. Kataoka, “Enhancementof sound source located within a particular area using a pair of smallmicrophone arrays,” IEICE Trans. Fundamentals, vol.E91-A, no.2,pp.561–574, Feb. 2008.

[10] J. Benesty, S. Makino, and J. Chen, Speech Enhancement, Springer-Verlag, 2005.

[11] O. Yilmaz and S. Rickard, “Blind separation of speech mixtures viatime-frequency masking,” IEEE Trans. Signal Process., vol.52, no.7,pp.1830–1847, 2004.

[12] S. Araki, H. Sawada, R. Mukai, and S. Makino, “Underdeterminedsparse source separation of convolutive mixtures with observationvector clustering,” ISCAS 2006, pp.2594–2597, 2006.

Appendix: Restrictions on Estimating Sound SourceExistence by Using I/O Level Difference

As we noted in Sect. 4.1, there are some specific cases wherewe cannot determine the sound source existence by usingthe I/O Level Difference. Here we explain about this re-striction in detail. First of all, let us assume that each sup-pression beamformer defined in Fig. 3 ideally suppresses thesignals arriving from each angular sector while it passes thesignals from other sectors without any distortion. For ex-ample, the gains of LSBL in such situation are given byαLL(ω) = αCL(ω) = 1 and αRL(ω) ≈ 0. Based on thisassumption, the I/O level differences IOb(ω, l) expressed bya matrix form is given by

IOb(ω, l)

=

⎡⎢⎢⎢⎢⎢⎢⎢⎣ΠLSBL

ΠRSBLΠLSBL

ΠRSBCΠLSBL

ΠRSBR

ΠLSBCΠRSBL

ΠLSBCΠRSBC

ΠLSBCΠRSBR

ΠLSBRΠRSBL

ΠLSBRΠRSBC

ΠLSBRΠRSBR

⎤⎥⎥⎥⎥⎥⎥⎥⎦ ,(A· 1)

where

ΠLSBL(ω, l) = PC(ω, l) + PL(ω, l) + PLL(ω, l)

ΠLSBC(ω, l) = PR(ω, l) + PT (ω, l) + PNL(ω, l),

ΠLSBR(ω, l) = PRR(ω, l) + PNR(ω, l) + PNC(ω, l),

ΠRSBL(ω, l) = PC(ω, l) + PR(ω, l) + PRR(ω, l),

ΠRSBC(ω, l) = PL(ω, l) + PT (ω, l) + PNR(ω, l),

ΠRSBR(ω, l) = PLL(ω, l) + PNL(ω, l) + PNC(ω, l).

As we search the Areas whose I/O level differences are big-ger than those of the others, we cannot specify the Areaif some I/O level differences have an equal value. FromEq. (A· 1), such cases occur when a part/all of the follow-ing equations are satisfied.

ΠLSBL(ω, l) = ΠLSBC

(ω, l) = ΠLSBR(ω, l)

ΠRSBL(ω, l) = ΠRSBC

(ω, l) = ΠRSBR(ω, l)

If a part of these equations exactly holds, the proposedI/O level difference cannot be used to determine the Areas.However, in practical situation, it is an extremely rare casebecause most of the noise sources are nonstationary andtheir level change so quickly. Therefore, we could ignorethe influence of this restriction on practical implementationof the proposed method.

Page 9: PAPER Improving Power Spectra Estimation in 2-Dimensional ...€¦ · Introduction With the popularisation of teleconference and voice recogni- ... freedom in its design. In contrast,

HIOKA et al.: IMPROVING POWER SPECTRA ESTIMATION IN 2-DIMENSIONAL AREAS281

Yusuke Hioka received B.E., M.E., andPh.D. degrees in electrical engineering in 2000,2002, and 2005 respectively from Keio Univer-sity, Yokohama, Japan. From 2002 to 2005, hewas a Research Associate of 21th Century COEprogram at Keio University. In 2005, he joinedNippon Telegraph and Telephone Corporation(NTT), and currently he is a Research Engineerin NTT Cyber Space Laboratories. Since June2010, Dr. Hioka has also been a Visiting Re-search Fellow at Massey University, Wellington,

New Zealand. His research interests include array signal processing andacoustic signal processing. He is a member of IEEE and the AcousticalSociety of Japan.

Ken’ichi Furuya received the B.E. and M.E.degrees in acoustic design from Kyushu Instituteof Design, Fukuoka, Japan, in 1985 and 1987,respectively, and the Ph.D. degree from KyushuUniversity, Japan, in 2005. He joined NipponTelegraph and Telephone Corporation (NTT) in1987. He is now a Senior Research Engineer atNTT Cyber Space Laboratories, Tokyo, Japan.His current research interests include signal pro-cessing in acoustic engineering. Dr. Furuya wasawarded the Sato Prize by the Acoustical Soci-

ety of Japan (ASJ) in 1991. He is a member of the Acoustical Society ofJapan, the Acoustical Society of America, and IEEE.

Yoichi Haneda received the B.S., M.S., andPh.D. degrees from Tohoku University, Sendai,in 1987, 1989, and 1999, respectively. Sincejoining Nippon Telegraph and Telephone Cor-poration (NTT) in 1989, he has been investigat-ing the modelling of acoustic transfer functions,acoustic signal processing, and acoustic echocancellers. He is now a Senior Research Engi-neer, Group Leader at NTT Cyber Space Labo-ratories. He received the paper awards from theAcoustical Society of Japan, and from the Insti-

tute of Electronics, Information, and Communication Engineers of Japan in2002. Dr. Haneda is a senior member of the IEEE, the Acoustical Societyof Japan.

Akitoshi Kataoka received the B.E., M.E.,and Ph.D. degrees in Electrical Engineeringfrom Doshisha University of Kyoto in 1984,1986, and 1999 respectively. Since joining NTTLaboratories in 1986, he has been engaged inresearch on noise and reverberation reduction,acoustic microphone arrays, and medium bit-rate speech and wideband coding algorithms forthe ITU-T standard. He contributed to establish-ing ITU-T G.729 standards. He is currently Pro-fessor of the Faculty of Science and Technology

at Ryukoku University. He received the Technology Development Awardfrom ASJ, the Telecommunication Systems Technology Prize awarded byThe Telecommunications Advancement Foundation in 1996, and the Prizeof the Commissioner of the Japan Patent Office from the Japan Institute ofInvention and Innovation in 2003. Dr. Kataoka is a member of the Acous-tical Society of Japan.