Binaural sub-band adaptive speech enhancement using artificial neural networks

Binaural sub-band adaptive speech enhancement usingarti®cial neural networks

Amir Hussain *, Douglas R. Campbell

Department of Electronic Engineering and Physics, University of Paisley, High St., Paisley PA1 2BE, Scotland, UK

Received 1 September 1997; received in revised form 1 January 1998; accepted 1 March 1998

Abstract

In this paper, a general class of ``single-hidden layered, feedforward'' Arti®cial Neural Network (ANN) based

adaptive non-linear ®lters is proposed for processing band-limited signals in a multi-microphone sub-band adaptive

speech-enhancement scheme. Initial comparative results achieved in simulation experiments using both simulated and

real automobile reverberant data demonstrate that the proposed speech-enhancement system employing ANN-based

sub-band processing is capable of outperforming conventional noise cancellation schemes. Ó 1998 Elsevier Science

B.V. All rights reserved.

Zusammenfassung

In dieser Arbeit wird eine allgemeine Klasse von nichtlinearen Filtern, die auf ``single-hidden layer feedforward''

k�unstlichen neuronalen Netzen (ANN) basieren, zur Verarbeitung von bandbegrenzten Signalen in einem adaptiven

Sprachverbesserungsverfahren f�ur mehrere Mikrofone in Teilb�andern vorgeschlagen. Erste vergleichbare Resultate, die

in Simulationsexperimenten sowohl mit simulierten als auch realen Signalen aus hallenden Automobil-Umgebungen

gewonnen wurden, zeigen, daû das vorgeschlagene Sprachverbesserungssystem, das ANN-basierte Teilband-Vera-

rbeitung verwendet, herk�ommliche St�orunterdr�uckungsverfahren �ubertre�en kann. Ó 1998 Elsevier Science B.V. All

rights reserved.

ReÂsumeÂ

Dans cet article, une classe g�en�erale de ®ltres adaptatifs non lin�eaires bas�es sur des r�eseaux de neurones arti®ciels

(ANN) �a simple couche cach�ee et unidirectionnels est propos�ee pour le traitement de signaux limit�es en fr�equence dans

une approche de rehaussement du signal bas�ee sur une m�ethode multi-microphone et adaptative en sous-bandes. Les

premiers r�esultats comparatifs obtenus en utilisant des donn�ees automobiles r�everb�erantes r�eelles et simul�ees montrent

que le syst�eme de rehaussement de la parole propos�e utilisant le traitement ANN en sous-bande est capable de surpasser

les m�ethodes plus conventionnelles d'annulation de bruit. Ó 1998 Elsevier Science B.V. All rights reserved.

Keywords: Adaptive speech enhancement; Sub-band processing; Arti®cial neural networks

Speech Communication 25 (1998) 177±186

* Corresponding author. Tel.: +44 141 848 3427; fax: +44 141 848 3404; e-mail: [email protected].

0167-6393/98/$ ± see front matter Ó 1998 Elsevier Science B.V. All rights reserved.

PII: S 0 1 6 7 - 6 3 9 3 ( 9 8 ) 0 0 0 3 6 - 3

1. Introduction

The enhancement of speech degraded by back-ground noise, by which we will mean an increase inthe SNR, is required as a necessary but not su�-cient condition to improve intelligibility for eitherhuman or machine recognition. Humans are ca-pable of detecting and understanding speech atlow Signal-to-Noise Ratios (SNR) without priorknowledge of the speech, the noise or the envi-ronment (Ghitza, 1994). Compared with humans,modern speech recognition equipment perfor-mance has been shown to degrade markedly in thepresence of background noise (Lippmann, 1997).

Some researchers have looked to the humanhearing system as a source of engineering modelsto approach the enhancement problem, e.g. Ghitza(1994) modeling the cochlea and Cheng andO'Shaughnessy (1991) utilizing a model of thelateral inhibition e�ect. A recurring feature in thisbody of work is the accepted model of the cochleaas a spectrum analyser, which splits incomingsignals into a large number of band-limited signalsprior to further processing. In practice, such sub-band decomposition based speech-enhancementsystems have been shown to give the importantbene®t of supporting adaptive diverse parallelprocessing in the sub-bands (Campbell, 1996; To-ner and Campbell, 1993; Hussain et al., 1997b;Mahalanobis et al., 1993). They allow signal fea-tures within the sub-bands, such as the noisepower, the coherence between the in-band signalsfrom multiple sensors and the convergence be-haviour of an adaptive algorithm, to in¯uence thesubsequent processing within the respective fre-quency band.

Single channel enhancement strategies generallysu�er when the noise spectrum overlaps that of thespeech. Humans can function well in such cir-cumstances, as shown by the ``cocktail party'' ef-fect (Moore, 1995), which can be partly attributedto multi-sensor usage since performance degradeswith sensory path damage. The existence of the``binaural unmasking'' e�ect (Moore, 1995) sup-ports the use of multiple sensors for noise reduc-tion as well as spatial localisation, appearingfunctionally equivalent to Widrow's classic noisecancelling (Widrow and Stearns, 1985). Further

complicating the enhancement problem is the non-stationarity of many everyday sources and the ef-fects of room acoustics. Humans may invoke shortterm adaptation strategies related to the e�ect re-ported by Darwin et al. (1989), to compensate forthese.

Classical speech-enhancement methods basedon full-band multi-microphone noise cancellationimplementations (Widrow and Stearns, 1985)which attempt to model acoustic path transferfunctions can produce excellent results in anechoicenvironments with localized sound radiators (To-ner and Campbell, 1993), however performancedeteriorates in reverberant environments (Camp-bell, 1996). Sub-band processing has been found tobe important in combating reverberation e�ects(Goulding and Bird, 1990; Toner and Campbell,1993). Hermansky and Tibrewala (1997) andBourlard and Dupont (1997) have also recentlydemonstrated the advantages of using sub-bandbased speech recognizers. Adaptation is necessaryto compensate for changing noise ®elds (Wallaceand Goubran, 1992) due to for example, non-Gaussian sources, source/sensor motion, or time-varying acoustic paths. Multi-sensor methods arenecessary to compensate for reverberation andspeech/noise spectral overlap (Toner and Camp-bell, 1993; Campbell, 1996).

In previous multi-microphone sub-band adap-tive (MMSBA) noise cancellation systems, ®lter-bank or transform methods provide a set ofcontiguous sub-bands within which continuoustime signals can be processed. The subsequentprocessing within each sub-band is performed usinglinear adaptive ®lters often using the Least MeanSquares (LMS) algorithm. Signi®cant performanceimprovements have been demonstrated in terms oferror convergence speed and output SNR overclassical full-band Finite Impulse Response (FIR)®lter based noise cancellers for speech-enhance-ment in both simulated and real reverberant envi-ronments (Toner and Campbell, 1993; Bouquin etal., 1994; Campbell, 1996). However, in cases wherethe distortions are not linear, the conventionallinear adaptive ®lters will not be able to optimallycancel the non-linear distortions. Additionally,some acoustic signals of interest are more e�-ciently modeled as non-Gaussian processes (such as

178 A. Hussain, D.R. Campbell / Speech Communication 25 (1998) 177±186

Laplacian or Gamma density functions). Knecht etal. (1995) have recently demonstrated performanceimprovements through the use of non-linear ®lter-ing in the full-band.

Arti®cial Neural Networks (ANNs) are an at-tempt to emulate the functionality of the humanbrain in a very fundamental manner, with a viewto capturing some of the power of these biologicalsystems. ANNs are useful for problems in whicheither the objective cannot be expressed preciselyin terms of measurable parameters, or the set ofparameters is poorly de®ned (Hush and Horne,1993). Over the past decade, there has been anincreasing interest in the use of ANNs for solvingcomplex real-world problems (Hush and Horne,1993; Haykin, 1996; Hussain, 1996). This resur-gence of interest is primarily due to improvedlearning algorithms, ®rmer theoretical founda-tions, greatly enhanced computer systems forsimulation and improved implementation strate-gies. Today ANNs have matured into an attractivealternative for solving di�cult scienti®c problemsinvolving learning, due to sustained e�orts ofmany researchers in the last thirty years.

The conventional feedforward neural networksinclude the category of multi-hidden layered,Multi-Layered Perceptron (MLP) type structures(Hush and Horne, 1993), and the single-hiddenlayered, Radial Basis Function (Hush and Horne,1993), Volterra Neural Network (VNN) (Raynerand Lynch, 1989) and newly reported FunctionallyExpanded Neural Network (FENN) (Hussain etal., 1997a; Hussain, 1997) type structures. All havebeen shown to be capable of forming an arbitrarilyclose approximation to any continuous non-linearmapping. However, the multi-layered MLP typenetworks have highly non-linear-in-the-parametersstructures, and require computationally expensivenon-linear updating algorithms (such as back-propagation) which are less suitable for on-lineadaptive applications (Hussain, 1996). On theother hand, the RBF, VNN and the FENN havelinear-in-the-parameters structures due to theiroutputs being a linear (combination) function ofthe output layer weights, giving the relative ad-vantages of ease of analysis and rapid adaptation.

In this paper, we propose the use of a class ofgeneral non-linear adaptive FIR-type ®lters based

on single hidden-layered linear-in-the-parametersANNs, for processing the band-limited signals in amulti-band speech-enhancement system. We re-port some experiments using real speech signalscorrupted with simulated non-linear interferenceand real automobile reverberant noise. The resultsindicate that non-linear ANN-based sub-bandprocessing can enhance the performance of con-ventional linear full-band and multi-band speech-enhancement systems.

This paper is organized as follows. Section 2describes the structure of the ANN-based adaptivenon-linear FIR ®lter, along with its adaptationalgorithms. Section 3 summarizes the binauralsub-band scheme employing the ANN-based sub-band processing. Some implementation issuesconcerned with the proposed noise cancellationscheme are presented in Section 4. Section 5 pre-sents the experimental results, which are ®nallyfollowed by some concluding remarks and futurework proposals outlined in Section 6.

2. Structure of the ANN-based non-linear adaptive

®lter

The general structure of the proposed non-lin-ear FIR-type ®lter illustrated in Fig. 1, is based onsingle-hidden layered, linear-in-the-parametersfeedforward ANNs. It employs an input expander

Fig. 1. The ANN-based adaptive non-linear ®lter.

A. Hussain, D.R. Campbell / Speech Communication 25 (1998) 177±186 179

which transforms the n-inputs �x1; . . . ; xn� (repre-senting lagged values of the input signal x passedthrough an (n)1)th order tapped delay line) into anon-linear intermediate (hidden) space of in-creased dimension N. The expanded input termsF � �f1 . . . fN � (termed the N-basis functions) arethen weighted by W � �w1 . . . wN �, and linearlycombined to form the adaptive ®lter output y. Theoverall mapping of the adaptive FIR-type ®lter isthus Rn ! RN ! R. The advantage of this partic-ular non-linear ®lter structure is that linear adap-tive ®lter-theory can be readily applied for on-lineadaptation. The non-linear expansion-model F,employed in the ®lter is completely general andcan employ, for example, the following.

(i) Any of the non-linear basis functions com-monly employed in RBF neural-networks (Hushand Horne, 1993), such as:

(a) Thin-plate spline basis functions of the ninputs:

fi�ui� � u2i log�ui�; �1�

where ui �kx ) cik for i � 1; . . . ;N ; x � �x1 . . . xn�is the input vector, fi(á) are the N non-linear basisfunctions of the inputs, kák denotes the Euclideannorm, ci are the centres of the basis functions, andN is the number of RBF centres. The centres aresome ®xed points in the n-dimensional input space,which they must sample.

(b) The multi-quadratic activation functions:

fi�ui� � �u2i � r2�1=2

; �2�where r is a real constant usually termed the widthof the basis function.

(c) The inverse multi-quadratic functions:

fi�ui� � 1=�u2i � r2�1=2

: �3�(d) And the most widely used Gaussian basis

functions:

fi�ui� � exp�ÿu2i =r

2�: �4�(ii) The sigmoidal basis functions employed in

MLP networks (Haykin, 1996):

fi�x� � tanh�x�: �5�

(iii) The Volterra (polynomial) expansion em-ployed in the hidden layer of the conventionalVNN (Rayner and Lynch, 1989):

F �x� � �1; xi1; xi1xi2; . . . ; xi1xi2 . . . xik�; �6�where ic � 1; . . . ; n for c � 1; . . . ; k with k repre-senting a kth order polynomial expansion of the ninputs; and F �� f1 . . . fN �.

(iv) A hybrid functional expansion employed ina newly developed Functionally-Expanded NeuralNetwork (FENN) (Hussain, 1996; Hussain et al.,1997a; Hussain, 1997) (which is a variant of theconventional non-linear-in-the-parameters Func-tional-Link Neural Network (Hussain et al.,1997c):

F �x� � �1; x; sin�ix�; cos�ix�; xj sin�xk�; xj cos�xk�;xi1xi2; . . . ; xi1xi2 . . . xik� �7�

for i� 1, 2, 3; j 6� k and j; k � 1; . . . ; n; i1 6� i2; . . . ;6� ik, and each of i1; i2; . . . ; ik � 1; . . . ; n with krepresenting the order of the polynomial expan-sion. The above expansion comprises a combina-tion of sigmoidal-shaped, Gaussian-shaped andpolynomial-subset activation functions (Hussainet al., 1997a). An additional bene®t of employingthe FENN's functional expansion-model like theVNN's polynomial-expansion, is that the use ofthe original network inputs within the expansion-model, also enables e�cient modeling of lineardynamical transfer-functions (Hussain, 1997).

The choice of one of the above non-linear ex-pansion-models is, in general, problem dependent(Hussain, 1996). Hush and Horne (1993) for ex-ample, have shown that some problems such asfunctional approximation can be solved more ef-®ciently with the sigmoidal-type basis functionsemployed in the MLP and FENN (Eqs. (5) and(7)); while others such as classi®cation problemsare more amenable to localized (e.g. Gaussian-type) basis functions employed in the RBF andFENN (Eqs. (4) and (7)). However, all the ex-pansion-models above (Eqs. (1)±(7)) are known tobe universal approximators (Hush and Horne,1993; Hussain, 1996; Hussain et al., 1997a), in thatthey can approximate any non-linear function toan arbitrary degree of accuracy.


The relative performance-complexity trade-o�for the non-linear models therefore needs to bedetermined for each speci®c problem. However inpractice, the simple polynomial expansion-modelemployed in the VNN (Eq. (6)) is attractive since itrequires relatively low-complexity hardware forimplementation. In this paper, we shall restrict ourchoice of the non-linear ®lter's expansion-model tobe polynomial.

Once the full expansion-model at the singlehidden layer of the above ANN-based ®lter hasbeen speci®ed, conventional stochastic-gradient orleast-squares based algorithms can then be used toprovide an e�cient means for real-time adaptationof the ®lter weights, as described in Section 2.1.This gives these non-linear FIR-type ®lters a sig-ni®cant advantage over multi-layered (MLP-type)neural-network-based adaptive ®lter-structures(Knecht et al., 1995), in recursive applications.

2.1. Adaptation algorithm

The three sequential stages of the ®lter's adap-tation algorithm are as follows:1. Compute the ®lter output at time k, as

y�k� � F T�k�W �k ÿ 1�; �8�where F(k) de®nes the [N,1] hidden layer vectorcomprising the enhanced input functions:

F �k� � f1�k� f2�k� . . . fN �k�� T;where fi (k), i � 1; . . . ;N ; represent the basisfunctions within the non-linear expansion-modeldescribed above, superscript T denotes vectortranspose, and W(k ) 1) is the [N,1] ®lter output-layer weight-vector given by

W �k ÿ 1� � w1�k ÿ 1� w2�k ÿ 1� . . . wN �k ÿ 1�� T:

2. Compute the output error as

e�k� � d�k� ÿ y�k�; �9�where d(k) is the desired signal. The Mean SquaredError (MSE) is therefore

E�e�k�2� � E�d�k�2� ÿ 2W �k ÿ 1�TE�d�k�F �k�� W �k ÿ 1�TE�F �k�F �k�T�W �k ÿ 1�;

where E(á) denotes the expectation operator and Tdenotes matrix transpose. The above MSE ex-pression guarantees that there will be no localminima, since the MSE is a quadratic function ofthe ®lter weights W(k). The corresponding mini-mum MSE (MMSE) for the ANN-based ®lter canthus be readily written as

MMSE � E�d�k�2�ÿ E�d�k�F �k��TE�F �k�F �k�T�ÿ1E�d�k�F �k��

with superscript )1 denoting matrix inverse andassuming that the auto-correlation matrixE(F(k)F(k)T) is non-singular. The MMSE aboveincludes as a special case the best linear (Wiener)MMSE for F �k� � x1�k� . . . xn�k�� T:3. Update the ®lter weight-vector W(k) using ei-

ther:3.1. Recursive Least Squares (RLS) update.

Update the ®lter weights W(k) using theexponentially weighted RLS algorithm asfollows:

W �k� � W �k ÿ 1� � P �k�F �k�e�k�; �10�where P(k) is the inverse of the correlationmatrix of the expanded input vector, andis updated as

P�k� � �F �k�F �k�T�ÿ1 � 1=k�P�k ÿ 1�ÿ P�k ÿ 1�F �k�F �k�TP �k ÿ 1�=fk� F �k�TP �k ÿ 1�F �k�g�; �11�

where k is the forgetting factor (6 1),which introduces exponential weightinginto past data. Numerically robust ver-sions of the RLS can be used instead of theabove, such as the Givens Least Squaresalgorithm.

3.2. Least Mean Squares (LMS) update. Alter-natively, the computationally more e�-cient and robust LMS algorithm can alsobe used for updating the output layerweights as follows:

W �k� � W �k ÿ 1� � le�k�F �k�; �12�where l controls the convergence rate.However, the rate of convergence ofthe LMS algorithm is dependent on thespread of the eigenvalues of the input


expansion-model's auto-correlation ma-trix, E(F(k) F(k)T), with a large eigenvaluespread dictating a signi®cantly slowerconvergence rate (Hussain et al., 1997a).On the other hand, the Least Squares cri-terion based RLS algorithm will convergemore rapidly compared to the LMS but atthe expense of an increased computationalcomplexity, O(N2) compared to O(N).Various Fast RLS (FRLS) algorithmshave also been proposed to reduce thecomplexity of the RLS from O(N2) toO(N), and can also be readily applied toadapt the above ANN-based ®lter struc-ture (Hussain, 1996).

3. Binaural sub-band scheme employing ANN-based

processing

The binaural sub-band system illustrated inFig. 2, decomposes the two wide-band input sig-nals into a number of band-limited signals. Thesub-band approach reduces the problem of iden-tifying a single, lengthy acoustic impulse responseto one of identifying a set of shorter, parallel ®lters(Toner and Campbell, 1993; Mahalanobis et al.,1993). Toner and Campbell (1993) reported thatthis approach considerably improved the meansquared error (MSE) convergence rate of anadaptive multi-band LMS ®lter compared to both

the conventional wide-band time-domain and fre-quency domain LMS ®lters. A signi®cant advan-tage of using sub-band processing (SBP) forspeech-enhancement is that it supports the use ofdiverse processing in individual frequency bands,the required sub-band processing being identi®edfrom features of the sub-band signals from themultiple sensors (Toner and Campbell, 1993;Campbell, 1996; Hussain et al, 1997b). The sub-bands can be distributed in the frequency domain,either in a linear or a non-linear fashion.

In this work, the sub-bands are achieved bymodifying the spectra of the FFT (or DCT) of theinput signals, and the number of ®lters is thereforelimited by the size of the FFT. The processing ineach sub-band is performed using the ANN-basedadaptive non-linear FIR-type ®lters.

4. Implementation issues

The proposed multi-band speech-enhancementsystem being a repetition of structurally identicalsub-band ANN-based elements, naturally sup-ports implementation by parallel processors. Thisspreads the computational load, which may befurther reduced by applying decimation tech-niques. Using closely spaced microphones reducesthe adaptive ®lter order and thus the sub-bandcomputational load, and the mis-adjustment noisein a continuously adapting scheme (Campbell,1996). The simple constrained ANN processingelements can be implemented using for example,multipliers for the case of the VNN, look-up tablesfor the case of the FENN, etc. The FFT (or DCT)based analysis ®lter-bank employed in this workcan be e�ciently implemented using a bank ofband-pass ®lters.

These features indicate that practical real-timevoice-band systems using multi-microphone sub-band adaptive (MMSBA) processing could beconstructed utilising projected high-speed VLSIdevices. As justi®cation, consider the recent PhilipsDCC-PASC system (Hoogendoorn, 1994) whichconsists of two custom VLSI chips. This operatesin real-time with 24 kHz bandwidth, processing 32sub-bands 750 Hz wide, and applies decimationand cross-band masking based on human hearing

Fig. 2. The binaural noise canceller employing ANN-based

SBP.


to achieve data compression. As a rough estimatereducing the bandwidth of this system to 4 kHzcould allow 192 bands of width 21 Hz for the samedata rate. Non-linear distribution of sub-bands asin humans (Greenwood, 1990; Allen, 1994) wouldallow a higher resolution in the range of the for-mats of speech, if necessary.

5. Simulation results

5.1. Simulated data

A multi-band version of the classical noisecancellation system is illustrated in Fig. 2. Theerrors between band-limited primary signal andthe output of the adaptive ®lter within each Sub-Band Processing (SBP) block are summed to formthe processed signal e(k) (Toner and Campbell,1993). The complete system was implemented inMATLAB and the ®lter-bank was realized usingthe real-valued Discrete Cosine Transform (DCT)method.

An initial experiment was carried out using realspeech and a simple periodic noise signal. Sinu-soids of various frequencies were chosen for thisinitial experiment due to ease of characterizationof the non-linear distortion e�ects on the signal.An example experiment is described below show-ing typical results for sinusoidal noise signals.

The desired signal at the primary channel was areal anechoic speech signal s(k) sampled at 10 kHz,and the reference noise signal was n(k)� 0.285sin(2p1000k). This noise was passed through anon-linear transfer function to produce the corre-lated noise signal n0(k),

n0�k� � 0:3n�k� � 0:6n�k ÿ 1�2 � 0:9n�k ÿ 2�3

ÿ 0:6n�k ÿ 3�2 ÿ 0:3n�k ÿ 4�;

which was added to the speech in the primarychannel. The above transfer function was arbi-trarily chosen in order to provide a test case with astrong non-linearity. The SNR at the primaryinput was approximately )1.4 dB. Ten thou-sand samples (representing 1 s) of the refer-ence signal n(k) and the primary signal s(k) + n0(k)were used.

The enhancement performance of the proposedmulti-band non-linear FIR (MBNLFIR) basednoise-canceller was compared with that of theclassical full (wide)-band linear FIR (FBLFIR)based noise-canceller (Widrow and Stearns, 1985)and the multi-band linear FIR (MBLFIR) basednoise-canceller (Toner and Campbell, 1993). In theMBLFIR system, linear adaptive FIR ®ltersindependently perform the processing in each sub-band, whereas for the case of the MBNLFIRsystem, the proposed non-linear FIR ®lters areused to perform the processing in each sub-bandband. The Volterra Series expansion-model (6),employing a truncated 2nd-order polynomial-ex-pansion of the ®lter inputs was chosen as the inputexpansion-model within the non-linear FIR(NLFIR) ®lter in the MBNLFIR system. The ex-ponentially-weighted RLS algorithm was used foradapting the weight coe�cients of all the full-bandand multi-band noise cancellers.

In order to make the comparisons as fair aspossible, an attempt was made to balance thecomputational complexity of the three algorithms.The order of the full-band linear FIR (FBLFIR)®lter was set to 84. For this demonstration, thenumber of sub-bands in the multi-band linear FIR(MBLFIR) system was set to four and the order ofthe linear FIR ®lter within each band was thuschosen as 21 (so that 4 ´ 21� 84, the order of theFBLFIR system). In the case of the MBNLFIRsystem with four sub-bands, the order of the non-linear VNN based FIR ®lter within each band wasset to 7. A truncated 2nd-order polynomial-ex-pansion of the sub-band NLFIR ®lter inputs wasemployed comprising the actual sub-band ®lterinputs, their square terms and 2nd order cross-product terms, which resulted in a total of N� 21terms (basis functions).

The Mean Squared Error (MSE) achieved bythe various noise cancellers over the last nine and ahalf thousand samples (allowing the ®rst ®vehundred samples for convergence) is compared inTable 1. The SNR improvements are shown inparenthesis.

As can be seen from Table 1, for this test casethe use of an MBLFIR system gives similar per-formance to the conventional FBLFIR ®lter incancelling the simulated non-linear distortion.


However, the use of the proposed MBNLFIRsystem can be seen to produce a much greaterperformance increment. Informal listening alsoshowed the MBNLFIR processed speech to beboth enhanced in SNR and of better perceivedquality than that obtained by the other methods.

5.2. Real automobile reverberant data

Speech and noise sequences recorded in an au-tomobile (Mercedes Benz 1990 Model) were usedfor comparing the performance of the threeadaptive noise cancellers. It is assumed for thisexperiment that: (a) the speaker is close enough tothe microphones so that the acoustic e�ects on thespeech are insigni®cant, (b) the noise signal at themicrophones may be represented as a point sourcemodi®ed by two di�erent acoustic-path transferfunctions H1 and H2, (c) that an e�ective voiceactivity detector (VAD) is available.

The three adaptive noise cancellers were nowadapted to operate as intermittent (or adapt andfreeze) noise cancellers with noisy speech input toboth the primary and secondary microphones. It hasbeen shown in previous work by the authors (Tonerand Campbell, 1993; Campbell, 1996; Hussain et al.,1997b) that two relatively closely spaced micro-phones may be used in an adaptive noise-cancella-tion scheme to identify a di�erential acoustic-pathtransfer function. Speci®cally, for the case of cor-related inter-channel noise sequences, intermittentnoise cancellation is performed, wherein the adap-tive ®lters converge during a noise-alone period. Theconverged ®lters model the di�erential acoustic-path transfer function between the noise source andthe two microphones, and can then be used in a noisecancellation format to process the noisy speechsignal during the speech plus noise period. Thisscheme is described mathematically in Appendix A.

Noise sequences were digitally recorded at asampling frequency of 12 kHz in the car travelling

at 100 km/h, using two microphones with a mi-crophone to microphone (MTM) spacing� 0.06 m.NATO alphabet code words, e.g. ``anton'', ``ber-ta'', ``emil'' and ``friedrich'' were recorded in thecar when stationary and all systems turned o�. Thenoise was added to each of four di�erent code-words to provide data for manufacturing two re-alistic SNR cases of 0 dB and +3 dB. A noise-alone period comprising the ®rst 1024 samples wasmanually labelled and the following three noisecancellation systems were compared.1. The FBLFIR ®lter of order 1024.2. The MBLFIR system comprising four sub-

bands and the order of the linear FIR ®lterwithin each band set to 1024/4� 256 (that is,length of the full-band ®lter above divided bythe number of bands, as in Somayazulu et al.,1989) in an attempt to balance the computa-tional complexity with system 1 above.

3. The MBNLFIR system comprising four sub-bands, with the order of the non-linear VNN-based FIR ®lter within each band set to (512/4� 128). A truncated 2nd-order polynomial-ex-pansion of the sub-band NLFIR ®lter inputswas employed comprising the actual sub-band®lter inputs and their square terms which result-ed in a total of N� 256 terms (basis functions)in each sub-band NLFIR ®lter. Thus, the com-plexity of the MBNLFIR system is comparableto that of system 2 above.The results are presented in Fig. 3. As can be

seen from Fig. 3, of the three approaches, theMBNLFIR system employing ANN-based SBP

Fig. 3. Average SNR improvement (ASNRI � 1 standard de-

viation) versus initial SNR.

Table 1

Performance comparison of various adaptive noise cancellers

for simulated noisy data

FBLFIR MBLFIR MBNLFIR

MSE 8.6 ´ 10ÿ4 8.5 ´ 10ÿ4 5.1 ´ 10ÿ5

(SNR improvement) (7.9 dB) (8 dB) (20.2 dB)


gives the best performance in cancelling the realautomobile reverberant noise. Informal listeningtests using random presentation of the processedsignals to ®ve young male adults con®rmed theMBNLFIR processed speech to be enhanced inSNR and of better perceived quality than thatobtained by the other methods.

6. Conclusions

A class of general ANN-based adaptive non-linear FIR-type ®lters has been presented togetherwith the adaptation algorithms employed in a sub-band adaptive speech-enhancement scheme.Comparative results achieved in simulation exper-iments using both simulated non-linear distortionand real-world reverberant signals, demonstratethat the use of these non-linear ®lters within amulti-band noise cancellation system can enhancespeech compared to the conventional linear ®lter-ing-based multi-band and full-band noise cancel-lers.

The superior performance of the MBNLFIRsystem is due to the use of non-linear ANN-basedprocessing within the sub-bands. A detailed theo-retical study is proposed to de®ne the attainableperformance. While the preliminary results re-ported in this paper must be taken with care, theygive interesting information on the limits of suchmethods as well as on the enhancement broughtabout by the new scheme.

Additionally, the linear-in-the-parametersstructure of the proposed ANNs may give usefulinsights into the physical composition of the un-derlying unknown transfer-function within eachrespective frequency band. Furthermore, for boththe experiments considered in this paper thecomplexity of the MBNLFIR system was forced tobe comparable to that of the MBLFIR system, butit can be further reduced by employing for exam-ple, a self-structuring LMS type algorithm (Lynchet al., 1991). Although a particular simulated non-linear noise transfer-function and a speci®c carenvironment have been employed in the casestudies, in practice, any dynamical acoustic-pathtransfer-function can be e�ciently modelled usingthe MBNLFIR approach, since all the proposed

ANN-based non-linear expansion-models areuniversal approximators.

Further experiments will use other real datasets, as well as speech recognizers and normal-hearing human subjects in order to assess andformally quantify the intelligibility improvementsobtained by use of the proposed MBNLFIRscheme employing ANN-based sub-band process-ing.

Acknowledgements

During this work, Dr. A. Hussain was sup-ported by the UK EPSRC Project Reference No.GR/K48907, and Prof. D.R. Campbell was sup-ported by a Leverhulme Trust Fellowship. Theauthors are grateful to Dr. Klaus Linhard ofDaimler-Benz Ltd., Germany, for providing thereal automobile data, and thank the anonymousreviewers for their constructive comments andsuggestions.

Appendix A

Assuming N, S, P, R represent the z-transformsof the noise signal, speech signal, primary signaland reference signal, respectively. The primary andreference signals in each sub-band are thus

P � B�S � H1N�; R � B�S � H2N�:The transformed error signal is thus

E � B��1ÿ H3�S � �H1 ÿ H3H2�N �;which is a frequency domain error, weighted bythe band-limiting transfer function B, and H3

represents the sub-band adaptive ®lter. The MeanSquared Error (MSE) function is

JE � �2pj�ÿ1

Ijzj�1

EE�zÿ1 dz:

The sub-band noise cancellation problem isthus, to ®nd an H3 such that within the sub-bandde®ned by B, the variance of JE is minimised.During a noise only period S� 0, de®ning thenoise spectral density Unn; then


JE � �2pj�ÿ1

Ijzj�1

B�H1 ÿ H3H2�

� Unn�H1 ÿ H3H2��B�zÿ1 dz;

which is minimised in the least squares sense when

H3 � �BH1��BH2�ÿ1:

That is, H3 is a band-limited transfer functionthat minimises the noise power in E. Now using H3

as a ®xed processing ®lter when speech and noiseare present ideally gives

E � B�1ÿ H3�S;where the output E is a noise reduced, ®lteredversion of the sub-band speech signal. This ap-proach will fail if H1�H2, however in practicalsituations such acoustic path balancing is di�cultto achieve.

References

Allen, J.B., 1994. How do humans process and recognise

speech? IEEE Trans. Speech Audio Proc. 2, 567±577.

Bouquin, R.L., Faucon, G., Azirani, A.A., Ehrmann, F., 1994.

Speech-enhancement using sub-band decomposition and

comparison with full-band techniques. In: Signal Process-

ing VII: Theories and Applications, Proc. EUSIPCO'94,

University of Edinburgh, UK, pp. 1206±1209.

Bourlard, H., Dupont, S., 1997. Sub-band-based speech recog-

nition. In: Proceedings IEEE ICASSP'97, Vol. 2, Munich,

Germany, pp. 1251±1254.

Campbell, D.R., 1996. Speech-enhancement for hearing aids.

In: Signal Processing VIII: Theories and Applications,

Proc. EUSIPCO'96, Trieste, Italy, pp. 467±470.

Cheng, Y.M., O'Shaughnessy, D., 1991. Speech-enhancement

based conceptually on auditory evidence. IEEE Trans.

Signal Proc. 39 (9), 1943±1954.

Darwin, C.J., McKeown, J.D., Kirby, D., 1989. Perceptual

compensation for transmission channel and speaker e�ects

on vowel quality. Speech Communication 8, 221±234.

Ghitza, O., 1994. Auditory models and human performance in

tasks related to speech coding and speech recognition.

IEEE Trans. Speech Audio Proc. 2, 115±132.

Goulding, M.M., Bird, J.S., 1990. Speech-enhancement for

mobile telephony. IEEE Trans. Vehicular Technol. 39 (4),

316±326.

Greenwood, D.D., 1990. A cochlear frequency-position func-

tion for several species-29 years later. J. Acoustic Soc. 86

(6), 2592±2605.

Haykin, S., 1996. Neural networks expand signal processing's

horizons. IEEE Signal Process. Magazine 13, 25±49.

Hermansky, H., Tibrewala, S., 1997. Sub-band based recogni-

tion of noisy speech. In: Proceedings IEEE ICASSP'97,

Vol. 2, Munich, Germany, pp. 1255±1258.

Hoogendoorn, A., 1994. Digital compact cassette. Proc. IEEE

82 (10), 1479±1489.

Hush, D.R., Horne, B.G., 1993. Progress in supervised neural

networks: What's new since Lippmann. IEEE Signal

Process. Magazine 10, 9±39.

Hussain, A., 1996. Novel arti®cial neural network architectures

and algorithms for non-linear dynamical system modeling

and digital communications applications. Ph.D. Thesis,

University of Strathclyde, Glasgow, UK.

Hussain, A., 1997. A new neural network structure for temporal

signal processing. In: Proceedings IEEE ICASSP'97, Mu-

nich, Germany, pp. 3341±3344.

Hussain, A., Soraghan, J.J., Durrani, T.S., 1997a. A new hybrid

neural network structure for non-linear time series model-

ing. Internat. J. Comput. Intelligence Finance 5 (1), 16±26.

Hussain, A., Campbell, D.R., Moir, T.J., 1997b. A multi-

microphone sub-band adaptive speech-enhancement sys-

tem employing diverse sub-band processing. In: Proceed-

ings ESCA-NATO Workshop on Robust Speech

Recognition for Unknown Communication Channels,

Pont-a-Mousson, France, pp. 123±126.

Hussain, A., Soraghan, J.J., Durrani, T.S., 1997c. A new

adaptive functional-link neural network based DFE for

overcoming co-channel interference. IEEE Trans. Co-

mmun. 45 (10), 1358±1362.

Knecht, W.G., Schenkel, M.E., Moschytz, G.S., 1995. Neural

network ®lters for speech-enhancement. IEEE Trans.

Speech Audio Process. 3 (6), 433±438.

Lippmann, R.P., 1997. Speech recognition by machines and

humans. Speech Communication 22, 1±15.

Lynch, M.R., Holden, S.B., Rayner, P.J., 1991. Complexity

reduction in Volterra connectionist networks using a self-

structuring LMS algorithm. In: Proceedings of Second IEE

Internat. Conference Arti®cial Neural Networks, Bourne-

mouth, UK, pp. 44±48.

Mahalanobis, A., Song, S., Mitra, S.K., Petraglia, M.R., 1993.

Adaptive FIR ®lters based on structural sub-band decom-

position for system identi®cation problems. IEEE Trans.

Circuits Systems 40 (6), 375±381.

Moore, B.C.J., 1995. Perceptual Consequences of Cochlear

Damage. Oxford University Press, London.

Rayner, P.J.W., Lynch, M.R., 1989. A new connectionist model

based on a non-linear adaptive ®lter. In: Proceedings IEEE

ICASSP'89, Glasgow, UK, pp. 1191±1194.

Somayazulu, V.S., Mitra, S.K., Shynk, J.J., 1989. Adaptive line

enhancement using multirate techniques. In: Proceedings

ICASSP'89, Glasgow, UK, pp. 928±931.

Toner, E., Campbell, D.R., 1993. Speech enhancement using

sub-band intermittent adaptation. Speech Communication

12, 253±259.

Wallace, R.B., Goubran, R.A., 1992. Improved tracking

adaptive noise canceller for non-stationary environments.

IEEE Trans. Signal Process. 40, 700±703.

Widrow, B., Stearns, S.D., 1985. Adaptive Signal Processing.

Prentice-Hall, Englewood Cli�s, NJ.


Binaural sub-band adaptive speech enhancement using artificial neural networks

Documents

Transcript of Binaural sub-band adaptive speech enhancement using artificial neural networks