pdfs.semanticscholar.org · 1 Annihilation-Reordering Lo ok-Ahead Pip elined CORDIC Based RLS...

1Annihilation-Reordering Look-Ahead PipelinedCORDIC Based RLS Adaptive Filters and TheirApplication to Adaptive BeamformingJun Ma, Keshab K. Parhi, and Ed F. DeprettereAbstractThe novel annihilation-reordering look-ahead technique is proposed as an attractive tech-nique for pipelining of Givens rotation (or CORDIC) based adaptive �lters. Unlike theexisting relaxed look-ahead, the annihilation-reordering look-ahead does not depend on thestatistical properties of the input samples. It is an exact look-ahead and based on CORDICarithmetic, which is known to be numerically stable. The conventional look-ahead is basedon multiply-add arithmetic. The annihilation-reordering look-ahead technique transformsan orthogonal sequential adaptive �ltering algorithm into an equivalent orthogonal concur-rent one by creating additional concurrency in the algorithm. Parallelism in the transformedalgorithm is explored, and di�erent implementation styles including pipelining, block pro-cessing, and incremental block processing are presented. Their complexity are also studiedand compared. The annihilation-reordering look-ahead is employed to develop �ne-grainpipelined QR decomposition based RLS adaptive �lters. Both implicit and explicit weightextraction algorithms are considered. The proposed pipelined architectures can be op-erated at arbitrarily high sample rate without degrading the �lter convergence behavior.Stability under �nite-precision arithmetic are studied and proved for the proposed archi-tectures. The pipelined CORDIC based RLS adaptive �lters are then employed to develophigh-speed linear constraint minimum variance (LCMV) adaptive beamforming algorithms.Both QR decomposition based minimum variance distortionless response (MVDR) realiza-tion and generalized sidelobe canceller (GSC) realization are presented. The complexity ofthe pipelined architectures are analyzed and compared. The proposed architectures can beoperated at arbitrarily high sample rate, and consist of only Givens rotations which can bescheduled onto CORDIC arithmetic based processors.EDICS: 5-VLSI / 5-PARA, 5-PARI. Contact author: Jun Ma, Department of Electrical and Computer Engi-neering, University of Minnesota, 200 Union St. S.E., Minneapolis, MN 55455. Phone: (612) 626-7217. Fax: (612)625-4583. Email: junma at ece.umn.edu. I. IntroductionTHE digital signal processing (DSP) technology has been and continues to be driven by the progress in VLSIcircuit technology, by the increasing bandwidth requirement of various applications such as multimedia andwireless communications, and more recently by the desire to reduce the overall power consumed by the target appli-cations. One of the important ways to design e�cient algorithms for high-speed/low-power applications is throughpipelining [1]{[2] and parallel processing [3]. Look-ahead techniques for pipelining of recursive �xed-coe�cient �ltershave been proposed [2],[4]{[5] and have been successfully applied to two-dimensional recursive �ltering [6], dynamicprogramming [7]{[9], algorithms with quantizer loops [10], �nite state machines [8], and Hu�man decoders [11].Relaxed look-ahead techniques for pipelining of adaptive �lters have been proposed [12], and have been successfullyapplied to LMS adaptive �lters [13], stochastic gradient lattice �lter [14], the adaptive di�erential vector quantizerJ. Ma and K.K. Parhi are with the Department of Electrical and Computer Engineering at the University of Minnesota, Minneapolis,MN 55455. E.F. Deprettere is with the Department of Electrical Engineering at the Delft University of Technology, 2628 CD Delft, TheNetherlands.

2[15], and the adaptive di�erential pulse code modulation codec [16]. Both look-ahead and relaxed look-ahead tech-niques are based on multiply-add arithmetic. In this paper, we propose novel annihilation-reordering look-aheadtechniques for pipelining of Givens rotation or CORDIC arithmetic based adaptive �lters which have been success-fully applied to QR decomposition based recursive least squares (QRD-RLS) algorithm [17], the adaptive inverse QRalgorithm [18], and the QR decomposition based minimum various distortionless response (MVDR-QR) algorithm[19].Recursive least squares (RLS) [20] based adaptive �lters have wide applications in channel equalization, voicebandmodem, high-de�nition TV (HDTV), digital audio broadcast (DAB) system, beamforming, and speech and imageprocessing. Historically, least mean squares (LMS) based adaptive �lters are preferred in practical applications dueto their simplicity and ease of implementation. A limitation of LMS algorithm is that it has a very slow convergencerate. The convergence of the LMS algorithm is also very sensitive to the eigenvalue spread of the correlation matrixof the input data. In applications such as HDTV equalizer and DAB system, there are only limited number of datasamples available. LMS based equalizer may not be able to convergence. The convergence of the RLS algorithm isan order of magnitude faster than the LMS algorithm, but its complexity is an order of magnitude higher than LMS.However, with rapid advances in scaled VLSI technologies, it is possible to implement RLS adaptive �lters on singlechips which will make them attractive due to their rapid convergence behavior.QR decomposition based RLS (QRD-RLS) algorithm [20], also referred as Givens rotation or CORDIC basedRLS algorithm in this paper, is the most promising RLS algorithm since it possess desirable properties for VLSIimplementations such as regularity, good �nite word-length behavior, and can be mapped onto CORDIC arithmeticbased processors [21]{[24]. The QRD-RLS algorithm can be summarized as follows. At each sample time instancen, evaluate a residual error: e(n) = y(n)� uT (n)w(n); (1)where u(n) and y(n) denote the p-element vector of signal samples and the reference signal at time instance n,respectively, and w(n) is the p-element vector of weights which minimize the quantity�(n) = k �1=2(n)e(n) k2= k �1=2(n) (y(n) �A(n)w(n)) k2; (2)where y(n) = [ y(1); � � � ; y(n) ]T denotes the sequence of all reference signal samples obtained up to time instance n,A(n) = [u(1); u(2); � � � ;u(n)]T is the input data matrix, and �(n) = diag[�n�1; � � � ; �; 1] is the diagonal matrix ofthe forgetting factors. Here, we assume that all the data are real. The extension to the complex case does not seemto have any particular di�culties. The optimum weight vector wls of the QRD-RLS solution can be obtained bysolving the following equation R(n)wls(n) = p(n); (3)where R(n) and p(n) are p-by-p matrix and p-by-1 vector, respectively, which are obtained by applying a QRdecomposition to the weighted data matrix �1=2(n)A(n) and the weighted reference vector �1=2(n)y(n), respectively.R(n) is usually chosen to be an upper triangular matrix.In practice, the QR decomposition is implemented in a recursive manner. With each incoming data sample set, anew row uT (n) is appended to the data matrix A(n� 1) to yield A(n). An orthogonal transformation matrix Q(n)is determined as products of p Givens rotation matrices to null the last row of A(n). Thus the triangular matrixR(n � 1) gets updated to R(n). The determined matrix Q(n) is then used to update p(n � 1) to p(n). The QR

3update procedure can be described by the following equationh R(n) p(n)0Tp �(n) i = Q(n) h �1=2R(n� 1) �1=2p(n� 1)uT (n) y(n) i : (4)A signal ow graph (SFG) representation of the QR update procedure is shown in Fig. 1. In this �gure, the circleand square cells denote CORDIC operations with circle cells operating in vectoring mode and square cells operatingin rotating mode. The functionality of the recursive update with M feedback delays (M = 1 for the triangularupdate) is shown at the bottom of Fig. 1. The c and s denote cos � and sin �, respectively, which are chosen toannihilate x1(n). The determined rotation angle is then applied to rotate the second column vector which consistsof �M=2r2(n�M) and x2(n).

Μ/2

2

12

2 11λ

2

λΜ/2

D

r θ 1

θ 2

θ 3

θ 4

D

p

D

p

p

D

p

y (n)

D

D

r

D

r

D

r

D

r

D

r

D

r

D

r

D

r

D

r

r

u (n)

θMD

Boundary Cell

u (n)

u’ (n)

MD

rθ

Internal Cell

sc

csr (n)

0

r (n)

x ’(n) x (n)= r (n-M) r (n-M)

u (n)4

u (n)3

u (n)2

x (n)

u (n)1

Fig. 1. The signal ow graph representation of the QR update procedure.In practice, there are two types of QRD-RLS algorithms. One is implicit weight extraction based RLS algorithmswhich are found useful in applications such as adaptive beamforming. In these algorithms, the residual error e(n) isobtained without the explicit computation of the weight vector w(n). A popular implicit weight extraction algorithmis due to McWhirter etc. [25]. The other is explicit weight extraction based RLS algorithms which are found usefulin applications such as channel equalization. One such algorithm is due to Gentleman and Kung [26], which involvesa triangular update part and a linear array for triangular back-solving. The linear array part does not make use ofGivens rotations and thus can not be e�ciently combined with the triangular update part. To overcome this problem,alternative QRD-RLS algorithms with inverse updating have been developed to achieve explicit weight extraction andalso make use of Givens rotations. A typical structure is the double-triangular type adaptive inverse QR algorithm[27]{[28]. This algorithm performs a QR update in an upper triangular matrix and an inverse QR update for weightextraction in a lower triangular matrix. Both implicit and explicit weight extraction based QRD-RLS algorithms canbe easily pipelined at cell level (also referred as coarse-grain pipelining). However, The speed or sample rate of thealgorithms is limited by the recursive operations in individual cells as shown in Fig. 1. In many applications, suchas image coding and beamforming, very high data rates would be required, and the sequential QRD-RLS algorithmsmay not be able to operate at such high data rate. In this paper, we create parallelism in the QRD-RLS algorithmsand exploit it for pipelining at �ner level such as bit or multi-bit level (also referred as �ne-grain pipelining). Notice

4that apart from being used to increase speed, pipelining can also be used to reduce power dissipation in low tomoderate speed applications [29].To exploit the parallelism and increase the speed of the QRD-RLS, lookahead techniques or block processingtechniques can be applied. The lookahead techniques and the so called STAR rotation have been used in [30] toallow �ne-grain pipelining with little hardware overhead. However, this is achieved at the cost of degradation of�ltering performance due to the approximations in the algorithms. Block processing was used to speed up the QRD-RLS in [31], with large hardware overhead. Both these approaches are based on multiply-add arithmetic. If oneinsists on not using multiply-add arithmetic for their implementation, there is no trivial extension of the lookaheadtechnique to the QRD-RLS algorithm.There are other fast QRD-RLS algorithms, which are computationally more e�cient than the original algorithm[32]{[38]. Square-root free forms of QRD-RLS are presented in [32]{[36]. A uni�ed approach to square-root QRD-RLS algorithm is presented in [33]. A low-complexity square-root free algorithm is developed in [34]. In [35], ascaled version of the fast Givens rotation [32] is developed that prevents over ow and under ow. Recently, divisionas well as square-root free algorithm has been proposed in [37]. In [38], a fast QRD-RLS algorithm based on Givensrotations was introduced. However, all these fast algorithms su�er the same pipelining di�culty as the QRD-RLSalgorithm, i.e., they can not be pipelined at �ne-grain level.In this paper, we propose a novel annihilation-reordering look-ahead technique to achieve �ne-grain pipelining inQRD-RLS adaptive �lters. It is an exact lookahead and based on CORDIC arithmetic. One of the nice properties ofthis technique is that it can transform an orthogonal sequential recursive DSP algorithm to an equivalent orthogonalconcurrent one by creating additional concurrency in the algorithm. The resulting transformed algorithm possessesthe pipelinablity, the stability (if the original one is stable), and good �nite word-length behavior which are attractivefor VLSI implementations.This paper is organized as follows. The annihilation-reordering look-ahead technique is presented in section II. Thederivation of pipelined CORDIC based RLS adaptive �lters using the proposed look-ahead technique is presented insection III. Section IV presents the application of pipelined RLS adaptive �lters to adaptive beamforming. Finally,section V draws conclusions and discusses future research directions. Appendix A and appendix B provide thederivation and proof of some key formulas presented in the paper.II. The Annihilation-Reordering Look-Ahead TechniqueIn this section, we introduce the annihilation-reordering look-ahead technique as an exact look-ahead based onCORDIC arithmetic. Similar to the traditional look-ahead, it transforms a sequential recursive algorithm to anequivalent concurrent one by creating additional parallelism in the algorithm. It is based on CORDIC arithmeticand is suitable for pipelining of Givens rotation based adaptive �ltering algorithms. The annihilation-reorderinglook-ahead technique can be derived from two aspects. One is from the block processing point of view, and theother is from the iteration point of view. The former is practical in real applications, while the latter shows theconnection with the traditional look-ahead technique. During our derivation, the forgetting factor � is omitted forclarity purpose.This section is organized as follows. The derivations of the annihilation-reordering look-ahead through blockprocessing and iteration are presented in section II-A and section II-B, respectively. The relationship with theconventional multiply-add look-ahead is shown in section II-C. Section II-D explores the parallelism in the proposedlook-ahead transformed algorithm. Di�erent implementation styles are then presented in section II-E. Finally, a

5lemma for stability invariance is stated and proved in section II-F.A. Look-Ahead Through Block ProcessingIn this section, we derive the annihilation-reordering look-ahead transformation for Givens rotation based adap-tive �ltering algorithms via block-processing formulation. The annihilation-reordering look-ahead technique can besummarized as the following two-step procedure.1. Formulate block updating form of the recursive operations with block size equal to the pipelining level M .2. Choose a sequence of Givens rotations to perform the updating in such a way that it �rst operates on the blockdata and then updates the recursive variables. The aim is to reduce the computational complexity of a blockupdate inside the feedback loop to the same complexity as a single-step update.Assume that a 3{time speed up is desired for the QR update shown in Fig. 1. Consider the block update formof the QR update procedure shown in Fig. 2 with block size equal to the desired pipelining level 3. A sequence ofGivens rotations is then chosen to annihilate the block data u(n � 2);u(n � 1), and u(n), and update R(n � 3) toR(n). In this �gure, traditional sequential update operation is used. The sample data is annihilated in a row-by-rowmanner and the diagonal r elements are involved in each update. The signal ow graph of a typical r element updateis shown in Fig. 3. It can be seen that the number of rotations inside the feedback loop increases linearly with thenumber of delay elements in the loop. Therefore, there is no net improvement in the sample or clock speed.3

Q

uu u u u

(n-3)

(n-1)(n-2)

0 0

(n)

0 0

u

00 0

u

0

u

0

rr r rr r

0

r rr

(n-2)(n-1)

u

0

Q 1

Q5

u

0

rr rr

r

r rr

u u uu u u uu u u u

(n)

(n-2)(n-1)(n)

0

0 0 0 0

(n)

uu

r

rr r rr r r

r rr

u u u u

(n-2)(n-1)(n)

(n-1)

0 0 0 0

Q2

r

Q4

r

rrr

(n-3)

rr r rr r r

r rr

u uu u u uu u u u

(n-3)

(n-2)(n-1)(n)

0 0

rrrr

rr r rr r r

r rr

u u u uu u u u

(n-2)(n-1)(n)

(n-2)

0 0 0 0

rr r

Fig. 2. QRD-RLS block update.D

r(n-1)

u(n)

r(n)

C

3D

r(n-3)

u(n-1)

u(n)

u(n-2)

r(n)

r(n-1)

r(n-2)C

C

C

( a ) ( b )Fig. 3. (a) The sequential QR update procedure. (b) The block update procedure with block size 3.The novel annihilation-reordering look-ahead technique is illustrated in Fig. 4. In this �gure, the sample datais annihilated in a column-by-column manner and the diagonal r elements are updated only at the last step. Thesignal ow graph of a typical r element update is shown in Fig. 5. It can been seen that, without increasing the loopcomputational complexity, the number of delay elements in the feedback loop is increased from one delay elementto three delay elements. These three delay elements can then be redistributed around the loop using the retimingtechnique [39] to achieve �ne-grain pipelining by 3-level. The two CORDIC units outside the feedback loop are the

6computation overhead due to the lookahead transformation. Since they are feed-forward, cutset pipelining [40] canbe applied to speed them up. Furthermore, the overhead CORDIC units outside the loop can be arranged in a treestructure to explore the parallelism and reduce overall latency.Q

3

rr r

r(n-2)

r (n)

(n-1)(n)

r rr r r

r

r

ru

r

r

uu u u u

r

u u u

r

(n-2)(n-1)

(n)

Q 1

Q5

uu

rr r rr

rr

r rr

u u uu u u u

u u

r

u

(n-2)

(n-3)

(n)

u

0

000

rr

rr r rr r r

r rr

(n-2)(n-1)

(n)00

00

(n)(n)

0 0 uuu u

uu

(n-3) Q4

Q2

(n-3)

u

rr r rr r r

r rr

u uu u uu u u

(n-3)

(n-2)(n-1)

(n)

u u00

0

(n-1)

0000

rr r rr r r

r rr

u u uu u u

(n-2)(n-1)

(n)

000

u u u

(n-3)

(n)

0000 Fig. 4. QRD-RLS annihilation-reordering look-ahead.

D

r(n-1)

u(n)

r(n)

C

3D

C

C

Cr(n)

u(n-1)

u(n)

r(n-3)

u(n-2)

( a ) ( b )Fig. 5. (a) A sequential QR update procedure. (b) The 3-level pipelined architecture using annihilation-reordering look-ahead.B. Look-Ahead Through IterationAlternatively, the annihilation-reordering look-ahead can be derived through matrix iterations. From Fig. 1 andequation (4), the basic QR recursion is given as follows� r(n)0 � = � c s�s c �� r(n� 1)u(n) � ; (5)where r(n) and u(n) correspond to the boundary element and input data to the boundary element in Fig. 1, re-spectively. A direct lookahead by iterating equation (5) two times can be performed by the following embeddingprocedure. Equation (5) can be rewritten as24 r(n)00 35 = 24 c1 0 s10 1 0�s1 0 c1 3524 r(n � 1)0u(n) 35 : (6)From equation (5), we also have 24 r(n � 1)0u(n) 35 = 24 c2 s2 0�s2 c2 00 0 1 3524 r(n� 2)u(n� 1)u(n) 35 : (7)

7Substituting equation (7) into equation (6) leads to24 r(n)00 35 = 24 c1 0 s10 1 0�s1 0 c1 3524 c2 s2 0�s2 c2 00 0 1 3524 r(n � 2)u(n� 1)u(n) 35 : (8)This is the one-step iterated version of equation (5). Iterating (8) once more leads to the following two-step iteratedversion of (5).2664 r(n)000 3775 = 2664 c1 0 0 s10 1 0 00 0 1 0�s1 0 0 c1 37752664 c2 0 s2 00 1 0 0�s2 0 c2 00 0 0 1 37752664 c3 s3 0 0�s3 c3 0 00 0 1 00 0 0 1 37752664 r(n� 3)u(n� 2)u(n� 1)u(n) 3775 : (9)The signal ow graph of (9) is shown in Fig. 3. Notice that this transformation does not help much, since allthree CORDIC operations involve updating the r element. Although the feedback loop contains three delays, thecomputation time in the loop is also increased by a factor of three. Therefore, the overall sample rate remainsunaltered.In order to increase the sample rates, the following transformation is considered. Notice that in (9), instead ofvectoring the input vector in the order of (1; 2); (1; 3), and (1; 4), we could apply the Givens matrix in a di�erentorder so that the input vector is annihilated in the order of (3; 4); (2; 3), and (1; 2), where notation (i; j) representsa pair of row indexes of input matrix in (9). For example, (3; 4) denotes that the Givens matrix will operate oninput vector [u(n� 1); u(n)]T , According to this scheme, the input samples are pre-processed and the r elements areupdated only at the last step. This leads to the following 3-level annihilation-reordering lookahead transformationfor CORDIC based RLS adaptive �lters.2664 r(n)000 3775 = 2664 c01 s01 0 0�s01 c01 0 00 0 1 00 0 0 1 37752664 1 0 0 00 c02 s02 00 �s02 c02 00 0 0 1 37752664 1 0 0 00 1 0 00 0 c01 s010 0 �s01 c01 37752664 r(n� 3)u(n� 2)u(n� 1)u(n) 3775 :The SFG of the above transformation is shown in Fig. 5, which is the same as the one derived from the blockprocessing point of view. Therefore, without increasing the loop computation time, we increase the number ofdelays in the feedback loop from one delay element to three delay elements. These three delay elements can then beredistributed around the loop to achieve pipelining by 3-level.The above derivation is similar to the traditional multiply-add look-ahead [2] procedure in the sense that bothperform look-lookahead through iteration. However, it can be seen that the block processing derivation in sectionII-A is more simple and practical in real applications.C. Relationship with Multiply-Add Look-AheadIt is worth mentioning here, for �rst order case, there exists strong similarity of the transformed ow graphsbetween the annihilation-reordering look-ahead and the multiply-add look-ahead. Consider the �rst order IIR digital�lter described by the following equation y(n) = a y(n� 1) + u(n): (10)

8D

��a

D D

��a ��a

3D

��a

��

��

��

��

u(n) y(n)

2

u(n)

3

y(n)

( a ) ( b )Fig. 6. (a) A �rst-order IIR digital �lter. (b) The 3-level pipelined architecture using multiply-add look-ahead.The SFG of (10) is shown in Fig. 6(a). After applying the multiply-add look-ahead transformation with pipelininglevel 3, the resulting equation is given asy(n) = a3 y(n� 3) + a2 u(n� 2) + a u(n� 1) + u(n): (11)The SFG of (11) is shown in Fig. 6(b). The �lter sample rate can be increased by a factor of 3 after redistributingthe 3 delay elements in the feedback loop.D DD

C

3D

CC C

��

��

��

��

u(n)

r(n)u(n) r(n)

( a ) ( b )Fig. 7. (a) A sequential QR update procedure. (b) The 3-level pipelined architecture using annihilation-reordering look-ahead.On the other hand, Fig. 5 can be redrawn as shown in Fig. 7. Comparing Fig. 7 to Fig. 6, it is seen that the twographs are essentially the same except that the multiply-add units are replaced by the CORDIC units.D. Parallelism in Annihilation-Reordering Look-AheadIn this section, we show explicitly how the annihilation-reordering lookahead technique explore the parallelism inthe recursive algorithm and create the additional concurrency.r(n-3) r(n-2) r(n-1) r(n) r(n+1) r(n+2) r(n+3) r(n+4) r(n+5)

u(n-2) u(n-1) u(n+1) u(n+2) u(n+3) u(n+4) u(n+5)u(n)

k direction (time)

Fig. 8. The dependence graph of the sequential QR update procedure.Consider the sequential QR update procedure shown in Fig. 5(a). Its dependence graph (DG) representationis shown in Fig. 8. In this �gure, the little circles denote CORDIC operations. The arrows denote dependencybetween signals. The k direction is the time increasing direction. The arrows along the k direction represent thedependency between iterations. For example, r(n + 1) is dependent on r(n), r(n) is dependent on r(n � 1), andso on. As we mentioned in section I, it is this kind of dependency that limits the speed of the QR update thuslimits the sample rate. The annihilation-reordering lookahead actually breaks this dependency and transforms theoriginal DG into an equivalent DG which consists of M independent sub-DGs, where M is the desired pipelininglevel. Since these M sub-DGs are independent, they can be executed in parallel. Therefore, the M independentsub-DGs are the additional concurrency created by look-ahead transformation. For M = 3, the 3 sub-DGs: DG-I,DG-II, and DG-III are shown in Fig. 9. Fig. 9 is the DG representation of Fig. 5(b). From Fig. 9, it is seen thatthe computation of r(n) is not dependent on the r(n� 1) anymore, instead dependent on the r element 3 iterations

9u(n-2) u(n-1) u(n+1) u(n+2) u(n+3) u(n+4) u(n+5)u(n)

r(n+1) r(n+4)

r(n+2) r(n+5)

r(n-2)

r(n-1)

r(n) r(n+3)

k direction (time)

r(n-3)

DG-III

DG-II

DG-I

Fig. 9. The dependence graph of the pipelined QR update with pipelining level 3.back which is r(n � 3). Similarly, r(n + 1) is dependent on r(n � 2), and r(n + 2) depends on r(n � 1). Therefore,after look-ahead transformation, the dependency between consecutive iterations are broken down into 3 independentoperation sequences with each sequence having a dependency between every three iterations. For each sub-DG,the dependency which is perpendicular to the k direction does not cause problem, since index transformation [41](which is equivalent to the cut-set pipelining for SFG) can be performed to reveal these dependency. Therefore,The 3 independent sequences which consist of 32 = 9 independent CORDIC operations in total are created for oneiteration. In general, for pipelining level M , M2 independent CORDIC operations are created in the algorithm forone iteration. As M increases, in�nite parallelism can be created in the algorithm, thus can achieve arbitrarily highsample rate.E. Pipelining and Block Processing ImplementationsIn this section, we present three concurrent realizations of CORDIC based RLS adaptive �lters. They are namedas pipelining, block processing, and incremental block processing.E.1 Pipelined RealizationConsider the three sub-DGs in Fig. 9. If they are mapped along the k direction, we obtain the SFG representation.Since the three sub-DGs are independent, they can be mapped onto the same CORDIC operation resources andoperated in a pipeline interleaving fashion [2]. This leads to the pipelined realization of CORDIC based adaptive�lters shown in Fig. 10. In this �gure, the input data samples are processed in the block manner through a tappeddelay line as shown in Fig. 11(a). Since consecutive block samples are shift-overlapped, thus all �ltering output canbe obtained consecutively. The implementation complexity in terms of CORDIC units for pipelined realization islinear with respect to the pipelining level M which is M = 3 CORDIC units in Fig. 10.

10u(n)

r(n)

��

��

C

DD

θ2

3θ

θ1

C

C

3D

Boundary Cell

Fig. 10. Pipelined realization with pipelining level 3.Block (k-1)

Block (k-2)

DD��

��

��

Block (k)

u(n-1)

u(n-2)

u(n)u(n-2)

u(n-3)

u(n-4)

u(n-1)

u(n-2)

u(n-3)

��

��

��

��

u(n-2)

u(n-5)

u(n-1)

u(n-4)

u(n)

u(n-3)

u(n-8) u(n-7) u(n-6)

( a ) ( b )Fig. 11. Serial-to-parallel conversion for (a) Pipelining and (b) Block Processing.E.2 Block Processing RealizationIn block processing, the three sub-DGs are mapped independently along the k direction to obtain the blockprocessing realization shown in Fig. 12. In block realizations, input samples are processed in the form of non-overlapping blocks to generate non-overlapping output samples. The block of multiple inputs is derived from thesingle serial input by using a serial-to-parallel converter at the input as shown in Fig. 11(b), and the serial output isderived from the block of outputs by a parallel-to-serial converter at the output. The implementation complexity interms of CORDIC units for block processing realization is quadratic with respect to the pipelining level M which isM2 = 32 = 9 CORDIC units in Fig. 12. To reduce complexity, incremental block processing technique can be used.Boundary Cell

r (3k+2)

(3k+2)

(3k+2)

(3k+1)

(3k+1)

2 θ1θ

2 θ1θ

r (3k+1)r (3k)

θ (3k)1

θ2 (3k)

θ3 (3k) θ3 (3k+1)θ3 (3k+1)

u ( 3k+1 ), u ( 3k )

DD

D D

D

C

C

C

C

C

C

C

C

CFig. 12. Block processing realization with block size 3.

11E.3 Incremental Block Processing RealizationConsider the annihilation-reordering look-ahead transformed DG shown in Fig. 9. Instead of using u(n+1); u(n); u(n�1), and r(n � 2) to obtain r(n + 1), r(n + 1) can be computed incrementally using u(n + 1) and r(n) once r(n) isavailable. Similarly, r(n+ 2) can be computed incrementally using u(n+ 2) and r(n+ 1) once r(n+ 1) is available.The DG of incremental block QR update is shown in Fig. 13. Mapping the DG along the k direction gives us theSFG representation of the incremental block processing realization shown in Fig. 14. The implementation complexityin terms of CORDIC units for incremental block processing is linear with respect to the pipelining level M which is2M � 1 = 2 � 3 � 1 = 5 CORDIC units in Fig. 14. Notice that the incremental computation parts do not containfeedback loops, thus cutset pipelining can be employed to speed them up.��

��

�� r(n)

r(n+2)r(n-3) r(n+3)

u(n-2) u(n-1) u(n+1) u(n+2) u(n+3)u(n)

k direction (time)

r(n+1)r(n-2) r(n-1)Fig. 13. The dependence graph of the incremental block QR update with block size 3.��

�� Boundary Cell

r (3k) r (3k+2)r (3k+1)D

DD

θ3

θ2 (3k)

θ3 (3k)

θ (3k)1

θ3 (3k+1) (3k+2)

u ( 3k+1 ), u ( 3k )

C

C

C

C

CFig. 14. Incremental block processing realization with block size 3.Therefore, in terms of number of CORDIC units used in the implementation, pipelined realization is better thanincremental block processing and block processing, and incremental block processing is better than block processing.In practice, the chosen of implementation styles depends on the target applications and available hardware resources.F. Invariance of Bounded Input/Bounded OutputIn this section, we show a property of the annihilation-reordering look-ahead transformation. It will be useful inthe proof of the stability of the pipelined QRD-RLS algorithm in section III-B.Lemma 1: Consider the compound CORDIC cell denoted by the dashed circle in Fig. 10. Under �nite-precisionarithmetic, if each individual CORDIC cell is bounded input and bounded output (BIBO), then the compoundCORDIC cell is also BIBO.

12Proof: Assume the pipelining level isM , from Fig. 10, the look-ahead transformed compound CORDIC cell consistsof cascade connections of M CORDIC units. Since each of them is BIBO under �nite precision arithmetic, thereforethe compound cell is also BIBO.III. Pipelined CORDIC Based RLS Adaptive FiltersIn this section, we apply the annihilation-reordering lookahead techniques to the CORDIC based RLS adaptive�lters and derive �ne-grain pipelined topologies. We consider both RLS with implicit weight extraction and explicitweight extraction.This section is organized as follows. The pipelined QRD-RLS with implicit weight extraction is presented insection III-A. Its stability under �nite-precision arithmetic is studied and proved in section III-B. Finally, thepipelined adaptive inverse QR algorithm for explicit weight extraction is presented in section III-C.A. Pipelined QRD-RLS with Implicit Weight ExtractionConsider the QRD-RLS formulation given in equations (1){(4). After the triangular matrix R(n) and the corre-sponding vector p(n) are generated, the optimum weight vector w(n) can be obtained by solving equation (3). Theresidual error e(n) is then computed as e(n) = y(n)� uT R�1(n)p(n): (12)However, for some applications such as adaptive beamforming, this proves to be unnecessary since in these cases, theresidual error e(n) is usually the only variable of interest, and it is not necessary to compute the weight vector w(n)explicitly. In [25], It is shown that the estimation error e(n) may be obtained directly as the product of two variables,the angle-normalized residual �(n) and the likelihood factor (n). �(n) and (n) are obtained by applying the sameorthogonal transformation matrix Q(n) to the vector [p(n� 1); y(n)]T and the pinning vector � = [0; � � � ; 0; 1]T [25].Therefore, the adaptive recursive least squares algorithm can be summarized as� R(n) p(n) s(n)0Tp �(n) (n) � = Q(n) � �1=2R(n� 1) �1=2p(n� 1) 0puT (n) y(n) 1 � ; (13)where 0p is the p-by-1 null vector. The SFG representation of the algorithm is shown in Fig. 15, where problem sizep is chosen to be 4. The circle and square cells in Fig. 15 denote Cordic operations which follow the same notationsin Fig. 1. The circle cell with letter G inside denotes a Gaussian rotation (or a linear CORDIC operation). Itsfunctionality is shown in the Figure. Notice that the converting factor cells which generate the likelihood factor does not contain recursive operations.In Fig. 15, the recursive operation in the cell limits the throughput of the input samples. To increase the samplerates, the annihilation-reordering look-ahead technique is applied.The recursive updating formula for the QRD-RLS with implicit weight extraction is given in equation (13). Itsblock updating form with block size M is given as follows� R(n) p(n) s(n)OM�p �(n) (n) � = Q(n) � �M=2R(n�M) �M=2p(n�M) 0pUM (n) yM (n) �M � ; (14)

131

u (n)2

u (n)3

u (n)4

u (n)

θ 1

x

z

y

G

z + x y

0

0

0D

r

γ0

θ 4

θ 3

θ 2

e (n)

α

D

p

D

p

p

D

p

0

D

r

D

r

D

r

D

r

D

r

D

r

D

r

D

r

D

r

D

G

y (n)1

Fig. 15. Signal ow graph representation for recursive least squares minimizations.where UM (n) is an M -by-p matrix de�ned asUM (n) = [u(n�M + 1); � � � ;u(n� 1);u(n)]T ;and yM (n) is an M -by-1 vector de�ned asyM (n) = [y(n�M + 1); � � � ; y(n� 1); y(n)]T :In (14), OM�p and 0p denote M -by-p null matrix and p-by-1 null vector, respectively, �(n) and (n) are M -by-1vectors, and �M is a M -by-1 constant vector de�ned as �M = [ 0; � � � ; 0; 1 ]T . The estimation error e(n) can becalculated as the inner product of the angle-normalized residual vector �(n) and the likelihood vector (n), i.e.e(n) = �T (n) (n): (15)The proof is given in Appendix A.We now determine a sequence of Givens rotations, whose product form the orthogonal transformation matrix Q(n)in (14), to annihilate the block input data matrix UM (n). The order of the Givens rotations is chosen such that theinput data is pre-processed and block-data update is �nished in the same complexity as a single-data update. Thisprocedure is illustrated in detail in Fig. 4. A 3-level �ne-grain pipelined QR update topology is shown in Fig. 5.After applying the annihilation-reordering lookahead, the concurrent QRD-RLS algorithm can be realized usingdi�erent implementation styles such as pipelining, block processing, and incremental block processing as discussed insection II-E. In the rest of the paper, we only show the topologies for the pipelined realization. The other realizationscan be derived similarly. A �ne-grain pipelining implementation with pipelining level 3 of CORDIC based QRD-RLSadaptive �lter with implicit weight extraction is shown in Fig. 16. In this �gure, all cell notations follow the notationsin Fig. 15 except that they are compound versions. The internal structure of each compound cell is shown at the

14bottom part of Fig. 16. Compared to Fig. 15, the 3-level pipelined architecture tripled the number of CORDICunits and communication bandwidth which is linear with respect to the pipelining level. Thus, in general, the totalcomplexity is O( 12Mp2) CORDIC units per sample time, where p is the input sample size, and M is the pipelininglevel.

3

2

1θθ

θ

pr r

Converting Factor Cell

3D

p

0

0

0

0

α

u (n)1

y1 2

y y3

2x

x1

3x

x1

y1 2x

2y

3x y3z + + +

γ

r

01 0

y (n)2u (n) 3u (n) 4u (n)

Internal CellBoundary Cell Reference Cell

G

z

03D

r

G

r

0

r

r

p

r r r

r p

r

3D

p

3D

3D

G

3D

3D

3D

3D

3D

3D

3D

D

3D3D

3D

3D

G

3D

G

e(n)

DD

C

C

C

C

C

C

C

C

C

C

C

C

DDDDDD D

Fig. 16. A 3-level �ne-grain pipelined CORDIC based implicit weight extraction QRD-RLS adaptive �lter architecture.B. Stability AnalysisIt is generally recognized that the QR decomposition based algorithms have good numerical properties, whichmeans that these can perform with an acceptable manner in a short word-length environment. This is due to thefact that the algorithms consist of only orthogonal transformation which leads to inherent stability under �nite-precision implementation. From section II-A and section II-B, we see that the annihilation-reordering look-aheadtransformation only involves orthogonal transformation and does not alternate the orthogonality of the algorithm.This implies that the pipelined algorithms also maintain the good numerical properties. Let's de�ne the stability of theQRD-RLS algorithm in the sense of bounded input/bounded output (BIBO), i.e., under �nite-precision arithmetic,if the input signals are bounded, then the output residual error e(n) is also bounded. We have the following result.

15Theorem 1: Under the �nite-precision arithmetic, given a pipelining level M , the M -level �ne-grain pipelinedCORDIC based RLS adaptive �lter algorithm with implicit weight extraction is stable.Proof: In [42], it is shown that for the sequential QRD-RLS algorithm shown in Fig. 15, a CORDIC cell, operatingwith �nite-precision arithmetic, constitutes a BIBO subsystem of the array. From Fig. 16, the pipelined algorithmhas the same architecture as the sequential one except that all CORDIC cells are compound versions. Therefore,by Lemma 1, a compound CORDIC cell, operating with �nite-precision arithmetic, constitutes a BIBO subsystemof the array. Thus, if the desired response y(n) and input samples u(n) in Fig. 16 are bounded, the quantized valueof the input of the �nal linear CORDIC cell is also bounded, which leads to the bounded residual error e(n). Thiscompletes the proof of Theorem 1.The stability of other CORDIC based RLS adaptive �ltering and beamforming algorithms presented in this papercan also be proved using the similar approach as in Theorem 1 and will not be repeated in the rest of the paper.C. Pipelined QRD-RLS With Explicit Weight ExtractionIn applications such as channel equalization, RLS based equalization algorithms such as, e.g., decision-directedschemes [43] and orthogonalized constant modulus algorithms [44], require the explicit availability of the �lter weightvector w(n). The standard (Gentleman-Kung type) QRD-RLS update scheme involves two computational stepswhich can not be e�ciently combined on a pipelined array. To circumvent the di�culty, inverse updating basedalgorithms are developed [44]{[46]. Here, we focus on a double-triangular type adaptive inverse QR algorithm [27].Consider the least squares formulation given in (1)|(4). De�ne the (p+1)-by-(p+1) upper triangular compoundmatrix ~R(n) as ~R(n) = � �1=2R(n) �1=2p(n)0Tp (n) � ;where (n) is a scalar factor, and R(n);p(n);0Tp are de�ned as in section I. Using equation (3), ~R�1 is then given as~R�1(n) = � ��1=2R�1(n) �R�1(n)p(n)= (n)0Tp 1= (n) �= � ��1=2R�1(n) �w(n)= (n)0Tp 1= (n) � :Notice that ~R�1 remains upper triangular and the optimal weight vector w(n) is explicitly shown on the rightmostcolumn of ~R�1 except for a scaling factor �1= . Now, consider the QR update of the upper triangular compoundmatrix ~R. From (4), we have � ~R(n)0Tp+1 � = ~Q(n) h ~R(n� 1)~uT (n) i ; (16)where ~uT (n) = [uT (n); y(n) ], and ~Q(n) is determined as products of (p + 1) Givens rotation matrices to null theinput sample vector ~uT (n) and update matrix ~R. Extending the (p+ 2)-by-(p+ 1) matrix on the right hand side of(16) to the (p+ 2)-by-(p+ 2) square matrix by adding an extra column vector [0Tp+1; 1 ]T to its right leads to� ~R(n) v(n)0Tp+1 d(n) � = ~Q(n) h ~R(n� 1) 0p+1~uT (n) 1 i ; (17)where vector v(n) and scalar d(n) correspond to the QR update of vector 0p+1 and scalar 1. Inverting the matrixon both sides of equation (17) (the matrix is non-singular since ~R is non-singular) and noticing that Q�1 = QT leadto � ~R�1(n) v0(n)0Tp+1 d0(n) � = � ~R�1(n� 1) 0p+1�~uT (n) ~R�1(n� 1) 1 � ~QT (n): (18)

16Taking the transposition on both sides of (18), we obtain� ~R�T (n) 0p+1v0T (n) d0(n) � = ~Q(n) � ~R�T (n� 1) � ~R�T (n� 1) ~u(n)0Tp+1 1 � :Thus, we have the following inverse updating formula� ~R�T (n)v0T (n) � = ~Q(n) � ~R�T (n� 1)0Tp+1 � : (19)Notice that the orthogonal transformation matrix ~Q(n), which updates the upper triangular matrix ~R in (16), alsoupdates the lower triangular matrix ~R�T in (19). Thus, the double-triangular adaptive inverse QR algorithm can besummarized as follows � ~R(n) ~R�T (n)0Tp+1 v0T (n) � = ~Q(n) � ~R(n� 1) ~R�T (n� 1)~uT (n) 0Tp+1 � : (20)The important point lies in noticing that the scaled weight vector �w= explicitly sits on the bottom row of lowertriangular matrix ~R�T or the rightmost column of upper triangular matrix ~R�1 as shown before. Therefore, wecould achieve parallel weight extraction by taking out the last row elements of ~R�T (n) and multiplying them by thescaling factor (n) sitting on the lower right corner of upper triangular matrix ~R(n).An e�cient SFG representation of the CORDIC based double-triangular adaptive inverse QR algorithm is shownin Fig. 17. In this �gure, the notation follows the ones in Fig. 1. The operation for updating r�1 elements is shownat the bottom part of Fig. 17. The residual error e(n) is computed according to (1) as shown in the �gure.The element on the lower right corner of lower triangular matrix ~R�T contains value 1= (n) and is not shown inthe �gure. This algorithm has complexity O(p2) Givens rotations per sample period, where p is the size of the array.We now apply the annihilation-reordering look-ahead technique to derive concurrent adaptive inverse QR algorithmfor high-speed CORDIC based parallel RLS weight extraction.The block updating form with block size M of the sequential updating equation (20) is given as followsh ~R(n) ~R�T (n)0M�(p+1) V (n) i = ~Q(n) � ~R(n�M) ~R�T (n�M)~UTM (n) 0M�(p+1) � ; (21)where ~UTM (n) is a M -by-(p+ 1) matrix de�ned as~UTM (n) = � ~u(n�M + 1) � � � ~u(n� 1) ~u(n) �T ;OM�(p+1) denotes a M -by-(p+ 1) null matrix, and V (n) is a M -by-(p+ 1) matrix.The derivation of (21) essentially follows the algebraic manipulation in (16)|(20) provided that we start from thefollowing block update equation h ~R(n)0M�(p+1) i = ~Q(n) � ~R(n�M)~UM (n) � : (22)Notice that the ~Q(n) matrix in (22) is di�erent from the ~Q(n) in (16), though we use the same notation here.Applying the annihilation-reordering procedure described in Fig. 4, we obtain the concurrent architecture shownin Fig. 5. A complete 3-level �ne-grain pipelined topology for CORDIC based QRD-RLS with explicit parallel weight

17

21

2 2

λ−Μ/2

2

1

λ−Μ/2

1

r

MD

u’ (n)

u (n)

Internal CellBoundary Cell

MD

θ

u (n)

r (n)

u ’(n)0

r

ccs u (n)

r (n-1)u (n)

r (n) r (n-1)=

θ

s

D

D

D

DD

D

D

/γ

D

Gy

1

1 1

D

1

w

11 1

D

1

θ

1

D

D

D

D

x

D

D

z

z + x y

D

D

D

0

G

y ( n )

e ( n )

u ( n ) u ( n )

u ( n ) u ( n ) y ( n )

33r 123r 1

13r

12r 122r 1

11r 11p

2

3p33

r

13r

22r

12r11r

r

23 p

γ

G G G 0

0

0

u ( n )1 2 3

u ( n )1 2 3

w2/γ w3/γθ4

θ3

θ2

Fig. 17. Signal ow graph representation of double triangular type adaptive inverse QR algorithm.extraction is shown in Fig. 18. In this �gure, all cell notations follow the notations in Fig. 17 except that some ofthem are compound versions. The internal structure of each compound cell is shown at the bottom part of Fig. 18.Compared to Fig. 17, the number of CORDIC units and communication bandwidth are tripled which is linear withrespect to the pipelining level. In general, the total complexity for pipelined realization of CORDIC based QRD-RLSwith explicit weight extraction is O(Mp2) where M is the pipelining level and p is the size of input samples.IV. Application to Adaptive BeamformingIn this section, we apply the annihilation-reordering look-ahead technique to the CORDIC based adaptive beam-forming algorithms and derive �ne-grain pipelined topologies.A beamformer is a processor which forms a scalar output signal as a weighted combination of the data receivedat an array of sensors. The weights determine the spatial �ltering characteristics of the beamformer and enableseparation of signals having overlapping frequency content if they originate from di�erent locations. The weights ina data independent beamformer are chosen to provide a �xed response independent of the received data. Statisti-cally optimum beamformers select the weights to optimize the beamformer response based on the statistics of thedata. The statistically optimum beamformers include the multiple sidelobe canceller (MSC) [47], beamformers withreference signal [48], maximum signal to noise ratio (Max SNR) beamformer [49], and linearly constrained minimumvariance (LCMV) beamformer [50]. Among them, the LCMV beamformer is most exible since it does not require

18

w1/γ

G

x

z

y

z + x y

D D D D D DD D

3D

3D

3D

u ( n ) u ( n ) y ( n )u ( n )1 2 3

C

C

C

C

C

Crii(n) rij(n)

Boundary Cell Internal Cell

θ1

θ

θ

2

3

3D3D

0

G

y ( n )

e ( n )

u ( n ) u ( n )

33r 123r 1

13r 1

12r 122r 1

11r 11p

2

3p33

r

13r

22r

12r11r

r

23 p

γ

G G G 0

0

0

u ( n )1 2 3

w2/γ w3/γ

3D 3D

3D

3D 3D 3D

3D 3D3D

3D 3D 3D 3D

3D3D 3D

0 0

0 0

0 0

Fig. 18. A 3{level �ne-grain pipelined topology of CORDIC based double-triangular adaptive inverse QR algorithm.absence of desired signal as in MSC, generating the desired signal as in reference signal beamformers, estimation ofsignal and noise covariance matrices as in Max SNR beamformers. The basic idea behind LCMV beamforming isto constrain the response of the beamformer so signals from the direction of interest are passed with speci�ed gainand phase. The weights are chosen to minimize output variance or power subject to the response constraint. Thishas the e�ect of preserving the desired signal while minimizing contributions to the output due to interfering signalsand noise arriving from directions other than the direction of interest. the LCMV beamforming is a constrainedminimization problem. Solving the constrained minimization problem directly leads to the minimum variance dis-tortionless response (MVDR) beamforming realization [28]. An alternative formulation is to change the constrainedminimization problem into unconstrained form which leads to the generalized sidelobe canceller realization [51]{[52],[47]. In practice, the data statistics are often unknown and may change with time so adaptive algorithms areused to obtain weights that converge to the statistically optimum solution. Least mean squares (LMS) and recursiveleast squares (RLS) based adaptive algorithms are two most popular ones for beamforming. Among RLS algorithms,QR decomposition based schemes are numerically stable and therefore attractive in practical applications. A goodsurvey on various beamforming approaches is given in [53]. In this paper, we consider high-speed QR decompositionbased LCMV adaptive beamforming which include MVDR and GSC realizations.This section is organized as follows. The LCMV adaptive beamforming problem is stated in section IV-A. Thepipelined QRD-MVDR realization is then presented in section IV-B. Finally, section IV-C addresses the pipelinedQRD-GSC realization and its comparison with the pipelined MVDR beamformer.

19A. LCMV Adaptive Beamforming ProblemConsider a linear array of p uniformly spaced sensors whose outputs are individually weighted and then summedto produce the beamformer output corresponding to the kth desired look directione(k)(n) = uT (n)w(k)(n); (23)where u(n) is the p-element vector of signal samples received by the array at time instant n, and w(k)(n) is thep-element vector of weights corresponding to the kth desired look direction. Let c(k) be the kth steering vectorwhich represents the kth desired look direction, and �(k) be the corresponding beamforming gain which is usually aconstant. The linear constraint minimum variance (LCMV) beamforming problem may be summarized as follows.Minimize the cost function �(k)(n) = k �1=2(n)e(k)(n) k2= k �1=2(n)A(n)w(k)(n) k2; (24)subject to a linear constraint c(k)T w(k)(n) = �(k): (25)where A(n) = � u(1) u(2) � � � u(n) �T is the input data matrix, and �(n) = diag[�n�1; � � � ; �; 1] is the diagonalmatrix of the forgetting factors. Here, we assume that all the data are real. The extension to the complex case doesnot present any particular di�culties.B. Pipelined QRD-MVDR Adaptive BeamformingA direct solving of the constraint least squares minimization problem given in (24) and (25) using the method ofLagrange multipliers leads to the minimum variance distortionless response (MVDR) beamforming realization.The solution to the constrained least squares minimization problem is given by the following formulaw(k)(n) = �(k) ��1(n) c(k)c(k)T ��1(n) c(k) ; (26)where � is the covariance matrix de�ned by �(n) = AT (n) �(n)A(n): (27)Assuming that a QR decomposition has been carried out on the weighted data matrix �(n)1=2A(n) so thatQ(n)�(n)1=2A(n) = � R(n)0 � ; (28)where R(n) is a p-by-p upper triangular matrix, then it follows that�(n) = RT (n)R(n); (29)and so R(n) is the Cholesky square root factor of the covariance matrix �(n). Equation (26) may therefore be writtenin the form w(k)(n) = �(k) R�1(n)R�T (n) c(k)c(k)T R�1(n)R�T (n) c(k)= �(k)R�1(n) a(k)(n)k a(k)(n) k2 ; (30)

20where a(k)(n) = R�T (n) c(k): (31)It follows that the beamformer output for the kth desired look direction at time instant n is given bye(k)(n) = uT (n)w(k)(n) = �(k) uT (n)R�1(n) a(k)(n)k a(k)(n) k2 : (32)Let e0(k)(n) = uT (n)R�1(n) a(k)(n): (33)That is e0(k)(n) is a scaled version of e(k)(n) with the scaling factor �(k)= k a(k)(n) k2.The QR decomposition of the data matrix A(n) can be implemented in a recursive manner as shown in (4) andrepeated here as � R(n)0Tp � = Q(n) � �1=2R(n� 1)uT (n) � : (34)After some algebraic manipulations, it can be shown that the orthogonal matrix Q in (34) also updates theauxiliary vector a(k) in (31) [28]. Therefore, the QR update for the MVDR adaptive beamforming algorithm can besummarized as follows.� R(n) a(k)(n) g(n)0Tp �(k)(n) (n) � = Q(n) � �1=2R(n� 1) �1=2a(k)(n� 1) 0puT (n) 0 1 � : (35)The insertion of the third column in (35) is used to generate the converting factor (n). It can be shown that thescaled residual e0(k)(n) can be computed using the following equation [25]e0(k)(n) = ��(k)(n) (n): (36)An e�cient SFG representation of the CORDIC based QRD-MVDR adaptive beamforming algorithm is shown inFig. 19. In this �gure, the circle and square cells follow the notations in Fig. 1. There are 3 desired look directionin the �gure with scaled beamformer outputs e0(1)(n); e0(2)(n), and e0(3)(n). The circle cell with a shaded right-angleand letter G inside denote a Gaussian rotation. Its functionality is shown in the Figure. Notice that the sign issubtraction instead of addition as in Fig. 15. The algorithm complexity is O( 12p2 +Kp) Givens rotations per sampletime, where p is the number of adjusting weights or the size of the antenna array, and K is the number of lookdirection constraints.In Fig. 19, the recursive operation in each cell limits the throughput of the input samples. Beamforming applicationsusually require very high sample rates, and the sequential QRD-MVDR algorithm might not be able to operate atsuch high sample rates. To increase the throughput, the annihilation-reordering look-ahead technique is applied.The recursive updating formula for the QRD-MVDR algorithm is given in (35). Its block update form with blocksize M is given as follows� R(n) a(k)(n) g(n)OM�p �(k)(n) (n) � = Q(n) h �M=2R(n �M) �M=2a(k)(n�M) 0pUTM (n) 0M �M i ; (37)where UTM (n) is an M -by-p matrix de�ned asUTM (n) = � u(n�M + 1) � � � u(n� 1) u(n) �T ;

21

(1)

a(1)

a(1)

a(1)

a(2)

a(2)

a(2)

a(2)

a(3)

a(3)

a(3)

a(3)

e’ (n)

D

z - x y

D

x

y

z

D

x

D

(3)

D

G

D

(2)e’ (n)

D

e’ (n)

D

u (n)

D

2

D

(3)

D

u (n)

D

G

D

(2)

D

α

D

G

D

0

D

α

D

G

D

0

D

γ

D

r

D

u (n)1

(1)a

α0

θ 4

θ 3

θ 2

θ 1

0

0

0

0

000143u (n)

r

rr

r r

rrrr

(1)

Fig. 19. Signal ow graph representation of QRD-MVDR adaptive beamforming algorithm.OM�p and 0M denote M -by-p null matrix and M -by-1 null vector, respectively. �(k)(n) and (n) are M -by-1vectors, and �M is a M -by-1 constant vector de�ned as�M = [ 0; � � � ; 0; 1 ]T :The scaled beamformer output for the kth look direction e0(k)(n) is then given ase0(k)(n) = ��(k)T (n) (n): (38)The derivation of equations (37) and (38) are shown in Appendix B. Notice that the Q(n) matrix in (37) is di�erentfrom the Q(n) in (35), though we use the same notation here.A sequence of Givens rotations, whose product form the orthogonal transformation matrix Q(n) in (37), is thendetermined to annihilate the block input data matrix UTM (n), and update R(n �M) and a(k)(n �M) as shown inFig. 4 and Fig. 5. A �nal �ne-grain pipelined QRD-MVDR beamforming topology with pipelining level 3 is shownin Fig. 20.In this �gure, all cell notations follow the notations in Fig. 19 except that they are compound versions. Theinternal structure of each compound cell is shown at the bottom part of Fig. 20. Notice that, compared to Fig. 19,the 3-level pipelined architecture tripled the number of Cordic units and communication bandwidth which is linearwith respect to the pipelining level. Thus, the total complexity is O(M( 12p2+Kp)) Givens rotations per sample time,where p is the number of antenna elements, K is the number of look direction constrains, and M is the pipelininglevel.C. Pipelined CORDIC Based Generalized Sidelobe CancellerAn alternative realization of the LCMV beamforming problem is to transform the constraint least squares mini-mization problem into unconstrained form. This leads to the generalized sidelobe canceller (GSC) realization.Consider the LCMV problem given in equations (24) and (25). The weight vector w(k)(n) can be decomposed intotwo orthogonal parts which lie in the range and null space of c(k).w(k) = w(k)q +B(k)w(k)b ;

22

(3)(2)(1)e’ (n)

3D

θ

θ

θ

3

2

1

03D3D

a(n)r(n)r(n)

z - --

0

0

0

0

000000001

z

G

G

GG

γ

α α(2) (3)

α(1)

e’ (n)e’ (n)

GG

00

G

r

(1)a

0

3D

0

3D

0D

r

D0

D

r

D

r

D

r

D

r

D

r

D

rrr

1

x

a

3D

(2)

3D

a

3D

(2)

3D

a

3D

(3)

3D

a

3D

(3)

3D

a

3D

(3)

3D

a

3D

(3)

3D

u (n)

3D

1

3D

u (n)

3D

2

3D

u (n)

3D

3

3D

4

3D

u (n)

3D

a(1)

a(1)

a(1)

a(2)

a(2)

y xy 2

Auxillary CellInternal CellBoundary Cell

2y 3y

3x

1x

2

1 1 x 2

y

Converting Factor Cell

3 y 3x

C

C

C

C

C

C

C

C

C

C

CCFig. 20. A 3{level pipelined Cordic based QRD-MVDR adaptive beamforming architecture.where w(k)q and w(k)b are p-by-1 and (p� 1)-by-1 vectors, respectively. The p-by-(p� 1) blocking matrix B(k) satis�esc(k)T B(k) = 0: (39)By (39), we then have c(k)T w(k)(n) = c(k)T w(k)q + c(k)TB(k)w(k)b (n)= c(k)T w(k)q= �(k):Thus, w(k)q = c(k) �c(k)T c(k)��1 �(k):Furthermore, e(k)(n) = A(n)w(k)(n)= A(n)w(k)q +A(n)B(k)w(k)b (n)= y(k)(n)�A(k)(n)w(k)b (n); (40)where y(k)(n) = A(n)w(k)qA(k)(n) = �A(n)B(k): (41)

23��

��

(k)bw

(k)B

(k)qw

pu

��

��

��

��

y

u u2 p-11 u

Least-squaresEstimator

(k)e

3(k)u(k)(k) u2 up

(k)

Fig. 21. The block diagram of a generalized sidelobe canceller.Substituting (40) into (24) and compare with (2), we see that the GSC problem is actually an unconstrainedRLS problem with the error equation given in (40), and a pre-processing procedure given in (41). A complete blockdiagram of a serial generalized sidelobe canceller is shown in Fig. 21.From Fig. 21, we see that the GSC consists of two parts: The �rst part is a pre-processing procedure which consistsof a matrix-vector multiplication operation and a vector inner product operation given in (41). We will show thatboth operations can be implemented using CORDIC rotations and consist of only feed-forward path which can bepipelined to �ne-grain level using cutset pipelining. The second part is the recursive least squares adaptive �lteringshown in the dashed box, which can be realized using the QRD-RLS algorithm. Furthermore, it can be �ne-grainpipelined using the annihilation-reordering look-ahead technique shown in section II. Therefore, �ne-grain pipelinedQRD-GSC architectures are developed which can be operated at arbitrarily high sample rates.The pre-processing part can be implemented using CORDIC arithmetic in the following way. A QR decompositionof vector c(k) gives us c(k)T Q = � k c(k) k 0 � � � 0 � :Decompose orthogonal matrix Q in the following wayQ = h w0(k)q j B(k) i :Then c(k)T w0(k)q = k c(k) kc(k)T B(k) = 0: (42)Comparing (42) with (39) and (40), we can see that w0(k)q is a scaled version of w(k)q , and the orthogonal matrixQ contains vector w(k)q up to a scaling factor and the matrix B(k). A typical CORDIC based pre-processing SFGrealization is shown in Fig. 22. The little circle represents a CORDIC operation. The CORDIC units with letter V

24inside operate in the vectoring mode, and the ones with letter R inside operate in the rotating mode. The dashedlines indicate the places where pipelining latches can be placed. Since all paths are feed-forward, �ne-grain pipeliningcan be achieved by placing latches between CORDIC micro-rotation stages using cut-set pipelining. The output ofFig. 22 is then fed into the input of Fig. 16 as shown in Fig. 21.

µc

c(k)

V

V

V

V

V

V

V

R

R

R

R

R

R

R

2

3

4

5

6

7

8

32 4 5 6 7 8

uy (k)(k)

321 4 5 6 7

u

1

8

Fig. 22. A typical pipelined pre-processing architecture of the GSC algorithm.From Fig. 21, the pipelined GSC topology consists of a feed-forward constraint preprocessor and a recursive least-squares processor as shown in Fig. 23(a). The preprocessor consumes approximately Mp CORDIC operations, andthe least-squares processor uses approximately 12Mp2 CORDIC operations where p is the number of adjustableweights and M is the pipelining level. Fig. 21 shows the GSC block diagram for only one constraint. If we imposeseveral (say K) constrains independently, we then have to run the same data through the beamformer a total of Ktimes. Such an imposition requires K di�erent preprocessors and K least-squares processing runs, taking a total ofapproximately MK( 12p2 + p) operations for the pipelined GSC topology.outputBeamformer

Beamformeroutput

( a )

( b )

preprocessorConstraint

squares

Least-squares

processor

processor

Least-

postprocessorConstraint

Data

DataFig. 23. (a) CORDIC based generalized sidelobe canceller. (b) QRD-MVDR beamformer.On the other hand, from Fig. 20, the MVDR topology is basically a constraint postprocessor in that it permitsthe application of the constraint after the least-squares processing, as shown in Fig. 23(b). Consequently, the output

25from the least-squares processor is broadcast to K separate constraint processors, each of which uses approximatelyMp operations. The total number of operations in the pipelined MVDR topology is thus approximatelyM( 12p2+Kp)which is considerably less than the corresponding number of operations for the pipelined GSC topology. Therefore,the pipelined MVDR beamformer has the advantage of being computationally more e�cient than the pipelined GSCrealization. However, the GSC provides the insight, and is useful for analysis of the LCMV beamforming problem.V. ConclusionsIn this paper, a novel annihilation-reordering look-ahead technique is proposed to achieve �ne-grain pipeliningfor CORDIC based RLS adaptive �ltering and beamforming algorithms. It is an exact look-ahead and based onCORDIC arithmetic. The look-ahead transformation can be derived from either the block processing or the iterationpoint of view, while the former is simpler and more practical and the latter shows the connection with the traditionalmultiply-add look-ahead technique. The exploration of the parallelism in the annihilation-reordering look-aheadtransformation leads to three implementation styles namely pipelining, block processing, and incremental blockprocessing. The implementation complexity in terms of CORDIC units for pipelined realization is the least, and theone for the block processing is the most.CORDIC based RLS �lters with implicit weight extraction are found useful in applications such as adaptive beam-forming, and the ones with explicit weight extraction are found useful in applications such as channel equalization.The application of proposed look-ahead technique to these RLS �lters lead to �ne-grain pipelined topologies whichcan be operated at arbitrarily high sample rate. The pipelined algorithms maintain the orthogonality and the stabilityunder �nite precision arithmetic.High-speed adaptive beamforming applications are presented in the paper. In particular, the linearly constrainedminimum variance (LCMV) adaptive beamformer is considered. The LCMV adaptive beamforming is a constrainedleast squares minimization problem. Solving the constrained minimization problem directly leads to the QRD-MVDRbeamforming realization. An alternative is the unconstrained reformulation which leads to the QR decompositionbased GSC realization. Both MVDR and GSC beamformer can be realized using CORDIC arithmetic. The ap-plication of the annihilation-reordering look-ahead technique to these adaptive beamforming algorithms leads to�ne-grain pipelined topologies which can achieve arbitrarily high throughout. Furthermore, they consist of onlyGivens rotations which can be mapped onto CORDIC arithmetic based processors [24].TABLE IThe implementation complexity in terms of CORDIC units for various RLS based algorithms and implementation styles.Implementation Styles QRD-RLS Inverse QR QRD-MVDR QRD-GSCPipelining 12Mp2 Mp2 M( 12p2 +Kp) MK( 12p2 + p)Incremental Block 12 (2M � 1)p2 (2M � 1)p2 (2M � 1)( 12p2 +Kp) (2M � 1)K( 12 p2 + p)Block Processing 12M2p2 M2p2 M2( 12 p2 +Kp) M2K( 12p2 + p)The implementation complexity in terms of CORDIC units for various RLS based algorithms and implementationstyles are shown in Table I. From the table, we see that the pipelining levelM is a dimension variable in the complexityexpressions for all algorithms and implementation styles. The pipelining and incremental block processing realizationsrequire a linear increasing CORDIC units with an increasing factor of M for the pipelining and a factor of 2M � 1for the incremental block processing. The complexity factor for the block processing is M2 which is quadratic withrespect to the pipelining level. The adaptive inverse QR algorithm requires approximately two times of CORDIC

26units as the QRD-RLS algorithm since an extra lower triangular matrix is needed to extract the weight vector. TheMVDR topology outperforms GSC topology in terms of complexity by employing a constraint post-processor ratherthan a constraint pre-processor for the GSC realization.Future research will be directed towards the scheduling and mapping of these pipelined adaptive �ltering andbeamforming architectures onto hardware con�gurations. It will be interesting to see how di�erent schedulingstrategies lead to mappings with di�erent power consumptions. Design of architecture con�gurations and schedules toboth satisfy the bandwidth requirements and achieve low power consumptions will be challenging. Fast orthonormal�-rotations can also be employed to implement these Givens rotation based architectures with lower complexity[54]{[56]. Appendix AIn this appendix, we derive equations (14) and (15). At time instance (n �M), apply the QR decomposition tothe weighted data matrix �1=2(n�M)A(n�M) and the reference vector y(n) as followsQ(n�M)�1=2(n�M) � A(n�M) y(n�M) � = � R(n�M) p(n�M)O v(n�M) � (43)where R(n�M) is p-by-p upper triangular matrix, p(n�M and v(n�M) are p-by-1 and (n�M � p)-by-1 vectors,respectively. At time n, the new inputs UM (n) and yM (n) become available processing, we have� A(n) y(n) � = � A(n�M) y(n �M)UTM (n) yM (n) � (44)De�ne ��1=2(n) = � �M=2�1=2(n�M) IM � ; (45)�Q(n�M) = � Q(n�M) IM � (46)Then �Q(n�M)��1=2(n) � A(n) y(n) � = 24 �M=2R(n�M) �M=2p(n�M)O �M=2v(n�M)UTM (n) yM (n) 35 (47)Notice that here we choose ��1=2(n) instead of �1=2(n). �1=2(n) di�ers from ��1=2(n) in replacing IM by �1=2(M).Using �1=2(n) will lead to extra operations of the input data. While, using ��1=2(n) will not and also not a�ect thealgorithm's convergence behavior due to the existence of �M=2 in ��1=2(n).Applying the orthogonal matrix Q(n) which consists of a sequence of Givens rotations to annihilate the input dataUM (n) using �M=2R(n�M) in (47), we have24 R(n) p(n)O v(n)OM�N �(n) 35 = Q(n) 24 �M=2R(n�M) �M=2p(n�M)O �M=2v(n �M)UM (n) yM (n) 35 (48)which derives the �rst two columns in (14). Next we prove equation (15) and justify the third column in (14). From(2), we have

27e(n�M) = y(n�M)�A(n�M)w(n�M) (49)By (43), we then haveQ(n�M)�1=2(n�M)e(n�M) = � p(n�M)v(n�M) �� R(n�M)O �w(n�M): (50)Let �(n�M) = Q1=2(n�M)�1=2(n�M)e(n�M)�eM (n) = yM (n)� UTM (n)w(n �M)eM (n) = yM (n)� UTM (n)w(n) (51)where �(n �M); �eM (n), and eM (n) are M -by-1 vectors. By (1), it is seen that the last element in eM (n) is thedesired residual error e(n) at time instance n. From (50) and (51), we obtainh �M=2�(n�M)�eM (n) i = " �M=2p(n�M)�M=2v(n�M)yM (n) #� " �M=2R(n �M)OUTM (n) #w(n�M): (52)Applying the orthogonal matrix Q(n) to both sides of (52) and use (48) to obtainQ(n) � �M=2�(n�M)eM (n) � = 24 p(n)v(n)�(n) 35� 24 R(n)OO 35w(n): (53)Notice that after annihilating the input block data UTM (n), the weight vector w(n�M) has been updated to w(n).Correspondingly, the residual error �eM (n) becomes eM (n) as de�ned in (51). Now, moving Q(n) to the right handside of (53), and noticing that Q(n) is orthogonal, we obtain� �M=2�(n�M)eM (n) � = QT (n)24 p(n)�R(n)w(n)v(n)�(n) 35 (54)Since the optimum weight vector w(n) satis�es p(n)�R(n)w(n) = 0p, (54) reduces to� �M=2�(n�M)eM (n) � = QT (n)24 0pv(n)�(n) 35 (55)Notice that the last element of eM (n) is e(n), we havee(n) = � 0Tn�M �TM �QT (n)24 0pv(n)�(n) 35= � 0Tn�M �TM �QT (n) � O(n�M)�MIM � �(n): (56)The second equality is due to the fact that QT (n) only makes use of elements 0p and �(n) in the vector[0Tp ;vT (n);�T (n) ]T . Take the transpose on both sides of (56), and notice that e(n) is a scalar, then

28e(n) = �T (n) � OM�(n�M) IM ��Q(n) � On�M�M ��= �T (n) � OM�(n�M) IM � � s(n) (n) �= �T (n) (n): (57)The second equality justify the third column in (14). This completes the derivation of (14) and (15).Appendix BIn this appendix, we derive equations (37) and (38). From (48), we have the following block QR update forQRD-MVDR algorithm. � R(n)OM�p � = Q(n) � �M=2R(n�M)UTM (n) � ; (58)Extending the (p+M)-by-p matrix on the right hand side of (58) to the (p+M)-by-(p+M) square matrix by addingextra columns [OTp�M ; IM ]T to its right leads to� R(n) V (n)OM�p D(n) � = Q(n) � �M=2R(n�M) Op�MUTM (n) IM � (59)where matrices V (n) and D(n) are p-by-M and M -by-M , respectively, which correspond to the QR update ofmatrices Op�M and IM . Inverting the matrix on both sides of equation (59) (the matrix is non-singular since R isnon-singular) and noticing that Q�1 = QT lead to� R�1(n) V 0(n)OM�p D0(n) � = � ��M=2R�1(n�M) Op�M��M=2UTM (n)R�1(n�M) IM � QT (n): (60)Taking the transposition on both sides of (60), we obtain� R�T (n) Op�MV 0T (n) D0(n) � = Q(n) h ��M=2R�T (n�M) ��M=2R�T (n�M)UM (n)OM�p IM i : (61)Thus, we have the following inverse updating formula� R�T (n)V 0T (n) � = Q(n) � ��M=2R�T (n�M)OM�p � : (62)Using (62), we could derive the QR block update equation for the MVDR auxiliary vector a(k)(n) de�ned in (31).Q(n) � ��M=2a(k)(n�M)OM�p � = Q(n) � ��M=2R�T (n�M)OM�p � c(k)= � R�T (n)V 0T (n) � c(k)= � a(k)(n)�(k)(n) � ; (63)where �(k)(n) = V 0T (n)c(k) (64)

29is a M -by-1 vector. Therefore, the orthogonal matrix Q in (58) also updates the auxiliary vector a(k) in (63). Thiscompletes the derivation of the �rst two columns in (37).Now, comparing (12) with (33), it is seen that if the vector p(n) is replaced with the vector �a(k)(n), and y(n) isset to 0, then (12) becomes (33), and correspondingly, (14) becomes (37) except that �(k)(n) needs to be changedto ��(n). In other words, if the p cells in Fig. 16 are replaced by the auxiliary cells a(k), and the reference signaly(n) is set to zero, then the residual error e(n) in Fig. 16 will becomes �e0(k)(n) in Fig. 20. Therefore, according to(15) and the proof in Appendix A, we conclude the correctness of (38), and at the same time justi�ed the insertionof the third column in equation (37). This complete the derivation of equations (37) and (38).References[1] K.K. Parhi, \Algorithm transformation techniques for concurrent processors," Proceedings of the IEEE, vol. 77, pp. 1879{1895,Dec. 1989.[2] K.K. Parhi and D.G. Messerschmitt, \Pipeline interleaving and parallelism in recursive digital �lters{Part I: Pipelining usingscattered look-ahead and decomposition," IEEE Transactions on Acoustic Speech and Signal Processing, vol. 37, pp. 1118{1134,July 1989.[3] K.K. Parhi and D.G. Messerschmitt, \Pipeline interleaving and parallelism in recursive digital �lters{Part II: Pipelined incrementalblock �ltering," IEEE Transactions on Acoustic Speech and Signal Processing, vol. 37, pp. 1099{1117, July 1989.[4] P.M. Kogge, \Parallel solution of recurrence problems," IBM J. Res. Develop., vol. 18, pp. 138{148, March 1974.[5] H.H. Loomis and B.Sinha, \High speed recursive digital �lter realization," Circuits, Sys., Signal Processing, vol. 3, no. 3, pp.267{297, 1984.[6] K.K. Parhi and D.G. Messerschmitt, \Concurrent architectures for two-dimensional recursive digital �ltering," IEEE Transactionson Circuits and Systems, vol. 36, pp. 813{829, June 1989.[7] G. Fettweis and H. Meyr, \Parallel Viterbi decoding by breaking the compare-select feedback bottleneck," IEEE Transactions onCommunications, vol. 37, pp. 785{790, Aug. 1989.[8] H.D. Lin and D.G. Messerschmitt, \Finite state machine has unlimited concurrency," IEEE Transactions on Circuits and Systems,vol. 38, pp. 465{475, May 1991.[9] K.K. Parhi, \Pipelining in dynamic programming architectures," IEEE Transactions on Signal Processing, vol. 39, pp. 1442{1450,June 1991.[10] K.K. Parhi, \Pipelining in algorithms with quantizer loops," IEEE Transactions on Circuits and Systems, vol. 38, pp. 745{754,July 1991.[11] K.K. Parhi, \High-speed VLSI architectures for Hu�man and Viterbi decoders," IEEE Trans. on Circuits and Systems II: Analogand Digital Signal Processing, vol. 39, pp. 385{391, June 1992.[12] N.R. Shanbhag and K.K. Parhi, Pipelined Adaptive Digital Filters, Kluwer Academic Publishers, 1994.[13] N.R. Shanbhag and K.K. Parhi, \Relaxed look-ahead pipelined LMS adaptive �lters and their application to ADPCM coder," IEEETrans. on Circuits and Systems II: Analog and Digital Signal Processing, vol. 40, pp. 753{766, Dec. 1993.[14] N.R. Shanbhag and K.K. Parhi, \A pipelined adaptive lattice �lter architecture," IEEE Transactions on Signal Processing, vol. 41,pp. 1925{1939, May 1993.[15] N.R. Shanbhag and K.K. Parhi, \A pipelined adaptive di�erential vector quantizer for low-power speech coding applications," IEEETrans. on Circuits and Systems II: Analog and Digital Signal Processing, pp. 347{349, May 1993.[16] N.R. Shanbhag and K.K. Parhi, \A high-speed architecture for ADPCM coder and decoder," in International Symposium onCircuits and Systems, May 1992, pp. 1499{1502.[17] J. Ma, E.F. Deprettere, and K.K. Parhi, \Pipelined CORDIC based QRD-RLS adaptive �ltering using matrix lookahead," in Proc.of the IEEE Workshop on Signal Processing Systems (SiPS), Nov. 1997, pp. 131{140.[18] J. Ma, K.K. Parhi, and E.F. Deprettere, \High-speed CORDIC based parallel weight extraction for QRD-RLS adaptive �ltering,"in International Symposium on Circuits and Systems, May 1998, pp. 245{248.[19] J. Ma, K.K. Parhi, and E.F. Deprettere, \Pipelined CORDIC based QRD-MVDR adaptive beamforming," in InternationalConference on Acoustic Speech and Signal Processing, May 1998, pp. 3025{3028.[20] S. Haykin, Adaptive Filter Theory, Englewood Cli�s, NJ: Prentice-Hall, 1992.[21] J.E. Volder, \The CORDIC trigonometric computing technique," IEEE Trans. on Electronic Computers, pp. 330{334, Sept. 1959.[22] Y.H. Hu, \Cordic-based VLSI architectures for digital signal processing," IEEE Signal Processing Magzine, , no. 7, pp. 16{35, July1992.[23] G.J. Hekstra and E.F. Deprettere, \Floating point CORDIC," in Proceedings 11th Symp. Computer Arithmetic, June 1993, pp.130{137.[24] E. Rijpkema, G. Hekstra, E. Deprettere, and J. Ma, \A strategy for determining a Jacobi speci�c data ow processor," in Proc. ofthe IEEE International Conf. on Application-Speci�c Systems, Architectures and Processors, July 1997, pp. 53{64.[25] J.G. McWhirter, \Recursive least-squares minimization using a systolic array," in Proc. SPIE: Real Time Signal Processing VI,1983, vol. 431, pp. 105{112.[26] W.M. Gentleman and H.T. Kung, \Matrix triangularization by systolic arrays," in Proc. SPIE: Real-Time Signal Processing IV,1981, pp. 298{303.[27] T.J. Shepherd, J.G. McWhirter, and J.E. Hudson, \Parallel weight extraction from a systolic adaptive beamformer," inMathematicsin signal processing II (J.G. McWhirter ed.). 1990, pp. 775{790, Clarendon Press, Oxford.[28] J.G. McWhirter and T.J. Shepherd, \Systolic array processor for MVDR beamforming," in IEE Proceedings, April 1989, vol. 136,pp. 75{80.[29] A.P. Chandrakasan, S. Sheng, and R.W. Broderson, \Low-power CMOS digital design," IEEE J. Solid-State Circuits, vol. 27, pp.473{484, April 1992.

30[30] K.J. Raghunath and K.K. Parhi, \Pipelined RLS adaptive �ltering using Scaled Tangent Rotations (STAR)," IEEE Transactionson Signal Processing, vol. 40, pp. 2591{2604, October 1996.[31] T.H.Y. Meng, E.A. Lee, and D.G. Messerschmitt, \Least-squares computation at arbitrarily high speeds," in International Confer-ence on Acoustic Speech and Signal Processing, 1987, pp. 1398{1401.[32] G.H. Golub and C.F.V. Loan, Matrix Computation, Baltimore, MD: Johns Hopkins University Press, 1989.[33] S.F. Hsieh, K.J.R. Liu, and K. Yao, \A un�ed square-root-free Givens rotation approach for QRD-based recursive least squaresestimation," IEEE Transactions on Signal Processing, vol. 41, pp. 1405{1409, March 1993.[34] S. Hammarling, \A note on modi�cations to Givens plane rotation," J. Inst. Math. Applicat., vol. 13, pp. 215{218, 1974.[35] J.L. Barlow and I.C.F. Ipsen, \Scaled Givens rotations for solution of linear least-squares problems on systolic arrays," SIAM J.Sci. Stat. Comput., vol. 13, pp. 716{733, Sept. 1987.[36] J. G�otze and U. Schwiegelshohn, \A square-root and division free Givens rotation for solving least-squares problems on systolicarrays," SIAM J. Sci. Stat. Comput., pp. 800{807, Dec. 1991.[37] E. Franrzeskakis and K.J.R. Liu, \A class of square-root and division free algorithms and architectures for QRD-based adaptivesignal processing," IEEE Transactions on Signal Processing, vol. 42, pp. 2455{2469, Sept. 1994.[38] J.M. Cio�, \The fast adaptive ROTOR's RLS algorithm," IEEE Transactions on Acoustic Speech and Signal Processing, vol. 38,pp. 631{653, April 1990.[39] C.E. Leiserson, F. Rose, and J. Saxe, \Optimizing synchronous circuitry by retiming," in Proc. Third Caltech Conf. VLSI, Pasadena,CA, March 1983, pp. 87{116.[40] K.K. Parhi, \High-level algorithm and architecture transformations for DSP synthesis," Journal of VLSI Signal Processing, vol. 9,pp. 121{143, 1995.[41] E.F. Deprettere, P. Held, and P. Wielage, \Model and methods for regular array design," International Journal of High SpeedElectronics; Special issue on Massively Parallel Computing{Part II, vol. 4(2), pp. 133{201, 1993.[42] H. Leung and S. Haykin, \Stability of recursive QRD-LS algorithms using �nite-precision systolic array implementation," IEEETransactions on Acoustic Speech and Signal Processing, vol. 37, no. 5, pp. 760{763, 1989.[43] M. Moonen and E.F. Deprettere, \A fully pipelined RLS-based array for channel equalization," Journal of VLSI Signal Processing,vol. 14, pp. 67{74, 1996.[44] R. Gooch and J. Lundell, \The cm array: An adaptive beamformer for constant modulus signa," in International Conference onAcoustic Speech and Signal Processing, 1986, pp. 2523{2526.[45] C.T. Pan and R.J. Plemmons, \Least squares modi�cations with inverse factorizations: parallel implications," J. Compt. Appl.Math., vol. 27, pp. 109{127, 1989.[46] S.T. Alexander and A.L. Ghirnikar, \A method for recursive least squares adaptive �ltering based upon an inverse QR decomposi-tion," IEEE Transactions on Signal Processing, vol. 41, pp. 20{30, 1993.[47] S.P. Applebaum and D.J. Chapman, \Adaptive arrays with main beam constraints," IEEE Trans. on AP, vol. AP-24, pp. 650{662,Sept. 1976.[48] B. Widrow, P.E. Mantey, L.J. Gri�ths, and B.B. Goode, \A novel algorithm and architecture for adaptive digital beamforming,"in Proceedings of the IEEE, Dec. 1967, vol. 55, pp. 2143{2159.[49] R. Monzingo and T. Miller, Introduction to Adaptive Array, Wiley and Sons, NY, 1980.[50] O.L. Frost III, \An algorithm for linearly constrained adaptive array processing," in Proceedings of the IEEE, Aug. 1972, vol. 60,pp. 926{935.[51] R.L. Hanson and C.L. Lawson, \Extensions and applications of the Householder algorithm for solving linear least squares problems,"Math. Comp., vol. 23, pp. 917{926, 1969.[52] L.J. Gri�ths and C.W. Jim, \An alternative approach to linearly constrained adptive beamforming," IEEE Trans. on AP, vol.AP-30, pp. 27{34, Jan. 1982.[53] B.D.V. Veen and K.M. Buckley, \Beamforming: a versatile approach to spatial �ltering," IEEE ASSP Magzine, vol. 5, pp. 4{24,April 1988.[54] G.J. Hekstra and E.F. Deprettere, \Fast rotations: Low cost arithmetic methods for orthonormal rotation," in Proceedings of 12thProc. Symp. Comput. Arithmetic, July 1997, pp. 116{125.[55] J. G�otze and G. Hekstra, \An algorithm and architecture based on orthonormal �{rotations for computing the symmetric EVD,"INTEGRATION, the VLSI Journal, vol. 20, pp. 21{39, 1995.[56] J. Ma, K.K. Parhi, G.J. Hekstra, and E.F. Deprettere, \E�cient implementations of CORDIC based IIR digital �lters using fastorthonormal �-rotations," in Proc. of the SPIE: Advanced Signal Processing Algorithms, Architectures, and Implementations VIII,July 1998.

pdfs.semanticscholar.org · 1 Annihilation-Reordering Lo ok-Ahead Pip elined CORDIC Based RLS...

Documents

Transcript of pdfs.semanticscholar.org · 1 Annihilation-Reordering Lo ok-Ahead Pip elined CORDIC Based RLS...