Bit Level Optimization of Adder Trees for Multiple Constant Multiplication Fir Filter Implementation

download Bit Level Optimization of Adder Trees for Multiple Constant Multiplication Fir Filter Implementation

of 8

description

ieee paper on bit level optimization

Transcript of Bit Level Optimization of Adder Trees for Multiple Constant Multiplication Fir Filter Implementation

  • SUBMITTED TO IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 1

    Bit-Level Optimization of Adder-Trees for MultipleConstant Multiplication FIR Filter Implementation

    Yu Pan and Pramod Kumar Meher, Senior Member, IEEE,

    AbstractMultiple constant multiplication (MCM) scheme iswidely used for implementing transposed direct-form FIR filters.While the research focus of MCM has been on more effectivecommon subexpression elimination, the optimization of adder-trees, which sum up the computed sub-expressions for eachcoefficient, is largely omitted. In this paper, we have identifiedthe resource minimization problem in the scheduling of adder-tree operations for the MCM block, and presented a mixedinteger programming (MIP) based algorithm for more efficientMCM-based implementation of FIR filters. Experimental resultshows that up to 15% reduction of area and 11.6% reduction ofpower (with an average of 8.46% and 5.96% respectively) canbe achieved on the top of already optimized adder/subtractornetwork of the MCM block.

    Index TermsAdder-tree optimization, finite impulse responsefilter, FIR, Multiple constant multiplication, MCM

    I. INTRODUCTION

    Finite impulse response (FIR) digital filters are widely usedas a basic tool in many digital signal processing (DSP) andcommunication applications. The complexity of a FIR filteris largely dominated by the multiplication of input sampleswith filter coefficients. But fortunately, the filter coefficientsare constants for a given filter, so that multiplications areimplemented by a network of adders, subtractors, and hard-wired shifts, where the number of adders and subtractorsare minimized by a constant multiplication scheme. In caseof a transposed direct-form FIR filter, the recent most inputsample at any given clock period is multiplied with all thefilter coefficients. A set of intermediate results are generatedin this case, and shared across all the multiplications inorder to minimize the total number of additions/subtractionsusing multiple constant multiplication (MCM) techniques (seeFig.I(a)). Each such intermediate result in an MCM processcorresponds to one of the common sub-expressions (CS) ofthe set of constants to be multiplied.

    A great deal of research has been done to develop effectivealgorithms to identify the optimal set of non-redundant subex-pressions to achieve the minimum number of logic operatorsand the minimum logic depth of the MCM [1][7]. Irrespectiveof differences in methodology and the level of optimality,in all these works, after the common subexpression termsare determined and the ADD/SUB network of non-redundantsubexpressions (or terms) is formed, the product value corre-sponding to each of the coefficients is computed by an adder-

    Yu Pan and Pramod Kumar Meher are with Institute for Infocom-m Research, 1 Fusionopolis Way, Singapore, 138632, e-mail: {ypan,pkmeher}@i2r.a-star.edu.sg.

    Copyright (c) 2013 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending an email to [email protected].

    x

  • SUBMITTED TO IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 2

    (0)

    +

    (3) (2) (1) (1)

    + +SUB ADD

    (0)

    (1)(2) (3)

    +++

    +

    +

    ADDERTREE SUB

    ADD(3)

    (2)

    (1)

    (4)

    (5)

    +

    +

    +

    delay(a) (b)

    Fig. 2. (a) An example adder-tree with delays and signs on each input term,(b) An internal schedule with minimum delay.

    A. Greedy Adder-Tree Scheduling

    The common practice of handling the summation of CSterms of each coefficient is to use the tree-height minimizationalgorithm [10] to produce a height optimum adder-tree. Thetree-height minimization algorithm iteratively collapses thepair {Ti, Tj} with smallest delays using an ADD/SUB to forma new term with delay max(Di, Dj) + 1, until a single termis reduced to. Fig.2 gives an example of the schedule for anadder-tree on the left with minimum delay.

    Note that either a positive or negative sign is associatedwith each input term (see Fig.2(a)), which denotes whether thecorresponding term should be added to or subtracted from thesummation. These signs also determine whether an additionoperation or a subtraction operation should be used when thealgorithm collapses a pair of terms in the adder-tree based onthe following rules. (1) If two input edges are of the samesign, an ADD will be used; otherwise, it will be a SUB. (2)The sign of the output edge is always the same as that of theleft input edge (i.e., the minuend edge in the subtractioncase). Using these two rules, it is possible that the final termproducing the summation result may carry a negative sign,such that a negation is needed after the adder-tree to correct thevalue. For an FIR filter, results from multiple adder-trees areaccumulated by a structural adder-register line. So the negationcan be eliminated by replacing the structural adder with asubtractor (see coefficient C1 in Fig.I for an example).

    B. Cost model

    In order to quantify and minimize the hardware cost of theadder-tree, we model the cost of ADD/SUB operations in thissection based on the ripple carry implementation, which ismost area efficient and will be picked up by the hardwarecompiler whenever the timing allows.

    Without loss of generality, for a single ADD/SUB operation,the pair of its input operands may be of different bit-widths,and one of them is to be left shifted by certain bit positions.We enumerate all the scenarios of shift-add/sub operations inFig.3. The cost calculation is done separately in three bit-segments. Starting from the least significant bit (LSB), the 1stsegment covers the bit positions up to but not including thefirst bit of the shifted operand; the 3rd segment covers the bitscorresponding to the sign extension bits of the sign extendedoperand; the 2nd segment takes the rest of the bit positions.

    Two cases of ADD operation are shown in Fig.3(a) and(b). In both cases, the 2nd and 3rd segments are implementedby one Full Adder (FA) per bit, while the 1st segment costnothing than wiring.

    0s 0ssigns0s

    signs

    0ssigns

    ( ) (b)

    FAs WiresFAs+ + FAs WiresFAs(a) (b)

    0s

    signs

    0ssigns

    1 1

    (c) (d)

    1

    i

    1

    FAs*FAsFAs FAs*FAsFAs

    0s

    signs

    signs 0s1 1

    WiresFAsFAs WiresFAsFAs(e) (f)

    WiresFAsFAs WiresFAsFAs

    Fig. 3. Cost of ADD/SUB operation under various input scenarios. Notations:FA - Full Adder, aboveline - invertor (INV). (a)(b) Cases for ADD. (c)(d)(e)(f)Cases for SUB.

    C = 30914 = 100010000200001C=30914= 1000 10000200001

    SUB

  • SUBMITTED TO IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 3

    of the same height. First, the bit widths2 of intermediate nodesare significantly smaller. For example, while other adder-treenodes are of similar bit widths, the one in the second layer isonly of 7-bits in the optimal adder-tree (Fig.4(b)) comparedto 14-bits of the non-optimal adder-tree. Note that wider bitwidth signal may contribute further to higher hardware costwhen input to the next layer of operators. Second, subtractionswith shift on the subtrahend also reduces operators cost. Forexample, for base bit width of 8-bits, the operator performing1

  • SUBMITTED TO IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 4

    ADD

    ++

    ADD

    SUB

    +

    SUB

    +

    (a) (b) (c) (d)

    + +

    Fig. 6. Relationship of operator type and the output sign with the signs ofinput operands.

    tree. It should be clear that if Ej is assigned any input term,all the operators which precede it on the binary tree should beNULLs. The number of operators that are nullified by edgeEj on the lth level is a constant NumNull(Ej) = 2l 1.Holistically, the total number of operators on the binary treeequals to the summation of the number of ADDs/SUBs andthe number of NULL operators. As a result, the number ofNULL operators is 2L N . This is expressed as follows

    EjNumNull(Ej) sumETSetj = 2L N (4)

    Till now, the above constraints suffice to produce assign-ments of the input terms correctly to form adder-trees ofvarious topologies. In order to model the cost of the adder-tree,we need to further model the ADD/SUB type of the operators,the bit widths of the edges and the cost of operators.

    B. Constraints for the Type of Operators

    For an none NULL operator, its ADD/SUB type is de-termined by the simple rules outlined in Section II-A, andgraphically depicted with Fig.6. On the binary decision tree,the types of each operator are determined by (1) the as-signment of input terms which carry the original edge signs,and (2) the propagation of the signs throughout the network.Consequently, we use two different sets of variables to modelthe initial condition and the propagation.

    For the initial condition, for each edge Ej , we define twobinary variables ispj and isnj to express the sign of Ej dueto input term assignment. ispj is 1 if the input term assignedto Ej carries a + sign, isnj is 1 if it carries a sign,and both are 0 if no input term is assigned to Ej . For eachEj , we have

    ispj =

    ti.jETSetj whose Sign(Ti) is +ti.j

    isnj =

    ti.jETSetj whose Sign(Ti) is ti.j (5)

    For sign propagation, for each edge Ej , we define anotherpair of binary variables spj and snj for the resultant sign ofEj . spj is 1 if Ej carries a + sign, and snj is 1 if Ejcarries a sign. If both are 0, it indicates that Ej is theoutput of a NULL operator. The sign propagates from upmostlayer downwards. For each Ej 0th layer, we have

    spj = ispj

    snj = isnj (6)

    For each Ej on the 1st layer downwards, supposing Ejl is the

  • SUBMITTED TO IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 5

    edges. Given that absolute shifts of all input terms are knownconstants for a coefficient, absolute shifts of intermediate termson its possible adder-trees can be obtained by a top-downpropagation process.

    Absolute shifts are modeled in a similar way that signs aremodeled. On each edge Ej , we first define an integer variableiashj for the initial absolute shift imposed by input terms.Let AbsSh(Ti) denote the constant of absolute shift value ofinput term Ti, we have

    iashj =

    ti.jETSetjAbsSh(Ti) ti.j (9)

    Then we define an integer variable ashj on each edge Ej forthe absolute shift value. The top-down propagation is modeledas follows. For each Ej 0th layer, we have

    ashj = iashj (10)

    For each Ej / 0th layer, we haveashj = min(ashjl , ashjr ) + iashj (11)

    Given the absolute shifts, we are ready to express the actualshifts. For a pair of input edges to the same operator, theircommon absolute shifts are trimmed off and taken care of bythe output edge of the operator (see Fig.7(b)). For each edgeEj / the last layer, and let Ejo be its output edge, the actualshift integer variable of Ej , denoted as shj , is expressed as

    shj = max(ashj ashjo , 0) (12)The max function with 0 ensures that the result value for aNULL operator is 0. For the last layer edge Ej , i.e., the outputedge of the adder-tree, we simply have

    shj = ashj (13)

    [Linearization] Note that both min and max functions arenon-linear. The general form of s = min(a, b) function canbe linearized by introducing an additional binary variable zand the following linear constraints

    s a; s b; s + M z a; s + M (1 z) bwhere M is a big positive constant greater than the possibledifference between a and b. Similarly, s = max(a, b) can belinearized with

    s a; s b; s a + M z; s b + M (1 z)

    2) Edge values: Computing the values of edges is also atop-down propagation process. For each edge Ej , we definean integer variable ivj for the initial value imposed by inputterms. Let V alue(Ti) be the constant of the value of inputterm Ti, we have

    ivj =

    ti.jETSetjV alue(Ti) ti.j (14)

    Let integer variable vj denote the values of edge Ej . Foreach Ej 0th layer, we have

    vj = ivj (15)

    For each Ej / 0th layer, we havevj = ivj if knj is 1

    = vjl

  • SUBMITTED TO IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 6

    define an extra binary variable z such that it equals to 0 whena is negative, and 1 otherwise. That is

    aM (1 z) < 0; a + M z 0where M is a positive constant greater than the upper boundof |a|. Given z, s = |a| can be expressed in

    aM z s a + M zaM (1 z) s a + M (1 z)

    where the M here should be a positive constant greater thanthe upper bound of 2|a|.

    We further linearize the bits function in the general form ofs = bits(a), where a is non-negative. Suppose a does not ex-ceed S bits, we create a set of binary variables {z0, . . . , zS1}to represent the LSB to MSB of a, as such

    a = S1k=02k zk

    We create another set of binary variables {p0, . . . , pS1}, suchthat if zk is the most significant bit of value 1 amongst all bits,then pk is 1, which in turn indicates that the bit width of ashould be k + 1. If zk is the most significant bit of value 1,the bits more significant than it will all be 0. For each pk, wehave

    pk = 1 if zk, zk+1, . . . , zS1 are all 0= 0 otherwise

    which can be expressed with the following linear constraints

    (1 zk) + zk+1 + . . . + zS1 M (1 pk) 0(1 zk) + zk+1 + . . . + zS1 + pk > 0

    where M is a positive constant greater than the number of zvariables involved in the inequalities. Finally, s = bits(a) isexpressed in

    s = S1k=0 (k + 1) pk

    4) Cost of adder-tree operators: Let F and I be theconstants of the costs of a 1-bit full adder and invertorrespectively, the cost of each operator cj which producesedge Ej is described as follows according to the cost modeldescribed in Section II-B

    cj = 0 if knj is 1= F [bwj (shjl + shjr )] if kaj is 1= F [bwj (sjl + shjr )] + I (bwjr sjl) if ksj is 1

    (18)

    where sjl is a binary variable indicating whether the shift isat the left input edge Ejl as such

    sjl = 1 if shjl > 0= 0 otherwise (19)

    [Linearization] The conditional equalities can be linearizedusing the same technique discussed in the edge value modelingsection. Constraint (19) is expressed linearly as

    shjl M sjl 0; shjl sjl 0

    where M is a constant greater than the upper bound of shjl .

    5) Cost of structural operators: For the sake of clarity, thestructural operator(s) of the adder-tree are not modeled onthe binary tree. In linear phase FIR filters, which are usedin most cases, a coefficient is accumulated twice at symmetrictap positions, thus corresponding to two structural operators. Ina general FIR filter, an adder-tree corresponds to one structuraloperator. For a structural operator, the adder-tree output edgealways inputs to it as the right side operand, while the leftinput and output edges are on the accumulation line and theirbit widths can be predetermined according to the upper/lowerbound of accumulated values up to this coefficient tap. Theleft input edge is always + signed and carries 0 shift. LetEs denote the structural operator output, and Esr be its right-hand-side input edge (i.e., the final output edge of the adder-tree), the cost of the structural operator cs is modeled asfollows

    cs = F [Bits(Es) shsr ] if spsr is 1= F [Bits(Es) shsr )] + I bwjr if snsr is 1 (20)

    where Bits(Es) is the bit width constant of the output edge.

    D. Objective function

    The objective function that minimizes the resource cost ofall operators on the adder-tree and its structural operator(s) is

    minimize :

    Ej / 0th layercj +

    structural ops

    cs (21)

    IV. RESOURCE MINIMIZATION FOR THE ENTIRE FIRFILTER

    In this section, we describe the top level algorithm tooptimize the resource consumption of the entire FIR filter,based on the minimum resource of each individual coefficientreturned by solving the MIP formulations discussed previous-ly. The top level algorithm is based on the observation thatlogic depth relaxation on the coefficient may result in adder-tree of less resource. Without affecting the filter performance,logic depth relaxation is applicable to all the coefficientswhose logic depths are smaller than the logic depth of thefilter.

    We first compute the minimum logic depth required by acoefficient adder-tree, and then apply logic depth relaxation toimprove the minimum resource when applicable.

    A. Minimum Logic Depth of a Coefficient Adder-Tree

    Firstly, the total number of ADD/SUB operators needed tosum up N input terms is N1. Based on the binary tree modelof the MIP formulation, a input term Ti with latency/logicdepth of Di nullifies at least 2Di 1 predecessor operatorpositions on the binary tree. As a result, the minimum binarytree required to accommodating the input terms should consistof at least N 1 +N1i=0 (2Di 1) operator positions. Onthe other hand, a binary tree of depths L contains 2L 1operators. Thus we have 2L 1 N 1 +N1i=0 (2Di 1).

  • SUBMITTED TO IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 7

    The minimum depths required by the coefficient is computedby

    L dlog2N1i=0

    2Die (22)

    The logic depth of the entire FIR filter takes the largest oneamong all its coefficients logic depths.

    B. Resource Minimization Algorithm of FIR filter

    Algorithm 1: Resource minimization of entire FIR filter.1 compute the logic depth of the filter Lfilter;2 for each non-zero coefficient Ci do3 compute Cis minimum logic depth LCi ;4 repeat5 rsrc := MipSched(Ci, LCi);6 if rsrc is not improved over the previous

    iteration then7 use the previous adder-tree schedule and

    break;8 LCi := LCi + 1;

    until LCi > Lfilter;

    The algorithm for resource minimization of the entire FIRfilter is elaborated in Algorithm 1. For a particular coefficient,the MIP based resource minimization procedure is first appliedto yield an adder-tree schedule at its minimum logic depth(line 5). Further logic depth relaxation is attempted providedthat the current depth does not exceed that of the filters(line 8), and the current attempt did improve the minimumresource (line 6). At the end of the algorithm, each coefficientis resource minimized with its adder-tree schedule. The totalresource of the entire filter is minimized, without increasingits overall logic depth.

    V. EXPERIMENTAL EVALUATIONS

    We performed our experiments on seven linear phase bandpass FIR filters of orders 16, 24, 32, 40, 48, 56 and 64. Thecoefficients are quantized as 16-bit signed integers and thenconverted to their Canonical Signed Digit (CSD) representa-tions. A 2-bit common subexpression identification algorithm,which iteratively identifies and eliminates the most frequentlyoccurring 2-bit horizontal pattern among the coefficients out-lined in [11], is developed to produce the input term networkfor the adder-trees.

    Graph algorithms are developed using LEMON [12] C++graph library which includes well defined interfaces for lin-ear programming and supports several popular solvers. IBMILOGs CPLEX 12.3 [13] library is used to solve the MIPprocedures. A small library of generic hardware components(ADDs, SUBs, registers and negators) handling differentoperand cases (signed/unsigned) and shifts is composed to fa-cilitate parameterizable hardware generation. ADDs and SUBsin these components are directly performed by VHDL +and operators on operands with appropriate bit width withappropriate shift. In the end of the algorithm, synthesizable

    Filter Logic Cell Area (sq. um) Power Consumption (mw)Order Depth Conv. MIP. Improv. Conv. MIP Improv.

    16 3 15764 13393 15.04% 0.280 0.247 11.57%24 3 23065 21003 8.94% 0.425 0.391 7.97%32 3 28897 26909 6.88% 0.531 0.511 3.93%40 3 36400 33281 8.57% 0.672 0.636 5.39%48 4 41271 38399 6.96% 0.785 0.742 5.44%56 4 49642 46837 5.65% 0.927 0.902 2.65%64 4 57026 52925 7.19% 1.064 1.013 4.80%

    Avg 8.46% 5.96%

    TABLE IAREA AND POWER CONSUMPTION IMPROVEMENT ON MCM BLOCK.

    hardware description in VHDL for the filter is produced. Thefilters are then synthesized with Synopsis design compilerusing TSMC 90nm library.

    The area and power consumption of the filters MCM blocks(inclusive of structural operators) are shown and comparedin Table I against the conventional adder-tree schedulingalgorithm. The average area improvement is 8.46%, and powersaving is 5.96%. With increasing filter order, the reductionstend to be lower. This is because the ratio of small valuedcoefficients is higher. These coefficients use small adder-treesthat can hardly be improved. Among a total of 140 adder-treesoptimized in the table, 19 are further improved by logic depthrelaxation of 1 more step.

    The run time of solving the MIP problem for each coef-ficient is generally within a few seconds. There are a smallnumber of cases for a few adder-trees of depths 4 taking morethan 1 minute to solve. In these cases, due to the auxiliaryvariables introduced to linearize the

  • SUBMITTED TO IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 8

    [6] P. K. Meher and Y. Pan, Mcm-based implementation of block fir filtersfor high-speed and low-power applications, in VLSI and System-on-Chip (VLSI-SoC), 2011 IEEE/IFIP 19th International Conference on.IEEE, Oct. 2011, pp. 118121.

    [7] L. Aksoy, C. Lazzari, E. Costa, P. Flores, and J. Monteiro, Design ofdigit-serial FIR filters: Algorithms, architectures, and a CAD tool, IEEETransactions on Very Large Scale Integration (VLSI) Systems, vol. 21,no. 3, pp. 498511, Mar. 2013.

    [8] M. B. Gately, M. B. Yeary, and C. Y. Tang, Multiple real-constant multi-plication with improved cost model and greedy and optimal searches, inProc. IEEE International Symposium on Circuits and Systems (ISCAS),May 2012, pp. 588591.

    [9] M. Kumm, P. Zipf, M. Faust, and C.-H. Chang, Pipelined adder graphoptimization for high speed multiple constant multiplication, in Circuitsand Systems (ISCAS), 2012 IEEE International Symposium on. IEEE,May 2012, pp. 4952.

    [10] R. Hartley and A. Casavant, Tree-height minimization in pipelinedarchitectures, in Proc. IEEE International Conference on Computer-Aided Design (ICCAD), Nov. 1989.

    [11] R. Mahesh and A. Vinod, A new common subexpression eliminationalgorithm for realizing low-complexity higher order digital filters, IEEETransactions on Computer-Aided Design of Integrated Circuits andSystems, vol. 27, no. 2, pp. 217229, 2008.

    [12] E. L. University, LEMON (library for efficient modeling and optimiza-tion in networks), http://lemon.cs.elte.hu/.

    [13] IBM ILOG CPLEX optimizer, www.ibm.com/software/integration/optimization/cplex-optimizer/.