10.1.1.17.8991

download 10.1.1.17.8991

of 8

Transcript of 10.1.1.17.8991

  • 7/29/2019 10.1.1.17.8991

    1/8

    Energy Savings with Appropriate Interconnection Networks in Parallel DSP

    Thorsten Drager and Gerhard FettweisTechnische Universitat Dresden

    Institut fur Nachrichtentechnik

    Mannesmann Mobilfunk Stiftungslehrstuhl fur mobile Nachrichtensysteme

    01062 Dresden, Germany

    draeger,fettweis @ifn.et.tu-dresden.de

    Abstract

    This paper presents an instruction level powermodel for a very long instructions word (VLIW) single

    instruction multiple data (SIMD) digital signal pro-

    cessor (DSP). We show that power consuming mem-

    ory accesses can be reduced in such SIMD processor

    architectures with an appropriate network connecting

    the parallel register files and/or data paths. Several

    network topologies are analyzed for a variety of digital

    signal processor algorithms concerning energy issues.

    1. Introduction

    In recent years the field of mobile computing de-

    vices has grown dramatically. More and more com-

    puting power as well as long battery lifetimes are re-

    quested. This is contradictory, since more computing

    power is mostly causing higher power consumption.

    In conjunction with an increasing computation power

    request, higher power consumption results in higher

    energy consumption, which reduces battery lifetime.

    Hence, there is a need for power/energy reduction

    techniques.

    Power may be reduced by software [17] or by

    hardware [19]. The design for low power cannot be

    achieved without accurate power prediction. For low

    power optimization in software an instruction-based

    power model is needed, which can also be used for

    low power optimization in hardware. Increasing com-

    puting power may be realized by higher frequencies

    as well as by using parallelism [4] [19]. Parallelism

    can be achieved in two independent ways: The in-

    struction set architecture can be parallelized resulting

    e.g. in a VLIW architecture and parallel data paths canbe implemented allowing parallel computation of re-

    sults e.g. in a SIMD architecture.

    These two parallel approaches cause problems but

    also allow optimization concerning energy. The paral-

    lel instruction set architecture complicates the genera-

    tion and handling of an instruction-level power model.

    Chapter 2 focuses on this topic. In chapter 3 we show

    how energy consumption may be reduced in parallel

    SIMD architectures. But if several parallel data paths

    are implemented, the architecture needs a network

    which interconnects the data paths with each other. To-day this network is often realized by a crossbar switch,

    which is known to consume significant power for high

    parallelism. Chapter 4 analyzes various alternatives to

    a crossbar switch focusing on energy consumption and

    also taking speed considerations into account.

    2. Instruction-Level Power Model for VLIW

    We set up an instruction-level power model for the

    M3-DSP [3][15], which was developed at Technische

    Universit at Dresden. The goal was to predict the aver-

    age power consumption of the DSP, running a particu-

    lar sequence of instructions. From this average power

    and the number of cycles of the given sequence at a

    certain clock rate it would be possible to calculate the

    energy consumption. In order to use this power model

    with an energy optimizing compiler the predicted en-

    ergy consumption for a given sequence has to be cal-

    culated efficiently during compile time [9][10].

  • 7/29/2019 10.1.1.17.8991

    2/8

  • 7/29/2019 10.1.1.17.8991

    3/8

    cycle slice 0 slice 1 . . . slice p-1

    t . . .

    t+1 . . ....

    ......

    ......

    t+M . . .

    t+M+1 . . ....

    ......

    ......

    Figure 1. Parallel FIR-filter implementation calculating one sample per slice

    be long enough to neglect the loop instruction itself,

    or two measurements with sequences of different sizes

    can be done to correct the measured values.

    With the described approach the difference between

    estimated and measured power consumption is lessthan 2% for the M3-DSP, which is sufficient for our

    needs.

    A part of the generated power model is shown in

    table 2. Different FIW pairs are listed with the mea-

    sured power consumption normalized to the no opera-

    tion (NOP) instruction (all FIW are NOP). Aside from

    SIMD MAC instructions which perform 16 MAC cal-

    culations in parallel, the most energy consuming in-

    structions are memory accesses (AGU readand AGU

    write). Hence, one strategy to reduce power dissipa-

    tion may be the reduction of memory accesses. In thenext chapter we show that memory accesses can be re-

    duced in parallel architectures by using data transfers

    which consume significant less power as shown in ta-

    ble 2.

    3. Parallelization Reduces Memory Accesses

    Many algorithms for digital signal processing can

    be parallelized in SIMD manner. So the calculation

    is calculated on parallel data paths, all performing the

    same operation. Then the same operands may be used

    in different data paths in the same cycles or some cy-

    cles later. Filter algorithms often show this property.

    Equation (1) and figure 1 show one possible paral-

    lelization of a real M-tap FIR-filter. Here represents

    the number of the output sample ( ) with .

    In figure 1 case the filter coefficients in all paral-

    lel data paths are the same in one cycle, and all but one

    input sample values ( ) are the same in the next cy-

    cle in the neighboring data path. SIMD-parallelization

    as shown for the FIR-filter was also analyzed for cas-

    caded biquad-filters and lattice-ladder filters with sim-

    ilar reuse opportunities.

    (1)

    ...

    For sequential calculation of an FIR-filter with ele-

    ment memory support, each of the output samples

    needs filter coefficients and input sam-

    ples. This results in ) memory accesses for

    all samples. The memory accesses can be reduced by

    doing parallel calculation, depending on the capabili-

    ties of the interconnection network unit (ICU). Most

    reduction can be achieved for the FIR-scheduling of

    figure 1, if the ICU supports a broadcast data transfer

    to pass the appropriate filter coefficient to all data

    paths, if the ICU supports a Zurich Zip [1] data trans-

    fer to shift all input samples by one to the right and

    filling data path 0 with a new sample, and if is an

    integer. Thereby only memory accesses

    are needed. If we additionally consider a group mem-

    ory architecture, in which data elements are read

    and written at once per memory access, the number

    of memory accesses is further reduced by .

    The latter reduction of memory accesses also re-

    duces energy dissipation, if the group access needs less

  • 7/29/2019 10.1.1.17.8991

    4/8

    then times the energy of the element access, which

    depends on memory access implementation. Consid-

    ering the first reduction approach with parallel calcu-

    lation and element memory, energy reduction can be

    achieved if the data transfers needed consume less en-

    ergy than the memory accesses not needed any more.This depends on the number of data transfers as well

    as on the energy consumed by each data transfer, and

    hence on the interconnection network.

    Comparing sequential calculation with element

    memory and parallel calculation with group memory,

    memory access can be reduced by . Power reduc-

    tion depends on memory access implementation and

    on the interconnection network. In the next chapter

    different network architectures are analyzed concern-

    ing energy consumption.

    4. Interconnection Network

    We have seen that in the M3-DSP with its ICU,

    constructed of multiplexer chains forming a bus net-

    work, energy dissipation can be reduced by exchang-

    ing memory accesses with data transfers in parallel

    algorithms. Todays parallel DSP architectures com-

    monly use crossbar switches as connection networks.

    This network may be efficient for small parallelism but

    grows with for parallel data paths. Hence,

    other efficient networks must be found for architec-

    tures with higher parallelism. In the following we ana-

    lyze flexible stage networks, like the M3s multiplexer

    chain, and well known fixed stage networks, also com-

    posed of multiplexer chains, to find an efficient ICU

    architecture.

    A network can be characterized by three parame-

    ters: topology, routing and flow control. In the topol-

    ogy or interconnection graph the vertices correspond

    to the network nodes, which are multiplexers connect-

    ing other multiplexer with a data path. The edges are

    the physical connections of the nodes. To find an effec-

    tive topology five parameters can be found: Bisection

    width which measures the wiring density of the net-

    work, in and out degree which corresponds to the num-

    ber of input and output wires, diameter which is the

    longest shortest path between two nodes, wire length

    and symmetry[2]. Concentrating on energy efficient

    networks, an energy estimation model is used to eval-

    uate a networks energy consumption.

    4.1. Network Power Estimation

    In CMOS designs, power dissipation can be classi-

    fied into three sources: switching power , internal

    or short circuit power and static power [13].

    Equation (2) shows the power dissipation for one node.These power dissipation sources can be mapped to the

    out degree and the wire length of the network topol-

    ogy.

    (2)

    Switching power describes the power dissipation

    of charging and discharging the node capacitance (3).

    The transition density is the average number of tran-

    sitions per time [11]. It refers to whether a node is used

    or not during a data transfer performed on the network.This is depending on the routing of the network and the

    data transfer itself. The fanout corresponds to the out

    degree of the node. The load capacitance can be

    formed of the gate capacitance and the wire ca-

    pacitance (5). Here, we assume long wires and

    refer to [5], which results in much more wire than gate

    capacitance (6). Also, the importance of wire capaci-

    tance for the load capacitance increases with each mi-

    gration in technology [13][5]. Hence, for simplicity

    we neglect the gate capacitance in this model (7). This

    results in a switching power proportional to the totalwire length over the out degree of a node, depending

    on the network routing and on data transfers (4).

    (3)

    (4)

    with (5)

    and (6)

    (7)

    Short circuit power occurs, when both p and n chan-

    nel devices conduct simultaneously, and thereby estab-

    lishing a short. This power is composed of the supply

    voltage and the peak current (8), which we

    assume to be constant values [6][14]. The short cir-

    cuit power appears only if a transition is performed

  • 7/29/2019 10.1.1.17.8991

    5/8

    and hence is only depending on the transition density

    and a constant factor (9).

    (8)

    (9)with (10)

    Static power is considered to be insignificant in

    CMOS technology [13], and is therefore neglected.

    For the resulting power model (11) short-circuit

    power is shown to be typically between 10 30%

    of total power dissipation and switching power be-

    tween 70 90% of total power [13]. Hence values

    for factors and can be received by simulation of

    networks performing various data transfers. Using the

    presented power model, energy optimal routing can befound by simulation.

    (11)

    4.2. Analysis

    To find an appropriate energy efficient network,

    we examined four classes of network topologies with

    eight nodes. One class is the bus topology without

    extra skip (as the M3-DSPs ICU), with extra skip of

    length two, and with extra skip of both length two and

    four. Another class is the ring topology also without or

    with extra skips. Together with the hypercube topol-

    ogy these networks are all flexible stage networks, be-

    cause they may use any given number of stages to per-

    form a data transfer. Also the omega and butterfly

    topologies were simulated. These are fixed stage net-

    works: all data transfers use the three stages of the net-

    work. Because the maximum number of used stages

    ( ) may increase the critical path of the overall

    design, we define an upper limit for this number re-

    garding flexible stage networks. For the bus network

    without extra skip we allow seven stages, because oth-

    erwise a transfer from first to last node wouldnt be

    possible in one step. In the ring network without ex-

    tra skip we restrict to four for the same reason.

    In the other bus and ring networks we reduce by

    one for every extra skip. In the hypercube we restrict

    to three for the same reason as for bus and ring.

    Bus and ring topologies are shown in figure 2, for the

    other networks we refer to [18]. To analyze the influ-

    ence of layout mapping, different layouts are consid-

    ered, shown in figure 3.

    ring with one extra skip (bus, skip 1,2)

    ring with two extra skips (ring, skip 1,2,4)

    bus with two extra skips (bus, skip 1,2,4)

    ring without extra skip (bus, skip 1)

    bus without extra skip (bus, skip 1)

    bus with one extra skip (bus, skip 1,2)

    Figure 2. Bus and ring networks

    double linear layout

    circular layout

    linear layout

    Figure 3. Different layouts

    We simulated all data transfers (DT) needed for

    our target algorithms. In addition to the broadcast

    and zurich zip data transfer needed for FIR-filtering

    we took the rotation DT (e.g.

    , which means the value of node

    0 is transfered to node 2 etc.). Another data trans-

    fer we considered, was the swap DT, which exchanges

  • 7/29/2019 10.1.1.17.8991

    6/8

    the values of node pairs as in

    . We also took data transfers into

    account used by Cooley-Tuckey-FFT and Singleton-

    FFT/VITERBI algorithms including bit-reversal. Here

    we refer to [12].

    Considering group memory, data may not bealigned properly having an offset as illustrated in fig-

    ure 4. We therefore simulated all described data trans-

    fers with all possible offsets in a next step. To find

    out if specific target data transfers have a significant

    impact on the performance of networks, random data

    transfers are analyzed in a third step.

    slice no. 0 1 2 3 4 5 6 7

    aligned data

    non-aligned data X X X

    (offset=3)

    Figure 4. aligned and non-aligned data

    5. Results

    After simulating all combinations of different

    topologies, layouts and data transfers figure 5 shows

    the average power consumption over all data trans-

    fers without offset for each particular topology in dif-

    ferent layouts. The general result is independent of

    the layout: flexible stage networks outperform fixed

    stage networks regarding power dissipation. Butterfly

    or omega networks consume 110%330% more power

    than the other networks. Regarding flexible stage net-

    works only, skip networks in ring or bus structure con-

    sume 15%35% less energy than hypercube networks.

    In bus and ring energy consumption can be reduced

    with each additional skip. Comparing bus and ring net-

    works with equal number of extra skips ring networks

    outperform bus networks: power dissipation is 35%

    smaller at most.

    Concentrating on layout issues networks in linear

    layout under perform networks in double linear layout,

    which in turn under perform networks in circular lay-

    out. But for bus topologies the differences are signif-

    icantly smaller than for the other networks. Ring and

    hypercube topologies dont profit from circular layout

    instead of double linear layout either.

    Figure 6 shows the energy consumption for the par-

    ticular data transfers without offset for all network

    bus, skip 1bus, skip 1,2

    bus, skip 1,2,4ring, skip 1

    ring, skip 1,2ring, skip 1,2,4

    hypercubebutterfly

    omega0

    1

    2

    3

    4

    5

    6lineardouble linearcircular

    Figure 5. Average energy consumption for dif-

    ferent layouts

    topologies in double linear layout. The energy dissi-pations differs a lot comparing the results for differ-

    ent data transfers performed in the same network. In

    general broadcast data transfers need the least energy

    and the swap data transfer the most energy for all net-

    works. The difference is 65%240%. However, clas-

    sification in energy expensive and non-expensive data

    transfers is not always possible. The zurich zip data

    transfer for instance, performs good on ring networks

    and bad on hypercube networks: in ring networks 20%

    more energy is consumed comparing broadcast with

    zurich zip data transfers, in hypercube networks thedifference is 70%. Comparing all networks and data

    transfers energy consumption, rotation transfers per-

    form significantly good on ring networks and Cooley-

    Tuckey-FFT transfers perform best on hypercube net-

    works.

    Considering all data transfers with offset, the av-

    erage energy consumption is significantly higher for

    all networks than without offset, as depicted in fig-

    ure 7. This effect is most notably for all ring networks

    and for bus networks with extra skip. Here, the en-

    ergy consumption is 30% higher than without offset,

    whereas for the other networks the difference is less

    than 10%. This means, considering offsets equalizes

    the energy performance of the networks: hypercubes

    perform better than bus networks and the advantage of

    ring networks decreases considering offsets. For ran-

    dom transfers, energy consumption increases by an-

    other 10-20% compared with the data transfers with

    offset.

  • 7/29/2019 10.1.1.17.8991

    7/8

    BroadcastRotation

    Zurich ZipSwap

    Cooley-Tockey-FFTSingleton-FFT

    0

    1

    2

    3

    4

    5

    6

    bus, skip 1bus, skip 1,2bus, skip 1,2,4ring, skip 1ring, skip 1,2ring, skip 1,2,4hypercube

    butterflyomega

    Figure 6. Energy consumption of data trans-

    fer classes in double linear layout

    bus, skip 1bus, skip 1,2

    bus, skip 1,2,4ring, skip 1

    ring, skip 1,2ring, skip 1,2,4

    hypercubebutterfly

    omega0

    1

    2

    3

    4

    5

    6no offsetwith offsetrandom data transfer

    Figure 7. Energy consumption of data trans-

    fer classes with additional offset and random

    data transfers in double linear layout

    Network efficiency is not only determined by en-

    ergy consumption, but also by cycles needed to per-

    form a data transfer. Figure 8 shows the average cycle

    count for all data transfers in the different networks in

    double linear layout. Here fixed stages networks out-

    perform most of the other networks. In bus and ring

    networks every additional skip reduces the average cy-

    cle count significantly. Bus networks with two extra

    skips and ring networks with at least one extra skip

    outperform hypercubes. Ring networks with two extra

    skips perform even better than fixed stage networks.

    Considering data transfers with offset and random data

    transfers the results remain similar.

    bus, skip 1bus, skip 1,2

    bus, skip 1,2,4ring, skip 1

    ring, skip 1,2ring, skip 1,2,4

    hypercubebutterfly

    omega1

    1,2

    1,4

    1,6

    Figure 8. Average cycle count

    As far as energy consumption is concerned, bus or

    even better ring networks outperform all others the

    more extra skips the better. Also for cycle count con-

    siderations ring networks with extra skips show verygood performance. But implementation and routing

    may cause problems with these bus or ring networks

    with extra skips.

    6. Conclusions

    In chapter 2 of this paper we set up an instruction-

    level power model. The model complexity was re-

    duced by generating separate lookup tables for energy

    consumption independent instruction groups. These

    groups were found, cutting the very long instruction

    word into its functional instruction words.

    According to the generated power model of the M3-

    DSP, memory accesses are one of the most power con-

    suming instructions, whereas data transfers dissipate

    significant small power. In chapter 3 it was shown,

    that memory accesses can be reduced in SIMD parallel

    architectures. The reduction is depending on the num-

    ber of parallel data paths. It is realized by performing

    group memory accesses instead of element accesses,

    and by exchanging memory accesses in general with

    data transfers.

    To achieve energy reduction by exchanging mem-

    ory accesses with data transfers, an interconnection

    network is needed that realizes all requested trans-

    fers and consume little power. By analyzing vari-

    ous possible interconnection networks based on multi-

    plexer chains in chapter 4, we found that flexible stage

    networks have significant less power dissipation than

    fixed stage networks. Different layouts dont change

  • 7/29/2019 10.1.1.17.8991

    8/8

    the ranking of the different topologies. But for differ-

    ent data transfers the ranking does change. Therefore,

    we suggest application specific construction of the in-

    terconnection network. Considering non-aligned data

    in group memory, the topology specific differences in

    power dissipation are reduced. Hence, a separate pre-aligning process might be taken into account.

    The first choice network considering energy con-

    sumption as well as cycle count and the number of

    maximum used stages is the ring topology with extra

    skips. The weak points are implementation and rout-

    ing aspects, which have to be further analyzed. Good

    alternatives are bus networks with extra skips which

    reduce implementation problems but also have diffi-

    cult routing, or the hypercube topology with easy rout-

    ing but slightly less performance in energy consump-

    tion.

    References

    [1] J. Beraud and C. Galand. Digital Signal Processor Ar-

    chitecture With Plural Multiply/Accumulate Devices.

    U.S. Patent no. 5 175 702, IBM Corp., December 29

    1992.

    [2] W. Dally. VLSI and Parallel Computation, chapter 3.

    Morgan Kaufmann Publishers, Inc. San Mateo, Cali-

    fornia, 1990.

    [3] G. P. Fettweis, M. Weiss, W. Drescher, U. Walther,

    F. Engel, S. Kobayashi, and T. Richter. Breaking newgrounds over 3000 mops: A broadband mobile mul-

    timedia modem dsp. In ICSPAT98. Toronto, 13.-16.

    September 1998.

    [4] J. L. Hennessy and D. A. Patterson. Computer Archi-

    tecture a Quantitative Approach. Morgan Kaufmann

    Publishers, Inc., second edition, 1996.

    [5] R. Ho, K. W. Mai, and M. A. Horowitz. The Future of

    Wires. In Proceedings of the IEEE, volume 89, April

    2001.

    [6] Y. I. Ismail and E. G. Friedman. Dynamic and

    Short-Circuit Power of CMOS Gates driving Lossless

    Transmission Lines. IEEE Transactions on Circuits

    and Systems, 46(8), August 1999.

    [7] B. Klass, D. Thomas, H. Schmit, and D. Nagle. Mod-

    eling Inter-Instruction Energy Effects in a Digital Sig-

    nal Processor. Power-Driven Microarchitecture Work-

    shop, International Symposium on Computer Archi-

    tecture, June 1998.

    [8] M. Lee, V. Tiwari, S. Malik, and M. Fujita. Power

    analysis and minimization techniques for embedded

    DSP software. IEEE Trans. VLSI Systems, 5(1):123

    135, March 1997.

    [9] M. Lorenz, T. Drager, R. Leupers, P. Marwedel, and

    G. Fettweis. Low-Energy DSP Code Generation Us-

    ing a Generic Algorithm. In Proceedings of Interna-

    tional Conference on Computer Design (ICCD) 2001,

    Austin, Texas, pages 431437, September 2001.

    [10] M. Lorenz and P. Marwedel. Energiebewusste Com-

    pilierung fur Digitale Signalprozessoren. In this con-ference book.

    [11] F. N. Najm. Transition Density: A New Measure of

    Activity in Digital Circuits. IEEE Transactions on

    Computer Aided Design of Intergrated Circuits and

    Systems, pages 310323, February 1993.

    [12] A. V. Oppenheim and R. W. Schafer. Discrete-Time

    Signal Processing. PrenticeHall, 1989.

    [13] M. Pedram. Power Minimization in IC Design: Prin-

    ciples and Applications. ACM Transactions on De-

    sign Automation of Electronic Systems, 1(1):356,

    January 1996.

    [14] D. Rabe and W. Nebel. Short circuit power consump-tion of glitches. In Proceedings of the 1996 inter-

    national symposium on Low power electronics and

    design, pages 125128, Monterey, CA USA, August

    1996.

    [15] T. Richter, W. Drescher, F. Engel, S. Kobayashi,

    V. Nikolajevic, M. Weiss, and G. Fettweis. A

    platform-based highly parallel digital signal proces-

    sor. In Proc. CICC 2001, pages 305308, San Diego,

    USA, 6.-9. May 2001.

    [16] S. Steinke, M. Knauer, L. Wehmeyer, and P. Mar-

    wedel. An Accurate and Fine Grain Instruction-Level

    Energy Model Supporting Optimizations. Interna-

    tional Workshop - Power and Timing Modeling, Opti-mization and Simulation, Yverdon-les-bains, Switzer-

    land, September 2001.

    [17] V. Tiwari, S. Malik, and A. Wolfe. Power analy-

    sis of embedded software: a first step towards soft-

    ware power minimization. IEEE Transactions on Very

    Large Scale Integration (VLSI) Systems, 2(4):437

    445, December 1994.

    [18] A. Varma and C. Raghavendra. Interconnection net-

    works for multiprocessor and multicomputers: theory

    and practice: manuscript for IEEE tutorial text. IEEE

    Computer Society Press, 1994.

    [19] G. Yeap and F. Najm, editors. Low Power VLSI De-sign and Technology. World Scientific Publishing

    Co., 1996.