10.1.1.17.8991

7/29/2019 10.1.1.17.8991

1/8

Energy Savings with Appropriate Interconnection Networks in Parallel DSP

Thorsten Drager and Gerhard FettweisTechnische Universitat Dresden

Institut fur Nachrichtentechnik

Mannesmann Mobilfunk Stiftungslehrstuhl fur mobile Nachrichtensysteme

01062 Dresden, Germany

draeger,fettweis @ifn.et.tu-dresden.de

Abstract

This paper presents an instruction level powermodel for a very long instructions word (VLIW) single

instruction multiple data (SIMD) digital signal pro-

cessor (DSP). We show that power consuming mem-

ory accesses can be reduced in such SIMD processor

architectures with an appropriate network connecting

the parallel register files and/or data paths. Several

network topologies are analyzed for a variety of digital

signal processor algorithms concerning energy issues.

1. Introduction

In recent years the field of mobile computing de-

vices has grown dramatically. More and more com-

puting power as well as long battery lifetimes are re-

quested. This is contradictory, since more computing

power is mostly causing higher power consumption.

In conjunction with an increasing computation power

request, higher power consumption results in higher

energy consumption, which reduces battery lifetime.

Hence, there is a need for power/energy reduction

techniques.

Power may be reduced by software [17] or by

hardware [19]. The design for low power cannot be

achieved without accurate power prediction. For low

power optimization in software an instruction-based

power model is needed, which can also be used for

low power optimization in hardware. Increasing com-

puting power may be realized by higher frequencies

as well as by using parallelism [4] [19]. Parallelism

can be achieved in two independent ways: The in-

struction set architecture can be parallelized resulting

e.g. in a VLIW architecture and parallel data paths canbe implemented allowing parallel computation of re-

sults e.g. in a SIMD architecture.

These two parallel approaches cause problems but

also allow optimization concerning energy. The paral-

lel instruction set architecture complicates the genera-

tion and handling of an instruction-level power model.

Chapter 2 focuses on this topic. In chapter 3 we show

how energy consumption may be reduced in parallel

SIMD architectures. But if several parallel data paths

are implemented, the architecture needs a network

which interconnects the data paths with each other. To-day this network is often realized by a crossbar switch,

which is known to consume significant power for high

parallelism. Chapter 4 analyzes various alternatives to

a crossbar switch focusing on energy consumption and

also taking speed considerations into account.

2. Instruction-Level Power Model for VLIW

We set up an instruction-level power model for the

M3-DSP [3][15], which was developed at Technische

Universit at Dresden. The goal was to predict the aver-

age power consumption of the DSP, running a particu-

lar sequence of instructions. From this average power

and the number of cycles of the given sequence at a

certain clock rate it would be possible to calculate the

energy consumption. In order to use this power model

with an energy optimizing compiler the predicted en-

ergy consumption for a given sequence has to be cal-

culated efficiently during compile time [9][10].

7/29/2019 10.1.1.17.8991

2/8

7/29/2019 10.1.1.17.8991

3/8

cycle slice 0 slice 1 . . . slice p-1

t . . .

t+1 . . ....

......

......

t+M . . .

t+M+1 . . ....

......

......

Figure 1. Parallel FIR-filter implementation calculating one sample per slice

be long enough to neglect the loop instruction itself,

or two measurements with sequences of different sizes

can be done to correct the measured values.

With the described approach the difference between

estimated and measured power consumption is lessthan 2% for the M3-DSP, which is sufficient for our

needs.

A part of the generated power model is shown in

table 2. Different FIW pairs are listed with the mea-

sured power consumption normalized to the no opera-

tion (NOP) instruction (all FIW are NOP). Aside from

SIMD MAC instructions which perform 16 MAC cal-

culations in parallel, the most energy consuming in-

structions are memory accesses (AGU readand AGU

write). Hence, one strategy to reduce power dissipa-

tion may be the reduction of memory accesses. In thenext chapter we show that memory accesses can be re-

duced in parallel architectures by using data transfers

which consume significant less power as shown in ta-

ble 2.

3. Parallelization Reduces Memory Accesses

Many algorithms for digital signal processing can

be parallelized in SIMD manner. So the calculation

is calculated on parallel data paths, all performing the

same operation. Then the same operands may be used

in different data paths in the same cycles or some cy-

cles later. Filter algorithms often show this property.

Equation (1) and figure 1 show one possible paral-

lelization of a real M-tap FIR-filter. Here represents

the number of the output sample ( ) with .

In figure 1 case the filter coefficients in all paral-

lel data paths are the same in one cycle, and all but one

input sample values ( ) are the same in the next cy-

cle in the neighboring data path. SIMD-parallelization

as shown for the FIR-filter was also analyzed for cas-

caded biquad-filters and lattice-ladder filters with sim-

ilar reuse opportunities.

(1)

...

For sequential calculation of an FIR-filter with ele-

ment memory support, each of the output samples

needs filter coefficients and input sam-

ples. This results in ) memory accesses for

all samples. The memory accesses can be reduced by

doing parallel calculation, depending on the capabili-

ties of the interconnection network unit (ICU). Most

reduction can be achieved for the FIR-scheduling of

figure 1, if the ICU supports a broadcast data transfer

to pass the appropriate filter coefficient to all data

paths, if the ICU supports a Zurich Zip [1] data trans-

fer to shift all input samples by one to the right and

filling data path 0 with a new sample, and if is an

integer. Thereby only memory accesses

are needed. If we additionally consider a group mem-

ory architecture, in which data elements are read

and written at once per memory access, the number

of memory accesses is further reduced by .

The latter reduction of memory accesses also re-

duces energy dissipation, if the group access needs less

7/29/2019 10.1.1.17.8991

4/8

then times the energy of the element access, which

depends on memory access implementation. Consid-

ering the first reduction approach with parallel calcu-

lation and element memory, energy reduction can be

achieved if the data transfers needed consume less en-

ergy than the memory accesses not needed any more.This depends on the number of data transfers as well

as on the energy consumed by each data transfer, and

hence on the interconnection network.

Comparing sequential calculation with element

memory and parallel calculation with group memory,

memory access can be reduced by . Power reduc-

tion depends on memory access implementation and

on the interconnection network. In the next chapter

different network architectures are analyzed concern-

ing energy consumption.

4. Interconnection Network

We have seen that in the M3-DSP with its ICU,

constructed of multiplexer chains forming a bus net-

work, energy dissipation can be reduced by exchang-

ing memory accesses with data transfers in parallel

algorithms. Todays parallel DSP architectures com-

monly use crossbar switches as connection networks.

This network may be efficient for small parallelism but

grows with for parallel data paths. Hence,

other efficient networks must be found for architec-

tures with higher parallelism. In the following we ana-

lyze flexible stage networks, like the M3s multiplexer

chain, and well known fixed stage networks, also com-

posed of multiplexer chains, to find an efficient ICU

architecture.

A network can be characterized by three parame-

ters: topology, routing and flow control. In the topol-

ogy or interconnection graph the vertices correspond

to the network nodes, which are multiplexers connect-

ing other multiplexer with a data path. The edges are

the physical connections of the nodes. To find an effec-

tive topology five parameters can be found: Bisection

width which measures the wiring density of the net-

work, in and out degree which corresponds to the num-

ber of input and output wires, diameter which is the

longest shortest path between two nodes, wire length

and symmetry[2]. Concentrating on energy efficient

networks, an energy estimation model is used to eval-

uate a networks energy consumption.

4.1. Network Power Estimation

In CMOS designs, power dissipation can be classi-

fied into three sources: switching power , internal

or short circuit power and static power [13].

Equation (2) shows the power dissipation for one node.These power dissipation sources can be mapped to the

out degree and the wire length of the network topol-

ogy.

(2)

Switching power describes the power dissipation

of charging and discharging the node capacitance (3).

The transition density is the average number of tran-

sitions per time [11]. It refers to whether a node is used

or not during a data transfer performed on the network.This is depending on the routing of the network and the

data transfer itself. The fanout corresponds to the out

degree of the node. The load capacitance can be

formed of the gate capacitance and the wire ca-

pacitance (5). Here, we assume long wires and

refer to [5], which results in much more wire than gate

capacitance (6). Also, the importance of wire capaci-

tance for the load capacitance increases with each mi-

gration in technology [13][5]. Hence, for simplicity

we neglect the gate capacitance in this model (7). This

results in a switching power proportional to the totalwire length over the out degree of a node, depending

on the network routing and on data transfers (4).

(3)

(4)

with (5)

and (6)

(7)

Short circuit power occurs, when both p and n chan-

nel devices conduct simultaneously, and thereby estab-

lishing a short. This power is composed of the supply

voltage and the peak current (8), which we

assume to be constant values [6][14]. The short cir-

cuit power appears only if a transition is performed

7/29/2019 10.1.1.17.8991

5/8

and hence is only depending on the transition density

and a constant factor (9).

(8)

(9)with (10)

Static power is considered to be insignificant in

CMOS technology [13], and is therefore neglected.

For the resulting power model (11) short-circuit

power is shown to be typically between 10 30%

of total power dissipation and switching power be-

tween 70 90% of total power [13]. Hence values

for factors and can be received by simulation of

networks performing various data transfers. Using the

presented power model, energy optimal routing can befound by simulation.

(11)

4.2. Analysis

To find an appropriate energy efficient network,

we examined four classes of network topologies with

eight nodes. One class is the bus topology without

extra skip (as the M3-DSPs ICU), with extra skip of

length two, and with extra skip of both length two and

four. Another class is the ring topology also without or

with extra skips. Together with the hypercube topol-

ogy these networks are all flexible stage networks, be-

cause they may use any given number of stages to per-

form a data transfer. Also the omega and butterfly

topologies were simulated. These are fixed stage net-

works: all data transfers use the three stages of the net-

work. Because the maximum number of used stages

( ) may increase the critical path of the overall

design, we define an upper limit for this number re-

garding flexible stage networks. For the bus network

without extra skip we allow seven stages, because oth-

erwise a transfer from first to last node wouldnt be

possible in one step. In the ring network without ex-

tra skip we restrict to four for the same reason.

In the other bus and ring networks we reduce by

one for every extra skip. In the hypercube we restrict

to three for the same reason as for bus and ring.

Bus and ring topologies are shown in figure 2, for the

other networks we refer to [18]. To analyze the influ-

ence of layout mapping, different layouts are consid-

ered, shown in figure 3.

ring with one extra skip (bus, skip 1,2)

ring with two extra skips (ring, skip 1,2,4)

bus with two extra skips (bus, skip 1,2,4)

ring without extra skip (bus, skip 1)

bus without extra skip (bus, skip 1)

bus with one extra skip (bus, skip 1,2)

Figure 2. Bus and ring networks

double linear layout

circular layout

linear layout

Figure 3. Different layouts

We simulated all data transfers (DT) needed for

our target algorithms. In addition to the broadcast

and zurich zip data transfer needed for FIR-filtering

we took the rotation DT (e.g.

, which means the value of node

0 is transfered to node 2 etc.). Another data trans-

fer we considered, was the swap DT, which exchanges

7/29/2019 10.1.1.17.8991

6/8

the values of node pairs as in

. We also took data transfers into

account used by Cooley-Tuckey-FFT and Singleton-

FFT/VITERBI algorithms including bit-reversal. Here

we refer to [12].

Considering group memory, data may not bealigned properly having an offset as illustrated in fig-

ure 4. We therefore simulated all described data trans-

fers with all possible offsets in a next step. To find

out if specific target data transfers have a significant

impact on the performance of networks, random data

transfers are analyzed in a third step.

slice no. 0 1 2 3 4 5 6 7

aligned data

non-aligned data X X X

(offset=3)

Figure 4. aligned and non-aligned data

5. Results

After simulating all combinations of different

topologies, layouts and data transfers figure 5 shows

the average power consumption over all data trans-

fers without offset for each particular topology in dif-

ferent layouts. The general result is independent of

the layout: flexible stage networks outperform fixed

stage networks regarding power dissipation. Butterfly

or omega networks consume 110%330% more power

than the other networks. Regarding flexible stage net-

works only, skip networks in ring or bus structure con-

sume 15%35% less energy than hypercube networks.

In bus and ring energy consumption can be reduced

with each additional skip. Comparing bus and ring net-

works with equal number of extra skips ring networks

outperform bus networks: power dissipation is 35%

smaller at most.

Concentrating on layout issues networks in linear

layout under perform networks in double linear layout,

which in turn under perform networks in circular lay-

out. But for bus topologies the differences are signif-

icantly smaller than for the other networks. Ring and

hypercube topologies dont profit from circular layout

instead of double linear layout either.

Figure 6 shows the energy consumption for the par-

ticular data transfers without offset for all network

bus, skip 1bus, skip 1,2

bus, skip 1,2,4ring, skip 1

ring, skip 1,2ring, skip 1,2,4

hypercubebutterfly

omega0

1

2

3

4

5

6lineardouble linearcircular

Figure 5. Average energy consumption for dif-

ferent layouts

topologies in double linear layout. The energy dissi-pations differs a lot comparing the results for differ-

ent data transfers performed in the same network. In

general broadcast data transfers need the least energy

and the swap data transfer the most energy for all net-

works. The difference is 65%240%. However, clas-

sification in energy expensive and non-expensive data

transfers is not always possible. The zurich zip data

transfer for instance, performs good on ring networks

and bad on hypercube networks: in ring networks 20%

more energy is consumed comparing broadcast with

zurich zip data transfers, in hypercube networks thedifference is 70%. Comparing all networks and data

transfers energy consumption, rotation transfers per-

form significantly good on ring networks and Cooley-

Tuckey-FFT transfers perform best on hypercube net-

works.

Considering all data transfers with offset, the av-

erage energy consumption is significantly higher for

all networks than without offset, as depicted in fig-

ure 7. This effect is most notably for all ring networks

and for bus networks with extra skip. Here, the en-

ergy consumption is 30% higher than without offset,

whereas for the other networks the difference is less

than 10%. This means, considering offsets equalizes

the energy performance of the networks: hypercubes

perform better than bus networks and the advantage of

ring networks decreases considering offsets. For ran-

dom transfers, energy consumption increases by an-

other 10-20% compared with the data transfers with

offset.

7/29/2019 10.1.1.17.8991

7/8

BroadcastRotation

Zurich ZipSwap

Cooley-Tockey-FFTSingleton-FFT

0

1

2

3

4

5

6

bus, skip 1bus, skip 1,2bus, skip 1,2,4ring, skip 1ring, skip 1,2ring, skip 1,2,4hypercube

butterflyomega

Figure 6. Energy consumption of data trans-

fer classes in double linear layout




hypercubebutterfly

omega0

1

2

3

4

5

6no offsetwith offsetrandom data transfer

Figure 7. Energy consumption of data trans-

fer classes with additional offset and random

data transfers in double linear layout

Network efficiency is not only determined by en-

ergy consumption, but also by cycles needed to per-

form a data transfer. Figure 8 shows the average cycle

count for all data transfers in the different networks in

double linear layout. Here fixed stages networks out-

perform most of the other networks. In bus and ring

networks every additional skip reduces the average cy-

cle count significantly. Bus networks with two extra

skips and ring networks with at least one extra skip

outperform hypercubes. Ring networks with two extra

skips perform even better than fixed stage networks.

Considering data transfers with offset and random data

transfers the results remain similar.




hypercubebutterfly

omega1

1,2

1,4

1,6

Figure 8. Average cycle count

As far as energy consumption is concerned, bus or

even better ring networks outperform all others the

more extra skips the better. Also for cycle count con-

siderations ring networks with extra skips show verygood performance. But implementation and routing

may cause problems with these bus or ring networks

with extra skips.

6. Conclusions

In chapter 2 of this paper we set up an instruction-

level power model. The model complexity was re-

duced by generating separate lookup tables for energy

consumption independent instruction groups. These

groups were found, cutting the very long instruction

word into its functional instruction words.

According to the generated power model of the M3-

DSP, memory accesses are one of the most power con-

suming instructions, whereas data transfers dissipate

significant small power. In chapter 3 it was shown,

that memory accesses can be reduced in SIMD parallel

architectures. The reduction is depending on the num-

ber of parallel data paths. It is realized by performing

group memory accesses instead of element accesses,

and by exchanging memory accesses in general with

data transfers.

To achieve energy reduction by exchanging mem-

ory accesses with data transfers, an interconnection

network is needed that realizes all requested trans-

fers and consume little power. By analyzing vari-

ous possible interconnection networks based on multi-

plexer chains in chapter 4, we found that flexible stage

networks have significant less power dissipation than

fixed stage networks. Different layouts dont change

7/29/2019 10.1.1.17.8991

8/8

the ranking of the different topologies. But for differ-

ent data transfers the ranking does change. Therefore,

we suggest application specific construction of the in-

terconnection network. Considering non-aligned data

in group memory, the topology specific differences in

power dissipation are reduced. Hence, a separate pre-aligning process might be taken into account.

The first choice network considering energy con-

sumption as well as cycle count and the number of

maximum used stages is the ring topology with extra

skips. The weak points are implementation and rout-

ing aspects, which have to be further analyzed. Good

alternatives are bus networks with extra skips which

reduce implementation problems but also have diffi-

cult routing, or the hypercube topology with easy rout-

ing but slightly less performance in energy consump-

tion.

References

[1] J. Beraud and C. Galand. Digital Signal Processor Ar-

chitecture With Plural Multiply/Accumulate Devices.

U.S. Patent no. 5 175 702, IBM Corp., December 29

1992.

[2] W. Dally. VLSI and Parallel Computation, chapter 3.

Morgan Kaufmann Publishers, Inc. San Mateo, Cali-

fornia, 1990.

[3] G. P. Fettweis, M. Weiss, W. Drescher, U. Walther,

F. Engel, S. Kobayashi, and T. Richter. Breaking newgrounds over 3000 mops: A broadband mobile mul-

timedia modem dsp. In ICSPAT98. Toronto, 13.-16.

September 1998.

[4] J. L. Hennessy and D. A. Patterson. Computer Archi-

tecture a Quantitative Approach. Morgan Kaufmann

Publishers, Inc., second edition, 1996.

[5] R. Ho, K. W. Mai, and M. A. Horowitz. The Future of

Wires. In Proceedings of the IEEE, volume 89, April

2001.

[6] Y. I. Ismail and E. G. Friedman. Dynamic and

Short-Circuit Power of CMOS Gates driving Lossless

Transmission Lines. IEEE Transactions on Circuits

and Systems, 46(8), August 1999.

[7] B. Klass, D. Thomas, H. Schmit, and D. Nagle. Mod-

eling Inter-Instruction Energy Effects in a Digital Sig-

nal Processor. Power-Driven Microarchitecture Work-

shop, International Symposium on Computer Archi-

tecture, June 1998.

[8] M. Lee, V. Tiwari, S. Malik, and M. Fujita. Power

analysis and minimization techniques for embedded

DSP software. IEEE Trans. VLSI Systems, 5(1):123

135, March 1997.

[9] M. Lorenz, T. Drager, R. Leupers, P. Marwedel, and

G. Fettweis. Low-Energy DSP Code Generation Us-

ing a Generic Algorithm. In Proceedings of Interna-

tional Conference on Computer Design (ICCD) 2001,

Austin, Texas, pages 431437, September 2001.

[10] M. Lorenz and P. Marwedel. Energiebewusste Com-

pilierung fur Digitale Signalprozessoren. In this con-ference book.

[11] F. N. Najm. Transition Density: A New Measure of

Activity in Digital Circuits. IEEE Transactions on

Computer Aided Design of Intergrated Circuits and

Systems, pages 310323, February 1993.

[12] A. V. Oppenheim and R. W. Schafer. Discrete-Time

Signal Processing. PrenticeHall, 1989.

[13] M. Pedram. Power Minimization in IC Design: Prin-

ciples and Applications. ACM Transactions on De-

sign Automation of Electronic Systems, 1(1):356,

January 1996.

[14] D. Rabe and W. Nebel. Short circuit power consump-tion of glitches. In Proceedings of the 1996 inter-

national symposium on Low power electronics and

design, pages 125128, Monterey, CA USA, August

1996.

[15] T. Richter, W. Drescher, F. Engel, S. Kobayashi,

V. Nikolajevic, M. Weiss, and G. Fettweis. A

platform-based highly parallel digital signal proces-

sor. In Proc. CICC 2001, pages 305308, San Diego,

USA, 6.-9. May 2001.

[16] S. Steinke, M. Knauer, L. Wehmeyer, and P. Mar-

wedel. An Accurate and Fine Grain Instruction-Level

Energy Model Supporting Optimizations. Interna-

tional Workshop - Power and Timing Modeling, Opti-mization and Simulation, Yverdon-les-bains, Switzer-

land, September 2001.

[17] V. Tiwari, S. Malik, and A. Wolfe. Power analy-

sis of embedded software: a first step towards soft-

ware power minimization. IEEE Transactions on Very

Large Scale Integration (VLSI) Systems, 2(4):437

445, December 1994.

[18] A. Varma and C. Raghavendra. Interconnection net-

works for multiprocessor and multicomputers: theory

and practice: manuscript for IEEE tutorial text. IEEE

Computer Society Press, 1994.

[19] G. Yeap and F. Najm, editors. Low Power VLSI De-sign and Technology. World Scientific Publishing

Co., 1996.

10.1.1.17.8991

Documents

Transcript of 10.1.1.17.8991