10.1.1.17.8991
-
Upload
palash-swarnakar -
Category
Documents
-
view
213 -
download
0
Transcript of 10.1.1.17.8991
-
7/29/2019 10.1.1.17.8991
1/8
Energy Savings with Appropriate Interconnection Networks in Parallel DSP
Thorsten Drager and Gerhard FettweisTechnische Universitat Dresden
Institut fur Nachrichtentechnik
Mannesmann Mobilfunk Stiftungslehrstuhl fur mobile Nachrichtensysteme
01062 Dresden, Germany
draeger,fettweis @ifn.et.tu-dresden.de
Abstract
This paper presents an instruction level powermodel for a very long instructions word (VLIW) single
instruction multiple data (SIMD) digital signal pro-
cessor (DSP). We show that power consuming mem-
ory accesses can be reduced in such SIMD processor
architectures with an appropriate network connecting
the parallel register files and/or data paths. Several
network topologies are analyzed for a variety of digital
signal processor algorithms concerning energy issues.
1. Introduction
In recent years the field of mobile computing de-
vices has grown dramatically. More and more com-
puting power as well as long battery lifetimes are re-
quested. This is contradictory, since more computing
power is mostly causing higher power consumption.
In conjunction with an increasing computation power
request, higher power consumption results in higher
energy consumption, which reduces battery lifetime.
Hence, there is a need for power/energy reduction
techniques.
Power may be reduced by software [17] or by
hardware [19]. The design for low power cannot be
achieved without accurate power prediction. For low
power optimization in software an instruction-based
power model is needed, which can also be used for
low power optimization in hardware. Increasing com-
puting power may be realized by higher frequencies
as well as by using parallelism [4] [19]. Parallelism
can be achieved in two independent ways: The in-
struction set architecture can be parallelized resulting
e.g. in a VLIW architecture and parallel data paths canbe implemented allowing parallel computation of re-
sults e.g. in a SIMD architecture.
These two parallel approaches cause problems but
also allow optimization concerning energy. The paral-
lel instruction set architecture complicates the genera-
tion and handling of an instruction-level power model.
Chapter 2 focuses on this topic. In chapter 3 we show
how energy consumption may be reduced in parallel
SIMD architectures. But if several parallel data paths
are implemented, the architecture needs a network
which interconnects the data paths with each other. To-day this network is often realized by a crossbar switch,
which is known to consume significant power for high
parallelism. Chapter 4 analyzes various alternatives to
a crossbar switch focusing on energy consumption and
also taking speed considerations into account.
2. Instruction-Level Power Model for VLIW
We set up an instruction-level power model for the
M3-DSP [3][15], which was developed at Technische
Universit at Dresden. The goal was to predict the aver-
age power consumption of the DSP, running a particu-
lar sequence of instructions. From this average power
and the number of cycles of the given sequence at a
certain clock rate it would be possible to calculate the
energy consumption. In order to use this power model
with an energy optimizing compiler the predicted en-
ergy consumption for a given sequence has to be cal-
culated efficiently during compile time [9][10].
-
7/29/2019 10.1.1.17.8991
2/8
-
7/29/2019 10.1.1.17.8991
3/8
cycle slice 0 slice 1 . . . slice p-1
t . . .
t+1 . . ....
......
......
t+M . . .
t+M+1 . . ....
......
......
Figure 1. Parallel FIR-filter implementation calculating one sample per slice
be long enough to neglect the loop instruction itself,
or two measurements with sequences of different sizes
can be done to correct the measured values.
With the described approach the difference between
estimated and measured power consumption is lessthan 2% for the M3-DSP, which is sufficient for our
needs.
A part of the generated power model is shown in
table 2. Different FIW pairs are listed with the mea-
sured power consumption normalized to the no opera-
tion (NOP) instruction (all FIW are NOP). Aside from
SIMD MAC instructions which perform 16 MAC cal-
culations in parallel, the most energy consuming in-
structions are memory accesses (AGU readand AGU
write). Hence, one strategy to reduce power dissipa-
tion may be the reduction of memory accesses. In thenext chapter we show that memory accesses can be re-
duced in parallel architectures by using data transfers
which consume significant less power as shown in ta-
ble 2.
3. Parallelization Reduces Memory Accesses
Many algorithms for digital signal processing can
be parallelized in SIMD manner. So the calculation
is calculated on parallel data paths, all performing the
same operation. Then the same operands may be used
in different data paths in the same cycles or some cy-
cles later. Filter algorithms often show this property.
Equation (1) and figure 1 show one possible paral-
lelization of a real M-tap FIR-filter. Here represents
the number of the output sample ( ) with .
In figure 1 case the filter coefficients in all paral-
lel data paths are the same in one cycle, and all but one
input sample values ( ) are the same in the next cy-
cle in the neighboring data path. SIMD-parallelization
as shown for the FIR-filter was also analyzed for cas-
caded biquad-filters and lattice-ladder filters with sim-
ilar reuse opportunities.
(1)
...
For sequential calculation of an FIR-filter with ele-
ment memory support, each of the output samples
needs filter coefficients and input sam-
ples. This results in ) memory accesses for
all samples. The memory accesses can be reduced by
doing parallel calculation, depending on the capabili-
ties of the interconnection network unit (ICU). Most
reduction can be achieved for the FIR-scheduling of
figure 1, if the ICU supports a broadcast data transfer
to pass the appropriate filter coefficient to all data
paths, if the ICU supports a Zurich Zip [1] data trans-
fer to shift all input samples by one to the right and
filling data path 0 with a new sample, and if is an
integer. Thereby only memory accesses
are needed. If we additionally consider a group mem-
ory architecture, in which data elements are read
and written at once per memory access, the number
of memory accesses is further reduced by .
The latter reduction of memory accesses also re-
duces energy dissipation, if the group access needs less
-
7/29/2019 10.1.1.17.8991
4/8
then times the energy of the element access, which
depends on memory access implementation. Consid-
ering the first reduction approach with parallel calcu-
lation and element memory, energy reduction can be
achieved if the data transfers needed consume less en-
ergy than the memory accesses not needed any more.This depends on the number of data transfers as well
as on the energy consumed by each data transfer, and
hence on the interconnection network.
Comparing sequential calculation with element
memory and parallel calculation with group memory,
memory access can be reduced by . Power reduc-
tion depends on memory access implementation and
on the interconnection network. In the next chapter
different network architectures are analyzed concern-
ing energy consumption.
4. Interconnection Network
We have seen that in the M3-DSP with its ICU,
constructed of multiplexer chains forming a bus net-
work, energy dissipation can be reduced by exchang-
ing memory accesses with data transfers in parallel
algorithms. Todays parallel DSP architectures com-
monly use crossbar switches as connection networks.
This network may be efficient for small parallelism but
grows with for parallel data paths. Hence,
other efficient networks must be found for architec-
tures with higher parallelism. In the following we ana-
lyze flexible stage networks, like the M3s multiplexer
chain, and well known fixed stage networks, also com-
posed of multiplexer chains, to find an efficient ICU
architecture.
A network can be characterized by three parame-
ters: topology, routing and flow control. In the topol-
ogy or interconnection graph the vertices correspond
to the network nodes, which are multiplexers connect-
ing other multiplexer with a data path. The edges are
the physical connections of the nodes. To find an effec-
tive topology five parameters can be found: Bisection
width which measures the wiring density of the net-
work, in and out degree which corresponds to the num-
ber of input and output wires, diameter which is the
longest shortest path between two nodes, wire length
and symmetry[2]. Concentrating on energy efficient
networks, an energy estimation model is used to eval-
uate a networks energy consumption.
4.1. Network Power Estimation
In CMOS designs, power dissipation can be classi-
fied into three sources: switching power , internal
or short circuit power and static power [13].
Equation (2) shows the power dissipation for one node.These power dissipation sources can be mapped to the
out degree and the wire length of the network topol-
ogy.
(2)
Switching power describes the power dissipation
of charging and discharging the node capacitance (3).
The transition density is the average number of tran-
sitions per time [11]. It refers to whether a node is used
or not during a data transfer performed on the network.This is depending on the routing of the network and the
data transfer itself. The fanout corresponds to the out
degree of the node. The load capacitance can be
formed of the gate capacitance and the wire ca-
pacitance (5). Here, we assume long wires and
refer to [5], which results in much more wire than gate
capacitance (6). Also, the importance of wire capaci-
tance for the load capacitance increases with each mi-
gration in technology [13][5]. Hence, for simplicity
we neglect the gate capacitance in this model (7). This
results in a switching power proportional to the totalwire length over the out degree of a node, depending
on the network routing and on data transfers (4).
(3)
(4)
with (5)
and (6)
(7)
Short circuit power occurs, when both p and n chan-
nel devices conduct simultaneously, and thereby estab-
lishing a short. This power is composed of the supply
voltage and the peak current (8), which we
assume to be constant values [6][14]. The short cir-
cuit power appears only if a transition is performed
-
7/29/2019 10.1.1.17.8991
5/8
and hence is only depending on the transition density
and a constant factor (9).
(8)
(9)with (10)
Static power is considered to be insignificant in
CMOS technology [13], and is therefore neglected.
For the resulting power model (11) short-circuit
power is shown to be typically between 10 30%
of total power dissipation and switching power be-
tween 70 90% of total power [13]. Hence values
for factors and can be received by simulation of
networks performing various data transfers. Using the
presented power model, energy optimal routing can befound by simulation.
(11)
4.2. Analysis
To find an appropriate energy efficient network,
we examined four classes of network topologies with
eight nodes. One class is the bus topology without
extra skip (as the M3-DSPs ICU), with extra skip of
length two, and with extra skip of both length two and
four. Another class is the ring topology also without or
with extra skips. Together with the hypercube topol-
ogy these networks are all flexible stage networks, be-
cause they may use any given number of stages to per-
form a data transfer. Also the omega and butterfly
topologies were simulated. These are fixed stage net-
works: all data transfers use the three stages of the net-
work. Because the maximum number of used stages
( ) may increase the critical path of the overall
design, we define an upper limit for this number re-
garding flexible stage networks. For the bus network
without extra skip we allow seven stages, because oth-
erwise a transfer from first to last node wouldnt be
possible in one step. In the ring network without ex-
tra skip we restrict to four for the same reason.
In the other bus and ring networks we reduce by
one for every extra skip. In the hypercube we restrict
to three for the same reason as for bus and ring.
Bus and ring topologies are shown in figure 2, for the
other networks we refer to [18]. To analyze the influ-
ence of layout mapping, different layouts are consid-
ered, shown in figure 3.
ring with one extra skip (bus, skip 1,2)
ring with two extra skips (ring, skip 1,2,4)
bus with two extra skips (bus, skip 1,2,4)
ring without extra skip (bus, skip 1)
bus without extra skip (bus, skip 1)
bus with one extra skip (bus, skip 1,2)
Figure 2. Bus and ring networks
double linear layout
circular layout
linear layout
Figure 3. Different layouts
We simulated all data transfers (DT) needed for
our target algorithms. In addition to the broadcast
and zurich zip data transfer needed for FIR-filtering
we took the rotation DT (e.g.
, which means the value of node
0 is transfered to node 2 etc.). Another data trans-
fer we considered, was the swap DT, which exchanges
-
7/29/2019 10.1.1.17.8991
6/8
the values of node pairs as in
. We also took data transfers into
account used by Cooley-Tuckey-FFT and Singleton-
FFT/VITERBI algorithms including bit-reversal. Here
we refer to [12].
Considering group memory, data may not bealigned properly having an offset as illustrated in fig-
ure 4. We therefore simulated all described data trans-
fers with all possible offsets in a next step. To find
out if specific target data transfers have a significant
impact on the performance of networks, random data
transfers are analyzed in a third step.
slice no. 0 1 2 3 4 5 6 7
aligned data
non-aligned data X X X
(offset=3)
Figure 4. aligned and non-aligned data
5. Results
After simulating all combinations of different
topologies, layouts and data transfers figure 5 shows
the average power consumption over all data trans-
fers without offset for each particular topology in dif-
ferent layouts. The general result is independent of
the layout: flexible stage networks outperform fixed
stage networks regarding power dissipation. Butterfly
or omega networks consume 110%330% more power
than the other networks. Regarding flexible stage net-
works only, skip networks in ring or bus structure con-
sume 15%35% less energy than hypercube networks.
In bus and ring energy consumption can be reduced
with each additional skip. Comparing bus and ring net-
works with equal number of extra skips ring networks
outperform bus networks: power dissipation is 35%
smaller at most.
Concentrating on layout issues networks in linear
layout under perform networks in double linear layout,
which in turn under perform networks in circular lay-
out. But for bus topologies the differences are signif-
icantly smaller than for the other networks. Ring and
hypercube topologies dont profit from circular layout
instead of double linear layout either.
Figure 6 shows the energy consumption for the par-
ticular data transfers without offset for all network
bus, skip 1bus, skip 1,2
bus, skip 1,2,4ring, skip 1
ring, skip 1,2ring, skip 1,2,4
hypercubebutterfly
omega0
1
2
3
4
5
6lineardouble linearcircular
Figure 5. Average energy consumption for dif-
ferent layouts
topologies in double linear layout. The energy dissi-pations differs a lot comparing the results for differ-
ent data transfers performed in the same network. In
general broadcast data transfers need the least energy
and the swap data transfer the most energy for all net-
works. The difference is 65%240%. However, clas-
sification in energy expensive and non-expensive data
transfers is not always possible. The zurich zip data
transfer for instance, performs good on ring networks
and bad on hypercube networks: in ring networks 20%
more energy is consumed comparing broadcast with
zurich zip data transfers, in hypercube networks thedifference is 70%. Comparing all networks and data
transfers energy consumption, rotation transfers per-
form significantly good on ring networks and Cooley-
Tuckey-FFT transfers perform best on hypercube net-
works.
Considering all data transfers with offset, the av-
erage energy consumption is significantly higher for
all networks than without offset, as depicted in fig-
ure 7. This effect is most notably for all ring networks
and for bus networks with extra skip. Here, the en-
ergy consumption is 30% higher than without offset,
whereas for the other networks the difference is less
than 10%. This means, considering offsets equalizes
the energy performance of the networks: hypercubes
perform better than bus networks and the advantage of
ring networks decreases considering offsets. For ran-
dom transfers, energy consumption increases by an-
other 10-20% compared with the data transfers with
offset.
-
7/29/2019 10.1.1.17.8991
7/8
BroadcastRotation
Zurich ZipSwap
Cooley-Tockey-FFTSingleton-FFT
0
1
2
3
4
5
6
bus, skip 1bus, skip 1,2bus, skip 1,2,4ring, skip 1ring, skip 1,2ring, skip 1,2,4hypercube
butterflyomega
Figure 6. Energy consumption of data trans-
fer classes in double linear layout
bus, skip 1bus, skip 1,2
bus, skip 1,2,4ring, skip 1
ring, skip 1,2ring, skip 1,2,4
hypercubebutterfly
omega0
1
2
3
4
5
6no offsetwith offsetrandom data transfer
Figure 7. Energy consumption of data trans-
fer classes with additional offset and random
data transfers in double linear layout
Network efficiency is not only determined by en-
ergy consumption, but also by cycles needed to per-
form a data transfer. Figure 8 shows the average cycle
count for all data transfers in the different networks in
double linear layout. Here fixed stages networks out-
perform most of the other networks. In bus and ring
networks every additional skip reduces the average cy-
cle count significantly. Bus networks with two extra
skips and ring networks with at least one extra skip
outperform hypercubes. Ring networks with two extra
skips perform even better than fixed stage networks.
Considering data transfers with offset and random data
transfers the results remain similar.
bus, skip 1bus, skip 1,2
bus, skip 1,2,4ring, skip 1
ring, skip 1,2ring, skip 1,2,4
hypercubebutterfly
omega1
1,2
1,4
1,6
Figure 8. Average cycle count
As far as energy consumption is concerned, bus or
even better ring networks outperform all others the
more extra skips the better. Also for cycle count con-
siderations ring networks with extra skips show verygood performance. But implementation and routing
may cause problems with these bus or ring networks
with extra skips.
6. Conclusions
In chapter 2 of this paper we set up an instruction-
level power model. The model complexity was re-
duced by generating separate lookup tables for energy
consumption independent instruction groups. These
groups were found, cutting the very long instruction
word into its functional instruction words.
According to the generated power model of the M3-
DSP, memory accesses are one of the most power con-
suming instructions, whereas data transfers dissipate
significant small power. In chapter 3 it was shown,
that memory accesses can be reduced in SIMD parallel
architectures. The reduction is depending on the num-
ber of parallel data paths. It is realized by performing
group memory accesses instead of element accesses,
and by exchanging memory accesses in general with
data transfers.
To achieve energy reduction by exchanging mem-
ory accesses with data transfers, an interconnection
network is needed that realizes all requested trans-
fers and consume little power. By analyzing vari-
ous possible interconnection networks based on multi-
plexer chains in chapter 4, we found that flexible stage
networks have significant less power dissipation than
fixed stage networks. Different layouts dont change
-
7/29/2019 10.1.1.17.8991
8/8
the ranking of the different topologies. But for differ-
ent data transfers the ranking does change. Therefore,
we suggest application specific construction of the in-
terconnection network. Considering non-aligned data
in group memory, the topology specific differences in
power dissipation are reduced. Hence, a separate pre-aligning process might be taken into account.
The first choice network considering energy con-
sumption as well as cycle count and the number of
maximum used stages is the ring topology with extra
skips. The weak points are implementation and rout-
ing aspects, which have to be further analyzed. Good
alternatives are bus networks with extra skips which
reduce implementation problems but also have diffi-
cult routing, or the hypercube topology with easy rout-
ing but slightly less performance in energy consump-
tion.
References
[1] J. Beraud and C. Galand. Digital Signal Processor Ar-
chitecture With Plural Multiply/Accumulate Devices.
U.S. Patent no. 5 175 702, IBM Corp., December 29
1992.
[2] W. Dally. VLSI and Parallel Computation, chapter 3.
Morgan Kaufmann Publishers, Inc. San Mateo, Cali-
fornia, 1990.
[3] G. P. Fettweis, M. Weiss, W. Drescher, U. Walther,
F. Engel, S. Kobayashi, and T. Richter. Breaking newgrounds over 3000 mops: A broadband mobile mul-
timedia modem dsp. In ICSPAT98. Toronto, 13.-16.
September 1998.
[4] J. L. Hennessy and D. A. Patterson. Computer Archi-
tecture a Quantitative Approach. Morgan Kaufmann
Publishers, Inc., second edition, 1996.
[5] R. Ho, K. W. Mai, and M. A. Horowitz. The Future of
Wires. In Proceedings of the IEEE, volume 89, April
2001.
[6] Y. I. Ismail and E. G. Friedman. Dynamic and
Short-Circuit Power of CMOS Gates driving Lossless
Transmission Lines. IEEE Transactions on Circuits
and Systems, 46(8), August 1999.
[7] B. Klass, D. Thomas, H. Schmit, and D. Nagle. Mod-
eling Inter-Instruction Energy Effects in a Digital Sig-
nal Processor. Power-Driven Microarchitecture Work-
shop, International Symposium on Computer Archi-
tecture, June 1998.
[8] M. Lee, V. Tiwari, S. Malik, and M. Fujita. Power
analysis and minimization techniques for embedded
DSP software. IEEE Trans. VLSI Systems, 5(1):123
135, March 1997.
[9] M. Lorenz, T. Drager, R. Leupers, P. Marwedel, and
G. Fettweis. Low-Energy DSP Code Generation Us-
ing a Generic Algorithm. In Proceedings of Interna-
tional Conference on Computer Design (ICCD) 2001,
Austin, Texas, pages 431437, September 2001.
[10] M. Lorenz and P. Marwedel. Energiebewusste Com-
pilierung fur Digitale Signalprozessoren. In this con-ference book.
[11] F. N. Najm. Transition Density: A New Measure of
Activity in Digital Circuits. IEEE Transactions on
Computer Aided Design of Intergrated Circuits and
Systems, pages 310323, February 1993.
[12] A. V. Oppenheim and R. W. Schafer. Discrete-Time
Signal Processing. PrenticeHall, 1989.
[13] M. Pedram. Power Minimization in IC Design: Prin-
ciples and Applications. ACM Transactions on De-
sign Automation of Electronic Systems, 1(1):356,
January 1996.
[14] D. Rabe and W. Nebel. Short circuit power consump-tion of glitches. In Proceedings of the 1996 inter-
national symposium on Low power electronics and
design, pages 125128, Monterey, CA USA, August
1996.
[15] T. Richter, W. Drescher, F. Engel, S. Kobayashi,
V. Nikolajevic, M. Weiss, and G. Fettweis. A
platform-based highly parallel digital signal proces-
sor. In Proc. CICC 2001, pages 305308, San Diego,
USA, 6.-9. May 2001.
[16] S. Steinke, M. Knauer, L. Wehmeyer, and P. Mar-
wedel. An Accurate and Fine Grain Instruction-Level
Energy Model Supporting Optimizations. Interna-
tional Workshop - Power and Timing Modeling, Opti-mization and Simulation, Yverdon-les-bains, Switzer-
land, September 2001.
[17] V. Tiwari, S. Malik, and A. Wolfe. Power analy-
sis of embedded software: a first step towards soft-
ware power minimization. IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, 2(4):437
445, December 1994.
[18] A. Varma and C. Raghavendra. Interconnection net-
works for multiprocessor and multicomputers: theory
and practice: manuscript for IEEE tutorial text. IEEE
Computer Society Press, 1994.
[19] G. Yeap and F. Najm, editors. Low Power VLSI De-sign and Technology. World Scientific Publishing
Co., 1996.