MPSoC Platform Design and Simulation for Power
Performance Estimation
Zhengjie Lu
Eindhoven University of Technology, Electrical Engineering Department
Den Dolech 2, Postbus 513, 5600 MB, Eindhoven, The Netherlands [email protected]
Abstract—Wireless sensor nodes (WSN) play an important role
in future intelligent applications, such as remote medical
examination and unattended industry-field monitoring. Since such a
device is usually powered by a battery, the design trade-off between
the performance and the power consumption is critical in achieving
maximum battery life. A configurable platform with the
performance and high-level power simulations is essential to
determine the feasible designs in the early design stage. In this
paper, we present such an MPSoC platform and its dynamic power
model. A work flow to enable the design automation with the help
of CHESS and CHECKER1 is provided. An ECG beat-detection
application is implemented on our platform as a case study. The
simulation results predict that the P2P communication architecture
with the software pipelining is optimal on an ECG MPSoC platform
in both the power-constraint scenario and the time-constraint
scenario.
Index terms—MPSoC platform, dynamic power model, work
flow, low power application, high-level power simulation.
1. INTRODUCTION
Wireless sensors are normally constrained by a limited
power budget, which makes a low-power design essential [1].
A favoured approach is to reuse the dedicated hardware
which is optimized for the low power application and
distributed as the Intellectual Property (IP). A full-system
power simulation is then necessary to determine the most
energy efficient configuration.
Power estimation is widely accepted in industry and it
offers promising accuracy. Unfortunately, its simulation
speed can decrease dramatically as the system becomes
complex (e.g. a triple-core processor). This is not efficient in
the early design stage. In fact, a rough estimation on power
consumption is good enough in the early design stage to
compare different configurations. For this reason, a higher-
level power simulation without high accuracy is acceptable
as the starting point of the system design.
Instruction-set simulation (ISS) is a kind of high level
performance simulation, which provides the cycle-accurate
profiles of a single processor with adequate fast simulating
speed. Both the number of active cycles and the energy
consumption of function units are collected during the
1 CHESS and CHECKER are both commercial compilers
from TargetCompiler N.V.
simulation. To support the multi-core simulation, ISS’s are
embedded within a co-simulation environment. The energy
of a full system is the sum of all ISS active energy plus inter-
ISS communication energy. However, different ISS’s hold
different I/O interfaces to the rest of the system and the
alternation in networked architecture requires reworking on
the communication interface.
The group of ULP-DSP (Ultra-Low Power Digital Signal
Processing) at imec Netherlands [24] is developing the ultra
low power WSN for health care and industrial monitoring. A
target multi-core system employs application-specified
instruction processors (ASIPs). The ASIP hardware is
developed with TargetCompiler Designer Tool [2], which is
appropriate for low power applications. The software
running on such an ASIP is compiled by the CHESS
compiler [2]. Besides, ISS for dedicated ASIP is generated
by CHECKER [2], which generates the profiles of software
execution and hardware usage.
Three aspects can be improved in the work flow of
TargetCompiler. First of all, the hardware/software co-
design for multi-core is inadequately supported. An ASIP
and its accompanying software are designed from a single-
core’s point of view. Questions about how these ASIPs are
interconnected and what the optimal networked architecture
is are not optimally supported in the current work flow [2].
Secondly, a convenient programming model for multi-core
applications in TargetCompiler workflow is not provided.
Thirdly, the power simulation is not easy due to the lack of
power models.
A flexible multi-processor system-on-chip (MPSoC)
platform with high-level power models is set up in this paper,
improving the work flow of TargetCompiler. It enables the
design space exploration on the performance and the power
estimation.
This paper is organized as follows. Section 2 introduces
the related work and our contribution. An MPSoC platform
and its power model are described in section 3. The power
models are intensively studied in section 4. A detailed
explanation of our proposed work flow is given in section 5.
Section 6 shows a case study of ECG application on our
platform. Experiments are reported in section 7. Conclusions
and future work are given in section 8.
2. RELATED WORK AND CONTRIBUTION
Many MPSoC platforms have already been proposed in
academic society. In [3] and [4], SimpleSclar [5] is
embedded into the systemC-based framework for co-
simulation. A general ISS-wrapper interface is introduced in
[6], which extends its use to not only SimpleScalar but also
other ISS’s. A full-system platform is also proposed in [7] as
an extension of M5 [8]. To speed up the simulation, mixed-
level simulation is introduced in [9]: the intra-core
computation is simulated by ISS on the cycle-accurate level
and the inter-core communication is simulated by OSCI
TLM-2 [10] on the transaction-accurate level. Application
specified MPSoC platforms are also presented in [11] and
[12]. A composable and predictable MPSoC platform
template for streaming application is proposed in [11]. A
work flow for MPSoC platform automatic synthesis is
developed in [12]. None of them supports power simulation.
To bridge this gap, [13] and [14] integrate Wattch [15] in
their platform to estimate the dynamic power consumption
of SimpleScalar cores. Orion2 [16] combines the core power
model of Wattch with a router power model, aiming at
network-on-chip architectures. Besides, a universal power
simulator McPAT is proposed in [17], which takes the
output of a cycle-by-cycle performance simulator as its input.
Unfortunately, an additional parser is needed for the
integration of McPAT and MPSoC platforms. Unfortunately,
a TargetCompiler ISS cannot be inserted immediately into
the MPSoC platforms above.
This paper is devoted to meet these challenges. Three
contributions are involved in our work:
1) A general MPSoC platform with common memory
interfaces to an ISS (e.g. TargetCompiler ISS).
2) A high-level dynamic power model for our proposed
platform, aiming at predicting the dynamic power
trend in the early design stage.
3) Investigating the impacts of the communication
architecture and program coding on the dynamic
power consumption of an ECG application.
3. MPSOC PLATFORM
A parameterized MPSoC platform not only benefits the
module reuse, but also enables the design space exploration
(DSE). Three sub-systems are included as shown in Fig. 1,
being: (1) IP core subsystem (e.g. IP core 1), (2) I/O tile
subsystem (e.g. I/O tile 1), and (3) OCCN subsystem (e.g.
OCCN bus and P2P). IP core subsystem is the systemC
wrapped ISS (see Fig. 2), while I/O tile and OCCN
subsystem function together as the inter-core communication.
In the following we explain the architecture aspects in
section 3.1, while the programming model on this platform
is introduced in section 3.2.
3.1. Architecture
In Fig. 1, an IP core is connected to an I/O tile subsystem
through its systemC wrapper interface and as a result it can
access either the program memory (PM) or the data memory
(DM) within the I/O tile. Each I/O tile can exchange data
through bus or peer-to-peer-link (P2P) using network
interfaces (NI). A shared memory module can be derived
from the I/O tile if only the DM, the arbiter and the slave NI
are presented in the I/O tile, as shown in Fig. 1. Each
component in Fig. 1 is explained in the following sections.
…
Core address mapping
DM
Arbiter
IP core 1 PM
OCCN bus
Master
NISlave
NI
I/O tile 1
OCCN P2P
Core address mapping
DM
Arbiter
IP core 2 PM
Master
NISlave
NI
I/O tile 2
…
DM
Arbiter
Slave
NI
Shared-memory
Fig. 1. Architecture overview A
dd
ress
Bu
s
Da
ta
Bu
s
ISS
SystemC wrapper
Co
ntr
ol
Bu
s
Address
Bus
Data
Bus
Control
Bus
Interface to
Program
Memory
Interface to Data Memory
DAT2
DAT1
DAT1
DAT2 DAT2
DAT1
DAT1
DAT2
M_NI S_NI
MA
ST
ER
OU
T
MA
ST
ER
IN
REQACK REQ
CO
RE
IN
CO
RE
OU
T
ACK
DM IN DM
OU
TS
LA
VE
OU
T
SL
AV
E
IN
32-bit
16-bit
32-bit 32-bit 32-bit
16-bit 16-bit 16-bit
Fig. 2. IP core subsystem Fig. 3. Multi-word access
through NI
3.1.1. IP core: In our situation, an IP core is presented as a
CHECKER generated ISS. The ISS is supposed to load an
executable which is compiled by the CHESS compiler, and
to perform a cycle-accurate simulation. As we mentioned
above, it needs to be wrapped as a systemC class before
being integrated into the platform. CHECKER can also
perform this job which results in the memory mapped I/O
(MMIO) interface as shown in Fig. 2. Notice that the
addressing bus to the data memory is isolated from the
program memory. The same holds for the data bus and the
control bus. All bus widths are determined by the IP core’s
specifications.
3.1.2. Core address mapping: A core-address-mapping
module provides a de-multiplexer with a single input port
and two identical output ports, bridging the core and I/O tile.
Both input port and output port are configured based on the
system specifications so that they can adapt to different IP
core’s I/O width. A core address mapping decides whether
the IP request goes to the local DM or external DMs. A DM
is local to an IP core when it resides inside the core’s I/O tile.
The remainder of DMs are declared as the external DMs
from this core’s point of view. In our case, the MMIO
address space of a IP core is divided into two continuous
sections. The core address mapping module just maps the
first section to the local DM, and the rest to the external
DMs. Details about the local/external address will be
discussed in section 4.1.
3.1.3. Arbiter: Simultaneous requests from different IP
cores to the same memory must be supported on our
platform. A multi-ported memory seems to be an easy
solution, but it can result in the high power consumption as
well as the large silicon area [18]. For this reason, a single-
ported memory is preferred in the practical low power
designs. As a consequence, arbitration is necessary to
sequence the concurrent requests to the single-ported
memory. However, it may increase the memory access time
from the IP core’s point of view, i.e. one additional cycle is
needed for arbitration. It may become even worse when the
external request competes for the local memory at the same
time. Round-robin scheduling is currently implemented
within the arbiter. which guarantees the IP core can access to
its local DM in no later than 4 cycles (i.e. 2 cycles for
waiting the previous request accomplishment and 2 other
cycles for processing this access) from the IP core’s point of
view. No arbitration is needed for accessing to PM, because
we assume no re-configuration (reloading PM) while
running. Also we assume PM is not the instruction cache.
3.1.4. Memory: The DM module provides universal
interfaces to adapt to different accessing modes: word, byte
and multi-word access. The IP core determines what modes
are supported in local memory and what size a word/vector
is. Any mode of the access to DM is accomplished within 2
cycles from the memory’s point of view. The same holds for
access to the PM module.
3.1.5. OCCN bus/P2P: The OCCN network library [19],
which is developed in the systemC community, is employed
in our platform to simulate the inter-core communication. It
provides two types of communication on the transaction-
accurate level: bus and P2P. The advantages of the
transaction accurate level simulation are the fast simulation
speed and the high level of abstraction. A read transaction
takes 2 clock cycles (one for request, the other for
acknowledgement), while a write transaction costs only one
cycle. Arbitration is only necessary on the bus. , and it does
not cost any clock cycles in the transaction-accurate level
simulation. However, we do add one more clock cycle for
arbitration for the sake of more accurate high-level power
simulation.
3.1.6. NI: The basic transmitting unit in the OCCN network
library is called ―Protocol Data Unit‖ (PDU), which consists
of both the control header and the data body.
Communication occurs between a pair of master port and
slave port inside different NI’s. Two types of NI are
available in our architecture: the master NI as the traffic
initiator and the slave NI as the traffic target. A slave NI is
accessed by a master NI using network addresses (see
section 5.1 for details). A master NI has two data
connections to reach its neighbour slave NI on the right, as
shown in Fig. 1: either through a bus or a direct P2P link.
The routing decision is based on where the data is kept, i.e.
in the streaming addresses (mapped to the P2P link) or the
non-streaming ones (mapped to the bus). This strategy will
not only speed up the regular burst-transfer between two
cores, but also provide a flexible communication. Another
function of an NI is to synchronize the transaction-level
simulation in the OCCN network and the cycle-accurate
level simulation in the I/O tiles. Also it supports the data
format transformation between PDU and the signals within
the I/O tiles. Both the master NI and the slave NI must be
capable to handle the communication between the I/O tiles
with different data bus widths. In our case, data larger than
the network data width (i.e. 16-bit) are broken into multiple
16-bit words by the master NI and then sent to the slave NI
(shown in Fig. 3) in sequence. The slave NI receives the data
segments and assembles them to the complete data.
3.2. Multi-process programming model
The IP-based design can improve not only the IP
hardware but also the dedicated software which is optimized
for the hardware. Softwares running on different IP cores are
taken as individual processes. A universal problem is how to
enable the communication among different processes. In our
case, the data in a process is divided into two groups: the
private data and the shared data. The communication
between the different processes can only take place with the
shared data. How to address the shared data and how to
synchronize the accesses to the shared data are the topics in
this section. Core
address
Local
address
External
address
Private
address
Shared
address
Stream
address
Non-stream
address
Fig. 4. Address hierarchy
3.2.1. Process communication: The starting point is the
address mapping. Two types of addresses are defined from
the IP core’s point of view as shown in Fig. 4: (1) the local
address and (2) the external address. The first one is mapped
to the IP core’s local DM, while the latter is mapped to the
external DMs. The local address space can be divided further
as: (1) the private address which is only addressable to its
local core, and (2) the shared address which can be accessed
by all cores. Hence, the shared data must physically reside in
the shared address. A core can access the shared data, which
is not located in its local memory, through its external
address. A two-step address translation is employed here.
First of all, the master NI translates the IP core’s external
address into the network address so that the target slave NI
can be found in the network. In the second step, the target
slave NI translates the network address to the shared
memory address and puts it on the address bus in the I/O tile.
By doing this, the shared data can be transferred among
different I/O tiles. As we mentioned before, there exist two
data connections from a master NI to its neighbour slave NI.
In our case, a number of network addresses, which are
available for the master NI, are specified as the stream
addresses. Those data located in these stream addresses will
be exchanged through the P2P connection. Those non-stream
addresses would be mapped to the address space of the bus
communication.
Fig. 5. Pseudo codes for the P/V synchronization example
3.2.2. Process synchronization: Before transferring the
shared data, synchronization between processes is necessary.
Two types of software synchronization are employed on our
platform. The first one is based on the P/V primitives [20].
Two semaphores are defined as:
Start semaphore: it indicates whether a core is started.
Busy semaphore: it indicates whether a core is ready
for transferring shared data.
Fig. 5 shows the pseudo codes of the P/V synchronization. A
core can only start processing after it is informed by the start
semaphore. After processing is complete, it is blocked until
the busy semaphore is released. Then it will write the
processing results to the destination memory. The second
type employs FIFO-based synchronization [21]. The core
acting as the traffic initiator can only write data to a FIFO
when it is ―not full‖, while the core acting as the traffic
target can only read data from a FIFO when it is ―not empty".
This type of synchronization provides the opportunity to
make use of the data pipelining. But a side-effect is also
significant: it has to load both the read/write pointers every
time before it can access the FIFO.
4. DYNAMIC POWER MODEL
On the architecture level, an electronic system’s dynamic
energy can be defined as the sum of the dynamic energy
consumed by all its components. The power dissipation on
the wires is assumed zero. From this, we can derive the
general expression for the system dynamic power as:
,
,,
sys dynm
sys dynmapp dynm
EP
t
where,sys dynmP and
,sys dynmE are the system’s average
dynamic power and the total dynamic energy, respectively.
,app dynmt is the total time of completing the application (e.g.
duty time). This power model is at high abstraction level and
not necessarily accurate, e.g. the leakage power dissipation
during the idle time is not included. However, it can be still
used to predict the power trends within different
architectures. This is sufficient in the early design stage.
Our architecture consists of the cores, the I/O tiles and the
networks. So its dynamic energy is the sum of the dynamic
energy contribution of all three types:
, , , ,1 1 1
( ) ( ) ( )core io networkNN N
sys dynm core dynm io dynm network dynmi j k
E E i E j E k
in which coreN is the total number of IP cores in the system.
, ( )core dynmE i denotes the total energy of the ith IP core.
Similar meanings hold for the other terms. It should be
pointed out that ioN might be larger than coreN when a
standalone shared memory is presented on the bus. Details
about each term in the equation above will be explained in
the next sections.
4.1. IP core dynamic energy model
An IP core’s dynamic power model on high abstract level
is usually a constant, which is the mean dynamic power
number taken from either the layout back-annotated power
simulation or the silicon measurements. In this case, its
dynamic energy is defined as:
, , ,core dynm core dynm core dynmE P t (1)
where ,core dynmP is the core’s dynamic power number and
,core dynmt is its active time.
Due to the software synchronization, an IP core can be
active even though it is not processing the data. This is due
to the absence of DMA in our current system. The core’s
active time are divided into three phases: the synchronization
phase, the memory-transfer phase and the computation phase.
Equation (1) then becomes:
, , ,, ,( )core sync core mem core compcore dynm core dynm
E P t t t
(2)
This three-phase classification can help us identify whether
the energy bottleneck is in computation or not. It can also
indicate whether additional hardware components, like
DMA or interrupter controller, are needed in the architecture.
4.2. I/O tile dynamic energy model
Our proposed I/O tile consists of the following modules:
program memory (PM), data memory (DM), arbiter (AR),
de-multiplexer (DMUX) and master/slave network interface
void proc() begin // Wait for start semaphore P(start_semph); V(start_semph); //Start intra-process processing … // Wait for unlocking P(busy_semph); V(busy_semph); // Start inter-process communication … end
(m_NI/s_NI). A general dynamic energy model can be
expressed as:
, , ,
, ,
_ , _ ,
io dynm pm dynm dm dynm
ar dynm dmux dynm
m ni dynm s ni dynm
E E E
E E
E E
(3)
The first row in equation (3) addresses the dynamic
energy consumed by memories. The second row in equation
(3) denotes the dynamic energy used by the arbiter and the
de-multiplexer. The arbiter and the de-multiplexer can be
designed with only a few gates and wires in our case because
of their simple function and limited number of ports on our
platform. For that reason, their dynamic energy
consumptions are assumed zero.
In equation (3), the dynamic energy contributed by NI is
special. A network interface functions as the transceiver in
the network and as a multi-word splitter/assembler in the I/O
tile. The dynamic energy consumption, due to the first
function, is represented in the network energy model in
section 4.3, so we can ignore it here in equation (3). The
second function of NI just increases the number of memory
accesses, and its dynamic energy contribution is not as
significant as the one contributed by memory. Considering
these above, the dynamic energy contributed by NI is also
neglected. Equation (3) can then be simplified as:
, , ,io dynm pm dynm dm dynmE E E (4)
4.3. Network dynamic energy model
An energy model of the wires in the 130 nm technology
process is proposed in [22], which is resulted from the
silicon measurements:
0.27 0.58wireE l pJ (5)
in which l is the wire length ranging from 1 to 5 mm. A bus
can be considered as a collection of the wires so that its
energy can be derived from equation (5) [22]:
( 1) /bus bus bus wireE W N E pJ transaction (6)
where N is the number of the nodes connected to the bus and
busW is the number of the wires in the bus. bus is the
activity factor of the bus, ranging from 0 to 1. The dynamic
energy contribution of the bus arbiter is assumed to be
ignored here as in [22]. In a similar way, the P2P energy
model can be described as:
2 2 2 /p p p p p p wireE W E pJ transaction (7)
where 2p pW is the number of the wires per link. When the
activity factor and the number of wires are same, a P2P link
is more energy efficient than a bus.
5. WORK FLOW
A five-step work flow is introduced for our platform (see
Fig. 6), which covers three different aspects in system design:
(1) the IP-based architecture design, (2) the multi-core
programming, and (3) the performance evaluation. Generally
speaking, an IP core is a stand-alone ISS integrated into a
configurable MPSoC platform. The MPSoC simulator is
derived from this platform. The software is modified for the
simulator. Finally, the system is simulated and evaluated,
and the system designers may improve the design based on
the Y-chart methodology [23]. Each step in Fig. 6 is
explained below.
Step 1: platform configuration
The system specifications (e.g. clock, I/O width, power
number and etc) are determined from either the data sheets
or the design purposes. A list is given in table 1. The way of
inter-core communication, i.e. either the bus or the peer-to-
peer link, is specified in this stage. The IP core’s dynamic
power number is from either the data sheet or the power
simulations. A configuration script generates an MPSoC
configuration file with these specifications.
Template for
MPSoC
configuration
Configuration
script
CHESS
compiler
Executable
file
Performance
and power
System
Specification
MPSoC
platform
systemC
compiler
front-end
systemC
compiler back-
end
………...………...Modified
codes ………...………...CHECKER
instruction
set simulator
………...………...Source
codes
MPSoC
configuration
Target
architecture
Codes
rework
Target
simulator
Step1
Step2.1
Step2.2
Step3.2
Step3.1
Hardware/Software co-
simulation
Step4
Step5.2Step5.1 Step5.3
Fig. 6. Work flow
TABLE 1 SYSTEM SPECIFICATION LIST
Name Parameter
IP core I/O bus width, word size, address
mapping, power number
Memory Size, data/address width, address
mapping, arbitration, power number
Communication Type, bandwidth, address mapping, power
number
Clock IP clock, I/O clock, network clock
Step 2: hardware/software retarget
The SystemC compiler will read the configuration file
together with the MPSoC platform, resulting in the target
architecture in systemC, as shown in step 2.1. In step 2.2, the
softwares running on individual IP cores have to be modified
so that they can be a part of the multi-core application, i.e. to
declare the shared data explicitly and to implement the inter-
process synchronization.
Step 3: hardware/software compilation
The modified software in the previous step is compiled by
CHESS to have the executable files (step 3.1). ISS’s, which
have already been generated and wrapped as the systemC
classes by CHECKER, are read by systemC compiler
together with the target architecture. The target simulator is
generated as the output of the systemC compiler in step 3.2.
Step 4: hardware/software co-simulation
The target simulator loads the executable files and outputs
the simulation results, i.e. the performance and the power
numbers.
Step 5: evaluation and redesign
The numbers are evaluated and may give feedback to the
earlier steps in the flow. The system specifications, such as
the communication type, can be changed in step 5.1, while
the software coding can be improved in step 5.2. In step 5.3,
dedicated hardware accelerators which are not in IP library
can be modelled in systemC and then inserted in our
platform. By doing these, the design can be optimized.
6. LOW-POWER APPLICATION STUDY
A WSN is a typical low power design. It is featured by the
low duty cycle and is sensitive to the power/energy
constraints. In this section, we will study one of its
applications in health care and apply this application on our
proposed MPSoC platform. The research results are given in
section 7.
6.1. ECG application
Traditional WSN applications transmit the raw sensor
data to the network terminal by the RF communication. This
can lead to considerable energy consumption, since RF
device is active for a long time. ULP-DSP group at imec
Netherlands is working in such a power efficient way that
the raw sensor data would be pre-processed by DSP before
transmitted by the RF transmitter. By doing this, the amount
of transmitted data is reduced and considerable energy is
saved. The roadmap shows that a power consumption of 20
uW will be realized on an existing signal processing
platform [24].
One typical application of the research in ULP-DSP group
is the remote electrocardiagram (ECG) sampling on human
heartbeats. The sampled heartbeats are analyzed every 3
seconds, and then the ECG signals are transmitted to the
remote terminal. Three scenarios will occur in this
application.
Scenario 1: normal heartbeats are detected and a small
amount of pre-processed data (i.e. 128 bytes) needs to
be transmitted.
Scenario 2: irregular heartbeats are detected so that all
raw ECG data (i.e. 3072 bytes) have to be transmitted.
Scenario 3: it is the minor case that the remote terminal
reconfigures the complete system through the wireless
network.
The second scenario interests us the most, because it
represents the worst case of the power consumption. Once its
power constraint is satisfied, the power constraints in the
other two scenarios should be met. We employ synchronous-
data flow (SDF) [25] to have a better insight in it.
In Fig. 7, actor A (or sensor) fires 3072 tokens (i.e. 3072
bytes) on its output edge every 3 seconds. The other nodes
consume the tokens on their input edges at constant rates
(e.g. 3072 bytes for node C). Communication latency is
assumed to be zero in Fig. 7, which is the ideal condition.
Flow latency should not exceed 3 seconds in the ECG
application:1 2 3
3072 3sec3072 3072r r r , in which
xr denotes the node’s processing rate.
3 sec r1
r2r3
A: SensorB: Digital signal
processing
C: Plain-text ciphering
D: RF transmitting
3072
30723072
3072
30723072
Fig. 7. SDF for scenario 2
6.2. Hardware specification
Three dedicated ASIPs are employed on our platform,
which are developed individually with TargetCompiler ASIP
design tool:
1) Biomedical processor: it processes the digital signals
from the bio-sensor.
2) Crypto processor: it ciphers the ECG data for the sake
of the privacy protection.
3) Ultra-wide-band processor: it is optimized for the
radio transceiving.
By default, these three processors can be individually
power-gated to reduce the leakage current. Their major
specifications are listed in table 2. The data memories must
hold sufficient space to keep both the private data and the
shared data (see table 3). Both the P2P and the bus channels
are supposed to be feasible in the target system, which can
serve in different application scenarios. Because that the P2P
consists of only wires and that a bus implementation is more
complex, it is assumed that the P2P bandwidth (i.e. 16 bits in
10 ns) is twice as large as the bus bandwidth(i.e. 16 bits in
20 ns), as listed in table 4.
Now the target triple-core architecture can be determined
based on those specifications above, as shown in Fig. 8. It
should be pointed out that the processor block (in blue) in
Fig. 8 is a combination of the IP core and its I/O tile. A
standalone shared memory (128 KB, 16-bit wide) is reserved
in Fig. 8, which is special for the minor scenario of the
system reconfiguration. The bus, the P2P and the standalone
shared memory are disabled when they are not required.
The dynamic power numbers of ASIP and data memory
are obtained from the power simulations and the data sheets,
with the TSMC 90 nm technology process. The power
numbers of bus/P2P are derived from equation (6) and (7), in
which the activity factor is assumed to be 0.5 and the wire
length to be 1 mm. Being aware that it was derived in the
130 nm technology process, the wire energy model in
equation (5) has to be scaled down by a factor of 1.44 [26]
for our 90 nm technology process.
TABLE 2
ASIP SPECIFICATION
TABLE 3
DATA MEMORY SPECIFICATION
TABLE 4
NETWORK SPECIFICATION
OCCN bus
Bio180
processorCrypto processor
UWB
processor
Shared
memory
OCCN P2P OCCN P2P
Traffic 1 Traffic 2
Initiator 1 Target 1/Initiator 2 Target 2
Fig. 8. Overview of target architecture
6.3. Software specification
The application mapping determines where the shared
data is allocated. In our case, static mapping is preferred
since each ASIP is optimized for dedicated application(s)
(see section 6.2). In the rest of this paper, the application
mapping would not be changed.
After mapping the applications on the ASIPs, the network
address (see section 3.2) must be determined for the shared
addresses. A global address map starting from zero is
applied, which maps the network addresses to the data
memories in an ascending order: the standalone bus shared
memory occupies the first section of the global addressing
space, followed by the Biomedical processor’s local memory,
while the Crypto and the UWB processor reserve the third
and the forth segmentations, respectively. The total size of
the global addressing space is the number of the available
shared addresses. Now the ASIPs have access to the shared
addresses in other memories by the corresponding network
addresses.
The physical locations of the shared data/synchronizers
must be determined (e.g. in which memory they reside), too.
In our case, the shared data and their synchronizers are
physically stored in the shared address of the traffic target,
and they can be accessed by the traffic initiator through its
external address. The Biomedical processor is the initiator
and the Crypto processor is the target in the traffic between
them, while the Crypto processor is the initiator and the
UWB processor is the target in the traffic between them (see
Fig. 8).
By default, these three ASIPs are operated in an ordered
way as shown in Fig. 9. The duty time of the sequential
program is the sum of the execution time in three ASIPs,
which must not exceed the data analysis period (3 seconds).
Active
Active
Active
Biomedical
processor
Crypto
processor
UWB
processor
Time
Duty time
Active
Fig. 9. Sequential program
6.4. Power model for ECG application
In Section 4, a dynamic power model for MPSoC
platform has been defined. However, it is too limited for our
ECG application: since the future tape-out will employ the
90 nm technology process, the impact of leakage current
must be taken into account in the 90 nm technology.
Leakage power can be as significant as the dynamic one.
Due to this reason, an extended power model for ECG
application is proposed as:
, ,
0
N
ECG dynm i leak i
iECG
E P t
PT
(8)
in which ,ECG dynmE and ,i leakP are the system’s dynamic
energy and the ith component’s leakage power, respectively.
it is the duty time of the ith component and N is the total
number of the components in our architecture. T is the data
analysis period (i.e. 3 seconds).
Equation (8) indicates that the leakage energy is strongly
related to the duty time. Once the frequency scaling is
applied to reduce the dynamic power, the application duty
time definitely increases, resulting in an increase of leakage
Processor I/O data
width
(bit)
Typical
clock
(MHz)
Dynamic
power
(mW)
Leakage
power
(uW)
Biomedical 32 100 5.135 6.45
Crypto 16 100 7.08 4.71
UWB 16 100 1.2368 0.792
No. Local IP core Data width (bit) Size (KB)
1 Biomedical 32 32
2 Crypto 16 16
3 UWB 16 16
Name Clock
(ns)
Latency
(ns)
Data width
(bit)
Dynamic energy
(nJ/trans)
Bus 20 2 16 43.35
P2P 10 1 16 14.45
energy. So it seems difficult to reduce the system power with
the frequency scaling. But we argue that the leakage energy
can also go down, when the operational clock frequency has
scaled down. Actually, the leakage energy is caused by the
leakage current which is related to the voltage supply. A low
frequency clock can run well with low voltage supply. So
the leakage energy increment caused by a longer duty time
can be compensated by a lower voltage supply.
Since the dynamic power interests us only in this paper,
the leakage energy will be excluded in our power simulation.
However, we will take care of it by keeping the duty time
within a reasonable range. Considering this, the Energy-
Delay Product (EDP) is employed as one of our metrics.
Only those systems with a low EDP can be our candidates.
In this paper, the term energy and delay refer to the
dynamic energy and the duty time, respectively. They would
not be distinguished in the later sections, except for the
explicit declarations.
TABLE 5 APPLICATION MAPPING
6.5. Power simulation strategy
Our power simulation aims at predicting the power trend
of the ECG application in the early stage. Considering this,
five assumptions are made to ease our power simulation.
1) All modules on our platform are individually power
gated3. The IP cores are also power-gated when they
access the external memories.
2) The power consumption caused by those memories
other than the data memories is ignored4. The voltage
supply is fixed at 1.2 V, which allows for a 100 MHz
clock frequency.
3) The IP core and its I/O tile should be operated at the
same frequency for the sake of simplicity. The bus
and P2P clock periods are fixed in our simulation (see
table 4).
4) Each IP core can scale its frequency individually.
5) The three ASIPs run their own softwares as shown in
table 5. Each ASIP must be in one of the following
four states during the simulation: idle, active,
synchronization or memory-transfer. An ASIP
waiting for the starting signal is in its idle state and is
assumed to be power gated. Once it is started, it
enters the active state until it needs access to the
2 It is replaced with the application of low-pass filter, due to
it is still under development at the moment. 3 It has been already developed by the ULP-DSP group,
although it is not presented in our platform. 4 It is from the unpublished test report in the ULP-DSP
group.
external memories. Then it is blocked in the
synchronization state until it is authorized to access
the target external memory. In the memory-transfer
state, it moves the shared data from one memory to
the other like a physical DMA hardware does. The
activities in different states are recorded in a log file
for high-level power calculations afterwards.
Four test-benches are specified for the power simulations,
which are shown in Table 6. They are discussed in the next
section.
TABLE 6 TEST-BENCH SPECIFICATION
TABLE 7
BENCHMARK SPECIFICATION
TABLE 8 BENCHMARK PROFILE
Name Dynamic energy (nJ) Duty time (ms)
Data path 55875.48 4.15
Memory 525 0.82
Total 56400 4.15
Dynamic power (mW): 13.6
7. EXPERIMENTS
In this section, the power simulations of ECG application
are discussed. Four test-benches are involved in our
simulation and evaluated by EDP. Section 7.1 gives some
basic facts about the benchmark. Then test-bench 1 and 2 are
profiled in section 7.2, and test-bench 3 and 4 in section 7.3.
The local frequency scaling and the global frequency scaling
are discussed in sections 7.4 and 7.5, respectively. A
summary of our experiments is given in section 7.6.
7.1. Benchmark
An imaginary processor is considered as our benchmark,
whose data path is made up of three ASIPs (i.e. Biomedical,
Crypto and UWB processor). Its hardware specifications are
shown in table 7. It runs the biomedical processing, the
plain-text ciphering and the RF transceiving successively,
and then is power-gated until the next data analysis period.
The dynamic energy and the duty time are given by the sum
of all ASIP dynamic energy and duty time, respectively. The
Processor DSP Plain-text
Cipher
RF
transmit2
Biomedical
Crypto
UWB
No. Bus P2P Sequential
Program
Software
pipeline
1
2
3
4
I/O data
width
(bit)
Typical
clock
(MHz)
Dynamic
power
(mW)
Leakage
power
(uW)
Local
Memory
size (KB)
32 100 13.4518 11.952 32
data path consumes the most dynamic energy (55875 nJ) in
the full system’s dynamic energy (56400 nJ), as shown in
table 8.
7.2. Sequential programming
The sequential program is running in both test-bench 1
and 2, as indicated in table 6. The dynamic energy profile is
shown in Fig. 10. Bus communication in test-bench 2
consumes more energy (92690 nJ) than the P2P
communication in test-bench 1 (16357 nJ). Although it is not
dominant in test-bench 1, communication cost in test-bench
1 is as high as 5.32 nJ/byte (16357 nJ for 3072 bytes), and
the computation cost in test-bench 1 is just 8.77 nJ/byte
(25851.8 nJ for 3072 bytes). The energy contributed by the
synchronization/memory-transfer states and data memories
is too low to be seen in Fig. 10.
Fig. 10. Energy profile of test-bench 1 and 2
Fig. 11. Duty time profile of test-bench 1
The duty time in test-bench 1 (i.e. 5.07 ms in Fig. 14) is
smaller than that in test-bench 2 (i.e. 5.14 ms in Fig. 14), due
to two reasons: (1) the bus is slower than P2P (see table 4)
and (2) the concurrent requests from the multiple IP cores
are handled in parallel by multiple P2P links.
Fig. 11 shows that ASIPs in test-bench 1 spend limited
time on the communication when they are in the
synchronization state and the memory-transfer state. The
crypto processor processing (i.e. the Crypto active state) is
dominant in the complete processing, which is about 1.6
times and 2.5 times larger than the Biomedical processor and
the UWB processor, respectively.
7.3. Software pipelining
Software pipelining, as shown in Fig. 12, is applied on
test-bench 1 and 2. The test-benches which implement the
software pipelining (i.e. test-bench 3 and 4) consume more
dynamic energy than those without the pipelining (i.e. test-
bench 1 and 2), as shown in Fig. 13. This is due to the
synchronization overheads caused by the software pipelining.
The P2P connected architecture (test-bench 4) wins over the
bus connected one (test-bench 3) in the energy saving,
because the P2P communication has a lower energy
consumption per transaction than the bus communication
does. However, the duty time of the test-benches with the
software pipelining is decreased by nearly half in Fig. 14. A
detailed profile of the energy contribution of test-bench 4
(see Fig. 15) discloses that the synchronization accounts for
most energy consumption.
Active
Active
Active
Active
Sync.
Active
Active Active Sync. Active Sync.Biomedical
processor
Crypto
processor
UWB
processor
Time
Duty time Fig. 12. SDF of software pipeline
Fig. 13. Dynamic energy in test-benches
Fig. 14. Duty time in test-benches
25851.8
16357.2
25852
92690.6
0
20000
40000
60000
80000
100000
Energy (nJ)
Test-bench 1
Test-bench 2
1486050
53940
2402070
147840979270
Duty time (ns)Biomedical processor: active
Biomedical processor: synchronization
Biomedical processor: memory transferCrypto processor: active
Crypto processor: synchronization
Crypto processor: memory transfer
UWB processor: active
UWB processor: synchronization
UWB processor: memory transfer
44087.4120427
768506.6667
117805.6667
0
200000
400000
600000
800000
1000000
Test-bench 1 Test-bench 2 Test-bench 3 Test-bench 4
Energy (nJ)
5.07E+06 5.14E+06
2.80E+06 2.69E+06
0.00E+00
1.00E+06
2.00E+06
3.00E+06
4.00E+06
5.00E+06
6.00E+06
Test-bench 1 Test-bench 2 Test-bench 3 Test-bench 4
Duty time (ns)
Fig. 15. Energy contribution in test-bench 4
Fig. 16. Energy-Delay Product in test-benches
3000
ms
1.54
ms
2.55
ms
0.979
ms
SensorBio-processor
Crypto processor
UWB processor
3072
30723072
3072
30723072
Fig. 17. SDF for test-bench 4
7.4. Local frequency scaling
Slowing down the Biomedical processor in test-bench 4
can reduce the number of synchronizations between the
Crypto processor and the Biomedical processor. Local
frequency scaling is one way to do that. The proper scaling
factor can be determined by SDF analysis. In Fig. 17, the
node processing rates are derived from the profile of test-
bench 1 (sequential programming). A relation between
different clock frequencies can now be determined from
SDF, as:
1.7 2.6cry bio uwbf f f (9)
in which xf is the clock frequency of ASIP. The net-list of
the Crypto processor is optimized for 100 MHz and a higher
clock would violate the optimized circuit structure. The
Crypto processor processing is the bottleneck of the entire
execution time, so that the Crypto processor must run as fast
as possible. For these two reasons, the Crypto processor’s
clock frequency is fixed at 100 MHz. Design space
explorations of the ASIP clock frequency are then executed
in the following ranges: the Biomedical processor clock
within 50~100 MHz and the UWB processor clock within
30~100 MHz.
The Biomedical processor clock is explored first, as
shown in Fig. 18. The horizontal axis is the duty time and
the vertical axis is the dynamic energy. Four Pareto points
are identified (the red circles in the figure). They are the
clock periods of 10, 17, 18 and 19 ns, respectively. After
checking their EDP shown in Fig. 19, the clock period 18 ns
with EDP 121.97 is chosen as the most efficient in
performance-power trade-off.
The UWB processor can also be slowed down, because its
processing time is shorter than the duty time. Such a high
speed is a waste of dynamic power. Fig. 20 shows the DSE
results. One Pareto point (27 ns) is found in Fig. 20. Its EDP
is about 121.96 shown in Fig. 21.
By now, the optimized ASIP clock frequency in test-
bench 4 is:
Biomedical processor at 55 MHz.
Crypto processor at 100 MHz.
UWB processor at 37 MHz.
Although the duty time increases slightly from 2.69 ns to
2.75 ns (see table 9), the energy reduces as much as 63%
after local frequency scaling.
Local frequency scaling is unnecessary for test-bench 1,
because test-bench 1 runs the sequential program without
synchronization overheads. Nevertheless, the clock
frequencies in test-bench 1 are scaled down to those in test-
bench 4 for comparison. By doing so, the energy in test-
bench 1 stays the same while the duty time climbs up by
about 1ms after scaling, as shown in table 10.
Fig. 18. DSE of Biomedical processor clock frequency
Fig. 19. Energy-Delay product in test-bench 4 after
Biomedical processor frequency scaling
Energy (nJ) Biomedical processor: active
Biomedical processor: synchronization
Biomedical processor: memory transfer
Crypto processor: active
Crypto procesor: synchronization
Crypto processor: memory transfer
UWB processor: active
UWB processor: synchronization
UWB processor: memory transfer
Data Memory
Synchronization: Biomedical-Crypto
Memory transfer: Biomedical-Crypto
Synchronization: Crypto-UWB
Memory transfer: Crypto-UWB
223.55
618.97
2150.04
317.15
0.00
500.00
1000.00
1500.00
2000.00
2500.00
Test-bench 1 Test-bench 2 Test-bench 3 Test-bench 4
Energy-Delay Product (nJ-sec)
0
20000
40000
60000
80000
100000
120000
140000
2.60E+06 2.70E+06 2.80E+06 2.90E+06 3.00E+06 3.10E+06
317.15
176.60151.98
129.36 121.97 121.99 136.26
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
Energy-Delay Product (nJ-sec)
10ns
18ns
19ns 17ns
Fig. 20. DSE of UWB processor clock frequency
Fig. 21. Energy-Delay product in test-bench 4 after UWB
processor frequency scaling
TABLE 9 PROFILE OF TEST-BENCH 4
Dynamic
energy (nJ)
Duty
time (ms)
Dynamic
power (mW)
Before scaling 117805.67 2.69 43.79
After scaling 44296.3 2.75 16.1
TABLE 10 PROFILE OF TEST-BENCH 1
Dynamic
energy (nJ)
Duty
time (ms)
Dynamic
power (mW)
Before scaling 44087.9 5.07 8.69
After scaling 44087.9 7.09 7.1
TABLE 11 PROFILE FOR POWER-CONSTRAINT SCENARIO
Dynamic
power (uW)
Duty time
(ms)
Global scaling
factor
Benchmark 20 2822 680
Test-bench 1 20 2201 355
Test-bench 4 20 2213 805
TABLE 12 PROFILE FOR DUTY-TIME-CONSTRAINT SCENARIO
Dynamic
power (uW)
Duty time
(second)
Global scaling
factor
Benchmark 18.9 3 723
Test-bench 1 14.7 3 484
Test-bench 4 14.8 3 1091
TABLE 13 CLOCK FREQUENCY FOR 20UW CONSTRAINT
Biomedical
(kHz)
Crypto
(kHz)
UWB
(kHz)
Test-bench 1 160 282 105
Test-bench 4 70 125 46
Benchmark 148
TABLE 14 CLOCK FREQUENCY FOR 3 SECONDS CONSTRAINT
Biomedical
(kHz)
Crypto
(kHz)
UWB
(kHz)
Test-bench 1 114 207 77
Test-bench 4 51 92 34
Benchmark 139
7.5. Global frequency scaling
To meet the power budget of the ECG application (i.e. 20
uW [24]), the global frequency scaling is applied on the
benchmark, test-bench 1 and 4. Table 11 lists the results
after the global scaling. The duty time in the benchmark is
almost 3 seconds, while those in test-bench 1 and 4 are
around 2 seconds.
The global frequency scaling can be also applied in the
duty time constraint scenario. For example, the duty time is
set as long as the data analysis period, as shown in table 12.
Then both test-benches can have the dynamic power
(14.7~14.8 uW) as much as 22% lower than the benchmark
(18.9 uW).
Table 13 as well as table 14 give the clock frequencies
after local/global scaling in the two scenarios above. They
suggest that test-bench 4 can be supplied with a much lower
voltage than the others, because of its slower clock
frequencies.
7.6. Summary
According to the simulation results above, three facts are
observed:
1) The bus communication is more expensive than the
P2P communication for the ECG application on our
proposed platform (see Fig. 10).
2) The P2P communication with the software pipelining
has a lower duty time and power consumption than our
benchmark in the power-constraint scenario and the duty-
time-constraint scenario, respectively (see table 11 and table
12).
3) The P2P communication with the software pipelining
can benefit from a low voltage supply due to its lower clock
frequencies (see table 13 and table 14).
8. CONCLUSIONS AND FUTURE WORK
In our work, an MPSoC platform and its work flow are
proposed. This platform consists of three subsystems: the IP
44296
44298
44300
44302
44304
44306
44308
44310
2.75E+06 2.75E+06 2.76E+06 2.76E+06 2.76E+06
121.97 121.96 121.96 121.96 121.96 121.96 121.96121.99
122.04
122.50
121.60
121.70
121.80
121.90
122.00
122.10
122.20
122.30
122.40
122.50
122.60
Energy-Deay Product (nJ-sec)
27ns
core, the I/O tile and the OCCN bus/P2P. The IP core
derives from the systemC wrapped ISS, and the I/O tile is
implemented in systemC. The OCCN bus/P2P is adopted
from the OCCN library. A multi-process programming
model is developed on this platform, in which the process
communication is based on the memory address mapping
and the process synchronization is implemented in the
software. An architecture-level dynamic power model is also
developed for this platform, which favours the power trend
prediction in the early design stage. In the work flow, our
platform is configured with the system specifications (e.g. IP
core clock frequency). A target simulator is generated in the
work flow, which can deliver the cycle-accurate timing
information and the power consumption numbers. The
simulation results can help improving the hardware/software
design. Although ISS’s developed by TargetCompiler are
employed throughout our paper, other ISS’s with the
memory interfaces can be integrated into our platform, too.
The ECG application has been shown on our platform.
Three ASIPs developed in imec Netherlands are employed.
Their ISS’s and softwares are integrated into our platform.
After the high-level power simulation, we find that P2P
communication with the software pipelining can satisfy both
the power-constraints and the time-constraints in ECG
scenarios. The P2P communication with the software
pipelining possibly benefits low leakage power, too.
In our proposed work flow, the software engineers
manually allocate the memory addresses for the shared data
to enable the inter-core communication (i.e. step 2.2 in Fig.
6). In fact, these memory addresses for the inter-core
communication are determined in the MPSoC configurations.
Once the IP core’s role in the traffic (i.e. traffic initiator or
target) is specified and its shared data are explicitly declared,
a smart compiler/linker can map the shared data at the proper
memory addresses of IP core. To achieve this goal, a
dedicated linker with special keywords can be developed in
the future.
ACKNOWLEDGMENT
This paper is part of the author’s master project in imec
Netherlands. Lots of thanks go to Prof. Henk Corporaal, Dr.
Mario Konijnenburg, Mr. Jos Huisken, Mr. Firew Siyoum
and the people in ULP-DSP group for their intensive
guidance.
REFERENCES
[1] H. Long, Y. Liu, Y. Wang, R.P. Dick, and H. Yang, ―Battery
allocation for wireless sensor network lifetime maximization under
cost constraints,‖ Proceedings of the 2009 International Conference
on Computer-Aided Design, San Jose, California: ACM, 2009, pp.
705-712.
[2] Gert Goossens, Dirk Lanneer, Werner Geurts, and Johan Van Praet,
―Design of ASIPs in Multi-Processor SoCs using the
Chess/Checkers Retargetable Tool Suite,‖ International Symposium
on System-on-Chip (SoC 2006), Tampere: 2006.
[3] Simon Jäger, ―A Simulation Framework for Multiprocessor SoC,‖
Student Thesis SA-2003.15, Winter Term 2002/2003, ETHz, 2003.
[4] R. Zhong, Y. Zhu, W. Chen, M. Lin, and W.F. Wong, ―An Inter-
Core Communication Enabled Multi-Core Simulator Based on
SimpleScalar,‖ 21st International Conference on Advanced
Information Networking and Applications Workshops, 2007,
AINAW'07, 2007.
[5] D. Burger and T.M. Austin, ―The SimpleScalar tool set, version 2.0,‖
ACM SIGARCH Computer Architecture News, vol. 25, 1997, pp.
13–25.
[6] L. Benini, D. Bertozzi, D. Bruni, N. Drago, F. Fummi, and M.
Poncino, ―Legacy SystemC co-simulation of multi-processor
systems-on-chip,‖ Computer Design: VLSI in Computers and
Processors, 2002, pp. 494–499.
[7] M. Yu, J. Song, F. Fu, S. Sun, and B. Liu, ―A Fast Timing-Accurate
MPSoC HW/SW Co-Simulation Platform based on a Novel
Synchronization Scheme,‖ Proceedings of the International
MultiConference of Engineers and Computer Scientists, vol. 2,
2010.
[8] N.L. Binkert, R.G. Dreslinski, L.R. Hsu, K.T. Lim, A.G. Saidi, and
S.K. Reinhardt, ―The M5 simulator: Modeling networked systems,‖
IEEE Micro, vol. 26, 2006, pp. 52–60.
[9] V. Joloboff and C. Helmstetter, ―SimSoC: A SystemC TLM
integrated ISS for full system simulation,‖ 2008.
[10] A. Rose, S. Swan, J. Pierce, and J.M. Fernandez, ―Transaction level
modeling in SystemC,‖ Open SystemC Initiative, 2005.
[11] A. Hansson, K. Goossens, M. Bekooij, and J. Huisken, ―CoMPSoC:
A template for composable and predictable multi-processor system
on chips,‖ ACM Transactions on Design Automation of Electronic
Systems (TODAES), vol. 14, 2009, p. 2.
[12] H.N. Nikolov, ―System-level design methodology for streaming
multi-processor embedded systems,‖ Leiden Institute of Advanced
Computer Science (LIACS), Faculty of Science, Leiden University,
2009.
[13] M. Monchiero, R. Canal, and A. González, ―Design space
exploration for multicore architectures: a
power/performance/thermal view,‖ Proceedings of the 20th annual
international conference on Supercomputing, 2006, p. 186.
[14] J. Xu, Y. Zhu, L. Jiang, J. Ni, and K. Zheng, ―A simulator for multi-
core processor micro-architecture featuring inter-core
communication, power and thermal behavior.‖
[15] D. Brooks, V. Tiwari, and M. Martonosi, ―Wattch: a framework for
architectural-level power analysis and optimizations,‖ ACM
SIGARCH Computer Architecture News, vol. 28, 2000, p. 94.
[16] A. Kahng, B. Li, L.S. Peh, and K. Samadi, ―Orion 2.0: A fast and
accurate noc power and area model for early-stage design space
exploration,‖ Design, Automation, and Test in Europe, 2009, pp.
423–428.
[17] S. Li, J.H. Ahn, R.D. Strong, J.B. Brockman, D.M. Tullsen, and N.P.
Jouppi, ―McPAT: an integrated power, area, and timing modeling
framework for multicore and manycore architectures,‖ Proceedings
of the 42nd Annual IEEE/ACM International Symposium on
Microarchitecture, 2009, pp. 469–480.
[18] H. Bajwa and X. Chen, ―Low-Power High-Performance and
Dynamically Configured Multi-Port Cache Memory Architecture,‖
2007 International Conference on Electrical Engineering, Lahore,
Pakistan: 2007, pp. 1-6.
[19] M. Coppola, S. Curaba, M.D. Grammatikakis, G. Maruccia, and F.
Papariello, ―OCCN: a network-on-chip modeling and simulation
framework,‖ Proceedings of the conference on Design, automation
and test in Europe-Volume 3, 2004, p. 30174.
[20] F.L. Bauer, H. Wössner, and H. Wössner, Algorithmic language and
program development, Birkhäuser, 1982.
[21] ―Circular buffer,‖ Wikipedia, the free encyclopedia, Aug. 2010.
[22] J. Dielissen, A. Radulescu, and K. Goossens, Power measurements
and analysis of a network on chip, Eindhoven, , The Netherlands:
Philips Research, 2005.
[23] P. Lieverse, P. Van Der Wolf, K. Vissers, and E. Deprettere, ―A
methodology for architecture exploration of heterogeneous signal
processing systems,‖ The Journal of VLSI Signal Processing, vol.
29, 2001, pp. 197–207.
[24] (2010) The HolstCenter website. [Online]. Available:
http://www.holstcentre.com/
[25] B.D. Theelen, M.C.W. Geilen, T. Basten, J.P.M. Voeten, S.V.
Gheorghita, and S. Stuijk, ―A scenario-aware data flow model for
combined long-run average and worst-case performance analysis,‖
Fourth ACM and IEEE International Conference on Formal
Methods and Models for Co-Design, 2006. MEMOCODE'06.
Proceedings, 2006, pp. 185–194.
[26] M. Pedram and J.M. Rabaey, Power aware design methodologies,
Springer, 2002.
Top Related