ecology labfaculty.cs.tamu.edu/rabi/cpsc689/lectures/Comm-Synthesis-NoC-Reading.pdfCOMMUNICA TION...
Transcript of ecology labfaculty.cs.tamu.edu/rabi/cpsc689/lectures/Comm-Synthesis-NoC-Reading.pdfCOMMUNICA TION...
COMMUNICATION SYNTHESIS FOR
ON CHIP NETWORKS
A Thesis
by
NARAYANAN SWAMINATHAN
Submitted to the OÆce of Graduate Studies ofTexas A&M University
in partial ful�llment of the requirements for the degree of
MASTER OF SCIENCE
August 2002
Major Subject: Computer Engineering
COMMUNICATION SYNTHESIS FOR
ON CHIP NETWORKS
A Thesis
by
NARAYANAN SWAMINATHAN
Submitted to Texas A&M Universityin partial ful�llment of the requirements
for the degree of
MASTER OF SCIENCE
Approved as to style and content by:
Gwan Seung Choi(Co-Chair of Committee)
Rabi N. Mahapatra(Co-Chair of Committee)
Ting-Chi Wang(Member)
Chanan Singh(Head of Department)
August 2002
Major Subject: Computer Engineering
iii
ABSTRACT
Communication Synthesis for
On Chip Networks. (August 2002)
Narayanan Swaminathan, B.E., Bharathiyar University
Co{Chairs of Advisory Committee: Dr. Gwan Seung ChoiDr. Rabi N. Mahapatra
The modern day System on Chip (SoC) is made up of a large number of het-
erogeneous cores with varied communication requirements. On chip networks are the
scalable, global interconnection solutions for these explicitly parallel systems. In this
thesis, we have analyzed the issues involved in synthesizing these networks and pro-
posed a methodology that will help to arrive at an optimal Network on Chip (NoC)
design. Some of the important issues that we consider for synthesis are the quality of
service (QoS) requirements of the communicating cores like latency and data rate, the
utilization of the network resources and implementation cost issues like area, power
and wiring complexity. We have developed a tool-set written mainly in C++, which
comprises of an IP clustering engine and an on chip network simulator, to aid the
synthesis. We have annotated the models used for communication architecture syn-
thesis with design parameters obtained from gate level synthesis. The tool-set can aid
the designer in predicting the various cost parameters and con�guring the network on
chip architecture for optimal performance. The network simulator, NoCSIM, can be
extended to evaluate the performance and feasibility of new and innovative network
architectures developed for on chip environment.
iv
To My Parents
v
ACKNOWLEDGMENTS
Thanks are due �rst to my committee, Rabi Mahapatra, Gwan Choi and Ting-
Chi Wang for their help and insight. I would have surely been lost without them.
I would also like to thank Praveen for playing the devil's advocate to my ideas.
Thanks are also due to Nithin and Vishi, who were very forthcoming with ideas for
the simulator design.
The entire Texas A&M HW/SW Codesign group also deserves my thanks for
sitting through multiple versions of my evolving research.
Lastly, I would like to thank my sister, Janani, for sharing her expertise in
logic synthesis, my brother-in-law, Bala, and my parents, Lalli and Swami, for their
encouragement in my academic endeavors.
vi
TABLE OF CONTENTS
CHAPTER Page
I INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : 1
II BACKGROUND AND PROPOSED RESEARCH : : : : : : : : 4
A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 4
B. Speci�cation and Pro�ling . . . . . . . . . . . . . . . . . . 4
C. Communication Synthesis and Estimation . . . . . . . . . 5
D. Communication Architecture . . . . . . . . . . . . . . . . . 5
E. Proposed Research . . . . . . . . . . . . . . . . . . . . . . 10
F. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
III CLUSTERING : : : : : : : : : : : : : : : : : : : : : : : : : : : 13
A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 13
B. Inputs to Clustering Engine . . . . . . . . . . . . . . . . . 14
C. Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . 16
1. Analysis Models for Clustering Cost Evaluation . . . . 19
2. Clustering Results . . . . . . . . . . . . . . . . . . . . 23
D. Cluster Pro�le Annotation . . . . . . . . . . . . . . . . . . 25
E. Intra-Cluster Synthesis . . . . . . . . . . . . . . . . . . . . 26
F. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
IV NETWORK ARCHITECTURE AND SYNTHESIS : : : : : : : 28
A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 28
B. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1. Network Operation . . . . . . . . . . . . . . . . . . . . 31
2. Flow-control . . . . . . . . . . . . . . . . . . . . . . . 34
3. Routing . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4. Topology . . . . . . . . . . . . . . . . . . . . . . . . . 36
5. Switching . . . . . . . . . . . . . . . . . . . . . . . . . 37
6. Synthesis Variables . . . . . . . . . . . . . . . . . . . . 37
C. NoCSIM - An On Chip Network Simulator . . . . . . . . . 38
1. SystemC - A System Level Design Language . . . . . 38
2. Features of the Network Simulator . . . . . . . . . . . 38
3. Network Synthesis Results . . . . . . . . . . . . . . . 42
vii
CHAPTER Page
D. VLSI Implementation Feasibility . . . . . . . . . . . . . . . 44
1. Results . . . . . . . . . . . . . . . . . . . . . . . . . . 45
E. Core-Network Interface . . . . . . . . . . . . . . . . . . . . 45
F. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
V FUTURE WORK : : : : : : : : : : : : : : : : : : : : : : : : : : 48
A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 48
B. Integration to Codesign Environment . . . . . . . . . . . . 48
C. Synthesized Communication Architecture . . . . . . . . . . 48
VI CONCLUSION : : : : : : : : : : : : : : : : : : : : : : : : : : : 50
REFERENCES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 51
VITA : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 57
viii
LIST OF TABLES
TABLE Page
I ITRANS 2001 Roadmap : : : : : : : : : : : : : : : : : : : : : : : : : 19
II E�ect of Virtual Channels and Bu�ers in Area of Network : : : : : : 45
III Area Cost Comparison of 128 and 256 Bit Networks : : : : : : : : : 45
IV Interface Analysis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 47
ix
LIST OF FIGURES
FIGURE Page
1 Shared Bus Architecture : : : : : : : : : : : : : : : : : : : : : : : : : 6
2 Hierarchical Bus Architecture : : : : : : : : : : : : : : : : : : : : : : 7
3 Octagon Architecture : : : : : : : : : : : : : : : : : : : : : : : : : : 8
4 Synthesis Methodology : : : : : : : : : : : : : : : : : : : : : : : : : : 9
5 Folded Torus : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11
6 Network Connection Diagram : : : : : : : : : : : : : : : : : : : : : : 12
7 Power Consumption Breakup of a High Performance Microprocessor 14
8 Clustering Flow : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 16
9 Results of Clustering Algorithm : : : : : : : : : : : : : : : : : : : : : 18
10 Comparison of Wire Delay and Gate Delay : : : : : : : : : : : : : : 20
11 Tradeo�s in Area and Power with Cluster Size : : : : : : : : : : : : : 23
12 Failure Comparison of Bus and Network : : : : : : : : : : : : : : : : 24
13 Synthesis and Veri�cation Methodology : : : : : : : : : : : : : : : : 29
14 Packet Format : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 31
15 Output Controller Architecture : : : : : : : : : : : : : : : : : : : : : 32
16 Input Controller Architecture : : : : : : : : : : : : : : : : : : : : : : 33
17 Network Routing Example : : : : : : : : : : : : : : : : : : : : : : : : 36
18 Simulator Structure and Features : : : : : : : : : : : : : : : : : : : : 40
19 Bu�ers vs. Network Performance : : : : : : : : : : : : : : : : : : : : 41
x
FIGURE Page
20 Virtual Channels vs. Network Performance : : : : : : : : : : : : : : 41
21 Comparison of Multicast and Unicast Routing : : : : : : : : : : : : : 43
22 VLSI Complexity Cost of Adding Virtual Channels : : : : : : : : : : 44
23 Core Network Interface : : : : : : : : : : : : : : : : : : : : : : : : : 46
1
CHAPTER I
INTRODUCTION
The complexity and performance requirements of the future System on Chip (SoC)
designs, combined with the multiplicity of bus interfaces pose signi�cant challenges to
the design and veri�cation of communication architecture. The International Tech-
nology Roadmap for Semiconductors (ITRS) identi�es design productivity, power
management, multi-core organization, I/O bandwidth, and circuit and process tech-
nologies as key contexts for future evolution of microprocessing units [1]. Commu-
nication architecture design and synthesis has a major role to play in almost all of
these factors, because as the density of the chip increases, area, throughput and power
dissipation of communication architecture become signi�cant issues, apart from the
design and veri�cation complexity. An aggressive design strategy can provide a good
tradeo� in terms of power, speed, area and design complexity. A robust and aggres-
sive communication architecture design requires a quick evaluation of communication
requirements and a synthesis based on those requirements.
The traditional bus based architectures are not scalable and eÆcient for next genera-
tion SoCs with hundreds of cores and aggregate throughput rates of multiple gigabits
per second[2]. These highly integrated and explicitly parallel SoC architectures are
replacing the architectures based on instruction level parallelism (ILP) and implicit
communications, for high performance designs. A switching network on chip (NoC)
appears to be the truly scalable interconnect as compared to shared buses, multiple
buses or hierarchy of buses, when we consider explicitly parallel systems on chip. The
migration to the distributed interconnection network oriented set up is the intuitive
The journal model is IEEE Transactions on Automatic Control.
2
next step after the transition from shared-bus paradigm to the hierarchical bus ar-
chitecture.
While high performance is an essential design attribute, predictability of performance
and veri�ability of the system are necessary for designers and chip architects. Pre-
dictability is necessary to take early design decisions based on expected bus/link
performance before actual implementation. Veri�cation and chip integration pose a
signi�cant challenge due to the presence of a large number of proprietary buses for
di�erent IP cores and shifting of technology towards ultra deep-submicron. This ne-
cessitates architectural standards on communication architecture and core interfaces,
like those proposed by Virtual Socket Interface Alliance [3]. But the high perfor-
mance and scalability of NoCs comes with an associated cost in terms of area and
power dissipation. The necessity to optimize this cost motivates us to develop a
synthesis tool that is supported by fast and accurate estimation models for optimal
communication architecture design. In this thesis, we propose a methodology for the
synthesis of on chip switching network. The aim of communication synthesis is to
create the hardware and software entities required for optimal and reliable commu-
nication between the di�erent cores in the system, across the on chip network. The
methodology addresses various issues involved in the synthesis of on-chip switching
network like placement of cores in the networked chip, architectural synthesis of an
optimal network and synthesis of core interface logic.
This thesis has the following contributions
� Synthesis methodology for on chip networks.
� Analysis models of on chip network costs to aid synthesis.
� NoCSIM - an on chip network simulator.
3
� VLSI Feasibility analysis of on chip networks.
The rest of the chapters are organized as follows. The second chapter analyzes the
work that has already been done in the area and introduces our synthesis method-
ology. The third chapter deals with the analysis phase of our methodology called
clustering. The fourth chapter deals with the network architecture synthesis and its
VLSI implementation feasibility based on gate level synthesis. It also addresses other
issues in synthesis like core network interface.
4
CHAPTER II
BACKGROUND AND PROPOSED RESEARCH
A. Introduction
The three critical issues for communication architecture performance are the inputs
to the synthesis methodology, estimation of the communication requirements and the
synthesized communication architecture. Although, research in networks on chip is
still in its infancy, a number of contributions have been recently reported[4, 5, 2, 6,
7, 8]. In this section, we have analyzed the present state of research in NoCs and the
critical issues mentioned above.
B. Speci�cation and Pro�ling
System speci�cation has been one of the major aspects of research over the past two
decades and a large number of speci�cation languages have been the result[9]. These
include C/C++, SystemC[10], Java, Speccharts and Hardware Description Languages
(HDL). Our approach is not speci�c to any of these descriptions, but can be easily
integrated with SystemC based descriptions.
In [11], Vahid presents a novel communication pro�ling method based on accesses
rather than data ow, which reduces the complexity of task-graph speci�cation.
Graph complexity is one of the most important considerations when dealing with
speci�cation of complex designs. In [12], Knudsen explores a graph based commu-
nication analysis scheme using a Control/Data ow graph. In [13], Lahiri analyzes
communication based on execution-trace using co-simulation results from Ptolemy.
For this thesis, we have assumed that a pro�led task-graph is made available to the
synthesis methodology.
5
C. Communication Synthesis and Estimation
Communication synthesis for distributed embedded systems has been dealt with in
[14], by analyzing delay bounds on communication in the system. This analysis has
been used to synthesize the communication links and to assign inter-process com-
munication to the links. In [15], Pai Chou emphasizes on the synthesis of software
drivers and hardware interfaces (glue logic) to connect processors to co-processors and
peripheral devices, while meeting bandwidth and performance requirements. Their
algorithm takes a communication estimate, derived from control ow graphs and
descriptions of processor and devices as input and synthesizes the required communi-
cation logic and drivers. Communication synthesis is tackled as an allocation problem
in [16] where the two aspects of synthesis - Protocol generation and Interface synthesis
are addressed. In [17], the authors integrate the process of communication protocol
selection to hardware software codesign by using a communication library to explore
tradeo�s in communication architecture design. Their work takes driver-processing
e�ects into account while calculating the throughput. In [18], the aim has been to
close the synthesis loop between system and physical design by a set of tools. They
further suggest a methodology for analyzing the communication, creating the bus
network and iterating for the best physical oorplan. Our communication synthesis
approach is similar to [18], but focused on synthesis of a switching communication
network as the global wiring solution instead of buses.
D. Communication Architecture
In the past, there has been a signi�cant amount of work proposed, related to di�erent
communication architectures for on chip and o� chip environments. The most com-
mon communication architectures have been the bus based architecture and switching
6
Slave Slave Slave
Master Master
Arbitrator
Fig. 1. Shared Bus Architecture
architectures. The main interconnection setups for bus based architectures are the
point to point, shared bus (shown in Figure 1) and hierarchical bus (shown in Fig-
ure 2) architectures. There have been a variety of switching architectures used in
multicomputer and multiprocessing networks like k-ary n-cube (eg. mesh, torus),
Multistage Interconnection Networks (eg. fat tree) and crossbar.
The most dominant architecture used for on chip environment is the bus based ar-
chitecture [19, 20, 21, 22]. This is mainly because the throughput requirements of
system on chips have been satis�ed by shared and hierarchical buses,so far. But it
is very doubtful that this scenario will persist for long. The shared bus architec-
ture, shown in Figure 1 is the most commonly available bus architecture for on chip
communication. The components attached to the bus contend among themselves to
access the bus. A bus arbiter examines the requests accumulated at the component
interfaces and grants access depending on the component having the highest priority.
The shared bus architecture su�ers because of decreased scalability [5]. For example,
the greater the number of processors trying to access the shared memory, the greater
is the latency to access it. Bus arbiter delay also grows with the number of devices
7
Slave Slave Slave
Master Master
Arbitrator
Slave Slave
Master
ArbitratorBus
Bridge
BusBridge
Fig. 2. Hierarchical Bus Architecture
attached. The bandwidth is limited and shared. However, a shared bus architecture
has its own advantages. The area cost of a bus and its supporting logic is very less
and simple. Hierarchical bus architecture is another scheme that has been proposed.
This arrangement is shown in Figure 2. The components that require a large amount
of communication among themselves are grouped on the same hierarchy. Each level
of hierarchy has a bridge that will act as a liaison for the traÆc from one hierarchy to
the other. So at each hierarchy, we have fewer components contributing to the loading
of the bus. This architecture has a higher performance,concurrency and throughput
than the shared bus, but has associated higher costs.
The migration to the distributed network environment is the intuitive next step, for
even higher throughput necessities, and to achieve more concurrency. This is one
important reason, that motivates our research into networks on chip and switching
architectures like [23]. The other motivational reasons are reliability, performance
8
Fig. 3. Octagon Architecture
predictability and possible power reduction. Among switching architectures, crossbar
has the best performance, but its cost factor is prohibitive and crossbars are gener-
ally non-scalable, as the wiring complexity is very high. The complexity grows as
the square of the number of nodes on the chip and a lot of care has to be taken to
avoid signal skew. This scheme is not energy eÆcient, as there will be a lot of energy
consumed by all the buses and the switches. Another architecture that has been pro-
posed is the fat tree-network architecture called Scalable, Programmable, Integrated
Network (SPIN) [2]. The number of roots in the tree was increased to amortize the
traÆc at the routers. The leaves contain processors or memory. However, this scheme
still su�ers because the intermediate routers have to share the burden of routing and
bu�ering the on chip traÆc destined for the leaves. A novel architecture called the
Octagon architecture has been proposed in [5]. This scheme shown in Figure 3 tries
to achieve the eÆciency of a cross bar architecture using an octagon arrangement of
nodes. Communication between any pair of nodes can be achieved in at most 2 hops.
The idea of using a switching NoC has been explored in [4, 5, 2, 6, 7, 8]. In [4],
an NoC example is proposed and the research issues involved in designing the NoC
9
Clustering
Intra-ClusterCommunication Synthesis
and Profile Annotation
NoCSim based Synthesis (Inter-Cluster
Communication Synthesis& Core Interface
Synthesis)
CommunicationSynthesized System
(based on Implementationmodel)
SystemSpecification/
Process Profile
Hardware/Resource Profile
Inputs
Synthesis
Outputs
Mapping
Fig. 4. Synthesis Methodology
are stated. In [2], a generic packet-switching network architecture based on Fat-tree
topology is proposed. The Pleiades architecture [24] explores a recon�gurable NoC
design for DSP applications and assesses the impact of recon�gurability and fault
tolerance of such networks. The impact of energy eÆciency and reliability of NoCs is
analyzed in [7].
10
E. Proposed Research
The primary objective of our work is to explore the communication architecture de-
sign and synthesis space with a switching network as the global wiring solution. It
seems that there has been little e�orts to investigate the synthesis of switching net-
works in SoC environment. There are no existing tools to explore viable synthesis
options of the communication entities required for on chip switching network com-
munication. Figure 4 explains the proposed synthesis method. The �gure shows the
various steps in the proposed synthesis methodology. The inputs are obtained in
the form of a pro�led resource graph,G(V,E) whose nodes, V are the communicating
resources (and their parameters) and whose edges, E, describe the communication
between the resources. As the on chip communication architectures play a very im-
portant role in performance of the chip, they need to be designed keeping the on
chip environment in mind. Therefore, we have developed an analytical model of the
NoC costs in C++ to do clustering of cores into tiles. This is the pre-processing
step of our synthesis methodology called as Clustering. After synthesizing the com-
munication architecture inside the cluster and generating the communication pro�le
between di�erent clusters, the outputs of the clustering engine are fed to the network
simulator, NoCSIM. This simulation model is used to estimate and re�ne the network
architecture. The interface between the cores and the network is also synthesized at
this stage. We use an implementation model based on VHDL to generate the synthe-
sized network architecture, which is integrated with the existing SoC design ow for
gate level synthesis. In this thesis, we are also providing an analysis on the feasibility
of the VLSI implementation of the network. We have developed the simulation tool
called NoCSIM, based on SystemC, a popular system level design language. The
development of a tool based in this language will make it easier to integrate existing
11
Fig. 5. Folded Torus
SystemC descriptions of cores into the NoC environment. It will also make it easier
to investigate the protocol stack implementations [25] that will be overlaid on the
physical network. In our work, we are experimenting with a folded torus topology
shown in Figure 5. The communicating cores are clustered into a single network
node referred to as VCNODE in Figure 6. These nodes or tiles communicate with
one other over the switching network. The communication architecture is a two-tier
hierarchy, where the �rst level is the network between tiles and the second level is the
communication channel within a tile.
F. Conclusion
In this section, we have analyzed the present state of the network on chip synthesis
problem and we have also introduced our methodology. We believe that using this
methodology which is based on the system requirements will lead to reduction in area
and power costs of NoCs, while meeting the quality of service requirements.
12
������������
����VC NODE VC NODEVC NODE
VC NODE VC NODE VC NODE
VC NODE VC NODE
Input Controller
Output Controller
Fig. 6. Network Connection Diagram
13
CHAPTER III
CLUSTERING
A. Introduction
Clustering is the process of identifying the cores that will be grouped as a single net-
work tile in the hierarchical topology. It is a pre-processing step, where we examine
the global view of communication between cores without considering the actual im-
plementation within a cluster. Clustering makes it easier to explore the design space,
as a hierarchical approach to communication architecture design is used. Clustering
makes it possible for elements (cores) to share physical communication channels in-
side a tile, thereby reducing the size and complexity of NoC resources. Clustering
becomes a necessity for some very low latency modes of communication, where the
latency incurred over the network cannot be amortized over the delay bandwidth
product of communication. Clusters are also formed to balance the communication
load on the NoC and to pack cores into tiles without wasting real estate of a tile and
to provide a regular structural layout of the cores. By clustering, we also hope to
attain reduced global clock power and interconnect power dissipation, which together
form more than 40 percent of the power dissipation of today's chips [1]. The power
consumption breakup is shown in Figure 7. But increasing the number of clusters
without a limit will prove to be counter productive as area and power could be wasted
in the network logic and network interconnection complexity. The above issues form
the various hard and soft constraints for clustering.
A cluster could constitute the local memory/cache of a processor, and other elements
which are tightly coupled with the core. An indication of this type of communication
is available from the task-graph fanout. In general, we believe that communication
14
Fig. 7. Power Consumption Breakup of a High Performance Microprocessor
modes like interrupts, cache/local-memory to processor traÆc which are low latency
and highly coupled modes of communication are candidates for intra-cluster com-
munication, whereas streaming traÆc and shared memory accesses should use inter-
cluster communication. Multicast modes of communication should also go through
the packet switched network.
B. Inputs to Clustering Engine
The inputs to the clustering engine are,
� Resource matrix
� Connection matrix
� Implementation constraints
The resource matrix provides the details on the type of resource (core), the resource
area, its lengths and widths. The core could be a soft core or a hard core, the di�erence
being that the aspect ratio for a soft core could be varied while it is not possible for
15
a hard core. The cores can be of di�erent kinds like microprocessors (DSP, GPP),
memory (shared,local), con�gurable cores (Xtensa core, FPGAs) and RTL blocks
(FIR,CORDIC cores).
The connection matrix provides the details of the interconnection between two
resources. It also provides information on the maximum and average latencies of
communication between the communicating components in the default time scale
units, the data ow rate between the communicating modules and the type of traÆc.
The type of traÆc is an input intended to the simulator and could be Constant
Bit Rate (CBR) or random Poisson distribution. The analyzer uses only Poisson
distribution of packet arrivals in its queuing model for the network. In addition to
this, the designer can provide additional information on connectivity of two modules
using a binary variable called the user de�ned constraint. If this value is 1 for a
communicating pair of modules, then these two modules will always be held within
a cluster. In other terms, they are a single core for the clustering engine. The user
de�ned constraint gives more control to the designer over the convergence of the
annealing algorithm. We believe that automating clustering to a high extent could
be counter productive for realistic design and veri�cation issues.
The implementation constraints that are needed by the clustering engine include
the process wire pitch, resistance and capacitance values of the global interconnect for
the fabrication process technology to be used, repeater areas, network logic overheads,
area per bit of storage, frequency of network and bus operations, number of allowed
buses inside a cluster and signaling overheads (in terms of number of wires per data
wire) for the physical layer.
16
Connection File Resource File
ClusteringAlgorithm
ViolationReports
Power Costs Area CostsCluster
Parameters
ImplementationFile
Designer Input
Fig. 8. Clustering Flow
C. Clustering Algorithm
The clustering methodology shown in Figure 8 takes the various constraints as inputs
and tries to optimize the allocation of cores into tiles. We have used a popular op-
timization technique called Simulated Annealing for this purpose. The outputs that
can be got from the clustering engine are violation reports of the hard constraints,
annotated task graph pro�le of inter-cluster communication and the area and power
costs of the implementation with the speci�ed number of clusters. If an incompat-
ibility in the speci�cation or implementation cannot be addressed by the simulated
annealing algorithm, then the cluster con�guration has to be modi�ed or the incom-
patibility should be resolved by re-mapping the behavior to the architecture. This is
the outer loop in the clustering methodology. As these design spaces are quite large
with varied con�gurations, the estimation should be fast and provide the necessary
17
level of accuracy. This is the reason clustering should be done based on analytical
models. It is not realistic to explore these regions with the network simulator.
The Simulated Annealing algorithm lends itself favorably to any problem with a large
solution space. Simulated annealing tries to model the cost of a system as the en-
ergy function of a thermodynamic system. The optimal cost of the system, then
becomes the ground state of this thermodynamic system. The main challenges in the
simulated annealing algorithm are the de�nition of the cost factors involved and the
selection of annealing moves, bias values and cooling schedule. The cost factors in-
volved in our clustering analysis are split as hard speci�cation constraints like latency
and wire complexity, whose biases will be high and soft implementation constraints
like bandwidth and area deviations, whose bias values will be low. These biases can
also be set by the designer. Apart from this, some other cost factors can be added
to the algorithm, such as power dissipation and network logic area. We found these
factors to be unfruitful, as there was no signi�cant variation of these factors with the
analysis models used, when the target number of clusters is �xed. But these factors
vary signi�cantly between di�erent number of clusters. Therefore, the design space is
explored for these soft constraints by varying the number of clusters and examining
their e�ect on the cost.
The algorithm maps resources randomly to di�erent clusters in the initialization step.
It then searches for resources with user constraints on connectivity and assigns those
resources to the same cluster. Any move separating them will not be accepted after
this assignment. The initialization step is followed by simulated annealing. A move
is de�ned as removal of a core from a cluster and its addition to a di�erent cluster.
The move is accepted or rejected based on the change in cost. All moves that result
in an improvement are accepted and moves that do not result in an improvement are
accepted based on the temperature at which the move is made and the cost di�erence.
18
Fig. 9. Results of Clustering Algorithm
Many moves are made at the same temperature until a limit is reached on the number
of attempts. When this limit is reached the temperature is lowered at a user speci�ed
rate (default of 0.95). When the temperature is high, probability of acceptance of a
negative move is high, but as the algorithm cools o�, this probability becomes lower
and lower, until a point is reached when no further improvement is attained for a
temperature. Simulated annealing is stopped at this point and the violation and cost
reports are passed on to the designer for analysis. Figure 9 shows the decrease in
cost over iterations. The three curves depict the maximum, minimum and average of
accepted costs at each temperature. The experiments were conducted for a random
resource graph with 85 resource nodes and 200 edges, mapped to 16 clusters.
19
Table I. ITRANS 2001 Roadmap
Year Power (W) Vdd (V) Clockrate (Mhz)
2001 130 1.1 1684
2002 140 1 2317
2003 150 1 3088
2004 160 1 3990
2005 170 0.9 5173
2010 218 0.6 11511
2016 288 0.4 28751
1. Analysis Models for Clustering Cost Evaluation
We have developed cost models for area, power and latency of the hierarchical mesh
network for a fast estimation of cluster requirements. This section explains the above
models and the assumptions that have been taken. One of the most important feature
of a hierarchical mesh architecture like the one explored by us is that the system could
have a Globally Asynchronous Locally Synchronous architecture (GALS) [26, 27]. The
packet switched network is a globally asynchronous medium, whereas the cluster/tile
could be made to operate on a single synchronous clock. This makes it possible to
reduce the global clock power dissipation, clock skew and synchronization problems
and number of repeaters used for signal integrity. On the downside, clock generation
circuit complexity and asynchronous interfaces have to be developed for the tile-
network interface. In spite of this, it has been predicted by ITRANS [1] that the wire
delay dominated complex SoC designs may have to turn to GALS for a global wiring
solution in this decade.
This is mainly because the clock frequency is increasing at such a high rate (as
20
Fig. 10. Comparison of Wire Delay and Gate Delay
shown in the Table I), that a signal could take multiple clock cycles to reach from
one end of the chip to another[28]. The wire delay and gate delay for various process
technologies is shown in Figure 10. GALS also makes it possible to do localized
frequency scaling, which is a feature that can be explored with our network simulator.
The clock power [29] is obtained by the Equation 3.1. The power dissipated by clock
is proportional to the product of the clock load and the length of clock wire to the
clock load.
Pclk = k � A � 2pA (3.1)
The clock is assumed to be distributed uniformly across the area (A) and the length
of the clock wire is proportional to the square root of area, while k is a proportionality
constant.
The length of the interconnect is obtained by analyzing the interconnection wires
that are needed between the input and output controllers, shown in Figure 6. For our
network structure, the total interconnection length (Li �Ni) is given by the Equation
21
3.4 after neglecting the control wires, which are much less compared to the data wires.
Li �Ni = (ClusterLength) � (22 � ChannelWidth) (3.2)
The interconnect power dissipation, Pint is modeled by the Equation 3.3.
Pint = a � Vd2d � f � (22 � ChannelWidth � Cw � ClusterLength+ Cb �Nb) (3.3)
The interconnect power is the sum of the power dissipated in charging and discharging
the interconnect capacitance (Cw) and the power dissipated by the repeater placed
along the length of the wire according to the critical length (Lcrit). The value ofa,
the probability of a wire switching from 0 to 1 is assumed to be 0.15 for a global
wire, but this value will become less as the number of cluster increases because the
probability factor is spread over the network. It will be much higher for a shared
bus. The value of Vdd and Cw will also be much higher for a shared bus, as the wire
load on a shared bus becomes higher. A distribution of power dissipation of di�erent
components of the design is given in Figure 7. The repeaters are modeled as spread
through the interconnection length (Li � Ni), after each critical length(Lcrit), which
is assumed to be a constant for the speci�ed metal wire and load. The number of
repeaters (Nb) is got from the Equation 3.4. The value of bu�er capacitive load Cb is
an input parameter to the clustering engine and is a constant for a bu�er design.
Nb =Li �Ni
Lcrit
(3.4)
The power dissipated by the interconnect logic is proportional to the number of
storage elements, added up with the overhead of input and output controllers. The
same holds true for the logic area of the network.
The wiring complexity (Wc) is modeled as the ratio of the tracks needed by the
interconnection network inside a cluster to the total number of tracks available, as
22
shown in Equation 3.5.
Wc =(ClusterArea) � (WireUsageFactor)
(WireP itch) � (9 � ClusterLength � ChannelWidth)(3.5)
Here, we have assumed that the wiring is done in two layers of metal. So the total
interconnection length is divided by two. When this ratio is greater than one, the
wiring complexity constraint is violated. This violation makes the annealing cost of
network high when the area of the cluster becomes very low. Equation 3.5 gives an
upper bound on the number of clusters that should be used for a speci�c process
technology for this distributed network architecture. It should be noted that for
this architecture, the wiring complexity is higher within a cluster, than between two
clusters. In order to reduce this, it is possible to reduce the channel size within the
distributed switch, but this increases the latency.
The latency for travel between two nodes is the product of the hop latency and
the number of hops. The hop latency for the network (Thop), given by Equation 3.6
is the sum of the wait time (Tw) to get a free bu�er and the time to get through the
input (Ti) and output controllers (To) and twice the time of ight (Tf ) through the
wire in network clock cycles.
Thop = Tw + Ti + To + 2 � Tf (3.6)
Tw =1
1� (1��)��k
1��k+1
(3.7)
The wait time is modeled as the wait time to get a free bu�er in an M/M/1 queue[30]
with k bu�ers, as shown in Equation 3.7. The number of network clocks to get
through the input controller (Ti) and output controller (To) is assumed to be one
without any pipelining. The time of ight (Tf) delay is got from a lookup table for
the speci�ed interconnection length. The time of ight is multiplied twice because of
23
Fig. 11. Tradeo�s in Area and Power with Cluster Size
the inter-cluster propagation time of the folded torus (Tf) and the propagation delay
between input and output controllers (Tf). This additional delay will not be present
for a mesh structure, but the average number of hops to get to a destination will
increase. The look up table is created using delay values obtained from an elmore
delay model of the interconnect between two standard CMOS bu�ers for the speci�ed
length.
2. Clustering Results
Figure 11 gives us an insight into the e�ect of number of clusters in the clock power,
area and interconnect power dissipation. Clock power decreases monotonically when
the number of clusters increase, whereas the area of interconnection logic increases.
The interconnect logic is mostly in the form of network logic for large number of
clusters, as number of buses inside a network tile reduces. Interconnect power is
measured as a percentage of the total maximum dissipation of power obtained from
24
Fig. 12. Latency Failure Comparison of Bus and Network
Figure I. The initial decrease in the power consumption of the interconnect can
be attributed to the reduction of bus load and the subsequent increase shows the
diminishing returns of having more network logic and wiring. It is therefore possible
to assign an operating point for the network, in terms of number of clusters (sixteen
for this example). Figure 12 shows the variation in latency failures of bus and network
with number of clusters. The bus is unable to support the huge number of cores when
there is a single cluster, which is intuitive of a shared bus. As we increase, the number
of clusters, the failures that occur over the bus reduces because the bus load becomes
lesser and lesser. This �gure also shows us that in spite of monotonically increasing
the throughput of the network by increasing the number of network tiles and reducing
the size of each cluster, the latency of communication between two components can
cause a number of failures in communication. These failures can be partly attributed
to the excessive number of hops taken by the data between communicating cores
and partly to sparseness in connectivity of the task graph that was taken as the
25
input. If the task graph is connected in a denser manner with higher average fanout,
cores that have higher connectivity were mapped into neighboring clusters because
the algorithm was able to make a clear distinction between cores that need to be
communicate over the network and the cores that need to be present within a cluster.
But when the cores were connected sparsely, this distinction was absent and cores got
mapped in an ad-hoc fashion. Decreasing the size of clusters makes matters worse,
because when the number of cores inside a cluster becomes less, more cores are made
to communicate over the network even though they may have to be mapped into
the same cluster. This caused a �xed number of failures to happen, irrespective of
the number of iterations the simulation was run. Therefore the number of clusters
should be kept at the optimum level required by the design and its communication
complexity. Another important observation from Figure 12 is that, the assignment
of buses to certain communicating cores (clustering) whose latencies fall above the
expected network latency could be done beforehand by the designer for more optimal
results. The designer can use the user de�ned constraint to his advantage for this
purpose.
D. Cluster Pro�le Annotation
Pro�le annotation is the process of creating a connection matrix specifying the con-
nection parameters of the di�erent clusters or tiles of the network. This is used for
the architectural synthesis of the network. The aggregate communication bandwidth
of the tile and minimum of latencies of cores inside the tile are some connection pa-
rameters, that become important Quality of Service (QoS) parameters to be satis�ed
by the network synthesizer. Apart from this, the location of communicating cores in
the network is also given as an input to the network and interface synthesizer, so they
26
calculate the source routes and generate packets to the appropriate destination at
the speci�ed rate with the speci�ed random distribution of the connection matrix. It
is also possible to increase the accuracy of architectural synthesis by annotating the
communication parameters of each and every sender and receiver (cores) than using
aggregate throughput and minimum latency measures.
E. Intra-Cluster Synthesis
Once clusters have been determined, we focus on each cluster and synthesize the
communication requirements for each cluster. Intra-cluster Synthesis is mainly split
up as protocol generation and interface generation. In other words, we should allocate
actual physical channels and synthesize the glue logic and software drivers to make
them communicate e�ectively. As a designer should explore the mapping of logical to
physical channels in the best possible fashion, we need a fast estimation scheme like
those proposed in [17, 31, 32, 33] for intra-cluster communication. The main issues
that need to be addressed for intra-cluster synthesis are the protocols and services
used for communication, bus width and data generation frequency di�erences between
the communicating cores for bu�ering and glue logic generation. The synthesis of logic
inside the cluster is closely related to clustering and this process happens iteratively
with clustering. We have not investigated into the detailed synthesis of intra cluster
communication in this thesis.
F. Conclusion
Clustering gives the designer an idea about the cost involved in choosing an arbitrary
number of clusters. It also provides the designer with the necessary data needed for
the next step of the synthesis process - Intra-cluster and Inter-cluster synthesis. It
27
reduces the number of iterations spent in both the syntheses.
28
CHAPTER IV
NETWORK ARCHITECTURE AND SYNTHESIS
A. Introduction
The main goal of architecture level communication synthesis for NoCs is to explore
the NoC design space and arrive at an optimal global communication architecture.
A detailed simulator tailored for NoC simulation assists the synthesis process. The
central idea behind architectural synthesis is to minimize the resource requirements
of NoC, while satisfying QoS requirements.
Figure 13 shows the proposed communication architecture design and veri�cation ow
using our SystemC methodology library based design. The network creation �le and
the communication description �le act as the con�gurators of the various inputs of the
simulator. Network simulation is done iteratively by changing the network creation
�le with the aim of optimizing the network. It is necessary that the synthesized
network is veri�ed using actual cosimulation trace inputs and interface models written
in SystemC before RTL or gate level synthesis. The distributed network [4] which has
been used for analysis throughout this work is just one example of NoC. There are
other possible topologies,routing and ow control mechanisms which can be used in
a NoC environment. The simulator that we have developed is intended for research
into these areas of NoC design. Irrespective of the network used, there are a few basic
rules for NoC design. The design should be power eÆcient, as power consumption is a
limiting factor in SoCs. Bu�er space is costly whereas wires are cheap and abundant,
contrary to o� chip environments. Latency is a limiting factor of the packet switched
interconnection scheme and this should be reduced or hidden to the maximum possible
extent. The clock rate can be made arbitrarily large by pipelining and is ultimately
29
Clusteringbased
Estimation
NetworkCreation
File
NetworkSimulation
Analysis ofPerformance
andUtilization
NetworkBehavioralModel inSystemC
CosimulationTraces
InterfaceBehavioralModel inSystemCDesigner
Input
VerificationSynthesis
CommunicationDescription File
Fig. 13. Synthesis and Veri�cation Methodology
30
going to be bound by the time of ight delay. Protocol stack implementation plays
a vital role in deciding the performance of the network. While designing protocol
stack and layering schemes, simplicity and performance should be more of a concern
rather than generality and dynamic capabilities of the network. This is because, once
a network is laid out on the chip, the degree of recon�gurability, that is required
is much lesser than that of large scale interconnection networks. Also at all steps
of the design and implementation process, the advantages and limitations for a NoC
environment should be kept in mind. The distances are very small and noise levels are
dependent on the aggressiveness of design. The synthesis of network is also dependent
on the application at hand and the service it requires. The next section deals with
the architectural details and features of the synthesized network.
B. Architecture
The network architecture that we synthesize has a folded torus topology with source
based routing and virtual channel ow control, similar to the one that has been
proposed by [4]. The topology is folded in this peculiar fashion in order to distribute
the wire delays equally between hops. This makes the structure regular and can
save global interconnection power by splitting the wire. This architecture is shown
in Figure 5. The other features of the architecture are back pressure based on credit
ow, reliable and in-order delivery of packets and dynamic virtual channel assignment.
In any network, the three main parameters, that characterize performance are ow
control, routing and topology. These will be explained in the following sections. But
a brief discussion on the working of the network and format of the packets that
traverse the network is necessary at this point. The reader is referred to the Figure
6 for a description of the top level interconnection between output controllers and
31
TYPE VCID ROUTE DATA
Fig. 14. Packet Format
input controllers of the switch and the interconnections between the network nodes.
Each node has �ve of these input and output controllers - one for each direction of
the network and one for the communication with the cores inside the node. The �ve
input and output controllers can be visualized as a single crossbar switch with �ve
inputs and outputs. The on chip implementation of this is a distributed switch as
shown in Figure 6.
1. Network Operation
Each packet has the �elds as shown in Figure 14. Packets are classi�ed as head,data,tail
and complete its.The its are the smallest units of data with ow control informa-
tion. The �rst it of a packet is the `Head' it. This is followed by multiple `Data'
its and in the end a `Tail' it. If a packet is only as big as a it, then the it is
`Complete', because it is head,data and tail it combined together. This distinction
is indicated by the type �eld. The division of a packet into it is necessary for the
packet to be able to traverse the network after being segmented into many pieces.
This segmentation becomes a necessity because the channel or bu�ers need not be
wide enough to carry complete packets. Segmentation is also a way of reducing la-
tency as the its can move towards the destination without waiting for its successors
to catch up. The sender sends a packet by sending the head it into the core input
controller and then sending data its in successive clock cycles. The head it gets
routed into the appropriate output controller (shown in Figure 15), based on the
route �eld speci�ed in the it and the same route is set for the rest of the data its
for that virtual channel until the tail it frees the virtual channel. The head it, then
32
Flit OutMux
Selection onBufferstateand Count
DIR 2
DIR 3
DIR 4
VC Updations
Flits In
Credits In
DIR 1
Credits Out
Fig. 15. Output Controller Architecture
tries to reserve a virtual channel at the next input controller (shown in Figure 16),
when it gets into the output controller. The assignment of virtual channel is made
on demand at the output controllers for the immediately succeeding input controller.
If no virtual channel is available at the next node, the output bu�er gets blocked
and so does the rest of the virtual channel. When the head it gets a virtual channel,
the it reserves the virtual channel for its data its and tail it by setting the route
�eld of that virtual channel with the shifted route �eld in its header. The subsequent
data its the same virtual channel reserved by the head it. When the correspond-
ing data its arrive at the next input controller, they are identi�ed by their virtual
channel identi�ers and put into the appropriate FIFO queue (virtual channel). The
route has already been set by the head it. They just move to the output bu�er
depending on the output bu�er availability. The availability of the output bu�er
depends on the availability of bu�ers in the next assigned virtual channel. In this
fashion, a single packet passes through the network based on bu�er availability. This
drastically reduces the bu�er requirement and therefore the latency of communica-
tion. The availability of bu�ers is passed as credit signals between neighboring nodes
33
Flit In Mux Demux
VC 1
VC 2
VC 3
VC 4
VirtualChannel
ForwardingUnit
Selectionbased on
FIFO,Credits,
Selectionbased
on Route
Credit State
VC State
Flit out
Credit In
Credit Out
Demux
Fig. 16. Input Controller Architecture
and also between input and output controllers of the same node. It is also possible
to implement these as �elds in the packet. But this implementation increases the
amount of packet traÆc. The method of having separate control wires and data wires
can also be used for better ow control schemes that take advantage of the on chip
communication scenario. The data �eld of the packet has the actual communication.
It could be in the form of request/response or stream packet. Request/Response
packets can be used for address space communication whereas the stream packets are
for streaming data between two cores. It is possible to map all kinds of data ow into
these two scenarios. A processor communicating to the distributed shared memory
could use address space communication whereas an FIR �lter core passing a stream
of data to yet another signal processing core or a peripheral can use streaming mode
of communication. The actual interpretation of data in the data �eld is handled by
the core network interface, which will be discussed in the forthcoming sections. This
is an example of layered approach to communication in our synthesis ow. The log-
34
ical network layer is abstracted from the wrapper which interprets the data passed
by the network layer. The disadvantage of-course is the amount of space wasted in
the physical channel for carrying network ow information like route �eld and virtual
channel identi�er. The next section underlines the importance of the virtual channel
ow control mechanism.
2. Flow-control
Flow-control is a very important area of research as it controls the allocation of net-
work resources like channels and bu�ers. Bu�er space can be utilized more eÆciently
by using more intelligent ow control schemes like [34]. The network traÆc ow con-
trol that we have used is a hop by hop blocking mode of ow control. Hop by hop
ow control makes it possible to convey congestion information in any part of the
network to the source and therefore the data gets blocked at the sender itself. This is
a way of informing the source that an alternative path could be taken for reaching the
destination if there is support for multiple routes. This ow control is faster because
it does not have to rely on the timeout mechanisms and acknowledgement packets
coming back from the destination. Virtual channels are built using a set of bu�ers
(Queues) at the input, so that there is no blocking of packets ( its) caused by a
previous packet ( it) being unable to leave the switch due to it being blocked. When
there is a blockade in any virtual channel, another available virtual channel could
be used by other communicating cores for transmission. Virtual channel ow control
has the multiple advantages of avoiding deadlock, providing bandwidth guarantees to
ows and improving the saturation throughput. But adding virtual channels without
any necessity can cause increased complexity and latency in the network. A more
detailed analysis of virtual channel ow control can be found in [35]. It should be
noted that making the number of virtual channels to be one gives us an ordinary
35
bu�ered network with FIFO arbitration. The next section gives more details about
the routing scheme with an example.
3. Routing
We have experimented with two choices of routing, namely Source based routing and
dynamic routing. Source based routing is simple and we believe has enough exibility
for on chip implementation. The latency in the network logic is reduced because of
the absence of any lookups, which is not the case with a dynamic routing, where
routing decision is made by the network logic.
We will explain the routing scheme with an example shown in Figure 17. After
receiving a packet, the input controller which received the packet on that node makes
a decision based on the last two bits of the route �eld. The values 00,01,10 and 11
indicate to the controller that the it is to be routed to the left,right,above and into
the core respectively. Let us assume a case when a sender gives a source route of 54,
which translates to 00110110 in binary to the core input controller. In the �rst step
the core controller looks into the source route and extracts the last 2 bits, which are
10. The packet is forwarded straight from the node using the routing logic explained
above. The next controller analyzes the shifted route �eld (001101), extracts the last
2 bits (01) and takes a decision to forward the packet to its right. Finally the node
to the left extracts the packet to itself after �nding 11 in the remaining route and the
packet is sinked.
The simulator also has the capability to support routing decisions at the controller
depending on destination address. This is called dynamic routing. This increases
the latency and storage requirements of the node but decreases the packet overhead
as the route �eld will be much larger than the address �eld. Another extension on
routing that we have done is implementation of a multicast routing scheme for this
36
VC node VC node
VC node VC node
VC node VC node VC node
Sender sends packetwith route 00110110
The intermediate nodeforwards the packet to its
right because of 01
The Sender nodeforwards the
packet straight because of10
The receiver node sinks thepacket because of 11
Sender
Receiver
Fig. 17. Network Routing Example
architecture. This decreases the latency of communication with very little increase in
network logic as explained in [36].
4. Topology
For our study, we have assumed the folded torus topology shown in Figure 5. The
advantage of this topology is its reduced number of hops for communication over
other regular topologies like mesh. There is no special signi�cance to this topology
other than the ones already expounded in classical texts on the subject, apart from
its application to on chip environment. This is because of a reduced wire complex-
ity and reduced signal skew and skew variations on account of the regular structure.
Although a crossbar,has a very low latency of communication, it may not be scalable
for complex networks on chips because of its wire complexity. The simulator that we
have developed is absolutely programmable for arbitrary topologies with slight mod-
37
i�cations in controller logic. This is because all topologies have inputs and outputs,
and therefore input and output controllers. The di�erence is in the number of I/O
ports and the interconnections between the ports.
5. Switching
Switching schemes that are currently used in switching networks can be classi�ed
as circuit switching, packet switching and hybrid switching[37]. We have adapted
packet switching as our scheme of implementation,but the switching scheme is appli-
cation dependent and high locality communications work better in a circuit switched
environment while more random communication bene�ts from packet switching.
6. Synthesis Variables
Synthesis of this network aims at selecting the appropriate number of bu�ers, virtual
channels, channel widths, packet formats and network topology and the location of
cores in the network. A rough decision on the relative location of cores in the network
is available from the clustering step, but a further re�nement in communication using
detailed simulation can show violation of latency or bandwidth constraints, which
was not comprehensible with analytical models. The next section gives details about
the simulator and the synthesis process and the results that were obtained for the
synthesis of the case study involving 85 resources and 200 connections. The resources
were mapped by the clustering phase into 16 cores, the network operating point.
38
C. NoCSIM - An On Chip Network Simulator
1. SystemC - A System Level Design Language
SystemC [10] is a C++ class library that can be used to e�ectively create a cycle-
accurate model of software algorithms, hardware architecture, and interfaces for SoC
and system-level designs. SystemC library and standard C++ development tools can
be used to create a system-level model and simulate to validate and optimize the
design. The class library provides notion of signals, clock and threads which are the
most important entities for hardware simulation. Modules in SystemC are hierarchical
entities that can contain other modules and processes of execution. Processes are used
to describe the behavior of the module and can be sensitive to clock and made to
wait on events. Modules communicate with each-other through a rich set of ports and
signals. SystemC can be used to model systems with di�erent levels of abstractions
ranging from untimed functional to clock cycle accurate register transfer level models.
But we lose on simulation speed as the description becomes more detailed. A system
level description of a design should not reach upto clock cycle accurate RTL levels.
The main advantage of writing the network simulator in SystemC, apart from its speed
is that it is tailored for system level design and veri�cation of hardware and software.
It provides a rich set of classes for modeling hardware. Moreover, integration of the
simulator with models of processor, memory and core written in SystemC becomes
easy. Another important advantage is that, any description written in SystemC can
be synthesized using commercial tools to gate level and further down.
2. Features of the Network Simulator
The network has been developed at the it level and is parameterizable for the syn-
thesis variables that have been discussed. It is ideal for the estimation of network
39
parameters for a complex SoC. The class hierarchy and programmability of the NoC-
SIM simulator is shown in Figure 18. The simulator can generate input traÆc patterns
with Constant Bit Rate (CBR) and random Poisson distribution. Another feature
that has been implemented for purposes of accuracy in a codesign environment is
cosimulation trace inputs. These traces can be made available from cosimulation
tools like Ptolemy. They can be used to generate packets similar to the Bus Func-
tional Models (BFM) used for design veri�cation by compiling the cosimulation traces
to generate packets. The outputs (monitors) that can be used to analyze the network
performance are latencies of resource pairs, network resource utilizations, channel ac-
tivity rates for power estimation etc. These outputs will be useful for analysis and
for carrying the synthesis process further into the gate level. The simulator can also
output Value Change Dump (VCD) trace �les for functional debugging of new pro-
tocols, ow control and routing schemes. The simulator can also be made to run
with di�erent network clock rates for frequency scaling applications. The tiles can
also have independent clock domains. The simulator can simulate any K-ary N-cube
topology, but for arbitrary topology extensions, the switch input and switch output
may need to be changed. This is because, the input and outputs of the switch will
always be present irrespective of the topology of the network. The characterizing
features that will need to be modeled are the number of input and output ports and
the interconnections between them inside and outside the switch. The main extension
that have been carried out so far in the simulator is the implementation of Multicast
routing scheme [36]. More advanced ow control and routing schemes can also be
modeled in a similar way.
40
Monitors
SWITCHINPUT
SWITCHOUTPUT
TrafficI/O
ConfigInputs
ConfigInputs
θ ROUTING FUNCTIONSθ BUFFERS AND SIZESθ VIRTUAL CHANNELSθ CHANNEL WIDTHSθ ARBITRATION / FLOW
CONTROL PROTOCOL
θ ARBITRATIONFLOWCONTROLPROTOCOL
θ BUFFER SIZESθ CHANNEL WIDTHS
θ ARBITRATION / FLOWCONTROL / PROTOCOL
θ BUFFERS AND SIZESθ CHANNEL WIDTHSθ NOISE GENERATION
Channels
θ ACTIVITY FACTORSθ UTILIZATIONSθ DROP/BLOCKING
RATESθ PROTOCOL CHECKING
clock source
θ COSIMULATION TRACEθ TRAFFIC GENERATORθ PROTOCOL MODELSθ CLOCK DOMAINSθ (SOURCES/SINKS)
ConfigInputs
Fig. 18. Simulator Structure and Features
41
Fig. 19. Bu�ers vs.Network Performance
Fig. 20. Virtual Channels vs. Network Performance
42
3. Network Synthesis Results
Figure 19 shows a plot of network performance variation with number of bu�ers.
A minimum bu�er size (12 for this example) is required for avoiding the excessive
drop rates at the input controller and output controllers. Drop rates are obtained
by measuring packet drops at each core input controller and averaging them over
100,000 simulation cycles. It is not possible to block data at the sender. The data
which comes from the sender, gets dropped in the sender due to burstiness of the data
or a very high data rate. The drop rate can be reduced by providing bu�ers. But
beyond a certain saturation data rate, adding a bu�er just increases the latency due
to more queuing and waiting. We have arrived at the optimum bu�er size of 1 it and
depth of 12 by iteratively simulating and measuring the incidence of drops. The other
parameter of designer concern, is the latency slack, which gives the latency deviation
from the target latency. Slack is measured by subtracting the packet creation clock
cycle from the current clock cycle and subtracting this result from the target latency.
A negative slack indicates too much of blocking in the path. This can be ameliorated
by examining the reason for drops or blocks with targeted simulation. It is shown
by Figure 20, that adding virtual channels can help only to an extent, when the
saturation bandwidth is obtained. We simulated the case shown in Figure 20 by
generating its from tiles every cycle whereas the �rst simulation was under a lighter
load. The packet generators were clocked at 67 percent of the network clock. As
shown in Figure 20, under heavy traÆc between communicating components or in a
case where a sender from a core sends to many destinations simultaneously, adding
virtual channels helps in improving slack and reducing the drop rate at the senders. It
can also be seen by Figure 21, that use of multicast routing helps in reducing latency.
Multicast routing has been implemented for this architecture [36] and was proved to
43
Fig. 21. Comparison of Multicast and Unicast Routing
take 12*n bits of additional storage, where n is the number of nodes on the source to
destination path of the multicast group.
We observe that the aggregate throughput of the proposed network, can reach upto
600 Gigabits per second assuming a 400Mhz network clock. We estimate that the area
overhead will be less than 5 percent of the total chip area for a sixteen cluster network
based on gate level synthesis results. With decreasing silicon cost of the chip and an
increasing interconnect delay and stringent power requirements, we believe that on
chip networks de�nitely have an edge over the traditional buses in high performance
applications. We predict an increase in the number of clusters and network size,
in terms of bu�er and channel sizes, based on technology scaling and performance
requirements of the future.
After architecture level synthesis of the network, the synthesis parameters are given as
input to the RTL generator which generates the RTL code for the con�gured network.
The RTL generator supports only bu�er sizes, number of bu�ers and number of virtual
44
Fig. 22. VLSI Complexity Cost of Adding Virtual Channels
channels as parameterizable arguments, unlike the simulator. Therefore variation of
virtual channel arbitration policies and routing schemes will not be possible in the
synthesized design without modi�cations to the VHDL code and synthesis scripts.
D. VLSI Implementation Feasibility
The network logic was implemented in VHDL and synthesized with Synopsys Design
Analyzer using .18 micron libraries. We found that the network could be clocked at
a rate of 400Mhz (critical path delay of 2.5ns). This is based on results obtained
from gate level synthesis using a segmented wire-load model. Further pipelining of
transactions could decrease this clock rate. The area and timing determination is
dependent on the gate level library and could be di�erent for some other libraries.
45
Table II. E�ect of Virtual Channels and Bu�ers in Area of Network
Number of Virtual Channels,Bu�ers Area Estimate (sq. microns)
1,8 338000
2,4 354000
2,8 431000
4,4 470000
Table III. Area Cost Comparison of 128 and 256 Bit Networks
Module Area (128 bit design) Area (256 bit design)
Output Controller 78,800 154100
Input Controller 354000 610000
1. Results
Figure 22 shows the cost of increasing the number of virtual channels using the
same number of bu�ers (8). Increasing the number of virtual channel de�nitely has
a penalty in the number of nets, but as shown in Table II, the logic cost has not
increased much.
The Table III compares the gate level implementation of an 128 bit and a 256
bit network in terms of area. This is a very important synthesis decision that could
be made with the help of the network simulator.
E. Core-Network Interface
The interface, shown in Figure 23, between a core and network plays the part of a
layer in the protocol stack. It could be designed in a variety of ways, but any imple-
mentation should be tailored to suit the service needs of the core. The Virtual Socket
46
Wrapper
CoreConfigurable
Core
WrapperVCI
To Network
To Network
Fig. 23. Core Network Interface
Interface Alliance (VSIA) [3] aims to standardize the interface on the core side. These
standards are called the Virtual Component Interface (VCI) and Peripheral Virtual
Component Interface (PVCI) and they act as interfaces for processor and peripheral
cores. The advantage of having a standard like VCI is that the core functionality is
decoupled from its communication. The VCI communicates with global buses using
entities called wrappers[38]. Some preliminary investigations into the design of the
wrappers for cores that produce streaming data and for con�gurable cores like Xtensa
processor[39] has been conducted. Table IV summarizes the results of our investiga-
tion. The analysis and design of wrappers for cores following these standards is left
as a future work.
Wrappers for cores that produce streaming data needs to have bu�ers and packetiz-
ers. The packetizer does a lookup of the route information provided in its transport
control register before sending the packet over the network. These bu�ers and packe-
tizers are either integrated with the core or separated from it. No wrappers are needed
for con�gurable processor cores as they can be made to communicate directly to the
network, by adding additional logic to the core data-path. The Xtensa processor
47
Table IV. Interface Analysis
core was added with the packetizing functionality through the use of user speci�ed
instructions called TIE instructions. The processor, designed with these new instruc-
tions(eg. packetize which does a look up of speci�c registers present in the processor
core), packetizes the data directly. We reduce the wrapper logic overhead, but on the
downside we couple the core and its communication as opposed to the implementation
advocated by VSIA. But we feel that this solution is applicable because of the con�g-
urability of cores. If a core is con�gurable directly, there is no reason to add wrappers
to con�gure it. We found that it took two core clock cycles for the con�gured Xtensa
processor to do a packetization after getting the actual read instruction to memory.
The memory hierarchy has not been taken into consideration for this analysis.
F. Conclusion
In this section, we have introduced our synthesis scheme and the simulator that
we have developed. We have also demonstrated the other issues involved in network
synthesis like wrapper synthesis. We have also discussed the feasibility of the network
on chip scheme for current and future implementations using gate level synthesis.
48
CHAPTER V
FUTURE WORK
A. Introduction
This thesis has proposed a methodology and tool-set for the analysis and synthesis
of network on chips. In this chapter, the possible extensions to this work will be
analyzed.
B. Integration to Codesign Environment
The integration of this communication architecture design to mainstream codesign
ow, involves making the partitioning and mapping of architecture to the behavior
aware of the global network for communication. Although, we have an outer loop
in the clustering methodology to re-map communicating components, their e�ect in
clustering has not been modeled completely. We believe that integrating clustering
to the process of partitioning and mapping would prove to be the best way to make
the whole codesign environment aware of the global network.
C. Synthesized Communication Architecture
The synthesized communication architecture should be optimized at di�erent layers.
At the physical layer, signaling,asynchronous circuit design for senders and receivers
and are the major research areas. The goal here should be power reduction and
design simplicity. The main areas of future research in the logical network layer
are topology, ow control, switching and routing. This research should be aimed at
simplicity of design and high performance. At the application layer possible areas
of research are the system level design of cores, interfaces, end to end protocols and
49
tuning of design entities right from operating systems to wrappers, in order to make
these design entities aware of the network. It would also be worthwhile to investigate
the e�ects of system level power optimization techniques (like voltage and frequency
scaling of the global network) on the network performance. The other issues that will
be a�ected by having a network on chip based VLSI design are handling of signals like
chip level reset, testability of chip and recon�gurability. Network recon�gurability, in
particular, has the very positive e�ect of reducing cost by increasing wafer yield.
50
CHAPTER VI
CONCLUSION
This thesis has provided a synthesis methodology for network on chips. The problems
in network on chip synthesis were analyzed and the key issues were identi�ed. The
�rst step in the methodology assists in analyzing the communication requirements.
This preprocessing step called clustering helped in reducing the simulation space and
in predicting the cost and performance tradeo�s of network on chips. The simula-
tion and synthesis tool written in SystemC helped in analyzing the communication
requirements with a more precise system level model of the network. The system
level simulation was iteratively performed, till the resource utilization is optimized
to the required level, while satisfying the quality of services. The simulation can be
performed with a high level of accuracy by using realistic execution traces.
The synthesis of core network interface and packetization has also been explained
and some preliminary results of their cost and performance have been provided. Gate
level synthesis of the network logic and interface logic have provided an estimate of
the VLSI implementation costs of network on chip. This thesis should provide the
reader with an understanding of the issues involved in the synthesis of networks on
chips and their application in the explicitly parallel system on chips of tomorrow.
51
REFERENCES
[1] A. Allan, D. Edenfeld, W. H. Joyner Jr, A. B.Kahng, M. Rodgers, and Y. Zorian,
\2001 technology roadmap for semiconductors," IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 35, pp. 42{53, January 2002.
[2] P. Guerrier and A. Greiner, \A generic architecture for on-chip packet-switched
interconnections," in Proceedings of the Design Automation and test in Europe,
Paris, France, March 2000, pp. 250{256.
[3] Virtual Socket Interface Alliance, \On chip bus attributes and virtual component
interface - draft speci�cation, v. 2.0.4," URL: http://www.vsia.org, September
1999.
[4] W. Dally and B. Towles, \Route packets, not wires: On-chip interconnection
networks," in Proceedings of the 38th Design Automation Conference, Las Vegas,
NV, June 2001, pp. 684{689.
[5] F. Karim, A. Nguyen, S. Dey, and R. Rao, \On-chip communication architecture
for OC-768 network processors," in Proceedings of the 38th Design Automation
Conference, Las Vegas, NV, June 2001, pp. 678{683.
[6] K.Goossens, J. van Meerbergen, A. Peeters, and P. Wielage, \Networks on
silicon: Combining best-e�ort and guaranteed services," in Proceedings of the
European Design and Test Conference, Paris, France, March 2002.
[7] L. Benini and G. De Micheli, \Powering networks on chips," in Proceedings
of the 14th International Symposium on System Synthesis, Montreal, Canada,
September 2001, pp. 33{38.
52
[8] S. Kumar, A. Jantsch, J. P. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tien-
syrja, and A. Hemani, \A network on chip architecture and design methodology,"
in Proceedings of the IEEE Computer Society Symposium on VLSI, Pittsburgh,
PA, April 2002.
[9] P.J. Ashenden, System-on-chip Methodologies and Design Languages, Kluwer
Academic Publishers, Dordrecht, The Netherlands, 1st edition, 2001.
[10] Synopsys, CoWare, and Frontier Design, \SystemC version 1.0 user's guide,"
URL: http://www.systemc.org, 2000.
[11] F.Vahid and D.D.Gajski, \SLIF : A Speci�cation-Level Intermediate Format for
system design," in Proceedings of the European Design and Test Conference,
Paris, France, March 1995, pp. 185{189.
[12] P. V. Knudsen and J. Madsen, \Communication estimation for hardware soft-
ware codesign," in Proceedings of the Sixth International Workshop on Hardware
Software Codesign (CODES), Seattle, WA, March 1998, pp. 55{59.
[13] K. Lahiri, A. Raghunathan, and S. Dey, \System-level performance analy-
sis for designing on-chip communication architectures," IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, vol. 20, pp. 768{782,
June 2001.
[14] T.Yen and W.Wolf, \Communication synthesis for distributed embedded sys-
tems," in Proceedings of the International Conference on Computer Aided De-
sign, Knoxville, TN, November 1995, pp. 288{294.
[15] P. Chou, R. B. Ortega, and G. Borriello, \Interface co-synthesis techniques for
53
embedded systems," in Proceedings of the International Conference on Com-
puter Aided Design, Knoxville, TN, November 1995, pp. 280{287.
[16] J. M. Daveau, G. F. Marchioro, T. Ben-Ismail, and A. A. Jerraya, \Protocol
selection and interface generation for hw-sw codesign," IEEE Transactions on
Very Large Scale Integration (VLSI) Systems, vol. 5, pp. 136{144, March 1997.
[17] P. V. Knudsen and J. Madsen, \Integrating communication protocol selection
with partitioning in hardware/software codesign," in Proceedings of the Inter-
national Symposium on System Level Synthesis, Seattle, WA, March 1998, pp.
111{116.
[18] M. Drinic, D. Kirovski, S. Meguerdichian, and M. Potkonjak, \Latency guided
on-chip bus network design," in Proceedings of the IEEE/ACM International
Conference on Computer Aided Design, San Jose, CA, November 2000, pp. 420{
423.
[19] International Business Machines Corporation, \The coreconnect bus architec-
ture," URL: http://www.chips.ibm.com/products/coreconnect, 1999.
[20] Advanced RISC Machines Limited, \AMBA on-chip bus," URL:
http://www.arm.com.
[21] D. Wingard, \Micronetwork-based integration of socs," in Proceedings of the
38th Design Automation Conference, Las Vegas, NV, June 2001, pp. 673{677.
[22] B. Cordan, \An eÆcient bus architecture for system-on-chip design," in IEEE
Conference on Custom Integrated Circuits, San Diego, CA, May 1999, pp. 623{
626.
54
[23] RapidIO Trade Association, \RapidIO interconnect speci�cation," URL:
http://www.rapidio.org/specs, 2001.
[24] H.Zhang, V. Prabhu, V. George, M. Wan, M. Benes, A. Abnous, and J. M.
Rabaey, \A 1-V heterogeneous recon�gurable DSP IC for wireless baseband
digital signal processing," IEEE Journal of Solid-State Circuits, vol. 35, pp.
1697{1704, November 2000.
[25] M. Sgroi, M. Sheets, A. Mihal, K. Keutzer, S. Malik, J. Rabaey, and A. S.
Vincentelli, \Addressing the system-on-a-chip interconnect woes through
communication-based design," in Proceedings of the 38th Design Automation
Conference, Las Vegas, NV, June 2001, pp. 667{672.
[26] D. M. Chapiro, \Globally-asynchronous locally-synchronous systems," Ph.D.
dissertation, Stanford University, 1984.
[27] J. Muttersbach, T. Villiger, and W. Fichtner, \Practical design of globally-
asynchronous locally-synchronous systems," in Proceedings of the 6th Interna-
tional Symposium on Advanced Research in Asynchronous Circuits and Systems,
Eilat, Israel, April 2000, pp. 52{59.
[28] D. Sylvester and K. Keutser, \A global wiring paradigm for deep submicron
design," IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, vol. 19, pp. 242{252, February 2000.
[29] A.Hemani, T.Meincke, S.Kumar, A.Postula, T.Olsson, P.Nilsson, J.Oberg,
P.Ellervee, and D.Lundqvist, \Lowering power consumption in clock by us-
ing globally asynchronous locally synchronous design style," in Proceedings of
Design Automation Conference, New Orleans, LA, June 1999, pp. 873{878.
55
[30] A. Leon-Garcia and I. Widjaja, Computer Networks: Fundamental Concepts
and Key Architectures, McGraw-Hill, Boston, MA, 1st edition, 2000.
[31] P. V. Knudsen and J. Madsen, \Graph based communication analysis for hard-
ware software codesign," in Proceedings of the Seventh International Workshop
on Hardware Software Codesign (CODES), Rome, Italy, May 1999, pp. 131{135.
[32] K. Lahiri, A. Raghunathan, and S. Dey, \Fast performance analysis of bus-based
system-on-chip communication architectures," in Proceedings of the IEEE/ACM
International Conference on Computer-Aided Design, San Jose, CA, November
1999, pp. 566{572.
[33] K. Lahiri, A. Raghunathan, and S. Dey, \EÆcient exploration of the soc com-
munication architecture design space," in Proceedings of the IEEE/ACM Inter-
national Conference on Computer-Aided Design, San Jose, CA, November 2000,
pp. 424{430.
[34] L.S. Pen and W. Dally, \Flit reservation ow control," in Proceedings of
the 6th International Symposium of High Performance Computer Architecture,
Toulouse, France, January 2000, pp. 73{84.
[35] W. Dally, \Virtual channel ow control," IEEE Transactions on Parallel and
Distributed Systems, vol. 3, pp. 194{205, March 1992.
[36] N. N. Sujir, V. Bangalore, N. Swaminathan, and R.N. Mahapatra, \Performance
improvement using multicast for on-chip networks," Tech. Rep., Texas A&M
University, 2002.
[37] J. Duato, P. Lopez, F. Silla, and S. Yalamanchili, \A high performance router
56
architecture for interconnection networks," in Proceedings of the 1996 Interna-
tional Conference on Parallel Processing, August 1996, vol. 1, pp. 61{68.
[38] D. Lyonnard, S. Yoo, A. Baghdadi, and A. A. Jerraya, \Automatic generation
of application-speci�c architectures for heterogeneous multiprocessor system-on-
chip," in Proceedings of the 38th Design Automation Conference, Las Vegas,
NV, June 2001, pp. 518{523.
[39] Xtensa Incorporated, \Xtensa TIE reference guide," URL: www.xtensa.com,
March 2002.
57
VITA
Narayanan Swaminathan was born in Madras, India on the 12th of September
1978, the son of Lalitha and Swaminathan. After completing his work at Vanavani
High School, he went on to gain a Bachelor of Engineering in Electrical Engineering
at P.S.G College of Technology, Bharathiyar University in May of 2000.
Permanent Address:
29, 6th Main Road, Nanganallur.
Madras,Tamilnadu.
India. Pin-600061
The typist for this thesis was the author.