ecology labfaculty.cs.tamu.edu/rabi/cpsc689/lectures/Comm-Synthesis-NoC-Reading.pdfCOMMUNICA TION...

COMMUNICATION SYNTHESIS FOR

ON CHIP NETWORKS

A Thesis

by

NARAYANAN SWAMINATHAN

Submitted to the OÆce of Graduate Studies ofTexas A&M University

in partial ful�llment of the requirements for the degree of

MASTER OF SCIENCE

August 2002

Major Subject: Computer Engineering

COMMUNICATION SYNTHESIS FOR

ON CHIP NETWORKS

A Thesis

by

NARAYANAN SWAMINATHAN

Submitted to Texas A&M Universityin partial ful�llment of the requirements

for the degree of

MASTER OF SCIENCE

Approved as to style and content by:

Gwan Seung Choi(Co-Chair of Committee)

Rabi N. Mahapatra(Co-Chair of Committee)

Ting-Chi Wang(Member)

Chanan Singh(Head of Department)

August 2002

Major Subject: Computer Engineering

iii

ABSTRACT

Communication Synthesis for

On Chip Networks. (August 2002)

Narayanan Swaminathan, B.E., Bharathiyar University

Co{Chairs of Advisory Committee: Dr. Gwan Seung ChoiDr. Rabi N. Mahapatra

The modern day System on Chip (SoC) is made up of a large number of het-

erogeneous cores with varied communication requirements. On chip networks are the

scalable, global interconnection solutions for these explicitly parallel systems. In this

thesis, we have analyzed the issues involved in synthesizing these networks and pro-

posed a methodology that will help to arrive at an optimal Network on Chip (NoC)

design. Some of the important issues that we consider for synthesis are the quality of

service (QoS) requirements of the communicating cores like latency and data rate, the

utilization of the network resources and implementation cost issues like area, power

and wiring complexity. We have developed a tool-set written mainly in C++, which

comprises of an IP clustering engine and an on chip network simulator, to aid the

synthesis. We have annotated the models used for communication architecture syn-

thesis with design parameters obtained from gate level synthesis. The tool-set can aid

the designer in predicting the various cost parameters and con�guring the network on

chip architecture for optimal performance. The network simulator, NoCSIM, can be

extended to evaluate the performance and feasibility of new and innovative network

architectures developed for on chip environment.

iv

To My Parents

v

ACKNOWLEDGMENTS

Thanks are due �rst to my committee, Rabi Mahapatra, Gwan Choi and Ting-

Chi Wang for their help and insight. I would have surely been lost without them.

I would also like to thank Praveen for playing the devil's advocate to my ideas.

Thanks are also due to Nithin and Vishi, who were very forthcoming with ideas for

the simulator design.

The entire Texas A&M HW/SW Codesign group also deserves my thanks for

sitting through multiple versions of my evolving research.

Lastly, I would like to thank my sister, Janani, for sharing her expertise in

logic synthesis, my brother-in-law, Bala, and my parents, Lalli and Swami, for their

encouragement in my academic endeavors.

vi

TABLE OF CONTENTS

CHAPTER Page

I INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : 1

II BACKGROUND AND PROPOSED RESEARCH : : : : : : : : 4

A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 4

B. Speci�cation and Pro�ling . . . . . . . . . . . . . . . . . . 4

C. Communication Synthesis and Estimation . . . . . . . . . 5

D. Communication Architecture . . . . . . . . . . . . . . . . . 5

E. Proposed Research . . . . . . . . . . . . . . . . . . . . . . 10

F. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

III CLUSTERING : : : : : : : : : : : : : : : : : : : : : : : : : : : 13

A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 13

B. Inputs to Clustering Engine . . . . . . . . . . . . . . . . . 14

C. Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . 16

1. Analysis Models for Clustering Cost Evaluation . . . . 19

2. Clustering Results . . . . . . . . . . . . . . . . . . . . 23

D. Cluster Pro�le Annotation . . . . . . . . . . . . . . . . . . 25

E. Intra-Cluster Synthesis . . . . . . . . . . . . . . . . . . . . 26

F. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

IV NETWORK ARCHITECTURE AND SYNTHESIS : : : : : : : 28

A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 28

B. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1. Network Operation . . . . . . . . . . . . . . . . . . . . 31

2. Flow-control . . . . . . . . . . . . . . . . . . . . . . . 34

3. Routing . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4. Topology . . . . . . . . . . . . . . . . . . . . . . . . . 36

5. Switching . . . . . . . . . . . . . . . . . . . . . . . . . 37

6. Synthesis Variables . . . . . . . . . . . . . . . . . . . . 37

C. NoCSIM - An On Chip Network Simulator . . . . . . . . . 38

1. SystemC - A System Level Design Language . . . . . 38

2. Features of the Network Simulator . . . . . . . . . . . 38

3. Network Synthesis Results . . . . . . . . . . . . . . . 42

vii

CHAPTER Page

D. VLSI Implementation Feasibility . . . . . . . . . . . . . . . 44

1. Results . . . . . . . . . . . . . . . . . . . . . . . . . . 45

E. Core-Network Interface . . . . . . . . . . . . . . . . . . . . 45

F. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

V FUTURE WORK : : : : : : : : : : : : : : : : : : : : : : : : : : 48

A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 48

B. Integration to Codesign Environment . . . . . . . . . . . . 48

C. Synthesized Communication Architecture . . . . . . . . . . 48

VI CONCLUSION : : : : : : : : : : : : : : : : : : : : : : : : : : : 50

REFERENCES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 51

VITA : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 57

viii

LIST OF TABLES

TABLE Page

I ITRANS 2001 Roadmap : : : : : : : : : : : : : : : : : : : : : : : : : 19

II E�ect of Virtual Channels and Bu�ers in Area of Network : : : : : : 45

III Area Cost Comparison of 128 and 256 Bit Networks : : : : : : : : : 45

IV Interface Analysis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 47

ix

LIST OF FIGURES

FIGURE Page

1 Shared Bus Architecture : : : : : : : : : : : : : : : : : : : : : : : : : 6

2 Hierarchical Bus Architecture : : : : : : : : : : : : : : : : : : : : : : 7

3 Octagon Architecture : : : : : : : : : : : : : : : : : : : : : : : : : : 8

4 Synthesis Methodology : : : : : : : : : : : : : : : : : : : : : : : : : : 9

5 Folded Torus : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11

6 Network Connection Diagram : : : : : : : : : : : : : : : : : : : : : : 12

7 Power Consumption Breakup of a High Performance Microprocessor 14

8 Clustering Flow : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 16

9 Results of Clustering Algorithm : : : : : : : : : : : : : : : : : : : : : 18

10 Comparison of Wire Delay and Gate Delay : : : : : : : : : : : : : : 20

11 Tradeo�s in Area and Power with Cluster Size : : : : : : : : : : : : : 23

12 Failure Comparison of Bus and Network : : : : : : : : : : : : : : : : 24

13 Synthesis and Veri�cation Methodology : : : : : : : : : : : : : : : : 29

14 Packet Format : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 31

15 Output Controller Architecture : : : : : : : : : : : : : : : : : : : : : 32

16 Input Controller Architecture : : : : : : : : : : : : : : : : : : : : : : 33

17 Network Routing Example : : : : : : : : : : : : : : : : : : : : : : : : 36

18 Simulator Structure and Features : : : : : : : : : : : : : : : : : : : : 40

19 Bu�ers vs. Network Performance : : : : : : : : : : : : : : : : : : : : 41

x

FIGURE Page

20 Virtual Channels vs. Network Performance : : : : : : : : : : : : : : 41

21 Comparison of Multicast and Unicast Routing : : : : : : : : : : : : : 43

22 VLSI Complexity Cost of Adding Virtual Channels : : : : : : : : : : 44

23 Core Network Interface : : : : : : : : : : : : : : : : : : : : : : : : : 46

1

CHAPTER I

INTRODUCTION

The complexity and performance requirements of the future System on Chip (SoC)

designs, combined with the multiplicity of bus interfaces pose signi�cant challenges to

the design and veri�cation of communication architecture. The International Tech-

nology Roadmap for Semiconductors (ITRS) identi�es design productivity, power

management, multi-core organization, I/O bandwidth, and circuit and process tech-

nologies as key contexts for future evolution of microprocessing units [1]. Commu-

nication architecture design and synthesis has a major role to play in almost all of

these factors, because as the density of the chip increases, area, throughput and power

dissipation of communication architecture become signi�cant issues, apart from the

design and veri�cation complexity. An aggressive design strategy can provide a good

tradeo� in terms of power, speed, area and design complexity. A robust and aggres-

sive communication architecture design requires a quick evaluation of communication

requirements and a synthesis based on those requirements.

The traditional bus based architectures are not scalable and eÆcient for next genera-

tion SoCs with hundreds of cores and aggregate throughput rates of multiple gigabits

per second[2]. These highly integrated and explicitly parallel SoC architectures are

replacing the architectures based on instruction level parallelism (ILP) and implicit

communications, for high performance designs. A switching network on chip (NoC)

appears to be the truly scalable interconnect as compared to shared buses, multiple

buses or hierarchy of buses, when we consider explicitly parallel systems on chip. The

migration to the distributed interconnection network oriented set up is the intuitive

The journal model is IEEE Transactions on Automatic Control.

2

next step after the transition from shared-bus paradigm to the hierarchical bus ar-

chitecture.

While high performance is an essential design attribute, predictability of performance

and veri�ability of the system are necessary for designers and chip architects. Pre-

dictability is necessary to take early design decisions based on expected bus/link

performance before actual implementation. Veri�cation and chip integration pose a

signi�cant challenge due to the presence of a large number of proprietary buses for

di�erent IP cores and shifting of technology towards ultra deep-submicron. This ne-

cessitates architectural standards on communication architecture and core interfaces,

like those proposed by Virtual Socket Interface Alliance [3]. But the high perfor-

mance and scalability of NoCs comes with an associated cost in terms of area and

power dissipation. The necessity to optimize this cost motivates us to develop a

synthesis tool that is supported by fast and accurate estimation models for optimal

communication architecture design. In this thesis, we propose a methodology for the

synthesis of on chip switching network. The aim of communication synthesis is to

create the hardware and software entities required for optimal and reliable commu-

nication between the di�erent cores in the system, across the on chip network. The

methodology addresses various issues involved in the synthesis of on-chip switching

network like placement of cores in the networked chip, architectural synthesis of an

optimal network and synthesis of core interface logic.

This thesis has the following contributions

� Synthesis methodology for on chip networks.

� Analysis models of on chip network costs to aid synthesis.

� NoCSIM - an on chip network simulator.

3

� VLSI Feasibility analysis of on chip networks.

The rest of the chapters are organized as follows. The second chapter analyzes the

work that has already been done in the area and introduces our synthesis method-

ology. The third chapter deals with the analysis phase of our methodology called

clustering. The fourth chapter deals with the network architecture synthesis and its

VLSI implementation feasibility based on gate level synthesis. It also addresses other

issues in synthesis like core network interface.

4

CHAPTER II

BACKGROUND AND PROPOSED RESEARCH

A. Introduction

The three critical issues for communication architecture performance are the inputs

to the synthesis methodology, estimation of the communication requirements and the

synthesized communication architecture. Although, research in networks on chip is

still in its infancy, a number of contributions have been recently reported[4, 5, 2, 6,

7, 8]. In this section, we have analyzed the present state of research in NoCs and the

critical issues mentioned above.

B. Speci�cation and Pro�ling

System speci�cation has been one of the major aspects of research over the past two

decades and a large number of speci�cation languages have been the result[9]. These

include C/C++, SystemC[10], Java, Speccharts and Hardware Description Languages

(HDL). Our approach is not speci�c to any of these descriptions, but can be easily

integrated with SystemC based descriptions.

In [11], Vahid presents a novel communication pro�ling method based on accesses

rather than data ow, which reduces the complexity of task-graph speci�cation.

Graph complexity is one of the most important considerations when dealing with

speci�cation of complex designs. In [12], Knudsen explores a graph based commu-

nication analysis scheme using a Control/Data ow graph. In [13], Lahiri analyzes

communication based on execution-trace using co-simulation results from Ptolemy.

For this thesis, we have assumed that a pro�led task-graph is made available to the

synthesis methodology.

5

C. Communication Synthesis and Estimation

Communication synthesis for distributed embedded systems has been dealt with in

[14], by analyzing delay bounds on communication in the system. This analysis has

been used to synthesize the communication links and to assign inter-process com-

munication to the links. In [15], Pai Chou emphasizes on the synthesis of software

drivers and hardware interfaces (glue logic) to connect processors to co-processors and

peripheral devices, while meeting bandwidth and performance requirements. Their

algorithm takes a communication estimate, derived from control ow graphs and

descriptions of processor and devices as input and synthesizes the required communi-

cation logic and drivers. Communication synthesis is tackled as an allocation problem

in [16] where the two aspects of synthesis - Protocol generation and Interface synthesis

are addressed. In [17], the authors integrate the process of communication protocol

selection to hardware software codesign by using a communication library to explore

tradeo�s in communication architecture design. Their work takes driver-processing

e�ects into account while calculating the throughput. In [18], the aim has been to

close the synthesis loop between system and physical design by a set of tools. They

further suggest a methodology for analyzing the communication, creating the bus

network and iterating for the best physical oorplan. Our communication synthesis

approach is similar to [18], but focused on synthesis of a switching communication

network as the global wiring solution instead of buses.

D. Communication Architecture

In the past, there has been a signi�cant amount of work proposed, related to di�erent

communication architectures for on chip and o� chip environments. The most com-

mon communication architectures have been the bus based architecture and switching

6

Slave Slave Slave

Master Master

Arbitrator

Fig. 1. Shared Bus Architecture

architectures. The main interconnection setups for bus based architectures are the

point to point, shared bus (shown in Figure 1) and hierarchical bus (shown in Fig-

ure 2) architectures. There have been a variety of switching architectures used in

multicomputer and multiprocessing networks like k-ary n-cube (eg. mesh, torus),

Multistage Interconnection Networks (eg. fat tree) and crossbar.

The most dominant architecture used for on chip environment is the bus based ar-

chitecture [19, 20, 21, 22]. This is mainly because the throughput requirements of

system on chips have been satis�ed by shared and hierarchical buses,so far. But it

is very doubtful that this scenario will persist for long. The shared bus architec-

ture, shown in Figure 1 is the most commonly available bus architecture for on chip

communication. The components attached to the bus contend among themselves to

access the bus. A bus arbiter examines the requests accumulated at the component

interfaces and grants access depending on the component having the highest priority.

The shared bus architecture su�ers because of decreased scalability [5]. For example,

the greater the number of processors trying to access the shared memory, the greater

is the latency to access it. Bus arbiter delay also grows with the number of devices

7

Slave Slave Slave

Master Master

Arbitrator

Slave Slave

Master

ArbitratorBus

Bridge

BusBridge

Fig. 2. Hierarchical Bus Architecture

attached. The bandwidth is limited and shared. However, a shared bus architecture

has its own advantages. The area cost of a bus and its supporting logic is very less

and simple. Hierarchical bus architecture is another scheme that has been proposed.

This arrangement is shown in Figure 2. The components that require a large amount

of communication among themselves are grouped on the same hierarchy. Each level

of hierarchy has a bridge that will act as a liaison for the traÆc from one hierarchy to

the other. So at each hierarchy, we have fewer components contributing to the loading

of the bus. This architecture has a higher performance,concurrency and throughput

than the shared bus, but has associated higher costs.

The migration to the distributed network environment is the intuitive next step, for

even higher throughput necessities, and to achieve more concurrency. This is one

important reason, that motivates our research into networks on chip and switching

architectures like [23]. The other motivational reasons are reliability, performance

8

Fig. 3. Octagon Architecture

predictability and possible power reduction. Among switching architectures, crossbar

has the best performance, but its cost factor is prohibitive and crossbars are gener-

ally non-scalable, as the wiring complexity is very high. The complexity grows as

the square of the number of nodes on the chip and a lot of care has to be taken to

avoid signal skew. This scheme is not energy eÆcient, as there will be a lot of energy

consumed by all the buses and the switches. Another architecture that has been pro-

posed is the fat tree-network architecture called Scalable, Programmable, Integrated

Network (SPIN) [2]. The number of roots in the tree was increased to amortize the

traÆc at the routers. The leaves contain processors or memory. However, this scheme

still su�ers because the intermediate routers have to share the burden of routing and

bu�ering the on chip traÆc destined for the leaves. A novel architecture called the

Octagon architecture has been proposed in [5]. This scheme shown in Figure 3 tries

to achieve the eÆciency of a cross bar architecture using an octagon arrangement of

nodes. Communication between any pair of nodes can be achieved in at most 2 hops.

The idea of using a switching NoC has been explored in [4, 5, 2, 6, 7, 8]. In [4],

an NoC example is proposed and the research issues involved in designing the NoC

9

Clustering

Intra-ClusterCommunication Synthesis

and Profile Annotation

NoCSim based Synthesis (Inter-Cluster

Communication Synthesis& Core Interface

Synthesis)

CommunicationSynthesized System

(based on Implementationmodel)

SystemSpecification/

Process Profile

Hardware/Resource Profile

Inputs

Synthesis

Outputs

Mapping

Fig. 4. Synthesis Methodology

are stated. In [2], a generic packet-switching network architecture based on Fat-tree

topology is proposed. The Pleiades architecture [24] explores a recon�gurable NoC

design for DSP applications and assesses the impact of recon�gurability and fault

tolerance of such networks. The impact of energy eÆciency and reliability of NoCs is

analyzed in [7].

10

E. Proposed Research

The primary objective of our work is to explore the communication architecture de-

sign and synthesis space with a switching network as the global wiring solution. It

seems that there has been little e�orts to investigate the synthesis of switching net-

works in SoC environment. There are no existing tools to explore viable synthesis

options of the communication entities required for on chip switching network com-

munication. Figure 4 explains the proposed synthesis method. The �gure shows the

various steps in the proposed synthesis methodology. The inputs are obtained in

the form of a pro�led resource graph,G(V,E) whose nodes, V are the communicating

resources (and their parameters) and whose edges, E, describe the communication

between the resources. As the on chip communication architectures play a very im-

portant role in performance of the chip, they need to be designed keeping the on

chip environment in mind. Therefore, we have developed an analytical model of the

NoC costs in C++ to do clustering of cores into tiles. This is the pre-processing

step of our synthesis methodology called as Clustering. After synthesizing the com-

munication architecture inside the cluster and generating the communication pro�le

between di�erent clusters, the outputs of the clustering engine are fed to the network

simulator, NoCSIM. This simulation model is used to estimate and re�ne the network

architecture. The interface between the cores and the network is also synthesized at

this stage. We use an implementation model based on VHDL to generate the synthe-

sized network architecture, which is integrated with the existing SoC design ow for

gate level synthesis. In this thesis, we are also providing an analysis on the feasibility

of the VLSI implementation of the network. We have developed the simulation tool

called NoCSIM, based on SystemC, a popular system level design language. The

development of a tool based in this language will make it easier to integrate existing

11

Fig. 5. Folded Torus

SystemC descriptions of cores into the NoC environment. It will also make it easier

to investigate the protocol stack implementations [25] that will be overlaid on the

physical network. In our work, we are experimenting with a folded torus topology

shown in Figure 5. The communicating cores are clustered into a single network

node referred to as VCNODE in Figure 6. These nodes or tiles communicate with

one other over the switching network. The communication architecture is a two-tier

hierarchy, where the �rst level is the network between tiles and the second level is the

communication channel within a tile.

F. Conclusion

In this section, we have analyzed the present state of the network on chip synthesis

problem and we have also introduced our methodology. We believe that using this

methodology which is based on the system requirements will lead to reduction in area

and power costs of NoCs, while meeting the quality of service requirements.

12

��

��VC NODE VC NODEVC NODE

VC NODE VC NODE VC NODE

VC NODE VC NODE

Input Controller

Output Controller

Fig. 6. Network Connection Diagram

13

CHAPTER III

CLUSTERING

A. Introduction

Clustering is the process of identifying the cores that will be grouped as a single net-

work tile in the hierarchical topology. It is a pre-processing step, where we examine

the global view of communication between cores without considering the actual im-

plementation within a cluster. Clustering makes it easier to explore the design space,

as a hierarchical approach to communication architecture design is used. Clustering

makes it possible for elements (cores) to share physical communication channels in-

side a tile, thereby reducing the size and complexity of NoC resources. Clustering

becomes a necessity for some very low latency modes of communication, where the

latency incurred over the network cannot be amortized over the delay bandwidth

product of communication. Clusters are also formed to balance the communication

load on the NoC and to pack cores into tiles without wasting real estate of a tile and

to provide a regular structural layout of the cores. By clustering, we also hope to

attain reduced global clock power and interconnect power dissipation, which together

form more than 40 percent of the power dissipation of today's chips [1]. The power

consumption breakup is shown in Figure 7. But increasing the number of clusters

without a limit will prove to be counter productive as area and power could be wasted

in the network logic and network interconnection complexity. The above issues form

the various hard and soft constraints for clustering.

A cluster could constitute the local memory/cache of a processor, and other elements

which are tightly coupled with the core. An indication of this type of communication

is available from the task-graph fanout. In general, we believe that communication

14

Fig. 7. Power Consumption Breakup of a High Performance Microprocessor

modes like interrupts, cache/local-memory to processor traÆc which are low latency

and highly coupled modes of communication are candidates for intra-cluster com-

munication, whereas streaming traÆc and shared memory accesses should use inter-

cluster communication. Multicast modes of communication should also go through

the packet switched network.

B. Inputs to Clustering Engine

The inputs to the clustering engine are,

� Resource matrix

� Connection matrix

� Implementation constraints

The resource matrix provides the details on the type of resource (core), the resource

area, its lengths and widths. The core could be a soft core or a hard core, the di�erence

being that the aspect ratio for a soft core could be varied while it is not possible for

15

a hard core. The cores can be of di�erent kinds like microprocessors (DSP, GPP),

memory (shared,local), con�gurable cores (Xtensa core, FPGAs) and RTL blocks

(FIR,CORDIC cores).

The connection matrix provides the details of the interconnection between two

resources. It also provides information on the maximum and average latencies of

communication between the communicating components in the default time scale

units, the data ow rate between the communicating modules and the type of traÆc.

The type of traÆc is an input intended to the simulator and could be Constant

Bit Rate (CBR) or random Poisson distribution. The analyzer uses only Poisson

distribution of packet arrivals in its queuing model for the network. In addition to

this, the designer can provide additional information on connectivity of two modules

using a binary variable called the user de�ned constraint. If this value is 1 for a

communicating pair of modules, then these two modules will always be held within

a cluster. In other terms, they are a single core for the clustering engine. The user

de�ned constraint gives more control to the designer over the convergence of the

annealing algorithm. We believe that automating clustering to a high extent could

be counter productive for realistic design and veri�cation issues.

The implementation constraints that are needed by the clustering engine include

the process wire pitch, resistance and capacitance values of the global interconnect for

the fabrication process technology to be used, repeater areas, network logic overheads,

area per bit of storage, frequency of network and bus operations, number of allowed

buses inside a cluster and signaling overheads (in terms of number of wires per data

wire) for the physical layer.

16

Connection File Resource File

ClusteringAlgorithm

ViolationReports

Power Costs Area CostsCluster

Parameters

ImplementationFile

Designer Input

Fig. 8. Clustering Flow

C. Clustering Algorithm

The clustering methodology shown in Figure 8 takes the various constraints as inputs

and tries to optimize the allocation of cores into tiles. We have used a popular op-

timization technique called Simulated Annealing for this purpose. The outputs that

can be got from the clustering engine are violation reports of the hard constraints,

annotated task graph pro�le of inter-cluster communication and the area and power

costs of the implementation with the speci�ed number of clusters. If an incompat-

ibility in the speci�cation or implementation cannot be addressed by the simulated

annealing algorithm, then the cluster con�guration has to be modi�ed or the incom-

patibility should be resolved by re-mapping the behavior to the architecture. This is

the outer loop in the clustering methodology. As these design spaces are quite large

with varied con�gurations, the estimation should be fast and provide the necessary

17

level of accuracy. This is the reason clustering should be done based on analytical

models. It is not realistic to explore these regions with the network simulator.

The Simulated Annealing algorithm lends itself favorably to any problem with a large

solution space. Simulated annealing tries to model the cost of a system as the en-

ergy function of a thermodynamic system. The optimal cost of the system, then

becomes the ground state of this thermodynamic system. The main challenges in the

simulated annealing algorithm are the de�nition of the cost factors involved and the

selection of annealing moves, bias values and cooling schedule. The cost factors in-

volved in our clustering analysis are split as hard speci�cation constraints like latency

and wire complexity, whose biases will be high and soft implementation constraints

like bandwidth and area deviations, whose bias values will be low. These biases can

also be set by the designer. Apart from this, some other cost factors can be added

to the algorithm, such as power dissipation and network logic area. We found these

factors to be unfruitful, as there was no signi�cant variation of these factors with the

analysis models used, when the target number of clusters is �xed. But these factors

vary signi�cantly between di�erent number of clusters. Therefore, the design space is

explored for these soft constraints by varying the number of clusters and examining

their e�ect on the cost.

The algorithm maps resources randomly to di�erent clusters in the initialization step.

It then searches for resources with user constraints on connectivity and assigns those

resources to the same cluster. Any move separating them will not be accepted after

this assignment. The initialization step is followed by simulated annealing. A move

is de�ned as removal of a core from a cluster and its addition to a di�erent cluster.

The move is accepted or rejected based on the change in cost. All moves that result

in an improvement are accepted and moves that do not result in an improvement are

accepted based on the temperature at which the move is made and the cost di�erence.

18

Fig. 9. Results of Clustering Algorithm

Many moves are made at the same temperature until a limit is reached on the number

of attempts. When this limit is reached the temperature is lowered at a user speci�ed

rate (default of 0.95). When the temperature is high, probability of acceptance of a

negative move is high, but as the algorithm cools o�, this probability becomes lower

and lower, until a point is reached when no further improvement is attained for a

temperature. Simulated annealing is stopped at this point and the violation and cost

reports are passed on to the designer for analysis. Figure 9 shows the decrease in

cost over iterations. The three curves depict the maximum, minimum and average of

accepted costs at each temperature. The experiments were conducted for a random

resource graph with 85 resource nodes and 200 edges, mapped to 16 clusters.

19

Table I. ITRANS 2001 Roadmap

Year Power (W) Vdd (V) Clockrate (Mhz)

2001 130 1.1 1684

2002 140 1 2317

2003 150 1 3088

2004 160 1 3990

2005 170 0.9 5173

2010 218 0.6 11511

2016 288 0.4 28751

1. Analysis Models for Clustering Cost Evaluation

We have developed cost models for area, power and latency of the hierarchical mesh

network for a fast estimation of cluster requirements. This section explains the above

models and the assumptions that have been taken. One of the most important feature

of a hierarchical mesh architecture like the one explored by us is that the system could

have a Globally Asynchronous Locally Synchronous architecture (GALS) [26, 27]. The

packet switched network is a globally asynchronous medium, whereas the cluster/tile

could be made to operate on a single synchronous clock. This makes it possible to

reduce the global clock power dissipation, clock skew and synchronization problems

and number of repeaters used for signal integrity. On the downside, clock generation

circuit complexity and asynchronous interfaces have to be developed for the tile-

network interface. In spite of this, it has been predicted by ITRANS [1] that the wire

delay dominated complex SoC designs may have to turn to GALS for a global wiring

solution in this decade.

This is mainly because the clock frequency is increasing at such a high rate (as

20

Fig. 10. Comparison of Wire Delay and Gate Delay

shown in the Table I), that a signal could take multiple clock cycles to reach from

one end of the chip to another[28]. The wire delay and gate delay for various process

technologies is shown in Figure 10. GALS also makes it possible to do localized

frequency scaling, which is a feature that can be explored with our network simulator.

The clock power [29] is obtained by the Equation 3.1. The power dissipated by clock

is proportional to the product of the clock load and the length of clock wire to the

clock load.

Pclk = k � A � 2pA (3.1)

The clock is assumed to be distributed uniformly across the area (A) and the length

of the clock wire is proportional to the square root of area, while k is a proportionality

constant.

The length of the interconnect is obtained by analyzing the interconnection wires

that are needed between the input and output controllers, shown in Figure 6. For our

network structure, the total interconnection length (Li �Ni) is given by the Equation

21

3.4 after neglecting the control wires, which are much less compared to the data wires.

Li �Ni = (ClusterLength) � (22 � ChannelWidth) (3.2)

The interconnect power dissipation, Pint is modeled by the Equation 3.3.

Pint = a � Vd2d � f � (22 � ChannelWidth � Cw � ClusterLength+ Cb �Nb) (3.3)

The interconnect power is the sum of the power dissipated in charging and discharging

the interconnect capacitance (Cw) and the power dissipated by the repeater placed

along the length of the wire according to the critical length (Lcrit). The value ofa,

the probability of a wire switching from 0 to 1 is assumed to be 0.15 for a global

wire, but this value will become less as the number of cluster increases because the

probability factor is spread over the network. It will be much higher for a shared

bus. The value of Vdd and Cw will also be much higher for a shared bus, as the wire

load on a shared bus becomes higher. A distribution of power dissipation of di�erent

components of the design is given in Figure 7. The repeaters are modeled as spread

through the interconnection length (Li � Ni), after each critical length(Lcrit), which

is assumed to be a constant for the speci�ed metal wire and load. The number of

repeaters (Nb) is got from the Equation 3.4. The value of bu�er capacitive load Cb is

an input parameter to the clustering engine and is a constant for a bu�er design.

Nb =Li �Ni

Lcrit

(3.4)

The power dissipated by the interconnect logic is proportional to the number of

storage elements, added up with the overhead of input and output controllers. The

same holds true for the logic area of the network.

The wiring complexity (Wc) is modeled as the ratio of the tracks needed by the

interconnection network inside a cluster to the total number of tracks available, as

22

shown in Equation 3.5.

Wc =(ClusterArea) � (WireUsageFactor)

(WireP itch) � (9 � ClusterLength � ChannelWidth)(3.5)

Here, we have assumed that the wiring is done in two layers of metal. So the total

interconnection length is divided by two. When this ratio is greater than one, the

wiring complexity constraint is violated. This violation makes the annealing cost of

network high when the area of the cluster becomes very low. Equation 3.5 gives an

upper bound on the number of clusters that should be used for a speci�c process

technology for this distributed network architecture. It should be noted that for

this architecture, the wiring complexity is higher within a cluster, than between two

clusters. In order to reduce this, it is possible to reduce the channel size within the

distributed switch, but this increases the latency.

The latency for travel between two nodes is the product of the hop latency and

the number of hops. The hop latency for the network (Thop), given by Equation 3.6

is the sum of the wait time (Tw) to get a free bu�er and the time to get through the

input (Ti) and output controllers (To) and twice the time of ight (Tf ) through the

wire in network clock cycles.

Thop = Tw + Ti + To + 2 � Tf (3.6)

Tw =1

1� (1��)��k

1��k+1

(3.7)

The wait time is modeled as the wait time to get a free bu�er in an M/M/1 queue[30]

with k bu�ers, as shown in Equation 3.7. The number of network clocks to get

through the input controller (Ti) and output controller (To) is assumed to be one

without any pipelining. The time of ight (Tf) delay is got from a lookup table for

the speci�ed interconnection length. The time of ight is multiplied twice because of

23

Fig. 11. Tradeo�s in Area and Power with Cluster Size

the inter-cluster propagation time of the folded torus (Tf) and the propagation delay

between input and output controllers (Tf). This additional delay will not be present

for a mesh structure, but the average number of hops to get to a destination will

increase. The look up table is created using delay values obtained from an elmore

delay model of the interconnect between two standard CMOS bu�ers for the speci�ed

length.

2. Clustering Results

Figure 11 gives us an insight into the e�ect of number of clusters in the clock power,

area and interconnect power dissipation. Clock power decreases monotonically when

the number of clusters increase, whereas the area of interconnection logic increases.

The interconnect logic is mostly in the form of network logic for large number of

clusters, as number of buses inside a network tile reduces. Interconnect power is

measured as a percentage of the total maximum dissipation of power obtained from

24

Fig. 12. Latency Failure Comparison of Bus and Network

Figure I. The initial decrease in the power consumption of the interconnect can

be attributed to the reduction of bus load and the subsequent increase shows the

diminishing returns of having more network logic and wiring. It is therefore possible

to assign an operating point for the network, in terms of number of clusters (sixteen

for this example). Figure 12 shows the variation in latency failures of bus and network

with number of clusters. The bus is unable to support the huge number of cores when

there is a single cluster, which is intuitive of a shared bus. As we increase, the number

of clusters, the failures that occur over the bus reduces because the bus load becomes

lesser and lesser. This �gure also shows us that in spite of monotonically increasing

the throughput of the network by increasing the number of network tiles and reducing

the size of each cluster, the latency of communication between two components can

cause a number of failures in communication. These failures can be partly attributed

to the excessive number of hops taken by the data between communicating cores

and partly to sparseness in connectivity of the task graph that was taken as the

25

input. If the task graph is connected in a denser manner with higher average fanout,

cores that have higher connectivity were mapped into neighboring clusters because

the algorithm was able to make a clear distinction between cores that need to be

communicate over the network and the cores that need to be present within a cluster.

But when the cores were connected sparsely, this distinction was absent and cores got

mapped in an ad-hoc fashion. Decreasing the size of clusters makes matters worse,

because when the number of cores inside a cluster becomes less, more cores are made

to communicate over the network even though they may have to be mapped into

the same cluster. This caused a �xed number of failures to happen, irrespective of

the number of iterations the simulation was run. Therefore the number of clusters

should be kept at the optimum level required by the design and its communication

complexity. Another important observation from Figure 12 is that, the assignment

of buses to certain communicating cores (clustering) whose latencies fall above the

expected network latency could be done beforehand by the designer for more optimal

results. The designer can use the user de�ned constraint to his advantage for this

purpose.

D. Cluster Pro�le Annotation

Pro�le annotation is the process of creating a connection matrix specifying the con-

nection parameters of the di�erent clusters or tiles of the network. This is used for

the architectural synthesis of the network. The aggregate communication bandwidth

of the tile and minimum of latencies of cores inside the tile are some connection pa-

rameters, that become important Quality of Service (QoS) parameters to be satis�ed

by the network synthesizer. Apart from this, the location of communicating cores in

the network is also given as an input to the network and interface synthesizer, so they

26

calculate the source routes and generate packets to the appropriate destination at

the speci�ed rate with the speci�ed random distribution of the connection matrix. It

is also possible to increase the accuracy of architectural synthesis by annotating the

communication parameters of each and every sender and receiver (cores) than using

aggregate throughput and minimum latency measures.

E. Intra-Cluster Synthesis

Once clusters have been determined, we focus on each cluster and synthesize the

communication requirements for each cluster. Intra-cluster Synthesis is mainly split

up as protocol generation and interface generation. In other words, we should allocate

actual physical channels and synthesize the glue logic and software drivers to make

them communicate e�ectively. As a designer should explore the mapping of logical to

physical channels in the best possible fashion, we need a fast estimation scheme like

those proposed in [17, 31, 32, 33] for intra-cluster communication. The main issues

that need to be addressed for intra-cluster synthesis are the protocols and services

used for communication, bus width and data generation frequency di�erences between

the communicating cores for bu�ering and glue logic generation. The synthesis of logic

inside the cluster is closely related to clustering and this process happens iteratively

with clustering. We have not investigated into the detailed synthesis of intra cluster

communication in this thesis.

F. Conclusion

Clustering gives the designer an idea about the cost involved in choosing an arbitrary

number of clusters. It also provides the designer with the necessary data needed for

the next step of the synthesis process - Intra-cluster and Inter-cluster synthesis. It

27

reduces the number of iterations spent in both the syntheses.

28

CHAPTER IV

NETWORK ARCHITECTURE AND SYNTHESIS

A. Introduction

The main goal of architecture level communication synthesis for NoCs is to explore

the NoC design space and arrive at an optimal global communication architecture.

A detailed simulator tailored for NoC simulation assists the synthesis process. The

central idea behind architectural synthesis is to minimize the resource requirements

of NoC, while satisfying QoS requirements.

Figure 13 shows the proposed communication architecture design and veri�cation ow

using our SystemC methodology library based design. The network creation �le and

the communication description �le act as the con�gurators of the various inputs of the

simulator. Network simulation is done iteratively by changing the network creation

�le with the aim of optimizing the network. It is necessary that the synthesized

network is veri�ed using actual cosimulation trace inputs and interface models written

in SystemC before RTL or gate level synthesis. The distributed network [4] which has

been used for analysis throughout this work is just one example of NoC. There are

other possible topologies,routing and ow control mechanisms which can be used in

a NoC environment. The simulator that we have developed is intended for research

into these areas of NoC design. Irrespective of the network used, there are a few basic

rules for NoC design. The design should be power eÆcient, as power consumption is a

limiting factor in SoCs. Bu�er space is costly whereas wires are cheap and abundant,

contrary to o� chip environments. Latency is a limiting factor of the packet switched

interconnection scheme and this should be reduced or hidden to the maximum possible

extent. The clock rate can be made arbitrarily large by pipelining and is ultimately

29

Clusteringbased

Estimation

NetworkCreation

File

NetworkSimulation

Analysis ofPerformance

andUtilization

NetworkBehavioralModel inSystemC

CosimulationTraces

InterfaceBehavioralModel inSystemCDesigner

Input

VerificationSynthesis

CommunicationDescription File

Fig. 13. Synthesis and Veri�cation Methodology

30

going to be bound by the time of ight delay. Protocol stack implementation plays

a vital role in deciding the performance of the network. While designing protocol

stack and layering schemes, simplicity and performance should be more of a concern

rather than generality and dynamic capabilities of the network. This is because, once

a network is laid out on the chip, the degree of recon�gurability, that is required

is much lesser than that of large scale interconnection networks. Also at all steps

of the design and implementation process, the advantages and limitations for a NoC

environment should be kept in mind. The distances are very small and noise levels are

dependent on the aggressiveness of design. The synthesis of network is also dependent

on the application at hand and the service it requires. The next section deals with

the architectural details and features of the synthesized network.

B. Architecture

The network architecture that we synthesize has a folded torus topology with source

based routing and virtual channel ow control, similar to the one that has been

proposed by [4]. The topology is folded in this peculiar fashion in order to distribute

the wire delays equally between hops. This makes the structure regular and can

save global interconnection power by splitting the wire. This architecture is shown

in Figure 5. The other features of the architecture are back pressure based on credit

ow, reliable and in-order delivery of packets and dynamic virtual channel assignment.

In any network, the three main parameters, that characterize performance are ow

control, routing and topology. These will be explained in the following sections. But

a brief discussion on the working of the network and format of the packets that

traverse the network is necessary at this point. The reader is referred to the Figure

6 for a description of the top level interconnection between output controllers and

31

TYPE VCID ROUTE DATA

Fig. 14. Packet Format

input controllers of the switch and the interconnections between the network nodes.

Each node has �ve of these input and output controllers - one for each direction of

the network and one for the communication with the cores inside the node. The �ve

input and output controllers can be visualized as a single crossbar switch with �ve

inputs and outputs. The on chip implementation of this is a distributed switch as

shown in Figure 6.

1. Network Operation

Each packet has the �elds as shown in Figure 14. Packets are classi�ed as head,data,tail

and complete its.The its are the smallest units of data with ow control informa-

tion. The �rst it of a packet is the `Head' it. This is followed by multiple `Data'

its and in the end a `Tail' it. If a packet is only as big as a it, then the it is

`Complete', because it is head,data and tail it combined together. This distinction

is indicated by the type �eld. The division of a packet into it is necessary for the

packet to be able to traverse the network after being segmented into many pieces.

This segmentation becomes a necessity because the channel or bu�ers need not be

wide enough to carry complete packets. Segmentation is also a way of reducing la-

tency as the its can move towards the destination without waiting for its successors

to catch up. The sender sends a packet by sending the head it into the core input

controller and then sending data its in successive clock cycles. The head it gets

routed into the appropriate output controller (shown in Figure 15), based on the

route �eld speci�ed in the it and the same route is set for the rest of the data its

for that virtual channel until the tail it frees the virtual channel. The head it, then

32

Flit OutMux

Selection onBufferstateand Count

DIR 2

DIR 3

DIR 4

VC Updations

Flits In

Credits In

DIR 1

Credits Out

Fig. 15. Output Controller Architecture

tries to reserve a virtual channel at the next input controller (shown in Figure 16),

when it gets into the output controller. The assignment of virtual channel is made

on demand at the output controllers for the immediately succeeding input controller.

If no virtual channel is available at the next node, the output bu�er gets blocked

and so does the rest of the virtual channel. When the head it gets a virtual channel,

the it reserves the virtual channel for its data its and tail it by setting the route

�eld of that virtual channel with the shifted route �eld in its header. The subsequent

data its the same virtual channel reserved by the head it. When the correspond-

ing data its arrive at the next input controller, they are identi�ed by their virtual

channel identi�ers and put into the appropriate FIFO queue (virtual channel). The

route has already been set by the head it. They just move to the output bu�er

depending on the output bu�er availability. The availability of the output bu�er

depends on the availability of bu�ers in the next assigned virtual channel. In this

fashion, a single packet passes through the network based on bu�er availability. This

drastically reduces the bu�er requirement and therefore the latency of communica-

tion. The availability of bu�ers is passed as credit signals between neighboring nodes

33

Flit In Mux Demux

VC 1

VC 2

VC 3

VC 4

VirtualChannel

ForwardingUnit

Selectionbased on

FIFO,Credits,

Selectionbased

on Route

Credit State

VC State

Flit out

Credit In

Credit Out

Demux

Fig. 16. Input Controller Architecture

and also between input and output controllers of the same node. It is also possible

to implement these as �elds in the packet. But this implementation increases the

amount of packet traÆc. The method of having separate control wires and data wires

can also be used for better ow control schemes that take advantage of the on chip

communication scenario. The data �eld of the packet has the actual communication.

It could be in the form of request/response or stream packet. Request/Response

packets can be used for address space communication whereas the stream packets are

for streaming data between two cores. It is possible to map all kinds of data ow into

these two scenarios. A processor communicating to the distributed shared memory

could use address space communication whereas an FIR �lter core passing a stream

of data to yet another signal processing core or a peripheral can use streaming mode

of communication. The actual interpretation of data in the data �eld is handled by

the core network interface, which will be discussed in the forthcoming sections. This

is an example of layered approach to communication in our synthesis ow. The log-

34

ical network layer is abstracted from the wrapper which interprets the data passed

by the network layer. The disadvantage of-course is the amount of space wasted in

the physical channel for carrying network ow information like route �eld and virtual

channel identi�er. The next section underlines the importance of the virtual channel

ow control mechanism.

2. Flow-control

Flow-control is a very important area of research as it controls the allocation of net-

work resources like channels and bu�ers. Bu�er space can be utilized more eÆciently

by using more intelligent ow control schemes like [34]. The network traÆc ow con-

trol that we have used is a hop by hop blocking mode of ow control. Hop by hop

ow control makes it possible to convey congestion information in any part of the

network to the source and therefore the data gets blocked at the sender itself. This is

a way of informing the source that an alternative path could be taken for reaching the

destination if there is support for multiple routes. This ow control is faster because

it does not have to rely on the timeout mechanisms and acknowledgement packets

coming back from the destination. Virtual channels are built using a set of bu�ers

(Queues) at the input, so that there is no blocking of packets ( its) caused by a

previous packet ( it) being unable to leave the switch due to it being blocked. When

there is a blockade in any virtual channel, another available virtual channel could

be used by other communicating cores for transmission. Virtual channel ow control

has the multiple advantages of avoiding deadlock, providing bandwidth guarantees to

ows and improving the saturation throughput. But adding virtual channels without

any necessity can cause increased complexity and latency in the network. A more

detailed analysis of virtual channel ow control can be found in [35]. It should be

noted that making the number of virtual channels to be one gives us an ordinary

35

bu�ered network with FIFO arbitration. The next section gives more details about

the routing scheme with an example.

3. Routing

We have experimented with two choices of routing, namely Source based routing and

dynamic routing. Source based routing is simple and we believe has enough exibility

for on chip implementation. The latency in the network logic is reduced because of

the absence of any lookups, which is not the case with a dynamic routing, where

routing decision is made by the network logic.

We will explain the routing scheme with an example shown in Figure 17. After

receiving a packet, the input controller which received the packet on that node makes

a decision based on the last two bits of the route �eld. The values 00,01,10 and 11

indicate to the controller that the it is to be routed to the left,right,above and into

the core respectively. Let us assume a case when a sender gives a source route of 54,

which translates to 00110110 in binary to the core input controller. In the �rst step

the core controller looks into the source route and extracts the last 2 bits, which are

10. The packet is forwarded straight from the node using the routing logic explained

above. The next controller analyzes the shifted route �eld (001101), extracts the last

2 bits (01) and takes a decision to forward the packet to its right. Finally the node

to the left extracts the packet to itself after �nding 11 in the remaining route and the

packet is sinked.

The simulator also has the capability to support routing decisions at the controller

depending on destination address. This is called dynamic routing. This increases

the latency and storage requirements of the node but decreases the packet overhead

as the route �eld will be much larger than the address �eld. Another extension on

routing that we have done is implementation of a multicast routing scheme for this

36

VC node VC node

VC node VC node

VC node VC node VC node

Sender sends packetwith route 00110110

The intermediate nodeforwards the packet to its

right because of 01

The Sender nodeforwards the

packet straight because of10

The receiver node sinks thepacket because of 11

Sender

Receiver

Fig. 17. Network Routing Example

architecture. This decreases the latency of communication with very little increase in

network logic as explained in [36].

4. Topology

For our study, we have assumed the folded torus topology shown in Figure 5. The

advantage of this topology is its reduced number of hops for communication over

other regular topologies like mesh. There is no special signi�cance to this topology

other than the ones already expounded in classical texts on the subject, apart from

its application to on chip environment. This is because of a reduced wire complex-

ity and reduced signal skew and skew variations on account of the regular structure.

Although a crossbar,has a very low latency of communication, it may not be scalable

for complex networks on chips because of its wire complexity. The simulator that we

have developed is absolutely programmable for arbitrary topologies with slight mod-

37

i�cations in controller logic. This is because all topologies have inputs and outputs,

and therefore input and output controllers. The di�erence is in the number of I/O

ports and the interconnections between the ports.

5. Switching

Switching schemes that are currently used in switching networks can be classi�ed

as circuit switching, packet switching and hybrid switching[37]. We have adapted

packet switching as our scheme of implementation,but the switching scheme is appli-

cation dependent and high locality communications work better in a circuit switched

environment while more random communication bene�ts from packet switching.

6. Synthesis Variables

Synthesis of this network aims at selecting the appropriate number of bu�ers, virtual

channels, channel widths, packet formats and network topology and the location of

cores in the network. A rough decision on the relative location of cores in the network

is available from the clustering step, but a further re�nement in communication using

detailed simulation can show violation of latency or bandwidth constraints, which

was not comprehensible with analytical models. The next section gives details about

the simulator and the synthesis process and the results that were obtained for the

synthesis of the case study involving 85 resources and 200 connections. The resources

were mapped by the clustering phase into 16 cores, the network operating point.

38

C. NoCSIM - An On Chip Network Simulator

1. SystemC - A System Level Design Language

SystemC [10] is a C++ class library that can be used to e�ectively create a cycle-

accurate model of software algorithms, hardware architecture, and interfaces for SoC

and system-level designs. SystemC library and standard C++ development tools can

be used to create a system-level model and simulate to validate and optimize the

design. The class library provides notion of signals, clock and threads which are the

most important entities for hardware simulation. Modules in SystemC are hierarchical

entities that can contain other modules and processes of execution. Processes are used

to describe the behavior of the module and can be sensitive to clock and made to

wait on events. Modules communicate with each-other through a rich set of ports and

signals. SystemC can be used to model systems with di�erent levels of abstractions

ranging from untimed functional to clock cycle accurate register transfer level models.

But we lose on simulation speed as the description becomes more detailed. A system

level description of a design should not reach upto clock cycle accurate RTL levels.

The main advantage of writing the network simulator in SystemC, apart from its speed

is that it is tailored for system level design and veri�cation of hardware and software.

It provides a rich set of classes for modeling hardware. Moreover, integration of the

simulator with models of processor, memory and core written in SystemC becomes

easy. Another important advantage is that, any description written in SystemC can

be synthesized using commercial tools to gate level and further down.

2. Features of the Network Simulator

The network has been developed at the it level and is parameterizable for the syn-

thesis variables that have been discussed. It is ideal for the estimation of network

39

parameters for a complex SoC. The class hierarchy and programmability of the NoC-

SIM simulator is shown in Figure 18. The simulator can generate input traÆc patterns

with Constant Bit Rate (CBR) and random Poisson distribution. Another feature

that has been implemented for purposes of accuracy in a codesign environment is

cosimulation trace inputs. These traces can be made available from cosimulation

tools like Ptolemy. They can be used to generate packets similar to the Bus Func-

tional Models (BFM) used for design veri�cation by compiling the cosimulation traces

to generate packets. The outputs (monitors) that can be used to analyze the network

performance are latencies of resource pairs, network resource utilizations, channel ac-

tivity rates for power estimation etc. These outputs will be useful for analysis and

for carrying the synthesis process further into the gate level. The simulator can also

output Value Change Dump (VCD) trace �les for functional debugging of new pro-

tocols, ow control and routing schemes. The simulator can also be made to run

with di�erent network clock rates for frequency scaling applications. The tiles can

also have independent clock domains. The simulator can simulate any K-ary N-cube

topology, but for arbitrary topology extensions, the switch input and switch output

may need to be changed. This is because, the input and outputs of the switch will

always be present irrespective of the topology of the network. The characterizing

features that will need to be modeled are the number of input and output ports and

the interconnections between them inside and outside the switch. The main extension

that have been carried out so far in the simulator is the implementation of Multicast

routing scheme [36]. More advanced ow control and routing schemes can also be

modeled in a similar way.

40

Monitors

SWITCHINPUT

SWITCHOUTPUT

TrafficI/O

ConfigInputs

ConfigInputs

θ ROUTING FUNCTIONSθ BUFFERS AND SIZESθ VIRTUAL CHANNELSθ CHANNEL WIDTHSθ ARBITRATION / FLOW

CONTROL PROTOCOL

θ ARBITRATIONFLOWCONTROLPROTOCOL

θ BUFFER SIZESθ CHANNEL WIDTHS

θ ARBITRATION / FLOWCONTROL / PROTOCOL

θ BUFFERS AND SIZESθ CHANNEL WIDTHSθ NOISE GENERATION

Channels

θ ACTIVITY FACTORSθ UTILIZATIONSθ DROP/BLOCKING

RATESθ PROTOCOL CHECKING

clock source

θ COSIMULATION TRACEθ TRAFFIC GENERATORθ PROTOCOL MODELSθ CLOCK DOMAINSθ (SOURCES/SINKS)

ConfigInputs

Fig. 18. Simulator Structure and Features

41

Fig. 19. Bu�ers vs.Network Performance

Fig. 20. Virtual Channels vs. Network Performance

42

3. Network Synthesis Results

Figure 19 shows a plot of network performance variation with number of bu�ers.

A minimum bu�er size (12 for this example) is required for avoiding the excessive

drop rates at the input controller and output controllers. Drop rates are obtained

by measuring packet drops at each core input controller and averaging them over

100,000 simulation cycles. It is not possible to block data at the sender. The data

which comes from the sender, gets dropped in the sender due to burstiness of the data

or a very high data rate. The drop rate can be reduced by providing bu�ers. But

beyond a certain saturation data rate, adding a bu�er just increases the latency due

to more queuing and waiting. We have arrived at the optimum bu�er size of 1 it and

depth of 12 by iteratively simulating and measuring the incidence of drops. The other

parameter of designer concern, is the latency slack, which gives the latency deviation

from the target latency. Slack is measured by subtracting the packet creation clock

cycle from the current clock cycle and subtracting this result from the target latency.

A negative slack indicates too much of blocking in the path. This can be ameliorated

by examining the reason for drops or blocks with targeted simulation. It is shown

by Figure 20, that adding virtual channels can help only to an extent, when the

saturation bandwidth is obtained. We simulated the case shown in Figure 20 by

generating its from tiles every cycle whereas the �rst simulation was under a lighter

load. The packet generators were clocked at 67 percent of the network clock. As

shown in Figure 20, under heavy traÆc between communicating components or in a

case where a sender from a core sends to many destinations simultaneously, adding

virtual channels helps in improving slack and reducing the drop rate at the senders. It

can also be seen by Figure 21, that use of multicast routing helps in reducing latency.

Multicast routing has been implemented for this architecture [36] and was proved to

43

Fig. 21. Comparison of Multicast and Unicast Routing

take 12*n bits of additional storage, where n is the number of nodes on the source to

destination path of the multicast group.

We observe that the aggregate throughput of the proposed network, can reach upto

600 Gigabits per second assuming a 400Mhz network clock. We estimate that the area

overhead will be less than 5 percent of the total chip area for a sixteen cluster network

based on gate level synthesis results. With decreasing silicon cost of the chip and an

increasing interconnect delay and stringent power requirements, we believe that on

chip networks de�nitely have an edge over the traditional buses in high performance

applications. We predict an increase in the number of clusters and network size,

in terms of bu�er and channel sizes, based on technology scaling and performance

requirements of the future.

After architecture level synthesis of the network, the synthesis parameters are given as

input to the RTL generator which generates the RTL code for the con�gured network.

The RTL generator supports only bu�er sizes, number of bu�ers and number of virtual

44

Fig. 22. VLSI Complexity Cost of Adding Virtual Channels

channels as parameterizable arguments, unlike the simulator. Therefore variation of

virtual channel arbitration policies and routing schemes will not be possible in the

synthesized design without modi�cations to the VHDL code and synthesis scripts.

D. VLSI Implementation Feasibility

The network logic was implemented in VHDL and synthesized with Synopsys Design

Analyzer using .18 micron libraries. We found that the network could be clocked at

a rate of 400Mhz (critical path delay of 2.5ns). This is based on results obtained

from gate level synthesis using a segmented wire-load model. Further pipelining of

transactions could decrease this clock rate. The area and timing determination is

dependent on the gate level library and could be di�erent for some other libraries.

45

Table II. E�ect of Virtual Channels and Bu�ers in Area of Network

Number of Virtual Channels,Bu�ers Area Estimate (sq. microns)

1,8 338000

2,4 354000

2,8 431000

4,4 470000

Table III. Area Cost Comparison of 128 and 256 Bit Networks

Module Area (128 bit design) Area (256 bit design)

Output Controller 78,800 154100

Input Controller 354000 610000

1. Results

Figure 22 shows the cost of increasing the number of virtual channels using the

same number of bu�ers (8). Increasing the number of virtual channel de�nitely has

a penalty in the number of nets, but as shown in Table II, the logic cost has not

increased much.

The Table III compares the gate level implementation of an 128 bit and a 256

bit network in terms of area. This is a very important synthesis decision that could

be made with the help of the network simulator.

E. Core-Network Interface

The interface, shown in Figure 23, between a core and network plays the part of a

layer in the protocol stack. It could be designed in a variety of ways, but any imple-

mentation should be tailored to suit the service needs of the core. The Virtual Socket

46

Wrapper

CoreConfigurable

Core

WrapperVCI

To Network

To Network

Fig. 23. Core Network Interface

Interface Alliance (VSIA) [3] aims to standardize the interface on the core side. These

standards are called the Virtual Component Interface (VCI) and Peripheral Virtual

Component Interface (PVCI) and they act as interfaces for processor and peripheral

cores. The advantage of having a standard like VCI is that the core functionality is

decoupled from its communication. The VCI communicates with global buses using

entities called wrappers[38]. Some preliminary investigations into the design of the

wrappers for cores that produce streaming data and for con�gurable cores like Xtensa

processor[39] has been conducted. Table IV summarizes the results of our investiga-

tion. The analysis and design of wrappers for cores following these standards is left

as a future work.

Wrappers for cores that produce streaming data needs to have bu�ers and packetiz-

ers. The packetizer does a lookup of the route information provided in its transport

control register before sending the packet over the network. These bu�ers and packe-

tizers are either integrated with the core or separated from it. No wrappers are needed

for con�gurable processor cores as they can be made to communicate directly to the

network, by adding additional logic to the core data-path. The Xtensa processor

47

Table IV. Interface Analysis

core was added with the packetizing functionality through the use of user speci�ed

instructions called TIE instructions. The processor, designed with these new instruc-

tions(eg. packetize which does a look up of speci�c registers present in the processor

core), packetizes the data directly. We reduce the wrapper logic overhead, but on the

downside we couple the core and its communication as opposed to the implementation

advocated by VSIA. But we feel that this solution is applicable because of the con�g-

urability of cores. If a core is con�gurable directly, there is no reason to add wrappers

to con�gure it. We found that it took two core clock cycles for the con�gured Xtensa

processor to do a packetization after getting the actual read instruction to memory.

The memory hierarchy has not been taken into consideration for this analysis.

F. Conclusion

In this section, we have introduced our synthesis scheme and the simulator that

we have developed. We have also demonstrated the other issues involved in network

synthesis like wrapper synthesis. We have also discussed the feasibility of the network

on chip scheme for current and future implementations using gate level synthesis.

48

CHAPTER V

FUTURE WORK

A. Introduction

This thesis has proposed a methodology and tool-set for the analysis and synthesis

of network on chips. In this chapter, the possible extensions to this work will be

analyzed.

B. Integration to Codesign Environment

The integration of this communication architecture design to mainstream codesign

ow, involves making the partitioning and mapping of architecture to the behavior

aware of the global network for communication. Although, we have an outer loop

in the clustering methodology to re-map communicating components, their e�ect in

clustering has not been modeled completely. We believe that integrating clustering

to the process of partitioning and mapping would prove to be the best way to make

the whole codesign environment aware of the global network.

C. Synthesized Communication Architecture

The synthesized communication architecture should be optimized at di�erent layers.

At the physical layer, signaling,asynchronous circuit design for senders and receivers

and are the major research areas. The goal here should be power reduction and

design simplicity. The main areas of future research in the logical network layer

are topology, ow control, switching and routing. This research should be aimed at

simplicity of design and high performance. At the application layer possible areas

of research are the system level design of cores, interfaces, end to end protocols and

49

tuning of design entities right from operating systems to wrappers, in order to make

these design entities aware of the network. It would also be worthwhile to investigate

the e�ects of system level power optimization techniques (like voltage and frequency

scaling of the global network) on the network performance. The other issues that will

be a�ected by having a network on chip based VLSI design are handling of signals like

chip level reset, testability of chip and recon�gurability. Network recon�gurability, in

particular, has the very positive e�ect of reducing cost by increasing wafer yield.

50

CHAPTER VI

CONCLUSION

This thesis has provided a synthesis methodology for network on chips. The problems

in network on chip synthesis were analyzed and the key issues were identi�ed. The

�rst step in the methodology assists in analyzing the communication requirements.

This preprocessing step called clustering helped in reducing the simulation space and

in predicting the cost and performance tradeo�s of network on chips. The simula-

tion and synthesis tool written in SystemC helped in analyzing the communication

requirements with a more precise system level model of the network. The system

level simulation was iteratively performed, till the resource utilization is optimized

to the required level, while satisfying the quality of services. The simulation can be

performed with a high level of accuracy by using realistic execution traces.

The synthesis of core network interface and packetization has also been explained

and some preliminary results of their cost and performance have been provided. Gate

level synthesis of the network logic and interface logic have provided an estimate of

the VLSI implementation costs of network on chip. This thesis should provide the

reader with an understanding of the issues involved in the synthesis of networks on

chips and their application in the explicitly parallel system on chips of tomorrow.

51

REFERENCES

[1] A. Allan, D. Edenfeld, W. H. Joyner Jr, A. B.Kahng, M. Rodgers, and Y. Zorian,

\2001 technology roadmap for semiconductors," IEEE Transactions on Very

Large Scale Integration (VLSI) Systems, vol. 35, pp. 42{53, January 2002.

[2] P. Guerrier and A. Greiner, \A generic architecture for on-chip packet-switched

interconnections," in Proceedings of the Design Automation and test in Europe,

Paris, France, March 2000, pp. 250{256.

[3] Virtual Socket Interface Alliance, \On chip bus attributes and virtual component

interface - draft speci�cation, v. 2.0.4," URL: http://www.vsia.org, September

1999.

[4] W. Dally and B. Towles, \Route packets, not wires: On-chip interconnection

networks," in Proceedings of the 38th Design Automation Conference, Las Vegas,

NV, June 2001, pp. 684{689.

[5] F. Karim, A. Nguyen, S. Dey, and R. Rao, \On-chip communication architecture

for OC-768 network processors," in Proceedings of the 38th Design Automation

Conference, Las Vegas, NV, June 2001, pp. 678{683.

[6] K.Goossens, J. van Meerbergen, A. Peeters, and P. Wielage, \Networks on

silicon: Combining best-e�ort and guaranteed services," in Proceedings of the

European Design and Test Conference, Paris, France, March 2002.

[7] L. Benini and G. De Micheli, \Powering networks on chips," in Proceedings

of the 14th International Symposium on System Synthesis, Montreal, Canada,

September 2001, pp. 33{38.

52

[8] S. Kumar, A. Jantsch, J. P. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tien-

syrja, and A. Hemani, \A network on chip architecture and design methodology,"

in Proceedings of the IEEE Computer Society Symposium on VLSI, Pittsburgh,

PA, April 2002.

[9] P.J. Ashenden, System-on-chip Methodologies and Design Languages, Kluwer

Academic Publishers, Dordrecht, The Netherlands, 1st edition, 2001.

[10] Synopsys, CoWare, and Frontier Design, \SystemC version 1.0 user's guide,"

URL: http://www.systemc.org, 2000.

[11] F.Vahid and D.D.Gajski, \SLIF : A Speci�cation-Level Intermediate Format for

system design," in Proceedings of the European Design and Test Conference,

Paris, France, March 1995, pp. 185{189.

[12] P. V. Knudsen and J. Madsen, \Communication estimation for hardware soft-

ware codesign," in Proceedings of the Sixth International Workshop on Hardware

Software Codesign (CODES), Seattle, WA, March 1998, pp. 55{59.

[13] K. Lahiri, A. Raghunathan, and S. Dey, \System-level performance analy-

sis for designing on-chip communication architectures," IEEE Transactions on

Computer-Aided Design of Integrated Circuits and Systems, vol. 20, pp. 768{782,

June 2001.

[14] T.Yen and W.Wolf, \Communication synthesis for distributed embedded sys-

tems," in Proceedings of the International Conference on Computer Aided De-

sign, Knoxville, TN, November 1995, pp. 288{294.

[15] P. Chou, R. B. Ortega, and G. Borriello, \Interface co-synthesis techniques for

53

embedded systems," in Proceedings of the International Conference on Com-

puter Aided Design, Knoxville, TN, November 1995, pp. 280{287.

[16] J. M. Daveau, G. F. Marchioro, T. Ben-Ismail, and A. A. Jerraya, \Protocol

selection and interface generation for hw-sw codesign," IEEE Transactions on

Very Large Scale Integration (VLSI) Systems, vol. 5, pp. 136{144, March 1997.

[17] P. V. Knudsen and J. Madsen, \Integrating communication protocol selection

with partitioning in hardware/software codesign," in Proceedings of the Inter-

national Symposium on System Level Synthesis, Seattle, WA, March 1998, pp.

111{116.

[18] M. Drinic, D. Kirovski, S. Meguerdichian, and M. Potkonjak, \Latency guided

on-chip bus network design," in Proceedings of the IEEE/ACM International

Conference on Computer Aided Design, San Jose, CA, November 2000, pp. 420{

423.

[19] International Business Machines Corporation, \The coreconnect bus architec-

ture," URL: http://www.chips.ibm.com/products/coreconnect, 1999.

[20] Advanced RISC Machines Limited, \AMBA on-chip bus," URL:

http://www.arm.com.

[21] D. Wingard, \Micronetwork-based integration of socs," in Proceedings of the

38th Design Automation Conference, Las Vegas, NV, June 2001, pp. 673{677.

[22] B. Cordan, \An eÆcient bus architecture for system-on-chip design," in IEEE

Conference on Custom Integrated Circuits, San Diego, CA, May 1999, pp. 623{

626.

54

[23] RapidIO Trade Association, \RapidIO interconnect speci�cation," URL:

http://www.rapidio.org/specs, 2001.

[24] H.Zhang, V. Prabhu, V. George, M. Wan, M. Benes, A. Abnous, and J. M.

Rabaey, \A 1-V heterogeneous recon�gurable DSP IC for wireless baseband

digital signal processing," IEEE Journal of Solid-State Circuits, vol. 35, pp.

1697{1704, November 2000.

[25] M. Sgroi, M. Sheets, A. Mihal, K. Keutzer, S. Malik, J. Rabaey, and A. S.

Vincentelli, \Addressing the system-on-a-chip interconnect woes through

communication-based design," in Proceedings of the 38th Design Automation

Conference, Las Vegas, NV, June 2001, pp. 667{672.

[26] D. M. Chapiro, \Globally-asynchronous locally-synchronous systems," Ph.D.

dissertation, Stanford University, 1984.

[27] J. Muttersbach, T. Villiger, and W. Fichtner, \Practical design of globally-

asynchronous locally-synchronous systems," in Proceedings of the 6th Interna-

tional Symposium on Advanced Research in Asynchronous Circuits and Systems,

Eilat, Israel, April 2000, pp. 52{59.

[28] D. Sylvester and K. Keutser, \A global wiring paradigm for deep submicron

design," IEEE Transactions on Computer-Aided Design of Integrated Circuits

and Systems, vol. 19, pp. 242{252, February 2000.

[29] A.Hemani, T.Meincke, S.Kumar, A.Postula, T.Olsson, P.Nilsson, J.Oberg,

P.Ellervee, and D.Lundqvist, \Lowering power consumption in clock by us-

ing globally asynchronous locally synchronous design style," in Proceedings of

Design Automation Conference, New Orleans, LA, June 1999, pp. 873{878.

55

[30] A. Leon-Garcia and I. Widjaja, Computer Networks: Fundamental Concepts

and Key Architectures, McGraw-Hill, Boston, MA, 1st edition, 2000.

[31] P. V. Knudsen and J. Madsen, \Graph based communication analysis for hard-

ware software codesign," in Proceedings of the Seventh International Workshop

on Hardware Software Codesign (CODES), Rome, Italy, May 1999, pp. 131{135.

[32] K. Lahiri, A. Raghunathan, and S. Dey, \Fast performance analysis of bus-based

system-on-chip communication architectures," in Proceedings of the IEEE/ACM

International Conference on Computer-Aided Design, San Jose, CA, November

1999, pp. 566{572.

[33] K. Lahiri, A. Raghunathan, and S. Dey, \EÆcient exploration of the soc com-

munication architecture design space," in Proceedings of the IEEE/ACM Inter-

national Conference on Computer-Aided Design, San Jose, CA, November 2000,

pp. 424{430.

[34] L.S. Pen and W. Dally, \Flit reservation ow control," in Proceedings of

the 6th International Symposium of High Performance Computer Architecture,

Toulouse, France, January 2000, pp. 73{84.

[35] W. Dally, \Virtual channel ow control," IEEE Transactions on Parallel and

Distributed Systems, vol. 3, pp. 194{205, March 1992.

[36] N. N. Sujir, V. Bangalore, N. Swaminathan, and R.N. Mahapatra, \Performance

improvement using multicast for on-chip networks," Tech. Rep., Texas A&M

University, 2002.

[37] J. Duato, P. Lopez, F. Silla, and S. Yalamanchili, \A high performance router

56

architecture for interconnection networks," in Proceedings of the 1996 Interna-

tional Conference on Parallel Processing, August 1996, vol. 1, pp. 61{68.

[38] D. Lyonnard, S. Yoo, A. Baghdadi, and A. A. Jerraya, \Automatic generation

of application-speci�c architectures for heterogeneous multiprocessor system-on-

chip," in Proceedings of the 38th Design Automation Conference, Las Vegas,

NV, June 2001, pp. 518{523.

[39] Xtensa Incorporated, \Xtensa TIE reference guide," URL: www.xtensa.com,

March 2002.

57

VITA

Narayanan Swaminathan was born in Madras, India on the 12th of September

1978, the son of Lalitha and Swaminathan. After completing his work at Vanavani

High School, he went on to gain a Bachelor of Engineering in Electrical Engineering

at P.S.G College of Technology, Bharathiyar University in May of 2000.

Permanent Address:

29, 6th Main Road, Nanganallur.

Madras,Tamilnadu.

India. Pin-600061

The typist for this thesis was the author.

ecology labfaculty.cs.tamu.edu/rabi/cpsc689/lectures/Comm-Synthesis-NoC-Reading.pdfCOMMUNICA TION...

Documents

Transcript of ecology labfaculty.cs.tamu.edu/rabi/cpsc689/lectures/Comm-Synthesis-NoC-Reading.pdfCOMMUNICA TION...