2431 SocD 10 Communication es08 - TUT · #8/45 Erno Salminen - Nov. 2008 b Characteristics of...

Erno Salminen - Nov. 2008

TKTTKT--2431 Soc 2431 Soc DesignDesignLec 10 Lec 10 –– OnOn--chip communicationchip communication

Erno SalminenErno Salminen

Department ofDepartment of Computer SystemsComputer SystemsTampere University of TechnologyTampere University of Technology

Fall 2008Fall 2008

Erno Salminen - Nov. 2008#2/45

Copyright noticeCopyright notice

Part of the slides adapted from slide setby Alberto Sangiovanni-Vincentelli

course EE249 at University of California, Berkeleyhttp://www-cad.eecs.berkeley.edu/~polis/class/lectures.shtml

by Timo D. HämäläinenManaging On-Chip Chip Communications, SoC Symposium, Tampere 19.11.2003

Part of figures fromL. Benini, G. De Micheli, Networks on chips: a new SoC paradigm, Computer, Vol. 35, Iss. 1, Jan. 2002, pp. 70 -78.V. Lahtinen, Design and Analysis of Interconnection Architectures for On-Chip Digital Systems, PhD Thesis, Tampere University of Technology, Department of Information Technology, June 2004.

http://www.tkt.cs.tut.fi/research/daci/pub_open/lahtinen_thesis.pdf


ContentsContents

Problem statementPhysical limitationsNetwork-on-chip (NoC)Extra

See also:E. Salminen, A. Kulmala, T.D. Hämäläinen, "Survey of Network-on-chip Proposals", white paper, OCP-IP, [online]: http://www.ocpip.org/socket/whitepapers/OCP-IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, 2008, 13 pages.E. Salminen, A. Kulmala, T.D. Hämäläinen, "On Network-on-chip comparison", Euromicro conf. on Digital System Design, Lübeck, Germany, August 27-31, 2007, pp. 503-510. http://daci.digitalsystems.cs.tut.fi:8180/pubfs/fileservlet?download=true&filedir=dacifs&freal=Salminen_-_On_Network-on-chip_compar.pdf&id=82519


At firstAt first

Make sure that simple things work before even trying more complex ones


SoCSoC

Problem Statement Problem Statement -- SoC ComplexitySoC Complexity

Communication networkCommunication network

SoC consists of heterogenous componentsVarying communication requirements/profilesNot all components communicate with each other

Mem_1Mem_1 Mem_NMem_N

Proc_1Proc_1 Proc_NProc_N Acc_1Acc_1 Acc_NAcc_N

Periph_1Periph_1 Periph_NPeriph_N


Problem Statement (2)Problem Statement (2)

Bandwidth (or throughput)Amount of data transferred in unit time, [MB/s]High requirement between CPU and memoryLow requirement between CPU and peripheral

Different latency expectations

Mem_1Mem_1 Mem_NMem_N Periph_1Periph_1 Periph_NPeriph_N

CPU_1CPU_1 Acc_NAcc_NCPU_NCPU_N Acc_1Acc_1

High BWLow BW


Several clock domainsSeveral clock domains

Not possible/practical to use same clock in every componentGALS – Globally asynchronous, locally synchronous

Components have local clocksCommunication needs handshaking/synchronization

Mem_1Mem_1 Mem_NMem_N

Proc_1Proc_1 Proc_NProc_N

Periph_1Periph_1 Periph_NPeriph_N

Acc_1Acc_1 Acc_NAcc_N

High freqLow freq


b

Characteristics of offered traffic foadCharacteristics of offered traffic foad

Spatial: where the data goall sources similar?

Temporal: average data rateTemporal: when to transfera) Short bursts of high transfer activity and long

periods of inactivityb) Transfers with constant sizes and intervals

time

very bursty

time

moderately bursty

data amount

time

constant bitrate

srcSpatial:

a) one dst: neighbor

b) one dst: some

c) few dst

d) send to all

Temporal:

b

a

c d


LatencyLatency

Delay between start of transfer and completion

time (last data ejected) – time (first data enters)[n cycles for transferring d words]

Interrupts usually require low latencyReal-time systems require guaranteed latencyStream data (voice, video) may require constant latency (low jitter)


Measuring loadMeasuring load--latency behaviorlatency behavior


Measured loadMeasured load--latency curvelatency curve


Transaction latency components

Scalable Multiprocessors, lecture slides, http://www.cs.princeton.edu/courses/archive/spr07/cos598A/


Physical limitationsPhysical limitations


ITRS 2003: InterconnectITRS 2003: InterconnectCChip crosship cross--sectionsection

transistors

Wires on top levels are wider and taller than on lower levelsTop layers for

Power supplyClockGlobal signals

Several metal layers - less congestionHierarchical scaling


ITRS 2003: InterconnectITRS 2003: Interconnect

Delay of global wires does not scale with technology

Delay of global wires does not scale with technology

gate

local signals

global signals with repeaters (bigger area and energy)

global signals

HUOM! OBS!

Muy importante!


Energy breakdown forecastEnergy breakdown forecast

[Mattan Erez, Stream Architectures –Programmability and Efficiency,

Tampere SoC, Nov. 17 2004]

compare


LocalizationLocalization

[Mattan Erez, Stream Architectures –Programmability and Efficiency, Tampere SoC, Nov. 17 2004]

Communication between non-neighboring components requires many hops

Communication must be localized to avoid long wires

consume much energyare slow, prone to error, cause routing congestion

Several small components instead of few large


Reliability problemsReliability problems

”Synchronization failures between clock domains will be rare but unavoidable”Electrical noise due to crosstalk, electromagentic interference, radiation...Data errors or upsets, soft errorsData transfers become unreliable and nondeterministicDesign needs both deterministic and stochastic models


Achieving reliabilityAchieving reliabilityToday, designers use physical techniques to overcome reliability problems

Wire sizingLength optimizationRepeater insertionShieldingData codingBunch of others...Huge design effort required

In (near) future, 100% reliability on physical level cannot be afforded anymoreReliability muts be increased with additional HW or SW layers

Error detecting/correcting codesRetransmissions

Requst/acknowledge and time-out counters


NetworkNetwork--onon--chip (NoC)chip (NoC)


NetworkNetwork--onon--Chip (NoC)Chip (NoC)

NoC motivation1.High fab cost and effort in traditional VLSI

Design general-purpose platform, ASSP2.Flexibility

For changing application needs3.Concurrency in transfers (whole chip)4.On-chip wires are no longer reliable5.Only short signal wires due to power and

delay problemsUsually packet-switched, multi-hop network


DifferencesDifferences betweenbetweenMultiprocessorsMultiprocessors and SoCand SoC

Multiprocessor systems (past) System-on-Chip (portable device)Scaleability important after fab (increase nodes)

Scaleability an issue only at design time (reuse, easy addition of nodes)

Load balancing and even distribution of computation important for maximum performance

Energy consumption important, idle nodes must be shut down

Communication network used as means of balancing computation and communication (both adjusted for optimal performance)

Computation might already be fixed per node (functional partition) Network serves nodes (only network adjusted)

Dataflow computing Computation is very heterogeneous, both dataflow and control style

In principle any node can compute a given task

Execution of various applications clustered within SoC (specialized nodes)Some research seems to be ”Re-inventing the wheel” New challenge: Energy saving combined to past multiprocessor research

Much experience and well established reasearch of routing, switching, scaleability, tailoring according to applications


Micronetwork protocol stackMicronetwork protocol stack

Layers are specialized and optimized according to application (domain)

abstraction

arbitration, packetization to increase reliabilityrouting

splitting into packets, reordering

HW dependent SW


NoCNoC

Structuretopology – routers and linksrouter design

Controlrouting – which way to takeflow control and switching – when to transmit


agent(0)agent(0)

processing element

processing element

TerminologyTerminology

communication networkcommunication network

network interfacenetwork interfacerouter(0)router(0) router(1)router(1)

router(2)

(degree=4)

router(2)

(degree=4)

agent(1)agent(1) link

message

pktpkt

ph

Abbreviations:fl = flit, flow ctrl unitph = phit, physical unitpkt =packet

fl fl fl

ph

port

ph

ph

ph

ph

terminal

or stream


Homogeneous networkHomogeneous network

replication effectmemory dominated any waysolve realization issuesonce and for allless flexibleProblematic if processing units are heterogeneous

assumes uniform size for components and hence

a) wastes area b) components have to be

splittedH. Corporaal, Advanced Computer Architecture5Z008 - Multiprocessors &Interconnect, course material, 2003.


Heterogeneous networkHeterogeneous network

better fit to application domain – better performancesmaller incrementscomponents are not uniformly sizedhierarcahical structureAre ASICs possible in the future anymore?

H. Corporaal, Advanced Computer Architecture5Z008 - Multiprocessors &Interconnect, course material, 2003.


Network topologyNetwork topology

Defines the components (e.g. routers) the connections (e.g. each router connected to 4 neighbours)

Vast number of topologies proposed in literature1. Static networks utilize only point-to-point or shared

connection lines2. Dynamic networks use switches (or routers) for

communicationa) Direct = each processing node connected to switchb) Indirect = some switches are not connected directly to

any processing node


Network topology (2)Network topology (2)

Can be modeled with graphsnode = processing unitedge = data stream

Number of nodes denoted with NAverage path length L

Avg num of edges between all nodes in graphsSmall L desired for small latency

Average degree <k>Avg. num of edges in each switchLarge <k> may decrease L but implementation gets more complex also


Network topology: Bisection bandwidthNetwork topology: Bisection bandwidth

Design is partitioned into two (nearly) equal halves, it is the minimum number of wires which must cross between the halves considering all possible partitions

Number of nodes in halves differs at most by 1Also other definitions...

High number means higher number of possible routes and hence increased flexibility and fault-toleranceShould increase with the number of nodes in scalable networks


Generic routerGeneric router

generic routergeneric router

......

inpu

t por

ts

output ports

......

crossbar

routing arbitrator


Routing algortihmRouting algortihm

Selects route from source to destination1. Deterministic

Same route always used between source and destinatione.g. 2-D mesh: first find correct row, then correct columnAll packets arrive in-orderOne blocked (or faulty) link/router, blocks all packets on that route

2. AdaptiveRoute varies according to blockage

Better performance (at least when reordering neglected)Better faul-toleranceDeadloack avoidance needs extra care

Data may arrive out-of-orderReordering buffers required at receiver Buffers may consume large area/energy


BufferingBufferingBuffering has big impact on NoC performance and router area

1. Store-and-forwardData forwarded when whole packet receivedWhole packet bufferedincreases area and latency

2. Virtual cut-through: Data forwarded ASAPWhole packet buffered if output blocked

3. Wormhole: Data forwarded ASAPBuffer sizes can be independent of the pcaket sizeReserves the whole transfer path and hence increases contention

Some schemes drop packets when contention is highHighly undetermisticAcknowledges required (doubles latency, buffers for retransfers)Not recommended


(Shared multimaster) bus(Shared multimaster) bus

Bus = set of signals connected to all devicesShared resource

One connection between devices reserves the whole interconnection

Bandwidth shared among devices

Bandwidth may be scaled by adding links

Most common SoC networkLow implementation costs, simpleLong signal lines problematic

Single busN = 16L = 1

<k> = -

Multiple busN = 16L = 1

<k> = -


Bus arbitration / addr decodingBus arbitration / addr decoding

Arbitration decides which master can use the shared resource (e.g. bus or memory)

Single-master system does not need arbitrationE.g. priority, round-robin, TDMA Two-level : e.g. TDMA + priorityMay be pipelined with previous transfer

Decoding is needed to determine the targetCentral / Distributed schemesAddress and Data are broadcast to every nodeDecoder select which read the data or respond


Centralized / Centralized / DistributedDistributed

M = masterS = slave

ArbiterArbiter

M1M1 M2M2 M3M3

S1S1 S2S2 S3S3

DecoderDecoder

a) Centralized

A2A2

arbiter/decoderarbiter/decoder

A3A3


A1A1


A4A4

arbiter/decoder

A5A5

arbiter/decoder

b) Distributed

Figure 2. Centralized vs. distributed control

request + grant

select


Complex bus topologiesComplex bus topologiesHierarchical bus - Several bus segments connected with bridges

Fast access as long as the target is in the same segment

Requires locality of accessesTheoretical max. speed-up = num of segmentsSegments either circuit or packet-switched together

Packet-switching provides more parallelism with added buffering

Split-busNo data storage – only three-state buffersIf switches are non-conducting, smaller effective capacitance and, hence, smaller energy

Split-bus

A A

A A

A

A

Hierarchical bus (chain + tree)

N = 16L = 2.1

<k> = 2.5

Hierarchical bus (chain)N = 16L = 2.3<k> = 2


Other topologiesOther topologies

RingN = 16L = 6.3<k> = 3

Fully connected, point-to-point networkN = 16L = 1

<k> = -

3D hypercube

N = 8L = 3.7<k> = 8

Highest performanceClearly not scalable approach

3-D topologies are hard to map on 2-D silicon die

Simple layoutUnidirectional ring may result in long latencyGood for pipelines


Topologies: mesh and torusTopologies: mesh and torus

2-D mesh and torus are very popularSimple layout for uniformly sized nodes

Wrap-around wires in torus need special attention

2-D meshN = 16L = 4.7<k> = 4

2-D torusN = 16L = 4.1<k> = 5


Topologies: TreeTopologies: Tree

Trad. tree has bisection bandwidth=1

Bottleneck for uniform trafficDoes not matter when the traffic is localized

Fat-tree has more (or wider) links near root

Becoming more popular as NoC topology

Trees also constructed so that each node is processing node

Rooted, complete, binary tree

N = 16L = 6.5

<k> = 2.9

Fat tree with butterfly elements and fanout of 2 (binary fat tree)

N = 16L = 6.5

<k> = 3.5


Topologies: static analysisTopologies: static analysisSome basic properties may be analyzed staticallySimulation with real applications preferred (i.e. dynamic analysis)

Network Number of switches

Number of wires

Links

Single bus 0 1 Bi

Multiple bus 0 e Bi

Hierarchical bus (chain) e-1 e Bi

Crossbar N2/4 N2/2 Bi

One-sided crossbar N2/2 N2-N/2 Bi

Binary tree N-1 2(N-1) Bi

Fat tree (fanout 2) Nlog2N 2Nlog2N Bi

Ring N 2N Bi

3-D hypercube N N+(N/2)log2N Bi

2-D mesh N 3N-2N1/2 Bi

2-D torus N 3N Bi

Point-to-point, fully connected

0 (N2-N)/2 Bi

Omega network (MIN) (N/4)(log2N-1) (N/2)log2N Uni

Network Parallel transactions

Longest path

Bisection bandwidth

Links

Single bus 1 1 1 Bi

Multiple bus e (e ≤ N) 1 e Bi

Hierarchical bus (chain) e (e ≤ N) e (e ≤ N) 1 Bi

Crossbar N N N-1 Bi

One-sided crossbar N 2N-1 N/2 Bi

Binary tree N 2log2N 1 Bi

Fat tree (fanout 2) N 2log2N N Bi

Ring N N/2+2 2 Bi

3-D hypercube N log2N+2 N/2 Bi

2-D mesh N 2N1/2 N1/2 Bi

2-D torus N N1/2+2 2N1/2 Bi

Point-to-point, fully connected

N 1 (N/2)*(N/2) Bi

Omega network (MIN) N/2 log2N N Uni

Lahtinen 2004: Table 3.2 Performance Lahtinen 2004: Table 3.3 Implementation costs


Average NoC 2008Average NoC 2008


Average NoC 2008 (2)Average NoC 2008 (2)

Salminen et al. Survey of NoC proposals, OCP-IP, 2008


Overview of Managing OnOverview of Managing On--Chip Chip CommunicationsCommunications

Dedicated point-to-point links

Dedicated point-to-point links

Single busSingle bus

Regular multi-hop topologies

Regular multi-hop topologies

Simple Alwaysguaranteed

Customized multi-hopCustomized multi-hop

LimitedLimited

Verycomplex

Designonce

Generalpurpose

Best-effort/Predictable

IP blockspecific

Net

wor

kN

etw

ork

elem

ents

elem

ents

Late

ncy&

BW

Late

ncy&

BW

Scal

eabi

lity

Scal

eabi

lity

& & F

lexi

bilit

yFl

exib

ility

# of

IP b

lock

s#

of IP

blo

cks

Net

wor

kN

etw

ork

reus

ere

use

Arbitrary

Hierarchical busstructures

Hierarchical busstructures


ConclusionConclusion

Many small components, different requirementsWire delays and power consumption becoming very problematicBig difference between local and global (or off-chip) communicationFully synchronous approach becoming unfeasibleNetwork-on-chip = multi-hop on-chip network

Often packet-switchedBuffering, routing, and topology are important design decisions


ExtraExtra


Case StudyCase Study

Managing Interconnection Complexity in Managing Interconnection Complexity in Heterogeneous IP Block InterconnectionHeterogeneous IP Block Interconnection(HIBI)(HIBI)


Lessons LearnedLessons Learned

Many communication networks have been studied in TUT

On-chip communication research started 1997A regular topology can well be fitted to algorithm specific comp/comm balanced implementationIn general case there is no optimal topologyCommunication-centric design was successfully conducted for performanceImportant to exploit features of application(s) to optimize interconnectionEstablished parallel processing doctrines can be applied to SoCSoC challenge is heterogeneity in computation


Interconnection Implementation ViewInterconnection Implementation ViewMake lowest level data transfer mechanisms simple and efficient

Minimum number of signals“Every clock edge carries useful data in transaction”

Perform all high-level operations on basic mechanisms Layered protocol model, OCP compatibleMessage passing

Use identical HW modules to compose overall interconnectionTranslate IP specific communication operations to networkSupport all (practical) topologiesNo limits to number of IP blocks (whole design)Support (re-)configurabilityFit to all communication needs –from memories to peripherals

““Gives body to build interconnectGives body to build interconnect””


System Design ViewSystem Design ViewMake interconnection aware of application functionalityA) System design time

Communication profiled from application processesClustering: localization of communicationAllocation of communication resources (segments, buffers)Optimization of non-reconfigurable parametersInitial QoS and other transfer parameters

B) Run timeUtilize knowledge of predictable communication events if available

Guaranteed QoS in transfersTrack communication –change QoS & other parameters if requiredTotally change mode of operation if required

HIBI Design Flow is 80% of the HIBI interconnect scheme““Gives brains to the communicationGives brains to the communication””


HIBI Identical Interconnection ModulesHIBI Identical Interconnection Modules

HIBI wrapper is the only building block used everywhere in interconnection

Between network and IP-blocksBetween network segmentsWrapper is parametrizable, modular, and configurableAsyncronous FIFO buffering

P1P1 Mem1Mem1PNPN Acc1Acc1... AccNAccN...... MemNMemN

HIBIWrapper

HIBIWrapper

FIFO / OCP interface

IP

HIBIwrapper

HIBI network

HIBIWrapper

HIBIWrapper

HIBIWrapper

HIBIWrapper

HIBIWrapper

HIBIWrapper

HIBIWrapper

HIBIWrapper

HIBIWrapper

HIBIWrapper


HIBI NetworkHIBI Network

HIBI network consists of bus segments and bridgesTransfers in segment synchronous circuit switchedTransfers across bridges asynchronous packet switchedScales from serial point-to-point link to an arbitrary topology

Identical signals between wrappers in network sideNo dedicated point-to-point signals

All signals shared within network segmentWrapper layout is independent of the number of agents

Totally distributed arbitrationNo central arbiterEach wrapper is aware of communication details


Clock Clock domaindomain

HIBI Network Example

HIBIHIBIWrapperWrapper

IP BLOCKIP BLOCK

Bridge

HIB

IH

IBI

Wra

pper

Wra

pper

HIB

IH

IBI

Wra

pper

Wra

pper

IP BLOCKIP BLOCK





IP BLOCKIP BLOCK


IP BLOCKIP BLOCK


IP BLOCKIP BLOCK


IP BLOCKIP BLOCK


IP BLOCKIP BLOCK



IP BLOCKIP BLOCK


IP BLOCKIP BLOCKIP BLOCKIP BLOCK



Bus latencyBus latencyTotal latency consists of several phases From: K. Kuusilinna, PhD Thesis, TUT, 2001.

Action

Until all data has been transferred ora limit for data transfers per burst is reached.

Transfer data

Wait for master ready /Wait for target ready

Subs

eque

ntda

ta la

tenc

y

Available MethodsRequest bus ownership

Bus ownership granted

Wait for higher priority transactions to complete / Arbitration

Arb

itrat

ion

late

ncy

Central arbiter, daisy chain, wired-OR,connectionless arbitration

Round-robin, hierarchical round-robin,time-slot, fixed priority, adaptive

(See Request)

Begin transaction

Wait for master ready /Wait for target ready

Transfer first data

Initi

alla

tenc

y

Address/data multiplexing,handshaking

Waiting time may be long during high contection

Drive or wait for the bus to settle to idle state

Turn

-aro

und

late

ncy

Figure: Bus latency

Optimizing this phase has biggest impact in long transfers


HIBI Quality of ServiceHIBI Quality of Service

TDMA (time division multiple access) with freely run-time adjustable frame length and slot durations and allocationsRe-synchronization to application phaseAlso traditional priority/round-robin

A3A2

A3A1

allocated time slotA1

competitionA3 A2 A3 A1 A3time frametime frame

t

A1 A2 A1 A3 A1

A1 A2 A3

Priority

Round-robinA1 A2 A3 t

tA2 A3 A1

time frametime frame

competition


HIBI Basic TransferHIBI Basic Transfer

Pipelined with arbitrationSplit-transactionsBurst transfersNo wait cycles allowedNon pre-emptive transfers

QoS is guaranteed with TDMA or with a combination of Send Max+Priority/RoundRobin

t

rq addr

ret addr

addr

data

w addr

w data ret dataw data

w addr rq addr ret addr

rq data rq data

ret addr ...

pipeline

split transaction


HIBI Wrapper Structure (v.2)HIBI Wrapper Structure (v.2)

Configmem

HIBI signals out HIBI signals in

IP signals in IP signals out

Tx FSM

HI priortx FIFO

LO priortx FIFO

HI priorrx FIFO

LO priorrx FIFO

Mux Demux

AddrdecoderRx FSM


WrapperWrapper ConfigurationConfiguration MemoryMemory

Stores all information for distributed arbitrationPermanent: ROM, 1 pageSemi run-time configurable: ROM with several pagesFull run-time configurable: RAM, with pages

Time slotlogic

Currconfvalues

Curr page

Conf page

Timeslotsignals

Newconf

values

Dem

ux

Mux

Cycle counter


HIBI Wrapper AreaHIBI Wrapper Area

1 161

961

953

184

947

513

374

25

15

0 200 400 600 800 1 000 1 200 1 400

Tx ctrl

Rx ctrl

Cfg mem, 1p ram

Cfg mem, 1p rom

fifo, 10x8b

fifo, 5x8b

Read mux

Addr decoder

Write demux

Sub-

bloc

k

Area [gates]

Flip-flop –based buffers

need large area


HIBI Wrapper Area in ASICHIBI Wrapper Area in ASIC

0

5 000

10 000

15 000

20 000

25 000

30 000

35 000

8 b 16 b 32 b 64 b 8 b 16 b 32 b 64 b 8 b 16 b 32 b 64 b

lo prior FIFOs = 3 / 3hi prior FIFOs = 0 / 0

1-page mem


1-page mem


2-page mem

Are

a [g

ates

]

RAMROM


HIBI Wrapper Area in FPGAHIBI Wrapper Area in FPGA

0

1 000

2 000

3 000

4 000

5 000

6 000

7 000

8 000

9 000

8 b 16 b 32 b 64 b 8 b 16 b 32 b 64 b 8 b 16 b 32 b 64 b


1-page mem


1-page mem


2-page mem

Area

[log

ic e

lem

ents

] RAMROM

More efficient boundary optimization


Effect of Virtual CutEffect of Virtual Cut--Through/StoreThrough/Store--andand--Forward to HIBI Interconnection AreaForward to HIBI Interconnection Area

Application: 2 shared memories, 8 processors, each read-process-write 100 bytes Solid lines: Minimal buffering in wrapper (virtual cut-through routing)

HIBI v.1 area is larger than v.2 due to separate address and data FIFOsDashed lines: All data buffered in wrapper (store-and-forward routing)

Interconnection logic area

0

1

2

3

4

5

0 20 40 60 80 100 120

Transfer size [byte]

Are

a [m

m2]

1 * 8b HIBI v.1

2 * 8b HIBI v.1

1 * 8b HIBI v.2

2 * 8b HIBI v.2

1 * 8b Excessbuffering v.11 * 8b Excessbuffering v.2


NoC component comparisonNoC component comparison


Runtime comparisonRuntime comparison

Salminen et al., SAMOS 2005.


NoC extrasNoC extras


Problems with Current NoC DiscussionProblems with Current NoC Discussion

What is ”NoC” – no common definitionSomething new, good by definition (needs no proof),...

General purpose – but to what extentArbitrary connectivity between any node?Uniform overall transfer distribution?

Discussion about “optimal topology”Multiprocessor architectures for scientific computations?Can massive fine-grain granularity parallelism be utilized in realistic SoC applications?

Copying computer network ideas without criticismIn-network data buffering, routing tables and algorithmsCompare to current TCP/IP or past ATM routers!

Toy test case applicationsBillion transistors – executes single FFT? Common benchmarks should be designed!


Wiring hierarchyWiring hierarchy

[H. Corporaal, Advanced Computer Architecture5Z008 - Multiprocessors &Interconnect, course material, 2003]

global

intermediate

local

How far can signal reach in one local clock cycle?Depends on

frequency (i.e duration of clock cycle)Wiring parameters (layer, width, height, density, shielding)

Not far anyway...Global wires will function as lossy transmission lines

RC models of today become inaccurate3-D modeling s-l-o-w and difficult


P. Liljeberg et al., Self-timed Approach for Noise Reduction in Noc, in “Interconnect-centric design for advanced SoC and NoC”, Kluwer. 2004

Crosstalk impactCrosstalk impactLong, fast switching wires

Long wires close to each other

Switching on neighbor wires affects delayDelay on wire 4 shown in table 2


Impact of DMAImpact of DMA

comp comp ...w/o DMA

compw/ DMA

a) short comm time

comp comm comp ...w/o DMA

compcommcomp

comm...compw/ DMA

b) equal comp and comm time

comm comm ...w/o DMA

comm comm...w/ DMA

c) long comm time

comp ...

agentagent

CPU coreCPU core

instr. meminstr. mem

data memdata mem DMADMA network

interfacenetwork interface

other perihp.other

perihp.

ii)

i)


Retransfer buffersRetransfer buffersIf packets are dropped or corrupted in delivery (usually) they have to retransferred

Variable latencies problematic: is packet dropped and just havinf longer latencyIf Time-out latency exceeded , packet is assumed to be missing

Source must store packets until it recieves acknowledge of succesfull transferSending acknowledge after each packet results in small buffer but (at least) double latencySengin ack after each N packet reuires bigger buffers but gives better performance

source destination

ack (ok)

ack (ok,ok,fail,ok)

a) ack for each packet

b) ack for each Npackets

srcsrcbuf

srcsrcbuf buf buf buf

dstdst

dstdst

Latency per pkt = send_latency + ack_latency

Latency per pkt = (N*send_latency + ack_latency) / N


Reordering buffersReordering buffersPackets arriving Out-of-order may require huge reordering buffers

Sometimes processing units may accept out-of-order delivery or buffers can be integrated with internal memory of the processing unit

If ack is sent after 4 packets, buffer for 4 packets is neededFurthermore, separate buffers are needed for each source as data may received in interleaved manner

E.g. (pkt_<n>_<src>) received: pkt_1_1, pkt_4_1, pkt_4_2, pkt_3_3...E.g. if ack sent after N apckets and S sources

reorder buffer size = N*S packets

source 0 destination

ack (ok)

ack

a) ack for each packet

b) ack for each Npackets

srcsrc

dstdst

buf buf buf buf

dstdstbuf

srcsrc buf buf buf buf

buf buf buf buf...

Ack forces in-order delivery

source 1

ack


Buffer reservationBuffer reservation

Notification of the next tx

Reserve buffer

ACK

Configure rx DMA

Actual data

(optional ACK)

Sender agent Receiver agent

Consume data

Notification of the reserved buffer

Actual data

Sender agent Receiver agent

Reserve buffer

Configure rx DMA

Reserve buffer etc.

(copy data)

Consume data

Observedtx duration

Observedtx duration


Intertiwned/ReorderingIntertiwned/Reordering

Transfers from different sources may arbitrarily intertwinedIn addition, packets may arrive out-of-order

netw

ork

netw

ork

source0

source1destination0

aabbcc

ddee ddaabbeecc

These are either single words, bursts, or packets, depending on

the network

destination0destination0

...ddee

aabbcc

i) fixed-length packets

dd aa bb eecc

”FIFO”-like buffers

from

net

wor

k

destination0destination0

from

net

wor

k

...dd ee

ii) variable-length packets

dd aa bb eecc

cc

linked list buffers

aa bb


Irregular IP sizeIrregular IP size

<19.5% reduction in area>

IP’s tend to have irregular size and shapeLargest IP per row/column decides its height/width

Some space wastedlinks will have varying length

Reordering the IPs reduces areaEnsure that frequently communicating IPs are still close to each other


Customized meshCustomized mesh

Connect more than IP to one routerSomewhat smaller bandwidth available per IP

Usually enough, thoughAdopt totally customized topology (the rightmost fig)

2431 SocD 10 Communication es08 - TUT · #8/45 Erno Salminen - Nov. 2008 b Characteristics of...

Documents

Transcript of 2431 SocD 10 Communication es08 - TUT · #8/45 Erno Salminen - Nov. 2008 b Characteristics of...