NOAA C oast W atch Program DOC/NOAA/NESDIS/STAR/SOCD College Park, MD 20740
2431 SocD 10 Communication es08 - TUT · #8/45 Erno Salminen - Nov. 2008 b Characteristics of...
Transcript of 2431 SocD 10 Communication es08 - TUT · #8/45 Erno Salminen - Nov. 2008 b Characteristics of...
Erno Salminen - Nov. 2008
TKTTKT--2431 Soc 2431 Soc DesignDesignLec 10 Lec 10 –– OnOn--chip communicationchip communication
Erno SalminenErno Salminen
Department ofDepartment of Computer SystemsComputer SystemsTampere University of TechnologyTampere University of Technology
Fall 2008Fall 2008
Erno Salminen - Nov. 2008#2/45
Copyright noticeCopyright notice
Part of the slides adapted from slide setby Alberto Sangiovanni-Vincentelli
course EE249 at University of California, Berkeleyhttp://www-cad.eecs.berkeley.edu/~polis/class/lectures.shtml
by Timo D. HämäläinenManaging On-Chip Chip Communications, SoC Symposium, Tampere 19.11.2003
Part of figures fromL. Benini, G. De Micheli, Networks on chips: a new SoC paradigm, Computer, Vol. 35, Iss. 1, Jan. 2002, pp. 70 -78.V. Lahtinen, Design and Analysis of Interconnection Architectures for On-Chip Digital Systems, PhD Thesis, Tampere University of Technology, Department of Information Technology, June 2004.
http://www.tkt.cs.tut.fi/research/daci/pub_open/lahtinen_thesis.pdf
Erno Salminen - Nov. 2008#3/45
ContentsContents
Problem statementPhysical limitationsNetwork-on-chip (NoC)Extra
See also:E. Salminen, A. Kulmala, T.D. Hämäläinen, "Survey of Network-on-chip Proposals", white paper, OCP-IP, [online]: http://www.ocpip.org/socket/whitepapers/OCP-IP_Survey_of_NoC_Proposals_White_Paper_April_2008.pdf, April 9, 2008, 13 pages.E. Salminen, A. Kulmala, T.D. Hämäläinen, "On Network-on-chip comparison", Euromicro conf. on Digital System Design, Lübeck, Germany, August 27-31, 2007, pp. 503-510. http://daci.digitalsystems.cs.tut.fi:8180/pubfs/fileservlet?download=true&filedir=dacifs&freal=Salminen_-_On_Network-on-chip_compar.pdf&id=82519
Erno Salminen - Nov. 2008#4/45
At firstAt first
Make sure that simple things work before even trying more complex ones
Erno Salminen - Nov. 2008#5/45
SoCSoC
Problem Statement Problem Statement -- SoC ComplexitySoC Complexity
Communication networkCommunication network
SoC consists of heterogenous componentsVarying communication requirements/profilesNot all components communicate with each other
Mem_1Mem_1 Mem_NMem_N
Proc_1Proc_1 Proc_NProc_N Acc_1Acc_1 Acc_NAcc_N
Periph_1Periph_1 Periph_NPeriph_N
Erno Salminen - Nov. 2008#6/45
Problem Statement (2)Problem Statement (2)
Bandwidth (or throughput)Amount of data transferred in unit time, [MB/s]High requirement between CPU and memoryLow requirement between CPU and peripheral
Different latency expectations
Mem_1Mem_1 Mem_NMem_N Periph_1Periph_1 Periph_NPeriph_N
CPU_1CPU_1 Acc_NAcc_NCPU_NCPU_N Acc_1Acc_1
High BWLow BW
Erno Salminen - Nov. 2008#7/45
Several clock domainsSeveral clock domains
Not possible/practical to use same clock in every componentGALS – Globally asynchronous, locally synchronous
Components have local clocksCommunication needs handshaking/synchronization
Mem_1Mem_1 Mem_NMem_N
Proc_1Proc_1 Proc_NProc_N
Periph_1Periph_1 Periph_NPeriph_N
Acc_1Acc_1 Acc_NAcc_N
High freqLow freq
Erno Salminen - Nov. 2008#8/45
b
Characteristics of offered traffic foadCharacteristics of offered traffic foad
Spatial: where the data goall sources similar?
Temporal: average data rateTemporal: when to transfera) Short bursts of high transfer activity and long
periods of inactivityb) Transfers with constant sizes and intervals
time
very bursty
time
moderately bursty
data amount
time
constant bitrate
srcSpatial:
a) one dst: neighbor
b) one dst: some
c) few dst
d) send to all
Temporal:
b
a
c d
Erno Salminen - Nov. 2008#9/45
LatencyLatency
Delay between start of transfer and completion
time (last data ejected) – time (first data enters)[n cycles for transferring d words]
Interrupts usually require low latencyReal-time systems require guaranteed latencyStream data (voice, video) may require constant latency (low jitter)
Erno Salminen - Nov. 2008#10/45
Measuring loadMeasuring load--latency behaviorlatency behavior
Erno Salminen - Nov. 2008#11/45
Measured loadMeasured load--latency curvelatency curve
Erno Salminen - Nov. 2008#12/45
Transaction latency components
Scalable Multiprocessors, lecture slides, http://www.cs.princeton.edu/courses/archive/spr07/cos598A/
Erno Salminen - Nov. 2008
Physical limitationsPhysical limitations
Erno Salminen - Nov. 2008#14/45
ITRS 2003: InterconnectITRS 2003: InterconnectCChip crosship cross--sectionsection
transistors
Wires on top levels are wider and taller than on lower levelsTop layers for
Power supplyClockGlobal signals
Several metal layers - less congestionHierarchical scaling
Erno Salminen - Nov. 2008#15/45
ITRS 2003: InterconnectITRS 2003: Interconnect
Delay of global wires does not scale with technology
Delay of global wires does not scale with technology
gate
local signals
global signals with repeaters (bigger area and energy)
global signals
HUOM! OBS!
Muy importante!
Erno Salminen - Nov. 2008#16/45
Energy breakdown forecastEnergy breakdown forecast
[Mattan Erez, Stream Architectures –Programmability and Efficiency,
Tampere SoC, Nov. 17 2004]
compare
Erno Salminen - Nov. 2008#17/45
LocalizationLocalization
[Mattan Erez, Stream Architectures –Programmability and Efficiency, Tampere SoC, Nov. 17 2004]
Communication between non-neighboring components requires many hops
Communication must be localized to avoid long wires
consume much energyare slow, prone to error, cause routing congestion
Several small components instead of few large
Erno Salminen - Nov. 2008#18/45
Reliability problemsReliability problems
”Synchronization failures between clock domains will be rare but unavoidable”Electrical noise due to crosstalk, electromagentic interference, radiation...Data errors or upsets, soft errorsData transfers become unreliable and nondeterministicDesign needs both deterministic and stochastic models
Erno Salminen - Nov. 2008#19/45
Achieving reliabilityAchieving reliabilityToday, designers use physical techniques to overcome reliability problems
Wire sizingLength optimizationRepeater insertionShieldingData codingBunch of others...Huge design effort required
In (near) future, 100% reliability on physical level cannot be afforded anymoreReliability muts be increased with additional HW or SW layers
Error detecting/correcting codesRetransmissions
Requst/acknowledge and time-out counters
Erno Salminen - Nov. 2008
NetworkNetwork--onon--chip (NoC)chip (NoC)
Erno Salminen - Nov. 2008#21/45
NetworkNetwork--onon--Chip (NoC)Chip (NoC)
NoC motivation1.High fab cost and effort in traditional VLSI
Design general-purpose platform, ASSP2.Flexibility
For changing application needs3.Concurrency in transfers (whole chip)4.On-chip wires are no longer reliable5.Only short signal wires due to power and
delay problemsUsually packet-switched, multi-hop network
Erno Salminen - Nov. 2008#22/45
DifferencesDifferences betweenbetweenMultiprocessorsMultiprocessors and SoCand SoC
Multiprocessor systems (past) System-on-Chip (portable device)Scaleability important after fab (increase nodes)
Scaleability an issue only at design time (reuse, easy addition of nodes)
Load balancing and even distribution of computation important for maximum performance
Energy consumption important, idle nodes must be shut down
Communication network used as means of balancing computation and communication (both adjusted for optimal performance)
Computation might already be fixed per node (functional partition) Network serves nodes (only network adjusted)
Dataflow computing Computation is very heterogeneous, both dataflow and control style
In principle any node can compute a given task
Execution of various applications clustered within SoC (specialized nodes)Some research seems to be ”Re-inventing the wheel” New challenge: Energy saving combined to past multiprocessor research
Much experience and well established reasearch of routing, switching, scaleability, tailoring according to applications
Erno Salminen - Nov. 2008#23/45
Micronetwork protocol stackMicronetwork protocol stack
Layers are specialized and optimized according to application (domain)
abstraction
arbitration, packetization to increase reliabilityrouting
splitting into packets, reordering
HW dependent SW
Erno Salminen - Nov. 2008#24/45
NoCNoC
Structuretopology – routers and linksrouter design
Controlrouting – which way to takeflow control and switching – when to transmit
Erno Salminen - Nov. 2008#25/45
agent(0)agent(0)
processing element
processing element
TerminologyTerminology
communication networkcommunication network
network interfacenetwork interfacerouter(0)router(0) router(1)router(1)
router(2)
(degree=4)
router(2)
(degree=4)
agent(1)agent(1) link
message
pktpkt
ph
Abbreviations:fl = flit, flow ctrl unitph = phit, physical unitpkt =packet
fl fl fl
ph
port
ph
ph
ph
ph
terminal
or stream
Erno Salminen - Nov. 2008#26/45
Homogeneous networkHomogeneous network
replication effectmemory dominated any waysolve realization issuesonce and for allless flexibleProblematic if processing units are heterogeneous
assumes uniform size for components and hence
a) wastes area b) components have to be
splittedH. Corporaal, Advanced Computer Architecture5Z008 - Multiprocessors &Interconnect, course material, 2003.
Erno Salminen - Nov. 2008#27/45
Heterogeneous networkHeterogeneous network
better fit to application domain – better performancesmaller incrementscomponents are not uniformly sizedhierarcahical structureAre ASICs possible in the future anymore?
H. Corporaal, Advanced Computer Architecture5Z008 - Multiprocessors &Interconnect, course material, 2003.
Erno Salminen - Nov. 2008#28/45
Network topologyNetwork topology
Defines the components (e.g. routers) the connections (e.g. each router connected to 4 neighbours)
Vast number of topologies proposed in literature1. Static networks utilize only point-to-point or shared
connection lines2. Dynamic networks use switches (or routers) for
communicationa) Direct = each processing node connected to switchb) Indirect = some switches are not connected directly to
any processing node
Erno Salminen - Nov. 2008#29/45
Network topology (2)Network topology (2)
Can be modeled with graphsnode = processing unitedge = data stream
Number of nodes denoted with NAverage path length L
Avg num of edges between all nodes in graphsSmall L desired for small latency
Average degree <k>Avg. num of edges in each switchLarge <k> may decrease L but implementation gets more complex also
Erno Salminen - Nov. 2008#30/45
Network topology: Bisection bandwidthNetwork topology: Bisection bandwidth
Design is partitioned into two (nearly) equal halves, it is the minimum number of wires which must cross between the halves considering all possible partitions
Number of nodes in halves differs at most by 1Also other definitions...
High number means higher number of possible routes and hence increased flexibility and fault-toleranceShould increase with the number of nodes in scalable networks
Erno Salminen - Nov. 2008#31/45
Generic routerGeneric router
generic routergeneric router
......
inpu
t por
ts
output ports
......
crossbar
routing arbitrator
Erno Salminen - Nov. 2008#32/45
Routing algortihmRouting algortihm
Selects route from source to destination1. Deterministic
Same route always used between source and destinatione.g. 2-D mesh: first find correct row, then correct columnAll packets arrive in-orderOne blocked (or faulty) link/router, blocks all packets on that route
2. AdaptiveRoute varies according to blockage
Better performance (at least when reordering neglected)Better faul-toleranceDeadloack avoidance needs extra care
Data may arrive out-of-orderReordering buffers required at receiver Buffers may consume large area/energy
Erno Salminen - Nov. 2008#33/45
BufferingBufferingBuffering has big impact on NoC performance and router area
1. Store-and-forwardData forwarded when whole packet receivedWhole packet bufferedincreases area and latency
2. Virtual cut-through: Data forwarded ASAPWhole packet buffered if output blocked
3. Wormhole: Data forwarded ASAPBuffer sizes can be independent of the pcaket sizeReserves the whole transfer path and hence increases contention
Some schemes drop packets when contention is highHighly undetermisticAcknowledges required (doubles latency, buffers for retransfers)Not recommended
Erno Salminen - Nov. 2008#34/45
(Shared multimaster) bus(Shared multimaster) bus
Bus = set of signals connected to all devicesShared resource
One connection between devices reserves the whole interconnection
Bandwidth shared among devices
Bandwidth may be scaled by adding links
Most common SoC networkLow implementation costs, simpleLong signal lines problematic
Single busN = 16L = 1
<k> = -
Multiple busN = 16L = 1
<k> = -
Erno Salminen - Nov. 2008#35/45
Bus arbitration / addr decodingBus arbitration / addr decoding
Arbitration decides which master can use the shared resource (e.g. bus or memory)
Single-master system does not need arbitrationE.g. priority, round-robin, TDMA Two-level : e.g. TDMA + priorityMay be pipelined with previous transfer
Decoding is needed to determine the targetCentral / Distributed schemesAddress and Data are broadcast to every nodeDecoder select which read the data or respond
Erno Salminen - Nov. 2008#36/45
Centralized / Centralized / DistributedDistributed
M = masterS = slave
ArbiterArbiter
M1M1 M2M2 M3M3
S1S1 S2S2 S3S3
DecoderDecoder
a) Centralized
A2A2
arbiter/decoderarbiter/decoder
A3A3
arbiter/decoderarbiter/decoder
A1A1
arbiter/decoderarbiter/decoder
A4A4
arbiter/decoder
A5A5
arbiter/decoder
b) Distributed
Figure 2. Centralized vs. distributed control
request + grant
select
Erno Salminen - Nov. 2008#37/45
Complex bus topologiesComplex bus topologiesHierarchical bus - Several bus segments connected with bridges
Fast access as long as the target is in the same segment
Requires locality of accessesTheoretical max. speed-up = num of segmentsSegments either circuit or packet-switched together
Packet-switching provides more parallelism with added buffering
Split-busNo data storage – only three-state buffersIf switches are non-conducting, smaller effective capacitance and, hence, smaller energy
Split-bus
A A
A A
A
A
Hierarchical bus (chain + tree)
N = 16L = 2.1
<k> = 2.5
Hierarchical bus (chain)N = 16L = 2.3<k> = 2
Erno Salminen - Nov. 2008#38/45
Other topologiesOther topologies
RingN = 16L = 6.3<k> = 3
Fully connected, point-to-point networkN = 16L = 1
<k> = -
3D hypercube
N = 8L = 3.7<k> = 8
Highest performanceClearly not scalable approach
3-D topologies are hard to map on 2-D silicon die
Simple layoutUnidirectional ring may result in long latencyGood for pipelines
Erno Salminen - Nov. 2008#39/45
Topologies: mesh and torusTopologies: mesh and torus
2-D mesh and torus are very popularSimple layout for uniformly sized nodes
Wrap-around wires in torus need special attention
2-D meshN = 16L = 4.7<k> = 4
2-D torusN = 16L = 4.1<k> = 5
Erno Salminen - Nov. 2008#40/45
Topologies: TreeTopologies: Tree
Trad. tree has bisection bandwidth=1
Bottleneck for uniform trafficDoes not matter when the traffic is localized
Fat-tree has more (or wider) links near root
Becoming more popular as NoC topology
Trees also constructed so that each node is processing node
Rooted, complete, binary tree
N = 16L = 6.5
<k> = 2.9
Fat tree with butterfly elements and fanout of 2 (binary fat tree)
N = 16L = 6.5
<k> = 3.5
Erno Salminen - Nov. 2008#41/45
Topologies: static analysisTopologies: static analysisSome basic properties may be analyzed staticallySimulation with real applications preferred (i.e. dynamic analysis)
Network Number of switches
Number of wires
Links
Single bus 0 1 Bi
Multiple bus 0 e Bi
Hierarchical bus (chain) e-1 e Bi
Crossbar N2/4 N2/2 Bi
One-sided crossbar N2/2 N2-N/2 Bi
Binary tree N-1 2(N-1) Bi
Fat tree (fanout 2) Nlog2N 2Nlog2N Bi
Ring N 2N Bi
3-D hypercube N N+(N/2)log2N Bi
2-D mesh N 3N-2N1/2 Bi
2-D torus N 3N Bi
Point-to-point, fully connected
0 (N2-N)/2 Bi
Omega network (MIN) (N/4)(log2N-1) (N/2)log2N Uni
Network Parallel transactions
Longest path
Bisection bandwidth
Links
Single bus 1 1 1 Bi
Multiple bus e (e ≤ N) 1 e Bi
Hierarchical bus (chain) e (e ≤ N) e (e ≤ N) 1 Bi
Crossbar N N N-1 Bi
One-sided crossbar N 2N-1 N/2 Bi
Binary tree N 2log2N 1 Bi
Fat tree (fanout 2) N 2log2N N Bi
Ring N N/2+2 2 Bi
3-D hypercube N log2N+2 N/2 Bi
2-D mesh N 2N1/2 N1/2 Bi
2-D torus N N1/2+2 2N1/2 Bi
Point-to-point, fully connected
N 1 (N/2)*(N/2) Bi
Omega network (MIN) N/2 log2N N Uni
Lahtinen 2004: Table 3.2 Performance Lahtinen 2004: Table 3.3 Implementation costs
Erno Salminen - Nov. 2008#42/45
Average NoC 2008Average NoC 2008
Erno Salminen - Nov. 2008#43/45
Average NoC 2008 (2)Average NoC 2008 (2)
Salminen et al. Survey of NoC proposals, OCP-IP, 2008
Erno Salminen - Nov. 2008#44/45
Overview of Managing OnOverview of Managing On--Chip Chip CommunicationsCommunications
Dedicated point-to-point links
Dedicated point-to-point links
Single busSingle bus
Regular multi-hop topologies
Regular multi-hop topologies
Simple Alwaysguaranteed
Customized multi-hopCustomized multi-hop
LimitedLimited
Verycomplex
Designonce
Generalpurpose
Best-effort/Predictable
IP blockspecific
Net
wor
kN
etw
ork
elem
ents
elem
ents
Late
ncy&
BW
Late
ncy&
BW
Scal
eabi
lity
Scal
eabi
lity
& & F
lexi
bilit
yFl
exib
ility
# of
IP b
lock
s#
of IP
blo
cks
Net
wor
kN
etw
ork
reus
ere
use
Arbitrary
Hierarchical busstructures
Hierarchical busstructures
Erno Salminen - Nov. 2008#45/45
ConclusionConclusion
Many small components, different requirementsWire delays and power consumption becoming very problematicBig difference between local and global (or off-chip) communicationFully synchronous approach becoming unfeasibleNetwork-on-chip = multi-hop on-chip network
Often packet-switchedBuffering, routing, and topology are important design decisions
Erno Salminen - Nov. 2008
ExtraExtra
Erno Salminen - Nov. 2008
Case StudyCase Study
Managing Interconnection Complexity in Managing Interconnection Complexity in Heterogeneous IP Block InterconnectionHeterogeneous IP Block Interconnection(HIBI)(HIBI)
Erno Salminen - Nov. 2008#48/45
Lessons LearnedLessons Learned
Many communication networks have been studied in TUT
On-chip communication research started 1997A regular topology can well be fitted to algorithm specific comp/comm balanced implementationIn general case there is no optimal topologyCommunication-centric design was successfully conducted for performanceImportant to exploit features of application(s) to optimize interconnectionEstablished parallel processing doctrines can be applied to SoCSoC challenge is heterogeneity in computation
Erno Salminen - Nov. 2008#49/45
Interconnection Implementation ViewInterconnection Implementation ViewMake lowest level data transfer mechanisms simple and efficient
Minimum number of signals“Every clock edge carries useful data in transaction”
Perform all high-level operations on basic mechanisms Layered protocol model, OCP compatibleMessage passing
Use identical HW modules to compose overall interconnectionTranslate IP specific communication operations to networkSupport all (practical) topologiesNo limits to number of IP blocks (whole design)Support (re-)configurabilityFit to all communication needs –from memories to peripherals
““Gives body to build interconnectGives body to build interconnect””
Erno Salminen - Nov. 2008#50/45
System Design ViewSystem Design ViewMake interconnection aware of application functionalityA) System design time
Communication profiled from application processesClustering: localization of communicationAllocation of communication resources (segments, buffers)Optimization of non-reconfigurable parametersInitial QoS and other transfer parameters
B) Run timeUtilize knowledge of predictable communication events if available
Guaranteed QoS in transfersTrack communication –change QoS & other parameters if requiredTotally change mode of operation if required
HIBI Design Flow is 80% of the HIBI interconnect scheme““Gives brains to the communicationGives brains to the communication””
Erno Salminen - Nov. 2008#51/45
HIBI Identical Interconnection ModulesHIBI Identical Interconnection Modules
HIBI wrapper is the only building block used everywhere in interconnection
Between network and IP-blocksBetween network segmentsWrapper is parametrizable, modular, and configurableAsyncronous FIFO buffering
P1P1 Mem1Mem1PNPN Acc1Acc1... AccNAccN...... MemNMemN
HIBIWrapper
HIBIWrapper
FIFO / OCP interface
IP
HIBIwrapper
HIBI network
HIBIWrapper
HIBIWrapper
HIBIWrapper
HIBIWrapper
HIBIWrapper
HIBIWrapper
HIBIWrapper
HIBIWrapper
HIBIWrapper
HIBIWrapper
Erno Salminen - Nov. 2008#52/45
HIBI NetworkHIBI Network
HIBI network consists of bus segments and bridgesTransfers in segment synchronous circuit switchedTransfers across bridges asynchronous packet switchedScales from serial point-to-point link to an arbitrary topology
Identical signals between wrappers in network sideNo dedicated point-to-point signals
All signals shared within network segmentWrapper layout is independent of the number of agents
Totally distributed arbitrationNo central arbiterEach wrapper is aware of communication details
Erno Salminen - Nov. 2008#53/45
Clock Clock domaindomain
HIBI Network Example
HIBIHIBIWrapperWrapper
IP BLOCKIP BLOCK
Bridge
HIB
IH
IBI
Wra
pper
Wra
pper
HIB
IH
IBI
Wra
pper
Wra
pper
IP BLOCKIP BLOCK
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapper
IP BLOCKIP BLOCK
HIBIHIBIWrapperWrapper
IP BLOCKIP BLOCK
HIBIHIBIWrapperWrapper
IP BLOCKIP BLOCK
HIBIHIBIWrapperWrapper
IP BLOCKIP BLOCK
HIBIHIBIWrapperWrapper
IP BLOCKIP BLOCK
HIBIHIBIWrapperWrapper
HIBIHIBIWrapperWrapper
IP BLOCKIP BLOCK
HIBIHIBIWrapperWrapper
IP BLOCKIP BLOCKIP BLOCKIP BLOCK
HIBIHIBIWrapperWrapper
Erno Salminen - Nov. 2008#54/45
Bus latencyBus latencyTotal latency consists of several phases From: K. Kuusilinna, PhD Thesis, TUT, 2001.
Action
Until all data has been transferred ora limit for data transfers per burst is reached.
Transfer data
Wait for master ready /Wait for target ready
Subs
eque
ntda
ta la
tenc
y
Available MethodsRequest bus ownership
Bus ownership granted
Wait for higher priority transactions to complete / Arbitration
Arb
itrat
ion
late
ncy
Central arbiter, daisy chain, wired-OR,connectionless arbitration
Round-robin, hierarchical round-robin,time-slot, fixed priority, adaptive
(See Request)
Begin transaction
Wait for master ready /Wait for target ready
Transfer first data
Initi
alla
tenc
y
Address/data multiplexing,handshaking
Waiting time may be long during high contection
Drive or wait for the bus to settle to idle state
Turn
-aro
und
late
ncy
Figure: Bus latency
Optimizing this phase has biggest impact in long transfers
Erno Salminen - Nov. 2008#55/45
HIBI Quality of ServiceHIBI Quality of Service
TDMA (time division multiple access) with freely run-time adjustable frame length and slot durations and allocationsRe-synchronization to application phaseAlso traditional priority/round-robin
A3A2
A3A1
allocated time slotA1
competitionA3 A2 A3 A1 A3time frametime frame
t
A1 A2 A1 A3 A1
A1 A2 A3
Priority
Round-robinA1 A2 A3 t
tA2 A3 A1
time frametime frame
competition
Erno Salminen - Nov. 2008#56/45
HIBI Basic TransferHIBI Basic Transfer
Pipelined with arbitrationSplit-transactionsBurst transfersNo wait cycles allowedNon pre-emptive transfers
QoS is guaranteed with TDMA or with a combination of Send Max+Priority/RoundRobin
t
rq addr
ret addr
addr
data
w addr
w data ret dataw data
w addr rq addr ret addr
rq data rq data
ret addr ...
pipeline
split transaction
Erno Salminen - Nov. 2008#57/45
HIBI Wrapper Structure (v.2)HIBI Wrapper Structure (v.2)
Configmem
HIBI signals out HIBI signals in
IP signals in IP signals out
Tx FSM
HI priortx FIFO
LO priortx FIFO
HI priorrx FIFO
LO priorrx FIFO
Mux Demux
AddrdecoderRx FSM
Erno Salminen - Nov. 2008#58/45
WrapperWrapper ConfigurationConfiguration MemoryMemory
Stores all information for distributed arbitrationPermanent: ROM, 1 pageSemi run-time configurable: ROM with several pagesFull run-time configurable: RAM, with pages
Time slotlogic
Currconfvalues
Curr page
Conf page
Timeslotsignals
Newconf
values
Dem
ux
Mux
Cycle counter
Erno Salminen - Nov. 2008#59/45
HIBI Wrapper AreaHIBI Wrapper Area
1 161
961
953
184
947
513
374
25
15
0 200 400 600 800 1 000 1 200 1 400
Tx ctrl
Rx ctrl
Cfg mem, 1p ram
Cfg mem, 1p rom
fifo, 10x8b
fifo, 5x8b
Read mux
Addr decoder
Write demux
Sub-
bloc
k
Area [gates]
Flip-flop –based buffers
need large area
Erno Salminen - Nov. 2008#60/45
HIBI Wrapper Area in ASICHIBI Wrapper Area in ASIC
0
5 000
10 000
15 000
20 000
25 000
30 000
35 000
8 b 16 b 32 b 64 b 8 b 16 b 32 b 64 b 8 b 16 b 32 b 64 b
lo prior FIFOs = 3 / 3hi prior FIFOs = 0 / 0
1-page mem
lo prior FIFOs = 5 / 5hi prior FIFOs = 5 / 5
1-page mem
lo prior FIFOs = 10 / 5hi prior FIFOs = 10 / 5
2-page mem
Are
a [g
ates
]
RAMROM
Erno Salminen - Nov. 2008#61/45
HIBI Wrapper Area in FPGAHIBI Wrapper Area in FPGA
0
1 000
2 000
3 000
4 000
5 000
6 000
7 000
8 000
9 000
8 b 16 b 32 b 64 b 8 b 16 b 32 b 64 b 8 b 16 b 32 b 64 b
lo prior FIFOs = 3 / 3hi prior FIFOs = 0 / 0
1-page mem
lo prior FIFOs = 5 / 5hi prior FIFOs = 5 / 5
1-page mem
lo prior FIFOs = 10 / 5hi prior FIFOs = 10 / 5
2-page mem
Area
[log
ic e
lem
ents
] RAMROM
More efficient boundary optimization
Erno Salminen - Nov. 2008#62/45
Effect of Virtual CutEffect of Virtual Cut--Through/StoreThrough/Store--andand--Forward to HIBI Interconnection AreaForward to HIBI Interconnection Area
Application: 2 shared memories, 8 processors, each read-process-write 100 bytes Solid lines: Minimal buffering in wrapper (virtual cut-through routing)
HIBI v.1 area is larger than v.2 due to separate address and data FIFOsDashed lines: All data buffered in wrapper (store-and-forward routing)
Interconnection logic area
0
1
2
3
4
5
0 20 40 60 80 100 120
Transfer size [byte]
Are
a [m
m2]
1 * 8b HIBI v.1
2 * 8b HIBI v.1
1 * 8b HIBI v.2
2 * 8b HIBI v.2
1 * 8b Excessbuffering v.11 * 8b Excessbuffering v.2
Erno Salminen - Nov. 2008#63/45
NoC component comparisonNoC component comparison
Erno Salminen - Nov. 2008#64/45
Runtime comparisonRuntime comparison
Salminen et al., SAMOS 2005.
Erno Salminen - Nov. 2008
NoC extrasNoC extras
Erno Salminen - Nov. 2008#66/45
Problems with Current NoC DiscussionProblems with Current NoC Discussion
What is ”NoC” – no common definitionSomething new, good by definition (needs no proof),...
General purpose – but to what extentArbitrary connectivity between any node?Uniform overall transfer distribution?
Discussion about “optimal topology”Multiprocessor architectures for scientific computations?Can massive fine-grain granularity parallelism be utilized in realistic SoC applications?
Copying computer network ideas without criticismIn-network data buffering, routing tables and algorithmsCompare to current TCP/IP or past ATM routers!
Toy test case applicationsBillion transistors – executes single FFT? Common benchmarks should be designed!
Erno Salminen - Nov. 2008#67/45
Wiring hierarchyWiring hierarchy
[H. Corporaal, Advanced Computer Architecture5Z008 - Multiprocessors &Interconnect, course material, 2003]
global
intermediate
local
How far can signal reach in one local clock cycle?Depends on
frequency (i.e duration of clock cycle)Wiring parameters (layer, width, height, density, shielding)
Not far anyway...Global wires will function as lossy transmission lines
RC models of today become inaccurate3-D modeling s-l-o-w and difficult
Erno Salminen - Nov. 2008#68/45
P. Liljeberg et al., Self-timed Approach for Noise Reduction in Noc, in “Interconnect-centric design for advanced SoC and NoC”, Kluwer. 2004
Crosstalk impactCrosstalk impactLong, fast switching wires
Long wires close to each other
Switching on neighbor wires affects delayDelay on wire 4 shown in table 2
Erno Salminen - Nov. 2008#69/45
Impact of DMAImpact of DMA
comp comp ...w/o DMA
compw/ DMA
a) short comm time
comp comm comp ...w/o DMA
compcommcomp
comm...compw/ DMA
b) equal comp and comm time
comm comm ...w/o DMA
comm comm...w/ DMA
c) long comm time
comp ...
agentagent
CPU coreCPU core
instr. meminstr. mem
data memdata mem DMADMA network
interfacenetwork interface
other perihp.other
perihp.
ii)
i)
Erno Salminen - Nov. 2008#70/45
Retransfer buffersRetransfer buffersIf packets are dropped or corrupted in delivery (usually) they have to retransferred
Variable latencies problematic: is packet dropped and just havinf longer latencyIf Time-out latency exceeded , packet is assumed to be missing
Source must store packets until it recieves acknowledge of succesfull transferSending acknowledge after each packet results in small buffer but (at least) double latencySengin ack after each N packet reuires bigger buffers but gives better performance
source destination
ack (ok)
ack (ok,ok,fail,ok)
a) ack for each packet
b) ack for each Npackets
srcsrcbuf
srcsrcbuf buf buf buf
dstdst
dstdst
Latency per pkt = send_latency + ack_latency
Latency per pkt = (N*send_latency + ack_latency) / N
Erno Salminen - Nov. 2008#71/45
Reordering buffersReordering buffersPackets arriving Out-of-order may require huge reordering buffers
Sometimes processing units may accept out-of-order delivery or buffers can be integrated with internal memory of the processing unit
If ack is sent after 4 packets, buffer for 4 packets is neededFurthermore, separate buffers are needed for each source as data may received in interleaved manner
E.g. (pkt_<n>_<src>) received: pkt_1_1, pkt_4_1, pkt_4_2, pkt_3_3...E.g. if ack sent after N apckets and S sources
reorder buffer size = N*S packets
source 0 destination
ack (ok)
ack
a) ack for each packet
b) ack for each Npackets
srcsrc
dstdst
buf buf buf buf
dstdstbuf
srcsrc buf buf buf buf
buf buf buf buf...
Ack forces in-order delivery
source 1
ack
Erno Salminen - Nov. 2008#72/45
Buffer reservationBuffer reservation
Notification of the next tx
Reserve buffer
ACK
Configure rx DMA
Actual data
(optional ACK)
Sender agent Receiver agent
Consume data
Notification of the reserved buffer
Actual data
Sender agent Receiver agent
Reserve buffer
Configure rx DMA
Reserve buffer etc.
(copy data)
Consume data
Observedtx duration
Observedtx duration
Erno Salminen - Nov. 2008#73/45
Intertiwned/ReorderingIntertiwned/Reordering
Transfers from different sources may arbitrarily intertwinedIn addition, packets may arrive out-of-order
netw
ork
netw
ork
source0
source1destination0
aabbcc
ddee ddaabbeecc
These are either single words, bursts, or packets, depending on
the network
destination0destination0
...ddee
aabbcc
i) fixed-length packets
dd aa bb eecc
”FIFO”-like buffers
from
net
wor
k
destination0destination0
from
net
wor
k
...dd ee
ii) variable-length packets
dd aa bb eecc
cc
linked list buffers
aa bb
Erno Salminen - Nov. 2008#74/45
Irregular IP sizeIrregular IP size
<19.5% reduction in area>
IP’s tend to have irregular size and shapeLargest IP per row/column decides its height/width
Some space wastedlinks will have varying length
Reordering the IPs reduces areaEnsure that frequently communicating IPs are still close to each other
Erno Salminen - Nov. 2008#75/45
Customized meshCustomized mesh
Connect more than IP to one routerSomewhat smaller bandwidth available per IP
Usually enough, thoughAdopt totally customized topology (the rightmost fig)