NoC: MPSoC Communication
FabricInterconnection Networks (ELE 580)
Shougata Ghosh18th Apr, 2006
Outline
MPSoC
Network-On-Chip
Cases: IBM CoreConnect CrossBow IPs Sonic Silicon Backplane
What are MPSoCs?
MPSoC – Multiprocessor System-On-Chip Most SoCs today use multiple processing
cores MPSoCs are characterised by heterogeneous
multiprocessors CPUs, IPs (Intellectual Properties), DSP
cores, Memory, Communication Handler (USB, UART, etc)
Where are MPSoCs used?
Cell phones Network Processors
(Used by Telecomm. and networking to handle high data rates)
Digital Television and set-top boxes High Definition Television Video games (PS emotion engine)
Challenges
All MPSoC designs have the following requirements: Speed Power Area Application Performance Time to market
Why Reinvent the wheel?
Why not use uniprocessor (3.4 GHz!!)? PDAs are usually uniprocessor
Cannot keep up with real-time processing requirements Slow for real-time data
Real-time processing requires “real” concurrency Uniprocessors provide “apparent” concurrency
through multitasking (OS) Multiprocessors can provide concurrency required to
handle real-time events
Need multiple Processors
Why not SMPs? +SMPs are cheaper (reuse) +Easier to program -Unpredictable delays (ex: Snoopy cache) -Need buffering to handle unpredictability
Area concerns
Configured SMPs would have unused resources
Special purpose PEs: Don’t need to support unwanted processes
Faster Area efficient Power efficient
Can exploit known memory access patterns Smaller Caches (Area savings)
MPSoC Architecture
Components
Hardware Multiple processors Non-programmable IPs Memory Communication Interface
Interface heterogeneous components to Comm. Network Communication Network
Hierarchical (Busses) NoC
Design Flow System-level-synthesis
Top-down approach Synthesis algo. ->SoC Arch. + SW Model from
system-level specs. Platform-based Design
Starts with Functional System Spec. + Predesigned Platform
Mapping & Scheduling of functions to HW/SW Component-based Design
Bottom-up approach
Platform Based Design Start with functional Spec :
Task Graphs
Task graph Nodes: Tasks to complete Edges: Communication
and Dependence between tasks
Execution time on the nodes Data communicated on the edges
Map tasks on pre designed HW
Use Extended Task Graph for SW and Communication
Mapping on to HW Gantt chart: Scheduling
task execution & Timing analysis
Extended Task Graph Comm. Nodes
(Reads and Writes)
ILP and Heuristic Algo. to schedule Task and Comm. to HW and SW
Component Based Design Conceptual MPSoC Platform SW, Processor, IP, Comm.
Fabric
Parallel Development Use APIs
Quicker time to market
Design Flow Schematic
Communication Fabric
Has been mostly Bus based IBM CoreConnect, Sonic Silicon Backplane, etc.
Busses not scalable!! Usually 5 Processors – rarely more than 10!
Number of cores has been increasing Push towards NoC
NoC NoC NoC-ing on Heaven’s Door!! Typical Network-On-Chip (Regular)
Regular NoC Bunch of tiles Each tile has input (inject into network) and
output (recv. From network) ports Input port => 256-bit Data 38-bit Control Network handles both static and dynamic traffic
Static: Flow of data from camera to MPEG encoder Dynamic: Memory request from PE (or CPU)
Uses dedicated VC for static traffic Dynamic traffic goes through arbitration
Control Bits Control bit fields
Type (2 bits): Head, Body, Tail, Idle Size (4 bits): Data size 0 (1-bit) to 8 (256-bit) VC Mask (8 bits): Mask to determine VC (out of 8)
Can be used to prioritise Route (16 bits): Source routing Ready (8 bits): Signal from network
indicating it’s ready to accept the next flit (??why 8?)
Flow Control
Virtual Channel flow control Router with input and output controller Input controller has buffer and state for each VC Inp. controller strips routing info from head flit Flit arbitrates for output VC Output VC has buffer for single flit
Used to store flit trying to get inp. buffer in next hop
Input and Output Controllers
NoC Issues Basic difference between NoC and Inter-chip or
Inter-board networks: Wires and pins are ABUNDANT in NoC Buffer space is limited in NoC
On-Chip pins for each tile could be 24,000 compared to 1000 for inter-chip designs
Designers can trade wiring resources for network performance!
Channels: On-Chip => 300 bits Inter-Chip => 8-16 bits
Topology
The previous design used folded torus Folded torus has twice the wire demand and
twice the bisection BW compared to mesh Converts plentiful wires to bandwidth
(performance) Not hard to implement On-Chip However, could be more power hungry
Flow Control Decision
Area scarce in On-Chip designs Buffers use up a LOT of area Flow control with less buffers are favourable However, need to balance with performance
Dropping pkt. FC requires least buffer but at the expense of performance
Misrouting when enough path diversity
High Performance Circuits
Wiring regular and known at design time Can be accurately modeled (R, L, C) This enables:
Low swing circuit – 100mV compared to 1V HUGE power saving
Overdrive produces 3 times signal velocity compared to full-swing drivers
Overdrive increases repeater spacing Again significant power savings
Heterogeneous NoC Regular topologies facilitate modular design
and easily scaled up by replication However, for heterogeneous systems, regular
topologies lead to overdesigns!! Heterogeneous NoCs can optimise local
bottlenecks Solution?
Complete Application Specific NoC synthesis flow Customised topology and NoC building blocks
xPipe Lite
Application Specific NoC library Creates application specific NoC
Uses library of NI, switch and link Parameterised library modules optimised for
frequency and low latency Packet switched communication Source routing Wormhole flow control Topology: Torus, Mesh, B-Tree, Butterfly
NoC Architecture Block Diagram
xPipes Lite
Uses OCP to communicate with cores OCP advantages:
Industry wide standard for comm. protocol between cores and NoC
Allows parallel development of cores and NoC Smoother development of modules Faster time to market
xPipes Lite – Network Interface Bridges OCP interface and NoC switching
fabric Functions:
Synch. Between OCP and xPipes timing Packeting OCP transaction to flits Route calculation Flit buffering to improve performance
NI
Uses 2 registers to interface with OCP Header reg. to store address (sent once) Payload reg. to store data (sent multiple times for
burst transfers) Flits generated from the registers
Header flit from Header reg. Body/payload flits from Payload reg.
Routing info. in header flit Route determined from LUT using the dest.
address
Network Interface
Bidirectional NI Output stage identical to
xPipes switches Input stage uses dual-
flit buffers Uses the same flow
control as the switches
Switch Architecture xPipes switch is the basic building block of the
switching fabric 2-cycle latency Output queued router Fixed and round robin priority arbitration on input
lines Flow control
ACK/nACK Go-Back-N semantics
CRC
Switch
Allocator module does the arbitration for head flit
Holds path until tail flit Routing info requests the
output port
The switch is parameterisable in: Number of input/output, arbitration policy, output
buffer sizes
Switch flow control
Input flit dropped if: Requested output port held by previous packet Output buffer full Lost the arbitration
NACK sent back All subsequent flits of that packet dropped
until header flit reappears(Go-Back-N flow control)
Updates routing info for next switch
xPipes Lite - Links
The links are pipelined to overcome interconnect delay problem
xPipes Lite uses shallow pipelines for all modules (NI, Switch) Low latency Less buffer requirement Area savings Higher frequency
xPipes Lite Design Flow
IBM CoreConnect
CoreConnect Bus Architecture An open 32-, 64-, 128-bit core on-chip bus
standard Communication fabric for IBM Blue Logic and
other non-IBM devices Provides high bandwidth with hierarchical
bus structure Processor Local Bus (PLB) On-Chip Peripheral Bus (OPB) Device Control Register bus (DCR)
Performance Features
CoreConnect Components
PLB OPB DCR PLB Arbiter OPB Arbiter PLB to OPB Bridge OPB to PLB Bridge
PLB
Processor Local Bus Fully synchronous, supports up to 8 masters
32-, 64-, and 128-bit architecture versions; extendable to 256-bit
Separate read/write data buses, enables overlapped transfers and higher data rates
High Bandwidth Capabilities Burst transfers, variable and fixed-length supported Pipelining Split transactions DMA transfers No on-chip tri-states required Cache Line transfers Overlapped arbitration, programmable priority fairness
Processor Local Bus (cont’d.)
Top Related