Download - Mp So C 18 Apr

NoC: MPSoC Communication

FabricInterconnection Networks (ELE 580)

Shougata Ghosh18th Apr, 2006

Outline

MPSoC

Network-On-Chip

Cases: IBM CoreConnect CrossBow IPs Sonic Silicon Backplane

What are MPSoCs?

MPSoC – Multiprocessor System-On-Chip Most SoCs today use multiple processing

cores MPSoCs are characterised by heterogeneous

multiprocessors CPUs, IPs (Intellectual Properties), DSP

cores, Memory, Communication Handler (USB, UART, etc)

Where are MPSoCs used?

Cell phones Network Processors

(Used by Telecomm. and networking to handle high data rates)

Digital Television and set-top boxes High Definition Television Video games (PS emotion engine)

Challenges

All MPSoC designs have the following requirements: Speed Power Area Application Performance Time to market

Why Reinvent the wheel?

Why not use uniprocessor (3.4 GHz!!)? PDAs are usually uniprocessor

Cannot keep up with real-time processing requirements Slow for real-time data

Real-time processing requires “real” concurrency Uniprocessors provide “apparent” concurrency

through multitasking (OS) Multiprocessors can provide concurrency required to

handle real-time events

Need multiple Processors

Why not SMPs? +SMPs are cheaper (reuse) +Easier to program -Unpredictable delays (ex: Snoopy cache) -Need buffering to handle unpredictability

Area concerns

Configured SMPs would have unused resources

Special purpose PEs: Don’t need to support unwanted processes

Faster Area efficient Power efficient

Can exploit known memory access patterns Smaller Caches (Area savings)

MPSoC Architecture

Components

Hardware Multiple processors Non-programmable IPs Memory Communication Interface

Interface heterogeneous components to Comm. Network Communication Network

Hierarchical (Busses) NoC

Design Flow System-level-synthesis

Top-down approach Synthesis algo. ->SoC Arch. + SW Model from

system-level specs. Platform-based Design

Starts with Functional System Spec. + Predesigned Platform

Mapping & Scheduling of functions to HW/SW Component-based Design

Bottom-up approach

Platform Based Design Start with functional Spec :

Task Graphs

Task graph Nodes: Tasks to complete Edges: Communication

and Dependence between tasks

Execution time on the nodes Data communicated on the edges

Map tasks on pre designed HW

Use Extended Task Graph for SW and Communication

Mapping on to HW Gantt chart: Scheduling

task execution & Timing analysis

Extended Task Graph Comm. Nodes

(Reads and Writes)

ILP and Heuristic Algo. to schedule Task and Comm. to HW and SW

Component Based Design Conceptual MPSoC Platform SW, Processor, IP, Comm.

Fabric

Parallel Development Use APIs

Quicker time to market

Design Flow Schematic

Communication Fabric

Has been mostly Bus based IBM CoreConnect, Sonic Silicon Backplane, etc.

Busses not scalable!! Usually 5 Processors – rarely more than 10!

Number of cores has been increasing Push towards NoC

NoC NoC NoC-ing on Heaven’s Door!! Typical Network-On-Chip (Regular)

Regular NoC Bunch of tiles Each tile has input (inject into network) and

output (recv. From network) ports Input port => 256-bit Data 38-bit Control Network handles both static and dynamic traffic

Static: Flow of data from camera to MPEG encoder Dynamic: Memory request from PE (or CPU)

Uses dedicated VC for static traffic Dynamic traffic goes through arbitration

Control Bits Control bit fields

Type (2 bits): Head, Body, Tail, Idle Size (4 bits): Data size 0 (1-bit) to 8 (256-bit) VC Mask (8 bits): Mask to determine VC (out of 8)

Can be used to prioritise Route (16 bits): Source routing Ready (8 bits): Signal from network

indicating it’s ready to accept the next flit (??why 8?)

Flow Control

Virtual Channel flow control Router with input and output controller Input controller has buffer and state for each VC Inp. controller strips routing info from head flit Flit arbitrates for output VC Output VC has buffer for single flit

Used to store flit trying to get inp. buffer in next hop

Input and Output Controllers

NoC Issues Basic difference between NoC and Inter-chip or

Inter-board networks: Wires and pins are ABUNDANT in NoC Buffer space is limited in NoC

On-Chip pins for each tile could be 24,000 compared to 1000 for inter-chip designs

Designers can trade wiring resources for network performance!

Channels: On-Chip => 300 bits Inter-Chip => 8-16 bits

Topology

The previous design used folded torus Folded torus has twice the wire demand and

twice the bisection BW compared to mesh Converts plentiful wires to bandwidth

(performance) Not hard to implement On-Chip However, could be more power hungry

Flow Control Decision

Area scarce in On-Chip designs Buffers use up a LOT of area Flow control with less buffers are favourable However, need to balance with performance

Dropping pkt. FC requires least buffer but at the expense of performance

Misrouting when enough path diversity

High Performance Circuits

Wiring regular and known at design time Can be accurately modeled (R, L, C) This enables:

Low swing circuit – 100mV compared to 1V HUGE power saving

Overdrive produces 3 times signal velocity compared to full-swing drivers

Overdrive increases repeater spacing Again significant power savings

Heterogeneous NoC Regular topologies facilitate modular design

and easily scaled up by replication However, for heterogeneous systems, regular

topologies lead to overdesigns!! Heterogeneous NoCs can optimise local

bottlenecks Solution?

Complete Application Specific NoC synthesis flow Customised topology and NoC building blocks

xPipe Lite

Application Specific NoC library Creates application specific NoC

Uses library of NI, switch and link Parameterised library modules optimised for

frequency and low latency Packet switched communication Source routing Wormhole flow control Topology: Torus, Mesh, B-Tree, Butterfly

NoC Architecture Block Diagram

xPipes Lite

Uses OCP to communicate with cores OCP advantages:

Industry wide standard for comm. protocol between cores and NoC

Allows parallel development of cores and NoC Smoother development of modules Faster time to market

xPipes Lite – Network Interface Bridges OCP interface and NoC switching

fabric Functions:

Synch. Between OCP and xPipes timing Packeting OCP transaction to flits Route calculation Flit buffering to improve performance

NI

Uses 2 registers to interface with OCP Header reg. to store address (sent once) Payload reg. to store data (sent multiple times for

burst transfers) Flits generated from the registers

Header flit from Header reg. Body/payload flits from Payload reg.

Routing info. in header flit Route determined from LUT using the dest.

address

Network Interface

Bidirectional NI Output stage identical to

xPipes switches Input stage uses dual-

flit buffers Uses the same flow

control as the switches

Switch Architecture xPipes switch is the basic building block of the

switching fabric 2-cycle latency Output queued router Fixed and round robin priority arbitration on input

lines Flow control

ACK/nACK Go-Back-N semantics

CRC

Switch

Allocator module does the arbitration for head flit

Holds path until tail flit Routing info requests the

output port

The switch is parameterisable in: Number of input/output, arbitration policy, output

buffer sizes

Switch flow control

Input flit dropped if: Requested output port held by previous packet Output buffer full Lost the arbitration

NACK sent back All subsequent flits of that packet dropped

until header flit reappears(Go-Back-N flow control)

Updates routing info for next switch

xPipes Lite - Links

The links are pipelined to overcome interconnect delay problem

xPipes Lite uses shallow pipelines for all modules (NI, Switch) Low latency Less buffer requirement Area savings Higher frequency

xPipes Lite Design Flow

IBM CoreConnect

CoreConnect Bus Architecture An open 32-, 64-, 128-bit core on-chip bus

standard Communication fabric for IBM Blue Logic and

other non-IBM devices Provides high bandwidth with hierarchical

bus structure Processor Local Bus (PLB) On-Chip Peripheral Bus (OPB) Device Control Register bus (DCR)

Performance Features

CoreConnect Components

PLB OPB DCR PLB Arbiter OPB Arbiter PLB to OPB Bridge OPB to PLB Bridge

Processor Local Bus Fully synchronous, supports up to 8 masters

32-, 64-, and 128-bit architecture versions; extendable to 256-bit

Separate read/write data buses, enables overlapped transfers and higher data rates

High Bandwidth Capabilities Burst transfers, variable and fixed-length supported Pipelining Split transactions DMA transfers No on-chip tri-states required Cache Line transfers Overlapped arbitration, programmable priority fairness

Processor Local Bus (cont’d.)