CS294-6 Reconfigurable Computing Day 25 Heterogeneous Systems and Interfacing.

CS294-6Reconfigurable Computing

Day 25

Heterogeneous Systems and Interfacing

Previously

• Homogenous model of computational array– single word granularity, depth, interconnect– all post-fabrication programmable

• Understand tradeoffs of each

Today

• Heterogeneous architectures– Why?– How?

• catalog of techniques

• fit in framework

• optimization and mapping

Why?

• Why would we be interested in heterogeneous architecture?– E.g.

Why?

• Applications have a mix of characteristics

• Already accepted– seldom can afford to build most general

(unstructured) array• bit-level, deep context, p=1

– => are picking some structure to exploit

• May be beneficial to have portions of computations optimized for different structure conditions.

Examples

• Processor+FPGA

• Processors or FPGA add– multiplier or MAC unit– FPU– Motion Estimation coprocessor

Optimization Prospect

• Less capacity for composite than either pure– (A1+A2)T12 < A1T1

– (A1+A2)T12 < A2T2

Optimization Prospect Example

• Floating Point– Task: I integer Ops + F FP-ADDs

– Aproc=125M2

– AFPU=40M2

– I cycles / FP Ops = 60– 125(I+60F) 165(I+F)

• (7500-165)/40 = I/F

• 183 I/F

How?

• Design issues:– Interconnect

• space and time

– Control

– Instructions• configuration path and

control

• Mapping

• Cost/Benefits:– Costs

• Area, Power

– Performance• Bandwidth, latency

Interconnect

• Bus (degenerate network)

• Memory (shared retiming resource)

• RF/Coproc (traditional processor inter.)

• Network

Interconnect: Bus

• Minimal physical network– shared with memory and/or other peripherals– 10s-100s of cycles away from processor (fpga)– low->moderate bandwidth– can handle multiple, different functional units

• but serial bottleneck of bus prevents simultaneous communication among devices

Interconnect: Bus Example

• XC6200

Interconnect: Memory

• Use memory (retiming) block to buffer data between heterogeneous regions– DMA (usually implies shared bus)– FIFO– dual port or shared RAM

• decoupled, moderate latency (10-100cycles)

• moderate bandwidth

Interconnect: Memory Example

• PAM, SPLASH

Interconnect: RF/Coproc

• Coupled directly to processor datapath– low latency (1-2 cycles)– moderately high bandwidth

• limited by RF ports and control

Interconnect: RF/Coproc Examples

• GARP, Chimaera– (more on this case Thursday)

Interconnect: Network

• Unify spatial network composing various heterogeneous components– high bandwidth– latency vary with distance– support simultaneous operate and data transfer– potentially dominant cost

• Ainterconnect > Afunction

– Granularity question• coarse (large blocks of each type)• fine (interleaved)

Interconnect: Network Coarse

• Cheops, Pleiades

HSRA Heterogeneous Blocks

Interconnect: NetworkCoarse vs. Fine

• Multiplier/FPGA example

Interconnect: NetworkCoarse vs. Fine

• Fine– possibly share interconnect

– locality

– uniform tiling

– not shared• may get concentrations of

heavy/no use• interconnect limit use as

independent resources

– ratio less flexible?

– More difficult design

• Coarse– flexible ratio– easier to keep dense

homogeneous blocks– requires own interconnect – doesn’t disrupt base

layout(s)– non-local route to/from

• more/longer wires

– boundaries in net

Admin

• For POWER: Update on– rcore simulation– HSRA energy– Jsim size problems?

• Fix in works

Control

• As before– How many controllers?– How many pinsts slaved off of each?

• Common classes:– Single Controller / Lock-step– Decoupled, datastream– Autonomous MIMD

Control: Lockstep

• Master controller (usually processor)– issues instruction (instruction tag)

• every cycle

• explicitly when device should operate

– Single thread of control• everything known to be in synch

– Idle while processor doing other tasks– Ex. VLIW (TriMedia), PRISC, GARP

Control: Data Stream

• Configure then run on data decoupled from control processor– run parallel with processor

• processor run orthogonal tasks• maybe several simultaneous tasks running on spatial

– unit not typically fed by processor directly– need to synchronize data transfer and operation

• polling, interrupt, semaphore

– Ex. Cheops, PADDI-2, Pleiades, SPLASH

Control: Autonomous (MIMD)

• Multiple (potential) control processors– not necessarily slaved– distribute control– more care in synchronization– Ex. Floe (MagicEight)

HSRA Multi-Hetero

• Coupling– unifying networks

• Balance – sequential/spatial

– control units w/ management task

Configuration

• Share interface bus– config and data

• XC6200, PAM, SPLASH

– config and memory path

• GARP

• Separate path / network– VLIW, Pleiades

• Explicit– XC6200, PAM,

SPLASH, ….

• Implicit– GARP/PRISC

Mapping

• Mapping– often option on where runs– must sort out what goes where

• faster in one resource?

• …but limited number of each resource

Mapping: Limited Resource

• What runs on faster, limited resource?– E.g. Tim’s C extraction last-time– General:

• what allocated to resource

• when reconfigure

• N candidate ops -> each choice

– Greedy• Break into temporal regions

– local working set and points of reconfiguration

• while resource available– add op offering most benefit

Mapping: Spatial Choice

• Different kinds of resources – (e.g. LUTs, multipliers)

• Multiple resources can solve same problem

• Limited number of each resource

• Match users with resources

Mapping: Bipartite Partitioning

• =>Bipartite matching– deal with unit resource consumption– also w/ regional/interconnect constraints– not directly deal with performance…

• Postpass(?) allocate – faster resources to critical path

• ? N of R1 vs. M of R2

Example/Details: Liu FPGA98

Mapping

• More common:– can solve with:

• 12A’s and 2B’s or

• 4 A’s and 4 B’s

– common need• 4 A’s and 2 B’s

– choice• 8 A’s vs. 2 B’s

Highlights

• Fit into existing framework– not that much new here– new issue: who and how share resources

• Issues: interconnect, control

• + density when hit balance

• - efficiency when balance mismatch

• - harder mapping (resource sharing)

CS294-6 Reconfigurable Computing Day 25 Heterogeneous Systems and Interfacing.

Documents

Transcript of CS294-6 Reconfigurable Computing Day 25 Heterogeneous Systems and Interfacing.