CS294-6 Reconfigurable Computing Day 25 Heterogeneous Systems and Interfacing.
-
date post
22-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of CS294-6 Reconfigurable Computing Day 25 Heterogeneous Systems and Interfacing.
CS294-6Reconfigurable Computing
Day 25
Heterogeneous Systems and Interfacing
Previously
• Homogenous model of computational array– single word granularity, depth, interconnect– all post-fabrication programmable
• Understand tradeoffs of each
Today
• Heterogeneous architectures– Why?– How?
• catalog of techniques
• fit in framework
• optimization and mapping
Why?
• Why would we be interested in heterogeneous architecture?– E.g.
Why?
• Applications have a mix of characteristics
• Already accepted– seldom can afford to build most general
(unstructured) array• bit-level, deep context, p=1
– => are picking some structure to exploit
• May be beneficial to have portions of computations optimized for different structure conditions.
Examples
• Processor+FPGA
• Processors or FPGA add– multiplier or MAC unit– FPU– Motion Estimation coprocessor
Optimization Prospect
• Less capacity for composite than either pure– (A1+A2)T12 < A1T1
– (A1+A2)T12 < A2T2
Optimization Prospect Example
• Floating Point– Task: I integer Ops + F FP-ADDs
– Aproc=125M2
– AFPU=40M2
– I cycles / FP Ops = 60– 125(I+60F) 165(I+F)
• (7500-165)/40 = I/F
• 183 I/F
How?
• Design issues:– Interconnect
• space and time
– Control
– Instructions• configuration path and
control
• Mapping
• Cost/Benefits:– Costs
• Area, Power
– Performance• Bandwidth, latency
Interconnect
• Bus (degenerate network)
• Memory (shared retiming resource)
• RF/Coproc (traditional processor inter.)
• Network
Interconnect: Bus
• Minimal physical network– shared with memory and/or other peripherals– 10s-100s of cycles away from processor (fpga)– low->moderate bandwidth– can handle multiple, different functional units
• but serial bottleneck of bus prevents simultaneous communication among devices
Interconnect: Bus Example
• XC6200
Interconnect: Memory
• Use memory (retiming) block to buffer data between heterogeneous regions– DMA (usually implies shared bus)– FIFO– dual port or shared RAM
• decoupled, moderate latency (10-100cycles)
• moderate bandwidth
Interconnect: Memory Example
• PAM, SPLASH
Interconnect: RF/Coproc
• Coupled directly to processor datapath– low latency (1-2 cycles)– moderately high bandwidth
• limited by RF ports and control
Interconnect: RF/Coproc Examples
• GARP, Chimaera– (more on this case Thursday)
Interconnect: Network
• Unify spatial network composing various heterogeneous components– high bandwidth– latency vary with distance– support simultaneous operate and data transfer– potentially dominant cost
• Ainterconnect > Afunction
– Granularity question• coarse (large blocks of each type)• fine (interleaved)
Interconnect: Network Coarse
• Cheops, Pleiades
HSRA Heterogeneous Blocks
Interconnect: NetworkCoarse vs. Fine
• Multiplier/FPGA example
Interconnect: NetworkCoarse vs. Fine
• Fine– possibly share interconnect
– locality
– uniform tiling
– not shared• may get concentrations of
heavy/no use• interconnect limit use as
independent resources
– ratio less flexible?
– More difficult design
• Coarse– flexible ratio– easier to keep dense
homogeneous blocks– requires own interconnect – doesn’t disrupt base
layout(s)– non-local route to/from
• more/longer wires
– boundaries in net
Admin
• For POWER: Update on– rcore simulation– HSRA energy– Jsim size problems?
• Fix in works
Control
• As before– How many controllers?– How many pinsts slaved off of each?
• Common classes:– Single Controller / Lock-step– Decoupled, datastream– Autonomous MIMD
Control: Lockstep
• Master controller (usually processor)– issues instruction (instruction tag)
• every cycle
• explicitly when device should operate
– Single thread of control• everything known to be in synch
– Idle while processor doing other tasks– Ex. VLIW (TriMedia), PRISC, GARP
Control: Data Stream
• Configure then run on data decoupled from control processor– run parallel with processor
• processor run orthogonal tasks• maybe several simultaneous tasks running on spatial
– unit not typically fed by processor directly– need to synchronize data transfer and operation
• polling, interrupt, semaphore
– Ex. Cheops, PADDI-2, Pleiades, SPLASH
Control: Autonomous (MIMD)
• Multiple (potential) control processors– not necessarily slaved– distribute control– more care in synchronization– Ex. Floe (MagicEight)
HSRA Multi-Hetero
• Coupling– unifying networks
• Balance – sequential/spatial
– control units w/ management task
Configuration
• Share interface bus– config and data
• XC6200, PAM, SPLASH
– config and memory path
• GARP
• Separate path / network– VLIW, Pleiades
• Explicit– XC6200, PAM,
SPLASH, ….
• Implicit– GARP/PRISC
Mapping
• Mapping– often option on where runs– must sort out what goes where
• faster in one resource?
• …but limited number of each resource
Mapping: Limited Resource
• What runs on faster, limited resource?– E.g. Tim’s C extraction last-time– General:
• what allocated to resource
• when reconfigure
• N candidate ops -> each choice
– Greedy• Break into temporal regions
– local working set and points of reconfiguration
• while resource available– add op offering most benefit
Mapping: Spatial Choice
• Different kinds of resources – (e.g. LUTs, multipliers)
• Multiple resources can solve same problem
• Limited number of each resource
• Match users with resources
Mapping: Bipartite Partitioning
• =>Bipartite matching– deal with unit resource consumption– also w/ regional/interconnect constraints– not directly deal with performance…
• Postpass(?) allocate – faster resources to critical path
• ? N of R1 vs. M of R2
Example/Details: Liu FPGA98
Mapping
• More common:– can solve with:
• 12A’s and 2B’s or
• 4 A’s and 4 B’s
– common need• 4 A’s and 2 B’s
– choice• 8 A’s vs. 2 B’s
Highlights
• Fit into existing framework– not that much new here– new issue: who and how share resources
• Issues: interconnect, control
• + density when hit balance
• - efficiency when balance mismatch
• - harder mapping (resource sharing)