Architecture and Synthesis for Multi-Cycle On-Chip...
Transcript of Architecture and Synthesis for Multi-Cycle On-Chip...
Architecture and Synthesis for MultiArchitecture and Synthesis for Multi--Cycle Cycle OnOn--Chip CommunicationChip Communication
Jason CongJason Cong
VLSI CAD LabVLSI CAD LabComputer Science DepartmentComputer Science Department
University of California, Los AngelesUniversity of California, Los Angelescong@[email protected]
http://http://cadlabcadlab..cscs..uclaucla..eduedu
Joint work with Y. Fan, G. Han, X. Yang, Z. ZhangJoint work with Y. Fan, G. Han, X. Yang, Z. Zhang
OutlineOutline
uuNeeds for MultiNeeds for Multi--Cycle OnCycle On--Chip CommunicationChip Communication
uuRegular Distributed Register (RDR) ArchitectureRegular Distributed Register (RDR) Architecture
uuMCAS: MultiMCAS: Multi--Cycle Communication Architectural Synthesis SystemCycle Communication Architectural Synthesis System•• SchedulingScheduling--driven placementdriven placement•• PlacementPlacement--driven rescheduling & rebindingdriven rescheduling & rebinding
uuExperimental ResultsExperimental Results
uuApplication in Pilot System Application in Pilot System ---- A Platform Based HW/SW Synthesis A Platform Based HW/SW Synthesis
SystemSystem
uuConclusions & Future WorkConclusions & Future Work
Interconnect Bottleneck in Nanometer Designs Interconnect Bottleneck in Nanometer Designs uu 1st challenge: Interconnect delay exceeds gate delay (happened i1st challenge: Interconnect delay exceeds gate delay (happened in mid 1990s)n mid 1990s)
uu Source of “timing closure” problemSource of “timing closure” problem
uu Happened in mid 1990s. Addressed by new physical synthesis/protHappened in mid 1990s. Addressed by new physical synthesis/prototyping toolsotyping tools
11.4 22.8 28.301 clock
2 clock
3 clock
4 clock
5 clock n ITRS’01 0.07um Techn 5.63 G Hz across-chip clockn 800 mm2 (28.3mm x 28.3mm)n IPEM BIWS estimations
u Buffer size: 100xu Driver/receiver size: 100x
n On semi-global layer (tier 3) :u Can travel up to 11.4 mm in
one cycleu Need 5 clock cycles from
corner to corner
Interconnect Bottleneck in Nanometer DesignsInterconnect Bottleneck in Nanometer Designs
uu 2nd challenge: 2nd challenge: SingleSingle--cycle full chip synchronization is no longer possiblecycle full chip synchronization is no longer possible
uu Not supported by the current CAD toolsetNot supported by the current CAD toolset
uu About to happen soonAbout to happen soon
n Altera Stratix: EP1S80B-C6n Large Size: 79,040 LEsn 22 DSP blocks …
n Corner to Corner Interconnect Delay:n 7.154 ns
n With clock frequency:n 300 MHz
n From corner to corner communication:n 3 clock cycles!
MegaRAMBlocks (9)
DSP Blocks (22)
M4K RAM Blocks (364)
M512 RAM Blocks (767)
Logic Array Blocks
(79,040 LEs)
SingleSingle--cycle Full Chip Synchronization No Longer cycle Full Chip Synchronization No Longer Possible Possible ---- FPGA ExampleFPGA Example
Possible SolutionsPossible Solutions
uuAsynchronous designsAsynchronous designs§§ Triggered by events instead of clocksTriggered by events instead of clocks
•• Bridging capabilities: provides interfaces for systems of differBridging capabilities: provides interfaces for systems of different speedsent speeds•• Greater flexibility: circuits in a system do not have to common Greater flexibility: circuits in a system do not have to common timingtiming
§§ DelayDelay--insensitiveinsensitive
§§ Reduced power consumption ?Reduced power consumption ?
§§ Improved performance ?Improved performance ?
uuSynchronous designs, with multiSynchronous designs, with multi--cycle communicationscycle communications§§ Much better understoodMuch better understood
§§ Can leverage existing tools/flows Can leverage existing tools/flows
§§ Our current focusOur current focus
MultiMulti--Cycle Interconnect Communication Cycle Interconnect Communication at Logic / Physical Levelat Logic / Physical Level
uuSimultaneous retiming + placement / Simultaneous retiming + placement / floorplanning floorplanning
§§ Retiming + multilevel partitioning[Cong et al, ICCADRetiming + multilevel partitioning[Cong et al, ICCAD’’00] and 00] and coarse placement[Cong et al, DACcoarse placement[Cong et al, DAC’’03]03]
§§ Retiming + Retiming + floorplanningfloorplanning [[ChongChong & & BraytonBrayton, IWLS, IWLS’’01] 01]
§§ Retiming + placement for Retiming + placement for FPGAsFPGAs [Singh & Brown, FPGA[Singh & Brown, FPGA’’02]02]
Need of Considering Retiming during PlacementNeed of Considering Retiming during Placement-- Retiming/pipelining on global interconnectsRetiming/pipelining on global interconnects
uu Multiple clock cycles are needed to cross the chipMultiple clock cycles are needed to cross the chip
uu Proper placement allows retiming to Proper placement allows retiming to hidehide global interconnect delays.global interconnect delays.
Placement 1
Before retiming, φ = 5.0
a b c d
After retiming, φ = 3.0
Before retiming, φ = 4.0
a cbd
Placement 2
d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL
Better Initial Placement !!
Need of Considering Retiming during PlacementNeed of Considering Retiming during Placement-- Retiming/pipelining on global interconnectsRetiming/pipelining on global interconnects
uu Multiple clock cycles are needed to cross the chipMultiple clock cycles are needed to cross the chip
uu Proper placement allows retiming to Proper placement allows retiming to hide hide global interconnect delays.global interconnect delays.
Placement 1
Before retiming, φ = 5.0
a b c d
After retiming, φ = 3.0
Before retiming, φ = 4.0
a cbd
After retiming, φ = 4.0
Placement 2
d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL
Better Initial Placement !!
Simultaneous Coarse Placement with Retiming Simultaneous Coarse Placement with Retiming on Interconnectson InterconnectsuuDifficulties Difficulties
§§ How to consider retiming/pipelining over global interconnectsHow to consider retiming/pipelining over global interconnects
•• FlipFlip--flop boundaries are not fixed during placement, difficult to do flop boundaries are not fixed during placement, difficult to do static static timing analysistiming analysis
§§ How to handle the high complexity of the combined problemHow to handle the high complexity of the combined problem
uuOur solutionOur solution§§ Compute the labels of all nodes under cCompute the labels of all nodes under c--retiming for a given retiming for a given
placement solution and perform sequential timing analysis (placement solution and perform sequential timing analysis (SeqSeq--TA)TA)
§§ Minimize the longest sequential path by improving the placement Minimize the longest sequential path by improving the placement solution in the multilevel coarse placement frameworksolution in the multilevel coarse placement framework
Sequential Arrival Time (SAT)Sequential Arrival Time (SAT)
uu Definition [Pan et al, TCAD98]Definition [Pan et al, TCAD98]§§ ll((vv) = max delay from PIs to ) = max delay from PIs to vv after opt. retiming under a given clock period after opt. retiming under a given clock period ff
§§ ll((vv) = max{) = max{ll((uu) ) -- ff ·· ww((u,vu,v) + ) + dd((u,vu,v) + ) + dd((vv)})}
§§ Relation to retiming: Relation to retiming: rr((vv) = ) = ll((vv) / ) / ff -- 11
§§ Theorem: Theorem: PP can be retimed to can be retimed to ff + max{+ max{dd((ee)} iff )} iff ll(POs) (POs) ≤≤ ff
uu SAT can be computed iteratively in O(VE) time (linear time in prSAT can be computed iteratively in O(VE) time (linear time in practice)actice)
u
wv
l(u) = 7
l(w) = 3
d(v) = 1, d(e) = 2, f = 5l(v) = max{7-5·1+2+1, 3+2+1} = 6
u v
l(u) w(u,v) d(v)
Limitation of Exploring MultiLimitation of Exploring Multi--cycle Interconnect cycle Interconnect Communication during Logic/Physical SynthesisCommunication during Logic/Physical Synthesis
uuMinimum clock period can be achieved by logic Minimum clock period can be achieved by logic
optimization is bounded by max. delayoptimization is bounded by max. delay--toto--register (DR) register (DR)
ratio of the loops in the circuits ratio of the loops in the circuits
uuRequire consideration of multiRequire consideration of multi--cycle communication cycle communication
during architecture & behavior synthesisduring architecture & behavior synthesis
• In a loop, 4 logic cells, 2 registers• Cell delay =1ns• Interconnect delay=1ns • DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2=4ns• Clock cycle >= 4ns
Our Contributions Our Contributions
uuRegular Distributed Register (RDR) microRegular Distributed Register (RDR) micro--architecturearchitecture
§§ Highly regularHighly regular
§§ Direct support of multiDirect support of multi--cycle oncycle on--chip communicationchip communication
uuMCAS: Architectural Synthesis for MultiMCAS: Architectural Synthesis for Multi--cycle cycle
CommunicationCommunication
§§ Integrated architectural synthesis (e.g. resource binding, Integrated architectural synthesis (e.g. resource binding, scheduling) with physical planningscheduling) with physical planning
§§ Target at RDR architecturesTarget at RDR architectures
Regular Distributed Register Architecture (1)Regular Distributed Register Architecture (1)
§ Distribute registers to each “island”§ Chose the island size such that local computation and communication in each
island can be done in a single cycle:
Global Interconnect
…
LCC
Reg. file
…
LCC
Reg. file
…
LCC
Reg. file
…
LCC
Reg. file
…
LCC
Reg. file
…
LCC
Reg. file
FSMFSM
FSMFSM
FSMFSM
THWDDDDD iiopticopticislandra ≤++≤+= −−− )(2 intlogintlogint
LocalComputationalCluster (LCC)
….Register File
Wi
Hi
Island
FSM
ADD
MUXMUL
Cluster with area constraint
Regular Distributed Register Architecture (2)Regular Distributed Register Architecture (2)
Global Interconnect
…
LCC
Reg. file
…
LCC
Reg. file
…
LCC
Reg. file
…
LCC
Reg. file
…
LCC
Reg. file
…
LCC
Reg. file
FSMFSM
FSMFSM
FSMFSM
LocalComputationalCluster (LCC)
….Register File
Wi
Hi
Island
FSM
ADD
MUXMUL
Cluster with area constraint
§ Use register banks:§ Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k
cycle interconnect communication in each island§ Highly regular
1 cycle
2 cycle
k cycle
ASIC Example : Regular Distributed Register ASIC Example : Regular Distributed Register Architecture for 70nm TechnologyArchitecture for 70nm Technology
§ NTRS’01 70nm Tech§ Chip dimension: 800 mm2 (28.3mm x
28.3mm)§ 5.63 G Hz across-chip clock
• Wire can travel up to 11.4mm within 1 clock cycle under interconnect optimization
• Need 5 clock cycles to cross the chip§ Each island base dimension
• Wi = Hi=3. 94 mm• = critical length (longest length that a
wire can run without buffer insertion) estimated by IPEM BIWS estimations assuming buffer size: 2x, driver/receiver size: 2x
• Logic volume: 19. 63M min-size 2-NAND gates
§ 8X8 island-base array§ Local registers are partitioned to 5 banks
FPGA Example : Regular Distributed Register FPGA Example : Regular Distributed Register Architecture for a Architecture for a StratixStratix DeviceDevice
§ To achieve 250 MHz clock frequency§ 4X6 island array§ Intra-island interconnect delay: § 2.616 ns
§ Logic delay of a 16 bit ADDER:§ 1.239 ns
§ Total Delay < 4 ns§ Each Island contains (average)§ 3290 LEs (for function units)§ 1 DSP block (8 9X9 bit multipliers)§ 32 M512 RAM blocks (register banks)§ 15 M4K RAM blocks (register banks)
§ MegaRAM blocks: global resources
n Stratix: EP1S80-C6n Large size: 79,040 LEsn Corner - corner interconnect delay
n 7.154 ns
Example: Example: Multicluster Multicluster Architectures of Architectures of DEC Alpha 21264
Source: The Multicluster Architecture: Reducing Cycle Time Through Partitioning by Keith I. Farkas, et al
RDR Architecture vs. DRARDR Architecture vs. DRA
uuDistributed Register File Architecture (DRA)Distributed Register File Architecture (DRA)§§ BehaviorBehavior--toto--Placed RTL Synthesis with PerformancePlaced RTL Synthesis with Performance--Driven Placement [Kim, Driven Placement [Kim,
et al, ICCADet al, ICCAD’’01]01]
uuSimilarities:Similarities:§§ Distribute registers near the local computational unitsDistribute registers near the local computational units
§§ Supports multiSupports multi--cycle communicationcycle communication
§§ Allows concurrent computation and communicationAllows concurrent computation and communication
uuDistinction:Distinction:
§§ The RDR architecture is highly The RDR architecture is highly regularregular
•• Facilitates interconnect delay estimationFacilitates interconnect delay estimation•• Enables the systematic exploration of cycleEnables the systematic exploration of cycle--time/latency time/latency
tradeoff by varying the size of the basic islandtradeoff by varying the size of the basic island
§Data flow graph extracted from discrete cosine transformation (DCT)
Example: Impact of Interconnect on SchedulingExample: Impact of Interconnect on Scheduling
Wirelength-driven placement
Reg. file
Reg. file
…Alu1
1,5,10Alu22,6,9
…Reg. file
Reg. file
…Mul23,7,8
…Mul1
4,11,12LCC
2 ns
1 ns- +
* *
--
*
*
-
*
*
-
1
3
5
7
9
2
4
6
8
11
10
12
Long interconnectShort interconnect
§The nodes with the same color are assigned to the same functional unit.
21 nsALU
22 nsMultiplier
NumDelayFU- +
* *
--
*
*
-
*
*
-
represents registers
SingleSingle--cycle vs. Multicycle vs. Multi--cycle Interconnect Communicationcycle Interconnect Communication
§Single-cycle interconnect communication §Scheduled in 6 clock cycles §Clock period is 4ns§Total latency is 24ns
§Multi-cycle interconnect communication§Scheduled in 9 clock cycles§Clock period is 2ns§Total latency is 18ns
10
+-
* *
--
*
*
-
*
*
-
21
3 4
65
7
8
9
11
12
Cycle 1
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 2
10
+-
*
*-
*
*
- *
-
21
3
4
6
5
7
8
9 12
Cycle 1
Cycle 7
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 2
Cycle 8
Cycle 9
-* 11
§With placement integrated with scheduling, critical path is reduced.§The DFG can be scheduled in 8 clock cycles, with clock period of 2ns.§The total latency is 16ns.
Enhancement 1: Simultaneous Placement and Scheduling Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimizationfor Performance Optimization
Reg. file
Reg. file…
Alu11,5,10
…Reg. file
Reg. file…
Mul23,7,8
…
Mul14,11,12
Alu22,6,9
Scheduling-driven placement10
+-
* *
--
*
*
-
*
*
-
21
3 4
65
7
8
9
11
12
Cycle 1
Cycle 7
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 2
Cycle 8
Enhancement 2: Simultaneous Placement, Scheduling Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimizationand Binding for Performance Optimization
§With placement integrated with scheduling and binding, the critical path is further reduced.§The DFG can be scheduled in 7 clock cycles, with clock period of 2ns.§The total latency is 14ns
Simultaneous placement, scheduling and binding
Cycle 1
Cycle 7
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 2
10
+-
*
--
*
*
-
*
*
-
21
3 4
65
7
8
9
11
12
*
Reg. file
Reg. file…
Alu11,5,10
…Reg. file
Reg. file…
Mul23,7,11
…
Alu22,6,9
Mul14,8,12
MCAS: PlacementMCAS: Placement--Driven Architectural Synthesis Using Driven Architectural Synthesis Using RDR ArchitectureRDR Architecture
Register and port binding
Datapath & FSM generation
Floorplanconstraints
RTL VHDL files
Multi-cycle path constraints
CDFG
C / VHDL
CDFG generation
+ 2
* 3 * 4
- 6- 5
* 7 * 8
- 9 * 11 * 12
- 10
- 1
RD
R A
rch. Spec.T
arget clock period
Resource allocation
Resource constraints
- +
* *
--
* *
- *
-
* Interconnected Component Graph (ICG)
Functional unit binding
Mult1 Alu2
Mult2 Alu1
Interconnected Component Graph (ICG)
Location information
Scheduling-driven placement
Reg. file
Reg. file…Alu1
1,5,10
…Reg. file
Reg. file…Mul2
3,7,12
…Alu22,6,9
Mul14,8,11
Placement-driven rebinding & scheduling
Cycle1
Cycle2
Cycle3
Cycle4
Cycle5
Cycle6
Cycle7
*
*
*
+-
*
--
*
-
*
-
Reg. file
Reg. file…Alu1
1,5,10
…Reg. file
Reg. file…Mul2
3,7,11
…Alu22,6,9
Mul14,8,12
MCAS: SchedulingMCAS: Scheduling--Driven Placement (1)Driven Placement (1)
uuBasic approach:Basic approach:§§ Integrate scheduling with an SAIntegrate scheduling with an SA--based coarse placement [Chang based coarse placement [Chang
et al, ISPDet al, ISPD’’02]02]
§§ Overlap computation with communicationOverlap computation with communication
§§ Hide critical data transfers into intraHide critical data transfers into intra--island by reducing weighted island by reducing weighted wirelengthwirelength..
uuDistinction between our placement and conventional Distinction between our placement and conventional performanceperformance--driven placementdriven placement§§ Problem size Problem size :: Relatively small (<10Relatively small (<1033) ) vs.vs. HugeHuge
§§ Input: Input: ICG (general ICG (general graph) ) vs.vs. NetlistNetlist ((acyclicacyclic graph)graph)
§§ Objective: Objective: To minimize: # of Clock cycles To minimize: # of Clock cycles vs. vs. Clock periodClock period
Reg. file
Reg. file…
Alu11,5,10
…Reg. file
Reg. file…
Mul23,7,8
…
Mul14,11,12
Alu22,6,9
MCAS: SchedulingMCAS: Scheduling--Driven Placement (2)Driven Placement (2)uuSchedulingScheduling--based timing analysisbased timing analysis
§§ Timing Analysis is performed on original CDFG instead of ICGTiming Analysis is performed on original CDFG instead of ICG• A fast list scheduling is performed on CDFG instead of the classical
static timing analysis • Critical edges in ICG are assigned high weights
§§ Timing Analysis Timing Analysis byby SchedulingScheduling
Weight assignment
10
+-
* *
--
*
*
-
*
*
-
21
3 4
65
7
8
9
11
12
Cycle 1
Cycle 7
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 2
Cycle 8
MCAS: Simultaneous Rescheduling & Rebinding (1)MCAS: Simultaneous Rescheduling & Rebinding (1)
uuSimultaneous list scheduling and Simultaneous list scheduling and binding to minimize total binding to minimize total schedule latencyschedule latency
uuPrevious approach Previous approach [[JeonJeon et al, et al, ASPDACASPDAC’’01]01]§§ cplcpl(i, j) = critical path length of (i, j) = critical path length of
fanoutfanout cone rooted at node i, when cone rooted at node i, when node i is bound to functional unit j.node i is bound to functional unit j.
§§ Perform list scheduling using Perform list scheduling using priority function priority function minmin jj((cplcpl(i, j)).(i, j)).
§§ Bind node to functional unit j with Bind node to functional unit j with the minimum the minimum cplcpl(i, j) at the earliest (i, j) at the earliest feasible control stepfeasible control step
X48
+3
*1
X40
+4
*2
*6*5
estest(i, j)(i, j)
cplcpl(i, j)(i, j)
+8-7
MCAS: Simultaneous Rescheduling & Rebinding (2)MCAS: Simultaneous Rescheduling & Rebinding (2)
uuOur contributionsOur contributions§§Use forceUse force--directed list scheduling and binding directed list scheduling and binding
with interconnect delay estimationwith interconnect delay estimation
§§Consider resource constraints Consider resource constraints •• During scheduling (for selecting deferred nodes)During scheduling (for selecting deferred nodes)•• During binding (as part of scheduling process)During binding (as part of scheduling process)
Experiment SettingsExperiment Settings
CDFG
Interconnected component graph
C / VHDL
Location information
1
Functional unit allocation & binding
Commercial FPGA development system
Placement-driven rebinding & rescheduling
Scheduling-driven placement
CDFG generation
2 3Register and port binding
Placement-driven scheduling
Scheduling
Datapath & FSM generationFloorplan constraints; Multi-cycle path constraints
RD
R A
rch. Spec.Target clock period
RTL VHDL files
Experimental Results (1)Experimental Results (1)
CSCP(ns)
Lat(ns) CS
CP(ns) Lat (ns) CS CP (ns)
Lat(ns)
pr 27 5.79 156.33 29 3.53 102.37 29 3.66 106.14wang 14 7.54 105.56 20 4.14 82.8 20 3.81 76.2
lee 20 6.25 125 27 3.36 90.72 26 3.38 87.88mcm 34 7.64 259.76 39 4.81 187.59 38 4.57 173.66
honda 23 7.58 174.34 24 3.78 90.72 24 4.18 100.32dir 50 7.03 351.5 51 4.41 224.91 51 4.33 220.83
chem 50 8.27 413.5 53 4.64 245.92 52 4.49 233.48u5ml12 68 9.3 632.4 70 5.34 373.8 70 4.3 301
Ave Ratio 1 1 1 1.14 0.57 0.65 1.13 0.56 0.63
Flow 1 Flow 2 Flow 3
§Flow1: Conventional approach§Flow2: Scheduling-driven placement§Flow3: Scheduling-driven placement + placement-driven rebinding & rescheduling
n Cycle number, clock period, and overall latency comparison
Experimental Results (2)Experimental Results (2)
0
100
200
300
400
500
600
700
pr wang lee mcm honda dir chem u5ml12
Late
ncy
(ns)
Flow 1
Flow 2
Flow 3
§Flow1: Conventional approach§Flow2: Scheduling-driven placement§Flow3: Scheduling-driven placement + placement-driven rebinding & rescheduling
n Total latency comparison
SynopsysSynopsys Flow Flow –– Behavioral Compiler vs. MCASBehavioral Compiler vs. MCAS
Behavioral Compiler
Design Compiler
MCAS
VHDL C
RTL VHDL
Mapped VHDL for Stratix FPGAs
Altera Quartus-II
Modelsim
VHDL Output for Simulation
gcc
Report
Equivalent high-level data flow description
Experimental Results (3)Experimental Results (3)
§ Synopsys Behavioral Compiler setting: default (optimizing latency)§ Average latency ratio of MCAS vs. BC: 76%
n MCAS basic flow vs. Synopsys’ Behavioral Compiler
0.00
100.00
200.00
300.00
400.00
500.00
600.00
pr wang mcm honda
Synopsys BCMCAS
0
1000
2000
3000
4000
5000
6000
7000
pr wang mcm honda
Synopsys BCMCAS
Latency Resource
Design Flow Cylces Reg ALU MULT fmax (MHz) LE Latency (ns) MCAS vs. BCSynopsys BC 25 28 5 8 90.31 2945 276.82
MCAS 27 34 6 2 96.74 2476 279.10 100.82%Synopsys BC 29 36 7 8 83.61 3605 346.85
MCAS 14 35 5 8 103.76 4242 134.93 38.90%Synopsys BC 43 142 23 7 79.65 6253 539.86
MCAS 34 35 6 3 72.05 3876 471.89 87.41%Synopsys BC 29 44 8 14 85.14 6128 340.62
MCAS 23 42 6 8 87.11 5523 264.03 77.52%
pr
wang
mcm
honda
Pilot: A PlatformPilot: A Platform--based HW/SW Synthesis Systembased HW/SW Synthesis System
uuPlatformPlatform--based Synthesisbased Synthesis§§ Start from system level design descriptionStart from system level design description
§§ Target to Target to FPSoCFPSoC platformplatform
§§ Automate the process as much as possibleAutomate the process as much as possible
uuSystem Data ModelSystem Data Model§§ MOC MOC –– Model of Computation Model of Computation
•• SystemSystem--Level Synthesis AlgorithmsLevel Synthesis Algorithms•• Incorporate models such as Incorporate models such as FunstateFunstate model etc.model etc.
§§ Internal RepresentationInternal Representation•• cover whole lifecover whole life--cycle of the flowcycle of the flow•• SDMSDM--API supports interAPI supports inter--operatabilityoperatability of CAD toolsof CAD tools
Platforms Used in Our ResearchPlatforms Used in Our ResearchuuHigh Programmable PlatformsHigh Programmable Platforms§§ XilinxXilinx VirtexVirtex II Pro, II Pro, Altera StratixAltera Stratix, etc., etc.
§§ Concentrates on Concentrates on reconfigurabilityreconfigurability•• Delivers Delivers reconfigurable reconfigurable processor + programmable logicprocessor + programmable logic
Rocket I/O Transceivers
PowerPC405
PowerPC405
PowerPC405
PowerPC405
Rocket I/O Transceivers
ProgrammableLogic
§Xilinx Virtex II Pro• Up to 4 IBM PowerPC in FPGA fabric• Up to 24 embedded Rocket I/O transceivers• Up to 556 18*18 multipliers• Over 10 Mb embedded block RAM• Up to 125,136 logic elements (LEs)
§Altera Stratix• Nios embedded processor• High-bandwidth I/O & High-Speed Interfaces• Up to 176 embedded multipliers
& up to 22 high performance DSP block• Up to 7 Mb embedded memory• Up to 79,040 logic elements (LEs)
Pilot Design FlowPilot Design Flow
n Tools Developed:u Converter: Translate SpecC to
SDMu Simulator: Validate the design in
SDM, Simulation design at different levels of abstraction
u SW code generator: Generate C Source Code from SDM for target platform
u HW code generator: Generate VHDL Source code from SDM for target platform
u Profiler: Generate profile based on generated SW/HW system
u HW synthesis: MCAS system
Design Design Spec. Spec.
SimulationSimulation
SynthesisSynthesis
C CodeC Code VHDLVHDL
TargetTargetSWSW
TargetTargetPLDPLD
SWSWCode GenCode Gen
HWHWCode GenCode Gen
System System Data Data ModelModel
PartitioningPartitioning
SchedulingScheduling
InterfaceInterfaceSynthesisSynthesis
SW synthesisSW synthesis
HW synthesisHW synthesis
PlatformPlatformInfo.Info.
EstimationEstimation
MCAS system
Work Accomplished:Work Accomplished:Jpeg EncoderJpeg Encoder
uuJpeg Encoder:Jpeg Encoder:
§§ An example to validate the design flowAn example to validate the design flow
116x96x8.bmp format(12214 Bytes)
116x96x8.jpg format(1704 Bytes)
Jpeg Example: Program StructureJpeg Example: Program Structure
BMPImage
File
BMPImage
File
ImageFragmentation
ImageFragmentation
DCTDCT
EntropyCoding
EntropyCoding
JPGImage
File
JPGImage
File
QuantizationQuantization
JPEG: an standard for image compressionDCT: Discrete Cosine Transform(ChenDCT)
Four mode of the operations in JPEG standard
ü Sequential DCT-based mode§ Progressive DCT-based mode§ Lossless mode§ Hierarchical mode
JPEG: an standard for image compressionDCT: Discrete Cosine Transform(ChenDCT)
Four mode of the operations in JPEG standard
ü Sequential DCT-based mode§ Progressive DCT-based mode§ Lossless mode§ Hierarchical mode
Jpeg Example: RunJpeg Example: Run--time Resultstime Results
uu RunRun--time result of Jpeg exampletime result of Jpeg example
time (10-6
s) rate(%) time (10-6
s) rate(%) time (10-6
s) rate(%) time (10-6
s) rate(%)
50.31 1.22% 50.31 1.92% 50.31 1.84% 50.31 4.59%(19878.67) (19878.67) (19878.67) (19878.67)
3160.56 76.46% 1641.04 62.78% 1756.67 64.35% 123.51 11.26%(316.4) (609.37) (569.26) (8096.46)176.42 4.27% 176.42 6.75% 176.42 6.46% 176.42 16.09%
(5668.41) (5668.41) (5668.41) (5668.41)746.29 18.05% 746.29 28.55% 746.29 27.34% 746.29 68.06%
(1339.96) (1339.96) (1339.96) (1339.96)Total 4133.57 100.00% 2614.05 100.00% 2729.68 100.00% 1096.52 100.00%
HuffmanEncode
NIOS(SW+HW2) NIOS(SW+HW3)
HandleData
DCT
Quantization
Module Name NIOS(SW) NIOS(SW+HW1)
n HW1: half DCT implementation with message passing communicationn HW2: Full DCT implementation with buffering communicationn HW3: Full DCT implementation with shared memory communication
Conclusions & Future WorkConclusions & Future Work
uuConclusions:Conclusions:§§ MultiMulti--cycle communication is needed for multicycle communication is needed for multi--gigahertz designsgigahertz designs
§§ Regular distributed register (RDR) architecture provides high reRegular distributed register (RDR) architecture provides high regularity and gularity and direct support ofdirect support of•• MultiMulti--cycle communicationcycle communication•• Integrated resource binding, scheduling, and physical planningIntegrated resource binding, scheduling, and physical planning
§§ Experimental results demonstrate the effectiveness of MCAS synthExperimental results demonstrate the effectiveness of MCAS synthesis esis algorithmsalgorithms
uuFuture Work:Future Work:§§ Further refinement of synthesis for multiFurther refinement of synthesis for multi--cycle synchronous designscycle synchronous designs
•• Support of controlSupport of control--intensive applications, e.g. distributed controller generationintensive applications, e.g. distributed controller generation•• Steering logic optimization, e.g. layoutSteering logic optimization, e.g. layout--driven distributed MUX tree generationdriven distributed MUX tree generation
§§ Synthesis solutions for asynchronous designsSynthesis solutions for asynchronous designs
AcknowledgementsAcknowledgements
uu Thanks for the supports from MARCO/DARPA Thanks for the supports from MARCO/DARPA GigaGiga--Scale Scale
System Research Center (GSRC) and Semiconductor System Research Center (GSRC) and Semiconductor
Research Corporation (SRC)Research Corporation (SRC)