Architecture and Synthesis for Multi-Cycle On-Chip...

Architecture and Synthesis for MultiArchitecture and Synthesis for Multi--Cycle Cycle OnOn--Chip CommunicationChip Communication

Jason CongJason Cong

VLSI CAD LabVLSI CAD LabComputer Science DepartmentComputer Science Department

University of California, Los AngelesUniversity of California, Los Angelescong@[email protected]

http://http://cadlabcadlab..cscs..uclaucla..eduedu

Joint work with Y. Fan, G. Han, X. Yang, Z. ZhangJoint work with Y. Fan, G. Han, X. Yang, Z. Zhang

OutlineOutline

uuNeeds for MultiNeeds for Multi--Cycle OnCycle On--Chip CommunicationChip Communication

uuRegular Distributed Register (RDR) ArchitectureRegular Distributed Register (RDR) Architecture

uuMCAS: MultiMCAS: Multi--Cycle Communication Architectural Synthesis SystemCycle Communication Architectural Synthesis System•• SchedulingScheduling--driven placementdriven placement•• PlacementPlacement--driven rescheduling & rebindingdriven rescheduling & rebinding

uuExperimental ResultsExperimental Results

uuApplication in Pilot System Application in Pilot System ---- A Platform Based HW/SW Synthesis A Platform Based HW/SW Synthesis

SystemSystem

uuConclusions & Future WorkConclusions & Future Work

Interconnect Bottleneck in Nanometer Designs Interconnect Bottleneck in Nanometer Designs uu 1st challenge: Interconnect delay exceeds gate delay (happened i1st challenge: Interconnect delay exceeds gate delay (happened in mid 1990s)n mid 1990s)

uu Source of “timing closure” problemSource of “timing closure” problem

uu Happened in mid 1990s. Addressed by new physical synthesis/protHappened in mid 1990s. Addressed by new physical synthesis/prototyping toolsotyping tools

11.4 22.8 28.301 clock

2 clock

3 clock

4 clock

5 clock n ITRS’01 0.07um Techn 5.63 G Hz across-chip clockn 800 mm2 (28.3mm x 28.3mm)n IPEM BIWS estimations

u Buffer size: 100xu Driver/receiver size: 100x

n On semi-global layer (tier 3) :u Can travel up to 11.4 mm in

one cycleu Need 5 clock cycles from

corner to corner

Interconnect Bottleneck in Nanometer DesignsInterconnect Bottleneck in Nanometer Designs

uu 2nd challenge: 2nd challenge: SingleSingle--cycle full chip synchronization is no longer possiblecycle full chip synchronization is no longer possible

uu Not supported by the current CAD toolsetNot supported by the current CAD toolset

uu About to happen soonAbout to happen soon

n Altera Stratix: EP1S80B-C6n Large Size: 79,040 LEsn 22 DSP blocks …

n Corner to Corner Interconnect Delay:n 7.154 ns

n With clock frequency:n 300 MHz

n From corner to corner communication:n 3 clock cycles!

MegaRAMBlocks (9)

DSP Blocks (22)

M4K RAM Blocks (364)

M512 RAM Blocks (767)

Logic Array Blocks

(79,040 LEs)

SingleSingle--cycle Full Chip Synchronization No Longer cycle Full Chip Synchronization No Longer Possible Possible ---- FPGA ExampleFPGA Example

Possible SolutionsPossible Solutions

uuAsynchronous designsAsynchronous designs§§ Triggered by events instead of clocksTriggered by events instead of clocks

•• Bridging capabilities: provides interfaces for systems of differBridging capabilities: provides interfaces for systems of different speedsent speeds•• Greater flexibility: circuits in a system do not have to common Greater flexibility: circuits in a system do not have to common timingtiming

§§ DelayDelay--insensitiveinsensitive

§§ Reduced power consumption ?Reduced power consumption ?

§§ Improved performance ?Improved performance ?

uuSynchronous designs, with multiSynchronous designs, with multi--cycle communicationscycle communications§§ Much better understoodMuch better understood

§§ Can leverage existing tools/flows Can leverage existing tools/flows

§§ Our current focusOur current focus

MultiMulti--Cycle Interconnect Communication Cycle Interconnect Communication at Logic / Physical Levelat Logic / Physical Level

uuSimultaneous retiming + placement / Simultaneous retiming + placement / floorplanning floorplanning

§§ Retiming + multilevel partitioning[Cong et al, ICCADRetiming + multilevel partitioning[Cong et al, ICCAD’’00] and 00] and coarse placement[Cong et al, DACcoarse placement[Cong et al, DAC’’03]03]

§§ Retiming + Retiming + floorplanningfloorplanning [[ChongChong & & BraytonBrayton, IWLS, IWLS’’01] 01]

§§ Retiming + placement for Retiming + placement for FPGAsFPGAs [Singh & Brown, FPGA[Singh & Brown, FPGA’’02]02]

Need of Considering Retiming during PlacementNeed of Considering Retiming during Placement-- Retiming/pipelining on global interconnectsRetiming/pipelining on global interconnects

uu Multiple clock cycles are needed to cross the chipMultiple clock cycles are needed to cross the chip

uu Proper placement allows retiming to Proper placement allows retiming to hidehide global interconnect delays.global interconnect delays.

Placement 1

Before retiming, φ = 5.0

a b c d

After retiming, φ = 3.0


a cbd

Placement 2

d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL

Better Initial Placement !!

Need of Considering Retiming during PlacementNeed of Considering Retiming during Placement-- Retiming/pipelining on global interconnectsRetiming/pipelining on global interconnects

uu Multiple clock cycles are needed to cross the chipMultiple clock cycles are needed to cross the chip

uu Proper placement allows retiming to Proper placement allows retiming to hide hide global interconnect delays.global interconnect delays.

Placement 1


a b c d



a cbd


Placement 2

d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL

Better Initial Placement !!

Simultaneous Coarse Placement with Retiming Simultaneous Coarse Placement with Retiming on Interconnectson InterconnectsuuDifficulties Difficulties

§§ How to consider retiming/pipelining over global interconnectsHow to consider retiming/pipelining over global interconnects

•• FlipFlip--flop boundaries are not fixed during placement, difficult to do flop boundaries are not fixed during placement, difficult to do static static timing analysistiming analysis

§§ How to handle the high complexity of the combined problemHow to handle the high complexity of the combined problem

uuOur solutionOur solution§§ Compute the labels of all nodes under cCompute the labels of all nodes under c--retiming for a given retiming for a given

placement solution and perform sequential timing analysis (placement solution and perform sequential timing analysis (SeqSeq--TA)TA)

§§ Minimize the longest sequential path by improving the placement Minimize the longest sequential path by improving the placement solution in the multilevel coarse placement frameworksolution in the multilevel coarse placement framework

Sequential Arrival Time (SAT)Sequential Arrival Time (SAT)

uu Definition [Pan et al, TCAD98]Definition [Pan et al, TCAD98]§§ ll((vv) = max delay from PIs to ) = max delay from PIs to vv after opt. retiming under a given clock period after opt. retiming under a given clock period ff

§§ ll((vv) = max{) = max{ll((uu) ) -- ff ·· ww((u,vu,v) + ) + dd((u,vu,v) + ) + dd((vv)})}

§§ Relation to retiming: Relation to retiming: rr((vv) = ) = ll((vv) / ) / ff -- 11

§§ Theorem: Theorem: PP can be retimed to can be retimed to ff + max{+ max{dd((ee)} iff )} iff ll(POs) (POs) ≤≤ ff

uu SAT can be computed iteratively in O(VE) time (linear time in prSAT can be computed iteratively in O(VE) time (linear time in practice)actice)

u

wv

l(u) = 7

l(w) = 3

d(v) = 1, d(e) = 2, f = 5l(v) = max{7-5·1+2+1, 3+2+1} = 6

u v

l(u) w(u,v) d(v)

Limitation of Exploring MultiLimitation of Exploring Multi--cycle Interconnect cycle Interconnect Communication during Logic/Physical SynthesisCommunication during Logic/Physical Synthesis

uuMinimum clock period can be achieved by logic Minimum clock period can be achieved by logic

optimization is bounded by max. delayoptimization is bounded by max. delay--toto--register (DR) register (DR)

ratio of the loops in the circuits ratio of the loops in the circuits

uuRequire consideration of multiRequire consideration of multi--cycle communication cycle communication

during architecture & behavior synthesisduring architecture & behavior synthesis

• In a loop, 4 logic cells, 2 registers• Cell delay =1ns• Interconnect delay=1ns • DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2=4ns• Clock cycle >= 4ns

Our Contributions Our Contributions

uuRegular Distributed Register (RDR) microRegular Distributed Register (RDR) micro--architecturearchitecture

§§ Highly regularHighly regular

§§ Direct support of multiDirect support of multi--cycle oncycle on--chip communicationchip communication

uuMCAS: Architectural Synthesis for MultiMCAS: Architectural Synthesis for Multi--cycle cycle

CommunicationCommunication

§§ Integrated architectural synthesis (e.g. resource binding, Integrated architectural synthesis (e.g. resource binding, scheduling) with physical planningscheduling) with physical planning

§§ Target at RDR architecturesTarget at RDR architectures

Regular Distributed Register Architecture (1)Regular Distributed Register Architecture (1)

§ Distribute registers to each “island”§ Chose the island size such that local computation and communication in each

island can be done in a single cycle:

Global Interconnect

…

LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

FSMFSM

FSMFSM

FSMFSM

THWDDDDD iiopticopticislandra ≤++≤+= −−− )(2 intlogintlogint

LocalComputationalCluster (LCC)

….Register File

Wi

Hi

Island

FSM

ADD

MUXMUL

Cluster with area constraint

Regular Distributed Register Architecture (2)Regular Distributed Register Architecture (2)

Global Interconnect

…

LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

FSMFSM

FSMFSM

FSMFSM

LocalComputationalCluster (LCC)

….Register File

Wi

Hi

Island

FSM

ADD

MUXMUL

Cluster with area constraint

§ Use register banks:§ Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k

cycle interconnect communication in each island§ Highly regular

1 cycle

2 cycle

k cycle

ASIC Example : Regular Distributed Register ASIC Example : Regular Distributed Register Architecture for 70nm TechnologyArchitecture for 70nm Technology

§ NTRS’01 70nm Tech§ Chip dimension: 800 mm2 (28.3mm x

28.3mm)§ 5.63 G Hz across-chip clock

• Wire can travel up to 11.4mm within 1 clock cycle under interconnect optimization

• Need 5 clock cycles to cross the chip§ Each island base dimension

• Wi = Hi=3. 94 mm• = critical length (longest length that a

wire can run without buffer insertion) estimated by IPEM BIWS estimations assuming buffer size: 2x, driver/receiver size: 2x

• Logic volume: 19. 63M min-size 2-NAND gates

§ 8X8 island-base array§ Local registers are partitioned to 5 banks

FPGA Example : Regular Distributed Register FPGA Example : Regular Distributed Register Architecture for a Architecture for a StratixStratix DeviceDevice

§ To achieve 250 MHz clock frequency§ 4X6 island array§ Intra-island interconnect delay: § 2.616 ns

§ Logic delay of a 16 bit ADDER:§ 1.239 ns

§ Total Delay < 4 ns§ Each Island contains (average)§ 3290 LEs (for function units)§ 1 DSP block (8 9X9 bit multipliers)§ 32 M512 RAM blocks (register banks)§ 15 M4K RAM blocks (register banks)

§ MegaRAM blocks: global resources

n Stratix: EP1S80-C6n Large size: 79,040 LEsn Corner - corner interconnect delay

n 7.154 ns

Example: Example: Multicluster Multicluster Architectures of Architectures of DEC Alpha 21264

Source: The Multicluster Architecture: Reducing Cycle Time Through Partitioning by Keith I. Farkas, et al

RDR Architecture vs. DRARDR Architecture vs. DRA

uuDistributed Register File Architecture (DRA)Distributed Register File Architecture (DRA)§§ BehaviorBehavior--toto--Placed RTL Synthesis with PerformancePlaced RTL Synthesis with Performance--Driven Placement [Kim, Driven Placement [Kim,

et al, ICCADet al, ICCAD’’01]01]

uuSimilarities:Similarities:§§ Distribute registers near the local computational unitsDistribute registers near the local computational units

§§ Supports multiSupports multi--cycle communicationcycle communication

§§ Allows concurrent computation and communicationAllows concurrent computation and communication

uuDistinction:Distinction:

§§ The RDR architecture is highly The RDR architecture is highly regularregular

•• Facilitates interconnect delay estimationFacilitates interconnect delay estimation•• Enables the systematic exploration of cycleEnables the systematic exploration of cycle--time/latency time/latency

tradeoff by varying the size of the basic islandtradeoff by varying the size of the basic island

§Data flow graph extracted from discrete cosine transformation (DCT)

Example: Impact of Interconnect on SchedulingExample: Impact of Interconnect on Scheduling

Wirelength-driven placement

Reg. file

Reg. file

…Alu1

1,5,10Alu22,6,9

…Reg. file

Reg. file

…Mul23,7,8

…Mul1

4,11,12LCC

2 ns

1 ns- +

* *

--

*

*

-

*

*

-

1

3

5

7

9

2

4

6

8

11

10

12

Long interconnectShort interconnect

§The nodes with the same color are assigned to the same functional unit.

21 nsALU

22 nsMultiplier

NumDelayFU- +

* *

--

*

*

-

*

*

-

represents registers

SingleSingle--cycle vs. Multicycle vs. Multi--cycle Interconnect Communicationcycle Interconnect Communication

§Single-cycle interconnect communication §Scheduled in 6 clock cycles §Clock period is 4ns§Total latency is 24ns

§Multi-cycle interconnect communication§Scheduled in 9 clock cycles§Clock period is 2ns§Total latency is 18ns

10

+-

* *

--

*

*

-

*

*

-

21

3 4

65

7

8

9

11

12

Cycle 1

Cycle 3

Cycle 4

Cycle 5

Cycle 6

Cycle 2

10

+-

*

*-

*

*

- *

-

21

3

4

6

5

7

8

9 12

Cycle 1

Cycle 7

Cycle 3

Cycle 4

Cycle 5

Cycle 6

Cycle 2

Cycle 8

Cycle 9

-* 11

§With placement integrated with scheduling, critical path is reduced.§The DFG can be scheduled in 8 clock cycles, with clock period of 2ns.§The total latency is 16ns.

Enhancement 1: Simultaneous Placement and Scheduling Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimizationfor Performance Optimization

Reg. file

Reg. file…

Alu11,5,10

…Reg. file

Reg. file…

Mul23,7,8

…

Mul14,11,12

Alu22,6,9

Scheduling-driven placement10

+-

* *

--

*

*

-

*

*

-

21

3 4

65

7

8

9

11

12

Cycle 1

Cycle 7

Cycle 3

Cycle 4

Cycle 5

Cycle 6

Cycle 2

Cycle 8

Enhancement 2: Simultaneous Placement, Scheduling Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimizationand Binding for Performance Optimization

§With placement integrated with scheduling and binding, the critical path is further reduced.§The DFG can be scheduled in 7 clock cycles, with clock period of 2ns.§The total latency is 14ns

Simultaneous placement, scheduling and binding

Cycle 1

Cycle 7

Cycle 3

Cycle 4

Cycle 5

Cycle 6

Cycle 2

10

+-

*

--

*

*

-

*

*

-

21

3 4

65

7

8

9

11

12

*

Reg. file

Reg. file…

Alu11,5,10

…Reg. file

Reg. file…

Mul23,7,11

…

Alu22,6,9

Mul14,8,12

MCAS: PlacementMCAS: Placement--Driven Architectural Synthesis Using Driven Architectural Synthesis Using RDR ArchitectureRDR Architecture

Register and port binding

Datapath & FSM generation

Floorplanconstraints

RTL VHDL files

Multi-cycle path constraints

CDFG

C / VHDL

CDFG generation

+ 2

* 3 * 4

- 6- 5

* 7 * 8

- 9 * 11 * 12

- 10

- 1

RD

R A

rch. Spec.T

arget clock period

Resource allocation

Resource constraints

- +

* *

--

* *

- *

-

* Interconnected Component Graph (ICG)

Functional unit binding

Mult1 Alu2

Mult2 Alu1

Interconnected Component Graph (ICG)

Location information

Scheduling-driven placement

Reg. file

Reg. file…Alu1

1,5,10

…Reg. file

Reg. file…Mul2

3,7,12

…Alu22,6,9

Mul14,8,11

Placement-driven rebinding & scheduling

Cycle1

Cycle2

Cycle3

Cycle4

Cycle5

Cycle6

Cycle7

*

*

*

+-

*

--

*

-

*

-

Reg. file

Reg. file…Alu1

1,5,10

…Reg. file

Reg. file…Mul2

3,7,11

…Alu22,6,9

Mul14,8,12

MCAS: SchedulingMCAS: Scheduling--Driven Placement (1)Driven Placement (1)

uuBasic approach:Basic approach:§§ Integrate scheduling with an SAIntegrate scheduling with an SA--based coarse placement [Chang based coarse placement [Chang

et al, ISPDet al, ISPD’’02]02]

§§ Overlap computation with communicationOverlap computation with communication

§§ Hide critical data transfers into intraHide critical data transfers into intra--island by reducing weighted island by reducing weighted wirelengthwirelength..

uuDistinction between our placement and conventional Distinction between our placement and conventional performanceperformance--driven placementdriven placement§§ Problem size Problem size :: Relatively small (<10Relatively small (<1033) ) vs.vs. HugeHuge

§§ Input: Input: ICG (general ICG (general graph) ) vs.vs. NetlistNetlist ((acyclicacyclic graph)graph)

§§ Objective: Objective: To minimize: # of Clock cycles To minimize: # of Clock cycles vs. vs. Clock periodClock period

Reg. file

Reg. file…

Alu11,5,10

…Reg. file

Reg. file…

Mul23,7,8

…

Mul14,11,12

Alu22,6,9

MCAS: SchedulingMCAS: Scheduling--Driven Placement (2)Driven Placement (2)uuSchedulingScheduling--based timing analysisbased timing analysis

§§ Timing Analysis is performed on original CDFG instead of ICGTiming Analysis is performed on original CDFG instead of ICG• A fast list scheduling is performed on CDFG instead of the classical

static timing analysis • Critical edges in ICG are assigned high weights

§§ Timing Analysis Timing Analysis byby SchedulingScheduling

Weight assignment

10

+-

* *

--

*

*

-

*

*

-

21

3 4

65

7

8

9

11

12

Cycle 1

Cycle 7

Cycle 3

Cycle 4

Cycle 5

Cycle 6

Cycle 2

Cycle 8

MCAS: Simultaneous Rescheduling & Rebinding (1)MCAS: Simultaneous Rescheduling & Rebinding (1)

uuSimultaneous list scheduling and Simultaneous list scheduling and binding to minimize total binding to minimize total schedule latencyschedule latency

uuPrevious approach Previous approach [[JeonJeon et al, et al, ASPDACASPDAC’’01]01]§§ cplcpl(i, j) = critical path length of (i, j) = critical path length of

fanoutfanout cone rooted at node i, when cone rooted at node i, when node i is bound to functional unit j.node i is bound to functional unit j.

§§ Perform list scheduling using Perform list scheduling using priority function priority function minmin jj((cplcpl(i, j)).(i, j)).

§§ Bind node to functional unit j with Bind node to functional unit j with the minimum the minimum cplcpl(i, j) at the earliest (i, j) at the earliest feasible control stepfeasible control step

X48

+3

*1

X40

+4

*2

*6*5

estest(i, j)(i, j)

cplcpl(i, j)(i, j)

+8-7

MCAS: Simultaneous Rescheduling & Rebinding (2)MCAS: Simultaneous Rescheduling & Rebinding (2)

uuOur contributionsOur contributions§§Use forceUse force--directed list scheduling and binding directed list scheduling and binding

with interconnect delay estimationwith interconnect delay estimation

§§Consider resource constraints Consider resource constraints •• During scheduling (for selecting deferred nodes)During scheduling (for selecting deferred nodes)•• During binding (as part of scheduling process)During binding (as part of scheduling process)

Experiment SettingsExperiment Settings

CDFG

Interconnected component graph

C / VHDL

Location information

1

Functional unit allocation & binding

Commercial FPGA development system

Placement-driven rebinding & rescheduling

Scheduling-driven placement

CDFG generation

2 3Register and port binding

Placement-driven scheduling

Scheduling

Datapath & FSM generationFloorplan constraints; Multi-cycle path constraints

RD

R A

rch. Spec.Target clock period

RTL VHDL files

Experimental Results (1)Experimental Results (1)

CSCP(ns)

Lat(ns) CS

CP(ns) Lat (ns) CS CP (ns)

Lat(ns)

pr 27 5.79 156.33 29 3.53 102.37 29 3.66 106.14wang 14 7.54 105.56 20 4.14 82.8 20 3.81 76.2

lee 20 6.25 125 27 3.36 90.72 26 3.38 87.88mcm 34 7.64 259.76 39 4.81 187.59 38 4.57 173.66

honda 23 7.58 174.34 24 3.78 90.72 24 4.18 100.32dir 50 7.03 351.5 51 4.41 224.91 51 4.33 220.83

chem 50 8.27 413.5 53 4.64 245.92 52 4.49 233.48u5ml12 68 9.3 632.4 70 5.34 373.8 70 4.3 301

Ave Ratio 1 1 1 1.14 0.57 0.65 1.13 0.56 0.63

Flow 1 Flow 2 Flow 3

§Flow1: Conventional approach§Flow2: Scheduling-driven placement§Flow3: Scheduling-driven placement + placement-driven rebinding & rescheduling

n Cycle number, clock period, and overall latency comparison


0

100

200

300

400

500

600

700

pr wang lee mcm honda dir chem u5ml12

Late

ncy

(ns)

Flow 1

Flow 2

Flow 3

§Flow1: Conventional approach§Flow2: Scheduling-driven placement§Flow3: Scheduling-driven placement + placement-driven rebinding & rescheduling

n Total latency comparison

SynopsysSynopsys Flow Flow –– Behavioral Compiler vs. MCASBehavioral Compiler vs. MCAS

Behavioral Compiler

Design Compiler

MCAS

VHDL C

RTL VHDL

Mapped VHDL for Stratix FPGAs

Altera Quartus-II

Modelsim

VHDL Output for Simulation

gcc

Report

Equivalent high-level data flow description


§ Synopsys Behavioral Compiler setting: default (optimizing latency)§ Average latency ratio of MCAS vs. BC: 76%

n MCAS basic flow vs. Synopsys’ Behavioral Compiler

0.00

100.00

200.00

300.00

400.00

500.00

600.00

pr wang mcm honda

Synopsys BCMCAS

0

1000

2000

3000

4000

5000

6000

7000

pr wang mcm honda

Synopsys BCMCAS

Latency Resource

Design Flow Cylces Reg ALU MULT fmax (MHz) LE Latency (ns) MCAS vs. BCSynopsys BC 25 28 5 8 90.31 2945 276.82

MCAS 27 34 6 2 96.74 2476 279.10 100.82%Synopsys BC 29 36 7 8 83.61 3605 346.85

MCAS 14 35 5 8 103.76 4242 134.93 38.90%Synopsys BC 43 142 23 7 79.65 6253 539.86

MCAS 34 35 6 3 72.05 3876 471.89 87.41%Synopsys BC 29 44 8 14 85.14 6128 340.62

MCAS 23 42 6 8 87.11 5523 264.03 77.52%

pr

wang

mcm

honda

Pilot: A PlatformPilot: A Platform--based HW/SW Synthesis Systembased HW/SW Synthesis System

uuPlatformPlatform--based Synthesisbased Synthesis§§ Start from system level design descriptionStart from system level design description

§§ Target to Target to FPSoCFPSoC platformplatform

§§ Automate the process as much as possibleAutomate the process as much as possible

uuSystem Data ModelSystem Data Model§§ MOC MOC –– Model of Computation Model of Computation

•• SystemSystem--Level Synthesis AlgorithmsLevel Synthesis Algorithms•• Incorporate models such as Incorporate models such as FunstateFunstate model etc.model etc.

§§ Internal RepresentationInternal Representation•• cover whole lifecover whole life--cycle of the flowcycle of the flow•• SDMSDM--API supports interAPI supports inter--operatabilityoperatability of CAD toolsof CAD tools

Platforms Used in Our ResearchPlatforms Used in Our ResearchuuHigh Programmable PlatformsHigh Programmable Platforms§§ XilinxXilinx VirtexVirtex II Pro, II Pro, Altera StratixAltera Stratix, etc., etc.

§§ Concentrates on Concentrates on reconfigurabilityreconfigurability•• Delivers Delivers reconfigurable reconfigurable processor + programmable logicprocessor + programmable logic

Rocket I/O Transceivers

PowerPC405

PowerPC405

PowerPC405

PowerPC405

Rocket I/O Transceivers

ProgrammableLogic

§Xilinx Virtex II Pro• Up to 4 IBM PowerPC in FPGA fabric• Up to 24 embedded Rocket I/O transceivers• Up to 556 18*18 multipliers• Over 10 Mb embedded block RAM• Up to 125,136 logic elements (LEs)

§Altera Stratix• Nios embedded processor• High-bandwidth I/O & High-Speed Interfaces• Up to 176 embedded multipliers

& up to 22 high performance DSP block• Up to 7 Mb embedded memory• Up to 79,040 logic elements (LEs)

Pilot Design FlowPilot Design Flow

n Tools Developed:u Converter: Translate SpecC to

SDMu Simulator: Validate the design in

SDM, Simulation design at different levels of abstraction

u SW code generator: Generate C Source Code from SDM for target platform

u HW code generator: Generate VHDL Source code from SDM for target platform

u Profiler: Generate profile based on generated SW/HW system

u HW synthesis: MCAS system

Design Design Spec. Spec.

SimulationSimulation

SynthesisSynthesis

C CodeC Code VHDLVHDL

TargetTargetSWSW

TargetTargetPLDPLD

SWSWCode GenCode Gen

HWHWCode GenCode Gen

System System Data Data ModelModel

PartitioningPartitioning

SchedulingScheduling

InterfaceInterfaceSynthesisSynthesis

SW synthesisSW synthesis

HW synthesisHW synthesis

PlatformPlatformInfo.Info.

EstimationEstimation

MCAS system

Work Accomplished:Work Accomplished:Jpeg EncoderJpeg Encoder

uuJpeg Encoder:Jpeg Encoder:

§§ An example to validate the design flowAn example to validate the design flow

116x96x8.bmp format(12214 Bytes)

116x96x8.jpg format(1704 Bytes)

Jpeg Example: Program StructureJpeg Example: Program Structure

BMPImage

File

BMPImage

File

ImageFragmentation

ImageFragmentation

DCTDCT

EntropyCoding

EntropyCoding

JPGImage

File

JPGImage

File

QuantizationQuantization

JPEG: an standard for image compressionDCT: Discrete Cosine Transform(ChenDCT)

Four mode of the operations in JPEG standard

ü Sequential DCT-based mode§ Progressive DCT-based mode§ Lossless mode§ Hierarchical mode

JPEG: an standard for image compressionDCT: Discrete Cosine Transform(ChenDCT)

Four mode of the operations in JPEG standard

ü Sequential DCT-based mode§ Progressive DCT-based mode§ Lossless mode§ Hierarchical mode

Jpeg Example: RunJpeg Example: Run--time Resultstime Results

uu RunRun--time result of Jpeg exampletime result of Jpeg example

time (10-6

s) rate(%) time (10-6



s) rate(%)

50.31 1.22% 50.31 1.92% 50.31 1.84% 50.31 4.59%(19878.67) (19878.67) (19878.67) (19878.67)

3160.56 76.46% 1641.04 62.78% 1756.67 64.35% 123.51 11.26%(316.4) (609.37) (569.26) (8096.46)176.42 4.27% 176.42 6.75% 176.42 6.46% 176.42 16.09%

(5668.41) (5668.41) (5668.41) (5668.41)746.29 18.05% 746.29 28.55% 746.29 27.34% 746.29 68.06%

(1339.96) (1339.96) (1339.96) (1339.96)Total 4133.57 100.00% 2614.05 100.00% 2729.68 100.00% 1096.52 100.00%

HuffmanEncode

NIOS(SW+HW2) NIOS(SW+HW3)

HandleData

DCT

Quantization

Module Name NIOS(SW) NIOS(SW+HW1)

n HW1: half DCT implementation with message passing communicationn HW2: Full DCT implementation with buffering communicationn HW3: Full DCT implementation with shared memory communication

Conclusions & Future WorkConclusions & Future Work

uuConclusions:Conclusions:§§ MultiMulti--cycle communication is needed for multicycle communication is needed for multi--gigahertz designsgigahertz designs

§§ Regular distributed register (RDR) architecture provides high reRegular distributed register (RDR) architecture provides high regularity and gularity and direct support ofdirect support of•• MultiMulti--cycle communicationcycle communication•• Integrated resource binding, scheduling, and physical planningIntegrated resource binding, scheduling, and physical planning

§§ Experimental results demonstrate the effectiveness of MCAS synthExperimental results demonstrate the effectiveness of MCAS synthesis esis algorithmsalgorithms

uuFuture Work:Future Work:§§ Further refinement of synthesis for multiFurther refinement of synthesis for multi--cycle synchronous designscycle synchronous designs

•• Support of controlSupport of control--intensive applications, e.g. distributed controller generationintensive applications, e.g. distributed controller generation•• Steering logic optimization, e.g. layoutSteering logic optimization, e.g. layout--driven distributed MUX tree generationdriven distributed MUX tree generation

§§ Synthesis solutions for asynchronous designsSynthesis solutions for asynchronous designs

AcknowledgementsAcknowledgements

uu Thanks for the supports from MARCO/DARPA Thanks for the supports from MARCO/DARPA GigaGiga--Scale Scale

System Research Center (GSRC) and Semiconductor System Research Center (GSRC) and Semiconductor

Research Corporation (SRC)Research Corporation (SRC)

Architecture and Synthesis for Multi-Cycle On-Chip...

Documents

Transcript of Architecture and Synthesis for Multi-Cycle On-Chip...