Architecture-Level Synthesis Architecture-Level Synthesis
for Automatic Interconnect Pipeliningfor Automatic Interconnect Pipelining
Jason Cong, Jason Cong, Yiping FanYiping Fan, Zhiru Zhang, Zhiru ZhangVLSI CAD LabVLSI CAD Lab
Computer Science Department Computer Science Department
University of California, Los AngelesUniversity of California, Los Angeles
{cong, fanyp, zhiruz}@cs.ucla.edu{cong, fanyp, zhiruz}@cs.ucla.edu
Funded by GSRC, NSF, and Altera Corp.Funded by GSRC, NSF, and Altera Corp.
OutlineOutline
MotivationMotivation
Our contributionsOur contributions RDR-Pipe micro-architectureRDR-Pipe micro-architecture
• Regular Distributed Register micro-architecture with interconnect Regular Distributed Register micro-architecture with interconnect pipelining pipelining
Synthesis flow and algorithmsSynthesis flow and algorithms• MCAS-Pipe: automatic interconnect pipelining and sharingMCAS-Pipe: automatic interconnect pipelining and sharing
Experimental resultsExperimental results
ConclusionsConclusions
Interconnect Bottleneck in Nanometer DesignsInterconnect Bottleneck in Nanometer Designs
11.4 22.8 28.30
1 cycle
2 cycles
3 cycles
4 cycles
5 cycles
Challenge: single-cycle full chip communication will be no longer possibleChallenge: single-cycle full chip communication will be no longer possible
Not supported by the current CAD toolsetNot supported by the current CAD toolset
ITRS’01 0.07um Tech 5.63 GHz across-chip clock 800 mm2 (28.3mm x 28.3mm) IPEM BIWS estimations
Buffer size: 100x Driver/receiver size: 100x
Semi-global layer (Tier 3) Can travel up to 11.4mm in
one cycle Need 5 clock cycles From
corner to corner
Related WorkRelated Work Retiming with placement or floorplanning Retiming with placement or floorplanning
Retiming + multilevel partitioning [Cong et al, ICCAD’00] and coarRetiming + multilevel partitioning [Cong et al, ICCAD’00] and coarse placement [Cong et al, DAC’03]se placement [Cong et al, DAC’03]
Retiming + floorplanning [Chong & Brayton, IWLS’01] Retiming + floorplanning [Chong & Brayton, IWLS’01]
Retiming + placement for FPGAs [Singh & Brown, FPGA’02]Retiming + placement for FPGAs [Singh & Brown, FPGA’02]
Global wire pipelining in ItaniumGlobal wire pipelining in ItaniumTM TM processor processor [McInerney et al. ISPD’00][McInerney et al. ISPD’00]
Buffer and flip-flop insertion in RTL Buffer and flip-flop insertion in RTL [Lu et al. DATE’02] [Lu et al. DATE’02] [Cocchini, ICCAD’02][Cocchini, ICCAD’02]
Limitation during Logic/Physical Level to Explore Limitation during Logic/Physical Level to Explore Multicycle Communication Multicycle Communication
Minimum clock period achievable by logic optimization is bMinimum clock period achievable by logic optimization is bounded by max. delay-to-register (DR) ratio of the loops in ounded by max. delay-to-register (DR) ratio of the loops in the circuits [Papaefthymiou, MST’94]the circuits [Papaefthymiou, MST’94]
• In a loop, 4 logic cells, 2 registers• Cell delay = 1ns• Interconnect delay = 1ns • DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2 = 4ns• Clock period 4ns
Interconnect pipelining by flip-flop insertion ?Interconnect pipelining by flip-flop insertion ? Requires considerable amount of manual rework on the original Requires considerable amount of manual rework on the original
RTL descriptionsRTL descriptions
Our ApproachOur Approach Consideration of multicycle communication during architeConsideration of multicycle communication during archite
ctural (or behavioral) synthesis ctural (or behavioral) synthesis [Cong et al, ISPD’03] [Cong et al. ICCAD’03][Cong et al, ISPD’03] [Cong et al. ICCAD’03] Regular Distributed Register (RDR) micro-architecture Regular Distributed Register (RDR) micro-architecture
• Highly regularHighly regular• Direct support of multicycle on-chip communicationDirect support of multicycle on-chip communication
MCAS: Architectural Synthesis for Multi-cycle CommunicationMCAS: Architectural Synthesis for Multi-cycle Communication• Efficiently maps the behavioral descriptions to RDR uArch Efficiently maps the behavioral descriptions to RDR uArch • Integrates architectural synthesis (e.g. resource binding, scheduling) Integrates architectural synthesis (e.g. resource binding, scheduling)
with physical planningwith physical planning
This workThis work Extension of RDR and MCAS for interconnect pipeliningExtension of RDR and MCAS for interconnect pipelining
OutlineOutline
MotivationMotivation
Our contributionsOur contributions RDR-Pipe micro-architectureRDR-Pipe micro-architecture
• Regular Distributed Register micro-architecture with interconnect Regular Distributed Register micro-architecture with interconnect pipelining pipelining
Synthesis flow and algorithmsSynthesis flow and algorithms• MCAS-Pipe: automatic interconnect pipelining and sharingMCAS-Pipe: automatic interconnect pipelining and sharing
Experimental resultsExperimental results
ConclusionsConclusions
…
LCCLCC
…
LCCLCC
…
LCCLCC
…
LCCLCC
…
LCCLCC
…
LCCLCC
FSM
FSM
FSM
FSM
FSM
FSM
FSM
FSM
FSM
FSM
FSM
FSM
Reg. file
Glob
al Intercon
nect
Reg. file
Reg. file Reg. file Reg. file
Reg. file
Regular Distributed Register Micro-ArchitectureRegular Distributed Register Micro-Architecture
LocalComputationalCluster (LCC)
LocalComputationalCluster (LCC)
….
Wi
H i
FSM
FSM
ALUALU
MULMUL MUXMUX
IslandIsland
1 cycle1 cycle
2 cycle
2 cycles
K cycle
K cycles
Distribute registers to each “island” Choose the island size such that local computation and communication in
each island can be done in a single cycle Use register banks: registers in each island are partitioned to k banks for 1
cycle, 2 cycle, … k cycle interconnect communication in each island
Wiring Overhead in RDR DesignsWiring Overhead in RDR Designs
Data transfers rData transfers r11rr33 and r and r22rr4 4 are overlapped are overlapped
Two dedicated global wires are needed Two dedicated global wires are needed
ALU1
MUL1
Interconnects with delay of 2 cycles
r1 r2
r3 r4
+
+
*
+ ALU1 MUL1*
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
r1
r2r3
r4
Sender register Receiver register
Architectural Solution: RDR-PipeArchitectural Solution: RDR-Pipe
Keep the intra-island Keep the intra-island
structuresstructures
Inter-island pipeline Inter-island pipeline
register station (PRS) for register station (PRS) for
global communicationsglobal communications
PRS performs PRS performs
autonomous autonomous store-and-store-and-
forwardforward Synchronous designSynchronous design
No global control signal No global control signal needed for PRSneeded for PRS
LCCF
SM
LCC
FS
M
LCC
FS
M
LCC
FS
M
LCC
FS
M
LCC
FS
M
Reg. File
V channel
PRS
H channel
Pipeline Register Station (PRS)
1 2
4
3
5 6
3
1 24
PRS
PRS PRS
Reducing Wiring Overhead in RDR-PipeReducing Wiring Overhead in RDR-Pipe
Data transfers are pipelined Data transfers are pipelined One wire with a pipeline register is enoughOne wire with a pipeline register is enough
ALU1ALU1
MUL1MUL1 2 cycle communication
r1
r2r3 r4
+
+
*
+ ALU1 MUL1*
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
r1
r1r3
r4
Sender register
Receiver register
Pipeline register
Synthesis Flow: MCAS-Pipe SystemSynthesis Flow: MCAS-Pipe System
MC
AS
-Pip
eM
CA
S-P
ipe
ICG
C / VHDL
Locations
Placement-driven rescheduling & rebinding
Placement-driven rescheduling & rebinding
Scheduling-driven placementScheduling-driven placement
CDFG generationCDFG generation
Register and port bindingRegister and port binding
Datapath & FSM generationDatapath & FSM generation
Resource allocation& Functional unit binding
Resource allocation& Functional unit binding
RTL VHDL & Floorplan constraints
CDFG
Global interconnect sharingAfter scheduling and functional u
nit binding
Before register and port binding
Enable multiple data communications to shar a physical link (a wire with pipeline registers)
Advantages over MCASExpect to reduce global wiring de
mand
No multicycle path constraint needed
Global interconnect sharingGlobal interconnect sharing
Global Interconnect SharingGlobal Interconnect Sharing
Two physical links are needed to Two physical links are needed to
support the concurrent data transferssupport the concurrent data transfers
A Bpe ce
D = 2
pg cgCycle 4
Cycle 1
Cycle 2
Cycle 3
Cycle 5
Cycle 6
Cycle 7 ce cg
pe
pg
Conflicted data transfers
Pipeline register Sender register Receiver register
Cycle 4
Cycle 1
Cycle 2
Cycle 3
Cycle 5
Cycle 6
Cycle 7 ce cg
pe
pg
Compatible data transfers
A Bpe
ce
D = 2
pgcg
Only one physical link is required to Only one physical link is required to support the scheduled data transferssupport the scheduled data transfers
A B
pe, pg
ce
D = 2
cg
Now, two producer registers can be merged, Now, two producer registers can be merged, since their life-times become compatiblesince their life-times become compatible
Global Pipelined Interconnect MinimizationGlobal Pipelined Interconnect Minimization DefinitionsDefinitions
Data links: pipelined global interconnectsData links: pipelined global interconnects Channel: set of data links between two islandsChannel: set of data links between two islands
• Width of a channel: number of its data linksWidth of a channel: number of its data links
Data transfer: movement of data from a producer to a consumerData transfer: movement of data from a producer to a consumer
Architectural assumptionArchitectural assumption Channels cannot share interconnectsChannels cannot share interconnects
TheoremTheorem Global pipelined interconnects are minimized if and only if the Global pipelined interconnects are minimized if and only if the
width of every channel is minimizedwidth of every channel is minimized
Transfer Scheduling for a Single ChannelTransfer Scheduling for a Single Channel A decision problem formulationA decision problem formulation
Given: Given:
• A channel (A channel (A, BA, B)) containing containing m m data linksdata links
• A data transfer set {A data transfer set {e | pe | pee A A and and ccee B B}, where each transfer }, where each transfer ee is associat is associat
ed with an arrival time ed with an arrival time TT((ppee))+1+1, a deadline , a deadline TT((ccee))-D-D((A, BA, B), and unit effective oc), and unit effective oc
cupancy timecupancy time
Fact: for every time slot, at most one transfer can be issued on a data linkFact: for every time slot, at most one transfer can be issued on a data link
Objective: to find a feasible transfer schedule on these data linksObjective: to find a feasible transfer schedule on these data links
Transfer scheduling is polynomial solvableTransfer scheduling is polynomial solvable A special real-time scheduling problem A special real-time scheduling problem [J. Blazewicz, 1979]
• Binary search for minimum feasible channel width Binary search for minimum feasible channel width mm
• For each width, apply Earliest-Deadline-First (EDF) scheduling: O(For each width, apply Earliest-Deadline-First (EDF) scheduling: O(nnloglognn))
• Overall time complexity: O(Overall time complexity: O(nnloglog22nn))
EDF-Based Transfer Scheduling ExampleEDF-Based Transfer Scheduling Example
Successfully scheduling onto 2 Successfully scheduling onto 2 data linksdata links
Data Link 1
Data Link 2
1
2 ?
3
4
5
Ordered by Earliest-Deadline-First Ordered by Earliest-Deadline-First
Time slot Time slot
Data Link 1
Data Link 2
1 2
3 4
5
6
12345
6
1
2
345
6
Ordered by left edgeOrdered by left edge Failed for 2 data links!Failed for 2 data links!
OutlineOutline
MotivationMotivation
Our contributionsOur contributions RDR-Pipe micro-architectureRDR-Pipe micro-architecture
• Regular Distributed Register micro-architecture with interconnect Regular Distributed Register micro-architecture with interconnect pipelining pipelining
Synthesis flow and algorithmsSynthesis flow and algorithms• MCAS-Pipe: automatic interconnect pipelining and sharingMCAS-Pipe: automatic interconnect pipelining and sharing
Experimental resultsExperimental results
ConclusionsConclusions
Experiment SettingsExperiment SettingsC / VHDLC / VHDL
Conventional Conventional flowflow
Altera QuartusII + StratixAltera QuartusII + StratixAltera QuartusII + StratixAltera QuartusII + Stratix
Scheduling-driven Scheduling-driven placementplacement
Scheduling-driven Scheduling-driven placementplacement
CDFG generationCDFG generationCDFG generationCDFG generation
MCAS-Pipe flowMCAS-Pipe flow
Conventional Conventional Scheduling Scheduling
Conventional Conventional Scheduling Scheduling
Datapath & Control generationDatapath & Control generationDatapath & Control generationDatapath & Control generation
Floorplan constraints (for MCAS and MCAS-Pipe); Floorplan constraints (for MCAS and MCAS-Pipe); Multicycle path constraints (for MCAS only)Multicycle path constraints (for MCAS only)
uArch. spec.
uArch. spec.
Target clock periodTarget clock period
RTL VHDL filesRTL VHDL files(for all flows)(for all flows)
Global interconnect Global interconnect sharingsharing
Global interconnect Global interconnect sharingsharing
MCAS MCAS flowflow
Functional unit Functional unit allocation & bindingallocation & binding
Functional unit Functional unit allocation & bindingallocation & binding
Placement-driven Placement-driven rebinding & reschedulingrebinding & rescheduling
Placement-driven Placement-driven rebinding & reschedulingrebinding & rescheduling
Register and port bindingRegister and port bindingRegister and port bindingRegister and port binding
Experimental Results: Register and LE UsageExperimental Results: Register and LE Usage
DesignsDesigns Node#Node#MCASMCAS CONV / MCASCONV / MCAS MCAS-Pipe / MCASMCAS-Pipe / MCAS
Reg#Reg# LELE Reg#Reg# LELE Reg#Reg# LELE
PRPR 46 31 1181 0.71 0.95 1.19 0.95
WANGWANG 52 40 1435 0.63 0.81 1.20 0.85
LEELEE 53 29 988 0.76 0.96 1.00 0.84
MCMMCM 98 57 2467 0.75 1.00 1.05 1.19
HONDAHONDA 101 41 2542 0.83 0.90 1.05 1.01
DIRDIR 152 44 2260 0.75 0.95 1.05 1.01
AverageAverage - - - 0.74 0.74 0.93 0.93 1.09 1.09 0.98 0.98
Design environment: Altera QuartusII, Stratix EP1S40Design environment: Altera QuartusII, Stratix EP1S40 MCAS vs. Conventional flow:MCAS vs. Conventional flow:
Uses more registers and logic elements (LE)Uses more registers and logic elements (LE)
MCAS-Pipe vs. MCAS: MCAS-Pipe vs. MCAS: Slightly more registers, and comparable logic element costSlightly more registers, and comparable logic element cost
Experimental Results: PerformanceExperimental Results: Performance Design environment: Altera QuartusII, Stratix EP1S40Design environment: Altera QuartusII, Stratix EP1S40 MCAS vs. Conventional flow:MCAS vs. Conventional flow:
36% reduction in clock period and 30% in total latency36% reduction in clock period and 30% in total latency
MCAS-Pipe vs. MCAS:MCAS-Pipe vs. MCAS: Comparable design performance (4% better)Comparable design performance (4% better)
0
2
4
6
8
10
12
Clo
ck p
eri
od
(n
s)
PR WANG LEE MCM HONDA DIR Average
Conventional
MCAS
MCAS-Pipe
0
100
200
300
400
500
600
To
tal l
ate
ncy
(n
s)
PR WANG LEE MCM HONDA DIR Average
Conventional
MCAS
MCAS-Pipe
Clock periodClock period Total latencyTotal latency
Interconnect Structure of Altera’s Stratix Interconnect Structure of Altera’s Stratix
Local: LL, LOV4
H4
H8
V8 Global:V16
Global: H24
Experimental Results: WirelengthExperimental Results: Wirelength Wire typesWire types
LL, LO: local wires; H4, V4, H8, V8: short global wiresLL, LO: local wires; H4, V4, H8, V8: short global wires
V16, H24: long global wiresV16, H24: long global wires
MCAS-Pipe vs. MCAS:MCAS-Pipe vs. MCAS:
28.8% long global wires reduction, 19.3% total wirelength reduction28.8% long global wires reduction, 19.3% total wirelength reduction
0
0. 2
0. 4
0. 6
0. 8
1
1. 2
1. 4
PR WANG LEE MCM HONDA DI R Average
LL+LOH4+V4H8+V8V16+H24Total
ConclusionsConclusions
High-level automatic on-chip interconnect pipeliningHigh-level automatic on-chip interconnect pipelining
RDR-Pipe: extension of RDR micro-architecture RDR-Pipe: extension of RDR micro-architecture
• Micro-architecture supporting interconnect pipeliningMicro-architecture supporting interconnect pipelining
MCAS-Pipe: enhancement of MCAS synthesis systemMCAS-Pipe: enhancement of MCAS synthesis system
• Add in a novel global interconnect sharing algorithm to Add in a novel global interconnect sharing algorithm to
effectively reduce the global wiringeffectively reduce the global wiring
Experimental resultsExperimental results
Matches or exceeds the RDR-based approach in performance Matches or exceeds the RDR-based approach in performance
Greatly reduces wiring demandGreatly reduces wiring demand
Thank youThank you
Top Related