Algorithms for Simultaneous Consideration of Multiple Physical Synthesis Transforms for Timing...

Algorithms for Simultaneous

Consideration of Multiple Physical

Synthesis Transforms for Timing Closure

Huan Ren and Shantanu Dutt Dept. of Electrical and Computer Engineering

University of Illinois Chicago

Outline

Problem formulation & prior work Network flow model Methodology Flow Discretization Requirements Structures for Accurate Objective Function Cost Simultaneous Detailed Placement—A Holistic Ap

proach! Experimental Results Conclusions

Problem Statement Problem Statement

Simultaneously apply a given set T of synthesis and replacement transforms to cells and nets on critical paths of a initial placed circuit to improve circuit delay near-optimally while satisfying area constraints.

For the current expts, T = {cell resizing, replication, replacement, type-1 & type-2 buffer insertions}

Critical paths (CP) = paths with delay > (1-α) fraction of circuit delay. We choose α=0.1.

Timing objective function [Dutt et al., ICCAD’06]

CS(ni ): critical sinks of nj, in CP D(uj, ni ) : delay of ni at sink uj . Sa(ni ) : allocated slack of ni , which is the path slack of the most critical p

ath through the net divided by the number of nets in the path allows exponential magnification of the timing function for critical nets

in order to approximate min. of the max net timing function ~ min. delays in CP

( , ) / ( )i j i

t j i a in CP u CS n

F D u n S n

Post-placement Incremental Physical Synthesis

Why necessary? Wire load estimation is very inaccurate prior to placement Leaves large room for improvements

Various transforms Cell sizing: effective for improving timing

Continuous sizing [Fishburn et al., ICCAD’85] and Discrete sizing [Hu et al., DAC’07], [Ren et al., IWLS’08]

Options: Different cell sizes available in the library (s options for s sizes)

Incremental global placement Re-place a subset of cells targeting the metric of interest for design clos

ure [Dutt et al., ICCAD’06], [Wonjoon et al., ICCAD’03] Transform options:

Remain in the position in the initial placement Move to the new position determined in a incremental global placement pro

Various Transforms (continued) Buffer insertion

Usually associated with routing tree generation Can be estimated after placement using two different types of

buffers [Jiang et al., TVLSI’98]

Transform options for each buffer type: Do not insert any buffer Insert a buffer with different sizes available in the library (s options

for s sizes)

SBuffer

Driving buffer (type 1)

Isolating buffer (type 2)

Critical

Non-critical

Various Transforms (continued) Cell Replication

Can both improve drive capability and isolating sinks. Need to partition sinks between the two drivers [Srivastava et al., TVLSI’0

4] [Lillis et al., ISCAS’96]. Transform options:

Do not replicate a cell Replicate a driver cell with several possible partitions of the sink cells among th

e two replicas (k options for k partitions)

Combining Multiple Synthesis Transforms—Past

Work Usually timing-driven Most methods simply apply them sequentially Transforms are not unified [Donath et al., DATE’00]

Incorporating different synthesis transforms in different partition levels in a partition based placement

[Jiang et al., TVLSI’98] Considers both cell resizing and buffer insertion Dynamic sequencing but greedy. Choose the transform with largest

delay improvement to area increase ratio for a net/cell each time. Can be trapped in local optimums. Hard to handle other transforms (e.g. incremental placement which

cause no area increase)

Coarse partition level Detailed partition level

TD pl adjustmentCell resizing Replication and buffering

Our method: -- simultaneous -- unified transforms

An example: A simple transform selection graph (TSG) for one net

Nodes: Transform options for each net (& its cells) Arcs: those in complete bipartite graphs between transform option

sets for a net—all combinations are available as flow paths Flow: has binary meaning: flow through a node the option for the

node is selected Flow: also has a quantitative meaning: In constraint satisfaction

problems, flow amount = constraint metric value = (in our case) sizes of selected options Flow cost is equal to the timing objective function value with

selected options Timing-optimal transform options = the min-cost flow

Network Flow Model

Ores(u)

Complete bipartite

TD function value for this choice of options 1 (res), 2 (b1)

Overall Model

Mini-TSG is constructed for each net in CP (net structures)

If two nets have common cells, their net structures are connected by a spanning structure.

Spanning structures

Flows indicating selected cell sizes and positions are sent to the DPG to perform detailed placement

Detailed placement “cost” is also considered when selecting options to reach an overall near-optimal soln

Methodology Flow

Determine transform options from trans. set T for every net in CP (from library or using known algorithms, e.g., for replication)

Determine the set CP of near-critical paths = {paths w/ delays >= (1-)[critical path delay)}

Construct the transform selection graph (TSG) and couple it with the detailed placement graph (DPG) [Dutt et al. ICCAD’06]

Determine F- (obj) and C- (discretization) costs for arcs in the TSG

Determine min-cost flow through TSG + DPG using the “concave-cost’’ min-cost method of [Kim & Pardalos, OR Letters, ’99]

Determine transforms across all cells & nets in CP and their legalized detailed placement from the above flow

Mutually exclusive arcs (MEAs) for the output arc and/or input arcs stes of some nodes: at most one arc in an MEA set can have flow through it

Hyper-arc flow Hyper-arcs may be needed in some problems to model k-way dependencies (k > 2). For example, needed in our physical sy

nthesis problems to accurately reflect obj. metric value change caused by flow through nodes in it.

Discretization Requirements in the Network Flow Model

Ores(u)

Ob1MEA sets

MEA sets

4-aryhyperarc

Star graph model—No flow state All flow state

InvalidValid

Star graph model w/ only 2 states

Net Structure and F-cost

First attempt: A linear structure

Product term based arc cost Order of a product term in the timing objective function is the #

of transforms the term is a function of. E.g., Objective func. (linear delay model): d(u,v)+d(u,w)= 2cRdL(ni)+2RdCv+2RdCw

Ores(u)2

Ob1 Ores(v) Ores(w)

Distribution node

Gathering node

Rd(Ores(u), Ob1) ·Cv(Ores(v), Orep(u), Orep(v)) order 5

• Each flow path isa transform combination• Set {paths} = Set {transform combos}

d(u, v)

uOres(u)v’ Orep(v)

Ores(v)v

Orep(u)u’

d(u, w)

Linear Structure—Issues in Objective Function Cost

Drawbacks of linear structure Cannot handle terms with order >2 Cannot handle terms that depend on two “non-adjacent” transforms.

Supply node

Gathering node

OxOy Oz

T(Ox, Oy)T(Ox

1, Oy2)

T(Ox, Oy ,Oz

T(Ox?, Oy

1, Oz2)

T(Ox, Oz)

No bipartite graph

Hyperarcs: Accurate Objective Function Cost

Product term based arc cost Order of a product term in the timing objective function: the # of transforms the ter

m is a function of. Ex: Simple linear delay model: d(u,v)+d(u,w) = 2cRdL(ni)+2RdCw+2RdCv

Rd(Ores(u), Ob1) ·Cv(Ores(v), Orep(v), Orep(u)) order 5

d(u, v)i1

2n hyperarcs

• Assuming 2 options per transform, order=n•mn hyperarcs ifm options per transform

Ores(u)

Ores(v)Orep(v)

Orep(u)

Meta-hyperarc H forabove order-5 term

“Combination”hyperarcs

d(u, w)

Arcs in network flow graph can only be between two nodes. Parallel arcs between central transform and parallel transform. Each parallel arc & the arcs to the regular transform option nodes it repres

ents corresponds to one hyperarc.

Hyperarcs: Star Graph Structure

T(Oxi, Oy

j, Oz1)

T(Oxi, Oy

j, Ozm)

Central transform

iOx j Oy

Oz….

Parallel transform

Regular transforms

Hyperarc representingan order-3 cost term value

T(Ox, Oy, Oz)

m options

… …

m parallel arcs

T(Oxi, Oy

j, Oz1)

T(Oxi, Oy

j, Ozm)

OxOyParallel arcs

Meta arc

Parallel arc sets

Multiple optionnodes

Multiplearcs

Meta Star Graph

f (valid)f’ (invalid)

MEA Satisfaction via Arc C-costs

Besides the objective function based cost (F-cost), a objective function independent C-cost is added

Total arc cost = F-cost + C-cost (cost is a step function—incurred once for any flow amount)

Theorem: A min-cost flow with C-costs on MEA arcs ensures MEA satisfaction

Valid flow F-cost

Min-cost invalid flow F-cost

Invalid flow F+C-cost

Valid flow F+C-cost

F-cost diff >= - CΔ C-cost diff >= CΔ+1Total diff >= 1

CΔ +1 CΔ +1

MEA sets

Heuristically or randomly select a valid flow& determine its cost C1

Obtain standard min-cost flow of cost C2

w/o discretization constraints

Let CΔ= C1 – C2

Set MEA arc cost = CΔ+1

Consistent Hyperarc flow: Idea: Only the total capacity of a parallel arc a

nd arcs to its consistent regular option nodes can be = to incoming flow amount f.

How: use prime numbers

Hyperarc-Consistent Flows via Arc C-costs

For k total regular option nodes (across allregular transforms), select k prime numbersp1<p2…<pk such that: 1/p1+…+1/pk>(pk-1)/ pk

Cap of non-para arcs: f(1/pj ) Cap of para arcs: f-(cap of its consistent non-para arcs)

C-cost is proportional to arc capacity: Cunit * cap(e) Cunit = (CΔ +1)/ Δcapmin , Δcapmin is the min{cap of invalid arc sets – f} Theorem: A min-cost flow with C-costs on star graph arcs ensures hyparc-consistent flows in star graphs

f(1-1/3)

f(1-1/5)

f(1/3)

f(1/5)

Tot cap = f

Tot cap < fTot cap > f

Discrete Arc Cost

Standard linear flow cost

Cap(e)

Slope=cost(e)/cap(e)

Cap(e)

Cost(e)

Step function cost (concave)

Well studied NP-hard problem [Kim et al., ORL’99]; we use their min-cost algo.

• Total arc cost = F-cost + C-cost (incurred once for any amt of flow)—arc cost is discrete

Affected parameters for ni: Driver R: Rd(Ores(u)), WL Li(Orep(u), Ob2), Sink C: Cv(Ores(v), Orep(v), Orep(u)), Cw(Ores(w), Orep(w), Orep(u)) Order > 2 terms: 2RdCv (order 4), 2c · RdLi (order 3), 2RdCw (order 4)

Ob2 Ores(u)

Orep(u)2c · RdLi

Meta arc

Multiple Cost Terms: Intersecting Hyperarcs & Overlapping Star Graphs

There is one star graph structure for each term in the objective function. Option nodes for common transforms between different terms are combined. Example: Consider three transforms: gate sizing (res), replication (rep) and

isolating buffer (b2).

Orep(v)

Ores(v)

Ores(u)

Orep(u) Ob2

Distr. node

Sub-TSG for net ni

Gathering node

Ores(w) Orep(w)

MEA constraint ensures consistent option selection for common transforms in diff. star graphs

Background: Incremental Detailed Placement [Dutt et al., ICCAD’06]

C11 C12 C13 C14

C21 C22 C24

C31 C32 C33

Cells to be legalized

Flow amount Cell movement Arcs possible movement dire

ctions Arc cost Deterioration on the

objective metric of the corresponding movement

Cells to be legalized are connected to the source

White spaces are connected to the sink.

Flows from the source to the sink perform cell legalization via white spaces.

Source

W1C11 C12 C13 C14

C24 W2W21C21 C22

Directly send branch flows to the detailed placement network flow graph (DPG) to perform simultaneous detailed placement

Flow is sent from the replacement option node of a cell to the corresponding position in the DPG.

Flow amount means the selected size of the cell.

Simultaneous Detailed Placement &

Area Constraint Satisfaction

Coupling between the flow and the size option nodes is needed: Shunting structurePos i of u

Pos j of u

j Opl(u)

To DPG

Shunting arc

Ores(u)

Opl(u)

Amax -Aj

(u) Aj(u)

(Amax,0)

(Aj(u),0)

Experimental Results—Benchmarks

Three benchmark sets TD-Dragon [Yang et al., ICCAD’02], ISCAS’85, TD-IBM Available options

For cell sizing & type-1, type-2 buffers: 4 options for TD-Dragon and ISCAS’85, and 5 options for TD-IBM

For replication: 4 options: 3 replication options with different partitions of sink cells and a no-replication option

For replacement: 2 options: a timing-driven position of each cell is calculated using method in [Dutt et al., ICCAD’06]. A cell can either stay at its original position or be moved to its timing-driven position.

3% extra white space is added to initial circuits in TD-IBM, and 10% extra white space is added to circuits in ISCAS’85 and TD-Dragon

Sequential Application of Transforms

We compare our results to the sequential application of transforms

Order of transform application matters in sequential application. We tested three different orders: 1) Decreasing order of ΔT/ΔA ΔT=25.92% replacement isolating buffer cell resizing drive buffer replic

ation 2) Decreasing order of ΔT ΔT=18.11% replacement cell resizing isolating buffer replication drive b

uffer 3) Increasing order of ΔA ΔT=22.64% replacement isolating buffer drive buffer cell resizing replic

TD-ibm benchmarks

Experimental Results

td-i bm01

td-i bm02

td-i bm06

td-i bm9

td-i bm14

td-i bm17

td-i bm18

OursSeq

25.9 8.9

C499 C880 C3540 C5315 C7552 Avg.

OursSeq

ISCAS’857.9

relatively

better

relatively

better

Experimental Results

Mat r i x VP2 MAC32 MAC64 Avg.% ti

ment Ours

y = 4x - 924

0 500 1000 1500 2000

# of cel l s i n CP

8.8 6.3

TD-Dragon

Our run time is about 1.5 times that of the seq. approach Linear increase w.r.t. number of cells on CP.

relatively

better

y = 0. 026x + 710

0 50 100 150 200

# of cel l s (k)

Conclusions

A general discretized n/w flow based approach to TD post-placement multiple physical synthesis; can handle most transforms in an unified manner

Considers transform applications simultaneously Obtained high-quality solutions; is not trapped in local optimas Performs simultaneous detailed placement (DP) so that DP cos

t is considered when selecting transform options Reasonable run time, good scalability & high quality solutions Demonstrates the power of using continuous opt. w/ well-stru

ctured discretizations Applicable to other constrained optimization problems (e.g., po

wer opt w/ area and timing constraints) Future Work: (a) Application to mixed-cell designs; (b) Consider g

lobal re-routing as a transform for signal integrity

Thank you

Algorithms for Simultaneous Consideration of Multiple Physical Synthesis Transforms for Timing...

Documents

Transcript of Algorithms for Simultaneous Consideration of Multiple Physical Synthesis Transforms for Timing...

Lecture 10: Performance Metrics Shantanu Dutt ECE Dept. UIC.

ECE 366 -- Computer Architecture Lecture Notes # 6 Shantanu Dutt How to Add To & Use the Basic Processor Organization To Execute Different Instructions.

Algorithmic Techniques in VLSI CAD Shantanu Dutt University of Illinois at Chicago.

ECE 368 CAD-Based Logic Design Shantanu Dutt Lecture 12dutt/courses/ece368/lect-notes/lecture... · 2013. 4. 17. · ECE 368 CAD-Based Logic Design Shantanu Dutt Lecture 11 File I/O

ECE 565 High-Level Synthesis—An Introduction Shantanu Dutt ECE Dept., UIC.

Lecture 11: Parallel Processing of Irregular Computations & Load Balancing Shantanu Dutt ECE Dept. UIC.

Introduction to Parallel Processing Shantanu Dutt University of Illinois at Chicago.

A NetworkFlow Approach to TimingDriven Incremental Placement for ASICs Shantanu Dutt, Huan Ren, Fenghua Yuan and Vishal Suthar Dept. of Electrical and.

FAR EAST AND BACK FR: TUYAN HUAN TO: TUYON-HUAN · title: far east and back fr: tuyan huan to: tuyon-huan subject: far east and back fr: tuyan huan to: tuyon-huan keywords

Shantanu Sinha

1 EECS 465: Digital Systems Lecture Notes # 7 (A) Introduction to Sequential Circuits (B) Latches and Flip-Flops (C) Counter Design SHANTANU DUTT Department.

Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.

EECS Uni - University of Illinois at Chicagodutt/courses/ece366/lect16-mem-hier.pdfEECS 366: Computer Ar chitecur e Instructor: Shantanu Dutt Department of EECS Uni v ersity of Illinois

1 EECS 465: Digital Systems Lecture Notes # 8 Sequential Circuit (Finite-State Machine) Design SHANTANU DUTT Department of Electrical and Computer Engineering.

Hasan Arslan and Shantanu Dutt Electrical & Computer Eng. University of Illinois at Chicago

Algorithmic Time Complexity Basics Shantanu Dutt ECE Dept. UIC.

Trust-Based Design and Check of FPGA Circuits Using …dutt/papers/acm-trets.pdfTrust-Based Design and Check of FPGA Circuits Using Two-Level Randomized ECC Structures SHANTANU DUTT

A Depth-First-Search Controlled Gridless Incremental Routing Algorithm for VLSI Circuits Hasan Arslan and Shantanu Dutt Electrical & Computer Eng. University.

Shantanu NAIK

1 ECE 368: CAD-Based Logic Design Lecture Notes # 5 Sequential Circuit (Finite-State Machine) Design SHANTANU DUTT Department of ECE University of Illinois,