Parallelization by SimPL ification : A Case Study in VLSI Placement

PAPA2011, University of Michigan

Parallelization by SimPLification:A Case Study in VLSI Placement

Myung-Chul Kim, Dong-Jin Leeand Igor L. MarkovDept. of EECS, University of Michigan

Complexities of Parallel Algorithms & SW1.Objectives of parallelization

A. Improve completion time by using multiple cores in ||B. Improve throughput by using stream processing

(latency may increase and become less predictable)C. Improve power consumption (by decreasing clk rate)2.Not an objective (a pitfall)

− Come up with a slow algorithm that is easy to parallelize

■In this talk: how to accomplish 1.A without 2− Take a leading algorithm and speed up its bottlenecks− Design a new algorithm that is

(a) better, (b) easy to parallelize

CAD Algorithms■Sequence of optimizations

− Subject to Amdahl’s law− The more the stages, the harder to parallelize effectively■Additional complications

− Elaborate data structures may entail overheadfor parallel access

− When processing is light, memory bandwidthmay become a bottleneck (with 4+ threads)

■Recommendations− A simpler algorithm is often either to parallelize

(fewer stages, simpler data structures)− Using standard solvers, e.g., linear algebra

helps reuse previous work on parallelization

Global Placement: Motivation■Interconnect lagging in

performance while transistors continue scaling

− Circuit delay, power dissipation and areadominated by interconnect

− Routing quality highly controlled by placement

■Circuit size and complexity rapidly increasing− Scalable placement algorithm is critical− Simplicity, integration with other optimizations

Unloaded

Coupling IR drop

RC delay

Goals in Placement■Find good relative ordering of cells

− Minimize wire length and congestion− Maximize timing slack■Find good spacing of cells

− Eliminate wiring congestion problems− Provide space for post placement stages

–clock trees–buffer insertion–timing correction

■Find good global position

Optimize Relative Order

To spread ...

.. or not to spread

Place to the left

… or to the right

Optimize Relative Order

Without whitespace,placement is dominated by ordering

Example of Global Placement (APlace 2.04 from UCSD)

Example of Global Placement (mFar from UCSB)

Placement Formulation

■Objective: Minimize estimated wirelength− Half-perimeter wirelength (HPWL)− (max X – min X) + (max Y – min Y)

■Subject to constraints:− Legality: Row-based

placement with no overlaps− Routability: Limiting local

interconnect congestion forsuccessful routing

− Timing: Meeting performancetarget of a design

Quadratic Placement■Consider a graph first, not a hypergraph■Minimize Σ(xi-xj)2+(yi-yj)2 (the sum is over eij)

− Seems unrelated to Σ |xi-xj|+|yi-yj| but can still be separated into x- and y-components

■Physical analogy: Hooke’s law− Consider an elastic spring, spread by x− Force F=-kx (k is the spring constant)− Energy E=kx2

− Our goal: minimize the energy of the system

A system of springs will only settle in a minimum

Iterative Optimization

Prior Work

■ Ideal Placer− Low runtime without sacrificing solution quality− Simplicity, integration with other optimizations

Solution Quality

Non-convex optimization

mFAR, Kraftwerk2, FastPlace3

Ideal placer

mPL6, APlace2, NTUPlace3

Quadratic and force-directed

Key features of SimPL■Flat quadratic placement■Primal dual optimization

− Closing the gap between upper and lower bounds

Final Solution

Lower-Bound Solutionby Linear System Solver

Iteration

Final Legal Solution

Upper-Bound Solution by Look-ahead Legalization

Initial WL Opt.

Common Analytical Placement Flow

Placement Instance

Converge

GlobalPlacement

Initial WLOptimization

Legalizationand Detailed Placement

SimPL Flow

We delegate final legalization and detailed placement to FastPlace-DP [M. Pan, et al, “An Efficient and Effective Detailed Placement Algorithm”, ICCAD2005]

Placement Instance

Legalizationand Detailed Placement

B2B net model[P. Spindler, et al, “Kraftwerk2 - A Fast Force-Directed Quadratic Placement Approach Using an Accurate Net Model,” TCAD 2008]

Pseudonet Insertion

Look-aheadLegalization

(Upper-Bound)

B2B GraphBuilding

Linear System Solver (Lower-Bound)

ConvergeGlobal

Placement

B2B GraphBuilding

Linear System Solver

WLConverge

noInitial WLOptimization

SimPL: Look-ahead Legalization■Purpose: Produces almost-legal placement (Upper-

Bound)while preserving the relative cell ordering givenby linear system solver (Lower-Bound)

■Identify target region − Find overflow bin b− Create a minimal wide enough bin cluster B around b■Perform geometric top-down partitioning

− Find cell area median (Cc) and whitespace median (CB) − Assign cells (Cc) to corresponding partitions (CB) ■Non-linear scaling

− Form stripe regions− Move cells across stripe regions in-order based on whitespace

SimPL: Look-ahead Legalization (1)

Performing geometric top-down partitioning

Overfilled binCell-area median (Cc)

whitespacemedian (CB)

Bin cluster (B)

Cell-area median (Cc)

whitespacemedian (CB)

Obstacle

borders

Uniform cutlines

CellOrdering

Per-stripeLinear Scaling

SimPL: Look-ahead Legalization (3)■Example (adaptec1)

Look-ahead legalization stops when target regions become small enough

SimPL: Using legal locations as anchors■Purpose: Gradually perturb the linear system to

generate lower-bound solutions with less overlap

■Anchors and Pseudonets− Look-ahead locations used

as fixed, zero-area anchors − Anchors and original cells

connected with 2-pin pseudonets− Pseudonet weights grow

linearly with iterations

Next illustration: Tug-of-war between low-wirelength and

legalized placements

SimPL Iterations on Adaptec1 (1)Iteration=0 (Init WL Opt.) Iteration=1 (Upper Bound)

Iteration=2 (Lower Bound) Iteration=3 (Upper Bound)

SimPL Iterations on Adaptec1 (2)Iteration=11 (Upper Bound)

Iteration=11 (Upper Bound)

Iteration=10 (Lower Bound)

SimPL Iterations on Adaptec1 (3)

Iteration=31 (Upper Bound)Iteration=30 (Lower Bound)

Convergence of SimPL■ Legal solution is formed between two bounds

Empirical Results: ISPD05 Benchmarks■Experimental setup

− Single threaded runs on a 3.2GHz Intel core i7 Quad CPU Q660 Linux workstation

− HPWL is computed by GSRC Bookshelf Evaluator< 5000 lines of code in C++, including CG solver

for sparse linear systems (w Jacobi preconditioner)

Initial placement 8%

CG solver 31%

Sparse matrix and B2B net

modeling8%

Look-ahead legalization

14%Pseudo-net insertion 1%

Post Global Placement

Speeding Up Placement Using Parallelism■SimPL has very few components (5KLOC)■Each bottleneck is amenable to some form of ||-ism

− Thread-level − Instruction-level

Parallelism in Conjugate Gradient Solver■Coarse-grain row partitioning

− Implemented using OpenMP3.0 compiler intrinsic

■SSE2 (Streaming SIMD Extensions) instructions− Process 4 multiple data with a single instruction− Marginal runtime improvement in SpMxV

■Reducing memory bandwidth demand of SpMxV− CSR (Compressed Sparse Row) format

Y. Saad, “Iterative Methods for Sparse Linear Systems,” SIAM 2003

Parallelism in CG Solver - Example

Parallelism in B2B Mode Update■B2B net model update

– B2B model is separable– Can process the x and y cases in parallel

− Additionally, split the nets of the netlist into equal groups that can be processed by multiple threads.

SSE optimization affects Runtime Profile

CG solver 19%

modeling10%

Pseudo-net insertion 1%

CG solver 31%

modeling8%

14%Pseudo-net insertion 1%

Parallelism in Look-ahead Legalization (1)■Look-ahead legalization (LAL) started consuming

a significant fraction of overall runtime

■Top-down geometric partitioning and non-linear scaling (T&N) are amenable to parallelization

− Top-down partitioning generates an increasing number of subtasks of similar sizes which can be solved in parallel

− After each level of T&N on bin cluster, eachthread generates two sub-clusters with similar numbers of cells

Parallelism in Look-ahead Legalization (2)■LAL keeps the global queue of bin clusters Q■Static partitioning

− Assign initial bin clusters to available threads such that each thread has similar number of bin clusters to start

■Subtask updates− Thread ti processes one of two sub-clusters (for the next

level of T&N), the remainder is added to the global cluster queue Q

■Dynamic task scheduling − When thread ti is idle, it dynamically retrieves clusters

from the global cluster queue Q. The number of clusters to be retrieved N = max(Q.size()/N_threads, 1)

Empirical Results – Overall Speed-ups■Experimental setup

− Multithreaded runs on a 8-core AMD-based system with four dual-core CPUs and 16GByte RAM

− Each CPU was Opteron 880 processor running at 2.4GHz with 1024KB cache

Empirical Results – Component Speed-ups

42PAPA2011, University of Michigan

Empirical Results – Component Speed-ups

Extending the Routability-driven Placement■Ongoing work: simultaneous place-and-route

Simultaneous Place-and-Route■After Look-Ahead Legalization (LAL)

perform Look-Ahead Routing (LAR)− Integrate an in-house router through clean API− Cell locations in, accurate congestion maps out− The placer accounts for congestion in addition to density

(slightly modified formulas, almost no extra work)■ISPD 2011 contest organized by IBM Research

− New, large benchmarks− Placements evaluated by a common global router

SimPL SimPLR■Key metric is #overflows (OF)■Also shown – routed WL (RtWL)

Conclusions■ New flat quadratic placement algorithm: SimPL

− Novel primal-dual based approach − Amenable to integration with physical synthesis

■ Self-contained, compact implementation − Fastest among available academic placers − Highly competitive solution quality− Amenable to parallelism− Easy to extend to simultaneous place-and-route

Questions and Answers

Thank you!Time for Questions

48PAPA2011, University of Michigan

Parallelization by SimPL ification : A Case Study in VLSI Placement

Documents

Transcript of Parallelization by SimPL ification : A Case Study in VLSI Placement

Simpl-e stem

simpl FALL2021 catalogue

The API and APP-ification of the Web

Asking for a major gift is SIMPL!!

Crestron SIMPL Windows Primer

Town & Country Simpl Implants Presentation

Simpl at #OpenCities Conference

Automatic parallelization by pattern-matching · PDF fileforms automatic parallelization of numerical Fortran 77 ... direct solvers for linear equation ... automatic parallelization

Asking for major gifts is Simpl

SIMPL BACnet Basics: A Tutorial How to use BACnet symbols in a SIMPL program v1.3 September 9, 2015.

Mom Unit5 Simpl Stre

HBO- ification of Netflix

Pg Sw Simpl Plus

Crestron SIMPL Windows Software Installation & Operations ...

Parallelization and Tuning

SimPL : An Effective Placement Algorithm

The API-ification of Education

Simpl Windows

Web::Machine - Simpl{e,y} HTTP

Trend Towards Parallelization