Floorplanning by Annealing on a Hypercube Architecture · PDF fileFloorplanning by Annealing...

IEoMELLON~Department of Electrical and Computer Engineering__

Floorplanning by Annealingon a Hypercube Architecture

Rajeev Jayaraman1987

Floorplanning by Annealing

on a Hypercube Architecture

Rajeev Jayaraman

Department of Electrical and Computer Engineering

Carnegie-Mellon University

Pittsburgh, PA 15213

A project report submitted in partial fulfillmentof the requirements for the degree of

Master of Science in Computer Engineering

March, 1987

This research has been funded by the Semiconductor Research Corporation

To my parents.

Table of Contents

Acknowledgements

Abstract

1. Introduction2. Background and Motivation

2.1. The Floorplanning Task2.2. Floorplanning Methods

2.2.1. Mincut Techniques2.2.2. Rectangular Dualization Techniques2.2.3. Simulated Annealing Techniques

2.3. Optimization and Parallelism in Simulated Annealing2.3.1. Serial Optimization2.3.2. Parallelism and Parallel Simulated Annealing2.3.3. Shared-Memory Implementations2.3.4. Hypercube Implementations

2.4. Motivation for Research

3. Serial Floorplanner3.1. Approach to Floorplanning3.2. Annealing Algorithm Implementation

3.2.1. Move Set3.2.2. Objective Function3.2.3. Annealing Schedule

3.3. Performance Evaluation of the Serial Algorithm

4. Parallel Fioorplenning Algodb’xns4.1. Hypercube Architecture4.2. Uncertainty in Parallel Move Evaluation4.3. Partitioning Strategy 1: Static Parallel Algorithm4.4. Partitioning Strategy 2: Simple Pipeline Algorithm4.5. Partitioning Strategy 3: Modified Pipeline Algorithm4.6. Comparison of Partitioning Strategies

5. Parallel Implementation5.1. Parallel Programming Environment

5.1.1. iPSC Hardware and Software5.1.2. iPSC Interprocessor Communication Mechanisms

5.2. Parallel Implementation Details5.2.1, Efficient Message Passing Patterns5.2.2. Message Composition5.2.3. Data Structures

12

37789

131520212425303133333737394244494951525660656868686971717374

5.3. Debugging6. Performance Evaluatkm of Parallel Algorithms

6.1. Methodology6.2, Speedup Results6.3. Convergence Results6.4. Summary

7. Co~clusionsReferences

75

7777798691~294

III

List of Figures

Figure 2-1:Figure 2-2:Figure 2-3:Figure 2-4:Rgure 2-5:Rgure 2-6:Rgure 2-7:Rgure 2-8:Rgure 3-1:Rgure 3-2:Rgure 3-3:Rgure 3-4:Rgure 3-5:Rgure 4-1:Rgure 4-2:Rgure 4-3:Figure 4-4:Rgure 4-5:

Rgure 4-6:Rgure 5-1 :Rgure 5-2:Rgure 5-3:

Rgure 5-4:Rgure 6-1:Rgure 6-2:Rgure 6-3:Rgure 6-4:

Rgure 6-5:Rgure 6-6:Rgure 6-7:Rgure 6-8:Rgure 6-9:

Mincut Partitioning 10Polar Graph Representation 11Slicing Tree Structure 12Rectangular Dualization 14The Simulated Annealing Algorithm 16Variation of the Cost Function During Annealing 22Heuristic Spanning 28Multiple-Seed Collusion 29Move set for PASHA 38Center Weighting Function for Overlap Cost Evaluation 41Comparison of Overlap Penalty Functions 42Final Floorplans Produced by PASHA 46MASON and PASHA Solutions for a Non-Slicing Structure 48Topology for 2, 3 and 4-Dimensional Hypercubes 51Static Parallel Algorithm on a 3-Dimensional Hypercube 56Pipeline Algorithm for a 4-Dimensional Hypercube 59Percolation of Update Information among Pipelines : Lazy Updating 63Topology for a Modified Pipeline algorithm on a 3-Dimensional 64HypercubeModified Pipeline Algorithm for a 4-Dimensional Hypercube 66Message Communication Overhead of the Intel iPSC Hypercube 70Sequence of Messages for a Broadcast Tree 72Binary Reflected Gray Code and its Topology on a 3-Dimensional 74HypercubeBin Data Structure 76Percentage of Time Spent in Each Move Task 80Execution Times for the Static Parallel Algorithm 81Effect of Lazy Updates in the Modified Pipeline Algorithm 83Total Execution Times for Modified Pipeline Algorithm using 84Benchmark BExecution Times for Different Pipeline Lengths 85Variation of Time Taken per Temperature 86Speedup for Parallel Algorithms 87Quality of Parallel Solutions 89Wirelength Variation in a Serial and a Parallel Algorithm 90

iv

List of Tables

Table 3-1: Comparison between PASHA and MASONTable 4-1: Comparison of Partitioning Strategies

4867

Acknowledgements

I would like to express my sincere thanks to my research advisor Prof. Rob Rutenbar. His

intellectual insight, penchant for perfection and infectious enthusiasm have been

instrumental in the completion of this work. I wish to place on record my grateful

acknowledgement to Mr. George Dodd of the Computer Science Department at General

Motors Technical Center Warren, Michigan for allowing me to use the Intel iPSC machine at

their premises. My grateful thanks are also due to Mr. Alan Baum and Mr. Don McMillan of

General Motors Technical Center for introducing me to the pleasures and pains of

programming on the iPSC system. I wish to acknowledge the support given by Intel Corp.

in giving me access to an Intel iPSC hypercube here at CMU. I would also like to express

my sincere appreciation to my committee members: Prof. Andrzej Strojwas and Prof. Zary

Segall.

I would like to acknowledge the work done by Dave Bohman in the installation of the

hypercube. I also would like to thank many members of the ECE community, especially

Saul Kravitz and Jim Quinlan for the many fruitful discussions which have subtly moulded

the nature of this work, and Dottie Setliff for suggesting an elegant acronym for this

research effort. And, finally a word of special thanks to all my officemates for the friendly,

fun filled environment which has been so very conducive to my work.

2

Abstract

Simulated annealing algorithms for VLSI layout tasks produce solutions of high quality

but are computationally expensive. This thesis examines some parallel approaches to

accelerate simulated annealing using message-passing multiprocessors with a hypercube

architecture. Floorplanning is chosen as a typical application of annealing in physical

design.

Different partitioning strategies which map this annealing algorithm onto a hypercube

architecture are presented. The objective in the design of these partitioning strategies is

to exploit maximum parallelism in the algorithm within the constraints of a message-

passing multiprocessor environment. Besides utilizing the limited parallelism inherent in

individual move evaluations, we also exploit the tolerance of annealing to errors in the

value of the system cost function as seen locally in each processor. To map these

partitioning strategies onto hypercube architectures, optimized message patterns are

developed.

Two parallel algorithms based on these partitioning strategies have been implemented on

a 16 node Intel hypercube. Practical speedups roughly between 4 and 8 have been

obtained on 16 processors for different strategies. The performance and solution quality

of these algorithms is presented and critically analyzed. With respect to solutions

produced by the analogous serial annealing algorithm, it is shown experimentally that the

introduction of uncertainty in the parallel algorithms does not compromise the solution

quality,

3

Chapter 1

Introduction

VLSI systems are becoming increasingly complex and, consequently, their design times

are also increasing. To enable the designer to complete chips at a faster rate, design

methodologies which result in shorter design cycles are employed. To manage the

complexity of the design of such VLSI systems, most of these design methodologies try to

decompose the entire design into smaller, more easily manageable tasks. Decisions made

early in the design process can profoundly affect the final quality of the design. Hence, it

is very important to predict the implications of early design decisions on the final quality of

the design. Typically, these methodologies stress the need for a hierarchical approach,

and the necessity for high-level planning at the start of the actual design process. A

hierarchical approach enables the designer to understand the implications of early design

decisions more completely, and reduces the possibility of design flaws. This results in

fewer iterations and faster turnaround times.

Physical design is the phase of the IC design process in which the functional design of a

piece of hardware is actually mapped onto the surface of silicon. Layout tasks must try to

optimize layout parameters which directly affect system performance, for example, the

aggregate wirelength, and exact geometric shape of each module. In physical design, the

floorplanning task determines a suitable geometric arrangement for the basic functional

blocks of the system, and perhaps the rough shape of the blocks themselves. The

floorplanner must optimize critical parameters, such as the total estimated areas and

4

wirelength, in order to ensure the success of subsequent design steps such as placement

and routing.

For our purposes, floorplanning produces a geometric arrangement of the functional

blocks, and a set of possible shapes for each block. Floorplanning, like most physical

design problems, is an NP-hard problem. The complexity of this class of problems grows

exponentially and, therefore, large floorplanning problems may require enormous amounts

of time to determine the most optimal solution. For practical reasons, heuristics which

strive to find fast, near-optimal solutions ere employed to solve such problems. Such

heuristics differ in the tradeoffs they make between execution time end the optimality of

their solutions. Iterative improvement methods are a class of heuristic methods which

often give good solutions, but very often tend to be slow. In addition, they do not

guarantee convergence to near-optimal solutions. Iterative improvement methods typically

start with some initial solution and iteratively improve, or refine this solution until no further

improvement is possible. This sometimes causes these methods to get stuck in locally

optimal but globally inferior solutions. Thus, the final solution produced by a typical

iterative improvement algorithm may be extremely sensitive to the initial starting solution.

Simulated annealing methods represent an alternative to classical iterative improvement

techniques. Annealing methods, which are also iterative improvement techniques, avoid

one major disadvantage common to most iterative methods: they provide a controlled

mechanism for the system to climb out of local minima to reach global minima. In a variety

of physical design applications simulated annealing algorithms have produced excellent

solutions, but they are almost always computationally very expensive to run. Since the

results of simulated annealing have been very encouraging, there have been various

modifications proposed which try to accelerate the basic serial algorithm. Of primary

interest to us ere multiprocessor implementations which try to exploit the inherent

concurrency of annealing algorithms. Different partitioning strategies have been used to

5

exploit this concurrency by dividing the computation involved in annealing among

cooperating processors. These partitioning strategies are usually specific to the target

machine on which they are to be implemented. Most of the work in parallel implementations

to date has been on shared-memory multiprocessors. The focus of this thesis is the study

of parallel partitioning schemes for annealing algorithms running on message-passing

multiprocessors, in particular, hypercube multiprocessors. These machines differ from

shared-memory machines in that they lack any global, transparently shareable memory; all

synchronization operations and data sharing for parallel computation is done by

messages. One of the main reasons for the attractiveness of message-passing machines

such as hypercube multiprocessors is that they can be incrementally upgraded to larger

systems more easily than many shared-memory machines.

In this project we have implemented a basic simulated annealing algorithm, and then used

this serial algorithm as a vehicle to study parallel algorithm partitioning schemes for

implementation on a hypercube. We have chosen a floorplanning task as a typical

application of simulated annealing in physical design. Although our serial floorplanner

requires some minor extensions and tuning to be of use as a practical tool, it nevertheless

exhibits all salient characteristics of a good application of simulated annealing, and

consequently suffices as a benchmark for our studies of annealing on hypercube

architectures. We examine different partitioning strategies and message passing patterns

which exploit the inherent concurrency of the basic serial algorithm. In particular, the

parallel schemes we propose exploit the tolerance of simulated annealing to errors in cost

function evaluation during individual iterative improvement steps.

This thesis is organized as follows. Chapter 2 discusses the formulation of the

floorplanning problem. We also review simulated annealing algorithms and some previous

work in the area of accelerating annealing algorithms. This is followed by a comparative

review of parallel simulated annealing algorithms. Chapter 3 discusses the serial version

6

of the simulated annealing floorplanner. Chapter 4 examines general issues in parallel

partitioning strategies and hypercube message passing patterns. Specific strategies are

proposed to map our serial algorithm onto hypercube machines. We also examine in detail

the error tolerance property of simulated annealing and its potential uses in parallel

annealing. Chapter 5 discusses implementation details of the serial and parallel algorithms.

In Chapter 6 we present results obtained by implementing the proposed parallel algorithms

on an Intel iPSC hypercube. Results of experiments performed with the parallel algorithms

are analyzed, and the advantages and shortcomings of these are critically reviewed.

Finally, a brief summary of the contributions of this thesis is presented and areas of future

research are identified in Chapter 7.

Chapter 2

Background and Motivation

In this chapter previous work related to this thesis is reviewed. We begin the discussion

with a specification of the fioorplanning task. This is followed by a review of some basic

techniques for solving floorplanning problems. In this context a basic review of simulated

annealing is presented along with some of its applications to fioorplanning. Prior efforts in

optimizing simulated annealing algorithms are reviewed, followed by a discussion of

recent parallel approaches to simulated annealing. We conclude this chapter by a

discussion of the motivation for this thesis and the goals to be accomplished in this work.

2.1. The Floorplanning Task

Floorplanning is the process of choosing geometrical attributes for hierarchically

partitioned functional modules so as to satisfy a set of electrical and topological

constraints. After the entire design is partitioned into a set of modules, the physical layout

of these modules must be determined so as to optimize the total interconnect wirelength,

total area and other layout parameters. The placement of these modules and the optimal

choice of their attributes in planning the area of the chip are the goals of the floorplanning

task.

The functional modules or cells have certain geometric constraints to be satisfied during

fioorplanning. These constraints typically result in a number of possible shapes and sizes

for each module and often reflect different possible layout styles for this cell. In addition,

8

the area of the chip is also sometimes constrained, either by its aspect ratio or maximum

allowable size. One primary objective of most floorplanners is the minimization of the

interconnection wirelengths while retaining maximum routability. The process of

floorplanning decides the optimal shapes and arrangement of all the modules, attempts to

pack all the modules in a compact rectangular area, and attempts to minimize the total

wirelength and area occupied by the floorplan.

Floorplanning and placement seem to be very similar tasks, but they differ in many ways.

Unlike placement, which occurs later in the layout, floorplanning is one of the earliest

tasks in layout design. Placement determines the arrangement of cells that have fixed

shapes and sizes. Floorplanning, on the other hand, not only determines the arrangement

of the cells but also decides the shapes and sizes of the cells which optimize the layout. In

addition, I/0 pin connections may have variable locations in some cells and their optimal

positions are also determined during floorplanning. Floorplanning, typically, deals with

less than 150 modules while placement very often has a couple of hundred modules to

deal with. In our model of floorplanning, the floorplanner determines the size and rough

arrangement of modules. Subsequently a detailed placement phase is required to

determine the precise positions of the modules and routing areas.

2.2. Floorplanning Methods

There are many different methods to solve the floorplanning problem. These methods

can be broadly classified into mincut techniques, rectangular dualization methods and

simulated annealing methods. Some of these techniques solve the floorplanning problem

in the absence of variable shapes and pin locations on modules. In such cases the module

is often abstracted as a macrocell, i.e., a cell with a definite shape. This subset of the

floorplanning problem is referred to as macrocell placement. Some floorplanning and

macrocell placement techniques are reviewed in the following sections.

9

2.2.1. Mincut Techniques

These techniques are based on a partitioning technique referred to as mincut partitioning

I~Kernighan 70, Breuer 77a, Breuer 77b’1. Assuming that we have a certain placement of

modules, a cutline is s horizontal or vertical line which divides the modules into two

distinct sets, one on each side of the cutline. There is, typically, an objective function that

assigns a cost to placing the cutline at a particular location. This cost of the cutline is

usually a function of the number of nets which cross the cutline, for example, the number

of nets connecting modules on different sides of the cutline and the relative imbalance

between the total areas of the modules on each side of the cutline. Weighted sums of

crossing net count and area imbalance are common objective functions. Starting with the

entire chip area, an optimal partition into two areas, each containing a subset of the total

number of modules, is done first. The process of partitioning continues recursively, on

each of these two areas and so forth, until the entire chip area is divided into rectangles

each enclosing a single module. Fig.2-1 illustrates the determination of cutlines in mincut

partitioning.

Lauther I~Lauther 79~] employs the mincut technique with a unique graph representation

to solve the macrocell placement problem. The layout is represented by two mutually dual,

acyclic and planar graphs each representing one of the two dimensions: vertical and

horizontal. Each macrocell is a rectangle, and is represented by a pair of edges, one in

each graph, where each edge represents one of the two dimensions of the rectangle. The

basic idea is to start with a rectangular area, with a size equal to the total aggregate area

of each cell to be placed, and proceed by recursively dissecting this area to obtain a final

topological placement for the modules. Each dissection partitions a region of the area into

two subregions; mincut techniques decide which modules go in each dissected subregion.

Each dissection contributes nodes or edges to the two graphs: the two graphs are

constructed in parallel with the dissections, and represent the topological placement of

10

First Cutline 2 Second Cutlines Third Cutline

2-1: Mincut Partitioning

the modules. Fig.2-2 illustrates a polar graph representation of a simple topology. The

process of finding cutlines and partitioning the modules into subregions continues

recursively until every region consists exactly of a single module. The final graphs can

then be converted to a detailed layout which shows the true cell dimensions while

maintaining the neighbour relations obtained from the graph.

Another mincut-based approach is the slicing technique [Brooks 40]. Slicing is a

technique in which a rectangular area is divided by a set of parallel line segments into

smaller rectangles. Each smaller rectangle so obtained is called a slice. Slicing methods

partition the modules into subsets, usually optimizing some function of net connectivity

across the slices, such that every subset can be placed within its corresponding slice. A

slicing tree is a graph used to represent a slicing structure. Each node of the slicing tree

represents a rectangular region which entirely encloses all the modules in each of the

nodes’ subtrees. In a complete slicing tree the leaves correspond to the individual

11

A

D C

B

Slicing Representation

Rgure 2-2:

A

Horizontal Graph

B

D

vertic= Graph

Polar Graph Representation

Polar Graph Representation

modules. The levels of a slicing tree represent either horizontal or vertical cuts, and the

slices at each level alternate between horizontal and vertical cuts. An optimal slicing tree

that determines the topological configuration is first found. A final rectangular dissection

is then derived from the topological configuration of the slicing tree and the shape

constraints of the modules. Fig.2-3 illustrates a binary slicing tree which is a specific

case of the general slicing tree.

A floorplanning tool, MASON [’Lapotin 85"1, uses similar partitioning heuristics for min cut

placement of arbitrarily shaped cells. The problem specification consists of a standard

graph in which the nodes correspond to the modules and the edges to the interconnection

between the modules. This graph is partitioned repeatedly until each partition contains

exactly one node. The partitioning optimizes a weighted sum of the nets crossing the cut

and the relative area imbalance on each side of the cut. Partitioning of srrmll graphs is done

12

Q horizontal cut

Q vertical cutSlicing Tree

Rgure 2-3: Slicing Tree Structure

ml m2

m3

m7

m6

m8 m9

Floorplan Equivalent

using exhaustive search, but heuristics are employed to partition large graphs. This

partioning is followed by the construction of a binary slicing tree. The final phase of the

algorithm converts this binary slicing tree to detailed layout. This is performed by two

slicing tree traversals. The first tree traversal is a Depth-First traversal that walks up the

slicing tree and evaluates the effect of alternate module dimensions on the quality of the

floorplan. At the completion of this traversal, optimal module dimensions are determined.

The second traversal is a pre-order traversal to determine the actual module positions that

satisfy the topological constraints laid down by the slicing tree.

The main advantage of mincut and slicing approaches lies in their inherent routability.

Cutlines in the slicing tree correspond to routing channels and the slicing tree always

yields cycle free routing [Supowit 83, Szepieniec 80"!. The routing pro~ess can be

completed using global routing followed by detailed channel routing. Due to the

minimization of the cutline net crossings at each step of the algorithm, channel congestion

13

is minimized. Mincut methods are also very popular because of their clarity of

representation and their speed. A disadvantage of a strict mincut approach is its relative

inflexibility. For example, user defined constraints such as alternate shapes for modules,

or a priori fixed module locations are difficult to handle. MASON [Lapotin 85"1 proposes

some extensions to the mincut approach which are able to handle some of these

problems. Like any iterative improvement technique, mincut approaches tend to get stuck

in locally optimal but globally inferior solutions. One method to get around this is to

conduct multiple runs of the algorithm with different initial configurations, and then select

the best available final solution.

2.2.2. Rectangular Dualization Techniques

Another technique used for floorplanning is the rectangular dualization method [Heller

82, Leinwald 84, Kozminski 84]. In this technique, a configuration is represented as a

graph in which the vertices represent the modules, and the edges represent the module

interconnections. The rectangular dual of this graph is constructed. The dual graph has

vertices which map to the rectangular faces of the modules and edges which correspond

to the adjacent sides of modules in the rectangular dual. Construction of the dual graph

involves branch and bound techniques to generate an exhaustive list of possible

configurations which yield minimum module area. A necessary condition to construct a

dual is that the original graph must be planar; non-planar graphs have to be planarized

before this method is applied. Non-planarity of the original graphs is due to the existence

of wiring crossovers that cannot be routed in the same plane. Consequently, planarization

is done by the introduction of some auxiliary modules which represent these wiring

crossovers.

Rectangular dualization is a very elegant graph theoretic characterization of the problem.

The representation of the problem is entirely geometric and mapping from dual graph to

14

floorplan and vice versa is very simple. One drawback with this approach is that it is a time

consuming method since it involves exhaustive evaluation of all possible duals of a given

graph. Another disadvantage of this approach is the possibility of the absence of any

satisfactory dual of a graph. As with the mincut approach, suboptimal solutions are very

likely here. In addition, during the planarization of the graph additional nodes

corresponding to wiring crossovers are introduced, creating a problem of determining

placement for these wiring crossovers in the dual. Fig.2-4 illustrates two rectangular dual

graphs and their equivalent geometric representation.

A B

A B

A

D

A

D

B

C

B

C

Dual Graphs Equivalent Slicing Representation

Rgure 2-4: Rectangular Dualization

15

2.2.3. Simulated Annealing Techniques

Simulated annealing [Kirkpatrick 83"1 is an iterative improvement method for attacking

combinatorial problems. This algorithm follows the analogy of finding a minimum energy

state in a physical system by annealing. Physical annealing consists of heating some

material to very high temperatures until it melts, followed by a gradual, thermodynamically

reversible cooling until the material freezes. At each of these intermediate temperatures

the constituent components of the system, e.g., molecules or atoms, rearrange themselves

in lower and lower energy configurations. Finally, when the system is frozen and no further

rearrangements are possible, the configuration of the system is in the lowest possible

energy state, called the ground state. The simulated annealing algorithm, as its name

suggests, uses an analogy to this process of annealing. To optimize the arrangement of

components in some system, we assume a certain objective function, analogous to the

energy, which is to be minimized. Random perturbations, called moves, are made to the

system, analogous to random molecule rearrangement occurring in the physical system.

Similar to the temperature in the physical system, we have a control parameter T which

regulates the acceptance of perturbations in the system during simulated annealing.

Random perturbations are attempted, and then evaluated by computation of the objective

function. If the change in the objective function .~E is negative, ioe., if this change results in

an improvement of the objective function, then the change is accepted. On the other hand,

if the change causes an increase in the objective function and worsens the arrangement,

the perturbation is accepted with a probability p(T,Z~E). Boltzmann-like probability

distributions are commonly used, for example:

p(T, z3E) = "’~E/T ( ,~E >O)

Thermal equilibrium is simulated by attempting a sufficient number of moves at every

temperature so as to explore a large fraction of the state space. Subsequent lowering of

the temperature reduces the probability of accepting positive changes and fewer uphill

16

moves are accepted. Finally, when the system is frozen, essentially no uphill moves are

accepted and since the objective function is near a minimum, few downhill moves are

found. The pseudo-code given in Fig.2-5 illustrates the simulated annealing algorithm.

Start with a sufficiently hIEh initial ~emperature (Too);while (’the state is still changinE’){

while (= state is not in thermal equilibrium with the current temperature"){make a random perturbation(move) ~o the configuration.evaluate the change in objective function (/~E) due ~o this perturbationif (improvement in the objective function i.e. /~E < O)

accept the change and update the configurationelse

evaluate ~he prob~b~l~y of ~ccep~nco~ccep~ ~he move wi~h ~h~s prob~bil~y ~nd update if necessary

lower the temperature /* T

Figure 2-5: The Simulated Annealing Algorithm

Simulated annealing has an advantage over greedy, downhill-only algorithms in its ability

to climb out of local minima. The presence of a controlled mechanism for the acceptance

of uphill moves is a critical new feature of these algorithms. Simulated annealing has been

used quite successfully to solve a variety of physical layout problems such as standard

cell placement [Sechen 84], macro cell placement [Jepsen 83], global routing [Vecchi

83], and gate matrix layout [Devadas 86].

We shall now briefly discuss some applications of simulated annealing in floorplanning.

Annealing approaches to floorplanning can be broadly classified into two categories

based on their problem representation. One method is a direct geometrical approach, in

which the floorplanning problem is modeled as a geometrical problem consisting of many

rectangles, each of which has to be placed to minimize the overall objective function.

Another method is to convert the floorplan to an abstract representation such as a polar

graph. Subsequently the transformed problem is annealed to get a solution which is then

mapped back to its geometrical equivalent.

17

Jepsen and Gelatt [Jepsen 83] propose a simulated annealing method for the placement

of macrocells with arbitrary rectangular sizes. This algorithm tries to minimize the total

wirelength of the placement, hence its objective function consists in part of a wirelength

estimator. Here annealing uses a direct geometric approach of moving the rectangles

around to find optimal placements. Consequently, the algorithm utilizes a move set

consisting of random relocations of the modules: moving a cell in either the horizontal or

vertical directions, rotating the cell in any of the four orientations, or reflecting the cell

along the vertical or horizontal axis. Apart from this, special macrocells like I/0 cells are

further constrained in that they are allowed to move only along the periphery of the chip.

The key innovation here is that overlaps among the macrocells are allowed during

annealing. These allowed overlaps greatly simplify the move set, but they clearly

represent an infeasible solution. Consequently, overlaps are penalized by the addition of

an overlap penalty to the objective function. The annealing schedule lowers the

temperature by a constant factor ~ and identifies the stopping criterion for annealing when

no moves have been accepted for three successive temperatures.

The TimberWolf package [Sechen 84"1 also includes a simulated annealing algorithm for

macrocell placement, and also anneals a direct geometric representation of the macrocell

placement problem. The objective function consists in part of wirelength minimization, and

an overlap penalty function similar to the one proposed in I’Jepsen 83"1. Another

component of the objective function reflects the cost of different I/0 locations on cells.

Pin locations are allowed to vary on individual modules, moving from site to site, where

each site has a limited capacity for pins. The objective function penalises pin sites which

exceed their allowable capacity. The proposed move set in this algorithm is richer than the

move set proposed in [’Jepsen 83] and includes: single macro cell displacements along

any arbitrary direction, position swapping between two macro cells, aspect ratio changes

in the shape of a single macrocell, and assignment of pins to new sites. The annealing

18

schedule of TimberWolf also uses Thew = ~Told, but varies the value of ~ dynamically

during the annealing process to proceed quickly through very hot and very cold

temperatures, and slowly through the critical intermediate temperatures. TimberWolf also

makes use of the concept of range limiting to avoid proposing unreasonably large-

perturbation moves at low temperatures. This ensures that a large percentage of moves

are not wastefully evaluated only to be rejected at low temperatures.

The approach adopted by Otten I’Otten 84~] uses a polar graph representation of the

floorplan for annealing, and differs distinctly from the two previously discussed

approaches. Moves are essentially transformations on the polar graph: the polar graph

itself is annealed to get a solution. The fundamental move in this algorithm is an exchange

of positions of the macro cells. The distance between the swapping macro cells is a

parameter which is used effectively to range-limit the moves. Wirelength minimization is

the sole objective function of the algorithm. The move set always explores only feasible

placements which are represented by polar graphs. Overlaps cannot occur in any

floorplan produced by this method and hence the objective function does not contain any

penalty function for overlaps. The starting temperature is derived empirically by attempting

a few moves and determining a temperature that will allow a very high percentage of the

uphill moves to be accepted. The value of ~ is derived theoretically, unlike the use of an

empirical value of ~ as is the case with the previous two methods.

Another approach to floorplanning using simulated annealing has been proposed by

Wong and Liu I~Wong 86~]. This algorithm uses a slicing tree representation called a

Normalized Polish Expression. The Normalized Polish Expression consists of a string of

symbols. The symbols are classified either as operands or operators. Operands represent

the modules and operators define the slicing cuts which dissect the entire floorplan.

There are two types of operators corresponding to the vertical and horizontal cuts. An

expression defines a complete layout in terms of its equivalent slicing tree. The objective

19

function consists of a total wirelength metric and a total area estimator. Moves consist of

manipulating symbols in an expression, such as swapping two operands or swapping two

adjacent operands and an operator. The swapping of two operands, or the complementing

a subexpression always result in legal Normalized Polish Expressions. On the other hand,

some moves, such as swapping an adjacent operator and an operand, may sometimes

yield an invalid Polish Expression. Hence the validity of this move must be established

before attempting it. The algorithm allows modules to have arbitrary rectilinear shapes

defined by a bounding curve. The bounding curve essentially determines the range of

feasible dimensions of the enclosing rectangular area of the module. A piecewise linear

bounding curve can define any rectilinear shape. The minimum area floorplan realization, a

task performed to evaluate a move, is done by adding the bounding curves of the modules

while walking up the slicing tree corresponding to the Polish expression. Incremental

methods of evaluating the minimum area floorplan realization are used to speed up

execution times. This representation of the floorplan reduces the number of neighbouring

states for each state and, consequently, enables the algorithm search many feasible

floorplans very quickly

Simulated annealing is an approach which has considerable flexibility compared to the

methods of mincut techniques and rectangular dualization. Many user-defined constraints

which cannot be easily handled by the previous two approaches can be implemented in

the algorithm by a simple change in the objective function. Another advantage of simulated

annealing is its controlled hill climbing mechanism which provides a way to climb out of

locally optimal solutions towards a globally optimal solution. However, these advantages

do not come without cost. Simulated annealing is a computationally expensive technique

end typically requires very long execution times. Various parameters in any actual

annealing algorithm must be tuned to a great degree to optimize performance, resulting in a

slight loss of generality. Nevertheless, the fact that simulated annealing is a general

approach for the solution of many different layout problems contributes to its popularity.

20

In the next chapter, we describe our own version of a fioorplanning algorithm using

simulated annealing; we employ a direct geometrical representation similar to that used in

~’Jepsen 83-1 and by Sechen [Sechen 84-1 in TimberWolf. The objective function consists

of a wirelength estimator, an area estimator and a penalty function for module overlaps. We

have used the idea of an overlap penalty function similar to that in I’Jepsen 83"1 with some

modifications to more accurately reflect the overlap situation. The move set used by our

algorithm is specifically adapted to our specification of the floorplanning task and is richer

than the simple move set proposed in I’Jepsen 83"1.

2.3. Optimization and Parallelism in Simulated Annealing

Simulated annealing essentially refines a random solution, cooling it from a high starting

temperature to the final frozen state through many intermediate temperatures. Computation

at every temperature involves the processing of thousands of moves. This means that

moves have to be proposed and evaluated, and configurations updated millions of times

during an entire annealing schedule. This is a computationally intensive process and

efforts have been made to optimize simulated annealing algorithms to improve their speed,

while at the same time maintaining the high quality of their solutions. Efforts to accelerate

annealing algorithms have been primarily in two directions. One method, focusing on serial

algorithms only, is to incorporate modifications in the algorithm to reduce the

computational complexity of the long sequence moves to be evaluated. The other method

focuses on parallelism in annealing, using multiprocessors and parallel algorithm

partitioning strategies to accelerate the computation. This section reviews serial and

parallel strategies to accelerate annealing.

21

2.3.1. Serial Optirnization

This subsection reviews serial strategies to accelerate annealing: optimal temperature

scales for annealing, rejectionless methods, optimal annealing schedules and error

tolerance in annealing.

Concepts of Scale: The cost function varies dynamically during annealing; Fig.2-6

illustrates this variation. As can be seen in Fig.2-6, the objective function does not

change appreciably at very high temperatures. Due to a high probability of acceptance of

uphill moves, annealing in this hot regime results in randomizing the configuration. This

suggests a modification to the basic serial algorithm which reduces the amount of high

temperature annealing to an extent sufficient to retain the optimality of the solution. White

l’White 84"] gives an empirical method to identify the optimum starting temperature based

on the parameters of the problem being solved. Certain assumptions are made regarding

the energy of the system, for example, the existence of finite energy maxima and energy

minima in the solution space. By using concepts from statistical thermodynamics, White

I’White 84"] shows that the standard deviation of the energy states defines a temperature

scale. These temperature scales identify the starting temperature to which the system

must be heated to obtain optimal solutions and also the freezing temperatures to which the

system must be cooled to get a good result. Knowledge of these temperatures tightens

the annealing schedule and results in faster annealing.

Rejectior~less methods: In a standard annealing algorithm, every move must be evaluated

in its entirety before its acceptance criterion is determined. Rejected moves, therefore,

result in a waste of computation. Greene and Supowit I’Greene 84"] propose a

modification of the simulated annealing algorithm which involves fewer rejected moves.

The move proposal stage is biased towards moves which will be eventually accepted. For

each move, a value is stored which is a weighted function of the change in cost it causes.

22

Temperature

Hot~ ~

Transition

Flgt~e 2-6: Variation of the Cost Function During Annealing

A move is selected with a probability given by a function of this value. This is followed by

regular updating of the state. As can be expected, this modification does not yield any

improvement in computation time over the basic algorithm at high temperatures, when the

acceptance rate of moves is high. However, at low temperatures, when only a small

percentage of moves are accepted, significant speed-ups are obtained. A crossover point

is determined, specific to the problem being annealed, and the selection of moves is

changed to the rejectionless method dynamically during annealing after this crossover

point. Range limiters, which are employed in several annealing algorithms, use a broadly

aimilar concept for their operation. Rajectionless methods are especially attractive when

the time to calculate the expected change to the objective function due to a move is

considerably less than the time to evaluate a move in its entirety. Greene l’Greene

84] uses this approach for a logic partitioning problem where it is easy to quickly evaluate

23

the expected change in the move. More complex problems such as floorplanning are not

amenable to this technique since there is no method to quickly establish expected

changes caused by moves.

Optimal Annealing Schedules: Choice of good annealing schedules increases the rate of

convergence of simulated annealing. Annealing schedules have been proposed with

optimal starting and stopping temperatures, temperature decrements, and thermal

equilibrium criteria. Huang et al. [Huang 86] have proposed an annealing schedule which

optimize each of these parameters of the schedule to get higher performance. Their

starting temperature is effectively infinite since they accept every move. They determine

the next temperature by the assumption that at that temperature any configuration whose

cost is worse by 3~ of the present configuration must be accepted with a very high

probability, where ~ is the standard deviation of energy states at this temperature. Since

thermal equilibrium is the establishment of e steady-state probability distribution of the

states of the system, the proposed annealing schedule identifies thermal equilibrium when

the ratio of the number of new states generated with their cost changes within a certain

fraction of ~ of the average cost reaches a stable value. This speeds up the establishment

of the equilibrium condition. Results with new annealing schedules typically show a factor

of 2 for increased rate of convergence.

Error Tolerance: A recent result by Grover [’Grover 86"1 explains why it is possible for the

simulated annealing algorithm to tolerate uncertainties. Uncertainties arise when the

evaluation of the objective function after a move has some error or ambiguity. The error

tolerance of simulated annealing implies that optimal solutions can be found even if the

exact value of the energy function can be in error by an amount ,~E. With concepts of

statistical mechanics, it is derived that the error tolerance of simulated annealing depends

very closely on the temperature of the system. It has been shown that when the error in

the energy function evaluation (/rE) is very much smaller than the temperature T(I,~EI

24

T) the algorithm preserves its convergence properties in spite of the errors and converges

to a good solution. This fact can be exploited to accelerate annealing algorithms by using

fast, approximate methods of move evaluation at high temperatures instead of slow, exact

methods of evaluation. This constraint on the error tolerance denotes an upper limit for the

error tolerance; errors beyond this limit may affect the optimality and convergence of the

algorithm. This result presents a way to exploit parallelism by allowing fast,parallel

evaluations of moves with some errors.

2.3.2. Paral;elisrn and Parallel Simulated Annealing

All the optimizations to the serial simulated annealing by way of modifications to move

computations or the annealing schedule have rarely contributed to speedups greater than

2. To obtain faster rates of convergence, efforts to accelerate simulated annealing

algorithms have been focussed more recently on the use of multiprocessors to exploit

parallelism inherent in annealing algorithms.

A close examination of simulated annealing algorithms reveals that there is potential

parallelism involved in the move evaluation process. Recent research in this area has

resulted in different ways of utilising this inherent parallelism to adapt annealing algorithms

tO parallel execution on various multiprocessors. Speedups here are obtained by efficient

partitioning schemes and by the use of a large number of processors. To date, most

parallel algorithms published for simulated annealing have been implemented on shared-

memory machines. Shared-memory machines have a disadvantage in that they cannot be

trivially expanded past some fixed limits arising from processor memory bandwidth

limitations and bus limitations. Typical, commercial shared-memory machines have upto

32 processors. Hypercube multiprocessors, on the other hand, are very nearly

incrementally expandable because they do not rely on global busses. Speedups are

limited almost entirely by algorithm performance. Current commercial hypercube have 16

25

to 1024 processors. These considerations have prompted us to study parallel

implementations of simulated annealing on hypercube architectures. In the following two

sections we review some of the main ideas in previous parallel implementations, both on

shared-memory architectures and on message-passing architectures.

2.3.3. Shared-Memory Implementatior=s

One of the earliest approaches to exploit parallelism in simulated annealing by Kravitz

[’Kravitz 86a, Kravitz 86b] uses a shared memory multiprocessor to do standard cell

placement, and identifies different parallel partitioning strategies. They identify two basic

kinds of parallelism in simulated annealing: Parallel-moves, which involves simultaneous

evaluation of a number of moves, and move-decomposition, which consists of

decomposing a single move into subtasks each of which can be performed

simultaneously. It is noted that these two types of parallelism are essentially orthogonah

one can perform many separate moves in parallel, and also decompose each move in

parallel subtasks.

For the Parallel-moves scheme, the concept of a Serializable subset is introduced;

moves which form a serializable subset can be evaluated in parallel due to their non

interacting nature and give the same result as a serial evaluation of the moves in some

known order. A simple serializable subset is the set of moves consisting of one accepted

move and the rest being rejected moves. Parallel moves are implemented by evaluating

moves in parallel on all processors until the first move is accepted. The acceptance of a

move automatically aborts other parallel move evaluations. The necessary updates

corresponding to this accepted move are done and parallel move evaluations begin all

over again. Move-decomposition schemes are also employed which use functional move

decompositions that divide the entire move evaluation into functional subtasks and assign

the evaluation of each of these subtasks to different processors.

26

The Parallel-moves algorithm works very well at low temperatures of annealing. This is

so because at low temperatures very few moves are accepted and large serializab/e

subsets can be found. However, at high temperatures the functional decomposition

strategies yield better results than the parallel moves scheme. An adaptive strategy is

suggested which changes partitioning strategies during the cooling process to produce

the best speedup across the entire temperature range. Kravitz and Rutenbar [’Kravitz

86a, Kravitz 86b’1 report speedups of about 3 for a 4-processor VAX 11/784

implementation.

A parallel simulated annealing algorithm for macro cell placement is presented by

Casotto [Casotto 86"1. Their objective function includes a wirelength estimator, a total

area estimator and a penalty function for the total overlap. A parallel moves scheme is

employed to exploit parallelism. Each processor has the responsibility to independently

propose, evaluate and accept moves pertaining to a certain set of modules. Since each

processor is evaluating moves in parallel and also accepting them asynchronously there is

some error involved in the move evaluation process. Every processor does not have

entirely correct information about the state of each module before it tries each new move,

unlike [~Kravitz 86a, Kravitz 86b"1 who always accept only one move and throw away the

rest. The algorithm accepts all acceptable parallel moves and in the process introduces

uncertainty in the value of the objective function. Experimentally they show that such an

uncertainty does not cause any serious problems with the convergence properties of the

annealing algorithm, as predicted by the results of EGrover 86"1. To force this uncertainty

to extremely small values at very low temperatures the concept of clustering cost is

introduced as part of the objective function. The clustering cost tries to force modules

which interact strongly amongst themselves to be allocated to the same processor node.

In effect, the partitioning of modules among physical processors is itself annealed, just as

the placement of cells on the chip is annealed. This clustering tries to find an optimal

27

partitioning of the modules that reduces the uncertainty of move evaluation by ensuring

that all the modules interacting with a move reside in the same processor; consequently,

the uncertainty in the move evaluation is reduced. Speedups of about 6 have been

reported while using 8 processors on a Sequent Balance 8000 shared-memory

multiprocessor.

Rose ERose 86] proposes three parallel algorithms which replace different phases of

simulated annealing for a standard cell placement task. The first technique, referred to as

Heuristic Spanning, is used to entirely replace annealing in the hot regime. With the help of

some mincut based heuristics, Heuristic Spanning searches for coarse interim placements

i.e., the sort of placements found during high temperature annealing. Once the Heuristic

Spanning phase is over, the best partial solution thus obtained is selected and several

independent, low temperature annealings are done in parallel. Each processor thus tries to

improve this placement with low temperature annealing. When annealing is completed in

each processor, the best solution is accepted as the final solution. Fig.2-7 illustrates this

algorithm.

The second technique is called Multiple-Seed Collusion. Similar to the previous method,

each processor carries out annealing in parallel, independently of the other processors.

After a certain number of moves, the partial solution in each of the processors is

examined. The best solution is accepted, and this is selected as the next configuration

from which all the processors repeat the whole procedure of independent annealing. This

process, intuitively at least, enables quick identification of search paths that lead to non-

optimal solutions. These paths are then discarded from the search space, thereby

reducing the complexity of search. The granularity of this method, which is the number of

moves after which the processors synchronize to select the best partial solution amongst

them as the new seed, turns out to be an important parameter. If this parameter is too

small, the probabilistic hill climbing property is essentially destroyed, and the

28

Proc. #1Fast Heuristics

Proc. #1: Lowtemp. annealing

DivideSearch space

Proc. #2Fast Heurlstlcs

Select bestInterim solution

Proc. #2: Lowtemp. annealing

Prec. #nFast Heuristics

Proc. On: LOwtemp. annealing

final solution

Rgure 2-7: Heuristic Spanning

convergence of the algorithm to optimal solutions is degraded. Also, this involves a large

interprocessor communication overhead. On the other hand, if this parameter is large the

search space is not reduced and the problem of expensive searching along non optimal

search path is not addressed efficiently. Rose [Rose 86] compares and contrasts these

two techniques and concludes that the Multiple-Seed Collusion method does not yield

very good solutions. Fig,2-8 illustrates the Multiple-Seed Collusion algorithm.

The third approach used in [Rose 86] uses geographical partitioning of modules on

processors. Processors assume the responsibility to move only those modules which lie

in its area. Processors propose, evaluate and accept moves independently of moves

occurring on other processors. Due to this independence in move evaluation, an effort

must be made to maintain information int.egrity within reasonable limits. This integrity is

maintained by three different communication patterns among the processors. In the Gross

29

Initial ~olutlon

interim eolutlon

Proc. #1: I I Proc. ~2:Make N moves Make N moves

interim ~olutlon

Proc. #n:Make N moves

No

Figure 2-8: Multiple-Seed Collusion

Collusion, method the processors are always responsible for the same subset of modules.

After making a certain number of moves all the processors send a message to the master

processor that calculates the updated state of the system and sends them back to the

individual processors. A global update is done to inform all the processors after every

move is made in the Full Broadcast scheme. This scheme involves heavy message traffic

between processors and can result in significant communication overheads. To minimize

the message traffic generated by the Full Broadcast scheme the Need to Know scheme is

proposed. The Need to Know strategy involves interprocess communication only to

update the processors which need to know the update information during subsequent

move evaluations. This reduces interprocessor communication to minimal required levels.

Speedups of about 4 are reported for the Full Broadcast and the Need to Know strategies

running on a 5-processor multiprocessor.

30

2.3.4. I-lypercube Implementations

An interesting solution of the travelling salesman problem (TSP) by simulated annealing

using a hypercube is given by Felten et al. [~Felten 85]. There is no shared memory here,

and hence all synchronization is implemented by message-passing. Each processor is

assigned a set of cities and an random initial tour is chosen. A move constitutes the

swapping of the positions of the cities on the tour. This swapping can be between cities in

the same processor or between cities residing on different processors. This is followed

by a global update phase which enables the cities to redistribute themselves throughout

the hypercube. After the entire annealing process, the cities come to reside in their proper

nodes and hence they have migrated to their proper location in the tour. Evaluating

Hyperswaps, or swapping between cities residing on different processors, is done by

using the message links existing between adjacent nodes. Since the hyperswaps

represent large changes their acceptance is very low at low temperatures. They are

useful at high temperatures since they can force the system to diffuse quickly out of local

minima. At low temperatures the adjacent pair swaps are more predominant. With this

algorithm speedups of 55 for a 64 node 6 dimensional hypercube have been reported. The

TSP has a very elegant characterization, i.e., it has a simple move set and a’ simple

objective function. Moves do not interact very much and can be evaluated in parallel fairly

accurately, thereby maintaining information integrity in the parallel moves scheme. This

fact makes the parallel moves scheme quite successful and 86% utilization of processors

is reported.

Banerjee l’Banerjee 86"1 presents a parallel simulated annealing algorithm for standard

cell placement on a hypercube. Their approach to the problem consists of partitioning the

modules by area amongst the processors. Moves consist of displacement moves and

swapping moves. The evaluation of these moves is performed in parallel with the help of

message passing. To help in the move evaluation, every processor also keeps all relevant

31

information about modules not in the area for which it is responsible. Once a move is

evaluated and its acceptance is decided, the necessary updates are made in individual

processors. Propagation of the update information to all other processors is done by the

use of a Hamiltonian circuit in the hypercube topology. This algorithm entails a very heavy

volume of message traffic which is disadvantageous. Another disadvantage of this

approach is that the communication overhead in message passing becomes very

expensive when the ratio of the communication time between processors to the

computation time on a single processor is significant. To cope with the heavy amounts of

message traffic entailed by this communication pattern a different strategy is suggested in

[’Banerjee 87]. Broadcast trees are used for broadcasting information to all the

processors. Broadcast trees route broadcast messages in a hypercube topology in times

proportional to the dimension of the cube. Speedups are predicted in the range from 6 to

13 for a 6-dimensional hypercube. These speedups are predicted from simulation times on

a hypercube simulator.

2.4. Motivation for Research

Simulated annealing is a technique which can be used to solve a variety of CAD

problems with extremely good results. However, given the extreme runtimes for typical

annealing algorithms, the need for methods to accelerate annealing techniques cannot be

overemphasized. The use of multiprocessors and parallel annealing strategies present

themselves as very interesting and useful areas of research. Most of the current research

in this area has been focussed towards parallel annealing algorithms on shared memory

machines. Our efforts in this area focus on the implementation of parallel simulated

annealing algorithms for machines with a hypercube message passing architecture.

This thesis examines different partitioning schemes to attack simulated annealing

problems on a hypercube multiprocessor. An effort has been made to identify the inherent

32

concurrency in a particular annealing algorithm, and deduce appropriate parallel algorithm

decomposition strategies. Since message passing primitives are our only tools for

synchronization and data sharing, we are forced to focus on optimizing the allocation of

computations and data to different processors. In an inappropriate decomposition, the

message communication overhead can sometimes become prohibitively expensive. This

fact has prompted us to study optimized patterns of message traffic for parallel annealing.

An effort has been made to partition the algorithm into large-grain subtasks to increase the

ratio of computation time to communication time.

We have chosen to implement a floorplanning algorithm as a typical application of

simulated annealing. Compared to the placement and routing problem, floorplanning has

many more degrees of freedom and hence a wider variety of solutions. The move

evaluation phase has many different subtasks, and has, therefore, a larger granularity than

the move evaluation phase for a placement or routing problem, This provides us with a

richer move set as compared to other simulated annealing problems. We have first

implemented a serial version of the floorplanner which serves as a vehicle for the parallel

implementations. The serial floorplanner that we have implemented is a "no frills"

floorplanner: it does not attempt to solve the floorplanning problem in its entirety. Instead,

we have made an attempt to capture the most important features of the floorplanning

problem which reflect the power of the simulated annealing technique, without making the

problem unnecessarily complex. The next chapter discusses the design of the serial

floorplanning algorithm.

33

Chapter 3

Serial Floorplanner

This research effort attempts to investigate some parallel approaches to floorplanning

using simulated annealing. The algorithms, both serial and parallel, which implement these

parallel approaches are collectively referred to as PASHA1. This chapter describes the

floorplanning algorithm which has been implemented in the serial version of PASHA.

Design considerations for this floorplanner are discussed. Simplifications of the problem,

which have been made to reduce the complexity of implementation, are critically reviewed.

This is followed by a performance evaluation of this serial implementation. Benchmarks are

run on the serial version of PASHA and the quality of the final solution is compared to that

obtained by another floorplanning program: MASON [’Lapotin 85"].

3.1. Approach to Floorplanning

The design for the serial version of PASHA has been influenced greatly by the macro cell

placement of Jepsen and Gelatt I~Jepsen 83]. The approach that we have chosen uses a

representation of the problem which is akin to the geometric nature of the problem. We do

not use any indirect graph based representation, such as a polar graph or slicing tree, for

the layout. Instead, modules are represented by rectangles which are moved and resized

by the annealing algorithm. Annealing attempts to find an optimal arrangement for these

rectangles together with their optimal shapes and sizes.

1pASHA: Parallel Approach to Simulated annealing on Hypercube Architectures

34 ’

We now discuss a general set of objectives and constraints that characterize an ideal

floorplanner. Essential input consists of modules and their interconnections. Modules can

have varying shapes and sizes depending on the layout style of the cell. Consequently,

the specification of modules includes a list of such alternate shapes and sizes. The

objective of the floorplanning process is to choose optimal shapes and sizes for the

modules from among these prespecified alternatives. Besides their shapes and sizes, the

positions of the I/0 connections on the boundaries of these modules are also variable,

depending on their internal layouts. Optimal positions of the I/0 connections must be

determined during the floorplanning process to minimize the total wirelength. In addition,

some global topology constraints may also exist. These constraints, typically, force some

modules to be positioned in specific configurations; for example, we might force some

modules to be placed adjacent to other modules, or force modules to be placed only in

some fixed area of the chip. These constraints arise mainly due to I/0 considerations. Bus

topology is another factor which critically affects system performance. Consequently,

optimum bus topology must also be determined, subject to a similar set of constraints.

The floorplanning area (the total acceptable area of the layout) is also usually

constrained. These constraints limit the size of the layout and may also restrict the aspect

ratios of the floorplan area. These constraints reflect fabrication and packaging

considerations. Floorplanning attempts to achieve a highly compact layout while

satisfying these constraints on the area. The layout, thus produced, must also ensure the

routability of all the nets.

In our implementation of a floorplanner, we have made a number of engineering

approximations and design judgements to reduce the complexity of the implementation

while still preserving the core of the problem. Instead of solving the floorplanning problem

in its entirety, our simplifications attempt to solve a sufficiently large subset of the actual

floorplanning problem. This subset accurately reflects the important characteristics of the

35

actual floorplanning problem. We shall now discuss these simplifications and their effects

on the floorplanning task.

We have chosen to calculate the net wirelengths by using the half perimeter method.

This method involves calculating the bounding box of each net (i.e., the bounding box of all

the modules which the net connects). The half perimeter of this bounding box

approximates the net wirelength. This method is chosen over other methods, such as

center-to-center wirelength evaluation and minimum Steiner tree estimations, because it

is a fairly accurate estimator of net wirelength and, more importantly, provides for faster

evaluations. Compared to the other wirelength estimation methods, this bounding box

metric always overestimates the wirelength of nets. The exact position of I/0 connections

on the boundary of each module is ignored in our simplification. This simplification affects

wirelength minimization minimally because of the overestimation of wirelength by the

bounding box metric.

In an ideal floorplanner, the modules have unconstrained space in which to move around

during the annealing process. As annealing proceeds, the modules rearrange themselves

in close proximity to occupy a compact area. However, the implementation of such an

"infinite" space for the modules presents a difficult problem. Therefore, we have chosen

to represent this space as a finite area by establishing some auxiliary constraints on

moves which allow modules to move only within this finite space. The dimensions of this

area are determined as a function of the estimated area of the floorplan. The estimated

area in turn is a function of the actual module sizes. In our implementation, the size of this

"playing" space is given by: c~ (maximum modules sizes). The value of the constant

user-defined. The aspect ratio for the "playing" field is the same as the desired aspect

ratio. Th choice of a "playing" field effectively produces a rigid boundary inside which the

modules ar(~ constrained to move. Moves which might move a module outside this rigid

boundary are termed illegal and are disallowed. This restriction on the floorplan area

keeps the aspect ratios of the final floorplan within reasonable limits of the desired value.

36

Topology constraints, such as those which limit the position of modules to be in specific

areas of the floorplan, are not considered in our implementation. Such constraints tend to

clutter an otherwise clean characterization of the problem and hence we have avoided this

class of constraints altogether. Implementation of such constraints can be done with

.minimal change in the problem representation. Future versions of PASHA will have the

capacity to tackle such constraints.

Our implementation does not allow bus constraints to be specified. Buses are unique

objects and it is desirable to handle them separately. Most floorplanning algorithms,

especially those with polar graphs and slicing trees, cannot handle different types of

objects. This is a basic drawback in their problem representation. A geometrical

representation like the one PASHA uses is easier to tailor to represent buses and bus

constraints. Another aspect of floorplanning which has not been implemented in the

current version of PASHA is the ability to handle external pins and pads.

Routing space for nets in the final floorplan is addressed by overestimating the areas of

the modules. Modules are expanded artificially just before the annealing process and

these expanded sizes are used during annealing. When annealing finally terminates and a

final floorplan is obtained, the modules are shrunk back to their original sizes. This results

in the creation of some routing space between modules. Presently, there is a user-

specified option to overestimate the size of module by a fixed fraction of its area. A

module which has a large number of nets connected to it needs a large routing space

requirement and the amount of overestimation in the area of the module should be,

consequently, a function of the number of nets which connect the module. Future versions

of PASHA will incorporate this feature.

37

3.2. Annealing Algorithm Implementation

The design of a good annealing algorithm involves determination of essentially four

aspects of annealing; the move set, the objective function, the annealing schedule and the

data structures. We shall briefly discuss these aspects with respect to our implementation

of an annealing algorithm for floorplanning.

3.2.1. Move Set

The move set for our annealing algorithm for floorplanning is designed to enable the

system to explore all possible degrees of freedom, and reach any feasible configuration.

The move set must specifically attempt to reconfigure the system by moving the modules

in the floorplan, and by exploring different shapes of the module. The move set for PASHA

is as follows :

Lateral shifts : These moves laterally shift the modules in any of the fourcompass directions. This is the most basic kind of movement the modulescan make in order to rearrange themselves during the process of annealing. Amovement of the module to an arbitrary location can be decomposed to atmost two lateral shifts. The simplicity of this move and the possibility ofdecomposing all other movements into a sequence of lateral shifts promptedits inclusion in the move set.

Swap : Two modules can exchange their positions in the layout. Though thismove can be essentially decomposed into a set of lateral shifts, we haveincorporated this move since it results in a sufficiently big perturbation to thesystem. Large perturbations help the system to climb out of local minimaquickly, or to proceed downhill quickly.

Rotate : A module can be located in any orientation in the final floorplan. Thismove serves to explore optimum orientations of the modules in the floorplan.Rotation of a module is done along any of the four directions. Since we aredealing with Manhattan geometry alone, modules are only rotated in multiplesof 90°

Change size : To choose the optimum size of the module from among thespecified sizes, this move simply explores alternative sizes. This move isdefined only for modules which have a list of alternate sizes specified. Themove consists of picking a random size for the module from this list.

Fig.3-1 illustrates the different types of moves in the move set for PASHA. Moves are

38

Lateral Shift Swap

Rotate Change Size

Rgure 3-1: Move set for PASHA

chosen at random from this move set. However, the relative proportion in which different

types of moves are chosen from the move set is critical to the performance of the

algorithm. An empirically determined optimal proportion of moves types can enhance the

convergence of the algorithm. For example, the TimberWolf package [’Sechen 84] uses an

empirical ratio of 10 to 1 as the ratio of single module moves to module exchanges for

optimal results. In our implementation, we have maintained the same ratio of 10 to 1

between single module moves and moduleexchanges. Further, since our single module

moves comprise three different types of moves we have chosen a ratio of 3:1:1 among

single module moves as the proportion of lateral shifts, rotates, and size changes,

respectively.

39

3.2.2. Objective Function

The objective function for the simulated annealing algorithm for floorplanning must

accurately quantify the goals of floorplanning within the framework of the various

constraints. Wirelength minimization is one of the primary objectives of floorplanning.

Consequently the objective function has a wirelength estimator. The wirelength estimator

employed in our implementation is the half perimeter metric. Floorplanning must also

attempt to pack the cells in the minimum possible area. Packaging considerations dictate

certain optimum aspect ratios of the chip area. Consequently, the floorplan area must be

optimized with respect to the total area and optimum aspect ratios. These considerations

are taken into account by the presence of an area estimator in the objective function.

An estimated area for the floorplan is calculated as the sum of the maximum possible

areas of the modules. The space in which the modules move around is made a fraction

greater than this estimated area. During the course of annealing, when the floorplan area

shrinks from the area of "playing" field to smaller values, the floorplan aspect ratios

always remain within a fraction of the desired aspect ratios. Floorplans which have a

greater area than the estimated area can be packed into more compact layouts. On the

other hand, floorplans with areas smaller than the estimated area might imply some

residual overlaps. To account for this, the area estimator in the objective function is a

function of the difference between the floorplan area and estimated area.

The move set that we have chosen to implement perturbs the location of rectangles in

the floorplan. Such perturbations, as dictated by the move set, allow the rectangles to

overlap. Overlaps of modules in the fioorplan represent an infeasibility in the layout and

must be penalisedo The introduction of an overlap penalty function in the objective function

drives away overlaps during annealing. The TimberWolf package [’Sechen 84"1 uses a

simple overlap penalty function which is proportional to the square of the area of overlap

4O

between modules. We have chosen to implement a more sophisticated overlap penalty

function, similar to the concept of a centre weighting function proposed in ~’Jepsen 83~.

The motivation for implementing a centre weighting function comes from the fact that the

total overlap area is not a very good estimator of the overlap penalty since it does not

account for the position of the overlap with respect to the module. Overlaps confined to

the periphery of modules are less harmful than overlaps near their centers, but the simple

method based on total area of overlap evaluates both these cases identically. The method

we have implemented penalizes the overlap depending on its position with respect to the

module.

Our representation of an overlap weighting function consists of the construction of two

imaginary pyramids on the two overlapping modules. The pyramids have their bases as the

area of the modules. The heights of the pyramids are equal and are user defined. When two

modules overlap, the two pyramids intersect. To calculate the value of the centre-

weighting function, we roughly approximate the total intersected volume between the two

pyramids. This volume represents the overlap penalty. Fig.3-2 illustrates this center

weighting function for the overlap penalties.

This simple characterization yields a center weighting function which reflects the

overlap penalty more accurately. Small modules overlapping big modules are more

accurately penalised than a simple overlap area measurement. The closer the overlap area

is to the centre of a module, the higher the overlap penalty. Consequently, overlaps are

repelled away from the centre of modules. Eventually, when annealing is complete very

few overlaps remain and the residual overlaps tend to be on the periphery of the modules

and not near the center. Fig.3-3 compares the two overlap penalty functions: the simple

overlap area estimator and our implementation of a centre-weighting function. Notice that

configurations C, D and E yield the same overlap penalty in the simple overlap area

function while the centre weighting function yields different overlap penalties. The overlap

41

Module #1

Figure 3-2: Center Weighting Function for Overlap Cost Evaluation

penalty for configuration D is maximum due to the central overlap between the modules

while the peripheral overlaps of configuration C and E are penalised to a lesser extent. In

our implementation of the centre weighting function, when more than 2 modules overlap

with each other all the pairwise overlap penalties are calculated and added to obtain the

total overlap penalty. The aggregate objective function is a simple weighted sum of the

values of the total wirelength, total area and the total overlap between modules. The

relative values of the weights attached to each aspect of the objective function are very

important. Biasing the weights towards one of the parameters yields solutions which are

optimal with respect to that parameter but non-optimal with respect to the others. These

weights must be carefully balanced so as to improve the final quality of the solution with

respect to all the parameters.

42

A C D

II

E

D

Position Position

E

Simple Overlap Area Penalty Function Centre Weighting Overlap Penalty Function

Rgure 3-3: Comparison of Overlap Penalty Functions

3.2.3. Annealing Schedule

The choice of a good annealing schedule involves the determination of four essential

parameters: the starting temperature, a temperature reduction technique, a thermal

equilibrium criterion and finally the stopping criterion. For the annealing to proceed to a

globally optimal solution, the starting temperature must be sufficiently high for efficient

traversal of the search space, but not so high as to cause unnecessary and expensive

computation at high temperatures. We choose a starting temperature which is hot enough

to enable randomization of the system without unnecessary computation at high

temperatures. The algorithm dynamically determines the starting temperature for each

problem. A large number of random moves are initially proposed and evaluated. The

average value of the change in cost function due to these moves is determined. The

43

starting temperature is chosen such that a large percentage (~ 95%) of these moves

would be accepted. This method gives a very good estimate of the value of the starting

temperature.

The temperature of the system is lowered by a constant factor. This is implemented by

using a simple method where Thew = ~ To=d (~ is a constant ~ 1 ). A value of <~ greater than

0.95 is a very conservative annealing schedule and can be very time consuming, on the

other hand a value of 0.7 or less can result in quenching the system to non-optimal

solutions. After some experimentation we have chosen a value of 0.9 for the value of ~.

The criterion to decide thermal equilibrium is another aspect of the annealing schedule

which is highly empirical. Usually thermal equilibrium is said to be attained when a

sufficient number of moves have been tried to explore a large percentage of the search

space at that temperature. Typically, this is implemented by attempting a certain number

of moves per module. As the degrees of freedom of the problem increase more moves

must be attempted per cell to attain thermal equilibrium Empirically, we have determined

that attempting 200 moves per module gives good results for the problems attempted.

In keeping with our objective to implement a "no frills" floorplanner, this conservative

annealing schedule serves its purpose and produces good results. Many improvements

can be made to this conservative annealing schedule to speedup convergence which have

not been implemented in the current version of PASHA but will be implemented in future

versions.

44

3.3. Performance Evaluation of the Serial Algorithm

We have implemented a serial version of PASHA, a floorplanner based on the annealing

algorithm discussed in previous sections. PASHA consists of approximately 3000 lines of

code written in C and runs under 4.2 BSD Unix. It accepts input in a very simple format,

similar to the one used by MASON I’Lapotin 85]. Input consists of a list of alternative sizes

and shapes of the modules and a netlist. PASHA plots a picture of the final floorplan in

GKS format for GKS-supported graphic displays. Additional output routines are also

provided which enable the user to view intermediate configurations dynamically during

annealing.

To evaluate the quality of floorplans produced by PASHA, we have chosen three

benchmarks representing small, medium and large floorplanning problems. Benchmark A

is a small floorplanning problem with 20 modules. A medium size problem with about 40

modules is Benchmark B. Benchmark C contains 60 modules and is a large problem

obtained from industry. The weights associated with each aspect of the objective function

are tuned for each benchmark to obtain the most optimal solution. The values of the

objective function and the CPU time are measured at the termination of annealing. Fig.3-4

shows the final floorplans for the different benchmarks. It must be mentioned that these

solutions can be further tuned, and they are presented here to demonstrate the fact that

PASHA performs reasonably as a "no frills" floorplanner. It can be observed that there are

a lot of residual overlaps in the solution for the Benchmark C. Benchmark C has modules

ranging in complexity from a complete RAM to a single inverter. Consequently, modules

have widely varying sizes. Center weighting does not seem to compensate for this

problem perfectly, although we conjecture it probably works better than the simpler

overlap schemes. Typical floorplanning problems have modules with more similar

complexity and the serial version of PASHA with centre weighting is able to tackle such

problems fairly effectively. To illustrate this, we reduce the disparity of complexity among

45

modules in Benchmark C by combining several closely connected lower-level modules

into fewer high-level modules. This modified version yields better solutions. The modified

Benchmark C contains 32 modules of approximately equal complexity. The final solution

of this modified Benchmark is shown in Fig.3-4.

We have used MASON to compare the quality of solutions obtained by PASHA. However,

it must be noted that there are some factors which must be taken into account in making

this comparison. First, both MASON and PASHA have an extensive set of different tuning

parameters. One of the critical user-defined parameters in MASON is the relative use of

heuristic and exhaustive search methods. The wirelength metric used by MASON and

PASHA also differ: MASON uses a centre to centre approximation for the wirelength while

PASHA uses a bounding box approximation. Due to these factors, only rough comparisons

can be made between MASON and PASHA. Nevertheless, these comparisons are made to

demonstrate that PASHA gives reasonable solutions. Table 3-1 compares the wirelength

and area objective functions obtained by PASHA and MASON for the three benchmarks.

The wirelength of the final floorplan produced by MASON is processed to determine the

bounding box wirelength for the sake of comparison. It must be noted that definite

conclusions regarding the quality of solutions cannot be drawn by comparing these

values. Nevertheless, this comparison serves to establish that the solutions of PASHA are

reasonable and of comparable quality to those produced by another floorplanning tool.

It can be seen from Table 3-1 that PASHA gives solutions of comparable quality with

those produced by MASON. However, PASHA is very slow compared to MASON. MASON

uses a slicing tree approach and is, consequently, very fast. Unlike PASHA, MASON also

performs global routing of the final floorplan. On the other hand, the main advantage of

PASHA over MASON is its flexibility. It is easier to add new constraints to the objective

function in PASHA than in MASON. In addition, sometimes the most compact and optimal

floorplans cannot be represented as slicing trees. Due to the slicing tree approach,

46

Bend’m~rk A : 20 Modules

Benchmark B : 38 ModulesFigure 3-4: Final Floorplans produced by PASHA

47

Benchmark C : 66 Modules

r --~---~ II I~ []

Modified Benclvnad( C : 83 ModulesFigure 3-4, concluded

48

Benchmark

Benchmark A (20 Modules)Benchmark B (40 Modules)

WirelengthMASON PASHA4538 42164970 4594

AreaMASON PASHA84882 8190060800 50176

Table 3-1: Comparison between PASHA and MASON

MASON cannot find floorplans that do not have a slicing structure, whereas PASHA can

reach such a solution. To demonstrate this, we set up a synthetic problem with 9 modules

and a known non-slicing optimal packing. MASON .and PASHA were run on this problem.

PASHA obtains the optimal solution for this problem, which MASON cannot obtain. Fig.3-5

illustrates the optimal solution for the synthetic benchmark and the results of PASHA end

MASON.

0

32

56 4

7

OptimalRgure 3-5:

8

0 17

65

32

0

25

3

7

6

4

8

MASON PASHA

MASON and PASHA Solutions for a Non-Slicing Structure

We have discussed the serial implementation of PASHA in this chapter. This

implementation of an annealing algorithm for floorplanning is used as a vehicle in our

studies of parallel strategies for simulated annealing on a hypercube.

49

Chapter 4

Parallel Floorplanning Algorithms

This chapter deals with parallel simulated annealing algorithms for floorplanning. These

algorithms have been targeted towards implementation on a multiprocessor with a

message passing architecture, in particular, a hypercube. In this chapter we propose three

partitioning strategies for floorplanning by annealing on a hypercube. Details of these

strategies are discussed, along with a critical evaluation of their advantages and

disadvantages. We propose some approaches which modify the basic annealing algorithm

to create a greater degree of parallelism. The additional parallelism, achieved by the

introduction of error in move evaluation, is exploited for faster execution. We begin with a

brief review of hypercube architectures. This is followed by the discussion of uncertainty

in move evaluation caused by parallel move evaluation. The final sections describe three

proposed parallel floorplanning algorithms.

4.1. Hypercube Architecture

All our parallel annealing algorithms have been targeted towards a hypercube

multiprocessor. A typical hypercube is a distributed-memory, message-passing

multiprocessor: all the processors have local memory, and they synchronize their

computation by sending messages among themselves through an interconnection

network I~Seitz 85]. The topology of the interconnection network is that of a hypercube,

where the nodes of the hypercube correspond to the individual processors end the edges

correspond to the message links between them. A hypercube of d dimensions consists of

50

2d nodes. The nodes are tagged with binary coded integers from 0 through 2d. Two nodes

whose tags differ by exactly one bit are connected by a link, and since the tags are bit

strings of length d every node has exactly d links to other nodes. The nodes send

messages to adjacent nodes through these links. Messages sent between non-adjacent

nodes are routed through intermediate nodes until they reach their target node. Efficient

routing algorithms exist which route messages in such a way that the path length of the

message route is equal to the number of bits in which the binary tags of the source and

target nodes differ. For example, the binary tags of a source node and a target node in an

N-dimensional hypercube cannot differ by more than N bits and hence the maximum path

length for a message in this case is N. The routing is not guaranteed to be commutative,

i.e., a path from node i to node j is not necessarily the same as the path from j to L Many

paths exist between any two nodes. These additional paths can be utilised to increase the

communication bandwidth or to enhance the fault tolerance of the hypercube. A simple

2-cube consists of four processors. A hypercube of any desired dimension can be

constructed with two hypercubes of the immediate lower dimension by connecting their

corresponding nodes. The number of interconnection links per processor, therefore,

grows only logarithmically with the number of processors. This is an advantage of the

hypercube topology, since a large number of processors can be used without prohibitively

complex interconnection networks. Fig.4-1 illustrates the topology of 2, 3 and 4-

dimensional hypercubes. Another advantage of the hypercube topology is that numerous

other network topologies such as trees, meshes, end rings can be easily mapped onto

hypercubes.

51

dim = 2

dim = 3

dim = 4

Rgure 4-1: Topology for 2, 3 and 4-Dimensional Hypercubes

4.2. Uncertainty in Parallel Move Evaluation

Wl~en moves are evaluated in parallel, each move cannot predict the changes caused by

other moves being concurrently evaluated. Sometimes it is possible for the parallel moves

to attempt to move the same object. Ambiguities in updating here must be resolved by

arbitrarily accepting one of the parallel moves while discarding all other parallel moves

which attempt to move the same object [Kravitz 86a, Kravitz 86b]. This results in wasted

move computations. One way of circumventing this problem is to use mutual exclusion

during the move generation stage; locking of objects serves the purpose of mutual

exclusion and prevents multiple parallel moves from moving the same object [Casotto 86~].

To implement the locking arrangement, we use the concept of ownership of objects. Each

processor is allowed to generate, evaluate and update moves pertaining only to those

objects on which it has ownership rights. Ownership rights of a processor are unique and

each module has an exclusive owner.

52

There is another type of ambiguity associated with parallel move evaluation. The

evaluation of a move on a processor relies on the local state, i.e., the state of the system

as seen by that processor, to determine the cost function. Though the local evaluation is

correct, globally the move evaluation could be erroneous. This error is the cause of

uncertainty in parallel move evaluation. Grover I’Grover 86~] has theoretically examined the

effects of error in move evaluation on the convergence of simulated annealing algorithms.

From the equations of statistical mechanics Grover l’Grover 86"1 shows that as long as

the magnitude of error in the energy function ~lErnax is less than the temperature, the value

of the partition function, a measure of the state of the system, is not changed significantly.

An important conclusion from this is that the state of the system can tolerate errors as

long as the maximum error is less than the value of the temperature. This property allows

us to introduce errors into simulated annealing without affecting the convergence

properties of the algorithm. Clearly, the tolerance to errors is dynamically dependent on

the temperature: large errors can be tolerated at high temperatures but low temperatures

allow very small error tolerance.

This property of error tolerance is critical in the partitioning strategies we have

developed. To reduce message traffic while doing parallel moves, our strategies employ

long sequences of relatively fast, but possibly erroneous moves before global updates are

made. This property gives us a good idea about the error that can be introduced in the

annealing without loss of convergence.

4.3. Partitioning Strategy 1: Static Parallel Algorithm

In this section we present the first of the three partitioning strategies that we have

developed for PASHA. The parallel moves scheme presents itself as a fairly simple

strategy to exploit parallelism in annealing and forms the basis of this strategy.

Independent moves are performed in parallel by processors. Every parallel move is

53

accepted or rejected independent of the other moves. The processors update their-state

depending on the moves they perform, completely independent of the moves in progress

in other processors. The state of the system in PASHA refers to the locations of modules

and the wirelengths of the nets connecting these modules. After a set of parallel moves is

performed on any processor, and the necessary updates made, the state of the system in

each processor is no longer identical. The next set of parallel moves are evaluated with

respect to different states in different processors. The state in each processor changes

with each accepted move and becomes more and more out of step with the states in the

other processors. This situation cannot be allowed to continue forever since the error in

the move evaluation auccesively increases, e.g., eventually every module will have been

moved, and no processor will have even an approximately correct state. To remedy this

situation and restore the integrity of the state among the processors, a global update is

performed. This update correlates the states from each processor and determines a new

global state which is then relayed to all the processors. Following each global update

phase all processors asynchronously continue parallel move evaluation.

To perform global updates, each processor sends a copy of its state to some

synchronizing processor. The synchronizing processor calculates the true global state

from the information about changes in the states done in each processor and then sends a

copy of this global state to each processor. Since this process of updating entails

messages between every processor and the synchronizing processor, the volume and

density of message traffic involved in a global update is heavy. The actual process of

determining the global state from the individual states of the processors is done using the

concept of object ownership as discussed previously. For floorplanning in PASHA, the

objects which are modified directly by moves are modules. Each processor owns a set of

modules. Ownership of a module gives a processor exclusive rights to move that module.

This implies that the change in the state in each processor is entirely due to modules

54

which a processor owns. This fact is used by the synchronizing node in determining a

global state of the system during a global update. Besides updating the global state of the

system, the synchronizing node also performs several other chores such as reallotment of

ownerships among modules, determination of equilibrium, and evaluation of the stopping

criterion. Note that swaps between two modules can be performed by a processor only

when both the modules being swapped are owned by that processor. Consequently,

random reallotment of ownership is performed after each global update. This enables any

two modules, to be, eventually, swapped. If reallotment of module ownerships is not done,

the effectiveness of swaps to provide large perturbations necessary to climb out of local

minima is reduced. Our algorithm simply randomizes the ownerships of modules after

every global update. However, instead of simple random reallotment of ownership,

changes of module ownership also appear in the parallel macrocell algorithm of I’Casotto

86] in such a way that strongly connected modules tend to be owned by the same

processors. This method is chosen to coerce strongly interacting modules to be owned

by the same processor during the "freezing" stages of annealing. This reduces the

number of interacting moves across processors and, consequently, reduces the error in

move evaluation at low temperatures.

Global updates result in considerable communication overheads due to the message

traffic. All processors except the synchronizing processor are idle during the global

updating which wastes computing resources. It is, therefore, desirable to reduce the

frequency of global updates. In our proposed scheme a sequence of moves is performed

by each processor before a global update. This makes global updates less frequent, but

also introduces more error into the computation since fewer updates mean that each

processor sees a less correct state of the system. The number of moves which each

processor performs before a global update is a crucial factor in determining the error in

evaluation and, consequently the convergence of the algorithm.

55

To exploit an additional source of parallelism, we also introduce functional move

decomposition while evaluating individual moves. The task of a move evaluation is shared

between two processors. In every such pair of processors, one processor proposes a

move, evaluates a small part of the move and also decides its acceptance criteria, while

the complementary processor in the pair performs the remaining subtasks of move

evaluation. Since the first processor controls the move evaluation by actually proposing

and accepting moves it is referred to as the master processor. The complementary

processor is called the slave processor. During a global update it is the master processor

which sends out the local state to the synchronizing processor and receives the updated

global state. This global state is subsequently passed on to the slave processor.

We divide the entire hypercube into pairs of adjacent nodes. Each master-slave pair is

chosen to reside on physically adjacent nodes of the hypercube. Since they form a unit of

move computation, the message traffic between the two processors in the pair is relatively

high and keeping them as adjacent nodes reduces the communication overhead. The

hypercube topology enables us to divide the cube into such pairs of adjacent processors.

We also chose not to employ a separate synchronizing node. Instead, one of the master

processors does double duty as a master processor as well as a synchronizing node. The

processors pairs are static and do not vary during execution. Moreover, the functional

subtasks performed by the processors also remain static and do not change during

annealing. Hence, we refer to this strategy as the static parallel algorithm. Fig.4-2

illustrates the static parallel algorithm on a 3-dimensional hypercube in which each

master-slave pair performs N local moves before a global update.

56

Complete n moves

Initialise

Complete n moves

~

Complete n moves ~ , Complete n moves

Updatestate

SynchronizingNode ui

Figure 4-2: Static Parallel Algorithm on a 3-Dimensional Hypercube

4.4. Partitioning Strategy 2: Simple Pipeline Algorithm

Pipelining is a fundamental strategy used in increasing the throughput of any task. This

section examines pipelining with a similar objective: to increase the throughput of move

computation. We consider processors that are arranged logically and physically in a

contiguous sequence to form a pipeline. As a move propagates from the beginning to the

end of the pipeline, every stage performs a unique subtask of the move. The move

computation is completed in its entirety at the end of the pipeline. The first stage of the

pipeline proposes the move, the last stage of the pipeline performs the update

corresponding to the move, and the intermediate stages perform intermediate parts of the

57

move evaluation. Every move is broken into a number Of functional subtasks and, hence,

this division of subtasks among the stages of the pipeline is a functional decomposition.

After a set of moves has been completed, only the last stage in the pipeline has the

updated state.

Due to the decomposition of the moves into smaller subtasks the computation time of

each subtask is small. Typical subtasks of a move in PASHA take a few milliseconds,

whereas the communication time for a message between adjacent nodes can be a

millisecond. Hence, the communication time between adjacent processors in the pipeline

can become comparable to the computation time of a subtask in a stage. This increases

the message communication overhead. One method of reducing this communications

overhead is to amortize the communication overhead over several move computations by

increasing the ratio of the computation time to the communication time. We achieve this by

grouping moves together while sending them through the pipeline. All the moves in each

group are evaluated before sending them to the next stage in the pipeline. This increases

the computation time while keeping the communications time essentially constant, thereby

reducing the communications overhead. Grouping moves is essentially a parallel moves

scheme since all the moves grouped together are evaluated independently of each other.

The length of a pipeline, in such a case where the stages perform functionally different

subtasks, cannot be increased indefinitely due to the coarse grain parallelism of functional

decomposition. To utilize more processors, we propose to employ multiple, parallel

pipelines. Consequently, global updates must be performed to synchronize the individual

states of all the pipes. A single processor performs these synchronizing functions. Notice

that a multiple pipeline strategy is similar to the static parallel algorithm in the sense that

the pipelines each process a group of moves before globally updating all the states in all

the pipes. We employ the concept of ownership, similar to the Static Parallel algorithm, to

avoid ambiguous moves. Modules are owned by pipes, instead of individual processors.

58

One of the main factors in deciding the length of each pipeline is that it must be a power

of 2 to enable a clean topological division of the hypercube into an integral number of

pipes. This restriction on the length of the pipeline also makes it possible to find adjacent

processors corresponding to adjacent stages in the pipeline. The message traffic between

adjacent stages in the pipeline is especially heavy and such an arrangement reduces the

communication overhead. If pipeline lengths of 2k are used then an arrangement is always

possible which enables the adjacent stages of the pipeline to be topologically adjacent

nodes. A move computation in PASHA can be broken into roughly four functionally

different subtasks: move proposal and updating, wirelength evaluation, overlap evaluation

and area evaluation. Consequently, we have used a pipeline with 4 stages in our pipeline

algorithm. The first stage proposes a group of moves, evaluates the wirelength parameters

for all these moves in the group and sends these moves off to the second stage. The

second stage evaluates the change in overlaps due to these moves while the first stage is

proposing the next set of moves. In a similar manner the move propagates through the third

stage of the pipeline to the fourth and final stage where the move is accepted or rejected

followed by necessary updates in the system state.

An interesting point of observation in such a pipeline system is that there is no direct

communication between the last and first stages of pipeline. Due to this lack of direct

communication, update information present in the final stage of the pipeline does not pass

to the other stages in the pipeline. As a result of this, some of the moves proposed by the

first stage, which are perfectly legal with respect to its copy of the system state, become

illegal with respect to the system state of the last stage. In our algorithm such moves are

not accepted regardless of their acceptance criteria. As we shall describe in the next

section, we have proposed certain modifications which allow a mechanism for updating

the stages of the pipeline with the state changes made in the last stage. Fig.4-3 illustrates

the pipeline algorithm for a four dimensional hypercube.

59

11081Read input, ~

P~rform moves Perform moves Perform n moves Perform

Updatestate

Synchronizing

imov~

Figure 4-3: Pipeline Algorithm for a 4-Dimensional Hypercube

6O

4.5. Partitioning Strategy 3: Modified Pipeline Algorithm

The two previously discussed parallel algorithms have one common drawback: the

global update phase in each of the algorithms represents a serial bottleneck in an

otherwise parallel algorithm. We have attempted to address this problem in our design of

another partitioning strategy: the Modified Pipeline algorithm. This is the most complex

partitioning strategy we have adopted for PASHA. This algorithm, as the name suggests, is

similar to the pipeline scheme just discussed. However, some important distinctions exist

between this algorithm and the simple pipeline algorithm. Topologically, this algorithm is

structured in such a way that the last stage of each pipeline communicates with the first

stage of the pipeline. Moreover, individual pipes are arranged in such a way that the

interconnection between neighbouring pipes itself forms a ring. The ring connection

between individual pipes enables the pipes to pass state update information among

themselves, neighbour to neighbour, to reduce the number of global updates.

An object decomposition is used to split the move computation across the different

stages of the pipeline. Unlike the simple pipeline strategy, each stage in this algorithm

owns a set of objects like nets and modules. Each stage evaluates the contribution of

each of its owned objects to the move. Moves propagate through the pipeline and are

completely evaluated when they reach the last stage of the pipeline. The first stage

proposes the move and computes some part of move before passing it on to the next

stage. After it passes on the move information to the next stage it immediately begins the

process of proposing another move. Each stage in tum computes a part of the move and

passes the information to the next stage and waits for the next move to be sent to it from

its previous stage. When the move reaches the last stage it has been completely

evaluated. The last stage then decides the acceptance criteria of the move. If the move is

accepted, the last stage updates the state of the system and, unlike the simple Pipeline

algorithm, passes this updated state to the first stage of the pipe. This small detail differs

61

from the previous pipeline algorithm where there is no mechanism, local to the pipe, to

communicate the updated system state to the other stages of the pipe. Notice that this

updated system state which is sent back to the first stage of the pipeline "percolates"

through the rest of the pipe along with new moves. This percolation of the updated state

ensures that move evaluations always have errors never more than those caused by a few

delayed updates. The delay in updates, obviously, is the time it takes for the updated state

caused by an accepted move to circle from the last stage of the pipe to every other stage

in the pipeline: the length of the pipeline.

As explained in the previous section, the length of the pipe must be a power of 2.

Moreover, since our algorithm uses object decomposition to partition the move

computation across the stages of the pipe, the number of stages in the pipeline is a

function of the number of objects. Unlike the functional move decomposition across the

stages in the simple pipeline algorithm, an object decomposition enables the division of an

entire move computation into many fine-grained subtasks. However, the number of

subtasks which a move can be decomposed into depends on the number of objects

involved in the problem. Consequently, the length of the pipeline cannot always be

increased when more processors are used. In such situations multiple pipes are used. The

presence of multiple pipes forces the need for a global update phase which synchronizes

the different streams of annealing in each of the pipes. In the global update phase of the

Static Parallel and Pipeline algorithms, messages are sent to the synchronizing node by all

nodes performing moves, after which the synchronizing node transmits messages

containing the updated state back to these nodes. The global update phase, therefore,

presents a serial bottleneck which must be avoided to improve performance. We propose

a new technique for global update which essentially distributes the process of a global

update among the processors and partially mitigates the serial bottleneck. In this approach

there is no complete global update in the strictest sense of the term. Instead, we replace

62

some of the synchronized global updates with a partial update that is distributed in this

sense: all nodes are not updated simultaneously and the update information may be

slightly stale by the time it takes to reach all the other nodes. Instead of updating all other

pipes, each pipe updates only its neighbour pipe. This kind of global update is always

incomplete. As discussed earlier, we have structured the pipes such that they form a ring

by themselves. Updating neighbour pipes which are interconnected to form a ring ensures

that eventually the state change in a pipe percolates to all the other pipes. For example,

after n updates (n being the number of pipes) the update information reaches all the other

pipes. We call this type of updating delayed global updating or lazy updating. Such a

partial updating mechanism implies that there is never a complete synchronization

between the multiple streams of annealing. The advantage of lazy updating is that it is

faster than synchronized global updates since no global synchronization messages are

sent and only local update messages are sent. However, in the absence of a synchronized

update the state as seen by the processors may become completely out of step. This can

cause a runaway effect on the magnitude of the error in move evaluations. To avoid this

runaway effect a mechanism for a complete global update is also implemented and the

ratio of complete global updates to the number of lazy updates is user-defined. This ratio

can be chosen as tradeoff between a serial bottleneck of a global update versus the

runaway of the error in move evaluation. Fig.4-4 illustrates the percolation of update

information in lazy updating.

The topology of processors and their interconnections as required by this algorithm can

be very easily mapped onto a hypercube. The hypercube topology allows us to structure

the pipes in such a way that the pipes themselves form a ring. Since the last stage of every

pipeline communicates with the first stage, the topology of each pipeline should also be

such that it forms a ring interconnection. Fig.4-5 illustrates a way of structuring a 3-

dimensional hypercube network consisting of 8 processors into four pipes, each of length

2 where the pipes themselves are connected in a circular fashion.

63

Generate Set of Moves

PartialTime Step 1 Evaluation

Time Step 2

Time Step 3

Update 1

Update

Time Step 4 .............................. Update 1

Time Step 5

Update

Time Step 6

Time Step7

Time Step 8 ......................................................................................................................

Update I I

I Update 11

Rgure 4-4: Percolation of Update Information among Pipelines : Lazy Updating

Similar to the Static Parallel algorithm and the Simple Pipeline algorithm, the existence of

multiple pipes calls for an implementation for ownership of modules by pipes. However, in

the Modified Pipeline algorithm two levels of ownership can be distinguished. The first

level of ownership of modules is among the pipes; individual pipes possess ownership

rights over individual modules. This level of ownership, which determines which modules a

pipe can move, is referred to as the move-proposal ownership. In addition, another set of

ownership rights exists among the individual stages of a pipeline. These rights determine

Logical Structure

Connections Within a Pipe

~ Connections Between Pipes

....................Unused Connections

Actual Phyalcal Structure on a 3-Dlmenalonal Hypercube

Rgure 4-5: Topology for a Modified Pipeline algorithm on a 3-Dimensional Hypercube

the objects each stage is responsible for during move computation. This level of

ownership is referred to as the move-computation ownership. The move-computation

ownership includes both nets and modules, unlike the move-proposal ownership which

65

includes only modules. The move-proposal ownership must be dynamic during annealing

to enable complete exploration of the search space, while the move-computation

ownership among the stages of the pipeline need not be dynamic. Fig.4-6 illustrates the

modified pipeline algorithm for a four stage pipeline decomposition in a 4-dimensional

hypercube.

4.6. Comparison of Partitioning Strategies

Table.4-1 compares and contrasts our proposed partioning strategies with respect to

three components: how individual moves ere decomposed in parallel subtasks on

cooperating processors; how complete, independent parallel moves are attempted; and

how state update information is distributed to keep uncertainty about the current system

state within acceptable limits.

66

Initialise /

Collatesystem states

Result

Figure 4-6: Modified Pipeline Algorithm for a 4-Dimensional Hypercube

67

Move-Decomposition1. Functional decompo-sition of moves.

2. Two processorscooperating for a singlemove.

Move-Decomposition1. Functional decompo-sition of moves.

2. Four processors perpipeline cooperating fora single move evalua-tion. Fixed pipelinelenoth.

Static Parallel AlgorilhmParallel-Moves

1. Multiple processorpairs with each proces-sor pair evaluating indi-vidual moves.2. Groups of movesperformed by each pro-cessor before a globalupdate,

Simple Pipeline AlgorithmParallel-Moves

1. Multiple pipelines,each evaluating indivi-dual moves.2. Groups of movesperformed by eachpipeline before a globalupdate.

Update1. Synchronized globalupdate by a singlenode: node 0.

2. Local updating donein master processor.Slave processor is up-dated next evaluation.

Update1. Synchronized globalupdate by a singlenode: node 3.2. Local updating doneonly in the last stage ofthe pipeline. Update notpassed to other stages.

Move-Decomposition1. Object decompositionof moves.

2. Pipelines of varyinglength (restricted bypowers of 2). Proces-sors in a pipelinecooperate for a singlemove evaluation.

Parallel-Moves1. Multiple pipelines,each generating amove. Besides, move-generation ownership ofmodules, ownership ofnets and modules formove-computation.2. Groups of movesperformed by eachpipeline before a globalupdate.

Update1. Lazy updating: Eachpipeline updates its suc-cessive neighbour pipein the ring and gets up-dated, in turn, by itsprevious neighbour.

2. Global updates pro-vided to synchronizethe states betweenpipelines. It is per-formed after severallazy updates have beenmade to ensure conver-gence.3. Local updates withina pipeline are done inthe last stage. Updatesare passed back to thefirst stage.

Table 4-1: Comparison of Partitioning Strategies

68

Chapter 5

Parallel Implementation

This chapter examines implementation issues for the serial and parallel versions of

PASHA. The parallel programming environment in which all the parallel algorithms have

been implemented is reviewed. Other pertinent implementation issues, such as message

passing mechanisms, parallel programming details, data structures, and debugging in a

parallel environment, are discussed.

5.1. Parallel Programming Environment

All the parallel algorithms of PASHA have been implemented on an Intel iPSC2 hypercube.

In this section, we shall briefly review the hardware and software of the iPSC hypercube.

This is followed by discussion of message passing mechanisms on the iPSC.

5.1.1. IPSC Hardware and Software

The Intel iPSC3 is a commercially available parallel computer system with a hypercube

architecture. Individual processors on the nodes of the hypercube are Intel 80286

processors, each with Intel 80287 numeric processing units and 512Kb of memory. The

iPSC machine on which PASHA is implemented is a 4-dimensional hypercube (16

processors). This machine is a large memory version with 4Mb of memory per node. An

Intel 82586 communications coprocessor takes care of most of the communications while

2Trademark of Intel Corp.

’3intel Personal SuperComputer

69

reducing the communications overhead on the main node processor. The host machine

(called the cube manager) is both a control processor and a user interface to the nodes.

The host machine is an Intel 310 microcomputer running the Xenix operating system.

Adjacent nodes on the hypercube are connected by bidirectional communications links. All

interprocessor communications is performed by message passing over these links. There

is an exclusive communication link to the host processor from each of the node

processors.

A typical application on the hypercube consists of two different programs: the host

program and the node program. Both the programs are compiled on the cube manager and

linked to different sets of libraries. The host machine first loads the node-kernel, followed

by the object-code on each node of the hypercube. Once the object-code is loaded on

the nodes, the nodes commence execution asynchronously, coordinated and

synchronized by messages from other nodes. These messages either contain data or

control information. After completion, results are communicated to the user by shipping

them back to the host, where necessary I/O is performed. Hypercubes of all dimensions

have essentially the same topology, consequently, typical applications are written in such

a way that they can be scaled to higher dimensions at runtime.

5.1.2. IPSC Interprocessor Communication Mechanlsms

All communication between the nodes of the hypercube is done solely by messages. A

message consists of a string of bytes in a message buffer. Messages are limited in the

iPSC to a maximum length of 16K bytes. Messages which are longer than 1K are

automatically split into chunks of 1K, transparent to the user, and transmitted to the

destination processor. We notice that small messages have a high communication cost

per byte. Beyond a length of 1K the communication cost sharply rises and then falls again.

This can be explained by the extra communication overhead incurred when messages,

70

larger than 1K bytes, are split into chunks of 1K for separate transmission. To illustrate

this, we performed a very simple experiment. Messages of user-specified length are

passed around the nodes in the form of a ring. The elapsed time for the message beginning

from the time it is generated to the time it get back to the original node after completing a

full circle is measured. Fig.5-1 shows the results of this experiment. As explained earlier,

the sharp increase when the message length is 1K can be noticed.

Communication Timeper message byte(in milliseconds)

Rgure 5-1:

0.3

0.25 -

0,2m

0.15 -

0.1Length of ring - 4

512 1024 1536 2048Message length (in bytes)

Message Communication Overhead of the Intel iPSC Hypercube

Sending and receiving messages by a processor must be done by executing specific

software routines. These routines can be blocking or non-blocking. A blocking routine

waits for the completion of the operation (transmission or reception) it initiated before

returning to its calling process. On the other hand, non-blocking routines simply initiate

message transmission or reception and return to their calling process without waiting for

the completion of the initiated task. Care must be taken while using non-blocking routines:

it is possible to overwrite the contents of message buffer and corrupt its contents before

the initiated message transmission/reception is complete. We use non-blocking routines

wherever possible to obtain parallelism in execution.

71

5.2. Parallel Implementation Details

This section describes the implementation details of the parallel versions of PASHA.

Some efficient message passing patterns are discussed. This is followed by a discussion

of some common concerns in controlling message traffic. Data structures, used for the

serial and parallel version of PASHA, are described. Finally some mechanisms used for

debugging in a parallel environment are mentioned.

5.2.1. Efficient Message Passing Patterns

In this section we shall discuss an efficient message passing pattern to perform the

global update phase which is an essential aspect of the three parallel implementations.

During global update all node processors send a message to a single synchronizing node

which sequentially receives these messages and then, after necessary updating, transmits

the updated system states back to each node by executing a set of sequential sends.

Global update is an O(n) operation where n is the number of nodes in the cube.

expedite this process, we use the concept of a broadcast tree which accomplishes the

task of sending messages to all nodes in the cube from one node in O(Iog n) time.

If the binary tag of a node is x then its neighbours in an n-dimensional hypercube can be

determined by the following simple equation :

x (~ 21 for i = 0,1, 2, .... n-1

A broadcast tree with its root in node 0 has a very simple algorithm I’Brandenburg 86~].

Every node sends the message only to those neighbours with tag x ~ 2i such that 2~ > x.

Every node, except the root node 0, has exactly one parent from which it receives the

message and sends it to its children. Fig.5-2 illustrates a broadcast tree for a 4-

dimensional hypercube with the root at 0. The table in Fig.5-2 shows the exact sequence

in which the messages are sent according to this broadcast tree. The topology of a

72

Broadcast Tree with Root at Node 0

Time Step #1 Time Step #2Origin ! Dest. Origin Dest.

0 1 0 2- - 1 3

Time Step #3Origin Dest.

0 41 52 63 7

Time Step #4Origin Dest.

0 81 92 103 114 125 136 147 15

Rgure 5-2: Sequence of Messages for a Broadcast Tree

broadcast tree for a particular dimension remains the same and a broadcast tree with a

root other than node 0 can be easily derived from the broadcast tree with root 0.

The topology of a hypercube allows us to organize the interconneCtion network of the

73

hypercube in a variety of ways depending on the application such as pairs of adjacent

processors for the Static Parallel Algorithm, and pipes/rings for the Pipeline and Modified

Pipeline Algorithms. ’Identification of these interconnection topologies is greatly facilitated

by the use of a Binary Reflected Gray Code l’McBryan 86~]. This is a sequence of binary

numbers where each number differs from its neighbours by one bit in its binary

representation. The numbers in the BRGC are interpreted as processor numbers.

Consequently, processors with numbers adjacent in the BRGC sequence are physically

adjacent in the hypercube. The interconnection topology represented by the BRGC itself

is a ring since the BRGC sequence, by definition, wraps around, i.e., the last number in the

sequence is the neighbour of the first number in the sequence. Other interconnection

topologies can be deduced from this sequence. The Binary Reflected Gray Code(BRGC)

sequence used for our implementation is given in Fig.5-3 along with its representation of a

ring on the hypercube.

5.2.2. Message ~itio~

All messages must be kept as short as possible for reasons discussed previously. To

ensure that messages are short, all excess message data must be reduced. This is done

by realising that some of the information about nets and modules is essentially static.

Once the information is separated into the static and dynamic categories only the dynamic

information needs to be sent as messages during annealing. The static information needs

to be passed just once at the beginning of the annealing. For example, the list of alternate

sizes that a module can take is static information, whereas the coordinates of the module

is dynamic information.

Also, since message communication overhead is high, where we have the option, we

pack maximum information into a single message packet rather than send several

individual packets of information. Sending several unrelated pieces of data at the same

74

ooo (o)

001 (1)

011 (3)

010 (2)

110 (6)

111 (7)

(5)(4)

ooo (o)

Binary Reflected Gray Code

7

2 3Equivalent Hypercube Topology

Figure 5-3: Binary Reflected Gray Code and its Topology on a 3-Dimensional Hypercube

time is done simply by packing them into a single message. Messages are packed at the

source and unpacked at the destination. This keeps the number of messages transmitted

to a minimum and, consequently, reduces communication overhead.

5.2.3. Data Structures

There are three main data structures in the floorplanner: the modules structure, the nets

structure and the bin structure. The basic framework of data structures essentially

remains the same for both the serial and parallel implementations of PASHA with a few

minor modifications in each parallel implementations. A brief description of these data

structures is given below.

o ModuleslCells: Modules are basically rectangles represented by thefollowing information:

o The (x,y) coordinates of the lower left comer of the module.

75

o The length and the width of the module

An array of alternate sizes which the cell can assume, as well as thecurrent dimensions.

o A list of nets to which the cell is connected.

¯ Nets: This data structure stores necessary information about the nets neededfor cost computation, and includes :

The coordinates of the bounding rectangle of the net used to calculatethe half-perimeter wirelength metric.

list of modules to which the net is connected.

Bin structure: The entire chip area is divided into a number of rectangularareas referred to as bins. Cells are represented by their four edges and thesebins keep track of cell edges which fall in their area. The edges of the cellswhich lie in the area covered by a bin are sorted and stored within that bin ina circular, doubly linked list. 4 The implementation of the bins is done bysplitting the entire area into vertical and horizontal strips~ There is a twodimensional array of pointers (vertical and horizontal) which keeps track the edges of the cells. Fig.5-4 illustrates the bin structure for a simplefloorplan.

5.3. Debugging

Debugging the parallel versions of PASHA for the hypercube turned out to be a very

challenging task. All the applications were first developed on an Intel supplied hypercube

simulator running on a VAX 11/785 under 4.2BSD UNIX. The simulator uses UNIX process

creation and interprocess communication primitives to simulate processes on nodes and

communication between them. The first step in the implementation of the parallel versions

involved the implementation of the skeletal structure of the required message passing for

the particular strategy. This step enabled us to debug the message passing patterns and

to ensure, for example, that there were no potential deadlock situations. Once the

communication patterns were debugged, routines for move computation were added. Most

of these move computation routines were those used in the serial version of PASHA.

4A version of PASHA uses a binary tree data structure instead of the doubly linked list, and will yield betterperformance when the average number of edges per bin is large.

76

Vertical Edge Pointers

Figure 5-4: Bin Data Structure

Once the application worked on the simulator, the next step was to port it on the iPSC.

Debugging this parallel application was minimal and primitive I/O techniques, like printing

data values into a common Iogfile, were used to debug the consolidated code. We note

that the absence of a distributed debugger on the hypercube is an inconvenience. The

following chapter discusses the results of the parallel implementations of PASHA.

77

Chapter 6

Performance Evaluation of Parallel Algorithms

This chapter evaluates the algorithms discussed in the previous chapters.

Multiprocessor experiments are performed and the results are analyzed. This chapter

begins with a brief discussion of the methodology adopted for evaluating the parallel

algorithms. This is followed by a presentation of the results which are then analyzed to

gain a critical insight into the behaviour of the parallel algorithms. The relative speedup

obtained in the parallel algorithms is presented. These algorithms are compared, with

respect to the quality of their solutions, to the serial algorithm. This is followed by a

discussion of the effect of parallelism on convergence.

6.1. Methodology

All the parallel versions of PASHA have been written in the C programming language and

have been implemented on the iPSC hypercube. The Static Parallel and the Simple Pipeline

algorithm implementations are each roughly 3000 lines of code, while the more

sophisticated Modified Pipeline algorithm consists of about 4000 lines of code.

We evaluate the comparative performance of these parallel algorithms with respect to

the small benchmark presented in Chapter 3 in page 44. This choice is motivated by the

relatively short execution times required for this benchmark, which permit many

experiments to be attempted. Compared to a larger benchmark, it can also be observed to

be a more challenging task simply because it is more difficult to identify great parallelism

78

in a small benchmark, e.g., we are annealing only 20 moveable objects on 16 processors.

The serial algorithm was tuned to a high degree for the small benchmark and these tuned

global parameters, such as the weights of the wirelength, overlap and area objective

functions, were used in the parallel algorithms.

Since annealing is a dynamic process, the time taken to perform all the moves at each

temperature was recorded during the execution of all parallel annealing algorithms. A fixed

number of moves is performed at each temperature. The number of moves attempted at

each temperature has been taken as 200 times the number of modules in the problem to

give good results. All the parallel algorithms are started from the same initial temperature

as the serial algorithm. Their execution continues until the serial algorithm’s stopping

criteria of three consecutive temperatures with no change in the cost function is satisfied.

The quality of solutions obtained is compared with the serial solution. The execution time

was measured as elapsed time rather than CPU time. All the experiments on the hypercube

were run with a single user status to reduce competition with other processes.

One common factor which is varied for all the parallel implementations is the number of

processors in the cube. Every algorithm has several parameters which were varied in the

course of the experiments to observe their effects on algorithm performance. For

example, the frequency of lazy updates in the Modified Pipeline algorithm is varied to

determine its effect on speedup and convergence. The number of processors and the

number of moves performed before a global update are parameters which decide the

magnitude of "parallelism" in the algorithm. A rough estimate of the "parallelism" is the

total number of moves performed by all the processors before a global update is made.

Comparison between different algorithms was performed by keeping this "parallelism"

constant. Speedup for an algorithm is a function of a variety of parameters, consequently,

relative speedups are measured. We define relative speedups as the ratio of the elapsed

time for the parallel algorithm running on the smallest allowable number of processors to

79

the elapsed time for the parallel algorithm running on n-processors, while keeping all the

other parameters of the parallel algorithm the same.

6.:2. Speedup Results

A small experiment on the serial algorithm demonstrated the average amount of time

spent in the different subtasks of a accepting a move. A statistically large sample of moves

was generated, evaluated and forced to be accepted. The time spent by each move in its

different subtasks was then noted and averaged over the number of performed moves. The

results of this experiment are shown in the pie-chart in Fig.6-1. Wirelength evaluation

takes approximately 43% time while overlap penalty calculation takes about 32% and area

estimation takes about 3%. The remaining time is taken by the move proposal stage and

the move updating stage. These times are measured for the small 20-module benchmark.

Dividing the move evaluation functionally, with different processors evaluating wirelength,

overlap and area objective function, results in a heavily imbalanced decomposition.

Consequently, the Simple Pipeline algorithm that we have suggested is an ineffective

decomposition for this benchmark. However, the load balancing for the Static Parallel

algorithm seems to be fairly good since the master processor performs the move proposal

and wirelength evaluation tasks while the slave processor performs the overlap evaluation

and area estimation. Move updating is shared between the two; the master processor

updates the wirelength while the slave processor updates the overlaps and areas. Due to

its object decomposition of each move across the stages of a pipeline, each stage in the

Modified Pipeline algorithm performs every move evaluation subtask on the modules it

owns. The Simple Pipeline algorithm uses functional decomposition and, consequently,

different stages in the pipeline perform wirelength evaluation, area estimation, etc. Fig.6-1

clearly illustrates that load balancing for this case is not good. Consequently, we did not

perform any experiments on the Simple Pipeline algorithm.

8O

Overlap Evaluation (32 %)

Move Proposal and Updating (23 %)

Area Evaluation (3 %)

Wirelength Evaluation (42 %)

Rgure 6-1: Percentage of Time Spent in Each Move Task

Fig.6-2 illustrates the execution times for Benchmark A with 20 modules using the Static

Parallel algorithm. We vary the size of the hypercube, and vary the number of moves

performed per processor before a global update. Notice that as the number of moves

before ¯ global update increases, the execution time decreases. This can be explained by

the fact that the synchronization overhead of global updates is reduced by amortizing it

over s large number of moves. An interesting point to note here is that applications with

similar "parallelism", such as a 16-processor, 1 move per update case and an 8-

processor, 2 moves per update case, have very similar execution times. We conclude that

this simple measure of "parallelism" roughly defined as the product of the number of

processors and the number of moves performed by each processsor before a global

update is a fundamental factor in deciding the speedup which can be obtained from the

Static Parallel Algorithm.

81

Total Time toFloorplan

(in minutes)

450

400

350

300

250

200

150

100

50

Processors

Processors

~"-------~ 16 Processors

I I I I I I I I 11 2 3 4 5 6 7 8 9

Parallel Moves per Processor Pair Before Global Update

(e): Total Execution Time

Average Timeper

Temperature(in minutes)

2.4

2.22.01.8

1.61.41.21.0

0.80.60.40.2

0

Figure 6-2:

16 Processors

1 2 3 4 5 6 7 8 9Parallel Moves per Processor Pair Before Global Update

(b): Average Time per Temperature

Execution Times for the Static Parallel Algorithm

Fig.6-3 plots the execution times of the Modified Pipeline algorithm running the 20

module benchmark. The length of each pipeline is fixed at 1, and we vary the number of

82

moves per pipeline before a lazy/global update. If we compare the execution times for the

Static Parallel algorithm with identical parameters from Fig.6-2 with the Modified Pipeline

algorithm with no lazy updates, we see that the Modified Pipeline algorithm is faster. This

is due to the fact that unlike the Static Parallel algorithm, where two processors cooperate

on a single move computation, the Modified Pipeline algorithm with a pipeline length of 1

has a single processor performing a complete move computation. This yields in better load

balancing between the processors and lower communication overheads and,

consequently faster execution times. Fig.6-3 also shows the effect of lazy updates on the

execution times. As expected, the execution time decreases with increasing number of

lazy updates, due to reduced frequency of costly global updates. Our design of lazy

updating was intended to reduce the communication overhead in global updating. In this

respect our algorithm has been quite successful, but it has been seen that complete lazy

updating does not allow convergence of the algorithm. There is a certain tradeoff between

the number of lazy updates to be used to reduce communication overheads and the

number of global synchronized updates used to preserve c.onvergence. An interesting

result is that as the number of lazy updates is increased, the execution times for large

"parallelism" does not show significant improvement. However, increasing the number of

lazy updates shows significant improvement in the average time per temperature. This

anomaly can be explained by the fact that the introduction of error in annealing causes the

annealing to proceed through more temperatures before reaching an optimal solution. In

other words, with more lazy updates annealing at each temperature runs faster, but we

require more temperatures to reach the same stopping criterion.

Fig.6-4 illustrates the total execution times for the Modified Pipeline algorithm for large

Benchmark B with 40 modules. Notice that in this example 16 processors are being used

with pipeline length 1, and the maximum "parallelism" employs 8 moves in each pipeline

before a lazy/global update. Such a large number of parallel moves does not seem to

affect the solution drastically at all.

83

Total Timeto

Floorplan(in minutes)

300

250 -

200 -

150 -

100 -

50-

Number of Procellore - 8

Length of Pipeline - 1

,,~. ~..

-- No Lazy Updates

....... Complete Update after 1 Lazy Update

- - - Complete Update after 3 lazy Updates

I I I I1 2 3 4

Parallel Moves per Pipe Before Lazy/Global Update

(e): Total Execution Time

Average Timeper


1.2m

1.0-

0.6-

0.6-

0.4-

0.2-

Rgure 6-3:

Number of Processors - 8

~ Lengt................. h of .Pipeline - 1

~ No lazy Updates....... Complete Update after 1 I.~zy Update- - - Complete Update after 3 Lazy Updates

I I I I1 2 3 4

Parallel Moves per Pipe Before Lazy/Global Update

(b): Average Time per Temperature

Effect of Lazy Updates in the Modified Pipeline Algorithm

Another experiment varied the length of the pipeline. It was seen that as the number of

stages in the pipeline increased, execution times did not reduce as expected. In fact the

Total Time toFloorplan

(in minutes)

300 -

250 -

200 -

150 -

100 -

Modified Pipeline AJgorlthm:Length of Pipeline - 1

Number of Proceesorl = 16

50 -- ~ No Lazy Updates- - - Complete Update efler 2 Lazy Updates

1 2 3 4 5 6 7 8 9Parallel Moves per Pipeline Before Global Update

Figure 6-4: Total Execution Times for Modified Pipeline Algorithm using Benchmark B

execution times actually increased almost linearly with larger lengths. This can be

attributed to improper load balancing between the stages of the pipeline. In particular,

increased throughput in a pipeline scheme can be obtained only if "filling" and "emptying"

effects of pipelining are negligible to the total computation. The "filling" and "emptying"

effects of pipelining can be amortized only if a large group of moves propagate through the

pipeline. Large groups of moves cannot not be attempted in parallel for this benchmark

since it consists of only 20 modules. This benchmark is not suited to test this particular

kind of parallelism. Fig.6-5 illustrates these results.

A large number of parallel moves attempted at high temperatures are accepted compared

to the number of accepted moves at low temperatures. Consequently, the overhead

involved in synchronizing the states after a global update must be high at high

temperatures and should reduce with the lowering of temperatures. This effect is

illustrated in Fig.6-6. Figo6-6 plots the time taken per temperature for a Modified Pipeline

algorithm as a function of the temperature. Note that since the communication overhead is

85

250 -

Total Time 200 -to

Floorplan(in minutes) 150

100 -

Modified Pipeline Algorithm :

S Processors4 Moves Before Global Update

o°

I I I I1 2 3 4

Length of Pipeline (only powers of 2)

(a): Total Execution Time

Average Timeper


1.75 -

1.5-

1.25 -

1.0-

0.75 -

0.5-

0.25 -

Rgure 6-5:

Modified Pipeline A~gorlthm :8 Processors4 Moves Before Global Update

I I l I1 2 3 4

Length of Pipeline (only powers of 2)

(b): Average Time per TemperatureExecution Times for Different Pipeline Lengths

essentially constant over all temperatures, the reduction in the time per temperature

almost entirely reflects the reduction in updating overhead. The updating overhead

reflects both the updating within a stream of parallel moves and global updating between

different streams.

86

Time / Temperature(in seconds)

50-

40-

30-

20-

10-

Modified Pipeline Algorithm :

8 Processors

I I I I0.035 0.35 3.5 35 350 3500

Temperature

Rgure 6-6: Variation of Time Taken per Temperature

The relative speedup curves for the Static Parallel algorithm and the Modified Pipeline

algorithm are given in Fig.6-7. Note that for the Static Parallel Algorithm the smallest

allowable number of processors is 2.

6.3. Convergence Results

All this would be a futile exercise if it cannot be demonstrated that the parallel

implementations of PASHA do indeed converge to solutions of comparable quality to those

obtained by the serial algorithm. Fig.6-8 tabulates the quality of results obtained for the

different cases of the Static Parallel and Modified Pipeline algorithms. The quality of

results is measured by the total wirelength, total area and the existence of the residual

overlap (i.e., module overlaps remaining at the end of annealing) as compared to the best

results obtained from a tuned serial version of PASHA. Residual overlaps, if any, were

always peripheral in nature and usually only between two modules. We note that the

results are largely within 7-9% of the serial solutions. Due to its inherent statistical nature

87

RelativeSpeedup

, Ideal Llnaar Speedup

IIII

Static Parallel Algorlthrn :6 Moves Before Global Update

4 8 16

Number of Processors

(a): Static Parallel Algorithm

RelativeSpeedup

7--

6-

5-

4-

3-

2--

1--

Ideal Linear Speedup ,/

/

/~ 6 Moves before Update/’""

~ No Lazy Updates

....... I Lazy Update

1 2 4 8 16

Number of Processors

(b): Modified Pipeline AlgorithmRgure 6-7: Speedup for Parallel Algorithms

the statistical sample size required for such a comparison ought to be bigger and

comparisons should be made only in the distribution of results obtained from a larger

sample size. The constraints of time on this thesis constrained usto use a very small

88

sample size: nevertheless, the fact that the parallel solutions were reasonably close to

the serial answers is very encouraging. In fact, in some cases they are even better than

the corresponding serial results. This can be attributed to the fact that parallel moves

enable the system to explore a greater breadth in the search space of solutions. This

observation is strengthened by the fact that small amounts of parallelism almost always

tend to give better solutions than the serial algorithm. For sake of proper comparison,

module sizes are not overestimated during annealing in both the serial and parallel

versions of PASHA.

The objective function that we use has three separate parts and it was noticed during the

annealing that each of these parameters has a different variation during annealing. The

area reduces at early temperatures, followed a little later by the reduction in overlaps.

Wirelengths tend to reduce last during the final stages of annealing. This can also be

observed from Fig.6-8 where a majority of wirelength values for the parallel cases tend to

be higher than the serial value, indicating that the wirelength values tend to freeze out last

during annealing. Introducing error in annealing corresponds ~o the introduction of some

uncontrolled hill climbing. Instead of using the entire cost function we use the wirelength

variation to illustrate this effect. The use of wirelength variation for this purpose is justfied

by the fact that the variation in wirelength is analogous to the variation of the entire cost

function. Looking at the wirelength variation during annealing in the typical serial and

parallel cases shown in Fig.6-9, we can see that the parallel case fluctuates more than the

serial case before settling down to an optimal solution. This is equivalent to shifting the

entire temperature schedule towards lower temperatures. Temperatures in serial annealing

correspond to higher equivalent temperatures in parallel annealing.

89

Cube

Dimension

2

No. of moves

before updateWirelength

Parallel/SerialFtatio1.071.041,071.00

Cost FunctionArea

ParallellSerialRatio0.991.070.980.99

Residual

Overlap

YesNoNoNo

.o5 1,11 Yes

.01 1.08 Yes

.03 0.96 No

.06 0.97 No

.05 0.99 Yes

.04 0.91 No.99 0.94 No

0.98 Yes

0.95

1.061.032 1.07 No

3 1.03 1.01 No

4 1.00 0.97 No

5 0.94 0.97 NoYes0.98

(a): Static Parallel Algorithm

Type of

Update

CompleteGlobalUpdate

1 LazyUpdate

2 LazyUpdates

3 LazyUpdates

No. of moves

before update

Cost Function

3

WirelengthParallel/Serial

Ratio

AreaParallel/Serial

Ratio

Residual

Overlap

0.98

1 0.99 0.89 Yes

2 1.00 0.92 No0.94

1.011.000.960.990.98

1.112 1.11

1,041.040.95

0.99 1.031.06 1.051.10 1.1221.04 1.093

NoYesNoNoNoYesYesNoNoYes

(b): Modified Pipeline with Lazy Update (8 Processors)Rgure 6-8: Quality of Parallel Solutions

g0

Wirelength

6600 -6400 -6200 -6000 -5800 -5600 -5400 -5200 -5000 -4800 -4600 -4400 -4200 -

Typical Parallel Annealing Curve

7 ao

5800 -5600 --5400 --5200 -5000 -4800 -4600 -4400 --4200 --4000

0,0035

Figure 6-9:

Sedal Annealing Curve

0.035 0.35 3.5 35 350 3500Temperature

Wirelength Variation in a Serial and a Parallel Algorithm

91

6.4. Summary

We have run several experiments on the Static Parallel algorithm and the Modified

Pipeline algorithm. It was noted that to reduce the communication overhead it was

essential to amortize this overhead over many move computations. Consequently, in

cases where the number of moves before a global update were higher, execution times

were better. As expected, for the Static Parallel algorithm the execution times improved

with greater number of processors.

The Modified Pipeline algorithm was faster than a Static Parallel algorithm running on the

same number of processors with the same degree of "parallelism". It was observed that

increasing the pipeline length in the Modified Pipeline did not reduce execution times for

this benchmark. This was due to the inability to extract large parallelism from this small

benchmark. The lazy updates in the Modified Pipeline reduced the average time per

temperature but did not significantly reduce the total execution time. Speedup by a factor

of 4 was obtained with 16 processors for the Static Parallel.algorithm, a speedup by a

factor of 6 was obtained for the Modified Pipeline algorithm for 16 processors with no lazy

updates. Using lazy updates we obtained a speedup of 7.5 for the Modified Pipeline

algorithm running on 16 processors.

All the parallel implementations yield solutions which are of high quality as compared to

the serial solutions. The number of parallel moves before a global update is very crucial in

determining the uncertainty in annealing and, consequently, the convergence of the

algorithm. In addition, it was found that lazy updates tend to disturb convergence to a

greater extent than global synchronized updates. The following chapter discusses the

conclusions and contributions of this thesis and identifies topics of future research in this

context.

92

Chapter 7

Conclusions

Annealing is a general purpose optimization method which holds great promise. The main

drawback of annealing is that it is computationally expensive. This research effort has

focused on the objective of accelerating annealing algorithms with the use of hypercube

multiprocessors. Floorplanning was chosen as a typical application of simulated

annealing. Several parallel algorithms were presented for partitioning the annealing

algorithm across the processors of a hypercube. We have shown that larger parallelism

can be extracted by introduction of e certain amount of error in the algorithm. We propose

some new partitioning strategies in PASHA which were implemented and tested on a 16-

processor Intel iPSC hypercube. Two of these strategies were tested completely: Static

Parallel and Modified Pipeline. Results obtained show a very encouraging trend. A speedup

of 4 was obtained for the Static Parallel algorithm running on 16 processors. The Modified

Pipeline algorithm running on 16 processors yielded a speedup of roughly 6 when not

using lazy updates, while it gave a speedup of a factor of 7.5 with the use of lazy updates.

Solutions obtained by these algorithms are of comparable quality to those obtained in the

serial case, This research opens up new directions towards which future work can be

directed. The short term goals prompted by this work include:

¯ Addition of some sophisticated annealing schedules to PASHA. We believethat addition of some sophisticated serial annealing schedules can easilyimprove the speed by a factor of 2.

¯ Evaluate the performance of PASHA when running on other hypercubemultiprocessors, such as the NCUBE hypercube, to determine the exacteffect of communication time on the time of execution of the algorithm.

93

¯ Minor tuning of the move evaluation tasks. This should lead to better loadbalancing in all the algorithms.

Addition of new constraints to the algorithm for floorplanning such asorientation of pins on modules, bus constraints, etc.

¯ Improvement in the data structures to reduce move evaluation time. Already,the latest version of the serial version of PASHA incorporates someoptimized data structures.

During the work with PASHA we have come across some areas where long term goals

can be focussed. Specifically, we feel that the area of parallel annealing schedules is an

area which deserves considerable investigation. Presently, efforts to parallelize annealing

applications have been using simple serial annealing schedules. Parallel simulated

annealing differs in many respects from the serial algorithms and annealing schedules

which take the error caused by parallelism into account will greatly enhance the

parallelism which can be exploited. Efforts must be made in the field of formalising parallel

annealing algorithms with some theoretical models. Such efforts will go a long way in

quantifying the effects of parallel moves and error on annealing.

94

References

[Banerjee 86]

[Banerjee 87]

[Brandenburg 86"]

[Breuer77a]

[Breuer 77b]

[Brooks 40]

[Casotto 86]

[Devadas 86]

Banerjee, Prithviraj and Mark Jones.A Parallel Simulated Annealing Algorithm for Standard Cell Placement

on a Hypercube Computer.In Proceedings of the International Conference on Computer-Aided

Design, pages 34-37. IEEE, November, 1986.

Banerjee Prithviraj and Mark Jones.Performance of a Parallel Algorithm for Standard Cell Placement on the

Intel Hypercube.1987.To be published in the Proceedings of the 24th Design Automation

Conference, 1987.

Brandenburg Joseph. E and David S. Scott.Embeddings of Communication Trees and Grids into Hypercubes.Technical Report 1, Intel Scientific Computers, 1986.

Breuer Melvin. A.A Class of Min-Cut Placement Algorithms.In Proceedings of the 14th Design Automation Conference, pages

284-290. IEEE, 1977.

Breuer Melvin A.Min-Cut Placement.Journal of Design Automation and Fault Tolerant Computing

1 (4):343-362, October, 1977.

Brooks R. L., C. A. B Smith, A. H. Stone and W. T. Tutte.The Dissection of Rectangles into Squares.Duke Mathematical Journal 7:312-340, 1940.

Casotto, Andrea, Fabio Romeo and Alberto Sangiovanni-Vincentelli.A Parallel Simulated Annealing Algorithm for the Placement of Macro

Cells.In Proceedings of the International Conference on Computer-Aided


Devadas Srinivas and A.Richard Newton.Topological Optimization of Multiple Level Array Logic: On Uni and

Multi-processors.In Proceedings of the International Conference on Computer-Aided


[Felten 85]

[Greene 84]

[Grover 86]

[Heller 82]

[Huang 86]

[Jepsen 83]

[Kernighan 70]

[Kirkpatrick 83]

[Kozminski 84]

[Kravitz 86a’1

[Kravitz 86b~]

95

Felten, Edward, Scott Karlin and Steve W.Otto.The Travelling Salesman Problem on a Hypercubic, MIMD Computer.In Proceeding of the 1985 International Conference on Parallel

Processing, pages 6-10. IEEE, August, 1985.

Greene, Jonathan W. and Kenneth J. Supowit.Simulated Annealing without Rejected Moves.In Proceedings of the Custom Integrated Circuits Conference, pages

658-663. 1984.

Grover, Lov K.A New Simulated Annealing Algorithm for Standard Cell Placement.In Proceedings of the International Conference on Computer-Aided


Heller W.R, G.Sorkin and K.Maling.The Planar Package Planner for System Designers.In Proceedings of the 19th Design Automation Conference, pages

777-784. IEEE, June, 1982.

Huang M. D., F. Romeo and Alberto Sangiovanni-Vincentelli.An Efficient General Cooling Schedule for Simulated Annealing.In Proceedings of the International Conference on Computer-Aided

Design, pages 381-384. IEEE, October, 1986.

Jepsen, D.W and C.D.Gelatt Jr..Macro Placement by Monte Carlo Annealing.In Proceedings of the International Conference on Computer-aided

design, pages 495-498. October, 1983.

Kemighan B. W and S. Lin.An efficient heuristic procedure for partitioning graphs.Bell System Technical Journal 49:291-308, February, 1970.

Kirkpatrick, S, C.D.Gelatt,Jr., and M.P.Vecchi.Optimization by Simulated Annealing.Science 220 (4598):671-680, May, 1983.

Kozminski K. and E.Kinnen.An Algorithm for Finding a Rectangular Dual of a Planar Graph for use in

Area Planning for VLSI Integrated Circuits.In Proceedings of the 21 st Design Automation Conference, pages

655-656. IEEE, June, 1984.

Kravitz, Saul. A.Multiprocessor-Based Placement by Simulated Annealing.Technical Report CMUCAD-86-6, SRC-CMU Center for Computer-

Aided Design, 1986.

Kravitz, Saul. A and Rob A. Rutenbar.Multiprocessor-Based Placement by Simulated Annealing.In Proceedings of the 23rd Design Automation Conference, pages

567-573. IEEE, June, 1986.

96

I’Lapotin 85]

[Lauther 79]

[Leinwald 84]

[McBryan 86]

[Otten 84]

[Rose 86]

[Sechen 84]

[Seitz 85]

[Supowit 83]

[Szepieniec 80]

[Vecchi83]

LaPotin, David Paul.A Global Floorplanning Approach for VLSI Design.PhD thesis, Carnegie Mellon University, December, 1985.

Lauther Ulrich.A Min-cut Placement Algorithm for General Cell Assemblies Based on a

Graph Representation.In Proceedings of the 14th Design Automation Conference, pages

1-10. IEEE, June, 1979.

Leinwald S.M., and Y. Lai.An Algorithm for Building Rectangular Floor-Plans.In Proceedings of the 21st Design Automation Conference, pages

663-664. IEEE, June, 1984.

McBryan O. A and E. F. Van de Velde.Hypercube Algorithms and Implementations.Technical Report, Courant Mathematics and Computing Laboratory,

February, 1986.

Otten ,Ralph H.J.M and Lukas P.P.P. van Ginneken.Floorplan design using Simulated Annealing.In Proceedings of the International Conference on Computer-aided

design, pages 96-98. 1984.

Rose, Jonathan Scott.Fast, High Quality VLSI Placement on an MIMD Multiprocessor.PhD thesis, University of Toronto, September, 1986.

Sechen Carl and Alberto Sangiovanni-Vincentelli.The TimberWolf Placement and Routing Package.In Proceedings of the Custom Integrated Circuits Conference, pages

522-527. May, 1984.

Seitz, Charles.L.The Cosmic Cube.Communications of the ACM 28(1 ):22-33, January, 1985.

Supowit K. J and E. A. Slutz.Placement Algorithms for Custom VLSI.In Proceedings of the 20th Design Automation Conference, pages

164-170. IEEE, June, 1983.

Szepieniec A. A and R. H. J. M. Otten.The Genealogical Approach to the Layout Problem.In Proceedings of the 17th Design Automation Conference, pages

535-542. IEEE, June, 1980.

Vecchi, Mario.P and Scott Kirkpatrick.Global Wiring by Simulated Annealing.IEEE Transactions on Computer-Aided Design CAD-2 (4):215-222,

October, 1983.

97

[White 84]

[Wong 86]

White, Steve. R.Concepts of Scale in Simulated Annealing.In Proceedings of the International Conference on Computers in Design

, pages 646-651. IEEE, 1984.

Wong, D.F and C.L.Liu.A New Algorithm for Floorplan Design.In Proceedings of the 23rd Design Automation Conference. IEEE, June,

1986.

Floorplanning by Annealing on a Hypercube Architecture · PDF fileFloorplanning by Annealing...

Documents

Transcript of Floorplanning by Annealing on a Hypercube Architecture · PDF fileFloorplanning by Annealing...