Berkeley: Sept 15, 1999 1 Physical Design Challenges of Reconfigurable Computing Systems Majid...

Berkeley: Sept 15, 1999 1

Physical Design Challenges of Reconfigurable Computing Systems

Majid SarrafzadehNuCAD

Department of ECENorthwestern University

Ryan Kastner, Todd Haverkos, Kia Bazargan, Seda Ogrenci, Eli Bozorgzadeh, Candice McGrew

Sponsored: DARPA, Motorola, AT&T, NSF


Faculty Position

• In VLSI Design & CAD (1-2 openings)

• VLSI Design & CAD: One of the six focused research areas in the department

• Assistant/Associate/Full Professor– (Northwestern rank: top 10; – ECE: top 20 (top 10 in 5 years)

• Contact: [email protected]


Field Programmable Gate Array: FPGA


FPGA(Xilinx)


Degraded Image Restored Image


Image stored in on-chip memory

Circuit to process the image

residing on the rest of the chipFPGA chip On-board memory,

where the image is stored

FPGA chip

Host processor

( image is stored here)

System A System B System C


CPU

Data Memory

Control

Data

Data Data

Instruction Memory (Program)

RFUOPs CPU instructions

The Architecture of a Reconfigurable System

RFU


RFU

Programmable logic

Programmable connections

Field Programmable Gate Array: FPGA• SRAM cells used in configuration

– Reconfigurable (runtime)– Static vs. dynamic configuration

• Hardware functions implemented as rectangular areas on the FPGA

SRAM cells


System Components

Configuration Memory

Config. Bits RFUOPs

RFU Manager

PlacementEngine

CacheManager

Prefetch/BranchPrediction Unit

Control

Program Manager

InstructionMem. (Prog.)

CPU instructions

Data

CPU

RFU

Data Memory

Data

Data


System Behavior

• Two kind of instructions– CPU instructions => always run on CPU

• Assume known runtime

– RFUOPs, might be performed on CPU if not enough room on RFU• Assume known runtime and reconfiguration time

• Runtime profiles and RFU status are used to decide between CPU and RFU


PD Challenges• Problem: Given RFUOPs to be performed on RFU and

DFG constraints, schedule them in time assign them physical location.

• Must be very fast: (mtools achieve 1000 cells per minute). Existing tools/techniques are very slow. Quality is less important.

• New PD algorithm/paradigms are needed.

• In this presentation: – placement, – routing, – an application on reconfigurable systems.


Firm Macros• Not hard (too rigid), not soft (takes too

much time to utilize the flexibility)

• Each unit is 80%-100% pre-designed: Can “break” the macros in limited ways

• We have defined a network algebra for combining circuits (based on parameterization using VHDL generics): combine a fast and a slow adder in multiple ways


Faculty Position



• Assistant/Associate/Full Professor– (Northwestern rank: top 10; – ECE: top 20 (top 10 in 5 years)– Contact: [email protected]


Execution of a Sample Program

RFU

t y

x

x = 3*a - b;…

C = RFUOP1(x,5);

y = 4*x - c;

for (i=0;i<3;i++){

x += RFUOP2(y);

++y;

}

z = RFUOP1(x,3);

a = z - y;

b = RFUOP3(a,b);

c = a - b;

…

CodeCode DFGDFG

=> (on CPU)

(on RFU)=>

=>

=>

No room on RFU to run allin parallel ==> run in sequence

=>

=>

(in parallel)=>

=>

=>


Placement

• On-line placement– RFU calls needs to be executed as the program

proceeds

• off-line placement– Have a complete or partial profile of the

operation


Online Placement• When a new RFUOP arrives

– Is there enough space to place the RFUOP?– If yes, Which location is best to place it?

• Decision 1: Managing the empty space– Fast but sub-optimal

• Keep only O(n) empty rectangles– Shorter Seg. (SSEG), Square Empty Rects. (SQR), ...

– Efficient use of RFU real estate• KAMER: Keep all O(n2) maximal empty rectangles

• Decision 2: Packing rule– Best Fit, Bottom Left, First Fit


Keeping All Empty Rectangles

Keeping O(n) Empty Rectangles - SSEG

Cannotfit

this

Berkeley: Sept 15, 1999 19Area( ) < Area( ) Choose A

Heuristics for Choosing an Empty Rectangle

AB

CurrentPlacement New module

to be inserted

+ = ?

BF (Best Fit) FF (First Fit) BL (Bottom Left)

Places the new module in the empty rectangle which causes less wasted space.

Any of A or B could be chosen for placing the new module.

P1

P2Places the new module in rect w/ lower bottom-left corner, breaking the tie by picking leftmost one. y(P2) < y(P1) Choose B


Heuristics for Choosing a Segment

SSEG (Shorter Seg) BER (Balanced Empty Rects) LSQR (Larger Rect Square)

SQR (Square Rects)LER (Large Empty Rects)LSEG (Longer Seg)

S1

S2

Chooses the shorter of the twosegments.

Chooses the longer of the twosegments.

AB

C

D

S1

S2

AB

C

D

A

B

C

D

A

B

C

D

Chooses the segment which creates less area difference.

Chooses the segment which creates the larger rectangle closer to square.

S1 < S2

S1 < S2

Area(B) - Area(A) > Area(D) - Area(C) AspectRatio(B) > AspectRatio(D)

Chooses the segment which creates the larger empty rectangle.

Chooses the segment which creates empty rectangles closer to squares.

Area(B) > Area(D)

Max{AR(A),AR(B)} < Max{AR(C),AR(D)}AR = AspectRatio


Online Placement Results

Bin-Pack

Data set KAMER SSEG BER LSQR LSEG LER SQR

ra2048 79.25 74.26 61.52 70.36 52.83 73.87 70.36ra4096 84.59 79.1 66.84 74.39 58.37 79.49 74.73ra8192 79.71 73.39 63.23 69.87 55.87 74.88 68.11

FF

ra16384 81.35 75.08 63.59 70.42 55.73 76.13 69.38 Avg(FF) 81.23 75.46 63.80 71.26 55.70 76.09 70.65

ra2048 82.52 77.49 67.18 75.05 58.93 76.46 74.66ra4096 87.06 81.76 73.22 80.32 64.57 81.66 79.78ra8192 82.28 77.57 67.85 73.91 59.04 76.12 73.77

BF

ra16384 84.04 78.81 68.5 75.36 60.92 78.25 75.44 Avg(BF) 83.97 78.91 69.19 76.16 60.86 78.12 75.91

ra2048 81.84 76.22 61.72 73.29 55.57 76.07 71.83ra4096 86.18 81.93 70.29 78.56 62.33 81.42 78.54ra8192 81.17 75.71 65.04 72.9 59.71 76.54 72.18

BL

ra16384 83.46 77.39 64.97 74.53 58.23 78.29 73.25 Avg(BL) 83.16 77.81 65.50 74.82 58.96 78.08 73.95

Table 1. Percentage of accepted modules using different bin-packing and empty space partitioning rules


Online Placement Results

Penalties for different partitioning heuristics when BF is used

0.0E+00

2.0E+07

4.0E+07

6.0E+07

8.0E+07

1.0E+08

1.2E+08

1.4E+08

1.6E+08

1.8E+08

KAMER SSEG BER LSQR LSEG LER SQRPartitioning heuristic

Pen

alty

A2048 A4096 A8192 A16384

Volume that does

not fitBEST


Online Placement Results (cont.)

Running Time Comparison(Time to place "A16384" file)

35.77 34.27 34.74

2.23 2.12 2.24

0

5

10

15

20

25

30

35

40

KAMER SSEG

Tim

e (s

ec.)

BF

FF

BL


ty

x

Off-line placement: 3-D Floorplanning

RFU

DFGDFG ScheduleSchedule

RFU CPU

RFU area

time


ty

x

3-D Floorplanning

RFU

By deleting this RFUOP(CPU performs theoperation)...


RFU CPU


ty

x

3-D Floorplanning

RFU


RFU CPU


Our 3-D Floorplanner: No change in the schedule

• Pure annealing– Move set

• Move operation from CPU set to RFU set

• Move operation from RFU set to CPU set

• Displace an already placed RFUOP on the RFU

– Cost function: Volume– Very poor results

• Start with an ASAP schedule, use on-line to get an initial solution, then low-temperature annealing


OfflinePenalty

OnlinePenalty

Ratio

147287 213153 69.10%253566 307879 82.36%464049 508923 91.18%539435 612623 88.05%

Algorithm DatasetT50T100S100S200

LTSAX=100%

A1024 427761 456627 93.68%

T50T100S100S200

LTSAX=20%

A1024

148975 213153 69.89%225603 307879 73.28%287153 508923 56.42%359980 612623 58.76%213036 456627 46.65%

Offline Placement Results

Place X% of the largest-volume modules using on-line placement


Flexibility of the Modules• Library of modules have different

implementations for each RFUOP– Experimental results with our online algorithms

show about 60% reduction in penalty.

• 3-4 Implementations are enough


Faster Routing: mostly offline

Technology-Mapped netlist

ArchitectureDescription File

VPR

Place Circuit or Read in Existing Placement

Perform either Global or Combined Global/Detailed Routing

Placement and Routing Output Files

VP

RC

AD

flo

w


Routing Algorithm (VPR)

Call the VPR’s Router by an arbitrary channel width • Based on PathFinder negotiated congestion algorithm

Step1: Each net routed by the shortest path

which can be found. (Regardless of any overuse of wiring segments)

Step2: Sequentially ripping-up and re-routing

every net in the circuit (by the lowest cost path found)


Fast Pattern Routing

• Maze-based routing algorithm has a good performance but it’s very slow.

So,• Speed-up the router by partially using pattern

routing

if an arbitrary net picked and routed differently, it would not change the result effectively.


Independent subset of nets

Two geometrical independent sets of nets

- Class 1

- Class 2


Routing Patterns

2 terminal net patterns Multi terminal net patterns (MST & RSTs)

Cos

t = L

+ c

onst

/ F

lexi

bili t

y


Implementation of Algorithm• First choose the 2 terminal nets to route - More than 50% of the nets are 2 terminal nets.

- In order to get the maximum independent sets, sort the two terminal nets in terms of their bounding boxes.

- Classify the 2 terminal nets in geometrical independent classes

- Route the classes, sequentially by pattern routing.

• Next choose the multi terminal nets ( low fan-out) - Route them in their corresponding RST patterns

• Finally, let the rest of the nets be routed by traditional router


Experimental Results

Router VPR PATTERN ROUTER

MCNCbenchmark

channelwidth

WL run time channel width

WL run time speed- up%

alu4 10 18601 334.49 10 19188 273.87 23%apex2 10 28410 830.32 11 29056 459.8 80%apex4 11 20503 443.15 12 20137 424.6 4.4%ex5p 12 17585 459.68 13 18020 357.65 28.5%frisc 11 49799 1920 11 50919 1870 2.7%diffeq 7 13796 155.45 8 13684 102.36 51.8%dsip 7 13128 113.19 7 13363 49.24 130%misex3 10 19557 345.59 10 20184 194.7 77.5%pdc 15 92249 6700 17 90988 2430 175%s298 7 19018 207.710 8 18794 74.69 178%s3841 7 55885 1110 8 55573 332.6 234%s38584.1 8 51658 1110 8 52610 603.74 84%seq 10 26130 939.84 11 26694 437.84 114.5%spla 12 59290 4030 12 60874 2350 71.5%tseng 6 8531 96.45 6 8780 39.63 143.4%des 8 20305 479.56 10 20439 311.62 54%ex1010 10 63699 2400 12 62662 914.67 162.4%bigkey 7 15808 135.94 7 16158 113.64 19.6%

average 9.3 30310.11 1122.57 10 33229 630 82.46%


Faculty Position



• Assistant/Associate/Full Professor– (Northwestern rank: top 10; – ECE: top 20 (top 10 in 5 years)– Contact: [email protected]


r0

r1

Image Restoration

The value of the center pixel in the next iteration:

xk+1 = *y + xk - * (d**xk)

r1r1

r1 r1 r1

r1

y: the pixel value from the original degraded image

xk: the pixel value from the previous iteration

d**xk denotes the weighted sumr1* (eight neighbor pixels) + r0 * center

pixel


Incentive : Processing of large sized images

using FPGA’s with limited resources

1. Segmentation of the image into smaller

sized images suitable for the FPGA

Segments of size m x n are surrounded

by an overlap of o.

m

o

n


. Pixels of individual segments are restored in parallel by hardware

. Restored segments are written back after the overlap is discarded

MEMORY

m

o

nRFU


How bad is the segmentation?• Theorem: The error introduces is about (w)**O example: (1/16) ** 2 = (1/264)

• Proof: By induction

m

o

n


Comparison of Image Qualities

1.6

1.8

2

2.2

2.4

2.6

2.8

3

3.2

8 16 32 64 128

Segment Sizes

ISN

R (d

B)

Cameraman(segmented)

Cameraman(sequential)

Moon (sequential)

Moon (segmented)


Image stored in on-chip memory

Circuit to process the image

residing on the rest of the chipFPGA chip On-board memory,

where the image is stored

FPGA chip

Host processor

( image is stored here)

System A System B System C


Image Software RunningTime (sec)

Running Timefor System A

(msec)

Running Time for System C

(msec)cameraman 4.772 9.157 91.960

moon 2.812 5.725 54.494

circle 2.987 4.254 42.722

animals 6.761 8.826 88.628

fish 7.029 14.026 140.850

barbara 21.741 36.630 367.840

yacht 12.367 34.079 342.227

soccer 12.360 34.079 342.227

announcer 13.462 34.079 342.227

bluegirl 10.158 34.079 342.227

cablecar 12.354 34.079 342.227

cornfield 13.458 34.079 342.227

Running Times of the Application on Software and on Different Systems

(ignoring reconfiguration)


Conclusions• Need radical departure (new algorithm, etc)

from traditional PD algorithms.

• Fast (and lower quality) place & route tools

• Do as much as possible (building complex libraries, hierarchical routing, …) before compilation

• All of the above (and more) needed to make reconfigurable computing a reality.


Faculty Position



• Assistant/Associate/Full Professor– (Northwestern rank: top 10; – ECE: top 20 (top 10 in 5 years)

• Contact: [email protected]

Berkeley: Sept 15, 1999 1 Physical Design Challenges of Reconfigurable Computing Systems Majid...

Documents

Transcript of Berkeley: Sept 15, 1999 1 Physical Design Challenges of Reconfigurable Computing Systems Majid...