Berkeley: Sept 15, 1999 1 Physical Design Challenges of Reconfigurable Computing Systems Majid...
-
date post
21-Dec-2015 -
Category
Documents
-
view
214 -
download
1
Transcript of Berkeley: Sept 15, 1999 1 Physical Design Challenges of Reconfigurable Computing Systems Majid...
Berkeley: Sept 15, 1999 1
Physical Design Challenges of Reconfigurable Computing Systems
Majid SarrafzadehNuCAD
Department of ECENorthwestern University
Ryan Kastner, Todd Haverkos, Kia Bazargan, Seda Ogrenci, Eli Bozorgzadeh, Candice McGrew
Sponsored: DARPA, Motorola, AT&T, NSF
Berkeley: Sept 15, 1999 2
Faculty Position
• In VLSI Design & CAD (1-2 openings)
• VLSI Design & CAD: One of the six focused research areas in the department
• Assistant/Associate/Full Professor– (Northwestern rank: top 10; – ECE: top 20 (top 10 in 5 years)
• Contact: [email protected]
Berkeley: Sept 15, 1999 7
Image stored in on-chip memory
Circuit to process the image
residing on the rest of the chipFPGA chip On-board memory,
where the image is stored
FPGA chip
Host processor
( image is stored here)
System A System B System C
Berkeley: Sept 15, 1999 8
CPU
Data Memory
Control
Data
Data Data
Instruction Memory (Program)
RFUOPs CPU instructions
The Architecture of a Reconfigurable System
RFU
Berkeley: Sept 15, 1999 9
RFU
Programmable logic
Programmable connections
Field Programmable Gate Array: FPGA• SRAM cells used in configuration
– Reconfigurable (runtime)– Static vs. dynamic configuration
• Hardware functions implemented as rectangular areas on the FPGA
SRAM cells
Berkeley: Sept 15, 1999 10
System Components
Configuration Memory
Config. Bits RFUOPs
RFU Manager
PlacementEngine
CacheManager
Prefetch/BranchPrediction Unit
Control
Program Manager
InstructionMem. (Prog.)
CPU instructions
Data
CPU
RFU
Data Memory
Data
Data
Berkeley: Sept 15, 1999 11
System Behavior
• Two kind of instructions– CPU instructions => always run on CPU
• Assume known runtime
– RFUOPs, might be performed on CPU if not enough room on RFU• Assume known runtime and reconfiguration time
• Runtime profiles and RFU status are used to decide between CPU and RFU
Berkeley: Sept 15, 1999 12
PD Challenges• Problem: Given RFUOPs to be performed on RFU and
DFG constraints, schedule them in time assign them physical location.
• Must be very fast: (mtools achieve 1000 cells per minute). Existing tools/techniques are very slow. Quality is less important.
• New PD algorithm/paradigms are needed.
• In this presentation: – placement, – routing, – an application on reconfigurable systems.
Berkeley: Sept 15, 1999 13
Firm Macros• Not hard (too rigid), not soft (takes too
much time to utilize the flexibility)
• Each unit is 80%-100% pre-designed: Can “break” the macros in limited ways
• We have defined a network algebra for combining circuits (based on parameterization using VHDL generics): combine a fast and a slow adder in multiple ways
Berkeley: Sept 15, 1999 14
Faculty Position
• In VLSI Design & CAD (1-2 openings)
• VLSI Design & CAD: One of the six focused research areas in the department
• Assistant/Associate/Full Professor– (Northwestern rank: top 10; – ECE: top 20 (top 10 in 5 years)– Contact: [email protected]
Berkeley: Sept 15, 1999 15
Execution of a Sample Program
RFU
t y
x
x = 3*a - b;…
C = RFUOP1(x,5);
y = 4*x - c;
for (i=0;i<3;i++){
x += RFUOP2(y);
++y;
}
z = RFUOP1(x,3);
a = z - y;
b = RFUOP3(a,b);
c = a - b;
…
CodeCode DFGDFG
=> (on CPU)
(on RFU)=>
=>
=>
No room on RFU to run allin parallel ==> run in sequence
=>
=>
(in parallel)=>
=>
=>
Berkeley: Sept 15, 1999 16
Placement
• On-line placement– RFU calls needs to be executed as the program
proceeds
• off-line placement– Have a complete or partial profile of the
operation
Berkeley: Sept 15, 1999 17
Online Placement• When a new RFUOP arrives
– Is there enough space to place the RFUOP?– If yes, Which location is best to place it?
• Decision 1: Managing the empty space– Fast but sub-optimal
• Keep only O(n) empty rectangles– Shorter Seg. (SSEG), Square Empty Rects. (SQR), ...
– Efficient use of RFU real estate• KAMER: Keep all O(n2) maximal empty rectangles
• Decision 2: Packing rule– Best Fit, Bottom Left, First Fit
Berkeley: Sept 15, 1999 18
Keeping All Empty Rectangles
Keeping O(n) Empty Rectangles - SSEG
Cannotfit
this
Berkeley: Sept 15, 1999 19Area( ) < Area( ) Choose A
Heuristics for Choosing an Empty Rectangle
AB
CurrentPlacement New module
to be inserted
+ = ?
BF (Best Fit) FF (First Fit) BL (Bottom Left)
Places the new module in the empty rectangle which causes less wasted space.
Any of A or B could be chosen for placing the new module.
P1
P2Places the new module in rect w/ lower bottom-left corner, breaking the tie by picking leftmost one. y(P2) < y(P1) Choose B
Berkeley: Sept 15, 1999 20
Heuristics for Choosing a Segment
SSEG (Shorter Seg) BER (Balanced Empty Rects) LSQR (Larger Rect Square)
SQR (Square Rects)LER (Large Empty Rects)LSEG (Longer Seg)
S1
S2
Chooses the shorter of the twosegments.
Chooses the longer of the twosegments.
AB
C
D
S1
S2
AB
C
D
A
B
C
D
A
B
C
D
Chooses the segment which creates less area difference.
Chooses the segment which creates the larger rectangle closer to square.
S1 < S2
S1 < S2
Area(B) - Area(A) > Area(D) - Area(C) AspectRatio(B) > AspectRatio(D)
Chooses the segment which creates the larger empty rectangle.
Chooses the segment which creates empty rectangles closer to squares.
Area(B) > Area(D)
Max{AR(A),AR(B)} < Max{AR(C),AR(D)}AR = AspectRatio
Berkeley: Sept 15, 1999 21
Online Placement Results
Bin-Pack
Data set KAMER SSEG BER LSQR LSEG LER SQR
ra2048 79.25 74.26 61.52 70.36 52.83 73.87 70.36ra4096 84.59 79.1 66.84 74.39 58.37 79.49 74.73ra8192 79.71 73.39 63.23 69.87 55.87 74.88 68.11
FF
ra16384 81.35 75.08 63.59 70.42 55.73 76.13 69.38 Avg(FF) 81.23 75.46 63.80 71.26 55.70 76.09 70.65
ra2048 82.52 77.49 67.18 75.05 58.93 76.46 74.66ra4096 87.06 81.76 73.22 80.32 64.57 81.66 79.78ra8192 82.28 77.57 67.85 73.91 59.04 76.12 73.77
BF
ra16384 84.04 78.81 68.5 75.36 60.92 78.25 75.44 Avg(BF) 83.97 78.91 69.19 76.16 60.86 78.12 75.91
ra2048 81.84 76.22 61.72 73.29 55.57 76.07 71.83ra4096 86.18 81.93 70.29 78.56 62.33 81.42 78.54ra8192 81.17 75.71 65.04 72.9 59.71 76.54 72.18
BL
ra16384 83.46 77.39 64.97 74.53 58.23 78.29 73.25 Avg(BL) 83.16 77.81 65.50 74.82 58.96 78.08 73.95
Table 1. Percentage of accepted modules using different bin-packing and empty space partitioning rules
Berkeley: Sept 15, 1999 22
Online Placement Results
Penalties for different partitioning heuristics when BF is used
0.0E+00
2.0E+07
4.0E+07
6.0E+07
8.0E+07
1.0E+08
1.2E+08
1.4E+08
1.6E+08
1.8E+08
KAMER SSEG BER LSQR LSEG LER SQRPartitioning heuristic
Pen
alty
A2048 A4096 A8192 A16384
Volume that does
not fitBEST
Berkeley: Sept 15, 1999 23
Online Placement Results (cont.)
Running Time Comparison(Time to place "A16384" file)
35.77 34.27 34.74
2.23 2.12 2.24
0
5
10
15
20
25
30
35
40
KAMER SSEG
Tim
e (s
ec.)
BF
FF
BL
Berkeley: Sept 15, 1999 24
ty
x
Off-line placement: 3-D Floorplanning
RFU
DFGDFG ScheduleSchedule
RFU CPU
RFU area
time
Berkeley: Sept 15, 1999 25
ty
x
3-D Floorplanning
RFU
By deleting this RFUOP(CPU performs theoperation)...
DFGDFG ScheduleSchedule
RFU CPU
Berkeley: Sept 15, 1999 27
Our 3-D Floorplanner: No change in the schedule
• Pure annealing– Move set
• Move operation from CPU set to RFU set
• Move operation from RFU set to CPU set
• Displace an already placed RFUOP on the RFU
– Cost function: Volume– Very poor results
• Start with an ASAP schedule, use on-line to get an initial solution, then low-temperature annealing
Berkeley: Sept 15, 1999 28
OfflinePenalty
OnlinePenalty
Ratio
147287 213153 69.10%253566 307879 82.36%464049 508923 91.18%539435 612623 88.05%
Algorithm DatasetT50T100S100S200
LTSAX=100%
A1024 427761 456627 93.68%
T50T100S100S200
LTSAX=20%
A1024
148975 213153 69.89%225603 307879 73.28%287153 508923 56.42%359980 612623 58.76%213036 456627 46.65%
Offline Placement Results
Place X% of the largest-volume modules using on-line placement
Berkeley: Sept 15, 1999 29
Flexibility of the Modules• Library of modules have different
implementations for each RFUOP– Experimental results with our online algorithms
show about 60% reduction in penalty.
• 3-4 Implementations are enough
Berkeley: Sept 15, 1999 30
Faster Routing: mostly offline
Technology-Mapped netlist
ArchitectureDescription File
VPR
Place Circuit or Read in Existing Placement
Perform either Global or Combined Global/Detailed Routing
Placement and Routing Output Files
VP
RC
AD
flo
w
Berkeley: Sept 15, 1999 31
Routing Algorithm (VPR)
Call the VPR’s Router by an arbitrary channel width • Based on PathFinder negotiated congestion algorithm
Step1: Each net routed by the shortest path
which can be found. (Regardless of any overuse of wiring segments)
Step2: Sequentially ripping-up and re-routing
every net in the circuit (by the lowest cost path found)
Berkeley: Sept 15, 1999 32
Fast Pattern Routing
• Maze-based routing algorithm has a good performance but it’s very slow.
So,• Speed-up the router by partially using pattern
routing
if an arbitrary net picked and routed differently, it would not change the result effectively.
Berkeley: Sept 15, 1999 33
Independent subset of nets
Two geometrical independent sets of nets
- Class 1
- Class 2
Berkeley: Sept 15, 1999 34
Routing Patterns
2 terminal net patterns Multi terminal net patterns (MST & RSTs)
Cos
t = L
+ c
onst
/ F
lexi
bili t
y
Berkeley: Sept 15, 1999 35
Implementation of Algorithm• First choose the 2 terminal nets to route - More than 50% of the nets are 2 terminal nets.
- In order to get the maximum independent sets, sort the two terminal nets in terms of their bounding boxes.
- Classify the 2 terminal nets in geometrical independent classes
- Route the classes, sequentially by pattern routing.
• Next choose the multi terminal nets ( low fan-out) - Route them in their corresponding RST patterns
• Finally, let the rest of the nets be routed by traditional router
Berkeley: Sept 15, 1999 36
Experimental Results
Router VPR PATTERN ROUTER
MCNCbenchmark
channelwidth
WL run time channel width
WL run time speed- up%
alu4 10 18601 334.49 10 19188 273.87 23%apex2 10 28410 830.32 11 29056 459.8 80%apex4 11 20503 443.15 12 20137 424.6 4.4%ex5p 12 17585 459.68 13 18020 357.65 28.5%frisc 11 49799 1920 11 50919 1870 2.7%diffeq 7 13796 155.45 8 13684 102.36 51.8%dsip 7 13128 113.19 7 13363 49.24 130%misex3 10 19557 345.59 10 20184 194.7 77.5%pdc 15 92249 6700 17 90988 2430 175%s298 7 19018 207.710 8 18794 74.69 178%s3841 7 55885 1110 8 55573 332.6 234%s38584.1 8 51658 1110 8 52610 603.74 84%seq 10 26130 939.84 11 26694 437.84 114.5%spla 12 59290 4030 12 60874 2350 71.5%tseng 6 8531 96.45 6 8780 39.63 143.4%des 8 20305 479.56 10 20439 311.62 54%ex1010 10 63699 2400 12 62662 914.67 162.4%bigkey 7 15808 135.94 7 16158 113.64 19.6%
average 9.3 30310.11 1122.57 10 33229 630 82.46%
Berkeley: Sept 15, 1999 37
Faculty Position
• In VLSI Design & CAD (1-2 openings)
• VLSI Design & CAD: One of the six focused research areas in the department
• Assistant/Associate/Full Professor– (Northwestern rank: top 10; – ECE: top 20 (top 10 in 5 years)– Contact: [email protected]
Berkeley: Sept 15, 1999 38
r0
r1
Image Restoration
The value of the center pixel in the next iteration:
xk+1 = *y + xk - * (d**xk)
r1r1
r1 r1 r1
r1
y: the pixel value from the original degraded image
xk: the pixel value from the previous iteration
d**xk denotes the weighted sumr1* (eight neighbor pixels) + r0 * center
pixel
Berkeley: Sept 15, 1999 39
Incentive : Processing of large sized images
using FPGA’s with limited resources
1. Segmentation of the image into smaller
sized images suitable for the FPGA
Segments of size m x n are surrounded
by an overlap of o.
m
o
n
Berkeley: Sept 15, 1999 40
. Pixels of individual segments are restored in parallel by hardware
. Restored segments are written back after the overlap is discarded
MEMORY
m
o
nRFU
Berkeley: Sept 15, 1999 41
How bad is the segmentation?• Theorem: The error introduces is about (w)**O example: (1/16) ** 2 = (1/264)
• Proof: By induction
m
o
n
Berkeley: Sept 15, 1999 42
Comparison of Image Qualities
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
8 16 32 64 128
Segment Sizes
ISN
R (d
B)
Cameraman(segmented)
Cameraman(sequential)
Moon (sequential)
Moon (segmented)
Berkeley: Sept 15, 1999 45
Image stored in on-chip memory
Circuit to process the image
residing on the rest of the chipFPGA chip On-board memory,
where the image is stored
FPGA chip
Host processor
( image is stored here)
System A System B System C
Berkeley: Sept 15, 1999 46
Image Software RunningTime (sec)
Running Timefor System A
(msec)
Running Time for System C
(msec)cameraman 4.772 9.157 91.960
moon 2.812 5.725 54.494
circle 2.987 4.254 42.722
animals 6.761 8.826 88.628
fish 7.029 14.026 140.850
barbara 21.741 36.630 367.840
yacht 12.367 34.079 342.227
soccer 12.360 34.079 342.227
announcer 13.462 34.079 342.227
bluegirl 10.158 34.079 342.227
cablecar 12.354 34.079 342.227
cornfield 13.458 34.079 342.227
Running Times of the Application on Software and on Different Systems
(ignoring reconfiguration)
Berkeley: Sept 15, 1999 47
Conclusions• Need radical departure (new algorithm, etc)
from traditional PD algorithms.
• Fast (and lower quality) place & route tools
• Do as much as possible (building complex libraries, hierarchical routing, …) before compilation
• All of the above (and more) needed to make reconfigurable computing a reality.
Berkeley: Sept 15, 1999 48
Faculty Position
• In VLSI Design & CAD (1-2 openings)
• VLSI Design & CAD: One of the six focused research areas in the department
• Assistant/Associate/Full Professor– (Northwestern rank: top 10; – ECE: top 20 (top 10 in 5 years)
• Contact: [email protected]