Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson,...

Architecture-Specific Packingfor Virtex-5 FPGAsTaneem Ahmed, Paul Kundarewich, Jason Anderson,Brad Taylor, Rajat Aggarwal

February 25th, 2008

2

Overview

• Virtex-5 6-LUT Packing• Virtex-5 DSP and Block RAM Packing• Results• Summary

3

Simplified FPGA Logic Element

4-LUT

A4A3A2A1

O4

FF

4

Simplified FPGA Logic Block

FF4-LUT

FF4-LUT

FF4-LUT

FF4-LUT

GeneralInterconnec

t

GeneralInterconnec

t

GeneralInterconnec

t

GeneralInterconnec

t

5

Virtex-5 Logic Block

CLB

FF6-LUT

FF6-LUT

FF6-LUT

FF6-LUT

SLICE

FF6-LUT

FF6-LUT

FF6-LUT

FF6-LUT

SLICE

GeneralInterconnec

t

GeneralInterconnec

t

GeneralInterconnec

t

GeneralInterconnec

t

6

Dual-Output 6-LUT

6-LUT

A6A5A4A3A2A1

O6

O5

7

Dual-Output 6-LUT UsageA6

A5A4A3A2A1

O6

5-LUT O5

5-LUT

8

Dual-Output Packing

A6

A5A4A3A2A1

O6

5-LUT

5-LUT O5

A6

A5A4A3A2A1

O6

5-LUT

5-LUT O5

6-LUT 6-LUT

Number of 6-LUTs used: 2Number of 6-LUTs used: 1!

xy

X

LogicX

ab

Y

LogicY

VCC

xy

ba

Y

LogicY

LogicX

X

9

XOR

XOR

AX

AX

6-LUT

CY

CY

F7

F7

F7

O5

O5O5

O6

CIN

FFAQ

AMUX

A

O6

O6

Virtex-5 LUT/FF Pair

10

Dual-Output Packing Tradeoff

AX

6-LUT

F7

O5

O5O5

O6

FF

O6

O66-LUT

11

Dual-Output Packing in Placer

• Goal: To reduce area without performance hit– Can be done pre-placement

• Will be sub-optimal without delay estimates – Use delay estimates available during placement to

make good decisions on when to merge two LUTs

• Approach:– Allow second 5-LUT to be used, when performance

impact is small– Incorporate LUT packing in placer’s cost function

12

Placer Cost Function

• Previous cost function:– Cost = a * W + b * T– W: wirelength cost T: timing performance cost

• Extend cost function with two new terms– One based on 6-LUT utilization (L)– One based on SLICE utilization (S)– Cost = a * W + b * T + c * L + d * S

13

6-LUT Utilization Term

• L is computed based on all the used 6-LUT slots

• Where

14

• S is computed based on all the available SLICEs

• Let:– Ni = Number of used 5-LUTs in SLICE i (at most 8)

SLICE Utilization Term

S = Sii=0

m

15

Performance Recovery

• Helpful to prohibit pack in certain cases for performance reasons

• Other used elements in a SLICE may block the “good” path from the O5 output to external interconnect.

16

Performance Recovery: XOR

XOR

XOR

AX

AX

LUT6

CY

CY

F7

F7

F7

O5

O5O5

O6

CIN

FFAQ

AMUX

A

O6

O6LUT6

FF

17

Performance Recovery: F7

XOR

XOR

AX

AX

LUT6

CY

CY

F7

F7

F7

O5

O5O5

O6

CIN

FFAQ

AMUX

A

O6

O6LUT6

F7

FF

18

6-LUT Reduction

0

2

4

6

8

10

12

14

16

Benchmark Design #

% 6

-LU

T R

ed

uc

tio

n

5.5% 6-LUTReduction

19

SLICE Reduction

0

5

10

15

20

25

Benchmark Design #

% S

LIC

E R

edu

ctio

n

10.23% SLICEReduction

20

Performance Results

-15

-10

-5

0

5

10

15

20

25

0 5 10 15 20 25

SLICEs Reduction (%)

Pe

rfo

rma

nc

e L

os

s (

%)

3.3% PerformanceDegradation

21

Overview

• Virtex-5 6-LUT Packing• Virtex-5 DSP and Block RAM Packing• Summary

22

New Type of Packing Problem

• Traditionally, packing is considered to be a problem of just LUTs and flops

• However, Virtex-5 contains large IP blocks that present their own packing problem

23

Virtex-5 Block RAMs

18 Kb RAM

18 Kb RAM

36Kb RAM

• A 36 Kbit block RAM tile can store:a) single 36 Kb RAMb) two independent 18 Kb RAMs

• Block RAM has configurable “aspect ratio”• 18 Kb RAM can be configured as:

16K x 1, 8K x 2, 2K x 9, or 1K x 18

• Tools decide which independent 18 Kb block RAMs to locate in which tile

24

Virtex-5 DSP48E Block• A multiply-accumulate operation, pervasive in DSP

circuits, can be realized in a single DSP48E. • Multiple DSP48Es can be chained together to form more

complex functions through the PCIN and PCOUT ports

PCIN

C (48-bit)

B (18-bit)A (25-bit)

=

48-bit

Op

tion

al p

ipe

line

re

gis

ter/

rou

ting

log

ic

Op

tion

al p

ipe

line

re

gis

ter/

rou

ting

log

ic

Ro

utin

g lo

gicX

P

25x18

Pattern detect

ALU

PCOUT

25

Block RAM and DSP Floorplan

• Block RAM and DSP48E tiles are organized in columns

Block RAM tile

DSP48E

DSP48E

Block RAM tile

DSP48E

DSP48E

Block RAM tile

DSP48E

DSP48E

Block RAM tile

DSP48E

DSP48E

Virtex-5DSP tile

Block RAM tile

Block RAM tile

Block RAM tile

Block RAM tile

Block RAM tile

DSP48E

DSP48E

Block RAM tile

26

Block RAM/DSP Packing

• Problem: Placer algorithms are heuristic and sometimes do not find an optimal block RAM packing

• Goal: Leverage preferred block RAM packing patterns to achieve high performance

• Target area: DSP designs– DSP designs make heavy use of block RAMs and

DSP blocks

27

DSP Block RAM Designs

• Most common DSP application is the Finite Impulse Response Filter or FIR filter– FIR filters have multiple instances of a “tap” which

involve DSP and block RAMs

28

FIR Filter

• A Finite Impulse Response or FIR filter is a digital filter that takes a weighted average of the signals in a delay line

• An N-tap filter can be expressed as:y[n] = c0*x[n] + c1*x[n-1]+…+cn*[n-N+1]– Where:

• y[n] is the output of the filter at time n• x[n] is the data input “signal” at time n• Ci is the coefficient

• Each coefficient/data product in sum is referred to as a “tap”– DSP units used for the multiply and accumulate– Block RAMs used to store the data and coefficients

29

FIR Designs – Use Case 1• 2-tap FIR filter involving small block RAMs

RAMD1 RAMC1

Data RAM

18 Kb block RAM

RAMD0 RAMC0

Coefficient RAM

DSP0 Tap 0

DSP1 Tap 1

PCOUT

PCIN

A

B

datainput

dataoutput

A

B

36 Kb block RAM Tile

30

Packing for Use Case 1

• Packing both 18k Block RAMs into a Block RAM tile permits a natural alignment between the DSP and Block RAMs

High Performance!

Block RAM tile

DSP48E

DSP48E

Block RAM tile

DSP48E

DSP48E

Block RAM tile

DSP48E

DSP48E

Block RAM tile

DSP48E

DSP48E

Operates as two independent18 Kb block RAMs

Virtex-5DSP tile

31

FIR Designs – Use Case 2

• 2-tap FIR filter involving larger block RAMs

DSP0

DSP1

PCOUT

PCIN

RAMD0

RAMD1

A

B

18 Kb block RAM

A

B

36 Kb block RAM

RAMC0

RAMC1

Data RAM Coefficient RAM

Tap 1

Tap 0

32

Packing for Use Case 2

• Two Block RAM columns feed one DSP column• Again provides a natural alignment between the

DSP and Block RAMsDSP48E

DSP48E

Block RAM tile

DSP48E

DSP48E

DSP48E

DSP48E

DSP48E

DSP48E

Block RAM tile

Block RAM tile

Block RAM tile

Block RAM tile

Block RAM tile

Block RAM tile

Block RAM tile

Virtex-5DSP tile

33

Block RAM Chains

• Use Case: 18k Block RAM’s data input and output pins connected together (e.g. FIFO)

• Algorithm: Look for such chains and pack them together into single block RAM tile

• Special Case: 18k block RAMs separated by registers

inRAM0dia doa

addra

RAM1dib dob

addrb

out

18 Kb block RAM

34

Block RAM/DSP Packing Results

Circuit Perf RAM Packing (MHz)

Perf. Baseline (MHz)

Percent Improvement

Circuit 1 500 400 25%

Circuit 2 450 365 23%

Circuit 3 500 470 6%

Circuit 4 425 435 -2%

Circuit 5 215 200 8%

Geomean 400 359 11%

35

Summary

• Described two architecture specific packing approaches for a 65nm commercial FPGA:Xilinx Virtex-5– Dual-output LUT packing in placement:

• Achieves 10.2% SLICE reduction and 5.5% LUT reduction– Packing for DSPs and block RAMs:

• Achieves 11% performance improvement

36

Questions

Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson,...

Documents

Transcript of Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson,...