Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the...

28
Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. [email protected] Michael Babst DSPlogic, Inc. [email protected] Roderick Swift DSPlogic, Inc. [email protected]

Transcript of Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the...

Page 1: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

Evaluation of running FFTs on the Cray XD1 with attached FPGAs

May 19, 2005

David StrenskiCray, Inc.

[email protected]

Michael BabstDSPlogic, Inc.

[email protected]

Roderick SwiftDSPlogic, Inc.

[email protected]

Page 2: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

Cray XD1 Overview

Page 3: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

The Rebirth of Co-processing

8086 Processor 8087 Coprocessor

AMD Opteron Xilinx Virtex II Pro FPGA

1976

2004

Page 4: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

Application AccelerationApplication Acceleration

Reconfigurable ComputingTightly coupled to OpteronFPGA acts like a programmable co-processor Performs vector operationsWell-suited for:

Searching, sorting, signal processing, audio/video/image manipulation, encryption, error correction, coding/decoding, packet processing, random number generation.

Application Accelerator

Page 5: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

Programming the FFT on the FPGA

Page 6: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

Data Flow Management

FFT Application Architecture

Input FIFO

(4 MB)

Cray AAAPI

FPGA Application

Cra

y R

T C

ore

Output FIFO

(4 MB)

Result Buffer(2 MB)

IO L

ibra

ry

I/O Control

Opteron Application

FFT Core

User Memory

Opteron Application Accelerator

Control State

Machine

FFT Config/ Status

Page 7: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

FFT Application Architecture

Four Layer ArchitectureCray Driver LayerI/O Management LayerFFT LibraryEnd-User Application

Cray Driver LayerLow-level interface between Opteron and FPGACray Application Accelerator software APIRapid-Array Transport CoreQDRII Core

Page 8: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

FFT Application Architecture

I/O Management LayerSimplified Software and FPGA Interfaces

Framed, streaming data interface

Software APIio_init(fpgafile)io_config(frame length)loadframe(pointer to input data)getresult(pointer to output location)txrx(input pointer, output pointer, dataset size)

I/O Management CoreInput/output FPGA FIFOs decouple Opteron and FPGA processingSimple Data I/O bus for FPGA FFT applicationUser Definable Control/Status Registers

Page 9: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

FFT Application Architecture

FFT LibraryFFT Core

Combination of off-the-shelf cores and custom VHDL optimized for XD1Cooley Tukey AlgorithmRadix-2 decimation-in-frequencyStreaming data at 1 sample per clock

Software APIfft_init(fft length, direction)

Page 10: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

FFT Application Architecture

End-User ApplicationUses both FFT Library and I/O LibraryInitialization

io_init()io_config()fft_init()

Data TransferSerial Data I/O

loadframe()getresult()

Optimized Data I/Otxrx()

Page 11: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

Data Flow

Page 12: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

Data Format/Frame Processing

F0 FM/K-1

B0 BK/N-1

Re[15:0]

Im[31:0] Re[31:0]

Complex Input Sample

Dataset = M samples

K samples / frame

D0 DN-1

N samples / fft

Dataset

Frame

FFT Block

Im[15:0]

Complex Output Sample

64 bits / sample

Frame Length:1024 < K < 65536FFT Size:32 < N < 65536, non-overlappingDataset Size:32 < M < Available Memory

Page 13: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

Data PipeliningOpteron-FPGA Communications latencyProcessing LatencyData pipelining required to achieve maximum FFT performanceLatency to Initial FFT Result

Send multiple frames prior to receiving first resultFurther optimization of latency possible

sec10*1.1

8)5.1125.3( 9KTNKT clkL ++≈

Page 14: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

Processing Pipeline

clkclk

ff KTFKT == clk

clkfr KT

FKT

89

89==

Load Input Fifo F0

Load FPGA Processor

F1

loadframe() loadframe()

F2

loadframe()

F3

loadframe()

F4

loadframe()

F0 F1 F2 F3 F4

Process DataProcessing Delay

To Output Fifo

To Result Buffer

To User Memory

F0

F0

F0

F1

F1

Tff

Tfr

Tfc

getresult()

F1

Dfft

Tfoclkfft NTD 5.1≈

sec10*1.1

8)5.1125.3( 9KTNKT clkL ++≈

910*1.18KTfc ≈

Fabric Limited(1.6 GB/s)

Fabric Limited(1.4 GB/s)

Memory Access Limited(≅1.1 GB/sec)

Page 15: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

FFT LatencyFFT Latency

1.00E-05

1.00E-04

1.00E-03

1.00E-02

5 6 7 8 9 10 11 12 13 14 15 16

log2(Nfft)

Late

ncy

(sec

)

Fc=200, Rm=1.6G Fc=175, Rm=1.6G Fc=175, Rm=1.1G

Page 16: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

Accuracy

Page 17: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

Accuracy MethodologyDifference between fixed/floating point

32-bit Single Precision23-bit mantissa, 8-bit exponent, 1 sign bit

Fixed point has limited dynamic rangeVariable mantissa for programmable precision16-bit input / 32-bit output example

Total FFT error also depends on other factorsRounding/Truncation at each stageTwiddle factor precision

Normalized metric, independent of length

n

n

bba

bacompare−

=),( nlen

i

ninn xxL

11

0⎟⎠

⎞⎜⎝

⎛== ∑

=

Page 18: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

FFTW L2norm Accuracy

Page 19: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

Measured Accuracy ResultsN = 1024

L2NormError(x_int16, x_r64)1.526792830336569e-005

L2NormError( FFTWFFT(x_int16), FFTWFFT(x_r64))1.526948880682192e-005

L2NormError( FPGAFFT(x_int16), FFTWFFT(x_r64)) 2.635449438537074e-005

N = 65536L2NormError(x_int16, x_r64)

1.526792830336569e-005L2NormError( FFTWFFT(x_int16), FFTWFFT(x_r64))

1.526796487132111e-005L2NormError( FPGAFFT(x_int16), FFTWFFT(x_r64))

3.216349170399355e-005

Page 20: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

Accuracy Summary

Comparable to other single-precision FFTsInitial rounding of data causes most errorInput less accurate, output more accurate than single precision floatPrecision may be traded off for speed, FFT size, etc.Dynamic range limits

Page 21: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

Performance

Page 22: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

FFT Computation Rate

1.6 GBytes/sec FPGA FFT computation rate w/ 200 MHz clockNot realizable

Expected Rates1.4 GBytes/sec theoretical max1.1 GBytes realistic with one or more of the following enhancements

I/O joint R/W software optimizationIncreased result buffer size beyond 2 MB (future Cray release)Bidirectional DMA (future Cray release)

550 Mbytes/sec approximate worst caseTime-shared Opteron transmit and receive

Current optimized rate ~830 Mbytes/sec64-bits / complex sample

R = 1.4 GB/sec = 5.7 ns/pointR = 1.1 GB/sec = 7.3 ns/pointR = 830 MB/sec = 12 ns/pointR = 550 MB/src = 14.6 ns/point

The average FFT computation rate isdetermined entirely by I/O throughput

Page 23: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

FFT Computation Rate

0

1

10

100

1000

10000

5 6 7 8 9 10 11 12 13 14 15 16

log2(Nfft)

Tim

eR = 1.42 R = 1.1 R = 1.1 / 2 FFTW(cpignol64-246)

Page 24: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

FFT Computation Rate

1

10

100

1000

10000

10 11 12 13 14 15 16

log2(Nfft)

Tim

e

R = 1.42 R = 1.1 R = 1.1 / 2 FFTW(cpignol64-246)

Page 25: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

Experimental Results

Un-optimized

FFT sizeFrame Size

Number of Frames

FFTs/ Frame

Total processed Average Duration (usec) Throughput (Mbytes/sec)

N K F K/N Nproc Total TX RX Total TX RX65536 65536 8 1 524288 932.63 403.00 526.00 562.16 1300.96 996.7532768 32768 16 1 524288 462.50 205.88 255.06 566.80 1273.32 1027.7716384 16384 32 1 524288 231.31 106.28 123.88 566.65 1233.26 1058.108192 8192 64 1 524288 116.86 56.05 60.31 560.81 1169.30 1086.624096 4096 128 1 524288 60.45 30.04 29.74 542.11 1090.85 1101.742048 2048 256 1 524288 32.81 17.96 14.50 499.44 912.20 1129.621024 1024 512 1 524288 19.23 11.09 7.90 426.02 738.95 1037.22512 1024 512 2 524288 9.27 5.44 3.73 441.76 753.50 1098.71256 1024 512 4 524288 4.62 2.75 1.81 443.67 745.54 1130.24128 1024 512 8 524288 2.31 1.38 0.91 442.91 744.73 1126.5164 1024 512 16 524288 1.16 0.69 0.46 441.00 740.96 1117.9032 1024 512 32 524288 0.58 0.35 0.23 441.38 742.03 1127.75

Page 26: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

Experimental Results

0

100

200

300

400

500

600

700

800

900

1024 2048 4096 8192 16384 32768 65536

Frame Length

Thro

ughp

ut (M

Byt

es/s

ec)

Sustained throughput, improved TX/RX software optimization

Page 27: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

Speed Improvement Ratio

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536

Nfft

T(fft

w)/T

(fpga

)R=1.4G R=1.1G R=800M

Page 28: Evaluation of running FFTs on the Cray XD1 with attached FPGAs · Evaluation of running FFTs on the Cray XD1 with attached FPGAs May 19, 2005 David Strenski Cray, Inc. stren@cray.com

ConclusionsUp to 4.75x speed gains are achievable todayFPGA Performance enhancement increases with FFT lengthMultiple FFTs utilize pipeline and provide efficiencyLatency limits usefulness for single computations of small FFT sizesFFT L2norm accuracy ~10-5, similar to other single-precision algorithmsModular architecture

Separate I/O and application optimizationRapid application development