FPGA based Acceleration of Linear Algebra Computations.

20
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay Kumar Siddharth Joshi Sumedh Attarde Prof. Sachin Patkar Prof. H. Narayanan

description

B.Y. Vinay Kumar Siddharth Joshi Sumedh Attarde Prof. Sachin Patkar Prof. H. Narayanan. FPGA based Acceleration of Linear Algebra Computations. Outline. Double Precision Dense Matrix-Matrix Multiplication. Motivation Related Work Algorithm Design Results Conclusions - PowerPoint PPT Presentation

Transcript of FPGA based Acceleration of Linear Algebra Computations.

Page 1: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

1

FPGA based Acceleration of Linear Algebra Computations.

B.Y. Vinay KumarSiddharth JoshiSumedh Attarde

Prof. Sachin PatkarProf. H. Narayanan

Page 2: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

2

Outline

Double Precision Dense Matrix-Matrix Multiplication. Motivation Related Work Algorithm Design Results Conclusions

Double Precision Sparse Matrix-Vector Multiplication. Introduction Prasanna DeLorimier David Gregg et. al. What can we do ?

Page 3: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

3

FPGA based Double Precision Dense Matrix-Matrix Multiplication.

Page 4: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

4

Motivation

FPGAs have been making inroads for HiPC. Accelerating BLAS-3 achieved by accelerating matrix

multiplications. Modern FPGAs provide an abundance of resources – We

must capitalise upon these.

Page 5: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

5

Related Work{1/2}

The two main works ~ Dou and Prasanna. Both based on linear arrays, both use memory switching – both sustain their peak.

Dou : Optimised for a large VirtexII pro device (Xillinx).Created his own MAC (Not fully compliant).Sub-block dimensions must be powers of 2.Optimised for Low IO bandwidth.

Page 6: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

6

Related Work{2/2}

Prasanna:

Scaling results in speed degradation of about 35% (2 PEs to 20 PEs).

2.1 GFLOPs on a CRAY XD1 with VirtexII Pros (XC2VP50).

For design only (XC2VP125) they report 15% clock degradation on 2 to 24 PEs.

» They state they have not made any platform specific optimisations, for the implemented design.

Page 7: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

7

Algorithm

1. Broadcast ‘A’, keep a unique ‘B’ per PE2. Multiply, and put in pipeline of multiplier.3. Output is fed to directly to Adder+Ram

(accumulator)4. When the updated C is ready, take them out.

Page 8: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

8

Design-1

Page 9: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

9

Design-II

Page 10: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

10

FPGA Synthesis/PAR data{1/2}

PE DSP48Es FIFO B RAM Slice Reg Slice LUT

1 16 1 2 2511 1374

4 64 4 8 10377 5451

8 128 8 16 20865 10886

16 256 16 32 41841 21750

20(SX240) 320 20 40 52329 27176

40 (SX240)

640 40 80 103335 53914

Table: Clock Speed in MHz for the overall design for different number of PE.

Device/PE 1 4 8 16 19 20 40

SX95T-3 377 374 373 373 372 201 -

SX240T-2 374 373 344 - - 372 371.7

Table: Resource Utilisation for SX95T and SX240T (post PAR)

Page 11: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

11

FPGA Synthesis/PAR data{2/2}

Table: Resource Utilisation for Virtex II ProXC2VP100 (post PAR)

15 PE 20 PE

MULT18x18 240(54%) 304(68%)

RAMB16s 90 (20%) 114(26%)

Slices 30218 (68%) 37023(83%)

Speed 133.94 MHz 133.79 MHz

Page 12: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

12

Conclusions

We propose a variation of the rank one update algorithm for matrix multiplication.

We introduce a scalable processing element for this algorithm, targeted a Virtex-5 SX240T FPGA

The two designs clearly show the difference of local storage on IO bandwidth.

The designs achieved a design speed of 373 MHz, 40 PEs and a sustained performance of 29.8 GFLOPS for a single FPGA. We also provide 5.3 GFLOPS on a XC2VP100.

Page 13: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

13

FPGA based Double Precision Sparse Matrix-Vector Multiplication.

Page 14: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

14

Introduction

There are three main papers we will be looking at Viktor Prasanna: Hybrid method use HLL+S/W+HDL Michael DeLorimier: Maximum performance but unrealisticDavid Gregg et. al.: Most realistic assumptions wrt DRAM

Page 15: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

15

Prasanna

Use of prexisting IP cores – specifically for iterative solver (CG)

4 input reduction ckt does dot product results in partial sums as op.

Adder loop with Array does summation of dotproduct – created using

HLL

Reduction ckt at the end uses B-Tree to create the final value

IP s are available

DRAM looked at – but not realistically

Order of Matrices is small

DRAM is bottleneck

With their IP's they have a good architecture -however change the IP

and modify datapath – eg. Dou MAC

Page 16: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

16

DeLorimier

Use BRAMs for everything.

Use for iterative Solver – specifically CG

MAC requires interleaving

They do load balancing in their partitioner which requires – a

communication stage, very matrix/partitioner dependent.

Communication is the bottleneck

Performance:750 MFLOPS / processor

16 Virtex II 6000s

Each has 5 PE + 1 CE

Page 17: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

17

David Gregg et. al. (SPAR)

They only report the use of the SPAR architecture for FPGAs

They use very pessimistic DRAM access times. Emphasis on

cache-miss removal

Not using their Block RAMs well – maybe something

interesting can be done here

128 MFLOPS for 3 parallel SPAR units but remove cache miss

and we get a peak of 570 MFLOPS

Page 18: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

18

What can we do ?

Both use CSR – Not required why not modify representation

Two approaches : We can try both simultaneously

Prasanna – split across dot products (same row many PE)

Delorimier – split accross rows (many rows – one PE)

Use data from SPAR – viable approach – both do zero

multiplies – we get away with one zero multiply/coloumn

Minimise communication or overlap it. - we can do interleaving

for this – while one stage computes the previous one

communicates.

Page 19: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

19

Questions ?

Page 20: FPGA based Acceleration of Linear Algebra Computations.

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

20

THANK YOU

Thank You