An FPGA Implementation of the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

37
An FPGA Implementation of the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE) Sam Lee

description

An FPGA Implementation of the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE) Sam Lee. What is this Thesis about?. Implementation Reciprocal Sum Compute Engine (RCSE). FPGA based. Accelerate part of Molecular Dynamics Sim. Smooth Particle Mesh Ewald. Investigation - PowerPoint PPT Presentation

Transcript of An FPGA Implementation of the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

Page 1: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

An FPGA Implementation of

the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine

(RSCE)

Sam Lee

Page 2: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

2

What is this Thesis about? Implementation

Reciprocal Sum Compute Engine (RCSE). FPGA based. Accelerate part of Molecular Dynamics Sim. Smooth Particle Mesh Ewald.

Investigation Precision requirement. Speedup capability. Parallelization strategy.

Page 3: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

3

Outline What is Molecular Dynamics Simulation?

What calculations are involved?

How do we accelerate and parallelize the calculations?

What did we find out about precision?

What did we find out about speedup?

What is left to be done?

Page 4: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

4

Molecular Dynamics Simulation

Page 5: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

5

Molecular Dynamics Simulation

1+

2-1+A

B

CForce by A on C

Force by C on A

Forc

e by

C o

n B

Forc

e by

B o

n C

Force

by B

on A

Force

by A

on B

E = - V (Electric Field = -Gradient of Potential) F = QE (Force = Charge x Electric Field) F = ma (Force = Mass x Acceleration) Time integration => New Positions and Velocities

Page 6: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

6

MD Simulation Problem scientists are facing:

SLOW!

O(N^2).

N=105, time-span=1ns, timestep size=1fs => 1022 calculations.

An 3GHz computer takes 5.8 x 1012 days to finish!!

Page 7: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

7

Solution Accelerate with FPGA

Especially: The O(N2) calculations.

To be more specific, the thesis addresses: Reciprocal Electrostatic energy and force

calculations. Smooth Particle Mesh Ewald algorithm.

Page 8: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

8

Previous Work Software Implementations:

Original PME Package written by Toukmaji. NAMD2. AMBER.

Hardware Implementations: No previous hardware implementations of SPME. MD-Grape & MD-Engine used Ewald Summation. Ewald Summation is O(N2); SPME is O(NLogN)!

Page 9: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

9

Calculations Involved

Smooth Particle Mesh Ewald

Page 10: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

10

Electrostatic Interaction Coulombic equation:

Under the Periodic Boundary Condition, summation is only … Conditionally Convergent.

'

1 1 ,2

1

n

N

i

N

j nij

ji

r

qqU

r

qqvcoulomb

0

21

4

Page 11: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

11

Periodic Boundary Condition

A

3

21

4

5

B

3

21

4

5

C

3

21

4

5

D

3

21

4

5

E

3

21

4

5

F

3

21

4

5

G

3

21

4

5

H

3

21

4

5

I

3

21

4

5

To combat Surface Effect…

3

21

4

5

Replication

Page 12: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

12

Ewald Summation Used For PBC

r

q

r

q

r

q

Direct Sum Reciprocal Sum

To calculate for the Coulombic Interactions. O(N2) Direct Sum + O(N2) Reciprocal Sum.

Page 13: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

13

Smooth Particle Mesh Ewald Shift the workload to the Reciprocal Sum.

Use Fast Fourier Transform.

O(N) Real + O(NLogN) Reciprocal.

RSCE calculates the Reciprocal Sum using the SPME algorithm.

Page 14: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

14

SPME Reciprocal Energy

233

222

211321 )(mb)(mb)(mb),m,mB(m

12

0

2exp1

12exp

n

k i

in

i

iii )

K

kπim()(kM)

K

)mπi(n()(mb

2

222

321

exp1

m

)/βmπ(

πV),m,mC(m

00000 ),,,c(m

)m,m,m)F(Q)(,m,mF(Q)(m),m,mB(mm

)/βmπ(

πVE

m

~

321321321

02

222exp

2

1

),m,mmQ)((θ),m,mQ(mEK

m

K

m

rec

K

m

~

321

11

01

12

02

13

03

3212

1

FFT FFT

Page 15: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

15

SPME Reciprocal Force

),m,mmQ)((θ),m,m(mr

Q

r

EF

K

m

K

m

rec

K

m αiαi

rec

~

321

11

01

12

02

13

03

321

Page 16: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

16

Reciprocal Sum Compute Engine

(RSCE)

Page 17: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

17

RSCE Validation Environment

RSCE

RS

232

MicroblazeSoftcore

ZBTMemory

ZBTMemory

ZBTMemory

ZBTMemory

ZB

TM

emory

Multimedia Board

Virtex-II FPGA

OPBRS232

RS

CE

Driv

er

NA

MD

2

Host

Page 18: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

18

RSCE Architecture

PIMZBT

(upper half)

Address for particle i

Fractional Part ofCoordinates

Integer Part of Coordinatesand Charge

Theta(x, y, z)[1..P]

ETMZBT

3D-FFT /3D-IFFT

Bspline CoefficientCalculator

(BCC)

Mesh Composer(MC)

Energy Calculator(EC)

Force Calculator(FC)

Energy

BLMZBT

Theta and dTheta

func and slope

QMMQMMI/RZBT

Lookup key

PIMZBT

(lower half)

Forces

Page 19: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

19

RSCE Verification Testbench

RSCE_TB

Testcase

SystemC Wrapper

RSCE SystemCFixed Point Model

opbBfm

PIM

RSCE Verilog RTL

ETM BLM QMMR QMMI

OpbPkt

memAccessPkt

Checker

upPkt

Memory model &checker

Page 20: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

20

RSCE SystemC Model

RSCE<DP, FXP>

HOST<DP, FXP>

BCC<DP, FXP>

MC<DP, FXP>

FC<DP, FXP>

EC<DP, FXP>

MAIN

QMMIQMMR

ETM

BLM

PIM

ETMC<FXP>

Fill

Upd

ate

BCC<FXP>

Get

3D-FFT<DP, FXP>

Tran

sfor

m

Page 21: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

21

MD Simulations with theRSCE

Page 22: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

22

RSCE Precision Goal Goal: Relative error < 10-5.

Two major calculation steps: B-Spline Calculation. 3D-FFT Calculation.

Due to limited logic resource + limited precision FFT LogiCore.=> Precision goal CANNOT be achieved.

Page 23: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

23

MD Simulation with RSCE RMS Energy Error Fluctuation:

E

EEnFluctuatioEnergyRMS

22

Page 24: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

24

FFT Precision Vs. Energy Fluctuation

Page 25: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

25

Speedup Analysis

RSCE vs. Software Implementation

Page 26: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

26

RSCE Speedup RSCE @ 100MHz vs. P4 Intel @ 2.4GHz.

Speedup: 3x to 14x

RSCE Computation time:

Freq

ttttTFCFFTDIFFTDMCcomp

133

Freq

PPPNK)(KKK)PP(PNTcomp

1322

Page 27: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

27

RSCE Speedup Why so insignificant?

QMM bandwidth limitation. Sequential nature of the SPME algorithm.

Solution: Use more QMM memories.

Slight design modifications required.

Page 28: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

28

Multi-QMM RSCE Speedup

Freq

ttttTFCFFTDIFFTDMCcomp

133

NQ-QMM RSCE Computation time :

The 4-QMM RSCE Speedup: 14x to 20x.

Assume N is of the same order as KxKxK: Speedup: 3(NQ-1)x

FreqN

PPPNK)KK

N

K)(KKK(KKK

N

)PP(PNT

QQQ

comp

12322

2

Page 29: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

29

RSCE Speedup

N P K

Single-QMMSpeedup against

Software

Four-QMMSpeedup against

Single-QMM

Four-QMMSpeedup against

Software

Speedup

20000 4 32 5.44x3.37 18x

Speedup

20000 4 64 6.97x2.10 14x

Speedup

20000 4 128 10.70x1.46 15x

Speedup

20000 8 32 3.72x3.90 14x

Speedup

20000 8 64 5.17x3.37 17x

Speedup

20000 8 128 7.94x2.10 16xx

Page 30: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

30

Parallelization Strategy

When Multiple RSCEs are Used Together

Page 31: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

31

RSCE Parallelization Strategy

A

B

C

Kx

Ky

0

D

E

F

P2P1

P3

P4

A

B

C

Kx

Ky

0

D

E

F

Assume a 2-D Simulation. Assume P=2, K=8, N=6. Assume NumP = 4.

Four 4x4x4 Mini MeshesAn 8x8x8 mesh

Page 32: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

32

RSCE Parallelization Strategy

P1 P3P2 P4

0 Kx

1D

FF

T Y

dire

ctio

n

Ky

P1

P3

P2

P4

0 Kx1D FFT X direction

Ky

Mini-mesh composed -> 2D-IFFT 2D-IFFT = two passes of 1D-FFT (X and Y).

X Direction FFT Y Direction FFT

Page 33: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

33

Parallelization Strategy

P2P1

P3

P4

A

B

C

Kx

Ky

0

D

E

F

Ky

Kx0

3

0PPTotal EE 2D-FFT

2D-IFFT -> Energy Calculation -> 2D-FFT 2D-FFT -> Force Calculation

Energy Calculation Force Calculation

Page 34: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

34

Multi-RSCE System

teTransferRa

GridZYXcomm Freq

NumPNumP

KKKT

1102

Precision

SystemClk

comp FreqNumP

PPPNK)(KKK)PP(PNT

1322

Page 35: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

35

Conclusion Successful integration of the RSCE into NAMD2.

Single-QMM RSCE Speedup = 3x to 14x.

NQ-QMM RSCE Speedup = 14x to 20x.

When N≈KxKxK, NQ-QMM Speedup = (NQ-1)3x.

Multi-RSCE system is still a better alternative than the Multi-FPGA Ewald Summation system.

Page 36: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

36

Future Work Input Precision Analysis.

More in-depth FFT Precision Analysis.

Implementation of block-floating Point FFT.

More investigation on how different simulation setting (K, P, and N) affects the RSCE speedup.

Investigate how to better parallelize the SPME algorithm.

Page 37: An FPGA Implementation  of   the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

37

Questions?