An FPGA Implementation of the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

An FPGA Implementation of

the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine

(RSCE)

Sam Lee

2

What is this Thesis about? Implementation

Reciprocal Sum Compute Engine (RCSE). FPGA based. Accelerate part of Molecular Dynamics Sim. Smooth Particle Mesh Ewald.

Investigation Precision requirement. Speedup capability. Parallelization strategy.

3

Outline What is Molecular Dynamics Simulation?

What calculations are involved?

How do we accelerate and parallelize the calculations?

What did we find out about precision?

What did we find out about speedup?

What is left to be done?

4

Molecular Dynamics Simulation

5

Molecular Dynamics Simulation

1+

2-1+A

B

CForce by A on C

Force by C on A

Forc

e by

C o

n B

Forc

e by

B o

n C

Force

by B

on A

Force

by A

on B

E = - V (Electric Field = -Gradient of Potential) F = QE (Force = Charge x Electric Field) F = ma (Force = Mass x Acceleration) Time integration => New Positions and Velocities

∆

6

MD Simulation Problem scientists are facing:

SLOW!

O(N^2).

N=105, time-span=1ns, timestep size=1fs => 1022 calculations.

An 3GHz computer takes 5.8 x 1012 days to finish!!

7

Solution Accelerate with FPGA

Especially: The O(N2) calculations.

To be more specific, the thesis addresses: Reciprocal Electrostatic energy and force

calculations. Smooth Particle Mesh Ewald algorithm.

8

Previous Work Software Implementations:

Original PME Package written by Toukmaji. NAMD2. AMBER.

Hardware Implementations: No previous hardware implementations of SPME. MD-Grape & MD-Engine used Ewald Summation. Ewald Summation is O(N2); SPME is O(NLogN)!

9

Calculations Involved

Smooth Particle Mesh Ewald

10

Electrostatic Interaction Coulombic equation:

Under the Periodic Boundary Condition, summation is only … Conditionally Convergent.

'

1 1 ,2

1

n

N

i

N

j nij

ji

r

qqU

r

qqvcoulomb

0

21

4

11

Periodic Boundary Condition

A

3

21

4

5

B

3

21

4

5

C

3

21

4

5

D

3

21

4

5

E

3

21

4

5

F

3

21

4

5

G

3

21

4

5

H

3

21

4

5

I

3

21

4

5

To combat Surface Effect…

3

21

4

5

Replication

12

Ewald Summation Used For PBC

r

q

r

q

r

q

Direct Sum Reciprocal Sum

To calculate for the Coulombic Interactions. O(N2) Direct Sum + O(N2) Reciprocal Sum.

13

Smooth Particle Mesh Ewald Shift the workload to the Reciprocal Sum.

Use Fast Fourier Transform.

O(N) Real + O(NLogN) Reciprocal.

RSCE calculates the Reciprocal Sum using the SPME algorithm.

14

SPME Reciprocal Energy

233

222

211321 )(mb)(mb)(mb),m,mB(m

12

0

2exp1

12exp

n

k i

in

i

iii )

K

kπim()(kM)

K

)mπi(n()(mb

2

222

321

exp1

m

)/βmπ(

πV),m,mC(m

00000 ),,,c(m

)m,m,m)F(Q)(,m,mF(Q)(m),m,mB(mm

)/βmπ(

πVE

m

~

321321321

02

222exp

2

1

),m,mmQ)((θ),m,mQ(mEK

m

K

m

rec

K

m

~

321

11

01

12

02

13

03

3212

1

FFT FFT

15

SPME Reciprocal Force

),m,mmQ)((θ),m,m(mr

Q

r

EF

K

m

K

m

rec

K

m αiαi

rec

~

321

11

01

12

02

13

03

321

16

Reciprocal Sum Compute Engine

(RSCE)

17

RSCE Validation Environment

RSCE

RS

232

MicroblazeSoftcore

ZBTMemory

ZBTMemory

ZBTMemory

ZBTMemory

ZB

TM

emory

Multimedia Board

Virtex-II FPGA

OPBRS232

RS

CE

Driv

er

NA

MD

2

Host

18

RSCE Architecture

PIMZBT

(upper half)

Address for particle i

Fractional Part ofCoordinates

Integer Part of Coordinatesand Charge

Theta(x, y, z)[1..P]

ETMZBT

3D-FFT /3D-IFFT

Bspline CoefficientCalculator

(BCC)

Mesh Composer(MC)

Energy Calculator(EC)

Force Calculator(FC)

Energy

BLMZBT

Theta and dTheta

func and slope

QMMQMMI/RZBT

Lookup key

PIMZBT

(lower half)

Forces

19

RSCE Verification Testbench

RSCE_TB

Testcase

SystemC Wrapper

RSCE SystemCFixed Point Model

opbBfm

PIM

RSCE Verilog RTL

ETM BLM QMMR QMMI

OpbPkt

memAccessPkt

Checker

upPkt

Memory model &checker

20

RSCE SystemC Model

RSCE<DP, FXP>

HOST<DP, FXP>

BCC<DP, FXP>

MC<DP, FXP>

FC<DP, FXP>

EC<DP, FXP>

MAIN

QMMIQMMR

ETM

BLM

PIM

ETMC<FXP>

Fill

Upd

ate

BCC<FXP>

Get

3D-FFT<DP, FXP>

Tran

sfor

m

21

MD Simulations with theRSCE

22

RSCE Precision Goal Goal: Relative error < 10-5.

Two major calculation steps: B-Spline Calculation. 3D-FFT Calculation.

Due to limited logic resource + limited precision FFT LogiCore.=> Precision goal CANNOT be achieved.

23

MD Simulation with RSCE RMS Energy Error Fluctuation:

E

EEnFluctuatioEnergyRMS

22

24

FFT Precision Vs. Energy Fluctuation

25

Speedup Analysis

RSCE vs. Software Implementation

26

RSCE Speedup RSCE @ 100MHz vs. P4 Intel @ 2.4GHz.

Speedup: 3x to 14x

RSCE Computation time:

Freq

ttttTFCFFTDIFFTDMCcomp

133

Freq

PPPNK)(KKK)PP(PNTcomp

1322

27

RSCE Speedup Why so insignificant?

QMM bandwidth limitation. Sequential nature of the SPME algorithm.

Solution: Use more QMM memories.

Slight design modifications required.

28

Multi-QMM RSCE Speedup

Freq

ttttTFCFFTDIFFTDMCcomp

133

NQ-QMM RSCE Computation time :

The 4-QMM RSCE Speedup: 14x to 20x.

Assume N is of the same order as KxKxK: Speedup: 3(NQ-1)x

FreqN

PPPNK)KK

N

K)(KKK(KKK

N

)PP(PNT

QQQ

comp

12322

2

29

RSCE Speedup

N P K

Single-QMMSpeedup against

Software

Four-QMMSpeedup against

Single-QMM

Four-QMMSpeedup against

Software

Speedup

20000 4 32 5.44x3.37 18x

Speedup

20000 4 64 6.97x2.10 14x

Speedup

20000 4 128 10.70x1.46 15x

Speedup

20000 8 32 3.72x3.90 14x

Speedup

20000 8 64 5.17x3.37 17x

Speedup

20000 8 128 7.94x2.10 16xx

30

Parallelization Strategy

When Multiple RSCEs are Used Together

31

RSCE Parallelization Strategy

A

B

C

Kx

Ky

0

D

E

F

P2P1

P3

P4

A

B

C

Kx

Ky

0

D

E

F

Assume a 2-D Simulation. Assume P=2, K=8, N=6. Assume NumP = 4.

Four 4x4x4 Mini MeshesAn 8x8x8 mesh

32

RSCE Parallelization Strategy

P1 P3P2 P4

0 Kx

1D

FF

T Y

dire

ctio

n

Ky

P1

P3

P2

P4

0 Kx1D FFT X direction

Ky

Mini-mesh composed -> 2D-IFFT 2D-IFFT = two passes of 1D-FFT (X and Y).

X Direction FFT Y Direction FFT

33

Parallelization Strategy

P2P1

P3

P4

A

B

C

Kx

Ky

0

D

E

F

Ky

Kx0

3

0PPTotal EE 2D-FFT

2D-IFFT -> Energy Calculation -> 2D-FFT 2D-FFT -> Force Calculation

Energy Calculation Force Calculation

34

Multi-RSCE System

teTransferRa

GridZYXcomm Freq

NumPNumP

KKKT

1102

Precision

SystemClk

comp FreqNumP

PPPNK)(KKK)PP(PNT

1322

35

Conclusion Successful integration of the RSCE into NAMD2.

Single-QMM RSCE Speedup = 3x to 14x.

NQ-QMM RSCE Speedup = 14x to 20x.

When N≈KxKxK, NQ-QMM Speedup = (NQ-1)3x.

Multi-RSCE system is still a better alternative than the Multi-FPGA Ewald Summation system.

36

Future Work Input Precision Analysis.

More in-depth FFT Precision Analysis.

Implementation of block-floating Point FFT.

More investigation on how different simulation setting (K, P, and N) affects the RSCE speedup.

Investigate how to better parallelize the SPME algorithm.

37

Questions?

An FPGA Implementation of the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)

Documents

Transcript of An FPGA Implementation of the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)