An FPGA Implementation of the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)
description
Transcript of An FPGA Implementation of the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (RSCE)
An FPGA Implementation of
the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine
(RSCE)
Sam Lee
2
What is this Thesis about? Implementation
Reciprocal Sum Compute Engine (RCSE). FPGA based. Accelerate part of Molecular Dynamics Sim. Smooth Particle Mesh Ewald.
Investigation Precision requirement. Speedup capability. Parallelization strategy.
3
Outline What is Molecular Dynamics Simulation?
What calculations are involved?
How do we accelerate and parallelize the calculations?
What did we find out about precision?
What did we find out about speedup?
What is left to be done?
4
Molecular Dynamics Simulation
5
Molecular Dynamics Simulation
1+
2-1+A
B
CForce by A on C
Force by C on A
Forc
e by
C o
n B
Forc
e by
B o
n C
Force
by B
on A
Force
by A
on B
E = - V (Electric Field = -Gradient of Potential) F = QE (Force = Charge x Electric Field) F = ma (Force = Mass x Acceleration) Time integration => New Positions and Velocities
∆
6
MD Simulation Problem scientists are facing:
SLOW!
O(N^2).
N=105, time-span=1ns, timestep size=1fs => 1022 calculations.
An 3GHz computer takes 5.8 x 1012 days to finish!!
7
Solution Accelerate with FPGA
Especially: The O(N2) calculations.
To be more specific, the thesis addresses: Reciprocal Electrostatic energy and force
calculations. Smooth Particle Mesh Ewald algorithm.
8
Previous Work Software Implementations:
Original PME Package written by Toukmaji. NAMD2. AMBER.
Hardware Implementations: No previous hardware implementations of SPME. MD-Grape & MD-Engine used Ewald Summation. Ewald Summation is O(N2); SPME is O(NLogN)!
9
Calculations Involved
Smooth Particle Mesh Ewald
10
Electrostatic Interaction Coulombic equation:
Under the Periodic Boundary Condition, summation is only … Conditionally Convergent.
'
1 1 ,2
1
n
N
i
N
j nij
ji
r
qqU
r
qqvcoulomb
0
21
4
11
Periodic Boundary Condition
A
3
21
4
5
B
3
21
4
5
C
3
21
4
5
D
3
21
4
5
E
3
21
4
5
F
3
21
4
5
G
3
21
4
5
H
3
21
4
5
I
3
21
4
5
To combat Surface Effect…
3
21
4
5
Replication
12
Ewald Summation Used For PBC
r
q
r
q
r
q
Direct Sum Reciprocal Sum
To calculate for the Coulombic Interactions. O(N2) Direct Sum + O(N2) Reciprocal Sum.
13
Smooth Particle Mesh Ewald Shift the workload to the Reciprocal Sum.
Use Fast Fourier Transform.
O(N) Real + O(NLogN) Reciprocal.
RSCE calculates the Reciprocal Sum using the SPME algorithm.
14
SPME Reciprocal Energy
233
222
211321 )(mb)(mb)(mb),m,mB(m
12
0
2exp1
12exp
n
k i
in
i
iii )
K
kπim()(kM)
K
)mπi(n()(mb
2
222
321
exp1
m
)/βmπ(
πV),m,mC(m
00000 ),,,c(m
)m,m,m)F(Q)(,m,mF(Q)(m),m,mB(mm
)/βmπ(
πVE
m
~
321321321
02
222exp
2
1
),m,mmQ)((θ),m,mQ(mEK
m
K
m
rec
K
m
~
321
11
01
12
02
13
03
3212
1
FFT FFT
15
SPME Reciprocal Force
),m,mmQ)((θ),m,m(mr
Q
r
EF
K
m
K
m
rec
K
m αiαi
rec
~
321
11
01
12
02
13
03
321
16
Reciprocal Sum Compute Engine
(RSCE)
17
RSCE Validation Environment
RSCE
RS
232
MicroblazeSoftcore
ZBTMemory
ZBTMemory
ZBTMemory
ZBTMemory
ZB
TM
emory
Multimedia Board
Virtex-II FPGA
OPBRS232
RS
CE
Driv
er
NA
MD
2
Host
18
RSCE Architecture
PIMZBT
(upper half)
Address for particle i
Fractional Part ofCoordinates
Integer Part of Coordinatesand Charge
Theta(x, y, z)[1..P]
ETMZBT
3D-FFT /3D-IFFT
Bspline CoefficientCalculator
(BCC)
Mesh Composer(MC)
Energy Calculator(EC)
Force Calculator(FC)
Energy
BLMZBT
Theta and dTheta
func and slope
QMMQMMI/RZBT
Lookup key
PIMZBT
(lower half)
Forces
19
RSCE Verification Testbench
RSCE_TB
Testcase
SystemC Wrapper
RSCE SystemCFixed Point Model
opbBfm
PIM
RSCE Verilog RTL
ETM BLM QMMR QMMI
OpbPkt
memAccessPkt
Checker
upPkt
Memory model &checker
20
RSCE SystemC Model
RSCE<DP, FXP>
HOST<DP, FXP>
BCC<DP, FXP>
MC<DP, FXP>
FC<DP, FXP>
EC<DP, FXP>
MAIN
QMMIQMMR
ETM
BLM
PIM
ETMC<FXP>
Fill
Upd
ate
BCC<FXP>
Get
3D-FFT<DP, FXP>
Tran
sfor
m
21
MD Simulations with theRSCE
22
RSCE Precision Goal Goal: Relative error < 10-5.
Two major calculation steps: B-Spline Calculation. 3D-FFT Calculation.
Due to limited logic resource + limited precision FFT LogiCore.=> Precision goal CANNOT be achieved.
23
MD Simulation with RSCE RMS Energy Error Fluctuation:
E
EEnFluctuatioEnergyRMS
22
24
FFT Precision Vs. Energy Fluctuation
25
Speedup Analysis
RSCE vs. Software Implementation
26
RSCE Speedup RSCE @ 100MHz vs. P4 Intel @ 2.4GHz.
Speedup: 3x to 14x
RSCE Computation time:
Freq
ttttTFCFFTDIFFTDMCcomp
133
Freq
PPPNK)(KKK)PP(PNTcomp
1322
27
RSCE Speedup Why so insignificant?
QMM bandwidth limitation. Sequential nature of the SPME algorithm.
Solution: Use more QMM memories.
Slight design modifications required.
28
Multi-QMM RSCE Speedup
Freq
ttttTFCFFTDIFFTDMCcomp
133
NQ-QMM RSCE Computation time :
The 4-QMM RSCE Speedup: 14x to 20x.
Assume N is of the same order as KxKxK: Speedup: 3(NQ-1)x
FreqN
PPPNK)KK
N
K)(KKK(KKK
N
)PP(PNT
QQQ
comp
12322
2
29
RSCE Speedup
N P K
Single-QMMSpeedup against
Software
Four-QMMSpeedup against
Single-QMM
Four-QMMSpeedup against
Software
Speedup
20000 4 32 5.44x3.37 18x
Speedup
20000 4 64 6.97x2.10 14x
Speedup
20000 4 128 10.70x1.46 15x
Speedup
20000 8 32 3.72x3.90 14x
Speedup
20000 8 64 5.17x3.37 17x
Speedup
20000 8 128 7.94x2.10 16xx
30
Parallelization Strategy
When Multiple RSCEs are Used Together
31
RSCE Parallelization Strategy
A
B
C
Kx
Ky
0
D
E
F
P2P1
P3
P4
A
B
C
Kx
Ky
0
D
E
F
Assume a 2-D Simulation. Assume P=2, K=8, N=6. Assume NumP = 4.
Four 4x4x4 Mini MeshesAn 8x8x8 mesh
32
RSCE Parallelization Strategy
P1 P3P2 P4
0 Kx
1D
FF
T Y
dire
ctio
n
Ky
P1
P3
P2
P4
0 Kx1D FFT X direction
Ky
Mini-mesh composed -> 2D-IFFT 2D-IFFT = two passes of 1D-FFT (X and Y).
X Direction FFT Y Direction FFT
33
Parallelization Strategy
P2P1
P3
P4
A
B
C
Kx
Ky
0
D
E
F
Ky
Kx0
3
0PPTotal EE 2D-FFT
2D-IFFT -> Energy Calculation -> 2D-FFT 2D-FFT -> Force Calculation
Energy Calculation Force Calculation
34
Multi-RSCE System
teTransferRa
GridZYXcomm Freq
NumPNumP
KKKT
1102
Precision
SystemClk
comp FreqNumP
PPPNK)(KKK)PP(PNT
1322
35
Conclusion Successful integration of the RSCE into NAMD2.
Single-QMM RSCE Speedup = 3x to 14x.
NQ-QMM RSCE Speedup = 14x to 20x.
When N≈KxKxK, NQ-QMM Speedup = (NQ-1)3x.
Multi-RSCE system is still a better alternative than the Multi-FPGA Ewald Summation system.
36
Future Work Input Precision Analysis.
More in-depth FFT Precision Analysis.
Implementation of block-floating Point FFT.
More investigation on how different simulation setting (K, P, and N) affects the RSCE speedup.
Investigate how to better parallelize the SPME algorithm.
37
Questions?