An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui...

34
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow

Transcript of An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui...

Page 1: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

An FPGA Implementation of theEwald Direct Space and Lennard-Jones

Compute Engines

By: David Chui

Supervisor: Professor P. Chow

Page 2: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Overview

Introduction and Motivation Background and Previous Work Hardware Compute Engines Results and Performance Conclusions and Future Work

Page 3: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

1. Introduction and Motivation

Page 4: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

What is Molecular Dynamics (MD) simulation?

Biomolecular simulations Structure and behavior of biological systems Uses classical mechanics to model a molecular system Newtonian equations of motion (F = ma) Compute forces and integrate acceleration through time

to move atoms A large scale MD system takes years to simulate

Page 5: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Why is this an interesting computational problem?

Physical time for simulation 1e-4 sec

Time-step size 1e-15 sec

Number of time-steps 1e11

Number of atoms in a protein system 32,000

Number of interactions 1e9

Number of instructions/force calculation 1e3

Total number of machine instructions 1e23

Estimated simulation time on a petaflop/sec capacity machine

3 years

Page 6: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Motivation

Special-purpose computers for MD simulation have become an interesting application

FPGA technology Reconfigurable Low cost for system prototype Short turn around time and development cycle Latest technology Design portability

Page 7: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Objectives

Implement the compute engines on FPGA Calculate the non-bonded interactions in an MD

simulation (Lennard-Jones and Ewald Direct Space) Explore the hardware resources Study the trade-off between hardware resources and

computational precision Analyze the hardware pipeline performance Become the components of a larger project in the future

Page 8: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

2. Background and Previous Work

Page 9: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Lennard-Jones Potential

Attraction due to instantaneous dipole of molecules Pair-wise non-bonded interactions O(N2) Short range force Use cut-off radius to reduce computations Reduced complexity close to O(N)

Page 10: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Lennard-Jones Potential of Argon gas

-150

-100

-50

0

50

100

150

200

250

300

0.3 0.5 0.7 0.9 1.1 1.3 1.5

r (nm)

v(r)

/kb

(K

)

612

4rr

U LJ

Page 11: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Electrostatic Potential

Attraction and repulsion due to electrostatic charge of particles (long range force)

Reformulate using Ewald Summation Decompose to Direct Space and Reciprocal Space Direct Space computation similar to Lennard-Jones Direct Space complexity close to O(N)

Page 12: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Ewald Summation - Direct Space

nij

nijji

N

ijn

r

r

rerfcqqU

,

,' )(

2

1

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7

x

erfc

(x)

Page 13: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Previous Hardware Developments

Project Technology Year

MD-GRAPE 0.6um 1996

MD-Engine 0.8um 1997

BlueGene/L 0.13um 2003

MD-GRAPE3 0.13um 2004

Page 14: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Recent work - FPGA based MD simulator

Transmogrifier-3 FPGA system University of Toronto (2003)

Estimated speedup of over 20 times over software with better hardware resources

Fixed-point arithmetic, function table lookup, and interpolation

Xilinx Virtex-II Pro XC2VP70 FPGA Boston University (2005)

Achieved a speedup of over 88 times over software Fixed-point arithmetic, function table lookup, and interpolation

Page 15: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

MD Simulation software - NAMD

Parallel runtime system (Charm++/Converse) Highly scalable Largest system simulated has over 300,000 atoms on

1000 processors Spatial decomposition Double precision floating point

Page 16: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

NAMD - Spatial Decomposition

Cutoff Radius

Cutoff Radius

Cell

Simulation Box

Cutoff Radius

Page 17: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

3. Hardware Compute Engines

Page 18: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Purpose and Design Approach

Implement the functionality of the software compute object

Calculate the non-bonded interactions given the particle information

Fixed-point arithmetic, function table lookup, and interpolation

Pipelined architecture

Page 19: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Compute Engine Block Diagram

ix: {7.25}

|Δr|²

F(x, y, z)

Function: |Δr|² =|Δx|² + |Δy|² + |Δz|²

i(x, y, z)

j(x, y, z)

ZBTMemory Lookup/

Linear Interpolation

constantMultiplication/

Addition E

Page 20: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Function Lookup Table

The function to be looked up is a function of |r|2 (the separation distance between a pair of atoms)

Block floating point lookup Partition function based on different precision

Page 21: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Function Lookup Table

Value and Slope

Partition

Value

Slope

ZBT Memory Bankr

Page 22: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Hardware Testing Configuration

NAMDmain( )

Compute ObjectEwald( )

Compute ObjectLennard_Jones( )

Communication Bus

EwaldHardware Engine

Lennard-JonesHardware Engine

Page 23: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

4. Results and Performance

Page 24: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Simulation Overview

Software model Different coordinate precisions and lookup table sizes Obtain the error compared to computation using double

precision

Page 25: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Total Energy Fluctuation

Total Energy Fluctuation: Ewald Direct Space

-7

-6

-5

-4

-3

-2

-1

0

Various Precision

log

(Re

lati

ve

rm

s F

luc

tua

tio

n)

Time-step 1.0fs Time-step 0.1fs

FP16K4K1K10 1̂x10 2̂x10 3̂x10 4̂x10 5̂x

Page 26: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Average Total Energy

Average Total Energy: Ewald Direct Space

268

270

272

274

276

278

280

282

Various Precision

|<E

>|

Time-step 1.0fs Time-step 0.1fs

FP16K4K1K10 1̂x10 2̂x10 3̂x10 4̂x10 5̂x

Page 27: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Operating Frequency

Compute Engine Arithmetic Core

Lennard-Jones 43.6 MHz 80.0 MHz

Ewald Direct Space

47.5 MHz 82.2 MHz

Page 28: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Latency and Throughput

Latency Throughput

Lennard-Jones 59 clocks 33.33%

Ewald Direct Space

44 clocks 100%

Page 29: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Hardware Improvement

Operating frequency: Place-and-route constraints More pipeline stages

Throughput: More hardware resources Avoid sharing of multipliers

Page 30: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Compared with previous work

Pipelined adders and multipliers Block floating point memory lookup Support different types of atoms

Lennard-Jones System

Latency

(clocks)

Operating Frequency (MHz)

Transmogrifier3 11 26.0

Xilinx Virtex-II 59 80.0

Page 31: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

5. Conclusions and Future Work

Page 32: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Hardware Precision

A combination of fixed-point arithmetic, function table lookup, and interpolation can achieve high precision

Similar result in RMS energy fluctuation and average energy Coordinate precision of {7.41} Table lookup size of 1K

Block floating memory Data precision maximized Different types of functions

Page 33: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Hardware Performance

Compute engines operating frequency: Ewald Direct Space 82.2 MHz Lennard-Jones 80.0 MHz

Achieving 100 MHz is feasible with newer FPGAs

Page 34: An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Future Work

Study different types of MD systems Simulate computation error with different table lookup

sizes and interpolation orders Hardware usage: storing data in block RAMs instead of

external ZBT memory