Post on 20-May-2020
11
MD-CUDAMD-CUDAPresented byPresented byWes Wes TolandToland
Syed Syed NabeelNabeel
22
OutlineOutline
ObjectivesObjectives Project OrganizationProject Organization CPUCPU GPUGPU GPGPUGPGPU CUDACUDA N-body problemN-body problem MD on CUDAMD on CUDA EvaluationEvaluation Future WorkFuture Work
33
ObjectivesObjectives Understand molecular dynamicsUnderstand molecular dynamics
(MD) simulations(MD) simulations Analyze various algorithms forAnalyze various algorithms for
parallelizing MD simulationsparallelizing MD simulations Study GPGPU architecturesStudy GPGPU architectures Learn the CUDA programming modelLearn the CUDA programming model
and APIand API Port existing parallel MD code toPort existing parallel MD code to
parallel GPGPU codeparallel GPGPU code Evaluate the scalability of our parallelEvaluate the scalability of our parallel
GPGPU MD applicationGPGPU MD application
44
ObjectivesObjectives Project OrganizationProject Organization CPUCPU GPUGPU GPGPUGPGPU CUDACUDA N-body problemN-body problem MD on CUDAMD on CUDA EvaluationEvaluation Future WorkFuture Work
55
Project OrganizationProject Organization
Research parallel MD algorithmsResearch parallel MD algorithms Research parallel GPGPU MDResearch parallel GPGPU MD
algorithmsalgorithms Install CUDA driversInstall CUDA drivers Install CUDA APIInstall CUDA API Use N-body work-distributingUse N-body work-distributing
framework to make a parallel MDframework to make a parallel MDcodecode
EvaluationEvaluation
66
Problems encountered (1/2)Problems encountered (1/2)
Installing the CUDA API took some timeInstalling the CUDA API took some timebecause it required several packages thatbecause it required several packages thathad specific had specific glibcglibc dependencies. dependencies.
This was solved by installing the followingThis was solved by installing the followingmodules via modules via YaStYaSt (for (for SuSeSuSe platforms): platforms):–– gccgcc–– gccgcc-++-++–– freeglut-develfreeglut-devel–– glibc-devel-32bitglibc-devel-32bit–– kernel-sourcekernel-source
77
Problems Encountered (2/2)Problems Encountered (2/2)
Once the CUDA APIs wereOnce the CUDA APIs weresuccessfully installed, wesuccessfully installed, weencountered runtime errors whenencountered runtime errors whenattempting to run attempting to run NvidiaNvidia benchmarks benchmarks
The correct The correct NvidiaNvidia GPU kernel GPU kernelmodule was not module was not notnot installed installed
SolutionSolution: Download and install the: Download and install thekernel module for the kernel module for the NvidiaNvidia QuadroQuadroFX 5600 GPU driverFX 5600 GPU driver
88
ObjectivesObjectives Project OrganizationProject Organization CPUCPU GPUGPU GPGPUGPGPU CUDACUDA N-body problemN-body problem MD on CUDAMD on CUDA EvaluationEvaluation Future WorkFuture Work
99
CPU Instruction Level Parallelism (ILP)CPU Instruction Level Parallelism (ILP) Instructions are re-ordered and combined intoInstructions are re-ordered and combined into
groupsgroups The groups of instructions are then executed inThe groups of instructions are then executed in
parallel without changing the result of theparallel without changing the result of theprogramprogram
Modern processors have multi-stage instructionModern processors have multi-stage instructionpipelinespipelines
Each stage in the pipeline corresponds to aEach stage in the pipeline corresponds to adifferent action the processor performs on thatdifferent action the processor performs on thatinstruction in that stageinstruction in that stage
1010
CPU LimitationsCPU Limitations
ILP is increasingly difficult to extractILP is increasingly difficult to extractfrom the instruction streamfrom the instruction stream
Control hardware dominatesControl hardware dominatesmicroprocessorsmicroprocessors–– Complex, difficult to build and verifyComplex, difficult to build and verify–– Takes substantial fraction of dieTakes substantial fraction of die–– Scales poorlyScales poorly–– Control hardware does not do any mathControl hardware does not do any math
1111
ObjectivesObjectives Project OrganizationProject Organization CPUCPU GPUGPU GPGPUGPGPU CUDACUDA N-body problemN-body problem MD on CUDAMD on CUDA EvaluationEvaluation Future WorkFuture Work
1212
GPUGPU Type of CPU attached to a Graphics cardType of CPU attached to a Graphics card
dedicated to calculating floating point operationsdedicated to calculating floating point operations Incorporates custom microchips which containIncorporates custom microchips which contain
special mathematical operationsspecial mathematical operations Stream Processing:Stream Processing: applications can use multiple applications can use multiple
computational units without explicitly managingcomputational units without explicitly managingallocation, synchronization, or communicationallocation, synchronization, or communicationamong those units.among those units.
1313
GPU LimitationsGPU Limitations
High learning curveHigh learning curve Programming model of most graphicsProgramming model of most graphics
processors is inadequate for mostprocessors is inadequate for mostnon-graphics applications/applicationnon-graphics applications/applicationprogrammersprogrammers
Limitations in writing and scatteringLimitations in writing and scatteringinformationinformation
DRAM memory bandwidth bottleneckDRAM memory bandwidth bottleneck
1414
ObjectivesObjectives Project OrganizationProject Organization CPUCPU GPUGPU GPGPUGPGPU CUDACUDA N-body problemN-body problem MD on CUDAMD on CUDA EvaluationEvaluation Future WorkFuture Work
1515
GPGPUGPGPU
General-purpose computing on GPUGeneral-purpose computing on GPU GPU is viewed as a compute deviceGPU is viewed as a compute device
capable of executing a large numbercapable of executing a large numberof threads in parallelof threads in parallel
GPU operates as a co-processor toGPU operates as a co-processor tothe main CPUthe main CPU
Portions of code that can bePortions of code that can beparallelized and are computationallyparallelized and are computationallyintensive can be off-loaded onto theintensive can be off-loaded onto theGPUGPU
1616
GPGPU ApplicationsGPGPU Applications
Physical modelingPhysical modeling Computational engineeringComputational engineering Game effects (FX) physicsGame effects (FX) physics Image processingImage processing Matrix algebraMatrix algebra ConvolutionConvolution CorrelationCorrelation SortingSorting
1717
GPGPU ChallengesGPGPU Challenges
Unintuitive graphics APIUnintuitive graphics API Confusing addressing modesConfusing addressing modes Shader Shader capabilities are limitedcapabilities are limited Limited instruction setsLimited instruction sets Limited communicationLimited communication
–– Explicit data transfer is typicallyExplicit data transfer is typicallyrequiredrequired
1818
ObjectivesObjectives Project OrganizationProject Organization CPUCPU GPUGPU GPGPUGPGPU CUDACUDA N-body problemN-body problem MD on CUDAMD on CUDA EvaluationEvaluation Future WorkFuture Work
1919
CUDA to the RescueCUDA to the Rescue CUDACUDA: : CCompute ompute UUnified nified DDeviceevice
AArchitecturerchitecture CUDACUDA manages GPU computations with a manages GPU computations with a
parallel model similar to certain CPUparallel model similar to certain CPUparadigms.paradigms.–– User does not need to map to a graphics APIUser does not need to map to a graphics API
LayersLayers–– Hardware driverHardware driver–– Application programming interface (API)Application programming interface (API)–– CUDA runtimeCUDA runtime–– CUFFTCUFFT–– CUBLASCUBLAS
2020
CUDA LayersCUDA Layers
2121
GPU Architecture In CUDAGPU Architecture In CUDAMemoryMemory
AddressingAddressingModesModes
General DRAMGeneral DRAMMemoryMemoryAddressingAddressing
SharedSharedMemoryMemoryAddressingAddressing
2222
General DRAM MemoryGeneral DRAM MemoryAddressingAddressing
MoreMoreprogrammingprogrammingflexibility thanflexibility thanGPUsGPUs
Ability to readAbility to readand write dataand write dataat any locationat any locationin DRAM, justin DRAM, justlike on a CPUlike on a CPU
2323
Shared MemoryShared Memory Parallel dataParallel data
cachecache Very fastVery fast
general readgeneral readand write accessand write access
Minimized over-Minimized over-fetch andfetch andround-trips toround-trips toDRAMDRAM
2424
GPU asGPU as a Computation Devicea Computation Device
Certain parallelizable orCertain parallelizable orcomputationally intensive portions ofcomputationally intensive portions ofcode are executed on multi-threadedcode are executed on multi-threadedGPUGPU
Kernel:Kernel: the common function the common functioncompiled to the instruction set of thecompiled to the instruction set of thedevice and downloaded to itdevice and downloaded to it
Separate DRAM for host and deviceSeparate DRAM for host and device–– Host MemoryHost Memory–– Device MemoryDevice Memory
2525
Thread BlocksThread Blocks
A batch ofA batch ofthreads thatthreads thatoperates with aoperates with alimited amount oflimited amount ofshared memoryshared memory
SynchronizationSynchronizationpoints are used topoints are used tocoordinate sharedcoordinate sharedmemory accessmemory access
Each threadEach threadknows its 1D, 2D,knows its 1D, 2D,or 3D thread idor 3D thread id
2626
Grid Of Thread BlocksGrid Of Thread Blocks A set of blocks of theA set of blocks of the
same size executingsame size executingthe same kernelthe same kernel
Single block has aSingle block has alimited number oflimited number ofthreadsthreads
Much more threadsMuch more threadsavailable in the gridavailable in the grid
Inter-thread isInter-thread isavoidedavoided
Each block isEach block isidentified by its 1D,identified by its 1D,2D, or 3D block id2D, or 3D block id
2727
CUDA Memory ModelCUDA Memory Model
Read-write per-Read-write per-thread registersthread registers
Read-write per-Read-write per-thread localthread localmemorymemory
Read-write per-Read-write per-block sharedblock sharedmemorymemory
Read-write per-Read-write per-grid globalgrid globalmemorymemory
Read-only per-gridRead-only per-gridconstant memoryconstant memory
Read-only per-gridRead-only per-gridtexture memorytexture memory
2828
Hardware Implementation (1/2)Hardware Implementation (1/2)
A set of SIMD multi-processors with on-chipshared memory
2929
Hardware Implementation (1/2)Hardware Implementation (1/2) SIMD behaviorSIMD behavior
through groups ofthrough groups ofthreads called threads called warpswarps
One or more threadOne or more threadblocks are executedblocks are executedon each multi-on each multi-processor using processor using time-time-slicingslicing
The The issue order ofissue order ofthe warpsthe warps within a within ablock is undefinedblock is undefined
The The issue order ofissue order ofthe blocksthe blocks within a within agrid of thread blocksgrid of thread blocksis undefinedis undefined
3030
ObjectivesObjectives Project OrganizationProject Organization CPUCPU GPUGPU GPGPUGPGPU CUDACUDA N-body problemN-body problem MD on CUDAMD on CUDA EvaluationEvaluation Future WorkFuture Work
3131
N-body ProblemN-body Problem
Numerically approximates the evolution ofNumerically approximates the evolution ofa system of bodies in which each bodya system of bodies in which each bodycontinuously interacts with every othercontinuously interacts with every otherbodybody–– Astrophysical Astrophysical simulation-the bodies attractsimulation-the bodies attract
each other through the gravitational forceeach other through the gravitational force–– Protein folding Protein folding simulations-to calculatesimulations-to calculate
electrostatic and van electrostatic and van der Waals der Waals forcesforces–– Turbulent fluid flowTurbulent fluid flow simulation simulation–– Global illumination computationGlobal illumination computation in in
computer graphicscomputer graphics
3232
All-pairs Approach toAll-pairs Approach toN-body SimulationN-body Simulation
A brute-force technique to evaluateA brute-force technique to evaluateall pair-wise interactions among Nall pair-wise interactions among Nbodiesbodies–– Relatively simple methodRelatively simple method–– O(NO(N22) computational complexity) computational complexity
Typical example of routine that isTypical example of routine that isused as a kernel to determine theused as a kernel to determine theforces in close-range interactionsforces in close-range interactions
3333
Accelerating All-pairs ApproachAccelerating All-pairs Approach
All-pairs component requiresAll-pairs component requiressubstantial time to compute substantial time to compute targettargetfor accelerationfor acceleration
Optimal reuse of dataOptimal reuse of data–– Computation arranged in tilesComputation arranged in tiles–– Interactions in each row are evaluatedInteractions in each row are evaluated
in sequential order, updating thein sequential order, updating theacceleration vectoracceleration vector
–– Separate rows are evaluated in parallelSeparate rows are evaluated in parallel
3434
Force ComputationsForce ComputationsGiven N bodies with anGiven N bodies with an Initial position xiInitial position xi Velocity vi for 1 Velocity vi for 1 ≤≤ i i ≤≤ N N Force vector Force vector fij fij on body i caused by itson body i caused by its
gravitational attraction to body jgravitational attraction to body j
mi and mi and mj mj -->the masses of bodies i and j-->the masses of bodies i and j rij rij = = xj xj − − xi is the vector from body i to body jxi is the vector from body i to body j G is the gravitational constantG is the gravitational constant
3535
Total ForceTotal Force
The total force The total force Fi Fi on body i, due toon body i, due toits interactions with the other N its interactions with the other N − − 11bodies, is obtained by summing allbodies, is obtained by summing allinteractions:interactions:
3636
Softening FactorSoftening Factor
As bodies approach each other, theAs bodies approach each other, theforce between them grows withoutforce between them grows withoutbound, which is an undesirablebound, which is an undesirablesituation for numerical integrationsituation for numerical integration
Softening factor Softening factor εε2 2 > 0 is added> 0 is added
Softening factor enforces a limit toSoftening factor enforces a limit tothe magnitude of the force betweenthe magnitude of the force betweenthe bodiesthe bodies
3737
Acceleration CalculationAcceleration Calculation
To integrate over time, we need theTo integrate over time, we need theacceleration acceleration aaii = = FFii/m/mii to update theto update theposition and velocity of body iposition and velocity of body i
The integrator used to update theThe integrator used to update thepositions and velocities is apositions and velocities is aLeapfrog-VerletLeapfrog-Verlet integratorintegrator
3838
ObjectivesObjectives Project OrganizationProject Organization CPUCPU GPUGPU GPGPUGPGPU CUDACUDA N-body problemN-body problem MD on CUDAMD on CUDA EvaluationEvaluation Future WorkFuture Work
3939
Code Organization (1/3)Code Organization (1/3) C++ Files (CPU)C++ Files (CPU)
–– mdmd..cppcppMain functionMain functionInstantiate Instantiate mdsystemcuda mdsystemcuda objectobjectUse Use mdsystemcuda mdsystemcuda object to make CUDAobject to make CUDA
calls to initialize CPU and GPU memory andcalls to initialize CPU and GPU memory andrun CUDA kernelrun CUDA kernel
–– mdsystemcudamdsystemcuda..cppcppProvides functionality to initialize CPU andProvides functionality to initialize CPU and
GPU memoryGPU memoryHas several wrapper functions for calls toHas several wrapper functions for calls to
the CUDA kernelthe CUDA kernel
4040
Code Organization (2/3)Code Organization (2/3)
CUDA Files (GPU)CUDA Files (GPU)–– mdsystemcudamdsystemcuda.cu.cu
Provides functions to copy the position andProvides functions to copy the position andvelocity arrays to GPU memoryvelocity arrays to GPU memory
integrateMDSystem integrateMDSystem is the top-level kernelis the top-level kernelfunction called by an function called by an mdsystemcuda mdsystemcuda object.object.This function calls the lower-level kernelThis function calls the lower-level kernelfunction, function, integrateBodiesintegrateBodies..
–– md_kernelmd_kernel.cu.cuContains the Contains the integrateBodies integrateBodies function, whichfunction, which
updates the position and velocity arraysupdates the position and velocity arrays
4141
Code Organization (3/3)Code Organization (3/3)
Input FilesInput Files–– mdmd.in.in
InitUcell[3]InitUcell[3]–– Determines N, the number of bodiesDetermines N, the number of bodies
StepLimit StepLimit StepAverageStepAverage–– Determines # of iterationsDetermines # of iterations
pp–– N/p is the number of thread blocksN/p is the number of thread blocks
qq–– Threads per bodyThreads per body
4242
Compiling MD-CUDA (1/2)Compiling MD-CUDA (1/2)
4343
Compiling MD-CUDA (2/2)Compiling MD-CUDA (2/2)
Compile C++ files:Compile C++ files:g++g++ $(CFLAGS)-o $(CFLAGS)-o mdsystemcudamdsystemcuda..cpp_o cpp_o -c-cmdsystemcudamdsystemcuda..cppcppg++g++ $(CFLAGS)-o $(CFLAGS)-o mdmd..cpp_o cpp_o -c -c mdmd..cppcpp
Compile CUDA files:Compile CUDA files:nvccnvcc $(CFLAGS)-o $(CFLAGS)-o mdsystemcudamdsystemcuda..cu_o cu_o -c-cmdsystemcudamdsystemcuda.cu.cu
Compile final executable:Compile final executable:g++g++ -fPIC -fPIC -o MD-CUDA -o MD-CUDA mdsystemcudamdsystemcuda..cpp_ocpp_omdmd..cpp_o mdsystemcudacpp_o mdsystemcuda..cu_o cu_o $(LDFLAGS) $(LDFLAGS) --lcudart -lcutillcudart -lcutil
4444
CUDA Implementation of N-bodyCUDA Implementation of N-body
All-pairs algorithm calculates each entry All-pairs algorithm calculates each entry ffijijin an N×N grid of all pair-wise forcesin an N×N grid of all pair-wise forces
The total force The total force FFii (or acceleration (or acceleration aaii) on) onbody i is obtained from the sum of allbody i is obtained from the sum of allentries in row ientries in row i
Each entry can be computedEach entry can be computedindependentlyindependently
O(NO(N22) available parallelism) available parallelism Requires O(NRequires O(N22) memory) memory Would be substantially limited byWould be substantially limited by
memory bandwidthmemory bandwidth
4545
Computational Tile(1/3)Computational Tile(1/3)
Serialize some of the computationsSerialize some of the computations–– Achieves the data reuse needed toAchieves the data reuse needed to
reach peak performance of thereach peak performance of thearithmetic unitsarithmetic units
–– Reduces the memory bandwidthReduces the memory bandwidthrequiredrequired
Computational tile:Computational tile: a square region a square regionof the grid of pair-wise forcesof the grid of pair-wise forcesconsisting of consisting of pp rows and rows and pp columns columns
4646
Computational Tile(2/3)Computational Tile(2/3)
Only Only 2p2p body descriptions are required to body descriptions are required toevaluate all evaluate all pp22 interactions in the tile interactions in the tile–– pp descriptions can be reused later descriptions can be reused later
These body descriptions can be stored inThese body descriptions can be stored inshared memory or in registersshared memory or in registers
The total effect of the interactions in theThe total effect of the interactions in thetile on the tile on the pp bodies is captured as an bodies is captured as anupdate to update to p p acceleration vectorsacceleration vectors
4747
Computational Tile(3/3)Computational Tile(3/3)
For optimal data reuseFor optimal data reusetile computation istile computation isorganized so that:organized so that:–– The interactions in eachThe interactions in each
row are evaluated inrow are evaluated insequential order,sequential order,updating theupdating theacceleration vectoracceleration vector
–– The separate rows areThe separate rows areevaluated in parallelevaluated in parallel
I/O for tile computation
Evaluation Order
4848
Body-Body Force CalculationBody-Body Force Calculation The interaction between a pair of bodies isThe interaction between a pair of bodies is
implemented as an entirely serialimplemented as an entirely serialcomputation.computation.
bodyBodyInteraction bodyBodyInteraction function does thefunction does thefollowing:following:–– Computes the force on body i from itsComputes the force on body i from its
interaction with body jinteraction with body j–– Updates acceleration Updates acceleration aaii of body i as a result ofof body i as a result of
this interaction.this interaction. FLOPS in FLOPS in bodyBodyInteractionbodyBodyInteraction: 20: 20
–– AdditionsAdditions–– MultiplicationsMultiplications–– SqrtfSqrtf() call() call–– Division (or reciprocal)Division (or reciprocal)
4949
Code For Code For bodyBodyInteractionbodyBodyInteraction
3 FLOPSr_ij
6 FLOPSdistSqr = dot(r_ij, r_ij) + EPS^2
4 FLOPS (2 mul, 1 sqrt, 1 inv)invDistCube =1/distSqr^(3/2)
6 FLOPSa_i = a_i + s * r_ij
6 FLOPSa_i = a_i + s * r_ij
5050
float4 Data Typefloat4 Data Type
Data type is for accelerations stored inData type is for accelerations stored inGPU device memoryGPU device memory
Allows Allows coalesced coalesced memory to accessmemory to accessarrays of data in device memory.arrays of data in device memory.–– Results in more efficient memory requests andResults in more efficient memory requests and
transfers.transfers.
Each bodyEach body’’s mass is stored in the w fields mass is stored in the w fieldof the bodyof the body’’s float4 positions float4 position
3D vectors are stored as float3 variables3D vectors are stored as float3 variables–– Register space is an issue and coalescedRegister space is an issue and coalesced
access is notaccess is not
5151
Tile CalculationTile Calculation A tile is evaluated by A tile is evaluated by p p threads performing thethreads performing the
same sequence of operations on different datasame sequence of operations on different data Each thread updates the acceleration of one bodyEach thread updates the acceleration of one body
as a result of its interaction with as a result of its interaction with pp other bodies other bodies Load Load pp body descriptions from the GPU device body descriptions from the GPU device
memory into the shared memory provided tomemory into the shared memory provided toeach thread blockeach thread block
Each thread in the block evaluates Each thread in the block evaluates pp successive successiveinteractionsinteractions
The result of the tile calculation is The result of the tile calculation is pp updated updatedaccelerationsaccelerations
5252
Tile CalculationTile Calculation
Each of the Each of the pp threads threads–– executes the function body in parallelexecutes the function body in parallel–– iterates over the same p bodiesiterates over the same p bodies–– computes the acceleration of its individual body as acomputes the acceleration of its individual body as a
result of interaction with p other bodiesresult of interaction with p other bodies
holds the position of the body for theexecuting thread
an array of body descriptions inshared memory
5353
Clustering Tiles into Thread BlocksClustering Tiles into Thread Blocks Thread blockThread block has has p p threads that executethreads that execute
some number of tiles in sequencesome number of tiles in sequence–– Sized to balance parallelism with data reuseSized to balance parallelism with data reuse–– Degree of parallelism (i.e. the number of rows)Degree of parallelism (i.e. the number of rows)
must be sufficiently large to interleave multiplemust be sufficiently large to interleave multiplewarps (to hide latencies in interactionwarps (to hide latencies in interactionevaluation)evaluation)
Amount of data reuse grows with theAmount of data reuse grows with thenumber of columnsnumber of columns–– It also governs the size of the transfer ofIt also governs the size of the transfer of
bodies from device memory into sharedbodies from device memory into sharedmemorymemory
The The size of the tilesize of the tile also determines the also determines theregister space and shared memoryregister space and shared memoryrequiredrequired
5454
Thread Blocks for N-BodyThread Blocks for N-BodyImplementation(1/3)Implementation(1/3)
For this implementation tiles areFor this implementation tiles aresquare of size square of size pp by by pp
Before executing a tile:Before executing a tile:–– Each thread fetches one body intoEach thread fetches one body into
shared memoryshared memory–– Threads synchronize after the fetchThreads synchronize after the fetch
Consequently, each tile starts with Consequently, each tile starts with ppsuccessive bodies in the sharedsuccessive bodies in the sharedmemorymemory
5555
Thread Blocks for N-BodyThread Blocks for N-BodyImplementation (2/3)Implementation (2/3)
Time spans the horizontal directionTime spans the horizontal direction Parallelism spans the vertical directionParallelism spans the vertical direction Heavy lines demarcate the tiles ofHeavy lines demarcate the tiles of
computation showingcomputation showing–– Where shared memory is loadedWhere shared memory is loaded–– Where a barrier synchronization is performedWhere a barrier synchronization is performed
In a thread block:In a thread block:–– There are There are N/pN/p tiles tiles–– p threads computing the forces on p threads computing the forces on pp bodies bodies
(one thread per body).(one thread per body).–– Each thread computes all Each thread computes all NN interactions for interactions for
one bodyone body
5656
Thread Blocks for N-BodyThread Blocks for N-BodyImplementation (3/3)Implementation (3/3)
Multiple threads work from left toMultiple threads work from left torightright
Synchronization takes place at theSynchronization takes place at theend of each tile of computationend of each tile of computation
5757
CUDA Kernel Code To CalculateCUDA Kernel Code To CalculateN-body Forces For A Thread BlockN-body Forces For A Thread Block
pointers to global device memory forthe positions devX and theaccelerations devA of the bodies
input parameters asssigned to local pointerswith type conversion so they can be indexedas arrays
ensures that all shared memory locations are populated before the gravitationcomputation proceeds
ensures that all threads finish their gravitation computation before advancing tothe next tile
5858
Defining a Grid of Thread BlocksDefining a Grid of Thread Blocks
The kernel program in previous section calculatesThe kernel program in previous section calculatesthe acceleration of the acceleration of pp bodies in a system, caused bodies in a system, causedby their interaction with all by their interaction with all NN bodies in the bodies in thesystemsystem
The kernel invoked on a grid of thread blocks toThe kernel invoked on a grid of thread blocks tocompute the acceleration of all compute the acceleration of all NN bodies bodies
There are p threads per block and one thread perThere are p threads per block and one thread perbodybody
Number of thread blocks needed to complete allNumber of thread blocks needed to complete allN bodies is N bodies is N/pN/p
Define a 1D grid of size N/pDefine a 1D grid of size N/p Result is a total of N threads that perform N forceResult is a total of N threads that perform N force
calculations each, for a total of Ncalculations each, for a total of N22 interactions interactions
5959
Evaluation of the Full GridEvaluation of the Full Gridof Interactionsof Interactions
VerticalVertical dimension dimensionshows the parallelismshows the parallelismof the 1D grid of N/pof the 1D grid of N/pindependent threadindependent threadblocks with p threadsblocks with p threadseacheach
HorizontalHorizontal dimension dimensionshows the sequentialshows the sequentialprocessing of N forceprocessing of N forcecalculations in eachcalculations in eachthreadthread
A thread block reloadsA thread block reloadsits shared memoryits shared memoryevery every pp steps to share steps to sharepp positions of data positions of data
6060
MD.MD.cppcpp
6161
MDSystemCUDAMDSystemCUDA..cppcpp
6262
Execution ConfigurationExecution Configuration Calls to a Calls to a __global____global__ function must specify the function must specify the
execution configurationexecution configuration There is an expression with the following formThere is an expression with the following form
between the function name and the between the function name and the args args list:list:
<<< Dg, Db, Ns, S >>> <<< Dg, Db, Ns, S >>>
–– Dg is of type dim3 and specifies the dimension and sizeDg is of type dim3 and specifies the dimension and sizeof the gridof the grid
–– Db is of type dim3 and specifies the dimension and sizeDb is of type dim3 and specifies the dimension and sizeof each block such thatof each block such that
Db.x * Db.y * Db.z equals the number of threads Db.x * Db.y * Db.z equals the number of threads perblockperblock–– Ns is of type Ns is of type size_tsize_t. It specifies the number of bytes of. It specifies the number of bytes of
shared memory that is dynamically allocated per block.shared memory that is dynamically allocated per block. Ns is an optional argument which defaults to 0Ns is an optional argument which defaults to 0
–– S is of type S is of type cudaStream_t cudaStream_t and specifies the associatedand specifies the associatedstreamstream
6363
MDSystemCUDAMDSystemCUDA.cu.cu
6464
ObjectivesObjectives Project OrganizationProject Organization CPUCPU GPUGPU GPGPUGPGPU CUDACUDA N-body problemN-body problem MD on CUDAMD on CUDA EvaluationEvaluation Future WorkFuture Work
6565
TestbedTestbed
CPUCPU‒‒ 8 Intel Xeon processors8 Intel Xeon processors‒‒ 4 cores @ 3.00 GHz per processor4 cores @ 3.00 GHz per processor‒‒ 4 GB RAM4 GB RAM‒‒ SUSE 10.2 (X86-64)SUSE 10.2 (X86-64)
GPUGPU–– NvidiaNvidia QuadroQuadro FX 5600 FX 5600
6666
Quadro Quadro FX 5600 SpecificationsFX 5600 Specifications Memory Size: 1.5GBMemory Size: 1.5GB
GDDR3GDDR3 Memory Interface: 384-Memory Interface: 384-
bitbit Memory Bandwidth: 76.8Memory Bandwidth: 76.8
GB/sec.GB/sec. Max PowerMax Power
Consumption: 171WConsumption: 171W Number of Slots: 2Number of Slots: 2 Display Connectors Display Connectors
DVI-I DVI-I StereoDVI-I DVI-I Stereo Dual-Link DVI: 2Dual-Link DVI: 2
6767
Performance TestingPerformance Testing 3 tests are performed to understand3 tests are performed to understand
the performance of MD-CUDA:the performance of MD-CUDA:–– Test 1: find the number of iterationsTest 1: find the number of iterations
that achieve a near maximum utilizationthat achieve a near maximum utilization–– Test 2: find a value of N (# of bodies)Test 2: find a value of N (# of bodies)
that achieves the highest performancethat achieves the highest performanceat a reasonable execution timeat a reasonable execution time
–– Test 3: find the optimal p and q valuesTest 3: find the optimal p and q valuesFull utilization occurs for p*q = 256Full utilization occurs for p*q = 256
6868
Test 1: Test 1: GFLOPS/s GFLOPS/s for different # of Iterationsfor different # of Iterations
N = 16384, p = 256, q = 1N = 16384, p = 256, q = 1 Standard Standard deviation: Min = deviation: Min = 0.00083670.0008367, Max = 0.0139Max = 0.0139 # of iterations = 1000# of iterations = 1000 produced an average performance of produced an average performance of
214.6474 GFLOPS/s214.6474 GFLOPS/s, compared to 214.7892 GFLOPS/s when #, compared to 214.7892 GFLOPS/s when #of iterations = 3000of iterations = 3000
Runtime of 1000 iterations was 2.998x faster than 3000Runtime of 1000 iterations was 2.998x faster than 3000iterationsiterations
6969
Test 2: Test 2: GFLOPS/s GFLOPS/s for different Problem Sizesfor different Problem Sizes
p=256, q = 1, # p=256, q = 1, # iter iter = 1000= 1000 Varied N to analyze scalabilityVaried N to analyze scalability Standard Standard deviation: Min = deviation: Min = 0.0008590.000859, Max = 0.0244Max = 0.0244 N = 16384N = 16384 performed an average of performed an average of 214.6454 GFLOPS/s214.6454 GFLOPS/s, where, where
N = 32768 produced an average of 216.602 GFLOPS/sN = 32768 produced an average of 216.602 GFLOPS/s Runtime of N = 16384 was 3.963x faster than N = 32768Runtime of N = 16384 was 3.963x faster than N = 32768
7070
Test 3: Test 3: GFLOPS/s GFLOPS/s for different values of p and qfor different values of p and q
7171
Test 3: Test 3: GFLOPS/s GFLOPS/s for different p and qfor different p and q
Higher FLOPS/s
q = 1 and p > 1
7272
ObjectivesObjectives Project OrganizationProject Organization CPUCPU GPUGPU GPGPUGPGPU CUDACUDA N-body problemN-body problem MD on CUDAMD on CUDA EvaluationEvaluation Future WorkFuture Work
7373
Future WorkFuture Work The accumulation of the potential energy is not an easyThe accumulation of the potential energy is not an easy
task to perform in paralleltask to perform in parallel The naïve solution would be to have every thread update aThe naïve solution would be to have every thread update a
single shared memory location with the potential for asingle shared memory location with the potential for asingle body-body interaction in a synchronized fashion.single body-body interaction in a synchronized fashion.–– This will serialize a portion of the code and is notThis will serialize a portion of the code and is not
acceptableacceptable Another naïve solution would be to copy the position andAnother naïve solution would be to copy the position and
velocity arrays off the GPU and serially compute thevelocity arrays off the GPU and serially compute thepotential energy on the CPUpotential energy on the CPU–– This also has poor performance and requires someThis also has poor performance and requires some
redundant computations to obtain the total potentialredundant computations to obtain the total potentialenergyenergy
We need to develop a parallel summation algorithm thatWe need to develop a parallel summation algorithm thatpossibly uses partial summation within thread blockspossibly uses partial summation within thread blocks
7474
ReferencesReferences
7575
Test 3 (1/5)Test 3 (1/5)
7676
Test 3 (2/5)Test 3 (2/5)
7777
Test 3 (3/5)Test 3 (3/5)
7878
Test 3 (4/5)Test 3 (4/5)