MD-CUDA · GPGPU CUDA N-body problem ... –Application programming interface (API) –CUDA runtime...

MD-CUDAMD-CUDAPresented byPresented byWes Wes TolandToland

Syed Syed NabeelNabeel

OutlineOutline

ObjectivesObjectives Project OrganizationProject Organization CPUCPU GPUGPU GPGPUGPGPU CUDACUDA N-body problemN-body problem MD on CUDAMD on CUDA EvaluationEvaluation Future WorkFuture Work

ObjectivesObjectives Understand molecular dynamicsUnderstand molecular dynamics

(MD) simulations(MD) simulations Analyze various algorithms forAnalyze various algorithms for

parallelizing MD simulationsparallelizing MD simulations Study GPGPU architecturesStudy GPGPU architectures Learn the CUDA programming modelLearn the CUDA programming model

and APIand API Port existing parallel MD code toPort existing parallel MD code to

parallel GPGPU codeparallel GPGPU code Evaluate the scalability of our parallelEvaluate the scalability of our parallel

GPGPU MD applicationGPGPU MD application

Project OrganizationProject Organization

Research parallel MD algorithmsResearch parallel MD algorithms Research parallel GPGPU MDResearch parallel GPGPU MD

algorithmsalgorithms Install CUDA driversInstall CUDA drivers Install CUDA APIInstall CUDA API Use N-body work-distributingUse N-body work-distributing

framework to make a parallel MDframework to make a parallel MDcodecode

EvaluationEvaluation

Problems encountered (1/2)Problems encountered (1/2)

Installing the CUDA API took some timeInstalling the CUDA API took some timebecause it required several packages thatbecause it required several packages thathad specific had specific glibcglibc dependencies. dependencies.

This was solved by installing the followingThis was solved by installing the followingmodules via modules via YaStYaSt (for (for SuSeSuSe platforms): platforms):–– gccgcc–– gccgcc-++-++–– freeglut-develfreeglut-devel–– glibc-devel-32bitglibc-devel-32bit–– kernel-sourcekernel-source

Problems Encountered (2/2)Problems Encountered (2/2)

Once the CUDA APIs wereOnce the CUDA APIs weresuccessfully installed, wesuccessfully installed, weencountered runtime errors whenencountered runtime errors whenattempting to run attempting to run NvidiaNvidia benchmarks benchmarks

The correct The correct NvidiaNvidia GPU kernel GPU kernelmodule was not module was not notnot installed installed

SolutionSolution: Download and install the: Download and install thekernel module for the kernel module for the NvidiaNvidia QuadroQuadroFX 5600 GPU driverFX 5600 GPU driver

CPU Instruction Level Parallelism (ILP)CPU Instruction Level Parallelism (ILP) Instructions are re-ordered and combined intoInstructions are re-ordered and combined into

groupsgroups The groups of instructions are then executed inThe groups of instructions are then executed in

parallel without changing the result of theparallel without changing the result of theprogramprogram

Modern processors have multi-stage instructionModern processors have multi-stage instructionpipelinespipelines

Each stage in the pipeline corresponds to aEach stage in the pipeline corresponds to adifferent action the processor performs on thatdifferent action the processor performs on thatinstruction in that stageinstruction in that stage

CPU LimitationsCPU Limitations

ILP is increasingly difficult to extractILP is increasingly difficult to extractfrom the instruction streamfrom the instruction stream

Control hardware dominatesControl hardware dominatesmicroprocessorsmicroprocessors–– Complex, difficult to build and verifyComplex, difficult to build and verify–– Takes substantial fraction of dieTakes substantial fraction of die–– Scales poorlyScales poorly–– Control hardware does not do any mathControl hardware does not do any math

GPUGPU Type of CPU attached to a Graphics cardType of CPU attached to a Graphics card

dedicated to calculating floating point operationsdedicated to calculating floating point operations Incorporates custom microchips which containIncorporates custom microchips which contain

special mathematical operationsspecial mathematical operations Stream Processing:Stream Processing: applications can use multiple applications can use multiple

computational units without explicitly managingcomputational units without explicitly managingallocation, synchronization, or communicationallocation, synchronization, or communicationamong those units.among those units.

GPU LimitationsGPU Limitations

High learning curveHigh learning curve Programming model of most graphicsProgramming model of most graphics

processors is inadequate for mostprocessors is inadequate for mostnon-graphics applications/applicationnon-graphics applications/applicationprogrammersprogrammers

Limitations in writing and scatteringLimitations in writing and scatteringinformationinformation

DRAM memory bandwidth bottleneckDRAM memory bandwidth bottleneck

GPGPUGPGPU

General-purpose computing on GPUGeneral-purpose computing on GPU GPU is viewed as a compute deviceGPU is viewed as a compute device

capable of executing a large numbercapable of executing a large numberof threads in parallelof threads in parallel

GPU operates as a co-processor toGPU operates as a co-processor tothe main CPUthe main CPU

Portions of code that can bePortions of code that can beparallelized and are computationallyparallelized and are computationallyintensive can be off-loaded onto theintensive can be off-loaded onto theGPUGPU

GPGPU ApplicationsGPGPU Applications

Physical modelingPhysical modeling Computational engineeringComputational engineering Game effects (FX) physicsGame effects (FX) physics Image processingImage processing Matrix algebraMatrix algebra ConvolutionConvolution CorrelationCorrelation SortingSorting

GPGPU ChallengesGPGPU Challenges

Unintuitive graphics APIUnintuitive graphics API Confusing addressing modesConfusing addressing modes Shader Shader capabilities are limitedcapabilities are limited Limited instruction setsLimited instruction sets Limited communicationLimited communication

–– Explicit data transfer is typicallyExplicit data transfer is typicallyrequiredrequired

CUDA to the RescueCUDA to the Rescue CUDACUDA: : CCompute ompute UUnified nified DDeviceevice

AArchitecturerchitecture CUDACUDA manages GPU computations with a manages GPU computations with a

parallel model similar to certain CPUparallel model similar to certain CPUparadigms.paradigms.–– User does not need to map to a graphics APIUser does not need to map to a graphics API

LayersLayers–– Hardware driverHardware driver–– Application programming interface (API)Application programming interface (API)–– CUDA runtimeCUDA runtime–– CUFFTCUFFT–– CUBLASCUBLAS

CUDA LayersCUDA Layers

GPU Architecture In CUDAGPU Architecture In CUDAMemoryMemory

AddressingAddressingModesModes

General DRAMGeneral DRAMMemoryMemoryAddressingAddressing

SharedSharedMemoryMemoryAddressingAddressing

General DRAM MemoryGeneral DRAM MemoryAddressingAddressing

MoreMoreprogrammingprogrammingflexibility thanflexibility thanGPUsGPUs

Ability to readAbility to readand write dataand write dataat any locationat any locationin DRAM, justin DRAM, justlike on a CPUlike on a CPU

Shared MemoryShared Memory Parallel dataParallel data

cachecache Very fastVery fast

general readgeneral readand write accessand write access

Minimized over-Minimized over-fetch andfetch andround-trips toround-trips toDRAMDRAM

GPU asGPU as a Computation Devicea Computation Device

Certain parallelizable orCertain parallelizable orcomputationally intensive portions ofcomputationally intensive portions ofcode are executed on multi-threadedcode are executed on multi-threadedGPUGPU

Kernel:Kernel: the common function the common functioncompiled to the instruction set of thecompiled to the instruction set of thedevice and downloaded to itdevice and downloaded to it

Separate DRAM for host and deviceSeparate DRAM for host and device–– Host MemoryHost Memory–– Device MemoryDevice Memory

Thread BlocksThread Blocks

A batch ofA batch ofthreads thatthreads thatoperates with aoperates with alimited amount oflimited amount ofshared memoryshared memory

SynchronizationSynchronizationpoints are used topoints are used tocoordinate sharedcoordinate sharedmemory accessmemory access

Each threadEach threadknows its 1D, 2D,knows its 1D, 2D,or 3D thread idor 3D thread id

Grid Of Thread BlocksGrid Of Thread Blocks A set of blocks of theA set of blocks of the

same size executingsame size executingthe same kernelthe same kernel

Single block has aSingle block has alimited number oflimited number ofthreadsthreads

Much more threadsMuch more threadsavailable in the gridavailable in the grid

Inter-thread isInter-thread isavoidedavoided

Each block isEach block isidentified by its 1D,identified by its 1D,2D, or 3D block id2D, or 3D block id

CUDA Memory ModelCUDA Memory Model

Read-write per-Read-write per-thread registersthread registers

Read-write per-Read-write per-thread localthread localmemorymemory

Read-write per-Read-write per-block sharedblock sharedmemorymemory

Read-write per-Read-write per-grid globalgrid globalmemorymemory

Read-only per-gridRead-only per-gridconstant memoryconstant memory

Read-only per-gridRead-only per-gridtexture memorytexture memory

Hardware Implementation (1/2)Hardware Implementation (1/2)

A set of SIMD multi-processors with on-chipshared memory

Hardware Implementation (1/2)Hardware Implementation (1/2) SIMD behaviorSIMD behavior

through groups ofthrough groups ofthreads called threads called warpswarps

One or more threadOne or more threadblocks are executedblocks are executedon each multi-on each multi-processor using processor using time-time-slicingslicing

The The issue order ofissue order ofthe warpsthe warps within a within ablock is undefinedblock is undefined

The The issue order ofissue order ofthe blocksthe blocks within a within agrid of thread blocksgrid of thread blocksis undefinedis undefined

N-body ProblemN-body Problem

Numerically approximates the evolution ofNumerically approximates the evolution ofa system of bodies in which each bodya system of bodies in which each bodycontinuously interacts with every othercontinuously interacts with every otherbodybody–– Astrophysical Astrophysical simulation-the bodies attractsimulation-the bodies attract

each other through the gravitational forceeach other through the gravitational force–– Protein folding Protein folding simulations-to calculatesimulations-to calculate

electrostatic and van electrostatic and van der Waals der Waals forcesforces–– Turbulent fluid flowTurbulent fluid flow simulation simulation–– Global illumination computationGlobal illumination computation in in

computer graphicscomputer graphics

All-pairs Approach toAll-pairs Approach toN-body SimulationN-body Simulation

A brute-force technique to evaluateA brute-force technique to evaluateall pair-wise interactions among Nall pair-wise interactions among Nbodiesbodies–– Relatively simple methodRelatively simple method–– O(NO(N22) computational complexity) computational complexity

Typical example of routine that isTypical example of routine that isused as a kernel to determine theused as a kernel to determine theforces in close-range interactionsforces in close-range interactions

Accelerating All-pairs ApproachAccelerating All-pairs Approach

All-pairs component requiresAll-pairs component requiressubstantial time to compute substantial time to compute targettargetfor accelerationfor acceleration

Optimal reuse of dataOptimal reuse of data–– Computation arranged in tilesComputation arranged in tiles–– Interactions in each row are evaluatedInteractions in each row are evaluated

in sequential order, updating thein sequential order, updating theacceleration vectoracceleration vector

–– Separate rows are evaluated in parallelSeparate rows are evaluated in parallel

Force ComputationsForce ComputationsGiven N bodies with anGiven N bodies with an Initial position xiInitial position xi Velocity vi for 1 Velocity vi for 1 ≤≤ i i ≤≤ N N Force vector Force vector fij fij on body i caused by itson body i caused by its

gravitational attraction to body jgravitational attraction to body j

mi and mi and mj mj -->the masses of bodies i and j-->the masses of bodies i and j rij rij = = xj xj − − xi is the vector from body i to body jxi is the vector from body i to body j G is the gravitational constantG is the gravitational constant

Total ForceTotal Force

The total force The total force Fi Fi on body i, due toon body i, due toits interactions with the other N its interactions with the other N − − 11bodies, is obtained by summing allbodies, is obtained by summing allinteractions:interactions:

Softening FactorSoftening Factor

As bodies approach each other, theAs bodies approach each other, theforce between them grows withoutforce between them grows withoutbound, which is an undesirablebound, which is an undesirablesituation for numerical integrationsituation for numerical integration

Softening factor Softening factor εε2 2 > 0 is added> 0 is added

Softening factor enforces a limit toSoftening factor enforces a limit tothe magnitude of the force betweenthe magnitude of the force betweenthe bodiesthe bodies

Acceleration CalculationAcceleration Calculation

To integrate over time, we need theTo integrate over time, we need theacceleration acceleration aaii = = FFii/m/mii to update theto update theposition and velocity of body iposition and velocity of body i

The integrator used to update theThe integrator used to update thepositions and velocities is apositions and velocities is aLeapfrog-VerletLeapfrog-Verlet integratorintegrator

Code Organization (1/3)Code Organization (1/3) C++ Files (CPU)C++ Files (CPU)

–– mdmd..cppcppMain functionMain functionInstantiate Instantiate mdsystemcuda mdsystemcuda objectobjectUse Use mdsystemcuda mdsystemcuda object to make CUDAobject to make CUDA

calls to initialize CPU and GPU memory andcalls to initialize CPU and GPU memory andrun CUDA kernelrun CUDA kernel

–– mdsystemcudamdsystemcuda..cppcppProvides functionality to initialize CPU andProvides functionality to initialize CPU and

GPU memoryGPU memoryHas several wrapper functions for calls toHas several wrapper functions for calls to

the CUDA kernelthe CUDA kernel

Code Organization (2/3)Code Organization (2/3)

CUDA Files (GPU)CUDA Files (GPU)–– mdsystemcudamdsystemcuda.cu.cu

Provides functions to copy the position andProvides functions to copy the position andvelocity arrays to GPU memoryvelocity arrays to GPU memory

integrateMDSystem integrateMDSystem is the top-level kernelis the top-level kernelfunction called by an function called by an mdsystemcuda mdsystemcuda object.object.This function calls the lower-level kernelThis function calls the lower-level kernelfunction, function, integrateBodiesintegrateBodies..

–– md_kernelmd_kernel.cu.cuContains the Contains the integrateBodies integrateBodies function, whichfunction, which

updates the position and velocity arraysupdates the position and velocity arrays

Code Organization (3/3)Code Organization (3/3)

Input FilesInput Files–– mdmd.in.in

InitUcell[3]InitUcell[3]–– Determines N, the number of bodiesDetermines N, the number of bodies

StepLimit StepLimit StepAverageStepAverage–– Determines # of iterationsDetermines # of iterations

pp–– N/p is the number of thread blocksN/p is the number of thread blocks

qq–– Threads per bodyThreads per body

Compiling MD-CUDA (1/2)Compiling MD-CUDA (1/2)

Compiling MD-CUDA (2/2)Compiling MD-CUDA (2/2)

Compile C++ files:Compile C++ files:g++g++ $(CFLAGS)-o $(CFLAGS)-o mdsystemcudamdsystemcuda..cpp_o cpp_o -c-cmdsystemcudamdsystemcuda..cppcppg++g++ $(CFLAGS)-o $(CFLAGS)-o mdmd..cpp_o cpp_o -c -c mdmd..cppcpp

Compile CUDA files:Compile CUDA files:nvccnvcc $(CFLAGS)-o $(CFLAGS)-o mdsystemcudamdsystemcuda..cu_o cu_o -c-cmdsystemcudamdsystemcuda.cu.cu

Compile final executable:Compile final executable:g++g++ -fPIC -fPIC -o MD-CUDA -o MD-CUDA mdsystemcudamdsystemcuda..cpp_ocpp_omdmd..cpp_o mdsystemcudacpp_o mdsystemcuda..cu_o cu_o $(LDFLAGS) $(LDFLAGS) --lcudart -lcutillcudart -lcutil

CUDA Implementation of N-bodyCUDA Implementation of N-body

All-pairs algorithm calculates each entry All-pairs algorithm calculates each entry ffijijin an N×N grid of all pair-wise forcesin an N×N grid of all pair-wise forces

The total force The total force FFii (or acceleration (or acceleration aaii) on) onbody i is obtained from the sum of allbody i is obtained from the sum of allentries in row ientries in row i

Each entry can be computedEach entry can be computedindependentlyindependently

O(NO(N22) available parallelism) available parallelism Requires O(NRequires O(N22) memory) memory Would be substantially limited byWould be substantially limited by

memory bandwidthmemory bandwidth

Computational Tile(1/3)Computational Tile(1/3)

Serialize some of the computationsSerialize some of the computations–– Achieves the data reuse needed toAchieves the data reuse needed to

reach peak performance of thereach peak performance of thearithmetic unitsarithmetic units

–– Reduces the memory bandwidthReduces the memory bandwidthrequiredrequired

Computational tile:Computational tile: a square region a square regionof the grid of pair-wise forcesof the grid of pair-wise forcesconsisting of consisting of pp rows and rows and pp columns columns

Only Only 2p2p body descriptions are required to body descriptions are required toevaluate all evaluate all pp22 interactions in the tile interactions in the tile–– pp descriptions can be reused later descriptions can be reused later

These body descriptions can be stored inThese body descriptions can be stored inshared memory or in registersshared memory or in registers

The total effect of the interactions in theThe total effect of the interactions in thetile on the tile on the pp bodies is captured as an bodies is captured as anupdate to update to p p acceleration vectorsacceleration vectors

For optimal data reuseFor optimal data reusetile computation istile computation isorganized so that:organized so that:–– The interactions in eachThe interactions in each

row are evaluated inrow are evaluated insequential order,sequential order,updating theupdating theacceleration vectoracceleration vector

–– The separate rows areThe separate rows areevaluated in parallelevaluated in parallel

I/O for tile computation

Evaluation Order

Body-Body Force CalculationBody-Body Force Calculation The interaction between a pair of bodies isThe interaction between a pair of bodies is

implemented as an entirely serialimplemented as an entirely serialcomputation.computation.

bodyBodyInteraction bodyBodyInteraction function does thefunction does thefollowing:following:–– Computes the force on body i from itsComputes the force on body i from its

interaction with body jinteraction with body j–– Updates acceleration Updates acceleration aaii of body i as a result ofof body i as a result of

this interaction.this interaction. FLOPS in FLOPS in bodyBodyInteractionbodyBodyInteraction: 20: 20

–– AdditionsAdditions–– MultiplicationsMultiplications–– SqrtfSqrtf() call() call–– Division (or reciprocal)Division (or reciprocal)

Code For Code For bodyBodyInteractionbodyBodyInteraction

3 FLOPSr_ij

6 FLOPSdistSqr = dot(r_ij, r_ij) + EPS^2

4 FLOPS (2 mul, 1 sqrt, 1 inv)invDistCube =1/distSqr^(3/2)

6 FLOPSa_i = a_i + s * r_ij

float4 Data Typefloat4 Data Type

Data type is for accelerations stored inData type is for accelerations stored inGPU device memoryGPU device memory

Allows Allows coalesced coalesced memory to accessmemory to accessarrays of data in device memory.arrays of data in device memory.–– Results in more efficient memory requests andResults in more efficient memory requests and

transfers.transfers.

Each bodyEach body’’s mass is stored in the w fields mass is stored in the w fieldof the bodyof the body’’s float4 positions float4 position

3D vectors are stored as float3 variables3D vectors are stored as float3 variables–– Register space is an issue and coalescedRegister space is an issue and coalesced

access is notaccess is not

Tile CalculationTile Calculation A tile is evaluated by A tile is evaluated by p p threads performing thethreads performing the

same sequence of operations on different datasame sequence of operations on different data Each thread updates the acceleration of one bodyEach thread updates the acceleration of one body

as a result of its interaction with as a result of its interaction with pp other bodies other bodies Load Load pp body descriptions from the GPU device body descriptions from the GPU device

memory into the shared memory provided tomemory into the shared memory provided toeach thread blockeach thread block

Each thread in the block evaluates Each thread in the block evaluates pp successive successiveinteractionsinteractions

The result of the tile calculation is The result of the tile calculation is pp updated updatedaccelerationsaccelerations

Tile CalculationTile Calculation

Each of the Each of the pp threads threads–– executes the function body in parallelexecutes the function body in parallel–– iterates over the same p bodiesiterates over the same p bodies–– computes the acceleration of its individual body as acomputes the acceleration of its individual body as a

result of interaction with p other bodiesresult of interaction with p other bodies

holds the position of the body for theexecuting thread

an array of body descriptions inshared memory

Clustering Tiles into Thread BlocksClustering Tiles into Thread Blocks Thread blockThread block has has p p threads that executethreads that execute

some number of tiles in sequencesome number of tiles in sequence–– Sized to balance parallelism with data reuseSized to balance parallelism with data reuse–– Degree of parallelism (i.e. the number of rows)Degree of parallelism (i.e. the number of rows)

must be sufficiently large to interleave multiplemust be sufficiently large to interleave multiplewarps (to hide latencies in interactionwarps (to hide latencies in interactionevaluation)evaluation)

Amount of data reuse grows with theAmount of data reuse grows with thenumber of columnsnumber of columns–– It also governs the size of the transfer ofIt also governs the size of the transfer of

bodies from device memory into sharedbodies from device memory into sharedmemorymemory

The The size of the tilesize of the tile also determines the also determines theregister space and shared memoryregister space and shared memoryrequiredrequired

Thread Blocks for N-BodyThread Blocks for N-BodyImplementation(1/3)Implementation(1/3)

For this implementation tiles areFor this implementation tiles aresquare of size square of size pp by by pp

Before executing a tile:Before executing a tile:–– Each thread fetches one body intoEach thread fetches one body into

shared memoryshared memory–– Threads synchronize after the fetchThreads synchronize after the fetch

Consequently, each tile starts with Consequently, each tile starts with ppsuccessive bodies in the sharedsuccessive bodies in the sharedmemorymemory

Thread Blocks for N-BodyThread Blocks for N-BodyImplementation (2/3)Implementation (2/3)

Time spans the horizontal directionTime spans the horizontal direction Parallelism spans the vertical directionParallelism spans the vertical direction Heavy lines demarcate the tiles ofHeavy lines demarcate the tiles of

computation showingcomputation showing–– Where shared memory is loadedWhere shared memory is loaded–– Where a barrier synchronization is performedWhere a barrier synchronization is performed

In a thread block:In a thread block:–– There are There are N/pN/p tiles tiles–– p threads computing the forces on p threads computing the forces on pp bodies bodies

(one thread per body).(one thread per body).–– Each thread computes all Each thread computes all NN interactions for interactions for

one bodyone body

Thread Blocks for N-BodyThread Blocks for N-BodyImplementation (3/3)Implementation (3/3)

Multiple threads work from left toMultiple threads work from left torightright

Synchronization takes place at theSynchronization takes place at theend of each tile of computationend of each tile of computation

CUDA Kernel Code To CalculateCUDA Kernel Code To CalculateN-body Forces For A Thread BlockN-body Forces For A Thread Block

pointers to global device memory forthe positions devX and theaccelerations devA of the bodies

input parameters asssigned to local pointerswith type conversion so they can be indexedas arrays

ensures that all shared memory locations are populated before the gravitationcomputation proceeds

ensures that all threads finish their gravitation computation before advancing tothe next tile

Defining a Grid of Thread BlocksDefining a Grid of Thread Blocks

The kernel program in previous section calculatesThe kernel program in previous section calculatesthe acceleration of the acceleration of pp bodies in a system, caused bodies in a system, causedby their interaction with all by their interaction with all NN bodies in the bodies in thesystemsystem

The kernel invoked on a grid of thread blocks toThe kernel invoked on a grid of thread blocks tocompute the acceleration of all compute the acceleration of all NN bodies bodies

There are p threads per block and one thread perThere are p threads per block and one thread perbodybody

Number of thread blocks needed to complete allNumber of thread blocks needed to complete allN bodies is N bodies is N/pN/p

Define a 1D grid of size N/pDefine a 1D grid of size N/p Result is a total of N threads that perform N forceResult is a total of N threads that perform N force

calculations each, for a total of Ncalculations each, for a total of N22 interactions interactions

Evaluation of the Full GridEvaluation of the Full Gridof Interactionsof Interactions

VerticalVertical dimension dimensionshows the parallelismshows the parallelismof the 1D grid of N/pof the 1D grid of N/pindependent threadindependent threadblocks with p threadsblocks with p threadseacheach

HorizontalHorizontal dimension dimensionshows the sequentialshows the sequentialprocessing of N forceprocessing of N forcecalculations in eachcalculations in eachthreadthread

A thread block reloadsA thread block reloadsits shared memoryits shared memoryevery every pp steps to share steps to sharepp positions of data positions of data

MD.MD.cppcpp

MDSystemCUDAMDSystemCUDA..cppcpp

Execution ConfigurationExecution Configuration Calls to a Calls to a __global____global__ function must specify the function must specify the

execution configurationexecution configuration There is an expression with the following formThere is an expression with the following form

between the function name and the between the function name and the args args list:list:

<<< Dg, Db, Ns, S >>> <<< Dg, Db, Ns, S >>>

–– Dg is of type dim3 and specifies the dimension and sizeDg is of type dim3 and specifies the dimension and sizeof the gridof the grid

–– Db is of type dim3 and specifies the dimension and sizeDb is of type dim3 and specifies the dimension and sizeof each block such thatof each block such that

Db.x * Db.y * Db.z equals the number of threads Db.x * Db.y * Db.z equals the number of threads perblockperblock–– Ns is of type Ns is of type size_tsize_t. It specifies the number of bytes of. It specifies the number of bytes of

shared memory that is dynamically allocated per block.shared memory that is dynamically allocated per block. Ns is an optional argument which defaults to 0Ns is an optional argument which defaults to 0

–– S is of type S is of type cudaStream_t cudaStream_t and specifies the associatedand specifies the associatedstreamstream

MDSystemCUDAMDSystemCUDA.cu.cu

TestbedTestbed

CPUCPU‒‒ 8 Intel Xeon processors8 Intel Xeon processors‒‒ 4 cores @ 3.00 GHz per processor4 cores @ 3.00 GHz per processor‒‒ 4 GB RAM4 GB RAM‒‒ SUSE 10.2 (X86-64)SUSE 10.2 (X86-64)

GPUGPU–– NvidiaNvidia QuadroQuadro FX 5600 FX 5600

Quadro Quadro FX 5600 SpecificationsFX 5600 Specifications Memory Size: 1.5GBMemory Size: 1.5GB

GDDR3GDDR3 Memory Interface: 384-Memory Interface: 384-

bitbit Memory Bandwidth: 76.8Memory Bandwidth: 76.8

GB/sec.GB/sec. Max PowerMax Power

Consumption: 171WConsumption: 171W Number of Slots: 2Number of Slots: 2 Display Connectors Display Connectors

DVI-I DVI-I StereoDVI-I DVI-I Stereo Dual-Link DVI: 2Dual-Link DVI: 2

Performance TestingPerformance Testing 3 tests are performed to understand3 tests are performed to understand

the performance of MD-CUDA:the performance of MD-CUDA:–– Test 1: find the number of iterationsTest 1: find the number of iterations

that achieve a near maximum utilizationthat achieve a near maximum utilization–– Test 2: find a value of N (# of bodies)Test 2: find a value of N (# of bodies)

that achieves the highest performancethat achieves the highest performanceat a reasonable execution timeat a reasonable execution time

–– Test 3: find the optimal p and q valuesTest 3: find the optimal p and q valuesFull utilization occurs for p*q = 256Full utilization occurs for p*q = 256

Test 1: Test 1: GFLOPS/s GFLOPS/s for different # of Iterationsfor different # of Iterations

N = 16384, p = 256, q = 1N = 16384, p = 256, q = 1 Standard Standard deviation: Min = deviation: Min = 0.00083670.0008367, Max = 0.0139Max = 0.0139 # of iterations = 1000# of iterations = 1000 produced an average performance of produced an average performance of

214.6474 GFLOPS/s214.6474 GFLOPS/s, compared to 214.7892 GFLOPS/s when #, compared to 214.7892 GFLOPS/s when #of iterations = 3000of iterations = 3000

Runtime of 1000 iterations was 2.998x faster than 3000Runtime of 1000 iterations was 2.998x faster than 3000iterationsiterations

Test 2: Test 2: GFLOPS/s GFLOPS/s for different Problem Sizesfor different Problem Sizes

p=256, q = 1, # p=256, q = 1, # iter iter = 1000= 1000 Varied N to analyze scalabilityVaried N to analyze scalability Standard Standard deviation: Min = deviation: Min = 0.0008590.000859, Max = 0.0244Max = 0.0244 N = 16384N = 16384 performed an average of performed an average of 214.6454 GFLOPS/s214.6454 GFLOPS/s, where, where

N = 32768 produced an average of 216.602 GFLOPS/sN = 32768 produced an average of 216.602 GFLOPS/s Runtime of N = 16384 was 3.963x faster than N = 32768Runtime of N = 16384 was 3.963x faster than N = 32768

Test 3: Test 3: GFLOPS/s GFLOPS/s for different values of p and qfor different values of p and q

Test 3: Test 3: GFLOPS/s GFLOPS/s for different p and qfor different p and q

Higher FLOPS/s

q = 1 and p > 1

Future WorkFuture Work The accumulation of the potential energy is not an easyThe accumulation of the potential energy is not an easy

task to perform in paralleltask to perform in parallel The naïve solution would be to have every thread update aThe naïve solution would be to have every thread update a

single shared memory location with the potential for asingle shared memory location with the potential for asingle body-body interaction in a synchronized fashion.single body-body interaction in a synchronized fashion.–– This will serialize a portion of the code and is notThis will serialize a portion of the code and is not

acceptableacceptable Another naïve solution would be to copy the position andAnother naïve solution would be to copy the position and

velocity arrays off the GPU and serially compute thevelocity arrays off the GPU and serially compute thepotential energy on the CPUpotential energy on the CPU–– This also has poor performance and requires someThis also has poor performance and requires some

redundant computations to obtain the total potentialredundant computations to obtain the total potentialenergyenergy

We need to develop a parallel summation algorithm thatWe need to develop a parallel summation algorithm thatpossibly uses partial summation within thread blockspossibly uses partial summation within thread blocks

ReferencesReferences

Test 3 (1/5)Test 3 (1/5)

Test 3 (2/5)Test 3 (2/5)

Test 3 (3/5)Test 3 (3/5)

Test 3 (4/5)Test 3 (4/5)

MD-CUDA · GPGPU CUDA N-body problem ... –Application programming interface (API) –CUDA runtime...

Documents

Transcript of MD-CUDA · GPGPU CUDA N-body problem ... –Application programming interface (API) –CUDA runtime...

CUDA Libraries - Camlunitycamlunity.ru/swap/Library/Conflux/NVIDIA CUDA/C2_BLAS_FFT.pdf · CUDA Libraries. 2 Outline ... initializes the CUBLAS library and must be called before any

CUDA Toolkit - Stanford University · CUDA Toolkit. 4 M02: High Performance Computing with CUDA ... Call SGEMM in CUBLAS library using NON-THUNKING interface (library …

CUDA CUBLAS Library - Research School of Computer … · NVIDIA Corporation CUBLAS Library PG-00000-002_V2.0 Published by NVIDIA Corporation 2701 San Tomas Expressway Santa Clara,

CUDA CUBLAS Library - Nc State Universitymoss.csc.ncsu.edu/.../nvidia/1.1/CUBLAS_Library_1.1.pdfPG-00000-002_V1.1 1 NVIDIA CHAPTER1 The CUBLAS Library CUBLAS is an implementation of

Introduction to GPGPUs and to CUDA programming model: … Autumn... · Introduction to GPGPUs and to CUDA programming model: CUDA Libraries ... Standard C Math library ... CUBLAS

Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Image Processing using CUDA - Amazon S3 · Image Processing using CUDA Anders Eklund, PhD ... CUBLAS •CUBLAS has many ... between two images, using the CUBLAS library

CUDA Toolkit - Camlunitycamlunity.ru/swap/Library/Conflux/NVIDIA CUDA/03-Toolkit.pdf · Toolkit: compiler, CUBLAS and CUFFT (required for development) SDK: collection of examples

CUDA CUBLAS Library

A Sampling of CUDA Libraries - Great Lakes Consortium · CUDA Libraries Michael Garland © 2009 NVIDIA Corporation CUBLAS Implementation of BLAS (Basic Linear Algebra ... CUBLAS library

RN-08476-001 v18.03 | April 2018 MXNET Release Notes · ‣ CUDA® Basic Linear Algebra Subroutines library™ (cuBLAS) 9.0.282 Patch 2 which is installed by default ‣ cuBLAS 9.0.234

CUDA Toolkit and Libraries - Hot Chips · #ifdef CUBLAS ! Call SGEMM in CUBLAS library using THUNKING interface (library takes care of ! memory allocation on device and data movement)

CUDA CUBLAS Library - RUC.dkdirac.ruc.dk/manuals/cuda-3.0/CUBLAS_Library_3.0.pdf · NVIDIA Corporation CUBLAS Library PG-00000-002_V3.0 Portions of the SGEMM, DGEMM and ZGEMM library

CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

STRATA DATA CONFERENCE 2018 · NVIDIA SDK & LIBRARIES INDUSTRY FRAMEWORKS & APPLICATIONS CUSTOMER USECASES SUPERCOMPUTING +550 Applications CUDA cuBLAS cuDNN cuFFT cuSPARSE DeepStream

High-Productivity CUDA Programming - NVIDIAon-demand.gputechconf.com/...High-Productivity-CUDA...“Drop-in” Acceleration for your Applications NVIDIA cuBLAS NVIDIA cuRANDLinear

April 2018 TENSORFLOW Release Notes - Nvidia · ‣ CUDA® Basic Linear Algebra Subroutines library™ (cuBLAS) 9.0.282 Patch 2 which is installed by default ‣ cuBLAS 9.0.234 Patch

CUDA Libraries and CUDA Fortran - MASSIVE Home · —CUBLAS: Dense Linear Algebra —CUSPARSE : Sparse Linear Algebra —LIBM: Standard C Math library

CUDA CUBLAS Library - developer.download.nvidia.comdeveloper.download.nvidia.com/compute/cuda/1_1/... · PG-00000-002_V1.1 3 NVIDIA CHAPTER 1 The CUBLAS Library Example 1. Fortran

CUDA CUBLAS Librarydirac.ruc.dk/manuals/cuda-3.2/CUBLAS_Library_02.pdfNVIDIA Corporation CUBLAS Library PG-05326-032_V02 Published by NVIDIA Corporation 2701 San Tomas Expressway Santa