Jacob Lauzon, Ryan McLaughlinmeseec.ce.rit.edu/756-projects/fall2014/2-2.pdf · 2014-12-08 ·...

CUDAJacob Lauzon, Ryan McLaughlin

Agenda1. Background2. Overview3. Basic GPU architecture4. CUDA Programming Model5. CUDA Example, Modeling the heart

BackgroundComputationally expensive problems illustrate the need for parallel processing.

Some of the key solutions today:CPU: MPIGPU: CUDA, OpenCL

Background cont.(Note that now, General Purpose GPU ~ GPU)Before OpenGL, Accelerators (GPUs) could only perform image operations.

1992 - OpenGL brought shaders to programmers, but required extensive knowledge of the GPUs, the shader usage, and the implementation language

1999 - nVIDIA GeForce 256 is “world’s first GPU” and first to let developers control computations, albeit only polygons and shaders using shader language.

2006 - CUDA unveiled, after absorbing the efforts of Ian Buck (Brooke programming model - 2003 ) to bring GPU programming to C language.

Overview, What is CUDA?

● CUDA - Compute Unified Device Architecture● Parallel computing platform created by NVIDIA● Released June of 2007● Implemented on nVIDIA Graphic Processing Units (GPUs)● CUDA allows for GPUs to be used as a general purpose processor, this is

known as GPGPU

Overview - GPUs● GPUs have architectures that promote many “slower” concurrent threads

rather than a single very fast thread● Optimized for data-parallel computations with very high throughput● Can tolerate higher memory latency than CPUs● Hardware focused heavily on computation

nVIDIA GPU Architecture● Made up of two main components:

o Global Memory Like RAM for a CPU Shared among all cores of the GPU and the CPU Usually 1 - 12 GBs per GPU High bandwidth

o Streaming Multiprocessors (SMs) (SMX on Kepler architecture) Where the computation is performed Contains their own control units, registers, execution pipelines, and

caches Contains “CUDA cores”/Streaming Processors (SIMD) which contain

logic units and computation units, including units dedicated to IEEE 754-2008 Floating Point arithmetic

Also contains Warp Schedulers which schedule groups of 32 parallel threads

Architecture cont. (Kepler)

Memory System (Kepler)● Each SMX has the following:

o 64 KB of configurable memory which can be split between L1 cache and shared memory

o 48 KB of Read-Only cache (Kepler only)

● Each GPU has 1536 KB of L2 cacheo Shared among all SMs and is

primary point to share data between the SMs

● Each GPU also has its main system memoryo Much like RAM to a CPU

Processing Flow

1. Copy data from CPU memory (main memory) to GPU memory

2. CPU instructs the GPU on how to process

3. Parallel processing on GPU4. Results are copied from GPU memory to

CPU memory

Programming Model● CUDA libraries exist for most

programming languages. (C, C++, Java, Python…)

● Keywords exist for each language that map to various CUDA “commands”

● CPU acts as “host”● CUDA functions are created to run

on the GPU but sequential code still runs on CPU

● Parallel portion of application is called a kernel

Kernels● The parallel portion of the application that is to be executed by the entire

GPU● Kernels are divided into many threads that execute the same code● Each thread has an ID (much like MPI)● Threads are grouped into blocks and blocks are grouped into a grid● The grid is what gets executed by the GPU

Kernel Execution

Communication and Synchronization● Since blocks of threads run on a single SM, they can communicate through

the shared memory through cooperative loads/stores and can use barrier synchronization

● Thread blocks are completely independent can execute sequentially or concurrently and in any ordero This allows for scalability to any number of SMs

Advantages● Massively Parallel - compute many independent values simultaneously● Highly Specialized to nVIDIA - Achieves higher efficiencies than OpenCL ● System is commercially available for less than a many processor system● Easily add more ‘systems’ by adding GPGPU cards● Multiple warps allow latency & ALU delay to be hidden● Shared memory access for warps● Full support for integer and floating point operations

Limitations● Limited to only nVIDIA cards● High specialization requires intricate details OpenCL doesn’t need● Suffers greatly in thread divergence - if statements/sequential● No built in exception handling - thread divergence is too severe● Substantial performance degradation when communicating with host● Memory is not paged like host memory, so is quickly limited.

o Can use host paged memory, but with large performance loss

Example - Modeling the HeartIn a similar fashion the the Neuron, the Action Potential (differential voltage in and outside a cell) may be modeled mathematically.

The voltageis controlled by variousvoltages, diffusions, gates, and ion concentrations

Fox ModelStates/Variables such as currents, concentrations, gate variables.

Sometimes such models are referred to by the number of current variables - i.e. the Fox model is is a 14 current model, while the Mitchell model is only 2 currents.

These models may be used to represent 1D, 2D, 3D simulations.

State VariablesEvery state variable for each cell must be stored. For a 2D simulation, there must be a 2D grid for each state variable, with 1 for each cell. A 60 variable model would need to store 60 Height x Width arrays.

Internally, this data must be precise (double) while it may be decreased for space optimizations when examining a given iteration for observation (output). These states may be saved in order to resume from the prior iteration.

Because the memory size only changes with number and size of states, the program is limited by the GPU’s memory and the number/size of states, but not the time being simulated.

ProcessingComputation Step:For a given coordinate, update the state based on the prior states and update equations. This may include integrating, differentiating, multiplication, division, etc. The equations are specified by the model (Fox)

Diffusion Step:In the second step, neighbors diffuse voltages to the given coordinate’s voltage. This could be 5 point, 9 point, etc.

Afterwards, the algorithm repeats, utilizing the new state values.

Implementation Details (CUDA C)Number & distribution (in blocks) of threads is calculated based on problem size.Each of the state variable grids are allocated both on the host and the GPUThe state variables are then initialized on the host, and copied to the GPUThe kernel, denoted using <<<>>> notation, operates by using the grid/block

details and x, y, z coordinates to calculate & store the appropriate valueAfterwards, the data is copied back to the host, before deallocating on GPU.Along the way, various low accuracy outputs are copied back and stored as short integers for checking/observation. The Compute/Diffusion steps are called iteratively from host for each time stepBoundary conditions - mirroring method (could use wrap around)

Problems & SolutionsKeeping all data on hand is most convenient, but limits the simulation size.Data sizeData Comparison/Checking/VisualizationIf statements discouraged -> special cases, boundary conditionsDiffusion step requires accessing multiple addresses per cell, every iteration

Visualization Examples (3D)

See 3D PDF example

ResultsFrom the results on hand, using the same simulation settings:For 1 Million Computation Nodes

GPU: 400 seconds/simulated secondSequential: 27000 seconds/simulated

secondSpeedup= 67.5

Questions?

Referenceshttp://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdfhttp://www.nvidia.com/object/cuda_home_new.htmlhttp://www.nvidia.com/object/tesla-workstations.htmlhttp://international.download.nvidia.com/pdf/kepler/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdfCuda by Example, Jason Sanders & Edward Kandrot

Jacob Lauzon, Ryan McLaughlinmeseec.ce.rit.edu/756-projects/fall2014/2-2.pdf · 2014-12-08 ·...

Documents

Transcript of Jacob Lauzon, Ryan McLaughlinmeseec.ce.rit.edu/756-projects/fall2014/2-2.pdf · 2014-12-08 ·...

Nicola Lauzon, Marc Turcotte & Laura McAuley Impact Assessments Unit

STAFF MONTHS PPMS User Group October 15, 2010 Rita Lauzon.

Consent-Based Humanitarian Intervention: Giving … · Oona A. Hathaway, Julia Brower, Ryan Liss, Tina Thomas & Jacob Victor ...

Development of an optimal calibration strategy for trace gas measurements Mark Battle (Bowdoin College) Zane Davis, Ryan Hart, Jayme Woogerd, Jacob Scheckman,

Lauzon Distinctive Hardwood Flooring is leading the way ... · Lauzon® Distinctive Hardwood Flooring is leading the way with its air-purifying smart floor [February 2015] - Canadian

Scott DeVore Ryan White Mitchell Stack Heather McMahon Brittany Thomason Jacob Western Cory Gregory.

EMERGENCY GENERATORS - WHEA ll-v2.pdf3 Bill Lauzon, PE 1973-2006 “Facility Engineer” Tomah –Fargo-Madison Kenosha -Racine 2006-2011 DHS-DQA 2011-present Lauzon Life Safety Consulting,

Inner Planets, Venus, And Mercury By Gabriel Ryan F. Jacob.

ART FR om NATURE - LAUZON Plancherslauzonplanchers.com/documents/brochures/brochure-13-ang.pdfArt comes from inspiration. Art outshines trends with emotion. With its collections, Lauzon

CtcLink Project Update February 2014 PPMS User Group Meeting Rita Lauzon.

By: Jonny Buck & Jacob Ryan

Auto Industry Analysis Group #2 February 24, 2009 Jacob Western Heather McMahon Ryan White Brittany Thomason Scott DeVore Cory Gregory Mitchell Stack.

Barnhart Community Development Jacob Lawrence Ryan Lawrence Mike Schweppe Brad Bartlett James Stamps Brookee’ Adams.

tecumsehtriathlon.comtecumsehtriathlon.com/pdf/PureKids-Windsor-RaceResults-2018.pdf · Owen Gilberto Domenic Chayce Cole Karson Lachlan Jack Cole William Nathan Jacob Ryan Bra en

Presented by: Bill Lauzon, PE Lauzon Life Safety Consulting, LLC - … · 2018-06-14 · Presented by: Bill Lauzon, PE Lauzon Life Safety Consulting, LLC 262-945-4567 Lauzon.lsc@gmail.com

Eric Lauzon, Regional Vice-President, Eastern Canada Assante Advisors:

ESMF Regridding Update Robert Oehmke, Peggy Li, Ryan O’Kuinghttons, Mat Rothstein, Joseph Jacob NOAA Cooperative Institute for Research in Environmental.

Team 2 Heather McMahon Jacob Western Brittany Thomason Scott Devore Ryan White Cory Gregory Mitchell Stack.

Community Partnerships Donor Informationmedia.virbcdn.com/files/ca/5dcd4f2189b10d33-2014-15CMAbrochur… · Eric Murine Ryan Moore Daniel Johnson Jacob Carpenter GUITAR Ross Whitaker

2014 Fall Sports - bishop-hartley.org · Sean Gilmore, en Hawk Row Three: Tim Engle, Reese Melton, Nick Onega, Jacob Grimm, Jacob McFeeters, Jake Hanson, Ryan Reed, ... Girls Tennis