Using Parallel Programming to Solve Complex Problems · 2013-05-27 · Programming to Solve Complex...

Post on 03-Jul-2020

2 views 0 download

Transcript of Using Parallel Programming to Solve Complex Problems · 2013-05-27 · Programming to Solve Complex...

Using Parallel Programming to Solve

Complex Problems

Evan Barnett Jordan Haskins

Outline

I. Serial Processing

II. Parallel Processing

III. CUDA

IV. Programming Examples

Serial Processing

• Traditionally, software has been written for serial

computation

• Serial processing is processing that occurs sequentially. • Instructions are executed one after another • Only one instruction may execute at any moment in

time

Serial Processing

www.computing.llnl.gov/tutorials/parallel_comp/

Moore’s Law • The number of transistors doubles every two years • Power Wall

Udacity.com, Introduction to Parallel Programming

Parallel Processing

• The simultaneous use of multiple compute resources to solve a computational problem

• The problem is broken into discrete parts that can be solved concurrently

• Each part is further broken down to a series of instructions

• Instructions from each part execute simultaneously on different CPUs and/or GPUs

Multiprocessors

• Started as multiprocessor systems that were

typically used with servers

• Dual-Core and Multi-Core Processors • Multiple CPUs on the same chip that run in parallel

by multi-threading

Parallel Processing

www.computing.llnl.gov/tutorials/parallel_comp/

Parallel Processing

• Latency vs. Throughput

• Goal: Higher Performing Processors • Minimize Latency • Increasing Throughput

• Latency • Time (e.g. Seconds)

• Throughput • Stuff/Time (e.g. Jobs/Hour)

GPUs • GPUs are designed to handle parallel processing

more efficiently.

www.nvidia.com/object/what-is-gpu-computing.html

• NVIDIA has introduced a way to harness the impressive computational power of the GPU.

CUDA

• CUDA (Compute Unified Device Architecture) • A parallel computing platform and programming

model that enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).

• Developed and released by NVIDIA in 2006

• Can run on all of NVIDIA’s latest discrete GPUs

• Extension of the C, C++, and Fortran languages

• Operating Systems that Support it: • Windows XP and later, Linux, and Mac OS X

CUDA

• Restricted to certain hardware • Nvidia cards:

• compute capability: restrictions to coding depending on compute capability

• 1.0-3.5

• Hardware we used during our research: • Nvidia GeForce 9600M GT: 1.1 compute capability • Nvidia GeForce GTX 580: 2.0 compute capability

CUDA Architecture

• By using CUDA functions, the CPU is able to utilize the parallelization capabilities of the GPU.

CUDA

• Kernel function: Runs on the GPU • cudaFxn<<<grid,block>>>(int var, int var)

• GPU hardware

• Stream Multiprocessors(SMs) • Run in parallel and independently • Contains streaming processors and memory • Executes one warp per SM at a time

• Streaming Processors(SPs) • Runs one thread each

CUDA • Threads – Path of execution through program

code

• Warps – collection of 32 threads

• Blocks – Made up of threads that communicate within their own block

• Grids – Made up of independent blocks

• Inside to out(threads -> blocks -> grids)

CUDA Functions

• cudaMalloc(void** devicePointer, size) • Allocates the size bytes of linear memory on the

device and the devPtr points to the memory

• cudaFree(devicePointer) • Frees the memory from cudaMalloc()

CUDA Functions

• cudaMemcpy(destination, source, count, kind)

• Copies the memory from one source to the destination specified by kind

• __syncthreads(); • Called in the kernel function • Threads wait here until all threads in the same

block reach that point

CUDA Memory • GPU Memory Model

• Local – Each thread has its own local memory • Shared – Each block has its own shared memory

that can be accessed by any thread in that block • Global – shared by all blocks and threads and can

be accessed at any time

• CPU Memory – called Host Memory • Normally data is copied to and from the Host

Memory to the Global Memory on the GPU

• Speed • Local > Shared > Global > Host

CUDA applications

• Medical Imaging

• Computational Fluid Dynamics

• Image modification

• Numerical Calculations

Grayscale Conversion: Parallel Program

• CUDA program that converts a color image into a grayscale image

• Manipulating images and videos is a perfect task for the GPU

• Simultaneously convert each pixel into grayscale using: •Y’ = .299*R + .587*G + .114*B

• Conversion is much quicker in parallel compared to sequential

Grayscale Conversion: Parallel Program

• CUDA code of Grayscale conversion (written in C++)

Accessible Population: Sequential Program • Purpose: to determine the accessible population within a 300 km radius of most US cities

• Sequential code displaying the populating of the large dists array

• 27 out of top 30 most accessible cities are in California

• Jackson, TN ranks 827th out of 988

Accessible Population: Sequential Program

Accessible Population: Parallel Program • Parallel code displaying the populating of the large dists array within the GPU’s kernel

• initialize the device variables

Accessible Population: Parallel Program • cudaMallocs for all memory that we need to allocate in the GPU kernel

• cudaMemcpy for the memory we need to transfer to kernel

• cudaFree the memory in the GPU after the kernel runs

N N2 Time (ms)100 10000 0.0200 40000 3.8600 360000 62.8988 976144 167.31976 3904576 686.03952 15618304 2706.87904 62473216 10803.3

SequentialN N2 Time (ms)

100 10000 105.0200 40000 101.3600 360000 132.5988 976144 148.31976 3904576 347.53952 15618304 1493.87904 62473216 4789.0

Parallel

Accessible Population: Sequential Program

• Small values of N: Sequential is the faster choice

• Larger values of N: Parallel becomes the much faster process

Accessible Population: Sequential Program

N N! Time1 1 -2 2 0 ms3 6 0 ms4 24 0 ms5 120 0 ms6 720 0 ms7 5040 0 ms8 40320 20 ms9 362880 150 ms10 3628800 1.4 secs11 39916800 17 secs12 479001600 3.8 mins

N N! Time13 6.2E+09 53.5 mins14 8.7E+10 13.2 hours15 1.3E+12 8.3 days16 2.1E+13 132.6 days17 3.6E+14 6.2 years18 6.4E+15 1 century19 1.2E+17 2 millennia20 2.4E+18 42 millennia21 5.1E+19 887 millennia22 1.1E+21 2 epochs23 2.6E+22 4.5 eras24 6.2E+23 21.5 eons

Determinant of a Matrix using Minors • Using the brute force method, this is a N! complexity

• Using parallel coding, studies have shown a reduction in execution of time of > 40%!

Questions?

• Question and Answer Time