Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Case Study: Accelerating...

Wavefo

rm Invers

L on A

Case Study: Accelerating Full Waveform Inversion via OpenCL™

on AMD GPUs

Chris Mason, Acceleware Product Manager

March 5, 2014

Wavefo

rm Invers

L on A

sAbout Acceleware Software and services company

specializing in HPC product development, developer training and consulting services

OpenCL training for AMD GPUs – Progressive lectures and hands-on lab exercises

– Experienced instructors

– Delivered worldwide

High performance consulting – Feasibility studies

– Porting and optimization

– Code commercialization 2

Wavefo

rm Invers

L on A

sOutline What is Full Waveform

Inversion? The Project OpenCL Optimizations

– Coalescing – Iterative kernel for stencil

operations– Fusing kernels together to

eliminate redundant memory accesses

Key Performance Results 3

Wavefo

rm Invers

L on A

sWhat is Full Waveform Inversion?

Seismic inversion technique

Used to build Earth models from recorded seismic

Uses a finite-difference solution to the acoustic wave

equation

Computationally expensive

Wavefo

rm Invers

L on A

sWhat is FWI?From a basic starting point...

... to an accurate velocity model

Wavefo

rm Invers

L on A

sFWI Algorithm

Initial Model Estimate

Forward Propagate Source → Residuals

Back Propagate Residuals → Gradient

Forward Propagation(s) → Step Length

Update Model

Increase Frequency

Loop over shots

Loop overfrequencies

Loop untilconvergence

Wavefo

rm Invers

L on A

sFWI Compute Cost Cluster size of 10s to 100s of CPU nodes Many days of runtime Accuracy and quality reduced to keep runtime acceptable

Wavefo

rm Invers

L on A

sThe Project GeoTomo develops high-end geophysical software products

that help geophysicists around the world to image beneath the subsurface

GeoTomo had pre-existing cluster-ready multi-threaded (OpenMP based) CPU FWI solution

GeoTomo required their FWI application to run faster so they could deliver the results quicker to their clients– Looked to AMD GPUs to potentially accelerate their FWI and

approached Acceleware for our help to make it happen

Wavefo

rm Invers

L on A

sWhy use GPUs? Performance!

AMD Opteron 6386 SE AMD FirePro W9000

AMD Firepro S10000

Memory Bandwidth59.7 GB/s 264 GB/s 480 GB/s

Peak Gflops (single) ~410 4000 5910

Peak Gflops (double)

~205 1000 1480

Total Memory >>6 GB 6GB 6 GB

Power Consumption140 W 274 W 375 W

Gflops per Watt(single precision) <3 14.59 15.76

Wavefo

rm Invers

L on A

sOpenCL Overview Parallel computing architecture standardized by the

Khronos Group

OpenCL: – Is a royalty free standard– Provides an API to coordinate parallel computation across

heterogeneous processors Of interest because heterogeneous devices can significantly accelerate

certain (primarily data-parallel) workloads

– Defines a cross-platform programming language– Used on handheld/embedded devices through supercomputers

Wavefo

rm Invers

L on A

sOpenCL Programming Model Heterogeneous model, including provisions for a host connected to

one or more devices– Example: GPUs, CPUs

Device 1 GPU

Device 2 GPU … Device N

Wavefo

rm Invers

L on A

sThe OpenCL Programming Model

Data-parallel portions of an algorithm are executed on the device as kernels– Kernels are C functions with some

restrictions and a few language extensions

– Many (parallel) work-items execute the kernel

The host executes serial code

between device kernel launches– Memory management– Data exchange to/from device (usually)– Error handling

Work-Group (0,0) Work-Group (1,0)

Work-Group (0,1) Work-Group (1,1)

Work-Group (0,2) Work-Group( 1,2)

ND Range

Work-Group (0,0)

Work-Group (1,0)

Work-Group (2,0)

Work-Group (0,1)

Work-Group (1,1)

Work-Group (2,1)

ND Range

Device

Wavefo

rm Invers

L on A

sOpenCL Memory Model OpenCL kernels have access to four distinct memory regions:

– Global Allows read/write access from all work-items in all work-groups Persistent across kernels

– Local Memory that is local to all work-items within a work-group

– Constant Region of memory that remains constant (read-only) during the execution of a

kernel

– Private Memory that is private to a work-item

OpenCL vendors map memory regions into physical resources– Local/constant/private memory usually several orders of magnitude lower

capacity but orders of magnitude faster than global memory

Wavefo

rm Invers

L on A

sOpenCL Syntax – Memory Spaces

Host and device have separate memory spaces– Data is explicitly moved between them

Typically over PCIe bus Host functions to allocate, copy, and free memory on device, eg.

– clCreateBuffer()– clEnqueueReadBuffer()– clEnqueueWriteBuffer()– clReleaseMemoryObject()

Wavefo

rm Invers

L on A

sPutting It All Together

A0 A1 A2 A3 A4 A5 A6 A7

B0 B1 B2 B3 B4 B5 B6 B7

C0 C1 C2 C3 C4 C5 C6 C7

Cx = Ax + Bx

One work-item per element

Operation

__kernelvoid VectorAdd(__global float* a, __global float* b, __global float* c){int idx = get_global_id(0);

c[idx] = a[idx] + b[idx];}

Each work-item has a unique index, typically used to index into arrays

Wavefo

rm Invers

L on A

sVector Add – Host Code

void VectorAdd(float* aH, float* bH, float* cH, int N){

int N_BYTES = N * sizeof(float);

// Device management code…cl_mem aD = clCreateBuffer(…,N_BYTES, …);cl_mem bD = clCreateBuffer(…,N_BYTES, …);cl_mem cD = clCreateBuffer(…,N_BYTES, …);

clEnqueueWriteBuffer(...,aD,…,N_BYTES,aH,…);clEnqueueWriteBuffer(...,bD,…,N_BYTES,bH,…);

// Pass kernel arguments and launch kernel…clEnqueueNDRangeKernel(…, &N, …);

clEnqueueReadBuffer(...,cD,…,N_BYTES,cH,…);}

Allocate memory on device

Transfer input arrays to device

Launch kernel

Transfer output array to host

Wavefo

rm Invers

L on A

sProject Steps 1) Profiling

– Acquired code, datasets and reference benchmarks from GeoTomo

– Set up local machines with near-equivalent hardware, compiled code and confirmed reference benchmark numbers

– Augmented code with timers to determine time spent in parallel regions, areas of interest

Wavefo

rm Invers

L on A

sProject Steps 2) Feasibility Analysis

– Investigated memory footprint for FWI jobs GPU memory limited to 6GB per card

– Investigated potential speedup / time to port code Maximum speed up determined by time spent in parallel regions

(Amdahl’s Law) Time to port dependent on feature set

– E.g. domain decomposition across multiple GPUs

Wavefo

rm Invers

L on A

sProject Steps 3) Implementation

– Creating testing harnesses– Kernel implementation– Resolving hardware driver issues– Enabling multi-GPU device support– Optimization iterations

4) Wrapup– Delivery of port, along with installation documentation– Trained GeoTomo developer on OpenCL

Wavefo

rm Invers

L on A

sKey GeoTomo Optimizations 1) Coalescing

– Changing memory access patterns in the kernels to those best suited for GPUs

Global memory is accessed via a request for a multi-byte word Combine load/store requests from consecutive work-items to

reduce the number of requested words– Fewer requests less contention to global memory

Make one big multi-word burst request to global memory whenever possible

– Contiguous bursts -> less global memory overhead

Wavefo

rm Invers

L on A

sKey GeoTomo Optimizations 2) Iterative kernel for stencil operations

Input Volumes Stencil Kernels

* • Outputs are weighted combinations of surrounding elements from input volumes• Off-axis weights are zero

Acknowledgement: Paulius Micikevicius, 2009 21

Wavefo

rm Invers

L on A

sKey GeoTomo Optimizations Naïve implementation would have each work-item read all

of its neighboring elements directly from global memory– Possible to hit maximum GPU memory bandwidth but

redundant reads hurt performance

Wavefo

rm Invers

L on A

sKey GeoTomo Optimizations Alternative: Iterating over 2D slices

along slowest dimension– Single items responsible for column of

output array– Work-group caches 2D plane of input in

local memory– Work-items store inputs in direction of

iteration in registers– Reduces required number of global

memory reads significantly

Single Work-item View

Register Local memory

Acknowledgement: Paulius Micikevicius, 2009 23

Wavefo

rm Invers

L on A

sKey GeoTomo Optimizations 3) Kernel Fusion

– Reduce redundant memory accesses by fusing kernels that operate on the same volume together

– Improves performance by reducing redundant global memory reads

4) Kernel Fission– Improve occupancy by lowering kernel resource requirements

(registers) via kernel simplification– Allows for more work-items to run concurrently on GPU,

improving masking of global memory latency

Wavefo

rm Invers

L on A

sPerformance Results FWI 15 Hz, 15 shots

– GPU version 7997 seconds– CPU (5 cores per shot) 67086 seconds [8.4X]– CPU (30 cores per shot) 166948 seconds [20.9X]

GPU: Sapphire Radeon HD 7970 GHz Edition – 6GB model

Wavefo

rm Invers

L on A

sPerformance Results“Using GPU’s we can use higher frequencies and more if not all of the shots to improve the resolution and coverage.”

James Jackson, President, GeoTomo

Wavefo

rm Invers

L on A

sQuestions? OpenCL Courses June 3-6, 2014, Calgary, Canada Private onsite classes also available Acceleware.com/opencl-training

OpenCL Consulting Feasibility studies Code commercialization Porting and optimization Mentoring Acceleware.com/services

Contact Us Tel: +1 403.249.9099 Email: services@acceleware.com

Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Case Study: Accelerating...

Documents

Transcript of Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Case Study: Accelerating...

An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

Open Standards for Compute, Graphics and Media Acceleration · OpenCL for Parallel Computation CPUs Multiple cores driving performance increases GPUs Increasingly general purpose

Introduction to OpenCL with examples - HPC-Forge · processor manufacturers including AMD, Intel, and NVIDIA, ... GPUs, and other processors CPUs Multiple cores driving performance

April 4-7, 2016 | Silicon Valley PERFORMANCE CONSIDERATIONS … · 2016-04-14 · Karthik Raghavan Ravi, 4/4/16 PERFORMANCE CONSIDERATIONS FOR OPENCL ON NVIDIA GPUS . 2 THE PROBLEM

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT.11 2014.

Programming GPUs with SYCL - C++ Edinburghcppedinburgh.uk/slides/201607-sycl.pdf · •Create a C++ for OpenCL ecosystem •Define an open portable standard •Provide the performance

OpenCL: Graphics Interop - OpenCL by Example

Executing Process Networks on Heterogeneous Platforms ... · hardware vendors. Besides GPUs, programs written in OpenCL can be ex-ecuted on CPUs, various accelerators, or DSPs. Even

Indiana University Bloomington - Rick Weber, David D ...salsahpc.indiana.edu/ECMLS2012/slides/ECMLS12...Specmaster OpenCL Myrimatch implementation Runs correctly on AMD, Nvidia GPUs;

AnIntroductionto#GPUs, CUDA#and#OpenCL#parlab.eecs.berkeley.edu/sites/all/parlab/files/CatanzaroIntroToCUDAOpenCL_0.pdf3/54& Throughput= Optimized#GPU LatencyOptimized CPU HeterogeneousParallelComputing

Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

OpenCL Guide

Making OpenCL™ Simple with Haskell - AMD · 3 | Making OpenCL™ Simple | January, 2011 | Public AGENDA Motivation Whistle stop introduction to OpenCL Bringing OpenCL to Haskell

OpenCL 2.0, OpenCL SYCL & OpenMP 4 - Open Standards for ...

An#Introduction#to#CUDA/OpenCL# …parlab.eecs.berkeley.edu/sites/all/parlab/files/CatanzaroIntroToG... · Mapping#CUDA#to#Nvidia#GPUs#! ... Introduction to CUDA! CUDA Programming

Graphics Processing Units (GPUs): Architecture and ...Lecture 11: OpenCL . Open Computing Language . Design Goals •Use all computation resources in the system (GPUs and CPUs as peers)

Optimizing OpenCL for NVIDIA GPUs€¦ · Shared memory that enables work-item cooperation Scalar ... Processor Work-group Multiprocessor • Work-groups are executed on multiprocessors

Обзор OpenCL

OpenCL - Parallel computing for CPUs and GPUsdeveloper.amd.com/wordpress/media/2013/01/AMD_OpenCL_Tutoria… · OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster

Implementation of Smith-Waterman algorithm in OpenCL for GPUs