Post on 11-Jan-2016
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
s
Case Study: Accelerating Full Waveform Inversion via OpenCL™
on AMD GPUs
©2014 Acceleware Ltd. All rights reserved.
Chris Mason, Acceleware Product Manager
March 5, 2014
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sAbout Acceleware Software and services company
specializing in HPC product development, developer training and consulting services
OpenCL training for AMD GPUs – Progressive lectures and hands-on lab exercises
– Experienced instructors
– Delivered worldwide
High performance consulting – Feasibility studies
– Porting and optimization
– Code commercialization 2
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sOutline What is Full Waveform
Inversion? The Project OpenCL Optimizations
– Coalescing – Iterative kernel for stencil
operations– Fusing kernels together to
eliminate redundant memory accesses
Key Performance Results 3
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sWhat is Full Waveform Inversion?
Seismic inversion technique
Used to build Earth models from recorded seismic
data
Uses a finite-difference solution to the acoustic wave
equation
Computationally expensive
4
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sWhat is FWI?From a basic starting point...
... to an accurate velocity model
5
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sFWI Algorithm
Initial Model Estimate
Forward Propagate Source → Residuals
Back Propagate Residuals → Gradient
Forward Propagation(s) → Step Length
Update Model
Increase Frequency
Loop over shots
Loop overfrequencies
Loop untilconvergence
6
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sFWI Compute Cost Cluster size of 10s to 100s of CPU nodes Many days of runtime Accuracy and quality reduced to keep runtime acceptable
7
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sThe Project GeoTomo develops high-end geophysical software products
that help geophysicists around the world to image beneath the subsurface
GeoTomo had pre-existing cluster-ready multi-threaded (OpenMP based) CPU FWI solution
GeoTomo required their FWI application to run faster so they could deliver the results quicker to their clients– Looked to AMD GPUs to potentially accelerate their FWI and
approached Acceleware for our help to make it happen
8
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sWhy use GPUs? Performance!
9
AMD Opteron 6386 SE AMD FirePro W9000
AMD Firepro S10000
Memory Bandwidth59.7 GB/s 264 GB/s 480 GB/s
Peak Gflops (single) ~410 4000 5910
Peak Gflops (double)
~205 1000 1480
Total Memory >>6 GB 6GB 6 GB
Power Consumption140 W 274 W 375 W
Gflops per Watt(single precision) <3 14.59 15.76
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sOpenCL Overview Parallel computing architecture standardized by the
Khronos Group
OpenCL: – Is a royalty free standard– Provides an API to coordinate parallel computation across
heterogeneous processors Of interest because heterogeneous devices can significantly accelerate
certain (primarily data-parallel) workloads
– Defines a cross-platform programming language– Used on handheld/embedded devices through supercomputers
10
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sOpenCL Programming Model Heterogeneous model, including provisions for a host connected to
one or more devices– Example: GPUs, CPUs
Host
Device 1 GPU
Device 2 GPU … Device N
GPU
11
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sThe OpenCL Programming Model
Data-parallel portions of an algorithm are executed on the device as kernels– Kernels are C functions with some
restrictions and a few language extensions
– Many (parallel) work-items execute the kernel
The host executes serial code
between device kernel launches– Memory management– Data exchange to/from device (usually)– Error handling
12
Work-Group (0,0) Work-Group (1,0)
Work-Group (0,1) Work-Group (1,1)
Work-Group (0,2) Work-Group( 1,2)
ND Range
Work-Group (0,0)
Work-Group (1,0)
Work-Group (2,0)
Work-Group (0,1)
Work-Group (1,1)
Work-Group (2,1)
ND Range
Host
Device
Host
Device
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sOpenCL Memory Model OpenCL kernels have access to four distinct memory regions:
– Global Allows read/write access from all work-items in all work-groups Persistent across kernels
– Local Memory that is local to all work-items within a work-group
– Constant Region of memory that remains constant (read-only) during the execution of a
kernel
– Private Memory that is private to a work-item
OpenCL vendors map memory regions into physical resources– Local/constant/private memory usually several orders of magnitude lower
capacity but orders of magnitude faster than global memory
13
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sOpenCL Syntax – Memory Spaces
Host and device have separate memory spaces– Data is explicitly moved between them
Typically over PCIe bus Host functions to allocate, copy, and free memory on device, eg.
– clCreateBuffer()– clEnqueueReadBuffer()– clEnqueueWriteBuffer()– clReleaseMemoryObject()
14
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sPutting It All Together
15
A0 A1 A2 A3 A4 A5 A6 A7
B0 B1 B2 B3 B4 B5 B6 B7
C0 C1 C2 C3 C4 C5 C6 C7
Cx = Ax + Bx
One work-item per element
Operation
__kernelvoid VectorAdd(__global float* a, __global float* b, __global float* c){int idx = get_global_id(0);
c[idx] = a[idx] + b[idx];}
Each work-item has a unique index, typically used to index into arrays
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sVector Add – Host Code
16
void VectorAdd(float* aH, float* bH, float* cH, int N){
int N_BYTES = N * sizeof(float);
// Device management code…cl_mem aD = clCreateBuffer(…,N_BYTES, …);cl_mem bD = clCreateBuffer(…,N_BYTES, …);cl_mem cD = clCreateBuffer(…,N_BYTES, …);
clEnqueueWriteBuffer(...,aD,…,N_BYTES,aH,…);clEnqueueWriteBuffer(...,bD,…,N_BYTES,bH,…);
// Pass kernel arguments and launch kernel…clEnqueueNDRangeKernel(…, &N, …);
clEnqueueReadBuffer(...,cD,…,N_BYTES,cH,…);}
Allocate memory on device
Transfer input arrays to device
Launch kernel
Transfer output array to host
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sProject Steps 1) Profiling
– Acquired code, datasets and reference benchmarks from GeoTomo
– Set up local machines with near-equivalent hardware, compiled code and confirmed reference benchmark numbers
– Augmented code with timers to determine time spent in parallel regions, areas of interest
17
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sProject Steps 2) Feasibility Analysis
– Investigated memory footprint for FWI jobs GPU memory limited to 6GB per card
– Investigated potential speedup / time to port code Maximum speed up determined by time spent in parallel regions
(Amdahl’s Law) Time to port dependent on feature set
– E.g. domain decomposition across multiple GPUs
18
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sProject Steps 3) Implementation
– Creating testing harnesses– Kernel implementation– Resolving hardware driver issues– Enabling multi-GPU device support– Optimization iterations
4) Wrapup– Delivery of port, along with installation documentation– Trained GeoTomo developer on OpenCL
19
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sKey GeoTomo Optimizations 1) Coalescing
– Changing memory access patterns in the kernels to those best suited for GPUs
Global memory is accessed via a request for a multi-byte word Combine load/store requests from consecutive work-items to
reduce the number of requested words– Fewer requests less contention to global memory
Make one big multi-word burst request to global memory whenever possible
– Contiguous bursts -> less global memory overhead
20
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sKey GeoTomo Optimizations 2) Iterative kernel for stencil operations
Input Volumes Stencil Kernels
* • Outputs are weighted combinations of surrounding elements from input volumes• Off-axis weights are zero
Acknowledgement: Paulius Micikevicius, 2009 21
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sKey GeoTomo Optimizations Naïve implementation would have each work-item read all
of its neighboring elements directly from global memory– Possible to hit maximum GPU memory bandwidth but
redundant reads hurt performance
22
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sKey GeoTomo Optimizations Alternative: Iterating over 2D slices
along slowest dimension– Single items responsible for column of
output array– Work-group caches 2D plane of input in
local memory– Work-items store inputs in direction of
iteration in registers– Reduces required number of global
memory reads significantly
Single Work-item View
Register Local memory
Acknowledgement: Paulius Micikevicius, 2009 23
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sKey GeoTomo Optimizations 3) Kernel Fusion
– Reduce redundant memory accesses by fusing kernels that operate on the same volume together
– Improves performance by reducing redundant global memory reads
4) Kernel Fission– Improve occupancy by lowering kernel resource requirements
(registers) via kernel simplification– Allows for more work-items to run concurrently on GPU,
improving masking of global memory latency
24
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sPerformance Results FWI 15 Hz, 15 shots
– GPU version 7997 seconds– CPU (5 cores per shot) 67086 seconds [8.4X]– CPU (30 cores per shot) 166948 seconds [20.9X]
GPU: Sapphire Radeon HD 7970 GHz Edition – 6GB model
25
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sPerformance Results“Using GPU’s we can use higher frequencies and more if not all of the shots to improve the resolution and coverage.”
James Jackson, President, GeoTomo
26
Case
Stu
dy:
Acc
ele
rati
ng F
ull
Wavefo
rm Invers
ion v
ia O
penC
L on A
MD
GPU
sQuestions? OpenCL Courses June 3-6, 2014, Calgary, Canada Private onsite classes also available Acceleware.com/opencl-training
OpenCL Consulting Feasibility studies Code commercialization Porting and optimization Mentoring Acceleware.com/services
Contact Us Tel: +1 403.249.9099 Email: services@acceleware.com
27