questions: jstaylor@uvm slide show (fair use images removed): uvm/~jstaylor/TKST
An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim...
Transcript of An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim...
![Page 1: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/1.jpg)
Developing an OpenMP Offloading Runtime for UVM-Capable GPUs
Hashim Sharif and Vikram Adve
University of Illinois at Urbana-Champaign
Hal Finkel
Argonne National Laboratory
Lingda Li
Brookhaven National Laboratory
![Page 2: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/2.jpg)
Heterogenous Programming
“Many Core” CPUs GPUs
➢ Heterogeneous programming allows for optimizing application subcomponents to their specific computation needs
![Page 3: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/3.jpg)
Heterogenous Programming
“Many Core” CPUs GPUs
➢ Heterogeneous programming allows for optimizing application subcomponents to their specific computation needs
➢ OpenMP 4.0/4.5 supports heterogeneous programming
![Page 4: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/4.jpg)
CUDA Unified Virtual Memory
➢ UVM provides a single memory space accessible by all GPUs and CPUs in the system
![Page 5: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/5.jpg)
New technology!
➢ UVM provides a single memory space accessible by all GPUs and CPUs in the system
➢ Pascal introduces a page migration engine to enable automatic page migration on data access
CUDA Unified Virtual Memory
![Page 6: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/6.jpg)
New technology!
➢ UVM provides a single memory space accessible by all GPUs and CPUs in the system
➢ Pascal introduces a page migration engine to enable automatic page migration on data access
➢ Pascal UVM is only limited by the overall System Memory size
CUDA Unified Virtual Memory
![Page 7: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/7.jpg)
int main( int argc, char* argv[] ) {
// Allocate memory for each vector on GPU
cudaMalloc(&X, bytes);
cudaMalloc(&Y, bytes);
cudaMalloc(&Z, bytes);
// Copy host vectors to device
cudaMemcpy(x, X, bytes, hostToDevice);
cudaMemcpy(y, Y, bytes, hostToDevice);
// Execute the kernel
vecAdd<<<gridSize, blockSize>>>(X, Y, Z);
// Copy the vector add result back to the host
cudaMemcpy(z, Z, bytes, deviceToHost);
}
int main( int argc, char* argv[] ) {
// Allocate memory in Unified virtual space
cudaMallocManaged(&X, bytes);
cudaMallocManaged(&Y, bytes);
cudaMallocManaged(&Z, bytes);
// NOTE: No need for explicit memory copies
// Execute the kernel
vecAdd<<<gridSize, blockSize>>>(X, Y, Z );
}
CUDA UVM Example
CUDA Code CUDA UVM Code
![Page 8: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/8.jpg)
int main( int argc, char* argv[] ) {
// Allocate memory for each vector on GPU
cudaMalloc(&X, bytes);
cudaMalloc(&Y, bytes);
cudaMalloc(&Z, bytes);
// Copy host vectors to device
cudaMemcpy(x, X, bytes, hostToDevice);
cudaMemcpy(y, Y, bytes, hostToDevice);
// Execute the kernel
vecAdd<<<gridSize, blockSize>>>(X, Y, Z);
// Copy the vector add result back to the host
cudaMemcpy(z, Z, bytes, deviceToHost);
}
int main( int argc, char* argv[] ) {
// Allocate memory in Unified virtual space
cudaMallocManaged(&X, bytes);
cudaMallocManaged(&Y, bytes);
cudaMallocManaged(&Z, bytes);
// NOTE: No need for explicit memory copies
// Execute the kernel
vecAdd<<<gridSize, blockSize>>>(X, Y, Z );
}
CUDA UVM Example
CUDA Code CUDA UVM Code
explicit data copies
![Page 9: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/9.jpg)
int main( int argc, char* argv[] ) {
// Allocate memory for each vector on GPU
cudaMalloc(&X, bytes);
cudaMalloc(&Y, bytes);
cudaMalloc(&Z, bytes);
// Copy host vectors to device
cudaMemcpy(x, X, bytes, hostToDevice);
cudaMemcpy(y, Y, bytes, hostToDevice);
// Execute the kernel
vecAdd<<<gridSize, blockSize>>>(X, Y, Z);
// Copy the vector add result back to the host
cudaMemcpy(z, Z, bytes, deviceToHost);
}
int main( int argc, char* argv[] ) {
// Allocate memory in Unified virtual space
cudaMallocManaged(&X, bytes);
cudaMallocManaged(&Y, bytes);
cudaMallocManaged(&Z, bytes);
// NOTE: No need for explicit memory copies
// Execute the kernel
vecAdd<<<gridSize, blockSize>>>(X, Y, Z );
}
CUDA UVM Example
CUDA Code CUDA UVM Code
No explicit copies
![Page 10: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/10.jpg)
Processor XProcessor Y
Cache XCache Y
OpenMP Offloading
RAM
Shared Data
![Page 11: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/11.jpg)
Processor X
RAM
Processor Y
Cache XCache Y
Shared Data
OpenMP threads share memory
OpenMP Offloading
![Page 12: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/12.jpg)
Processor XProcessor Y
Device Memory
Accelerator
Bus
Cache XCache Y
OpenMP Offloading
RAM
Shared Data
![Page 13: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/13.jpg)
Processor XProcessor Y
Device Memory
Accelerator
Bus
Cache XCache Y
OpenMP 4 includes
directives for
mapping data to device
memory
OpenMP Offloading
RAM
Shared Data
![Page 14: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/14.jpg)
#pragma omp target data map(to: x[0:N])
#pragma omp target data map(tofrom: y[0:N])
{
#pragma omp target
#pragma omp teams
#pragma omp distribute parallel for
for (long int i = 0; i < N; i++)
y[i] = a*x[i] + y[i];
}
OpenMP Target Directives - Saxpy
![Page 15: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/15.jpg)
#pragma omp target data map(to: x[0:N])
#pragma omp target data map(tofrom: y[0:N])
{
#pragma omp target
#pragma omp teams
#pragma omp distribute parallel for
for (long int i = 0; i < N; i++)
y[i] = a*x[i] + y[i];
}
OpenMP Target Directives - Saxpy
directs target
compilation
![Page 16: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/16.jpg)
#pragma omp target data map(to: x[0:N])
#pragma omp target data map(tofrom: y[0:N])
{
#pragma omp target
#pragma omp teams
#pragma omp distribute parallel for
for (long int i = 0; i < N; i++)
y[i] = a*x[i] + y[i];
}
map specifies data
movement
OpenMP Target Directives - Saxpy
![Page 17: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/17.jpg)
Clang/LLVM Binary
Host Code Device CodeTarget constructsNon offloading constructs
OpenMP Offloading Runtime - libomptarget
![Page 18: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/18.jpg)
Clang/LLVM Binary
Host Code Target constructsNon offloading constructs
Host OpenMP RTlibomp
OpenMP Offloading Runtime - libomptarget
Serves OpenMP Runtime calls for non-target regions
Device Code
![Page 19: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/19.jpg)
Clang/LLVM Binary
Host Code Target constructsNon offloading constructs
Host OpenMP RTlibomp
Offloading Runtime
libomptarget
Device-agnostic openMP runtime; implementing
support for target offload regions
OpenMP Offloading Runtime - libomptarget
Device Code
![Page 20: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/20.jpg)
Clang/LLVM Binary
Host Code Target constructsNon offloading constructs
Host OpenMP RTlibomp
Offloading Runtime
libomptarget
CUDA Device Plugin
libomptarget
CUDA Driver API
Device plugins implements device
specific operations e.gdata copies, kernel
execution
OpenMP Offloading Runtime - libomptarget
Device Code
![Page 21: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/21.jpg)
An OpenMP Framework for UVM
➢Improved Performance
➢Performance Portability
➢How can we leverage OpenMP target constructs? ➢Does performance scale with large datasets?
Design Considerations
Goals
![Page 22: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/22.jpg)
An OpenMP Framework for UVM
➢Developed as LLVM Transformations➢Extracts important application-specific information
➢e.g data access probability
➢Our implementation extends the libomptarget library➢More specifically, we extend the CUDA device plugin➢Developed a UVM-compatible offloading plugin➢ Includes UVM-specific performance optimizations
OpenMP Runtime
Compiler Technology
![Page 23: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/23.jpg)
Optimizations - Prefetching
#pragma omp target data map(to: x[0:N])
#pragma omp target data map(tofrom: y[0:N])
{
#pragma omp target
#pragma omp teams
#pragma omp distribute parallel for
for (long int i = 0; i < N; i++)
y[i] = a*x[i] + y[i];
}
ExecuteTargetRegion()
Runtime
![Page 24: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/24.jpg)
Optimizations - Prefetching
#pragma omp target data map(to: x[0:N])
#pragma omp target data map(tofrom: y[0:N])
{
#pragma omp target
#pragma omp teams
#pragma omp distribute parallel for
for (long int i = 0; i < N; i++)
y[i] = a*x[i] + y[i];
}
ExecuteTargetRegion()
Runtime
Data pages are fetched on demand. This leads to page fault processing overhead
![Page 25: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/25.jpg)
Optimizations - Prefetching
#pragma omp target data map(to: x[0:N])
#pragma omp target data map(tofrom: y[0:N])
{
#pragma omp target
#pragma omp teams
#pragma omp distribute parallel for
for (long int i = 0; i < N; i++)
y[i] = a*x[i] + y[i];
}
PrefetchToDevice(x)
PrefetchToDevice(y)
ExecuteTargetRegion()
PrefetchToHost(y)
Runtime
![Page 26: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/26.jpg)
Optimizations - Prefetching
#pragma omp target data map(to: x[0:N])
#pragma omp target data map(tofrom: y[0:N])
{
#pragma omp target
#pragma omp teams
#pragma omp distribute parallel for
for (long int i = 0; i < N; i++)
y[i] = a*x[i] + y[i];
}
PrefetchToDevice(x)
PrefetchToDevice(y)
ExecuteTargetRegion()
PrefetchToHost(y)
Runtime
Prefetching used data pages helps avoid page fault processing overhead
![Page 27: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/27.jpg)
App source
LLVM IRBuild
Prefetching Workflow - Compiler
Goal: The compiler extracts the access probabilities associated with the OpenMP mapped data
![Page 28: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/28.jpg)
App source
LLVM IRBuild Profile Application
LLVM Pass:Extract Access
Probability
Compiler
Prefetching Workflow - Compiler
Goal: The compiler extracts the access probabilities associated with the OpenMP mapped data
![Page 29: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/29.jpg)
App source
LLVM IRBuild Profile Application
LLVM Pass:Extract Access
Probability
LLVM Pass:Add Access Probability
Transformed IR
Compiler
Prefetching Workflow - Compiler
Goal: The compiler extracts the access probabilities associated with the OpenMP mapped data
![Page 30: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/30.jpg)
App source
LLVM IRBuild Profile Application
LLVM Pass:Extract Access
Probability
LLVM Pass:Add Access Probability
Transformed IR
Binary
Compiler
Prefetching Workflow - Compiler
Goal: The compiler extracts the access probabilities associated with the OpenMP mapped data
Build
![Page 31: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/31.jpg)
ExecutableOffloading Runtime -
libomptargetAPI calls including access probabilities
Prefetching Workflow - Runtime
![Page 32: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/32.jpg)
ExecutableOffloading Runtime -
libomptargetAPI calls including access probabilities
CUDA Device Plugin-
libomptarget
Prefetching Workflow - Runtime
![Page 33: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/33.jpg)
ExecutableOffloading Runtime -
libomptargetAPI calls including access probabilities
CUDA Device Plugin-
libomptarget
Prefetching Workflow - Runtime
Uses a cost-model to
determine the profitability of
prefetching
![Page 34: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/34.jpg)
ExecutableOffloading Runtime -
libomptarget
Device Memory
RAM
API calls including access probabilities
Prefetch data with high access probability
Data Prefetching
CUDA Device Plugin-
libomptarget
Prefetching Workflow - Runtime
Uses a cost-model to
determine the profitability of
prefetching
![Page 35: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/35.jpg)
Prefetch Compute
➢ With device memory oversubscription, naïve data prefetching leads to memory thrashing
Memory Oversubscription
![Page 36: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/36.jpg)
Prefetch Compute
➢ With device memory oversubscription, naïve data prefetching leads to memory thrashing
Prefetch
(Partial)
Compute
(Partial)
Prefetch
(Partial)
Compute
(Partial)
(Now everything fits)
➢ Pipelining partial prefetches with partial compute
Memory Oversubscription
![Page 37: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/37.jpg)
App source
LLVM IRBuild
Pipelining Workflow - Compiler
![Page 38: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/38.jpg)
App source
LLVM IRBuild LLVM Pass:
Extract OpenMP Loop bounds
Compiler
Pipelining Workflow - Compiler
![Page 39: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/39.jpg)
App source
LLVM IRBuild LLVM Pass:
Extract OpenMP Loop bounds
Compiler
Pipelining Workflow - Compiler
Required for chunking the
iteration space
![Page 40: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/40.jpg)
App source
LLVM IRBuild LLVM Pass:
Extract OpenMP Loop bounds
Compiler
Pipelining Workflow - Compiler
LLVM Pass:Extract memory
access expressions
Required for chunking the
iteration space
![Page 41: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/41.jpg)
App source
LLVM IRBuild LLVM Pass:
Extract OpenMP Loop bounds
Compiler
Pipelining Workflow - Compiler
LLVM Pass:Extract memory
access expressions
Required for chunking the
iteration space
Required for chunking the
data prefetches
![Page 42: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/42.jpg)
App source
LLVM IRBuild LLVM Pass:
Extract OpenMP Loop bounds
Transformed IR
Compiler
Pipelining Workflow - Compiler
LLVM Pass:Add OpenMP Region Info
LLVM Pass:Extract memory
access expressions
Required for chunking the
iteration space
Required for chunking the
data prefetches
![Page 43: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/43.jpg)
App source
LLVM IRBuild LLVM Pass:
Extract OpenMP Loop bounds
Transformed IR
Compiler
Pipelining Workflow - Compiler
LLVM Pass:Add OpenMP Region Info
LLVM Pass:Extract memory
access expressions
Required for chunking the
iteration space
Required for chunking the
data prefetches
BinaryBuild
![Page 44: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/44.jpg)
tgt_run_target()
Runtime API call including OpenMP Region Info
Pipelining Workflow - Runtime
tgt_run_target()
Application Binary
libomptarget
![Page 45: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/45.jpg)
tgt_run_target()
Runtime API call including OpenMP Region Info
Pipelining Workflow - Runtime
tgt_run_target()
Application Binary
libomptarget
Application calls into the runtime to
invoke the target offload region
![Page 46: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/46.jpg)
tgt_run_target()
Runtime API call including OpenMP Region Info
Pipelining Workflow - Runtime
tgt_run_target()
chunkCount = getChunkCount()
for(i=0; i < chunkCount; i++){
prefetchChunk()
executeChunk()
synchronizeTask()
}
Application Binary
libomptarget Device Plugin
Application calls into the runtime to
invoke the target offload region
![Page 47: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/47.jpg)
tgt_run_target()
Runtime API call including OpenMP Region Info
Pipelining Workflow - Runtime
tgt_run_target()
chunkCount = getChunkCount()
for(i=0; i < chunkCount; i++){
prefetchChunk()
executeChunk()
synchronizeTask()
}
Application Binary
libomptarget Device Plugin
Application calls into the runtime to
invoke the target offload region
The device plugins chunks a kernel into multiple smaller
kernels – if the device memory is being oversubscribed
![Page 48: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/48.jpg)
tgt_run_target()
Runtime API call including OpenMP Region Info
Pipelining Workflow - Runtime
tgt_run_target()
chunkCount = getChunkCount()
for(i=0; i < chunkCount; i++){
prefetchChunk()
executeChunk()
synchronizeTask()
}
Application Binary
libomptarget Device Plugin
Application calls into the runtime to
invoke the target offload region
The device plugins chunks a kernel into multiple smaller
kernels – if the device memory is being oversubscribed
![Page 49: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/49.jpg)
tgt_run_target()
Runtime API call including OpenMP Region Info
Pipelining Workflow - Runtime
tgt_run_target()
chunkCount = getChunkCount()
for(i=0; i < chunkCount; i++){
prefetchChunk()
executeChunk()
synchronizeTask()
}
Application Binary
libomptarget Device Plugin
Application calls into the runtime to
invoke the target offload region
The device plugins chunks a kernel into multiple smaller
kernels – if the device memory is being oversubscribed
![Page 50: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/50.jpg)
tgt_run_target()
Runtime API call including OpenMP Region Info
Pipelining Workflow - Runtime
tgt_run_target()
chunkCount = getChunkCount()
for(i=0; i < chunkCount; i++){
prefetchChunk()
executeChunk()
synchronizeTask()
}
Application Binary
libomptarget Device Plugin
Application calls into the runtime to
invoke the target offload region
The device plugins chunks a kernel into multiple smaller
kernels – if the device memory is being oversubscribed
![Page 51: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/51.jpg)
tgt_run_target()
Runtime API call including OpenMP Region Info
Pipelining Workflow - Runtime
tgt_run_target()
chunkCount = getChunkCount()
for(i=0; i < chunkCount; i++){
prefetchChunk()
executeChunk()
synchronizeTask()
}
Application Binary
libomptarget Device Plugin
Application calls into the runtime to
invoke the target offload region
The device plugins chunks a kernel into multiple smaller
kernels – if the device memory is being oversubscribed
![Page 52: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/52.jpg)
tgt_run_target()
Runtime API call including OpenMP Region Info
Pipelining Workflow - Runtime
tgt_run_target()
chunkCount = getChunkCount()
for(i=0; i < chunkCount; i++){
prefetchChunk()
executeChunk()
synchronizeTask()
}
Application Binary
libomptarget Device Plugin
Application calls into the runtime to
invoke the target offload region
The device plugins chunks a kernel into multiple smaller
kernels – if the device memory is being oversubscribed
![Page 53: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/53.jpg)
Experiments
![Page 54: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/54.jpg)
➢Do the optimizations help improve performance for computational kernels?
➢Do the optimizations scale with increasing dataset sizes?
Experiments
![Page 55: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/55.jpg)
➢Baseline: Demand Paging
➢Comparisons: Prefetching, Pipelined Prefetching
➢Benchmarks: Saxpy, Kmeans (Rodinia)
Experiments
![Page 56: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/56.jpg)
Experimental Setup
➢ Summitdev cluster at ORNL ➢ Pascal P100 GPU cards➢16GB device memory
➢Nvlink interconnect
➢ Clang, LLVM ➢clang-ykt project
➢ libomp, libomptarget➢clang-ykt project
Software
Hardware
![Page 57: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/57.jpg)
Results - SAXPY
0
20
40
60
80
100
120
27 28 29 30 31
Tim
e (%
)
log(N) elements
Paging Prefetch Pipelining+Prefetching
![Page 58: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/58.jpg)
Results - SAXPY
This does not fit in device memory
0
20
40
60
80
100
120
27 28 29 30 31
Tim
e (%
)
log(N) elements
Paging Prefetch Pipelining+Prefetching
![Page 59: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/59.jpg)
Kmeans – 10 Iterations
0
20
40
60
80
100
120
27 28 29 30 31
Tim
e (%
)
log(N) points
Paging Prefetch Pipelining+Prefetching
![Page 60: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/60.jpg)
Kmeans – 10 Iterations
0
20
40
60
80
100
120
27 28 29 30 31
Tim
e (%
)
log(N) points
Paging Prefetch Pipelining+Prefetching
This does not fit in device memory
![Page 61: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/61.jpg)
Kmeans – 20 Iterations
0
20
40
60
80
100
120
140
160
180
200
27 28 29 30 31
Tim
e (%
)
log(N) points
Paging Prefetch Pipelining
![Page 62: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/62.jpg)
Kmeans – 20 Iterations
0
20
40
60
80
100
120
140
160
180
200
27 28 29 30 31
Tim
e (%
)
log(N) points
Paging Prefetch Pipelining
Demand paging performs better – hardware learns to use a more suitable eviction
policy
![Page 63: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/63.jpg)
➢Developed an OpenMP Framework for UVM-capable GPUs
➢Develop optimizations to reduce page fault processing overhead
➢Prefetching
➢Pipelining
➢Optimizations enable reasonable improvements on benchmarks
➢Future Work: Developing more sophisticated pipelining strategies
Summary
![Page 64: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/64.jpg)
Thanks & Questions?
![Page 65: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/65.jpg)
• The LLVM community (including our many contributing vendors)
• The SOLLVE project (part of the Exascale Computing Project)
• OLCF, ORNL for access to the SummitDev system.
• ALCF, ANL, and DOE
Acknowledgements
![Page 66: An OpenMP Runtime for UVM-Capable GPU’s · 2021. 2. 5. · Runtime for UVM-Capable GPUs Hashim Sharif and Vikram Adve University of Illinois at Urbana-Champaign Hal Finkel Argonne](https://reader036.fdocuments.in/reader036/viewer/2022062611/61278e303010fd355d3bbc2a/html5/thumbnails/66.jpg)
References
• https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/
• http://on-demand.gputechconf.com/gtc/2016/presentation/s6510-jeff-larkin-targeting-gpus-openmp.pdf
• https://drive.google.com/file/d/0B-jX56_FbGKRM21sYlNYVnB4eFk/view
• https://www.nersc.gov/assets/Uploads/XE62011OpenMP.pdf
• http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf
• https://llvm-hpc3-workshop.github.io/slides/Bertolli.pdf