Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam...

43
Kripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen www.libocca.org Advisor: Prof. Tim Warburton This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC LLNL-PRES-675166

Transcript of Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam...

Page 1: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Kripke: OCCA Port

David S. Medina

LLNL Mentors: Jeff Keasler, Adam Kunen

www.libocca.org

Advisor: Prof. Tim Warburton

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Lawrence Livermore National Security, LLCLLNL-PRES-675166

Page 2: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

2

Summary

Many front-ends (more coming) together with multiple back-ends

Interface Hardware

x86

Xeon Phi

AMD GPU

NVIDIA GPU

OpenCL

Intel COI

NVIDIA CUDA

Pthreads

Application

OCCA API

LLNL-PRES-675166

Page 3: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

3

Summary

Many front-ends (more coming) together with multiple back-ends

Hardware

x86

Xeon Phi

AMD GPU

NVIDIA GPU

Application

OCCA API

LLNL-PRES-675166

Page 4: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Summary: Approaches

4

API Type Front-ends Kernel Language

Back-ends

Kokkos ND Arrays C++11 Templates CUDA, OpenMP

OCCA API + DSL + source-to-source

C, C++, Fortran, Python + others

Custom Kernel Language

Pthreads, OMP, OCL, CUDA, COI

RAJA Lambdas C++11 Loop-Macros OMP, CUDA, ACC

CU2CL Source-to-Source Stand-alone CUDA OpenCL

Insieme Intermediate Interpretation C OpenMP, Cilk,

MPI, OpenCLOpenCL, MPI,

Insieme Runtime

OmpSs Directives + Kernels C, C++ Hybrid OpenMP,

OpenCL, CUDAOpenMP, OpenCL,

CUDA

Ocelot PTX Translator CUDA CUDA CUDA, OpenCL, x86

OpenMP 4.0 #pragma C++11 #pragma OpenMP/CUDA

OpenACC #pragma C++11 #pragma OpenMP/CUDA

Some Portability Approaches

Com

pilerA

PI

LLNL-PRES-675166

Page 5: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Summary: Application Snapshot

5

#include "occa.hpp"

int main(int argc, char **argv){ float *a = new float[5]; float *b = new float[5]; float *ab = new float[5];

for(int i = 0; i < 5; ++i){ a[i] = i; b[i] = 1 - i; ab[i] = 0; }

occa::device device; occa::kernel addVectors; occa::memory o_a, o_b, o_ab;

device.setup(“mode = OpenCL, platformID = 0, deviceID = 0”);

o_a = device.malloc(5*sizeof(float)); o_b = device.malloc(5*sizeof(float)); o_ab = device.malloc(5*sizeof(float)); o_a.copyFrom(a); o_b.copyFrom(b);

addVectors = device.buildKernelFromSource(“addVectors.okl", "addVectors");

addVectors(5, o_a, o_b, o_ab);

o_ab.copyTo(ab);

for(int i = 0; i < 5; ++i) std::cout << i << ": " << ab[i] << '\n';

kernel void addVectors(const int entries, const float *a, const float *b, float *ab){

for(int i = 0; i < entries; ++i; tile(16)){ if(i < entries) ab[i] = a[i] + b[i]; }}

https://github.com/libocca/occa/tree/master/examples/addVectorsLLNL-PRES-675166

Page 6: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Summary: Application Snapshot

6

#include "occa.hpp"

int main(int argc, char **argv){ float *a = new float[5]; float *b = new float[5]; float *ab = new float[5];

for(int i = 0; i < 5; ++i){ a[i] = i; b[i] = 1 - i; ab[i] = 0; }

occa::device device; occa::kernel addVectors; occa::memory o_a, o_b, o_ab;

device.setup(“mode = OpenCL, platformID = 0, deviceID = 0”);

o_a = device.malloc(5*sizeof(float)); o_b = device.malloc(5*sizeof(float)); o_ab = device.malloc(5*sizeof(float)); o_a.copyFrom(a); o_b.copyFrom(b);

addVectors = device.buildKernelFromSource(“addVectors.okl", "addVectors");

addVectors(5, o_a, o_b, o_ab);

o_ab.copyTo(ab);

for(int i = 0; i < 5; ++i) std::cout << i << ": " << ab[i] << '\n';

kernel void addVectors(const int entries, const float *a, const float *b, float *ab){

for(int i = 0; i < entries; ++i; tile(16)){ if(i < entries) ab[i] = a[i] + b[i]; }}

https://github.com/libocca/occa/tree/master/examples/addVectorsLLNL-PRES-675166

Page 7: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Summary: Application Snapshot

7

#include "occa.hpp"

int main(int argc, char **argv){ float *a = new float[5]; float *b = new float[5]; float *ab = new float[5];

for(int i = 0; i < 5; ++i){ a[i] = i; b[i] = 1 - i; ab[i] = 0; }

occa::device device; occa::kernel addVectors; occa::memory o_a, o_b, o_ab;

device.setup(“mode = OpenCL, platformID = 0, deviceID = 0”);

o_a = device.malloc(5*sizeof(float)); o_b = device.malloc(5*sizeof(float)); o_ab = device.malloc(5*sizeof(float)); o_a.copyFrom(a); o_b.copyFrom(b);

addVectors = device.buildKernelFromSource(“addVectors.okl", "addVectors");

addVectors(5, o_a, o_b, o_ab);

o_ab.copyTo(ab);

for(int i = 0; i < 5; ++i) std::cout << i << ": " << ab[i] << '\n';

kernel void addVectors(const int entries, const float *a, const float *b, float *ab){

for(int i = 0; i < entries; ++i; tile(16)){ if(i < entries) ab[i] = a[i] + b[i]; }}

https://github.com/libocca/occa/tree/master/examples/addVectorsLLNL-PRES-675166

Page 8: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Summary: Application Snapshot

8

#include "occa.hpp"

int main(int argc, char **argv){ float *a = new float[5]; float *b = new float[5]; float *ab = new float[5];

for(int i = 0; i < 5; ++i){ a[i] = i; b[i] = 1 - i; ab[i] = 0; }

occa::device device; occa::kernel addVectors; occa::memory o_a, o_b, o_ab;

device.setup(“mode = OpenCL, platformID = 0, deviceID = 0”);

o_a = device.malloc(5*sizeof(float)); o_b = device.malloc(5*sizeof(float)); o_ab = device.malloc(5*sizeof(float)); o_a.copyFrom(a); o_b.copyFrom(b);

addVectors = device.buildKernelFromSource(“addVectors.okl", "addVectors");

addVectors(5, o_a, o_b, o_ab);

o_ab.copyTo(ab);

for(int i = 0; i < 5; ++i) std::cout << i << ": " << ab[i] << '\n';

kernel void addVectors(const int entries, const float *a, const float *b, float *ab){

for(int i = 0; i < entries; ++i; tile(16)){ if(i < entries) ab[i] = a[i] + b[i]; }}

https://github.com/libocca/occa/tree/master/examples/addVectorsLLNL-PRES-675166

Page 9: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Interfaces Hardware

OCCA Overview

9

x86

Xeon Phi

AMD GPU

NVIDIA GPU

OpenCL

Intel COI

NVIDIA CUDA

Pthreads

LLNL-PRES-675166

Page 10: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

OCCA Overview

10

ç

Interfaces Hardware

x86

Xeon Phi

AMD GPU

NVIDIA GPU

OpenCL

Intel COI

NVIDIA CUDA

Pthreads

IR

Intermediate Representation

LLNL-PRES-675166

Page 11: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Parser

OKL OFL

Unified Languages

OCCA Overview

11

Interfaces Hardware

x86

Xeon Phi

AMD GPU

NVIDIA GPU

OpenCL

Intel COI

NVIDIA CUDA

Pthreads

IRIRIR

Intermediate Representation

OCCA Fortran Language

OCCA Kernel Language

LLNL-PRES-675166

Page 12: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Parser

OKL OFL

Unified Languages

OCCA Overview

12

Interfaces Hardware

x86

Xeon Phi

AMD GPU

NVIDIA GPU

OpenCL

Intel COI

NVIDIA CUDA

Pthreads

IRIR

OpenCL

NVIDIA CUDA

OCCA Fortran Language

OCCA Kernel Language

LLNL-PRES-675166

Page 13: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

OCCA Overview

13

Interface Hardware

x86

Xeon Phi

AMD GPU

NVIDIA GPU

OpenCL

Intel COI

NVIDIA CUDA

Pthreads

OpenCL

NVIDIA CUDA

Parser IR

OFLOKL

Unified LanguagesApplication

OCCA API

LLNL-PRES-675166

Page 14: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Computational hierarchies are similar

Parallelization Paradigm

14

Global Memory

Group 0

SharedShared

Registers

Item (0,0)

Shared

Group 1 Group 2

Half-WarpCore≈

Stream Core

Processing Element ≈

CPU Architecture GPU Architecture

L1 Cache

Global Memory

Core 0 Core 1 Core 2

L1 Cache L1 Cache

L3 Cache

Processing Element

LLNL-PRES-675166

Page 15: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Computational hierarchies are similar

Parallelization Paradigm

15

Global Memory

Group 0

SharedShared

Registers

Item (0,0)

Shared

Group 1 Group 2

CPU Architecture GPU Architecture

L1 Cache

Global Memory

Core 0 Core 1 Core 2

L1 Cache L1 Cache

L3 Cache

Processing Element

void cpuFunction(){ #pragma omp parallel for for(int i = 0; i < work; ++i){

Do [hopefully thread-independent] work

}

}

__kernel void gpuFunction(){// for each work-group {// for each work-item in group {

Do [group-independent] work

// }// } }

LLNL-PRES-675166

Page 16: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Computational hierarchies are similar

Parallelization Paradigm

16

Global Memory

Group 0

SharedShared

Registers

Item (0,0)

Shared

Group 1 Group 2

CPU Architecture GPU Architecture

L1 Cache

Global Memory

Core 0 Core 1 Core 2

L1 Cache L1 Cache

L3 Cache

Processing Element

void ompFunction(){// for each thread { for(thread’s work){

Do [hopefully thread-independent] work

}// }}

__kernel void gpuFunction(){// for each work-group {// for each work-item in group {

Do [group-independent] work

// }// } }

LLNL-PRES-675166

Page 17: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

kernel void kernelName(...){ ...

for(outer){ for(inner){ } }

...}

kernel void kernelName(...){ ...

for(int groupZ = 0; groupZ < zGroups; ++groupZ; outer2){ for(int groupY = 0; groupY < yGroups; ++groupY; outer1){ for(int groupX = 0; groupX < xGroups; ++groupX; outer0){ // Work-group implicit loops

for(int itemZ = 0; itemZ < zItems; ++itemZ; inner2){ for(int itemY = 0; itemY < yItems; ++itemY; inner1){ for(int itemX = 0; itemX < xItems; ++itemX; inner0){ // Work-item implicit loops // GPU Kernel Scope }}} }}}

...}

Description • Minimal extensions to C, familiar for regular programmers

• Translates to OCCA IR with code transformations

• Parallel loops are explicit through the fourth for-loop inner and outer labels

OKL: OCCA Kernel Language

17The concept of iterating over groups and items is simple

Shared Shared Shared

dim3 blockDim(xGroups,yGroups,zGroups);dim3 threadDim(xItems,yItems,zItems);kernelName<<< blockDim , threadDim >>>(…);

LLNL-PRES-675166

Page 18: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

kernel void kernelName(...){ for(outer){ for(inner){ } }

for(outer){ for(inner){ } }

for(outer){ for(inner){ } }}

OKL: OCCA Kernel Language

18Execute a set of outer-loops is equivalent to launching a CUDA/OpenCL kernel

Multiple outer-loops • Sets of outer-loops are synonymous with CUDA and OpenCL kernels

• Extension: allow for multiple outer-loops per kernel

LLNL-PRES-675166

Page 19: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

OKL: OCCA Kernel Language

19Execute a set of outer-loops is equivalent to launching a CUDA/OpenCL kernel

kernel void kernelName(...){ ...

for(outer){ for(inner){ } }

...}

Multiple outer-loops • Sets of outer-loops are synonymous with CUDA and OpenCL kernels

• Extension: allow for multiple outer-loops per kernel

LLNL-PRES-675166

Page 20: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

20Execute a set of outer-loops is equivalent to launching a CUDA/OpenCL kernel

kernel void kernelName(…){

if(expr){ for(outer){ for(inner){ } } else{ for(outer){ for(inner){ } } }

while(expr){ for(outer){ for(inner){ } } }

}

OKL: OCCA Kernel Language

Multiple outer-loops • Sets of outer-loops are synonymous with CUDA and OpenCL kernels

• Extension: allow for multiple outer-loops per kernel

LLNL-PRES-675166

Page 21: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

for(int groupX = 0; groupX < xGroups; ++groupX; outer0){ // Work-group implicit loops exclusive int exclusiveVar, exclusiveArray[10];

for(int itemX = 0; itemX < 16; ++ itemX; inner0){ // Work-item implicit loops exclusiveVar = itemX; // Pre-fetch }

// Auto-insert [barrier(localMemFence);]

for(int itemX = 0; itemX < 16; ++ itemX; inner0){ // Work-item implicit loops int i = exclusiveVar; // Use pre-fetched data }}

for(int groupX = 0; groupX < xGroups; ++groupX; outer0){ // Work-group implicit loops shared int sharedVar[16];

for(int itemX = 0; itemX < 16; ++ itemX; inner0){ // Work-item implicit loops sharedVar[itemX] = itemX; }

// Auto-insert [barrier(localMemFence);]

for(int itemX = 0; itemX < 16; ++ itemX; inner0){ // Work-item implicit loops int i = (sharedVar[itemX] + sharedVar[(itemX + 1) % 16]); }}

Shared Memory

OKL: OCCA Kernel Language

21Local barriers are auto-inserted (gives a warning)

Register Memory (SIMD Variables)

LLNL-PRES-675166

Page 22: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

for(int groupX = 0; groupX < xGroups; ++groupX; outer0){ // Work-group implicit loops exclusive int exclusiveVar, exclusiveArray[10];

for(int itemX = 0; itemX < 16; ++ itemX; inner0){ // Work-item implicit loops exclusiveVar = itemX; // Pre-fetch }

// Auto-insert [barrier(localMemFence);]

for(int itemX = 0; itemX < 16; ++ itemX; inner0){ // Work-item implicit loops int i = exclusiveVar; // Use pre-fetched data }}

for(int groupX = 0; groupX < xGroups; ++groupX; outer0){ // Work-group implicit loops shared int sharedVar[16];

for(int itemX = 0; itemX < 16; ++ itemX; inner0){ // Work-item implicit loops sharedVar[itemX] = itemX; }

// Auto-insert [barrier(localMemFence);]

for(int itemX = 0; itemX < 16; ++ itemX; inner0){ // Work-item implicit loops int i = (sharedVar[itemX] + sharedVar[(itemX + 1) % 16]); }}

Shared Memory

OKL: OCCA Kernel Language

22Local barriers are auto-inserted (gives a warning)

Register Memory (SIMD Variables)

LLNL-PRES-675166

Page 23: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

kernel subroutine kernelName(...) ...

DO groupY = 1, yGroups, outer1 DO groupX = 1, xGroups, outer0 // Work-group implicit loops DO itemY = 1, yItems, inner1 DO itemX = 1, xItems, inner0 // Work-item implicit loops // GPU Kernel Scope END DO END DO END DO END DO

...end subroutine kernelName

Description • Translates to OKL and then to OCCA IR with code transformations

• Parallel loops are explicit through the inner and outer DO-labels

OFL: OCCA Fortran Language

23Because [OFL -> OKL], all features added to OKL are inherintly added to OFL

Shared and Exclusive Memory

integer(4), shared :: sharedVar(16,30)integer(4), exclusive :: exclusiveVar, exclusiveArray(10)

LLNL-PRES-675166

Page 24: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

OCCA API: Original

24

#include "occa.hpp"

int main(int argc, char **argv){ occa::device device; occa::kernel addVectors; occa::memory o_a, o_b, o_ab;

device.setup(“mode = OpenCL, platformID = 0, deviceID = 0”);

float *a = new float[5]; float *b = new float[5]; float *ab = new float[5];

for(int i = 0; i < 5; ++i){ a[i] = i; b[i] = 1 - i; ab[i] = 0; }

o_a = device.malloc(5*sizeof(float)); o_b = device.malloc(5*sizeof(float)); o_ab = device.malloc(5*sizeof(float)); o_a.copyFrom(a); o_b.copyFrom(b);

addVectors = device.buildKernel("addVectors.okl", "addVectors");

addVectors(5, o_a, o_b, o_ab);

o_ab.copyTo(ab);

for(int i = 0; i < 5; ++i) std::cout << i << ": " << ab[i] << '\n';

https://github.com/libocca/occa/tree/master/examples/addVectorsLLNL-PRES-675166

Page 25: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

OCCA API: Original + UVA + Managed Memory

25

#include "occa.hpp"

int main(int argc, char **argv){ occa::device device; occa::kernel addVectors; occa::memory o_a, o_b, o_ab;

device.setup(“mode = OpenCL, platformID = 0, deviceID = 0”); float *a = (float*) device.managedAlloc(5 * sizeof(float)); float *b = (float*) device.managedAlloc(5 * sizeof(float)); float *ab = (float*) device.managedAlloc(5 * sizeof(float));

for(int i = 0; i < 5; ++i){ a[i] = i; b[i] = 1 - i; ab[i] = 0; }

o_a = device.malloc(5*sizeof(float)); o_b = device.malloc(5*sizeof(float)); o_ab = device.malloc(5*sizeof(float)); o_a.copyFrom(a); o_b.copyFrom(b);

addVectors = device.buildKernel("addVectors.okl", "addVectors");

addVectors(5, a, b, ab);

device::finish(); // occa::memcpy(ab, o_ab, 5*sizeof(float));

for(int i = 0; i < 5; ++i) std::cout << i << ": " << ab[i] << '\n';

https://github.com/libocca/occa/tree/master/examples/uvaAddVectorsLLNL-PRES-675166

Page 26: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

OCCA API: Original + Managed + BG Device

26

#include "occa.hpp"

int main(int argc, char **argv){ occa::device device; occa::kernel addVectors; occa::memory o_a, o_b, o_ab;

occa::setDevice(“mode = OpenCL, platformID = 0, deviceID = 0”); float *a = (float*) occa::managedAlloc(5 * sizeof(float)); float *b = (float*) occa::managedAlloc(5 * sizeof(float)); float *ab = (float*) occa::managedAlloc(5 * sizeof(float));

for(int i = 0; i < 5; ++i){ a[i] = i; b[i] = 1 - i; ab[i] = 0; }

o_a = (float*) device.uvaAlloc(5*sizeof(float)); o_b = (float*) device.uvaAlloc(5*sizeof(float)); o_ab = (float*) device.uvaAlloc(5*sizeof(float)); occa::memcpy(o_a, a, 5*sizeof(float)); occa::memcpy(o_b, b, 5*sizeof(float));

addVectors = occa::buildKernel("addVectors.okl", "addVectors");

addVectors(5, a, b, ab);

occa::finish() // occa::memcpy(ab, o_ab, 5*sizeof(float));

for(int i = 0; i < 5; ++i) std::cout << i << ": " << ab[i] << '\n';

https://github.com/libocca/occa/tree/master/examples/uvaAddVectorsLLNL-PRES-675166

Page 27: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Kripke Port

LLNL-PRES-675166

Page 28: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

for(g){ for(d){ for(z){ // Work ... } }}

for(g){ for(z){ for(d){ // Work ... } }}

for(d){ for(g){ for(z){ // Work ... } }}

for(d){ for(z){ for(g){ // Work ... } }}

for(z){ for(d){ for(g){ // Work ... } }}

for(z){ for(g){ for(d){ // Work ... } }}

Kripke: Goals

28

Loop Reordering (Run-Time) • Simplify loop re-ordering for testing

• Avoid re-writing code (less code, less bugs)

for(g; loopOrder(lo_g)){ for(d; loopOrder(lo_d)){ for(z; loopOrder(lo_z)){ // Work ... } }}

LLNL-PRES-675166

Page 29: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Kripke: Goals

29

Index Ordering (Run-Time) • Swapping loop-order changes access order

• Avoid re-writing code (less code, less bugs) ... again

double *psi @dim(groups, moments, zones);

psi(g,m,z) = 0;

double *psi;

psi[z + zones*(m + moments*g)] = 0;

LLNL-PRES-675166

Page 30: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Kripke: Goals

30

Index Ordering (Run-Time) • Swapping loop-order changes access order

• Avoid re-writing code (less code, less bugs) ... again

double *psi @(dim(groups, moments, zones), idxOrder(0,1,2));

psi(g,m,z) = 0;

double *psi;

psi[g + groups*(m + moments*z)] = 0;

LLNL-PRES-675166

Page 31: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Kripke: Goals

31

Index Ordering (Run-Time) • Swapping loop-order changes access order

• Avoid re-writing code (less code, less bugs) ... again

double * KRESTRICT phi = sdom.phi->ptr(group0, 0, 0);double * KRESTRICT psi_ptr = sdom.psi->ptr();for(g){ double * KRESTRICT ell_nm = sdom.ell->ptr(); for(m){ double * KRESTRICT psi = psi_ptr; for(d){ double ell_nm_d = ell_nm[d]; for(z){ phi[z] += ell_nm_d * psi[z]; } psi += num_zones; } ell_nm += num_local_directions; phi += num_zones; } psi_ptr += num_zones * num_local_directions;}

LLNL-PRES-675166

Page 32: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Kripke: Goals

32

Index Ordering (Run-Time) • Swapping loop-order changes access order

• Avoid re-writing code (less code, less bugs) ... again

kernel void LTimes( double * restrict phi @(dim(PHI_DIM), idxOrder(PHI_IDX)), const double * restrict ell @(dim(ELL_DIM) idxOrder(PHI_IDX)), const double * restrict psi @(dim(PSI_DIM) idxOrder(PHI_IDX))){

for(g; loopOrder(lo_g)){ for(m; loopOrder(lo_m)){ for(d; loopOrder(lo_d)){ for(z; loopOrder(lo_z)){ phi(group0 + g,m,z) += ell(d,m) * psi(g,d,z); } } } }}

Kripke Kernels: 2004 linesAverage File : 334 lines

OCCA Kernels: 275 lines

LLNL-PRES-675166

Page 33: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Kripke: Goals

33

Index Ordering (Run-Time) • Swapping loop-order changes access order

• Avoid re-writing code (less code, less bugs) ... again

kernel void LTimes( double * restrict phi @(dim(PHI_DIM), idxOrder(PHI_IDX)), const double * restrict ell @(dim(ELL_DIM) idxOrder(PHI_IDX)), const double * restrict psi @(dim(PSI_DIM) idxOrder(PHI_IDX))){

for(g; loopOrder(lo_g)){ for(m; loopOrder(lo_m)){ for(d; loopOrder(lo_d)){ const double r_ell = ell(d,m); for(z; loopOrder(lo_z)){ phi(group0 + g,m,z) += r_ell * psi(g,d,z); } } } }}

Kripke Kernels: 2004 linesAverage File : 334 lines

OCCA Kernels: 275 lines

LLNL-PRES-675166

Page 34: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Kripke: Goals

34

Command-Line Options • Compatibility : --nest DGZ works to compare with the original Kripke code

• Loop Ordering : --lo GZDM,GZDM,ZMGP,XG,KJIGD

• Index Ordering : -io GMZ,GDZ,DM,DM,GCP,GZ

• Help : --help, --moreHelp

--LTimesOrder GZDM--LPlusTimesOrder GZDM--scatteringOrder ZMGP--sourceOrder XG--sweepOrder KJIGD

--phiOrder GMZ--psiOrder GDZ--ellOrder DM--ellPlusOrder DM--sigsOrder GCP--sigtOrder GZ

G: Group (From in scattering)P: Group (To in scattering)Z: ZoneD: DirectionsM: MomentX: MixK: K spatial dimensionJ: J spatial dimensionI: I spatial dimensionC: Coefficient

LLNL-PRES-675166

Page 35: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Kripke: Results

35

LTimes Kernel (DGZ on [cab])T

ime

Take

n (s

)

0.08

0.09

0.1

0.11

0.12

Original

| | |

OCCA+ Prefetches

OCCA

LLNL-PRES-675166

Page 36: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Kripke: Results

36

LPlusTimes Kernel (DGZ on [cab])T

ime

Take

n (s

)

0

0.033

0.065

0.098

0.13

Original

| | |

OCCA+ Prefetches

OCCA

LLNL-PRES-675166

Page 37: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Kripke: Results

37

Scattering Kernel (DGZ on [cab])T

ime

Take

n (s

)

0

0.75

1.5

2.25

3

Original

| | |

OCCA+ Prefetches

OCCA

LLNL-PRES-675166

Page 38: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Kripke: Results (Only with Prefetches)

38

Scattering Kernel (DGZ on [cab])T

ime

Take

n (s

)

0

0.225

0.45

0.675

0.9

Original

| | |

OCCA+ Prefetches

OCCA

LLNL-PRES-675166

Page 39: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Kripke: Results

39

Source Kernel (DGZ on [cab])

Tim

e Ta

ken

(s)

0.00000

0.00013

0.00025

0.00038

0.00050

Original

| | |

OCCA+ Prefetches

OCCA

Probably OCCA launch overhead

LLNL-PRES-675166

Page 40: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Kripke: Results

40

Sweep Kernel (DGZ on [cab])T

ime

Take

n (s

)

0

0.035

0.07

0.105

0.14

Original

| | |

OCCA+ Prefetches

OCCA

LLNL-PRES-675166

Page 41: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Kripke: Results

41

Solve Time (DGZ on [cab])T

ime

Take

n (s

)

0

1

2

3

4

Original

| | |

OCCA+ Prefetches

OCCA

LLNL-PRES-675166

Page 42: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Kripke: Results (Only with Prefetches)

42

Solve Time (DGZ on [cab])T

ime

Take

n (s

)

0

0.4

0.8

1.2

1.6

Original

| | |

OCCA+ Prefetches

OCCA

LLNL-PRES-675166

Page 43: Kripke: OCCA Port - ComputingKripke: OCCA Port David S. Medina LLNL Mentors: Jeff Keasler, Adam Kunen rg Advisor: Prof. Tim Warburton This work was performed under the auspices of

Live Demo

www.libocca.org

LLNL-PRES-675166