GMAC Global Memory for Acceleratorsdeveloper.amd.com/wordpress/media/2013/06/2908_1_final.pdf ·...

GMAC Global Memory for Accelerators

Wen-mei W. Hwu, Isaac Gelado and Javier Cabezas

GMAC in a nutshell

• GMAC: Unified Virtual Address Space for OpenCL

– Simplifies the CPU code

– Exploits advanced OpenCL features for free

– Transparent memory consistency management

• Vector addition example – Really simple kernel code

– But, what about the CPU code?

__kernel void vector(__global float *c, __global float *a, __global float *b) { int idx = get_global_id(0); c[idx] = a[idx] + b[idx]; }

6/15/11 2 AMD Fusion Summit 2011

CPU OpenCL code (I)

• Set-up OpenCL int main(int argc, char *argv[]) { cl_platform_id platform; cl_device_id device; cl_context context; cl_command_queue command_queue; cl_program program; cl_kernel kernel; cl_int error_code; float *a, *b, *c; cl_mem d_a, d_b, d_c; /* Start setting up OpenCL */ error_code = clGetPlatformIDs(1, &platform, NULL); assert(error_code == CL_SUCCESS); error_code = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL); assert(error_code == CL_SUCCESS); context = clCreateContext(0, 1, &device, NULL, NULL, &error_code); assert(error_code == CL_SUCCESS); command_queue = clCreateCommandQueue(context, device, 0, &error_code); assert(error_code == CL_SUCCESS); program = clCreateProgramWithSource(context, 1, &kernel_source, NULL, &error_code); assert(error_code == CL_SUCCESS); error_code = clBuildProgram(program, 1, &device, NULL, NULL, NULL); assert(error_code == CL_SUCCESS); kernel = clCreateKernel(program, "vecAdd", &error_code); assert(error_code == CL_SUCCESS);

CPU OpenCL code (II)

• Allocate memory and initialize data /* Alloc & init input data */ assert((a = (float *)malloc(vecSize * sizeof(float)) != NULL); d_a = clCreateBuffer(context, CL_MEM_READ_WRITE, vecSize * sizeof(float), NULL, &error_code); assert(error == CL_SUCCESS); read_file(“vector_A.data”, a, vecSize); assert((b = (float *)malloc(vecSize * sizeof(float)) != NULL); d_b = clCreateBuffer(context, CL_MEM_READ_WRITE, vecSize * sizeof(float), NULL, &error_code); assert(error == CL_SUCCESS); read_file(“vector_B.data”, b, vecSize); /* Alloc output data */ assert((b = (float *)malloc(vecSize * sizeof(float)) != NULL); d_b = clCreateBuffer(context, CL_MEM_READ_WRITE, vecSize * sizeof(float), NULL, &error_code); assert(error == CL_SUCCESS); /* Copy data to the device */ assert(clEnqueueWriteBuffer(command_queue, d_a, CL_FALSE, 0, vecSize * sizeof(float), a, 0, NULL, NULL) == CL_SUCCESS); assert(clEnqueueWriteBuffer(command_queue, d_b, CL_FALSE, 0, vecSize * sizeof(float), b, 0, NULL, NULL) == CL_SUCCESS);

CPU OpenCL code (III)

• Call the kernel and save the output /* Set kernel arguments */ assert(clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_c) == CL_SUCCESS); assert(clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_a) == CL_SUCCESS); assert(clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_b) == CL_SUCCESS); /* Call the kernel */ size_t global_size = vecSize; assert(clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_size, NULL, 0, NULL, NULL) == CL_SUCCESS); assert(clFinish(command_queue) == CL_SUCCESS); /* Get the results back */ assert(clEnqueueReadBuffer(command_queue, d_c, CL_FALSE, 0, vecSize * sizeof(float), c, 0, NULL, NULL) == CL_SUCCESS); save_file(“vector_C.data”, c, vecSize); /* Release memory */ clReleaseMemObject(d_c); free(c); clReleaseMemObject(d_b); free(b); clReleaseMemObject(d_a); free(a); clReleaseCommandQueue(command_queue); clReleaseContext(context); return 0; }

GMAC code sample

int main(int argc, char *argv[]) { float *a, *b, *c; assert(eclCompileSource(kernel_source) == eclSuccess); /* Alloc & init input data */ assert(eclMalloc((void **)&a, vecSize * sizeof(float)) == eclSuccess); read_file(“vector_A.data”, vecSize); assert(eclMalloc((void **)&b, vecSize * sizeof(float)) == eclSuccess) read_file(“vector_B.data”, vecSize); /* Alloc output data */ assert(eclMalloc((void **)&c, vecSize * sizeof(float)) == eclSuccess) /* Call the kernel */ ecl_kernel kernel; size_t globalSize = vecSize; assert(eclGetKernel("vecAdd", &kernel) == eclSuccess); assert(eclSetKernelArgPtr(kernel, 0, c) == eclSuccess); assert(eclSetKernelArgPtr(kernel, 1, a) == eclSuccess); assert(eclSetKernelArgPtr(kernel, 2, b) == eclSuccess); assert(eclCallNDRange(kernel, 1, NULL, &globalSize, NULL) == eclSuccess); save_file(“vector_C.data”, vecSize); eclFree(a); eclFree(b); eclFree(c); return 0; }

GMAC Supported Platforms

• Any OpenCL 1.1 compatible stack, with optimizations for:

– AMD Fusion devices

– AMD Radeon HD

– NVIDIA Tesla

• Windows 7 (64 and 32 bits)

• GNU/Linux (64 and 32 bits)

6/15/11 AMD Fusion Summit 2011 7

Outline

• Introduction

• GMAC Memory Model

– Asymmetric Memory

– Global Memory

• Performance Evaluation

• Conclusions

GMAC Memory Model

• Unified CPU / GPU virtual address space

• Asymmetric address space accessibility

Memory

Shared Data Accessed by CPU and GPU via same pointer

CPU Data

GMAC Implementation

• Fusion APU

• AMD Radeon HD

Physical Memory

CPU CPU

Physical Memory

GPU GPU

Physical Memory

Coherence

GMAC Consistency Model

• Implicit acquire / release primitives at accelerator call / return boundaries

CPU GPU

GMAC Coherence

• Avoid unnecessary data copies

• Lazy-update: – Call: transfer modified data

– Return: transfer when needed

Accelerator System Memory

Accelerator Memory

GMAC Memory API

• Allocate shared memory eclError_t eclMalloc(void **ptr, size_t size)

– Allocated memory address (returned by reference)

– Gets the size of the data to be allocated

– Error code, eclSuccess if no error

• Example usage

#include <gmac/opencl.h> int main(int argc, char *argv[]) { float *foo = NULL; eclError_t error; if((error = eclMalloc((void **)&foo, FOO_SIZE)) != eclSuccess) FATAL(“Error allocating memory %s”, eclErrorString(error)); . . . }

GMAC Memory API

• Release shared memory eclError_t eclFree(void *ptr)

– Memory address to be released

• Example usage

#include <gmac/opencl.h> int main(int argc, char *argv[]) { float *foo = NULL; eclError_t error; if((error = eclMalloc((void **)&foo, FOO_SIZE)) != eclSuccess) FATAL(“Error allocating memory %s”, eclErrorString(error)); . . . eclFree(foo); }

• Functions overridden (interposition) by GMAC: – Standard C Library memory functions: memset(), memcpy()

– Standard C Library I/O: fread(), fwrite(), read(), write()

– MPI: MPI_Send(), MPI_Receive

• Get advanced OpenCL features for free – Asynchronous highly optimized data transfers

– Pre-pinned memory

GMAC Built-in Optimizations

Calls to fread()

Data Transfers wait for kernel completion

Outline

• Introduction

– Global Memory

• Conclusions

GMAC Global Memory

• For multi-GPU systems: data accessible by all accelerators, but owned by the CPU

• Example: medium matrix in FDTD simulations

Memory

GMAC Global Memory

• Read-only data structures

– Zero-copy memory if read only once by the GPU

– Replicated data if read often by the GPU

• GMAC Global memory:

– Pre-pinned zero-copy in AMD Fusion

– Discrete GPU (e.g. HD Radeon):

• Replicated data copies if enough GPU memory

• Pre-pinned zero-copy otherwise

GMAC Global memory API

• Allocate global shared Memory eclError_t eclGlobalMalloc(void **ptr, size_t size)

– Allocated memory address (returned by reference)

– Gets the size of the data to be allocated

• Example usage

#include <gmac/opencl.h> int main(int argc, char *argv[]) { float *foo = NULL; eclError_t error; if((error = eclGlobalMalloc((void **)&foo, FOO_SIZE)) != eclSuccess) FATAL(“Error allocating memory %s”, eclErrorString(error)); . . . }

Outline

• Introduction

– Global Memory

• Conclusions

GMAC Performance

• Vector Addition: worst case scenario

Vector Size

SpeedUp OpenCL GMAC

GMAC Performance

• Sobel filtering on video stream

• OpenCL:

– 2.5ms per frame

– 192 lines of code

• GMAC:

– 1.5ms per frame

– 91 lines of code

• Both OpenCL and GMAC are faster than a CPU implementation

GMAC Hands-on

• Sobel Filtering Example

• Bullet Particle Collision Demo

– OpenCL

– GMAC

Outline

• Introduction

– Global Memory

• Conclusions

Conclusions

• Single virtual address space for CPUs and GPUs

• Use OpenCL advanced features

– Automatic overlap data communication and computation

– Get access to any GPU from any CPU thread

• Get more performance from your application more easily

http://www.multicorewareinc.com

Backup Slides

Rolling Update Data Transfers

• Overlap CPU execution and data transfers

• Minimal transfer on-demand

• Rolling-update: – Memory-block size granularity

Accelerator System Memory

Accelerator Memory

GMAC and Multi-threading

• In the past, one host thread had one CPU

• In GMAC, each host thread has:

– One CPU

– One GPU

• A GMAC thread is running on GPU or on the CPU, but not on both at the same time

• Create threads using what you already know – pthread_create(...)

GMAC and Multi-threading

• Virtual memory accessibility:

– Complete address space in CPU mode

– Partial address space in GPU mode

CPU CPU

GPU GPU Memory

http://www.multicorewareinc.com

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions

and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited

to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product

differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no

obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to

make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO

RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS

INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY

DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL

OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF

EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in

this presentation are for informational purposes only and may be trademarks of their respective owners.

The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and

opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is

not responsible for the content herein and no endorsements are implied.

GMAC Global Memory for Acceleratorsdeveloper.amd.com/wordpress/media/2013/06/2908_1_final.pdf ·...

Documents

Transcript of GMAC Global Memory for Acceleratorsdeveloper.amd.com/wordpress/media/2013/06/2908_1_final.pdf ·...

OpenCL Guide

Обзор OpenCL

2008 GMAC Professional Development

GMAC - Financial_Analysis Form

Sobel Dava - Longitude

FHFA vs Homecoming (GMAC)

Sobel 2015

gmac-5 user guide

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT.11 2014.

Agenda Centroc GMAC Meeting 2 February 2012dev.centroc.com.au/wp-content/uploads/020212_GMACpaper.pdf · GMAC Members in bold It is advised that the next Centroc GMAC meeting will

Chapter 44 GMAC Ethernet Interface 44.1 Overviewrockchip.fr/RK312X TRM/chapter-44-gmac-ethernet-interface.pdf · Chapter 44 GMAC Ethernet Interface 44.1 Overview The GMAC Ethernet

2015 Alumni Perspectives Survey Report - GMAC › ~ › media › Files › gmac › Research › Measuri… · 2015 Graduate Management Admission Council (GMAC). ... The findings

GMAC IR Update

GMAC Clock Synch Modelling

gmac Robert Hull, GMAC Chief Financial Officer GMAC LLC 2008 Third Quarter Financial Results Conference Call November 5, 2008

Gmac Guide 2012

Imagination OpenVX & OpenCL User Nodes for … implementation works with our PowerVR GPU ... join the dots” in the output of Canny and Sobel Detailed discussion beyond the scope

Corporate Recruiters Survey - GMAC – Graduate Management .../media/Files/gmac/Research... · Graduate Management Admission Council (GMAC), a global nonprofit education organization

Improving Performance Portability in OpenCL Programspeople.cs.uchicago.edu/~yaozhang/main-portability.pdf · Improving Performance Portability in OpenCL Programs ... for OpenCL 1.2

Gmac Amicus Brief Final