ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HETEROGENEOUS SYSTEM

ARCHITECTURE (HSA): ARCHITECTURE

AND ALGORITHMS

ISCA TUTORIAL - JUNE 15, 2014

TOPICS

Introduction

HSAIL Virtual Parallel ISA

HSA Runtime

HSA Memory Model

HSA Queuing Model

HSA Applications

HSA Compilation

© Copyright 2014 HSA Foundation. All Rights Reserved

The HSA Specifications are not at 1.0 final so all content is subject to change

SCHEDULE


Time Topic Speaker

8:45am Introduction to HSA Phil Rogers, AMD

9:30am HSAIL Virtual Parallel ISA Ben Sander, AMD

10:30am Break

10:50am HSA Runtime Yeh-Ching Chung, National Tsing Hua University

12 noon Lunch

1pm HSA Memory Model Benedict Gaster, Qualcomm

2pm HSA Queuing Model Hakan Persson, ARM

3pm Break

3:15pm HSA Compilation Technology Wen Mei Hwu, University of Illinois

4pm HSA Application Programming Wen Mei Hwu, University of Illinois

4:45pm Questions All presenters

INTRODUCTIONPHIL ROGERS, AMD CORPORATE FELLOW &

PRESIDENT OF HSA FOUNDATION

HSA FOUNDATION

Founded in June 2012

Developing a new platform for heterogeneous

systems

www.hsafoundation.com

Specifications under development in working

groups to define the platform

Membership consists of 43 companies and 16

universities

Adding 1-2 new members each month


DIVERSE PARTNERS DRIVING FUTURE OF

HETEROGENEOUS COMPUTING


Founders

Promoters

Supporters

Contributors

Academic

Needs Updating – Add Toshiba

Logo

http://www.apical.co.uk/

http://www.multicorewareinc.com/index.php

MEMBERSHIP TABLEMembership Level Number List

Founder 6 AMD, ARM, Imagination Technologies, MediaTek Inc., Qualcomm Inc., Samsung Electronics Co Ltd

Promoter 1 LG Electronics

Contributor 25 Analog Devices Inc., Apical, Broadcom, Canonical Limited, CEVA Inc., Digital Media Professionals, Electronics and Telecommunications Research, Institute (ETRI), General Processor, Huawei, Industrial Technology Res. Institute, Marvell International Ltd., Mobica, Oracle, Sonics, Inc, Sony Mobile, Communications, Swarm 64 GmbH, Synopsys, Tensilica, Inc., Texas Instruments Inc., Toshiba, VIA Technologies, Vivante Corporation

Supporter 13 Allinea Software Ltd, Arteris Inc., Codeplay Software, Fabric Engine, Kishonti, Lawrence Livermore National Laboratory, Linaro, MultiCoreWare, Oak Ridge National Laboratory, Sandia Corporation, StreamComputing, SUSE LLC, UChicago Argonne LLC, Operator of Argonne National Laboratory

Academic 17 Institute for Computing Systems Architecture, Missouri University of Science & Technology, National Tsing Hua University, NMAM Institute of Technology, Northeastern University, Rice University, Seoul National University, System Software Lab National, Tsing Hua University, Tampere University of Technology, TEI of Crete, The University of Mississippi, University of North Texas, University of Bologna, University of Bristol Microelectronic Research Group, University of Edinburgh, University of Illinois at Urbana-Champaign Department of Computer Science


HETEROGENEOUS PROCESSORS HAVE

PROLIFERATED — MAKE THEM BETTER

Heterogeneous SOCs have arrived and are a

tremendous advance over previous platforms

SOCs combine CPU cores, GPU cores and

other accelerators, with high bandwidth access

to memory

How do we make them even better? Easier to program

Easier to optimize

Higher performance

Lower power

HSA unites accelerators architecturally

Early focus on the GPU compute accelerator,

but HSA will go well beyond the GPU


INFLECTIONS IN PROCESSOR DESIGN


?

Sin

gle

-th

read

Pe

rfo

rman

ce

Time

we are

here

Enabled by: Moore’s

Law

Voltage Scaling

Constrained by:

Power

Complexity

Single-Core Era

Mo

de

rn A

pp

licatio

n

Pe

rfo

rman

ce

Time (Data-parallel exploitation)

we are

here

Heterogeneous

Systems Era

Enabled by: Abundant data

parallelism

Power efficient

GPUs

Temporarily

Constrained by:Programming

models

Comm.overhead

Th

rou

gh

put

Pe

rfo

rman

ce

Time (# of processors)

we are

here

Enabled by: Moore’s Law

SMP

architecture

Constrained

by:Power

Parallel SW

Scalability

Multi-Core Era

Assembly C/C++ Java … pthreads OpenMP / TBB …Shader CUDA OpenCL

C++ and Java

LEGACY GPU COMPUTE

PCIe

™

System Memory(Coherent)

CPU CPU CPU. .

. CU CU CU CU

CU CU CU CU

GPU Memory(Non-Coherent)

GPU

Multiple memory pools

Multiple address spaces

High overhead dispatch

Data copies across PCIe

New languages for programming

Dual source development

Proprietary environments

Expert programmers only

Need to fix all of this to unleash our programmers

The limiters


EXISTING APUS AND SOCS

CPU

1

CPU

N…CPU

2

Physical Integration

CU

1 …CU

2

CU

3

CU

M-2

CU

M-1

CU

M

System Memory(Coherent)

GPU Memory(Non-Coherent)

GPU

Physical Integration

Good first step

Some copies gone

Two memory pools remain

Still queue through the OS

Still requires expert programmers

Need to finish the job

AN HSA ENABLED SOC

Unified Coherent Memory enables data sharing across all processors

Processors architected to operate cooperatively

Designed to enable the application to run on different processors at different times

Unified Coherent Memory

CPU

1

CPU

N…CPU

2

CU

1

CU

2

CU

3

CU

M-2

CU

M-1

CU

M…

PILLARS OF HSA*

Unified addressing across all processors

Operation into pageable system memory

Full memory coherency

User mode dispatch

Architected queuing language

Scheduling and context switching

HSA Intermediate Language (HSAIL)

High level language support for GPU compute processors


* All features of HSA are subject to change, pending ratification of 1.0 Final specifications by the HSA Board of Directors

HSA SPECIFICATIONS

HSA System Architecture Specification

Version 1.0 Provisional, Released April 2014

Defines discovery, memory model, queue management, atomics, etc

HSA Programmers Reference Specification

Version 1.0 Provisional, Released June 2014

Defines the HSAIL language and object format

HSA Runtime Software Specification

Version 1.0 Provisional, expected to be released in July 2014

Defines the APIs through which an HSA application uses the platform

All released specifications can be found at the HSA Foundation web site:

www.hsafoundation.com/standards


http://www.hsafoundation.com/standards

HSA - AN OPEN PLATFORM Open Architecture, membership open to all

HSA Programmers Reference Manual

HSA System Architecture

HSA Runtime

Delivered via royalty free standards

Royalty Free IP, Specifications and APIs

ISA agnostic for both CPU and GPU

Membership from all areas of computing

Hardware companies

Operating Systems

Tools and Middleware

Applications

Universities


HSA INTERMEDIATE LAYER — HSAIL

HSAIL is a virtual ISA for parallel programs

Finalized to ISA by a JIT compiler or “Finalizer”

ISA independent by design for CPU & GPU

Explicitly parallel

Designed for data parallel programming

Support for exceptions, virtual functions,

and other high level language features

Lower level than OpenCL SPIR

Fits naturally in the OpenCL compilation stack

Suitable to support additional high level languages and programming models:

Java, C++, OpenMP, C++, Python, etc


HSA MEMORY MODEL

Defines visibility ordering between all

threads in the HSA System

Designed to be compatible with

C++11, Java, OpenCL and .NET

Memory Models

Relaxed consistency memory model

for parallel compute performance

Visibility controlled by:

Load.Acquire

Store.Release

Fences


HSA QUEUING MODEL

User mode queuing for low latency dispatch

Application dispatches directly

No OS or driver required in the dispatch path

Architected Queuing Layer

Single compute dispatch path for all hardware

No driver translation, direct to hardware

Allows for dispatch to queue from any agent

CPU or GPU

GPU self enqueue enables lots of solutions

Recursion

Tree traversal

Wavefront reforming


HSA SOFTWARE

Hardware - APUs, CPUs, GPUs

Driver Stack

Domain Libraries

OpenCL™, DX Runtimes,

User Mode Drivers

Graphics Kernel Mode Driver

AppsApps

AppsApps

AppsApps

HSA Software Stack

Task Queuing

Libraries

HSA Domain Libraries,

OpenCL ™ 2.x Runtime

HSA Kernel

Mode Driver

HSA Runtime

HSA JIT

AppsApps

AppsApps

AppsApps

User mode component Kernel mode component Components contributed by third parties

EVOLUTION OF THE SOFTWARE STACK


OPENCL™ AND HSA

HSA is an optimized platform architecture

for OpenCL

Not an alternative to OpenCL

OpenCL on HSA will benefit from

Avoidance of wasteful copies

Low latency dispatch

Improved memory model

Pointers shared between CPU and GPU

OpenCL 2.0 leverages HSA Features

Shared Virtual Memory

Platform Atomics


ADDITIONAL LANGUAGES ON HSA

In development


Language Body More Information

Java Sumatra OpenJDK http://openjdk.java.net/projects/sumatra/

LLVM LLVM Code

generator for HSAIL

C++ AMP Multicoreware https://bitbucket.org/multicoreware/cppa

mp-driver-ng/wiki/Home

OpenMP, GCC AMD, Suse https://gcc.gnu.org/viewcvs/gcc/branches

/hsa/gcc/README.hsa?view=markup&p

athrev=207425

http://openjdk.java.net/projects/sumatra/

https://bitbucket.org/multicoreware/cppamp-driver-ng/wiki/Home

https://gcc.gnu.org/viewcvs/gcc/branches/hsa/gcc/README.hsa?view=markup&pathrev=207425

SUMATRA PROJECT OVERVIEW

AMD/Oracle sponsored Open Source (OpenJDK) project

Targeted at Java 9 (2015 release)

Allows developers to efficiently represent data parallel algorithms in

Java

Sumatra ‘repurposes’ Java 8’s multi-core Stream/Lambda API’s to

enable both CPU or GPU computing

At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch

‘selected’ constructs to available HSA enabled devices

Developers of Java libraries are already refactoring their library code to

use these same constructs

So developers using existing libraries should see GPU acceleration

without any code changes


https://wikis.oracle.com/display/HotSpotInternals/Sumatra

http://mail.openjdk.java.net/pipermail/sumatra-dev/


Application.java

Java Compiler

GPUCPU

Sumatra Enabled JVM

Application

GPU ISA

Lambda/Stream API

CPU ISA

Application.clas

s

Development

Runtime

HSA Finalizer


https://wikis.oracle.com/display/HotSpotInternals/Sumatra

http://mail.openjdk.java.net/pipermail/sumatra-dev/

HSA OPEN SOURCE SOFTWARE

HSA will feature an open source linux execution and compilation stack

Allows a single shared implementation for many components

Enables university research and collaboration in all areas

Because it’s the right thing to do


Component Name IHV or Common Rationale

HSA Bolt Library Common Enable understanding and debug

HSAIL Code Generator Common Enable research

LLVM Contributions Common Industry and academic collaboration

HSAIL Assembler Common Enable understanding and debug

HSA Runtime Common Standardize on a single runtime

HSA Finalizer IHV Enable research and debug

HSA Kernel Driver IHV For inclusion in linux distros

WORKLOAD EXAMPLE

SUFFIX ARRAY CONSTRUCTIONCLOUD SERVER WORKLOAD

SUFFIX ARRAYS

Suffix Arrays are a fundamental data structure

Designed for efficient searching of a large text

Quickly locate every occurrence of a substring S in a text T

Suffix Arrays are used to accelerate in-memory cloud workloads

Full text index search

Lossless data compression

Bio-informatics


ACCELERATED SUFFIX ARRAY

CONSTRUCTION ON HSA


M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013.

AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM

By offloading data parallel computations to

GPU, HSA increases performance and

reduces energy for Suffix Array

Construction.

By efficiently sharing data between CPU and

GPU, HSA lets us move compute to data

without penalty of intermediate copies.

+5.8x

-5x

INCREASED

PERFORMANCEDECREASED

ENERGYMerge Sort::GPU

Radix Sort::GPU

Compute SA::CPU

Lexical Rank::CPU

Radix Sort::GPU

Skew Algorithm for Compute SA

EASE OF PROGRAMMINGCODE COMPLEXITY VS. PERFORMANCE

LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT

PROGRAMMING MODELS

AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.

Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta

0

50

100

150

200

250

300

350L

OC

Copy-back Algorithm Launch Copy Compile Init Performance

Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt

Pe

rform

an

ce

35.00

30.00

25.00

20.00

15.00

10.00

5.00

0Copy-

back

Algorithm

Launch

Copy

Compile

Init.

Copy-back

Algorithm

Launch

Copy

Compile

Copy-back

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

(Exemplary ISV “Hessian” Kernel)


THE HSA FUTURE

Architected heterogeneous processing on the SOC

Programming of accelerators becomes much easier

Accelerated software that runs across multiple hardware vendors

Scalability from smart phones to super computers on a common architecture

GPU acceleration of parallel processing is the initial target, with DSPs

and other accelerators coming to the HSA system architecture model

Heterogeneous software ecosystem evolves at a much faster pace

Lower power, more capable devices in your hand, on the wall, in the cloud


JOIN US!

WWW.HSAFOUNDATION.COM

HETEROGENEOUS SYSTEM

ARCHITECTURE (HSA): HSAIL VIRTUAL

PARALLEL ISA

BEN SANDER, AMD

TOPICS

Introduction and Motivation

HSAIL – what makes it special?

HSAIL Execution Model

How to program in HSAIL?

Conclusion


STATE OF GPU COMPUTING

Today’s Challenges

Separate address spaces

Copies

Can’t share pointers

New language required for compute kernel

EX: OpenCL™ runtime API

Compute kernel compiled separately than host

code

Emerging Solution

HSA Hardware

Single address space

Coherent

Virtual

Fast access from all components

Can share pointers

Bring GPU computing to existing, popular,

programming models

Single-source, fully supported by compiler

HSAIL compiler IR (Cross-platform!)

• GPUs are fast and power efficient : high compute density per-mm and per-watt

• But: Can be hard to program

PCIe

THE PORTABILITY CHALLENGE

CPU ISAs

ISA innovations added incrementally (ie NEON, AVX, etc)

ISA retains backwards-compatibility with previous generation

Two dominant instruction-set architectures: ARM and x86

GPU ISAs

Massive diversity of architectures in the market

Each vendor has own ISA - and often several in market at same time

No commitment (or attempt!) to provide any backwards compatibility

Traditionally graphics APIs (OpenGL, DirectX) provide necessary abstraction


HSAIL : WHAT MAKES IT SPECIAL?

WHAT IS HSAIL?

Intermediate language for parallel compute in HSA

Generated by a “High Level Compiler” (GCC, LLVM, Java VM, etc)

Expresses parallel regions of code

Binary format of HSAIL is called “BRIG”

Goal: Bring parallel acceleration to mainstream programming languages


main() {

…

#pragma omp parallel for

for (int i=0;i<N; i++) {

}

…

}

High-Level

Compiler BRIG Finalizer Component

ISA

Host ISA

KEY HSAIL FEATURES

Parallel

Shared virtual memory

Portable across vendors in HSA Foundation

Stable across multiple product generations

Consistent numerical results (IEEE-754 with defined min accuracy)

Fast, robust, simple finalization step (no monthly updates)

Good performance (little need to write in ISA)

Supports all of OpenCL™

Supports Java, C++, and other languages as well


HSAIL INSTRUCTION SET - OVERVIEW Similar to assembly language for a RISC CPU

Load-store architecture

Destination register first, then source registers

140 opcodes (Java™ bytecode has 200)

Floating point (single, double, half (f16))

Integer (32-bit, 64-bit)

Some packed operations

Branches

Function calls

Platform Atomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas

Synchronize host CPU and HSA Component!

Text and Binary formats (“BRIG”)

ld_global_u64 $d0, [$d6 + 120] ; $d0= load($d6+120)

add_u64 $d1, $d0, 24 ; $d1= $d2+24


SEGMENTS AND MEMORY (1/2)

7 segments of memory

global, readonly, group, spill, private, arg, kernarg

Memory instructions can (optionally) specify a segment

Control data sharing properties and communicate intent

Global Segment

Visible to all HSA agents (including host CPU)

Group Segment

Provides high-performance memory shared in the work-group.

Group memory can be read and written by any work-item in the work-group

HSAIL provides sync operations to control visibility of group memory

ld_global_u64 $d0,[$d6]

ld_group_u64 $d0,[$d6+24]

st_spill_f32 $s1,[$d6+4]


SEGMENTS AND MEMORY (2/2)

Spill, Private, Arg Segments

Represent different regions of a per-work-item stack

Typically generated by compiler, not specified by programmer

Compiler can use these to convey intent – ie spills

Kernarg Segment

Programmer writes kernarg segment to pass arguments to a kernel

Read-Only Segment

Remains constant during execution of kernel


FLAT ADDRESSING

Each segment mapped into virtual address space

Flat addresses can map to segments based on virtual address

Instructions with no explicit segment use flat addressing

Very useful for high-level language support (ie classes, libraries)

Aligns well with OpenCL 2.0 “generic” addressing feature

ld_global_u64 $d6, [%_arg0] ; global

ld_u64 $d0,[$d6+24] ; flat


REGISTERS

Four classes of registers:

S: 32-bit, Single-precision FP or Int

D: 64-bit, Double-precision FP or Long Int

Q: 128-bit, Packed data.

C: 1-bit, Control Registers (Compares)

Fixed number of registers

S, D, Q share a single pool of resources

S + 2*D + 4*Q <= 128

Up to 128 S or 64 D or 32 Q (or a blend)

Register allocation done in high-level compiler

Finalizer doesn’t perform expensive register allocation

c0

c1

c2

c3

c4

c5

c6

c7

s0d0

q0s1

s2d1

s3

s4d2

q1s5

s6d3

s7

s8d4

q2s9

s10d5

s11

…s120

d60

q30s121

s122d61

s123

s124d62

q31s125

s126d63

s127


SIMT EXECUTION MODEL

HSAIL Presents a “SIMT” execution model to the programmer

“Single Instruction, Multiple Thread”

Programmer writes program for a single thread of execution

Each work-item appears to have its own program counter

Branch instructions look natural

Hardware Implementation

Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency

Actually one program counter for the entire SIMD instruction

Branches implemented with predication

SIMT Advantages

Easier to program (branch code in particular)

Natural path for mainstream programming models and existing compilers

Scales across a wide variety of hardware (programmer doesn’t see vector width)

Cross-lane operations available for those who want peak performance


WAVEFRONTS

Hardware SIMD vector, composed of 1, 2, 4, 8, 16, 32, 64, 128, or 256 “lanes”

Lanes in wavefront can be “active” or “inactive”

Inactive lanes consume hardware resources but don’t do useful work

Tradeoffs “Wavefront-aware” programming can be useful for peak performance

But results in less portable code (since wavefront width is encoded in algorithm)

if (cond) {

operationA; // cond=True lanes active here

} else {

operationB; // cond=False lanes active here

}


CROSS-LANE OPERATIONS

Example HSAIL cross-lane operation: “activelaneid”

Dest set to count of earlier work-items that are active for this instruction

Useful for compaction algorithms

Example HSAIL cross-lane operation: “activelaneshuffle”

Each workitem reads value from another lane in the wavefront

Supports selection of “identity” element for inactive lanes

Useful for wavefront-level reductionsactivelaneshuffle_b32 $s0, $s1, $s2, 0, 0

// s0 = dest, s1= source, s2=lane select, no identity

activelaneid_u32 $s0


HSAIL MODES

Working group strived to limit optional modes and features in HSAIL

Minimize differences between HSA target machines

Better for compiler vendors and application developers

Two modes survived

Machine Models

Small: 32-bit pointers, 32-bit data

Large: 64-bit pointers, 32-bit or 64-bit data

Vendors can support one or both models

“Base” and “Full” Profiles

Two sets of requirements for FP accuracy, rounding, exception reporting, hard

pre-emption


HSA PROFILESFeature Base Full

Addressing Modes Small, Large Small, Large

All 32-bit HSAIL operations according to the declared profile Yes Yes

F16 support (IEEE 754 or better) Yes Yes

F64 support No Yes

Precision for add/sub/mul 1/2 ULP 1/2 ULP

Precision for div 2.5 ULP 1/2 ULP

Precision for sqrt 1 ULP 1/2 ULP

HSAIL Rounding: Near Yes Yes

HSAIL Rounding: Up / Down / Zero No Yes

Subnormal floating-point Flush-to-zero Supported

Propagate NaN Payloads No Yes

FMA Yes Yes

Arithmetic Exception reporting None DETECT or BREAK

Debug trap Yes Yes

Hard Preemption No Yes


HSA PARALLEL EXECUTION

MODEL


HSA PARALLEL EXECUTION MODELBasic Idea:

Programmer supplies an HSAIL

“kernel” that is run on each work-item.

Kernel is written as a single thread of

execution.

Programmer specifies grid dimensions

(scope of problem) when launching

the kernel.

Each work-item has a unique

coordinate in the grid.

Programmer optionally specifies work-

group dimensions (for optimized

communication).


CONVOLUTION / SOBEL EDGE FILTER

Gx = [ -1 0 +1 ]

[ -2 0 +2 ]

[ -1 0 +1 ]

Gy = [ -1 -2 -1 ]

[ 0 0 0 ]

[ +1 +2 +1 ]

G = sqrt(Gx2 + Gy

2)



Gx = [ -1 0 +1 ]

[ -2 0 +2 ]

[ -1 0 +1 ]

Gy = [ -1 -2 -1 ]

[ 0 0 0 ]

[ +1 +2 +1 ]

G = sqrt(Gx2 + Gy

2)

2D grid

workitem

kernel



Gx = [ -1 0 +1 ]

[ -2 0 +2 ]

[ -1 0 +1 ]

Gy = [ -1 -2 -1 ]

[ 0 0 0 ]

[ +1 +2 +1 ]

G = sqrt(Gx2 + Gy

2)

2D work-group

2D grid

workitem

kernel


HOW TO PROGRAM HSA?

WHAT DO I TYPE?


HSA PROGRAMMING MODELS : CORE PRINCIPLES

Single source

Host and device code side-by-side in same source file

Written in same programming language

Single unified coherent address space

Freely share pointers between host and device

Similar memory model as multi-core CPU

Parallel regions identified with existing language syntax

Typically same syntax used for multi-core CPU

HSAIL is the compiler IR that supports these programming models


GCC OPENMP : COMPILATION FLOW

SUSE GCC Project

Adding HSAIL code generator to GCC compiler infrastructure

Supports OpenMP 3.1 syntax

No data movement directives required !main() {

…

// Host code.

#pragma omp parallel for

for (int i=0;i<N; i++) {

C[i] = A[i] + B[i];

}

…

}

GCC OpenMP

CompilerBRIG Finalizer Component

ISA

Host ISA


GCC OpenMP flowC/C++/Fortran OpenMP application

e.g., #pragma omp forfor( j = 0; j<n;j++) { b[j] = a[j]; }

GNU Compiler(GCC)

Compiles host code + Emits runtime calls with kernel name, parameters, launch attributes

Lowers OpenMP directives,converts GIMPLE to BRIG.Embeds BRIG into host code

Dispatch kernel to GPU

Pragmas map to calls into HSA Runtime

Application

Compiler

Run timeFinalize kernel from BRIG->ISAKernels finalized once and cached.

Compile time


MCW C++AMP : COMPILATION FLOW

C++AMP : Single-source C++ template parallel programming model

MCW compiler based on CLANG/LLVM

Open-source and runs on Linux

Leverage open-source LLVM->HSAIL code generator

main() {

…

parallel_for_each(grid<1>(ext

entent<256>(…)

…

}

C++AMP

CompilerBRIG Finalizer Component

ISA

Host ISA


JAVA: RUNTIME FLOW


JAVA 8 – HSA ENABLED APARAPI

Java 8 brings Stream + Lambda API.‒ More natural way of expressing data parallel algorithms‒ Initially targeted at multi-core.

APARAPI will :‒ Support Java 8 Lambdas‒ Dispatch code to HSA enabled devices at runtime via

HSAIL

JVM

Java Application

HSA Finalizer & Runtime

APARAPI + Lambda API

GPUCPU

Future Java – HSA ENABLED JAVA (SUMATRA)

Adds native GPU acceleration to Java Virtual Machine (JVM)

Developer uses JDK Lambda, Stream API

JVM uses GRAAL compiler to generate HSAIL

JVM

Java Application

HSA Finalizer & Runtime

Java JDK Stream + Lambda API

Java GRAAL JITbackend

GPUCPU

AN EXAMPLE (IN JAVA 8)


//Example computes the percentage of total scores achieved by each player on a team.

class Player {

private Team team; // Note: Reference to the parent Team.

private int scores;

private float pctOfTeamScores;

public Team getTeam() {return team;}

public int getScores() {return scores;}

public void setPctOfTeamScores(int pct) { pctOfTeamScores = pct; }

};

// “Team” class not shown

// Assume “allPlayers’ is an initialized array of Players..

Arrays.stream(allPlayers). // wrap the array in a stream

parallel(). // developer indication that lambda is thread-safe

forEach(p -> {

int teamScores = p.getTeam().getScores();

float pctOfTeamScores = (float)p.getScores()/(float) teamScores;

p.setPctOfTeamScores(pctOfTeamScores);

});

HSAIL CODE EXAMPLE


01: version 0:95: $full : $large;

02: // static method HotSpotMethod<Main.lambda$2(Player)>

03: kernel &run (

04: kernarg_u64 %_arg0 // Kernel signature for lambda method

05: ) {

06: ld_kernarg_u64 $d6, [%_arg0]; // Move arg to an HSAIL register

07: workitemabsid_u32 $s2, 0; // Read the work-item global “X” coord

08:

09: cvt_u64_s32 $d2, $s2; // Convert X gid to long

10: mul_u64 $d2, $d2, 8; // Adjust index for sizeof ref

11: add_u64 $d2, $d2, 24; // Adjust for actual elements start

12: add_u64 $d2, $d2, $d6; // Add to array ref ptr

13: ld_global_u64 $d6, [$d2]; // Load from array element into reg

14: @L0:

15: ld_global_u64 $d0, [$d6 + 120]; // p.getTeam()

16: mov_b64 $d3, $d0;

17: ld_global_s32 $s3, [$d6 + 40]; // p.getScores ()

18: cvt_f32_s32 $s16, $s3;

19: ld_global_s32 $s0, [$d0 + 24]; // Team getScores()

20: cvt_f32_s32 $s17, $s0;

21: div_f32 $s16, $s16, $s17; // p.getScores()/teamScores

22: st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores()

23: ret;

24: };

HOW TO PROGRAM HSA?

OTHER PROGRAMMING TOOLS


HSAIL ASSEMBLER

kernel &run (kernarg_u64 %_arg0)

{

ld_kernarg_u64 $d6, [%_arg0];

workitemabsid_u32 $s2, 0;

cvt_u64_s32 $d2, $s2;

mul_u64 $d2, $d2, 8;

add_u64 $d2, $d2, 24;

add_u64 $d2, $d2, $d6;

ld_global_u64 $d6, [$d2];

. . .

HSAIL

Assembler BRIG FinalizerMachine

ISA

• HSAIL has a text format and an assembler


OPENCL™ OFFLINE COMPILER (CLOC)

__kernel void vec_add(

__global const float *a,

__global const float *b,

__global float *c,

const unsigned int n)

{

int id = get_global_id(0);

// Bounds check

if (id < n)

c[id] = a[id] + b[id];

}

CLOC BRIG FinalizerMachine

ISA

•OpenCL split-source model cleanly isolates kernel

•Can express many HSAIL features in OpenCL Kernel Language

•Higher productivity than writing in HSAIL assembly

•Can dispatch kernel directly with HSAIL Runtime (lower-level access to hardware)

•Or use CLOC+OKRA Runtime for approachable “fits-on-a-slide” GPU programming model


KEY TAKEAWAYS HSAIL

Thin, robust, fast finalizer

Portable (multiple HW vendors and parallel architectures)

Supports shared virtual memory and platform atomics

HSA brings GPU computing to mainstream programming models

Shared and coherent memory bridges “faraway accelerator” gap

HSAIL provides the common IL for high-level languages to benefit from

parallel computing

Languages and Compilers

HSAIL support in GCC, LLVM, Java JVM

Leverage same language syntax designed for multi-core CPUs

Can use pointer-containing data structures


HSA RUNTIMEYEN-CHING CHUNG, NATIONAL TSING HUA

UNIVERSITY

OUTLINE Introduction

HSA Core Runtime API (Pre-release 1.0 provisional)

Initialization and Shut Down

Notifications (Synchronous/Asynchronous)

Agent Information

Signals and Synchronization (Memory-Based)

Queues and Architected Dispatch

Summary


INTRODUCTION (1)

The HSA core runtime is a thin, user-mode API that provides the interface necessary for

the host to launch compute kernels to the available HSA components.

The overall goal of the HSA core runtime design is to provide a high-performance dispatch

mechanism that is portable across multiple HSA vendor architectures.

The dispatch mechanism differentiates the HSA runtime from other language runtimes by

architected argument setting and kernel launching at the hardware and specification level.

The HSA core runtime API is standard across all HSA vendors, such that languages which use the

HSA runtime can run on different vendor’s platforms that support the API.

The implementation of the HSA runtime may include kernel-level components (required for

some hardware components, ex: AMD Kaveri) or may be entirely user-space (for example,

simulators or CPU implementations).


Component 1

Driver

Component N…

Vendor m

…Component 1

Driver

Component N…

Vendor 1

Component 1

HSA Runtime

Component N…

HSA Vendor 1

HSA

Finalizer Component 1

HSA Runtime

Component N…

HSA Vendor m

HSA

Finalizer

INTRODUCTION (2)

Programming Model

Language Runtime

The software architecture stack without HSA runtime

OpenCL

App

Java

App

OpenMP

App

DSL

App

OpenCL

Runtime

Java

Runtime

OpenMP

Runtime

DSL

Runtime

…

…

The software architecture stack with HSA runtime

…


INTRODUCTION (3)

OpenCL Runtime HSA RuntimeAgent

Start Program

HSA Memory Allocation

Enqueue Dispatch Packet

Exit Program Resource Deallocation

Command Queue

Platform, Device, and Context Initialization

SVM Allocation and Kernel Arguments Setting

Build Kernel

HSA Runtime Close

HSA Runtime Initialization and Topology Discovery

HSAIL Finalization and Linking


INTRODUCTION (4)

HSA Platform System Architecture Specification support

Runtime initialization and shutdown

Notifications (synchronous/asynchronous)

Agent information

Signals and synchronization (memory-based)

Queues and Architected dispatch

Memory management

HSAIL support

Finalization, linking, and debugging

Image and Sampler support

HSA Runtime

HSA Memory Allocation

Enqueue Dispatch Packet

HSA Runtime Close

HSA Runtime Initialization and

Topology Discovery

HSAIL Finalization and Linking


RUNTIME INITIALIZATION AND

SHUTDOWN

OUTLINE

Runtime Initialization API

hsa_init

Runtime Shut Down API

hsa_shut_down

Examples


HSA RUNTIME INITIALIZATION

When the API is invoked for the first time in a given process, a runtime

instance is created.

A typical runtime instance may contain information of platform, topology, reference

count, queues, signals, etc.

The API can be called multiple times by applications

Only a single runtime instance will exist for a given process.

Whenever the API is invoked, the reference count is increased by one.


HSA RUNTIME SHUT DOWN

When the API is invoked, the reference count is decreased by 1.

When the reference count < 1

All the resources associated with the runtime instance (queues, signals, topology

information, etc.) are considered invalid and any attempt to reference them in

subsequent API calls results in undefined behavior.

The user might call hsa_init to initialize the HSA runtime again.

The HSA runtime might release resources associated with it.


EXAMPLE – RUNTIME INITIALIZATION (1)

Data structure for

runtime instance

If hsa_init is called more than once,

increase the ref_count by 1


EXAMPLE – RUNTIME INITIALIZATION (2)

hsa_init is called the first time, allocate

resources and set the reference count

Get the number of HSA agent

Initialize agents

Create an empty agent list

If initialization failed, release resources

Create topology table


Agent-0

node_id 0

id 0

type CPU

vendor Generic

name Generic

wavefront_size 0

queue_size 200

group_memory 0

fbarrier_max_count 1

is_pic_supported 0……

EXAMPLE - RUNTIME INSTANCE (1)Platform Name: Generic Memory

node_id 0

id 0

segment_type 111111

address_base 0x0001

size 2048 MB

peak_bandwidth 6553.6 mpbs

Agent-1

node_id 0

id 0

type GPU

vendor Generic

name Generic

wavefront_size 64

queue_size 200

group_memory 64

fbarrier_max_count 1

is_pic_supported 1

Cache

node_id 0

id 0

levels 1

associativity 1

cache size 64KB

cache line size 4

is_inclusive 1

Agent: 2

Memory: 1

Cache: 1

… …


Agent-0

node_id = 0

id = 0

agent_type = 1 (CPU)

vendor[16] = Generic

name[16] = Generic

wavefront_size = 0

queue_size =200

group_memory_size_bytes =0

fbarrier_max_count = 1

is_pic_supported = 0

Platform Header File

*base_address = 0x00001

Size = 248

system_timestamp_frequency_

mhz = 200

signal_maximum_wait = 1/200

*node_id

no_nodes = 1

*agent_list

no_agent = 2

*memory_descriptor_list

no_memory_descriptor = 1

*cache_descriptor_list

no_cache_descriptor = 1

EXAMPLE - RUNTIME INSTANCE (2)

…

…

cache

node_id = 0

Id = 0

Levels = 1

* associativity

* cache_size

* cache_line_size

* is_inclusive

1 NULL

64KB NULL

1 NULL

4 NULL

Memory

node_id = 0

Id = 0

supported_segment_type_mask =

111111

virtual_address_base = 0x0001

size_in_bytes = 2048MB

peak_bandwidth_mbps = 6553.6

0 NULL

45 165 NULL

285 NULL

325 NULL

Agent-1

node_id = 0

id = 0

agent_type = 2 (GPU)

vendor[16] = Generic

name[16] = Generic

wavefront_size = 64

queue_size =200

group_memory_size_bytes =64

fbarrier_max_count = 1

is_pic_supported = 1

…


EXAMPLE – RUNTIME SHUT DOWN


If ref_count < 1, then free the list;

Otherwise decrease the ref_count

by 1.

NOTIFICATIONS

(SYNCHRONOUS/ASYNCHRONOUS)

OUTLINE

Synchronous Notifications

hsa_status_t

hsa_status_string

Asynchronous Notifications

Example


SYNCHRONOUS NOTIFICATIONS

Notifications (errors, events, etc.) reported by the runtime can be synchronous or

asynchronous

The HSA runtime uses the return values of API functions to pass notifications

synchronously.

A status code is define as an enumeration, , to capture the return value

of any API function that has been executed, except accessors/mutators.

The notification is a status code that indicates success or error.

Success is represented by HSA_STATUS_SUCCESS, which is equivalent to zero.

An error status is assigned a positive integer and its identifier starts with the

HSA_STATUS_ERROR prefix.

The status code can help to determine a cause of the unsuccessful execution.


STATUS CODE QUERY

Query additional information on status code

Parameters status (input): Status code that the user is seeking more information on

status_string (output): An ISO/IEC 646 encoded English language string that potentially

describes the error status


ASYNCHRONOUS NOTIFICATIONS

The runtime passes asynchronous notifications by calling user-defined

callbacks.

For instance, queues are a common source of asynchronous events because the

tasks queued by an application are asynchronously consumed by the packet

processor. Callbacks are associated with queues when they are created. When the

runtime detects an error in a queue, it invokes the callback associated with that

queue and passes it an error flag (indicating what happened) and a pointer to the

erroneous queue.

The HSA runtime does not implement any default callbacks.

When using blocking functions within the callback implementation, a callback that

does not return can render the runtime state to be undefined.


EXAMPLE - CALLBACK

Pass the callback function

when create queue

If the queue is empty, set the

event and invoke callback


AGENT INFORMATION

OUTLINE

Agent information

hsa_node_t

hsa_agent_t

hsa_agent_info_t

hsa_component_feature_t

Agent Information manipulation APIs

hsa_iterate_agents

hsa_agent_get_info

Example


INTRODUCTION

The runtime exposes a list of agents that are available in the system.

An HSA agent is a hardware component that participates in the HSA memory model.

An HSA agent can submit AQL packets for execution.

An HSA agent may also but is not required to be an HSA component. It is possible for

a system to include HSA agents that are neither an HSA component nor a host CPU.

HSA agents are defined as opaque handles of type hsa_agent_t .

The HSA runtime provides APIs for applications to traverse the list of available

agents and query attributes of a particular agent.


AGENT INFORMATION (1)

Opaque agent handle

Opaque NUMA node handle

An HSA memory node is a node that delineates a set of

system components (host CPUs and HSA Components) with

“local” access to a set of memory resources attached to the

node's memory controller and appropriate HSA-compliant

access attributes.



Component features

An HSA component is a hardware or software component that can be a target of the AQL queries

and conforms to the memory model of the HSA.

Values

HSA_COMPONENT_FEATURE_NONE = 0

No component capabilities. The device is an agent, but not a component.

HSA_COMPONENT_FEATURE_BASIC = 1

The component supports the HSAIL instruction set and all the AQL packet types except Agent

dispatch.

HSA_COMPONENT_FEATURE_ALL = 2

The component supports the HSAIL instruction set and all the AQL packet types.



Agent attributes

Values

HSA_AGENT_INFO_MAX_GRID_DIM

HSA_AGENT_INFO_MAX_WORKGROUP_DIM

HSA_AGENT_INFO_QUEUE_MAX_PACKETS

HSA_AGENT_INFO_CLOCK

HSA_AGENT_INFO_CLOCK_FREQUENCY

HSA_AGENT_INFO_MAX_SIGNAL_WAIT

HSA_AGENT_INFO_NAME

HSA_AGENT_INFO_NODE

HSA_AGENT_INFO_COMPONENT_FEATURES

HSA_AGENT_INFO_VENDOR_NAME

HSA_AGENT_INFO_WAVEFRONT_SIZE

HSA_AGENT_INFO_CACHE_SIZE


AGENT INFORMATION MANIPULATION (1)

Iterate over the available agents, and invoke an application-defined callback on

every iteration

If callback returns a status other than HSA_STATUS_SUCCESS for a particular

iteration, the traversal stops and the function returns that status value.

Parameters

callback (input): Callback to be invoked once per agent

data (input): Application data that is passed to callback on every iteration. Can be

NULL.


AGENT INFORMATION MANIPULATION (2)

Get the current value of an attribute for a given agent

Parameters

agent (input): A valid agent

attribute (input): Attribute to query

value (output): Pointer to a user-allocated buffer where to store the value of the

attribute. If the buffer passed by the application is not large enough to hold the value

of attribute, the behavior is undefined.


EXAMPLE - AGENT ATTRIBUTE QUERY

Copy agent attribute information

Get the agent handle of Agent 0


SIGNALS AND SYNCHRONIZATION

(MEMORY-BASED)

OUTLIINE

Signal

Signal manipulation API

Create/Destroy

Query

Send

Atomic Operations

Signal wait

Get time out

Signal Condition

Example


SIGNAL (1)

HSA agents can communicate with each other by using coherent global memory,

or by using signals.

A signal is represented by an opaque signal handle

A signal carries a value, which can be updated or conditionally waited upon via

an API call or HSAIL instruction.

The value occupies four or eight bytes depending on the machine model in use.


SIGNAL (2)

Updating the value of a signal is equivalent to sending the signal.

In addition to the update (store) of signals, the API for sending signal must

support other atomic operations with specific memory order semantics

Atomic operations: AND, OR, XOR, Add, Subtract, Exchange, and CAS

Memory order semantics : Release and Relaxed


SIGNAL CREATE/DESTROY

Create a signal

Parameters

initial_value (input): Initial value of the

signal.

signal_handle (output): Signal handle.

Destroy a signal previous created by

hsa_signal_create

Parameter

signal_handle (input): Signal handle.


Send and atomically set the value of a signal

with release semantics

SIGNAL LOAD/STORE

Atomically read the current signal value with

acquire semantics

Atomically read the current signal value with

relaxed semantics

Send and atomically set the value of a signal with

relaxed semantics


Send and atomically increment the value of a

signal by a given amount with release semantics

SIGNAL ADD/SUBTRACT

Send and atomically decrement the value of a

signal by a given amount with release semantics

Send and atomically increment the value of a

signal by a given amount with relaxed semantics

Send and atomically decrement the value of a

signal by a given amount with relaxed semantics


Send and atomically perform a logical AND operation

on the value of a signal and a given value with

release semantics

SIGNAL AND (OR, XOR)/EXCHANGE

Send and atomically set the value of a signal and

return its previous value with release semantics

Send and atomically perform a logical AND operation

on the value of a signal and a given value with

relaxed semantics

Send and atomically set the value of a signal and

return its previous value with relaxed semantics


SIGNAL WAIT (1)

The application may wait on a signal, with a condition specifying the terms of

wait.

Signal wait condition operator

Values

HSA_EQ: The two operands are equal.

HSA_NE: The two operands are not equal.

HSA_LT: The first operand is less than the second operand.

HSA_GTE: The first operand is greater than or equal to the second operand.


SIGNAL WAIT (2)

The wait can be done either in the HSA component via an HSAIL wait instruction

or via a runtime API defined here.

Waiting on a signal returns the current value at the opaque signal object;

The wait may have a runtime defined timeout which indicates the maximum amount of time that an

implementation can spend waiting.

The signal infrastructure allows for multiple senders/waiters on a single signal.

Wait reads the value, hence acquire synchronizations may be applied.


SIGNAL WAIT (3)

Signal wait

Parameters

signal_handle (input): A signal handle

condition (input): Condition used to compare the passed and signal values

compare_ value (input): Value to compare with

return_value (output): A pointer where the current signal value must be read into


SIGNAL WAIT (4)

Signal wait with timeout

Parameters

signal_handle (input): A signal handle

timeout (input): Maximum wait duration (A value of zero indicates no maximum)

long_wait (input): Hint indicating that the signal value is not expected to meet the given condition in

a short period of time. The HSA runtime may use this hint to optimize the wait implementation.

condition (input): Condition used to compare the passed and signal values

compare_ value (input): Value to compare with

return_value (output): A pointer where the current signal value must be read into


EXAMPLE – SIGNAL WAIT (1)

thread_1 thread_2

thread_1 is blocked

hsa_signal_add_relaxed

(value = value + 3)

Return signal value

Condition satisfied, the

execution of thread_1

continues

value = 0

Timeline Timeline

value = 3

hsa_signal_substract_relaxed

(value = value - 1)value = 2

hsa_signal_wait_timeout_acquire

(value == 2)


EXAMPLE – SIGNAL WAIT (2)

If signal_handle is invalid, then return signal invalid status

Compare tmp->value with compare_value to see if the

condition is satisfied?

If timeout = 0 then return signal time out status

Signal wait condition function

If the condition is satisfied, then return signal and status


QUEUES AND ARCHITECTED

DISPATCH

OUTLINE

Queues

Queue Types and Structure

HSA runtime API for Queue Manipulations

Architected Queuing Language (AQL) Support

Packet type

Packet header

Examples

Enqueue Packet

Packet Processor


INTRODUCTION (1)

An HSA-compliant platform supports multiple user-level command queues allocation.

A use-level command queue is characterized as runtime-allocated, user-level accessible virtual

memory of a certain size, containing packets defined in the Architected Queuing Language (AQL

packets).

Queues are allocated by HSA applications through the HSA runtime.

HSA software receives memory-based structures to configure the hardware queues to

allow for efficient software management of the hardware queues of the HSA agents.

This queue memory shall be processed by the HSA Packet Processor as a ring buffer.

Queues are read-only data structures.

Writing values directly to a queue structure results in undefined behavior.

But HSA agents can directly modify the contents of the buffer pointed by base_address, or use

runtime APIs to access the doorbell signal or the service queue.


Two queue types, AQL and Service Queues, are supported

AQL Queue consumes AQL packets that are used to specify the information of kernel functions

that will be executed on the HSA component

Service Queue consumes agent dispatch packets that are used to specify runtime-defined or user

registered functions that will be executed on the agent (typically, the host CPU)

INTRODUCTION (2)


INTRODUCTION (3)

AQL queue structure


INTRODUCTION (4)

In addition to the data held in the queue structure, the queue also defines two

properties (readIndex and writeIndex) that define the location of “head” and “tail”

of the queue.

readIndex: The read index is a 64-bit unsigned integer that specifies the packetID of

the next AQL packet to be consumed by the packet processor.

writeIndex: The write index is a 64-bit unsigned integer that specifies the packetID of

the next AQL packet slot to be allocated.

Both indices are not directly exposed to the user, who can only access them by using

dedicated HSA core runtime APIs.

The available index functions differ on the index of interest (read or write), action to be

performed (addition, compare and swap, etc.), and memory consistency model

(relaxed, release, etc.).


INTRODUCTION (5)

The read index is automatically advanced when a packet is read by the packet

processor.

When the packet processor observes that

The read index matches the write index, the queue can be considered empty;

The write index is greater than or equal to the sum of the read index and the size of

the queue, then the queue is full.

The doorbell_signal field of a queue contains a signal that is used by the agent

to inform the packet processor to process the packets it writes.

The value that the doorbell signaled is equal to the ID of the packet that is ready to be

launched.


INTRODUCTION (6)

The new task might be consumed by the packet processor even before the

doorbell signal has been signaled by the agent.

This is because the packet processor might be already processing some other

packets and observes that there is new work available, so it processes the new

packets.

In any case, the agent must ring the doorbell for every batch of packets it writes.


QUEUE CREATE/DESTROY

Create a user mode queue

When a queue is created, the runtime also

allocates the packet buffer and the completion

signal.

The application should only rely on the status

code returned to determine if the queue is valid

Destroy a user mode queue

A destroyed queue might not be accessed after being

destroyed.

When a queue is destroyed, the state of the AQL packets

that have not been yet fully processed becomes undefined.


GET READ/WRITE INDEX

Atomically retrieve read index of a queue with

acquire semantics

Atomically retrieve write index of a queue with

acquire semantics

Atomically retrieve read index of a queue with

relaxed semantics

Atomically retrieve write index of a queue with

relaxed semantics


SET READ/WRITE INDEX

Atomically set the read index of a queue with

release semantics

Atomically set the read index of a queue with

relaxed semantics

Atomically set the write index of a queue with

release semantics

Atomically set the write index of a queue with

relaxed semantics


COMPARE AND SWAP WRITE INDEX

Atomically compare and set the write index of a

queue with acquire/release/relaxed/acquire-

release semantics

Parameters queue (input): A queue

expected (input): The expected index value

val (input): Value to copy to the write index if expected

matches the observed write index

Return value

Previous value of the write index


ADD WRITE INDEX

Atomically increment the write index of a

queue by an offset with

release/acquire/relaxed/acquire-release

semantics

Parameters

queue (input): A queue

val (input): The value to add to the write index

Return value

Previous value of the write index


ARCHITECTED QUEUING LANGUAGE (AQL)

An HSA-compliant system provides a command interface for the dispatch of

HSA agent commands.

This command interface is provided by the Architected Queuing Language (AQL).

AQL allows HSA agents to build and enqueue their own command packets,

enabling fast and low-power dispatch.

AQL also provides support for HSA component queue submissions

The HSA component kernel can write commands in AQL format.


AQL PACKET (1)

AQL packet format

Values

Always reserved packet (0): Packet format is set to always reserved when the queue is initialized.

Invalid packet (1): Packet format is set to invalid when the readIndex is incremented, making the packet slot available to the HSA agents.

Dispatch packet (2): Dispatch packets contain jobs for the HSA component and are created by HSA agents.

Barrier packet (3): Barrier packets can be inserted by HSA agents to delay processing subsequent packets. All queues support barrier packets.

Agent dispatch packet (4): Dispatch packets contain jobs for the HSA agent and are created by HSA agents.


AQL PACKET (2)

HSA signaling object handle used to indicate completion of the job


EXAMPLE - ENQUEUE AQL PACKET (1)

An HSA agent submits a task to a queue by performing the following steps:

Allocate a packet slot (by incrementing the writeIndex)

Initialize the packet and copy packet to a queue associated with the Packet Processor

Mark packet as valid

Notify the Packet Processor of the packet (With doorbell signal)


EXAMPLE - ENQUEUE AQL PACKET (2)

Dispatch Queue

Allocate an AQL packet slot

Copy the packet into queue. Note

that, we can have a lock here to

prevent race condition in

multithread environment

WriteIndex

ReadIndexInitialize

packet

Send doorbell signal


EXAMPLE - PACKET PROCESSOR

WriteIndex

ReadIndex

Get packet content

Check if barrier packet

Update readIndex, change packet state to invalid,

and send completion signal.

Receive doorbell Dispatch Queue

If there is any packet in queue, process the packet.


MEMORY MANAGEMENT

OUTLINE

Memory registration and deregistration

Memory region and memory segment

APIs for memory region manipulation

APIs for memory registration and deregistration


INTRODUCTION

One of the key features of HSA is its ability to share global pointers between the

host application and code executing on the HSA component.

This ability means that an application can directly pass a pointer to memory allocated on the host

to a kernel function dispatched to a component without an intermediate copy

When a buffer created in the host is also accessed by a component,

programmers are encouraged to register the corresponding address range

beforehand.

Registering memory expresses an intention to access (read or write) the passed buffer from a

component other than the host. This is a performance hint that allows the runtime implementation

to know which buffers will be accessed by some of the components ahead of time.

When an HSA program no longer needs to access a registered buffer in a device,

the user should deregister that virtual address range.


MEMORY REGION/SEGMENT

A memory region represents a virtual memory interval that is visible to a particular agent,

and contains properties about how memory is accessed or allocated from that agent.

Memory segments

Values

HSA_SEGMENT_GLOBAL = 1

HSA_SEGMENT_PRIVATE = 2

HSA_SEGMENT_GROUP = 4

HSA_SEGMENT_KERNARG = 8

HSA_SEGMENT_READONLY = 16

HSA_SEGMENT_IMAGE = 32


MEMORY REGION INFORMATION

Attributes of a memory region

Values

HSA_REGION_INFO_BASE_ADDRESS

HSA_REGION_INFO_SIZE

HSA_REGION_INFO_NODE

HSA_REGION_INFO_MAX_ALLOCATION_SIZE

HSA_REGION_INFO_SEGMENT

HSA_REGION_INFO_BANDWIDTH

HSA_REGION_INFO_CACHED


MEMORY REGION MANIPULATION (1)

Get the current value of an attribute of a region

Iterate over the memory regions that are visible to an agent, and invoke an

application-defined callback on every iteration

If callback returns a status other than HSA_STATUS_SUCCESS for a particular iteration, the

traversal stops and the function returns that status value.


MEMORY REGION MANIPULATION (2)

Allocate a block of memory

Deallocate a block of memory previously allocated

using hsa_memory_allocate

Copy block of memory

Copying a number of bytes larger than the size of the

memory regions pointed by dst or src results in

undefined behavior.


MEMORY REGISTRATION/DEREGISTRATION

Register memory

Parameters

address (input): A pointer to the base of

the memory region to be registered. If a

NULL pointer is passed, no operation is

performed.

size (input): Requested registration size

in bytes. A size of zero is only allowed if

address is NULL.

Deregister memory previously registered

using hsa_memory_register

Parameter

address (input): A pointer to the base of the

memory region to be registered. If a NULL

pointer is passed, no operation is performed.


EXAMPLE

Allocate a memory space

Use hsa_region_get_info to get the

size in byte of this memory space

Register this memory space for a

performance hint

Finish operation, deregister and

free this memory space


SUMMARY

SUMMARY

Covered

HSA Core Runtime API (Pre-release 1.0 provisional)

Runtime Initialization and Shutdown (Open/Close)

Notifications (Synchronous/Asynchronous)

Agent Information

Signals and Synchronization (Memory-Based)

Queues and Architected Dispatch

Memory Management

Not covered

Extension of Core Runtime

HSAIL Finalization, Linking, and Debugging

Images and Samplers


QUESTIONS?


HSA MEMORY MODELBEN GASTER, ENGINEER, QUALCOMM

OUTLINE

HSA Memory Model

OpenCL 2.0

Has a memory model too

Obstruction-free bounded deques

An example using the HSA memory model


HSA MEMORY MODEL


TYPES OF MODELS

Shared memory computers and programming languages, divide complexity into

models:

1. Memory model specifies safety

e.g. can a work-item prevent others from progressing?

This is what this section of the tutorial will focus on

2. Execution model specifies liveness

Described in Ben Sander’s tutorial section on HSAIL

e.g. can a work-item prevent others from progressing

3. Performance model specifies the big picture

e.g. caches or branch divergence

Specific to particular implementations and outside the scope of today’s tutorial


THE PROBLEM

Assume all locations (a, b, …) are initialized to 0

What are the values of $s2 and $s4 after execution?


Work-item 0

mov_u32 $s1, 1 ;

st_global_u32 $s1, [&a] ;

ld_global_u32 $s2, [&b] ;

Work-item 1

mov_u32 $s3, 1 ;

st_global_u32 $s3, [&b] ;

ld_global_u32 $s4, [&a] ;

*a = 1;

int x = *b;

*b = 1;

int y = *a;

initially *a = 0 && *b = 0

THE SOLUTION

The memory model tells us:

Defines the visibility of writes to memory at any given point

Provides us with a set of possible executions


WHAT MAKES A GOOD MEMORY MODEL*

Programmability ; A good model should make it (relatively) easy to write multi-

work-item programs. The model should be intuitive to most users, even to those

who have not read the details

Performance ; A good model should facilitate high-performance implementations

at reasonable power, cost, etc. It should give implementers broad latitude in

options

Portability ; A good model would be adopted widely or at least provide backward

compatibility or the ability to translate among models

* S. V. Adve. Designing Memory Consistency Models for Shared-Memory Multiprocessors. PhD thesis, Computer Sciences Department,

University of Wisconsin–Madison, Nov. 1993.


SEQUENTIAL CONSISTENCY (SC)*

Axiomatic Definition

A single processor (core) sequential if “the result of an execution is the same as if the

operations had been executed in the order specified by the program.”

A multiprocessor sequentially consistent if “the result of any execution is the same as if the

operations of all processors (cores) were executed in some sequential order, and the

operations of each individual processor (core) appear in this sequence in the order specified by

its program.”


But HW/Compiler actually implements more relaxed models, e.g. ARMv7

* L. Lamport. How to Make a Multiprocessor Computer that Correctly

Executes Multiprocessor Programs. IEEE Transactions on Computers,

C-28(9):690–91, Sept. 1979.

SEQUENTIAL CONSISTENCY (SC)


Work-item 0

mov_u32 $s1, 1 ;



Work-item 1

mov_u32 $s3, 1 ;



mov_u32 $s1, 1 ;

mov_u32 $s3, 1;





$s2 = 0 && $s4 =

1

BUT WHAT ABOUT ACTUAL HARDWARE

Sequential consistency is (reasonably) easy to understand, but limits

optimizations that the compiler and hardware can perform

Many modern processors implement many reordering optimizations

Store buffers (TSO*), work-items can see their own stores early

Reorder buffers (XC*), work-items can see other work-items store early


*TSO – Total Store Order as implemented by Sparc and x86

*XC – Relaxed Consistency model, e.g. ARMv7, Power7, and Adreno

RELAXED CONSISTENCY (XC)


Work-item 0

mov_u32 $s1, 1 ;



Work-item 1

mov_u32 $s3, 1 ;



mov_u32 $s1, 1 ;

mov_u32 $s3, 1;





$s2 = 0 && $s4 =

0

WHAT ARE OUR 3 Ps?

Programmability ; XC is really pretty hard for the programmer to reason about

what will be visible when

many memory model experts have been known to get it wrong!

Performance ; XC is good for performance, the hardware (compiler) is free to

reorder many loads and stores, opening the door for performance and power

enhancements

Portability ; XC is very portable as it places very little constraints


MY CHILDREN AND COMPUTER

ARCHITECTS ALL WANT

To have their cake and eat it!


Put picture with kids and cake

HSA Provides: The ability to enable

programmers to reason with (relatively)

intuitive model of SC, while still achieving the

benefits of XC!

SEQUENTIAL CONSISTENCY FOR DRF*

HSA adopts the same approach as Java, C++11, and OpenCL 2.0 adopting SC for Data

Race Free (DRF)

plus some new capabilities !

(Informally) A data race occurs when two (or more) work-items access the same memory

location such that:

At least one of the accesses is a WRITE

There are no intervening synchronization operations

SC for DRF asks:

Programmers to ensure programs are DRF under SC

Implementers to ensure that all executions of DRF programs on the relaxed model are also SC

executions


*S. V. Adve and M. D. Hill. Weak Ordering—A New Definition. In Proceedings of the

17th Annual International Symposium on Computer Architecture, pp. 2–14, May

1990

HSA SUPPORTS RELEASE CONSISTENCY

HSA’s memory model is based on RCSC: All atomic_ld_scacq and atomic_st_screl are SC

Means coherence on all atomic_ld_scacq and atomic_st_screl to a single

address. )

All atomic_ld_scacq and atomic_st_screl are program ordered per work-

item (actually: sequence-order by language constraints

Similar model adopted by ARMv8

HSA extends RCSC to SC for HRF*, to access the full capabilities of

modern heterogeneous systems, containing CPUs, GPUs, and DSPs,

for example.


*Sequential Consistency for Heterogeneous-Race-Free Programmer-centric

Memory Models for Heterogeneous Platforms. D. R. Hower, Beckman, B. R.

Gaster, B. Hechtman, M D. Hill, S. K. Reinhart, and D. Wood. MSPC’13.

MAKING RELAXED CONSISTENCY WORK


Work-item 0

mov_u32 $s1, 1 ;

atomic_st_global_u32_screl $s1, [&a] ;

atomic_ld_global_u32_scacq $s2, [&b] ;

Work-item 1

mov_u32 $s3, 1 ;

atomic_st_global_u32_screl $s3, [&b] ;

atomic_ld_global_u32_scacq $s4, [&a]

;

mov_u32 $s1, 1 ;

mov_u32 $s3, 1;

atomic_st_global_u32_screl $s1, [&a] ;

atomic_ld_global_u32_scacq $s2, [&b] ;

atomic_st_global_u32_screl $s3, [&b] ;

atomic_ld_global_u32_scacq $s4, [&a] ;

$s2 = 0 && $s4 =

1

SEQUENTIAL CONSISTENCY FOR DRF

Two memory accesses participate in a data race if they

access the same location

at least one access is a store

can occur simultaneously

i.e. appear as adjacent operations in interleaving.

A program is data-race-free if no possible execution results in a data race.

Sequential consistency for data-race-free programs

Avoid everything else

HSA: Not good enough!


ALL ARE NOT EQUAL – OR SOME CAN SEE

BETTER THAN OTHERS

Remember the HSAIL

Execution Model


device scope

group scope

wave

scope

platform scope

DATA-RACE-FREE IS NOT ENOUGH

t1 t2 t3 t4

st_global 1, [&X]

atomic_st_global_screl 0, [&flag]

atomic_cas_global_scar 1, 0, [&flag]

...

atomic_st_global_screl 0, [&flag]

atomic_cas_global_scar ,1 0, [&flag]

ld_global (??), [&x]

group #1-2 group #3-4

Two ordinary memory accesses participate in a data race if they

Access same location

At least one is a store

Can occur simultaneouslyNot a data race…

Is it SC?

Well that depends

t4t3t1 t2

SGlobal

S12 S34

visibility implied by

causality?


SEQUENTIAL CONSISTENCY FOR

HETEROGENEOUS-RACE-FREE

Two memory accesses participate in a heterogeneous race if

access the same location

at least one access is a store

can occur simultaneously

i.e. appear as adjacent operations in interleaving.

Are not synchronized with “enough” scope

A program is heterogeneous-race-free if no possible execution results in a

heterogeneous race.

Sequential consistency for heterogeneous-race-free programs

Avoid everything else


HSA HETEROGENEOUS RACE FREE

HRF0: Basic Scope Synchronization

“enough” = both threads synchronize using identical scope

Recall example:

Contains a heterogeneous race in HSA

t1 t2 t3 t4

st_global 1, [&X]

atomic_st_global_rcrel_wg 0, [&flag]

...

atomic_cas_global_scar_wg,1 0, [&flag]

ld_global (??), [&x]

Workgroup #1-2 Workgroup #3-4HSA Conclusion:

This is bad. Don’t do it.


HOW TO USE HSA WITH SCOPES

Use smallest scope that includes all

producers/consumers of shared data

HSA Scope Selection Guideline

Implication:

Producers/consumers must be known at synchronization time

Want: For performance, use smallest scope possible

What is safe in HSA?

Is this a valid assumption?


REGULAR GPGPU WORKLOADS

N

M

Define

Problem Space

Partition

Hierarchically

Communicate

Locally

N times

Communicate

Globally

M times

Well defined (regular) data partitioning +

Well defined (regular) synchronization pattern =

Producer/consumers are always known

Generally: HSA works well with

regular data-parallel workloads


t1 t2 t3 t4

st_global 1, [&X]

atomic_st_global_screl_plat 0, [&flag]

atomic_cas_global_scar_plat 1, 0, [&flag]

...

atomic_st_global_screl_plat 0, [&flag]

atomic_cas_global_ar_plat ,1 0, [&flag]

ld $s1, [&x]

IRREGULAR WORKLOADS HSA: example is race

Must upgrade wg (workgroup) -> plat (platform)

HSA memory model says:

ld $s1, [&x], will see value (1)!

Workgroup #1-2 Workgroup #3-4


OPENCL

HAS MEMORY MODELS TOOMAPPING ONTO HSA’S MEMORY MODEL

It is straightforward to provide a mapping from OpenCL 1.x to the proposed model

OpenCL 1.x atomics are unordered and so map to atomic_op_X

Mapping for fences not shown but straightforward

OPENCL 1.X MEMORY MODEL MAPPING

OpenCL Operation HSA Memory Model

Operation

Atomic load ld_global_wg

ld_group_wg

Atomic store atomic_st_global_wg

atomic_st_group_wg

atomic_op atomic_op_global_comp

atomic_op_group_wg

barrier(…) fence ; barrier_wg


OPENCL 2.0 BACKGROUND

Provisional specification released at SIGGRAPH’13, July 2013.

Huge update to OpenCL to account for the evolving hardware landscape and

emerging use cases (e.g. irregular work loads)

Key features:

Shared virtual memory, including platform atomics

Formally defined memory model based on C11 plus support for scopes

Includes an extended set of C1X atomic operations

Generic address space, that subsumes global, local, and private

Device to device enqueue

Out-of-order device side queuing model

Backwards compatible with OpenCL 1.x


OPENCL 2.0 MEMORY MODEL MAPPINGOpenCL Operation HSA Memory Model Operation

Load

memory_order_relaxed

atomic_ld_[global | group]_relaxed_scope

Store

Memory_order_relaxed

atomic_st_[global | group]_relaxed_scope

Load

memory_order_acquire

atomic_ld_[global | group]_scacq_scope

Load

memory_order_seq_cst

atomic_ld_[global | group]_scacq_scope

Store

memory_order_release

atomic_st_[global | group]_screl_scope

Store

Memory_order_seq_cst

atomic_st_[global | group]_screl_scope

memory_order_acq_rel atomic_op_[global | group]_scar_scope

memory_order_seq_cst atomic_op_[global|group]_scar_scope


OPENCL 2.0 MEMORY SCOPE MAPPING

OpenCL Scope HSA Scope

memory_scope_sub_group _wave

memory_scope_work_group _wg

memory_scope_device _component

memory_scope_all_svm_devices _platform


OBSTRUCTION-FREE

BOUNDED DEQUES

AN EXAMPLE USING THE HSA MEMORY MODEL

CONCURRENT DATA-STRUCTURES

Why do we need such a memory model in practice?

One important application of memory consistency is in the development and use

of concurrent data-structures

In particular, there are a class data-structures implementations that provide non-

blocking guarantees: wait-free; An algorithm is wait-free if every operation has a bound on the number of

steps the algorithm will take before the operation completes

In practice very hard to build efficient data-structures that meet this requirement

lock-free; An algorithm is lock-free if every if, given enough time, at least one thread of

the work-items (or threads) makes progress

In practice lock-free algorithms are implemented by work-item cooperating with one

enough to allow progress

Obstruction-free; An algorithm is obstruction-free if a work-item, running in isolation, can

make progress


Emerging Compute Cluster

BUT WAY NOT JUST USE MUTUAL

EXCLUSION?


Fabric & Memory Controller

KraitCPUAdreno

GPUKraitCPU

KraitCPU

KraitCPU

MMUMMUs

2MB L2

HexagonDSP

MMU

?? ??

Diversity in a heterogeneous system, such as

different clock speeds, different scheduling

policies, and more can mean traditional

mutual exclusion is not the right choice

CONCURRENT DATA-STRUCTURES

Emerging heterogeneous compute clusters means we need:

To adapt existing concurrent data-structures

Developer new concurrent data-structures

Lock based programming may still be useful but often these algorithms will need

to be lock-free

Of course, this is a key application of the HSA memory model

To showcase this we highlight the development of a well known (HLM)

obstruction-free deque*


*Herlihy, M. et al. 2003. Obstruction-free

synchronization: double-ended queues as an

example. (2003), 522–529.

HLM - OBSTRUCTION-FREE DEQUE

Uses a fixed length circular queue

At any given time, reading from left to right, the array will contain: Zero or more left-null (LN) values

Zero or more dummy-null (DN) values

Zero or more right-null (RN) values

At all times there must be: At least two different nulls values

At least one LN or DN, and at least one DN or RN

Memory consistency is required to allow multiple producers and multiple

consumers, potentially happening in parallel from the left and right ends, to see

changes from other work-items (HSA Components) and threads (HSA Agents)


HLM - OBSTRUCTION-FREE DEQUE


LNLN vLN RNv RNRN

left right

Key:

LN – left null value

RN – right null value

v – value

left – left hint index

right – right hint index

C REPRESENTATION OF DEQUE

struct node {

uint64_t type : 2; // null type (LN, RN, DN)

uint64_t counter : 8 ; // version counter to avoid ABA

uint64_t value : 54 ; // index value stored in queue

}

struct queue {

unsigned int size; // size of bounded buffer

node * array; // backing store for deque itself© Copyright 2014 HSA Foundation. All Rights Reserved

HSAIL REPRESENTATION

Allocate a deque in global memory using HSAIL

@deque_instance:

align 64 global_u32 &size;

align 8 global_u64 &array;


ORACLE

Assume a function:

function &rcheck_oracle (arg_u32 %k, arg_u64 %left, arg_u64 %right) (arg_u64 %queue);

Which given a deque

returns (%k) the position of the left most of RN

atomic_ld_global_scacq used to read node from array

Makes one if necessary (i.e. if there are only LN or DN)

atomic_cas_global_scar, required to make new RN

returns (%left) the left node (i.e. the value to the left of the left most RN position)

returns (%right) the right node (i.e. the value at position (%k))


RIGHT POP

function &right_pop(arg_u32err, arg_u64 %result) (arg_u64 %deque) {

// load queue address

ld_arg_u64 $d0, [%deque];

@loop_forever:

// setup and call right oracle to get next RN

arg_u32 %k; arg_u64 %current; arg_u64 %next;

call &rcheck_oracle(%queue) ;

ld_arg_u32 $s0, [%k]; ld_arg_u64 $d1, [%current]; ld_arg_u64 $d2, [%next];

// current.value($d5)

shr_u64 $d5, $d1, 62;

// current.counter($d6)

and_u64 $d6, $d1,

0x3FC0000000000000;

shr_u64 $d6, $d6, 54;

// current.value($d7)

and_u64 $d7, $d1, 0x3FFFFFFFFFFFFF;

// next.counter($d8)

and_u64 $d8, $d2, 0x3FC0000000000000; shr_u64 $d8, $d8, 54;

brn @loop_forever ;

}


RIGHT POP – TEST FOR EMPTY

// current.type($d5) == LN || current.type($d5) == DN

cmp_neq_b1_u64 $c0, $d5, LN; cmp_neq_b1_u64 $c1, $d5, DN;

or_b1 $c0, $c0, $c1;

cbr $c0, @not_empty ;

// current node index (%deque($d0) + (%k($s1) - 1) * 16)

add_u32 $s1, $s0, -1; mul_u32 $s1, $s1, 16; add_u32 $d3, $d0, $s0;

atomic_ld_global_scacq_u64 $d4, [$d3];

cmp_neq_b1_u64 $c0, $d4, $d1;

cbr $c0, @not_empty;

st_arg_u32 EMPTY, [&err]; // deque empty so return EMPTY

%ret

@not_empty:


RIGHT POP – TRY READ/REMOVE NODE// $d9 = (RN, next.cnt+1, 0)

add_u64 $d8, $d8, 1;

shl_u64 $d9, RN, 62;

and_u64 $d8, $d8, $d9;

// cas(deq+k, next, node(RN, next.cnt+1, 0))

atomic_cas_global_scar_u64 $d9, [$s0], $d2, $d9;

cmp_neq_u64 $c0, $d9, $d2;

cbr $c0, @cas_failed;

// $d9 = (RN, current.cnt+1, 0)

add_u64 $d6, $d6, 1;

shl_u64 $d9, RN, 62;

and_u64 $d9, $d6, $d9;

// cas(deq+(k-1), curr, node(RN, curr.cnt+1,0)

atomic_cas_global_scar_u64 $d9, [$s1], $d1, $d9;

cmp_neq_u64 $c0, $d9, $d1;

cbr $c0, @cas_failed;

st_arg_u32 SUCCESS, [&err];

st_arg_u64 $d7, [&value];

%ret

@cas_failed:

// loop back around and try again


TAKE AWAYS

HSA provides a powerful and modern memory model Based on the well know SC for DRF

Defined as Release Consistency

Extended with scopes as defined by HRF

OpenCL 2.0 introduces a new memory model Also based on SC for DRF

Also defined in terms of Release Consistency

Also Extended with scope as defined in HRF

Has a well defined mapping to HSA

Concurrent algorithm development for emerging heterogeneous computing

cluster can benefit from HSA and OpenCL 2.0 memory models


HSA QUEUING MODELHAKAN PERSSON, SENIOR PRINCIPAL ENGINEER,

ARM

HSA QUEUEING, MOTIVATION

MOTIVATION (TODAY’S PICTURE)


Application OS GPU

Transfer

buffer to GPU Copy/Map

Memory

Queue Job

Schedule JobStart Job

Finish Job

Schedule

ApplicationGet Buffer

Copy/Map

Memory

HSA QUEUEING: REQUIREMENTS

REQUIREMENTS

Three key technologies are used to build the user mode queueing

mechanism


System Coherency

Signaling

AQL (Architected Queueing Language) enables any agent

enqueue tasks


SHARED VIRTUAL MEMORY

PHYSICAL MEMORY

SHARED VIRTUAL MEMORY (TODAY)

Multiple Virtual memory address spaces


CPU0 GPU

VIRTUAL MEMORY1

PHYSICAL MEMORY

VA1->PA1 VA2->PA1

VIRTUAL MEMORY2

PHYSICAL MEMORY

SHARED VIRTUAL MEMORY (HSA)

Common Virtual Memory for all HSA agents


CPU0 GPU

VIRTUAL MEMORY

PHYSICAL MEMORY

VA->PA VA->PA


Advantages

No mapping tricks, no copying back-and-forth between different PA addresses

Send pointers (not data) back and forth between HSA agents.

Implications

Common Page Tables (and common interpretation of architectural semantics such as shareability, protection, etc).

Common mechanisms for address translation (and servicing address translation faults)

Concept of a process address space (PASID) to allow multiple, per process virtual address spaces within the system.



Specifics

Minimum supported VA width is 48b for 64b systems, and 32b for

32b systems.

HSA agents may reserve VA ranges for internal use via system

software.

All HSA agents other than the host unit must use the lowest privilege

level

If present, read/write access flags for page tables must be

maintained by all agents.

Read/write permissions apply to all HSA agents, equally.


GETTING THERE …


Application OS GPU

Transfer


Memory

Queue Job


Finish Job

Schedule


Copy/Map

Memory

CACHE COHERENCY

CACHE COHERENCY DOMAINS (1/3)

Data accesses to global memory segment from all HSA Agents shall be

coherent without the need for explicit cache maintenance.



Advantages

Composability

Reduced SW complexity when communicating between agents

Lower barrier to entry when porting software

Implications

Hardware coherency support between all HSA agents

Can take many forms

Stand alone Snoop Filters / Directories

Combined L3/Filters

Snoop-based systems (no filter)

Etc …



Specifics

No requirement for instruction memory accesses to be coherent

Only applies to the Primary memory type.

No requirement for HSA agents to maintain coherency to any memory location where the HSA agents do not specify the same memory attributes

Read-only image data is required to remain static during the execution of an HSA kernel.

No double mapping (via different attributes) in order to modify. Must remain static


GETTING CLOSER …


Application OS GPU

Transfer


Memory

Queue Job


Finish Job

Schedule


Copy/Map

Memory

SIGNALING

SIGNALING (1/3)

HSA agents support the ability to use signaling objects

All creation/destruction signaling objects occurs via HSA

runtime APIs

From an HSA Agent you can directly access signaling objects.

Signaling a signal object (this will wake up HSA agents

waiting upon the object)

Query current object

Wait on the current object (various conditions supported).


SIGNALING (2/3)

Advantages

Enables asynchronous events between HSA agents, without involving the kernel

Common idiom for work offload

Low power waiting

Implications

Runtime support required

Commonly implemented on top of cache coherency flows


SIGNALING (3/3)

Specifics

Only supported within a PASID

Supported wait conditions are =, !=, < and >=

Wait operations may return sporadically (no guarantee against false positives)

Programmer must test.

Wait operations have a maximum duration before returning.

The HSAIL atomic operations are supported on signal objects.

Signal objects are opaque

Must use dedicated HSAIL/HSA runtime operations


ALMOST THERE…


Application OS GPU

Transfer


Memory

Queue Job


Finish Job

Schedule


Copy/Map

Memory

USER MODE QUEUING

ONE BLOCK LEFT


Application OS GPU

Transfer


Memory

Queue Job


Finish Job

Schedule


Copy/Map

Memory

USER MODE QUEUEING (1/3)

User mode Queueing

Enables user space applications to directly, without OS intervention, enqueue jobs (“Dispatch Packets”) for HSA agents.

Queues are created/destroyed via calls to the HSA runtime.

One (or many) agents enqueue packets, a single agent dequeues packets.

Requires coherency and shared virtual memory.


USER MODE QUEUEING (2/3)

Advantages

Avoid involving the kernel/driver when dispatching work for an Agent.

Lower latency job dispatch enables finer granularity of offload

Standard memory protection mechanisms may be used to protect communication with

the consuming agent.

Implications

Packet formats/fields are Architected – standard across vendors!

Guaranteed backward compatibility

Packets are enqueued/dequeued via an Architected protocol (all via memory

accesses and signaling)

More on this later……


SUCCESS!


Application OS GPU

Transfer


Memory

Queue Job


Finish Job

Schedule


Copy/Map

Memory

SUCCESS!


Application OS GPU

Queue Job

Start Job

Finish Job

ARCHITECTED QUEUEING

LANGUAGE, QUEUES

ARCHITECTED QUEUEING LANGUAGE

HSA Queues look just like standard shared memory queues, supporting multi-producer, single-consumer

Single producer variant defined with some optimizations possible.

Queues consist of storage, read/write indices, ID, etc.

Queues are created/destroyed via calls to the HSA runtime

“Packets” are placed in queues directly from user mode, via an architected protocol

Packet format is architected


Producer Producer

Consumer

Read Index

Write Index

Storage in

coherent, shared

memory

Packets

ARCHITECTED QUEUING LANGUAGE

Packets are read and dispatched for execution from the queue in order, but may complete in any order.

There is no guarantee that more than one packet will be processed in parallel at a time

There may be many queues. A single agent may also consume from several queues.

Any HSA agent may enqueue packets

CPUs

GPUs

Other accelerators


QUEUE STRUCTURE


Offset (bytes) Size (bytes) Field Notes

0 4 queueType Differentiate different queues

4 4 queueFeatures Indicate supported features

8 8 baseAddress Pointer to packet array

16 16 doorbellSignal HSA signaling object handle

24 4 size Packet array cardinality

28 4 queueId Unique per process

32 8 serviceQueue Queue for callback services

intrinsic 8 writeIndex Packet array write index

intrinsic 8 readIndex Packet array read index

QUEUE VARIANTS

queueType and queueFeatures together define queue semantics and

capabilities

Two queueType values defined, other values reserved:

MULTI – queue supports multiple producers

SINGLE – queue supports single producer

queueFeatures is a bitfield indicating capabilities

DISPATCH (bit 0) if set then queue supports DISPATCH packets

AGENT_DISPATCH (bit 1) if set then queue supports AGENT_DISPATCH packets

All other bits are reserved and must be 0


QUEUE STRUCTURE DETAILS

Queue doorbells are HSA signaling objects with restrictions

Created as part of the queue – lifetime tied to queue object

Atomic read-modify-write not allowed

size field value must be aligned to a power of 2

serviceQueue can be used by HSA kernel for callback services

Provided by application when queue is created

Can be mapped to HSA runtime provided serviceQueue, an application serviced

queue, or NULL if no serviceQueue required


READ/WRITE INDICES

readIndex and writeIndex properties are part of the queue, but not visible in the queue structure

Accessed through HSA runtime API and HSAIL operations

HSA runtime/HSAIL operations defined to

Read readIndex or writeIndex property

Write readIndex or writeIndex property

Add constant to writeIndex property (returns previous writeIndex value)

CAS on writeIndex property

readIndex & writeIndex operations treated as atomic in memory model

relaxed, acquire, release and acquire-release variants defined as applicable

readIndex and writeIndex never wrap

PacketID – the index of a particular packet

Uniquely identifies each packet of a queue


PACKET ENQUEUE

Packet enqueue follows a few simple steps:

Reserve space

Multiple packets can be reserved at a time

Write packet to queue

Mark packet as valid

Producer no longer allowed to modify packet

Consumer is allowed to start processing packet

Notify consumer of packet through the queue doorbell

Multiple packets can be notified at a time

Doorbell signal should be signaled with last packetID notified

On small machine model the lower 32 bits of the packetID are used


PACKET RESERVATION

Two flows envisaged

Atomic add writeIndex with number of packets to reserve

Producer must wait until packetID < readIndex + size before writing to packet

Queue can be sized so that wait is unlikely (or impossible)

Suitable when many threads use one queue

Check queue not full first, then use atomic CAS to update writeIndex

Can be inefficient if many threads use the same queue

Allows different failure model if queue is congested


QUEUE OPTIMIZATIONS

Queue behavior is loosely defined to allow optimizations

Some potential producer behavior optimizations:

Keep local copy of readIndex, update when required

For single producer queues:

Keep local copy of writeIndex

Use store operation rather than add/cas atomic to update writeIndex

Some potential consumer behavior optimizations:

Use packet format field to determine whether a packet has been submitted rather than writeIndexproperty

Speculatively read multiple packets from the queue

Not update readIndex for each packet processed

Rely on value used for doorbellSignal to notify new packets

Especially useful for single producer queues


POTENTIAL MULTI-PRODUCER ALGORITHM

// Allocate packetuint64_t packetID = hsa_queue_add_write_index_relaxed(q, 1);

// Wait until the queue is no longer full. uint64_t rdIdx;do {rdIdx = hsa_queue_load_read_index_relaxed(q);

} while (packetID >= (rdIdx + q->size));

// calculate indexuint32_t arrayIdx = packetID & (q->size-1);

// copy over the packet, the format field is INVALIDq->baseAddress[arrayIdx] = pkt;

// Update format field with release semanticsq->baseAddress[index].hdr.format.store(DISPATCH, std::memory_order_release);

// ring doorbell, with release semantics (could also amortize over multiple packets)hsa_signal_send_relaxed(q->doorbellSignal, packetID);


POTENTIAL CONSUMER ALGORITHM// Get location of next packetuint64_t readIndex = hsa_queue_load_read_index_relaxed(q);

// calculate the index uint32_t arrayIdx = readIndex & (q->size-1);

// spin while empty (could also perform low-power wait on doorbell)while (INVALID == q->baseAddress[arrayIdx].hdr.format) { }

// copy over the packetpkt = q->baseAddress[arrayIdx];

// set the format field to invalidq->baseAddress[arrayIdx].hdr.format.store(INVALID, std::memory_order_relaxed);

// Update the readIndex using HSA intrinsichsa_queue_store_read_index_relaxed(q, readIndex+1);

// Now process <pkt>!


ARCHITECTED QUEUEING

LANGUAGE, PACKETS

PACKETS


Packets come in three main types with architected layouts

Always reserved & Invalid

Do not contain any valid tasks and are not processed (queue will not progress)

Dispatch

Specifies kernel execution over a grid

Agent Dispatch

Specifies a single function to perform with a set of parameters

Barrier

Used for task dependencies

COMMON PACKET HEADER

Start Offset

(Bytes)Format Field Name Description

0 uint16_t

format:8

Contains the packet type (Always reserved, Invalid,

Dispatch, Agent Dispatch, and Barrier). Other values are

reserved and should not be used.

barrier:1If set then processing of packet will only begin when all

preceding packets are complete.

acquireFenceScope:2

Determines the scope and type of the memory fence

operation applied before the packet enters the active

phase.

Must be 0 for Barrier Packets.

releaseFenceScope:2

Determines the scope and type of the memory fence

operation applied after kernel completion but before the

packet is completed.

reserved:3 Must be 0


DISPATCH PACKET


Start

Offset

(Bytes)

Format Field Name Description

0 uint16_t header Packet header

2 uint16_tdimensions:2 Number of dimensions specified in gridSize. Valid values are 1, 2, or 3.

reserved:14 Must be 0.

4 uint16_t workgroupSize.x x dimension of work-group (measured in work-items).

6 uint16_t workgroupSize.y y dimension of work-group (measured in work-items).

8 uint16_t workgroupSize.z z dimension of work-group (measured in work-items).

10 uint16_t reserved2 Must be 0.

12 uint32_t gridSize.x x dimension of grid (measured in work-items).

16 uint32_t gridSize.y y dimension of grid (measured in work-items).

20 uint32_t gridSize.z z dimension of grid (measured in work-items).

24 uint32_t privateSegmentSizeBytes Total size in bytes of private memory allocation request (per work-item).

28 uint32_t groupSegmentSizeBytes Total size in bytes of group memory allocation request (per work-group).

32 uint64_t kernelObjectAddressAddress of an object in memory that includes an implementation-defined

executable ISA image for the kernel.

40 uint64_t kernargAddress Address of memory containing kernel arguments.


56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.

AGENT DISPATCH PACKET


Start Offset


0 uint16_t header Packet header

2 uint16_t type

The function to be performed by the destination Agent. The type value is

split into the following ranges:

0x0000:0x3FFF – Vendor specific

0x4000:0x7FFF – HSA runtime

0x8000:0xFFFF – User registered function

4 uint32_t reserved2 Must be 0.8 uint64_t returnLocation Pointer to location to store the function return value in.16 uint64_t arg[0]

64-bit direct or indirect arguments.24 uint64_t arg[1]32 uint64_t arg[2]40 uint64_t arg[3]48 uint64_t reserved3 Must be 0.


BARRIER PACKET

Used for specifying dependences between packets

HSA agent will not launch any further packets from this queue until the barrier

packet signal conditions are met

Used for specifying dependences on packets dispatched from any queue.

Execution phase completes only when all of the dependent signals (up to five) have

been signaled (with the value of 0).

Or if an error has occurred in one of the packets upon which we have a dependence.


BARRIER PACKET


Start Offset


0 uint16_t header Packet header, see 2.8.1 Packet header (p. 16).



8 uint64_t depSignal0

Address of dependent signaling objects to be evaluated by the packet processor.







DEPENDENCES

A user may never assume more than one packet is being executed by an HSA

agent at a time.

Implications:

Packets can’t poll on shared memory values which will be set by packets issued from

other queues, unless the user has ensured the proper ordering.

To ensure all previous packets from a queue have been completed, use the Barrier

bit.

To ensure specific packets from any queue have completed, use the Barrier packet.


HSA QUEUEING, PACKET EXECUTION

PACKET EXECUTION

Launch phase

Initiated when launch conditions are met

All preceding packets in the queue must have exited launch phase

If the barrier bit in the packet header is set, then all preceding packets in the queue must have exited completion phase

Includes memory acquire fence

Active phase

Execute the packet

Barrier packets remain in Active phase until conditions are met.

Completion phase

First step is memory release fence – make results visible.

completionSignal field is then signaled with a decrementing atomic.


PACKET EXECUTION – BARRIER BIT


Pkt1

Launch

Pkt2

Launch

Pkt1

Execute

Pkt2

Execute

Pkt1

Complete

Pkt3

Launch (barrier=1)

Pkt2

Complete

Pkt3

Execute

Time

Pkt3 launches whenall

packets in the queue

have completed.

PUTTING IT ALL TOGETHER (FFT)


Packet 1

Packet 2

Packet 3

Packet 4

Packet 5

Packet 6

Barrier Barrier

X[0]

X[1]

X[2]

X[3]

X[4]

X[5]

X[6]

X[7]

Time

PUTTING IT ALL TOGETHER


AQL Pseudo Code

// Send the packets to do the first stage. aql_dispatch(pkt1);aql_dispatch(pkt2);

// Send the next two packets, setting the barrier bit so we// know packets 1 & 2 will be complete before 3 and 4 are // launched. aql_dispatch_with _barrier_bit(pkt3); aql_dispatch(pkt4);

// Same as above (make sure 3 & 4 are done before issuing 5// & 6) aql_dispatch_with_barrier_bit(pkt5); aql_dispatch(pkt6);

// This packet will notify us when 5 & 6 are complete)aql_dispatch_with_barrier_bit(finish_pkt);

PACKET EXECUTION – BARRIER PACKET


Barrier T2Q2

T1Q1

Signal X

init to 1

depSignal0

completionSignal

Time

Decrements signal X

Barrier

Launch

T1

Launch

Barrier

Execute

T1

Execute

Barrier

Complete

T1

Complete

T2

Launch

T2

Execute

T2

Complete

Barrier completes

when signal X

signalled with 0T2 launches once

barrier complete

DEPTH FIRST CHILD TASK EXECUTION

Consider two generations of child tasks

Task T submits tasks T.1 & T.2

Task T.1 submits tasks T.1.1 & T.1.2

Task T.2 submits tasks T.2.1 & T.2.2

Desired outcome

Depth first child task execution

I.e. T T1 T.1.1 T.1.2 T.2 T.2.1 T.2.2

T passed signal (allComplete) to decrement when all tasks are complete (T and its

children etc)


T

T.2.2T.1.2T.1.2T.1.1

T.1 T.2

HOW TO DO THIS WITH HSA QUEUES?

Use a separate user mode queue for each recursion level

Task T submits to queue Q1

Tasks T.1 & T.2 submits tasks to queue Q2

Queues could be passed in as parameters to task T

Depth first requires ordering of T.1, T.2 and their children

Use additional signal object (childrenComplete) to track completion of the children of

T.1 & T.2

childrenComplete set to number of children (i.e. 2) by each of T.1 & T.2


A PICTURE SAYS MORE THAN 1000 WORDS


T

T.2.2T.1.2T.1.2T.1.1

T.1 T.2 T.1 Barrier T.2 BarrierQ1

Wait on

childrenCompleteSignal

allComplete

T.1.1 T.1.2 T.2.1 T.2.2Q2

SUMMARY


KEY HSA TECHNOLOGIES

HSA combines several mechanisms to enable low overhead task

dispatch


System Coherency

Signaling

AQL

User mode queues – from any compatible agent

Architected packet format

Rich dependency mechanism

Flexible and efficient signaling of completion


QUESTIONS?


HSA APPLICATIONS

WEN-MEI HWU, PROFESSOR, UNIVERSITY OF ILLINOIS

WITH J.P. BORDES AND JUAN GOMEZ

USE CASES SHOWING HSA ADVANTAGE

Programming Technique

Use Case Description HSA Advantage

Pointer-based Data Structures

Binary tree searchesGPU performs parallel searches in a CPU created

binary tree.

CPU and GPU have access to entire unified coherent

memory. GPU can access existing data structures containing

pointers.

Platform Atomics

Work-Group Dynamic Task Management

GPU directly operate on a task pool managed

by the CPU for algorithms with dynamic

computation loads

Binary tree updatesCPU and GPU operating simultaneously on the

tree, both doing modifications

CPU and GPU can synchronize using Platform Atomics

Higher performance through parallel operations reducing the

need for data copying and reconciling.

Large Data SetsHierarchical data searchesApplications include object recognition, collision

detection, global illumination, BVH


memory. GPU can operate on huge models in place,

reducing copy and kernel launch overhead.

CPU CallbacksMiddleware user-callbacksGPU processes work items, some of which require

a call to a CPU function to fetch new data

GPU can invoke CPU functions from within a GPU kernel

Simpler programming does not require “split kernels”

Higher performance through parallel operations


UNIFIED COHERENT MEMORY

FOR POINTER-BASED DATA

STRUCTURES

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE


L R

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE




Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

L

RL

RL

R



Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE



Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

L R



Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE


SYSTEM MEMORY

KERNEL

GPU


HSA and full OpenCL 2.0

TREE RESULTBUFFER

L R

L R L R



HSA

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R


POINTER DATA STRUCTURES

- CODE COMPLEXITY

HSA Legacy


POINTER DATA STRUCTURES

- PERFORMANCE

0

10,000

20,000

30,000

40,000

50,000

60,000

1M 5M 10M 25M

Se

arc

h r

ate

(

no

des

/ m

s )

Tree size ( # nodes )

Binary Tree Search

CPU (1 core)

CPU (4 core)

Legacy APU

HSA APU

Measured in AMD labs Jan 1-3 on system shown in back up

slide


PLATFORM ATOMICS FOR

DYNAMIC TASK MANAGEMENT

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

0

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

0

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010


0

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

0

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL


Legacy*


Asynchronous transfer


4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

0

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL


Legacy*



4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL


Legacy*


Asynchronous transfer


4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL


Legacy*



4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

1

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL


Legacy*


Atomic add


4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

1

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL


Legacy*



4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

2

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL


Legacy*


Atomic add


4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

2

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL


Legacy*



4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

3

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL

PLATFORM ATOMICS

Legacy*


Atomic add


4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

3

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL


Legacy*



4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

4

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL


Legacy*


Atomic add


4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

4

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

4

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL


Legacy*


Zero-copy



0

HOST COHERENT MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4


GPU MEMORY



0


WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4


GPU MEMORY

memcpy



4


WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4


GPU MEMORY



4


WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

1

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4


GPU MEMORY

Platform atomic add



4


WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

1

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4


GPU MEMORY



4


WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

2

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4


GPU MEMORY

Platform atomic add



4


WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

2

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4


GPU MEMORY



4


WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

3

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4


GPU MEMORY

Platform atomic add



4


WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

3

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4


GPU MEMORY



4


WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

4

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4


GPU MEMORY

Platform atomic add


PLATFORM ATOMICS – CODE COMPLEXITY

HSALegacy

Host enqueue function: 20 lines of code

Host enqueue function: 102 lines of code


PLATFORM ATOMICS - PERFORMANCE

0

100

200

300

400

500

600

700

64 128 256 512 64 128 256 512

4096 16384

Execu

tio

n t

ime (

ms)

Tasks per insertionTasks pool size

Legacy implementation (ms)

HSA implementation (ms)


PLATFORM ATOMICS FOR

CPU/GPU COLLABORATION

PLATFORM ATOMICSENABLING EFFICIENT GPU/CPU COLLABORATION

Legacy

Only GPU can work on input

array

Concurrent

processing not

possible

TREEINPUTBUFFER

GPU

KERNEL


PLATFORM ATOMICS

Legacy

Only GPU can work on input

array

Concurrent

processing not

possible

TREEINPUTBUFFER

GPU

KERNEL


GPU

KERNEL

PLATFORM ATOMICS

Both CPU+GPU

operating on same

data structure

concurrently

TREEINPUTBUFFER

CPU 0

CPU 1



UNIFIED COHERENT MEMORY

FOR LARGE

DATA SETS

PROCESSING LARGE DATA SETS

The CPU creates a large data structure in System Memory. Computations

using the data are offloaded to the GPU.

SYSTEM MEMORY

GPU


SYSTEM MEMORY

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

PROCESSING LARGE DATA SETS

Larg

e 3

D s

patia

l d

ata

str

uctu

re

GPU

The CPU creates a large data structure in System Memory. Computations

using the data are offloaded to the GPU.

Compare HSA and Legacy methods


SYSTEM MEMORY

LEGACY ACCESS USING GPU MEMORY

Legacy

GPU Memory is smaller

Have to copy and process in chunks

GPU

GPU MEMORY


SYSTEM MEMORY

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

LEGACY ACCESS TO LARGE STRUCTURES

Larg

e 3

D s

patia

l d

ata

str

uctu

re

GPU

GPU MEMORY


SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

KERNEL

Copy of top 2 levels of hierarchy

Larg

e 3

D s

patia

l d

ata

str

uctu

re

GPU MEMORY


GPU

GPU MEMORY

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

FIRST

KERNEL


SYSTEM MEMORY


Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

GPU MEMORY

FIRST

KERNEL


SYSTEM MEMORY


Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

GPU MEMORY


SYSTEM MEMORY


Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

KERNEL

Copy of bottom 3 levels of one branch of the hierarchy

GPU MEMORY


SYSTEM MEMORY


Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

KERNEL

GPU MEMORY

SECOND

KERNEL


SYSTEM MEMORY


Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

GPU MEMORY


SYSTEM MEMORY


Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

Copy of bottom 3 levels of a different branch of the

hierarchy

GPU MEMORY


SYSTEM MEMORY


Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

KERNEL

GPU MEMORY

Nth

KERNEL


LARGE SPATIAL DATA STRUCTURE

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

Larg

e 3

D s

patia

l d

ata

str

uctu

reSYSTEM MEMORY

KERNEL

GPUHSA and full OpenCL 2.0


SYSTEM MEMORY

GPU CAN TRAVERSE ENTIRE HIERARCHY

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

HSA

KERNEL

GPU


SYSTEM MEMORY

GPU CAN TRAVERSE ENTIRE HIERARCHY

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

KERNEL

HSA

GPU


CALLBACKS

CALLBACKS

Parallel processing algorithm with branches

A seldom taken branch requires new data from the CPU

On legacy systems, the algorithm must be split:

Process Kernel 1 on GPU

Check for CPU callbacks and if any, process on CPU

Process Kernel 2 on GPU

Example algorithm from Image Processing

Perform a filter

Calculate average LUMA in each tile

Compare LUMA against threshold and call CPU callback if exceeded (rare)

Perform special processing on tiles with callbackx\s

COMMON SITUATION IN HC

Input Image Output Image


CALLBACKS

Legacy

GP

U T

HR

EA

DS

0

1

2

N

.

.

.

.

.

.

.

.

.

Continuation kernel finishes up kernel works results in poor GPU utilization


CALLBACKS

Input Image

1 Tile = 1 OpenCL Work Item

Output Image

GPU

• Work items compute average RGB value of all the pixels in a tile

• Work items also compute average Luma from the average RGB

• If average Luma > threshold, workgroup invokes CPU CALLBACK

• In parallel with callback, continue compute

CPU

• For selected tiles, update average Lumavalue (set to RED)

GPU

• Work items apply the Luma value to all pixels in the tile

GPU to CPU callbacks use Shared

Virtual Memory (SVM) Semaphores,

implemented using Platform Atomic

Compare-and-Swap.


CALLBACKS

A few kernel threads need CPU callback services but serviced immediately

GP

U T

HR

EA

DS

0

1

2

N

.

.

.

.

.

.

.

.

.

CPU callbacks



SUMMARY - HSA ADVANTAGE

Programming Technique

Use Case Description HSA Advantage

Pointer-based Data Structures

Binary tree searchesGPU performs parallel searches in a CPU created

binary tree.


memory. GPU can access existing data structures containing

pointers.

Platform Atomics

Work-Group Dynamic Task Management

GPU directly operate on a task pool managed

by the CPU for algorithms with dynamic

computation loads

Binary tree updatesCPU and GPU operating simultaneously on the

tree, both doing modifications

CPU and GPU can synchronize using Platform Atomics

Higher performance through parallel operations reducing the

need for data copying and reconciling.

Large Data SetsHierarchical data searchesApplications include object recognition, collision

detection, global illumination, BVH


memory. GPU can operate on huge models in place,

reducing copy and kernel launch overhead.

CPU CallbacksMiddleware user-callbacksGPU processes work items, some of which require

a call to a CPU function to fetch new data

GPU can invoke CPU functions from within a GPU kernel

Simpler programming does not require “split kernels”

Higher performance through parallel operations


QUESTIONS?

HSA COMPILATIONWEN-MEI HWU, CTO, MULTICOREWARE INC

WITH RAY I-JUI SUNG

KEY HSA FEATURES FOR COMPILATION

ALL-PROCESSORS-EQUAL

GPU and CPU have equal

flexibility to create and

dispatch work items

EQUAL ACCESS TO ENTIRE SYSTEM MEMORY

GPU and CPU have

uniform visibility into entire

memory space

Unified Coherent

Memory

GPUCPU

Single Dispatch Path

GPUCPU


A QUICK REVIEW OF OPENCLCURRENT STATE OF PORTABLE HETEROGENEOUS

PARALLEL PROGRAMMING

DEVICE CODE IN OPENCL

SIMPLE MATRIX MULTIPLICATION

__kernel void

matrixMul(__global float* C, __global float* A, __global float* B, int wA, int wB) {

int tx = get_global_id(0);

int ty = get_global_id(1);

float value = 0;

for (int k = 0; k < wA; ++k)

{

float elementA = A[ty * wA + k];

float elementB = B[k * wB + tx];

value += elementA * elementB;

}

C[ty * wA + tx] = value;

}

Explicit thread index usage.

Reasonably readable.

Portable across CPUs, GPUs, and FPGAs


HOST CODE IN OPENCL -

CONCEPTUAL

1. allocate and initialize memory on host side

2. Initialize OpenCL

3. allocate device memory and move the data

4. Load and build device code

5. Launch kernel

a. append arguments

6. move the data back from device


int main(int argc, char** argv){

// set seed for rand()

srand(2006);

/****************************************************/

/* Allocate and initialize memory on Host Side */

/****************************************************/

// allocate and initialize host memory for matrices A and B

unsigned int size_A = WA * HA;

unsigned int mem_size_A = sizeof(float) * size_A;

float* h_A = (float*) malloc(mem_size_A);

unsigned int size_B = WB * HB;

unsigned int mem_size_B = sizeof(float) * size_B;

float* h_B = (float*) malloc(mem_size_B);

randomInit(h_A, size_A);

randomInit(h_B, size_B);

// allocate host memory for the result C

unsigned int size_C = WC * HC;

unsigned int mem_size_C = sizeof(float) * size_C;

float* h_C = (float*) malloc(mem_size_C);

/*****************************************/

/* Initialize OpenCL */

/*****************************************/

// OpenCL specific variables

cl_context clGPUContext;

cl_command_queue clCommandQue;

cl_program clProgram;

size_t dataBytes;

size_t kernelLength;

cl_int errcode;

// OpenCL device memory pointers for matrices

cl_mem d_A;

cl_mem d_B;

cl_mem d_C;

clGPUContext = clCreateContextFromType(0,

CL_DEVICE_TYPE_GPU,

NULL, NULL, &errcode);

shrCheckError(errcode, CL_SUCCESS);

// get the list of GPU devices associated with context

errcode = clGetContextInfo(clGPUContext,

CL_CONTEXT_DEVICES, 0, NULL,

&dataBytes);

cl_device_id *clDevices = (cl_device_id *)

malloc(dataBytes);

errcode |= clGetContextInfo(clGPUContext,

CL_CONTEXT_DEVICES, dataBytes,

clDevices, NULL);


//Create a command-queue

clCommandQue = clCreateCommandQueue(clGPUContext,

clDevices[0], 0, &errcode);


// 3. Allocate device memory and move data

d_C = clCreateBuffer(clGPUContext,

CL_MEM_READ_WRITE,

mem_size_A, NULL, &errcode);

d_A = clCreateBuffer(clGPUContext,

CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,

mem_size_A, h_A, &errcode);

d_B = clCreateBuffer(clGPUContext,

CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,

mem_size_B, h_B, &errcode);

// 4. Load and build OpenCL kernel

char *clMatrixMul = oclLoadProgSource("kernel.cl",

"// My comment\n",

&kernelLength);

shrCheckError(clMatrixMul != NULL, shrTRUE);

clProgram = clCreateProgramWithSource(clGPUContext,

1, (const char **)&clMatrixMul,

&kernelLength, &errcode);


errcode = clBuildProgram(clProgram, 0,

NULL, NULL, NULL, NULL);


clKernel = clCreateKernel(clProgram,

"matrixMul", &errcode);


// 5. Launch OpenCL kernel

size_t localWorkSize[2], globalWorkSize[2];

int wA = WA;

int wC = WC;

errcode = clSetKernelArg(clKernel, 0,

sizeof(cl_mem), (void *)&d_C);

errcode |= clSetKernelArg(clKernel, 1,

sizeof(cl_mem), (void *)&d_A);


sizeof(cl_mem), (void *)&d_B);


sizeof(int), (void *)&wA);


sizeof(int), (void *)&wC);


localWorkSize[0] = 16;

localWorkSize[1] = 16;

globalWorkSize[0] = 1024;

globalWorkSize[1] = 1024;

errcode = clEnqueueNDRangeKernel(clCommandQue,

clKernel, 2, NULL, globalWorkSize,

localWorkSize, 0, NULL, NULL);


// 6. Retrieve result from device

errcode = clEnqueueReadBuffer(clCommandQue,

d_C, CL_TRUE, 0, mem_size_C,

h_C, 0, NULL, NULL);


// 7. clean up memory

free(h_A);

free(h_B);

free(h_C);

clReleaseMemObject(d_A);

clReleaseMemObject(d_C);

clReleaseMemObject(d_B);

free(clDevices);

free(clMatrixMul);

clReleaseContext(clGPUContext);

clReleaseKernel(clKernel);

clReleaseProgram(clProgram);

clReleaseCommandQueue(clCommandQue);}

almost 100 lines of code

– tedious and hard to maintain

It does not take advantage of HAS features.

It will likely need to be changed for OpenCL 2.0.

COMPARING SEVERAL HIGH-LEVEL

PROGRAMMING INTERFACES

C++AMP Thrust Bolt OpenACC SYCL

C++ Language

extension

proposed by

Microsoft

library

proposed

by CUDA

library

proposed

by AMD

Annotation

and

Pragmas

proposed

by PGI

C++

wrapper

for

OpenCL

All these proposals aim to reduce tedious boiler

plate code and provide transparent porting to future

systems (future proofing).


OPENACCHSA ENABLES SIMPLER IMPLEMENTATION OR

BETTER OPTIMIZATION


OPENACC- SIMPLE MATRIX MULTIPLICATION EXAMPLE

1. void MatrixMulti(float *C, const float *A, const float *B, int hA, int wA, int wB)

2 {

3 #pragma acc parallel loop copyin(A[0:hA*wA]) copyin(B[0:wA*wB]) copyout(C[0:hA*wB])

4 for (int i=0; i<hA; i++) {

5 #pragma acc loop

6 for (int j=0; j<wB; j++) {

7 float sum = 0;

8 for (int k=0; k<wA; k++) {

9 float a = A[i*wA+k];

10 float b = B[k*wB+j];

11 sum += a*b;

12 }

13 C[i*Nw+j] = sum;

14 }

15 }

16 }

Little Host Code Overhead

Programmer annotation of

kernel computation

Programmer annotation of data movement


ADVANTAGE OF HSA FOR OPENACC

Flexibility in copyin and copyout implementation

Flexible code generation for nested acc parallel loops

E.g., inner loop bounds that depend on outer loop iterations

Compiler data affinity optimization (especially OpenACC kernel regions)

The compiler does not have to undo programmer managed data transfers


C++AMP HSA ENABLES EFFICIENT COMPILATION OF AN

EVEN HIGHER LEVEL OF PROGRAMMING

INTERFACE


C++ AMP

● C++ Accelerated Massive Parallelism

● Designed for data level parallelism

● Extension of C++11 proposed by Microsoft

● An open specification with multiple implementations aiming at standardization

● MS Visual Studio 2013

● MulticoreWare CLAMP

● GPU data modeled as C++14-like containers for multidimensional arrays

● GPU kernels modeled as C++11 lambda

● Minimal extension to C++ for simplicity and future proofing


MATRIX MULTIPLICATION IN C++AMP

void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix,

int ha, int hb, int hc) {

array_view<int, 2> a(ha, hb, aMatrix);

array_view<int, 2> b(hb, hc, bMatrix);

array_view<int, 2> product(ha, hc, productMatrix);

parallel_for_each(

product.extent,

[=](index<2> idx) restrict(amp) {

int row = idx[0];

int col = idx[1];

for (int inner = 0; inner < 2; inner++) {

product[idx] += a(row, inner) * b(inner, col);

}

}

);

product.synchronize();}

clGPUContext = clCreateContextFromType(0,

CL_DEVICE_TYPE_GPU,

NULL, NULL, &errcode);


// get the list of GPU devices associated

// with context

errcode = clGetContextInfo(clGPUContext,

CL_CONTEXT_DEVICES, 0, NULL,

&dataBytes);

cl_device_id *clDevices = (cl_device_id *)

malloc(dataBytes);

errcode |= clGetContextInfo(clGPUContext,

CL_CONTEXT_DEVICES, dataBytes,

clDevices, NULL);


//Create a command-queue

clCommandQue =

clCreateCommandQueue(clGPUContext,

clDevices[0], 0, &errcode);


__kernel void

matrixMul(__global float* C, __global float* A,

__global float* B, int wA, int wB) {



float value = 0;

for (int k = 0; k < wA; ++k)

{




}

C[ty * wA + tx] = value;}


C++AMP PROGRAMMING MODEL

void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix) {

array_view<int, 2> a(3, 2, aMatrix);

array_view<int, 2> b(2, 3, bMatrix);

array_view<int, 2> product(3, 3, productMatrix);

parallel_for_each(

product.extent,


int row = idx[0];

int col = idx[1];



}

}

);


GPU data

modeled as

data container







parallel_for_each(

product.extent,


int row = idx[0];

int col = idx[1];



}

}

);


Kernels modeled as

lambdas; arguments are

implicitly modeled as

captured variables,

programmer do not need to

specify copyin and copyout







parallel_for_each(

product.extent,


int row = idx[0];

int col = idx[1];



}

}

);

product.synchronize();

}

Execution

interface; marking

an implicitly

parallel region for

GPU execution


MCW C++AMP (CLAMP)

● Runs on Linux and Mac OS X

● Output code compatible with all major OpenCL stacks: AMD, Apple/Intel (OS X),

NVIDIA and even POCL

● Clang/LLVM-based, open source

o Translate C++AMP code to OpenCL C or OpenCL 1.2 SPIR

o With template helper library

● Runtime: OpenCL 1.1/HSA Runtime and GMAC for non-HSA systems

● One of the two C++ AMP implementations recognized by HSA foundation


MCW C++ AMP COMPILER

● Device Path

o generate OpenCL C code and SPIR

o emit kernel function

● Host Path

o preparation to launch the code

C++ AMP

source code

Clang/LLVM 3.3

Device

Code

Host

Code


TRANSLATION

parallel_for_each(product.extent,


int row = idx[0];

int col = idx[1];



}

});

__kernel void

matrixMul(__global float* C, __global float*

A,

__global float* B, int wA, int wB){



float value = 0;

for (int k = 0; k < wA; ++k)

{




}


● Append the arguments

● Set the index

● emit kernel function

● implicit memory management


EXECUTION ON NON-HSA OPENCL

PLATFORMS

C++ AMP

source code

Clang/LLVM

3.3

Device Code

C++ AMP

source code

Clang/LLVM

3.3

Host Code

gmac

OpenCL

Our work

Runtime


GMAC

● unified virtual address space in

software

● Can have high overhead

sometimes

● In HSA (e.g., AMD Kaveri), GMAC

is not longer needed

Gelado, et al, ASPLOS 2010


CASE STUDY:BINOMIAL OPTION PRICING

Line of Codes

0

50

100

150

200

250

300

350

C++AMP OpenCL

Lines of Code by Cloc

Host

Kernel


PERFORMANCE ON NON-HSA SYSTEMSBINOMIAL OPTION PRICING

0

0.02

0.04

0.06

0.08

0.1

0.12

Total GPU Time Kernel-only

Tim

e in

Seco

nd

s

Performance on an NV Tesla C2050

OpenCL

C++AMP


EXECUTION ON HSA

C++ AMP

source code

Clang/LLVM

3.3

Device SPIR

C++ AMP

source code

Clang/LLVM

3.3

Host SPIR

HSA Runtime

Compile Time

Runtime


WHAT WE NEED TO DO?

● Kernel function

o emit the kernel function with required arguments

● On Host side

o a function that recursively traverses the object and append the arguments to OpenCL

stack.

● On Device side

o reconstruct it on the device code for future use.


WHY COMPILING C++AMP TO OPENCL IS

NOT TRIVIAL

● C++AMP → LLVM IR → OpenCL C or SPIR

● arguments passing (lambda capture vs function calls)

● explicit V.S. implicit memory transfer

● Heavy lifting is done by compiler and runtime


EXAMPLE

struct A { int a; };struct B : A { int b; };struct C { B b; int c; };

struct C c;

c.c = 100;

auto fn = [=] () { int qq = c.c; };


TRANSLATION

parallel_for_each(product.extent,


int row = idx[0];

int col = idx[1];



}

});

__kernel void

matrixMul(__global float* C, __global float* A,

__global float* B, int wA, int wB){



float value = 0;

for (int k = 0; k < wA; ++k)

{




}


● Compiler

● Turn captured variables into

OpenCL arguments

● Populate the index<N> in OCL

kernel

● Runtime

● Implicit memory management


QUESTIONS?


ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

Technology

Transcript of ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial