ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

353
HETEROGENEOUS SYSTEM ARCHITECTURE (HSA): ARCHITECTURE AND ALGORITHMS ISCA TUTORIAL - JUNE 15, 2014

Transcript of ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

Page 1: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HETEROGENEOUS SYSTEM

ARCHITECTURE (HSA): ARCHITECTURE

AND ALGORITHMS

ISCA TUTORIAL - JUNE 15, 2014

Page 2: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

TOPICS

Introduction

HSAIL Virtual Parallel ISA

HSA Runtime

HSA Memory Model

HSA Queuing Model

HSA Applications

HSA Compilation

© Copyright 2014 HSA Foundation. All Rights Reserved

The HSA Specifications are not at 1.0 final so all content is subject to change

Page 3: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SCHEDULE

© Copyright 2014 HSA Foundation. All Rights Reserved

Time Topic Speaker

8:45am Introduction to HSA Phil Rogers, AMD

9:30am HSAIL Virtual Parallel ISA Ben Sander, AMD

10:30am Break

10:50am HSA Runtime Yeh-Ching Chung, National Tsing Hua University

12 noon Lunch

1pm HSA Memory Model Benedict Gaster, Qualcomm

2pm HSA Queuing Model Hakan Persson, ARM

3pm Break

3:15pm HSA Compilation Technology Wen Mei Hwu, University of Illinois

4pm HSA Application Programming Wen Mei Hwu, University of Illinois

4:45pm Questions All presenters

Page 4: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

INTRODUCTIONPHIL ROGERS, AMD CORPORATE FELLOW &

PRESIDENT OF HSA FOUNDATION

Page 5: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA FOUNDATION

Founded in June 2012

Developing a new platform for heterogeneous

systems

www.hsafoundation.com

Specifications under development in working

groups to define the platform

Membership consists of 43 companies and 16

universities

Adding 1-2 new members each month

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 6: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

DIVERSE PARTNERS DRIVING FUTURE OF

HETEROGENEOUS COMPUTING

© Copyright 2014 HSA Foundation. All Rights Reserved

Founders

Promoters

Supporters

Contributors

Academic

Needs Updating – Add Toshiba

Logo

Page 7: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

MEMBERSHIP TABLEMembership Level Number List

Founder 6 AMD, ARM, Imagination Technologies, MediaTek Inc., Qualcomm Inc., Samsung Electronics Co Ltd

Promoter 1 LG Electronics

Contributor 25 Analog Devices Inc., Apical, Broadcom, Canonical Limited, CEVA Inc., Digital Media Professionals, Electronics and Telecommunications Research, Institute (ETRI), General Processor, Huawei, Industrial Technology Res. Institute, Marvell International Ltd., Mobica, Oracle, Sonics, Inc, Sony Mobile, Communications, Swarm 64 GmbH, Synopsys, Tensilica, Inc., Texas Instruments Inc., Toshiba, VIA Technologies, Vivante Corporation

Supporter 13 Allinea Software Ltd, Arteris Inc., Codeplay Software, Fabric Engine, Kishonti, Lawrence Livermore National Laboratory, Linaro, MultiCoreWare, Oak Ridge National Laboratory, Sandia Corporation, StreamComputing, SUSE LLC, UChicago Argonne LLC, Operator of Argonne National Laboratory

Academic 17 Institute for Computing Systems Architecture, Missouri University of Science & Technology, National Tsing Hua University, NMAM Institute of Technology, Northeastern University, Rice University, Seoul National University, System Software Lab National, Tsing Hua University, Tampere University of Technology, TEI of Crete, The University of Mississippi, University of North Texas, University of Bologna, University of Bristol Microelectronic Research Group, University of Edinburgh, University of Illinois at Urbana-Champaign Department of Computer Science

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 8: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HETEROGENEOUS PROCESSORS HAVE

PROLIFERATED — MAKE THEM BETTER

Heterogeneous SOCs have arrived and are a

tremendous advance over previous platforms

SOCs combine CPU cores, GPU cores and

other accelerators, with high bandwidth access

to memory

How do we make them even better? Easier to program

Easier to optimize

Higher performance

Lower power

HSA unites accelerators architecturally

Early focus on the GPU compute accelerator,

but HSA will go well beyond the GPU

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 9: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

INFLECTIONS IN PROCESSOR DESIGN

© Copyright 2014 HSA Foundation. All Rights Reserved

?

Sin

gle

-th

read

Pe

rfo

rman

ce

Time

we are

here

Enabled by: Moore’s

Law

Voltage Scaling

Constrained by:

Power

Complexity

Single-Core Era

Mo

de

rn A

pp

licatio

n

Pe

rfo

rman

ce

Time (Data-parallel exploitation)

we are

here

Heterogeneous

Systems Era

Enabled by: Abundant data

parallelism

Power efficient

GPUs

Temporarily

Constrained by:Programming

models

Comm.overhead

Th

rou

gh

put

Pe

rfo

rman

ce

Time (# of processors)

we are

here

Enabled by: Moore’s Law

SMP

architecture

Constrained

by:Power

Parallel SW

Scalability

Multi-Core Era

Assembly C/C++ Java … pthreads OpenMP / TBB …Shader CUDA OpenCL

C++ and Java

Page 10: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

LEGACY GPU COMPUTE

PCIe

System Memory(Coherent)

CPU CPU CPU. .

. CU CU CU CU

CU CU CU CU

GPU Memory(Non-Coherent)

GPU

Multiple memory pools

Multiple address spaces

High overhead dispatch

Data copies across PCIe

New languages for programming

Dual source development

Proprietary environments

Expert programmers only

Need to fix all of this to unleash our programmers

The limiters

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 11: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

EXISTING APUS AND SOCS

CPU

1

CPU

N…CPU

2

Physical Integration

CU

1 …CU

2

CU

3

CU

M-2

CU

M-1

CU

M

System Memory(Coherent)

GPU Memory(Non-Coherent)

GPU

Physical Integration

Good first step

Some copies gone

Two memory pools remain

Still queue through the OS

Still requires expert programmers

Need to finish the job

Page 12: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

AN HSA ENABLED SOC

Unified Coherent Memory enables data sharing across all processors

Processors architected to operate cooperatively

Designed to enable the application to run on different processors at different times

Unified Coherent Memory

CPU

1

CPU

N…CPU

2

CU

1

CU

2

CU

3

CU

M-2

CU

M-1

CU

M…

Page 13: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PILLARS OF HSA*

Unified addressing across all processors

Operation into pageable system memory

Full memory coherency

User mode dispatch

Architected queuing language

Scheduling and context switching

HSA Intermediate Language (HSAIL)

High level language support for GPU compute processors

© Copyright 2014 HSA Foundation. All Rights Reserved

* All features of HSA are subject to change, pending ratification of 1.0 Final specifications by the HSA Board of Directors

Page 14: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA SPECIFICATIONS

HSA System Architecture Specification

Version 1.0 Provisional, Released April 2014

Defines discovery, memory model, queue management, atomics, etc

HSA Programmers Reference Specification

Version 1.0 Provisional, Released June 2014

Defines the HSAIL language and object format

HSA Runtime Software Specification

Version 1.0 Provisional, expected to be released in July 2014

Defines the APIs through which an HSA application uses the platform

All released specifications can be found at the HSA Foundation web site:

www.hsafoundation.com/standards

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 15: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA - AN OPEN PLATFORM Open Architecture, membership open to all

HSA Programmers Reference Manual

HSA System Architecture

HSA Runtime

Delivered via royalty free standards

Royalty Free IP, Specifications and APIs

ISA agnostic for both CPU and GPU

Membership from all areas of computing

Hardware companies

Operating Systems

Tools and Middleware

Applications

Universities

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 16: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA INTERMEDIATE LAYER — HSAIL

HSAIL is a virtual ISA for parallel programs

Finalized to ISA by a JIT compiler or “Finalizer”

ISA independent by design for CPU & GPU

Explicitly parallel

Designed for data parallel programming

Support for exceptions, virtual functions,

and other high level language features

Lower level than OpenCL SPIR

Fits naturally in the OpenCL compilation stack

Suitable to support additional high level languages and programming models:

Java, C++, OpenMP, C++, Python, etc

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 17: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA MEMORY MODEL

Defines visibility ordering between all

threads in the HSA System

Designed to be compatible with

C++11, Java, OpenCL and .NET

Memory Models

Relaxed consistency memory model

for parallel compute performance

Visibility controlled by:

Load.Acquire

Store.Release

Fences

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 18: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA QUEUING MODEL

User mode queuing for low latency dispatch

Application dispatches directly

No OS or driver required in the dispatch path

Architected Queuing Layer

Single compute dispatch path for all hardware

No driver translation, direct to hardware

Allows for dispatch to queue from any agent

CPU or GPU

GPU self enqueue enables lots of solutions

Recursion

Tree traversal

Wavefront reforming

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 19: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA SOFTWARE

Page 20: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

Hardware - APUs, CPUs, GPUs

Driver Stack

Domain Libraries

OpenCL™, DX Runtimes,

User Mode Drivers

Graphics Kernel Mode Driver

AppsApps

AppsApps

AppsApps

HSA Software Stack

Task Queuing

Libraries

HSA Domain Libraries,

OpenCL ™ 2.x Runtime

HSA Kernel

Mode Driver

HSA Runtime

HSA JIT

AppsApps

AppsApps

AppsApps

User mode component Kernel mode component Components contributed by third parties

EVOLUTION OF THE SOFTWARE STACK

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 21: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

OPENCL™ AND HSA

HSA is an optimized platform architecture

for OpenCL

Not an alternative to OpenCL

OpenCL on HSA will benefit from

Avoidance of wasteful copies

Low latency dispatch

Improved memory model

Pointers shared between CPU and GPU

OpenCL 2.0 leverages HSA Features

Shared Virtual Memory

Platform Atomics

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 22: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

ADDITIONAL LANGUAGES ON HSA

In development

© Copyright 2014 HSA Foundation. All Rights Reserved

Language Body More Information

Java Sumatra OpenJDK http://openjdk.java.net/projects/sumatra/

LLVM LLVM Code

generator for HSAIL

C++ AMP Multicoreware https://bitbucket.org/multicoreware/cppa

mp-driver-ng/wiki/Home

OpenMP, GCC AMD, Suse https://gcc.gnu.org/viewcvs/gcc/branches

/hsa/gcc/README.hsa?view=markup&p

athrev=207425

Page 23: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SUMATRA PROJECT OVERVIEW

AMD/Oracle sponsored Open Source (OpenJDK) project

Targeted at Java 9 (2015 release)

Allows developers to efficiently represent data parallel algorithms in

Java

Sumatra ‘repurposes’ Java 8’s multi-core Stream/Lambda API’s to

enable both CPU or GPU computing

At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch

‘selected’ constructs to available HSA enabled devices

Developers of Java libraries are already refactoring their library code to

use these same constructs

So developers using existing libraries should see GPU acceleration

without any code changes

http://openjdk.java.net/projects/sumatra/

https://wikis.oracle.com/display/HotSpotInternals/Sumatra

http://mail.openjdk.java.net/pipermail/sumatra-dev/

© Copyright 2014 HSA Foundation. All Rights Reserved

Application.java

Java Compiler

GPUCPU

Sumatra Enabled JVM

Application

GPU ISA

Lambda/Stream API

CPU ISA

Application.clas

s

Development

Runtime

HSA Finalizer

Page 24: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA OPEN SOURCE SOFTWARE

HSA will feature an open source linux execution and compilation stack

Allows a single shared implementation for many components

Enables university research and collaboration in all areas

Because it’s the right thing to do

© Copyright 2014 HSA Foundation. All Rights Reserved

Component Name IHV or Common Rationale

HSA Bolt Library Common Enable understanding and debug

HSAIL Code Generator Common Enable research

LLVM Contributions Common Industry and academic collaboration

HSAIL Assembler Common Enable understanding and debug

HSA Runtime Common Standardize on a single runtime

HSA Finalizer IHV Enable research and debug

HSA Kernel Driver IHV For inclusion in linux distros

Page 25: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

WORKLOAD EXAMPLE

SUFFIX ARRAY CONSTRUCTIONCLOUD SERVER WORKLOAD

Page 26: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SUFFIX ARRAYS

Suffix Arrays are a fundamental data structure

Designed for efficient searching of a large text

Quickly locate every occurrence of a substring S in a text T

Suffix Arrays are used to accelerate in-memory cloud workloads

Full text index search

Lossless data compression

Bio-informatics

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 27: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

ACCELERATED SUFFIX ARRAY

CONSTRUCTION ON HSA

© Copyright 2014 HSA Foundation. All Rights Reserved

M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013.

AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM

By offloading data parallel computations to

GPU, HSA increases performance and

reduces energy for Suffix Array

Construction.

By efficiently sharing data between CPU and

GPU, HSA lets us move compute to data

without penalty of intermediate copies.

+5.8x

-5x

INCREASED

PERFORMANCEDECREASED

ENERGYMerge Sort::GPU

Radix Sort::GPU

Compute SA::CPU

Lexical Rank::CPU

Radix Sort::GPU

Skew Algorithm for Compute SA

Page 28: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

EASE OF PROGRAMMINGCODE COMPLEXITY VS. PERFORMANCE

Page 29: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT

PROGRAMMING MODELS

AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.

Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta

0

50

100

150

200

250

300

350L

OC

Copy-back Algorithm Launch Copy Compile Init Performance

Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt

Pe

rform

an

ce

35.00

30.00

25.00

20.00

15.00

10.00

5.00

0Copy-

back

Algorithm

Launch

Copy

Compile

Init.

Copy-back

Algorithm

Launch

Copy

Compile

Copy-back

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

(Exemplary ISV “Hessian” Kernel)

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 30: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

THE HSA FUTURE

Architected heterogeneous processing on the SOC

Programming of accelerators becomes much easier

Accelerated software that runs across multiple hardware vendors

Scalability from smart phones to super computers on a common architecture

GPU acceleration of parallel processing is the initial target, with DSPs

and other accelerators coming to the HSA system architecture model

Heterogeneous software ecosystem evolves at a much faster pace

Lower power, more capable devices in your hand, on the wall, in the cloud

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 31: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

JOIN US!

WWW.HSAFOUNDATION.COM

Page 32: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HETEROGENEOUS SYSTEM

ARCHITECTURE (HSA): HSAIL VIRTUAL

PARALLEL ISA

BEN SANDER, AMD

Page 33: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

TOPICS

Introduction and Motivation

HSAIL – what makes it special?

HSAIL Execution Model

How to program in HSAIL?

Conclusion

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 34: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

STATE OF GPU COMPUTING

Today’s Challenges

Separate address spaces

Copies

Can’t share pointers

New language required for compute kernel

EX: OpenCL™ runtime API

Compute kernel compiled separately than host

code

Emerging Solution

HSA Hardware

Single address space

Coherent

Virtual

Fast access from all components

Can share pointers

Bring GPU computing to existing, popular,

programming models

Single-source, fully supported by compiler

HSAIL compiler IR (Cross-platform!)

• GPUs are fast and power efficient : high compute density per-mm and per-watt

• But: Can be hard to program

PCIe

Page 35: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

THE PORTABILITY CHALLENGE

CPU ISAs

ISA innovations added incrementally (ie NEON, AVX, etc)

ISA retains backwards-compatibility with previous generation

Two dominant instruction-set architectures: ARM and x86

GPU ISAs

Massive diversity of architectures in the market

Each vendor has own ISA - and often several in market at same time

No commitment (or attempt!) to provide any backwards compatibility

Traditionally graphics APIs (OpenGL, DirectX) provide necessary abstraction

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 36: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSAIL : WHAT MAKES IT SPECIAL?

Page 37: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

WHAT IS HSAIL?

Intermediate language for parallel compute in HSA

Generated by a “High Level Compiler” (GCC, LLVM, Java VM, etc)

Expresses parallel regions of code

Binary format of HSAIL is called “BRIG”

Goal: Bring parallel acceleration to mainstream programming languages

© Copyright 2014 HSA Foundation. All Rights Reserved

main() {

#pragma omp parallel for

for (int i=0;i<N; i++) {

}

}

High-Level

Compiler BRIG Finalizer Component

ISA

Host ISA

Page 38: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

KEY HSAIL FEATURES

Parallel

Shared virtual memory

Portable across vendors in HSA Foundation

Stable across multiple product generations

Consistent numerical results (IEEE-754 with defined min accuracy)

Fast, robust, simple finalization step (no monthly updates)

Good performance (little need to write in ISA)

Supports all of OpenCL™

Supports Java, C++, and other languages as well

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 39: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSAIL INSTRUCTION SET - OVERVIEW Similar to assembly language for a RISC CPU

Load-store architecture

Destination register first, then source registers

140 opcodes (Java™ bytecode has 200)

Floating point (single, double, half (f16))

Integer (32-bit, 64-bit)

Some packed operations

Branches

Function calls

Platform Atomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas

Synchronize host CPU and HSA Component!

Text and Binary formats (“BRIG”)

ld_global_u64 $d0, [$d6 + 120] ; $d0= load($d6+120)

add_u64 $d1, $d0, 24 ; $d1= $d2+24

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 40: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SEGMENTS AND MEMORY (1/2)

7 segments of memory

global, readonly, group, spill, private, arg, kernarg

Memory instructions can (optionally) specify a segment

Control data sharing properties and communicate intent

Global Segment

Visible to all HSA agents (including host CPU)

Group Segment

Provides high-performance memory shared in the work-group.

Group memory can be read and written by any work-item in the work-group

HSAIL provides sync operations to control visibility of group memory

ld_global_u64 $d0,[$d6]

ld_group_u64 $d0,[$d6+24]

st_spill_f32 $s1,[$d6+4]

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 41: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SEGMENTS AND MEMORY (2/2)

Spill, Private, Arg Segments

Represent different regions of a per-work-item stack

Typically generated by compiler, not specified by programmer

Compiler can use these to convey intent – ie spills

Kernarg Segment

Programmer writes kernarg segment to pass arguments to a kernel

Read-Only Segment

Remains constant during execution of kernel

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 42: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

FLAT ADDRESSING

Each segment mapped into virtual address space

Flat addresses can map to segments based on virtual address

Instructions with no explicit segment use flat addressing

Very useful for high-level language support (ie classes, libraries)

Aligns well with OpenCL 2.0 “generic” addressing feature

ld_global_u64 $d6, [%_arg0] ; global

ld_u64 $d0,[$d6+24] ; flat

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 43: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

REGISTERS

Four classes of registers:

S: 32-bit, Single-precision FP or Int

D: 64-bit, Double-precision FP or Long Int

Q: 128-bit, Packed data.

C: 1-bit, Control Registers (Compares)

Fixed number of registers

S, D, Q share a single pool of resources

S + 2*D + 4*Q <= 128

Up to 128 S or 64 D or 32 Q (or a blend)

Register allocation done in high-level compiler

Finalizer doesn’t perform expensive register allocation

c0

c1

c2

c3

c4

c5

c6

c7

s0d0

q0s1

s2d1

s3

s4d2

q1s5

s6d3

s7

s8d4

q2s9

s10d5

s11

…s120

d60

q30s121

s122d61

s123

s124d62

q31s125

s126d63

s127

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 44: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SIMT EXECUTION MODEL

HSAIL Presents a “SIMT” execution model to the programmer

“Single Instruction, Multiple Thread”

Programmer writes program for a single thread of execution

Each work-item appears to have its own program counter

Branch instructions look natural

Hardware Implementation

Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency

Actually one program counter for the entire SIMD instruction

Branches implemented with predication

SIMT Advantages

Easier to program (branch code in particular)

Natural path for mainstream programming models and existing compilers

Scales across a wide variety of hardware (programmer doesn’t see vector width)

Cross-lane operations available for those who want peak performance

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 45: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

WAVEFRONTS

Hardware SIMD vector, composed of 1, 2, 4, 8, 16, 32, 64, 128, or 256 “lanes”

Lanes in wavefront can be “active” or “inactive”

Inactive lanes consume hardware resources but don’t do useful work

Tradeoffs “Wavefront-aware” programming can be useful for peak performance

But results in less portable code (since wavefront width is encoded in algorithm)

if (cond) {

operationA; // cond=True lanes active here

} else {

operationB; // cond=False lanes active here

}

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 46: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

CROSS-LANE OPERATIONS

Example HSAIL cross-lane operation: “activelaneid”

Dest set to count of earlier work-items that are active for this instruction

Useful for compaction algorithms

Example HSAIL cross-lane operation: “activelaneshuffle”

Each workitem reads value from another lane in the wavefront

Supports selection of “identity” element for inactive lanes

Useful for wavefront-level reductionsactivelaneshuffle_b32 $s0, $s1, $s2, 0, 0

// s0 = dest, s1= source, s2=lane select, no identity

activelaneid_u32 $s0

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 47: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSAIL MODES

Working group strived to limit optional modes and features in HSAIL

Minimize differences between HSA target machines

Better for compiler vendors and application developers

Two modes survived

Machine Models

Small: 32-bit pointers, 32-bit data

Large: 64-bit pointers, 32-bit or 64-bit data

Vendors can support one or both models

“Base” and “Full” Profiles

Two sets of requirements for FP accuracy, rounding, exception reporting, hard

pre-emption

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 48: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA PROFILESFeature Base Full

Addressing Modes Small, Large Small, Large

All 32-bit HSAIL operations according to the declared profile Yes Yes

F16 support (IEEE 754 or better) Yes Yes

F64 support No Yes

Precision for add/sub/mul 1/2 ULP 1/2 ULP

Precision for div 2.5 ULP 1/2 ULP

Precision for sqrt 1 ULP 1/2 ULP

HSAIL Rounding: Near Yes Yes

HSAIL Rounding: Up / Down / Zero No Yes

Subnormal floating-point Flush-to-zero Supported

Propagate NaN Payloads No Yes

FMA Yes Yes

Arithmetic Exception reporting None DETECT or BREAK

Debug trap Yes Yes

Hard Preemption No Yes

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 49: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA PARALLEL EXECUTION

MODEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 50: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA PARALLEL EXECUTION MODELBasic Idea:

Programmer supplies an HSAIL

“kernel” that is run on each work-item.

Kernel is written as a single thread of

execution.

Programmer specifies grid dimensions

(scope of problem) when launching

the kernel.

Each work-item has a unique

coordinate in the grid.

Programmer optionally specifies work-

group dimensions (for optimized

communication).

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 51: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

CONVOLUTION / SOBEL EDGE FILTER

Gx = [ -1 0 +1 ]

[ -2 0 +2 ]

[ -1 0 +1 ]

Gy = [ -1 -2 -1 ]

[ 0 0 0 ]

[ +1 +2 +1 ]

G = sqrt(Gx2 + Gy

2)

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 52: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

CONVOLUTION / SOBEL EDGE FILTER

Gx = [ -1 0 +1 ]

[ -2 0 +2 ]

[ -1 0 +1 ]

Gy = [ -1 -2 -1 ]

[ 0 0 0 ]

[ +1 +2 +1 ]

G = sqrt(Gx2 + Gy

2)

2D grid

workitem

kernel

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 53: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

CONVOLUTION / SOBEL EDGE FILTER

Gx = [ -1 0 +1 ]

[ -2 0 +2 ]

[ -1 0 +1 ]

Gy = [ -1 -2 -1 ]

[ 0 0 0 ]

[ +1 +2 +1 ]

G = sqrt(Gx2 + Gy

2)

2D work-group

2D grid

workitem

kernel

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 54: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HOW TO PROGRAM HSA?

WHAT DO I TYPE?

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 55: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA PROGRAMMING MODELS : CORE PRINCIPLES

Single source

Host and device code side-by-side in same source file

Written in same programming language

Single unified coherent address space

Freely share pointers between host and device

Similar memory model as multi-core CPU

Parallel regions identified with existing language syntax

Typically same syntax used for multi-core CPU

HSAIL is the compiler IR that supports these programming models

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 56: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

GCC OPENMP : COMPILATION FLOW

SUSE GCC Project

Adding HSAIL code generator to GCC compiler infrastructure

Supports OpenMP 3.1 syntax

No data movement directives required !main() {

// Host code.

#pragma omp parallel for

for (int i=0;i<N; i++) {

C[i] = A[i] + B[i];

}

}

GCC OpenMP

CompilerBRIG Finalizer Component

ISA

Host ISA

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 57: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

GCC OpenMP flowC/C++/Fortran OpenMP application

e.g., #pragma omp forfor( j = 0; j<n;j++) { b[j] = a[j]; }

GNU Compiler(GCC)

Compiles host code + Emits runtime calls with kernel name, parameters, launch attributes

Lowers OpenMP directives,converts GIMPLE to BRIG.Embeds BRIG into host code

Dispatch kernel to GPU

Pragmas map to calls into HSA Runtime

Application

Compiler

Run timeFinalize kernel from BRIG->ISAKernels finalized once and cached.

Compile time

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 58: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

MCW C++AMP : COMPILATION FLOW

C++AMP : Single-source C++ template parallel programming model

MCW compiler based on CLANG/LLVM

Open-source and runs on Linux

Leverage open-source LLVM->HSAIL code generator

main() {

parallel_for_each(grid<1>(ext

entent<256>(…)

}

C++AMP

CompilerBRIG Finalizer Component

ISA

Host ISA

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 59: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

JAVA: RUNTIME FLOW

© Copyright 2014 HSA Foundation. All Rights Reserved

JAVA 8 – HSA ENABLED APARAPI

Java 8 brings Stream + Lambda API.‒ More natural way of expressing data parallel algorithms‒ Initially targeted at multi-core.

APARAPI will :‒ Support Java 8 Lambdas‒ Dispatch code to HSA enabled devices at runtime via

HSAIL

JVM

Java Application

HSA Finalizer & Runtime

APARAPI + Lambda API

GPUCPU

Future Java – HSA ENABLED JAVA (SUMATRA)

Adds native GPU acceleration to Java Virtual Machine (JVM)

Developer uses JDK Lambda, Stream API

JVM uses GRAAL compiler to generate HSAIL

JVM

Java Application

HSA Finalizer & Runtime

Java JDK Stream + Lambda API

Java GRAAL JITbackend

GPUCPU

Page 60: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

AN EXAMPLE (IN JAVA 8)

© Copyright 2014 HSA Foundation. All Rights Reserved

//Example computes the percentage of total scores achieved by each player on a team.

class Player {

private Team team; // Note: Reference to the parent Team.

private int scores;

private float pctOfTeamScores;

public Team getTeam() {return team;}

public int getScores() {return scores;}

public void setPctOfTeamScores(int pct) { pctOfTeamScores = pct; }

};

// “Team” class not shown

// Assume “allPlayers’ is an initialized array of Players..

Arrays.stream(allPlayers). // wrap the array in a stream

parallel(). // developer indication that lambda is thread-safe

forEach(p -> {

int teamScores = p.getTeam().getScores();

float pctOfTeamScores = (float)p.getScores()/(float) teamScores;

p.setPctOfTeamScores(pctOfTeamScores);

});

Page 61: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSAIL CODE EXAMPLE

© Copyright 2014 HSA Foundation. All Rights Reserved

01: version 0:95: $full : $large;

02: // static method HotSpotMethod<Main.lambda$2(Player)>

03: kernel &run (

04: kernarg_u64 %_arg0 // Kernel signature for lambda method

05: ) {

06: ld_kernarg_u64 $d6, [%_arg0]; // Move arg to an HSAIL register

07: workitemabsid_u32 $s2, 0; // Read the work-item global “X” coord

08:

09: cvt_u64_s32 $d2, $s2; // Convert X gid to long

10: mul_u64 $d2, $d2, 8; // Adjust index for sizeof ref

11: add_u64 $d2, $d2, 24; // Adjust for actual elements start

12: add_u64 $d2, $d2, $d6; // Add to array ref ptr

13: ld_global_u64 $d6, [$d2]; // Load from array element into reg

14: @L0:

15: ld_global_u64 $d0, [$d6 + 120]; // p.getTeam()

16: mov_b64 $d3, $d0;

17: ld_global_s32 $s3, [$d6 + 40]; // p.getScores ()

18: cvt_f32_s32 $s16, $s3;

19: ld_global_s32 $s0, [$d0 + 24]; // Team getScores()

20: cvt_f32_s32 $s17, $s0;

21: div_f32 $s16, $s16, $s17; // p.getScores()/teamScores

22: st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores()

23: ret;

24: };

Page 62: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HOW TO PROGRAM HSA?

OTHER PROGRAMMING TOOLS

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 63: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSAIL ASSEMBLER

kernel &run (kernarg_u64 %_arg0)

{

ld_kernarg_u64 $d6, [%_arg0];

workitemabsid_u32 $s2, 0;

cvt_u64_s32 $d2, $s2;

mul_u64 $d2, $d2, 8;

add_u64 $d2, $d2, 24;

add_u64 $d2, $d2, $d6;

ld_global_u64 $d6, [$d2];

. . .

HSAIL

Assembler BRIG FinalizerMachine

ISA

• HSAIL has a text format and an assembler

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 64: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

OPENCL™ OFFLINE COMPILER (CLOC)

__kernel void vec_add(

__global const float *a,

__global const float *b,

__global float *c,

const unsigned int n)

{

int id = get_global_id(0);

// Bounds check

if (id < n)

c[id] = a[id] + b[id];

}

CLOC BRIG FinalizerMachine

ISA

•OpenCL split-source model cleanly isolates kernel

•Can express many HSAIL features in OpenCL Kernel Language

•Higher productivity than writing in HSAIL assembly

•Can dispatch kernel directly with HSAIL Runtime (lower-level access to hardware)

•Or use CLOC+OKRA Runtime for approachable “fits-on-a-slide” GPU programming model

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 65: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

KEY TAKEAWAYS HSAIL

Thin, robust, fast finalizer

Portable (multiple HW vendors and parallel architectures)

Supports shared virtual memory and platform atomics

HSA brings GPU computing to mainstream programming models

Shared and coherent memory bridges “faraway accelerator” gap

HSAIL provides the common IL for high-level languages to benefit from

parallel computing

Languages and Compilers

HSAIL support in GCC, LLVM, Java JVM

Leverage same language syntax designed for multi-core CPUs

Can use pointer-containing data structures

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 66: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA RUNTIMEYEN-CHING CHUNG, NATIONAL TSING HUA

UNIVERSITY

Page 67: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

OUTLINE Introduction

HSA Core Runtime API (Pre-release 1.0 provisional)

Initialization and Shut Down

Notifications (Synchronous/Asynchronous)

Agent Information

Signals and Synchronization (Memory-Based)

Queues and Architected Dispatch

Summary

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 68: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

INTRODUCTION (1)

The HSA core runtime is a thin, user-mode API that provides the interface necessary for

the host to launch compute kernels to the available HSA components.

The overall goal of the HSA core runtime design is to provide a high-performance dispatch

mechanism that is portable across multiple HSA vendor architectures.

The dispatch mechanism differentiates the HSA runtime from other language runtimes by

architected argument setting and kernel launching at the hardware and specification level.

The HSA core runtime API is standard across all HSA vendors, such that languages which use the

HSA runtime can run on different vendor’s platforms that support the API.

The implementation of the HSA runtime may include kernel-level components (required for

some hardware components, ex: AMD Kaveri) or may be entirely user-space (for example,

simulators or CPU implementations).

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 69: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

Component 1

Driver

Component N…

Vendor m

…Component 1

Driver

Component N…

Vendor 1

Component 1

HSA Runtime

Component N…

HSA Vendor 1

HSA

Finalizer Component 1

HSA Runtime

Component N…

HSA Vendor m

HSA

Finalizer

INTRODUCTION (2)

Programming Model

Language Runtime

The software architecture stack without HSA runtime

OpenCL

App

Java

App

OpenMP

App

DSL

App

OpenCL

Runtime

Java

Runtime

OpenMP

Runtime

DSL

Runtime

The software architecture stack with HSA runtime

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 70: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

INTRODUCTION (3)

OpenCL Runtime HSA RuntimeAgent

Start Program

HSA Memory Allocation

Enqueue Dispatch Packet

Exit Program Resource Deallocation

Command Queue

Platform, Device, and Context Initialization

SVM Allocation and Kernel Arguments Setting

Build Kernel

HSA Runtime Close

HSA Runtime Initialization and Topology Discovery

HSAIL Finalization and Linking

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 71: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

INTRODUCTION (4)

HSA Platform System Architecture Specification support

Runtime initialization and shutdown

Notifications (synchronous/asynchronous)

Agent information

Signals and synchronization (memory-based)

Queues and Architected dispatch

Memory management

HSAIL support

Finalization, linking, and debugging

Image and Sampler support

HSA Runtime

HSA Memory Allocation

Enqueue Dispatch Packet

HSA Runtime Close

HSA Runtime Initialization and

Topology Discovery

HSAIL Finalization and Linking

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 72: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

RUNTIME INITIALIZATION AND

SHUTDOWN

Page 73: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

OUTLINE

Runtime Initialization API

hsa_init

Runtime Shut Down API

hsa_shut_down

Examples

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 74: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA RUNTIME INITIALIZATION

When the API is invoked for the first time in a given process, a runtime

instance is created.

A typical runtime instance may contain information of platform, topology, reference

count, queues, signals, etc.

The API can be called multiple times by applications

Only a single runtime instance will exist for a given process.

Whenever the API is invoked, the reference count is increased by one.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 75: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA RUNTIME SHUT DOWN

When the API is invoked, the reference count is decreased by 1.

When the reference count < 1

All the resources associated with the runtime instance (queues, signals, topology

information, etc.) are considered invalid and any attempt to reference them in

subsequent API calls results in undefined behavior.

The user might call hsa_init to initialize the HSA runtime again.

The HSA runtime might release resources associated with it.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 76: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

EXAMPLE – RUNTIME INITIALIZATION (1)

Data structure for

runtime instance

If hsa_init is called more than once,

increase the ref_count by 1

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 77: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

EXAMPLE – RUNTIME INITIALIZATION (2)

hsa_init is called the first time, allocate

resources and set the reference count

Get the number of HSA agent

Initialize agents

Create an empty agent list

If initialization failed, release resources

Create topology table

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 78: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

Agent-0

node_id 0

id 0

type CPU

vendor Generic

name Generic

wavefront_size 0

queue_size 200

group_memory 0

fbarrier_max_count 1

is_pic_supported 0……

EXAMPLE - RUNTIME INSTANCE (1)Platform Name: Generic Memory

node_id 0

id 0

segment_type 111111

address_base 0x0001

size 2048 MB

peak_bandwidth 6553.6 mpbs

Agent-1

node_id 0

id 0

type GPU

vendor Generic

name Generic

wavefront_size 64

queue_size 200

group_memory 64

fbarrier_max_count 1

is_pic_supported 1

Cache

node_id 0

id 0

levels 1

associativity 1

cache size 64KB

cache line size 4

is_inclusive 1

Agent: 2

Memory: 1

Cache: 1

… …

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 79: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

Agent-0

node_id = 0

id = 0

agent_type = 1 (CPU)

vendor[16] = Generic

name[16] = Generic

wavefront_size = 0

queue_size =200

group_memory_size_bytes =0

fbarrier_max_count = 1

is_pic_supported = 0

Platform Header File

*base_address = 0x00001

Size = 248

system_timestamp_frequency_

mhz = 200

signal_maximum_wait = 1/200

*node_id

no_nodes = 1

*agent_list

no_agent = 2

*memory_descriptor_list

no_memory_descriptor = 1

*cache_descriptor_list

no_cache_descriptor = 1

EXAMPLE - RUNTIME INSTANCE (2)

cache

node_id = 0

Id = 0

Levels = 1

* associativity

* cache_size

* cache_line_size

* is_inclusive

1 NULL

64KB NULL

1 NULL

4 NULL

Memory

node_id = 0

Id = 0

supported_segment_type_mask =

111111

virtual_address_base = 0x0001

size_in_bytes = 2048MB

peak_bandwidth_mbps = 6553.6

0 NULL

45 165 NULL

285 NULL

325 NULL

Agent-1

node_id = 0

id = 0

agent_type = 2 (GPU)

vendor[16] = Generic

name[16] = Generic

wavefront_size = 64

queue_size =200

group_memory_size_bytes =64

fbarrier_max_count = 1

is_pic_supported = 1

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 80: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

EXAMPLE – RUNTIME SHUT DOWN

© Copyright 2014 HSA Foundation. All Rights Reserved

If ref_count < 1, then free the list;

Otherwise decrease the ref_count

by 1.

Page 81: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

NOTIFICATIONS

(SYNCHRONOUS/ASYNCHRONOUS)

Page 82: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

OUTLINE

Synchronous Notifications

hsa_status_t

hsa_status_string

Asynchronous Notifications

Example

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 83: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYNCHRONOUS NOTIFICATIONS

Notifications (errors, events, etc.) reported by the runtime can be synchronous or

asynchronous

The HSA runtime uses the return values of API functions to pass notifications

synchronously.

A status code is define as an enumeration, , to capture the return value

of any API function that has been executed, except accessors/mutators.

The notification is a status code that indicates success or error.

Success is represented by HSA_STATUS_SUCCESS, which is equivalent to zero.

An error status is assigned a positive integer and its identifier starts with the

HSA_STATUS_ERROR prefix.

The status code can help to determine a cause of the unsuccessful execution.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 84: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

STATUS CODE QUERY

Query additional information on status code

Parameters status (input): Status code that the user is seeking more information on

status_string (output): An ISO/IEC 646 encoded English language string that potentially

describes the error status

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 85: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

ASYNCHRONOUS NOTIFICATIONS

The runtime passes asynchronous notifications by calling user-defined

callbacks.

For instance, queues are a common source of asynchronous events because the

tasks queued by an application are asynchronously consumed by the packet

processor. Callbacks are associated with queues when they are created. When the

runtime detects an error in a queue, it invokes the callback associated with that

queue and passes it an error flag (indicating what happened) and a pointer to the

erroneous queue.

The HSA runtime does not implement any default callbacks.

When using blocking functions within the callback implementation, a callback that

does not return can render the runtime state to be undefined.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 86: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

EXAMPLE - CALLBACK

Pass the callback function

when create queue

If the queue is empty, set the

event and invoke callback

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 87: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

AGENT INFORMATION

Page 88: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

OUTLINE

Agent information

hsa_node_t

hsa_agent_t

hsa_agent_info_t

hsa_component_feature_t

Agent Information manipulation APIs

hsa_iterate_agents

hsa_agent_get_info

Example

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 89: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

INTRODUCTION

The runtime exposes a list of agents that are available in the system.

An HSA agent is a hardware component that participates in the HSA memory model.

An HSA agent can submit AQL packets for execution.

An HSA agent may also but is not required to be an HSA component. It is possible for

a system to include HSA agents that are neither an HSA component nor a host CPU.

HSA agents are defined as opaque handles of type hsa_agent_t .

The HSA runtime provides APIs for applications to traverse the list of available

agents and query attributes of a particular agent.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 90: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

AGENT INFORMATION (1)

Opaque agent handle

Opaque NUMA node handle

An HSA memory node is a node that delineates a set of

system components (host CPUs and HSA Components) with

“local” access to a set of memory resources attached to the

node's memory controller and appropriate HSA-compliant

access attributes.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 91: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

AGENT INFORMATION (2)

Component features

An HSA component is a hardware or software component that can be a target of the AQL queries

and conforms to the memory model of the HSA.

Values

HSA_COMPONENT_FEATURE_NONE = 0

No component capabilities. The device is an agent, but not a component.

HSA_COMPONENT_FEATURE_BASIC = 1

The component supports the HSAIL instruction set and all the AQL packet types except Agent

dispatch.

HSA_COMPONENT_FEATURE_ALL = 2

The component supports the HSAIL instruction set and all the AQL packet types.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 92: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

AGENT INFORMATION (3)

Agent attributes

Values

HSA_AGENT_INFO_MAX_GRID_DIM

HSA_AGENT_INFO_MAX_WORKGROUP_DIM

HSA_AGENT_INFO_QUEUE_MAX_PACKETS

HSA_AGENT_INFO_CLOCK

HSA_AGENT_INFO_CLOCK_FREQUENCY

HSA_AGENT_INFO_MAX_SIGNAL_WAIT

HSA_AGENT_INFO_NAME

HSA_AGENT_INFO_NODE

HSA_AGENT_INFO_COMPONENT_FEATURES

HSA_AGENT_INFO_VENDOR_NAME

HSA_AGENT_INFO_WAVEFRONT_SIZE

HSA_AGENT_INFO_CACHE_SIZE

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 93: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

AGENT INFORMATION MANIPULATION (1)

Iterate over the available agents, and invoke an application-defined callback on

every iteration

If callback returns a status other than HSA_STATUS_SUCCESS for a particular

iteration, the traversal stops and the function returns that status value.

Parameters

callback (input): Callback to be invoked once per agent

data (input): Application data that is passed to callback on every iteration. Can be

NULL.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 94: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

AGENT INFORMATION MANIPULATION (2)

Get the current value of an attribute for a given agent

Parameters

agent (input): A valid agent

attribute (input): Attribute to query

value (output): Pointer to a user-allocated buffer where to store the value of the

attribute. If the buffer passed by the application is not large enough to hold the value

of attribute, the behavior is undefined.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 95: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

EXAMPLE - AGENT ATTRIBUTE QUERY

Copy agent attribute information

Get the agent handle of Agent 0

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 96: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SIGNALS AND SYNCHRONIZATION

(MEMORY-BASED)

Page 97: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

OUTLIINE

Signal

Signal manipulation API

Create/Destroy

Query

Send

Atomic Operations

Signal wait

Get time out

Signal Condition

Example

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 98: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SIGNAL (1)

HSA agents can communicate with each other by using coherent global memory,

or by using signals.

A signal is represented by an opaque signal handle

A signal carries a value, which can be updated or conditionally waited upon via

an API call or HSAIL instruction.

The value occupies four or eight bytes depending on the machine model in use.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 99: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SIGNAL (2)

Updating the value of a signal is equivalent to sending the signal.

In addition to the update (store) of signals, the API for sending signal must

support other atomic operations with specific memory order semantics

Atomic operations: AND, OR, XOR, Add, Subtract, Exchange, and CAS

Memory order semantics : Release and Relaxed

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 100: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SIGNAL CREATE/DESTROY

Create a signal

Parameters

initial_value (input): Initial value of the

signal.

signal_handle (output): Signal handle.

Destroy a signal previous created by

hsa_signal_create

Parameter

signal_handle (input): Signal handle.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 101: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

Send and atomically set the value of a signal

with release semantics

SIGNAL LOAD/STORE

Atomically read the current signal value with

acquire semantics

Atomically read the current signal value with

relaxed semantics

Send and atomically set the value of a signal with

relaxed semantics

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 102: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

Send and atomically increment the value of a

signal by a given amount with release semantics

SIGNAL ADD/SUBTRACT

Send and atomically decrement the value of a

signal by a given amount with release semantics

Send and atomically increment the value of a

signal by a given amount with relaxed semantics

Send and atomically decrement the value of a

signal by a given amount with relaxed semantics

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 103: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

Send and atomically perform a logical AND operation

on the value of a signal and a given value with

release semantics

SIGNAL AND (OR, XOR)/EXCHANGE

Send and atomically set the value of a signal and

return its previous value with release semantics

Send and atomically perform a logical AND operation

on the value of a signal and a given value with

relaxed semantics

Send and atomically set the value of a signal and

return its previous value with relaxed semantics

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 104: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SIGNAL WAIT (1)

The application may wait on a signal, with a condition specifying the terms of

wait.

Signal wait condition operator

Values

HSA_EQ: The two operands are equal.

HSA_NE: The two operands are not equal.

HSA_LT: The first operand is less than the second operand.

HSA_GTE: The first operand is greater than or equal to the second operand.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 105: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SIGNAL WAIT (2)

The wait can be done either in the HSA component via an HSAIL wait instruction

or via a runtime API defined here.

Waiting on a signal returns the current value at the opaque signal object;

The wait may have a runtime defined timeout which indicates the maximum amount of time that an

implementation can spend waiting.

The signal infrastructure allows for multiple senders/waiters on a single signal.

Wait reads the value, hence acquire synchronizations may be applied.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 106: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SIGNAL WAIT (3)

Signal wait

Parameters

signal_handle (input): A signal handle

condition (input): Condition used to compare the passed and signal values

compare_ value (input): Value to compare with

return_value (output): A pointer where the current signal value must be read into

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 107: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SIGNAL WAIT (4)

Signal wait with timeout

Parameters

signal_handle (input): A signal handle

timeout (input): Maximum wait duration (A value of zero indicates no maximum)

long_wait (input): Hint indicating that the signal value is not expected to meet the given condition in

a short period of time. The HSA runtime may use this hint to optimize the wait implementation.

condition (input): Condition used to compare the passed and signal values

compare_ value (input): Value to compare with

return_value (output): A pointer where the current signal value must be read into

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 108: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

EXAMPLE – SIGNAL WAIT (1)

thread_1 thread_2

thread_1 is blocked

hsa_signal_add_relaxed

(value = value + 3)

Return signal value

Condition satisfied, the

execution of thread_1

continues

value = 0

Timeline Timeline

value = 3

hsa_signal_substract_relaxed

(value = value - 1)value = 2

hsa_signal_wait_timeout_acquire

(value == 2)

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 109: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

EXAMPLE – SIGNAL WAIT (2)

If signal_handle is invalid, then return signal invalid status

Compare tmp->value with compare_value to see if the

condition is satisfied?

If timeout = 0 then return signal time out status

Signal wait condition function

If the condition is satisfied, then return signal and status

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 110: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

QUEUES AND ARCHITECTED

DISPATCH

Page 111: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

OUTLINE

Queues

Queue Types and Structure

HSA runtime API for Queue Manipulations

Architected Queuing Language (AQL) Support

Packet type

Packet header

Examples

Enqueue Packet

Packet Processor

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 112: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

INTRODUCTION (1)

An HSA-compliant platform supports multiple user-level command queues allocation.

A use-level command queue is characterized as runtime-allocated, user-level accessible virtual

memory of a certain size, containing packets defined in the Architected Queuing Language (AQL

packets).

Queues are allocated by HSA applications through the HSA runtime.

HSA software receives memory-based structures to configure the hardware queues to

allow for efficient software management of the hardware queues of the HSA agents.

This queue memory shall be processed by the HSA Packet Processor as a ring buffer.

Queues are read-only data structures.

Writing values directly to a queue structure results in undefined behavior.

But HSA agents can directly modify the contents of the buffer pointed by base_address, or use

runtime APIs to access the doorbell signal or the service queue.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 113: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

Two queue types, AQL and Service Queues, are supported

AQL Queue consumes AQL packets that are used to specify the information of kernel functions

that will be executed on the HSA component

Service Queue consumes agent dispatch packets that are used to specify runtime-defined or user

registered functions that will be executed on the agent (typically, the host CPU)

INTRODUCTION (2)

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 114: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

INTRODUCTION (3)

AQL queue structure

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 115: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

INTRODUCTION (4)

In addition to the data held in the queue structure, the queue also defines two

properties (readIndex and writeIndex) that define the location of “head” and “tail”

of the queue.

readIndex: The read index is a 64-bit unsigned integer that specifies the packetID of

the next AQL packet to be consumed by the packet processor.

writeIndex: The write index is a 64-bit unsigned integer that specifies the packetID of

the next AQL packet slot to be allocated.

Both indices are not directly exposed to the user, who can only access them by using

dedicated HSA core runtime APIs.

The available index functions differ on the index of interest (read or write), action to be

performed (addition, compare and swap, etc.), and memory consistency model

(relaxed, release, etc.).

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 116: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

INTRODUCTION (5)

The read index is automatically advanced when a packet is read by the packet

processor.

When the packet processor observes that

The read index matches the write index, the queue can be considered empty;

The write index is greater than or equal to the sum of the read index and the size of

the queue, then the queue is full.

The doorbell_signal field of a queue contains a signal that is used by the agent

to inform the packet processor to process the packets it writes.

The value that the doorbell signaled is equal to the ID of the packet that is ready to be

launched.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 117: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

INTRODUCTION (6)

The new task might be consumed by the packet processor even before the

doorbell signal has been signaled by the agent.

This is because the packet processor might be already processing some other

packets and observes that there is new work available, so it processes the new

packets.

In any case, the agent must ring the doorbell for every batch of packets it writes.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 118: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

QUEUE CREATE/DESTROY

Create a user mode queue

When a queue is created, the runtime also

allocates the packet buffer and the completion

signal.

The application should only rely on the status

code returned to determine if the queue is valid

Destroy a user mode queue

A destroyed queue might not be accessed after being

destroyed.

When a queue is destroyed, the state of the AQL packets

that have not been yet fully processed becomes undefined.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 119: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

GET READ/WRITE INDEX

Atomically retrieve read index of a queue with

acquire semantics

Atomically retrieve write index of a queue with

acquire semantics

Atomically retrieve read index of a queue with

relaxed semantics

Atomically retrieve write index of a queue with

relaxed semantics

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 120: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SET READ/WRITE INDEX

Atomically set the read index of a queue with

release semantics

Atomically set the read index of a queue with

relaxed semantics

Atomically set the write index of a queue with

release semantics

Atomically set the write index of a queue with

relaxed semantics

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 121: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

COMPARE AND SWAP WRITE INDEX

Atomically compare and set the write index of a

queue with acquire/release/relaxed/acquire-

release semantics

Parameters queue (input): A queue

expected (input): The expected index value

val (input): Value to copy to the write index if expected

matches the observed write index

Return value

Previous value of the write index

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 122: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

ADD WRITE INDEX

Atomically increment the write index of a

queue by an offset with

release/acquire/relaxed/acquire-release

semantics

Parameters

queue (input): A queue

val (input): The value to add to the write index

Return value

Previous value of the write index

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 123: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

ARCHITECTED QUEUING LANGUAGE (AQL)

An HSA-compliant system provides a command interface for the dispatch of

HSA agent commands.

This command interface is provided by the Architected Queuing Language (AQL).

AQL allows HSA agents to build and enqueue their own command packets,

enabling fast and low-power dispatch.

AQL also provides support for HSA component queue submissions

The HSA component kernel can write commands in AQL format.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 124: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

AQL PACKET (1)

AQL packet format

Values

Always reserved packet (0): Packet format is set to always reserved when the queue is initialized.

Invalid packet (1): Packet format is set to invalid when the readIndex is incremented, making the packet slot available to the HSA agents.

Dispatch packet (2): Dispatch packets contain jobs for the HSA component and are created by HSA agents.

Barrier packet (3): Barrier packets can be inserted by HSA agents to delay processing subsequent packets. All queues support barrier packets.

Agent dispatch packet (4): Dispatch packets contain jobs for the HSA agent and are created by HSA agents.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 125: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

AQL PACKET (2)

HSA signaling object handle used to indicate completion of the job

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 126: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

EXAMPLE - ENQUEUE AQL PACKET (1)

An HSA agent submits a task to a queue by performing the following steps:

Allocate a packet slot (by incrementing the writeIndex)

Initialize the packet and copy packet to a queue associated with the Packet Processor

Mark packet as valid

Notify the Packet Processor of the packet (With doorbell signal)

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 127: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

EXAMPLE - ENQUEUE AQL PACKET (2)

Dispatch Queue

Allocate an AQL packet slot

Copy the packet into queue. Note

that, we can have a lock here to

prevent race condition in

multithread environment

WriteIndex

ReadIndexInitialize

packet

Send doorbell signal

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 128: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

EXAMPLE - PACKET PROCESSOR

WriteIndex

ReadIndex

Get packet content

Check if barrier packet

Update readIndex, change packet state to invalid,

and send completion signal.

Receive doorbell Dispatch Queue

If there is any packet in queue, process the packet.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 129: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

MEMORY MANAGEMENT

Page 130: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

OUTLINE

Memory registration and deregistration

Memory region and memory segment

APIs for memory region manipulation

APIs for memory registration and deregistration

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 131: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

INTRODUCTION

One of the key features of HSA is its ability to share global pointers between the

host application and code executing on the HSA component.

This ability means that an application can directly pass a pointer to memory allocated on the host

to a kernel function dispatched to a component without an intermediate copy

When a buffer created in the host is also accessed by a component,

programmers are encouraged to register the corresponding address range

beforehand.

Registering memory expresses an intention to access (read or write) the passed buffer from a

component other than the host. This is a performance hint that allows the runtime implementation

to know which buffers will be accessed by some of the components ahead of time.

When an HSA program no longer needs to access a registered buffer in a device,

the user should deregister that virtual address range.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 132: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

MEMORY REGION/SEGMENT

A memory region represents a virtual memory interval that is visible to a particular agent,

and contains properties about how memory is accessed or allocated from that agent.

Memory segments

Values

HSA_SEGMENT_GLOBAL = 1

HSA_SEGMENT_PRIVATE = 2

HSA_SEGMENT_GROUP = 4

HSA_SEGMENT_KERNARG = 8

HSA_SEGMENT_READONLY = 16

HSA_SEGMENT_IMAGE = 32

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 133: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

MEMORY REGION INFORMATION

Attributes of a memory region

Values

HSA_REGION_INFO_BASE_ADDRESS

HSA_REGION_INFO_SIZE

HSA_REGION_INFO_NODE

HSA_REGION_INFO_MAX_ALLOCATION_SIZE

HSA_REGION_INFO_SEGMENT

HSA_REGION_INFO_BANDWIDTH

HSA_REGION_INFO_CACHED

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 134: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

MEMORY REGION MANIPULATION (1)

Get the current value of an attribute of a region

Iterate over the memory regions that are visible to an agent, and invoke an

application-defined callback on every iteration

If callback returns a status other than HSA_STATUS_SUCCESS for a particular iteration, the

traversal stops and the function returns that status value.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 135: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

MEMORY REGION MANIPULATION (2)

Allocate a block of memory

Deallocate a block of memory previously allocated

using hsa_memory_allocate

Copy block of memory

Copying a number of bytes larger than the size of the

memory regions pointed by dst or src results in

undefined behavior.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 136: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

MEMORY REGISTRATION/DEREGISTRATION

Register memory

Parameters

address (input): A pointer to the base of

the memory region to be registered. If a

NULL pointer is passed, no operation is

performed.

size (input): Requested registration size

in bytes. A size of zero is only allowed if

address is NULL.

Deregister memory previously registered

using hsa_memory_register

Parameter

address (input): A pointer to the base of the

memory region to be registered. If a NULL

pointer is passed, no operation is performed.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 137: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

EXAMPLE

Allocate a memory space

Use hsa_region_get_info to get the

size in byte of this memory space

Register this memory space for a

performance hint

Finish operation, deregister and

free this memory space

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 138: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SUMMARY

Page 139: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SUMMARY

Covered

HSA Core Runtime API (Pre-release 1.0 provisional)

Runtime Initialization and Shutdown (Open/Close)

Notifications (Synchronous/Asynchronous)

Agent Information

Signals and Synchronization (Memory-Based)

Queues and Architected Dispatch

Memory Management

Not covered

Extension of Core Runtime

HSAIL Finalization, Linking, and Debugging

Images and Samplers

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 140: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

QUESTIONS?

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 141: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA MEMORY MODELBEN GASTER, ENGINEER, QUALCOMM

Page 142: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

OUTLINE

HSA Memory Model

OpenCL 2.0

Has a memory model too

Obstruction-free bounded deques

An example using the HSA memory model

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 143: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA MEMORY MODEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 144: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

TYPES OF MODELS

Shared memory computers and programming languages, divide complexity into

models:

1. Memory model specifies safety

e.g. can a work-item prevent others from progressing?

This is what this section of the tutorial will focus on

2. Execution model specifies liveness

Described in Ben Sander’s tutorial section on HSAIL

e.g. can a work-item prevent others from progressing

3. Performance model specifies the big picture

e.g. caches or branch divergence

Specific to particular implementations and outside the scope of today’s tutorial

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 145: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

THE PROBLEM

Assume all locations (a, b, …) are initialized to 0

What are the values of $s2 and $s4 after execution?

© Copyright 2014 HSA Foundation. All Rights Reserved

Work-item 0

mov_u32 $s1, 1 ;

st_global_u32 $s1, [&a] ;

ld_global_u32 $s2, [&b] ;

Work-item 1

mov_u32 $s3, 1 ;

st_global_u32 $s3, [&b] ;

ld_global_u32 $s4, [&a] ;

*a = 1;

int x = *b;

*b = 1;

int y = *a;

initially *a = 0 && *b = 0

Page 146: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

THE SOLUTION

The memory model tells us:

Defines the visibility of writes to memory at any given point

Provides us with a set of possible executions

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 147: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

WHAT MAKES A GOOD MEMORY MODEL*

Programmability ; A good model should make it (relatively) easy to write multi-

work-item programs. The model should be intuitive to most users, even to those

who have not read the details

Performance ; A good model should facilitate high-performance implementations

at reasonable power, cost, etc. It should give implementers broad latitude in

options

Portability ; A good model would be adopted widely or at least provide backward

compatibility or the ability to translate among models

* S. V. Adve. Designing Memory Consistency Models for Shared-Memory Multiprocessors. PhD thesis, Computer Sciences Department,

University of Wisconsin–Madison, Nov. 1993.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 148: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SEQUENTIAL CONSISTENCY (SC)*

Axiomatic Definition

A single processor (core) sequential if “the result of an execution is the same as if the

operations had been executed in the order specified by the program.”

A multiprocessor sequentially consistent if “the result of any execution is the same as if the

operations of all processors (cores) were executed in some sequential order, and the

operations of each individual processor (core) appear in this sequence in the order specified by

its program.”

© Copyright 2014 HSA Foundation. All Rights Reserved

But HW/Compiler actually implements more relaxed models, e.g. ARMv7

* L. Lamport. How to Make a Multiprocessor Computer that Correctly

Executes Multiprocessor Programs. IEEE Transactions on Computers,

C-28(9):690–91, Sept. 1979.

Page 149: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SEQUENTIAL CONSISTENCY (SC)

© Copyright 2014 HSA Foundation. All Rights Reserved

Work-item 0

mov_u32 $s1, 1 ;

st_global_u32 $s1, [&a] ;

ld_global_u32 $s2, [&b] ;

Work-item 1

mov_u32 $s3, 1 ;

st_global_u32 $s3, [&b] ;

ld_global_u32 $s4, [&a] ;

mov_u32 $s1, 1 ;

mov_u32 $s3, 1;

st_global_u32 $s1, [&a] ;

ld_global_u32 $s2, [&b] ;

st_global_u32 $s3, [&b] ;

ld_global_u32 $s4, [&a] ;

$s2 = 0 && $s4 =

1

Page 150: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

BUT WHAT ABOUT ACTUAL HARDWARE

Sequential consistency is (reasonably) easy to understand, but limits

optimizations that the compiler and hardware can perform

Many modern processors implement many reordering optimizations

Store buffers (TSO*), work-items can see their own stores early

Reorder buffers (XC*), work-items can see other work-items store early

© Copyright 2014 HSA Foundation. All Rights Reserved

*TSO – Total Store Order as implemented by Sparc and x86

*XC – Relaxed Consistency model, e.g. ARMv7, Power7, and Adreno

Page 151: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

RELAXED CONSISTENCY (XC)

© Copyright 2014 HSA Foundation. All Rights Reserved

Work-item 0

mov_u32 $s1, 1 ;

st_global_u32 $s1, [&a] ;

ld_global_u32 $s2, [&b] ;

Work-item 1

mov_u32 $s3, 1 ;

st_global_u32 $s3, [&b] ;

ld_global_u32 $s4, [&a] ;

mov_u32 $s1, 1 ;

mov_u32 $s3, 1;

ld_global_u32 $s2, [&b] ;

ld_global_u32 $s4, [&a] ;

st_global_u32 $s1, [&a] ;

st_global_u32 $s3, [&b] ;

$s2 = 0 && $s4 =

0

Page 152: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

WHAT ARE OUR 3 Ps?

Programmability ; XC is really pretty hard for the programmer to reason about

what will be visible when

many memory model experts have been known to get it wrong!

Performance ; XC is good for performance, the hardware (compiler) is free to

reorder many loads and stores, opening the door for performance and power

enhancements

Portability ; XC is very portable as it places very little constraints

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 153: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

MY CHILDREN AND COMPUTER

ARCHITECTS ALL WANT

To have their cake and eat it!

© Copyright 2014 HSA Foundation. All Rights Reserved

Put picture with kids and cake

HSA Provides: The ability to enable

programmers to reason with (relatively)

intuitive model of SC, while still achieving the

benefits of XC!

Page 154: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SEQUENTIAL CONSISTENCY FOR DRF*

HSA adopts the same approach as Java, C++11, and OpenCL 2.0 adopting SC for Data

Race Free (DRF)

plus some new capabilities !

(Informally) A data race occurs when two (or more) work-items access the same memory

location such that:

At least one of the accesses is a WRITE

There are no intervening synchronization operations

SC for DRF asks:

Programmers to ensure programs are DRF under SC

Implementers to ensure that all executions of DRF programs on the relaxed model are also SC

executions

© Copyright 2014 HSA Foundation. All Rights Reserved

*S. V. Adve and M. D. Hill. Weak Ordering—A New Definition. In Proceedings of the

17th Annual International Symposium on Computer Architecture, pp. 2–14, May

1990

Page 155: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA SUPPORTS RELEASE CONSISTENCY

HSA’s memory model is based on RCSC: All atomic_ld_scacq and atomic_st_screl are SC

Means coherence on all atomic_ld_scacq and atomic_st_screl to a single

address. )

All atomic_ld_scacq and atomic_st_screl are program ordered per work-

item (actually: sequence-order by language constraints

Similar model adopted by ARMv8

HSA extends RCSC to SC for HRF*, to access the full capabilities of

modern heterogeneous systems, containing CPUs, GPUs, and DSPs,

for example.

© Copyright 2014 HSA Foundation. All Rights Reserved

*Sequential Consistency for Heterogeneous-Race-Free Programmer-centric

Memory Models for Heterogeneous Platforms. D. R. Hower, Beckman, B. R.

Gaster, B. Hechtman, M D. Hill, S. K. Reinhart, and D. Wood. MSPC’13.

Page 156: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

MAKING RELAXED CONSISTENCY WORK

© Copyright 2014 HSA Foundation. All Rights Reserved

Work-item 0

mov_u32 $s1, 1 ;

atomic_st_global_u32_screl $s1, [&a] ;

atomic_ld_global_u32_scacq $s2, [&b] ;

Work-item 1

mov_u32 $s3, 1 ;

atomic_st_global_u32_screl $s3, [&b] ;

atomic_ld_global_u32_scacq $s4, [&a]

;

mov_u32 $s1, 1 ;

mov_u32 $s3, 1;

atomic_st_global_u32_screl $s1, [&a] ;

atomic_ld_global_u32_scacq $s2, [&b] ;

atomic_st_global_u32_screl $s3, [&b] ;

atomic_ld_global_u32_scacq $s4, [&a] ;

$s2 = 0 && $s4 =

1

Page 157: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SEQUENTIAL CONSISTENCY FOR DRF

Two memory accesses participate in a data race if they

access the same location

at least one access is a store

can occur simultaneously

i.e. appear as adjacent operations in interleaving.

A program is data-race-free if no possible execution results in a data race.

Sequential consistency for data-race-free programs

Avoid everything else

HSA: Not good enough!

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 158: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

ALL ARE NOT EQUAL – OR SOME CAN SEE

BETTER THAN OTHERS

Remember the HSAIL

Execution Model

© Copyright 2014 HSA Foundation. All Rights Reserved

device scope

group scope

wave

scope

platform scope

Page 159: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

DATA-RACE-FREE IS NOT ENOUGH

t1 t2 t3 t4

st_global 1, [&X]

atomic_st_global_screl 0, [&flag]

atomic_cas_global_scar 1, 0, [&flag]

...

atomic_st_global_screl 0, [&flag]

atomic_cas_global_scar ,1 0, [&flag]

ld_global (??), [&x]

group #1-2 group #3-4

Two ordinary memory accesses participate in a data race if they

Access same location

At least one is a store

Can occur simultaneouslyNot a data race…

Is it SC?

Well that depends

t4t3t1 t2

SGlobal

S12 S34

visibility implied by

causality?

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 160: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SEQUENTIAL CONSISTENCY FOR

HETEROGENEOUS-RACE-FREE

Two memory accesses participate in a heterogeneous race if

access the same location

at least one access is a store

can occur simultaneously

i.e. appear as adjacent operations in interleaving.

Are not synchronized with “enough” scope

A program is heterogeneous-race-free if no possible execution results in a

heterogeneous race.

Sequential consistency for heterogeneous-race-free programs

Avoid everything else

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 161: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA HETEROGENEOUS RACE FREE

HRF0: Basic Scope Synchronization

“enough” = both threads synchronize using identical scope

Recall example:

Contains a heterogeneous race in HSA

t1 t2 t3 t4

st_global 1, [&X]

atomic_st_global_rcrel_wg 0, [&flag]

...

atomic_cas_global_scar_wg,1 0, [&flag]

ld_global (??), [&x]

Workgroup #1-2 Workgroup #3-4HSA Conclusion:

This is bad. Don’t do it.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 162: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HOW TO USE HSA WITH SCOPES

Use smallest scope that includes all

producers/consumers of shared data

HSA Scope Selection Guideline

Implication:

Producers/consumers must be known at synchronization time

Want: For performance, use smallest scope possible

What is safe in HSA?

Is this a valid assumption?

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 163: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

REGULAR GPGPU WORKLOADS

N

M

Define

Problem Space

Partition

Hierarchically

Communicate

Locally

N times

Communicate

Globally

M times

Well defined (regular) data partitioning +

Well defined (regular) synchronization pattern =

Producer/consumers are always known

Generally: HSA works well with

regular data-parallel workloads

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 164: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

t1 t2 t3 t4

st_global 1, [&X]

atomic_st_global_screl_plat 0, [&flag]

atomic_cas_global_scar_plat 1, 0, [&flag]

...

atomic_st_global_screl_plat 0, [&flag]

atomic_cas_global_ar_plat ,1 0, [&flag]

ld $s1, [&x]

IRREGULAR WORKLOADS HSA: example is race

Must upgrade wg (workgroup) -> plat (platform)

HSA memory model says:

ld $s1, [&x], will see value (1)!

Workgroup #1-2 Workgroup #3-4

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 165: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

OPENCL

HAS MEMORY MODELS TOOMAPPING ONTO HSA’S MEMORY MODEL

Page 166: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

It is straightforward to provide a mapping from OpenCL 1.x to the proposed model

OpenCL 1.x atomics are unordered and so map to atomic_op_X

Mapping for fences not shown but straightforward

OPENCL 1.X MEMORY MODEL MAPPING

OpenCL Operation HSA Memory Model

Operation

Atomic load ld_global_wg

ld_group_wg

Atomic store atomic_st_global_wg

atomic_st_group_wg

atomic_op atomic_op_global_comp

atomic_op_group_wg

barrier(…) fence ; barrier_wg

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 167: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

OPENCL 2.0 BACKGROUND

Provisional specification released at SIGGRAPH’13, July 2013.

Huge update to OpenCL to account for the evolving hardware landscape and

emerging use cases (e.g. irregular work loads)

Key features:

Shared virtual memory, including platform atomics

Formally defined memory model based on C11 plus support for scopes

Includes an extended set of C1X atomic operations

Generic address space, that subsumes global, local, and private

Device to device enqueue

Out-of-order device side queuing model

Backwards compatible with OpenCL 1.x

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 168: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

OPENCL 2.0 MEMORY MODEL MAPPINGOpenCL Operation HSA Memory Model Operation

Load

memory_order_relaxed

atomic_ld_[global | group]_relaxed_scope

Store

Memory_order_relaxed

atomic_st_[global | group]_relaxed_scope

Load

memory_order_acquire

atomic_ld_[global | group]_scacq_scope

Load

memory_order_seq_cst

atomic_ld_[global | group]_scacq_scope

Store

memory_order_release

atomic_st_[global | group]_screl_scope

Store

Memory_order_seq_cst

atomic_st_[global | group]_screl_scope

memory_order_acq_rel atomic_op_[global | group]_scar_scope

memory_order_seq_cst atomic_op_[global|group]_scar_scope

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 169: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

OPENCL 2.0 MEMORY SCOPE MAPPING

OpenCL Scope HSA Scope

memory_scope_sub_group _wave

memory_scope_work_group _wg

memory_scope_device _component

memory_scope_all_svm_devices _platform

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 170: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

OBSTRUCTION-FREE

BOUNDED DEQUES

AN EXAMPLE USING THE HSA MEMORY MODEL

Page 171: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

CONCURRENT DATA-STRUCTURES

Why do we need such a memory model in practice?

One important application of memory consistency is in the development and use

of concurrent data-structures

In particular, there are a class data-structures implementations that provide non-

blocking guarantees: wait-free; An algorithm is wait-free if every operation has a bound on the number of

steps the algorithm will take before the operation completes

In practice very hard to build efficient data-structures that meet this requirement

lock-free; An algorithm is lock-free if every if, given enough time, at least one thread of

the work-items (or threads) makes progress

In practice lock-free algorithms are implemented by work-item cooperating with one

enough to allow progress

Obstruction-free; An algorithm is obstruction-free if a work-item, running in isolation, can

make progress

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 172: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

Emerging Compute Cluster

BUT WAY NOT JUST USE MUTUAL

EXCLUSION?

© Copyright 2014 HSA Foundation. All Rights Reserved

Fabric & Memory Controller

KraitCPUAdreno

GPUKraitCPU

KraitCPU

KraitCPU

MMUMMUs

2MB L2

HexagonDSP

MMU

?? ??

Diversity in a heterogeneous system, such as

different clock speeds, different scheduling

policies, and more can mean traditional

mutual exclusion is not the right choice

Page 173: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

CONCURRENT DATA-STRUCTURES

Emerging heterogeneous compute clusters means we need:

To adapt existing concurrent data-structures

Developer new concurrent data-structures

Lock based programming may still be useful but often these algorithms will need

to be lock-free

Of course, this is a key application of the HSA memory model

To showcase this we highlight the development of a well known (HLM)

obstruction-free deque*

© Copyright 2014 HSA Foundation. All Rights Reserved

*Herlihy, M. et al. 2003. Obstruction-free

synchronization: double-ended queues as an

example. (2003), 522–529.

Page 174: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HLM - OBSTRUCTION-FREE DEQUE

Uses a fixed length circular queue

At any given time, reading from left to right, the array will contain: Zero or more left-null (LN) values

Zero or more dummy-null (DN) values

Zero or more right-null (RN) values

At all times there must be: At least two different nulls values

At least one LN or DN, and at least one DN or RN

Memory consistency is required to allow multiple producers and multiple

consumers, potentially happening in parallel from the left and right ends, to see

changes from other work-items (HSA Components) and threads (HSA Agents)

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 175: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HLM - OBSTRUCTION-FREE DEQUE

© Copyright 2014 HSA Foundation. All Rights Reserved

LNLN vLN RNv RNRN

left right

Key:

LN – left null value

RN – right null value

v – value

left – left hint index

right – right hint index

Page 176: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

C REPRESENTATION OF DEQUE

struct node {

uint64_t type : 2; // null type (LN, RN, DN)

uint64_t counter : 8 ; // version counter to avoid ABA

uint64_t value : 54 ; // index value stored in queue

}

struct queue {

unsigned int size; // size of bounded buffer

node * array; // backing store for deque itself© Copyright 2014 HSA Foundation. All Rights Reserved

Page 177: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSAIL REPRESENTATION

Allocate a deque in global memory using HSAIL

@deque_instance:

align 64 global_u32 &size;

align 8 global_u64 &array;

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 178: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

ORACLE

Assume a function:

function &rcheck_oracle (arg_u32 %k, arg_u64 %left, arg_u64 %right) (arg_u64 %queue);

Which given a deque

returns (%k) the position of the left most of RN

atomic_ld_global_scacq used to read node from array

Makes one if necessary (i.e. if there are only LN or DN)

atomic_cas_global_scar, required to make new RN

returns (%left) the left node (i.e. the value to the left of the left most RN position)

returns (%right) the right node (i.e. the value at position (%k))

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 179: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

RIGHT POP

function &right_pop(arg_u32err, arg_u64 %result) (arg_u64 %deque) {

// load queue address

ld_arg_u64 $d0, [%deque];

@loop_forever:

// setup and call right oracle to get next RN

arg_u32 %k; arg_u64 %current; arg_u64 %next;

call &rcheck_oracle(%queue) ;

ld_arg_u32 $s0, [%k]; ld_arg_u64 $d1, [%current]; ld_arg_u64 $d2, [%next];

// current.value($d5)

shr_u64 $d5, $d1, 62;

// current.counter($d6)

and_u64 $d6, $d1,

0x3FC0000000000000;

shr_u64 $d6, $d6, 54;

// current.value($d7)

and_u64 $d7, $d1, 0x3FFFFFFFFFFFFF;

// next.counter($d8)

and_u64 $d8, $d2, 0x3FC0000000000000; shr_u64 $d8, $d8, 54;

brn @loop_forever ;

}

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 180: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

RIGHT POP – TEST FOR EMPTY

// current.type($d5) == LN || current.type($d5) == DN

cmp_neq_b1_u64 $c0, $d5, LN; cmp_neq_b1_u64 $c1, $d5, DN;

or_b1 $c0, $c0, $c1;

cbr $c0, @not_empty ;

// current node index (%deque($d0) + (%k($s1) - 1) * 16)

add_u32 $s1, $s0, -1; mul_u32 $s1, $s1, 16; add_u32 $d3, $d0, $s0;

atomic_ld_global_scacq_u64 $d4, [$d3];

cmp_neq_b1_u64 $c0, $d4, $d1;

cbr $c0, @not_empty;

st_arg_u32 EMPTY, [&err]; // deque empty so return EMPTY

%ret

@not_empty:

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 181: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

RIGHT POP – TRY READ/REMOVE NODE// $d9 = (RN, next.cnt+1, 0)

add_u64 $d8, $d8, 1;

shl_u64 $d9, RN, 62;

and_u64 $d8, $d8, $d9;

// cas(deq+k, next, node(RN, next.cnt+1, 0))

atomic_cas_global_scar_u64 $d9, [$s0], $d2, $d9;

cmp_neq_u64 $c0, $d9, $d2;

cbr $c0, @cas_failed;

// $d9 = (RN, current.cnt+1, 0)

add_u64 $d6, $d6, 1;

shl_u64 $d9, RN, 62;

and_u64 $d9, $d6, $d9;

// cas(deq+(k-1), curr, node(RN, curr.cnt+1,0)

atomic_cas_global_scar_u64 $d9, [$s1], $d1, $d9;

cmp_neq_u64 $c0, $d9, $d1;

cbr $c0, @cas_failed;

st_arg_u32 SUCCESS, [&err];

st_arg_u64 $d7, [&value];

%ret

@cas_failed:

// loop back around and try again

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 182: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

TAKE AWAYS

HSA provides a powerful and modern memory model Based on the well know SC for DRF

Defined as Release Consistency

Extended with scopes as defined by HRF

OpenCL 2.0 introduces a new memory model Also based on SC for DRF

Also defined in terms of Release Consistency

Also Extended with scope as defined in HRF

Has a well defined mapping to HSA

Concurrent algorithm development for emerging heterogeneous computing

cluster can benefit from HSA and OpenCL 2.0 memory models

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 183: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA QUEUING MODELHAKAN PERSSON, SENIOR PRINCIPAL ENGINEER,

ARM

Page 184: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA QUEUEING, MOTIVATION

Page 185: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

MOTIVATION (TODAY’S PICTURE)

© Copyright 2014 HSA Foundation. All Rights Reserved

Application OS GPU

Transfer

buffer to GPU Copy/Map

Memory

Queue Job

Schedule JobStart Job

Finish Job

Schedule

ApplicationGet Buffer

Copy/Map

Memory

Page 186: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA QUEUEING: REQUIREMENTS

Page 187: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

REQUIREMENTS

Three key technologies are used to build the user mode queueing

mechanism

Shared Virtual Memory

System Coherency

Signaling

AQL (Architected Queueing Language) enables any agent

enqueue tasks

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 188: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SHARED VIRTUAL MEMORY

Page 189: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PHYSICAL MEMORY

SHARED VIRTUAL MEMORY (TODAY)

Multiple Virtual memory address spaces

© Copyright 2014 HSA Foundation. All Rights Reserved

CPU0 GPU

VIRTUAL MEMORY1

PHYSICAL MEMORY

VA1->PA1 VA2->PA1

VIRTUAL MEMORY2

Page 190: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PHYSICAL MEMORY

SHARED VIRTUAL MEMORY (HSA)

Common Virtual Memory for all HSA agents

© Copyright 2014 HSA Foundation. All Rights Reserved

CPU0 GPU

VIRTUAL MEMORY

PHYSICAL MEMORY

VA->PA VA->PA

Page 191: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SHARED VIRTUAL MEMORY

Advantages

No mapping tricks, no copying back-and-forth between different PA addresses

Send pointers (not data) back and forth between HSA agents.

Implications

Common Page Tables (and common interpretation of architectural semantics such as shareability, protection, etc).

Common mechanisms for address translation (and servicing address translation faults)

Concept of a process address space (PASID) to allow multiple, per process virtual address spaces within the system.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 192: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SHARED VIRTUAL MEMORY

Specifics

Minimum supported VA width is 48b for 64b systems, and 32b for

32b systems.

HSA agents may reserve VA ranges for internal use via system

software.

All HSA agents other than the host unit must use the lowest privilege

level

If present, read/write access flags for page tables must be

maintained by all agents.

Read/write permissions apply to all HSA agents, equally.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 193: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

GETTING THERE …

© Copyright 2014 HSA Foundation. All Rights Reserved

Application OS GPU

Transfer

buffer to GPU Copy/Map

Memory

Queue Job

Schedule JobStart Job

Finish Job

Schedule

ApplicationGet Buffer

Copy/Map

Memory

Page 194: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

CACHE COHERENCY

Page 195: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

CACHE COHERENCY DOMAINS (1/3)

Data accesses to global memory segment from all HSA Agents shall be

coherent without the need for explicit cache maintenance.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 196: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

CACHE COHERENCY DOMAINS (2/3)

Advantages

Composability

Reduced SW complexity when communicating between agents

Lower barrier to entry when porting software

Implications

Hardware coherency support between all HSA agents

Can take many forms

Stand alone Snoop Filters / Directories

Combined L3/Filters

Snoop-based systems (no filter)

Etc …

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 197: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

CACHE COHERENCY DOMAINS (3/3)

Specifics

No requirement for instruction memory accesses to be coherent

Only applies to the Primary memory type.

No requirement for HSA agents to maintain coherency to any memory location where the HSA agents do not specify the same memory attributes

Read-only image data is required to remain static during the execution of an HSA kernel.

No double mapping (via different attributes) in order to modify. Must remain static

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 198: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

GETTING CLOSER …

© Copyright 2014 HSA Foundation. All Rights Reserved

Application OS GPU

Transfer

buffer to GPU Copy/Map

Memory

Queue Job

Schedule JobStart Job

Finish Job

Schedule

ApplicationGet Buffer

Copy/Map

Memory

Page 199: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SIGNALING

Page 200: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SIGNALING (1/3)

HSA agents support the ability to use signaling objects

All creation/destruction signaling objects occurs via HSA

runtime APIs

From an HSA Agent you can directly access signaling objects.

Signaling a signal object (this will wake up HSA agents

waiting upon the object)

Query current object

Wait on the current object (various conditions supported).

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 201: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SIGNALING (2/3)

Advantages

Enables asynchronous events between HSA agents, without involving the kernel

Common idiom for work offload

Low power waiting

Implications

Runtime support required

Commonly implemented on top of cache coherency flows

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 202: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SIGNALING (3/3)

Specifics

Only supported within a PASID

Supported wait conditions are =, !=, < and >=

Wait operations may return sporadically (no guarantee against false positives)

Programmer must test.

Wait operations have a maximum duration before returning.

The HSAIL atomic operations are supported on signal objects.

Signal objects are opaque

Must use dedicated HSAIL/HSA runtime operations

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 203: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

ALMOST THERE…

© Copyright 2014 HSA Foundation. All Rights Reserved

Application OS GPU

Transfer

buffer to GPU Copy/Map

Memory

Queue Job

Schedule JobStart Job

Finish Job

Schedule

ApplicationGet Buffer

Copy/Map

Memory

Page 204: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

USER MODE QUEUING

Page 205: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

ONE BLOCK LEFT

© Copyright 2014 HSA Foundation. All Rights Reserved

Application OS GPU

Transfer

buffer to GPU Copy/Map

Memory

Queue Job

Schedule JobStart Job

Finish Job

Schedule

ApplicationGet Buffer

Copy/Map

Memory

Page 206: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

USER MODE QUEUEING (1/3)

User mode Queueing

Enables user space applications to directly, without OS intervention, enqueue jobs (“Dispatch Packets”) for HSA agents.

Queues are created/destroyed via calls to the HSA runtime.

One (or many) agents enqueue packets, a single agent dequeues packets.

Requires coherency and shared virtual memory.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 207: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

USER MODE QUEUEING (2/3)

Advantages

Avoid involving the kernel/driver when dispatching work for an Agent.

Lower latency job dispatch enables finer granularity of offload

Standard memory protection mechanisms may be used to protect communication with

the consuming agent.

Implications

Packet formats/fields are Architected – standard across vendors!

Guaranteed backward compatibility

Packets are enqueued/dequeued via an Architected protocol (all via memory

accesses and signaling)

More on this later……

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 208: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SUCCESS!

© Copyright 2014 HSA Foundation. All Rights Reserved

Application OS GPU

Transfer

buffer to GPU Copy/Map

Memory

Queue Job

Schedule JobStart Job

Finish Job

Schedule

ApplicationGet Buffer

Copy/Map

Memory

Page 209: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SUCCESS!

© Copyright 2014 HSA Foundation. All Rights Reserved

Application OS GPU

Queue Job

Start Job

Finish Job

Page 210: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

ARCHITECTED QUEUEING

LANGUAGE, QUEUES

Page 211: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

ARCHITECTED QUEUEING LANGUAGE

HSA Queues look just like standard shared memory queues, supporting multi-producer, single-consumer

Single producer variant defined with some optimizations possible.

Queues consist of storage, read/write indices, ID, etc.

Queues are created/destroyed via calls to the HSA runtime

“Packets” are placed in queues directly from user mode, via an architected protocol

Packet format is architected

© Copyright 2014 HSA Foundation. All Rights Reserved

Producer Producer

Consumer

Read Index

Write Index

Storage in

coherent, shared

memory

Packets

Page 212: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

ARCHITECTED QUEUING LANGUAGE

Packets are read and dispatched for execution from the queue in order, but may complete in any order.

There is no guarantee that more than one packet will be processed in parallel at a time

There may be many queues. A single agent may also consume from several queues.

Any HSA agent may enqueue packets

CPUs

GPUs

Other accelerators

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 213: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

QUEUE STRUCTURE

© Copyright 2014 HSA Foundation. All Rights Reserved

Offset (bytes) Size (bytes) Field Notes

0 4 queueType Differentiate different queues

4 4 queueFeatures Indicate supported features

8 8 baseAddress Pointer to packet array

16 16 doorbellSignal HSA signaling object handle

24 4 size Packet array cardinality

28 4 queueId Unique per process

32 8 serviceQueue Queue for callback services

intrinsic 8 writeIndex Packet array write index

intrinsic 8 readIndex Packet array read index

Page 214: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

QUEUE VARIANTS

queueType and queueFeatures together define queue semantics and

capabilities

Two queueType values defined, other values reserved:

MULTI – queue supports multiple producers

SINGLE – queue supports single producer

queueFeatures is a bitfield indicating capabilities

DISPATCH (bit 0) if set then queue supports DISPATCH packets

AGENT_DISPATCH (bit 1) if set then queue supports AGENT_DISPATCH packets

All other bits are reserved and must be 0

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 215: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

QUEUE STRUCTURE DETAILS

Queue doorbells are HSA signaling objects with restrictions

Created as part of the queue – lifetime tied to queue object

Atomic read-modify-write not allowed

size field value must be aligned to a power of 2

serviceQueue can be used by HSA kernel for callback services

Provided by application when queue is created

Can be mapped to HSA runtime provided serviceQueue, an application serviced

queue, or NULL if no serviceQueue required

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 216: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

READ/WRITE INDICES

readIndex and writeIndex properties are part of the queue, but not visible in the queue structure

Accessed through HSA runtime API and HSAIL operations

HSA runtime/HSAIL operations defined to

Read readIndex or writeIndex property

Write readIndex or writeIndex property

Add constant to writeIndex property (returns previous writeIndex value)

CAS on writeIndex property

readIndex & writeIndex operations treated as atomic in memory model

relaxed, acquire, release and acquire-release variants defined as applicable

readIndex and writeIndex never wrap

PacketID – the index of a particular packet

Uniquely identifies each packet of a queue

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 217: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PACKET ENQUEUE

Packet enqueue follows a few simple steps:

Reserve space

Multiple packets can be reserved at a time

Write packet to queue

Mark packet as valid

Producer no longer allowed to modify packet

Consumer is allowed to start processing packet

Notify consumer of packet through the queue doorbell

Multiple packets can be notified at a time

Doorbell signal should be signaled with last packetID notified

On small machine model the lower 32 bits of the packetID are used

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 218: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PACKET RESERVATION

Two flows envisaged

Atomic add writeIndex with number of packets to reserve

Producer must wait until packetID < readIndex + size before writing to packet

Queue can be sized so that wait is unlikely (or impossible)

Suitable when many threads use one queue

Check queue not full first, then use atomic CAS to update writeIndex

Can be inefficient if many threads use the same queue

Allows different failure model if queue is congested

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 219: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

QUEUE OPTIMIZATIONS

Queue behavior is loosely defined to allow optimizations

Some potential producer behavior optimizations:

Keep local copy of readIndex, update when required

For single producer queues:

Keep local copy of writeIndex

Use store operation rather than add/cas atomic to update writeIndex

Some potential consumer behavior optimizations:

Use packet format field to determine whether a packet has been submitted rather than writeIndexproperty

Speculatively read multiple packets from the queue

Not update readIndex for each packet processed

Rely on value used for doorbellSignal to notify new packets

Especially useful for single producer queues

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 220: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

POTENTIAL MULTI-PRODUCER ALGORITHM

// Allocate packetuint64_t packetID = hsa_queue_add_write_index_relaxed(q, 1);

// Wait until the queue is no longer full. uint64_t rdIdx;do {rdIdx = hsa_queue_load_read_index_relaxed(q);

} while (packetID >= (rdIdx + q->size));

// calculate indexuint32_t arrayIdx = packetID & (q->size-1);

// copy over the packet, the format field is INVALIDq->baseAddress[arrayIdx] = pkt;

// Update format field with release semanticsq->baseAddress[index].hdr.format.store(DISPATCH, std::memory_order_release);

// ring doorbell, with release semantics (could also amortize over multiple packets)hsa_signal_send_relaxed(q->doorbellSignal, packetID);

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 221: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

POTENTIAL CONSUMER ALGORITHM// Get location of next packetuint64_t readIndex = hsa_queue_load_read_index_relaxed(q);

// calculate the index uint32_t arrayIdx = readIndex & (q->size-1);

// spin while empty (could also perform low-power wait on doorbell)while (INVALID == q->baseAddress[arrayIdx].hdr.format) { }

// copy over the packetpkt = q->baseAddress[arrayIdx];

// set the format field to invalidq->baseAddress[arrayIdx].hdr.format.store(INVALID, std::memory_order_relaxed);

// Update the readIndex using HSA intrinsichsa_queue_store_read_index_relaxed(q, readIndex+1);

// Now process <pkt>!

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 222: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

ARCHITECTED QUEUEING

LANGUAGE, PACKETS

Page 223: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PACKETS

© Copyright 2014 HSA Foundation. All Rights Reserved

Packets come in three main types with architected layouts

Always reserved & Invalid

Do not contain any valid tasks and are not processed (queue will not progress)

Dispatch

Specifies kernel execution over a grid

Agent Dispatch

Specifies a single function to perform with a set of parameters

Barrier

Used for task dependencies

Page 224: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

COMMON PACKET HEADER

Start Offset

(Bytes)Format Field Name Description

0 uint16_t

format:8

Contains the packet type (Always reserved, Invalid,

Dispatch, Agent Dispatch, and Barrier). Other values are

reserved and should not be used.

barrier:1If set then processing of packet will only begin when all

preceding packets are complete.

acquireFenceScope:2

Determines the scope and type of the memory fence

operation applied before the packet enters the active

phase.

Must be 0 for Barrier Packets.

releaseFenceScope:2

Determines the scope and type of the memory fence

operation applied after kernel completion but before the

packet is completed.

reserved:3 Must be 0

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 225: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

DISPATCH PACKET

© Copyright 2014 HSA Foundation. All Rights Reserved

Start

Offset

(Bytes)

Format Field Name Description

0 uint16_t header Packet header

2 uint16_tdimensions:2 Number of dimensions specified in gridSize. Valid values are 1, 2, or 3.

reserved:14 Must be 0.

4 uint16_t workgroupSize.x x dimension of work-group (measured in work-items).

6 uint16_t workgroupSize.y y dimension of work-group (measured in work-items).

8 uint16_t workgroupSize.z z dimension of work-group (measured in work-items).

10 uint16_t reserved2 Must be 0.

12 uint32_t gridSize.x x dimension of grid (measured in work-items).

16 uint32_t gridSize.y y dimension of grid (measured in work-items).

20 uint32_t gridSize.z z dimension of grid (measured in work-items).

24 uint32_t privateSegmentSizeBytes Total size in bytes of private memory allocation request (per work-item).

28 uint32_t groupSegmentSizeBytes Total size in bytes of group memory allocation request (per work-group).

32 uint64_t kernelObjectAddressAddress of an object in memory that includes an implementation-defined

executable ISA image for the kernel.

40 uint64_t kernargAddress Address of memory containing kernel arguments.

48 uint64_t reserved3 Must be 0.

56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.

Page 226: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

AGENT DISPATCH PACKET

© Copyright 2014 HSA Foundation. All Rights Reserved

Start Offset

(Bytes)Format Field Name Description

0 uint16_t header Packet header

2 uint16_t type

The function to be performed by the destination Agent. The type value is

split into the following ranges:

0x0000:0x3FFF – Vendor specific

0x4000:0x7FFF – HSA runtime

0x8000:0xFFFF – User registered function

4 uint32_t reserved2 Must be 0.8 uint64_t returnLocation Pointer to location to store the function return value in.16 uint64_t arg[0]

64-bit direct or indirect arguments.24 uint64_t arg[1]32 uint64_t arg[2]40 uint64_t arg[3]48 uint64_t reserved3 Must be 0.

56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.

Page 227: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

BARRIER PACKET

Used for specifying dependences between packets

HSA agent will not launch any further packets from this queue until the barrier

packet signal conditions are met

Used for specifying dependences on packets dispatched from any queue.

Execution phase completes only when all of the dependent signals (up to five) have

been signaled (with the value of 0).

Or if an error has occurred in one of the packets upon which we have a dependence.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 228: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

BARRIER PACKET

© Copyright 2014 HSA Foundation. All Rights Reserved

Start Offset

(Bytes)Format Field Name Description

0 uint16_t header Packet header, see 2.8.1 Packet header (p. 16).

2 uint16_t reserved2 Must be 0.

4 uint32_t reserved3 Must be 0.

8 uint64_t depSignal0

Address of dependent signaling objects to be evaluated by the packet processor.

16 uint64_t depSignal1

24 uint64_t depSignal2

32 uint64_t depSignal3

40 uint64_t depSignal4

48 uint64_t reserved4 Must be 0.

56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.

Page 229: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

DEPENDENCES

A user may never assume more than one packet is being executed by an HSA

agent at a time.

Implications:

Packets can’t poll on shared memory values which will be set by packets issued from

other queues, unless the user has ensured the proper ordering.

To ensure all previous packets from a queue have been completed, use the Barrier

bit.

To ensure specific packets from any queue have completed, use the Barrier packet.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 230: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA QUEUEING, PACKET EXECUTION

Page 231: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PACKET EXECUTION

Launch phase

Initiated when launch conditions are met

All preceding packets in the queue must have exited launch phase

If the barrier bit in the packet header is set, then all preceding packets in the queue must have exited completion phase

Includes memory acquire fence

Active phase

Execute the packet

Barrier packets remain in Active phase until conditions are met.

Completion phase

First step is memory release fence – make results visible.

completionSignal field is then signaled with a decrementing atomic.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 232: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PACKET EXECUTION – BARRIER BIT

© Copyright 2014 HSA Foundation. All Rights Reserved

Pkt1

Launch

Pkt2

Launch

Pkt1

Execute

Pkt2

Execute

Pkt1

Complete

Pkt3

Launch (barrier=1)

Pkt2

Complete

Pkt3

Execute

Time

Pkt3 launches whenall

packets in the queue

have completed.

Page 233: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PUTTING IT ALL TOGETHER (FFT)

© Copyright 2014 HSA Foundation. All Rights Reserved

Packet 1

Packet 2

Packet 3

Packet 4

Packet 5

Packet 6

Barrier Barrier

X[0]

X[1]

X[2]

X[3]

X[4]

X[5]

X[6]

X[7]

Time

Page 234: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PUTTING IT ALL TOGETHER

© Copyright 2014 HSA Foundation. All Rights Reserved

AQL Pseudo Code

// Send the packets to do the first stage. aql_dispatch(pkt1);aql_dispatch(pkt2);

// Send the next two packets, setting the barrier bit so we// know packets 1 & 2 will be complete before 3 and 4 are // launched. aql_dispatch_with _barrier_bit(pkt3); aql_dispatch(pkt4);

// Same as above (make sure 3 & 4 are done before issuing 5// & 6) aql_dispatch_with_barrier_bit(pkt5); aql_dispatch(pkt6);

// This packet will notify us when 5 & 6 are complete)aql_dispatch_with_barrier_bit(finish_pkt);

Page 235: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PACKET EXECUTION – BARRIER PACKET

© Copyright 2014 HSA Foundation. All Rights Reserved

Barrier T2Q2

T1Q1

Signal X

init to 1

depSignal0

completionSignal

Time

Decrements signal X

Barrier

Launch

T1

Launch

Barrier

Execute

T1

Execute

Barrier

Complete

T1

Complete

T2

Launch

T2

Execute

T2

Complete

Barrier completes

when signal X

signalled with 0T2 launches once

barrier complete

Page 236: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

DEPTH FIRST CHILD TASK EXECUTION

Consider two generations of child tasks

Task T submits tasks T.1 & T.2

Task T.1 submits tasks T.1.1 & T.1.2

Task T.2 submits tasks T.2.1 & T.2.2

Desired outcome

Depth first child task execution

I.e. T T1 T.1.1 T.1.2 T.2 T.2.1 T.2.2

T passed signal (allComplete) to decrement when all tasks are complete (T and its

children etc)

© Copyright 2014 HSA Foundation. All Rights Reserved

T

T.2.2T.1.2T.1.2T.1.1

T.1 T.2

Page 237: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HOW TO DO THIS WITH HSA QUEUES?

Use a separate user mode queue for each recursion level

Task T submits to queue Q1

Tasks T.1 & T.2 submits tasks to queue Q2

Queues could be passed in as parameters to task T

Depth first requires ordering of T.1, T.2 and their children

Use additional signal object (childrenComplete) to track completion of the children of

T.1 & T.2

childrenComplete set to number of children (i.e. 2) by each of T.1 & T.2

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 238: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

A PICTURE SAYS MORE THAN 1000 WORDS

© Copyright 2014 HSA Foundation. All Rights Reserved

T

T.2.2T.1.2T.1.2T.1.1

T.1 T.2 T.1 Barrier T.2 BarrierQ1

Wait on

childrenCompleteSignal

allComplete

T.1.1 T.1.2 T.2.1 T.2.2Q2

Page 239: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SUMMARY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 240: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

KEY HSA TECHNOLOGIES

HSA combines several mechanisms to enable low overhead task

dispatch

Shared Virtual Memory

System Coherency

Signaling

AQL

User mode queues – from any compatible agent

Architected packet format

Rich dependency mechanism

Flexible and efficient signaling of completion

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 241: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

QUESTIONS?

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 242: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA APPLICATIONS

WEN-MEI HWU, PROFESSOR, UNIVERSITY OF ILLINOIS

WITH J.P. BORDES AND JUAN GOMEZ

Page 243: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

USE CASES SHOWING HSA ADVANTAGE

Programming Technique

Use Case Description HSA Advantage

Pointer-based Data Structures

Binary tree searchesGPU performs parallel searches in a CPU created

binary tree.

CPU and GPU have access to entire unified coherent

memory. GPU can access existing data structures containing

pointers.

Platform Atomics

Work-Group Dynamic Task Management

GPU directly operate on a task pool managed

by the CPU for algorithms with dynamic

computation loads

Binary tree updatesCPU and GPU operating simultaneously on the

tree, both doing modifications

CPU and GPU can synchronize using Platform Atomics

Higher performance through parallel operations reducing the

need for data copying and reconciling.

Large Data SetsHierarchical data searchesApplications include object recognition, collision

detection, global illumination, BVH

CPU and GPU have access to entire unified coherent

memory. GPU can operate on huge models in place,

reducing copy and kernel launch overhead.

CPU CallbacksMiddleware user-callbacksGPU processes work items, some of which require

a call to a CPU function to fetch new data

GPU can invoke CPU functions from within a GPU kernel

Simpler programming does not require “split kernels”

Higher performance through parallel operations

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 244: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

UNIFIED COHERENT MEMORY

FOR POINTER-BASED DATA

STRUCTURES

Page 245: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 246: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

L R

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 247: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

L

RL

RL

R

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 248: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 249: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

L R

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 250: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 251: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 252: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

KERNEL

GPU

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

HSA and full OpenCL 2.0

TREE RESULTBUFFER

L R

L R L R

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 253: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

HSA

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 254: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

HSA

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 255: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

HSA

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 256: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

HSA

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 257: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

POINTER DATA STRUCTURES

- CODE COMPLEXITY

HSA Legacy

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 258: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

POINTER DATA STRUCTURES

- PERFORMANCE

0

10,000

20,000

30,000

40,000

50,000

60,000

1M 5M 10M 25M

Se

arc

h r

ate

(

no

des

/ m

s )

Tree size ( # nodes )

Binary Tree Search

CPU (1 core)

CPU (4 core)

Legacy APU

HSA APU

Measured in AMD labs Jan 1-3 on system shown in back up

slide

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 259: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICS FOR

DYNAMIC TASK MANAGEMENT

Page 260: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

0

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

0

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 261: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

0

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

0

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Asynchronous transfer

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 262: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

0

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 263: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Asynchronous transfer

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 264: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 265: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

1

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 266: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

1

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 267: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

2

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 268: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

2

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 269: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

3

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL

PLATFORM ATOMICS

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 270: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

3

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 271: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

4

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 272: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

4

SYSTEM MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

4

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

4

NUM. WRITTEN

TASKS

0

4

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 3WORK-

GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Zero-copy

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 273: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

0

HOST COHERENT MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 274: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

0

HOST COHERENT MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

memcpy

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 275: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 276: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 277: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

1

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

Platform atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 278: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

1

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 279: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

2

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

Platform atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 280: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

2

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 281: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

3

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

Platform atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 282: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

3

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 283: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-

GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

4

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-

GROUP 2

WORK-

GROUP 3WORK-

GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

Platform atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 284: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICS – CODE COMPLEXITY

HSALegacy

Host enqueue function: 20 lines of code

Host enqueue function: 102 lines of code

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 285: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICS - PERFORMANCE

0

100

200

300

400

500

600

700

64 128 256 512 64 128 256 512

4096 16384

Execu

tio

n t

ime (

ms)

Tasks per insertionTasks pool size

Legacy implementation (ms)

HSA implementation (ms)

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 286: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICS FOR

CPU/GPU COLLABORATION

Page 287: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICSENABLING EFFICIENT GPU/CPU COLLABORATION

Legacy

Only GPU can work on input

array

Concurrent

processing not

possible

TREEINPUTBUFFER

GPU

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 288: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICS

Legacy

Only GPU can work on input

array

Concurrent

processing not

possible

TREEINPUTBUFFER

GPU

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 289: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PLATFORM ATOMICS

Legacy

Only GPU can work on input

array

Concurrent

processing not

possible

TREEINPUTBUFFER

GPU

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 290: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

GPU

KERNEL

PLATFORM ATOMICS

Both CPU+GPU

operating on same

data structure

concurrently

TREEINPUTBUFFER

CPU 0

CPU 1

HSA and full OpenCL 2.0

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 291: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

GPU

KERNEL

PLATFORM ATOMICS

Both CPU+GPU

operating on same

data structure

concurrently

TREEINPUTBUFFER

CPU 0

CPU 1

HSA and full OpenCL 2.0

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 292: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

UNIFIED COHERENT MEMORY

FOR LARGE

DATA SETS

Page 293: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PROCESSING LARGE DATA SETS

The CPU creates a large data structure in System Memory. Computations

using the data are offloaded to the GPU.

SYSTEM MEMORY

GPU

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 294: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

PROCESSING LARGE DATA SETS

Larg

e 3

D s

patia

l d

ata

str

uctu

re

GPU

The CPU creates a large data structure in System Memory. Computations

using the data are offloaded to the GPU.

Compare HSA and Legacy methods

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 295: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

LEGACY ACCESS USING GPU MEMORY

Legacy

GPU Memory is smaller

Have to copy and process in chunks

GPU

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 296: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

LEGACY ACCESS TO LARGE STRUCTURES

Larg

e 3

D s

patia

l d

ata

str

uctu

re

GPU

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 297: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

KERNEL

Copy of top 2 levels of hierarchy

Larg

e 3

D s

patia

l d

ata

str

uctu

re

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 298: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

GPU

GPU MEMORY

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

FIRST

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 299: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

GPU MEMORY

FIRST

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 300: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

GPU MEMORY

FIRST

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 301: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 302: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

KERNEL

Copy of bottom 3 levels of one branch of the hierarchy

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 303: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

KERNEL

GPU MEMORY

SECOND

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 304: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

KERNEL

GPU MEMORY

SECOND

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 305: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

KERNEL

GPU MEMORY

SECOND

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 306: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 307: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 308: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

Copy of bottom 3 levels of a different branch of the

hierarchy

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 309: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

KERNEL

GPU MEMORY

Nth

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 310: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

KERNEL

GPU MEMORY

Nth

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 311: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

GPU

KERNEL

GPU MEMORY

Nth

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 312: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

LARGE SPATIAL DATA STRUCTURE

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

Larg

e 3

D s

patia

l d

ata

str

uctu

reSYSTEM MEMORY

KERNEL

GPUHSA and full OpenCL 2.0

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 313: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

GPU CAN TRAVERSE ENTIRE HIERARCHY

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

HSA

KERNEL

GPU

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 314: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

GPU CAN TRAVERSE ENTIRE HIERARCHY

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

HSA

KERNEL

GPU

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 315: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

GPU CAN TRAVERSE ENTIRE HIERARCHY

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

HSA

KERNEL

GPU

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 316: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

GPU CAN TRAVERSE ENTIRE HIERARCHY

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

HSA

KERNEL

GPU

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 317: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SYSTEM MEMORY

GPU CAN TRAVERSE ENTIRE HIERARCHY

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

KERNEL

HSA

GPU

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 318: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

CALLBACKS

Page 319: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

CALLBACKS

Parallel processing algorithm with branches

A seldom taken branch requires new data from the CPU

On legacy systems, the algorithm must be split:

Process Kernel 1 on GPU

Check for CPU callbacks and if any, process on CPU

Process Kernel 2 on GPU

Example algorithm from Image Processing

Perform a filter

Calculate average LUMA in each tile

Compare LUMA against threshold and call CPU callback if exceeded (rare)

Perform special processing on tiles with callbackx\s

COMMON SITUATION IN HC

Input Image Output Image

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 320: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

CALLBACKS

Legacy

GP

U T

HR

EA

DS

0

1

2

N

.

.

.

.

.

.

.

.

.

Continuation kernel finishes up kernel works results in poor GPU utilization

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 321: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

CALLBACKS

Input Image

1 Tile = 1 OpenCL Work Item

Output Image

GPU

• Work items compute average RGB value of all the pixels in a tile

• Work items also compute average Luma from the average RGB

• If average Luma > threshold, workgroup invokes CPU CALLBACK

• In parallel with callback, continue compute

CPU

• For selected tiles, update average Lumavalue (set to RED)

GPU

• Work items apply the Luma value to all pixels in the tile

GPU to CPU callbacks use Shared

Virtual Memory (SVM) Semaphores,

implemented using Platform Atomic

Compare-and-Swap.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 322: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

CALLBACKS

A few kernel threads need CPU callback services but serviced immediately

GP

U T

HR

EA

DS

0

1

2

N

.

.

.

.

.

.

.

.

.

CPU callbacks

HSA and full OpenCL 2.0

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 323: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

SUMMARY - HSA ADVANTAGE

Programming Technique

Use Case Description HSA Advantage

Pointer-based Data Structures

Binary tree searchesGPU performs parallel searches in a CPU created

binary tree.

CPU and GPU have access to entire unified coherent

memory. GPU can access existing data structures containing

pointers.

Platform Atomics

Work-Group Dynamic Task Management

GPU directly operate on a task pool managed

by the CPU for algorithms with dynamic

computation loads

Binary tree updatesCPU and GPU operating simultaneously on the

tree, both doing modifications

CPU and GPU can synchronize using Platform Atomics

Higher performance through parallel operations reducing the

need for data copying and reconciling.

Large Data SetsHierarchical data searchesApplications include object recognition, collision

detection, global illumination, BVH

CPU and GPU have access to entire unified coherent

memory. GPU can operate on huge models in place,

reducing copy and kernel launch overhead.

CPU CallbacksMiddleware user-callbacksGPU processes work items, some of which require

a call to a CPU function to fetch new data

GPU can invoke CPU functions from within a GPU kernel

Simpler programming does not require “split kernels”

Higher performance through parallel operations

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 324: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

QUESTIONS?

Page 325: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HSA COMPILATIONWEN-MEI HWU, CTO, MULTICOREWARE INC

WITH RAY I-JUI SUNG

Page 326: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

KEY HSA FEATURES FOR COMPILATION

ALL-PROCESSORS-EQUAL

GPU and CPU have equal

flexibility to create and

dispatch work items

EQUAL ACCESS TO ENTIRE SYSTEM MEMORY

GPU and CPU have

uniform visibility into entire

memory space

Unified Coherent

Memory

GPUCPU

Single Dispatch Path

GPUCPU

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 327: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

A QUICK REVIEW OF OPENCLCURRENT STATE OF PORTABLE HETEROGENEOUS

PARALLEL PROGRAMMING

Page 328: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

DEVICE CODE IN OPENCL

SIMPLE MATRIX MULTIPLICATION

__kernel void

matrixMul(__global float* C, __global float* A, __global float* B, int wA, int wB) {

int tx = get_global_id(0);

int ty = get_global_id(1);

float value = 0;

for (int k = 0; k < wA; ++k)

{

float elementA = A[ty * wA + k];

float elementB = B[k * wB + tx];

value += elementA * elementB;

}

C[ty * wA + tx] = value;

}

Explicit thread index usage.

Reasonably readable.

Portable across CPUs, GPUs, and FPGAs

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 329: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HOST CODE IN OPENCL -

CONCEPTUAL

1. allocate and initialize memory on host side

2. Initialize OpenCL

3. allocate device memory and move the data

4. Load and build device code

5. Launch kernel

a. append arguments

6. move the data back from device

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 330: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

int main(int argc, char** argv){

// set seed for rand()

srand(2006);

/****************************************************/

/* Allocate and initialize memory on Host Side */

/****************************************************/

// allocate and initialize host memory for matrices A and B

unsigned int size_A = WA * HA;

unsigned int mem_size_A = sizeof(float) * size_A;

float* h_A = (float*) malloc(mem_size_A);

unsigned int size_B = WB * HB;

unsigned int mem_size_B = sizeof(float) * size_B;

float* h_B = (float*) malloc(mem_size_B);

randomInit(h_A, size_A);

randomInit(h_B, size_B);

// allocate host memory for the result C

unsigned int size_C = WC * HC;

unsigned int mem_size_C = sizeof(float) * size_C;

float* h_C = (float*) malloc(mem_size_C);

/*****************************************/

/* Initialize OpenCL */

/*****************************************/

// OpenCL specific variables

cl_context clGPUContext;

cl_command_queue clCommandQue;

cl_program clProgram;

size_t dataBytes;

size_t kernelLength;

cl_int errcode;

// OpenCL device memory pointers for matrices

cl_mem d_A;

cl_mem d_B;

cl_mem d_C;

clGPUContext = clCreateContextFromType(0,

CL_DEVICE_TYPE_GPU,

NULL, NULL, &errcode);

shrCheckError(errcode, CL_SUCCESS);

// get the list of GPU devices associated with context

errcode = clGetContextInfo(clGPUContext,

CL_CONTEXT_DEVICES, 0, NULL,

&dataBytes);

cl_device_id *clDevices = (cl_device_id *)

malloc(dataBytes);

errcode |= clGetContextInfo(clGPUContext,

CL_CONTEXT_DEVICES, dataBytes,

clDevices, NULL);

shrCheckError(errcode, CL_SUCCESS);

//Create a command-queue

clCommandQue = clCreateCommandQueue(clGPUContext,

clDevices[0], 0, &errcode);

shrCheckError(errcode, CL_SUCCESS);

// 3. Allocate device memory and move data

d_C = clCreateBuffer(clGPUContext,

CL_MEM_READ_WRITE,

mem_size_A, NULL, &errcode);

d_A = clCreateBuffer(clGPUContext,

CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,

mem_size_A, h_A, &errcode);

d_B = clCreateBuffer(clGPUContext,

CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,

mem_size_B, h_B, &errcode);

// 4. Load and build OpenCL kernel

char *clMatrixMul = oclLoadProgSource("kernel.cl",

"// My comment\n",

&kernelLength);

shrCheckError(clMatrixMul != NULL, shrTRUE);

clProgram = clCreateProgramWithSource(clGPUContext,

1, (const char **)&clMatrixMul,

&kernelLength, &errcode);

shrCheckError(errcode, CL_SUCCESS);

errcode = clBuildProgram(clProgram, 0,

NULL, NULL, NULL, NULL);

shrCheckError(errcode, CL_SUCCESS);

clKernel = clCreateKernel(clProgram,

"matrixMul", &errcode);

shrCheckError(errcode, CL_SUCCESS);

// 5. Launch OpenCL kernel

size_t localWorkSize[2], globalWorkSize[2];

int wA = WA;

int wC = WC;

errcode = clSetKernelArg(clKernel, 0,

sizeof(cl_mem), (void *)&d_C);

errcode |= clSetKernelArg(clKernel, 1,

sizeof(cl_mem), (void *)&d_A);

errcode |= clSetKernelArg(clKernel, 2,

sizeof(cl_mem), (void *)&d_B);

errcode |= clSetKernelArg(clKernel, 3,

sizeof(int), (void *)&wA);

errcode |= clSetKernelArg(clKernel, 4,

sizeof(int), (void *)&wC);

shrCheckError(errcode, CL_SUCCESS);

localWorkSize[0] = 16;

localWorkSize[1] = 16;

globalWorkSize[0] = 1024;

globalWorkSize[1] = 1024;

errcode = clEnqueueNDRangeKernel(clCommandQue,

clKernel, 2, NULL, globalWorkSize,

localWorkSize, 0, NULL, NULL);

shrCheckError(errcode, CL_SUCCESS);

// 6. Retrieve result from device

errcode = clEnqueueReadBuffer(clCommandQue,

d_C, CL_TRUE, 0, mem_size_C,

h_C, 0, NULL, NULL);

shrCheckError(errcode, CL_SUCCESS);

// 7. clean up memory

free(h_A);

free(h_B);

free(h_C);

clReleaseMemObject(d_A);

clReleaseMemObject(d_C);

clReleaseMemObject(d_B);

free(clDevices);

free(clMatrixMul);

clReleaseContext(clGPUContext);

clReleaseKernel(clKernel);

clReleaseProgram(clProgram);

clReleaseCommandQueue(clCommandQue);}

almost 100 lines of code

– tedious and hard to maintain

It does not take advantage of HAS features.

It will likely need to be changed for OpenCL 2.0.

Page 331: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

COMPARING SEVERAL HIGH-LEVEL

PROGRAMMING INTERFACES

C++AMP Thrust Bolt OpenACC SYCL

C++ Language

extension

proposed by

Microsoft

library

proposed

by CUDA

library

proposed

by AMD

Annotation

and

Pragmas

proposed

by PGI

C++

wrapper

for

OpenCL

All these proposals aim to reduce tedious boiler

plate code and provide transparent porting to future

systems (future proofing).

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 332: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

OPENACCHSA ENABLES SIMPLER IMPLEMENTATION OR

BETTER OPTIMIZATION

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 333: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

OPENACC- SIMPLE MATRIX MULTIPLICATION EXAMPLE

1. void MatrixMulti(float *C, const float *A, const float *B, int hA, int wA, int wB)

2 {

3 #pragma acc parallel loop copyin(A[0:hA*wA]) copyin(B[0:wA*wB]) copyout(C[0:hA*wB])

4 for (int i=0; i<hA; i++) {

5 #pragma acc loop

6 for (int j=0; j<wB; j++) {

7 float sum = 0;

8 for (int k=0; k<wA; k++) {

9 float a = A[i*wA+k];

10 float b = B[k*wB+j];

11 sum += a*b;

12 }

13 C[i*Nw+j] = sum;

14 }

15 }

16 }

Little Host Code Overhead

Programmer annotation of

kernel computation

Programmer annotation of data movement

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 334: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

ADVANTAGE OF HSA FOR OPENACC

Flexibility in copyin and copyout implementation

Flexible code generation for nested acc parallel loops

E.g., inner loop bounds that depend on outer loop iterations

Compiler data affinity optimization (especially OpenACC kernel regions)

The compiler does not have to undo programmer managed data transfers

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 335: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

C++AMP HSA ENABLES EFFICIENT COMPILATION OF AN

EVEN HIGHER LEVEL OF PROGRAMMING

INTERFACE

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 336: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

C++ AMP

● C++ Accelerated Massive Parallelism

● Designed for data level parallelism

● Extension of C++11 proposed by Microsoft

● An open specification with multiple implementations aiming at standardization

● MS Visual Studio 2013

● MulticoreWare CLAMP

● GPU data modeled as C++14-like containers for multidimensional arrays

● GPU kernels modeled as C++11 lambda

● Minimal extension to C++ for simplicity and future proofing

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 337: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

MATRIX MULTIPLICATION IN C++AMP

void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix,

int ha, int hb, int hc) {

array_view<int, 2> a(ha, hb, aMatrix);

array_view<int, 2> b(hb, hc, bMatrix);

array_view<int, 2> product(ha, hc, productMatrix);

parallel_for_each(

product.extent,

[=](index<2> idx) restrict(amp) {

int row = idx[0];

int col = idx[1];

for (int inner = 0; inner < 2; inner++) {

product[idx] += a(row, inner) * b(inner, col);

}

}

);

product.synchronize();}

clGPUContext = clCreateContextFromType(0,

CL_DEVICE_TYPE_GPU,

NULL, NULL, &errcode);

shrCheckError(errcode, CL_SUCCESS);

// get the list of GPU devices associated

// with context

errcode = clGetContextInfo(clGPUContext,

CL_CONTEXT_DEVICES, 0, NULL,

&dataBytes);

cl_device_id *clDevices = (cl_device_id *)

malloc(dataBytes);

errcode |= clGetContextInfo(clGPUContext,

CL_CONTEXT_DEVICES, dataBytes,

clDevices, NULL);

shrCheckError(errcode, CL_SUCCESS);

//Create a command-queue

clCommandQue =

clCreateCommandQueue(clGPUContext,

clDevices[0], 0, &errcode);

shrCheckError(errcode, CL_SUCCESS);

__kernel void

matrixMul(__global float* C, __global float* A,

__global float* B, int wA, int wB) {

int tx = get_global_id(0);

int ty = get_global_id(1);

float value = 0;

for (int k = 0; k < wA; ++k)

{

float elementA = A[ty * wA + k];

float elementB = B[k * wB + tx];

value += elementA * elementB;

}

C[ty * wA + tx] = value;}

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 338: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

C++AMP PROGRAMMING MODEL

void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix) {

array_view<int, 2> a(3, 2, aMatrix);

array_view<int, 2> b(2, 3, bMatrix);

array_view<int, 2> product(3, 3, productMatrix);

parallel_for_each(

product.extent,

[=](index<2> idx) restrict(amp) {

int row = idx[0];

int col = idx[1];

for (int inner = 0; inner < 2; inner++) {

product[idx] += a(row, inner) * b(inner, col);

}

}

);

product.synchronize();}

GPU data

modeled as

data container

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 339: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

C++AMP PROGRAMMING MODEL

void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix) {

array_view<int, 2> a(3, 2, aMatrix);

array_view<int, 2> b(2, 3, bMatrix);

array_view<int, 2> product(3, 3, productMatrix);

parallel_for_each(

product.extent,

[=](index<2> idx) restrict(amp) {

int row = idx[0];

int col = idx[1];

for (int inner = 0; inner < 2; inner++) {

product[idx] += a(row, inner) * b(inner, col);

}

}

);

product.synchronize();}

Kernels modeled as

lambdas; arguments are

implicitly modeled as

captured variables,

programmer do not need to

specify copyin and copyout

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 340: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

C++AMP PROGRAMMING MODEL

void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix) {

array_view<int, 2> a(3, 2, aMatrix);

array_view<int, 2> b(2, 3, bMatrix);

array_view<int, 2> product(3, 3, productMatrix);

parallel_for_each(

product.extent,

[=](index<2> idx) restrict(amp) {

int row = idx[0];

int col = idx[1];

for (int inner = 0; inner < 2; inner++) {

product[idx] += a(row, inner) * b(inner, col);

}

}

);

product.synchronize();

}

Execution

interface; marking

an implicitly

parallel region for

GPU execution

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 341: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

MCW C++AMP (CLAMP)

● Runs on Linux and Mac OS X

● Output code compatible with all major OpenCL stacks: AMD, Apple/Intel (OS X),

NVIDIA and even POCL

● Clang/LLVM-based, open source

o Translate C++AMP code to OpenCL C or OpenCL 1.2 SPIR

o With template helper library

● Runtime: OpenCL 1.1/HSA Runtime and GMAC for non-HSA systems

● One of the two C++ AMP implementations recognized by HSA foundation

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 342: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

MCW C++ AMP COMPILER

● Device Path

o generate OpenCL C code and SPIR

o emit kernel function

● Host Path

o preparation to launch the code

C++ AMP

source code

Clang/LLVM 3.3

Device

Code

Host

Code

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 343: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

TRANSLATION

parallel_for_each(product.extent,

[=](index<2> idx) restrict(amp) {

int row = idx[0];

int col = idx[1];

for (int inner = 0; inner < 2; inner++) {

product[idx] += a(row, inner) * b(inner, col);

}

});

__kernel void

matrixMul(__global float* C, __global float*

A,

__global float* B, int wA, int wB){

int tx = get_global_id(0);

int ty = get_global_id(1);

float value = 0;

for (int k = 0; k < wA; ++k)

{

float elementA = A[ty * wA + k];

float elementB = B[k * wB + tx];

value += elementA * elementB;

}

C[ty * wA + tx] = value;}

● Append the arguments

● Set the index

● emit kernel function

● implicit memory management

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 344: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

EXECUTION ON NON-HSA OPENCL

PLATFORMS

C++ AMP

source code

Clang/LLVM

3.3

Device Code

C++ AMP

source code

Clang/LLVM

3.3

Host Code

gmac

OpenCL

Our work

Runtime

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 345: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

GMAC

● unified virtual address space in

software

● Can have high overhead

sometimes

● In HSA (e.g., AMD Kaveri), GMAC

is not longer needed

Gelado, et al, ASPLOS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 346: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

CASE STUDY:BINOMIAL OPTION PRICING

Line of Codes

0

50

100

150

200

250

300

350

C++AMP OpenCL

Lines of Code by Cloc

Host

Kernel

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 347: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

PERFORMANCE ON NON-HSA SYSTEMSBINOMIAL OPTION PRICING

0

0.02

0.04

0.06

0.08

0.1

0.12

Total GPU Time Kernel-only

Tim

e in

Seco

nd

s

Performance on an NV Tesla C2050

OpenCL

C++AMP

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 348: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

EXECUTION ON HSA

C++ AMP

source code

Clang/LLVM

3.3

Device SPIR

C++ AMP

source code

Clang/LLVM

3.3

Host SPIR

HSA Runtime

Compile Time

Runtime

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 349: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

WHAT WE NEED TO DO?

● Kernel function

o emit the kernel function with required arguments

● On Host side

o a function that recursively traverses the object and append the arguments to OpenCL

stack.

● On Device side

o reconstruct it on the device code for future use.

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 350: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

WHY COMPILING C++AMP TO OPENCL IS

NOT TRIVIAL

● C++AMP → LLVM IR → OpenCL C or SPIR

● arguments passing (lambda capture vs function calls)

● explicit V.S. implicit memory transfer

● Heavy lifting is done by compiler and runtime

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 351: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

EXAMPLE

struct A { int a; };struct B : A { int b; };struct C { B b; int c; };

struct C c;

c.c = 100;

auto fn = [=] () { int qq = c.c; };

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 352: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

TRANSLATION

parallel_for_each(product.extent,

[=](index<2> idx) restrict(amp) {

int row = idx[0];

int col = idx[1];

for (int inner = 0; inner < 2; inner++) {

product[idx] += a(row, inner) * b(inner, col);

}

});

__kernel void

matrixMul(__global float* C, __global float* A,

__global float* B, int wA, int wB){

int tx = get_global_id(0);

int ty = get_global_id(1);

float value = 0;

for (int k = 0; k < wA; ++k)

{

float elementA = A[ty * wA + k];

float elementB = B[k * wB + tx];

value += elementA * elementB;

}

C[ty * wA + tx] = value;}

● Compiler

● Turn captured variables into

OpenCL arguments

● Populate the index<N> in OCL

kernel

● Runtime

● Implicit memory management

© Copyright 2014 HSA Foundation. All Rights Reserved

Page 353: ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

QUESTIONS?

© Copyright 2014 HSA Foundation. All Rights Reserved