ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
-
Upload
hsa-foundation -
Category
Technology
-
view
6.052 -
download
5
Transcript of ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial
HETEROGENEOUS SYSTEM
ARCHITECTURE (HSA): ARCHITECTURE
AND ALGORITHMS
ISCA TUTORIAL - JUNE 15, 2014
TOPICS
Introduction
HSAIL Virtual Parallel ISA
HSA Runtime
HSA Memory Model
HSA Queuing Model
HSA Applications
HSA Compilation
© Copyright 2014 HSA Foundation. All Rights Reserved
The HSA Specifications are not at 1.0 final so all content is subject to change
SCHEDULE
© Copyright 2014 HSA Foundation. All Rights Reserved
Time Topic Speaker
8:45am Introduction to HSA Phil Rogers, AMD
9:30am HSAIL Virtual Parallel ISA Ben Sander, AMD
10:30am Break
10:50am HSA Runtime Yeh-Ching Chung, National Tsing Hua University
12 noon Lunch
1pm HSA Memory Model Benedict Gaster, Qualcomm
2pm HSA Queuing Model Hakan Persson, ARM
3pm Break
3:15pm HSA Compilation Technology Wen Mei Hwu, University of Illinois
4pm HSA Application Programming Wen Mei Hwu, University of Illinois
4:45pm Questions All presenters
INTRODUCTIONPHIL ROGERS, AMD CORPORATE FELLOW &
PRESIDENT OF HSA FOUNDATION
HSA FOUNDATION
Founded in June 2012
Developing a new platform for heterogeneous
systems
www.hsafoundation.com
Specifications under development in working
groups to define the platform
Membership consists of 43 companies and 16
universities
Adding 1-2 new members each month
© Copyright 2014 HSA Foundation. All Rights Reserved
DIVERSE PARTNERS DRIVING FUTURE OF
HETEROGENEOUS COMPUTING
© Copyright 2014 HSA Foundation. All Rights Reserved
Founders
Promoters
Supporters
Contributors
Academic
Needs Updating – Add Toshiba
Logo
MEMBERSHIP TABLEMembership Level Number List
Founder 6 AMD, ARM, Imagination Technologies, MediaTek Inc., Qualcomm Inc., Samsung Electronics Co Ltd
Promoter 1 LG Electronics
Contributor 25 Analog Devices Inc., Apical, Broadcom, Canonical Limited, CEVA Inc., Digital Media Professionals, Electronics and Telecommunications Research, Institute (ETRI), General Processor, Huawei, Industrial Technology Res. Institute, Marvell International Ltd., Mobica, Oracle, Sonics, Inc, Sony Mobile, Communications, Swarm 64 GmbH, Synopsys, Tensilica, Inc., Texas Instruments Inc., Toshiba, VIA Technologies, Vivante Corporation
Supporter 13 Allinea Software Ltd, Arteris Inc., Codeplay Software, Fabric Engine, Kishonti, Lawrence Livermore National Laboratory, Linaro, MultiCoreWare, Oak Ridge National Laboratory, Sandia Corporation, StreamComputing, SUSE LLC, UChicago Argonne LLC, Operator of Argonne National Laboratory
Academic 17 Institute for Computing Systems Architecture, Missouri University of Science & Technology, National Tsing Hua University, NMAM Institute of Technology, Northeastern University, Rice University, Seoul National University, System Software Lab National, Tsing Hua University, Tampere University of Technology, TEI of Crete, The University of Mississippi, University of North Texas, University of Bologna, University of Bristol Microelectronic Research Group, University of Edinburgh, University of Illinois at Urbana-Champaign Department of Computer Science
© Copyright 2014 HSA Foundation. All Rights Reserved
HETEROGENEOUS PROCESSORS HAVE
PROLIFERATED — MAKE THEM BETTER
Heterogeneous SOCs have arrived and are a
tremendous advance over previous platforms
SOCs combine CPU cores, GPU cores and
other accelerators, with high bandwidth access
to memory
How do we make them even better? Easier to program
Easier to optimize
Higher performance
Lower power
HSA unites accelerators architecturally
Early focus on the GPU compute accelerator,
but HSA will go well beyond the GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
INFLECTIONS IN PROCESSOR DESIGN
© Copyright 2014 HSA Foundation. All Rights Reserved
?
Sin
gle
-th
read
Pe
rfo
rman
ce
Time
we are
here
Enabled by: Moore’s
Law
Voltage Scaling
Constrained by:
Power
Complexity
Single-Core Era
Mo
de
rn A
pp
licatio
n
Pe
rfo
rman
ce
Time (Data-parallel exploitation)
we are
here
Heterogeneous
Systems Era
Enabled by: Abundant data
parallelism
Power efficient
GPUs
Temporarily
Constrained by:Programming
models
Comm.overhead
Th
rou
gh
put
Pe
rfo
rman
ce
Time (# of processors)
we are
here
Enabled by: Moore’s Law
SMP
architecture
Constrained
by:Power
Parallel SW
Scalability
Multi-Core Era
Assembly C/C++ Java … pthreads OpenMP / TBB …Shader CUDA OpenCL
C++ and Java
LEGACY GPU COMPUTE
PCIe
™
System Memory(Coherent)
CPU CPU CPU. .
. CU CU CU CU
CU CU CU CU
GPU Memory(Non-Coherent)
GPU
Multiple memory pools
Multiple address spaces
High overhead dispatch
Data copies across PCIe
New languages for programming
Dual source development
Proprietary environments
Expert programmers only
Need to fix all of this to unleash our programmers
The limiters
© Copyright 2014 HSA Foundation. All Rights Reserved
EXISTING APUS AND SOCS
CPU
1
CPU
N…CPU
2
Physical Integration
CU
1 …CU
2
CU
3
CU
M-2
CU
M-1
CU
M
System Memory(Coherent)
GPU Memory(Non-Coherent)
GPU
Physical Integration
Good first step
Some copies gone
Two memory pools remain
Still queue through the OS
Still requires expert programmers
Need to finish the job
AN HSA ENABLED SOC
Unified Coherent Memory enables data sharing across all processors
Processors architected to operate cooperatively
Designed to enable the application to run on different processors at different times
Unified Coherent Memory
CPU
1
CPU
N…CPU
2
CU
1
CU
2
CU
3
CU
M-2
CU
M-1
CU
M…
PILLARS OF HSA*
Unified addressing across all processors
Operation into pageable system memory
Full memory coherency
User mode dispatch
Architected queuing language
Scheduling and context switching
HSA Intermediate Language (HSAIL)
High level language support for GPU compute processors
© Copyright 2014 HSA Foundation. All Rights Reserved
* All features of HSA are subject to change, pending ratification of 1.0 Final specifications by the HSA Board of Directors
HSA SPECIFICATIONS
HSA System Architecture Specification
Version 1.0 Provisional, Released April 2014
Defines discovery, memory model, queue management, atomics, etc
HSA Programmers Reference Specification
Version 1.0 Provisional, Released June 2014
Defines the HSAIL language and object format
HSA Runtime Software Specification
Version 1.0 Provisional, expected to be released in July 2014
Defines the APIs through which an HSA application uses the platform
All released specifications can be found at the HSA Foundation web site:
www.hsafoundation.com/standards
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA - AN OPEN PLATFORM Open Architecture, membership open to all
HSA Programmers Reference Manual
HSA System Architecture
HSA Runtime
Delivered via royalty free standards
Royalty Free IP, Specifications and APIs
ISA agnostic for both CPU and GPU
Membership from all areas of computing
Hardware companies
Operating Systems
Tools and Middleware
Applications
Universities
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA INTERMEDIATE LAYER — HSAIL
HSAIL is a virtual ISA for parallel programs
Finalized to ISA by a JIT compiler or “Finalizer”
ISA independent by design for CPU & GPU
Explicitly parallel
Designed for data parallel programming
Support for exceptions, virtual functions,
and other high level language features
Lower level than OpenCL SPIR
Fits naturally in the OpenCL compilation stack
Suitable to support additional high level languages and programming models:
Java, C++, OpenMP, C++, Python, etc
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA MEMORY MODEL
Defines visibility ordering between all
threads in the HSA System
Designed to be compatible with
C++11, Java, OpenCL and .NET
Memory Models
Relaxed consistency memory model
for parallel compute performance
Visibility controlled by:
Load.Acquire
Store.Release
Fences
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA QUEUING MODEL
User mode queuing for low latency dispatch
Application dispatches directly
No OS or driver required in the dispatch path
Architected Queuing Layer
Single compute dispatch path for all hardware
No driver translation, direct to hardware
Allows for dispatch to queue from any agent
CPU or GPU
GPU self enqueue enables lots of solutions
Recursion
Tree traversal
Wavefront reforming
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA SOFTWARE
Hardware - APUs, CPUs, GPUs
Driver Stack
Domain Libraries
OpenCL™, DX Runtimes,
User Mode Drivers
Graphics Kernel Mode Driver
AppsApps
AppsApps
AppsApps
HSA Software Stack
Task Queuing
Libraries
HSA Domain Libraries,
OpenCL ™ 2.x Runtime
HSA Kernel
Mode Driver
HSA Runtime
HSA JIT
AppsApps
AppsApps
AppsApps
User mode component Kernel mode component Components contributed by third parties
EVOLUTION OF THE SOFTWARE STACK
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL™ AND HSA
HSA is an optimized platform architecture
for OpenCL
Not an alternative to OpenCL
OpenCL on HSA will benefit from
Avoidance of wasteful copies
Low latency dispatch
Improved memory model
Pointers shared between CPU and GPU
OpenCL 2.0 leverages HSA Features
Shared Virtual Memory
Platform Atomics
© Copyright 2014 HSA Foundation. All Rights Reserved
ADDITIONAL LANGUAGES ON HSA
In development
© Copyright 2014 HSA Foundation. All Rights Reserved
Language Body More Information
Java Sumatra OpenJDK http://openjdk.java.net/projects/sumatra/
LLVM LLVM Code
generator for HSAIL
C++ AMP Multicoreware https://bitbucket.org/multicoreware/cppa
mp-driver-ng/wiki/Home
OpenMP, GCC AMD, Suse https://gcc.gnu.org/viewcvs/gcc/branches
/hsa/gcc/README.hsa?view=markup&p
athrev=207425
SUMATRA PROJECT OVERVIEW
AMD/Oracle sponsored Open Source (OpenJDK) project
Targeted at Java 9 (2015 release)
Allows developers to efficiently represent data parallel algorithms in
Java
Sumatra ‘repurposes’ Java 8’s multi-core Stream/Lambda API’s to
enable both CPU or GPU computing
At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch
‘selected’ constructs to available HSA enabled devices
Developers of Java libraries are already refactoring their library code to
use these same constructs
So developers using existing libraries should see GPU acceleration
without any code changes
http://openjdk.java.net/projects/sumatra/
https://wikis.oracle.com/display/HotSpotInternals/Sumatra
http://mail.openjdk.java.net/pipermail/sumatra-dev/
© Copyright 2014 HSA Foundation. All Rights Reserved
Application.java
Java Compiler
GPUCPU
Sumatra Enabled JVM
Application
GPU ISA
Lambda/Stream API
CPU ISA
Application.clas
s
Development
Runtime
HSA Finalizer
HSA OPEN SOURCE SOFTWARE
HSA will feature an open source linux execution and compilation stack
Allows a single shared implementation for many components
Enables university research and collaboration in all areas
Because it’s the right thing to do
© Copyright 2014 HSA Foundation. All Rights Reserved
Component Name IHV or Common Rationale
HSA Bolt Library Common Enable understanding and debug
HSAIL Code Generator Common Enable research
LLVM Contributions Common Industry and academic collaboration
HSAIL Assembler Common Enable understanding and debug
HSA Runtime Common Standardize on a single runtime
HSA Finalizer IHV Enable research and debug
HSA Kernel Driver IHV For inclusion in linux distros
WORKLOAD EXAMPLE
SUFFIX ARRAY CONSTRUCTIONCLOUD SERVER WORKLOAD
SUFFIX ARRAYS
Suffix Arrays are a fundamental data structure
Designed for efficient searching of a large text
Quickly locate every occurrence of a substring S in a text T
Suffix Arrays are used to accelerate in-memory cloud workloads
Full text index search
Lossless data compression
Bio-informatics
© Copyright 2014 HSA Foundation. All Rights Reserved
ACCELERATED SUFFIX ARRAY
CONSTRUCTION ON HSA
© Copyright 2014 HSA Foundation. All Rights Reserved
M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013.
AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM
By offloading data parallel computations to
GPU, HSA increases performance and
reduces energy for Suffix Array
Construction.
By efficiently sharing data between CPU and
GPU, HSA lets us move compute to data
without penalty of intermediate copies.
+5.8x
-5x
INCREASED
PERFORMANCEDECREASED
ENERGYMerge Sort::GPU
Radix Sort::GPU
Compute SA::CPU
Lexical Rank::CPU
Radix Sort::GPU
Skew Algorithm for Compute SA
EASE OF PROGRAMMINGCODE COMPLEXITY VS. PERFORMANCE
LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT
PROGRAMMING MODELS
AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.
Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta
0
50
100
150
200
250
300
350L
OC
Copy-back Algorithm Launch Copy Compile Init Performance
Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt
Pe
rform
an
ce
35.00
30.00
25.00
20.00
15.00
10.00
5.00
0Copy-
back
Algorithm
Launch
Copy
Compile
Init.
Copy-back
Algorithm
Launch
Copy
Compile
Copy-back
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
(Exemplary ISV “Hessian” Kernel)
© Copyright 2014 HSA Foundation. All Rights Reserved
THE HSA FUTURE
Architected heterogeneous processing on the SOC
Programming of accelerators becomes much easier
Accelerated software that runs across multiple hardware vendors
Scalability from smart phones to super computers on a common architecture
GPU acceleration of parallel processing is the initial target, with DSPs
and other accelerators coming to the HSA system architecture model
Heterogeneous software ecosystem evolves at a much faster pace
Lower power, more capable devices in your hand, on the wall, in the cloud
© Copyright 2014 HSA Foundation. All Rights Reserved
JOIN US!
WWW.HSAFOUNDATION.COM
HETEROGENEOUS SYSTEM
ARCHITECTURE (HSA): HSAIL VIRTUAL
PARALLEL ISA
BEN SANDER, AMD
TOPICS
Introduction and Motivation
HSAIL – what makes it special?
HSAIL Execution Model
How to program in HSAIL?
Conclusion
© Copyright 2014 HSA Foundation. All Rights Reserved
STATE OF GPU COMPUTING
Today’s Challenges
Separate address spaces
Copies
Can’t share pointers
New language required for compute kernel
EX: OpenCL™ runtime API
Compute kernel compiled separately than host
code
Emerging Solution
HSA Hardware
Single address space
Coherent
Virtual
Fast access from all components
Can share pointers
Bring GPU computing to existing, popular,
programming models
Single-source, fully supported by compiler
HSAIL compiler IR (Cross-platform!)
• GPUs are fast and power efficient : high compute density per-mm and per-watt
• But: Can be hard to program
PCIe
THE PORTABILITY CHALLENGE
CPU ISAs
ISA innovations added incrementally (ie NEON, AVX, etc)
ISA retains backwards-compatibility with previous generation
Two dominant instruction-set architectures: ARM and x86
GPU ISAs
Massive diversity of architectures in the market
Each vendor has own ISA - and often several in market at same time
No commitment (or attempt!) to provide any backwards compatibility
Traditionally graphics APIs (OpenGL, DirectX) provide necessary abstraction
© Copyright 2014 HSA Foundation. All Rights Reserved
HSAIL : WHAT MAKES IT SPECIAL?
WHAT IS HSAIL?
Intermediate language for parallel compute in HSA
Generated by a “High Level Compiler” (GCC, LLVM, Java VM, etc)
Expresses parallel regions of code
Binary format of HSAIL is called “BRIG”
Goal: Bring parallel acceleration to mainstream programming languages
© Copyright 2014 HSA Foundation. All Rights Reserved
main() {
…
#pragma omp parallel for
for (int i=0;i<N; i++) {
}
…
}
High-Level
Compiler BRIG Finalizer Component
ISA
Host ISA
KEY HSAIL FEATURES
Parallel
Shared virtual memory
Portable across vendors in HSA Foundation
Stable across multiple product generations
Consistent numerical results (IEEE-754 with defined min accuracy)
Fast, robust, simple finalization step (no monthly updates)
Good performance (little need to write in ISA)
Supports all of OpenCL™
Supports Java, C++, and other languages as well
© Copyright 2014 HSA Foundation. All Rights Reserved
HSAIL INSTRUCTION SET - OVERVIEW Similar to assembly language for a RISC CPU
Load-store architecture
Destination register first, then source registers
140 opcodes (Java™ bytecode has 200)
Floating point (single, double, half (f16))
Integer (32-bit, 64-bit)
Some packed operations
Branches
Function calls
Platform Atomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas
Synchronize host CPU and HSA Component!
Text and Binary formats (“BRIG”)
ld_global_u64 $d0, [$d6 + 120] ; $d0= load($d6+120)
add_u64 $d1, $d0, 24 ; $d1= $d2+24
© Copyright 2014 HSA Foundation. All Rights Reserved
SEGMENTS AND MEMORY (1/2)
7 segments of memory
global, readonly, group, spill, private, arg, kernarg
Memory instructions can (optionally) specify a segment
Control data sharing properties and communicate intent
Global Segment
Visible to all HSA agents (including host CPU)
Group Segment
Provides high-performance memory shared in the work-group.
Group memory can be read and written by any work-item in the work-group
HSAIL provides sync operations to control visibility of group memory
ld_global_u64 $d0,[$d6]
ld_group_u64 $d0,[$d6+24]
st_spill_f32 $s1,[$d6+4]
© Copyright 2014 HSA Foundation. All Rights Reserved
SEGMENTS AND MEMORY (2/2)
Spill, Private, Arg Segments
Represent different regions of a per-work-item stack
Typically generated by compiler, not specified by programmer
Compiler can use these to convey intent – ie spills
Kernarg Segment
Programmer writes kernarg segment to pass arguments to a kernel
Read-Only Segment
Remains constant during execution of kernel
© Copyright 2014 HSA Foundation. All Rights Reserved
FLAT ADDRESSING
Each segment mapped into virtual address space
Flat addresses can map to segments based on virtual address
Instructions with no explicit segment use flat addressing
Very useful for high-level language support (ie classes, libraries)
Aligns well with OpenCL 2.0 “generic” addressing feature
ld_global_u64 $d6, [%_arg0] ; global
ld_u64 $d0,[$d6+24] ; flat
© Copyright 2014 HSA Foundation. All Rights Reserved
REGISTERS
Four classes of registers:
S: 32-bit, Single-precision FP or Int
D: 64-bit, Double-precision FP or Long Int
Q: 128-bit, Packed data.
C: 1-bit, Control Registers (Compares)
Fixed number of registers
S, D, Q share a single pool of resources
S + 2*D + 4*Q <= 128
Up to 128 S or 64 D or 32 Q (or a blend)
Register allocation done in high-level compiler
Finalizer doesn’t perform expensive register allocation
c0
c1
c2
c3
c4
c5
c6
c7
s0d0
q0s1
s2d1
s3
s4d2
q1s5
s6d3
s7
s8d4
q2s9
s10d5
s11
…s120
d60
q30s121
s122d61
s123
s124d62
q31s125
s126d63
s127
© Copyright 2014 HSA Foundation. All Rights Reserved
SIMT EXECUTION MODEL
HSAIL Presents a “SIMT” execution model to the programmer
“Single Instruction, Multiple Thread”
Programmer writes program for a single thread of execution
Each work-item appears to have its own program counter
Branch instructions look natural
Hardware Implementation
Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency
Actually one program counter for the entire SIMD instruction
Branches implemented with predication
SIMT Advantages
Easier to program (branch code in particular)
Natural path for mainstream programming models and existing compilers
Scales across a wide variety of hardware (programmer doesn’t see vector width)
Cross-lane operations available for those who want peak performance
© Copyright 2014 HSA Foundation. All Rights Reserved
WAVEFRONTS
Hardware SIMD vector, composed of 1, 2, 4, 8, 16, 32, 64, 128, or 256 “lanes”
Lanes in wavefront can be “active” or “inactive”
Inactive lanes consume hardware resources but don’t do useful work
Tradeoffs “Wavefront-aware” programming can be useful for peak performance
But results in less portable code (since wavefront width is encoded in algorithm)
if (cond) {
operationA; // cond=True lanes active here
} else {
operationB; // cond=False lanes active here
}
© Copyright 2014 HSA Foundation. All Rights Reserved
CROSS-LANE OPERATIONS
Example HSAIL cross-lane operation: “activelaneid”
Dest set to count of earlier work-items that are active for this instruction
Useful for compaction algorithms
Example HSAIL cross-lane operation: “activelaneshuffle”
Each workitem reads value from another lane in the wavefront
Supports selection of “identity” element for inactive lanes
Useful for wavefront-level reductionsactivelaneshuffle_b32 $s0, $s1, $s2, 0, 0
// s0 = dest, s1= source, s2=lane select, no identity
activelaneid_u32 $s0
© Copyright 2014 HSA Foundation. All Rights Reserved
HSAIL MODES
Working group strived to limit optional modes and features in HSAIL
Minimize differences between HSA target machines
Better for compiler vendors and application developers
Two modes survived
Machine Models
Small: 32-bit pointers, 32-bit data
Large: 64-bit pointers, 32-bit or 64-bit data
Vendors can support one or both models
“Base” and “Full” Profiles
Two sets of requirements for FP accuracy, rounding, exception reporting, hard
pre-emption
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA PROFILESFeature Base Full
Addressing Modes Small, Large Small, Large
All 32-bit HSAIL operations according to the declared profile Yes Yes
F16 support (IEEE 754 or better) Yes Yes
F64 support No Yes
Precision for add/sub/mul 1/2 ULP 1/2 ULP
Precision for div 2.5 ULP 1/2 ULP
Precision for sqrt 1 ULP 1/2 ULP
HSAIL Rounding: Near Yes Yes
HSAIL Rounding: Up / Down / Zero No Yes
Subnormal floating-point Flush-to-zero Supported
Propagate NaN Payloads No Yes
FMA Yes Yes
Arithmetic Exception reporting None DETECT or BREAK
Debug trap Yes Yes
Hard Preemption No Yes
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA PARALLEL EXECUTION
MODEL
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA PARALLEL EXECUTION MODELBasic Idea:
Programmer supplies an HSAIL
“kernel” that is run on each work-item.
Kernel is written as a single thread of
execution.
Programmer specifies grid dimensions
(scope of problem) when launching
the kernel.
Each work-item has a unique
coordinate in the grid.
Programmer optionally specifies work-
group dimensions (for optimized
communication).
© Copyright 2014 HSA Foundation. All Rights Reserved
CONVOLUTION / SOBEL EDGE FILTER
Gx = [ -1 0 +1 ]
[ -2 0 +2 ]
[ -1 0 +1 ]
Gy = [ -1 -2 -1 ]
[ 0 0 0 ]
[ +1 +2 +1 ]
G = sqrt(Gx2 + Gy
2)
© Copyright 2014 HSA Foundation. All Rights Reserved
CONVOLUTION / SOBEL EDGE FILTER
Gx = [ -1 0 +1 ]
[ -2 0 +2 ]
[ -1 0 +1 ]
Gy = [ -1 -2 -1 ]
[ 0 0 0 ]
[ +1 +2 +1 ]
G = sqrt(Gx2 + Gy
2)
2D grid
workitem
kernel
© Copyright 2014 HSA Foundation. All Rights Reserved
CONVOLUTION / SOBEL EDGE FILTER
Gx = [ -1 0 +1 ]
[ -2 0 +2 ]
[ -1 0 +1 ]
Gy = [ -1 -2 -1 ]
[ 0 0 0 ]
[ +1 +2 +1 ]
G = sqrt(Gx2 + Gy
2)
2D work-group
2D grid
workitem
kernel
© Copyright 2014 HSA Foundation. All Rights Reserved
HOW TO PROGRAM HSA?
WHAT DO I TYPE?
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA PROGRAMMING MODELS : CORE PRINCIPLES
Single source
Host and device code side-by-side in same source file
Written in same programming language
Single unified coherent address space
Freely share pointers between host and device
Similar memory model as multi-core CPU
Parallel regions identified with existing language syntax
Typically same syntax used for multi-core CPU
HSAIL is the compiler IR that supports these programming models
© Copyright 2014 HSA Foundation. All Rights Reserved
GCC OPENMP : COMPILATION FLOW
SUSE GCC Project
Adding HSAIL code generator to GCC compiler infrastructure
Supports OpenMP 3.1 syntax
No data movement directives required !main() {
…
// Host code.
#pragma omp parallel for
for (int i=0;i<N; i++) {
C[i] = A[i] + B[i];
}
…
}
GCC OpenMP
CompilerBRIG Finalizer Component
ISA
Host ISA
© Copyright 2014 HSA Foundation. All Rights Reserved
GCC OpenMP flowC/C++/Fortran OpenMP application
e.g., #pragma omp forfor( j = 0; j<n;j++) { b[j] = a[j]; }
GNU Compiler(GCC)
Compiles host code + Emits runtime calls with kernel name, parameters, launch attributes
Lowers OpenMP directives,converts GIMPLE to BRIG.Embeds BRIG into host code
Dispatch kernel to GPU
Pragmas map to calls into HSA Runtime
Application
Compiler
Run timeFinalize kernel from BRIG->ISAKernels finalized once and cached.
Compile time
© Copyright 2014 HSA Foundation. All Rights Reserved
MCW C++AMP : COMPILATION FLOW
C++AMP : Single-source C++ template parallel programming model
MCW compiler based on CLANG/LLVM
Open-source and runs on Linux
Leverage open-source LLVM->HSAIL code generator
main() {
…
parallel_for_each(grid<1>(ext
entent<256>(…)
…
}
C++AMP
CompilerBRIG Finalizer Component
ISA
Host ISA
© Copyright 2014 HSA Foundation. All Rights Reserved
JAVA: RUNTIME FLOW
© Copyright 2014 HSA Foundation. All Rights Reserved
JAVA 8 – HSA ENABLED APARAPI
Java 8 brings Stream + Lambda API.‒ More natural way of expressing data parallel algorithms‒ Initially targeted at multi-core.
APARAPI will :‒ Support Java 8 Lambdas‒ Dispatch code to HSA enabled devices at runtime via
HSAIL
JVM
Java Application
HSA Finalizer & Runtime
APARAPI + Lambda API
GPUCPU
Future Java – HSA ENABLED JAVA (SUMATRA)
Adds native GPU acceleration to Java Virtual Machine (JVM)
Developer uses JDK Lambda, Stream API
JVM uses GRAAL compiler to generate HSAIL
JVM
Java Application
HSA Finalizer & Runtime
Java JDK Stream + Lambda API
Java GRAAL JITbackend
GPUCPU
AN EXAMPLE (IN JAVA 8)
© Copyright 2014 HSA Foundation. All Rights Reserved
//Example computes the percentage of total scores achieved by each player on a team.
class Player {
private Team team; // Note: Reference to the parent Team.
private int scores;
private float pctOfTeamScores;
public Team getTeam() {return team;}
public int getScores() {return scores;}
public void setPctOfTeamScores(int pct) { pctOfTeamScores = pct; }
};
// “Team” class not shown
// Assume “allPlayers’ is an initialized array of Players..
Arrays.stream(allPlayers). // wrap the array in a stream
parallel(). // developer indication that lambda is thread-safe
forEach(p -> {
int teamScores = p.getTeam().getScores();
float pctOfTeamScores = (float)p.getScores()/(float) teamScores;
p.setPctOfTeamScores(pctOfTeamScores);
});
HSAIL CODE EXAMPLE
© Copyright 2014 HSA Foundation. All Rights Reserved
01: version 0:95: $full : $large;
02: // static method HotSpotMethod<Main.lambda$2(Player)>
03: kernel &run (
04: kernarg_u64 %_arg0 // Kernel signature for lambda method
05: ) {
06: ld_kernarg_u64 $d6, [%_arg0]; // Move arg to an HSAIL register
07: workitemabsid_u32 $s2, 0; // Read the work-item global “X” coord
08:
09: cvt_u64_s32 $d2, $s2; // Convert X gid to long
10: mul_u64 $d2, $d2, 8; // Adjust index for sizeof ref
11: add_u64 $d2, $d2, 24; // Adjust for actual elements start
12: add_u64 $d2, $d2, $d6; // Add to array ref ptr
13: ld_global_u64 $d6, [$d2]; // Load from array element into reg
14: @L0:
15: ld_global_u64 $d0, [$d6 + 120]; // p.getTeam()
16: mov_b64 $d3, $d0;
17: ld_global_s32 $s3, [$d6 + 40]; // p.getScores ()
18: cvt_f32_s32 $s16, $s3;
19: ld_global_s32 $s0, [$d0 + 24]; // Team getScores()
20: cvt_f32_s32 $s17, $s0;
21: div_f32 $s16, $s16, $s17; // p.getScores()/teamScores
22: st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores()
23: ret;
24: };
HOW TO PROGRAM HSA?
OTHER PROGRAMMING TOOLS
© Copyright 2014 HSA Foundation. All Rights Reserved
HSAIL ASSEMBLER
kernel &run (kernarg_u64 %_arg0)
{
ld_kernarg_u64 $d6, [%_arg0];
workitemabsid_u32 $s2, 0;
cvt_u64_s32 $d2, $s2;
mul_u64 $d2, $d2, 8;
add_u64 $d2, $d2, 24;
add_u64 $d2, $d2, $d6;
ld_global_u64 $d6, [$d2];
. . .
HSAIL
Assembler BRIG FinalizerMachine
ISA
• HSAIL has a text format and an assembler
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL™ OFFLINE COMPILER (CLOC)
__kernel void vec_add(
__global const float *a,
__global const float *b,
__global float *c,
const unsigned int n)
{
int id = get_global_id(0);
// Bounds check
if (id < n)
c[id] = a[id] + b[id];
}
CLOC BRIG FinalizerMachine
ISA
•OpenCL split-source model cleanly isolates kernel
•Can express many HSAIL features in OpenCL Kernel Language
•Higher productivity than writing in HSAIL assembly
•Can dispatch kernel directly with HSAIL Runtime (lower-level access to hardware)
•Or use CLOC+OKRA Runtime for approachable “fits-on-a-slide” GPU programming model
© Copyright 2014 HSA Foundation. All Rights Reserved
KEY TAKEAWAYS HSAIL
Thin, robust, fast finalizer
Portable (multiple HW vendors and parallel architectures)
Supports shared virtual memory and platform atomics
HSA brings GPU computing to mainstream programming models
Shared and coherent memory bridges “faraway accelerator” gap
HSAIL provides the common IL for high-level languages to benefit from
parallel computing
Languages and Compilers
HSAIL support in GCC, LLVM, Java JVM
Leverage same language syntax designed for multi-core CPUs
Can use pointer-containing data structures
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA RUNTIMEYEN-CHING CHUNG, NATIONAL TSING HUA
UNIVERSITY
OUTLINE Introduction
HSA Core Runtime API (Pre-release 1.0 provisional)
Initialization and Shut Down
Notifications (Synchronous/Asynchronous)
Agent Information
Signals and Synchronization (Memory-Based)
Queues and Architected Dispatch
Summary
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (1)
The HSA core runtime is a thin, user-mode API that provides the interface necessary for
the host to launch compute kernels to the available HSA components.
The overall goal of the HSA core runtime design is to provide a high-performance dispatch
mechanism that is portable across multiple HSA vendor architectures.
The dispatch mechanism differentiates the HSA runtime from other language runtimes by
architected argument setting and kernel launching at the hardware and specification level.
The HSA core runtime API is standard across all HSA vendors, such that languages which use the
HSA runtime can run on different vendor’s platforms that support the API.
The implementation of the HSA runtime may include kernel-level components (required for
some hardware components, ex: AMD Kaveri) or may be entirely user-space (for example,
simulators or CPU implementations).
© Copyright 2014 HSA Foundation. All Rights Reserved
Component 1
Driver
Component N…
Vendor m
…Component 1
Driver
Component N…
Vendor 1
Component 1
HSA Runtime
Component N…
HSA Vendor 1
HSA
Finalizer Component 1
HSA Runtime
Component N…
HSA Vendor m
HSA
Finalizer
INTRODUCTION (2)
Programming Model
Language Runtime
The software architecture stack without HSA runtime
OpenCL
App
Java
App
OpenMP
App
DSL
App
OpenCL
Runtime
Java
Runtime
OpenMP
Runtime
DSL
Runtime
…
…
The software architecture stack with HSA runtime
…
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (3)
OpenCL Runtime HSA RuntimeAgent
Start Program
HSA Memory Allocation
Enqueue Dispatch Packet
Exit Program Resource Deallocation
Command Queue
Platform, Device, and Context Initialization
SVM Allocation and Kernel Arguments Setting
Build Kernel
HSA Runtime Close
HSA Runtime Initialization and Topology Discovery
HSAIL Finalization and Linking
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (4)
HSA Platform System Architecture Specification support
Runtime initialization and shutdown
Notifications (synchronous/asynchronous)
Agent information
Signals and synchronization (memory-based)
Queues and Architected dispatch
Memory management
HSAIL support
Finalization, linking, and debugging
Image and Sampler support
HSA Runtime
HSA Memory Allocation
Enqueue Dispatch Packet
HSA Runtime Close
HSA Runtime Initialization and
Topology Discovery
HSAIL Finalization and Linking
© Copyright 2014 HSA Foundation. All Rights Reserved
RUNTIME INITIALIZATION AND
SHUTDOWN
OUTLINE
Runtime Initialization API
hsa_init
Runtime Shut Down API
hsa_shut_down
Examples
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA RUNTIME INITIALIZATION
When the API is invoked for the first time in a given process, a runtime
instance is created.
A typical runtime instance may contain information of platform, topology, reference
count, queues, signals, etc.
The API can be called multiple times by applications
Only a single runtime instance will exist for a given process.
Whenever the API is invoked, the reference count is increased by one.
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA RUNTIME SHUT DOWN
When the API is invoked, the reference count is decreased by 1.
When the reference count < 1
All the resources associated with the runtime instance (queues, signals, topology
information, etc.) are considered invalid and any attempt to reference them in
subsequent API calls results in undefined behavior.
The user might call hsa_init to initialize the HSA runtime again.
The HSA runtime might release resources associated with it.
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE – RUNTIME INITIALIZATION (1)
Data structure for
runtime instance
If hsa_init is called more than once,
increase the ref_count by 1
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE – RUNTIME INITIALIZATION (2)
hsa_init is called the first time, allocate
resources and set the reference count
Get the number of HSA agent
Initialize agents
Create an empty agent list
If initialization failed, release resources
Create topology table
© Copyright 2014 HSA Foundation. All Rights Reserved
Agent-0
node_id 0
id 0
type CPU
vendor Generic
name Generic
wavefront_size 0
queue_size 200
group_memory 0
fbarrier_max_count 1
is_pic_supported 0……
EXAMPLE - RUNTIME INSTANCE (1)Platform Name: Generic Memory
node_id 0
id 0
segment_type 111111
address_base 0x0001
size 2048 MB
peak_bandwidth 6553.6 mpbs
Agent-1
node_id 0
id 0
type GPU
vendor Generic
name Generic
wavefront_size 64
queue_size 200
group_memory 64
fbarrier_max_count 1
is_pic_supported 1
Cache
node_id 0
id 0
levels 1
associativity 1
cache size 64KB
cache line size 4
is_inclusive 1
Agent: 2
Memory: 1
Cache: 1
… …
© Copyright 2014 HSA Foundation. All Rights Reserved
Agent-0
node_id = 0
id = 0
agent_type = 1 (CPU)
vendor[16] = Generic
name[16] = Generic
wavefront_size = 0
queue_size =200
group_memory_size_bytes =0
fbarrier_max_count = 1
is_pic_supported = 0
Platform Header File
*base_address = 0x00001
Size = 248
system_timestamp_frequency_
mhz = 200
signal_maximum_wait = 1/200
*node_id
no_nodes = 1
*agent_list
no_agent = 2
*memory_descriptor_list
no_memory_descriptor = 1
*cache_descriptor_list
no_cache_descriptor = 1
EXAMPLE - RUNTIME INSTANCE (2)
…
…
cache
node_id = 0
Id = 0
Levels = 1
* associativity
* cache_size
* cache_line_size
* is_inclusive
1 NULL
64KB NULL
1 NULL
4 NULL
Memory
node_id = 0
Id = 0
supported_segment_type_mask =
111111
virtual_address_base = 0x0001
size_in_bytes = 2048MB
peak_bandwidth_mbps = 6553.6
0 NULL
45 165 NULL
285 NULL
325 NULL
Agent-1
node_id = 0
id = 0
agent_type = 2 (GPU)
vendor[16] = Generic
name[16] = Generic
wavefront_size = 64
queue_size =200
group_memory_size_bytes =64
fbarrier_max_count = 1
is_pic_supported = 1
…
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE – RUNTIME SHUT DOWN
© Copyright 2014 HSA Foundation. All Rights Reserved
If ref_count < 1, then free the list;
Otherwise decrease the ref_count
by 1.
NOTIFICATIONS
(SYNCHRONOUS/ASYNCHRONOUS)
OUTLINE
Synchronous Notifications
hsa_status_t
hsa_status_string
Asynchronous Notifications
Example
© Copyright 2014 HSA Foundation. All Rights Reserved
SYNCHRONOUS NOTIFICATIONS
Notifications (errors, events, etc.) reported by the runtime can be synchronous or
asynchronous
The HSA runtime uses the return values of API functions to pass notifications
synchronously.
A status code is define as an enumeration, , to capture the return value
of any API function that has been executed, except accessors/mutators.
The notification is a status code that indicates success or error.
Success is represented by HSA_STATUS_SUCCESS, which is equivalent to zero.
An error status is assigned a positive integer and its identifier starts with the
HSA_STATUS_ERROR prefix.
The status code can help to determine a cause of the unsuccessful execution.
© Copyright 2014 HSA Foundation. All Rights Reserved
STATUS CODE QUERY
Query additional information on status code
Parameters status (input): Status code that the user is seeking more information on
status_string (output): An ISO/IEC 646 encoded English language string that potentially
describes the error status
© Copyright 2014 HSA Foundation. All Rights Reserved
ASYNCHRONOUS NOTIFICATIONS
The runtime passes asynchronous notifications by calling user-defined
callbacks.
For instance, queues are a common source of asynchronous events because the
tasks queued by an application are asynchronously consumed by the packet
processor. Callbacks are associated with queues when they are created. When the
runtime detects an error in a queue, it invokes the callback associated with that
queue and passes it an error flag (indicating what happened) and a pointer to the
erroneous queue.
The HSA runtime does not implement any default callbacks.
When using blocking functions within the callback implementation, a callback that
does not return can render the runtime state to be undefined.
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE - CALLBACK
Pass the callback function
when create queue
If the queue is empty, set the
event and invoke callback
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION
OUTLINE
Agent information
hsa_node_t
hsa_agent_t
hsa_agent_info_t
hsa_component_feature_t
Agent Information manipulation APIs
hsa_iterate_agents
hsa_agent_get_info
Example
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION
The runtime exposes a list of agents that are available in the system.
An HSA agent is a hardware component that participates in the HSA memory model.
An HSA agent can submit AQL packets for execution.
An HSA agent may also but is not required to be an HSA component. It is possible for
a system to include HSA agents that are neither an HSA component nor a host CPU.
HSA agents are defined as opaque handles of type hsa_agent_t .
The HSA runtime provides APIs for applications to traverse the list of available
agents and query attributes of a particular agent.
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION (1)
Opaque agent handle
Opaque NUMA node handle
An HSA memory node is a node that delineates a set of
system components (host CPUs and HSA Components) with
“local” access to a set of memory resources attached to the
node's memory controller and appropriate HSA-compliant
access attributes.
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION (2)
Component features
An HSA component is a hardware or software component that can be a target of the AQL queries
and conforms to the memory model of the HSA.
Values
HSA_COMPONENT_FEATURE_NONE = 0
No component capabilities. The device is an agent, but not a component.
HSA_COMPONENT_FEATURE_BASIC = 1
The component supports the HSAIL instruction set and all the AQL packet types except Agent
dispatch.
HSA_COMPONENT_FEATURE_ALL = 2
The component supports the HSAIL instruction set and all the AQL packet types.
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION (3)
Agent attributes
Values
HSA_AGENT_INFO_MAX_GRID_DIM
HSA_AGENT_INFO_MAX_WORKGROUP_DIM
HSA_AGENT_INFO_QUEUE_MAX_PACKETS
HSA_AGENT_INFO_CLOCK
HSA_AGENT_INFO_CLOCK_FREQUENCY
HSA_AGENT_INFO_MAX_SIGNAL_WAIT
HSA_AGENT_INFO_NAME
HSA_AGENT_INFO_NODE
HSA_AGENT_INFO_COMPONENT_FEATURES
HSA_AGENT_INFO_VENDOR_NAME
HSA_AGENT_INFO_WAVEFRONT_SIZE
HSA_AGENT_INFO_CACHE_SIZE
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION MANIPULATION (1)
Iterate over the available agents, and invoke an application-defined callback on
every iteration
If callback returns a status other than HSA_STATUS_SUCCESS for a particular
iteration, the traversal stops and the function returns that status value.
Parameters
callback (input): Callback to be invoked once per agent
data (input): Application data that is passed to callback on every iteration. Can be
NULL.
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION MANIPULATION (2)
Get the current value of an attribute for a given agent
Parameters
agent (input): A valid agent
attribute (input): Attribute to query
value (output): Pointer to a user-allocated buffer where to store the value of the
attribute. If the buffer passed by the application is not large enough to hold the value
of attribute, the behavior is undefined.
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE - AGENT ATTRIBUTE QUERY
Copy agent attribute information
Get the agent handle of Agent 0
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNALS AND SYNCHRONIZATION
(MEMORY-BASED)
OUTLIINE
Signal
Signal manipulation API
Create/Destroy
Query
Send
Atomic Operations
Signal wait
Get time out
Signal Condition
Example
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL (1)
HSA agents can communicate with each other by using coherent global memory,
or by using signals.
A signal is represented by an opaque signal handle
A signal carries a value, which can be updated or conditionally waited upon via
an API call or HSAIL instruction.
The value occupies four or eight bytes depending on the machine model in use.
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL (2)
Updating the value of a signal is equivalent to sending the signal.
In addition to the update (store) of signals, the API for sending signal must
support other atomic operations with specific memory order semantics
Atomic operations: AND, OR, XOR, Add, Subtract, Exchange, and CAS
Memory order semantics : Release and Relaxed
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL CREATE/DESTROY
Create a signal
Parameters
initial_value (input): Initial value of the
signal.
signal_handle (output): Signal handle.
Destroy a signal previous created by
hsa_signal_create
Parameter
signal_handle (input): Signal handle.
© Copyright 2014 HSA Foundation. All Rights Reserved
Send and atomically set the value of a signal
with release semantics
SIGNAL LOAD/STORE
Atomically read the current signal value with
acquire semantics
Atomically read the current signal value with
relaxed semantics
Send and atomically set the value of a signal with
relaxed semantics
© Copyright 2014 HSA Foundation. All Rights Reserved
Send and atomically increment the value of a
signal by a given amount with release semantics
SIGNAL ADD/SUBTRACT
Send and atomically decrement the value of a
signal by a given amount with release semantics
Send and atomically increment the value of a
signal by a given amount with relaxed semantics
Send and atomically decrement the value of a
signal by a given amount with relaxed semantics
© Copyright 2014 HSA Foundation. All Rights Reserved
Send and atomically perform a logical AND operation
on the value of a signal and a given value with
release semantics
SIGNAL AND (OR, XOR)/EXCHANGE
Send and atomically set the value of a signal and
return its previous value with release semantics
Send and atomically perform a logical AND operation
on the value of a signal and a given value with
relaxed semantics
Send and atomically set the value of a signal and
return its previous value with relaxed semantics
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL WAIT (1)
The application may wait on a signal, with a condition specifying the terms of
wait.
Signal wait condition operator
Values
HSA_EQ: The two operands are equal.
HSA_NE: The two operands are not equal.
HSA_LT: The first operand is less than the second operand.
HSA_GTE: The first operand is greater than or equal to the second operand.
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL WAIT (2)
The wait can be done either in the HSA component via an HSAIL wait instruction
or via a runtime API defined here.
Waiting on a signal returns the current value at the opaque signal object;
The wait may have a runtime defined timeout which indicates the maximum amount of time that an
implementation can spend waiting.
The signal infrastructure allows for multiple senders/waiters on a single signal.
Wait reads the value, hence acquire synchronizations may be applied.
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL WAIT (3)
Signal wait
Parameters
signal_handle (input): A signal handle
condition (input): Condition used to compare the passed and signal values
compare_ value (input): Value to compare with
return_value (output): A pointer where the current signal value must be read into
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL WAIT (4)
Signal wait with timeout
Parameters
signal_handle (input): A signal handle
timeout (input): Maximum wait duration (A value of zero indicates no maximum)
long_wait (input): Hint indicating that the signal value is not expected to meet the given condition in
a short period of time. The HSA runtime may use this hint to optimize the wait implementation.
condition (input): Condition used to compare the passed and signal values
compare_ value (input): Value to compare with
return_value (output): A pointer where the current signal value must be read into
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE – SIGNAL WAIT (1)
thread_1 thread_2
thread_1 is blocked
hsa_signal_add_relaxed
(value = value + 3)
Return signal value
Condition satisfied, the
execution of thread_1
continues
value = 0
Timeline Timeline
value = 3
hsa_signal_substract_relaxed
(value = value - 1)value = 2
hsa_signal_wait_timeout_acquire
(value == 2)
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE – SIGNAL WAIT (2)
If signal_handle is invalid, then return signal invalid status
Compare tmp->value with compare_value to see if the
condition is satisfied?
If timeout = 0 then return signal time out status
Signal wait condition function
If the condition is satisfied, then return signal and status
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUES AND ARCHITECTED
DISPATCH
OUTLINE
Queues
Queue Types and Structure
HSA runtime API for Queue Manipulations
Architected Queuing Language (AQL) Support
Packet type
Packet header
Examples
Enqueue Packet
Packet Processor
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (1)
An HSA-compliant platform supports multiple user-level command queues allocation.
A use-level command queue is characterized as runtime-allocated, user-level accessible virtual
memory of a certain size, containing packets defined in the Architected Queuing Language (AQL
packets).
Queues are allocated by HSA applications through the HSA runtime.
HSA software receives memory-based structures to configure the hardware queues to
allow for efficient software management of the hardware queues of the HSA agents.
This queue memory shall be processed by the HSA Packet Processor as a ring buffer.
Queues are read-only data structures.
Writing values directly to a queue structure results in undefined behavior.
But HSA agents can directly modify the contents of the buffer pointed by base_address, or use
runtime APIs to access the doorbell signal or the service queue.
© Copyright 2014 HSA Foundation. All Rights Reserved
Two queue types, AQL and Service Queues, are supported
AQL Queue consumes AQL packets that are used to specify the information of kernel functions
that will be executed on the HSA component
Service Queue consumes agent dispatch packets that are used to specify runtime-defined or user
registered functions that will be executed on the agent (typically, the host CPU)
INTRODUCTION (2)
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (3)
AQL queue structure
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (4)
In addition to the data held in the queue structure, the queue also defines two
properties (readIndex and writeIndex) that define the location of “head” and “tail”
of the queue.
readIndex: The read index is a 64-bit unsigned integer that specifies the packetID of
the next AQL packet to be consumed by the packet processor.
writeIndex: The write index is a 64-bit unsigned integer that specifies the packetID of
the next AQL packet slot to be allocated.
Both indices are not directly exposed to the user, who can only access them by using
dedicated HSA core runtime APIs.
The available index functions differ on the index of interest (read or write), action to be
performed (addition, compare and swap, etc.), and memory consistency model
(relaxed, release, etc.).
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (5)
The read index is automatically advanced when a packet is read by the packet
processor.
When the packet processor observes that
The read index matches the write index, the queue can be considered empty;
The write index is greater than or equal to the sum of the read index and the size of
the queue, then the queue is full.
The doorbell_signal field of a queue contains a signal that is used by the agent
to inform the packet processor to process the packets it writes.
The value that the doorbell signaled is equal to the ID of the packet that is ready to be
launched.
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (6)
The new task might be consumed by the packet processor even before the
doorbell signal has been signaled by the agent.
This is because the packet processor might be already processing some other
packets and observes that there is new work available, so it processes the new
packets.
In any case, the agent must ring the doorbell for every batch of packets it writes.
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE CREATE/DESTROY
Create a user mode queue
When a queue is created, the runtime also
allocates the packet buffer and the completion
signal.
The application should only rely on the status
code returned to determine if the queue is valid
Destroy a user mode queue
A destroyed queue might not be accessed after being
destroyed.
When a queue is destroyed, the state of the AQL packets
that have not been yet fully processed becomes undefined.
© Copyright 2014 HSA Foundation. All Rights Reserved
GET READ/WRITE INDEX
Atomically retrieve read index of a queue with
acquire semantics
Atomically retrieve write index of a queue with
acquire semantics
Atomically retrieve read index of a queue with
relaxed semantics
Atomically retrieve write index of a queue with
relaxed semantics
© Copyright 2014 HSA Foundation. All Rights Reserved
SET READ/WRITE INDEX
Atomically set the read index of a queue with
release semantics
Atomically set the read index of a queue with
relaxed semantics
Atomically set the write index of a queue with
release semantics
Atomically set the write index of a queue with
relaxed semantics
© Copyright 2014 HSA Foundation. All Rights Reserved
COMPARE AND SWAP WRITE INDEX
Atomically compare and set the write index of a
queue with acquire/release/relaxed/acquire-
release semantics
Parameters queue (input): A queue
expected (input): The expected index value
val (input): Value to copy to the write index if expected
matches the observed write index
Return value
Previous value of the write index
© Copyright 2014 HSA Foundation. All Rights Reserved
ADD WRITE INDEX
Atomically increment the write index of a
queue by an offset with
release/acquire/relaxed/acquire-release
semantics
Parameters
queue (input): A queue
val (input): The value to add to the write index
Return value
Previous value of the write index
© Copyright 2014 HSA Foundation. All Rights Reserved
ARCHITECTED QUEUING LANGUAGE (AQL)
An HSA-compliant system provides a command interface for the dispatch of
HSA agent commands.
This command interface is provided by the Architected Queuing Language (AQL).
AQL allows HSA agents to build and enqueue their own command packets,
enabling fast and low-power dispatch.
AQL also provides support for HSA component queue submissions
The HSA component kernel can write commands in AQL format.
© Copyright 2014 HSA Foundation. All Rights Reserved
AQL PACKET (1)
AQL packet format
Values
Always reserved packet (0): Packet format is set to always reserved when the queue is initialized.
Invalid packet (1): Packet format is set to invalid when the readIndex is incremented, making the packet slot available to the HSA agents.
Dispatch packet (2): Dispatch packets contain jobs for the HSA component and are created by HSA agents.
Barrier packet (3): Barrier packets can be inserted by HSA agents to delay processing subsequent packets. All queues support barrier packets.
Agent dispatch packet (4): Dispatch packets contain jobs for the HSA agent and are created by HSA agents.
© Copyright 2014 HSA Foundation. All Rights Reserved
AQL PACKET (2)
HSA signaling object handle used to indicate completion of the job
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE - ENQUEUE AQL PACKET (1)
An HSA agent submits a task to a queue by performing the following steps:
Allocate a packet slot (by incrementing the writeIndex)
Initialize the packet and copy packet to a queue associated with the Packet Processor
Mark packet as valid
Notify the Packet Processor of the packet (With doorbell signal)
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE - ENQUEUE AQL PACKET (2)
Dispatch Queue
Allocate an AQL packet slot
Copy the packet into queue. Note
that, we can have a lock here to
prevent race condition in
multithread environment
WriteIndex
ReadIndexInitialize
packet
Send doorbell signal
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE - PACKET PROCESSOR
WriteIndex
ReadIndex
Get packet content
Check if barrier packet
Update readIndex, change packet state to invalid,
and send completion signal.
Receive doorbell Dispatch Queue
If there is any packet in queue, process the packet.
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY MANAGEMENT
OUTLINE
Memory registration and deregistration
Memory region and memory segment
APIs for memory region manipulation
APIs for memory registration and deregistration
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION
One of the key features of HSA is its ability to share global pointers between the
host application and code executing on the HSA component.
This ability means that an application can directly pass a pointer to memory allocated on the host
to a kernel function dispatched to a component without an intermediate copy
When a buffer created in the host is also accessed by a component,
programmers are encouraged to register the corresponding address range
beforehand.
Registering memory expresses an intention to access (read or write) the passed buffer from a
component other than the host. This is a performance hint that allows the runtime implementation
to know which buffers will be accessed by some of the components ahead of time.
When an HSA program no longer needs to access a registered buffer in a device,
the user should deregister that virtual address range.
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY REGION/SEGMENT
A memory region represents a virtual memory interval that is visible to a particular agent,
and contains properties about how memory is accessed or allocated from that agent.
Memory segments
Values
HSA_SEGMENT_GLOBAL = 1
HSA_SEGMENT_PRIVATE = 2
HSA_SEGMENT_GROUP = 4
HSA_SEGMENT_KERNARG = 8
HSA_SEGMENT_READONLY = 16
HSA_SEGMENT_IMAGE = 32
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY REGION INFORMATION
Attributes of a memory region
Values
HSA_REGION_INFO_BASE_ADDRESS
HSA_REGION_INFO_SIZE
HSA_REGION_INFO_NODE
HSA_REGION_INFO_MAX_ALLOCATION_SIZE
HSA_REGION_INFO_SEGMENT
HSA_REGION_INFO_BANDWIDTH
HSA_REGION_INFO_CACHED
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY REGION MANIPULATION (1)
Get the current value of an attribute of a region
Iterate over the memory regions that are visible to an agent, and invoke an
application-defined callback on every iteration
If callback returns a status other than HSA_STATUS_SUCCESS for a particular iteration, the
traversal stops and the function returns that status value.
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY REGION MANIPULATION (2)
Allocate a block of memory
Deallocate a block of memory previously allocated
using hsa_memory_allocate
Copy block of memory
Copying a number of bytes larger than the size of the
memory regions pointed by dst or src results in
undefined behavior.
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY REGISTRATION/DEREGISTRATION
Register memory
Parameters
address (input): A pointer to the base of
the memory region to be registered. If a
NULL pointer is passed, no operation is
performed.
size (input): Requested registration size
in bytes. A size of zero is only allowed if
address is NULL.
Deregister memory previously registered
using hsa_memory_register
Parameter
address (input): A pointer to the base of the
memory region to be registered. If a NULL
pointer is passed, no operation is performed.
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE
Allocate a memory space
Use hsa_region_get_info to get the
size in byte of this memory space
Register this memory space for a
performance hint
Finish operation, deregister and
free this memory space
© Copyright 2014 HSA Foundation. All Rights Reserved
SUMMARY
SUMMARY
Covered
HSA Core Runtime API (Pre-release 1.0 provisional)
Runtime Initialization and Shutdown (Open/Close)
Notifications (Synchronous/Asynchronous)
Agent Information
Signals and Synchronization (Memory-Based)
Queues and Architected Dispatch
Memory Management
Not covered
Extension of Core Runtime
HSAIL Finalization, Linking, and Debugging
Images and Samplers
© Copyright 2014 HSA Foundation. All Rights Reserved
QUESTIONS?
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA MEMORY MODELBEN GASTER, ENGINEER, QUALCOMM
OUTLINE
HSA Memory Model
OpenCL 2.0
Has a memory model too
Obstruction-free bounded deques
An example using the HSA memory model
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA MEMORY MODEL
© Copyright 2014 HSA Foundation. All Rights Reserved
TYPES OF MODELS
Shared memory computers and programming languages, divide complexity into
models:
1. Memory model specifies safety
e.g. can a work-item prevent others from progressing?
This is what this section of the tutorial will focus on
2. Execution model specifies liveness
Described in Ben Sander’s tutorial section on HSAIL
e.g. can a work-item prevent others from progressing
3. Performance model specifies the big picture
e.g. caches or branch divergence
Specific to particular implementations and outside the scope of today’s tutorial
© Copyright 2014 HSA Foundation. All Rights Reserved
THE PROBLEM
Assume all locations (a, b, …) are initialized to 0
What are the values of $s2 and $s4 after execution?
© Copyright 2014 HSA Foundation. All Rights Reserved
Work-item 0
mov_u32 $s1, 1 ;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
*a = 1;
int x = *b;
*b = 1;
int y = *a;
initially *a = 0 && *b = 0
THE SOLUTION
The memory model tells us:
Defines the visibility of writes to memory at any given point
Provides us with a set of possible executions
© Copyright 2014 HSA Foundation. All Rights Reserved
WHAT MAKES A GOOD MEMORY MODEL*
Programmability ; A good model should make it (relatively) easy to write multi-
work-item programs. The model should be intuitive to most users, even to those
who have not read the details
Performance ; A good model should facilitate high-performance implementations
at reasonable power, cost, etc. It should give implementers broad latitude in
options
Portability ; A good model would be adopted widely or at least provide backward
compatibility or the ability to translate among models
* S. V. Adve. Designing Memory Consistency Models for Shared-Memory Multiprocessors. PhD thesis, Computer Sciences Department,
University of Wisconsin–Madison, Nov. 1993.
© Copyright 2014 HSA Foundation. All Rights Reserved
SEQUENTIAL CONSISTENCY (SC)*
Axiomatic Definition
A single processor (core) sequential if “the result of an execution is the same as if the
operations had been executed in the order specified by the program.”
A multiprocessor sequentially consistent if “the result of any execution is the same as if the
operations of all processors (cores) were executed in some sequential order, and the
operations of each individual processor (core) appear in this sequence in the order specified by
its program.”
© Copyright 2014 HSA Foundation. All Rights Reserved
But HW/Compiler actually implements more relaxed models, e.g. ARMv7
* L. Lamport. How to Make a Multiprocessor Computer that Correctly
Executes Multiprocessor Programs. IEEE Transactions on Computers,
C-28(9):690–91, Sept. 1979.
SEQUENTIAL CONSISTENCY (SC)
© Copyright 2014 HSA Foundation. All Rights Reserved
Work-item 0
mov_u32 $s1, 1 ;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
$s2 = 0 && $s4 =
1
BUT WHAT ABOUT ACTUAL HARDWARE
Sequential consistency is (reasonably) easy to understand, but limits
optimizations that the compiler and hardware can perform
Many modern processors implement many reordering optimizations
Store buffers (TSO*), work-items can see their own stores early
Reorder buffers (XC*), work-items can see other work-items store early
© Copyright 2014 HSA Foundation. All Rights Reserved
*TSO – Total Store Order as implemented by Sparc and x86
*XC – Relaxed Consistency model, e.g. ARMv7, Power7, and Adreno
RELAXED CONSISTENCY (XC)
© Copyright 2014 HSA Foundation. All Rights Reserved
Work-item 0
mov_u32 $s1, 1 ;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
ld_global_u32 $s2, [&b] ;
ld_global_u32 $s4, [&a] ;
st_global_u32 $s1, [&a] ;
st_global_u32 $s3, [&b] ;
$s2 = 0 && $s4 =
0
WHAT ARE OUR 3 Ps?
Programmability ; XC is really pretty hard for the programmer to reason about
what will be visible when
many memory model experts have been known to get it wrong!
Performance ; XC is good for performance, the hardware (compiler) is free to
reorder many loads and stores, opening the door for performance and power
enhancements
Portability ; XC is very portable as it places very little constraints
© Copyright 2014 HSA Foundation. All Rights Reserved
MY CHILDREN AND COMPUTER
ARCHITECTS ALL WANT
To have their cake and eat it!
© Copyright 2014 HSA Foundation. All Rights Reserved
Put picture with kids and cake
HSA Provides: The ability to enable
programmers to reason with (relatively)
intuitive model of SC, while still achieving the
benefits of XC!
SEQUENTIAL CONSISTENCY FOR DRF*
HSA adopts the same approach as Java, C++11, and OpenCL 2.0 adopting SC for Data
Race Free (DRF)
plus some new capabilities !
(Informally) A data race occurs when two (or more) work-items access the same memory
location such that:
At least one of the accesses is a WRITE
There are no intervening synchronization operations
SC for DRF asks:
Programmers to ensure programs are DRF under SC
Implementers to ensure that all executions of DRF programs on the relaxed model are also SC
executions
© Copyright 2014 HSA Foundation. All Rights Reserved
*S. V. Adve and M. D. Hill. Weak Ordering—A New Definition. In Proceedings of the
17th Annual International Symposium on Computer Architecture, pp. 2–14, May
1990
HSA SUPPORTS RELEASE CONSISTENCY
HSA’s memory model is based on RCSC: All atomic_ld_scacq and atomic_st_screl are SC
Means coherence on all atomic_ld_scacq and atomic_st_screl to a single
address. )
All atomic_ld_scacq and atomic_st_screl are program ordered per work-
item (actually: sequence-order by language constraints
Similar model adopted by ARMv8
HSA extends RCSC to SC for HRF*, to access the full capabilities of
modern heterogeneous systems, containing CPUs, GPUs, and DSPs,
for example.
© Copyright 2014 HSA Foundation. All Rights Reserved
*Sequential Consistency for Heterogeneous-Race-Free Programmer-centric
Memory Models for Heterogeneous Platforms. D. R. Hower, Beckman, B. R.
Gaster, B. Hechtman, M D. Hill, S. K. Reinhart, and D. Wood. MSPC’13.
MAKING RELAXED CONSISTENCY WORK
© Copyright 2014 HSA Foundation. All Rights Reserved
Work-item 0
mov_u32 $s1, 1 ;
atomic_st_global_u32_screl $s1, [&a] ;
atomic_ld_global_u32_scacq $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
atomic_st_global_u32_screl $s3, [&b] ;
atomic_ld_global_u32_scacq $s4, [&a]
;
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
atomic_st_global_u32_screl $s1, [&a] ;
atomic_ld_global_u32_scacq $s2, [&b] ;
atomic_st_global_u32_screl $s3, [&b] ;
atomic_ld_global_u32_scacq $s4, [&a] ;
$s2 = 0 && $s4 =
1
SEQUENTIAL CONSISTENCY FOR DRF
Two memory accesses participate in a data race if they
access the same location
at least one access is a store
can occur simultaneously
i.e. appear as adjacent operations in interleaving.
A program is data-race-free if no possible execution results in a data race.
Sequential consistency for data-race-free programs
Avoid everything else
HSA: Not good enough!
© Copyright 2014 HSA Foundation. All Rights Reserved
ALL ARE NOT EQUAL – OR SOME CAN SEE
BETTER THAN OTHERS
Remember the HSAIL
Execution Model
© Copyright 2014 HSA Foundation. All Rights Reserved
device scope
group scope
wave
scope
platform scope
DATA-RACE-FREE IS NOT ENOUGH
t1 t2 t3 t4
st_global 1, [&X]
atomic_st_global_screl 0, [&flag]
atomic_cas_global_scar 1, 0, [&flag]
...
atomic_st_global_screl 0, [&flag]
atomic_cas_global_scar ,1 0, [&flag]
ld_global (??), [&x]
group #1-2 group #3-4
Two ordinary memory accesses participate in a data race if they
Access same location
At least one is a store
Can occur simultaneouslyNot a data race…
Is it SC?
Well that depends
t4t3t1 t2
SGlobal
S12 S34
visibility implied by
causality?
© Copyright 2014 HSA Foundation. All Rights Reserved
SEQUENTIAL CONSISTENCY FOR
HETEROGENEOUS-RACE-FREE
Two memory accesses participate in a heterogeneous race if
access the same location
at least one access is a store
can occur simultaneously
i.e. appear as adjacent operations in interleaving.
Are not synchronized with “enough” scope
A program is heterogeneous-race-free if no possible execution results in a
heterogeneous race.
Sequential consistency for heterogeneous-race-free programs
Avoid everything else
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA HETEROGENEOUS RACE FREE
HRF0: Basic Scope Synchronization
“enough” = both threads synchronize using identical scope
Recall example:
Contains a heterogeneous race in HSA
t1 t2 t3 t4
st_global 1, [&X]
atomic_st_global_rcrel_wg 0, [&flag]
...
atomic_cas_global_scar_wg,1 0, [&flag]
ld_global (??), [&x]
Workgroup #1-2 Workgroup #3-4HSA Conclusion:
This is bad. Don’t do it.
© Copyright 2014 HSA Foundation. All Rights Reserved
HOW TO USE HSA WITH SCOPES
Use smallest scope that includes all
producers/consumers of shared data
HSA Scope Selection Guideline
Implication:
Producers/consumers must be known at synchronization time
Want: For performance, use smallest scope possible
What is safe in HSA?
Is this a valid assumption?
© Copyright 2014 HSA Foundation. All Rights Reserved
REGULAR GPGPU WORKLOADS
N
M
Define
Problem Space
Partition
Hierarchically
Communicate
Locally
N times
Communicate
Globally
M times
Well defined (regular) data partitioning +
Well defined (regular) synchronization pattern =
Producer/consumers are always known
Generally: HSA works well with
regular data-parallel workloads
© Copyright 2014 HSA Foundation. All Rights Reserved
t1 t2 t3 t4
st_global 1, [&X]
atomic_st_global_screl_plat 0, [&flag]
atomic_cas_global_scar_plat 1, 0, [&flag]
...
atomic_st_global_screl_plat 0, [&flag]
atomic_cas_global_ar_plat ,1 0, [&flag]
ld $s1, [&x]
IRREGULAR WORKLOADS HSA: example is race
Must upgrade wg (workgroup) -> plat (platform)
HSA memory model says:
ld $s1, [&x], will see value (1)!
Workgroup #1-2 Workgroup #3-4
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL
HAS MEMORY MODELS TOOMAPPING ONTO HSA’S MEMORY MODEL
It is straightforward to provide a mapping from OpenCL 1.x to the proposed model
OpenCL 1.x atomics are unordered and so map to atomic_op_X
Mapping for fences not shown but straightforward
OPENCL 1.X MEMORY MODEL MAPPING
OpenCL Operation HSA Memory Model
Operation
Atomic load ld_global_wg
ld_group_wg
Atomic store atomic_st_global_wg
atomic_st_group_wg
atomic_op atomic_op_global_comp
atomic_op_group_wg
barrier(…) fence ; barrier_wg
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL 2.0 BACKGROUND
Provisional specification released at SIGGRAPH’13, July 2013.
Huge update to OpenCL to account for the evolving hardware landscape and
emerging use cases (e.g. irregular work loads)
Key features:
Shared virtual memory, including platform atomics
Formally defined memory model based on C11 plus support for scopes
Includes an extended set of C1X atomic operations
Generic address space, that subsumes global, local, and private
Device to device enqueue
Out-of-order device side queuing model
Backwards compatible with OpenCL 1.x
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL 2.0 MEMORY MODEL MAPPINGOpenCL Operation HSA Memory Model Operation
Load
memory_order_relaxed
atomic_ld_[global | group]_relaxed_scope
Store
Memory_order_relaxed
atomic_st_[global | group]_relaxed_scope
Load
memory_order_acquire
atomic_ld_[global | group]_scacq_scope
Load
memory_order_seq_cst
atomic_ld_[global | group]_scacq_scope
Store
memory_order_release
atomic_st_[global | group]_screl_scope
Store
Memory_order_seq_cst
atomic_st_[global | group]_screl_scope
memory_order_acq_rel atomic_op_[global | group]_scar_scope
memory_order_seq_cst atomic_op_[global|group]_scar_scope
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL 2.0 MEMORY SCOPE MAPPING
OpenCL Scope HSA Scope
memory_scope_sub_group _wave
memory_scope_work_group _wg
memory_scope_device _component
memory_scope_all_svm_devices _platform
© Copyright 2014 HSA Foundation. All Rights Reserved
OBSTRUCTION-FREE
BOUNDED DEQUES
AN EXAMPLE USING THE HSA MEMORY MODEL
CONCURRENT DATA-STRUCTURES
Why do we need such a memory model in practice?
One important application of memory consistency is in the development and use
of concurrent data-structures
In particular, there are a class data-structures implementations that provide non-
blocking guarantees: wait-free; An algorithm is wait-free if every operation has a bound on the number of
steps the algorithm will take before the operation completes
In practice very hard to build efficient data-structures that meet this requirement
lock-free; An algorithm is lock-free if every if, given enough time, at least one thread of
the work-items (or threads) makes progress
In practice lock-free algorithms are implemented by work-item cooperating with one
enough to allow progress
Obstruction-free; An algorithm is obstruction-free if a work-item, running in isolation, can
make progress
© Copyright 2014 HSA Foundation. All Rights Reserved
Emerging Compute Cluster
BUT WAY NOT JUST USE MUTUAL
EXCLUSION?
© Copyright 2014 HSA Foundation. All Rights Reserved
Fabric & Memory Controller
KraitCPUAdreno
GPUKraitCPU
KraitCPU
KraitCPU
MMUMMUs
2MB L2
HexagonDSP
MMU
?? ??
Diversity in a heterogeneous system, such as
different clock speeds, different scheduling
policies, and more can mean traditional
mutual exclusion is not the right choice
CONCURRENT DATA-STRUCTURES
Emerging heterogeneous compute clusters means we need:
To adapt existing concurrent data-structures
Developer new concurrent data-structures
Lock based programming may still be useful but often these algorithms will need
to be lock-free
Of course, this is a key application of the HSA memory model
To showcase this we highlight the development of a well known (HLM)
obstruction-free deque*
© Copyright 2014 HSA Foundation. All Rights Reserved
*Herlihy, M. et al. 2003. Obstruction-free
synchronization: double-ended queues as an
example. (2003), 522–529.
HLM - OBSTRUCTION-FREE DEQUE
Uses a fixed length circular queue
At any given time, reading from left to right, the array will contain: Zero or more left-null (LN) values
Zero or more dummy-null (DN) values
Zero or more right-null (RN) values
At all times there must be: At least two different nulls values
At least one LN or DN, and at least one DN or RN
Memory consistency is required to allow multiple producers and multiple
consumers, potentially happening in parallel from the left and right ends, to see
changes from other work-items (HSA Components) and threads (HSA Agents)
© Copyright 2014 HSA Foundation. All Rights Reserved
HLM - OBSTRUCTION-FREE DEQUE
© Copyright 2014 HSA Foundation. All Rights Reserved
LNLN vLN RNv RNRN
left right
Key:
LN – left null value
RN – right null value
v – value
left – left hint index
right – right hint index
C REPRESENTATION OF DEQUE
struct node {
uint64_t type : 2; // null type (LN, RN, DN)
uint64_t counter : 8 ; // version counter to avoid ABA
uint64_t value : 54 ; // index value stored in queue
}
struct queue {
unsigned int size; // size of bounded buffer
node * array; // backing store for deque itself© Copyright 2014 HSA Foundation. All Rights Reserved
HSAIL REPRESENTATION
Allocate a deque in global memory using HSAIL
@deque_instance:
align 64 global_u32 &size;
align 8 global_u64 &array;
© Copyright 2014 HSA Foundation. All Rights Reserved
ORACLE
Assume a function:
function &rcheck_oracle (arg_u32 %k, arg_u64 %left, arg_u64 %right) (arg_u64 %queue);
Which given a deque
returns (%k) the position of the left most of RN
atomic_ld_global_scacq used to read node from array
Makes one if necessary (i.e. if there are only LN or DN)
atomic_cas_global_scar, required to make new RN
returns (%left) the left node (i.e. the value to the left of the left most RN position)
returns (%right) the right node (i.e. the value at position (%k))
© Copyright 2014 HSA Foundation. All Rights Reserved
RIGHT POP
function &right_pop(arg_u32err, arg_u64 %result) (arg_u64 %deque) {
// load queue address
ld_arg_u64 $d0, [%deque];
@loop_forever:
// setup and call right oracle to get next RN
arg_u32 %k; arg_u64 %current; arg_u64 %next;
call &rcheck_oracle(%queue) ;
ld_arg_u32 $s0, [%k]; ld_arg_u64 $d1, [%current]; ld_arg_u64 $d2, [%next];
// current.value($d5)
shr_u64 $d5, $d1, 62;
// current.counter($d6)
and_u64 $d6, $d1,
0x3FC0000000000000;
shr_u64 $d6, $d6, 54;
// current.value($d7)
and_u64 $d7, $d1, 0x3FFFFFFFFFFFFF;
// next.counter($d8)
and_u64 $d8, $d2, 0x3FC0000000000000; shr_u64 $d8, $d8, 54;
brn @loop_forever ;
}
© Copyright 2014 HSA Foundation. All Rights Reserved
RIGHT POP – TEST FOR EMPTY
// current.type($d5) == LN || current.type($d5) == DN
cmp_neq_b1_u64 $c0, $d5, LN; cmp_neq_b1_u64 $c1, $d5, DN;
or_b1 $c0, $c0, $c1;
cbr $c0, @not_empty ;
// current node index (%deque($d0) + (%k($s1) - 1) * 16)
add_u32 $s1, $s0, -1; mul_u32 $s1, $s1, 16; add_u32 $d3, $d0, $s0;
atomic_ld_global_scacq_u64 $d4, [$d3];
cmp_neq_b1_u64 $c0, $d4, $d1;
cbr $c0, @not_empty;
st_arg_u32 EMPTY, [&err]; // deque empty so return EMPTY
%ret
@not_empty:
© Copyright 2014 HSA Foundation. All Rights Reserved
RIGHT POP – TRY READ/REMOVE NODE// $d9 = (RN, next.cnt+1, 0)
add_u64 $d8, $d8, 1;
shl_u64 $d9, RN, 62;
and_u64 $d8, $d8, $d9;
// cas(deq+k, next, node(RN, next.cnt+1, 0))
atomic_cas_global_scar_u64 $d9, [$s0], $d2, $d9;
cmp_neq_u64 $c0, $d9, $d2;
cbr $c0, @cas_failed;
// $d9 = (RN, current.cnt+1, 0)
add_u64 $d6, $d6, 1;
shl_u64 $d9, RN, 62;
and_u64 $d9, $d6, $d9;
// cas(deq+(k-1), curr, node(RN, curr.cnt+1,0)
atomic_cas_global_scar_u64 $d9, [$s1], $d1, $d9;
cmp_neq_u64 $c0, $d9, $d1;
cbr $c0, @cas_failed;
st_arg_u32 SUCCESS, [&err];
st_arg_u64 $d7, [&value];
%ret
@cas_failed:
// loop back around and try again
© Copyright 2014 HSA Foundation. All Rights Reserved
TAKE AWAYS
HSA provides a powerful and modern memory model Based on the well know SC for DRF
Defined as Release Consistency
Extended with scopes as defined by HRF
OpenCL 2.0 introduces a new memory model Also based on SC for DRF
Also defined in terms of Release Consistency
Also Extended with scope as defined in HRF
Has a well defined mapping to HSA
Concurrent algorithm development for emerging heterogeneous computing
cluster can benefit from HSA and OpenCL 2.0 memory models
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA QUEUING MODELHAKAN PERSSON, SENIOR PRINCIPAL ENGINEER,
ARM
HSA QUEUEING, MOTIVATION
MOTIVATION (TODAY’S PICTURE)
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule JobStart Job
Finish Job
Schedule
ApplicationGet Buffer
Copy/Map
Memory
HSA QUEUEING: REQUIREMENTS
REQUIREMENTS
Three key technologies are used to build the user mode queueing
mechanism
Shared Virtual Memory
System Coherency
Signaling
AQL (Architected Queueing Language) enables any agent
enqueue tasks
© Copyright 2014 HSA Foundation. All Rights Reserved
SHARED VIRTUAL MEMORY
PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (TODAY)
Multiple Virtual memory address spaces
© Copyright 2014 HSA Foundation. All Rights Reserved
CPU0 GPU
VIRTUAL MEMORY1
PHYSICAL MEMORY
VA1->PA1 VA2->PA1
VIRTUAL MEMORY2
PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (HSA)
Common Virtual Memory for all HSA agents
© Copyright 2014 HSA Foundation. All Rights Reserved
CPU0 GPU
VIRTUAL MEMORY
PHYSICAL MEMORY
VA->PA VA->PA
SHARED VIRTUAL MEMORY
Advantages
No mapping tricks, no copying back-and-forth between different PA addresses
Send pointers (not data) back and forth between HSA agents.
Implications
Common Page Tables (and common interpretation of architectural semantics such as shareability, protection, etc).
Common mechanisms for address translation (and servicing address translation faults)
Concept of a process address space (PASID) to allow multiple, per process virtual address spaces within the system.
© Copyright 2014 HSA Foundation. All Rights Reserved
SHARED VIRTUAL MEMORY
Specifics
Minimum supported VA width is 48b for 64b systems, and 32b for
32b systems.
HSA agents may reserve VA ranges for internal use via system
software.
All HSA agents other than the host unit must use the lowest privilege
level
If present, read/write access flags for page tables must be
maintained by all agents.
Read/write permissions apply to all HSA agents, equally.
© Copyright 2014 HSA Foundation. All Rights Reserved
GETTING THERE …
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule JobStart Job
Finish Job
Schedule
ApplicationGet Buffer
Copy/Map
Memory
CACHE COHERENCY
CACHE COHERENCY DOMAINS (1/3)
Data accesses to global memory segment from all HSA Agents shall be
coherent without the need for explicit cache maintenance.
© Copyright 2014 HSA Foundation. All Rights Reserved
CACHE COHERENCY DOMAINS (2/3)
Advantages
Composability
Reduced SW complexity when communicating between agents
Lower barrier to entry when porting software
Implications
Hardware coherency support between all HSA agents
Can take many forms
Stand alone Snoop Filters / Directories
Combined L3/Filters
Snoop-based systems (no filter)
Etc …
© Copyright 2014 HSA Foundation. All Rights Reserved
CACHE COHERENCY DOMAINS (3/3)
Specifics
No requirement for instruction memory accesses to be coherent
Only applies to the Primary memory type.
No requirement for HSA agents to maintain coherency to any memory location where the HSA agents do not specify the same memory attributes
Read-only image data is required to remain static during the execution of an HSA kernel.
No double mapping (via different attributes) in order to modify. Must remain static
© Copyright 2014 HSA Foundation. All Rights Reserved
GETTING CLOSER …
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule JobStart Job
Finish Job
Schedule
ApplicationGet Buffer
Copy/Map
Memory
SIGNALING
SIGNALING (1/3)
HSA agents support the ability to use signaling objects
All creation/destruction signaling objects occurs via HSA
runtime APIs
From an HSA Agent you can directly access signaling objects.
Signaling a signal object (this will wake up HSA agents
waiting upon the object)
Query current object
Wait on the current object (various conditions supported).
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNALING (2/3)
Advantages
Enables asynchronous events between HSA agents, without involving the kernel
Common idiom for work offload
Low power waiting
Implications
Runtime support required
Commonly implemented on top of cache coherency flows
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNALING (3/3)
Specifics
Only supported within a PASID
Supported wait conditions are =, !=, < and >=
Wait operations may return sporadically (no guarantee against false positives)
Programmer must test.
Wait operations have a maximum duration before returning.
The HSAIL atomic operations are supported on signal objects.
Signal objects are opaque
Must use dedicated HSAIL/HSA runtime operations
© Copyright 2014 HSA Foundation. All Rights Reserved
ALMOST THERE…
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule JobStart Job
Finish Job
Schedule
ApplicationGet Buffer
Copy/Map
Memory
USER MODE QUEUING
ONE BLOCK LEFT
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule JobStart Job
Finish Job
Schedule
ApplicationGet Buffer
Copy/Map
Memory
USER MODE QUEUEING (1/3)
User mode Queueing
Enables user space applications to directly, without OS intervention, enqueue jobs (“Dispatch Packets”) for HSA agents.
Queues are created/destroyed via calls to the HSA runtime.
One (or many) agents enqueue packets, a single agent dequeues packets.
Requires coherency and shared virtual memory.
© Copyright 2014 HSA Foundation. All Rights Reserved
USER MODE QUEUEING (2/3)
Advantages
Avoid involving the kernel/driver when dispatching work for an Agent.
Lower latency job dispatch enables finer granularity of offload
Standard memory protection mechanisms may be used to protect communication with
the consuming agent.
Implications
Packet formats/fields are Architected – standard across vendors!
Guaranteed backward compatibility
Packets are enqueued/dequeued via an Architected protocol (all via memory
accesses and signaling)
© Copyright 2014 HSA Foundation. All Rights Reserved
SUCCESS!
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule JobStart Job
Finish Job
Schedule
ApplicationGet Buffer
Copy/Map
Memory
SUCCESS!
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Queue Job
Start Job
Finish Job
ARCHITECTED QUEUEING
LANGUAGE, QUEUES
ARCHITECTED QUEUEING LANGUAGE
HSA Queues look just like standard shared memory queues, supporting multi-producer, single-consumer
Single producer variant defined with some optimizations possible.
Queues consist of storage, read/write indices, ID, etc.
Queues are created/destroyed via calls to the HSA runtime
“Packets” are placed in queues directly from user mode, via an architected protocol
Packet format is architected
© Copyright 2014 HSA Foundation. All Rights Reserved
Producer Producer
Consumer
Read Index
Write Index
Storage in
coherent, shared
memory
Packets
ARCHITECTED QUEUING LANGUAGE
Packets are read and dispatched for execution from the queue in order, but may complete in any order.
There is no guarantee that more than one packet will be processed in parallel at a time
There may be many queues. A single agent may also consume from several queues.
Any HSA agent may enqueue packets
CPUs
GPUs
Other accelerators
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE STRUCTURE
© Copyright 2014 HSA Foundation. All Rights Reserved
Offset (bytes) Size (bytes) Field Notes
0 4 queueType Differentiate different queues
4 4 queueFeatures Indicate supported features
8 8 baseAddress Pointer to packet array
16 16 doorbellSignal HSA signaling object handle
24 4 size Packet array cardinality
28 4 queueId Unique per process
32 8 serviceQueue Queue for callback services
intrinsic 8 writeIndex Packet array write index
intrinsic 8 readIndex Packet array read index
QUEUE VARIANTS
queueType and queueFeatures together define queue semantics and
capabilities
Two queueType values defined, other values reserved:
MULTI – queue supports multiple producers
SINGLE – queue supports single producer
queueFeatures is a bitfield indicating capabilities
DISPATCH (bit 0) if set then queue supports DISPATCH packets
AGENT_DISPATCH (bit 1) if set then queue supports AGENT_DISPATCH packets
All other bits are reserved and must be 0
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE STRUCTURE DETAILS
Queue doorbells are HSA signaling objects with restrictions
Created as part of the queue – lifetime tied to queue object
Atomic read-modify-write not allowed
size field value must be aligned to a power of 2
serviceQueue can be used by HSA kernel for callback services
Provided by application when queue is created
Can be mapped to HSA runtime provided serviceQueue, an application serviced
queue, or NULL if no serviceQueue required
© Copyright 2014 HSA Foundation. All Rights Reserved
READ/WRITE INDICES
readIndex and writeIndex properties are part of the queue, but not visible in the queue structure
Accessed through HSA runtime API and HSAIL operations
HSA runtime/HSAIL operations defined to
Read readIndex or writeIndex property
Write readIndex or writeIndex property
Add constant to writeIndex property (returns previous writeIndex value)
CAS on writeIndex property
readIndex & writeIndex operations treated as atomic in memory model
relaxed, acquire, release and acquire-release variants defined as applicable
readIndex and writeIndex never wrap
PacketID – the index of a particular packet
Uniquely identifies each packet of a queue
© Copyright 2014 HSA Foundation. All Rights Reserved
PACKET ENQUEUE
Packet enqueue follows a few simple steps:
Reserve space
Multiple packets can be reserved at a time
Write packet to queue
Mark packet as valid
Producer no longer allowed to modify packet
Consumer is allowed to start processing packet
Notify consumer of packet through the queue doorbell
Multiple packets can be notified at a time
Doorbell signal should be signaled with last packetID notified
On small machine model the lower 32 bits of the packetID are used
© Copyright 2014 HSA Foundation. All Rights Reserved
PACKET RESERVATION
Two flows envisaged
Atomic add writeIndex with number of packets to reserve
Producer must wait until packetID < readIndex + size before writing to packet
Queue can be sized so that wait is unlikely (or impossible)
Suitable when many threads use one queue
Check queue not full first, then use atomic CAS to update writeIndex
Can be inefficient if many threads use the same queue
Allows different failure model if queue is congested
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE OPTIMIZATIONS
Queue behavior is loosely defined to allow optimizations
Some potential producer behavior optimizations:
Keep local copy of readIndex, update when required
For single producer queues:
Keep local copy of writeIndex
Use store operation rather than add/cas atomic to update writeIndex
Some potential consumer behavior optimizations:
Use packet format field to determine whether a packet has been submitted rather than writeIndexproperty
Speculatively read multiple packets from the queue
Not update readIndex for each packet processed
Rely on value used for doorbellSignal to notify new packets
Especially useful for single producer queues
© Copyright 2014 HSA Foundation. All Rights Reserved
POTENTIAL MULTI-PRODUCER ALGORITHM
// Allocate packetuint64_t packetID = hsa_queue_add_write_index_relaxed(q, 1);
// Wait until the queue is no longer full. uint64_t rdIdx;do {rdIdx = hsa_queue_load_read_index_relaxed(q);
} while (packetID >= (rdIdx + q->size));
// calculate indexuint32_t arrayIdx = packetID & (q->size-1);
// copy over the packet, the format field is INVALIDq->baseAddress[arrayIdx] = pkt;
// Update format field with release semanticsq->baseAddress[index].hdr.format.store(DISPATCH, std::memory_order_release);
// ring doorbell, with release semantics (could also amortize over multiple packets)hsa_signal_send_relaxed(q->doorbellSignal, packetID);
© Copyright 2014 HSA Foundation. All Rights Reserved
POTENTIAL CONSUMER ALGORITHM// Get location of next packetuint64_t readIndex = hsa_queue_load_read_index_relaxed(q);
// calculate the index uint32_t arrayIdx = readIndex & (q->size-1);
// spin while empty (could also perform low-power wait on doorbell)while (INVALID == q->baseAddress[arrayIdx].hdr.format) { }
// copy over the packetpkt = q->baseAddress[arrayIdx];
// set the format field to invalidq->baseAddress[arrayIdx].hdr.format.store(INVALID, std::memory_order_relaxed);
// Update the readIndex using HSA intrinsichsa_queue_store_read_index_relaxed(q, readIndex+1);
// Now process <pkt>!
© Copyright 2014 HSA Foundation. All Rights Reserved
ARCHITECTED QUEUEING
LANGUAGE, PACKETS
PACKETS
© Copyright 2014 HSA Foundation. All Rights Reserved
Packets come in three main types with architected layouts
Always reserved & Invalid
Do not contain any valid tasks and are not processed (queue will not progress)
Dispatch
Specifies kernel execution over a grid
Agent Dispatch
Specifies a single function to perform with a set of parameters
Barrier
Used for task dependencies
COMMON PACKET HEADER
Start Offset
(Bytes)Format Field Name Description
0 uint16_t
format:8
Contains the packet type (Always reserved, Invalid,
Dispatch, Agent Dispatch, and Barrier). Other values are
reserved and should not be used.
barrier:1If set then processing of packet will only begin when all
preceding packets are complete.
acquireFenceScope:2
Determines the scope and type of the memory fence
operation applied before the packet enters the active
phase.
Must be 0 for Barrier Packets.
releaseFenceScope:2
Determines the scope and type of the memory fence
operation applied after kernel completion but before the
packet is completed.
reserved:3 Must be 0
© Copyright 2014 HSA Foundation. All Rights Reserved
DISPATCH PACKET
© Copyright 2014 HSA Foundation. All Rights Reserved
Start
Offset
(Bytes)
Format Field Name Description
0 uint16_t header Packet header
2 uint16_tdimensions:2 Number of dimensions specified in gridSize. Valid values are 1, 2, or 3.
reserved:14 Must be 0.
4 uint16_t workgroupSize.x x dimension of work-group (measured in work-items).
6 uint16_t workgroupSize.y y dimension of work-group (measured in work-items).
8 uint16_t workgroupSize.z z dimension of work-group (measured in work-items).
10 uint16_t reserved2 Must be 0.
12 uint32_t gridSize.x x dimension of grid (measured in work-items).
16 uint32_t gridSize.y y dimension of grid (measured in work-items).
20 uint32_t gridSize.z z dimension of grid (measured in work-items).
24 uint32_t privateSegmentSizeBytes Total size in bytes of private memory allocation request (per work-item).
28 uint32_t groupSegmentSizeBytes Total size in bytes of group memory allocation request (per work-group).
32 uint64_t kernelObjectAddressAddress of an object in memory that includes an implementation-defined
executable ISA image for the kernel.
40 uint64_t kernargAddress Address of memory containing kernel arguments.
48 uint64_t reserved3 Must be 0.
56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
AGENT DISPATCH PACKET
© Copyright 2014 HSA Foundation. All Rights Reserved
Start Offset
(Bytes)Format Field Name Description
0 uint16_t header Packet header
2 uint16_t type
The function to be performed by the destination Agent. The type value is
split into the following ranges:
0x0000:0x3FFF – Vendor specific
0x4000:0x7FFF – HSA runtime
0x8000:0xFFFF – User registered function
4 uint32_t reserved2 Must be 0.8 uint64_t returnLocation Pointer to location to store the function return value in.16 uint64_t arg[0]
64-bit direct or indirect arguments.24 uint64_t arg[1]32 uint64_t arg[2]40 uint64_t arg[3]48 uint64_t reserved3 Must be 0.
56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
BARRIER PACKET
Used for specifying dependences between packets
HSA agent will not launch any further packets from this queue until the barrier
packet signal conditions are met
Used for specifying dependences on packets dispatched from any queue.
Execution phase completes only when all of the dependent signals (up to five) have
been signaled (with the value of 0).
Or if an error has occurred in one of the packets upon which we have a dependence.
© Copyright 2014 HSA Foundation. All Rights Reserved
BARRIER PACKET
© Copyright 2014 HSA Foundation. All Rights Reserved
Start Offset
(Bytes)Format Field Name Description
0 uint16_t header Packet header, see 2.8.1 Packet header (p. 16).
2 uint16_t reserved2 Must be 0.
4 uint32_t reserved3 Must be 0.
8 uint64_t depSignal0
Address of dependent signaling objects to be evaluated by the packet processor.
16 uint64_t depSignal1
24 uint64_t depSignal2
32 uint64_t depSignal3
40 uint64_t depSignal4
48 uint64_t reserved4 Must be 0.
56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
DEPENDENCES
A user may never assume more than one packet is being executed by an HSA
agent at a time.
Implications:
Packets can’t poll on shared memory values which will be set by packets issued from
other queues, unless the user has ensured the proper ordering.
To ensure all previous packets from a queue have been completed, use the Barrier
bit.
To ensure specific packets from any queue have completed, use the Barrier packet.
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA QUEUEING, PACKET EXECUTION
PACKET EXECUTION
Launch phase
Initiated when launch conditions are met
All preceding packets in the queue must have exited launch phase
If the barrier bit in the packet header is set, then all preceding packets in the queue must have exited completion phase
Includes memory acquire fence
Active phase
Execute the packet
Barrier packets remain in Active phase until conditions are met.
Completion phase
First step is memory release fence – make results visible.
completionSignal field is then signaled with a decrementing atomic.
© Copyright 2014 HSA Foundation. All Rights Reserved
PACKET EXECUTION – BARRIER BIT
© Copyright 2014 HSA Foundation. All Rights Reserved
Pkt1
Launch
Pkt2
Launch
Pkt1
Execute
Pkt2
Execute
Pkt1
Complete
Pkt3
Launch (barrier=1)
Pkt2
Complete
Pkt3
Execute
Time
Pkt3 launches whenall
packets in the queue
have completed.
PUTTING IT ALL TOGETHER (FFT)
© Copyright 2014 HSA Foundation. All Rights Reserved
Packet 1
Packet 2
Packet 3
Packet 4
Packet 5
Packet 6
Barrier Barrier
X[0]
X[1]
X[2]
X[3]
X[4]
X[5]
X[6]
X[7]
Time
PUTTING IT ALL TOGETHER
© Copyright 2014 HSA Foundation. All Rights Reserved
AQL Pseudo Code
// Send the packets to do the first stage. aql_dispatch(pkt1);aql_dispatch(pkt2);
// Send the next two packets, setting the barrier bit so we// know packets 1 & 2 will be complete before 3 and 4 are // launched. aql_dispatch_with _barrier_bit(pkt3); aql_dispatch(pkt4);
// Same as above (make sure 3 & 4 are done before issuing 5// & 6) aql_dispatch_with_barrier_bit(pkt5); aql_dispatch(pkt6);
// This packet will notify us when 5 & 6 are complete)aql_dispatch_with_barrier_bit(finish_pkt);
PACKET EXECUTION – BARRIER PACKET
© Copyright 2014 HSA Foundation. All Rights Reserved
Barrier T2Q2
T1Q1
Signal X
init to 1
depSignal0
completionSignal
Time
Decrements signal X
Barrier
Launch
T1
Launch
Barrier
Execute
T1
Execute
Barrier
Complete
T1
Complete
T2
Launch
T2
Execute
T2
Complete
Barrier completes
when signal X
signalled with 0T2 launches once
barrier complete
DEPTH FIRST CHILD TASK EXECUTION
Consider two generations of child tasks
Task T submits tasks T.1 & T.2
Task T.1 submits tasks T.1.1 & T.1.2
Task T.2 submits tasks T.2.1 & T.2.2
Desired outcome
Depth first child task execution
I.e. T T1 T.1.1 T.1.2 T.2 T.2.1 T.2.2
T passed signal (allComplete) to decrement when all tasks are complete (T and its
children etc)
© Copyright 2014 HSA Foundation. All Rights Reserved
T
T.2.2T.1.2T.1.2T.1.1
T.1 T.2
HOW TO DO THIS WITH HSA QUEUES?
Use a separate user mode queue for each recursion level
Task T submits to queue Q1
Tasks T.1 & T.2 submits tasks to queue Q2
Queues could be passed in as parameters to task T
Depth first requires ordering of T.1, T.2 and their children
Use additional signal object (childrenComplete) to track completion of the children of
T.1 & T.2
childrenComplete set to number of children (i.e. 2) by each of T.1 & T.2
© Copyright 2014 HSA Foundation. All Rights Reserved
A PICTURE SAYS MORE THAN 1000 WORDS
© Copyright 2014 HSA Foundation. All Rights Reserved
T
T.2.2T.1.2T.1.2T.1.1
T.1 T.2 T.1 Barrier T.2 BarrierQ1
Wait on
childrenCompleteSignal
allComplete
T.1.1 T.1.2 T.2.1 T.2.2Q2
SUMMARY
© Copyright 2014 HSA Foundation. All Rights Reserved
KEY HSA TECHNOLOGIES
HSA combines several mechanisms to enable low overhead task
dispatch
Shared Virtual Memory
System Coherency
Signaling
AQL
User mode queues – from any compatible agent
Architected packet format
Rich dependency mechanism
Flexible and efficient signaling of completion
© Copyright 2014 HSA Foundation. All Rights Reserved
QUESTIONS?
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA APPLICATIONS
WEN-MEI HWU, PROFESSOR, UNIVERSITY OF ILLINOIS
WITH J.P. BORDES AND JUAN GOMEZ
USE CASES SHOWING HSA ADVANTAGE
Programming Technique
Use Case Description HSA Advantage
Pointer-based Data Structures
Binary tree searchesGPU performs parallel searches in a CPU created
binary tree.
CPU and GPU have access to entire unified coherent
memory. GPU can access existing data structures containing
pointers.
Platform Atomics
Work-Group Dynamic Task Management
GPU directly operate on a task pool managed
by the CPU for algorithms with dynamic
computation loads
Binary tree updatesCPU and GPU operating simultaneously on the
tree, both doing modifications
CPU and GPU can synchronize using Platform Atomics
Higher performance through parallel operations reducing the
need for data copying and reconciling.
Large Data SetsHierarchical data searchesApplications include object recognition, collision
detection, global illumination, BVH
CPU and GPU have access to entire unified coherent
memory. GPU can operate on huge models in place,
reducing copy and kernel launch overhead.
CPU CallbacksMiddleware user-callbacksGPU processes work items, some of which require
a call to a CPU function to fetch new data
GPU can invoke CPU functions from within a GPU kernel
Simpler programming does not require “split kernels”
Higher performance through parallel operations
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
FOR POINTER-BASED DATA
STRUCTURES
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
GPU MEMORY
RESULT BUFFER
FLAT TREE
© Copyright 2014 HSA Foundation. All Rights Reserved
L R
Legacy
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
GPU MEMORY
RESULT BUFFER
FLAT TREE
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
GPU MEMORY
RESULT BUFFER
FLAT TREE
L
RL
RL
R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
GPU MEMORY
RESULT BUFFER
FLAT TREE
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
GPU MEMORY
RESULT BUFFER
FLAT TREE
L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
GPU MEMORY
RESULT BUFFER
FLAT TREE
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
GPU MEMORY
RESULT BUFFER
FLAT TREE
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
KERNEL
GPU
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
HSA and full OpenCL 2.0
TREE RESULTBUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
HSA
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
HSA
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
HSA
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
HSA
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
POINTER DATA STRUCTURES
- CODE COMPLEXITY
HSA Legacy
© Copyright 2014 HSA Foundation. All Rights Reserved
POINTER DATA STRUCTURES
- PERFORMANCE
0
10,000
20,000
30,000
40,000
50,000
60,000
1M 5M 10M 25M
Se
arc
h r
ate
(
no
des
/ m
s )
Tree size ( # nodes )
Binary Tree Search
CPU (1 core)
CPU (4 core)
Legacy APU
HSA APU
Measured in AMD labs Jan 1-3 on system shown in back up
slide
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS FOR
DYNAMIC TASK MANAGEMENT
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
0
SYSTEM MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
TASKS POOL
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
0
NUM. WRITTEN
TASKS
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3WORK-
GROUP 4
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
0
SYSTEM MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
0
NUM. WRITTEN
TASKS
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3WORK-
GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Asynchronous transfer
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
0
NUM. WRITTEN
TASKS
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3WORK-
GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM. WRITTEN
TASKS
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3WORK-
GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Asynchronous transfer
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM. WRITTEN
TASKS
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3WORK-
GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM. WRITTEN
TASKS
0
1
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3WORK-
GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM. WRITTEN
TASKS
0
1
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3WORK-
GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM. WRITTEN
TASKS
0
2
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3WORK-
GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM. WRITTEN
TASKS
0
2
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3WORK-
GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM. WRITTEN
TASKS
0
3
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3WORK-
GROUP 4
TASKS POOL
PLATFORM ATOMICS
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM. WRITTEN
TASKS
0
3
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3WORK-
GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM. WRITTEN
TASKS
0
4
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3WORK-
GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
4
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM. WRITTEN
TASKS
0
4
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3WORK-
GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Zero-copy
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
0
HOST COHERENT MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
0
HOST COHERENT MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
memcpy
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
1
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
Platform atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
1
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
2
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
Platform atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
2
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
3
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
Platform atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
3
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-
GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
4
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
Platform atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS – CODE COMPLEXITY
HSALegacy
Host enqueue function: 20 lines of code
Host enqueue function: 102 lines of code
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS - PERFORMANCE
0
100
200
300
400
500
600
700
64 128 256 512 64 128 256 512
4096 16384
Execu
tio
n t
ime (
ms)
Tasks per insertionTasks pool size
Legacy implementation (ms)
HSA implementation (ms)
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS FOR
CPU/GPU COLLABORATION
PLATFORM ATOMICSENABLING EFFICIENT GPU/CPU COLLABORATION
Legacy
Only GPU can work on input
array
Concurrent
processing not
possible
TREEINPUTBUFFER
GPU
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
Legacy
Only GPU can work on input
array
Concurrent
processing not
possible
TREEINPUTBUFFER
GPU
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
Legacy
Only GPU can work on input
array
Concurrent
processing not
possible
TREEINPUTBUFFER
GPU
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
GPU
KERNEL
PLATFORM ATOMICS
Both CPU+GPU
operating on same
data structure
concurrently
TREEINPUTBUFFER
CPU 0
CPU 1
HSA and full OpenCL 2.0
© Copyright 2014 HSA Foundation. All Rights Reserved
GPU
KERNEL
PLATFORM ATOMICS
Both CPU+GPU
operating on same
data structure
concurrently
TREEINPUTBUFFER
CPU 0
CPU 1
HSA and full OpenCL 2.0
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
FOR LARGE
DATA SETS
PROCESSING LARGE DATA SETS
The CPU creates a large data structure in System Memory. Computations
using the data are offloaded to the GPU.
SYSTEM MEMORY
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
PROCESSING LARGE DATA SETS
Larg
e 3
D s
patia
l d
ata
str
uctu
re
GPU
The CPU creates a large data structure in System Memory. Computations
using the data are offloaded to the GPU.
Compare HSA and Legacy methods
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
LEGACY ACCESS USING GPU MEMORY
Legacy
GPU Memory is smaller
Have to copy and process in chunks
GPU
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
LEGACY ACCESS TO LARGE STRUCTURES
Larg
e 3
D s
patia
l d
ata
str
uctu
re
GPU
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
Copy of top 2 levels of hierarchy
Larg
e 3
D s
patia
l d
ata
str
uctu
re
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
GPU
GPU MEMORY
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
FIRST
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
GPU MEMORY
FIRST
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
GPU MEMORY
FIRST
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
Copy of bottom 3 levels of one branch of the hierarchy
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU MEMORY
SECOND
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU MEMORY
SECOND
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU MEMORY
SECOND
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
Copy of bottom 3 levels of a different branch of the
hierarchy
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU MEMORY
Nth
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU MEMORY
Nth
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU MEMORY
Nth
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
LARGE SPATIAL DATA STRUCTURE
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
Larg
e 3
D s
patia
l d
ata
str
uctu
reSYSTEM MEMORY
KERNEL
GPUHSA and full OpenCL 2.0
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
HSA
KERNEL
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
HSA
KERNEL
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
HSA
KERNEL
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
HSA
KERNEL
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
KERNEL
HSA
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
CALLBACKS
CALLBACKS
Parallel processing algorithm with branches
A seldom taken branch requires new data from the CPU
On legacy systems, the algorithm must be split:
Process Kernel 1 on GPU
Check for CPU callbacks and if any, process on CPU
Process Kernel 2 on GPU
Example algorithm from Image Processing
Perform a filter
Calculate average LUMA in each tile
Compare LUMA against threshold and call CPU callback if exceeded (rare)
Perform special processing on tiles with callbackx\s
COMMON SITUATION IN HC
Input Image Output Image
© Copyright 2014 HSA Foundation. All Rights Reserved
CALLBACKS
Legacy
GP
U T
HR
EA
DS
0
1
2
N
.
.
.
.
.
.
.
.
.
Continuation kernel finishes up kernel works results in poor GPU utilization
© Copyright 2014 HSA Foundation. All Rights Reserved
CALLBACKS
Input Image
1 Tile = 1 OpenCL Work Item
Output Image
GPU
• Work items compute average RGB value of all the pixels in a tile
• Work items also compute average Luma from the average RGB
• If average Luma > threshold, workgroup invokes CPU CALLBACK
• In parallel with callback, continue compute
CPU
• For selected tiles, update average Lumavalue (set to RED)
GPU
• Work items apply the Luma value to all pixels in the tile
GPU to CPU callbacks use Shared
Virtual Memory (SVM) Semaphores,
implemented using Platform Atomic
Compare-and-Swap.
© Copyright 2014 HSA Foundation. All Rights Reserved
CALLBACKS
A few kernel threads need CPU callback services but serviced immediately
GP
U T
HR
EA
DS
0
1
2
N
.
.
.
.
.
.
.
.
.
CPU callbacks
HSA and full OpenCL 2.0
© Copyright 2014 HSA Foundation. All Rights Reserved
SUMMARY - HSA ADVANTAGE
Programming Technique
Use Case Description HSA Advantage
Pointer-based Data Structures
Binary tree searchesGPU performs parallel searches in a CPU created
binary tree.
CPU and GPU have access to entire unified coherent
memory. GPU can access existing data structures containing
pointers.
Platform Atomics
Work-Group Dynamic Task Management
GPU directly operate on a task pool managed
by the CPU for algorithms with dynamic
computation loads
Binary tree updatesCPU and GPU operating simultaneously on the
tree, both doing modifications
CPU and GPU can synchronize using Platform Atomics
Higher performance through parallel operations reducing the
need for data copying and reconciling.
Large Data SetsHierarchical data searchesApplications include object recognition, collision
detection, global illumination, BVH
CPU and GPU have access to entire unified coherent
memory. GPU can operate on huge models in place,
reducing copy and kernel launch overhead.
CPU CallbacksMiddleware user-callbacksGPU processes work items, some of which require
a call to a CPU function to fetch new data
GPU can invoke CPU functions from within a GPU kernel
Simpler programming does not require “split kernels”
Higher performance through parallel operations
© Copyright 2014 HSA Foundation. All Rights Reserved
QUESTIONS?
HSA COMPILATIONWEN-MEI HWU, CTO, MULTICOREWARE INC
WITH RAY I-JUI SUNG
KEY HSA FEATURES FOR COMPILATION
ALL-PROCESSORS-EQUAL
GPU and CPU have equal
flexibility to create and
dispatch work items
EQUAL ACCESS TO ENTIRE SYSTEM MEMORY
GPU and CPU have
uniform visibility into entire
memory space
Unified Coherent
Memory
GPUCPU
Single Dispatch Path
GPUCPU
© Copyright 2014 HSA Foundation. All Rights Reserved
A QUICK REVIEW OF OPENCLCURRENT STATE OF PORTABLE HETEROGENEOUS
PARALLEL PROGRAMMING
DEVICE CODE IN OPENCL
SIMPLE MATRIX MULTIPLICATION
__kernel void
matrixMul(__global float* C, __global float* A, __global float* B, int wA, int wB) {
int tx = get_global_id(0);
int ty = get_global_id(1);
float value = 0;
for (int k = 0; k < wA; ++k)
{
float elementA = A[ty * wA + k];
float elementB = B[k * wB + tx];
value += elementA * elementB;
}
C[ty * wA + tx] = value;
}
Explicit thread index usage.
Reasonably readable.
Portable across CPUs, GPUs, and FPGAs
© Copyright 2014 HSA Foundation. All Rights Reserved
HOST CODE IN OPENCL -
CONCEPTUAL
1. allocate and initialize memory on host side
2. Initialize OpenCL
3. allocate device memory and move the data
4. Load and build device code
5. Launch kernel
a. append arguments
6. move the data back from device
© Copyright 2014 HSA Foundation. All Rights Reserved
int main(int argc, char** argv){
// set seed for rand()
srand(2006);
/****************************************************/
/* Allocate and initialize memory on Host Side */
/****************************************************/
// allocate and initialize host memory for matrices A and B
unsigned int size_A = WA * HA;
unsigned int mem_size_A = sizeof(float) * size_A;
float* h_A = (float*) malloc(mem_size_A);
unsigned int size_B = WB * HB;
unsigned int mem_size_B = sizeof(float) * size_B;
float* h_B = (float*) malloc(mem_size_B);
randomInit(h_A, size_A);
randomInit(h_B, size_B);
// allocate host memory for the result C
unsigned int size_C = WC * HC;
unsigned int mem_size_C = sizeof(float) * size_C;
float* h_C = (float*) malloc(mem_size_C);
/*****************************************/
/* Initialize OpenCL */
/*****************************************/
// OpenCL specific variables
cl_context clGPUContext;
cl_command_queue clCommandQue;
cl_program clProgram;
size_t dataBytes;
size_t kernelLength;
cl_int errcode;
// OpenCL device memory pointers for matrices
cl_mem d_A;
cl_mem d_B;
cl_mem d_C;
clGPUContext = clCreateContextFromType(0,
CL_DEVICE_TYPE_GPU,
NULL, NULL, &errcode);
shrCheckError(errcode, CL_SUCCESS);
// get the list of GPU devices associated with context
errcode = clGetContextInfo(clGPUContext,
CL_CONTEXT_DEVICES, 0, NULL,
&dataBytes);
cl_device_id *clDevices = (cl_device_id *)
malloc(dataBytes);
errcode |= clGetContextInfo(clGPUContext,
CL_CONTEXT_DEVICES, dataBytes,
clDevices, NULL);
shrCheckError(errcode, CL_SUCCESS);
//Create a command-queue
clCommandQue = clCreateCommandQueue(clGPUContext,
clDevices[0], 0, &errcode);
shrCheckError(errcode, CL_SUCCESS);
// 3. Allocate device memory and move data
d_C = clCreateBuffer(clGPUContext,
CL_MEM_READ_WRITE,
mem_size_A, NULL, &errcode);
d_A = clCreateBuffer(clGPUContext,
CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
mem_size_A, h_A, &errcode);
d_B = clCreateBuffer(clGPUContext,
CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
mem_size_B, h_B, &errcode);
// 4. Load and build OpenCL kernel
char *clMatrixMul = oclLoadProgSource("kernel.cl",
"// My comment\n",
&kernelLength);
shrCheckError(clMatrixMul != NULL, shrTRUE);
clProgram = clCreateProgramWithSource(clGPUContext,
1, (const char **)&clMatrixMul,
&kernelLength, &errcode);
shrCheckError(errcode, CL_SUCCESS);
errcode = clBuildProgram(clProgram, 0,
NULL, NULL, NULL, NULL);
shrCheckError(errcode, CL_SUCCESS);
clKernel = clCreateKernel(clProgram,
"matrixMul", &errcode);
shrCheckError(errcode, CL_SUCCESS);
// 5. Launch OpenCL kernel
size_t localWorkSize[2], globalWorkSize[2];
int wA = WA;
int wC = WC;
errcode = clSetKernelArg(clKernel, 0,
sizeof(cl_mem), (void *)&d_C);
errcode |= clSetKernelArg(clKernel, 1,
sizeof(cl_mem), (void *)&d_A);
errcode |= clSetKernelArg(clKernel, 2,
sizeof(cl_mem), (void *)&d_B);
errcode |= clSetKernelArg(clKernel, 3,
sizeof(int), (void *)&wA);
errcode |= clSetKernelArg(clKernel, 4,
sizeof(int), (void *)&wC);
shrCheckError(errcode, CL_SUCCESS);
localWorkSize[0] = 16;
localWorkSize[1] = 16;
globalWorkSize[0] = 1024;
globalWorkSize[1] = 1024;
errcode = clEnqueueNDRangeKernel(clCommandQue,
clKernel, 2, NULL, globalWorkSize,
localWorkSize, 0, NULL, NULL);
shrCheckError(errcode, CL_SUCCESS);
// 6. Retrieve result from device
errcode = clEnqueueReadBuffer(clCommandQue,
d_C, CL_TRUE, 0, mem_size_C,
h_C, 0, NULL, NULL);
shrCheckError(errcode, CL_SUCCESS);
// 7. clean up memory
free(h_A);
free(h_B);
free(h_C);
clReleaseMemObject(d_A);
clReleaseMemObject(d_C);
clReleaseMemObject(d_B);
free(clDevices);
free(clMatrixMul);
clReleaseContext(clGPUContext);
clReleaseKernel(clKernel);
clReleaseProgram(clProgram);
clReleaseCommandQueue(clCommandQue);}
almost 100 lines of code
– tedious and hard to maintain
It does not take advantage of HAS features.
It will likely need to be changed for OpenCL 2.0.
COMPARING SEVERAL HIGH-LEVEL
PROGRAMMING INTERFACES
C++AMP Thrust Bolt OpenACC SYCL
C++ Language
extension
proposed by
Microsoft
library
proposed
by CUDA
library
proposed
by AMD
Annotation
and
Pragmas
proposed
by PGI
C++
wrapper
for
OpenCL
All these proposals aim to reduce tedious boiler
plate code and provide transparent porting to future
systems (future proofing).
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENACCHSA ENABLES SIMPLER IMPLEMENTATION OR
BETTER OPTIMIZATION
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENACC- SIMPLE MATRIX MULTIPLICATION EXAMPLE
1. void MatrixMulti(float *C, const float *A, const float *B, int hA, int wA, int wB)
2 {
3 #pragma acc parallel loop copyin(A[0:hA*wA]) copyin(B[0:wA*wB]) copyout(C[0:hA*wB])
4 for (int i=0; i<hA; i++) {
5 #pragma acc loop
6 for (int j=0; j<wB; j++) {
7 float sum = 0;
8 for (int k=0; k<wA; k++) {
9 float a = A[i*wA+k];
10 float b = B[k*wB+j];
11 sum += a*b;
12 }
13 C[i*Nw+j] = sum;
14 }
15 }
16 }
Little Host Code Overhead
Programmer annotation of
kernel computation
Programmer annotation of data movement
© Copyright 2014 HSA Foundation. All Rights Reserved
ADVANTAGE OF HSA FOR OPENACC
Flexibility in copyin and copyout implementation
Flexible code generation for nested acc parallel loops
E.g., inner loop bounds that depend on outer loop iterations
Compiler data affinity optimization (especially OpenACC kernel regions)
The compiler does not have to undo programmer managed data transfers
© Copyright 2014 HSA Foundation. All Rights Reserved
C++AMP HSA ENABLES EFFICIENT COMPILATION OF AN
EVEN HIGHER LEVEL OF PROGRAMMING
INTERFACE
© Copyright 2014 HSA Foundation. All Rights Reserved
C++ AMP
● C++ Accelerated Massive Parallelism
● Designed for data level parallelism
● Extension of C++11 proposed by Microsoft
● An open specification with multiple implementations aiming at standardization
● MS Visual Studio 2013
● MulticoreWare CLAMP
● GPU data modeled as C++14-like containers for multidimensional arrays
● GPU kernels modeled as C++11 lambda
● Minimal extension to C++ for simplicity and future proofing
© Copyright 2014 HSA Foundation. All Rights Reserved
MATRIX MULTIPLICATION IN C++AMP
void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix,
int ha, int hb, int hc) {
array_view<int, 2> a(ha, hb, aMatrix);
array_view<int, 2> b(hb, hc, bMatrix);
array_view<int, 2> product(ha, hc, productMatrix);
parallel_for_each(
product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
}
);
product.synchronize();}
clGPUContext = clCreateContextFromType(0,
CL_DEVICE_TYPE_GPU,
NULL, NULL, &errcode);
shrCheckError(errcode, CL_SUCCESS);
// get the list of GPU devices associated
// with context
errcode = clGetContextInfo(clGPUContext,
CL_CONTEXT_DEVICES, 0, NULL,
&dataBytes);
cl_device_id *clDevices = (cl_device_id *)
malloc(dataBytes);
errcode |= clGetContextInfo(clGPUContext,
CL_CONTEXT_DEVICES, dataBytes,
clDevices, NULL);
shrCheckError(errcode, CL_SUCCESS);
//Create a command-queue
clCommandQue =
clCreateCommandQueue(clGPUContext,
clDevices[0], 0, &errcode);
shrCheckError(errcode, CL_SUCCESS);
__kernel void
matrixMul(__global float* C, __global float* A,
__global float* B, int wA, int wB) {
int tx = get_global_id(0);
int ty = get_global_id(1);
float value = 0;
for (int k = 0; k < wA; ++k)
{
float elementA = A[ty * wA + k];
float elementB = B[k * wB + tx];
value += elementA * elementB;
}
C[ty * wA + tx] = value;}
© Copyright 2014 HSA Foundation. All Rights Reserved
C++AMP PROGRAMMING MODEL
void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix) {
array_view<int, 2> a(3, 2, aMatrix);
array_view<int, 2> b(2, 3, bMatrix);
array_view<int, 2> product(3, 3, productMatrix);
parallel_for_each(
product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
}
);
product.synchronize();}
GPU data
modeled as
data container
© Copyright 2014 HSA Foundation. All Rights Reserved
C++AMP PROGRAMMING MODEL
void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix) {
array_view<int, 2> a(3, 2, aMatrix);
array_view<int, 2> b(2, 3, bMatrix);
array_view<int, 2> product(3, 3, productMatrix);
parallel_for_each(
product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
}
);
product.synchronize();}
Kernels modeled as
lambdas; arguments are
implicitly modeled as
captured variables,
programmer do not need to
specify copyin and copyout
© Copyright 2014 HSA Foundation. All Rights Reserved
C++AMP PROGRAMMING MODEL
void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix) {
array_view<int, 2> a(3, 2, aMatrix);
array_view<int, 2> b(2, 3, bMatrix);
array_view<int, 2> product(3, 3, productMatrix);
parallel_for_each(
product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
}
);
product.synchronize();
}
Execution
interface; marking
an implicitly
parallel region for
GPU execution
© Copyright 2014 HSA Foundation. All Rights Reserved
MCW C++AMP (CLAMP)
● Runs on Linux and Mac OS X
● Output code compatible with all major OpenCL stacks: AMD, Apple/Intel (OS X),
NVIDIA and even POCL
● Clang/LLVM-based, open source
o Translate C++AMP code to OpenCL C or OpenCL 1.2 SPIR
o With template helper library
● Runtime: OpenCL 1.1/HSA Runtime and GMAC for non-HSA systems
● One of the two C++ AMP implementations recognized by HSA foundation
© Copyright 2014 HSA Foundation. All Rights Reserved
MCW C++ AMP COMPILER
● Device Path
o generate OpenCL C code and SPIR
o emit kernel function
● Host Path
o preparation to launch the code
C++ AMP
source code
Clang/LLVM 3.3
Device
Code
Host
Code
© Copyright 2014 HSA Foundation. All Rights Reserved
TRANSLATION
parallel_for_each(product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
});
__kernel void
matrixMul(__global float* C, __global float*
A,
__global float* B, int wA, int wB){
int tx = get_global_id(0);
int ty = get_global_id(1);
float value = 0;
for (int k = 0; k < wA; ++k)
{
float elementA = A[ty * wA + k];
float elementB = B[k * wB + tx];
value += elementA * elementB;
}
C[ty * wA + tx] = value;}
● Append the arguments
● Set the index
● emit kernel function
● implicit memory management
© Copyright 2014 HSA Foundation. All Rights Reserved
EXECUTION ON NON-HSA OPENCL
PLATFORMS
C++ AMP
source code
Clang/LLVM
3.3
Device Code
C++ AMP
source code
Clang/LLVM
3.3
Host Code
gmac
OpenCL
Our work
Runtime
© Copyright 2014 HSA Foundation. All Rights Reserved
GMAC
● unified virtual address space in
software
● Can have high overhead
sometimes
● In HSA (e.g., AMD Kaveri), GMAC
is not longer needed
Gelado, et al, ASPLOS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
CASE STUDY:BINOMIAL OPTION PRICING
Line of Codes
0
50
100
150
200
250
300
350
C++AMP OpenCL
Lines of Code by Cloc
Host
Kernel
© Copyright 2014 HSA Foundation. All Rights Reserved
PERFORMANCE ON NON-HSA SYSTEMSBINOMIAL OPTION PRICING
0
0.02
0.04
0.06
0.08
0.1
0.12
Total GPU Time Kernel-only
Tim
e in
Seco
nd
s
Performance on an NV Tesla C2050
OpenCL
C++AMP
© Copyright 2014 HSA Foundation. All Rights Reserved
EXECUTION ON HSA
C++ AMP
source code
Clang/LLVM
3.3
Device SPIR
C++ AMP
source code
Clang/LLVM
3.3
Host SPIR
HSA Runtime
Compile Time
Runtime
© Copyright 2014 HSA Foundation. All Rights Reserved
WHAT WE NEED TO DO?
● Kernel function
o emit the kernel function with required arguments
● On Host side
o a function that recursively traverses the object and append the arguments to OpenCL
stack.
● On Device side
o reconstruct it on the device code for future use.
© Copyright 2014 HSA Foundation. All Rights Reserved
WHY COMPILING C++AMP TO OPENCL IS
NOT TRIVIAL
● C++AMP → LLVM IR → OpenCL C or SPIR
● arguments passing (lambda capture vs function calls)
● explicit V.S. implicit memory transfer
● Heavy lifting is done by compiler and runtime
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE
struct A { int a; };struct B : A { int b; };struct C { B b; int c; };
struct C c;
c.c = 100;
auto fn = [=] () { int qq = c.c; };
© Copyright 2014 HSA Foundation. All Rights Reserved
TRANSLATION
parallel_for_each(product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
});
__kernel void
matrixMul(__global float* C, __global float* A,
__global float* B, int wA, int wB){
int tx = get_global_id(0);
int ty = get_global_id(1);
float value = 0;
for (int k = 0; k < wA; ++k)
{
float elementA = A[ty * wA + k];
float elementB = B[k * wB + tx];
value += elementA * elementB;
}
C[ty * wA + tx] = value;}
● Compiler
● Turn captured variables into
OpenCL arguments
● Populate the index<N> in OCL
kernel
● Runtime
● Implicit memory management
© Copyright 2014 HSA Foundation. All Rights Reserved
QUESTIONS?
© Copyright 2014 HSA Foundation. All Rights Reserved