AMD GCN Port - GNU Project

www.mentor.com/embedded

Android is a trademark of Google Inc. Use of this trademark is subject to Google Permissions. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries.Qt is a registered trade mark of Qt Company Oy. All other trademarks mentioned in this document are trademarks of their respective owners.

AMD GCN Port

September, 2018

Andrew Stubbs

© Mentor Graphics Corp. Company Confdentialwww.mentor.com/embeddedAMD GCN Port, September 20182

Overview GCN Architecture Port History Current Status How To GCC Porting Challenges


AMD GCN Port

GCN Architecture Overview


It’s a GPU! (Not a CPU.)

64 Identical Compute Units (currently), each having ...– 4 “SIMD” units, each having …

● 800 32-bit scalar registers (SGPRs).● 256 vector registers (VGPRs), each having …

– 64 32-bit lanes.

Each SIMD unit can run 10 threads (“wave fronts”, interleaved)➔ 40 threads per Compute Unit.➔ 2560 threads per GPU.➔ Total of 163840 active vector lanes (“work items”, confusingly also called

“threads” in GPU-land).


Register File Sharing Instruction encodings can access ...

– 102 of the 800 SGPRs.– All 256 VGPRs.

But, for maximum occupancy of 10 threads per CU, each thread must only use ...

– 80 SGPRs– 24 VGPRs

● Registers are allocated in blocks of 4, so 256 ÷ 10 = 24 each with 16 left unused.

The number of registers must be reserved in the object fle header.


NUMA Features The GPU has

– Access to the large graphics memory.● Via the L2 cache.

– 64K low-latency “Global Data Store” (for inter-CU communication).

Each CU has …– An L1 cache.– 64K low-latency “Local Data Store”.– A separate Constant Cache (“Kcache”) for scalar accesses only.

Virtual address spaces– Diferent instruction types use diferent addresses to access the same main

memory.


From the Manual …


GCN ISA 32/64-bit variable-length instruction encoding.

– 32-bit basic encoding, plus one of● 32-bit immediate● 32-bit sub-word access modifer● 32-bit data-parallel-processing modifer● 32-bit additional opcode-specifc encoding

– Many instructions allow multiple operands to use register/immediate encodings, although only one may use the additional word.

Separate instruction sets for scalar and vector operations.– The scalar instruction set is incomplete, however, so some scalar operations

must happen in vector registers.– Control fow is scalar-only.


Memory Access GCN only has Scatter/Gather

– There’s no general instruction that can access a contiguous vector.● (“Flat scratch” can, but only within the stack space.)

– We must build a vector of addresses.● Or, use a base address with a vector of ofsets or indices.

The ISA provides a range of load/store instruction varieties.– “bufer” uses a descriptor with many felds.– “ds” uses 32-bit addresses for LDS and GDS.– “fat” uses 64-bit addresses for main memory, LDS, and “fat scratch”.– “global” uses 64-bit addresses for main memory only, but permits ofsets.– “atomic” is like fat, but with various atomic operators.– “scalar” uses 64-bit addresses, but through a diferent cache.– “scratch”, “image” …. you get the idea: memory is hard.


AMD GCN Port

Port History


Origin Summer 2016: Honza Hubicka and Martin Jambor start the port. Presented at the 2016 Cauldron:


Mentor Graphics / CodeSourcery Early 2017: AMD hire Mentor Graphics to create a new GCC port.

– Requirement: GFortran working with OpenACC and OpenMP. We contacted Honza, and used his port as the starting point.

Implemented– Much wider ISA support– Function calls (with a custom ABI)– “gcn-run” launcher for stand-alone programs.

● The testsuite now runs correctly.– OpenMP/OpenACC/Libgomp from x86_64.

● With Discrete GPU support (rather than the existing HSA APU support).– Libgfortran (minimal mode).


Mentor Graphics / CodeSourcery Late 2017: First Fortran-only binary toolchain release made.

– Based on GCC 7.– Fortran/OpenACC/OpenMP work as per GCC 7 state

● Plus a few small back-ports from the openacc-gcc-7-branch.– Supports GCN3, with limited auto-vectorization support.

Spring 2018: C/C++/Fortran binary release made.– Now with GCN5 support.– C++ is only supported for OpenMP/OpenACC ofoaded code

● i.e. it can be compiled through the LTO mechanism.– You can download it now!

https://www.mentor.com/embedded-software/sourcery-tools/sourcery-codebench/editions/lite-edition/

https://www.mentor.com/embedded-software/sourcery-tools/sourcery-codebench/editions/lite-edition/


Current Port Status A new binary release will be made in November.

– Updated to GCC 8.– Incorporates openacc_gcc_8_branch for the freshest OpenACC support.– Expanded vector support:

● Fully masked loops.● Scatter/gather.● More operators implemented.

– Many bug fxes and minor improvements.


Other tools Still no Binutils port.

– We’re using LLVM 6.0, for the assembler and linker.

Newlib port complete.– Supports malloc, stdout/stderr, exit, abort ….– Supports dynamic re-entrancy for thread-safe malloc, etc.– Enough for ofoading and testing.


Upstreaming Posted Wednesday September 5th.

– Stand-alone only – no OpenACC/OpenMP/libgomp support yet.

Hoping to get the backend in GCC 9.


To Do Sub-word vector operations.

– Requires explicit truncate operations other architectures don’t need.– Maybe a vector equivalent to WORD_REGISTER_OPERATIONS.

Register sharing to allow more than 4 threads per CU.

Many cleanups & optimization tasks.

C++ support– Exceptions, static constructors, libstdc++, etc.– (The ofoading compiler can already handle C++ code parsed by the host

compiler, as long as it doesn’t call into the standard library.)


AMD GCN Port

How To Build And Use The Toolchain


How To Use The GCN Toolchain1)Build LLVM 6.0 for GCN, and extract the assembler (llvm-mc) and

linker (llvm-ld).2)Build cross-GCC and Newlib

● Use the usual cross-build technique.● Enable only C and Fortran.

3)Install the ROCm drivers, and HSA runtime libraries.4)Compile a standard “Hello world”, no modifcation required.5) export LD_LIBRARY_PATH=/opt/rocm/lib6)Run with “gcn-run a.out”.

● “gcn-run” can be found under libexec.● It launches a single-threaded programs accepting normal CLI arguments and

returning a normal numeric result.● Programs can write to stdout and stderr, but cannot read stdin or access fles.


How To Use OpenACC/OpenMP1)Build the GCN toolchain, as above, but with ...

--enable-as-accelerator-for=x86_64-none-linux-gnu● NOTE: The initially upstreamed sources will not include ofoad support.

2)Build an x86_64 toolchain, however you like, but with …--enable-offload-targets=amdgcn-unknown-amdhsa

3)Install the ROCm drivers and HSA runtime libraries.4)Compile one of the OpenACC/OpenMP examples, using …

-fopenacc or -fopenmp

5) export LD_LIBRARY_PATH=/opt/rocm/lib:$TC/x86_64-none-linux-gnu/lib646)Run with “./a.out”.

● Use “export GCN_DEBUG=1” to watch the kernels getting ofoaded to the GPU.


AMD GCN Port

GCC Porting Challenges

or

Aspects of GCN that are challenging in GCC

(excluding OpenACC and OpenMP)


Challenge 1:

Porting Choices Follow existing GPU practice?

– Only NVPTX exists.– PTX models a thread for each vector lane.

● HSAIL likewise.– GCC converts vector loops to “fork” operators, and treats everything as

scalars. Model GCN as a CPU?

– Use SIMD instructions and the autovectorizer.

➔ We chose CPU-style SIMD.


Challenge 2:

Reload and moves Move insns are not permitted to require additional reloads

– (When emitted during reload.*)– But, GCN vector moves depend on the value of the EXEC register.

● Only the values in enabled lanes will be moved.

So, moves emitted during reload must ignore EXEC.– We needed to a write an md_reorg pass to fx up EXEC around all moves.– The register allocator can automatically handle EXEC for all other insn types.

It took a long time to get past the “90 reload” ICE headache!

* Actually we use LRA


Challenge 3:

Scalars In Vector Registers


Challenge 3:

Scalars In Vector Registers There is not a full set of scalar instructions.

– We must do some scalar operations using vector instructions.– We may choose to do some scalar operations in vectors, even when scalar

instructions are available. How to do it?

– Disable all but one lane?– Duplicate across all lanes?

● HSA/PTX would handle scalars redundantly.● But, some operators must not be duplicated (e.g. atomics).

➔ Do both!– We disable all but lane zero for loads/stores.– We leave other lanes enabled where it’s harmless.


Challenge 4:

Vector Size & Elements There are many places in GCC that assume that a “vector register”

has a fxed number of bits that can be divided between a variable number of elements.

– E.g. V16QI/V8HI/V4SI/V2DI– There are features to allow diferently size vector units (c.f. AVX), but once

you’ve chosen a size you’re stuck with it.

GCN has a fxed number of elements (64), and therefore vector size varies according to element size.

– V64DI = 4096bits, V64SI = 2048bits, V64HI = 1024bits.– Problem: If the frst type vectorized in a function is 32-bit, GCC will refuse to

vectorize 64-bit types because it can’t build a V32DImode vector!

➔ Various middle-end patches are required.


Challenge 5:

Extend RTL? Honza suggested for_each_lane and lane_index:


Challenge 5:

Extend RTL? No, we didn’t ….

– We struggled on with what’s there already.

But, we solved the specifc problem using scatter/gather loads.– In GCC 7 we had MEM with vectors of addresses.

● Expanded post-reload.– In GCC 8 we use gather/scatter.

● Expanded at any point. And, extend semantics of vec_select to permit non-constant lane

numbers. Extended RTL would still be useful though.

– See “SUBREGS of Vectors”, below.


Challenge 6:

Vectorization with Zero Stride Consider this function (from vect-strided-store.c):

void f(int * __restrict dest, int * __restrict src, int stride, int n) { for (int i = 0; i < n; i++) dest[i*stride] = src[i];}

– GCC vectorizes the loop into a mask_load and mask_scatter_store.

When stride==0, C semantics would require this efect:dest[0] = src[n-1]

Most vector architectures seem to do so.– (I don’t see reports of vect-strided-store.c failing.)


Challenge 6:

Vectorization with Zero Stride Consider this function (from vect-strided-store.c):

void f(int * __restrict dest, int * __restrict src, int stride, int n) { for (int i = 0; i < n; i++) dest[i*stride] = src[i];}

– GCC vectorizes the loop into a mask_load and mask_scatter_store.

GCN tries to write all the values of src[i] to dest.– But, the “winner” is undefned: dest[0] = src[rand()%n]

Intermittent test failure!– I don’t know how to fx this without disallowing scatter stores everywhere?


Challenge 7:

Address Spaces GCN has 5 diferent memory address spaces.

– Flat, Global, and Scratch are 64-bit.– LDS and GDS are 32-bit.

GCN has multiple memory instructions each supporting diferent address spaces and properties (such as swizzling).

We use GCC address spaces to select the instruction type, as well as the memory region.

– Also use a diferent default address space on GCN3 vs. GCN5.– And a diferent default for functions that might receive Flat LDS pointers.


Challenge 7:

Address Space Trouble1)Address spaces are not propagated everywhere.

● E.g. Built-in functions.

2)gather_load/scatter_store don’t support address spaces at all.

3)Address spaces with diferent pointer sizes are often broken.● Several sources of ICEs.● They work fne once in the back-end.


Challenge 8:

Swizzled Stacks Swizzling is a way in which each vector lane can feel like it has its

own stack without actually killing cache performance. Here’s a valiant attempt to explain it!


Challenge 8:

Swizzled Stacks This makes sense for HSA using the “each vector lane is a thread”

model. It does not make sense for C programs because

– Stack addresses cannot be used in a general way.● The memory is not laid out how the compiler expects.● The address is in a non-default address space, so pointers can’t be passed

anywhere.– Stack-pointer adjustment calculations are totally screwed up.– It’s hard to distinguish stack accesses from global memory accesses.

➔ The GCN Port uses a plain stack.– We use neither “bufer” nor “scratch” instructions.


Challenge 9:

Gather/Scatter Ofsets Size gather_load/scatter_store take ofset vectors that match the size

of the data vector– E.g. DImode ofset for V64DFmode vectors.– Presumably due to architectures with lane count determined by element size.

But GCN ofsets are always 32-bit.– DImode ofsets are excessive.– QImode ofsets are unnecessarily limited.

It would be better to have “gather_loadmn” where “n” is the ofset size, and have the vectorizor select an appropriate size from those available.


Challenge 10:

Mask Modes There’s an inconvenient mismatch

– vec_merge uses a DImode bitmask.● Which is natural for GCN.

– mask_load et al must use V64BImode (or some vector type)● Which must be converted at expand time somehow.

We do it with SUBREG everywhere:(subreg:DI (reg:V64BI …) 0)

But, simplify_subreg can’t cope with that when computing reg_equiv.➔ We’ve implemented handling of multiple elements per byte.


Challenge 11:

SUBREGs of Vectors GCC has no way to express “the lowpart of each element”.

– Compare: (subreg:SI (reg:DI …) 0)– with this: (subreg:V64SI (reg:V64DI …) 0)

GCN uses two 32-bit-per-element vector registers to hold a 64-bit-per-element vector (still with 64 elements)

– Currently we must wait until a hardreg is allocated before we split.– Splitting late means we miss optimization opportunities.


Challenge 12:

vec_merge Everywhere This has been a problem from the very start:


Challenge 12:

vec_merge Everywhere Every vector insn is written as a vec_merge pattern:

(set ((<dest>) (vec_merge (<op>) (<dest> or UNSPEC_UNDEF) (<exec reg>))))

Advantages:– It represents what the hardware really does.– Reload handles the EXEC register (lane mask) automatically.– We could allow any operation to be masked.


Challenge 12:

vec_merge Everywhere Disadvantages:

– It basically renders the combine pass useless.– Many optimizations don’t recognize operations any more.

● E.g. No-op moves are not removed.– GCC actually supports only selected operations with masks (mostly loads and

stores), so it is unnecessary complexity.– Every define_insn requires a separate define_expand to add the additional

operands.– The “destination or undef” thing is clunky.

● Sometimes a “U0” constraint is sufcient (but the matching constraint causes early-clobbers to be considered free).

● Sometimes separate patterns are needed with (match_dup 0).


Challenge 12:

vec_merge Solutions1)Simply remove the vec_merge.

– Rely on md_reorg to manage EXEC.– Retain vec_merge for insns where GCC supports masks, of course.– Add back vec_merge patterns for use of Combine, where useful.

2)Add additional support for Combining vec_merge.– And teach other optimizations too.

AMD GCN Port - GNU Project

Documents

Transcript of AMD GCN Port - GNU Project