JASM: A Java Library for the Generation and Scheduling of PTX Assembly
description
Transcript of JASM: A Java Library for the Generation and Scheduling of PTX Assembly
JASM: A Java Library for the Generation and Scheduling of
PTX AssemblyChristos Kartsaklis
ICHEC
Purpose
• NVIDIA GPUs– Many self-conflicting parameters affect performance.– Some not nvcc-tunable.
• JASM– Similar to a compiler back-end but programmable
itself.– Different constructs to generate variants of the same
kernel.– Explore the optimisations strategy space faster.– For CUDA programmers and instruction bottlenecks.
2
Structure
1. Introduction1. Features absent from nvcc 2. Code compilation3. JASM
2. Library Features1. Dependencies2. Aliasing3. Predication4. Snippets5. Reverse compilation
3. Summary
3
Features absent from nvcc – 1/3
• What is PTX?– Virtual machine ISA for NVIDIA’s GPUs.– Generated by nvcc (-ptxas) and compiled by ptxas.
• nvcc limitation– Cannot write inlined PTX in CUDA C/C++.– Some extra #pragmas needed.
4
Features absent from nvcc – 2/3
• Predication– Not exposed; would like to have:– predicate p = ...; // some condition– #pragma predicate using p– if (p) { ...
• Address space derivation– nvcc must be able to determine what space pointers
refer to; cannot do:– *(__shared__ float* someint) = ...– double d = *(__constant__ double*) foo;
5
Features absent from nvcc – 3/3
• Latency Hints– Hard for the compiler to determine (non)coalesced
accesses; ideally:– #pragma !coalesced foo = bar[i];
• Reflection– __device__ float foo(...) { ... }#pragma N=registersUsedBy(foo);
6
Code compilation – 1/2
• Instruction Generation– Compiler does it for you
1. High-level code intermediary form (I/F).
2. Transform the I/F.
3. Generate machine code from the transformed I/F.
– How good the generated code is?• Need to manually inspect it.
7
Code compilation – 2/2
• Instruction Scheduling– Core part of any compiler.
• Determines the order that instructions will execute in.
– Purpose• Correctness, latency hiding and ILP.
– Problems• Hard to steer from a high-level language.• Compiler often generates its own code.• #pragma directives & compiler options.
8
JASM – 1/3
• Dedicated tools, such as BAGEL– User selects instructions and BAGEL schedules.– Generating code written in C using the BAGEL lib.– Uses the notion of “abstract instructions”.
• JASM– Similar philosophy – enhanced functionality.
• Focus:– Reflection.– User-Programmable Instruction Scheduling.
9
JASM – 2/3
• Basic Block (BB)– A bunch of instructions that do not change the flow of
execution.
• Control Flow Graph (CFG)– A directed graph of BBs where edges indicate
changes in execution flow.
• Instructions Stream– The order of instructions in memory.– Each instruction is “notionally” part of a BB.
10
JASM – 3/3
• Examples– Append an instruction to a basic block:
• lBB.append(LDSHAREDS32, lArg1, 4, lArg2);
– Branching:• lSource.branchTo(lTargetBB, false);
– Reorder:• lBB = lBB.reorder(BASIC_COST_FUNCTION,
DA_PTX_DATA_REGISTER_HAZARDS, DA_ALIASING);
– Predicate:• lBB = lBB.predicate(lP, lRegisterFile);
– Obtain macro:• SnippetDescriptor lSD = CToHLA.obtain(“x*x / (y-z)”,...);
BasicBlock lNewBB = lSD.instantiate(lArg1, ...);
11
Structure
1. Introduction1. Features absent from nvcc 2. Code compilation3. JASM
2. Library Features1. Dependencies2. Aliasing3. Predication4. Snippets5. Reverse compilation
3. Summary
12
Dependencies – 1/2
• All contribute to the final instructions stream.– What is the ideal layout?
• What complicates the compiler– Not enough information to distinguish between true
and false dependencies.– Variable-latency instructions
• E.g. coalesced vs non-coalesced accesses.
13
Dependencies – 2/2
• JASM determines instruction order based on:– Dependency Analysis (DA) modules & Cost Function.
• Full space exploration – no heuristics:– DAs constrain instructions’ motion in the stream.– Cost function estimates execution time for any
stream.– Scheduling done by external constraints solver.
• Only applicable to basic blocks.
14
Aliasing – 1/3
• PTX is not the final thing.– Further optimised by ptxas before machine code
generation.
• Want to specify exactly what is and what’s not aliased.– No #pragma aliased / disjoint in PTX.
• Goal:– Simplify declaration of aliasing/disjoint accesses.– Handle all memory spaces.
15
Aliasing – 2/3
• JASM Addressable Memory Regions (AMRs)• An n-ary tree of AMRs where:
– Root nodes represent spaces (e.g. shared, global)– Each node is a memory region and a sub-region of its
parent’s.– Siblings are disjoint regions, collectively making their
parent’s
• Instructions are associated with AMRs.– AMRs predefined for CUDA memory spaces.
16
Aliasing – 3/3
• Example
• Generally:– No need for pointer grouping (a la “#pragma disjoint”
etc.)– We work with instructions, not pointers.
global mem AMR 01: st.global.f32 [%r0], %f5
02: ld.global.f64 [%r1], %lf8
03: st.global.f32 [%r3], %f4
04: st.global.s32 [%r8], %s1
17
Predication – 1/3
• Conditional execution– if (foo==0) bar=5;– Thread divergence.
• Predicated execution– setp.eq.s32 %p, foo, 0;@%p mov.s32 bar, 5;
– Non-divergent cycle-burner.
• Fine line between the two.• Cannot predicate code explicitly in CUDA.
18
Predication – 2/3
• Explicit– Can allocate predicates, modify them and predicate
instructions.
• Example:– Direct:
Register lP = lRegFile.allocate(BT.PRED, “%p1”);// @%p1 mov.s32 bar, 5lBB.append(PMOVS32, lP, lBar, new Immediate(5));
– By reflection:Instruction lT = new InstructionImpl( MOVS32, lBar, new Immediate(5));lBB.append(lT.descriptor().toPredicatedVersion(), lP, lArg1, lArg2);
19
Predication – 3/3
• Any basic block can be predicated.– Including already-predicated instructions.
• Example:– %p mov.s32 %d, %s; // if (%p) %d = %s;– Predicate by %q
• Output:– @ %q and.pred %t, %p, %q;@!%q mov.pred %t, %q; // i.e. %t = %q ? (%p && %q) : false;@ %t mov.s32 %d, %s;
20
Snippets – 1/3
• Problem:– Certain tasks require knowing in advance how the
compiler treats a piece of code.
• Software pipelining– template<typename T>vmult( T* aDst, T* aSrc1, T* aSrc2) { for(int i=0 ; i<N ; i++) aDst[i] = aSrc2[i] * aSrc2[i];}
21
Snippets – 2/3
• Consider H/W vs S/W instructions– Tradeoff between pipeline stall & register pressure.
• However, register pressure:– Is also a function of the # of thread blocks.
• Ideally– Want to generate pipelined code for a variable
number of dependencies.
• Solution:– Encapsulate function in a reflective macro
• Reflect instructions & dependencies.
22
Snippets – 3/3
• Consider the complex “multiplication”– (a+ib)*(c+id) (ac+bd)+i(ad+bc)– 2 stages: 2 muls, then 2 madds.
• Snippet descriptor organisation:– Group 0:
• mul.f32 ?x, ?a, ?c; mul.f32 ?y, ?a, ?d
– Group 1:• mad.f32 ?x, ?x, ?b, ?d; mad.f32 ?y, ?y, ?b, ?c
– ?* items are parameters.
• Any basic block can be “snippetised”.
23
Reverse compilation – 1/3
• What to do with legacy CUDA code?– Option: Manually re-write in JASM.
• No. Any PTX file can be loaded in JASM.– Not just loaded in. – Organised in basic blocks within a Control Flow
Graph.– Malleable from thereon like every JASM code.
24
Reverse compilation – 2/3
• Inlined C in JASM– Idea: obtain a snippet from a C function.– Opposite of “inlined assembly in C”.– Why?
• Reuse what nvcc makes available.• Enjoy the benefits that come with snippets.
25
Reverse compilation – 3/3
• At the moment, we can do the following:– Code:
• SnippetDescriptor lSD= CToHLA.obtain( “(x*y) % 3”, “int”, “r”, “int”, “x”, “int”, “y”);
– r is the return parameter; x & y are arguments.– Equivalent to:
• int r = (x*y) % 3;
• Now we can write:– if(lSD.numberOfRegisters() > 5) { ...
26
Summary
• NVIDIA GPUs– Many self-conflicting parameters affect performance.– Some not nvcc-tunable.
• JASM– Similar to a compiler back-end but programmable
itself.– Different constructs to generate variants of the same
kernel.– Explore the optimisations strategy space faster.
• The optimisations are expressed as a function of the code.
27