Compiling for Parallel Machines
description
Transcript of Compiling for Parallel Machines
Titanium 1CS264, K. Yelick
Compiling for Parallel Machines
CS264
Kathy Yelick
Titanium 2CS264, K. Yelick
Two General Research Goals• Correctness: help programmers eliminate bugs
– Analysis to detect bugs statically (and conservatively)
– Tools such as debuggers to help detect bugs dynamically
• Performance: help make programs run faster– Static compiler optimizations
» May use analyses similar to above to ensure compiler is correctly transforming code
» In many areas, the open problem is determining which transformations should be applied when
– Link or load-time optimizations, including object code translation
– Feedback-directed optimization
– Runtime optimization
• For parallel machines, if you can’t get good performance, what’s the point?
Titanium 3CS264, K. Yelick
A Little History
• Most research on compiling for parallel machines is– automatic parallelization of serial code
– loop-level parallelization (usually Fortran)
• Most parallel programs are written using explicit parallelism, either
– Message passing with a single processor multiple data (SPMD) model
A ) usually MPI with either Fortran or mixed C++ and Fortran for scientific applications
B ) shared memory with a thread and synchronization library in C or Java for non-scientific applications
– Option B is easier to program, but requires hardware support that is still unproven for more than 200 processors
Titanium 4CS264, K. Yelick
Titanium Overview
• Give programmers a global address space– Useful for building large complex data structures that are spread
over the machine
– But, don’t pretend it will have uniform access time (I.e., not quite shared memory)
• Use an explicit parallelism model– SPMD for simplicity
• Extend a “standard” language with data structures for specific problem domain, grid-based scientific applications
– Small amount of syntax added for ease of programming
– General idea: build domain-specific features into the language and optimization framework
Titanium 5CS264, K. Yelick
Titanium Goals
• Performance– close to C/FORTRAN + MPI or better
• Portability– develop on uniprocessor, then SMP, then MPP/Cluster
• Safety– as safe as Java, extended to parallel framework
• Expressiveness– close to usability of threads
– add minimal set of features
• Compatibility, interoperability, etc.– no gratuitous departures from Java standard
Titanium 6CS264, K. Yelick
Titanium Goals
• Performance– close to C/FORTRAN + MPI or better
• Safety– as safe as Java, extended to parallel framework
• Expressiveness– close to usability of threads
– add minimal set of features
• Compatibility, interoperability, etc.– no gratuitous departures from Java standard
Titanium 7CS264, K. Yelick
Titanium
• Take the best features of threads and MPI– global address space like threads (ease programming)
– SPMD parallelism like MPI (for performance)
– local/global distinction, i.e., layout matters (for performance)
• Based on Java, a cleaner C++– classes, memory management
• Language is extensible through classes– domain-specific language extensions
– current support for grid-based computations, including AMR
• Optimizing compiler– communication and memory optimizations
– synchronization analysis
– cache and other uniprocessor optimizations
Titanium 8CS264, K. Yelick
New Language Features• Scalable parallelism
– SPMD model of execution with global address space
• Multidimensional arrays– points and index sets as first-class values to simplify programs– iterators for performance
• Checked Synchronization – single-valued variables and globally executed methods
• Global Communication Library• Immutable classes
– user-definable non-reference types for performance
• Operator overloading– by demand from our user community
• Semi-automated zone-based memory management– as safe as a garbage-collected language– better parallel performance and scalability
Titanium 9CS264, K. Yelick
Lecture Outline
• Language and compiler support for uniprocessor performance
– Immutable classes– Multidimensional Arrays– foreach
• Language support for parallel computation• Analysis of parallel code• Summary and future directions
Titanium 10CS264, K. Yelick
Java: A Cleaner C++
• Java is an object-oriented language– classes (no standalone functions) with methods
– inheritance between classes; multiple interface inheritance only
• Documentation on web at java.sun.com
• Syntax similar to C++class Hello { public static void main (String [] argv) { System.out.println(“Hello, world!”); }}
• Safe– Strongly typed: checked at compile time, no unsafe casts
– Automatic memory management
• Titanium is (almost) strict superset
Titanium 11CS264, K. Yelick
Java Objects
• Primitive scalar types: boolean, double, int, etc.– implementations will store these on the program stack
– access is fast -- comparable to other languages
• Objects: user-defined and from the standard library– passed by pointer value (object sharing) into functions
– has level of indirection (pointer to) implicit
– simple model, but inefficient for small objects
2.6
3true
r: 7.1
i: 4.3
Titanium 12CS264, K. Yelick
Java Object Exampleclass Complex { private double real; private double imag; public Complex(double r, double i) { real = r; imag = i; } public Complex add(Complex c) { return new Complex(c.real + real, c.imag + imag); public double getReal {return real; } public double getImag {return imag;}}
Complex c = new Complex(7.1, 4.3);c = c.add(c);
class VisComplex extends Complex { ... }
Titanium 13CS264, K. Yelick
Immutable Classes in Titanium
• For small objects, would sometimes prefer– to avoid level of indirection
– pass by value (copying of entire object)
– especially when objects are immutable -- fields are unchangeable
» extends the idea of primitive values (1, 4.2, etc.) to user-defined values
• Titanium introduces immutable classes– all fields are final (implicitly)
– cannot inherit from (extend) or be inherited by other classes
– needs to have 0-argument constructor, e.g., Complex ()
immutable class Complex { ... } Complex c = new Complex(7.1, 4.3);
Titanium 14CS264, K. Yelick
Arrays in Java
• Arrays in Java are objects
• Only 1D arrays are directly supported
• Array bounds are checked
• Multidimensional arrays as arrays-of-arrays are slow
Titanium 15CS264, K. Yelick
Multidimensional Arrays in Titanium
• New kind of multidimensional array added– Two arrays may overlap (unlike Java arrays)
– Indexed by Points (tuple of ints)
– Constructed over a set of Points, called Domains
– RectDomains are special case of domains
– Points, Domains and RectDomains are built-in immutable classes
• Support for adaptive meshes and other mesh/grid operations
RectDomain<2> d = [0:n,0:n];
Point<2> p = [1, 2];
double [2d] a = new double [d];
a[0,0] = a[9,9];
Titanium 16CS264, K. Yelick
Naïve MatMul with Titanium Arrays
public static void matMul(double [2d] a, double [2d] b, double [2d] c) { int n = c.domain().max()[1]; // assumes square for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { for (int k = 0; k < n; k++) { c[i,j] += a[i,k] * b[k,j]; } } }}
Titanium 17CS264, K. Yelick
Two Performance Issues
• In any language, uniprocessor performance is often dominated by memory hierarchy costs
– algorithms that are “blocked” for the memory hierarchy (caches and registers) can be much faster
• In Titanium, the representation of arrays is fast, but the access methods are expensive
– need optimizations on Titanium arrays
» common subexpression elimination
» eliminate (or hoist) bounds checking
» strength reduce: e.g., naïve code has 1 divide per dimension for each array access
• See Geoff Pike’s work– goal: competitive with C/Fortran performance or better
Titanium 18CS264, K. Yelick
Matrix Multiply (blocked, or tiled)Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N
is called the blocksize
for i = 1 to N
for j = 1 to N
{read block C(i,j) into fast memory}
for k = 1 to N
{read block A(i,k) into fast memory}
{read block B(k,j) into fast memory}
C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks}
{write block C(i,j) back to slow memory}
= + *
C(i,j) C(i,j) A(i,k)
B(k,j)
Titanium 19CS264, K. Yelick
Memory Hierarchy Optimizations: MatMul
Speed of n-by-n matrix multiply on Sun Ultra-1/170, peak = 330 MFlops
Titanium 20CS264, K. Yelick
Unordered iteration
• Often useful to reorder iterations for caches
• Compilers can do this for simple operations, e.g., matrix multiply, but hard in general
• Titanium adds unordered iteration on rectangular domains
foreach (p within r) { ….. } – p is a Point new point, scoped only within the foreach body
– r is a previously-declared RectDomain
• Foreach simplifies bounds checking as well
• Additional operations on domains and arrays to subset and transform
Titanium 21CS264, K. Yelick
Better MatMul with Titanium Arrays
public static void matMul(double [2d] a, double [2d] b, double [2d] c) { foreach (ij within c.domain()) { double [1d] aRowi = a.slice(1, ij[1]); double [1d] bColj = b.slice(2, ij[2]); foreach (k within aRowi.domain()) { c[ij] += aRowi[k] * bColj[k]; } }}
Current compiler eliminates array overhead, making it comparable to C performance for 3 nested loops
Automatic tiling still TBD
Titanium 23CS264, K. Yelick
Lecture Outline
• Language and compiler support for uniprocessor performance
• Language support for parallel computation– SPMD execution– Global and local references– Communication– Barriers and single– Synchronized methods and blocks (as in Java)
• Analysis of parallel code• Summary and future directions
Titanium 24CS264, K. Yelick
SPMD Execution Model
• Java programs can be run as Titanium, but the result will be that all processors do all the work
• E.g., parallel hello world class HelloWorld {
public static void main (String [] argv) {
System.out.println(“Hello from proc ”
Ti.thisProc());
}
}
• Any non-trivial program will have communication and synchronization between processors
Titanium 25CS264, K. Yelick
SPMD Execution Model
• A common style is compute/communicate
• E.g., in each timestep within fish simulation with gravitation attraction
read all fish and compute forces on mine
Ti.barrier();
write to my fish using new forces
Ti.barrier();
Titanium 26CS264, K. Yelick
SPMD Model
• All processor start together and execute same code, but not in lock-step
• Sometimes they take different branches if (Ti.thisProc() == 0) { … do setup … } for(all data I own) { … compute on data … }
• Common source of bugs is barriers or other global operations inside branches or loops barrier, broadcast, reduction, exchange
• A “single” method is one called by all procs public single static void allStep(…)
• A “single” variable has the same value on all procs int single timestep = 0;
Titanium 27CS264, K. Yelick
SPMD Execution Model
• Barriers and single in FishSimulation (n-body)class FishSim { public static void main (String [] argv) { int allTimestep = 0; int allEndTime = 100; for (; allTimestep < allEndTime; allTimestep++){ read all fish and compute forces on mine Ti.barrier(); write to my fish using new forces Ti.barrier(); } }}
• Single methods inferred; see David Gay’s work
single
singlesingle
Titanium 28CS264, K. Yelick
Global Address Space
• Processes allocate locally
• References can be passed to other processes
Class C { int val;….. }
C gv; // global pointerC local lv; // local pointer
if (thisProc() == 0) {lv = new C();
}gv = broadcast lv from 0; gv.val = …; // full… = gv.val; // functionality
Process 0Other
processes
lv
gv
lv
gv
lv
gv
lv
gv
lv
gv
lv
gv
LOCAL HEAP
LOCAL HEAP
Titanium 29CS264, K. Yelick
Use of Global / Local
• Default is global– easier to port shared-memory programs
– performance bugs common: global pointers are more expensive
– harder to use sequential kernels
• Use local declarations in critical sections
• Compiler can infer many instances of “local”
• See Liblit’s work on LQI (Local Qualification Inference)
Titanium 30CS264, K. Yelick
Local Pointer Analysis [Liblit, Aiken]• Global references simplify programming, but incur
overhead, even when data is local– Split-C therefore requires global pointers be declared explicitly
– Titanium pointers global by default: easier, better portability
• Automatic “local qualification” inferenceEffect of LQI
0
50
100
150
200
250
cannon lu sample gsrb poison
applications
run
nin
g t
ime
(s
ec
)
Original
After LQI
Titanium 31CS264, K. Yelick
Parallel performance
• Speedup on Ultrasparc SMP
• AMR largely limited by – current algorithm
– problem size
– 2 levels, with top one serial
• Not yet optimized with “local” for distributed memory
0
1
2
3
4
5
6
7
8
1 2 4 8
em3d
amr
Titanium 32CS264, K. Yelick
Lecture Outline
• Language and compiler support for uniprocessor performance
• Language support for parallel computation• Analysis and Optimization of parallel code
– Tolerate network latency: Split-C experience– Hardware trends and reordering– Semantics: sequential consistency– Cycle detection: parallel dependence analysis– Synchronization analysis: parallel flow analysis
• Summary and future directions
Titanium 33CS264, K. Yelick
Split-C Experience: Latency Overlap
• Titanium borrowed ideas from Split-C– global address space
– SPMD parallelism
• But, Split-C had non-blocking accesses built in to tolerate network latency on remote read/write
• Also one-way communication
• Conclusion: useful, but complicated
int *global p; x := *p; /* get */ *p := 3; /* put */ sync; /* wait for my puts/gets */
*p :- x; /* store */all_store_sync; /* wait globally */
Titanium 34CS264, K. Yelick
Other sources of Overlap
• Would like compiler to introduce put/get/store.
• Hardware also reorders– out-of-order execution
– write buffered with read by-pass
– non-FIFO write buffers
– weak memory models in general
• Software already reorders too– register allocation
– any code motion
• System provides enforcement primitives– e.g., memory fence, volatile, etc.
– tend to be heavy wait and with unpredictable performance
• Can the compiler hide all this?
Titanium 35CS264, K. Yelick
Semantics: Sequential Consistency
• When compiling sequential programs:
Valid if y not in expr1 and x not in expr2 (roughly)
• When compiling parallel code, not sufficient test.
y = expr2;
x = expr1;
x = expr1;
y = expr2;
Initially flag = data = 0
Proc A Proc B
data = 1; while (flag==1);
flag = 1; ….. = …..data…..;
Titanium 36CS264, K. Yelick
Cycle Detection: Dependence Analog
• Processors define a “program order” on accesses from the same thread
P is the union of these total orders
• Memory system define an “access order” on accesses to the same variable
A is access order (read/write & write/write pairs)
• A violation of sequential consistency is cycle in P U A.
• Intuition: time cannot flow backwards.
write data read flag
write flag read data
Titanium 37CS264, K. Yelick
Cycle Detection
• Generalizes to arbitrary numbers of variables and processors
• Cycles may be arbitrarily long, but it is sufficient to consider only cycles with 1 or 2 consecutive stops per processor [Sasha & Snir]
write x write y read y
read y write x
Titanium 38CS264, K. Yelick
Static Analysis for Cycle Detection
• Approximate P by the control flow graph
• Approximate A by undirected “dependence” edges
• Let the “delay set” D be all edges from P that are part of a minimal cycle
• The execution order of D edge must be preserved; other P edges may be reordered (modulo usual rules about serial code)
• Synchronization analsysis also critical [Krishnamurthy]
write z read x
read y write z
write y read x
Titanium 39CS264, K. Yelick
Automatic Communication Optimization
• Implemented in subset of C with limited pointers [Krishnamurthy, Yelick]
• Experiments on the NOW; 3 synchronization styles
• Future: pointer analysis and optimizations for AMR [Jeh, Yelick]
Tim
e (
no
rma
lized
)
Titanium 40CS264, K. Yelick
Other Language Extensions
Java extensions for expressiveness & performance
• Operator overloading
• Zone-based memory management
• Foreign function interface
The following is not yet implemented in the compiler
• Parameterized types (aka templates)
Titanium 41CS264, K. Yelick
Implementation
• Strategy– compile Titanium into C
– Solaris or Posix threads for SMPs
– Active Messages (Split-C library) for communication
• Status– runs on SUN Enterprise 8-way SMP
– runs on Berkeley NOW
– runs on the Tera (not fully tested)
– T3E port partially working
– SP2 port under way
Titanium 42CS264, K. Yelick
Titanium Status
• Titanium language definition complete.
• Titanium compiler running.
• Compiles for uniprocessors, NOW, Tera, t3e, SMPs, SP2 (under way).
• Application developments ongoing.
• Lots of research opportunities.
Titanium 43CS264, K. Yelick
Future Directions
• Super optimizers for targeted kernels– e.g., Phipack, Sparsity, FFTW, and Atlas
– include feedback and some runtime information
• New application domains– unstructured grids (aka graphs and sparse matrices)
– I/O-intensive applications such as information retrieval
• Optimizing I/O as well as communication– uniform treatment of memory hierarchy optimizations
• Performance heterogeneity from the hardware– related to dynamic load balancing in software
• Reasoning about parallel code– correctness analysis: race condition and synchronization analysis
– better analysis: aliases and threads
– Java memory model and hiding the hardware model
Titanium 44CS264, K. Yelick
Backup Slides
Titanium 45CS264, K. Yelick
Point, RectDomain, Arrays in General
Point<2> lb = [1, 1];Point<2> ub = [10, 20];RectDomain<2> R = [lb : ub : [2, 2]];double [2d] A = new double[r];...foreach (p in A.domain()) {
A[p] = B[2 * p + [1, 1]];}
• Points specified by a tuple of ints
• RectDomains given by: – lower bound point
– upper bound point
– stride point
• Array given by RectDomain and element type
Titanium 46CS264, K. Yelick
AMR Poisson
• Poisson Solver [Semenzato, Pike, Colella]– 3D AMR
– finite domain
– variable coefficients
– multigrid across levels
• Performance of Titanium implementation– Sequential multigrid performance +/- 20% of Fortran
– On fixed, well-balanced problem of 8 patches, 723 parallel speedups of 5.5 on 8 processors
Titanium 47CS264, K. Yelick
Distributed Data Structures
• Build distributed data structures: – broadcast or exchange
RectDomain <1> single allProcs = [0:Ti.numProcs-1];
RectDomain <1> myFishDomain = [0:myFishCount-1];
Fish [1d] single [1d] allFish =
new Fish [allProcs][1d];
Fish [1d] myFish = new Fish [myFishDomain];
allFish.exchage(myFish);
• Now each processor has an array of global pointers, one to each processors chunk of fish
Titanium 48CS264, K. Yelick
Consistency Model
• Titanium adopts the Java memory consistency model
• Roughly: Access to shared variables that are not synchronized have undefined behavior.
• Use synchronization to control access to shared variables.
– barriers
– synchronized methods and blocks
Titanium 49CS264, K. Yelick
Example: Domain
Point<2> lb = [0, 0];Point<2> ub = [6, 4];RectDomain<2> r = [lb : ub : [2, 2]];…Domain<2> red = r + (r + [1, 1]);foreach (p in red) {
...}
(0, 0)
(6, 4)r
(1, 1)
(7, 5)r + [1, 1]
(0, 0)
(7, 5)red
• Domains in general are not rectangular
• Built using set operations– union, +
– intersection, *
– difference, -
• Example is red-black algorithm
Titanium 50CS264, K. Yelick
Example using Domains and foreach
• Gauss-Seidel red-black computation in multigrid
void gsrb() {
boundary (phi);
for (domain<2> d = res; d != null;
d = (d == red ? black : null)) {
foreach (q in d)
res[q] = ((phi[n(q)] + phi[s(q)] + phi[e(q)] + phi[w(q)])*4
+ (phi[ne(q) + phi[nw(q)] + phi[se(q)] + phi[sw(q)])
- 20.0*phi[q] - k*rhs[q]) * 0.05;
foreach (q in d) phi[q] += res[q];
}
}
unordered iteration
Titanium 51CS264, K. Yelick
Applications
• Three-D AMR Poisson Solver (AMR3D)– block-structured grids
– 2000 line program
– algorithm not yet fully implemented in other languages
– tests performance and effectiveness of language features
• Other 2D Poisson Solvers (under development)– infinite domains
– based on method of local corrections
• Three-D Electromagnetic Waves (EM3D)– unstructured grids
• Several smaller benchmarks