Introduction to X10x10.sourceforge.net/documentation/papers/X10... · reified generics templates;...
Transcript of Introduction to X10x10.sourceforge.net/documentation/papers/X10... · reified generics templates;...
Introduction to X10
Olivier Tardieu IBM T.J. Watson Research Center
This material is based upon work supported by the Defense Advanced Research Projects Agency, by the Air Force Office of Scientific Research, and by the Department of Energy.
Take Away
§ X10 is § a programming language
§ derived from and interoperable with Java § an open source tool chain
§ compilers, runtime, IDE § developed at IBM Research since 2004 with support from DARPA, DoE, and AFOSR § >100 contributors (IBM and Academia)
§ a growing community § >100 papers § workshops, tutorials, courses
§ X10 tackles the challenge of programming at scale § first HPC, then clusters, now cloud § scale out: run across many distributed nodes § scale up: exploit multi-core and accelerators § elasticity and resilience § double goal: productivity and performance
2
Links
§ Main X10 website http://x10-lang.org
§ X10 Language Specification http://x10.sourceforge.net/documentation/languagespec/x10-latest.pdf
§ A Brief Introduction to X10 (for the HPC Programmer)http://x10.sourceforge.net/documentation/intro/intro-223.pdf
§ X10 2.4.3 release (command line tools only) http://sourceforge.net/projects/x10/files/x10/2.4.3/
§ X10DT 2.4.3 release (Eclipse-based IDE) http://sourceforge.net/projects/x10/files/x10dt/2.4.3/
3
Agenda
§ X10 overview § APGAS programming model § X10 programming language
§ Tool chain
§ Implementation
§ Applications
§ X10 2013/2014 Highlights
§ Onward
4
X10 Overview
5
Asynchronous Partitioned Global Address Space (APGAS) Memory abstraction § Message passing
§ each task lives in its own address space; example: MPI § Shared memory
§ shared address space for all the tasks; example: OpenMP § PGAS
§ global address space: single address space across all tasks § partitioned address space: each partition must fit within a shared-memory node § examples: Titanium, UPC, Co-array Fortran, X10, Chapel
Execution model § SPMD
§ symmetric tasks progressing in lockstep; examples: MPI, OpenMP 3, CUDA kernels § APGAS
§ asynchronous tasks; fork/join synchronization; examples: Cilk, X10
6
Task parallelism § async S!§ finish S!
Place-shifting operations § at(p) S!§ at(p) e!
………
…
Tasks
Local Heap
Place 0
………
Tasks
Local Heap
Place N
…
Global Reference
Places and Tasks
Concurrency control § when(c) S!§ atomic S!
Distributed heap § GlobalRef[T]!§ PlaceLocalHandle[T]!
7
Idioms
§ Remote procedure call v = at(p) evalThere(arg1, arg2);!
§ Active message at(p) async runThere(arg1, arg2);
§ Divide-and-conquer parallelism def fib(n:Long):Long { if(n < 2) return n; val f1:Long; val f2:Long; finish { async f1 = fib(n-1); f2 = fib(n-2); } return f1 + f2; }!!
§ SPMD finish for(p in Place.places()) { at(p) async runEverywhere(); }
§ Atomic remote update at(ref) async atomic ref() += v;!
§ Computation/communication overlap val acc = new Accumulator(); while(cond) { finish { val v = acc.currentValue(); at(ref) async ref() = v; acc.updateValue(); } }!!
8
BlockDistRail.x10
public class BlockDistRail[T] {! protected val sz:Long; // block size! protected val raw:PlaceLocalHandle[Rail[T]];!!
public def this(sz:Long, places:Long){T haszero} {! this.sz = sz;! raw = PlaceLocalHandle.make[Rail[T]](PlaceGroup.make(places), ()=>new Rail[T](sz));!
}!! public operator this(i:Long) = (v:T) { at(Place(i/sz)) raw()(i%sz) = v; }!!
public operator this(i:Long) = at(Place(i/sz)) raw()(i%sz);!! public static def main(Rail[String]) {!
val rail = new BlockDistRail[Long](5, 4);! rail(7) = 8; Console.OUT.println(rail(7));! }!}!
9
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 Place 0 Place 1 Place 2 Place 3
Like Java
§ Objects § classes and interfaces
§ single-class inheritance, multiple interfaces § fields, methods, constructors § virtual dispatch, overriding, overloading, static methods
§ Packages and files § Garbage collector § Variables and values (final variables, but final is the default)
§ definite assignment (extended to tasks) § Expressions and statements
§ control statements: if, switch, for, while, do-while, break, continue, return § Exceptions
§ try-catch-finally, throw § Comprehension loops and iterators
10
Beyond Java
§ Syntax § types “x:Long” rather than “Long x” § declarations val, var, def!§ function literals (a:Long, b:Long) => a < b ? a : b!§ ranges 0..(size-1)!§ operators user-defined behavior for standard operators
§ Types § local type inference val b = false;!§ function types (Long, Long) => Long!§ typedefs type BinOp[T] = (T, T) => T; § structs headerless inline objects; extensible primitive types § arrays multi-dimensional, distributed; implemented in X10 § properties and constraints extended static checking § reified generics templates; constrained kinds
11
gradual typing
Tool Chain
12
Tool Chain
§ Eclipse Public License
§ “Native” X10 implementation § C++ based; CUDA support § distributed multi-process (one place per process + one place per GPU) § C/POSIX network abstraction layer (X10RT) § x86, x86_64, Power; Linux, AIX, OS X, Windows/Cygwin, BG/Q; TCP/IP, PAMI, MPI
§ “Managed” X10 implementation § Java 6/7 based; no CUDA support § distributed multi-JVM (one place per JVM) § pure Java implementation over TCP/IP or using X10RT via JNI (Linux & OS X)
§ X10DT (Eclipse-based IDE) available for Windows, Linux, OS X § supports many core development tasks including remote build & execute facilities
13
Compilation and Execution
X10 Source
Parsing / Type Check
AST Optimizations AST Lowering X10 AST
X10 AST
Java Code Generation
C++ Code Generation
Java Source C++ Source
Java Compiler Platform Compilers XRJ XRC XRX
Java Byteode Native executable
X10RT
X10 Compiler Front-End
Java Back-End
C++ Back-End
Native Environment (CPU, GPU, etc) Java VMs
JNI
Managed X10 Native X10
Existing Java Application Existing Native (C/C++/etc) Application
Java Interop
Cuda Source
14
X10DT
Building
Editing
Browsing
Debug
Source navigation, syntax highlighting, parsing errors,
folding, hyperlinking, outline and quick outline, hover
help, content assist, type hierarchy, format, search,
call graph, quick fixes - Java/C++ support - Local and remote
Launching
15
Implementation
16
Native Runtime
XRX
Runtime
§ X10RT (X10 runtime transport) § core API: active messages § extended API: collectives & RDMAs
§ emulation layer § two versions: C (+JNI bindings) or pure Java
§ Native runtime § processes, threads, atomic ops § object model (layout, RTTI, serialization) § two versions: C++ and Java
§ XRX (X10 runtime in X10) § async, finish, at, when, atomic § X10 code compiled to C++ or Java
§ Core X10 libraries § x10.array, io, util, util.concurrent
X10 Application
X10RT
PAMI TCP/IP
X10 Core Class Libraries
MPI CUDA
17
SHM
APGAS Constructs
§ One process per place
§ Local tasks: async & finish § thread pool; cooperative work-stealing scheduler
§ Remote tasks: at(p) async § source side: synthetize active message
§ async id + serialized heap + control state (finish, clocks) § compiler identifies captured variables (roots); runtime serializes heap reachable from roots
§ destination side: decode active message § polling (when idle + on runtime entry)
§ Distributed finish § complex and potentially costly due to message reordering § pattern-based specialization; program analysis
18
Applications
19
HPC Challenge 2012 – X10 at Petascale – Power 775
589231 22.4
18.0
16.00
18.00
20.00
22.00
24.00
0
200000
400000
600000
800000
0 16384 32768
Gflo
ps/p
lace
Gflo
ps
Places
G-HPL
396614
7.23
7.12
0
5
10
15
0
100000
200000
300000
400000
500000
0 27840 55680
GB
/s/p
lace
GB
/s
Places
EP Stream (Triad)
844
0.82 0.82
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
100
200
300
400
500
600
700
800
900
0 8192 16384 24576 32768
Gup
s/pl
ace
Gup
s
Places
G-RandomAccess
26958
0.88
0.82
0.00 0.20 0.40 0.60 0.80 1.00 1.20
0 5000
10000 15000 20000 25000 30000
0 16384 32768 G
flops
/pla
ce
Gflo
ps
Places
G-FFT
356344
596451 10.93 10.87
10.71
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
0
100000
200000
300000
400000
500000
600000
700000
0 13920 27840 41760 55680
Mill
ion
node
s/s/
plac
e
Mill
ion
node
s/s
Places
UTS
20
Community and Applications
§ X10 applications and frameworks § ANUChem, [Milthorpe IPDPS 2011], [Limpanuparb JCTC 2013] § ScaleGraph [Dayarathna et al X10 2012] § Invasive Computing [Bungartz et al X10 2013] § XAXIS [Suzumura et al X10 2012]; used in Megaffic (IBM Mega Traffic Simulator) § Global Matrix Library: distributed sparse and dense matrices § Global Load Balancing [Zhang et al PPAA 2014]
§ X10 as a coordination language for scale-out § SatX10 [Bloom et al SAT’12 Tools] § Power system contingency analysis [Khaitan & McCalley X10 2013]
§ X10 as a target language § MatLab [Kumar & Hendren X10 2013] § StreamX10 [Wei et al X10 2012]
21
2013/2014 Highlights
22
Releases
§ X10 2.4.0 – October 2013 § array redesign § long is the default integer type
§ X10 2.4.1 – December 2013 § resilient X10: ability to detect and program around place failure § pure Java implementation of the x10rt network layer => end JNI dependency § resume CUDA support; upgrade to CUDA Toolkit 6.0
§ X10 2.4.2 – February 2014 § GLB: global load balancing framework (limitation: single-threaded) § pure multi-place single-host Java launcher => end Cygwin dependency on Windows
§ X10 2.4.3 – May 2014 § source-level debugging in X10DT (breakpoints, stepping…) § busy waiting fix
23
backward incompatible
Arrays
§ Primitive arrays: Rail[T] § fixed-size, zero-based, dense, 1d array with elements of type T § long indices, bounds checking § generic X10 class with native implementations
§ x10.array package: Array[T], DistArray[T] § fixed-size, zero-based, dense, multi-dimensional, rectangular arrays of type T § row-major 1d, 2d, 3d arrays, blocked 1d and 2d distributed arrays § pure X10 code built on top of Rail[T] => extensible collection
§ x10.regionarray package: Point, Region, Dist, Array[T], DistArray[T] § arrays defined over arbitrary (user-definable) regions and distributions § fully general; descendent of original X10 array sub-language § pure X10 code built on top of Rail[T] => extensible collection
24
Resilient X10
§ Key ideas § enhance X10 to tolerate place failure § natural extension of APGAS programming model § build in minimal support § enable applications & frameworks to implement their own fault tolerance strategies
This work was funded in part by the Air Force Office of Scientific Research under Contract No. FA8750-13-C-0052. 25
MPI Existing X10 (fast)
Hadoop Checkpoint & Restart
X10-FT (slow)
Resilient X10 (fast)
Non-resilient / manual Transparent fault tolerance Failure awareness
Keynote at 13:35 David Cunningham
Global Load Balancing Framework
26
Mapper
Reducer
…
Result
TQ TQ
TB
TQ TaskBag
TaskQueue
Worker Worker …
TB
Onward
27
Ongoing Work
§ Elastic X10 § add places dynamically § “recover” failed places § dynamic place groups
§ Grid X10 § combine resilient/elastic X10 with external data grid § Hazelcast bindings for X10 § better serialization for mixed code (Java & X10)
§ Cloud X10 § ease deployment over clouds
28
Examples
29
Hello.x10
package examples;!import x10.io.Console;!!
public class Hello { // class! protected val n:Long; // field!! public def this(n:Long) { this.n = n; } // constructor!! public def test() = n > 0; // method!
! public static def main(args:Rail[String]) { // main method! Console.OUT.println("Hello world! ");! val foo = new Hello(args.size); // inferred type! var result:Boolean = foo.test(); // no inference for vars! if(result) Console.OUT.println("The first arg is: " + args(0));!
}!}!
30
Compiling and Running X10 Programs
§ C++ backend x10c++ -O Hello.x10 -o hello; ./hello!
§ Java backend x10c -O Hello.x10; x10 examples.Hello!
§ Compiler flags -O generate optimized code -NO_CHECKS disable generation of all runtime checks (null, bounds…) -x10rt <impl> select x10rt implementation: sockets (default), pami, mpi…
§ Runtime configuration: environment variables X10_NPLACES=<n> number of places (x10rt sockets on localhost) X10_NTHREADS=<n> number of worker threads per place
31
Primitive Types and Structs
§ Structs cannot extend other data types or be extended or have mutable fields § Structs are allocated inline and have no header
§ Primitive types are structs with native implementations § Boolean, Char, Byte, Short, Int, Long, Float, Double, UByte, UShort, UInt, ULong !
public struct Complex implements Arithmetic[Complex] {! public val re:Double;!
public val im:Double;!! public @Inline def this(re:Double, im:Double) { this.re = re; this.im = im; }!
! public operator this + (that:Complex) = Complex(re + that.re, im + that.im);!! // and more!
}!!// a:Rail[Complex](N) has same layout as b:Rail[Double](2*N) in memory!// a(i).re ~ b(2*i) and a(i).im ~ b(2*i+1)!
32
ArraySum.x10
package examples;!import x10.array.*;!!
public class ArraySum {! static N = 10;!! static def reduce[T](a:Array[T], f:(T,T)=>T){T haszero} {! var result:T = Zero.get[T]();! for(v in a) result = f(result, v);!
return result;! } !! public static def main(Rail[String]) {! val a = new Array_2[Double](N, N);! for(var i:Long=0; i<N; ++i) for(j in 0..(N-1)) a(i,j) = i+j;!
Console.OUT.println("Sum: " + reduce(a, (x:Double,y:Double)=>x+y));! }!}!
33
Vector.x10
package examples;!!public class Vector[T](size:Long){T haszero,T<:Arithmetic[T]} {! val raw:Rail[T]{self!=null,self.size==this.size};!
// this refers to the object instance, self to the value being constrained!! def this(size:Long) { property(size); raw = new Rail[T](size); }!
! def add(vec:Vector[T]){vec.size==this.size}:Vector[T]{self.size==this.size} {! for(i in 0..(size-1)) raw(i) += vec.raw(i);! return this;!
}!! public static def main(Rail[String]) {!
val v = new Vector[Long](4); ! val w = new Vector[Long](5);! v.add(w);! }!
}!// fails compiling with -STATIC_CHECKS or throws x10.lang.FailedDynamicCheckException!
34
Monte Carlo Pi
package examples;!import x10.util.Random;!!public class SeqPi {!
public static def main(args:Rail[String]) {! val N = Long.parse(args(0));! var result:Double = 0;!
val rand = new Random();! for(1..N) {! val x = rand.nextDouble();! val y = rand.nextDouble();!
if(x*x + y*y <= 1) result++;! }! val pi = 4*result/N;!
Console.OUT.println("The value of pi is " + pi);! }!}!
35
Parallel Monte Carlo Pi with Atomic
public class ParPi {! public static def main(args:Rail[String]) {! val N = Long.parse(args(0)); val P = Long.parse(args(1));! var result:Double = 0;!
finish for(1..P) async {! val myRand = new Random();! var myResult:Double = 0;!
for(1..(N/P)) {! val x = myRand.nextDouble();! val y = myRand.nextDouble();! if(x*x + y*y <= 1) myResult++;!
}! atomic result += myResult;! }!
val pi = 4*result/N;! Console.OUT.println("The value of pi is " + pi);! }!}!
36
Distributed Monte Carlo Pi with Atomic and GlobalRef
public class DistPi {! public static def main(args:Rail[String]) {! val N = Long.parse(args(0));! val result = GlobalRef[Cell[Double]](new Cell[Double](0));!
finish for(p in Place.places()) at(p) async {! val myRand = new Random();! var myResult:Double = 0;!
for(1..(N/Place.MAX_PLACES)) {! val x = myRand.nextDouble();! val y = myRand.nextDouble();! if(x*x + y*y <= 1) myResult++;!
}! val myFinalResult = myResult;! at(result) async atomic result()() += myFinalResult;!
}! val pi = 4*result()()/N;! Console.OUT.println("The value of pi is " + pi);! }!
}!!!!
! 37
Parallel Monte Carlo Pi with Collecting Finish
public class CollectPi {! public static def main(args:Rail[String]) {! val N = Long.parse(args(0)); val P = Long.parse(args(1));! val result = finish(Reducible.SumReducer[Double]()) {!
for(1..P) async {! val myRand = new Random();! var myResult:Double = 0;!
for(1..(N/P)) {! val x = myRand.nextDouble();! val y = myRand.nextDouble();! if(x*x + y*y <= 1) myResult++;!
}! offer myResult;! }!
};! val pi = 4*result/N;! Console.OUT.println("The value of pi is " + pi);! }!
}!
38
Distributed Monte Carlo Pi with Collecting Finish
public class MontyPi {! public static def main(args:Rail[String]) {! val N = Long.parse(args(0));! val result = finish(Reducible.SumReducer[Double]()) {!
for(p in Place.places()) at(p) async {! val myRand = new Random();! var myResult:Double = 0;!
for(1..(N/Place.MAX_PLACES)) {! val x = myRand.nextDouble();! val y = myRand.nextDouble();! if(x*x + y*y <= 1) myResult++;!
}! offer myResult;! }!
};! val pi = 4*result/N;! Console.OUT.println("The value of pi is " + pi);! }!
}!!!!
! 39