Introduction to X10x10.sourceforge.net/documentation/papers/X10... · reified generics templates;...

Introduction to X10

Olivier Tardieu IBM T.J. Watson Research Center

This material is based upon work supported by the Defense Advanced Research Projects Agency, by the Air Force Office of Scientific Research, and by the Department of Energy.

Take Away

§  X10 is §  a programming language

§  derived from and interoperable with Java §  an open source tool chain

§  compilers, runtime, IDE §  developed at IBM Research since 2004 with support from DARPA, DoE, and AFOSR §  >100 contributors (IBM and Academia)

§  a growing community §  >100 papers §  workshops, tutorials, courses

§  X10 tackles the challenge of programming at scale §  first HPC, then clusters, now cloud §  scale out: run across many distributed nodes §  scale up: exploit multi-core and accelerators §  elasticity and resilience §  double goal: productivity and performance

2

Links

§  Main X10 website http://x10-lang.org

§  X10 Language Specification http://x10.sourceforge.net/documentation/languagespec/x10-latest.pdf

§  A Brief Introduction to X10 (for the HPC Programmer)http://x10.sourceforge.net/documentation/intro/intro-223.pdf

§  X10 2.4.3 release (command line tools only) http://sourceforge.net/projects/x10/files/x10/2.4.3/

§  X10DT 2.4.3 release (Eclipse-based IDE) http://sourceforge.net/projects/x10/files/x10dt/2.4.3/

3

Agenda

§  X10 overview §  APGAS programming model §  X10 programming language

§  Tool chain

§  Implementation

§  Applications

§  X10 2013/2014 Highlights

§  Onward

4

X10 Overview

5

Asynchronous Partitioned Global Address Space (APGAS) Memory abstraction §  Message passing

§  each task lives in its own address space; example: MPI §  Shared memory

§  shared address space for all the tasks; example: OpenMP §  PGAS

§  global address space: single address space across all tasks §  partitioned address space: each partition must fit within a shared-memory node §  examples: Titanium, UPC, Co-array Fortran, X10, Chapel

Execution model §  SPMD

§  symmetric tasks progressing in lockstep; examples: MPI, OpenMP 3, CUDA kernels §  APGAS

§  asynchronous tasks; fork/join synchronization; examples: Cilk, X10

6

Task parallelism §  async S!§  finish S!

Place-shifting operations §  at(p) S!§  at(p) e!

………

…

Tasks

Local Heap

Place 0

………

Tasks

Local Heap

Place N

…

Global Reference

Places and Tasks

Concurrency control §  when(c) S!§  atomic S!

Distributed heap §  GlobalRef[T]!§  PlaceLocalHandle[T]!

7

Idioms

§  Remote procedure call v = at(p) evalThere(arg1, arg2);!

§  Active message at(p) async runThere(arg1, arg2);

§  Divide-and-conquer parallelism def fib(n:Long):Long { if(n < 2) return n; val f1:Long; val f2:Long; finish { async f1 = fib(n-1); f2 = fib(n-2); } return f1 + f2; }!!

§  SPMD finish for(p in Place.places()) { at(p) async runEverywhere(); }

§  Atomic remote update at(ref) async atomic ref() += v;!

§  Computation/communication overlap val acc = new Accumulator(); while(cond) { finish { val v = acc.currentValue(); at(ref) async ref() = v; acc.updateValue(); } }!!

8

BlockDistRail.x10

public class BlockDistRail[T] {! protected val sz:Long; // block size! protected val raw:PlaceLocalHandle[Rail[T]];!!

public def this(sz:Long, places:Long){T haszero} {! this.sz = sz;! raw = PlaceLocalHandle.make[Rail[T]](PlaceGroup.make(places), ()=>new Rail[T](sz));!

}!! public operator this(i:Long) = (v:T) { at(Place(i/sz)) raw()(i%sz) = v; }!!

public operator this(i:Long) = at(Place(i/sz)) raw()(i%sz);!! public static def main(Rail[String]) {!

val rail = new BlockDistRail[Long](5, 4);! rail(7) = 8; Console.OUT.println(rail(7));! }!}!

9

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 Place 0 Place 1 Place 2 Place 3

Like Java

§  Objects §  classes and interfaces

§  single-class inheritance, multiple interfaces §  fields, methods, constructors §  virtual dispatch, overriding, overloading, static methods

§  Packages and files §  Garbage collector §  Variables and values (final variables, but final is the default)

§  definite assignment (extended to tasks) §  Expressions and statements

§  control statements: if, switch, for, while, do-while, break, continue, return §  Exceptions

§  try-catch-finally, throw §  Comprehension loops and iterators

10

Beyond Java

§  Syntax §  types “x:Long” rather than “Long x” §  declarations val, var, def!§  function literals (a:Long, b:Long) => a < b ? a : b!§  ranges 0..(size-1)!§  operators user-defined behavior for standard operators

§  Types §  local type inference val b = false;!§  function types (Long, Long) => Long!§  typedefs type BinOp[T] = (T, T) => T; §  structs headerless inline objects; extensible primitive types §  arrays multi-dimensional, distributed; implemented in X10 §  properties and constraints extended static checking §  reified generics templates; constrained kinds

11

gradual typing

Tool Chain

12

Tool Chain

§  Eclipse Public License

§  “Native” X10 implementation §  C++ based; CUDA support §  distributed multi-process (one place per process + one place per GPU) §  C/POSIX network abstraction layer (X10RT) §  x86, x86_64, Power; Linux, AIX, OS X, Windows/Cygwin, BG/Q; TCP/IP, PAMI, MPI

§  “Managed” X10 implementation §  Java 6/7 based; no CUDA support §  distributed multi-JVM (one place per JVM) §  pure Java implementation over TCP/IP or using X10RT via JNI (Linux & OS X)

§  X10DT (Eclipse-based IDE) available for Windows, Linux, OS X §  supports many core development tasks including remote build & execute facilities

13

Compilation and Execution

X10 Source

Parsing / Type Check

AST Optimizations AST Lowering X10 AST

X10 AST

Java Code Generation

C++ Code Generation

Java Source C++ Source

Java Compiler Platform Compilers XRJ XRC XRX

Java Byteode Native executable

X10RT

X10 Compiler Front-End

Java Back-End

C++ Back-End

Native Environment (CPU, GPU, etc) Java VMs

JNI

Managed X10 Native X10

Existing Java Application Existing Native (C/C++/etc) Application

Java Interop

Cuda Source

14

X10DT

Building

Editing

Browsing

Debug

Source navigation, syntax highlighting, parsing errors,

folding, hyperlinking, outline and quick outline, hover

help, content assist, type hierarchy, format, search,

call graph, quick fixes - Java/C++ support - Local and remote

Launching

15

Implementation

16

Native Runtime

XRX

Runtime

§  X10RT (X10 runtime transport) §  core API: active messages §  extended API: collectives & RDMAs

§  emulation layer §  two versions: C (+JNI bindings) or pure Java

§  Native runtime §  processes, threads, atomic ops §  object model (layout, RTTI, serialization) §  two versions: C++ and Java

§  XRX (X10 runtime in X10) §  async, finish, at, when, atomic §  X10 code compiled to C++ or Java

§  Core X10 libraries §  x10.array, io, util, util.concurrent

X10 Application

X10RT

PAMI TCP/IP

X10 Core Class Libraries

MPI CUDA

17

SHM

APGAS Constructs

§  One process per place

§  Local tasks: async & finish §  thread pool; cooperative work-stealing scheduler

§  Remote tasks: at(p) async §  source side: synthetize active message

§  async id + serialized heap + control state (finish, clocks) §  compiler identifies captured variables (roots); runtime serializes heap reachable from roots

§  destination side: decode active message §  polling (when idle + on runtime entry)

§  Distributed finish §  complex and potentially costly due to message reordering §  pattern-based specialization; program analysis

18

Applications

19

HPC Challenge 2012 – X10 at Petascale – Power 775

589231 22.4

18.0

16.00

18.00

20.00

22.00

24.00

0

200000

400000

600000

800000

0 16384 32768

Gflo

ps/p

lace

Gflo

ps

Places

G-HPL

396614

7.23

7.12

0

5

10

15

0

100000

200000

300000

400000

500000

0 27840 55680

GB

/s/p

lace

GB

/s

Places

EP Stream (Triad)

844

0.82 0.82

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

100

200

300

400

500

600

700

800

900

0 8192 16384 24576 32768

Gup

s/pl

ace

Gup

s

Places

G-RandomAccess

26958

0.88

0.82

0.00 0.20 0.40 0.60 0.80 1.00 1.20

0 5000

10000 15000 20000 25000 30000

0 16384 32768 G

flops

/pla

ce

Gflo

ps

Places

G-FFT

356344

596451 10.93 10.87

10.71

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

0

100000

200000

300000

400000

500000

600000

700000

0 13920 27840 41760 55680

Mill

ion

node

s/s/

plac

e

Mill

ion

node

s/s

Places

UTS

20

Community and Applications

§  X10 applications and frameworks §  ANUChem, [Milthorpe IPDPS 2011], [Limpanuparb JCTC 2013] §  ScaleGraph [Dayarathna et al X10 2012] §  Invasive Computing [Bungartz et al X10 2013] §  XAXIS [Suzumura et al X10 2012]; used in Megaffic (IBM Mega Traffic Simulator) §  Global Matrix Library: distributed sparse and dense matrices §  Global Load Balancing [Zhang et al PPAA 2014]

§  X10 as a coordination language for scale-out §  SatX10 [Bloom et al SAT’12 Tools] §  Power system contingency analysis [Khaitan & McCalley X10 2013]

§  X10 as a target language §  MatLab [Kumar & Hendren X10 2013] §  StreamX10 [Wei et al X10 2012]

21

2013/2014 Highlights

22

Releases

§  X10 2.4.0 – October 2013 §  array redesign §  long is the default integer type

§  X10 2.4.1 – December 2013 §  resilient X10: ability to detect and program around place failure §  pure Java implementation of the x10rt network layer => end JNI dependency §  resume CUDA support; upgrade to CUDA Toolkit 6.0

§  X10 2.4.2 – February 2014 §  GLB: global load balancing framework (limitation: single-threaded) §  pure multi-place single-host Java launcher => end Cygwin dependency on Windows

§  X10 2.4.3 – May 2014 §  source-level debugging in X10DT (breakpoints, stepping…) §  busy waiting fix

23

backward incompatible

Arrays

§  Primitive arrays: Rail[T] §  fixed-size, zero-based, dense, 1d array with elements of type T §  long indices, bounds checking §  generic X10 class with native implementations

§  x10.array package: Array[T], DistArray[T] §  fixed-size, zero-based, dense, multi-dimensional, rectangular arrays of type T §  row-major 1d, 2d, 3d arrays, blocked 1d and 2d distributed arrays §  pure X10 code built on top of Rail[T] => extensible collection

§  x10.regionarray package: Point, Region, Dist, Array[T], DistArray[T] §  arrays defined over arbitrary (user-definable) regions and distributions §  fully general; descendent of original X10 array sub-language §  pure X10 code built on top of Rail[T] => extensible collection

24

Resilient X10

§  Key ideas §  enhance X10 to tolerate place failure §  natural extension of APGAS programming model §  build in minimal support §  enable applications & frameworks to implement their own fault tolerance strategies

This work was funded in part by the Air Force Office of Scientific Research under Contract No. FA8750-13-C-0052. 25

MPI Existing X10 (fast)

Hadoop Checkpoint & Restart

X10-FT (slow)

Resilient X10 (fast)

Non-resilient / manual Transparent fault tolerance Failure awareness

Keynote at 13:35 David Cunningham

Global Load Balancing Framework

26

Mapper

Reducer

…

Result

TQ TQ

TB

TQ TaskBag

TaskQueue

Worker Worker …

TB

Onward

27

Ongoing Work

§  Elastic X10 §  add places dynamically §  “recover” failed places §  dynamic place groups

§  Grid X10 §  combine resilient/elastic X10 with external data grid §  Hazelcast bindings for X10 §  better serialization for mixed code (Java & X10)

§  Cloud X10 §  ease deployment over clouds

28

Examples

29

Hello.x10

package examples;!import x10.io.Console;!!

public class Hello { // class! protected val n:Long; // field!! public def this(n:Long) { this.n = n; } // constructor!! public def test() = n > 0; // method!

! public static def main(args:Rail[String]) { // main method! Console.OUT.println("Hello world! ");! val foo = new Hello(args.size); // inferred type! var result:Boolean = foo.test(); // no inference for vars! if(result) Console.OUT.println("The first arg is: " + args(0));!

}!}!

30

Compiling and Running X10 Programs

§  C++ backend x10c++ -O Hello.x10 -o hello; ./hello!

§  Java backend x10c -O Hello.x10; x10 examples.Hello!

§  Compiler flags -O generate optimized code -NO_CHECKS disable generation of all runtime checks (null, bounds…) -x10rt <impl> select x10rt implementation: sockets (default), pami, mpi…

§  Runtime configuration: environment variables X10_NPLACES=<n> number of places (x10rt sockets on localhost) X10_NTHREADS=<n> number of worker threads per place

31

Primitive Types and Structs

§  Structs cannot extend other data types or be extended or have mutable fields §  Structs are allocated inline and have no header

§  Primitive types are structs with native implementations §  Boolean, Char, Byte, Short, Int, Long, Float, Double, UByte, UShort, UInt, ULong !

public struct Complex implements Arithmetic[Complex] {! public val re:Double;!

public val im:Double;!! public @Inline def this(re:Double, im:Double) { this.re = re; this.im = im; }!

! public operator this + (that:Complex) = Complex(re + that.re, im + that.im);!! // and more!

}!!// a:Rail[Complex](N) has same layout as b:Rail[Double](2*N) in memory!// a(i).re ~ b(2*i) and a(i).im ~ b(2*i+1)!

32

ArraySum.x10

package examples;!import x10.array.*;!!

public class ArraySum {! static N = 10;!! static def reduce[T](a:Array[T], f:(T,T)=>T){T haszero} {! var result:T = Zero.get[T]();! for(v in a) result = f(result, v);!

return result;! } !! public static def main(Rail[String]) {! val a = new Array_2[Double](N, N);! for(var i:Long=0; i<N; ++i) for(j in 0..(N-1)) a(i,j) = i+j;!

Console.OUT.println("Sum: " + reduce(a, (x:Double,y:Double)=>x+y));! }!}!

33

Vector.x10

package examples;!!public class Vector[T](size:Long){T haszero,T<:Arithmetic[T]} {! val raw:Rail[T]{self!=null,self.size==this.size};!

// this refers to the object instance, self to the value being constrained!! def this(size:Long) { property(size); raw = new Rail[T](size); }!

! def add(vec:Vector[T]){vec.size==this.size}:Vector[T]{self.size==this.size} {! for(i in 0..(size-1)) raw(i) += vec.raw(i);! return this;!

}!! public static def main(Rail[String]) {!

val v = new Vector[Long](4); ! val w = new Vector[Long](5);! v.add(w);! }!

}!// fails compiling with -STATIC_CHECKS or throws x10.lang.FailedDynamicCheckException!

34

Monte Carlo Pi

package examples;!import x10.util.Random;!!public class SeqPi {!

public static def main(args:Rail[String]) {! val N = Long.parse(args(0));! var result:Double = 0;!

val rand = new Random();! for(1..N) {! val x = rand.nextDouble();! val y = rand.nextDouble();!

if(x*x + y*y <= 1) result++;! }! val pi = 4*result/N;!

Console.OUT.println("The value of pi is " + pi);! }!}!

35

Parallel Monte Carlo Pi with Atomic

public class ParPi {! public static def main(args:Rail[String]) {! val N = Long.parse(args(0)); val P = Long.parse(args(1));! var result:Double = 0;!

finish for(1..P) async {! val myRand = new Random();! var myResult:Double = 0;!

for(1..(N/P)) {! val x = myRand.nextDouble();! val y = myRand.nextDouble();! if(x*x + y*y <= 1) myResult++;!

}! atomic result += myResult;! }!

val pi = 4*result/N;! Console.OUT.println("The value of pi is " + pi);! }!}!

36

Distributed Monte Carlo Pi with Atomic and GlobalRef

public class DistPi {! public static def main(args:Rail[String]) {! val N = Long.parse(args(0));! val result = GlobalRef[Cell[Double]](new Cell[Double](0));!

finish for(p in Place.places()) at(p) async {! val myRand = new Random();! var myResult:Double = 0;!

for(1..(N/Place.MAX_PLACES)) {! val x = myRand.nextDouble();! val y = myRand.nextDouble();! if(x*x + y*y <= 1) myResult++;!

}! val myFinalResult = myResult;! at(result) async atomic result()() += myFinalResult;!

}! val pi = 4*result()()/N;! Console.OUT.println("The value of pi is " + pi);! }!

}!!!!

! 37

Parallel Monte Carlo Pi with Collecting Finish

public class CollectPi {! public static def main(args:Rail[String]) {! val N = Long.parse(args(0)); val P = Long.parse(args(1));! val result = finish(Reducible.SumReducer[Double]()) {!

for(1..P) async {! val myRand = new Random();! var myResult:Double = 0;!

for(1..(N/P)) {! val x = myRand.nextDouble();! val y = myRand.nextDouble();! if(x*x + y*y <= 1) myResult++;!

}! offer myResult;! }!

};! val pi = 4*result/N;! Console.OUT.println("The value of pi is " + pi);! }!

}!

38

Distributed Monte Carlo Pi with Collecting Finish

public class MontyPi {! public static def main(args:Rail[String]) {! val N = Long.parse(args(0));! val result = finish(Reducible.SumReducer[Double]()) {!

for(p in Place.places()) at(p) async {! val myRand = new Random();! var myResult:Double = 0;!

for(1..(N/Place.MAX_PLACES)) {! val x = myRand.nextDouble();! val y = myRand.nextDouble();! if(x*x + y*y <= 1) myResult++;!

}! offer myResult;! }!

};! val pi = 4*result/N;! Console.OUT.println("The value of pi is " + pi);! }!

}!!!!

! 39

Introduction to X10x10.sourceforge.net/documentation/papers/X10... · reified generics templates;...

Documents

Transcript of Introduction to X10x10.sourceforge.net/documentation/papers/X10... · reified generics templates;...