Arvindsujeeth scaladays12

Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Victoria Popic, Kunle Olukotun

Stanford University Pervasive Parallelism Laboratory (PPL)

Tiark Rompf, Aleksandar Prokopec, Vojin Jovanovic, Philipp Haller, Martin Odersky

Ecole Polytechnique Federale de Lausanne (EPFL) Programming Methods Laboratory (LAMP)

Squeryl

DSLs can be used for high performance, too

Cray Jaguar

Sun T2

Nvidia Fermi

Altera FPGA

MPI PGAS

Pthreads OpenMP

CUDA OpenCL

Verilog VHDL

Cray Jaguar

Sun T2

Nvidia Fermi

Altera FPGA

MPI PGAS

Pthreads OpenMP

CUDA OpenCL

Verilog VHDL

Virtual Worlds

Personal Robotics

Data Informatics

Scientific Engineering

Applications

Too many different programming models

Cray Jaguar

Sun T2

Nvidia Fermi

Altera FPGA

MPI PGAS

Pthreads OpenMP

CUDA OpenCL

Verilog VHDL

Virtual Worlds

Personal Robotics

Data Informatics

Scientific Engineering

Applications

DSLs

n  Tiark Rompf’s talk yesterday

n  In case you missed it: n  Techniques for rewriting high-level

programs to high-performance programs

n  Build an intermediate representation (IR) of Scala programs at runtime

n  IR can be optimized and code generated

n  Introduction to existing Delite DSLs

n Constructing your own Delite DSL

n Not covered – under the covers: n  Implementation details about the Delite

framework

n  See http://cgo2012.hyperdsls.org/

n  Syntax is legal Scala

n  Staged to build an IR (metaprogramming)

n Optimized at a high level

n Compiled to different low-level target architectures

A B A C

* *

+

n  OptiML (Machine Learning) n  OptiQL (Data querying) n  OptiGraph (Large-scale graph analysis) n  OptiCollections (Scala collections) n  OptiMesh (Mesh-based PDE solvers)

n  OptiSDR (Software-defined radio) n  OptiCVX (Convex optimization)

Coming soon:

n  Provides a familiar (MATLAB-like) language and API for writing ML applications n  Ex. val c = a * b (a, b are Matrix[Double])

n  Implicitly parallel data structures n  Base types: Vector[T], Matrix[T], Graph[V,E], Stream[T] n  Subtypes: TrainingSet, IndexVector, Image, …

n  Implicitly parallel control structures n  sum{…}, (0::end) {…}, gradient { … }, untilconverged { … } n  Arguments to control structures are anonymous functions with

restricted semantics

OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning, ICML 2011

untilconverged(mu, tol){ mu => // calculate distances to current centroids

// move each cluster centroid to the // mean of the points assigned to it

}

untilconverged(mu, tol){ mu => // calculate distances to current centroids val c = (0::m){i => val allDistances = mu mapRows { centroid => dist(x(i), centroid) } allDistances.minIndex }

// move each cluster centroid to the // mean of the points assigned to it

}

fused

untilconverged(mu, tol){ mu => // calculate distances to current centroids val c = (0::m){i => val allDistances = mu mapRows { centroid => dist(x(i), centroid) } allDistances.minIndex }

// move each cluster centroid to the // mean of the points assigned to it val newMu = (0::k,*){ i => val (weightedpoints, points) = sum(0,m) { j => if (c(i) == j) (x(i),1) } val d = if (points == 0) 1 else points weightedpoints / d } newMu }

n Data querying of in-memory collections n  inspired by LINQ

n  SQL-like declarative language

n Use high-level semantic knowledge to implement query optimizer

// lineItems: Iterable[LineItem] // Similar to Q1 of the TPCH benchmark val q = lineItems Where(_.l_shipdate <= Date(‘‘19981201’’)). GroupBy(l => (l.l_linestatus)). Select(g => new Result { val lineStatus = g.key val sumQty = g.Sum(_.l_quantity) val sumDiscountedPrice = g.Sum(r => r.l_extendedprice*(1.0-‐r.l_discount)) val avgPrice = g.Average(_.l_extendedprice) val countOrder = g.Count }) OrderBy(_.returnFlag) ThenBy(_.lineStatus)

hoisted

fused

n  A DSL for large-scale graph analysis based on Green-Marl

n  Directed and undirected graphs, nodes, edges

n  Collections for node/edge storage n  Set, sequence, order

n  Deferred assignment and parallel reductions with bulk synchronous consistency

Green-Marl: A DSL for Easy and Efficient Graph Analysis (Hong et. al.), ASPLOS ’12

for(t <-‐ G.Nodes) { val rank = ((1.0 d)/ N) + d * Sum(t.InNbrs){w => PR(w) / w.OutDegree} PR <= (t,rank) diff += Math.abs(rank -‐ PR(t)) }

Implicitly parallel iteration

Deferred assignment and scalar reduction

Writes become visible after the loop completes

n  A port of a subset of Scala collections to a staged Delite DSL

n  Demonstrates the benefits of high-level optimization and code generation

val sourcedests = pagelinks flatMap { l => val sd = l.split(":") val source = Long.parseLong(sd(0)) val dests = sd(1).trim.split(" ") dests.map(d => (Integer.parseInt(d), source)) } val inverted = sourcedests groupBy (x => x._1)

Reverse web-link benchmark in OptiCollections

Tuples encoded as longs in back-end

Program at a high level Get high performance

Scala def apply(x388:Int,x423:Int,x389:Int, x419:Array[Double],x431:Int, x433:Array[Double]) { val x418 = x413 * x389 val x912_zero = { 0 } val x912_zero_2 = { 1.7976931348623157E308 } var x912 = x912_zero var x912_2 = x912_zero_2 var x425 = 0 while (x425 < x423) { val x430 = x425 * 1 val x432 = x430 * x431 val x916_zero = { 0.0 } . . .

CUDA __device__ int

dev_collect_x478_x478(int x423,int x389,DeliteArray<double> x419,int x431,DeliteArray<double> x433,int x413) {

int x418 = x413 * x389; int x919 = 0; double x919_2 = 1.7976931348623157E308; int x425 = 0; while (x425 < x423) { int x430 = x425 * 1; int x432 = x430 * x431; double x923 = 0.0; int x450 = 0; . . .

1.0

2.1

0.52

1.2

0

0.5

1

1.5

2

1P 8P Nor

mal

ized

Exe

cuti

on T

ime

TPCH-Q1

1.0

6.7

0.63

2.3

0

0.4

0.8

1.2

1.6

1P 8P

OptiQL LINQ

TPCH-Q2

0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60

1 CPU 2 CPU 4 CPU 8 CPU Nor

mal

ized

Exe

cuti

on T

ime

Template Matching

OptiML

OptiQL

0

0.2

0.4

0.6

0.8

1

1 CPU 2 CPU 4 CPU 8 CPU GPU

k-means

OptiML

C++

1.6

1.9

3.6

5.1

10.6

OptiGraph

OptiCollections

1.7 1.7 1.7

0

0.2

0.4

0.6

0.8

1

1P 2P 4P 8P

Nor

mal

ized

Exe

cuti

on

Tim

e

100k nodes x 800k edges

2.1

3.9 4.8

1.3

2.4

3.8 4.3

0

0.2

0.4

0.6

0.8

1

1P 2P 4P 8P

OptiGraph Green Marl

8M nodes x 64M edges

(PageRank)

1.0

2.2

3.8 5.6

0.61

1.2

2.1

3.4

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1P 2P 4P 8P

OptiCollections

Scala Parallel Collections

463 MB

1.0 1.3 2.0 3.1

0.30

0.52

0.71 0.82

0

0.5

1

1.5

2

2.5

3

3.5

4

1P 2P 4P 8P

75 MB

(Reverse web-link benchmark)

How do I build my own Delite DSL?

Domain Embedding Language (Scala)

Delite Runtime

Modular Staging

Heterogeneous Hardware

Delite: DSL Infrastructure

Walk-time Optimizations

Delite Compiler

Static Optimizations Heterogeneous Code Generation

Locality-aware Scheduling

Physics (OptiMesh)

Machine Learning (OptiML)

Domain Specific

Languages

SMP GPU

Parallel Patterns

Data Analytics (OptiQL)

Graph Analysis

(OptiGraph)

1.  Types n  abstract, front-end

2.  Operations n  language operators and methods available on types;

represented by IR nodes

3.  Data Structures n  platform-specific concrete implementation, back-end

4.  Code Generators n  Scala traits that define how to emit code as strings for

various IR nodes and platforms

5.  Analyses and Optimizations (Optional) n  IR rewriting via pattern matching, traversals/transformations

(e.g. fusion)

abstract class Vector[T] extends DeliteCollection[T]

abstract class Matrix[T] extends DeliteCollection[T]

abstract class Image[T] extends Matrix[T]

placeholders for static type checking and method dispatch;

not bound to any implementation

trait VectorOps { // add an infix + operator to Rep[Vector[A]] def infix_+(lhs: Rep[Vector[A]], rhs: Rep[Vector[A]]) = vector_plus(lhs, rhs)

// abstract, applications cannot inspect what happens // when methods are called def vector_length(lhs: Rep[Vector[A]]): Rep[Int] def vector_plus(lhs: Rep[Vector[A]], rhs: Rep[Vector[A]]): Rep[Vector[A]] }

The same abstract Vector we defined earlier

trait VectorOpsExp extends VectorOps with Expressions { // a Delite parallel op IR node case class VectorPlus(inA: Exp[Vector[A]], inB: Exp[Vector[A]])

extends DeliteOpZipWith[Vector[A], Vector[A], Vector[A]] { // number of elements in the input collections def size = inA.length // the output collection def alloc = Vector[A](inA.length) // the ZipWith function def func = (a,b) => a + b

} // construct IR nodes def vector_plus(lhs: Exp[Vector[A]], rhs: Exp[Vector[A]]) = VectorPlus(lhs, rhs)

}

// a concrete, back-‐end Scala data structure // will be instantiated by generated code class Vector[T](__length: Int) { var _length = __length var _data: Array[T] = new Array[T](_length)

}

// corresponding data structures for other back-‐ends // (CUDA, OpenCL, etc.) // . . .

trait ScalaGenVectorOps extends ScalaGen { val IR: VectorOpsExp import IR._

override def emitNode(sym: Sym[Any], rhs: Def[Any]) (implicit stream: PrintWriter) =

// generate code for particular IR nodes rhs match { case v@VectorNew(length) =>

emitValDef(sym, “new " + remap("Vector")+"(" + quote(length) + ")")

case VectorLength(x) => emitValDef(sym, quote(x) + ". _length") case _ => super.emitNode(sym, rhs)

} }

The exact back-end field name we defined earlier

override def matrix_plus[A:Manifest:Arith] (x: Exp[Matrix[A]], y: Exp[Matrix[A]]) =

(x, y) match { // (AB + AD) == A(B + D) case (Def(MatrixTimes(a, b)), Def(MatrixTimes(c, d))) if (a == c) => // return optimized version matrix_times(a, matrix_plus(b,d))

// other rewrites // case . . .

case _ => super.matrix_plus(x, y) }

trait OptiML extends OptiMLScalaOpsPkg with VectorOps with MatrixOps \\ with ...

trait OptiMLExp extends OptiMLScalaOpsPkgExp with VectorOpsExp with MatrixOpsExp \\ with ...

trait OptiMLCodeGenScala extends OptiMLScalaCodeGenPkg with ScalaGenVectorOps with ScalaGenMatrixOps \\ with ...

trait OptiMLCodeGenCuda extends OptiMLCudaCodeGenPkg with CudaGenVectorOps with CudaGenMatrixOps \\ with ...

n  Delite DSLs target high performance architectures from Scala

n  Open source – use them to accelerate your apps or build your own! n  http://github.com/stanford-ppl/Delite

n  Mailing List: n  http://groups.google.com/group/delite-devel

n  Thank you

Arvindsujeeth scaladays12

Technology

Transcript of Arvindsujeeth scaladays12