Arvindsujeeth scaladays12
-
Upload
skills-matter-talks -
Category
Technology
-
view
735 -
download
0
Transcript of Arvindsujeeth scaladays12
Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Victoria Popic, Kunle Olukotun
Stanford University Pervasive Parallelism Laboratory (PPL)
Tiark Rompf, Aleksandar Prokopec, Vojin Jovanovic, Philipp Haller, Martin Odersky
Ecole Polytechnique Federale de Lausanne (EPFL) Programming Methods Laboratory (LAMP)
Squeryl
DSLs can be used for high performance, too
Cray Jaguar
Sun T2
Nvidia Fermi
Altera FPGA
MPI PGAS
Pthreads OpenMP
CUDA OpenCL
Verilog VHDL
Cray Jaguar
Sun T2
Nvidia Fermi
Altera FPGA
MPI PGAS
Pthreads OpenMP
CUDA OpenCL
Verilog VHDL
Virtual Worlds
Personal Robotics
Data Informatics
Scientific Engineering
Applications
Too many different programming models
Cray Jaguar
Sun T2
Nvidia Fermi
Altera FPGA
MPI PGAS
Pthreads OpenMP
CUDA OpenCL
Verilog VHDL
Virtual Worlds
Personal Robotics
Data Informatics
Scientific Engineering
Applications
DSLs
n Tiark Rompf’s talk yesterday
n In case you missed it: n Techniques for rewriting high-level
programs to high-performance programs
n Build an intermediate representation (IR) of Scala programs at runtime
n IR can be optimized and code generated
n Introduction to existing Delite DSLs
n Constructing your own Delite DSL
n Not covered – under the covers: n Implementation details about the Delite
framework
n See http://cgo2012.hyperdsls.org/
n Syntax is legal Scala
n Staged to build an IR (metaprogramming)
n Optimized at a high level
n Compiled to different low-level target architectures
A B A C
* *
+
n OptiML (Machine Learning) n OptiQL (Data querying) n OptiGraph (Large-scale graph analysis) n OptiCollections (Scala collections) n OptiMesh (Mesh-based PDE solvers)
n OptiSDR (Software-defined radio) n OptiCVX (Convex optimization)
Coming soon:
n Provides a familiar (MATLAB-like) language and API for writing ML applications n Ex. val c = a * b (a, b are Matrix[Double])
n Implicitly parallel data structures n Base types: Vector[T], Matrix[T], Graph[V,E], Stream[T] n Subtypes: TrainingSet, IndexVector, Image, …
n Implicitly parallel control structures n sum{…}, (0::end) {…}, gradient { … }, untilconverged { … } n Arguments to control structures are anonymous functions with
restricted semantics
OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning, ICML 2011
untilconverged(mu, tol){ mu => // calculate distances to current centroids
// move each cluster centroid to the // mean of the points assigned to it
}
untilconverged(mu, tol){ mu => // calculate distances to current centroids val c = (0::m){i => val allDistances = mu mapRows { centroid => dist(x(i), centroid) } allDistances.minIndex }
// move each cluster centroid to the // mean of the points assigned to it
}
fused
untilconverged(mu, tol){ mu => // calculate distances to current centroids val c = (0::m){i => val allDistances = mu mapRows { centroid => dist(x(i), centroid) } allDistances.minIndex }
// move each cluster centroid to the // mean of the points assigned to it val newMu = (0::k,*){ i => val (weightedpoints, points) = sum(0,m) { j => if (c(i) == j) (x(i),1) } val d = if (points == 0) 1 else points weightedpoints / d } newMu }
n Data querying of in-memory collections n inspired by LINQ
n SQL-like declarative language
n Use high-level semantic knowledge to implement query optimizer
// lineItems: Iterable[LineItem] // Similar to Q1 of the TPCH benchmark val q = lineItems Where(_.l_shipdate <= Date(‘‘19981201’’)). GroupBy(l => (l.l_linestatus)). Select(g => new Result { val lineStatus = g.key val sumQty = g.Sum(_.l_quantity) val sumDiscountedPrice = g.Sum(r => r.l_extendedprice*(1.0-‐r.l_discount)) val avgPrice = g.Average(_.l_extendedprice) val countOrder = g.Count }) OrderBy(_.returnFlag) ThenBy(_.lineStatus)
hoisted
fused
n A DSL for large-scale graph analysis based on Green-Marl
n Directed and undirected graphs, nodes, edges
n Collections for node/edge storage n Set, sequence, order
n Deferred assignment and parallel reductions with bulk synchronous consistency
Green-Marl: A DSL for Easy and Efficient Graph Analysis (Hong et. al.), ASPLOS ’12
for(t <-‐ G.Nodes) { val rank = ((1.0 d)/ N) + d * Sum(t.InNbrs){w => PR(w) / w.OutDegree} PR <= (t,rank) diff += Math.abs(rank -‐ PR(t)) }
Implicitly parallel iteration
Deferred assignment and scalar reduction
Writes become visible after the loop completes
n A port of a subset of Scala collections to a staged Delite DSL
n Demonstrates the benefits of high-level optimization and code generation
val sourcedests = pagelinks flatMap { l => val sd = l.split(":") val source = Long.parseLong(sd(0)) val dests = sd(1).trim.split(" ") dests.map(d => (Integer.parseInt(d), source)) } val inverted = sourcedests groupBy (x => x._1)
Reverse web-link benchmark in OptiCollections
Tuples encoded as longs in back-end
Program at a high level Get high performance
Scala def apply(x388:Int,x423:Int,x389:Int, x419:Array[Double],x431:Int, x433:Array[Double]) { val x418 = x413 * x389 val x912_zero = { 0 } val x912_zero_2 = { 1.7976931348623157E308 } var x912 = x912_zero var x912_2 = x912_zero_2 var x425 = 0 while (x425 < x423) { val x430 = x425 * 1 val x432 = x430 * x431 val x916_zero = { 0.0 } . . .
CUDA __device__ int
dev_collect_x478_x478(int x423,int x389,DeliteArray<double> x419,int x431,DeliteArray<double> x433,int x413) {
int x418 = x413 * x389; int x919 = 0; double x919_2 = 1.7976931348623157E308; int x425 = 0; while (x425 < x423) { int x430 = x425 * 1; int x432 = x430 * x431; double x923 = 0.0; int x450 = 0; . . .
1.0
2.1
0.52
1.2
0
0.5
1
1.5
2
1P 8P Nor
mal
ized
Exe
cuti
on T
ime
TPCH-Q1
1.0
6.7
0.63
2.3
0
0.4
0.8
1.2
1.6
1P 8P
OptiQL LINQ
TPCH-Q2
0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60
1 CPU 2 CPU 4 CPU 8 CPU Nor
mal
ized
Exe
cuti
on T
ime
Template Matching
OptiML
OptiQL
0
0.2
0.4
0.6
0.8
1
1 CPU 2 CPU 4 CPU 8 CPU GPU
k-means
OptiML
C++
1.6
1.9
3.6
5.1
10.6
OptiGraph
OptiCollections
1.7 1.7 1.7
0
0.2
0.4
0.6
0.8
1
1P 2P 4P 8P
Nor
mal
ized
Exe
cuti
on
Tim
e
100k nodes x 800k edges
2.1
3.9 4.8
1.3
2.4
3.8 4.3
0
0.2
0.4
0.6
0.8
1
1P 2P 4P 8P
OptiGraph Green Marl
8M nodes x 64M edges
(PageRank)
1.0
2.2
3.8 5.6
0.61
1.2
2.1
3.4
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
1P 2P 4P 8P
OptiCollections
Scala Parallel Collections
463 MB
1.0 1.3 2.0 3.1
0.30
0.52
0.71 0.82
0
0.5
1
1.5
2
2.5
3
3.5
4
1P 2P 4P 8P
75 MB
(Reverse web-link benchmark)
How do I build my own Delite DSL?
Domain Embedding Language (Scala)
Delite Runtime
Modular Staging
Heterogeneous Hardware
Delite: DSL Infrastructure
Walk-time Optimizations
Delite Compiler
Static Optimizations Heterogeneous Code Generation
Locality-aware Scheduling
Physics (OptiMesh)
Machine Learning (OptiML)
Domain Specific
Languages
SMP GPU
Parallel Patterns
Data Analytics (OptiQL)
Graph Analysis
(OptiGraph)
1. Types n abstract, front-end
2. Operations n language operators and methods available on types;
represented by IR nodes
3. Data Structures n platform-specific concrete implementation, back-end
4. Code Generators n Scala traits that define how to emit code as strings for
various IR nodes and platforms
5. Analyses and Optimizations (Optional) n IR rewriting via pattern matching, traversals/transformations
(e.g. fusion)
abstract class Vector[T] extends DeliteCollection[T]
abstract class Matrix[T] extends DeliteCollection[T]
abstract class Image[T] extends Matrix[T]
placeholders for static type checking and method dispatch;
not bound to any implementation
trait VectorOps { // add an infix + operator to Rep[Vector[A]] def infix_+(lhs: Rep[Vector[A]], rhs: Rep[Vector[A]]) = vector_plus(lhs, rhs)
// abstract, applications cannot inspect what happens // when methods are called def vector_length(lhs: Rep[Vector[A]]): Rep[Int] def vector_plus(lhs: Rep[Vector[A]], rhs: Rep[Vector[A]]): Rep[Vector[A]] }
The same abstract Vector we defined earlier
trait VectorOpsExp extends VectorOps with Expressions { // a Delite parallel op IR node case class VectorPlus(inA: Exp[Vector[A]], inB: Exp[Vector[A]])
extends DeliteOpZipWith[Vector[A], Vector[A], Vector[A]] { // number of elements in the input collections def size = inA.length // the output collection def alloc = Vector[A](inA.length) // the ZipWith function def func = (a,b) => a + b
} // construct IR nodes def vector_plus(lhs: Exp[Vector[A]], rhs: Exp[Vector[A]]) = VectorPlus(lhs, rhs)
}
// a concrete, back-‐end Scala data structure // will be instantiated by generated code class Vector[T](__length: Int) { var _length = __length var _data: Array[T] = new Array[T](_length)
}
// corresponding data structures for other back-‐ends // (CUDA, OpenCL, etc.) // . . .
trait ScalaGenVectorOps extends ScalaGen { val IR: VectorOpsExp import IR._
override def emitNode(sym: Sym[Any], rhs: Def[Any]) (implicit stream: PrintWriter) =
// generate code for particular IR nodes rhs match { case v@VectorNew(length) =>
emitValDef(sym, “new " + remap("Vector")+"(" + quote(length) + ")")
case VectorLength(x) => emitValDef(sym, quote(x) + ". _length") case _ => super.emitNode(sym, rhs)
} }
The exact back-end field name we defined earlier
override def matrix_plus[A:Manifest:Arith] (x: Exp[Matrix[A]], y: Exp[Matrix[A]]) =
(x, y) match { // (AB + AD) == A(B + D) case (Def(MatrixTimes(a, b)), Def(MatrixTimes(c, d))) if (a == c) => // return optimized version matrix_times(a, matrix_plus(b,d))
// other rewrites // case . . .
case _ => super.matrix_plus(x, y) }
trait OptiML extends OptiMLScalaOpsPkg with VectorOps with MatrixOps \\ with ...
trait OptiMLExp extends OptiMLScalaOpsPkgExp with VectorOpsExp with MatrixOpsExp \\ with ...
trait OptiMLCodeGenScala extends OptiMLScalaCodeGenPkg with ScalaGenVectorOps with ScalaGenMatrixOps \\ with ...
trait OptiMLCodeGenCuda extends OptiMLCudaCodeGenPkg with CudaGenVectorOps with CudaGenMatrixOps \\ with ...
n Delite DSLs target high performance architectures from Scala
n Open source – use them to accelerate your apps or build your own! n http://github.com/stanford-ppl/Delite
n Mailing List: n http://groups.google.com/group/delite-devel
n Thank you