Arvindsujeeth scaladays12

34
Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Victoria Popic, Kunle Olukotun Stanford University Pervasive Parallelism Laboratory (PPL) Tiark Rompf, Aleksandar Prokopec, Vojin Jovanovic, Philipp Haller, Martin Odersky Ecole Polytechnique Federale de Lausanne (EPFL) Programming Methods Laboratory (LAMP)

Transcript of Arvindsujeeth scaladays12

Page 1: Arvindsujeeth scaladays12

Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Victoria Popic, Kunle Olukotun

Stanford University Pervasive Parallelism Laboratory (PPL)

Tiark Rompf, Aleksandar Prokopec, Vojin Jovanovic, Philipp Haller, Martin Odersky

Ecole Polytechnique Federale de Lausanne (EPFL) Programming Methods Laboratory (LAMP)

Page 2: Arvindsujeeth scaladays12

Squeryl

Page 3: Arvindsujeeth scaladays12

DSLs can be used for high performance, too

Page 4: Arvindsujeeth scaladays12

Cray Jaguar

Sun T2

Nvidia Fermi

Altera FPGA

MPI PGAS

Pthreads OpenMP

CUDA OpenCL

Verilog VHDL

Page 5: Arvindsujeeth scaladays12

Cray Jaguar

Sun T2

Nvidia Fermi

Altera FPGA

MPI PGAS

Pthreads OpenMP

CUDA OpenCL

Verilog VHDL

Virtual Worlds

Personal Robotics

Data Informatics

Scientific Engineering

Applications

Page 6: Arvindsujeeth scaladays12

Too many different programming models

Cray Jaguar

Sun T2

Nvidia Fermi

Altera FPGA

MPI PGAS

Pthreads OpenMP

CUDA OpenCL

Verilog VHDL

Virtual Worlds

Personal Robotics

Data Informatics

Scientific Engineering

Applications

DSLs

Page 7: Arvindsujeeth scaladays12

n  Tiark Rompf’s talk yesterday

n  In case you missed it: n  Techniques for rewriting high-level

programs to high-performance programs

n  Build an intermediate representation (IR) of Scala programs at runtime

n  IR can be optimized and code generated

Page 8: Arvindsujeeth scaladays12

n  Introduction to existing Delite DSLs

n Constructing your own Delite DSL

n Not covered – under the covers: n  Implementation details about the Delite

framework

n  See http://cgo2012.hyperdsls.org/

Page 9: Arvindsujeeth scaladays12

n  Syntax is legal Scala

n  Staged to build an IR (metaprogramming)

n Optimized at a high level

n Compiled to different low-level target architectures

A B A C

* *

+

Page 10: Arvindsujeeth scaladays12

n  OptiML (Machine Learning) n  OptiQL (Data querying) n  OptiGraph (Large-scale graph analysis) n  OptiCollections (Scala collections) n  OptiMesh (Mesh-based PDE solvers)

n  OptiSDR (Software-defined radio) n  OptiCVX (Convex optimization)

Coming soon:

Page 11: Arvindsujeeth scaladays12

n  Provides a familiar (MATLAB-like) language and API for writing ML applications n  Ex. val  c  =  a  *  b  (a, b are Matrix[Double])

n  Implicitly parallel data structures n  Base types: Vector[T], Matrix[T], Graph[V,E], Stream[T] n  Subtypes: TrainingSet, IndexVector, Image, …

n  Implicitly parallel control structures n  sum{…}, (0::end) {…}, gradient { … }, untilconverged { … } n  Arguments to control structures are anonymous functions with

restricted semantics

OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning, ICML 2011

Page 12: Arvindsujeeth scaladays12

untilconverged(mu,  tol){  mu  =>          //  calculate  distances  to  current  centroids  

       //  move  each  cluster  centroid  to  the          //  mean  of  the  points  assigned  to  it  

}  

Page 13: Arvindsujeeth scaladays12

untilconverged(mu,  tol){  mu  =>          //  calculate  distances  to  current  centroids          val  c  =  (0::m){i  =>                val  allDistances  =  mu  mapRows  {  centroid  =>                      dist(x(i),  centroid)                }                allDistances.minIndex          }  

       //  move  each  cluster  centroid  to  the          //  mean  of  the  points  assigned  to  it  

}  

Page 14: Arvindsujeeth scaladays12

fused

untilconverged(mu,  tol){  mu  =>          //  calculate  distances  to  current  centroids          val  c  =  (0::m){i  =>                val  allDistances  =  mu  mapRows  {  centroid  =>                      dist(x(i),  centroid)                              }                allDistances.minIndex          }  

       //  move  each  cluster  centroid  to  the          //  mean  of  the  points  assigned  to  it          val  newMu  =  (0::k,*){  i  =>                val  (weightedpoints,  points)  =  sum(0,m)  {  j  =>                      if  (c(i)  ==  j)  (x(i),1)                }                val  d  =  if  (points  ==  0)  1  else  points                  weightedpoints  /  d          }          newMu  }  

Page 15: Arvindsujeeth scaladays12

n Data querying of in-memory collections n  inspired by LINQ

n  SQL-like declarative language

n Use high-level semantic knowledge to implement query optimizer

Page 16: Arvindsujeeth scaladays12

//  lineItems:  Iterable[LineItem]  //  Similar  to  Q1  of  the  TPCH  benchmark  val  q  =  lineItems  Where(_.l_shipdate  <=  Date(‘‘19981201’’)).      GroupBy(l  =>  (l.l_linestatus)).      Select(g  =>  new  Result  {          val  lineStatus  =  g.key          val  sumQty  =  g.Sum(_.l_quantity)          val  sumDiscountedPrice  =              g.Sum(r  =>  r.l_extendedprice*(1.0-­‐r.l_discount))          val  avgPrice  =  g.Average(_.l_extendedprice)          val  countOrder  =  g.Count      })  OrderBy(_.returnFlag)  ThenBy(_.lineStatus)  

hoisted

fused

Page 17: Arvindsujeeth scaladays12

n  A DSL for large-scale graph analysis based on Green-Marl

n  Directed and undirected graphs, nodes, edges

n  Collections for node/edge storage n  Set, sequence, order

n  Deferred assignment and parallel reductions with bulk synchronous consistency

Green-Marl: A DSL for Easy and Efficient Graph Analysis (Hong et. al.), ASPLOS ’12

Page 18: Arvindsujeeth scaladays12

for(t  <-­‐  G.Nodes)  {      val  rank  =  ((1.0  d)/  N)  +                              d  *  Sum(t.InNbrs){w  =>  PR(w)  /  w.OutDegree}      PR  <=  (t,rank)      diff  +=  Math.abs(rank  -­‐  PR(t))  }  

Implicitly parallel iteration

Deferred assignment and scalar reduction

Writes become visible after the loop completes

Page 19: Arvindsujeeth scaladays12

n  A port of a subset of Scala collections to a staged Delite DSL

n  Demonstrates the benefits of high-level optimization and code generation

val  sourcedests  =  pagelinks  flatMap  {  l  =>      val  sd  =  l.split(":")      val  source  =  Long.parseLong(sd(0))      val  dests  =  sd(1).trim.split("  ")      dests.map(d  =>  (Integer.parseInt(d),  source))  }  val  inverted  =  sourcedests  groupBy  (x  =>  x._1)  

Reverse web-link benchmark in OptiCollections

Tuples encoded as longs in back-end

Page 20: Arvindsujeeth scaladays12

Program at a high level Get high performance

Page 21: Arvindsujeeth scaladays12

Scala def  apply(x388:Int,x423:Int,x389:Int,      x419:Array[Double],x431:Int,      x433:Array[Double])  {  val  x418  =  x413  *  x389  val  x912_zero      =  {  0  }  val  x912_zero_2  =  {        1.7976931348623157E308  }  var  x912      =  x912_zero  var  x912_2  =  x912_zero_2  var  x425  =  0  while  (x425  <  x423)  {          val  x430  =  x425  *  1      val  x432  =  x430  *  x431      val  x916_zero  =  {      0.0      }  .  .  .  

CUDA __device__  int  

dev_collect_x478_x478(int  x423,int  x389,DeliteArray<double>  x419,int  x431,DeliteArray<double>  x433,int  x413)  {  

int  x418  =  x413  *  x389;  int  x919  =  0;    double  x919_2  =  1.7976931348623157E308;  int  x425  =  0;  while  (x425  <  x423)  {          int  x430  =  x425  *  1;      int  x432  =  x430  *  x431;      double  x923  =  0.0;      int  x450  =  0;  .  .  .  

Page 22: Arvindsujeeth scaladays12

1.0

2.1

0.52

1.2

0

0.5

1

1.5

2

1P 8P Nor

mal

ized

Exe

cuti

on T

ime

TPCH-Q1

1.0

6.7

0.63

2.3

0

0.4

0.8

1.2

1.6

1P 8P

OptiQL LINQ

TPCH-Q2

0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60

1 CPU 2 CPU 4 CPU 8 CPU Nor

mal

ized

Exe

cuti

on T

ime

Template Matching

OptiML

OptiQL

0

0.2

0.4

0.6

0.8

1

1 CPU 2 CPU 4 CPU 8 CPU GPU

k-means

OptiML

C++

1.6

1.9

3.6

5.1

10.6

Page 23: Arvindsujeeth scaladays12

OptiGraph

OptiCollections

1.7 1.7 1.7

0

0.2

0.4

0.6

0.8

1

1P 2P 4P 8P

Nor

mal

ized

Exe

cuti

on

Tim

e

100k nodes x 800k edges

2.1

3.9 4.8

1.3

2.4

3.8 4.3

0

0.2

0.4

0.6

0.8

1

1P 2P 4P 8P

OptiGraph Green Marl

8M nodes x 64M edges

(PageRank)

1.0

2.2

3.8 5.6

0.61

1.2

2.1

3.4

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1P 2P 4P 8P

OptiCollections

Scala Parallel Collections

463 MB

1.0 1.3 2.0 3.1

0.30

0.52

0.71 0.82

0

0.5

1

1.5

2

2.5

3

3.5

4

1P 2P 4P 8P

75 MB

(Reverse web-link benchmark)

Page 24: Arvindsujeeth scaladays12

How do I build my own Delite DSL?

Page 25: Arvindsujeeth scaladays12

Domain Embedding Language (Scala)

Delite Runtime

Modular Staging

Heterogeneous Hardware

Delite: DSL Infrastructure

Walk-time Optimizations

Delite Compiler

Static Optimizations Heterogeneous Code Generation

Locality-aware Scheduling

Physics (OptiMesh)

Machine Learning (OptiML)

Domain Specific

Languages

SMP GPU

Parallel Patterns

Data Analytics (OptiQL)

Graph Analysis

(OptiGraph)

Page 26: Arvindsujeeth scaladays12

1.  Types n  abstract, front-end

2.  Operations n  language operators and methods available on types;

represented by IR nodes

3.  Data Structures n  platform-specific concrete implementation, back-end

4.  Code Generators n  Scala traits that define how to emit code as strings for

various IR nodes and platforms

5.  Analyses and Optimizations (Optional) n  IR rewriting via pattern matching, traversals/transformations

(e.g. fusion)

Page 27: Arvindsujeeth scaladays12

abstract  class  Vector[T]  extends  DeliteCollection[T]  

abstract  class  Matrix[T]  extends  DeliteCollection[T]  

abstract  class  Image[T]  extends  Matrix[T]  

placeholders for static type checking and method dispatch;

not bound to any implementation

Page 28: Arvindsujeeth scaladays12

trait  VectorOps  {      //  add  an  infix  +  operator  to  Rep[Vector[A]]      def  infix_+(lhs:  Rep[Vector[A]],  rhs:  Rep[Vector[A]])  =          vector_plus(lhs,  rhs)  

   //  abstract,  applications  cannot  inspect  what  happens        //  when  methods  are  called      def  vector_length(lhs:  Rep[Vector[A]]):  Rep[Int]      def  vector_plus(lhs:  Rep[Vector[A]],                                      rhs:  Rep[Vector[A]]):  Rep[Vector[A]]  }  

The same abstract Vector we defined earlier

Page 29: Arvindsujeeth scaladays12

trait  VectorOpsExp  extends  VectorOps  with  Expressions  {  //  a  Delite  parallel  op  IR  node  case  class  VectorPlus(inA:  Exp[Vector[A]],  inB:  Exp[Vector[A]])  

extends  DeliteOpZipWith[Vector[A],  Vector[A],  Vector[A]]  {    //  number  of  elements  in  the  input  collections    def  size  =  inA.length    //  the  output  collection    def  alloc  =  Vector[A](inA.length)    //  the  ZipWith  function    def  func  =  (a,b)  =>  a  +  b  

}  //  construct  IR  nodes  def  vector_plus(lhs:  Exp[Vector[A]],  rhs:  Exp[Vector[A]])    =  VectorPlus(lhs,  rhs)  

}  

Page 30: Arvindsujeeth scaladays12

//  a  concrete,  back-­‐end  Scala  data  structure  //  will  be  instantiated  by  generated  code  class  Vector[T](__length:  Int)  {    var  _length  =  __length    var  _data:  Array[T]  =  new  Array[T](_length)  

}  

//  corresponding  data  structures  for  other  back-­‐ends  //  (CUDA,  OpenCL,  etc.)  //  .  .  .  

Page 31: Arvindsujeeth scaladays12

trait  ScalaGenVectorOps  extends  ScalaGen  {    val  IR:  VectorOpsExp    import  IR._  

 override  def  emitNode(sym:  Sym[Any],  rhs:  Def[Any])    (implicit  stream:  PrintWriter)  =  

     //  generate  code  for  particular  IR  nodes        rhs  match  {            case  v@VectorNew(length)  =>  

                 emitValDef(sym,  “new  "  +  remap("Vector")+"("  +                              quote(length)  +  ")")  

         case  VectorLength(x)  =>                  emitValDef(sym,  quote(x)  +  ".  _length")            case  _  =>  super.emitNode(sym,  rhs)  

       }  }  

The exact back-end field name we defined earlier

Page 32: Arvindsujeeth scaladays12

override  def  matrix_plus[A:Manifest:Arith]    (x:  Exp[Matrix[A]],  y:  Exp[Matrix[A]])  =  

     (x,  y)  match  {                //  (AB  +  AD)  ==  A(B  +  D)                case  (Def(MatrixTimes(a,  b)),                            Def(MatrixTimes(c,  d)))  if  (a  ==  c)  =>                        //  return  optimized  version                    matrix_times(a,  matrix_plus(b,d))  

   //  other  rewrites      //  case  .  .  .  

         case  _  =>  super.matrix_plus(x,  y)        }  

Page 33: Arvindsujeeth scaladays12

trait  OptiML  extends  OptiMLScalaOpsPkg  with  VectorOps  with  MatrixOps  \\  with  ...  

trait  OptiMLExp  extends  OptiMLScalaOpsPkgExp  with  VectorOpsExp  with  MatrixOpsExp  \\  with  ...  

trait  OptiMLCodeGenScala  extends  OptiMLScalaCodeGenPkg  with  ScalaGenVectorOps  with  ScalaGenMatrixOps  \\  with  ...  

trait  OptiMLCodeGenCuda  extends  OptiMLCudaCodeGenPkg  with  CudaGenVectorOps  with  CudaGenMatrixOps  \\  with  ...  

Page 34: Arvindsujeeth scaladays12

n  Delite DSLs target high performance architectures from Scala

n  Open source – use them to accelerate your apps or build your own! n  http://github.com/stanford-ppl/Delite

n  Mailing List: n  http://groups.google.com/group/delite-devel

n  Thank you