Arvindsujeeth scaladays12

Post on 10-May-2015

735 views 0 download

Tags:

Transcript of Arvindsujeeth scaladays12

Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Victoria Popic, Kunle Olukotun

Stanford University Pervasive Parallelism Laboratory (PPL)

Tiark Rompf, Aleksandar Prokopec, Vojin Jovanovic, Philipp Haller, Martin Odersky

Ecole Polytechnique Federale de Lausanne (EPFL) Programming Methods Laboratory (LAMP)

Squeryl

DSLs can be used for high performance, too

Cray Jaguar

Sun T2

Nvidia Fermi

Altera FPGA

MPI PGAS

Pthreads OpenMP

CUDA OpenCL

Verilog VHDL

Cray Jaguar

Sun T2

Nvidia Fermi

Altera FPGA

MPI PGAS

Pthreads OpenMP

CUDA OpenCL

Verilog VHDL

Virtual Worlds

Personal Robotics

Data Informatics

Scientific Engineering

Applications

Too many different programming models

Cray Jaguar

Sun T2

Nvidia Fermi

Altera FPGA

MPI PGAS

Pthreads OpenMP

CUDA OpenCL

Verilog VHDL

Virtual Worlds

Personal Robotics

Data Informatics

Scientific Engineering

Applications

DSLs

n  Tiark Rompf’s talk yesterday

n  In case you missed it: n  Techniques for rewriting high-level

programs to high-performance programs

n  Build an intermediate representation (IR) of Scala programs at runtime

n  IR can be optimized and code generated

n  Introduction to existing Delite DSLs

n Constructing your own Delite DSL

n Not covered – under the covers: n  Implementation details about the Delite

framework

n  See http://cgo2012.hyperdsls.org/

n  Syntax is legal Scala

n  Staged to build an IR (metaprogramming)

n Optimized at a high level

n Compiled to different low-level target architectures

A B A C

* *

+

n  OptiML (Machine Learning) n  OptiQL (Data querying) n  OptiGraph (Large-scale graph analysis) n  OptiCollections (Scala collections) n  OptiMesh (Mesh-based PDE solvers)

n  OptiSDR (Software-defined radio) n  OptiCVX (Convex optimization)

Coming soon:

n  Provides a familiar (MATLAB-like) language and API for writing ML applications n  Ex. val  c  =  a  *  b  (a, b are Matrix[Double])

n  Implicitly parallel data structures n  Base types: Vector[T], Matrix[T], Graph[V,E], Stream[T] n  Subtypes: TrainingSet, IndexVector, Image, …

n  Implicitly parallel control structures n  sum{…}, (0::end) {…}, gradient { … }, untilconverged { … } n  Arguments to control structures are anonymous functions with

restricted semantics

OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning, ICML 2011

untilconverged(mu,  tol){  mu  =>          //  calculate  distances  to  current  centroids  

       //  move  each  cluster  centroid  to  the          //  mean  of  the  points  assigned  to  it  

}  

untilconverged(mu,  tol){  mu  =>          //  calculate  distances  to  current  centroids          val  c  =  (0::m){i  =>                val  allDistances  =  mu  mapRows  {  centroid  =>                      dist(x(i),  centroid)                }                allDistances.minIndex          }  

       //  move  each  cluster  centroid  to  the          //  mean  of  the  points  assigned  to  it  

}  

fused

untilconverged(mu,  tol){  mu  =>          //  calculate  distances  to  current  centroids          val  c  =  (0::m){i  =>                val  allDistances  =  mu  mapRows  {  centroid  =>                      dist(x(i),  centroid)                              }                allDistances.minIndex          }  

       //  move  each  cluster  centroid  to  the          //  mean  of  the  points  assigned  to  it          val  newMu  =  (0::k,*){  i  =>                val  (weightedpoints,  points)  =  sum(0,m)  {  j  =>                      if  (c(i)  ==  j)  (x(i),1)                }                val  d  =  if  (points  ==  0)  1  else  points                  weightedpoints  /  d          }          newMu  }  

n Data querying of in-memory collections n  inspired by LINQ

n  SQL-like declarative language

n Use high-level semantic knowledge to implement query optimizer

//  lineItems:  Iterable[LineItem]  //  Similar  to  Q1  of  the  TPCH  benchmark  val  q  =  lineItems  Where(_.l_shipdate  <=  Date(‘‘19981201’’)).      GroupBy(l  =>  (l.l_linestatus)).      Select(g  =>  new  Result  {          val  lineStatus  =  g.key          val  sumQty  =  g.Sum(_.l_quantity)          val  sumDiscountedPrice  =              g.Sum(r  =>  r.l_extendedprice*(1.0-­‐r.l_discount))          val  avgPrice  =  g.Average(_.l_extendedprice)          val  countOrder  =  g.Count      })  OrderBy(_.returnFlag)  ThenBy(_.lineStatus)  

hoisted

fused

n  A DSL for large-scale graph analysis based on Green-Marl

n  Directed and undirected graphs, nodes, edges

n  Collections for node/edge storage n  Set, sequence, order

n  Deferred assignment and parallel reductions with bulk synchronous consistency

Green-Marl: A DSL for Easy and Efficient Graph Analysis (Hong et. al.), ASPLOS ’12

for(t  <-­‐  G.Nodes)  {      val  rank  =  ((1.0  d)/  N)  +                              d  *  Sum(t.InNbrs){w  =>  PR(w)  /  w.OutDegree}      PR  <=  (t,rank)      diff  +=  Math.abs(rank  -­‐  PR(t))  }  

Implicitly parallel iteration

Deferred assignment and scalar reduction

Writes become visible after the loop completes

n  A port of a subset of Scala collections to a staged Delite DSL

n  Demonstrates the benefits of high-level optimization and code generation

val  sourcedests  =  pagelinks  flatMap  {  l  =>      val  sd  =  l.split(":")      val  source  =  Long.parseLong(sd(0))      val  dests  =  sd(1).trim.split("  ")      dests.map(d  =>  (Integer.parseInt(d),  source))  }  val  inverted  =  sourcedests  groupBy  (x  =>  x._1)  

Reverse web-link benchmark in OptiCollections

Tuples encoded as longs in back-end

Program at a high level Get high performance

Scala def  apply(x388:Int,x423:Int,x389:Int,      x419:Array[Double],x431:Int,      x433:Array[Double])  {  val  x418  =  x413  *  x389  val  x912_zero      =  {  0  }  val  x912_zero_2  =  {        1.7976931348623157E308  }  var  x912      =  x912_zero  var  x912_2  =  x912_zero_2  var  x425  =  0  while  (x425  <  x423)  {          val  x430  =  x425  *  1      val  x432  =  x430  *  x431      val  x916_zero  =  {      0.0      }  .  .  .  

CUDA __device__  int  

dev_collect_x478_x478(int  x423,int  x389,DeliteArray<double>  x419,int  x431,DeliteArray<double>  x433,int  x413)  {  

int  x418  =  x413  *  x389;  int  x919  =  0;    double  x919_2  =  1.7976931348623157E308;  int  x425  =  0;  while  (x425  <  x423)  {          int  x430  =  x425  *  1;      int  x432  =  x430  *  x431;      double  x923  =  0.0;      int  x450  =  0;  .  .  .  

1.0

2.1

0.52

1.2

0

0.5

1

1.5

2

1P 8P Nor

mal

ized

Exe

cuti

on T

ime

TPCH-Q1

1.0

6.7

0.63

2.3

0

0.4

0.8

1.2

1.6

1P 8P

OptiQL LINQ

TPCH-Q2

0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60

1 CPU 2 CPU 4 CPU 8 CPU Nor

mal

ized

Exe

cuti

on T

ime

Template Matching

OptiML

OptiQL

0

0.2

0.4

0.6

0.8

1

1 CPU 2 CPU 4 CPU 8 CPU GPU

k-means

OptiML

C++

1.6

1.9

3.6

5.1

10.6

OptiGraph

OptiCollections

1.7 1.7 1.7

0

0.2

0.4

0.6

0.8

1

1P 2P 4P 8P

Nor

mal

ized

Exe

cuti

on

Tim

e

100k nodes x 800k edges

2.1

3.9 4.8

1.3

2.4

3.8 4.3

0

0.2

0.4

0.6

0.8

1

1P 2P 4P 8P

OptiGraph Green Marl

8M nodes x 64M edges

(PageRank)

1.0

2.2

3.8 5.6

0.61

1.2

2.1

3.4

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1P 2P 4P 8P

OptiCollections

Scala Parallel Collections

463 MB

1.0 1.3 2.0 3.1

0.30

0.52

0.71 0.82

0

0.5

1

1.5

2

2.5

3

3.5

4

1P 2P 4P 8P

75 MB

(Reverse web-link benchmark)

How do I build my own Delite DSL?

Domain Embedding Language (Scala)

Delite Runtime

Modular Staging

Heterogeneous Hardware

Delite: DSL Infrastructure

Walk-time Optimizations

Delite Compiler

Static Optimizations Heterogeneous Code Generation

Locality-aware Scheduling

Physics (OptiMesh)

Machine Learning (OptiML)

Domain Specific

Languages

SMP GPU

Parallel Patterns

Data Analytics (OptiQL)

Graph Analysis

(OptiGraph)

1.  Types n  abstract, front-end

2.  Operations n  language operators and methods available on types;

represented by IR nodes

3.  Data Structures n  platform-specific concrete implementation, back-end

4.  Code Generators n  Scala traits that define how to emit code as strings for

various IR nodes and platforms

5.  Analyses and Optimizations (Optional) n  IR rewriting via pattern matching, traversals/transformations

(e.g. fusion)

abstract  class  Vector[T]  extends  DeliteCollection[T]  

abstract  class  Matrix[T]  extends  DeliteCollection[T]  

abstract  class  Image[T]  extends  Matrix[T]  

placeholders for static type checking and method dispatch;

not bound to any implementation

trait  VectorOps  {      //  add  an  infix  +  operator  to  Rep[Vector[A]]      def  infix_+(lhs:  Rep[Vector[A]],  rhs:  Rep[Vector[A]])  =          vector_plus(lhs,  rhs)  

   //  abstract,  applications  cannot  inspect  what  happens        //  when  methods  are  called      def  vector_length(lhs:  Rep[Vector[A]]):  Rep[Int]      def  vector_plus(lhs:  Rep[Vector[A]],                                      rhs:  Rep[Vector[A]]):  Rep[Vector[A]]  }  

The same abstract Vector we defined earlier

trait  VectorOpsExp  extends  VectorOps  with  Expressions  {  //  a  Delite  parallel  op  IR  node  case  class  VectorPlus(inA:  Exp[Vector[A]],  inB:  Exp[Vector[A]])  

extends  DeliteOpZipWith[Vector[A],  Vector[A],  Vector[A]]  {    //  number  of  elements  in  the  input  collections    def  size  =  inA.length    //  the  output  collection    def  alloc  =  Vector[A](inA.length)    //  the  ZipWith  function    def  func  =  (a,b)  =>  a  +  b  

}  //  construct  IR  nodes  def  vector_plus(lhs:  Exp[Vector[A]],  rhs:  Exp[Vector[A]])    =  VectorPlus(lhs,  rhs)  

}  

//  a  concrete,  back-­‐end  Scala  data  structure  //  will  be  instantiated  by  generated  code  class  Vector[T](__length:  Int)  {    var  _length  =  __length    var  _data:  Array[T]  =  new  Array[T](_length)  

}  

//  corresponding  data  structures  for  other  back-­‐ends  //  (CUDA,  OpenCL,  etc.)  //  .  .  .  

trait  ScalaGenVectorOps  extends  ScalaGen  {    val  IR:  VectorOpsExp    import  IR._  

 override  def  emitNode(sym:  Sym[Any],  rhs:  Def[Any])    (implicit  stream:  PrintWriter)  =  

     //  generate  code  for  particular  IR  nodes        rhs  match  {            case  v@VectorNew(length)  =>  

                 emitValDef(sym,  “new  "  +  remap("Vector")+"("  +                              quote(length)  +  ")")  

         case  VectorLength(x)  =>                  emitValDef(sym,  quote(x)  +  ".  _length")            case  _  =>  super.emitNode(sym,  rhs)  

       }  }  

The exact back-end field name we defined earlier

override  def  matrix_plus[A:Manifest:Arith]    (x:  Exp[Matrix[A]],  y:  Exp[Matrix[A]])  =  

     (x,  y)  match  {                //  (AB  +  AD)  ==  A(B  +  D)                case  (Def(MatrixTimes(a,  b)),                            Def(MatrixTimes(c,  d)))  if  (a  ==  c)  =>                        //  return  optimized  version                    matrix_times(a,  matrix_plus(b,d))  

   //  other  rewrites      //  case  .  .  .  

         case  _  =>  super.matrix_plus(x,  y)        }  

trait  OptiML  extends  OptiMLScalaOpsPkg  with  VectorOps  with  MatrixOps  \\  with  ...  

trait  OptiMLExp  extends  OptiMLScalaOpsPkgExp  with  VectorOpsExp  with  MatrixOpsExp  \\  with  ...  

trait  OptiMLCodeGenScala  extends  OptiMLScalaCodeGenPkg  with  ScalaGenVectorOps  with  ScalaGenMatrixOps  \\  with  ...  

trait  OptiMLCodeGenCuda  extends  OptiMLCudaCodeGenPkg  with  CudaGenVectorOps  with  CudaGenMatrixOps  \\  with  ...  

n  Delite DSLs target high performance architectures from Scala

n  Open source – use them to accelerate your apps or build your own! n  http://github.com/stanford-ppl/Delite

n  Mailing List: n  http://groups.google.com/group/delite-devel

n  Thank you