EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I...

BERKELEY PAR LABBERKELEY PAR LAB

Efficiency Programming for the (Productive) Masses

Armando Fox, Bryan Catanzaro, Shoaib Kamil, Yunsup Lee, Ben Carpenter, Erin Carson,

Krste Asanovic, Dave Patterson, Kurt Keutzer

UC Berkeley Parallel Computing Lab/UPCRC

BERKELEY PAR LAB

Make productivity programmers efficient, and efficiency programmers productive?

Productivity level language (PLL): Python, Ruby high-level abstractions well-matched to application

domain => 5x faster development and 3-10x fewer lines of code

>90% of programmers Efficiency level language (ELL): C/C++, CUDA, OpenCL

>5x longer development time potential 10x-100x performance by exposing HW

model <10% programmers, yet their work is poorly reused

5x development time 10x-100x performance!

Raise level of abstraction and get performance?

BERKELEY PAR LAB

Capture patterns instead of “domains”?

Efficiency programmers know how to target computation patterns to hardware stencil/SIMD codes => GPUs sparse matrix => communication-avoiding algo’s on

multicore “Big finance” Monte Carlo sim => MapReduce

Libraries? Useful, but don’t raise abstraction level

How to make ELL work accessible to more PLL programmers?

BERKELEY PAR LAB

“Stovepipes”: Connect Pattern to Platform

OOO GPU SIMD FPGA Cloud

Runtime & OS

Common language substrate

Rendering Probabilistic Physics Lin. Alg.

Virt. worlds Data viz. Robotics Music App domainsComputation domainsLanguage

Thick RuntimeHardware

Traditional Traditional LayersLayers

OOO GPU SIMD FPGA Cloud

Runtime & OS

Virt. worlds

Data viz. Robotics Music Applications

Motifs/Patterns

Thin Runtime

Hardware

““Stovepipes”Stovepipes”Sparse Matrix

Dense to

Dense to

GP

UG

PU St

enci

l

Sten

cil

to S

IMD

to S

IMD

Ste

ncil

to

Ste

ncil

to

FP

GA

FP

GA

Den

se

Den

se

to O

oOto

OoO

Dense Matrix Stencil

Humans must produce

these

BERKELEY PAR LAB

SEJITS: Selective, Embedded Just-in-Time Specialization

Productivity programmers write in general purpose, modern, high level PLL

SEJITS infrastructure specializes computation patterns selectively at runtime

Specialization uses runtime info to generate and JIT-compile ELL code targeted to hardware

Embedded because PLL’s own machinery enables (vs. extending PLL interpreter)

BERKELEY PAR LAB

Specifically...

When “specializable” function is called: determine if specializer available for current platform if no: continue executing normally in PLL

If a specializer is found, it can: manipulate/traverse AST of the function emit & JIT-compile ELL source code dynamically link compiled code to PLL interpreter

Specializers written in PLL

Necessary features present in modern PLL’s, but absent from older widely-used PLL’s

BERKELEY PAR LAB

.py.py

OS/HWOS/HW

f()f() @h()@h()

SpecializerSpecializer

.c.c@g()@g()

SEJITSSEJITS

Productivity app

.so.so

cc/ldcc/ld

$$

SEJITS makes tuning decisions per-function (not per-app)

BERKELEY PAR LAB

.py.py

OS/HWOS/HW

f()f() @h()@h()


.c.c@g()@g()

SEJITSSEJITS

Productivity app

.so.so

cc/ldcc/ld

$$

SEJITS makes tuning decisions per-function (not per-app)

Selective

Embedded

JIT

Specialization

BERKELEY PAR LAB

Example: Stencil Computation in Ruby

9

class LaplacianKernel < Kernel def kernel(in_grid, out_grid) in_grid.each_interior do |point| point.neighbors(1).each do |x| out_grid[point] += 0.2*x.val end endend

VALUE kern_par(int argc, VALUE* argv, VALUE self) {unpack_arrays into in_grid and out_grid;

#pragma omp parallel for default(shared) private (t_6,t_7,t_8)for (t_8=1; t_8<256-1; t_8++) { for (t_7=1; t_7<256-1; t_7++) { for (t_6=1; t_6<256-1; t_6++) { int center = INDEX(t_6,t_7,t_8); out_grid[center] = (out_grid[center] +(0.2*in_grid[INDEX(t_6-1,t_7,t_8)])); ... out_grid[center] = (out_grid[center] +(0.2*in_grid[INDEX(t_6,t_7,t_8+1)]));;}}}return Qtrue;}

•Specializer emits OpenMP•1000x-2000x faster than Ruby

Use introspection to grab parameters, inspect AST of computation

BERKELEY PAR LAB

Example: Sparse Matrix-Vector Multiply in Python

10

# “Gather nonzero entries, # multiply them by vector,# do for each column”

Specializer outputs CUDA for nvcc:

SEJITS leverages downstream toolchains

B. Catanzaro et al., joint work with NVIDIA Research

BERKELEY PAR LAB

.py.py

Nexus on Eucalyptus or EC2Nexus on Eucalyptus or EC2

f()f() @h()@h()


@g()@g()

SEJITSSEJITS

Productivity app

SparkworkerSparkworker

.scala.scala

scalacscalac

$$

Spark & Nexus• Spark enables cloud- distributed, persistent, fault-tolerant shared parallel data structures

• Relies on Scala runtime and data-parallel abstractions

• Relies on Nexus (cloud resource management) layer

SEJITS in the Cloud

BERKELEY PAR LAB

Example: Logistic regression using Spark/Scala (in progress)

M. Zaharia et al., Spark: Cluster Computing With Working Sets, HotCloud’09

B. Hindman et al., Nexus: A Common Substrate for Cluster Computing, HotCloud‘0912

BERKELEY PAR LAB

.py.py

Nexus on CloudNexus on Cloud

f()f() @h()@h()


@g()@g()

SEJITSSEJITS

Productivity app

Hadoop masterHadoop master

.java.java

javacjavac

$$

SEJITS in the Cloud

BERKELEY PAR LAB

SEJITS for Cloud Computing

Idea: same Python app runs on desktop, on manycore, and in cloud

Cloud/multicore synergy: specialize intra-node as well as generate cloud code

Cloud: Emit JIT-able code for Spark (Scala), Hadoop (Java), MPI (C), ...

Single node: Emit JIT-able code for OpenCL, CUDA, OpenMP, ...

Combine abstractions in one app Remember...can always fall back to PLL

BERKELEY PAR LAB

Questions

Won’t we need lots & lots of specializers? if ParLab “motifs” bet is correct, ~10s of specializers

will go a long way

What about libraries, frameworks, etc.? SEJITS is complementary to frameworks Most libraries for ELL, and ELLs lack features that

promote code reuse, don’t raise abstraction level

Why isn’t this just as hard as “magic compiler”? Specializers written by human experts SEJITS allows “crowdsourcing” them

Will programmers accustomed to Matlab/Fortran learn functional style, list comprehensions, etc.?

BERKELEY PAR LAB

Conclusion

SEJITS enables code-generation strategy per-function, not per-app

Uniform approach to productive programming same app on cloud, multicore, autotuned libraries

Combine multiple frameworks/abstractions in same app

Research enabler Incrementally develop specializers for different motifs

or prototype HW Don’t need full compiler & toolchain just to get started

BERKELEY PAR LABBERKELEY PAR LAB

Questions

17

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I...

Documents

Transcript of EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I...