The Alea Reactive Dataflow System for GPU Parallelization · The Alea Reactive Dataflow System for...

The Alea Reactive Dataflow Systemfor GPU Parallelization

Philipp Kramer, Luc BläserInstitute for Software, HSR Rapperswil

Daniel EgloffQuantAlea, Zurich

Funded by Swiss Commission of Technology and Innovation, Project No 16130.2 PFES-ES

HLPGPU 201619 Jan 2016

GPU Parallelization Is Tough

Massive Parallel Power

Not easy to use

□ Vector parallel and machine-centric programming models

□ Invocation and data management logic

5760 Cores

Simplify GPU Programming

GPU parallel programming for (almost) everyone

□ No GPU experience required

□ Fast development

□ Good performance

On the basis of .NET

□ Available for C#, F#, VB etc.

Alea Dataflow Programming Model

Dataflow

□ Graph of operations (pre-fabricated)

□ Data propagated through graph

Reactive

□ Feed input in arbitrary intervals

□ Listen for asynchronous output

Graph Construction

e.g. Markov-Chain-Step:

var splitter = new Splitter<float[,], float[]>();var mvp = new MatrixVectorProduct<float>();var merger = new Merger<float[,], float[]>();

splitter.First.ConnectTo(mvp.Left);splitter.First.ConnectTo(merger.First);splitter.Second.ConnectTo(mvp.Right);mvp.ConnectTo(merger.Second);

splitter.Second.OnReceive(Console.WriteLine);

SplitterT[]T[,]

T[,], T[]

Merger

T[]T[,]

T[,], T[]

MatrixVector

Product

T[,] T[]

Dataflow Execution

SplitterT[]T[,]

T[,], T[]

Merger

T[]T[,]

T[,], T[]

MatrixVector

Product

T[,] T[]

SplitterT[]T[,]

T[,], T[]

Merger

T[]T[,]

T[,], T[]

MatrixVector

Product

T[,] T[]

SplitterT[]T[,]

T[,], T[]

Merger

T[]T[,]

T[,], T[]

MatrixVector

Product

T[,] T[]

SplitterT[]T[,]

T[,], T[]

Merger

T[]T[,]

T[,], T[]

MatrixVector

Product

T[,] T[]

splitter.Input.Send(new Tuple<>(M, v);

M, v’

Cyclic Graph

Corresponds to iterative calculation

e.g. Markov-Chain steady-state:

Func<int, float[], float[], bool> steadyCondition = …;Func<float[], float[], float[]> aggregation = …;var approx = new Approximator<float[,], float[]>(

steadyCondition, aggregation);

approx.Iterate.ConnectTo(splitter);

merger.ConnectTo(approx.Continue);

SplitterT[]T[,]

T[,], T[]

MergerT[]T[,]

T[,], T[]

MatrixVector

Product

T[,] T[]

Approximator

T[,], T[]T[,], T[]

T[,], T[] T[,], T[]

e.g. Monte Carlo Pi

var randoms = new RandomMersenne<float>(0, 1);randoms

.Pairing()

.Map(p => p.Left * p.Left + p.Right * p.Right <= 1 ? 1f : 0f)

.Average()

.OnReceive(x => Console.WriteLine(4 * x));

Fluent Graph Construction

Random

Pairing

Average

float[]

Runtime System behind the Scenes

Operations implement GPU and/or CPU

Automatic memory copying

□ only when needed

GPU garbage collection

Multi-GPU support

Scheduling with .NET TPL

GPU Parallelization can be Easy

Pre-fabricated and custom operations□ Vector parallel and machine-centric programming models

Automatic orchestration□ Invocation and data management logic

□ Garbage collection

□ Synchronization

5760 Cores

Performance Comparison to CUDA C

GPU Kernels written in C# in Cuda C style

Comparison of GPU-kernel-performance to CUDA C

□ GeForce GTX Titan Black (2880 cores)

Benchmark Speedup

Average 1.12

Matrix Convolution 1.03

Matrix AB + BC + CA 0.81

Monte Carlo π 1.06

Applications and Performance

GeForce GTX Titan Black (2880 cores) vs. CPU (4 Core Intel Xeon E5-2609 2.4GHz)

Option Pricing Case (single GPU, 32 options)

Training Phase of Neural Network Case (MNIST data)

Configuration Speedup

16k paths/it, 30 days 6

Configuration Unopt. Speedup Opt. Speedup

30 hidden neurons 2.4 3.3

480 hidden neurons 24 33

7680 hidden neurons 93 124

Conclusions

Simple GPU parallelization in .NET

□ Fast development with generic prefabricated operations

□ Cleaner and slimmer code by separation of concerns

□ Possibility to define custom operations

Efficient runtime system

□ Automatic data movement and GPU garbage collection

□ Low overhead compared to CUDA C

Further Information

www.quantalea.com

Prof. Dr. Luc BläserHSR Hochschule für Technik RapperswilInstitute for Softwarelblaeser@hsr.chhttp://concurrency.ch

Dr. Daniel EgloffQuantAlea GmbH

Switzerlanddaniel.egloff@quantalea.netwww.quantalea.com

The Alea Reactive Dataflow System for GPU Parallelization · The Alea Reactive Dataflow System for...

Documents

Transcript of The Alea Reactive Dataflow System for GPU Parallelization · The Alea Reactive Dataflow System for...

The need for parallelization Challenges towards effective parallelization A multilevel parallelization framework for BEM: A compute intensive application.

Hawkeye: Effective Discovery of Dataflow Impediments to Parallelization Omer Tripp John Field Greta Yorsh Mooly Sagiv.

Alea K. newman Professional Portfolio

PARALLELIZATION AND PERFORMANCE OPTIMIZATION OF ...domingo/teaching/ciic8996/PARALLELIZATION … · PARALLELIZATION AND PERFORMANCE OPTIMIZATION OF BIOINFORMATICS AND BIOMEDICAL APPLICATIONS

REVISTA DE ALEA Nº 2

Alea modular seo ppt

Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

dataflow - Lunds tekniska högskola · 3 motivation the need for a parallel programming model dataflow programming actors, dataflow, and the CAL actor language dataflow perspectives

Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph

State audit of ALEA

Dataflow Models

Alea Evolution Company Profile 2015

revista alea

CATALOGO PRODOTTI ALEA EVOLUTION

Dataflow Diagram

Visioni volume1 - Alea

Automatic Parallelization

Federal Motor Carrier - ALEA

Dataflow II: Finish Dataflow Analysis, Start on Classical Optimizations

Alea Glasgow - Christmas 2015