Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow...
Transcript of Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow...
![Page 1: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph](https://reader031.fdocuments.in/reader031/viewer/2022021511/5ac56b707f8b9ae06c8dc0cd/html5/thumbnails/1.jpg)
Alea Reactive DataflowGPU Parallelization Made Simple
Luc BläserIFS Institute for Software, HSR Rapperswil
D. Egloff, O. Knobel, P. Kramer, X. Zhang, D. Fabian(Joint HSR and QuantAlea Zurich)
REBLS’1421 Oct 2014
![Page 2: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph](https://reader031.fdocuments.in/reader031/viewer/2022021511/5ac56b707f8b9ae06c8dc0cd/html5/thumbnails/2.jpg)
GPU Programming Today
Massive parallel power
□ Very specific pattern: vector-parallelism
High obstacles
□ Particular algorithms needed
□ Machine-centric programming models
□ Poor language and runtime integration
Good excuses against it - unfortunately
□ Too difficult, costly, error-prone, marginal benefit
5760 Cores
![Page 3: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph](https://reader031.fdocuments.in/reader031/viewer/2022021511/5ac56b707f8b9ae06c8dc0cd/html5/thumbnails/3.jpg)
Our Goal
GPU parallel programming for (almost) everyone
Radical simplification
□ No GPU experience required
□ Fast development
□ High performance comes automatically
□ Guaranteed memory safety
Broad community
□ .NET in general: C#, F#, VB etc.
□ Based on Alea cuBase F# runtime
![Page 4: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph](https://reader031.fdocuments.in/reader031/viewer/2022021511/5ac56b707f8b9ae06c8dc0cd/html5/thumbnails/4.jpg)
Alea Dataflow Programming Model
Dataflow
□ Graph of operations
□ Data propagated through graph
Reactive
□ Feed input in arbitrary intervals
□ Listen for asynchronous output
![Page 5: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph](https://reader031.fdocuments.in/reader031/viewer/2022021511/5ac56b707f8b9ae06c8dc0cd/html5/thumbnails/5.jpg)
The Descriptive Power
Program is purely descriptive
□ What, not how
Efficient execution behind the scenes
□ Vector-parallel operations
□ Stream operations on GPU
□ Minimize memory copying
□ Hybrid multi-platform scheduling
□ Tune degree of parallelization
□ …
![Page 6: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph](https://reader031.fdocuments.in/reader031/viewer/2022021511/5ac56b707f8b9ae06c8dc0cd/html5/thumbnails/6.jpg)
Operation
Unit of calculation (typically vector-parallel)
Input and output ports
Port = stream of typed data
Consumes input, produces output
Map
Input: T[]
Output: U[]
MatrixProduct
Left: T[,] Right: T[,]
Output: T[,]
Splitter
Input: Tuple<T, U>
First: T Second: U
![Page 7: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph](https://reader031.fdocuments.in/reader031/viewer/2022021511/5ac56b707f8b9ae06c8dc0cd/html5/thumbnails/7.jpg)
Graph
Random
Pairing
Map
Average
var randoms = new Random<float>(0, 1);var coordinates = new Pairing<float>();var inUnitCircle = new Map<Pair<float>, float>(p => p.Left * p.Left + p.Right * p.Right <= 1
? 1f : 0f);var average = new Average<float>();
randoms.Output.ConnectTo(coordinates.Input);coordinates.Output.ConnectTo(inUnitCircle.Input);inUnitCircle.Output.ConnectTo(average.Input);
float[]
int[]
Pair<float>[]
![Page 8: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph](https://reader031.fdocuments.in/reader031/viewer/2022021511/5ac56b707f8b9ae06c8dc0cd/html5/thumbnails/8.jpg)
Dataflow
Send data to input port
Receive from output port
All asynchronousRandom
Pairing
Map
Average
average.Output.OnReceive(x =>Console.WriteLine(4 * x));
random.Input.Send(1000);random.Input.Send(1000000);
1000
Write(…)
1000000
Write(…)
![Page 9: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph](https://reader031.fdocuments.in/reader031/viewer/2022021511/5ac56b707f8b9ae06c8dc0cd/html5/thumbnails/9.jpg)
Short Fluent Notation
var randoms = new Random<float>(0, 1);randoms.Pairing().Map(p => p.Left * p.Left + p.Right * p.Right <= 1 ? 1f : 0f).Average().OnReceive(x => Console.WriteLine(4 * x));
randoms.Send(100);randoms.Send(100000);
![Page 10: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph](https://reader031.fdocuments.in/reader031/viewer/2022021511/5ac56b707f8b9ae06c8dc0cd/html5/thumbnails/10.jpg)
Algebraic Computation
A * B + C
MatrixProduct
MatrixSum
var product = new MatrixProduct<float>();var sum = new MatrixSum<float>();
product.Output.ConnectTo(sum.Left);
sum.Output.OnReceive(Console.WriteLine);
product.Left.Send(A);product.Right.Send(B);sum.Right.Send(C);
A B C
![Page 11: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph](https://reader031.fdocuments.in/reader031/viewer/2022021511/5ac56b707f8b9ae06c8dc0cd/html5/thumbnails/11.jpg)
Iterative Computation
𝑏𝑖+1 = 𝐴 ∙ 𝑏𝑖 (until 𝑏𝑖+1 = 𝑏𝑖)
var source = new Splitter<double[,], double[]>();var product = source.First.Multiply(source.Second);
var steady = product.Compare(source.Second, fun x y ->
Math.Abs(x – y) < 1E-6);var next = source.First.Merge(product);var branch = steady.Switch(next);branch.False.ConnectTo(source.Input);
branch.True.OnReceive(Console.WriteLine)source.Send(new Tuple<double[,], double>(A, b0));
Splitter
MatrixVector
Product
Merger Comparison
(A, bi)
(A, bi+1)
A
bi
bi+1
Switch
bool
(A, bi+1)
![Page 12: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph](https://reader031.fdocuments.in/reader031/viewer/2022021511/5ac56b707f8b9ae06c8dc0cd/html5/thumbnails/12.jpg)
Current Scheduler
Operation implement GPU and/or CPU
GPU operations combined to stream
Memory copy only when needed
Host scheduling with .NET TPL
Automatic free memory management
GPU
CPU
COPY
COPY
COPY
COPY
![Page 13: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph](https://reader031.fdocuments.in/reader031/viewer/2022021511/5ac56b707f8b9ae06c8dc0cd/html5/thumbnails/13.jpg)
Operation Catalogue
Prefabricated generic operations□ Switch, Merger, Splitter, Comparison
□ Map, Reduction, Average, Pairing
□ Random, MatrixProduct, MatrixSum, MatrixVectorProduct, VectorSum, ScalarProduct
□ More to come…
Custom operations can be added
Good performance□ Nearly as fast as native C CUDA: overhead about 10%
• Performance depends on operation implementation
• Small overhead (cross-compilation, managed to unmanaged interop, scheduler)
![Page 14: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph](https://reader031.fdocuments.in/reader031/viewer/2022021511/5ac56b707f8b9ae06c8dc0cd/html5/thumbnails/14.jpg)
Related Works
Rx.NET / TPL Dataflow□ Single input and output port□ Not for GPU
Xcelerit□ Not reactive: single flow per graph□ No generic operations with functors
MSR PTasks / Dandelion□ Synchronous receive, on C++, no generic operations□ .NET LINQ integration (pull instead of push)
Fastflow□ Not reactive (sync run of the graph)□ More low-level C++ tasks, no functors
Nikola□ Implicit dataflow described by functions□ Limited set of operations
![Page 15: Alea Reactive Dataflow GPU Parallelization Made Simple · Related Works Rx.NET / TPL Dataflow Single input and output port Not for GPU Xcelerit Not reactive: single flow per graph](https://reader031.fdocuments.in/reader031/viewer/2022021511/5ac56b707f8b9ae06c8dc0cd/html5/thumbnails/15.jpg)
Conclusions
Simple but powerful GPU parallelization in .NET
□ No low-level GPU artefacts
□ Fast and condensed problem formulation
□ Efficient and safe execution by the scheduler
The descriptive paradigm is the key
□ Reactive makes it very general: cycles, infinite etc.
□ Practical suitability depends on operations
Future directions
□ Advanced schedulers: multi GPUs, cluster, optimizations
□ Larger operation catalogue, optimized operations