Task and Data Parallelism
-
Upload
sasha-goldshtein -
Category
Technology
-
view
599 -
download
2
description
Transcript of Task and Data Parallelism
![Page 1: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/1.jpg)
Sasha GoldshteinCTO, Sela Group
Task and Data Parallelism
![Page 2: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/2.jpg)
Agenda
•Multicore machines have been a cheap commodity for >10 years•Adoption of concurrent programming is still slow•Patterns and best practices are scarce•We discuss the APIs first…•…and then turn to examples, best practices, and tips
![Page 3: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/3.jpg)
TPL Evolution
![Page 4: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/4.jpg)
Tasks
•A task is a unit of work–May be executed in parallel with other tasks by a scheduler (e.g. Thread Pool)–Much more than threads, and yet much cheaper
Task<string> t = Task.Factory.StartNew( () => { return DnaSimulation(…); });t.ContinueWith(r => Show(r.Exception), TaskContinuationOptions.OnlyOnFaulted);t.ContinueWith(r => Show(r.Result), TaskContinuationOptions.OnlyOnRanToCompletion);DisplayProgress();
try { //The C# 5.0 version var task = Task.Run(DnaSimulation); DisplayProgress(); Show(await task);} catch (Exception ex) { Show(ex);}
![Page 5: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/5.jpg)
Parallel Loops
•Ideal for parallelizing work over a collection of data•Easy porting of for and foreach loops–Beware of inter-iteration dependencies!
Parallel.For(0, 100, i => { ...});
Parallel.ForEach(urls, url => { webClient.Post(url, options, data);});
![Page 6: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/6.jpg)
Parallel LINQ
•Mind-bogglingly easy parallelization of LINQ queries•Can introduce ordering into the pipeline, or preserve order of original elementsvar query = from monster in monsters.AsParallel()
where monster.IsAttacking let newMonster = SimulateMovement(monster) orderby newMonster.XP select newMonster;
query.ForAll(monster => Move(monster));
![Page 7: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/7.jpg)
Measuring Concurrency
•Visual Studio Concurrency Visualizer to the rescue
![Page 8: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/8.jpg)
Recursive Parallelism Extraction
•Divide-and-conquer algorithms are often parallelized through the recursive call–Be careful with parallelization threshold and watch out for dependenciesvoid FFT(float[] src, float[] dst, int n, int r, int
s) { if (n == 1) { dst[r] = src[r]; } else { FFT(src, n/2, r, s*2); FFT(src, n/2, r+s, s*2); //Combine the two halves in O(n) time }}
Parallel.Invoke( () => FFT(src, n/2, r, s*2), () => FFT(src, n/2, r+s, s*2));
![Page 9: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/9.jpg)
DEMORecursive parallel QuickSort
![Page 10: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/10.jpg)
Symmetric Data Processing
•For a large set of uniform data items that need to processed, parallel loops are usually the best choice and lead to ideal work distribution•Inter-iteration dependencies complicate things (think in-place blur)Parallel.For(0, image.Rows, i => {
for (int j = 0; j < image.Cols; ++j) { destImage.SetPixel(i, j, PixelBlur(image, i, j)); }});
![Page 11: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/11.jpg)
Uneven Work Distribution
•With non-uniform data items, use custom partitioning or manual distribution–Primes: 7 is easier to check than 10,320,647
var work = Enumerable.Range(0, Environment.ProcessorCount) .Select(n => Task.Run(() => CountPrimes(start+chunk*n, start+chunk*(n+1))));Task.WaitAll(work.ToArray());
versus
Parallel.ForEach(Partitioner.Create(Start, End, chunkSize), chunk => CountPrimes(chunk.Item1, chunk.Item2));
![Page 12: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/12.jpg)
DEMOUneven workload distribution
![Page 13: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/13.jpg)
Complex Dependency Management
•Must extract all dependencies and incorporate them into the algorithm–Typical scenarios: 1D loops, dynamic algorithms–Edit distance: each task depends on 2 predecessors, wavefront
C = x[i-1] == y[i-1] ? 0 : 1;D[i, j] = min( D[i-1, j] + 1, D[i, j-1] + 1, D[i-1, j-1] + C);
0,0
m,n
![Page 14: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/14.jpg)
DEMODependency management
![Page 15: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/15.jpg)
Synchronization > Aggregation
•Excessive synchronization brings parallel code to its knees–Try to avoid shared state–Aggregate thread- or task-local state and merge
Parallel.ForEach( Partitioner.Create(Start, End, ChunkSize), () => new List<int>(), //initial local state (range, pls, localPrimes) => { //aggregator for (int i = range.Item1; i < range.Item2; ++i) if (IsPrime(i)) localPrimes.Add(i); return localPrimes; }, localPrimes => { lock (primes) //combiner primes.AddRange(localPrimes);});
![Page 16: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/16.jpg)
DEMOAggregation
![Page 17: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/17.jpg)
Creative Synchronization
• We implement a collection of stock prices, initialized with 105 name/price pairs– 107 reads/s, 106 “update” writes/s, 103
“add” writes/day–Many reader threads, many writer
threadsGET(key): if safe contains key then return safe[key] lock { return unsafe[key] }
PUT(key, value): if safe contains key then safe[key] = value lock { unsafe[key] = value }
![Page 18: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/18.jpg)
Lock-Free Patterns (1)
•Try to avoid Windows synchronization and use hardware synchronization–Primitive operations such as Interlocked.Increment, Interlocked.CompareExchange–Retry pattern with Interlocked.CompareExchange enables arbitrary lock-free algorithms
int InterlockedMultiply(ref int x, int y) { int t, r; do { t = x; r = t * y; } while (Interlocked.CompareExchange(ref x, r, t) != t); return r;}
Old
val
ue
New
val
ue
Com
para
nd
![Page 19: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/19.jpg)
Lock-Free Patterns (2)
•User-mode spinlocks (SpinLock class) can replace locks you acquire very often, which protect tiny computationsclass __DontUseMe__SpinLock { private volatile int _lck; public void Enter() { while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0); } public void Exit() { _lck = 0; }}
![Page 20: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/20.jpg)
Miscellaneous Tips (1)
•Don’t mix several concurrency frameworks in the same process•Some parallel work is best organized in pipelines – TPL DataFlow
![Page 21: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/21.jpg)
Miscellaneous Tips (2)
•Some parallel work can be offloaded to the GPU – C++ AMP
void vadd_exp(float* x, float* y, float* z, int n) { array_view<const float,1> avX(n, x), avY(n, y); array_view<float,1> avZ(n, z); avZ.discard_data(); parallel_for_each(avZ.extent, [=](index<1> i) ... { avZ[i] = avX[i] + fast_math::exp(avY[i]); }); avZ.synchronize();}
![Page 22: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/22.jpg)
Miscellaneous Tips (3)
•Invest in SIMD parallelization of heavy math or data-parallel algorithms
–Already available on Mono (Mono.Simd)
•Make sure to take cache effects into account, especially on MP systems
START: movups xmm0, [esi+4*ecx] addps xmm0, [edi+4*ecx] movups [ebx+4*ecx], xmm0 sub ecx, 4jns START
![Page 23: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/23.jpg)
Summary
• Avoid shared state and synchronization• Parallelize judiciously and apply
thresholds• Measure and understand performance
gains or losses• Concurrency and parallelism are still
hard• A body of best practices, tips, patterns,
examples is being built
![Page 24: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/24.jpg)
Additional References
![Page 25: Task and Data Parallelism](https://reader033.fdocuments.in/reader033/viewer/2022052600/5585a7f4d8b42ae22a8b4b01/html5/thumbnails/25.jpg)
THANK YOU!
Sasha GoldshteinCTO, Sela Groupblog.sashag.net@goldshtn