Task and Data Parallelism: Real-World Examples
-
Upload
sasha-goldshtein -
Category
Technology
-
view
4.008 -
download
2
description
Transcript of Task and Data Parallelism: Real-World Examples
![Page 1: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/1.jpg)
Sasha Goldshtein
CTOSela Group
@goldshtnblog.sashag.net
Task and Data Parallelism: Real-World
Examples
![Page 2: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/2.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
2
AGENDA
Multicore machines have been a cheap commodity for >10 years
Adoption of concurrent programming is still slow
Patterns and best practices are scarce We discuss the APIs first… …and then turn to examples, best practices, and tips
![Page 3: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/3.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
3
TPL EVOLUTION
The Future
• DataFlow in .NET 4.5 (NuGet)
• Augmented with language support (await, async methods)
2012
• Released in full glory with .NET 4.0
2010
• Incubated for 3 years as “Parallel Extensions for .NET”
2008
![Page 4: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/4.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
4
TASKS
A task is a unit of work May be executed in parallel with other tasks
by a scheduler (e.g. Thread Pool)
Much more than threads, and yet much cheaper
Task<string> t = Task.Factory.StartNew( () => { return DnaSimulation(…); });t.ContinueWith(r => Show(r.Exception), TaskContinuationOptions.OnlyOnFaulted);t.ContinueWith(r => Show(r.Result), TaskContinuationOptions.OnlyOnRanToCompletion);DisplayProgress();
try { //The C# 5.0 version var task = Task.Run(DnaSimulation); DisplayProgress(); Show(await task);}catch (Exception ex) { Show(ex);}
![Page 5: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/5.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
5
PARALLEL LOOPS
Ideal for parallelizing work over a collection of data
Easy porting of for and foreach loops Beware of inter-iteration dependencies!
Parallel.For(0, 100, i => { ...});
Parallel.ForEach(urls, url => { webClient.Post(url, options, data);});
![Page 6: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/6.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
6
PARALLEL LINQ
Mind-bogglingly easy parallelization of LINQ queries
Can introduce ordering into the pipeline, or preserve order of original elementsvar query = from monster in monsters.AsParallel() where monster.IsAttacking let newMonster = SimulateMovement(monster) orderby newMonster.XP select newMonster;
query.ForAll(monster => Move(monster));
![Page 7: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/7.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
7
MEASURING CONCURRENCY
Visual Studio Concurrency Visualizer to the rescue
![Page 8: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/8.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
8
RECURSIVE PARALLELISM EXTRACTION
Divide-and-conquer algorithms are often parallelized through the recursive call
Be careful with parallelization threshold and watch out for dependenciesvoid FFT(float[] src, float[] dst, int n, int r, int
s) { if (n == 1) { dst[r] = src[r]; } else { FFT(src, n/2, r, s*2); FFT(src, n/2, r+s, s*2); //Combine the two halves in O(n) time }}
Parallel.Invoke( () => FFT(src, n/2, r, s*2), () => FFT(src, n/2, r+s, s*2));
![Page 9: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/9.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
9
SYMMETRIC DATA PROCESSING
For a large set of uniform data items that need to processed, parallel loops are usually the best choice and lead to ideal work distribution
Inter-iteration dependencies complicate things (think in-place blur)
Parallel.For(0, image.Rows, i => { for (int j = 0; j < image.Cols; ++j) { destImage.SetPixel(i, j, PixelBlur(image, i, j)); }});
![Page 10: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/10.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
10
UNEVEN WORK DISTRIBUTION
With non-uniform data items, use custom partitioning or manual distribution
Primes: 7 is easier to check than 10,320,647var work = Enumerable.Range(0, Environment.ProcessorCount) .Select(n => Task.Run(() => CountPrimes(start+chunk*n, start+chunk*(n+1))));Task.WaitAll(work.ToArray());
VS
Parallel.ForEach(Partitioner.Create(Start, End, chunkSize), chunk => CountPrimes(chunk.Item1, chunk.Item2));
![Page 11: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/11.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
11
COMPLEX DEPENDENCY MANAGEMENT
Must extract all dependencies and incorporate them into the algorithm
Typical scenarios: 1D loops, dynamic algorithms
Edit distance: each task depends on 2 predecessors, wavefront computation
C = x[i-1] == y[i-1] ? 0 : 1;D[i, j] = min( D[i-1, j] + 1, D[i, j-1] + 1, D[i-1, j-1] + C);
0,0
m,n
![Page 12: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/12.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
12
SYNCHRONIZATION > AGGREGATION
Excessive synchronization brings parallel code to its knees
Try to avoid shared state, or minimize access to it
Aggregate thread- or task-local state and merge later
Parallel.ForEach( Partitioner.Create(Start, End, ChunkSize), () => new List<int>(), //initial local state (range, pls, localPrimes) => { //aggregator for (int i = range.Item1; i < range.Item2; ++i) if (IsPrime(i)) localPrimes.Add(i); return localPrimes; }, localPrimes => { lock (primes) //combiner primes.AddRange(localPrimes);});
![Page 13: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/13.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
13
CREATIVE SYNCHRONIZATION
We implement a collection of stock prices, initialized with 105 name/price pairs
107 reads/s, 106 “update” writes/s, 103 “add” writes/day
Many reader threads, many writer threads
GET(key): if safe contains key then return safe[key] lock { return unsafe[key] }
PUT(key, value): if safe contains key then safe[key] = value lock { unsafe[key] = value }
![Page 14: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/14.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
14
LOCK-FREE PATTERNS (1)
Try to avoid Windows synchronization and use hardware synchronization
Primitive operations such as Interlocked.Increment, Interlocked.CompareExchange
Retry pattern with Interlocked.CompareExchange enables arbitrary lock-free algorithms
int InterlockedMultiply(ref int x, int y) { int t, r; do { t = x; r = t * y; } while (Interlocked.CompareExchange(ref x, r, t) != t); return r;}
New
Valu
e
Com
para
nd
Old
Valu
e
![Page 15: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/15.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
15
LOCK-FREE PATTERNS (2)
User-mode spinlocks (SpinLock class) can replace locks you acquire very often, which protect tiny computations
class __DontUseMe__SpinLock { private int _lck; public void Enter() { while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0); } public void Exit() { _lck = 0; Thread.MemoryBarrier(); }}
![Page 16: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/16.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
16
MISCELLANEOUS TIPS (1)
Don’t mix several concurrency frameworks in the same process
Some parallel work is best organized in pipelines – TPL DataFlow
BroadcastBlock
<Uri>
TransformBlock
<Uri, byte[]>
TransformBlock
<byte[], string>
ActionBlock<string>
![Page 17: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/17.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
17
MISCELLANEOUS TIPS (2)
Some parallel work can be offloaded to the GPU – C++ AMP
void vadd_exp(float* x, float* y, float* z, int n) { array_view<const float,1> avX(n, x), avY(n, y); array_view<float,1> avZ(n, z); avZ.discard_data(); parallel_for_each(avZ.extent, [=](index<1> i) ... { avZ[i] = avX[i] + fast_math::exp(avY[i]); }); avZ.synchronize();}
![Page 18: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/18.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
18
MISCELLANEOUS TIPS (3)
Invest in SIMD parallelization of heavy math or data-parallel algorithms
Make sure to take cache effects into account, especially on MP systems
START: movups xmm0, [esi+4*ecx] addps xmm0, [edi+4*ecx] movups [ebx+4*ecx], xmm0 sub ecx, 4jns START
![Page 19: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/19.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
19
SUMMARY
Avoid shared state and synchronization Parallelize judiciously and apply
thresholds Measure and understand performance
gains or losses Concurrency and parallelism are still
hard A body of best practices, tips, patterns,
examples is being built
![Page 20: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/20.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
ADDITIONAL REFERENCES
![Page 21: Task and Data Parallelism: Real-World Examples](https://reader036.fdocuments.in/reader036/viewer/2022062707/5585a760d8b42a6c1a8b4bcf/html5/thumbnails/21.jpg)
www.devconnections.com
GARBAGE COLLECTION PERFORMANCE TIPS
21
THANK YOU!
Sasha Goldshtein@goldshtn