B1 Data Parallel
-
Upload
wiliam-javier -
Category
Documents
-
view
217 -
download
0
description
Transcript of B1 Data Parallel
![Page 1: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/1.jpg)
ARCS 2008
Data parallel algorithms, Data parallel algorithms, algorithmicalgorithmic building blocks, building blocks,
precision vs. accuracyprecision vs. accuracy
Robert Robert StrzodkaStrzodka
ARCS 2008 ARCS 2008 –– Architecture of Computing SystemsArchitecture of Computing SystemsGPGPU and CUDA GPGPU and CUDA TutorialsTutorials
Dresden, Germany, February 25 2008Dresden, Germany, February 25 2008
![Page 2: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/2.jpg)
ARCS 2008
2
OverviewOverview
• Parallel Processing on GPUs
• Types of Parallel Data Flow
• Parallel Prefix or Scan
• Precision and Accuracy
![Page 3: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/3.jpg)
ARCS 2008
3
The GPU is a Fast, Parallel Array ProcessorThe GPU is a Fast, Parallel Array Processor
Input Arrays:1D, 3D,
2D (typical)
Output Arrays:1D, 3D (slice),2D (typical)
![Page 4: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/4.jpg)
ARCS 2008
4
The GPU is a Fast, Parallel Array ProcessorThe GPU is a Fast, Parallel Array Processor
Input Arrays:1D, 3D,
2D (typical)
Vertex Processor (VP)
Kernel changes indexregions of output arrays
Output Arrays:1D, 3D (slice),2D (typical)
![Page 5: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/5.jpg)
ARCS 2008
5
The GPU is a Fast, Parallel Array ProcessorThe GPU is a Fast, Parallel Array Processor
Input Arrays:1D, 3D,
2D (typical)
Vertex Processor (VP)
Kernel changes indexregions of output arrays
Rasterizer
Creates data streams from index regions
Output Arrays:1D, 3D (slice),2D (typical)
![Page 6: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/6.jpg)
ARCS 2008
6
The GPU is a Fast, Parallel Array ProcessorThe GPU is a Fast, Parallel Array Processor
Input Arrays:1D, 3D,
2D (typical)
Vertex Processor (VP)
Kernel changes indexregions of output arrays
Rasterizer
Creates data streams from index regions
Stream of array elements,order unknown
Output Arrays:1D, 3D (slice),2D (typical)
![Page 7: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/7.jpg)
ARCS 2008
7
The GPU is a Fast, Parallel Array ProcessorThe GPU is a Fast, Parallel Array Processor
Input Arrays:1D, 3D,
2D (typical)
Vertex Processor (VP)
Kernel changes indexregions of output arrays
Rasterizer
Creates data streams from index regions
Stream of array elements,order unknown
Fragment Processor (FP)
Kernel changes each datum independently,
reads more input arrays
Output Arrays:1D, 3D (slice),2D (typical)
![Page 8: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/8.jpg)
ARCS 2008
8
The GPU is a Fast, Parallel Array ProcessorThe GPU is a Fast, Parallel Array Processor
Input Arrays:1D, 3D,
2D (typical)
Vertex Processor (VP)
Kernel changes indexregions of output arrays
Rasterizer
Creates data streams from index regions
Stream of array elements,order unknown
Fragment Processor (FP)
Kernel changes each datum independently,
reads more input arrays
Output Arrays:1D, 3D (slice),2D (typical)
![Page 9: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/9.jpg)
ARCS 2008
9
The GPU is a Fast, Highly MultiThe GPU is a Fast, Highly Multi--Threaded ProcessorThreaded Processor
Input Arrays:nD
Start thousands of parallel threads in
groups of m, e.g. 32
Output Arrays:nD
![Page 10: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/10.jpg)
ARCS 2008
10
The GPU is a Fast, Highly MultiThe GPU is a Fast, Highly Multi--Threaded ProcessorThreaded Processor
Input Arrays:nD
Start thousands of parallel threads in
groups of m, e.g. 32
Each group operates in a SIMD fashion, with
predication if necessary
Output Arrays:nD
![Page 11: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/11.jpg)
ARCS 2008
11
The GPU is a Fast, Highly MultiThe GPU is a Fast, Highly Multi--Threaded ProcessorThreaded Processor
Input Arrays:nD
Start thousands of parallel threads in
groups of m, e.g. 32
Each group operates in a SIMD fashion, with
predication if necessary
In general all threads are independentbut certain collections of groups may
use local memory to exchange data
Output Arrays:nD
![Page 12: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/12.jpg)
ARCS 2008
12
Native Memory Layout Native Memory Layout –– Data LocalityData Locality
CPU• 1D input• 1D output• Other dimensions with offsets
GPU• 2D input• 2D output• Other dimensions with offsets
Input Input Output
Output
Color coded localityred (near), blue (far)
![Page 13: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/13.jpg)
ARCS 2008
13
Primitive Index Regions in Output ArraysPrimitive Index Regions in Output Arrays
Output region• Quads and Triangles– Fastest option
Output region• Line segments
– Slower, try to pair lines to 2xh, wx2 quads
Output region• Point Clouds
– Slowest, try to gather points into larger forms
![Page 14: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/14.jpg)
ARCS 2008
14
GPUsGPUs are Optimized for Local Data Accessare Optimized for Local Data Access
• CPU– Large cache– Few processing elements– Optimized for spatial and
temporal data reuse
GeForceGeForce 7800 GTX7800 GTX Pentium 4Pentium 4
chart courtesy
of Ian Buck
Memory access types: Cache, Sequential, Random
• GPU– Small cache– Many processing elements– Optimized for sequential
(streaming) data access
![Page 15: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/15.jpg)
ARCS 2008
15
Input and Output ArraysInput and Output Arrays
CPU• Input and output arrays may
overlap
GPU• Input and output arrays must
not overlap
Input
Output
Input
Output
![Page 16: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/16.jpg)
ARCS 2008
16
Configuration OverheadConfiguration Overhead
ConfiguConfigu--rationrationlimitedlimited
CompuCompu--tationtationlimitedlimited
chart courtesy
of Ian Buck
![Page 17: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/17.jpg)
ARCS 2008
17
OverviewOverview
• Parallel Processing on GPUs
• Types of Parallel Data Flow
• Parallel Prefix or Scan
• Precision and Accuracy
![Page 18: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/18.jpg)
ARCS 2008
18
Parallel DataParallel Data--Flow: Map, Gather and ScatterFlow: Map, Gather and Scatter
Input Output
Input Output
Input OutputScatter: x(2,3)= f(a), x(6,7)= g(a), …
Map: x= f(a)
General: x(2,3)= f( a(1,2), a(3,5), … ),x(6,7)= f( a(6,2), a(7,5), … ),
Gather: x= f( a(1,2), a(3,5), … )
Input Output
![Page 19: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/19.jpg)
ARCS 2008
Performance of Gather and ScatterPerformance of Gather and Scatter
19
chart courtesy of
Naga Govindaraju
![Page 20: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/20.jpg)
ARCS 2008
20
Parallel Reduction, e.g. Maximum of an ArrayParallel Reduction, e.g. Maximum of an Array
input
![Page 21: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/21.jpg)
ARCS 2008
21
Parallel Reduction, e.g. Maximum of an ArrayParallel Reduction, e.g. Maximum of an Array
input arrayN/2 x N/2 output
![Page 22: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/22.jpg)
ARCS 2008
22
Parallel Reduction, e.g. Maximum of an ArrayParallel Reduction, e.g. Maximum of an Array
gather 2x2 regions for each output
![Page 23: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/23.jpg)
ARCS 2008
23
Parallel Reduction, e.g. Maximum of an ArrayParallel Reduction, e.g. Maximum of an Array
first output
![Page 24: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/24.jpg)
ARCS 2008
24
Parallel Reduction, e.g. Maximum of an ArrayParallel Reduction, e.g. Maximum of an Array
maximum of 2x2 region
![Page 25: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/25.jpg)
ARCS 2008
25
Parallel Reduction, e.g. Maximum of an ArrayParallel Reduction, e.g. Maximum of an Array
intermediates
![Page 26: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/26.jpg)
ARCS 2008
26
Parallel Reduction, e.g. Maximum of an ArrayParallel Reduction, e.g. Maximum of an Array
input intermediates result
![Page 27: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/27.jpg)
ARCS 2008
27
Parallel Reduction, e.g. Maximum of an ArrayParallel Reduction, e.g. Maximum of an Array
• For commutative operators (e.g. +,*,max) this is encapsulated into a single function call.
• For a more detailed discussion see, Mark Harris' CUDA optimization talk from SC 2007: http://www.gpgpu.org/sc2007/
![Page 28: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/28.jpg)
ARCS 2008
28
OverviewOverview
• Parallel Processing on GPUs
• Types of Parallel Data Flow
• Parallel Prefix or Scan
• Precision and Accuracy
slides courtesy of
Shubho Sengupta
![Page 29: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/29.jpg)
ARCS 2008
MotivationMotivation
30140703
31473
0 Null
• Stream Compaction
• Split
TTTFFFFF
FTFFTFFT
![Page 30: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/30.jpg)
ARCS 2008
• Common scenarios in parallel computing– Variable output per thread– Threads want to perform a split – radix sort, building trees
• “What came before/after me?”• “Where do I start writing my data?”• Scan answers this question
MotivationMotivation
![Page 31: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/31.jpg)
ARCS 2008
ScanScan
• Each element is a sum of all the elements to the left of it (Exclusive)
• Each element is a sum of all the elements to the left of it and itself (Inclusive)
2216151111430 Exclusive
25221615111143 Inclusive
36140713 Input
![Page 32: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/32.jpg)
ARCS 2008
Scan Scan –– the pastthe past
• First proposed in APL (1962)• Used as a data parallel primitive in the Connection
Machine (1990)• Guy Blelloch used scan as a primitive for various
parallel algorithms (1990)
![Page 33: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/33.jpg)
ARCS 2008
Scan Scan –– the presentthe present
• First GPU implementation by Daniel Horn (2004), O(nlogn)
• Subsequent GPU implementations by – Hensley (2005) O(n logn), Sengupta (2006) O(n), Greß (2006)
O(n) 2D
• NVIDIA CUDA implementation by Mark Harris (2007), O(n), space efficient
![Page 34: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/34.jpg)
ARCS 2008
Scan Scan -- ReduceReduce
36140713
96547743
1465411743
2565411743
• log n steps
• Work halves each step
• O(n) work
• In place, space efficient
![Page 35: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/35.jpg)
ARCS 2008
Scan Scan -- Down SweepDown Sweep
065411743
116540743
1661144703
2216151111430
2565411743 • log n steps
• Work doubles each step
• O(n) work
• In place, space efficient
![Page 36: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/36.jpg)
ARCS 2008
Segmented ScanSegmented Scan
• Input
• Scan within each segment in parallel• Output
30 770 710
13 407 361
![Page 37: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/37.jpg)
ARCS 2008
Segmented ScanSegmented Scan
• Introduced by Schwartz (1980)• Forms the basis for a wide variety of algorithms
– Radixsort, Quicksort– Sparse Matrix-Vector Multiply– Convex Hull– Solving recurrences– Tree operations
![Page 38: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/38.jpg)
ARCS 2008
Segmented Scan Segmented Scan –– Large InputLarge Input
![Page 39: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/39.jpg)
ARCS 2008
Segmented Scan Segmented Scan –– AdvantagesAdvantages
• Operations in parallel over all the segments• Irregular workload since segments can be of any length• Can simulate divide-and-conquer recursion since
additional segments can be generated
![Page 40: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/40.jpg)
ARCS 2008
Segmented Scan Variant: Segmented Scan Variant: TridiagonalTridiagonal SolverSolver
• Tridiagonal system of n rows solved in parallel• Then for each of the m columns in parallel• Read pattern is similar to but more complex than scan
![Page 41: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/41.jpg)
ARCS 2008
41
OverviewOverview
• Parallel Processing on GPUs
• Types of Parallel Data Flow
• Parallel Prefix or Scan
• Precision and Accuracy
![Page 42: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/42.jpg)
ARCS 2008
42
The Erratic The Erratic RoundoffRoundoff ErrorErrorS
mal
ler i
s be
tter
-100
-90
-80
-70
-60
-50
-40
-30
-20
0 10 20 30 40 50
y =
log2
( f(a
) ),
0 --
> 2^
-100
x = log2( 1/a ), a = 1 / 2^x
Roundoff error for: 0 = f(a):= |(1+a)^3 - (1+3a^2) - (3a+a^3)|
single precisiondouble precision
![Page 43: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/43.jpg)
ARCS 2008
43
Precision and AccuracyPrecision and Accuracy
• There is no monotonic relation between the computational precision and the accuracy of the final result.
• Increasing precision can decrease accuracy !
• The increase or decrease of precision in different parts of a computation can have very different impact on the accuracy.
• The above can be exploited to significantly reduce the precision in parts of a computation without a loss in accuracy.
• We obtain a mixed precision method.
![Page 44: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/44.jpg)
ARCS 2008
44
Quantization Quantization -- Preserving AccuracyPreserving Accuracy
– Watch out for cancellation• a≈b, r = c*a-c*b• r = c(a-b)
– Maximize operations on the same scale• a∈[0,1], b,c∈[10-3,10-4], r = a+b+c• r = a+(b+c)
– Make implicit relations between constants explicit• ai=0.01, i=0..99 r = Σi<100ai ≠ 1• a99=1-(Σi<99ai ), r = Σi<100ai = 1
– Use symmetric intervals for multiplication• a ~ [-1, 1], r = 0.1134*(a+1)• r = 0.1134a+0.1134
– Minimize the number of multiplications• r = 0.25a + 0.1b + 0.15c• r = 0.1(a+b)+0.15(a+c)
![Page 45: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/45.jpg)
ARCS 2008
45
Resources for Signed Integer OperationsResources for Signed Integer Operations
Operation Area Latency
min(r,0)max(r,0) b+1 2
add(r1,r2)sub(r1,r2)
2b b
add(r1,r2,r3)→add(r4,r5) 2b 1mult(r1,r2)
sqr(r)b(b-2) b ld(b)
sqrt(r) 2c(c-5) c(c+3)
b: bitlength of argument, c: bitlength of result
![Page 46: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/46.jpg)
ARCS 2008
46
FPGA Results: Conjugate Gradient (CG)FPGA Results: Conjugate Gradient (CG)
0
200
400
600
800
1000
1200
1400
1600
20 25 30 35 40 45 50
Num
ber o
f slic
es
Bits of mantissa
Area of s??e11 float kernels on the xc2v500/xc2v8000(CG)
AdderMultiplier
CG kernel normalized (1/30)
Sm
alle
r is
bette
r
[Göddeke et al. Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations, IJPEDS 2007]
![Page 47: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/47.jpg)
ARCS 2008
47
High Precision EmulationHigh Precision Emulation
• Given a m x m bit unsigned integer multiplier we want to build a n x n multiplier with a n=k*m bit result
∑∑∑∑+>+
=
+−
+≤+=
+−
=
−
=
− +=⋅k
kjiji
jimji
k
kjiji
jimji
k
jj
jmk
ii
im bababa1
1,
)(
11,
)(
112222
• The evaluation of the first sum requires k(k+1)/2 multiplications,the evaluation of the second depends on the rounding mode
• For floating point numbers additional operations are necessary because of the mantissa/exponent distinction
• A double emulation with two aligned s23e8 single floats is less complex than an exact s52e11 double emulation, achieves a s46e8 precision and still requires 10-20 single float operations
![Page 48: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/48.jpg)
ARCS 2008
Precision Precision –– Performance Rough EstimatesPerformance Rough Estimates
• Reconfigurable device, e.g. FPGA– 2x float add ≈ 1x double add– 4x float mul ≈ 1x double mul
• Hardware emulation (compute area limited), e.g. GPU– 2x float add ≈ 1x double add– 5x float mul ≈ 1x double mul
• Hardware emulation (data path limited), e.g. CPU– 2x float add ≈ 1x double add– 2x float mul ≈ 1x double mul
• Software emulation– 10x float add ≈ 1x double add– 20x float mul ≈ 1x double mul
48
![Page 49: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/49.jpg)
ARCS 2008
49
• Exploit the speed of low precision and obtain a result of high accuracy
dk =b-Axk Compute in high precision (cheap)Ack=dk Solve in low precision (fast)xk+1=xk+ck Correct in high precision (cheap)k=k+1 Iterate until convergence in high precision
• Low precision solution is used as a pre-conditioner in a high precision iterative method– A is small and dense: Solve Ack=dk directly– A is large and sparse: Solve (approximately) Ack=dk with an iterative
method itself
Mixed Mixed PrecisionPrecision Iterative Iterative RefinementRefinement Ax=bAx=b
![Page 50: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/50.jpg)
ARCS 2008
50
CPU Results: LU SolverCPU Results: LU Solver
chart courtesy
of Jack Dongarra
Larg
er is
bet
ter
[Langou et al. Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems), SC 2006]
![Page 51: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/51.jpg)
ARCS 2008
51
GPU Results: Conjugate Gradient (CG) and GPU Results: Conjugate Gradient (CG) and MultigridMultigrid (MG)(MG)S
mal
ler i
s be
tter
5e-7
5e-6
5e-5
5e-4
6 7 8 9 10
Sec
onds
per
grid
nod
e
Data level
Performance of double precision CPU and mixed precision CPU-GPU solvers
CG CPUCG GPU
MG2+2 CPUMG2+2 GPU
[Göddeke et al. Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations, IJPEDS 2007]
![Page 52: B1 Data Parallel](https://reader033.fdocuments.in/reader033/viewer/2022051417/55cf9152550346f57b8c8ef4/html5/thumbnails/52.jpg)
ARCS 2008
52
ConclusionsConclusions
• Parallel Processing on GPUs is about identifying independent work and preserving data locality
• Map, gather, scatter are basic types of parallel data-flow.
• Parallel prefix (scan) enables the parallelization of many seemingly inherently sequential algorithms
• Precision ≠ accuracy! Mixed precision methods can reduce resource requirements quadratically.