Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems
Beyond Shared Memory Loop Parallelism in the Polyhedral...
Transcript of Beyond Shared Memory Loop Parallelism in the Polyhedral...
Beyond Shared Memory Loop Parallelism in the Polyhedral Model
Tomofumi Yuki Ph.D Dissertation
10/30 2012
The Problem
Figure from www.spiral.net/problem.html
2
Parallel Processing
n A small niche in the past, hot topic today n Ultimate Solution: Automatic Parallelization
n Extremely difficult problem n After decades of research, limited success
n Other solutions: Programming Models n Libraries (MPI, OpenMP, CnC, TBB, etc.) n Parallel languages (UPC, Chapel, X10, etc.) n Domain Specific Languages (stencils, etc.)
3
MPI Code Generation
Polyhedral X10
X10
AlphaZ
MDE
40+ years of research linear algebra, ILP
CLooG, ISL, Omega, PLuTo
Contributions
Polyhedral Model
4
Polyhedral State-of-the-art
n Tiling based parallelization n Extensions to parameterized tile sizes
n First step [Renganarayana2007] n Parallelization + Imperfectly nested
loops[Hartono2010, Kim2010]
n PLuTo approach is now used by many people n Wave-front of tiles: better strategy than
maximum parallelism [Bondhugula2008]
n Many advances in shared memory context
5
How far can shared memory go?
n The Memory Wall is still there n Does it make sense for 1000 cores to share
memory? [Berkley View, Shalf 07, Kumar 05] n Power n Coherency overhead n False sharing n Hierarchy? n Data volume (tera- peta-bytes)
6
Distributed Memory Parallelization n Problems implicitly handled by the shared
memory now need explicit treatment n Communication
n Which processors need to send/receive? n Which data to send/receive? n How to manage communication buffers?
n Data partitioning n How do you allocate memory across nodes?
7
MPI Code Generator
n Distributed Memory Parallelization n Tiling based n Parameterized tile sizes n C+MPI implementation
n Uniform dependences as key enabler n Many affine dependences can be uniformized
n Shared memory performance carried over to distributed memory n Scales as well as PLuTo but to multiple nodes
8
Related Work (Polyhedral)
n Polyhedral Approaches n Initial idea [Amarasinghe1993] n Analysis for fixed sized tiling [Claßen2006] n Further optimization [Bondhugula2011]
n “Brute Force” polyhedral analysis for handling communication n No hope of handling parametric tile size n Can handle arbitrarily affine programs
9
Outline
n Introduction n “Uniform-ness” of Affine Programs
n Uniformization n Uniform-ness of PolyBench
n MPI Code Generation n Tiling n Uniform-ness simplifies everything n Comparison against PLuTo with PolyBench
n Conclusions and Future Work
10
Affine vs Uniform
n Affine Dependences: f = Ax+b n Examples
n (i,j->j,i) n (i,j->i,i) n (i->0)
n Uniform Dependences: f = Ix+b n Examples
n (i,j->i-1,j) n (i->i-1)
11
Uniformization
n (i->0) n (i->0)
n (i->i-1)
12
Uniformization
n Uniformization is a classic technique n “solved” in the 1980’s n has been “forgotten” in the multi-core era
n Any affine dependence can be uniformized n by adding a dimension [Roychowdhury1988]
n Nullspace pipelining n simple technique for uniformization n many dependences are uniformized
13
Uniformization and Tiling
n Uniformization does not influence tilability
14
PolyBench [Pouchet2010]
n Collection of 30 polyhedral kernels n Proposed by Pouchet as a benchmark for
polyhedral compilation n Goal: Small enough benchmark so that
individual results are reported; no averages
n Kernels from: n data mining n linear algebra kernels, solvers n dynamic programming n stencil computations
15
Uniform-ness of PolyBench
n 5 of them are “incorrect” and are excluded
n Embedding: Match dimensions of statements n Phase Detection: Separate program into phases
n Output of a phase is used as inputs to the other
Stage Uniform at Start
After Embedding
After Pipelining
After Phase Detection
Number of Fully Uniform Programs
8/25 (32%)
13/25 (52%)
21/25 (84%)
24/25 (96%)
16
Outline
n Introduction n Uniform-ness of Affine Programs
n Uniformization n Uniform-ness of PolyBench
n MPI Code Generation n Tiling n Uniform-ness simplifies everything n Comparison against PLuTo with PolyBench
n Conclusions and Future Work
17
Basic Strategy: Tiling
n We focus on tilable programs
18
Dependences in Tilable Space
n All in the non-positive direction
19
Wave-front Parallelization
n All tiles with the same color can run in parallel
20
Assumptions
n Uniform in at least one of the dimensions n The uniform dimension is made outermost
n Tilable space is fully permutable
n One-dimensional processor allocation n Large enough tile sizes
n Dependences do not span multiple tiles
n Then, communication is extremely simplified
21
Processor Allocation
n Outermost tile loop is distributed
P0 P1 P2 P3 i1
i2
22
Values to be Communicated
n Faces of the tiles (may be thicker than 1)
i1
i2
P0 P1 P2 P3
23
Naïve Placement of Send and Receive Codes n Receiver is the consumer tile of the values
i1
i2
P0 P1 P2 P3
S
S
S
R
R
R
24
Problems in Naïve Placement
n Receiver is in the next wave-front time
i1
i2
P0 P1 P2 P3
S
S
S
R
R
R
t=0
t=1
t=2
t=3
25
Problems in Naïve Placement
n Receiver is in the next wave-front time n Number of communications “in-flight”���
= amount of parallelism n MPI_Send will deadlock
n May not return control if system buffer is full
n Asynchronous communication is required n Must manage your own buffer n required buffer = amount of parallelism
n i.e., number of virtual processors
26
Proposed Placement of Send and Receive codes n Receiver is one tile below the consumer
i1
i2
P0 P1 P2 P3
S
S
S
R
R
R
27
Placement within a Tile
n Naïve Placement: n Receive -> Compute -> Send
n Proposed Placement: n Issue asynchronous receive (MPI_Irecv) n Compute n Issue asynchronous send (MPI_Isend) n Wait for values to arrive
n Overlap of computation and communication n Only two buffers per physical processor
Overlap
Recv Buffer
Send Buffer
28
Evaluation
n Compare performance with PLuTo n Shared memory version with same strategy
n Cray: 24 cores per node, up to 96 cores n Goal: Similar scaling as PLuTo n Tile sizes are searched with educated guesses n PolyBench
n 7 are too small n 3 cannot be tiled or have limited parallelism n 9 cannot be used due to PLuTo/PolyBench issue
29
Performance Results
30
ª n Linear extrapolation from speed up of 24 cores n Broadcast cost at most 2.5 seconds
correlation covariance 2mm 3mm gemm syr2k syrk lu fdtd−2d jacobi−2dimper
seidel−2d
Summary of AlphaZ Performance Comparison with PLuTo
Sp
ee
d U
p w
ith
re
sp
ect
to P
Lu
To w
ith
1 c
ore
02
04
06
08
01
00
PLuTo 24 coresAlphaZ 24 coresPLuTo 96 cores (extrapolated)AlphaZ 96 cores (No Bcast)
AlphaZ System
n System for polyhedral design space exploration n Key features not explored by other tools:
n Memory allocation n Reductions
n Case studies to illustrate the importance of unexplored design space [LCPC2012]
n Polyhedral Equational Model [WOLFHPC2012]
n MDE applied to compilers [MODELS2011]
31
Polyhedral X10 [PPoPP2013?]
n Work with Vijay Saraswat and Paul Feautrier n Extension of array data flow analysis to X10
n supports finish/async but not clocks
n finish/async can express more than doall n Focus of polyhedral model so far: doall
n Dataflow result is used to detect races n With polyhedral precision, we can guarantee
program regions to be race-free
32
Conclusions
n Polyhedral Compilation has lots of potential n Memory/reductions are not explored n Successes in automatic parallelization n Race-free guarantee
n Handling arbitrary affine may be an overkill n Uniformization makes a lot of sense n Distributed memory parallelization made easy n Can handle most of PolyBench
33
Future Work
n Many direct extensions n Hybrid MPI+OpenMP with multi-level tiling n Partial uniformization to satisfy pre-condition n Handling clocks in Polyhedral X10
n More broad applications of polyhedral model n Approximations n Larger granularity: blocks of computations
instead of statements n Abstract interpretations [Alias2010]
34
Acknowledgements
n Advisor: Sanjay Rajopadhye n Committee members:
n Wim Böhm n Michelle Strout n Edwin Chong
n Unofficial Co-advisor: Steven Derrien n Members of
n Mélange, HPCM, CAIRN n Dave Wonnacott, Haverford students
35
Backup Slides
36
Uniformization and Tiling
n Tilability is preserved
37
D-Tiling Review [Kim2011]
n Parametric tiling for shared memory n Uses non-polyhedral skewing of tiles
n Required for wave-front execution of tiles
n The key equation: n n where
n d: number of tiled dimensions n ti: tile origins n ts: tile sizes
38
€
time =tiitsii=1
d∑
D-Tiling Review cont.
n The equation enables skewing of tiles n If one of time or tile origins are unknown, can
be computed from the others
n Generated Code: (tix is d-1th tile origin)
39
for (time=start:end)! for (ti1=ti1LB:ti1UB)! …! for (tix=tixLB:tixUB) {! tid = f(time, ti1, …, tix);! //compute tile ti1,ti2,…,tix,tid! }!
Placement of Receive Code using D-Tiling n Slight modification to the use of the equation
n Visit tiles in the next wave-front time
40
for (time=start:end)! for (ti1=ti1LB:ti1UB)! …! for (tix=tixLB:tixUB) {! tidNext = f(time+1, ti1, …, tix);! //receive and unpack buffer for! //tile ti1,ti2,…,tix,tidNext! }!
Proposed Placement of Send and Receive codes n Receiver is one tile below the consumer
i1
i2
P0 P1 P2 P3
S
S
S
R
R
R
41
Extensions to Schedule Independent Mapping n Schedule Independent Mapping [Strout1998]
n Universal Occupancy Vectors (UOVs) n Legal storage mapping for any legal execution n Uniform dependence programs only
n Universality of UOVs can be restricted n e.g., to tiled execution
n For tiled execution, shortest UOV can be found without any search
42
LU Decomposition
lu
Number of Cores
Sp
ee
d U
p w
ith r
esp
ect
to
PL
uTo
with
1 c
ore
0 8 16 24 48 72 96
08
16
24
48
72
96 PLuTo
AlphaZAlphaZ (No Bcast)
43
seidel-2d
seidel!2d
Number of Cores
Sp
ee
d U
p w
ith r
esp
ect
to
PL
uTo
with
1 c
ore
0 8 16 24 48 72 96
08
16
24
48
72
96 PLuTo
AlphaZAlphaZ (No Bcast)
44
seidel-2d (no 8x8x8)
seidel!2d (without 8x8x8 tiles)
Number of Cores
Sp
ee
d U
p w
ith r
esp
ect
to
PL
uTo
with
1 c
ore
0 8 16 24 48 72 96
08
16
24
48
72
96 PLuTo
AlphaZAlphaZ (No Bcast)
45
jacobi-2d-imper
jacobi!2d!imper
Number of Cores
Sp
ee
d U
p w
ith r
esp
ect
to
PL
uTo
with
1 c
ore
0 8 16 24 48 72 96
08
16
24
48
72
96 PLuTo
AlphaZAlphaZ (No Bcast)
46
Related Work (Non-Polyhedral)
n Global communications [Li1990] n Translation from shared memory programs n Pattern matching for global communications
n Paradigm [Banerjee1995] n No loop transformations n Finds parallel loops and inserts necessary
communications
n Tiling based [Goumas2006] n Perfectly nested uniform dependences
47
n PLuTo does not scale because the outer loop is not tiled
adi.c: Performance
Speedup of Optimized Code on Xeon
Number of Threads (Cores)
Spee
d up
com
pare
d to
orig
inal
cod
e
AlphaZPLuTo
0 1 2 4 8
01
24
8
Speedup of Optimized Code on Cray XT6m
Number of Threads (Cores)
Spee
d up
com
pare
d to
orig
inal
cod
e
AlphaZPLuTo
0 4 8 12 16 20 240
48
1216
2024
48
n Complexity reduction is empirically confirmed
UNAfold: Performance
200 400 600 800 1000 1400
050
010
0015
0020
0025
00
Execution Time of UNAfold
Sequence Length (N)
Exec
utio
n Ti
me
in S
econ
ds original simplified
2.0 2.2 2.4 2.6 2.8 3.0 3.2
01
23
45
67
8
Log plot of Execution Time
Log of Sequence Length
Log
of E
xecu
tion
Tim
e original simplified y = 4x + b1
y = 3x + b2
49
Contributions
n The AlphaZ System n Polyhedral compiler with full control to the user n Equational view of the polyhedral model
n MPI Code Generator n The first code generator with parametric tiling n Double buffering
n Polyhedral X10 n Extension to the polyhedral model n Race-free guarantee of X10 programs
50