CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.
-
Upload
dominic-parrish -
Category
Documents
-
view
217 -
download
1
Transcript of CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.
![Page 1: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/1.jpg)
1
CR18: Advanced Compilers
L06: Code Generation
Tomofumi Yuki
![Page 2: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/2.jpg)
2
Code Generation
Completing the transformation loop
Problem: how to generate code to scan a
polyhedron? a union of polyhedra? how to generate tiled code? how to generate parametrically tiled
code?
![Page 3: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/3.jpg)
3
Evolution of Code Gen
Ancourt & Irigoin 1991 single polyhedron scanning
LooPo: Griebl & Lengauer 1996 1st step to unions of polyhedra scan bounding box + guards
Omega Code Gen 1995 generate inefficient code (convex hull +
guards) then try to remove inefficiencies
![Page 4: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/4.jpg)
4
Evolution of Code Gen
LoopGen: Quilleré-Rajopadhye-Wilde 2000 efficiently scanning unions of polyhedra
CLooG: Bastoul 2004 improvements to QRW algorithm robust and well maintained
implementation AST Generation: Grosser 2015
Polyhedral AST generation is more than scanning polyhedra
scanning is not enough!
![Page 5: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/5.jpg)
5
Scanning a Polyhedron
Scanning Polyhedra with DO Loops [1991]
Problem: generate bounds on loops outermost loop: constants and params inner loop: + surrounding iterators
Approach: Fourier-Motzkin elimination projecting out variables
![Page 6: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/6.jpg)
6
Single Polyhedron Example
What is the loop nest for lex. scan?
i
ji≤N
j≥0
i-j≥0for i = 0 .. N for j = 0 .. i S;
![Page 7: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/7.jpg)
7
Single Polyhedron Example
What is the loop nest for permuted case? j as the outer loop
i
ji≤N
j≥0
i-j≥0for j = 0 .. N for i = j .. N S;
![Page 8: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/8.jpg)
8
Scanning Unions of Polyhedra Consider scanning two statements
Naïve approach: bounding box
S1: [N]->{ S1[i]->[i] : 0≤i<N }S2: [N]->{ S2[i]->[i+5] : 0≤i<N }
S1: [N]->{ [i] : 0≤i≤N }S2: [N]->{ [i] : 5≤i≤N+5 }
CoB
for (i=0 .. N+5) if (0<=i && i<=N) S1; if (5<=i && i<=N+5) S2;
![Page 9: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/9.jpg)
9
Slightly Better than BBox
Make disjoint domains
But this is also problematic: code size can quickly grow
for (i=0 .. i<=4) S1;for (i=4 .. i<=N) S1; S2;for (i=N+1 .. i<=N+5) S2;
S1: [N]->{ S1[i]->[i] : 0≤i<N }S2: [N]->{ S2[i]->[i] : 0≤i<M }
![Page 10: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/10.jpg)
10
QRW Algorithm
Key: Recursive Splitting Given a set of n-D domains to scan
start at d=1 and context=parameters 1. Restrict the domains to the context 2. Project the domains to outer d-
dimensions 3. Make the projections disjoint 4. Recurse for each disjoint projection
d=d+1, context=a piece of the projection 5. Sort the resulting loops
![Page 11: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/11.jpg)
11
Example
Scan the following domains
i
j
S1S2
d=1context=universe
Step1: Projection
Step2: Separation
Step3: Recurse
Step4: Sort
![Page 12: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/12.jpg)
12
Example
Scan the following domains
i
j
S1S2
d=1context=universe
Step1: Projection
Step2: Separation
Step3: Recurse
Step4: Sort
![Page 13: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/13.jpg)
13
Example
Scan the following domains
i
j
S1S2
d=1context=universe
Step1: Projection
Step2: Separation
Step3: Recurse
Step4: Sort
for (i=0..1) ...
for (i=2..6) ...
![Page 14: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/14.jpg)
14
Example
Scan the following domains
i
j
S1S2
d=2context=0≤i≤2
Step1: Projection
Step2: Separation
Step3: Recurse
Step4: Sort
for (i=0..1) ...for (i=0..1) for (j=0..4) S1;
![Page 15: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/15.jpg)
15
Example
Scan the following domains
i
j
S1S2
d=2context=2≤i≤6
Step1: Projection
Step2: Separation
Step3: Recurse
Step4: Sort
for (i=2..6) ...
![Page 16: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/16.jpg)
16
Example
Scan the following domains
i
j
S1S2
d=2context=2≤i≤6
Step1: Projection
Step2: Separation
Step3: Recurse
Step4: Sort
for (i=2..6) ...
L2L1
L4L3
![Page 17: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/17.jpg)
17
Example
Scan the following domains
i
j
S1S2
d=2context=2≤i≤6
Step1: Projection
Step2: Separation
Step3: Recurse
Step4: Sort
for (i=2..6) L2 L1 L3 L4
L2L1
L4L3
![Page 18: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/18.jpg)
18
CLooG: Chunky Loop Generator A few problems in QRW Algorithm
high complexity code size is not controlled
CLooG uses: pattern matching to avoid costly
polyhedral operations during separation may stop the recursion at some depth
and generate loops with guards to reduce size
![Page 19: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/19.jpg)
19
Tiled Code Generation
Tiling with fixed size we did this earlier
Tiling with parametric size problem: non-affine!
![Page 20: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/20.jpg)
20
Tiling Review
What does the tiled code look like?for (i=0; i<=N; i++) for (j=0; j<=N; j++) S;
for (ti=0; ti<=N; ti+=ts) for (tj=0; tj<=N; tj+=ts) for (i=ti; i<min(N+1,ti+ts); i++) for (j=tj; j<min(N+1,tj+ts); j++) Sfor (ti=0; ti<=floor(N,ts); ti++) for (tj=0; tj<=floor(N,ts); tj++) for (i=ti*ts; i<min(N+1,(ti+1)*ts); i++) for (j=tj*ts; j<min(N+1,(tj+1)*ts); j++) S
with tile size ts
![Page 21: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/21.jpg)
21
Two Approaches
Use fix sized tiling if the tile is a constant, stays affine pragmatic choice by many tools
Use non-polyhedral code generation much better for tuning tile sizes make sense for semi-automatic tools
![Page 22: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/22.jpg)
22
Difficulties in Tiled Code Gen This is still a very simplified view
In practice, we tile after transformation skewing, etc.
Let’s see the tiled iteration space with tvis
for (ti=0; ti<=N; ti+=ts) for (tj=0; tj<=N; tj+=ts) for (i=ti; i<min(N+1,ti+ts); i++) for (j=tj; j<min(N+1,tj+ts); j++) S
![Page 23: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/23.jpg)
23
Full Tiles, Inset / Outset
Partial tiles have a lot of control overhead
Challenges for parametric tiled code gen make sure to scan the outset but also separate the inset use efficient point loops for inset
All with out polyhedral analysis
![Page 24: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/24.jpg)
24
Point Loops for Full/Partial Tile Full Tile Point Loop
Partial/Empty Tile Point Loop
for (i=ti; i<ti+si; i++) for (j=tj; j<tj+sj; j++) for (k=tk; k<tk+sk; k++) ...
for (i=max(ti,...); i<min(ti+si,...); i++) for (j=max(tj,...); j<min(tj+sj,...); j++) for (k=max(tk,...); k<min(tk+sk,...); k++) if (...) ...
![Page 25: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/25.jpg)
25
Progression of Parametric Tiling Perfectly nested, single loop
TLoG [Renganarayana et al. 2007] Multiple levels of tiling
HiTLoG [Renganarayana et al. 2007] PrimeTile [Hartono 2009]
Parallelizing the tiles DynTile [Hartono 2010] D-Tiling [Kim 2011]
![Page 26: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/26.jpg)
26
Computing the Outset
We start with some domain expand in each dimension by (symbolic) tile size – 1 except for upper bounds
{[i,j]: 0≤i≤10 and i≤j≤i+10}
{[i,j]: -(ts-1) ≤i≤10 and -(ts-1)+i≤j and -(ts-1)+j≤i+10}
![Page 27: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/27.jpg)
27
Computing the Inset
We start with some domain shrink in each dimension by (symbolic) tile size – 1 except for lower bounds
{[i,j]: 0≤i≤10 and i≤j≤i+10}
{[i,j]: 0≤i≤10-(ts-1) and i≤j-(ts-1) and j≤i+10-(ts-1)}
![Page 28: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/28.jpg)
28
Syntactic Manipulation
We cannot use polyhedral code generators so back to modifying AST
Modify the loop bounds to get loops that visit outset get guards to switch point-loops
Up to here is HiTLoG/PrimeTile
![Page 29: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/29.jpg)
29
Problem: Parallelization
After tiling, there is parallelism However, it requires skewing of tiles
We need non-polyehdral skewing The key equation:
where
d: number of tiled dimensions ti: tile origins ts: tile sizes
![Page 30: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/30.jpg)
30
D-Tiling
The equation enables skewing of tiles If one of time or tile origins are
unknown, can be computed from the others
Generated Code: (tix is d-1th tile origin)for (time=start:end) for (ti1=ti1LB:ti1UB) … for (tix=tixLB:tixUB) { tid = f(time, ti1, …, tix); //compute tile ti1,ti2,…,tix,tid }
![Page 31: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/31.jpg)
31
Distributed Memory Parallelization Problems implicitly handled by the
shared memory now need explicit treatment
Communication Which processors need to send/receive? Which data to send/receive? How to manage communication buffers?
Data partitioning How do you allocate memory across
nodes?
![Page 32: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/32.jpg)
32
MPI Code Generator
Distributed Memory Parallelization Tiling based Parameterized tile sizes C+MPI implementation
Uniform dependences as key enabler Many affine dependences can be
uniformized Shared memory performance carried
over to distributed memory Scales as well as PLuTo but to multiple
nodes
![Page 33: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/33.jpg)
33
Related Work (Polyhedral)
Polyhedral Approaches Initial idea [Amarasinghe1993] Analysis for fixed sized tiling
[Claßen2006] Further optimization [Bondhugula2011]
“Brute Force” polyhedral analysis for handling communication No hope of handling parametric tile size Can handle arbitrarily affine programs
![Page 34: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/34.jpg)
34
Outline
Introduction “Uniform-ness” of Affine Programs
Uniformization Uniform-ness of PolyBench
MPI Code Generation Tiling Uniform-ness simplifies everything Comparison against PLuTo with
PolyBench Conclusions and Future Work
![Page 35: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/35.jpg)
35
Affine vs Uniform
Affine Dependences: f = Ax+b Examples
(i,j->j,i) (i,j->i,i) (i->0)
Uniform Dependences: f = Ix+b Examples
(i,j->i-1,j) (i->i-1)
![Page 36: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/36.jpg)
36
Uniformization
(i->0) (i->0)
(i->i-1)
![Page 37: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/37.jpg)
37
Uniformization
Uniformization is a classic technique “solved” in the 1980’s has been “forgotten” in the multi-core
era Any affine dependence can be
uniformized by adding a dimension
[Roychowdhury1988] Nullspace pipelining
simple technique for uniformization many dependences are uniformized
![Page 38: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/38.jpg)
38
Uniformization and Tiling
Uniformization does not influence tilability
![Page 39: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/39.jpg)
39
PolyBench [Pouchet2010]
Collection of 30 polyhedral kernels Proposed by Pouchet as a benchmark for
polyhedral compilation Goal: Small enough benchmark so that
individual results are reported; no averages
Kernels from: data mining linear algebra kernels, solvers dynamic programming stencil computations
![Page 40: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/40.jpg)
40
Uniform-ness of PolyBench
5 of them are “incorrect” and are excluded
Embedding: Match dimensions of statements
Phase Detection: Separate program into phases Output of a phase is used as inputs to
the other
Stage Uniform at
Start
AfterEmbeddin
g
AfterPipelining
After Phase
Detection
Number of Fully UniformPrograms
8/25 (32%)
13/25 (52%)
21/25 (84%)
24/25 (96%)
![Page 41: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/41.jpg)
41
Outline
Introduction Uniform-ness of Affine Programs
Uniformization Uniform-ness of PolyBench
MPI Code Generation Tiling Uniform-ness simplifies everything Comparison against PLuTo with
PolyBench Conclusions and Future Work
![Page 42: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/42.jpg)
42
Basic Strategy: Tiling
We focus on tilable programs
![Page 43: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/43.jpg)
43
Dependences in Tilable Space All in the non-positive direction
![Page 44: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/44.jpg)
44
Wave-front Parallelization
All tiles with the same color can run in parallel
![Page 45: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/45.jpg)
45
Assumptions
Uniform in at least one of the dimensions
The uniform dimension is made outermost Tilable space is fully permutable
One-dimensional processor allocation Large enough tile sizes
Dependences do not span multiple tiles Then, communication is extremely
simplified
![Page 46: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/46.jpg)
46
Processor Allocation
Outermost tile loop is distributed
P0 P1 P2 P3i1
i2
![Page 47: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/47.jpg)
47
Values to be Communicated
Faces of the tiles (may be thicker than 1)
i1
i2
P0 P1 P2 P3
![Page 48: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/48.jpg)
48
Naïve Placement of Send and Receive Codes Receiver is the consumer tile of the
values
i1
i2
P0 P1 P2 P3
S
S
S
R
R
R
![Page 49: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/49.jpg)
49
Problems in Naïve Placement Receiver is in the next wave-front time
i1
i2
P0 P1 P2 P3
S
S
S
R
R
R
t=0
t=1
t=2
t=3
![Page 50: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/50.jpg)
50
Problems in Naïve Placement Receiver is in the next wave-front time Number of communications “in-flight”
= amount of parallelism MPI_Send will deadlock
May not return control if system buffer is full
Asynchronous communication is required Must manage your own buffer required buffer = amount of parallelism
i.e., number of virtual processors
![Page 51: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/51.jpg)
51
Proposed Placement of Send and Receive codes Receiver is one tile below the consumer
i1
i2
P0 P1 P2 P3
S
S
S
R
R
R
![Page 52: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/52.jpg)
52
Placement within a Tile
Naïve Placement: Receive -> Compute -> Send
Proposed Placement: Issue asynchronous receive (MPI_Irecv) Compute Issue asynchronous send (MPI_Isend) Wait for values to arrive
Overlap of computation and communication
Only two buffers per physical processor
Overlap
Recv Buffer
Send Buffer
![Page 53: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/53.jpg)
53
Evaluation
Compare performance with PLuTo Shared memory version with same
strategy Cray: 24 cores per node, up to 96 cores Goal: Similar scaling as PLuTo Tile sizes are searched with educated
guesses PolyBench
7 are too small 3 cannot be tiled or have limited
parallelism 9 cannot be used due to
PLuTo/PolyBench issue
![Page 54: CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc41a28abf838ca5aa0/html5/thumbnails/54.jpg)
54
Performance Results
Linear extrapolation from speed up of 24
cores Broadcast cost at most 2.5 seconds