CR18: Advanced Compilers L02: Dependence Analysis Tomofumi Yuki 1.
CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.
-
Upload
brianne-berry -
Category
Documents
-
view
218 -
download
0
Transcript of CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.
![Page 1: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/1.jpg)
1
CR18: Advanced Compilers
L05: Scheduling for Locality
Tomofumi Yuki
![Page 2: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/2.jpg)
2
Recap
Last time we saw scheduling techniques max. parallelism != best performance
This time how can we do better?
![Page 3: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/3.jpg)
3
Pluto Strategy
We want only 1D parallelism coarse-grained (outer) parallelism good data locality
We want Tiling wave-front parallelism is guaranteed each tile can be executed atomically good for sequential performance
![Page 4: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/4.jpg)
4
Intuition of Pluto Algorithm
Skew and Tile
i
j
i
j
![Page 5: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/5.jpg)
5
Tiling Hyper-Planes
Another name for 1D schedule θ set of θs define tiling
Defines the transform (i,j->i+j,i) corresponds to the
skew in prev. slide
i
j
θ2=i
θ1=i+j
![Page 6: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/6.jpg)
6
Legality of Tiling
Each tiling hyper-plane must satisfy:
What is difference from causality condition? note this is about affine transform, not
scheduleMust be weakly satisfied for each
dimension!
![Page 7: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/7.jpg)
7
What does the condition mean? 1. Fully Permutable
recall θs define the transform all statements mapped to a common d-D
space let i1, ..., in be the new indices
Weakly satisfied in all dimensionsi1≥i’
1,...,in≥i’n for all dependences
Reformulation of the fully permutable condition
works for scheduling imperfect loop nests
![Page 8: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/8.jpg)
8
What does the condition mean? 2. All statements are fused
somewhat implied by fully permutability what are possible dependences
from S1 to S2 from S2 to S1
Exception when S1 do not use value of S2
for i for j S1 for j S2
![Page 9: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/9.jpg)
9
Selecting Tiling Hyper-planes Which is better?
i
j
i
j
![Page 10: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/10.jpg)
10
Cost Functions in Pluto
Formulated as:
What does this capture? dep: (i,j->i+1,j-1) δ1 = (i+1+j-1) – (i+j) = 0 δ2 = (i+1) – (i) = 1
i
j
θ2=i
θ1=i+j
![Page 11: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/11.jpg)
11
Cost Functions in Pluto
Formulated as:
What does this capture? dep: (i,j->i+1,j-1) δ1 = (i+1+j-1) – (i+j) = 0 δ2 = (i+1-(j-1)) – (i-j) = 2
i
j
θ2=i-j
θ1=i+j
![Page 12: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/12.jpg)
12
Reuse Distance
When the θ corresponds to sequential loop
Two dependences (i,j->i+1,j) (i,0->i,j) : j>0 what are the δs?
δ represents #iterations in the loop
(corresponding to θ)until reuse via e
i
j
θ1=i
θ2=j
![Page 13: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/13.jpg)
13
Communication Volume
When the θ corresponds to parallel loop Let si, sj be the tile sizes Horizontal dependence
sj values to the horizontal neighbor
Vertical dependence si valeus to N/sj tiles
Constant is better 0 is even better!
i
j
![Page 14: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/14.jpg)
14
Iterative Search
We need d-hyper-planes for a d-D space note that we are not looking for
parallelism parallelism comes with the tile wave-
fronts
Approach: find one θ for each statement constraint the space to be linearly
independent of the θs already found repeat
![Page 15: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/15.jpg)
15
Tilable Band
Band of Loops/Schedules consecutive sequence of dimensions
Tilable band a band that satisfies the legality
condition for a common set of dependences
PLuTo tiles the outermost tilable band
![Page 16: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/16.jpg)
16
So, which is better?
What are the θs and δs? what is the order?
i
j
i
j
![Page 17: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/17.jpg)
17
Solving with ILP
Farkas Lemma again we had enough of Farkas last time
There is a problem when the constraint is:
![Page 18: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/18.jpg)
18
The “Practical” Choice
Given the schedule prototype:
Constraint the coefficients to:
What does this mean? Relaxed recently by a paper on PLUTO+
and
![Page 19: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/19.jpg)
19
Example 1: Jacobi 1D
One example implementation:
but it is rather contrived due to limitations in polyhedral compilers
The dependences are simple
for t = 0 .. T for i = 1 .. N-1S1: B[i] = foo(A[i], A[i-1], A[i+1]); for i = 1 .. N-1S2: A[i] = foo(B[i], B[i-1], B[i+1]);
for t = 0 .. T for i = 1 .. N-1S1: A[t,i] = foo(A[t-1,i], A[t-1,i-1], A[t-1,i+1]);
![Page 20: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/20.jpg)
20
Example 1: Jacobi 1D
Prototype: θS1(t,i) = a1t+a2i+a0
δ1=a1(t+1)+a2i+a0-(a1t+a2i+a0)=a1
δ2=a1(t+1)+a2(i+1)+a0-(a1t+a2i+a0)=a1+a2
δ3=a1(t+1)+a2(i-1)+a0-(a1t+a2i+a0)=a1-a2
S1[t,i] -> S1[t+1,i]
S1[t,i] -> S1[t+1,i+1]
S1[t,i] -> S1[t+1,i-1]
δ1=θS1(t+1,i)-θS1(t,i)
δ2=θS1(t+1,i+1)-θS1(t,i)
δ3=θS1(t+1,i-1)-θS1(t,i)
![Page 21: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/21.jpg)
21
Example 1: Jacobi 1D
Prototype: θS1(t,i) = a1t+a2i+a0
δ1=a1(t+1)+a2i+a0-(a1t+a2i+a0)=a1
δ2=a1(t+1)+a2(i+1)+a0-(a1t+a2i+a0)=a1+a2
δ3=a1(t+1)+a2(i-1)+a0-(a1t+a2i+a0)=a1-a2
linearly independent with the previous
S1[t,i] -> S1[t+1,i]
S1[t,i] -> S1[t+1,i+1]
S1[t,i] -> S1[t+1,i-1]
δ1=θS1(t+1,i)-θS1(t,i)
δ2=θS1(t+1,i+1)-θS1(t,i)
δ3=θS1(t+1,i-1)-θS1(t,i)
![Page 22: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/22.jpg)
22
Example 1: Jacobi 1D
We have a set of hyper-planes θS1(t,i) = (t,t+i)
t
i
t
i
![Page 23: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/23.jpg)
23
Example 2: 2mm
Simplified a bitfor i = 0 .. N for j = 0 .. N for k = 0 .. NS1: C[i,j] += A[i,k] * B[k,j];
for i = 0 .. N for j = 0 .. N for k = 0 .. NS2: E[i,j] += C[i,k] * D[k,j];
S1[i,j,k] -> S1[i,j,k+1]
S2[i,j,k] -> S2[i,j,k+1]
S1[i,j,N] -> S2[i’,j’,k’]: i=i’ and j=k’
![Page 24: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/24.jpg)
24
Example 2: 2mm (dim 1)
Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0
θS2(x,y,z) = b1x+b2y+b3z+b0
Easy ones:
Interesting case is the inter-statement dep.
S1[i,j,k] -> S1[i,j,k+1]
S2[x,y,z] -> S2[x,y,z+1]
S1[i,j,N] -> S2[x,y,z]: i=x and j=z
a3=0
b3=0
S2[i,j,k] -> S1[i,k,N]:or S2[x,y,z] -> S1[x,z,N]:
![Page 25: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/25.jpg)
25
Example 2: 2mm (dim 1)
Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0
θS2(x,y,z) = b1x+b2y+b3z+b0
Interesting case is the inter-statement dep.
b1x+b2y+b3z+b0 - a1x+a2z+a3N+a0
S1[i,j,N] -> S2[x,y,z]: i=x and j=z
or S2[x,y,z] -> S1[x,z,N]:
(b1-a1)x+b2y+(b3-a2)z+b0+a3N+a0
(b1-a1)x+b2y-a2za3=b3=0
![Page 26: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/26.jpg)
26
Example 2: 2mm (dim 1)
Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0
θS2(x,y,z) = b1x+b2y+b3z+b0
Minimize: subject to a1+a2+b1+b2=0 a,b≥0 (plus weakly satisfied)
We get θS1(i,j,k) = i
θS2(x,y,z) = x
(b1-a1)x+b2y-a2z
![Page 27: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/27.jpg)
27
Example 2: 2mm (dim 2)
Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0
θS2(x,y,z) = b1x+b2y+b3z+b0
Minimize:subject to (b1-a1)x+b2y+(b3-a2)z+b0+a3N+a0
linearly independent with the previous We get
θS1(i,j,k) = j
θS2(x,y,z) = z
(b1-a1)x+b2y-a2z
![Page 28: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/28.jpg)
28
Example 2: 2mm (dim 3)
Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0
θS2(x,y,z) = b1x+b2y+b3z+b0
Minimize:subject to (b1-a1)x+b2y+(b3-a2)z+b0+a3N+a0
linearly independent with the previous does θS1=k and θS2=y work?
a3=1, b2=1, rest 0
(b1-a1)x+b2y-a2z
S1[i,j,N] -> S2[x,y,z]: i=x and j=z
![Page 29: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/29.jpg)
29
Example 2: 2mm (dim 3)
Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0
θS2(x,y,z) = b1x+b2y+b3z+b0
Minimize:subject to (b1-a1)x+b2y+(b3-a2)z+b0+a3N+a0
linearly independent with the previous does θS1=k and θS2=y work?
a3=1, b2=1, rest 0
(b1-a1)x+b2y-a2z
S1[i,j,N] -> S2[x,y,z]: i=x and j=z
or S2[x,y,z] -> S1[x,z,N]:
![Page 30: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/30.jpg)
30
Example 2: 2mm (dim 3)
Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0
θS2(x,y,z) = b1x+b2y+b3z+b0
Minimize:subject to (b1-a1)x+b2y+(b3-a2)z+b0+a3N+a0
linearly independent with the previous we have to split here
θS1=0 and θS2=1
(b1-a1)x+b2y-a2z
S1[i,j,N] -> S2[x,y,z]: i=x and j=z
or S2[x,y,z] -> S1[x,z,N]:
![Page 31: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/31.jpg)
31
Example 2: 2mm (dim 4)
Proceed to the 4th dimension because the 3rd dimension is only for statement ordering Now solve the problem independently
for each statement Case S1:
linearly independent with [i] and [j] Case S2:
linearly independent with [x] and [z] We get [k] and [y]
S1[i,j,k] -> S1[i,j,k+1]
S2[x,y,z] -> S2[x,y,z+1]
![Page 32: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/32.jpg)
32
Example 2: 2mm
Finally, we have a set of hyper-planes θS1(i,j,k) = (i,j,0,k)
θS2(i,j,k) = (i,k,1,j)
Tilable Bandfor i = 0 .. N for j = 0 .. N for k = 0 .. NS1: C[i,j] += A[i,k] * B[k,j];
for i = 0 .. N for j = 0 .. N for k = 0 .. NS2: E[i,j] += C[i,k] * D[k,j];
for i = 0 .. N for j = 0 .. N { for k = 0 .. NS1: C[i,j] += A[i,k] * B[k,j]; for k = 0 .. NS2: E[i,k] += C[i,j] * D[j,k]; }
![Page 33: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/33.jpg)
33
Example 2: 2mm
Output of Pluto
![Page 34: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.](https://reader036.fdocuments.in/reader036/viewer/2022081513/5697bfe81a28abf838cb60c2/html5/thumbnails/34.jpg)
34
Summary of Pluto
Paper in 2008 huge impact: 350+ citations already
Works very well as the default strategy
But, it is far from perfect!