CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.
CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.
-
Upload
hugo-george -
Category
Documents
-
view
217 -
download
0
Transcript of CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.
![Page 1: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/1.jpg)
1
CR18: Advanced Compilers
L04: Scheduling
Tomofumi Yuki
![Page 2: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/2.jpg)
2
Today’s Agenda
Revisiting legality with schedules How to find schedules
![Page 3: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/3.jpg)
3
Schedules
Recall that we had many “schedules” here, we use the one related to time
In general, a schedule is a function s.t. input: statement instance output: timestamp where instances mapped to the same
timestamp “may happen in parallel” We talk about static schedules in this
class
![Page 4: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/4.jpg)
4
Legality with Schedule
Causality Condition Given a PRDG with nodes N and edges E
src(e) = producer statement dst(e) = consumer statement DS = domain of statement node S De = domain of dependence e
Check:
![Page 5: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/5.jpg)
5
Example (uniform case)
Back to the legality check with vectorsfor (i=1; i<N; i++) for (j=1; j<M; j++)S: A[i][j] = A[i-1][j+1] + B[i][j];
[1,-1]
i
j θs(i,j)=i
e: (i,j->i+1,j-1)
θs(i+1,j-1)>θs(i,j)
i+1>i
![Page 6: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/6.jpg)
6
Example (uniform case)
Back to the legality check with vectorsfor (i=1; i<N; i++) for (j=1; j<M; j++)S: A[i][j] = A[i-1][j+1] + B[i][j];
[1,-1]
i
j θs(i,j)=j
e: (i,j->i+1,j-1)
θs(i+1,j-1)>θs(i,j)
j-1>j
![Page 7: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/7.jpg)
7
Example (uniform case)
Back to the legality check with vectorsfor (i=1; i<N; i++) for (j=1; j<M; j++)S: A[i][j] = A[i-1][j+1] + B[i][j];
[1,-1]
i
j θs(i,j)=i-j
e: (i,j->i+1,j-1)
θs(i+1,j-1)>θs(i,j)
i-j+2>i-j
![Page 8: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/8.jpg)
8
Example (affine case)
Back to the legality check with vectorsfor (i=1; i<N; i++) for (j=1; j<M; j++)S: A[i][j] = A[i][j-1] + A[i-1][M-j]; [1,*]
i
j θs(i,j)=i+j
e: (i,j->i+1,M-j)
θs(i+1,M-j)>θs(i,j)
M+i-j+1>i+j(M+1)/2>j
[0,1]
![Page 9: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/9.jpg)
9
The Scheduling Problem
Find θs that satisfy causality conditions i.e., no dependences are violated
Connection to loops you can complete the schedule to get
the transformation for loops
Sometimes, the problem is formulated in terms of the transform instead of schedule
![Page 10: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/10.jpg)
10
Parallel Execution of DO Loops [Lamport 74]
One of the 1st papers on automatic parallelization
Hyper-plane method Loops of the form
Scope of dependences: uniform + α
for I1 = l1 .. u1
... for In = ln .. un
body
for J1 = λ1 .. μ1
... for Jk = λk .. μk
forall Jk+1 = λk+1 .. μk+1
... forall Jn = λn .. μn
body
![Page 11: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/11.jpg)
11
The Hyper-Plane Method
The main theorem (simplified) We are looking for a schedule θ, such
thatthe inner n-1 loops are parallel θ is restricted to linear θ=a1I1+...+anIn
The key idea: given a distance vector c we want θ (c)>0 proof of existence for lex. positive c in
paper
![Page 12: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/12.jpg)
12
The Hyper-Plane Method
Optimizing the schedule
What should be the objective function? In this paper, it is min(μ1-λ1)
which is min(θ’(μ1-λ1)) θ’(x)=|a1|x1+...+|an|xn
for I1 = l1 .. u1
... for In = ln .. un
body
for J1 = λ1 .. μ1
forall J2 = λ2 .. μ2
... forall Jn = λn .. μn
body
![Page 13: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/13.jpg)
13
Example 1
With distance vectors [1,0] [0,1] θ(i,j)=ai+bj Constraints
θ([1,0])>0 : a>0 θ([0,1])>0 : b>0
Minimize Ni+Mj for 0≤i<N, 0≤j<M
i
j
![Page 14: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/14.jpg)
14
Example 2
With distance vectors [1,-1] [0,1] θ(i,j)=ai+bj Constraints
θ([1,-1])>0: a>b θ([0,1])>0 : b>0
Minimize Ni+Mj for 0≤i<N, 0≤j<M
i
j
![Page 15: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/15.jpg)
15
The General Plane Method
Generalizing the Hyper-Plane method When the dependences are no longer
uniform Given the iteration vector x,
Hyper-Plane method is for array accesses:
VAR[p(x)+c] where p is a permutation common to the
entire body General-Plane method extends to:
VAR[d(p(x)+c)] where d “drops” some number of
dimensions
![Page 16: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/16.jpg)
16
Final Words on this Paper
Very earlier paper, but it does dependence analysis scheduling loop transformation / code generation
Similar technique by Wolf & Lam for direction vectors (1991)
![Page 17: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/17.jpg)
17
Farkas Scheduling [Feautrier 92]
Given a PRDG find a schedule θs for each statement S θis restricted to affine functions
Affine form of Farkas Lemma given a domain D = Ax+b≥0 an affine form ψ(x) is non-negative in D iff
it can be described as positive combination
Farkas Multiplier
![Page 18: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/18.jpg)
18
Problem Formulation
Given a PRDG with nodes N and edges E Positivity:
all schedules starts at 0 Causality:
source/destination instance x,y when the dependence is active
note: edge is producer to consumer
![Page 19: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/19.jpg)
19
Using Farkas Lemma
Given statements S1 and S2 with schedules θS1, θS2 and a dependence e (from S1 to S2)
We want to make sure θS2(y)>θS1(x) for all <x,y> in De
which is θS2(y)-θS1(x)-1≥0 in De
make it a single function to get ψe (x,y)≥0 in De
![Page 20: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/20.jpg)
20
The Farkas Method
Build constraints on the schedule build ψe(x,y) for each e each ψ constraints the Farkas multipliers solve!
![Page 21: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/21.jpg)
21
Example 1
Consider the following
a
DS0: {[i,j] : 0≤i≤N and 0≤j<i }
DS1: {[i] : 0≤i≤N} e1: S0[i,j]->S0[i,j-1] e2: S1[i]->S0[i,i-1]
for (i=0 .. N) { for (j=0 .. i-1)S0: x[i] = x[i] – L[i,j]*x[j];S1: x[i] = x[i] / L[i,j];}
direction is consumer to producer
![Page 22: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/22.jpg)
22
Example 2
Consider the following
DS0: {[i] : 0≤i≤N}
DS1: {[i,j] : 0≤i,j≤N} e1: S1[i,j]->S0[i] : j=0 e2: S1[i,j]->S1[i,j-1] : j>0
for (i=0 .. N) {S0: x[i] = 0; for (j=0 .. N)S1: x[i] = x[i] + L[i,j]*b[j];}
![Page 23: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/23.jpg)
23
Example 3
Back to this examplefor (i=1; i<N; i++) for (j=1; j<M; j++)S: A[i][j] = A[i][j-1] + A[i-1][M-j];
i
j θs=a1i+a2j+a0
e1:(i,j->i,j-1)
e2:(i,j->i-1,M-j)
![Page 24: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/24.jpg)
24
Multi-Dimensional Scheduling One-Dimensional Affine Schedules are
not sufficient linearization of lex. order is polynomial
(if you have parameters)
So we want to find a set of θs for each statement
![Page 25: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/25.jpg)
25
Multi-Dimensional Farkas
Formulate the problem just like 1D case each dependence adds constraints
But, we allow some to be not satisfied recall causality condition
δ< 0 : dependence violation δ= 0 : weakly satisfied δ> 0 : strongly satisfied
![Page 26: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/26.jpg)
26
Greedy Algorithm
Given a PRDG with edged E 1. formulate the problem for all edges in E
2. weakly satisfy all of them 3. strongly satisfy as much as possible 4. add the obtained θ to the list 5. remove strongly satisfied edged from E
6. repeat until E is empty The obtained list of θs is your schedule
![Page 27: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/27.jpg)
27
Back to the Example
Back to this examplefor (i=1; i<N; i++) for (j=1; j<M; j++)S: A[i][j] = A[i][j-1] + A[i-1][M-j];
i
j θs=a1i+a2j+a0
e1:(i,j->i,j-1)
e2:(i,j->i-1,M-j)
![Page 28: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/28.jpg)
28
The Vertex Method
Another method for scheduling Uses the generator representation of
polyhedra Constraint representation:
intersection of half-spaces Generator representation:
convex hull of vertices, rays, and lines
The Mapping of Linear Recurrence Equations on Regular Arrays, Patrice Quinton and Vincent Van Dogen, 1989
![Page 29: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/29.jpg)
29
The Main Theorem
A schedule legal for the vertices + rays + lines is also legal for the entire polyhedron generated by them you can compute constraints on
schedules no need to reason about potentially
infinite set of iterations
![Page 30: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/30.jpg)
30
On the Optimality of Scheduling Paper by Alain Darte and Frédéric Vivien Survey of various methods for
scheduling what is the dependence abstraction
used? what can you say about optimality?
Optimality: does the method find all parallelism? how to define “all” parallelism?
![Page 31: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/31.jpg)
31
Scheduling Algorithms
Allen and Kennedy [1987] targeting vector machines; dependence-
levels Wolf and Lam [1991]
Lamport-like; dependence vectors Darte and Vivien [1996]
Farkas-like; dependence polyhedra Feautrier [1992]
Farkas Algorithm; affine dependences Lim and Lam [1997]
![Page 32: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/32.jpg)
32
Allen and Kennedy (in short)
You have dependence-levels only i.e., you know the dimension where the
dependence is carried Parallelizes the inner loops with no loop
carried dependence this paper introduced dependence levels
Also deals with loop fusion if dependence is carried in some outer
common loop, it can safely be fused
![Page 33: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/33.jpg)
33
Optimality of Allen and Kennedy The dependence information is very
limited dependence-level only
Then the parallelism found is actually optimal later proved by Darte and Vivien
![Page 34: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/34.jpg)
34
Wolf and Lam (in short)
Input: direction vectors Output: fully permutable loops
what does this mean? Context: unimodular transformations
Optimal parallelism extraction if you only know direction vectors perfectly nested loops
![Page 35: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/35.jpg)
35
Optimality of Farkas Algorithm Original paper had no claims
later proved by Darte and Vivien
The Greedy algorithm is actually optimal!
With a few caveats affine schedules one schedule per statement
![Page 36: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/36.jpg)
36
Index-Set Splitting
Piece-wise affine schedule or split a statement into multiple
statements or split an equation into ...
Main Idea: using one schedule for the entire statement is (sometimes) not optimal
![Page 37: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/37.jpg)
37
Example: Smashing
Periodic Boundaries can you tile?
i
j
![Page 38: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/38.jpg)
38
Example: Smashing
Periodic Boundaries
i
j
i
j
![Page 39: CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.](https://reader035.fdocuments.in/reader035/viewer/2022070413/5697bfab1a28abf838c9b030/html5/thumbnails/39.jpg)
39
How Good is Optimal
What does Farkas scheduling bring?