Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
-
date post
19-Dec-2015 -
Category
Documents
-
view
217 -
download
1
Transcript of Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
![Page 1: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/1.jpg)
Wei Li 1Stanford University
CS243 Winter 2006
Loop Transformations Loop Transformations and Locality and Locality
![Page 2: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/2.jpg)
2CS243 Winter 2006Stanford University
AgendaAgenda
IntroductionIntroduction Loop TransformationsLoop Transformations Affine Transform TheoryAffine Transform Theory
![Page 3: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/3.jpg)
3CS243 Winter 2006Stanford University
Memory HierarchyMemory Hierarchy
CPU
C
C
Memory
Cache
![Page 4: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/4.jpg)
4CS243 Winter 2006Stanford University
Cache LocalityCache Locality
for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for
• Suppose array A has column-major layout
A[1,1]A[1,1] A[2,1]A[2,1] …… A[1,2]A[1,2] A[2,2]A[2,2] …… A[1,3]A[1,3] ……
• Loop nest has poor spatial cache locality.
![Page 5: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/5.jpg)
5CS243 Winter 2006Stanford University
Loop InterchangeLoop Interchange
for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for
• Suppose array A has column-major layout
A[1,1]A[1,1] A[2,1]A[2,1] …… A[1,2]A[1,2] A[2,2]A[2,2] …… A[1,3]A[1,3] ……
• New loop nest has better spatial cache locality.
for j = 1, 200
for i = 1, 100 A[i, j] = A[i, j] + 3 end_forend_for
![Page 6: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/6.jpg)
6CS243 Winter 2006Stanford University
Interchange Loops?Interchange Loops?
for i = 2, 100 for j = 1, 200 A[i, j] = A[i-1, j+1]+3 end_forend_for
• e.g. dependence from (3,3) to (4,2)
i
j
![Page 7: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/7.jpg)
7CS243 Winter 2006Stanford University
Dependence VectorsDependence Vectors
i
j
Distance vector (1,-1) Distance vector (1,-1) = (4,2)-(3,3)= (4,2)-(3,3)
Direction vector (+, -) Direction vector (+, -) from the signs of from the signs of distance vectordistance vector
Loop interchange is Loop interchange is not legal if there exists not legal if there exists dependence (+, -)dependence (+, -)
![Page 8: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/8.jpg)
8CS243 Winter 2006Stanford University
AgendaAgenda
IntroductionIntroduction Loop TransformationsLoop Transformations Affine Transform TheoryAffine Transform Theory
![Page 9: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/9.jpg)
9CS243 Winter 2006Stanford University
Loop FusionLoop Fusion
for i = 1, 1000 A[i] = B[i] + 3end_for
for j = 1, 1000 C[j] = A[j] + 5end_for
for i = 1, 1000 A[i] = B[i] + 3 C[i] = A[i] + 5end_for
Better reuse between A[i] and A[i]Better reuse between A[i] and A[i]
![Page 10: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/10.jpg)
10CS243 Winter 2006Stanford University
Loop DistributionLoop Distribution
for i = 1, 1000 A[i] = A[i-1] + 3end_for
for i = 1, 1000 C[i] = B[i] + 5end_for
for i = 1, 1000 A[i] = A[i-1] + 3 C[i] = B[i] + 5end_for
22ndnd loop is parallel loop is parallel
![Page 11: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/11.jpg)
11CS243 Winter 2006Stanford University
Register BlockingRegister Blocking
for j = 1, 2*m for i = 1, 2*n A[i, j] = A[i-1, j] + A[i-1, j-1] end_forend_for
for j = 1, 2*m, 2 for i = 1, 2*n, 2 A[i, j] = A[i-1,j] + A[i-1,j-1] A[i, j+1] = A[i-1,j+1] + A[i-1,j] A[i+1, j] = A[i, j] + A[i, j-1] A[i+1, j+1] = A[i, j+1] + A[i, j] end_forend_for
Better reuse between A[i,j] and A[i,j]Better reuse between A[i,j] and A[i,j]
![Page 12: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/12.jpg)
12CS243 Winter 2006Stanford University
Virtual Register AllocationVirtual Register Allocationfor j = 1, 2*M, 2 for i = 1, 2*N, 2 r1 = A[i-1,j] r2 = r1 + A[i-1,j-1] A[i, j] = r2 r3 = A[i-1,j+1] + r1 A[i, j+1] = r3 A[i+1, j] = r2 + A[i, j-1] A[i+1, j+1] = r3 + r2 end_forend_for
Memory operations reduced to register load/store
8MN loads to 4MN loads
![Page 13: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/13.jpg)
13CS243 Winter 2006Stanford University
Scalar ReplacementScalar Replacement
for i = 2, N+1 = A[i-1]+1 A[i] =end_for
t1 = A[1]for i = 2, N+1 = t1 + 1 t1 = A[i] = t1end_for
Eliminate loads and stores for array references
![Page 14: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/14.jpg)
14CS243 Winter 2006Stanford University
Unroll-and-JamUnroll-and-Jam
for j = 1, 2*M for i = 1, N A[i, j] = A[i-1, j] + A[i-1, j-1] end_forend_for
for j = 1, 2*M, 2 for i = 1, N A[i, j]=A[i-1,j]+A[i-1,j-1] A[i, j+1]=A[i-1,j+1]+A[i-1,j] end_forend_for
Expose more opportunity for scalar replacement
![Page 15: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/15.jpg)
15CS243 Winter 2006Stanford University
Large ArraysLarge Arrays
for i = 1, 1000 for j = 1, 1000 A[i, j] = A[i, j] + B[j, i] end_forend_for
• Suppose arrays A and B have row-major layout
B has poor cache locality. Loop interchange will not help.
![Page 16: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/16.jpg)
16CS243 Winter 2006Stanford University
Loop BlockingLoop Blocking
for v = 1, 1000, 20 for u = 1, 1000, 20 for j = v, v+19 for i = u, u+19 A[i, j] = A[i, j] + B[j, i] end_for end_for end_forend_for
Access to small blocks of the arrays has good cache locality.
![Page 17: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/17.jpg)
17CS243 Winter 2006Stanford University
Loop Unrolling for ILPLoop Unrolling for ILP
for i = 1, 10 a[i] = b[i]; *p = ... end_for
for I = 1, 10, 2 a[i] = b[i]; *p = … a[i+1] = b[i+1]; *p = …end_for
Large scheduling regions. Fewer dynamic branches Increased code size
![Page 18: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/18.jpg)
18CS243 Winter 2006Stanford University
AgendaAgenda
IntroductionIntroduction Loop TransformationsLoop Transformations Affine Transform TheoryAffine Transform Theory
![Page 19: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/19.jpg)
19CS243 Winter 2006Stanford University
ObjectiveObjective
Unify a large class of program Unify a large class of program transformations.transformations.
Example:Example:
float Z[100];for i = 0, 9 Z[i+10] = Z[i];end_for
![Page 20: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/20.jpg)
20CS243 Winter 2006Stanford University
Iteration SpaceIteration Space
A d-deep loop nest has d index variables, A d-deep loop nest has d index variables, and is modeled by a d-dimensional space. and is modeled by a d-dimensional space. The space of iterations is bounded by the The space of iterations is bounded by the lower and upper bounds of the loop lower and upper bounds of the loop indices.indices.
Iteration space i = 0,1, …9 Iteration space i = 0,1, …9
for i = 0, 9 Z[i+10] = Z[i];end_for
![Page 21: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/21.jpg)
21CS243 Winter 2006Stanford University
Matrix FormulationMatrix Formulation
The iterations in a d-deep loop nest can be The iterations in a d-deep loop nest can be represented mathematically asrepresented mathematically as
Z is the set of integersZ is the set of integers B is a d B is a d xx d integer matrix d integer matrix b is an integer vector of length d, andb is an integer vector of length d, and 0 is a vector of d 0’s.0 is a vector of d 0’s.
}0|{ biBdZi
![Page 22: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/22.jpg)
22CS243 Winter 2006Stanford University
ExampleExample
for i = 0, 5 for j = i, 7 Z[j,i] = 0;
E.g. the 3rd row –i+j ≥ 0 is from the lower bound j ≥ i for loop j.
0
0
0
0
7
0
5
0
10
11
01
01
j
i
![Page 23: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/23.jpg)
23CS243 Winter 2006Stanford University
Symbolic ConstantsSymbolic Constants
for i = 0, n Z[i] = 0;
E.g. the 1st row –i+n ≥ 0 is from the upper bound i ≤ n.
0
0
01
11|{n
iZi
![Page 24: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/24.jpg)
24CS243 Winter 2006Stanford University
Data SpaceData Space
An n-dimensional array is modeled by an An n-dimensional array is modeled by an n-dimensional space. The space is n-dimensional space. The space is bounded by the array bounds.bounded by the array bounds.
Data space a = 0,1, …99 Data space a = 0,1, …99
float Z[100]for i = 0, 9 Z[i+10] = Z[i];end_for
![Page 25: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/25.jpg)
25CS243 Winter 2006Stanford University
Processor SpaceProcessor Space
Initially assume unbounded number of Initially assume unbounded number of virtual processors (vp1, vp2, …) organized virtual processors (vp1, vp2, …) organized in a multi-dimensional space. in a multi-dimensional space. (iteration 1, vp1), (iteration 2, vp2),…(iteration 1, vp1), (iteration 2, vp2),…
After parallelization, map to physical After parallelization, map to physical processors (p1, p2).processors (p1, p2). (vp1, p1), (vp2, p2), (vp3, p1), (vp4, p2),… (vp1, p1), (vp2, p2), (vp3, p1), (vp4, p2),…
![Page 26: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/26.jpg)
26CS243 Winter 2006Stanford University
Affine Array Index FunctionAffine Array Index Function
Each array access in the code specifies a Each array access in the code specifies a mapping from an iteration in the iteration mapping from an iteration in the iteration space to an array element in the data space to an array element in the data spacespace
Both i+10 and i are affine.Both i+10 and i are affine.
float Z[100]for i = 0, 9 Z[i+10] = Z[i];end_for
![Page 27: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/27.jpg)
27CS243 Winter 2006Stanford University
Array Affine AccessArray Affine Access
The bounds of the loop are expressed as The bounds of the loop are expressed as affine expressions of the surrounding loop affine expressions of the surrounding loop variables and symbolic constants, andvariables and symbolic constants, and
The index for each dimension of the array The index for each dimension of the array is also an affine expression of surrounding is also an affine expression of surrounding loop variables and symbolic constantsloop variables and symbolic constants
![Page 28: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/28.jpg)
28CS243 Winter 2006Stanford University
Matrix FormulationMatrix Formulation
Array access maps a vector i within the Array access maps a vector i within the bounds to array element location Fi+f.bounds to array element location Fi+f.
E.g. access X[i-1] in loop nest i,j E.g. access X[i-1] in loop nest i,j
}0|{ biBdZi
101
j
i
![Page 29: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/29.jpg)
29CS243 Winter 2006Stanford University
Affine PartitioningAffine Partitioning
An affine function to assign iterations in an An affine function to assign iterations in an iteration space to processors in the iteration space to processors in the processor space.processor space.
E.g. iteration i to processor 10-i.E.g. iteration i to processor 10-i.
float Z[100]for i = 0, 9 Z[i+10] = Z[i];end_for
![Page 30: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/30.jpg)
30CS243 Winter 2006Stanford University
Data Access RegionData Access Region
An affine function to assign iterations in an An affine function to assign iterations in an iteration space to processors in the iteration space to processors in the processor space.processor space.
Region for Z[i+10] is {a | 10 ≤ a ≤ 20}.Region for Z[i+10] is {a | 10 ≤ a ≤ 20}.
float Z[100]for i = 0, 9 Z[i+10] = Z[i];end_for
![Page 31: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/31.jpg)
31CS243 Winter 2006Stanford University
Data DependencesData Dependences
Solution to linear constraints as shown in Solution to linear constraints as shown in the last lecture.the last lecture. There exist iThere exist irr, i, iww, such that, such that
0 0 ≤ ≤ iirr,, i iw w ≤≤ 9, 9,
iiw w + 10 = i+ 10 = irrfloat Z[100]for i = 0, 9 Z[i+10] = Z[i];end_for
![Page 32: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/32.jpg)
32CS243 Winter 2006Stanford University
Affine TransformAffine Transform
i
j
u
v
bj
iB
v
u
![Page 33: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/33.jpg)
33CS243 Winter 2006Stanford University
Locality OptimizationLocality Optimization
for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for
for u = 1, 200
for v = 1, 100 A[v,u] = A[v,u]+ 3 end_forend_for
j
i
v
u
01
10
![Page 34: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/34.jpg)
34CS243 Winter 2006Stanford University
Old Iteration SpaceOld Iteration Space
j
i
v
u
01
10
0
0
0
0
200
1
100
1
10
10
01
01
j
ifor i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for
0
0
0
0
200
1
100
11
01
10
10
10
01
01
v
u
![Page 35: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/35.jpg)
35CS243 Winter 2006Stanford University
New Iteration SpaceNew Iteration Space
0
0
0
0
200
1
100
1
01
01
10
10
v
u
0
0
0
0
200
1
100
11
01
10
10
10
01
01
v
ufor u = 1, 200
for v = 1, 100 A[v,u] = A[v,u]+ 3 end_forend_for
![Page 36: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/36.jpg)
36CS243 Winter 2006Stanford University
Old Array AccessesOld Array Accesses
j
i
v
u
01
10
]10,01[
j
i
j
iA
for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for
]1
01
1010,1
01
1001[
v
u
v
uA
![Page 37: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/37.jpg)
37CS243 Winter 2006Stanford University
New Array AccessesNew Array Accesses
]1
01
1010,1
01
1001[
v
u
v
uA
for u = 1, 200
for v = 1, 100 A[v,u] = A[v,u]+ 3 end_forend_for
]01,10[
v
u
v
uA
![Page 38: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/38.jpg)
38CS243 Winter 2006Stanford University
Interchange Loops?Interchange Loops?
for i = 2, 1000 for j = 1, 1000 A[i, j] = A[i-1, j+1]+3 end_forend_for
• e.g. dependence vector dold = (1,-1)
i
j
![Page 39: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/39.jpg)
39CS243 Winter 2006Stanford University
Interchange Loops?Interchange Loops?
j
i
v
u
01
10
1
1
1
1
01
10
01
10oldnew dd
A transformation is legal, if the new dependence A transformation is legal, if the new dependence is lexicographically positive, i.e. the leading non-is lexicographically positive, i.e. the leading non-zero in the dependence vector is positive.zero in the dependence vector is positive.
![Page 40: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d3a5503460f94a1445d/html5/thumbnails/40.jpg)
40CS243 Winter 2006Stanford University
SummarySummary
Locality OptimizationsLocality Optimizations Loop TransformationsLoop Transformations Affine Transform TheoryAffine Transform Theory