Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David...
-
Upload
ella-whitehead -
Category
Documents
-
view
213 -
download
0
Transcript of Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David...
Programming with Tiles
Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela+, Maria J. Garzaran, Davi
d PaduaUniversity of Illinois at Urbana-Champaign
*IBM, India+Universidade da Coruna, Spain
2
Motivation
The importance of tiles A natural way to express many algorithms Partitioning data is an effective way to enhance locality Pervasive in parallel computing
Limitations of today’s programming language Lack of programming construct to express tiles directly
Complicated indexing and loop structure to traverse an array by tiles
No support for storage by tile. Mixed the design and detailed implementation.
3
Contributions
Our group developed Hierarchically Tiled Arrays (HTAs) to support tiles to enhance locality and parallelism (PPoPP 06).
Designed new language constructs based on our true experiences. Dynamic partitioning Overlapped tiling
Evaluated both the productivity and performance
4
Outline
Introduction of HTADynamic partitioningOverlapped tilingImpact on programming productivityRelated work Conclusions
5
HTA overview
A
A(0,0)
A(1, 0:1)
A(0,1)[0,1]
op
A + B
A(0, 0:1) = B (1, 0:1)
6
HTA library implementation
In Matlab In C++
Support sequential and parallel execution on top of MPI and TBB.
Support for linear, non-linear data layouts Execution model
SPMD
Programming model Single threaded
7
Outline
Introduction of HTADynamic partitioningOverlapped tilingImpact on programming productivityRelated work Conclusions
8
Why dynamic partitioning?
Some linear algebra algorithmsAdd and remove partitions to sweep through
matrices.
Cache oblivious algorithmsCreate the partition following a divide and
conquer strategy
9
Syntax and semantics
Partition lines and partition
Two methods void part( Tuple sourcePartition, Tuple offset ); void rmPart( Tuple partition );
001
1
2
2 001
1
3
3
2
2 00
2
2
1
1
A A.part((1,1),(2,2))
A.rmPart((1,1))
NONE for not partitioning
fixed
10
LU algorithm with dynamic partitioning
A.part((1,1),(nb, nb))A.rmPart((1,1))
done done
donePartially updated
A11
A21 A22
A12
done done
done Partially updated
Beginning of iteration Repartitio
nUpdate
End of iteration
In the loop
nb
nb
12212222
1121
11
21
111
ULAA
......
UL
L
A
AP
11
LU algorithm represented in FLAME [Geijn, et a
l]
UPD
UPD UPD
UPD
In the loopThe FLAME algorithm
12
void lu(HTA<double,2> A, HTA<int,1> p,int nb){
A.part((0,0),(0,0)); p.part((0), (0));
while(A(0,0).lsize(1)<A.lsize(1)){ int b = min(A(1,1).lsize(0), nb);
A.part((1,1),(b,b)); p.part((1), (b));
dgetf2(A(1:2,1), p(1)); dlaswp(A(1:2,0), p(1)); dlaswp(A(1:2,2), p(1)); trsm(HtaRight, HtaUpper, HtaNoTrans, HtaUnit, One, A(1,1),A(1,2)); gemm(HtaNoTrans, HtaNoTrans, MinusOne, A(2,1), A(1,2), One, A(2,2)); A.rmPart((1,1)); p.rmPart((1)); }}
From algorithm to HTA programThe FLAME algorithm
The HTA implementation
13
FLAME API vs. HTA
A.part((0,0), (0,0));
A.part((1,1), (b,b));
A.rmPart((1,1));
A(1:2, 1);
HTA: 1). General. 2). Fewer variables.
3). Simple index. 4). Flexible range selection
14
A cache oblivious algorithm:parallel merge
221143 14 30 32 41
181042 15 23 28 40
in1in2
1. Take middle element in in1
2. Find lower bound in in222
23
out3. Calculate partition in
out
4. Partition (logically) 221143 14 30 32 41
181042 15 23 28 40
5. Merge in parallel
15
HTA vs. Treading Building Blocks (TBB)
HTA code TBB code
HTA: partition tiles and operate on tiles in parallel
TBB: prepare the merge range and operate on smaller ranges in parallel
map(PMerge(), out, in1, in2);
parallel_for(PMergeRange(begin1, end1, begin2, end2, out), PMergeBody());
16
Evaluation
Benchmark CategoryExecution
modePlatform
MMM Cache oblivious algorithm
Sequential
3.0 GHz Intel Pentium 4 , 16KB L1, 1MB L2 and 1GB RAM.
Recursive LU Cache oblivious algorithm
Dynamic LU FLAME’s algorithm
Sylvester FLAME’s algorithm
Parallel merge Cache oblivious algorithm ParallelTwo Quad core 2.66 GHz Xeon processors
17
Recursive MMM Recursive LU
Dynamic LU
Sylvester
18
Parallel merge
19
Outline
Introduction of HTADynamic partitioningOverlapped tilingImpact on programming productivityRelated work Conclusions
20
Programs that have computations based on neighboring points E.g. iterative PDE solvers Benefit from tiling Shadow regions
Problems with current approach Explicit allocation and update for shadow regions Example: NAS MG (Lines of code)
Motivation
Versions comm3 Residual Inversion Projection
MPI 327 23 24 49
CAF 302 25 23 55
OpenMP 26 27 26 59
HTA 33 55 53 64
21
Overlapped tiling
Objective Automate the allocation and update of shadow regions Allow the access of neighbors across neighboring tiles
Allow programmer to specify overlapped region at creation time
Overlap<DIM> ol = Overlap<DIM>(Tuple negativeDir, Tuple positiveDir,
BoundaryMode mode,
bool autoUpdate=true);
22
Example of overlapped tiling
Shadow region for
Shadow region for
Shadow region for
Tuple::seq tiling = ((4), (3));
Overlap<DIM> ol((1), (1), zero);
A=HTA<double,1>::alloc(1, tiling, array, ROW, ol);
00
overlapoverlap
BoundaryBoundary
23
Indexing
Operations (+,-, map, =, etc) The conformability rules only apply to owned regions Enable operations with non-overlapped HTA.
Indexing and operations
T[0:3]=T[ALL]T[-1] T[4]
owned region
24
Shadow region consistency
Ensure shadow regions to be properly updated and consistent with the corresponding data in the owned tiles.
Use update on read policyBookkeeping the status of each tile.
SPMD mode: no communication is needed.
Allow manual and automatic updates.
25
Evaluation
Benchmark DescriptionExecution
mode
Sequential 3D Jacobi
Stencil computation on 6 neighbors sequential
NAS MG Multigrid V-cycle algorithm parallel
NAS LU Navier Strokes equation solver parallel
A cluster consisting of 128 nodes each with two 2 GHz G5 processors and 4 GB of RAM. We used one processor per node in our experiments with Myrinet connection. NAS code (Fortran + MPI) was compiled with g77 and HTA code was compiled with g++ (3.3). O3 flag is used for both cases.
26
MG comm3 in HTA without overlapped tilingvoid comm3 (Grid& u){ int NX = u.shape().size()[0] - 1; int NY = u.shape().size()[1] - 1; int NZ = u.shape().size()[2] - 1; int nx = u(T(0,0,0)).shape().size()[0] - 1; int ny = u(T(0,0,0)).shape().size()[1] - 1; int nz = u(T(0,0,0)).shape().size()[2] - 1; //north-south
Traits::Default::async(); if (NX > 0) u((R(0, NX-1), R(0, NY), R(0, NZ)))((R(nx, nx), R(1, ny-1), R(1, nz-1))) = u((R(1, NX), R(0, NY), R(0, NZ)))((R(1,1), R(1, ny-1), R(1,nz-1))); u((R(NX, NX), R(0, NY), R(0, NZ)))((R(nx, nx), R(1, ny-1), R(1, nz-1))) = u((R(0,0), R(0, NY), R(0, NZ)))((R(1,1), R(1, ny-1), R(1,nz-1))); if (NX > 0) u((R(1, NX), R(0, NY), R(0, NZ)))((R(0,0), R(1, ny-1), R(1, nz-1))) = u((R(0, NX-1), R(0, NY), R(0, NZ)))((R(nx-1,nx-1), R(1, ny-1), R(1,nz-1))); u((R(0, 0), R(0, NY), R(0, NZ)))((R(0,0), R(1, ny-1), R(1, nz-1))) = u((R(NX, NX), R(0, NY), R(0, NZ)))((R(nx-1,nx-1), R(1, ny-1), R(1,nz-1)));
Traits::Default::sync(); Traits::Default::async();
//east-west if (NY > 0) u((R(0, NX), R(0, NY-1), R(0, NZ)))((R(0, nx), R(ny, ny), R(1, nz-1))) = u((R(0, NX), R(1, NY), R(0, NZ)))((R(0, nx), R(1, 1), R(1, nz-1))); u((R(0, NX), R(NY, NY), R(0, NZ)))((R(0, nx), R(ny, ny), R(1, nz-1))) = u((R(0, NX), R(0,0), R(0, NZ)))((R(0,nx), R(1, 1), R(1, nz-1))); if (NY > 0) u((R(0, NX), R(1, NY), R(0, NZ)))((R(0, nx), R(0, 0), R(1, nz-1))) = u((R(0, NX), R(0, NY-1), R(0, NZ)))((R(0, nx), R(ny-1, ny-1), R(1, nz-1))); u((R(0, NX), R(0, 0), R(0, NZ)))((R(0, nx), R(0, 0), R(1, nz-1))) = u((R(0, NX), R(NY, NY), R(0, NZ)))((R(0,nx), R(ny-1, ny-1), R(1, nz-1)));
Traits::Default::sync(); Traits::Default::async();
//front-back if (NZ > 0) u((R(0, NX), R(0, NY), R(0, NZ-1)))((R(0, nx), R(0, ny), R(nz, nz))) = u((R(0, NX), R(0, NY), R(1, NZ)))((R(0, nx), R(0, ny), R(1, 1))); u((R(0, NX), R(0, NY), R(NZ, NZ)))((R(0, nx), R(0, ny), R(nz, nz))) = u((R(0, NX), R(0, NY), R(0, 0)))((R(0, nx), R(0, ny), R(1, 1))); if (NZ > 0) u((R(0, NX), R(0, NY), R(1, NZ)))((R(0, nx), R(0, ny), R(0, 0))) = u((R(0, NX), R(0, NY), R(0, NZ-1)))((R(0, nx), R(0, ny), R(nz-1, nz-1))); u((R(0, NX), R(0, NY), R(0, 0)))((R(0, nx), R(0, ny), R(0, 0))) = u((R(0, NX), R(0, NY), R(NZ, NZ)))((R(0, nx), R(0, ny), R(nz-1, nz-1))); Traits::Default::sync();}
Overlap<3> * ol = new Overlap<3> (T<3>(1,1,1),T<3>(1,1,1), PERIODIC);
MG comm3 in HTA with overlapped tiling
27
MG comm3 in NAS (Fortran + MPI) subroutine comm3(u,n1,n2,n3,kk) implicit none
include 'mpinpb.h' include 'globals.h'
integer n1, n2, n3, kk double precision u(n1,n2,n3) integer axis
if( .not. dead(kk) )then do axis = 1, 3 if( nprocs .ne. 1) then call ready( axis, -1, kk ) call ready( axis, +1, kk ) call give3( axis, +1, u, n1, n2, n3, kk ) call give3( axis, -1, u, n1, n2, n3, kk ) call take3( axis, -1, u, n1, n2, n3 ) call take3( axis, +1, u, n1, n2, n3 ) else call comm1p( axis, u, n1, n2, n3, kk ) endif enddo else call zero3(u,n1,n2,n3) endif return end
subroutine give3( axis, dir, u, n1, n2, n3, k ) implicit none
include 'mpinpb.h' include 'globals.h'
integer axis, dir, n1, n2, n3, k, ierr double precision u( n1, n2, n3 )
integer i3, i2, i1, buff_len,buff_id
buff_id = 2 + dir buff_len = 0
if( axis .eq. 1 )then if( dir .eq. -1 )then
do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len,buff_id ) = u( 2, i2,i3) enddo enddo
call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr)
else if( dir .eq. +1 ) then
do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( n1-1, i2,i3) enddo enddo
call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr)
endif endif
if( axis .eq. 2 )then if( dir .eq. -1 )then
do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1, 2,i3) enddo enddo
call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr)
else if( dir .eq. +1 ) then
do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id )= u( i1,n2-1,i3) enddo enddo
call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr)
endif endif
if( axis .eq. 3 )then if( dir .eq. -1 )then
do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,2) enddo enddo
call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr)
else if( dir .eq. +1 ) then
do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,n3-1) enddo enddo
call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr)
endif endif
return end
subroutine take3( axis, dir, u, n1, n2, n3 ) implicit none
include 'mpinpb.h' include 'globals.h'
integer axis, dir, n1, n2, n3 double precision u( n1, n2, n3 )
integer buff_id, indx
integer status(mpi_status_size), ierr
integer i3, i2, i1
call mpi_wait( msg_id( axis, dir, 1 ),status,ierr) buff_id = 3 + dir indx = 0
if( axis .eq. 1 )then if( dir .eq. -1 )then
do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(n1,i2,i3) = buff(indx, buff_id ) enddo enddo
else if( dir .eq. +1 ) then
do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(1,i2,i3) = buff(indx, buff_id ) enddo enddo
endif endif
if( axis .eq. 2 )then if( dir .eq. -1 )then
do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,n2,i3) = buff(indx, buff_id ) enddo enddo
else if( dir .eq. +1 ) then
do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,1,i3) = buff(indx, buff_id ) enddo enddo
endif endif
if( axis .eq. 3 )then if( dir .eq. -1 )then
do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,n3) = buff(indx, buff_id ) enddo enddo
else if( dir .eq. +1 ) then
do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,1) = buff(indx, buff_id ) enddo enddo
endif endif
return end
subroutine ready( axis, dir, k ) implicit none
include 'mpinpb.h' include 'globals.h'
integer axis, dir, k integer buff_id,buff_len,i,ierr
buff_id = 3 + dir buff_len = nm2
do i=1,nm2 buff(i,buff_id) = 0.0D0 enddo
msg_id(axis,dir,1) = msg_type(axis,dir) +1000*me
call mpi_irecv( buff(1,buff_id), buff_len, > dp_type, nbr(axis,-dir,k), msg_type(axis,dir), > mpi_comm_world, msg_id(axis,dir,1), ierr) return end
ready
give3 take3
28
3D Jacobi NAS MG class C
NAS LU class B
NAS LU class C
29
Outline
Introduction of HTADynamic partitioningOverlapped tilingImpact on programming productivityRelated work Conclusions
30
Three metrics
Programming effort [Halstead, 1977] Program volume V
A function of the number of operators and operands and their total number of occurrences.
Potential volume V* The most succinct form in a language which has defined or implemented
the required operations.
Program complexity [McCabe,1976] C = P + 1, where P is the number of decision points in the program
Source lines of codes L
*
2
V
VE
31
Evaluation
Programs Effort Complexity LOC
staticLU HTA 61,545 3 10
staticLU NDS 208,074 8 49
staticLU LAPACK 160,509 6 37
dynamicLU HTA 51,599 1 13
dynamicLU FLAME 170,477 1 52
recLU HTA 85,530 1 18
recLU ATLAS 186,891 10 40
Sylvester HTA 423,404 2 47
Sylvester FLAME 700,629 5 95
32
Related work
FLAME API [Bientinesi et al., 2005] Ad-hoc notations
Sequoia [Fatahalian et al., 2006] Principal construct: task
HPF [Hiranandani et al., 1992] and Co-Array Fortran [Numrich and Reid, 1998] Tiles are used for distribution Do not address different levels of memory hierarchy
POOMA [Reynders,1996] Tiles and shadow regions are accessed as a whole
Global Arrays [Nieplocha et al., 2006] SPMD programming model
33
Conclusion
HTA makes tiles part of a language It provides generalized framework to express tiles. It increases productivity
Less index calculation, fewer variables, loops, simpler function interface
Dynamic partitioning Overlapped tiling
Little performance degradation
34
35
Code example: 2D Jacobi
Without overlapped tiling With overlapped tiling
• HTA takes care of allocation of shadow regions, data consistency
• Clean indexing syntax
36
Two types of stencil computation Concurrent computation
Each tile can be executed independently
Example: Jacobi, MG
Wavefront computation The execution sequence of ti
les follows a certain order. Example: LU, SSOR
With the update on read policy, minimal number of communications is achieved.
computation Shadow region update in every
iteration
computationShadow region
update in second iteration