Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David...

Programming with Tiles

Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela+, Maria J. Garzaran, Davi

d PaduaUniversity of Illinois at Urbana-Champaign

*IBM, India+Universidade da Coruna, Spain

2

Motivation

The importance of tiles A natural way to express many algorithms Partitioning data is an effective way to enhance locality Pervasive in parallel computing

Limitations of today’s programming language Lack of programming construct to express tiles directly

Complicated indexing and loop structure to traverse an array by tiles

No support for storage by tile. Mixed the design and detailed implementation.

3

Contributions

Our group developed Hierarchically Tiled Arrays (HTAs) to support tiles to enhance locality and parallelism (PPoPP 06).

Designed new language constructs based on our true experiences. Dynamic partitioning Overlapped tiling

Evaluated both the productivity and performance

4

Outline

Introduction of HTADynamic partitioningOverlapped tilingImpact on programming productivityRelated work Conclusions

5

HTA overview

A

A(0,0)

A(1, 0:1)

A(0,1)[0,1]

op

A + B

A(0, 0:1) = B (1, 0:1)

6

HTA library implementation

In Matlab In C++

Support sequential and parallel execution on top of MPI and TBB.

Support for linear, non-linear data layouts Execution model

SPMD

Programming model Single threaded

7

Outline


8

Why dynamic partitioning?

Some linear algebra algorithmsAdd and remove partitions to sweep through

matrices.

Cache oblivious algorithmsCreate the partition following a divide and

conquer strategy

9

Syntax and semantics

Partition lines and partition

Two methods void part( Tuple sourcePartition, Tuple offset ); void rmPart( Tuple partition );

001

1

2

2 001

1

3

3

2

2 00

2

2

1

1

A A.part((1,1),(2,2))

A.rmPart((1,1))

NONE for not partitioning

fixed

10

LU algorithm with dynamic partitioning

A.part((1,1),(nb, nb))A.rmPart((1,1))

done done

donePartially updated

A11

A21 A22

A12

done done

done Partially updated

Beginning of iteration Repartitio

nUpdate

End of iteration

In the loop

nb

nb

12212222

1121

11

21

111

ULAA

......

UL

L

A

AP

11

LU algorithm represented in FLAME [Geijn, et a

l]

UPD

UPD UPD

UPD

In the loopThe FLAME algorithm

12

void lu(HTA<double,2> A, HTA<int,1> p,int nb){

A.part((0,0),(0,0)); p.part((0), (0));

while(A(0,0).lsize(1)<A.lsize(1)){ int b = min(A(1,1).lsize(0), nb);

A.part((1,1),(b,b)); p.part((1), (b));

dgetf2(A(1:2,1), p(1)); dlaswp(A(1:2,0), p(1)); dlaswp(A(1:2,2), p(1)); trsm(HtaRight, HtaUpper, HtaNoTrans, HtaUnit, One, A(1,1),A(1,2)); gemm(HtaNoTrans, HtaNoTrans, MinusOne, A(2,1), A(1,2), One, A(2,2)); A.rmPart((1,1)); p.rmPart((1)); }}

From algorithm to HTA programThe FLAME algorithm

The HTA implementation

13

FLAME API vs. HTA

A.part((0,0), (0,0));

A.part((1,1), (b,b));

A.rmPart((1,1));

A(1:2, 1);

HTA: 1). General. 2). Fewer variables.

3). Simple index. 4). Flexible range selection

14

A cache oblivious algorithm:parallel merge

221143 14 30 32 41

181042 15 23 28 40

in1in2

1. Take middle element in in1

2. Find lower bound in in222

23

out3. Calculate partition in

out

4. Partition (logically) 221143 14 30 32 41

181042 15 23 28 40

5. Merge in parallel

15

HTA vs. Treading Building Blocks (TBB)

HTA code TBB code

HTA: partition tiles and operate on tiles in parallel

TBB: prepare the merge range and operate on smaller ranges in parallel

map(PMerge(), out, in1, in2);

parallel_for(PMergeRange(begin1, end1, begin2, end2, out), PMergeBody());

16

Evaluation

Benchmark CategoryExecution

modePlatform

MMM Cache oblivious algorithm

Sequential

3.0 GHz Intel Pentium 4 , 16KB L1, 1MB L2 and 1GB RAM.

Recursive LU Cache oblivious algorithm

Dynamic LU FLAME’s algorithm

Sylvester FLAME’s algorithm

Parallel merge Cache oblivious algorithm ParallelTwo Quad core 2.66 GHz Xeon processors

17

Recursive MMM Recursive LU

Dynamic LU

Sylvester

18

Parallel merge

19

Outline


20

Programs that have computations based on neighboring points E.g. iterative PDE solvers Benefit from tiling Shadow regions

Problems with current approach Explicit allocation and update for shadow regions Example: NAS MG (Lines of code)

Motivation

Versions comm3 Residual Inversion Projection

MPI 327 23 24 49

CAF 302 25 23 55

OpenMP 26 27 26 59

HTA 33 55 53 64

21

Overlapped tiling

Objective Automate the allocation and update of shadow regions Allow the access of neighbors across neighboring tiles

Allow programmer to specify overlapped region at creation time

Overlap<DIM> ol = Overlap<DIM>(Tuple negativeDir, Tuple positiveDir,

BoundaryMode mode,

bool autoUpdate=true);

22

Example of overlapped tiling

Shadow region for

Shadow region for

Shadow region for

Tuple::seq tiling = ((4), (3));

Overlap<DIM> ol((1), (1), zero);

A=HTA<double,1>::alloc(1, tiling, array, ROW, ol);

00

overlapoverlap

BoundaryBoundary

23

Indexing

Operations (+,-, map, =, etc) The conformability rules only apply to owned regions Enable operations with non-overlapped HTA.

Indexing and operations

T[0:3]=T[ALL]T[-1] T[4]

owned region

24

Shadow region consistency

Ensure shadow regions to be properly updated and consistent with the corresponding data in the owned tiles.

Use update on read policyBookkeeping the status of each tile.

SPMD mode: no communication is needed.

Allow manual and automatic updates.

25

Evaluation

Benchmark DescriptionExecution

mode

Sequential 3D Jacobi

Stencil computation on 6 neighbors sequential

NAS MG Multigrid V-cycle algorithm parallel

NAS LU Navier Strokes equation solver parallel

A cluster consisting of 128 nodes each with two 2 GHz G5 processors and 4 GB of RAM. We used one processor per node in our experiments with Myrinet connection. NAS code (Fortran + MPI) was compiled with g77 and HTA code was compiled with g++ (3.3). O3 flag is used for both cases.

26

MG comm3 in HTA without overlapped tilingvoid comm3 (Grid& u){ int NX = u.shape().size()[0] - 1; int NY = u.shape().size()[1] - 1; int NZ = u.shape().size()[2] - 1; int nx = u(T(0,0,0)).shape().size()[0] - 1; int ny = u(T(0,0,0)).shape().size()[1] - 1; int nz = u(T(0,0,0)).shape().size()[2] - 1; //north-south

Traits::Default::async(); if (NX > 0) u((R(0, NX-1), R(0, NY), R(0, NZ)))((R(nx, nx), R(1, ny-1), R(1, nz-1))) = u((R(1, NX), R(0, NY), R(0, NZ)))((R(1,1), R(1, ny-1), R(1,nz-1))); u((R(NX, NX), R(0, NY), R(0, NZ)))((R(nx, nx), R(1, ny-1), R(1, nz-1))) = u((R(0,0), R(0, NY), R(0, NZ)))((R(1,1), R(1, ny-1), R(1,nz-1))); if (NX > 0) u((R(1, NX), R(0, NY), R(0, NZ)))((R(0,0), R(1, ny-1), R(1, nz-1))) = u((R(0, NX-1), R(0, NY), R(0, NZ)))((R(nx-1,nx-1), R(1, ny-1), R(1,nz-1))); u((R(0, 0), R(0, NY), R(0, NZ)))((R(0,0), R(1, ny-1), R(1, nz-1))) = u((R(NX, NX), R(0, NY), R(0, NZ)))((R(nx-1,nx-1), R(1, ny-1), R(1,nz-1)));

Traits::Default::sync(); Traits::Default::async();

//east-west if (NY > 0) u((R(0, NX), R(0, NY-1), R(0, NZ)))((R(0, nx), R(ny, ny), R(1, nz-1))) = u((R(0, NX), R(1, NY), R(0, NZ)))((R(0, nx), R(1, 1), R(1, nz-1))); u((R(0, NX), R(NY, NY), R(0, NZ)))((R(0, nx), R(ny, ny), R(1, nz-1))) = u((R(0, NX), R(0,0), R(0, NZ)))((R(0,nx), R(1, 1), R(1, nz-1))); if (NY > 0) u((R(0, NX), R(1, NY), R(0, NZ)))((R(0, nx), R(0, 0), R(1, nz-1))) = u((R(0, NX), R(0, NY-1), R(0, NZ)))((R(0, nx), R(ny-1, ny-1), R(1, nz-1))); u((R(0, NX), R(0, 0), R(0, NZ)))((R(0, nx), R(0, 0), R(1, nz-1))) = u((R(0, NX), R(NY, NY), R(0, NZ)))((R(0,nx), R(ny-1, ny-1), R(1, nz-1)));

Traits::Default::sync(); Traits::Default::async();

//front-back if (NZ > 0) u((R(0, NX), R(0, NY), R(0, NZ-1)))((R(0, nx), R(0, ny), R(nz, nz))) = u((R(0, NX), R(0, NY), R(1, NZ)))((R(0, nx), R(0, ny), R(1, 1))); u((R(0, NX), R(0, NY), R(NZ, NZ)))((R(0, nx), R(0, ny), R(nz, nz))) = u((R(0, NX), R(0, NY), R(0, 0)))((R(0, nx), R(0, ny), R(1, 1))); if (NZ > 0) u((R(0, NX), R(0, NY), R(1, NZ)))((R(0, nx), R(0, ny), R(0, 0))) = u((R(0, NX), R(0, NY), R(0, NZ-1)))((R(0, nx), R(0, ny), R(nz-1, nz-1))); u((R(0, NX), R(0, NY), R(0, 0)))((R(0, nx), R(0, ny), R(0, 0))) = u((R(0, NX), R(0, NY), R(NZ, NZ)))((R(0, nx), R(0, ny), R(nz-1, nz-1))); Traits::Default::sync();}

Overlap<3> * ol = new Overlap<3> (T<3>(1,1,1),T<3>(1,1,1), PERIODIC);

MG comm3 in HTA with overlapped tiling

27

MG comm3 in NAS (Fortran + MPI) subroutine comm3(u,n1,n2,n3,kk) implicit none

include 'mpinpb.h' include 'globals.h'

integer n1, n2, n3, kk double precision u(n1,n2,n3) integer axis

if( .not. dead(kk) )then do axis = 1, 3 if( nprocs .ne. 1) then call ready( axis, -1, kk ) call ready( axis, +1, kk ) call give3( axis, +1, u, n1, n2, n3, kk ) call give3( axis, -1, u, n1, n2, n3, kk ) call take3( axis, -1, u, n1, n2, n3 ) call take3( axis, +1, u, n1, n2, n3 ) else call comm1p( axis, u, n1, n2, n3, kk ) endif enddo else call zero3(u,n1,n2,n3) endif return end

subroutine give3( axis, dir, u, n1, n2, n3, k ) implicit none


integer axis, dir, n1, n2, n3, k, ierr double precision u( n1, n2, n3 )

integer i3, i2, i1, buff_len,buff_id

buff_id = 2 + dir buff_len = 0

if( axis .eq. 1 )then if( dir .eq. -1 )then

do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len,buff_id ) = u( 2, i2,i3) enddo enddo

call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr)

else if( dir .eq. +1 ) then

do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( n1-1, i2,i3) enddo enddo


endif endif


do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1, 2,i3) enddo enddo



do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id )= u( i1,n2-1,i3) enddo enddo


endif endif


do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,2) enddo enddo



do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,n3-1) enddo enddo


endif endif

return end

subroutine take3( axis, dir, u, n1, n2, n3 ) implicit none


integer axis, dir, n1, n2, n3 double precision u( n1, n2, n3 )

integer buff_id, indx

integer status(mpi_status_size), ierr

integer i3, i2, i1

call mpi_wait( msg_id( axis, dir, 1 ),status,ierr) buff_id = 3 + dir indx = 0


do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(n1,i2,i3) = buff(indx, buff_id ) enddo enddo


do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(1,i2,i3) = buff(indx, buff_id ) enddo enddo

endif endif


do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,n2,i3) = buff(indx, buff_id ) enddo enddo


do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,1,i3) = buff(indx, buff_id ) enddo enddo

endif endif


do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,n3) = buff(indx, buff_id ) enddo enddo


do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,1) = buff(indx, buff_id ) enddo enddo

endif endif

return end

subroutine ready( axis, dir, k ) implicit none


integer axis, dir, k integer buff_id,buff_len,i,ierr

buff_id = 3 + dir buff_len = nm2

do i=1,nm2 buff(i,buff_id) = 0.0D0 enddo

msg_id(axis,dir,1) = msg_type(axis,dir) +1000*me

call mpi_irecv( buff(1,buff_id), buff_len, > dp_type, nbr(axis,-dir,k), msg_type(axis,dir), > mpi_comm_world, msg_id(axis,dir,1), ierr) return end

ready

give3 take3

28

3D Jacobi NAS MG class C

NAS LU class B

NAS LU class C

29

Outline


30

Three metrics

Programming effort [Halstead, 1977] Program volume V

A function of the number of operators and operands and their total number of occurrences.

Potential volume V* The most succinct form in a language which has defined or implemented

the required operations.

Program complexity [McCabe,1976] C = P + 1, where P is the number of decision points in the program

Source lines of codes L

*

2

V

VE

31

Evaluation

Programs Effort Complexity LOC

staticLU HTA 61,545 3 10

staticLU NDS 208,074 8 49

staticLU LAPACK 160,509 6 37

dynamicLU HTA 51,599 1 13

dynamicLU FLAME 170,477 1 52

recLU HTA 85,530 1 18

recLU ATLAS 186,891 10 40

Sylvester HTA 423,404 2 47

Sylvester FLAME 700,629 5 95

32

Related work

FLAME API [Bientinesi et al., 2005] Ad-hoc notations

Sequoia [Fatahalian et al., 2006] Principal construct: task

HPF [Hiranandani et al., 1992] and Co-Array Fortran [Numrich and Reid, 1998] Tiles are used for distribution Do not address different levels of memory hierarchy

POOMA [Reynders,1996] Tiles and shadow regions are accessed as a whole

Global Arrays [Nieplocha et al., 2006] SPMD programming model

33

Conclusion

HTA makes tiles part of a language It provides generalized framework to express tiles. It increases productivity

Less index calculation, fewer variables, loops, simpler function interface

Dynamic partitioning Overlapped tiling

Little performance degradation

35

Code example: 2D Jacobi

Without overlapped tiling With overlapped tiling

• HTA takes care of allocation of shadow regions, data consistency

• Clean indexing syntax

36

Two types of stencil computation Concurrent computation

Each tile can be executed independently

Example: Jacobi, MG

Wavefront computation The execution sequence of ti

les follows a certain order. Example: LU, SSOR

With the update on read policy, minimal number of communications is achieved.

computation Shadow region update in every

iteration

computationShadow region

update in second iteration

Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David...

Documents

Transcript of Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David...