Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David...

36
Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela + , Maria J. Garzaran, D avid Padua University of Illinois at Urb ana-Champaign *IBM, India + Universidade da Coruna, Spai n

Transcript of Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David...

Page 1: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

Programming with Tiles

Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela+, Maria J. Garzaran, Davi

d PaduaUniversity of Illinois at Urbana-Champaign

*IBM, India+Universidade da Coruna, Spain

Page 2: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

2

Motivation

The importance of tiles A natural way to express many algorithms Partitioning data is an effective way to enhance locality Pervasive in parallel computing

Limitations of today’s programming language Lack of programming construct to express tiles directly

Complicated indexing and loop structure to traverse an array by tiles

No support for storage by tile. Mixed the design and detailed implementation.

Page 3: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

3

Contributions

Our group developed Hierarchically Tiled Arrays (HTAs) to support tiles to enhance locality and parallelism (PPoPP 06).

Designed new language constructs based on our true experiences. Dynamic partitioning Overlapped tiling

Evaluated both the productivity and performance

Page 4: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

4

Outline

Introduction of HTADynamic partitioningOverlapped tilingImpact on programming productivityRelated work Conclusions

Page 5: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

5

HTA overview

A

A(0,0)

A(1, 0:1)

A(0,1)[0,1]

op

A + B

A(0, 0:1) = B (1, 0:1)

Page 6: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

6

HTA library implementation

In Matlab In C++

Support sequential and parallel execution on top of MPI and TBB.

Support for linear, non-linear data layouts Execution model

SPMD

Programming model Single threaded

Page 7: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

7

Outline

Introduction of HTADynamic partitioningOverlapped tilingImpact on programming productivityRelated work Conclusions

Page 8: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

8

Why dynamic partitioning?

Some linear algebra algorithmsAdd and remove partitions to sweep through

matrices.

Cache oblivious algorithmsCreate the partition following a divide and

conquer strategy

Page 9: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

9

Syntax and semantics

Partition lines and partition

Two methods void part( Tuple sourcePartition, Tuple offset ); void rmPart( Tuple partition );

001

1

2

2 001

1

3

3

2

2 00

2

2

1

1

A A.part((1,1),(2,2))

A.rmPart((1,1))

NONE for not partitioning

fixed

Page 10: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

10

LU algorithm with dynamic partitioning

A.part((1,1),(nb, nb))A.rmPart((1,1))

done done

donePartially updated

A11

A21 A22

A12

done done

done Partially updated

Beginning of iteration Repartitio

nUpdate

End of iteration

In the loop

nb

nb

12212222

1121

11

21

111

ULAA

......

UL

L

A

AP

Page 11: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

11

LU algorithm represented in FLAME [Geijn, et a

l]

UPD

UPD UPD

UPD

In the loopThe FLAME algorithm

Page 12: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

12

void lu(HTA<double,2> A, HTA<int,1> p,int nb){

A.part((0,0),(0,0)); p.part((0), (0));

while(A(0,0).lsize(1)<A.lsize(1)){ int b = min(A(1,1).lsize(0), nb);

A.part((1,1),(b,b)); p.part((1), (b));

dgetf2(A(1:2,1), p(1)); dlaswp(A(1:2,0), p(1)); dlaswp(A(1:2,2), p(1)); trsm(HtaRight, HtaUpper, HtaNoTrans, HtaUnit, One, A(1,1),A(1,2)); gemm(HtaNoTrans, HtaNoTrans, MinusOne, A(2,1), A(1,2), One, A(2,2)); A.rmPart((1,1)); p.rmPart((1)); }}

From algorithm to HTA programThe FLAME algorithm

The HTA implementation

Page 13: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

13

FLAME API vs. HTA

A.part((0,0), (0,0));

A.part((1,1), (b,b));

A.rmPart((1,1));

A(1:2, 1);

HTA: 1). General. 2). Fewer variables.

3). Simple index. 4). Flexible range selection

Page 14: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

14

A cache oblivious algorithm:parallel merge

221143 14 30 32 41

181042 15 23 28 40

in1in2

1. Take middle element in in1

2. Find lower bound in in222

23

out3. Calculate partition in

out

4. Partition (logically) 221143 14 30 32 41

181042 15 23 28 40

5. Merge in parallel

Page 15: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

15

HTA vs. Treading Building Blocks (TBB)

HTA code TBB code

HTA: partition tiles and operate on tiles in parallel

TBB: prepare the merge range and operate on smaller ranges in parallel

map(PMerge(), out, in1, in2);

parallel_for(PMergeRange(begin1, end1, begin2, end2, out), PMergeBody());

Page 16: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

16

Evaluation

Benchmark CategoryExecution

modePlatform

MMM Cache oblivious algorithm

Sequential

3.0 GHz Intel Pentium 4 , 16KB L1, 1MB L2 and 1GB RAM.

Recursive LU Cache oblivious algorithm

Dynamic LU FLAME’s algorithm

Sylvester FLAME’s algorithm

Parallel merge Cache oblivious algorithm ParallelTwo Quad core 2.66 GHz Xeon processors

Page 17: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

17

Recursive MMM Recursive LU

Dynamic LU

Sylvester

Page 18: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

18

Parallel merge

Page 19: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

19

Outline

Introduction of HTADynamic partitioningOverlapped tilingImpact on programming productivityRelated work Conclusions

Page 20: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

20

Programs that have computations based on neighboring points E.g. iterative PDE solvers Benefit from tiling Shadow regions

Problems with current approach Explicit allocation and update for shadow regions Example: NAS MG (Lines of code)

Motivation

Versions comm3 Residual Inversion Projection

MPI 327 23 24 49

CAF 302 25 23 55

OpenMP 26 27 26 59

HTA 33 55 53 64

Page 21: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

21

Overlapped tiling

Objective Automate the allocation and update of shadow regions Allow the access of neighbors across neighboring tiles

Allow programmer to specify overlapped region at creation time

Overlap<DIM> ol = Overlap<DIM>(Tuple negativeDir, Tuple positiveDir,

BoundaryMode mode,

bool autoUpdate=true);

Page 22: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

22

Example of overlapped tiling

Shadow region for

Shadow region for

Shadow region for

Tuple::seq tiling = ((4), (3));

Overlap<DIM> ol((1), (1), zero);

A=HTA<double,1>::alloc(1, tiling, array, ROW, ol);

00

overlapoverlap

BoundaryBoundary

Page 23: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

23

Indexing

Operations (+,-, map, =, etc) The conformability rules only apply to owned regions Enable operations with non-overlapped HTA.

Indexing and operations

T[0:3]=T[ALL]T[-1] T[4]

owned region

Page 24: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

24

Shadow region consistency

Ensure shadow regions to be properly updated and consistent with the corresponding data in the owned tiles.

Use update on read policyBookkeeping the status of each tile.

SPMD mode: no communication is needed.

Allow manual and automatic updates.

Page 25: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

25

Evaluation

Benchmark DescriptionExecution

mode

Sequential 3D Jacobi

Stencil computation on 6 neighbors sequential

NAS MG Multigrid V-cycle algorithm parallel

NAS LU Navier Strokes equation solver parallel

A cluster consisting of 128 nodes each with two 2 GHz G5 processors and 4 GB of RAM. We used one processor per node in our experiments with Myrinet connection. NAS code (Fortran + MPI) was compiled with g77 and HTA code was compiled with g++ (3.3). O3 flag is used for both cases.

Page 26: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

26

MG comm3 in HTA without overlapped tilingvoid comm3 (Grid& u){ int NX = u.shape().size()[0] - 1; int NY = u.shape().size()[1] - 1; int NZ = u.shape().size()[2] - 1; int nx = u(T(0,0,0)).shape().size()[0] - 1; int ny = u(T(0,0,0)).shape().size()[1] - 1; int nz = u(T(0,0,0)).shape().size()[2] - 1; //north-south

Traits::Default::async(); if (NX > 0) u((R(0, NX-1), R(0, NY), R(0, NZ)))((R(nx, nx), R(1, ny-1), R(1, nz-1))) = u((R(1, NX), R(0, NY), R(0, NZ)))((R(1,1), R(1, ny-1), R(1,nz-1))); u((R(NX, NX), R(0, NY), R(0, NZ)))((R(nx, nx), R(1, ny-1), R(1, nz-1))) = u((R(0,0), R(0, NY), R(0, NZ)))((R(1,1), R(1, ny-1), R(1,nz-1))); if (NX > 0) u((R(1, NX), R(0, NY), R(0, NZ)))((R(0,0), R(1, ny-1), R(1, nz-1))) = u((R(0, NX-1), R(0, NY), R(0, NZ)))((R(nx-1,nx-1), R(1, ny-1), R(1,nz-1))); u((R(0, 0), R(0, NY), R(0, NZ)))((R(0,0), R(1, ny-1), R(1, nz-1))) = u((R(NX, NX), R(0, NY), R(0, NZ)))((R(nx-1,nx-1), R(1, ny-1), R(1,nz-1)));

Traits::Default::sync(); Traits::Default::async();

//east-west if (NY > 0) u((R(0, NX), R(0, NY-1), R(0, NZ)))((R(0, nx), R(ny, ny), R(1, nz-1))) = u((R(0, NX), R(1, NY), R(0, NZ)))((R(0, nx), R(1, 1), R(1, nz-1))); u((R(0, NX), R(NY, NY), R(0, NZ)))((R(0, nx), R(ny, ny), R(1, nz-1))) = u((R(0, NX), R(0,0), R(0, NZ)))((R(0,nx), R(1, 1), R(1, nz-1))); if (NY > 0) u((R(0, NX), R(1, NY), R(0, NZ)))((R(0, nx), R(0, 0), R(1, nz-1))) = u((R(0, NX), R(0, NY-1), R(0, NZ)))((R(0, nx), R(ny-1, ny-1), R(1, nz-1))); u((R(0, NX), R(0, 0), R(0, NZ)))((R(0, nx), R(0, 0), R(1, nz-1))) = u((R(0, NX), R(NY, NY), R(0, NZ)))((R(0,nx), R(ny-1, ny-1), R(1, nz-1)));

Traits::Default::sync(); Traits::Default::async();

//front-back if (NZ > 0) u((R(0, NX), R(0, NY), R(0, NZ-1)))((R(0, nx), R(0, ny), R(nz, nz))) = u((R(0, NX), R(0, NY), R(1, NZ)))((R(0, nx), R(0, ny), R(1, 1))); u((R(0, NX), R(0, NY), R(NZ, NZ)))((R(0, nx), R(0, ny), R(nz, nz))) = u((R(0, NX), R(0, NY), R(0, 0)))((R(0, nx), R(0, ny), R(1, 1))); if (NZ > 0) u((R(0, NX), R(0, NY), R(1, NZ)))((R(0, nx), R(0, ny), R(0, 0))) = u((R(0, NX), R(0, NY), R(0, NZ-1)))((R(0, nx), R(0, ny), R(nz-1, nz-1))); u((R(0, NX), R(0, NY), R(0, 0)))((R(0, nx), R(0, ny), R(0, 0))) = u((R(0, NX), R(0, NY), R(NZ, NZ)))((R(0, nx), R(0, ny), R(nz-1, nz-1))); Traits::Default::sync();}

Overlap<3> * ol = new Overlap<3> (T<3>(1,1,1),T<3>(1,1,1), PERIODIC);

MG comm3 in HTA with overlapped tiling

Page 27: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

27

MG comm3 in NAS (Fortran + MPI) subroutine comm3(u,n1,n2,n3,kk) implicit none

include 'mpinpb.h' include 'globals.h'

integer n1, n2, n3, kk double precision u(n1,n2,n3) integer axis

if( .not. dead(kk) )then do axis = 1, 3 if( nprocs .ne. 1) then call ready( axis, -1, kk ) call ready( axis, +1, kk ) call give3( axis, +1, u, n1, n2, n3, kk ) call give3( axis, -1, u, n1, n2, n3, kk ) call take3( axis, -1, u, n1, n2, n3 ) call take3( axis, +1, u, n1, n2, n3 ) else call comm1p( axis, u, n1, n2, n3, kk ) endif enddo else call zero3(u,n1,n2,n3) endif return end

subroutine give3( axis, dir, u, n1, n2, n3, k ) implicit none

include 'mpinpb.h' include 'globals.h'

integer axis, dir, n1, n2, n3, k, ierr double precision u( n1, n2, n3 )

integer i3, i2, i1, buff_len,buff_id

buff_id = 2 + dir buff_len = 0

if( axis .eq. 1 )then if( dir .eq. -1 )then

do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len,buff_id ) = u( 2, i2,i3) enddo enddo

call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr)

else if( dir .eq. +1 ) then

do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( n1-1, i2,i3) enddo enddo

call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr)

endif endif

if( axis .eq. 2 )then if( dir .eq. -1 )then

do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1, 2,i3) enddo enddo

call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr)

else if( dir .eq. +1 ) then

do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id )= u( i1,n2-1,i3) enddo enddo

call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr)

endif endif

if( axis .eq. 3 )then if( dir .eq. -1 )then

do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,2) enddo enddo

call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr)

else if( dir .eq. +1 ) then

do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,n3-1) enddo enddo

call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr)

endif endif

return end

subroutine take3( axis, dir, u, n1, n2, n3 ) implicit none

include 'mpinpb.h' include 'globals.h'

integer axis, dir, n1, n2, n3 double precision u( n1, n2, n3 )

integer buff_id, indx

integer status(mpi_status_size), ierr

integer i3, i2, i1

call mpi_wait( msg_id( axis, dir, 1 ),status,ierr) buff_id = 3 + dir indx = 0

if( axis .eq. 1 )then if( dir .eq. -1 )then

do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(n1,i2,i3) = buff(indx, buff_id ) enddo enddo

else if( dir .eq. +1 ) then

do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(1,i2,i3) = buff(indx, buff_id ) enddo enddo

endif endif

if( axis .eq. 2 )then if( dir .eq. -1 )then

do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,n2,i3) = buff(indx, buff_id ) enddo enddo

else if( dir .eq. +1 ) then

do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,1,i3) = buff(indx, buff_id ) enddo enddo

endif endif

if( axis .eq. 3 )then if( dir .eq. -1 )then

do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,n3) = buff(indx, buff_id ) enddo enddo

else if( dir .eq. +1 ) then

do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,1) = buff(indx, buff_id ) enddo enddo

endif endif

return end

subroutine ready( axis, dir, k ) implicit none

include 'mpinpb.h' include 'globals.h'

integer axis, dir, k integer buff_id,buff_len,i,ierr

buff_id = 3 + dir buff_len = nm2

do i=1,nm2 buff(i,buff_id) = 0.0D0 enddo

msg_id(axis,dir,1) = msg_type(axis,dir) +1000*me

call mpi_irecv( buff(1,buff_id), buff_len, > dp_type, nbr(axis,-dir,k), msg_type(axis,dir), > mpi_comm_world, msg_id(axis,dir,1), ierr) return end

ready

give3 take3

Page 28: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

28

3D Jacobi NAS MG class C

NAS LU class B

NAS LU class C

Page 29: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

29

Outline

Introduction of HTADynamic partitioningOverlapped tilingImpact on programming productivityRelated work Conclusions

Page 30: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

30

Three metrics

Programming effort [Halstead, 1977] Program volume V

A function of the number of operators and operands and their total number of occurrences.

Potential volume V* The most succinct form in a language which has defined or implemented

the required operations.

Program complexity [McCabe,1976] C = P + 1, where P is the number of decision points in the program

Source lines of codes L

*

2

V

VE

Page 31: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

31

Evaluation

Programs Effort Complexity LOC

staticLU HTA 61,545 3 10

staticLU NDS 208,074 8 49

staticLU LAPACK 160,509 6 37

dynamicLU HTA 51,599 1 13

dynamicLU FLAME 170,477 1 52

recLU HTA 85,530 1 18

recLU ATLAS 186,891 10 40

Sylvester HTA 423,404 2 47

Sylvester FLAME 700,629 5 95

Page 32: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

32

Related work

FLAME API [Bientinesi et al., 2005] Ad-hoc notations

Sequoia [Fatahalian et al., 2006] Principal construct: task

HPF [Hiranandani et al., 1992] and Co-Array Fortran [Numrich and Reid, 1998] Tiles are used for distribution Do not address different levels of memory hierarchy

POOMA [Reynders,1996] Tiles and shadow regions are accessed as a whole

Global Arrays [Nieplocha et al., 2006] SPMD programming model

Page 33: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

33

Conclusion

HTA makes tiles part of a language It provides generalized framework to express tiles. It increases productivity

Less index calculation, fewer variables, loops, simpler function interface

Dynamic partitioning Overlapped tiling

Little performance degradation

Page 34: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

34

Page 35: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

35

Code example: 2D Jacobi

Without overlapped tiling With overlapped tiling

• HTA takes care of allocation of shadow regions, data consistency

• Clean indexing syntax

Page 36: Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

36

Two types of stencil computation Concurrent computation

Each tile can be executed independently

Example: Jacobi, MG

Wavefront computation The execution sequence of ti

les follows a certain order. Example: LU, SSOR

With the update on read policy, minimal number of communications is achieved.

computation Shadow region update in every

iteration

computationShadow region

update in second iteration