Effective Automatic Parallelization of Stencil Computations *

Ohio State Univ

Effective Automatic Parallelization of Stencil Computations*

Sriram Krishnamoorthy1

Muthu Baskaran1, Uday Bondhugula1, Atanas Rountev1, J. Ramanujam2, P. Sadayappan1

1The Ohio State University2Lousiana State University

* Work supported by NSF

Ohio State Univ

Introduction Stencil computations

Sweep through large data set Multiple time iterations Simple load balanced schedule

Tiling – essential to improve data locality

Dependences between tiles Pipelined execution Skewed iteration spaces – load imbalance

Solution: Adjust tiling – re-enable concurrent execution

Ohio State Univ

Motivation

FOR t = 0 TO T-1 FOR i = 1 TO N-1 A[t,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3

t

i

Ohio State Univ

Notation

Iteration space B: n-dim polyhedron

Dependences D: n-dim vectors

Hyperplanes H: n-dim normal vectors Tile bounded by pairs of

hyperplanes

Ohio State Univ

Approach

Concurrent start in non-tiled iteration space

Identify hyperplanes inhibiting concurrent start in tiled space

Replace one face for each inhibiting pair Overlapped Tiling – Replace “back-face” Split Tiling – Replace “front-face”

Ohio State Univ

Concurrent Start: Before Tiling

Condition: A boundary that does not carry any dependence

Ohio State Univ

Inter-tile Dependences

Shift vectors Tile traversal order Normal to all other hyperplanes

Hyperplane carries dependence A dependence “pokes” through

Inter-tile dependence vector Shift vector Corresponding hyperplane

carries dependence

Ohio State Univ

Concurrent Start Inhibition

Concurrent start in original iteration space along a boundary

But that boundary carries an inter-tile dependence

A boundary has concurrent start

S_j is an inter-tiledependence

That boundary carriesInter-tile dependence

Ohio State Univ

Companion Hyperplane

Hyperplane that destroys the inter-tile dependence

Swivel a hyperplane “backward”

Dependences carried by original hyperplane are “neutralized” Incoming dependences become non-incoming Outgoing dependences become non-outgoing

Ohio State Univ

Overlapped Tiling Replace “back face” with

companion hyperplane

Additional region is shared with preceding tile Region of preceding tile that

caused the dependence

Each new tile independent of preceding tile (“do-all” parallelism)

Increased computation cost; communication volume

Ohio State Univ

Split Tiling

Replace “front face” with companion hyperplane

Tile split into independent and dependent regions

Execute independent region followed by dependent region

Increased #communications

Ohio State Univ

Experimental Evaluation

Cluster 2.8 GHz dual-processor Opteron 254 1MB L2 cache; 4GB RAM Linux 2.6.9; Intel compiler (icc) –O3

Comparison Two pipelined schedules – along space and time 1000 time steps 1 – 32 processors

Ohio State Univ

Pipelined Execution: Parameters

Space tile size : 1000Time tile size : 16

64000 elements; 32 processors

Ohio State Univ

Performance with Problem Size

Ohio State Univ

Weak Scaling Problem size = #procs * 20000 Horizontal line – Linear Scaling

Ohio State Univ

Conclusion

Time tiling stencils – crucial for data locality

Might inhibit concurrent execution

Presented: Two approaches to enabling concurrent execution

Ongoing work: Modeling relative benefits of the two approaches

Ohio State Univ

Thank You!

Effective Automatic Parallelization of Stencil Computations *

Documents

Transcript of Effective Automatic Parallelization of Stencil Computations *