Effective Automatic Parallelization of Stencil Computations *
description
Transcript of Effective Automatic Parallelization of Stencil Computations *
Ohio State Univ
Effective Automatic Parallelization of Stencil Computations*
Sriram Krishnamoorthy1
Muthu Baskaran1, Uday Bondhugula1, Atanas Rountev1, J. Ramanujam2, P. Sadayappan1
1The Ohio State University2Lousiana State University
* Work supported by NSF
Ohio State Univ
Introduction Stencil computations
Sweep through large data set Multiple time iterations Simple load balanced schedule
Tiling – essential to improve data locality
Dependences between tiles Pipelined execution Skewed iteration spaces – load imbalance
Solution: Adjust tiling – re-enable concurrent execution
Ohio State Univ
Motivation
FOR t = 0 TO T-1 FOR i = 1 TO N-1 A[t,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
t
i
Ohio State Univ
Notation
Iteration space B: n-dim polyhedron
Dependences D: n-dim vectors
Hyperplanes H: n-dim normal vectors Tile bounded by pairs of
hyperplanes
Ohio State Univ
Approach
Concurrent start in non-tiled iteration space
Identify hyperplanes inhibiting concurrent start in tiled space
Replace one face for each inhibiting pair Overlapped Tiling – Replace “back-face” Split Tiling – Replace “front-face”
Ohio State Univ
Concurrent Start: Before Tiling
Condition: A boundary that does not carry any dependence
Ohio State Univ
Inter-tile Dependences
Shift vectors Tile traversal order Normal to all other hyperplanes
Hyperplane carries dependence A dependence “pokes” through
Inter-tile dependence vector Shift vector Corresponding hyperplane
carries dependence
Ohio State Univ
Concurrent Start Inhibition
Concurrent start in original iteration space along a boundary
But that boundary carries an inter-tile dependence
A boundary has concurrent start
S_j is an inter-tiledependence
That boundary carriesInter-tile dependence
Ohio State Univ
Companion Hyperplane
Hyperplane that destroys the inter-tile dependence
Swivel a hyperplane “backward”
Dependences carried by original hyperplane are “neutralized” Incoming dependences become non-incoming Outgoing dependences become non-outgoing
Ohio State Univ
Overlapped Tiling Replace “back face” with
companion hyperplane
Additional region is shared with preceding tile Region of preceding tile that
caused the dependence
Each new tile independent of preceding tile (“do-all” parallelism)
Increased computation cost; communication volume
Ohio State Univ
Split Tiling
Replace “front face” with companion hyperplane
Tile split into independent and dependent regions
Execute independent region followed by dependent region
Increased #communications
Ohio State Univ
Experimental Evaluation
Cluster 2.8 GHz dual-processor Opteron 254 1MB L2 cache; 4GB RAM Linux 2.6.9; Intel compiler (icc) –O3
Comparison Two pipelined schedules – along space and time 1000 time steps 1 – 32 processors
Ohio State Univ
Pipelined Execution: Parameters
Space tile size : 1000Time tile size : 16
64000 elements; 32 processors
Ohio State Univ
Performance with Problem Size
Ohio State Univ
Weak Scaling Problem size = #procs * 20000 Horizontal line – Linear Scaling
Ohio State Univ
Conclusion
Time tiling stencils – crucial for data locality
Might inhibit concurrent execution
Presented: Two approaches to enabling concurrent execution
Ongoing work: Modeling relative benefits of the two approaches
Ohio State Univ
Thank You!