Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David...
-
Upload
jeffery-norris -
Category
Documents
-
view
214 -
download
0
Transcript of Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David...
Optimal Parallelogram Selection for Hierarchical Tiling
Authors: Xing Zhou, Maria J. Garzaran, David Padua
University of Illinois
Presenter: Wei Zuo
Motivation and Background
• The importance of loop tiling• Optimize multiple nested loops (most time
consuming)• Improve data locality• Expose parallelism
What is Hierarchical tiling
• Tile the loop hierarchically to fit the organization of the target machine
Why Hierarchical Tiling
• The advantage of hierarchy-aware optimization• To unleash the potential of hierarchically
organized system
Challenges of (hierarchical) tiling
• Selection of tile sizes• Selection of tile shapes
• Shapes have significant impact of execution time • Shapes at different level interacts with each other
• Can not select tile shape at each level separately• Need a global model considering hierarchical tiling
Contribution of the paper
• An automatic system for selection of tile shapes in a hierarchical system
• A model that compute execution time of the tile shapes
• Show that the problem of optimal tile shape selection is a nonlinear bi-level programming problem
Math Concepts
• Iteration Space
• Tiling Representation
• Dependence Vectors
• Execution Time Vector
Iteration space representation• The edge matrix E of iteration space I
• An example:
• The function span(E) describes the iteration space I
Tiling transformation representation• Tiling matrix:
• After tiling:
• E’ is the new edge matrix• A is the transformation matrix
• We have:
A=T-1 is the affine transformation of tiling with shape T
Hierarchical tiling
• Recursively tile an iteration space I• Bottom-up. T0: Finest -> Tn: Original Space
• Ik: is the iteration space of k-th level
• Tk: Tile shape of k-th level
• Ek: Edge matrix of k-th level
Dependence Vectors
• Dependence matrix• A dependence vector d = (d0 , d1 , . . . , dn−1 )
indicates that any iteration i must finish before iteration i + d.
• Assume atomic computation (not communication overlap)• It is possible to topologically sort all tiles• Can be no cycles in the inter-tile dependence graph • Hyperplanes defining the tiles must not be crossed
by two dependence vectors with different directions.
Dependence Vectors• No cycle => each dependence vector d must be
covered by the cone spanned by the extension of t0,…tn-1
• Tiles be large => inter-tile dependences only exist between adjacent tiles
• Combine together:
• After transformation
• Dk, the dependence at k-th level tiling
• The sequential execution time of a loop with iteration space I is
• Consider the parallelism, Ideal execution time is the minimal execution time of an iteration space I that can be achieved by any valid schedule of iterations (E: edge matrix, D: dependence matrix, L(E, D) denote the length of the longest path of dependent iterations in the iteration space)
• Example:
• After simplification:
Execution Model
The Tile Size Selection Model• Problem Statement
• The Optimization formation
• Compute the Longest Dependent Path
• Automated Framework
Tile Size Selection Model
• Problem statement: Selecting the tile shape for hierarchical tiling.• Identifying the tile shapes defining an l-level
hierarchical tiling that minimizes the execution time of the computation defined by giving an n-dimensional hyperparallelepiped-shaped iteration space I and m dependence vectors; i.e. Determing the sequency of tiling matrices T0, T1, … Tl-1.
• Assumptions• The model considers parallelogram tile shapes• At a given level, all nonboundary tiles have the
same shape• Tiling is an affine transformation• Computation within a tile is atomic• Infinite resource for parallelism
• Iteration space
• Execution time per-tile for bottom-level tile
• The recursion for upper tiles:
Model Formulation
• Each tile at level below is considered as a single iteration
• The per-iteration execution time is Time(Tk-1). • Dk is dependence at k-th level with D0 = D• ts
k be the synchronization and communication overhead of each tile
Model Formulation
• Optimization:• Select t1, … tl-1 to minimize total execution time
• Constraints (dependence)
• Question: How to compute “L” ?
Contribution of the paper
• An automatic system for selection of tile shapes in a hierarchical system
• A model that compute execution time of the tile shapes
• Show that the problem of optimal tile shape selection is a nonlinear bi-level programming problem
• Computing L(Tk, In) 0<k<n-1
• Computing L(E, In)
Computation of the L
• Computing L(Tk, In)• By affine transformation
• Since:
• To compute L(Tk,1n), we must find the longest path P (p0, p1,..., pL−1)
• Therefore: L(Tk, 1n) = max{L}
Computation of the L
• Computation of L(E, In)• Since dependence vectors d can point in any
direction, the longest dependent path does not necessarily start from origin (0,0,...,0) of the hypercube iteration space.
• Approximately estimate the L using binary optimization
The Automatic Framework
• Multidimensional non-linear optimization problem w/o a known analytical solution• NOMAD
Experiments
• Platforms: Bluewater super computer• First level: 256 nodes• Second level: Each node has an NVIDIA Tesla
GPU accelerator with 2688 CUDA cores
• Tiling Schemes • Scheme 1 & 2: The common tile shapes• The hierarchical overlapped tiling method• Note: tiles shapes include Square, Diamond and
Skewing1 & 2
Comparing the performance
Testing the model accuracy
• The accuracy of the analytical model for execution estimation • 15% except for 1D-Jacobi• Reasons cause inaccuracy:
• The variation of communication time and execution time for different program
• The hardware resource for parallelism is not unlimited
Conclusion
• An automatic system for selection of tile shapes in a hierarchical system
• A model that compute execution time of the tile shapes
• Show that the problem of optimal tile shape selection is a nonlinear bi-level programming problem
• Review the limitations, these can be future work• Affine, regular parallelism, adding the hardware
resource model, considering the different metrics, e.g. power, area …