Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David...

Optimal Parallelogram Selection for Hierarchical Tiling

Authors: Xing Zhou, Maria J. Garzaran, David Padua

University of Illinois

Presenter: Wei Zuo

Motivation and Background

• The importance of loop tiling• Optimize multiple nested loops (most time

consuming)• Improve data locality• Expose parallelism

What is Hierarchical tiling

• Tile the loop hierarchically to fit the organization of the target machine

Why Hierarchical Tiling

• The advantage of hierarchy-aware optimization• To unleash the potential of hierarchically

organized system

Challenges of (hierarchical) tiling

• Selection of tile sizes• Selection of tile shapes

• Shapes have significant impact of execution time • Shapes at different level interacts with each other

• Can not select tile shape at each level separately• Need a global model considering hierarchical tiling

Contribution of the paper

• An automatic system for selection of tile shapes in a hierarchical system

• A model that compute execution time of the tile shapes

• Show that the problem of optimal tile shape selection is a nonlinear bi-level programming problem

Math Concepts

• Iteration Space

• Tiling Representation

• Dependence Vectors

• Execution Time Vector

Iteration space representation• The edge matrix E of iteration space I

• An example:

• The function span(E) describes the iteration space I

Tiling transformation representation• Tiling matrix:

• After tiling:

• E’ is the new edge matrix• A is the transformation matrix

• We have:

A=T-1 is the affine transformation of tiling with shape T

Hierarchical tiling

• Recursively tile an iteration space I• Bottom-up. T0: Finest -> Tn: Original Space

• Ik: is the iteration space of k-th level

• Tk: Tile shape of k-th level

• Ek: Edge matrix of k-th level

Dependence Vectors

• Dependence matrix• A dependence vector d = (d0 , d1 , . . . , dn−1 )

indicates that any iteration i must finish before iteration i + d.

• Assume atomic computation (not communication overlap)• It is possible to topologically sort all tiles• Can be no cycles in the inter-tile dependence graph • Hyperplanes defining the tiles must not be crossed

by two dependence vectors with different directions.

Dependence Vectors• No cycle => each dependence vector d must be

covered by the cone spanned by the extension of t0,…tn-1

• Tiles be large => inter-tile dependences only exist between adjacent tiles

• Combine together:

• After transformation

• Dk, the dependence at k-th level tiling

• The sequential execution time of a loop with iteration space I is

• Consider the parallelism, Ideal execution time is the minimal execution time of an iteration space I that can be achieved by any valid schedule of iterations (E: edge matrix, D: dependence matrix， L(E, D) denote the length of the longest path of dependent iterations in the iteration space)

• Example:

• After simplification:

Execution Model

The Tile Size Selection Model• Problem Statement

• The Optimization formation

• Compute the Longest Dependent Path

• Automated Framework

Tile Size Selection Model

• Problem statement: Selecting the tile shape for hierarchical tiling.• Identifying the tile shapes defining an l-level

hierarchical tiling that minimizes the execution time of the computation defined by giving an n-dimensional hyperparallelepiped-shaped iteration space I and m dependence vectors; i.e. Determing the sequency of tiling matrices T0, T1, … Tl-1.

• Assumptions• The model considers parallelogram tile shapes• At a given level, all nonboundary tiles have the

same shape• Tiling is an affine transformation• Computation within a tile is atomic• Infinite resource for parallelism

• Iteration space

• Execution time per-tile for bottom-level tile

• The recursion for upper tiles:

Model Formulation

• Each tile at level below is considered as a single iteration

• The per-iteration execution time is Time(Tk-1). • Dk is dependence at k-th level with D0 = D• ts

k be the synchronization and communication overhead of each tile

Model Formulation

• Optimization:• Select t1, … tl-1 to minimize total execution time

• Constraints (dependence)

• Question: How to compute “L” ?

Contribution of the paper




• Computing L(Tk, In) 0<k<n-1

• Computing L(E, In)

Computation of the L

• Computing L(Tk, In)• By affine transformation

• Since:

• To compute L(Tk,1n), we must find the longest path P (p0, p1,..., pL−1)

• Therefore: L(Tk, 1n) = max{L}

Computation of the L

• Computation of L(E, In)• Since dependence vectors d can point in any

direction, the longest dependent path does not necessarily start from origin (0,0,...,0) of the hypercube iteration space.

• Approximately estimate the L using binary optimization

The Automatic Framework

• Multidimensional non-linear optimization problem w/o a known analytical solution• NOMAD

Experiments

• Platforms: Bluewater super computer• First level: 256 nodes• Second level: Each node has an NVIDIA Tesla

GPU accelerator with 2688 CUDA cores

• Tiling Schemes • Scheme 1 & 2: The common tile shapes• The hierarchical overlapped tiling method• Note: tiles shapes include Square, Diamond and

Skewing1 & 2

Comparing the performance

Testing the model accuracy

• The accuracy of the analytical model for execution estimation • 15% except for 1D-Jacobi• Reasons cause inaccuracy:

• The variation of communication time and execution time for different program

• The hardware resource for parallelism is not unlimited

Conclusion




• Review the limitations, these can be future work• Affine, regular parallelism, adding the hardware

resource model, considering the different metrics, e.g. power, area …

Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David...

Documents

Transcript of Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David...