Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and...

24
Mapping of Regular Nested Loop Programs to Mapping of Regular Nested Loop Programs to arse-grained Reconfigurable Arrays – Constraints and Methodol rse-grained Reconfigurable Arrays – Constraints and Methodolo Presented by: Luis Ortiz Department of Computer Science Department of Computer Science The University of Texas at San Antonio The University of Texas at San Antonio F. Hanning, H. Dutta, W. Tichy, and Jürgen Teich University of Erlangen-Nuremberg, Germany Proceedings of the 18thInternational Parallel and Distributed Processing Symposium (IPDPS’04)

description

Overview  Constructing a parallel program is equivalent to specifying its execution order the operations of a program form a set, and its execution order is a binary, transitive and asymmetric relation the relevant sets are (unions of) Z-polytopes most of the optimizations may be presented as transformation of the original program  The problem of automatic parallelization given a set of operations E and a strict total order on it find a partial order on E such that execution of E under it is determinate and gives the same results as the original program

Transcript of Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and...

Page 1: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Mapping of Regular Nested Loop Programs toMapping of Regular Nested Loop Programs toCoarse-grained Reconfigurable Arrays – Constraints and MethodologyCoarse-grained Reconfigurable Arrays – Constraints and Methodology

Presented by: Luis Ortiz

Department of Computer ScienceDepartment of Computer ScienceThe University of Texas at San AntonioThe University of Texas at San Antonio

F. Hanning, H. Dutta, W. Tichy, and Jürgen TeichUniversity of Erlangen-Nuremberg, Germany

Proceedings of the 18thInternational Parallel and Distributed Processing Symposium (IPDPS’04)

Page 2: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

OutlineOutline

Overview The Problem Reconfigurable Architectures Design Flow for Regular Mapping Parallelizing Transformations Constraints Related to CG Reconfigurable Arrays Case Study Results Conclusions and Future Work

Page 3: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

OverviewOverview

Constructing a parallel program is equivalent to specifying its execution order• the operations of a program form a set, and its execution order is

a binary, transitive and asymmetric relation• the relevant sets are (unions of) Z-polytopes• most of the optimizations may be presented as transformation of

the original program

The problem of automatic parallelization• given a set of operations E and a strict total order on it• find a partial order on E such that execution of E under it is

determinate and gives the same results as the original program

Page 4: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Overview (cont.)Overview (cont.)

Defining a polyhedron• a set of linear inequalities: Ax + a ≥ 0• the polyhedron is the set of all x which satisfies these inequalities• the basic property of a polyhedron is convexity:

• if two points a and b belong to a polyhedron, then so all convex combinations• λa + (1 – λ)b, 0 ≤ λ ≤ 1

• a bounded polyhedron is called a polytope

Page 5: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Overview (cont.)Overview (cont.)

The essence of the polytope model is to apply affine transformations to the iteration spaces of a program• the iteration domain of statement S:

Dom(S) = {x | Dsx + ds ≥ 0}• Ds and ds are the matrix and constant vector which define the

iteration polytope. ds may depend linearly on the structure parameters

Page 6: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Overview (cont.)Overview (cont.)

Coarse-grained reconfigurable architectures• provide flexibility of software combined with the performance of

hardware• but, hardware complexity is a problem due to a lack of mapping

tools

Parallelization techniques and compilers• map computationally intensive algorithms efficiently to coarse-

grained reconfigurable arrays

Page 7: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

The ProblemThe Problem

“Mapping a certain class of regular nested loop programs onto a dedicated processor array”

Page 8: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Reconfigurable ArchitecturesReconfigurable Architectures

Span a wide range of abstraction levels• from fine-grained Look-Up Table (LUT) based reconfigurable logic

devices to distributed and hierarchical systems with heterogeneous reconfigurable components

Efficiency comparison• standard arithmetic is less efficient on fine-grained architectures

• due to the large routing area overhead

Few research work which deals with the compilation to coarse-grained reconfigurable architecture

Page 9: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Design Flow for Regular MappingDesign Flow for Regular Mapping

Page 10: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

A piecewise regular algorithm contains N quantified equations

• • each equation Si[I] is of the form

• •

• xi[I] are indexed variables• fi are arbitrary functions• dji ∈ ℤn are constant data dependence vectors, and denote

similar arguments• Ii are called index spaces

Design Flow for Regular Mapping (cont.)Design Flow for Regular Mapping (cont.)

Page 11: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Linearly bounded lattice

• • •

• this set is affinely mapped onto iteration vectors I using an affine transformation

Block pipelining period• time interval between the initiations of two successive problem

instances (β)

Design Flow for Regular Mapping (cont.)Design Flow for Regular Mapping (cont.)

Page 12: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Parallelizing TransformationsParallelizing Transformations

Based on the representation of equations and index spaces several combinations of parallelizing transformations in the polytope model can be applied

• Affine Transformations• Localization• Operator Splitting• Exploration of Space-Time Mappings• Partitioning• Control Generation• HDL Generation & Synthesis

Page 13: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Constraints Related to CG Reconfigurable ArraysConstraints Related to CG Reconfigurable Arrays

Coarse-grained (re)configurable architectures consist of an array of processor elements (PE)

• array of processor elements (PE)• one or more dedicated functional units or• one or more arithmetic logic units (ALU)

• memory• local memory → register files• memory banks• an instruction memory is required if the PE contains an

instruction programmable ALU• interconnect structures• I/O ports• synchronization and reconfiguration mechanisms

Page 14: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Case StudyCase Study

Regular mapping methodology applied for a matrix multiplication algorithm

• target architecture• PACT XPP64-A reconfigurable processor array

• 64 ALU-PAEs of 24 bit data with in an 8x8 array• each ALU-PAE contains of three objects

• the ALU-PAE• Back-Register-object (BREG)• Forward-Register-object (FREG)

• all objects are connected to horizontal routing channels

Page 15: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Case Study (cont.)Case Study (cont.)

• RAM-PAE are located in two columns at the left and the right border of the array, two ports for independent r/w operations

• RAM can be configured to FIFO mode• each RAM-PAE has a 512x24 bit storage capacity• four independent I/O interfaces located in the corners of

the array

Page 16: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Structure of the PACT XPP64-A reconfigurable processor

ALU-PAE objects

Case Study (cont.)Case Study (cont.)

Page 17: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Case Study (cont.)Case Study (cont.)

Matrix multiplication algorithm• C = A * B• A ∈ ZNxN

• B ∈ ZNxN

• computations may be represented by a dependence graph (DG)• dependence graphs can be represented in a reduced form

• Reduced Dependence Graph: to each edge e = (vi, vj) there is associated a dependence vector dij ∈ Zn

• virtual Processor Elements (VPEs) are used to map the PE obtained from the design flow to the given architecture

Page 18: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Matrix multiplication algorithm, C-code

Matrix multiplication algorithm after parallelization, operator splitting, embedding, and localization

Case Study (cont.)Case Study (cont.)

Page 19: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

DG of transformed matrix multiplication algorithmN = 2

Reduced dependence graph

4 x 4 processor array

Case Study (cont.)Case Study (cont.)

Page 20: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Case Study (cont.)Case Study (cont.)

Output data• Ox the output-variable space of variable x of the space-time

mapped or partitioned index space• the output can be two-dimensional• the transformed output variables are distributed over the entire

array• collect the data from one processor’s line PL and feed them out to

an array border• • •

• m ∈ Z1xn denote the time instances t ∈ Tx(Pi,j) where the variable x produces an output at processor element Pi,j

Page 21: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Case Study (cont.)Case Study (cont.)

• if one of the following conditions holds, output data can be serialized

Page 22: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

a) Dataflow graph of the LPGS-partitioned matrix multiplication 4 x 4 exampleb) Dataflow graph after performing localization inside each filec) Array implementation of the partitioned example

Partitioned implementation of the matrix multiplication algorithm

Case Study (cont.)Case Study (cont.)

Page 23: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

ResultsResults

Both implementations (full-size and partitioned) show optimal utilization of resources

Each configured MAC-unit performs one operation per cycle

It is observed that using fewer resources with better implementation more performance per cycle can be achieved

The number of ALUs is reduced from O(3N) to O(N)

Merging and writing of output data streams is overlapped with computations in PEs

Page 24: Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Conclusions and Future WorkConclusions and Future Work

The mapping methodology based on loop parallelization in the polytope model provides results that are efficient in terms of utilization of resources and execution time

Future work is focused on perform automatic compilation of nested loop programs