Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How

Compiler Supported Coarse-Grained Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How Pipelined Parallelism: Why and How

Gagan Agrawal Gagan Agrawal Wei Du Wei Du

Tahsin Kurc Tahsin Kurc Umit Catalyurek Umit Catalyurek

Joel Saltz Joel Saltz The Ohio State University The Ohio State University

Overall Context Overall Context

NGS grant titled ``An Integrated Middleware and NGS grant titled ``An Integrated Middleware and Language/Compiler Framework for Data-Intensive Language/Compiler Framework for Data-Intensive Applications’’, funded September 2002 – August Applications’’, funded September 2002 – August 2005.2005.

Project Components Project Components – Runtime Optimizations in the DataCutter System Runtime Optimizations in the DataCutter System – Compiler Optimization of DataCutter filters Compiler Optimization of DataCutter filters – Automatic Generation of DataCutter filters Automatic Generation of DataCutter filters

Focus of this talk Focus of this talk

General Motivation General Motivation

Language and Compiler Support for Parallelism of Language and Compiler Support for Parallelism of many forms has been explored many forms has been explored – Shared memory parallelism Shared memory parallelism – Instruction-level parallelism Instruction-level parallelism – Distributed memory parallelism Distributed memory parallelism – Multithreaded execution Multithreaded execution

Application and technology trends are making Application and technology trends are making another form of parallelism desirable and feasible another form of parallelism desirable and feasible – Coarse-Grained Pipelined Parallelism Coarse-Grained Pipelined Parallelism

Coarse-Grained Pipelined Coarse-Grained Pipelined ParallelismParallelism

(CGPP)(CGPP) DefinitionDefinition– Computations associated with an application are carried Computations associated with an application are carried

out in several stages, which are executed on a pipeline out in several stages, which are executed on a pipeline of computing unitsof computing units

Example Example — K-nearest Neighbor— K-nearest Neighbor

Given a 3-D range R= <(xGiven a 3-D range R= <(x11, y, y11, z, z11), (x), (x22, y, y22, z, z22)>, and )>, and

a point a point = (a, b, c). = (a, b, c).

We want to find the nearest K neighbors of We want to find the nearest K neighbors of within R. within R.

Range_query Find the K-nearest neighbors

Coarse-Grained Pipelined Coarse-Grained Pipelined Parallelism is Desirable & Parallelism is Desirable &

FeasibleFeasible Application scenariosApplication scenarios

Internet

data

data

data

data

datadatadata


FeasibleFeasible A new class of data-intensive applicationsA new class of data-intensive applications

– Scientific data analysisScientific data analysis– data miningdata mining– data visualizationdata visualization– image analysisimage analysis

Two direct ways to implement such applicationsTwo direct ways to implement such applications– Downloading all the data to Downloading all the data to

user’s machine – often not feasible user’s machine – often not feasible

– Computing at the data repositoryComputing at the data repository - - usually too slow usually too slow

Our beliefOur belief

– A coarse-grained pipelined execution A coarse-grained pipelined execution model is a good matchmodel is a good match

Internet

data

data


FeasibleFeasible

Coarse-Grained Pipelined Coarse-Grained Pipelined Parallelism needs Compiler Parallelism needs Compiler

SupportSupport Computation needs to be decomposed into stagesComputation needs to be decomposed into stages Decomposition decisions are dependent on Decomposition decisions are dependent on

execution environmentexecution environment– How many computing sites availableHow many computing sites available– How many available computing cycles on each siteHow many available computing cycles on each site– What are the available communication linksWhat are the available communication links– What’s the bandwidth of each linkWhat’s the bandwidth of each link

Code for each stage follows the same processing Code for each stage follows the same processing pattern, so it can be generated by compilerpattern, so it can be generated by compiler

Shared or distributed memory parallelism needs to Shared or distributed memory parallelism needs to be exploitedbe exploited

High-level language and compiler High-level language and compiler support are necessarysupport are necessary

OutlineOutline Coarse-grained pipelined parallelism is desirable Coarse-grained pipelined parallelism is desirable

& feasible& feasible Coarse-grained pipelined parallelism needs high-Coarse-grained pipelined parallelism needs high-

level language & compiler supportlevel language & compiler support An entire picture of the systemAn entire picture of the system DataCutter runtime system & language dialect DataCutter runtime system & language dialect Overview of the challenges for the compilerOverview of the challenges for the compiler Compiler TechniquesCompiler Techniques Experimental resultsExperimental results Related workRelated work Future work & ConclusionsFuture work & Conclusions

An Entire PictureAn Entire Picture

Java Dialect

Compiler Support

DataCutter Runtime System

Decomposition

Code Generation

DataCutter Runtime SystemDataCutter Runtime System

Ongoing project at OSU / Maryland ( Kurc, Ongoing project at OSU / Maryland ( Kurc, Catalyurek, Beynon, Saltz et al) Catalyurek, Beynon, Saltz et al)

Targets a distributed, heterogeneous environmentTargets a distributed, heterogeneous environment Allow decomposition of application-specific data Allow decomposition of application-specific data

processing operations into a set of interacting processing operations into a set of interacting processesprocesses

Provides a specific low-level interfaceProvides a specific low-level interface– filterfilter– StreamStream

layout & placementlayout & placement

filter1 filter2 filter3stream stream

Language DialectLanguage Dialect GoalGoal

– to give compiler information about independent to give compiler information about independent collections of objects, parallel loops and collections of objects, parallel loops and reduction operations, pipelined parallelismreduction operations, pipelined parallelism

Extensions of JavaExtensions of Java– Pipelined_loopPipelined_loop– Domain & RectdomainDomain & Rectdomain– Foreach loopForeach loop– reduction variablesreduction variables

ISO-Surface Extraction Example CodeISO-Surface Extraction Example Code

public class isosurface {public class isosurface { public static void main(String arg[]) {public static void main(String arg[]) { float iso_value;float iso_value; RectDomain<1> CubeRange = [min:max];RectDomain<1> CubeRange = [min:max]; CUBE[1d] InputData = new CUBE[CubeRange];CUBE[1d] InputData = new CUBE[CubeRange]; Point<1> p, b;Point<1> p, b;

RectDomain<1> PacketRange = RectDomain<1> PacketRange = [1:runtime_def_num_packets[1:runtime_def_num_packets];];

RectDomain<1> EachRange = RectDomain<1> EachRange =

[1:(max-min)/runtime_define_num_packets];[1:(max-min)/runtime_define_num_packets]; Pipelined_loop (b in PacketRange) {Pipelined_loop (b in PacketRange) { Foreach (p in EachRange) {Foreach (p in EachRange) {

InputData[p].ISO_SurfaceTriangles(iso_value,…);InputData[p].ISO_SurfaceTriangles(iso_value,…); }} … …… … }}}}

For (int i=min; i++; i<max-1){ // operate on InputData[i]}

Pipelined_loop (b in PacketRange)Pipelined_loop (b in PacketRange) { { 0. foreach ( …) { … }0. foreach ( …) { … }1. foreach ( …) { … }1. foreach ( …) { … } … …… …n-1. S;n-1. S; }} Merge Merge

RectDomain<1> PacketRange = [1:4];RectDomain<1> PacketRange = [1:4];

Overview of the Challenges for Overview of the Challenges for the Compilerthe Compiler

Filter DecompositionFilter Decomposition– To identify the candidate filter boundariesTo identify the candidate filter boundaries– Compute communication volume between two Compute communication volume between two

consecutive filtersconsecutive filters– Cost ModelCost Model– Compute a mapping from computations in a Compute a mapping from computations in a

loop to computing units in a pipelineloop to computing units in a pipeline Filter Code GenerationFilter Code Generation

Three types of candidate boundariesThree types of candidate boundaries– Start & end of a foreach loopStart & end of a foreach loop– Conditional statementConditional statement If ( point[p].inRange(high, low) ) {If ( point[p].inRange(high, low) ) { local_KNN(point[p]);local_KNN(point[p]); }}– Start & end of a function call within a foreach Start & end of a function call within a foreach

looploop Any non-foreach loop must be completely Any non-foreach loop must be completely

inside a single filterinside a single filter

Identify the Candidate Filter Identify the Candidate Filter BoundariesBoundaries

Compute Required CommunicationCompute Required Communication

ReqComm(b) = the set of values need to be the set of values need to be communicated through this boundarycommunicated through this boundary

Cons(B)Cons(B) = the set of variables that are used in B, = the set of variables that are used in B, not defined in Bnot defined in B

Gens(B)Gens(B) = the set of variables that are defined in = the set of variables that are defined in B, still alive at the end of BB, still alive at the end of B

ReqComm(b2) = ReqComm(b1) – Gens(B) +

Cons(B)

B

b2

b1

Cost ModelCost Model

Cost ModelCost Model– A sequence of A sequence of mm computing units, C computing units, C11,…, C,…, Cm m with with

computing powers P(Ccomputing powers P(C11), …, P(C), …, P(Cmm))

– A sequence of A sequence of m-1m-1 network links, L network links, L11, …, L, …, Lm-1m-1 with with

bandwidths B(Lbandwidths B(L11), …, B(L), …, B(Lm-1m-1))

– A sequence of A sequence of nn candidate filter boundaries candidate filter boundaries bb11, ,

…,…, b bnn

Cost ModelCost ModelC1

C2

C3

L1

L2

time

stage

C1

L1

C2

L2

C3

Say, L2 is bottleneck stage,

T = T(C1)+T(L1)+T(C2)+N*T(L2)+T(C3)

Say, C2 is bottleneck stage,

T = T(C1)+T(L1)+N*T(C2)+T(L2)+T(C3)

Filter DecompositionFilter Decomposition

C1

C2

Cm-1

Cm

L1

Lm-1

f1

f2

fn

fn+1

b1

bn

Goal:Goal:

Find a mapping: LFind a mapping: Lii → → bbjj, to , to

minimize the predicted minimize the predicted execution time, where 1≤ i ≤ execution time, where 1≤ i ≤ m-1, 1≤ j ≤ n. m-1, 1≤ j ≤ n.

Intuitively, the candidate

filter boundary bj is inserted

between computing units Ci

and Ci+1

m-1

n+1+m-1Exhaustive search

Filter Decomposition: A Greedy Algo.Filter Decomposition: A Greedy Algo.

C1

C2

C3

C4

L1

L3

L2

f1

f2

f3

f4

f5

L1

C1

C3

C4

C2

f1

L1 to b1 : T1

Estimated Costf1 , f2

L1 to b2 : T2

L1 to b3 : T3

L1 to b4 : T4

Min{T1 … T4 } = T2

To minimize the predicted execution To minimize the predicted execution timetime

b1

b2

b3

b4

Code GenerationCode Generation

Abstraction of the work each filter doesAbstraction of the work each filter does– Read in a buffer of data from input streamRead in a buffer of data from input stream– Iterate over the set of dataIterate over the set of data– Write out the results to output streamWrite out the results to output stream

Code generation issuesCode generation issues– How to get the Cons(b) from the input stream How to get the Cons(b) from the input stream --- unpacking data--- unpacking data– How to organize the output data for the How to organize the output data for the

successive filter --- packing datasuccessive filter --- packing data

Experimental ResultsExperimental Results GoalGoal

– To show Compiler-generated code is efficientTo show Compiler-generated code is efficient Environment settingsEnvironment settings

– 700MHZ Pentium machines700MHZ Pentium machines– Connected through Myrinet LANai 7.0Connected through Myrinet LANai 7.0

ConfigurationsConfigurations# data sites --- # computing sites --- user machine# data sites --- # computing sites --- user machine

– 1-1-11-1-1– 2-2-12-2-1– 4-4-14-4-1

Experimental ResultsExperimental Results VersionsVersions

– DefaultDefault version version Site hosting the data only reads and transmits data, Site hosting the data only reads and transmits data,

no processing at allno processing at all User’s desktop only views the results, no processing User’s desktop only views the results, no processing

at allat all All the work are done by the compute nodesAll the work are done by the compute nodes

– Compiler-generated versionCompiler-generated version Intelligent decomposition is done by the compilerIntelligent decomposition is done by the compiler More computations are performed on the end nodes More computations are performed on the end nodes

to reduce the communication volumeto reduce the communication volume

– Manual versionManual version Hand-written DataCutter filters with similar Hand-written DataCutter filters with similar

decomposition as the compiler-generated versiondecomposition as the compiler-generated version

Computing nodes workload heavyCommunication volume high

workload balanced between each nodeCommunication volume reduced

Experimental Results: ISO-Surface Experimental Results: ISO-Surface Rendering (Z-Buffer Based)Rendering (Z-Buffer Based)

0

5

10

15

20

25

30

35

40

1 2 40

20

40

60

80

100

120

140

160

1 2 4

Decomp

Default

Width of pipeline Width of pipeline

Small dataset150M

Large dataset600M

Speedup 1.92 3.34 Speedup 1.99 3.82

20% improvement over default version

Experimental Results: ISO-Surface Experimental Results: ISO-Surface Rendering (Active Pixel Based)Rendering (Active Pixel Based)

0

5

10

15

20

25

30

35

40

1 2 40

20

40

60

80

100

120

140

160

1 2 4

Decomp

Default


Small dataset150M

Large dataset600M

Speedup close to linear

> 15% improvement over default version

Experimental Results: KNNExperimental Results: KNN

0100020003000400050006000700080009000

10000

1 2 40

100020003000400050006000700080009000

10000

1 2 4

Decomp

Manual

Default


K = 3108M

K = 200108M

Speedup 1.89 3.38 Speedup 1.87 3.82

>150% improvement over default version

Experimental Results: Virtual Experimental Results: Virtual MicroscopeMicroscope

0100200300400500600700800900

1000

1 2 40

2000

4000

6000

8000

10000

12000

14000

16000

18000

1 2 4

Decomp

Manual

Default


Small query800M, 512*512

Large query800M, 2048*2048

≈40% improvement over default version

Experimental ResultsExperimental Results

SummarySummary– The The compiler-decomposedcompiler-decomposed versions achieve an versions achieve an

improvement between 10% and 150% over improvement between 10% and 150% over defaultdefault versions versions

– In most cases, increasing the width of the In most cases, increasing the width of the pipeline results in near-linear speeduppipeline results in near-linear speedup

– Compared with the Compared with the manualmanual version, the version, the compiler-decomposedcompiler-decomposed versions are generally versions are generally quite closequite close

Ongoing and Future WorkOngoing and Future Work

Buffer size optimizationBuffer size optimization Cost model refinement & implementationCost model refinement & implementation More applicationsMore applications More realistic environment settings: More realistic environment settings:

resource dynamically availableresource dynamically available

ConclusionsConclusions

Coarse-Grained Pipelined Parallelism is desirable Coarse-Grained Pipelined Parallelism is desirable & feasible& feasible

Coarse-Grained Pipelined Parallelism needs Coarse-Grained Pipelined Parallelism needs language & compiler supportlanguage & compiler support

An algorithm for required communication analysis An algorithm for required communication analysis is givenis given

A greedy algorithm for filter decomposition is A greedy algorithm for filter decomposition is developeddeveloped

A cost model is designedA cost model is designed Results of detailed evaluation of our compiler are Results of detailed evaluation of our compiler are

encouragingencouraging

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How

Documents

Transcript of Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How