Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism Wei Du Renato Ferreira Gagan...

Compiler Support for Exploiting Compiler Support for Exploiting Coarse-Grained Pipelined Coarse-Grained Pipelined

ParallelismParallelism

Wei Du Wei Du

Renato FerreiraRenato Ferreira

Gagan AgrawalGagan Agrawal

Ohio-State UniversityOhio-State University

Coarse-Grained Pipelined Coarse-Grained Pipelined ParallelismParallelism

(CGPP)(CGPP)DefinitionDefinition– Computations associated with an application are Computations associated with an application are

carried out in several stages, which are executed on a carried out in several stages, which are executed on a pipeline of computing unitspipeline of computing units

Example Example — K-nearest Neighbor— K-nearest Neighbor

Given a 3-D range R= <(xGiven a 3-D range R= <(x11, y, y11, z, z11), (x), (x22, y, y22, z, z22)>, and )>, and

a point a point = (a, b, c). = (a, b, c). We want to find the nearest K neighbors of We want to find the nearest K neighbors of within R. within R.

Range_query Find the K-nearest neighbors

Coarse-Grained Pipelined Coarse-Grained Pipelined Parallelism is Desirable & Parallelism is Desirable &

FeasibleFeasibleApplication scenariosApplication scenarios

Internet

data

data

data

data

datadatadata

Our beliefOur belief

– A coarse-grained pipelined execution A coarse-grained pipelined execution model is a good matchmodel is a good match

Internet

data

data


FeasibleFeasible

Coarse-Grained Pipelined Coarse-Grained Pipelined Parallelism needs Compiler Parallelism needs Compiler

SupportSupportComputation needs to be decomposed into stagesComputation needs to be decomposed into stages

Decomposition decisions are dependent on Decomposition decisions are dependent on execution environmentexecution environment– availability and capacity of computing sites and availability and capacity of computing sites and

communication linkscommunication links

Code for each stage follows the same processing Code for each stage follows the same processing pattern, so it can be generated by compilerpattern, so it can be generated by compiler

Shared or distributed memory parallelism needs to Shared or distributed memory parallelism needs to be exploitedbe exploited

High-level language and compiler High-level language and compiler support are necessarysupport are necessary

OutlineOutline

MotivationMotivation

Overview of the systemOverview of the system

DataCutter runtime system & DataCutter runtime system &

Language dialect Language dialect

Compiler techniquesCompiler techniques

Experimental resultsExperimental results

Related workRelated work

Future work & ConclusionsFuture work & Conclusions

OverviewOverview

Java Dialect

Compiler Support

DataCutter Runtime System

Decomposition

Code Generation

DataCutter Runtime SystemDataCutter Runtime System

Ongoing project at OSU / Maryland ( Kurc, Ongoing project at OSU / Maryland ( Kurc, Catalyurek, Beynon, Saltz et al) Catalyurek, Beynon, Saltz et al)

Targets a distributed, heterogeneous environmentTargets a distributed, heterogeneous environment

Allows decomposition of application-specific data Allows decomposition of application-specific data processing operations into a set of interacting processing operations into a set of interacting processesprocesses

Provides a specific low-level interfaceProvides a specific low-level interface– filterfilter– streamstream– layout & placementlayout & placement

filter1 filter2 filter3stream stream

Language DialectLanguage Dialect

GoalGoal– to give compiler information about independent to give compiler information about independent

collections of objects, parallel loops, reduction collections of objects, parallel loops, reduction operations, and pipelined parallelismoperations, and pipelined parallelism

Extensions of JavaExtensions of Java– Pipelined_loopPipelined_loop– Domain & RectdomainDomain & Rectdomain– Foreach loopForeach loop– reduction variablesreduction variables

ISO-Surface Extraction Example CodeISO-Surface Extraction Example Code

public class isosurface {public class isosurface { public static void main(String arg[]) {public static void main(String arg[]) { float iso_value;float iso_value; RectDomain<1> CubeRange = [min:max];RectDomain<1> CubeRange = [min:max]; CUBE[1d] InputData = new CUBE[CubeRange];CUBE[1d] InputData = new CUBE[CubeRange]; Point<1> p, b;Point<1> p, b;

RectDomain<1> PacketRange = RectDomain<1> PacketRange = [1:runtime_def_num_packets[1:runtime_def_num_packets];];

RectDomain<1> EachRange = RectDomain<1> EachRange = [1:(max-min)/runtime_define_num_packets];[1:(max-min)/runtime_define_num_packets];

Pipelined_loop (b in PacketRange) {Pipelined_loop (b in PacketRange) { Foreach (p in EachRange) {Foreach (p in EachRange) {

InputData[p].ISO_SurfaceTriangles(iso_value,…);InputData[p].ISO_SurfaceTriangles(iso_value,…); }} … …… … }}}}

For (int i=min; i++; i<max-1){ // operate on InputData[i]}

Pipelined_loop (b in PacketRange)Pipelined_loop (b in PacketRange) { { 0. foreach ( …) { … }0. foreach ( …) { … }1. foreach ( …) { … }1. foreach ( …) { … } … …… …n-1. S;n-1. S; }} Merge Merge

RectDomain<1> PacketRange = [1:4];RectDomain<1> PacketRange = [1:4];

Overview of the Challenges for Overview of the Challenges for the Compilerthe Compiler

Filter DecompositionFilter Decomposition– To identify the candidate filter boundariesTo identify the candidate filter boundaries– Compute communication volume between two Compute communication volume between two

consecutive filtersconsecutive filters– Cost ModelCost Model– Determine a mapping from computations in a Determine a mapping from computations in a

loop to processing units in a pipelineloop to processing units in a pipeline

Filter Code GenerationFilter Code Generation

Compute Required CommunicationCompute Required Communication

ReqComm(b) = the set of values need to be the set of values need to be communicated through this boundarycommunicated through this boundary

Cons(B)Cons(B) = the set of variables that are used in B, = the set of variables that are used in B, not defined in Bnot defined in B

Gens(B)Gens(B) = the set of variables that are defined in = the set of variables that are defined in B, still alive at the end of BB, still alive at the end of B

ReqComm(b2) = ReqComm(b1) – Gens(B) +

Cons(B)

B

b2

b1

Filter DecompositionFilter Decomposition

C1

C2

Cm-1

Cm

L1

Lm-1

f1

f2

fn

fn+1

b1

bn

Goal:Goal: Find a mapping: LFind a mapping: Lii → → bbjj, to , to

minimize the predicted execution minimize the predicted execution time, where 1≤ i ≤ m-1, 1≤ j ≤ n. time, where 1≤ i ≤ m-1, 1≤ j ≤ n.

Intuitively, the candidate filter

boundary bj is inserted

between computing units Ci

and Ci+1

m-1

n+m-1Exhaustive search

Filter Decomposition: Dynamic Filter Decomposition: Dynamic ProgrammingProgramming

Cm-2

Cm-1

Cm

Lm-1

fn+1

fn+1

Lm-2

fn

fn

Filter Decomposition: Dynamic Filter Decomposition: Dynamic ProgrammingProgramming

Cm-2

Cm-1

Cm

Lm-1

Lm-2

T[i,j]: min cost of doing computations f1 ,…,,…, fi on computing units CC11,…, C,…, Cj, j,

where the results of fi are on CCjj.

T[i,j] = minT[i-1,j] + Cost_comp(P(Cj),Task(fi))

T[i,j-1] + Cost_comp(B(Lj-1),Vol(fi))

Goal: T[n+1,m]

Cost: O(mn)

Code GenerationCode Generation

Abstraction of the work each filter doesAbstraction of the work each filter does– Read in a buffer of data from input streamRead in a buffer of data from input stream– Iterate over the set of dataIterate over the set of data– Write out the results to output streamWrite out the results to output stream

Code generation issuesCode generation issues– How to get the Cons(b) from the input stream How to get the Cons(b) from the input stream --- unpacking data--- unpacking data– How to organize the output data for the How to organize the output data for the

successive filter --- packing datasuccessive filter --- packing data

Experimental ResultsExperimental Results

GoalGoal– To show Compiler-generated code is efficientTo show Compiler-generated code is efficient

ConfigurationsConfigurations# data sites --- # computing sites --- user machine# data sites --- # computing sites --- user machine– 1-1-11-1-1– 2-2-12-2-1– 4-4-14-4-1

– width of a pipelinewidth of a pipeline

data compute userdata compute

userdata compute

data compute

data compute

data computeuser

data compute

Experimental ResultsExperimental ResultsVersionsVersions– DefaultDefault version version

Site hosting the data only reads and transmits data, Site hosting the data only reads and transmits data, no processing at allno processing at allUser’s desktop only views the results, no processing User’s desktop only views the results, no processing at allat allAll the work is done by the computing nodesAll the work is done by the computing nodes

– Compiler-generated versionCompiler-generated versionIntelligent decomposition is done by the compilerIntelligent decomposition is done by the compilerMore computations are performed on the end nodes More computations are performed on the end nodes to reduce the communication volumeto reduce the communication volume

– Manual versionManual versionHand-written DataCutter filters with similar Hand-written DataCutter filters with similar decomposition as the compiler-generated versiondecomposition as the compiler-generated version

Computing nodes workload heavyCommunication volume high

workload balanced between each nodeCommunication volume reduced

Experimental Results: ISO-Surface Experimental Results: ISO-Surface RenderingRendering

0

5

10

15

20

25

30

35

40

1 2 40

20

40

60

80

100

120

140

160

1 2 4

Decomp

Default

Width of pipeline Width of pipeline

Small dataset150M

Large dataset600M

Speedup 1.92 3.34 Speedup 1.99 3.82

20% improvement over default version

Experimental Results: KNNExperimental Results: KNN

0100020003000400050006000700080009000

10000

1 2 40

100020003000400050006000700080009000

10000

1 2 4

Decomp

Manual

Default


K = 3108M

K = 200108M

Speedup 1.89 3.38 Speedup 1.87 3.82

>150% improvement over default version

Experimental Results: Virtual Experimental Results: Virtual MicroscopeMicroscope

0100200300400500600700800900

1000

1 2 40

2000

4000

6000

8000

10000

12000

14000

16000

18000

1 2 4

Decomp

Manual

Default


Small query800M, 512*512

Large query800M, 2048*2048

≈40% improvement over default version

Experimental ResultsExperimental Results

SummarySummary– The The compiler-decomposedcompiler-decomposed versions achieve versions achieve

an improvement between 10% and 150% over an improvement between 10% and 150% over defaultdefault versions versions

– In most cases, increasing the width of the In most cases, increasing the width of the pipeline results in near-linear speeduppipeline results in near-linear speedup

– Compared with the Compared with the manualmanual version, the version, the compiler-decomposedcompiler-decomposed versions are generally versions are generally quite closequite close

Related WorkRelated WorkNo previous work on language & compiler support No previous work on language & compiler support for CGPP for CGPP StreamIt (MIT)StreamIt (MIT)

– Targets at streaming applicationsTargets at streaming applications– A language for communication-exposed architectures A language for communication-exposed architectures – A compiler performs stream-specific optimizations A compiler performs stream-specific optimizations

– Lower-level language interface Lower-level language interface – Targets at different architectureTargets at different architecture

Ziegler et al (USC/ISI)Ziegler et al (USC/ISI)– Target at pipelined FPGA Architectures– Consider different granularity of communication Consider different granularity of communication

between FPGAsbetween FPGAs

Related WorkRelated Work

Run-time support for CGPPRun-time support for CGPP– Stampede (Georgia Tech)Stampede (Georgia Tech)

Multimedia applications, support is in the form of Multimedia applications, support is in the form of cluster-wide threads and shared objectscluster-wide threads and shared objects

– Yang et al (Penn State)Yang et al (Penn State) Scheduler for vision applications, executed in a Scheduler for vision applications, executed in a pipelined fashion within a clusterpipelined fashion within a cluster

– Remos (CMU)Remos (CMU) Resource monitoring system for network-aware Resource monitoring system for network-aware applications to get info. about execution environmentapplications to get info. about execution environment

– Active Stream (Georgia Tech)Active Stream (Georgia Tech) A middleware approach for distributed applicationsA middleware approach for distributed applications

Future Work & ConclusionFuture Work & Conclusion

Future WorkFuture Work– Buffer size optimizationBuffer size optimization– Cost model refinement & implementationCost model refinement & implementation– More applicationsMore applications– More realistic environment settings: resource More realistic environment settings: resource

dynamically available --- compiler directed dynamically available --- compiler directed adaptationadaptation

Future Work & ConclusionFuture Work & Conclusion

ConclusionConclusion– Coarse-Grained Pipelined Parallelism is desirable & Coarse-Grained Pipelined Parallelism is desirable &

feasiblefeasible– Coarse-Grained Pipelined Parallelism needs language & Coarse-Grained Pipelined Parallelism needs language &

compiler supportcompiler support– An algorithm for required communication analysis is An algorithm for required communication analysis is

givengiven– A dynamic programming algorithm for filter A dynamic programming algorithm for filter

decomposition is developeddecomposition is developed– A cost model is designedA cost model is designed– Results of detailed evaluation of our compiler are Results of detailed evaluation of our compiler are

encouragingencouraging

Thank you !!!Thank you !!!

Cost ModelCost ModelCost ModelCost Model– A sequence of A sequence of mm computing units, C computing units, C11,…, C,…, Cm m with with

computing powers P(Ccomputing powers P(C11), …, P(C), …, P(Cmm))

– A sequence of A sequence of m-1m-1 network links, L network links, L11, …, L, …, Lm-1m-1 with with

bandwidths B(Lbandwidths B(L11), …, B(L), …, B(Lm-1m-1))

– A sequence of A sequence of nn candidate filter boundaries candidate filter boundaries bb11, ,

…,…, b bnn

C1 C2 C3

L1 L2

Say, L2 is bottleneck stage,

T = T(C1)+T(L1)+T(C2)+N*T(L2)+T(C3)

Three types of candidate boundariesThree types of candidate boundaries– Start & end of a foreach loopStart & end of a foreach loop– Conditional statementConditional statement If ( point[p].inRange(high, low) ) {If ( point[p].inRange(high, low) ) { local_KNN(point[p]);local_KNN(point[p]); }}– Start & end of a function call within a foreach Start & end of a function call within a foreach

looploop

Any non-foreach loop must be completely Any non-foreach loop must be completely inside a single filterinside a single filter

Identify the Candidate Filter Identify the Candidate Filter BoundariesBoundaries


FeasibleFeasibleA new class of data-intensive applicationsA new class of data-intensive applications– scientific data analysisscientific data analysis– data miningdata mining– data visualizationdata visualization– image analysisimage analysis– and more …and more …

Two direct ways to implement such applicationsTwo direct ways to implement such applications– Downloading all the data to Downloading all the data to user’s machineuser’s machine

– Computing at the data repositoryComputing at the data repository

Compute Required Compute Required CommunicationCommunication

ReqComm(b0) = { }

Cons={X, Y}Gens={Z}

Cons={A}Gens={X,Y}

b0

b2

b1

ReqComm(b2) = {A}

ReqComm(b1) = {X, Y }

ReqComm(b2) = ReqComm(b1) – Gens(B) + Cons(B)

Compute Required Compute Required Communication Communication

Z = A + 48Z = A + 48If Z > 0If Z > 0 Y = Z * AY = Z * AX = Z + AX = Z + A

Cons(B)Cons(B) = the set of variables that are used in B, = the set of variables that are used in B, not defined in Bnot defined in B

Gens(B)Gens(B) = the set of variables that are defined in = the set of variables that are defined in B, still alive at the end of BB, still alive at the end of B

Cons(B) Gens(B) Cons(B) Gens(B)

A X, ZA X, Z

Z, A XZ, A X

Z, A XZ, A X

Z, A XZ, A X

Cons(s) Gens(s)Cons(s) Gens(s)

A Z A Z

Z Z

Z, AZ, A

Z, A XZ, A X

code Generationcode Generation

Two ways to organize data in a bufferTwo ways to organize data in a buffer– Instance-wiseInstance-wise– Field-wiseField-wise

Class CClass C

{ {

int x;int x;

float y;float y;

int z;int z;

}}

X Y Z X Y Z . . .

Instance-wise

Field-wise

X X . . . Y Y . . . Z . . .

Code GenerationCode Generation

Ways that fields of an object are usedWays that fields of an object are used– In the same loopIn the same loop

for (int i=0; i<count; i++) {for (int i=0; i<count; i++) {… … = InputData[i].x + …;= InputData[i].x + …;… … = … + InputData[i].y; }= … + InputData[i].y; }

– In different loopsIn different loops

for (int i=0; i<count; i++) {for (int i=0; i<count; i++) {… … = InputData[i].x + …; }= InputData[i].x + …; }

for (int i=0; i<count; i++) {for (int i=0; i<count; i++) { … … = … + InputData[i].y;}= … + InputData[i].y;}

Instance-wise

Field-wise

Cost ModelCost ModelC1

C2

C3

L1

L2

time

stage

C1

L1

C2

L2

C3

Say, L2 is bottleneck stage,

T = T(C1)+T(L1)+T(C2)+N*T(L2)+T(C3)

Say, C2 is bottleneck stage,

T = T(C1)+T(L1)+N*T(C2)+T(L2)+T(C3)

Experimental Results: ISO-Surface Experimental Results: ISO-Surface Rendering (Active Pixel Based)Rendering (Active Pixel Based)

0

5

10

15

20

25

30

35

40

1 2 40

20

40

60

80

100

120

140

160

1 2 4

Decomp

Default


Small dataset150M

Large dataset600M

Speedup close to linear

> 15% improvement over default version

Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism Wei Du Renato Ferreira Gagan...

Documents

Transcript of Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism Wei Du Renato Ferreira Gagan...