Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism Wei Du Renato Ferreira Gagan...
-
Upload
charles-atkins -
Category
Documents
-
view
213 -
download
1
Transcript of Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism Wei Du Renato Ferreira Gagan...
Compiler Support for Exploiting Compiler Support for Exploiting Coarse-Grained Pipelined Coarse-Grained Pipelined
ParallelismParallelism
Wei Du Wei Du
Renato FerreiraRenato Ferreira
Gagan AgrawalGagan Agrawal
Ohio-State UniversityOhio-State University
Coarse-Grained Pipelined Coarse-Grained Pipelined ParallelismParallelism
(CGPP)(CGPP)DefinitionDefinition– Computations associated with an application are Computations associated with an application are
carried out in several stages, which are executed on a carried out in several stages, which are executed on a pipeline of computing unitspipeline of computing units
Example Example — K-nearest Neighbor— K-nearest Neighbor
Given a 3-D range R= <(xGiven a 3-D range R= <(x11, y, y11, z, z11), (x), (x22, y, y22, z, z22)>, and )>, and
a point a point = (a, b, c). = (a, b, c). We want to find the nearest K neighbors of We want to find the nearest K neighbors of within R. within R.
Range_query Find the K-nearest neighbors
Coarse-Grained Pipelined Coarse-Grained Pipelined Parallelism is Desirable & Parallelism is Desirable &
FeasibleFeasibleApplication scenariosApplication scenarios
Internet
data
data
data
data
datadatadata
Our beliefOur belief
– A coarse-grained pipelined execution A coarse-grained pipelined execution model is a good matchmodel is a good match
Internet
data
data
Coarse-Grained Pipelined Coarse-Grained Pipelined Parallelism is Desirable & Parallelism is Desirable &
FeasibleFeasible
Coarse-Grained Pipelined Coarse-Grained Pipelined Parallelism needs Compiler Parallelism needs Compiler
SupportSupportComputation needs to be decomposed into stagesComputation needs to be decomposed into stages
Decomposition decisions are dependent on Decomposition decisions are dependent on execution environmentexecution environment– availability and capacity of computing sites and availability and capacity of computing sites and
communication linkscommunication links
Code for each stage follows the same processing Code for each stage follows the same processing pattern, so it can be generated by compilerpattern, so it can be generated by compiler
Shared or distributed memory parallelism needs to Shared or distributed memory parallelism needs to be exploitedbe exploited
High-level language and compiler High-level language and compiler support are necessarysupport are necessary
OutlineOutline
MotivationMotivation
Overview of the systemOverview of the system
DataCutter runtime system & DataCutter runtime system &
Language dialect Language dialect
Compiler techniquesCompiler techniques
Experimental resultsExperimental results
Related workRelated work
Future work & ConclusionsFuture work & Conclusions
OverviewOverview
Java Dialect
Compiler Support
DataCutter Runtime System
Decomposition
Code Generation
DataCutter Runtime SystemDataCutter Runtime System
Ongoing project at OSU / Maryland ( Kurc, Ongoing project at OSU / Maryland ( Kurc, Catalyurek, Beynon, Saltz et al) Catalyurek, Beynon, Saltz et al)
Targets a distributed, heterogeneous environmentTargets a distributed, heterogeneous environment
Allows decomposition of application-specific data Allows decomposition of application-specific data processing operations into a set of interacting processing operations into a set of interacting processesprocesses
Provides a specific low-level interfaceProvides a specific low-level interface– filterfilter– streamstream– layout & placementlayout & placement
filter1 filter2 filter3stream stream
Language DialectLanguage Dialect
GoalGoal– to give compiler information about independent to give compiler information about independent
collections of objects, parallel loops, reduction collections of objects, parallel loops, reduction operations, and pipelined parallelismoperations, and pipelined parallelism
Extensions of JavaExtensions of Java– Pipelined_loopPipelined_loop– Domain & RectdomainDomain & Rectdomain– Foreach loopForeach loop– reduction variablesreduction variables
ISO-Surface Extraction Example CodeISO-Surface Extraction Example Code
public class isosurface {public class isosurface { public static void main(String arg[]) {public static void main(String arg[]) { float iso_value;float iso_value; RectDomain<1> CubeRange = [min:max];RectDomain<1> CubeRange = [min:max]; CUBE[1d] InputData = new CUBE[CubeRange];CUBE[1d] InputData = new CUBE[CubeRange]; Point<1> p, b;Point<1> p, b;
RectDomain<1> PacketRange = RectDomain<1> PacketRange = [1:runtime_def_num_packets[1:runtime_def_num_packets];];
RectDomain<1> EachRange = RectDomain<1> EachRange = [1:(max-min)/runtime_define_num_packets];[1:(max-min)/runtime_define_num_packets];
Pipelined_loop (b in PacketRange) {Pipelined_loop (b in PacketRange) { Foreach (p in EachRange) {Foreach (p in EachRange) {
InputData[p].ISO_SurfaceTriangles(iso_value,…);InputData[p].ISO_SurfaceTriangles(iso_value,…); }} … …… … }}}}
For (int i=min; i++; i<max-1){ // operate on InputData[i]}
Pipelined_loop (b in PacketRange)Pipelined_loop (b in PacketRange) { { 0. foreach ( …) { … }0. foreach ( …) { … }1. foreach ( …) { … }1. foreach ( …) { … } … …… …n-1. S;n-1. S; }} Merge Merge
RectDomain<1> PacketRange = [1:4];RectDomain<1> PacketRange = [1:4];
Overview of the Challenges for Overview of the Challenges for the Compilerthe Compiler
Filter DecompositionFilter Decomposition– To identify the candidate filter boundariesTo identify the candidate filter boundaries– Compute communication volume between two Compute communication volume between two
consecutive filtersconsecutive filters– Cost ModelCost Model– Determine a mapping from computations in a Determine a mapping from computations in a
loop to processing units in a pipelineloop to processing units in a pipeline
Filter Code GenerationFilter Code Generation
Compute Required CommunicationCompute Required Communication
ReqComm(b) = the set of values need to be the set of values need to be communicated through this boundarycommunicated through this boundary
Cons(B)Cons(B) = the set of variables that are used in B, = the set of variables that are used in B, not defined in Bnot defined in B
Gens(B)Gens(B) = the set of variables that are defined in = the set of variables that are defined in B, still alive at the end of BB, still alive at the end of B
ReqComm(b2) = ReqComm(b1) – Gens(B) +
Cons(B)
B
b2
b1
Filter DecompositionFilter Decomposition
C1
C2
Cm-1
Cm
L1
Lm-1
f1
f2
fn
fn+1
b1
bn
Goal:Goal: Find a mapping: LFind a mapping: Lii → → bbjj, to , to
minimize the predicted execution minimize the predicted execution time, where 1≤ i ≤ m-1, 1≤ j ≤ n. time, where 1≤ i ≤ m-1, 1≤ j ≤ n.
Intuitively, the candidate filter
boundary bj is inserted
between computing units Ci
and Ci+1
m-1
n+m-1Exhaustive search
Filter Decomposition: Dynamic Filter Decomposition: Dynamic ProgrammingProgramming
Cm-2
Cm-1
Cm
Lm-1
fn+1
fn+1
Lm-2
fn
fn
Filter Decomposition: Dynamic Filter Decomposition: Dynamic ProgrammingProgramming
Cm-2
Cm-1
Cm
Lm-1
Lm-2
T[i,j]: min cost of doing computations f1 ,…,,…, fi on computing units CC11,…, C,…, Cj, j,
where the results of fi are on CCjj.
T[i,j] = minT[i-1,j] + Cost_comp(P(Cj),Task(fi))
T[i,j-1] + Cost_comp(B(Lj-1),Vol(fi))
Goal: T[n+1,m]
Cost: O(mn)
Code GenerationCode Generation
Abstraction of the work each filter doesAbstraction of the work each filter does– Read in a buffer of data from input streamRead in a buffer of data from input stream– Iterate over the set of dataIterate over the set of data– Write out the results to output streamWrite out the results to output stream
Code generation issuesCode generation issues– How to get the Cons(b) from the input stream How to get the Cons(b) from the input stream --- unpacking data--- unpacking data– How to organize the output data for the How to organize the output data for the
successive filter --- packing datasuccessive filter --- packing data
Experimental ResultsExperimental Results
GoalGoal– To show Compiler-generated code is efficientTo show Compiler-generated code is efficient
ConfigurationsConfigurations# data sites --- # computing sites --- user machine# data sites --- # computing sites --- user machine– 1-1-11-1-1– 2-2-12-2-1– 4-4-14-4-1
– width of a pipelinewidth of a pipeline
data compute userdata compute
userdata compute
data compute
data compute
data computeuser
data compute
Experimental ResultsExperimental ResultsVersionsVersions– DefaultDefault version version
Site hosting the data only reads and transmits data, Site hosting the data only reads and transmits data, no processing at allno processing at allUser’s desktop only views the results, no processing User’s desktop only views the results, no processing at allat allAll the work is done by the computing nodesAll the work is done by the computing nodes
– Compiler-generated versionCompiler-generated versionIntelligent decomposition is done by the compilerIntelligent decomposition is done by the compilerMore computations are performed on the end nodes More computations are performed on the end nodes to reduce the communication volumeto reduce the communication volume
– Manual versionManual versionHand-written DataCutter filters with similar Hand-written DataCutter filters with similar decomposition as the compiler-generated versiondecomposition as the compiler-generated version
Computing nodes workload heavyCommunication volume high
workload balanced between each nodeCommunication volume reduced
Experimental Results: ISO-Surface Experimental Results: ISO-Surface RenderingRendering
0
5
10
15
20
25
30
35
40
1 2 40
20
40
60
80
100
120
140
160
1 2 4
Decomp
Default
Width of pipeline Width of pipeline
Small dataset150M
Large dataset600M
Speedup 1.92 3.34 Speedup 1.99 3.82
20% improvement over default version
Experimental Results: KNNExperimental Results: KNN
0100020003000400050006000700080009000
10000
1 2 40
100020003000400050006000700080009000
10000
1 2 4
Decomp
Manual
Default
Width of pipeline Width of pipeline
K = 3108M
K = 200108M
Speedup 1.89 3.38 Speedup 1.87 3.82
>150% improvement over default version
Experimental Results: Virtual Experimental Results: Virtual MicroscopeMicroscope
0100200300400500600700800900
1000
1 2 40
2000
4000
6000
8000
10000
12000
14000
16000
18000
1 2 4
Decomp
Manual
Default
Width of pipeline Width of pipeline
Small query800M, 512*512
Large query800M, 2048*2048
≈40% improvement over default version
Experimental ResultsExperimental Results
SummarySummary– The The compiler-decomposedcompiler-decomposed versions achieve versions achieve
an improvement between 10% and 150% over an improvement between 10% and 150% over defaultdefault versions versions
– In most cases, increasing the width of the In most cases, increasing the width of the pipeline results in near-linear speeduppipeline results in near-linear speedup
– Compared with the Compared with the manualmanual version, the version, the compiler-decomposedcompiler-decomposed versions are generally versions are generally quite closequite close
Related WorkRelated WorkNo previous work on language & compiler support No previous work on language & compiler support for CGPP for CGPP StreamIt (MIT)StreamIt (MIT)
– Targets at streaming applicationsTargets at streaming applications– A language for communication-exposed architectures A language for communication-exposed architectures – A compiler performs stream-specific optimizations A compiler performs stream-specific optimizations
– Lower-level language interface Lower-level language interface – Targets at different architectureTargets at different architecture
Ziegler et al (USC/ISI)Ziegler et al (USC/ISI)– Target at pipelined FPGA Architectures– Consider different granularity of communication Consider different granularity of communication
between FPGAsbetween FPGAs
Related WorkRelated Work
Run-time support for CGPPRun-time support for CGPP– Stampede (Georgia Tech)Stampede (Georgia Tech)
Multimedia applications, support is in the form of Multimedia applications, support is in the form of cluster-wide threads and shared objectscluster-wide threads and shared objects
– Yang et al (Penn State)Yang et al (Penn State) Scheduler for vision applications, executed in a Scheduler for vision applications, executed in a pipelined fashion within a clusterpipelined fashion within a cluster
– Remos (CMU)Remos (CMU) Resource monitoring system for network-aware Resource monitoring system for network-aware applications to get info. about execution environmentapplications to get info. about execution environment
– Active Stream (Georgia Tech)Active Stream (Georgia Tech) A middleware approach for distributed applicationsA middleware approach for distributed applications
Future Work & ConclusionFuture Work & Conclusion
Future WorkFuture Work– Buffer size optimizationBuffer size optimization– Cost model refinement & implementationCost model refinement & implementation– More applicationsMore applications– More realistic environment settings: resource More realistic environment settings: resource
dynamically available --- compiler directed dynamically available --- compiler directed adaptationadaptation
Future Work & ConclusionFuture Work & Conclusion
ConclusionConclusion– Coarse-Grained Pipelined Parallelism is desirable & Coarse-Grained Pipelined Parallelism is desirable &
feasiblefeasible– Coarse-Grained Pipelined Parallelism needs language & Coarse-Grained Pipelined Parallelism needs language &
compiler supportcompiler support– An algorithm for required communication analysis is An algorithm for required communication analysis is
givengiven– A dynamic programming algorithm for filter A dynamic programming algorithm for filter
decomposition is developeddecomposition is developed– A cost model is designedA cost model is designed– Results of detailed evaluation of our compiler are Results of detailed evaluation of our compiler are
encouragingencouraging
Thank you !!!Thank you !!!
Thank you !!!Thank you !!!
Cost ModelCost ModelCost ModelCost Model– A sequence of A sequence of mm computing units, C computing units, C11,…, C,…, Cm m with with
computing powers P(Ccomputing powers P(C11), …, P(C), …, P(Cmm))
– A sequence of A sequence of m-1m-1 network links, L network links, L11, …, L, …, Lm-1m-1 with with
bandwidths B(Lbandwidths B(L11), …, B(L), …, B(Lm-1m-1))
– A sequence of A sequence of nn candidate filter boundaries candidate filter boundaries bb11, ,
…,…, b bnn
C1 C2 C3
L1 L2
Say, L2 is bottleneck stage,
T = T(C1)+T(L1)+T(C2)+N*T(L2)+T(C3)
Three types of candidate boundariesThree types of candidate boundaries– Start & end of a foreach loopStart & end of a foreach loop– Conditional statementConditional statement If ( point[p].inRange(high, low) ) {If ( point[p].inRange(high, low) ) { local_KNN(point[p]);local_KNN(point[p]); }}– Start & end of a function call within a foreach Start & end of a function call within a foreach
looploop
Any non-foreach loop must be completely Any non-foreach loop must be completely inside a single filterinside a single filter
Identify the Candidate Filter Identify the Candidate Filter BoundariesBoundaries
Coarse-Grained Pipelined Coarse-Grained Pipelined Parallelism is Desirable & Parallelism is Desirable &
FeasibleFeasibleA new class of data-intensive applicationsA new class of data-intensive applications– scientific data analysisscientific data analysis– data miningdata mining– data visualizationdata visualization– image analysisimage analysis– and more …and more …
Two direct ways to implement such applicationsTwo direct ways to implement such applications– Downloading all the data to Downloading all the data to user’s machineuser’s machine
– Computing at the data repositoryComputing at the data repository
Compute Required Compute Required CommunicationCommunication
ReqComm(b0) = { }
Cons={X, Y}Gens={Z}
Cons={A}Gens={X,Y}
b0
b2
b1
ReqComm(b2) = {A}
ReqComm(b1) = {X, Y }
ReqComm(b2) = ReqComm(b1) – Gens(B) + Cons(B)
Compute Required Compute Required Communication Communication
Z = A + 48Z = A + 48If Z > 0If Z > 0 Y = Z * AY = Z * AX = Z + AX = Z + A
Cons(B)Cons(B) = the set of variables that are used in B, = the set of variables that are used in B, not defined in Bnot defined in B
Gens(B)Gens(B) = the set of variables that are defined in = the set of variables that are defined in B, still alive at the end of BB, still alive at the end of B
Cons(B) Gens(B) Cons(B) Gens(B)
A X, ZA X, Z
Z, A XZ, A X
Z, A XZ, A X
Z, A XZ, A X
Cons(s) Gens(s)Cons(s) Gens(s)
A Z A Z
Z Z
Z, AZ, A
Z, A XZ, A X
code Generationcode Generation
Two ways to organize data in a bufferTwo ways to organize data in a buffer– Instance-wiseInstance-wise– Field-wiseField-wise
Class CClass C
{ {
int x;int x;
float y;float y;
int z;int z;
}}
X Y Z X Y Z . . .
Instance-wise
Field-wise
X X . . . Y Y . . . Z . . .
Code GenerationCode Generation
Ways that fields of an object are usedWays that fields of an object are used– In the same loopIn the same loop
for (int i=0; i<count; i++) {for (int i=0; i<count; i++) {… … = InputData[i].x + …;= InputData[i].x + …;… … = … + InputData[i].y; }= … + InputData[i].y; }
– In different loopsIn different loops
for (int i=0; i<count; i++) {for (int i=0; i<count; i++) {… … = InputData[i].x + …; }= InputData[i].x + …; }
for (int i=0; i<count; i++) {for (int i=0; i<count; i++) { … … = … + InputData[i].y;}= … + InputData[i].y;}
Instance-wise
Field-wise
Cost ModelCost ModelC1
C2
C3
L1
L2
time
stage
C1
L1
C2
L2
C3
Say, L2 is bottleneck stage,
T = T(C1)+T(L1)+T(C2)+N*T(L2)+T(C3)
Say, C2 is bottleneck stage,
T = T(C1)+T(L1)+N*T(C2)+T(L2)+T(C3)
Experimental Results: ISO-Surface Experimental Results: ISO-Surface Rendering (Active Pixel Based)Rendering (Active Pixel Based)
0
5
10
15
20
25
30
35
40
1 2 40
20
40
60
80
100
120
140
160
1 2 4
Decomp
Default
Width of pipeline Width of pipeline
Small dataset150M
Large dataset600M
Speedup close to linear
> 15% improvement over default version