Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How
description
Transcript of Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How
Compiler Supported Coarse-Grained Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How Pipelined Parallelism: Why and How
Gagan Agrawal Gagan Agrawal Wei Du Wei Du
Tahsin Kurc Tahsin Kurc Umit Catalyurek Umit Catalyurek
Joel Saltz Joel Saltz The Ohio State University The Ohio State University
Overall Context Overall Context
NGS grant titled ``An Integrated Middleware and NGS grant titled ``An Integrated Middleware and Language/Compiler Framework for Data-Intensive Language/Compiler Framework for Data-Intensive Applications’’, funded September 2002 – August Applications’’, funded September 2002 – August 2005.2005.
Project Components Project Components – Runtime Optimizations in the DataCutter System Runtime Optimizations in the DataCutter System – Compiler Optimization of DataCutter filters Compiler Optimization of DataCutter filters – Automatic Generation of DataCutter filters Automatic Generation of DataCutter filters
Focus of this talk Focus of this talk
General Motivation General Motivation
Language and Compiler Support for Parallelism of Language and Compiler Support for Parallelism of many forms has been explored many forms has been explored – Shared memory parallelism Shared memory parallelism – Instruction-level parallelism Instruction-level parallelism – Distributed memory parallelism Distributed memory parallelism – Multithreaded execution Multithreaded execution
Application and technology trends are making Application and technology trends are making another form of parallelism desirable and feasible another form of parallelism desirable and feasible – Coarse-Grained Pipelined Parallelism Coarse-Grained Pipelined Parallelism
Coarse-Grained Pipelined Coarse-Grained Pipelined ParallelismParallelism
(CGPP)(CGPP) DefinitionDefinition– Computations associated with an application are carried Computations associated with an application are carried
out in several stages, which are executed on a pipeline out in several stages, which are executed on a pipeline of computing unitsof computing units
Example Example — K-nearest Neighbor— K-nearest Neighbor
Given a 3-D range R= <(xGiven a 3-D range R= <(x11, y, y11, z, z11), (x), (x22, y, y22, z, z22)>, and )>, and
a point a point = (a, b, c). = (a, b, c).
We want to find the nearest K neighbors of We want to find the nearest K neighbors of within R. within R.
Range_query Find the K-nearest neighbors
Coarse-Grained Pipelined Coarse-Grained Pipelined Parallelism is Desirable & Parallelism is Desirable &
FeasibleFeasible Application scenariosApplication scenarios
Internet
data
data
data
data
datadatadata
Coarse-Grained Pipelined Coarse-Grained Pipelined Parallelism is Desirable & Parallelism is Desirable &
FeasibleFeasible A new class of data-intensive applicationsA new class of data-intensive applications
– Scientific data analysisScientific data analysis– data miningdata mining– data visualizationdata visualization– image analysisimage analysis
Two direct ways to implement such applicationsTwo direct ways to implement such applications– Downloading all the data to Downloading all the data to
user’s machine – often not feasible user’s machine – often not feasible
– Computing at the data repositoryComputing at the data repository - - usually too slow usually too slow
Our beliefOur belief
– A coarse-grained pipelined execution A coarse-grained pipelined execution model is a good matchmodel is a good match
Internet
data
data
Coarse-Grained Pipelined Coarse-Grained Pipelined Parallelism is Desirable & Parallelism is Desirable &
FeasibleFeasible
Coarse-Grained Pipelined Coarse-Grained Pipelined Parallelism needs Compiler Parallelism needs Compiler
SupportSupport Computation needs to be decomposed into stagesComputation needs to be decomposed into stages Decomposition decisions are dependent on Decomposition decisions are dependent on
execution environmentexecution environment– How many computing sites availableHow many computing sites available– How many available computing cycles on each siteHow many available computing cycles on each site– What are the available communication linksWhat are the available communication links– What’s the bandwidth of each linkWhat’s the bandwidth of each link
Code for each stage follows the same processing Code for each stage follows the same processing pattern, so it can be generated by compilerpattern, so it can be generated by compiler
Shared or distributed memory parallelism needs to Shared or distributed memory parallelism needs to be exploitedbe exploited
High-level language and compiler High-level language and compiler support are necessarysupport are necessary
OutlineOutline Coarse-grained pipelined parallelism is desirable Coarse-grained pipelined parallelism is desirable
& feasible& feasible Coarse-grained pipelined parallelism needs high-Coarse-grained pipelined parallelism needs high-
level language & compiler supportlevel language & compiler support An entire picture of the systemAn entire picture of the system DataCutter runtime system & language dialect DataCutter runtime system & language dialect Overview of the challenges for the compilerOverview of the challenges for the compiler Compiler TechniquesCompiler Techniques Experimental resultsExperimental results Related workRelated work Future work & ConclusionsFuture work & Conclusions
An Entire PictureAn Entire Picture
Java Dialect
Compiler Support
DataCutter Runtime System
Decomposition
Code Generation
DataCutter Runtime SystemDataCutter Runtime System
Ongoing project at OSU / Maryland ( Kurc, Ongoing project at OSU / Maryland ( Kurc, Catalyurek, Beynon, Saltz et al) Catalyurek, Beynon, Saltz et al)
Targets a distributed, heterogeneous environmentTargets a distributed, heterogeneous environment Allow decomposition of application-specific data Allow decomposition of application-specific data
processing operations into a set of interacting processing operations into a set of interacting processesprocesses
Provides a specific low-level interfaceProvides a specific low-level interface– filterfilter– StreamStream
layout & placementlayout & placement
filter1 filter2 filter3stream stream
Language DialectLanguage Dialect GoalGoal
– to give compiler information about independent to give compiler information about independent collections of objects, parallel loops and collections of objects, parallel loops and reduction operations, pipelined parallelismreduction operations, pipelined parallelism
Extensions of JavaExtensions of Java– Pipelined_loopPipelined_loop– Domain & RectdomainDomain & Rectdomain– Foreach loopForeach loop– reduction variablesreduction variables
ISO-Surface Extraction Example CodeISO-Surface Extraction Example Code
public class isosurface {public class isosurface { public static void main(String arg[]) {public static void main(String arg[]) { float iso_value;float iso_value; RectDomain<1> CubeRange = [min:max];RectDomain<1> CubeRange = [min:max]; CUBE[1d] InputData = new CUBE[CubeRange];CUBE[1d] InputData = new CUBE[CubeRange]; Point<1> p, b;Point<1> p, b;
RectDomain<1> PacketRange = RectDomain<1> PacketRange = [1:runtime_def_num_packets[1:runtime_def_num_packets];];
RectDomain<1> EachRange = RectDomain<1> EachRange =
[1:(max-min)/runtime_define_num_packets];[1:(max-min)/runtime_define_num_packets]; Pipelined_loop (b in PacketRange) {Pipelined_loop (b in PacketRange) { Foreach (p in EachRange) {Foreach (p in EachRange) {
InputData[p].ISO_SurfaceTriangles(iso_value,…);InputData[p].ISO_SurfaceTriangles(iso_value,…); }} … …… … }}}}
For (int i=min; i++; i<max-1){ // operate on InputData[i]}
Pipelined_loop (b in PacketRange)Pipelined_loop (b in PacketRange) { { 0. foreach ( …) { … }0. foreach ( …) { … }1. foreach ( …) { … }1. foreach ( …) { … } … …… …n-1. S;n-1. S; }} Merge Merge
RectDomain<1> PacketRange = [1:4];RectDomain<1> PacketRange = [1:4];
Overview of the Challenges for Overview of the Challenges for the Compilerthe Compiler
Filter DecompositionFilter Decomposition– To identify the candidate filter boundariesTo identify the candidate filter boundaries– Compute communication volume between two Compute communication volume between two
consecutive filtersconsecutive filters– Cost ModelCost Model– Compute a mapping from computations in a Compute a mapping from computations in a
loop to computing units in a pipelineloop to computing units in a pipeline Filter Code GenerationFilter Code Generation
Three types of candidate boundariesThree types of candidate boundaries– Start & end of a foreach loopStart & end of a foreach loop– Conditional statementConditional statement If ( point[p].inRange(high, low) ) {If ( point[p].inRange(high, low) ) { local_KNN(point[p]);local_KNN(point[p]); }}– Start & end of a function call within a foreach Start & end of a function call within a foreach
looploop Any non-foreach loop must be completely Any non-foreach loop must be completely
inside a single filterinside a single filter
Identify the Candidate Filter Identify the Candidate Filter BoundariesBoundaries
Compute Required CommunicationCompute Required Communication
ReqComm(b) = the set of values need to be the set of values need to be communicated through this boundarycommunicated through this boundary
Cons(B)Cons(B) = the set of variables that are used in B, = the set of variables that are used in B, not defined in Bnot defined in B
Gens(B)Gens(B) = the set of variables that are defined in = the set of variables that are defined in B, still alive at the end of BB, still alive at the end of B
ReqComm(b2) = ReqComm(b1) – Gens(B) +
Cons(B)
B
b2
b1
Cost ModelCost Model
Cost ModelCost Model– A sequence of A sequence of mm computing units, C computing units, C11,…, C,…, Cm m with with
computing powers P(Ccomputing powers P(C11), …, P(C), …, P(Cmm))
– A sequence of A sequence of m-1m-1 network links, L network links, L11, …, L, …, Lm-1m-1 with with
bandwidths B(Lbandwidths B(L11), …, B(L), …, B(Lm-1m-1))
– A sequence of A sequence of nn candidate filter boundaries candidate filter boundaries bb11, ,
…,…, b bnn
Cost ModelCost ModelC1
C2
C3
L1
L2
time
stage
C1
L1
C2
L2
C3
Say, L2 is bottleneck stage,
T = T(C1)+T(L1)+T(C2)+N*T(L2)+T(C3)
Say, C2 is bottleneck stage,
T = T(C1)+T(L1)+N*T(C2)+T(L2)+T(C3)
Filter DecompositionFilter Decomposition
C1
C2
Cm-1
Cm
L1
Lm-1
f1
f2
fn
fn+1
b1
bn
Goal:Goal:
Find a mapping: LFind a mapping: Lii → → bbjj, to , to
minimize the predicted minimize the predicted execution time, where 1≤ i ≤ execution time, where 1≤ i ≤ m-1, 1≤ j ≤ n. m-1, 1≤ j ≤ n.
Intuitively, the candidate
filter boundary bj is inserted
between computing units Ci
and Ci+1
m-1
n+1+m-1Exhaustive search
Filter Decomposition: A Greedy Algo.Filter Decomposition: A Greedy Algo.
C1
C2
C3
C4
L1
L3
L2
f1
f2
f3
f4
f5
L1
C1
C3
C4
C2
f1
L1 to b1 : T1
Estimated Costf1 , f2
L1 to b2 : T2
L1 to b3 : T3
L1 to b4 : T4
Min{T1 … T4 } = T2
To minimize the predicted execution To minimize the predicted execution timetime
b1
b2
b3
b4
Code GenerationCode Generation
Abstraction of the work each filter doesAbstraction of the work each filter does– Read in a buffer of data from input streamRead in a buffer of data from input stream– Iterate over the set of dataIterate over the set of data– Write out the results to output streamWrite out the results to output stream
Code generation issuesCode generation issues– How to get the Cons(b) from the input stream How to get the Cons(b) from the input stream --- unpacking data--- unpacking data– How to organize the output data for the How to organize the output data for the
successive filter --- packing datasuccessive filter --- packing data
Experimental ResultsExperimental Results GoalGoal
– To show Compiler-generated code is efficientTo show Compiler-generated code is efficient Environment settingsEnvironment settings
– 700MHZ Pentium machines700MHZ Pentium machines– Connected through Myrinet LANai 7.0Connected through Myrinet LANai 7.0
ConfigurationsConfigurations# data sites --- # computing sites --- user machine# data sites --- # computing sites --- user machine
– 1-1-11-1-1– 2-2-12-2-1– 4-4-14-4-1
Experimental ResultsExperimental Results VersionsVersions
– DefaultDefault version version Site hosting the data only reads and transmits data, Site hosting the data only reads and transmits data,
no processing at allno processing at all User’s desktop only views the results, no processing User’s desktop only views the results, no processing
at allat all All the work are done by the compute nodesAll the work are done by the compute nodes
– Compiler-generated versionCompiler-generated version Intelligent decomposition is done by the compilerIntelligent decomposition is done by the compiler More computations are performed on the end nodes More computations are performed on the end nodes
to reduce the communication volumeto reduce the communication volume
– Manual versionManual version Hand-written DataCutter filters with similar Hand-written DataCutter filters with similar
decomposition as the compiler-generated versiondecomposition as the compiler-generated version
Computing nodes workload heavyCommunication volume high
workload balanced between each nodeCommunication volume reduced
Experimental Results: ISO-Surface Experimental Results: ISO-Surface Rendering (Z-Buffer Based)Rendering (Z-Buffer Based)
0
5
10
15
20
25
30
35
40
1 2 40
20
40
60
80
100
120
140
160
1 2 4
Decomp
Default
Width of pipeline Width of pipeline
Small dataset150M
Large dataset600M
Speedup 1.92 3.34 Speedup 1.99 3.82
20% improvement over default version
Experimental Results: ISO-Surface Experimental Results: ISO-Surface Rendering (Active Pixel Based)Rendering (Active Pixel Based)
0
5
10
15
20
25
30
35
40
1 2 40
20
40
60
80
100
120
140
160
1 2 4
Decomp
Default
Width of pipeline Width of pipeline
Small dataset150M
Large dataset600M
Speedup close to linear
> 15% improvement over default version
Experimental Results: KNNExperimental Results: KNN
0100020003000400050006000700080009000
10000
1 2 40
100020003000400050006000700080009000
10000
1 2 4
Decomp
Manual
Default
Width of pipeline Width of pipeline
K = 3108M
K = 200108M
Speedup 1.89 3.38 Speedup 1.87 3.82
>150% improvement over default version
Experimental Results: Virtual Experimental Results: Virtual MicroscopeMicroscope
0100200300400500600700800900
1000
1 2 40
2000
4000
6000
8000
10000
12000
14000
16000
18000
1 2 4
Decomp
Manual
Default
Width of pipeline Width of pipeline
Small query800M, 512*512
Large query800M, 2048*2048
≈40% improvement over default version
Experimental ResultsExperimental Results
SummarySummary– The The compiler-decomposedcompiler-decomposed versions achieve an versions achieve an
improvement between 10% and 150% over improvement between 10% and 150% over defaultdefault versions versions
– In most cases, increasing the width of the In most cases, increasing the width of the pipeline results in near-linear speeduppipeline results in near-linear speedup
– Compared with the Compared with the manualmanual version, the version, the compiler-decomposedcompiler-decomposed versions are generally versions are generally quite closequite close
Ongoing and Future WorkOngoing and Future Work
Buffer size optimizationBuffer size optimization Cost model refinement & implementationCost model refinement & implementation More applicationsMore applications More realistic environment settings: More realistic environment settings:
resource dynamically availableresource dynamically available
ConclusionsConclusions
Coarse-Grained Pipelined Parallelism is desirable Coarse-Grained Pipelined Parallelism is desirable & feasible& feasible
Coarse-Grained Pipelined Parallelism needs Coarse-Grained Pipelined Parallelism needs language & compiler supportlanguage & compiler support
An algorithm for required communication analysis An algorithm for required communication analysis is givenis given
A greedy algorithm for filter decomposition is A greedy algorithm for filter decomposition is developeddeveloped
A cost model is designedA cost model is designed Results of detailed evaluation of our compiler are Results of detailed evaluation of our compiler are
encouragingencouraging