Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created...
Transcript of Sputnik - University of California, Davis · thesispresentation.ppt Author: Sean Peisert Created...
Sputnik:Automated Decomposition on
Heterogeneous Clusters ofMultiprocessors
Sean Philip [email protected]
http://www.sdsc.edu/~peisertMarch 20, 2000
Photo courtesy of NASA.
Copyright © 2000 Sean Philip Peisert
Thesis Committee
• Scott B. Baden, advisor• Larry Carter• Jeanne Ferrante• Sidney Karin
Copyright © 2000 Sean Philip Peisert
Motivation
• Supercomputers in science are evolvingsuch that fewer and fewer are vectormachines and mainframemulticomputers. Most are clusters ofmultiprocessors.
• A multiprocessor is a shared-memorymachine whereas a multicomputer is“shared-nothing,” or distributed memorymachine.
Copyright © 2000 Sean Philip Peisert
MultiprocessorMain Memory/Memory Bus
Processor 0
L1 Cache
L2 Cache
Processor 1
L1 Cache
L2 Cache
Processor 2
L1 Cache
L2 Cache
Processor 3
L1 Cache
L2 Cache
Bus
Copyright © 2000 Sean Philip Peisert
Clusters of Multiprocessors
• Multiprocessors are built largely usingcomponent parts. They are also verymodular.
• Easy to upgrade a portion of the nodesin the cluster with new nodes.
• Having different speed nodesin the cluster mean thatprograms have to be writtendifferently.
Pho
to c
ourt
esty
of S
GI
Copyright © 2000 Sean Philip Peisert
HeterogeneousCluster of Multiprocessors
Copyright © 2000 Sean Philip Peisert
Heterogeneity Problem
• A parallel program will only run as fastas the slowest node.
• For example, if one adds new nodes toa cluster that run faster than the existingnodes, the new nodes will finish firstand idle until the slower nodes finish.
Copyright © 2000 Sean Philip Peisert
Heterogeneity Problem
• If processors are idle, they are wastingprocessing power when they could beprocessing data.
• The program is not performing optimallyand can be run faster.
• The machine is not being usedefficiently.
Copyright © 2000 Sean Philip Peisert
Related Work
• Fink and Baden: KeLP2• Foster and Karonis: Grid-Enabled MPI• Anglano, Schopf, Wolski and Berman:
Zoom• Crandall and Quinn: Decomposition
Advisory System• Wolski, Spring and Peterson: Network
Weather Service
Copyright © 2000 Sean Philip Peisert
Heterogeneity Solution
• Optimize the program individually foreach separate node based on priorinformation about each node.
• This is not easy.
Copyright © 2000 Sean Philip Peisert
Goals of Sputnik
• Allow a programmer to write softwarefor a heterogeneous cluster as if thecluster is homogeneous. In otherwords, without adding much morecomplexity.
• Improve performance of the programbeing run and the utilization andefficiency of the cluster.
Copyright © 2000 Sean Philip Peisert
The Sputnik Model
• Two-stage process for optimizingperformance on a heterogeneouscluster.
• ClusterDiscovery• ClusterOptimization
Copyright © 2000 Sean Philip Peisert
ClusterDiscovery
• Performs a “resource discovery” — asearch of a defined parameter space —to understand how the application inquestion runs on each individual node inthe cluster.
• Runs the kernel repeatedly inside a“shell” to determine the best performingoptimizations.
Copyright © 2000 Sean Philip Peisert
Cluster Discovery
• Saves the best optimization data insidea file for future use.
Copyright © 2000 Sean Philip Peisert
ClusterOptimization
• Makes the specific optimizations foreach node based on what the first stagehas discovered.
• Some possible optimizations include:Adjusting the number of threads pernode, cache tiling, data partitioning,machine and data “class” optimization.
Copyright © 2000 Sean Philip Peisert
The Sputnik API
• Focuses on just a few specificoptimizations.
• The Sputnik API is built on topof KeLP, which is used fordata description and inter-node communication.
• OpenMP is used for intra-node communication.
Copyright © 2000 Sean Philip Peisert
ClusterDiscovery (API)
• Instead of searching the entireparameter space for possibleoptimizations, I focused on two:– Adjusting the number of OpenMP threads.– Repartitioning the data.
Copyright © 2000 Sean Philip Peisert
ClusterDiscovery (API)
• Runs the kernel repeatedly withdifferent numbers of threads on eachnode. When it finds the optimal numberof threads for a given node, it saves thetiming for that node.
• The saved timing is compared with theother timings in the cluster and ratiosare formed.
Copyright © 2000 Sean Philip Peisert
ClusterOptimization (API)
• The ratios from ClusterDiscovery areused to re-partition the data so that theratio of a given node’s power in relationto the rest of the cluster is the samefraction of the data that it works on.
Copyright © 2000 Sean Philip Peisert
ClusterOptimization (API)
Node 0 Node 1 Node 2
Original Partitioning: New Partitioning:
Example: Node 0 runstwice as fast as node 1and node 2.
Copyright © 2000 Sean Philip Peisert
API Limitations/Assumptions
• One-dimensional decomposition.• Confined to two tiers of parallelism.• Assumes that a node given a smaller
chunk of data will run at the sameMFLOPS rate as with the original sizechunk.
• The problems do not fit into the highestlevel of cache.
Copyright © 2000 Sean Philip Peisert
API Limitations/Assumptions
• Assume no node is less than half asfast as any other node.
Copyright © 2000 Sean Philip Peisert
main()
int main(int argc, char **argv) { MPI_Init(&argc, &argv); // Initialize MPI InitKeLP(argc,argv); // Initialize KeLP
// Call Sputnik's main routine, which in turn will // then call SputnikMain(). SputnikGo(argc,argv); MPI_Finalize(); // Close off MPI return (0);}
Copyright © 2000 Sean Philip Peisert
SputnikGo()
while(i < MAX_THREADS && time[last iteration] < time[second-to-last iteration]) {
omp_set_num_threads(i);
time[i] = SputnikMain(int argc, char **argv, NULL);
i = i * 2;}
i = iteration before the best we found in the previous loop;
while (time[last iteration] < time[2nd-to-last iteration]) {
omp_set_num_threads(i);
time[i] = SputnikMain(int argc, char **argv, NULL); i = i + 1;
}
omp_set_num_threads(optimal number);
time[i] = SputnikMain(int argc, char **argv, bestTimes);
Copyright © 2000 Sean Philip Peisert
SputnikMain()
double SputnikMain(int argc,char ** argv, double * SputnikTimes) {
double start, finish;
...
<declarations, initializations> ...
start = MPI_Wtime(); // start timing
kernel(); // call the kernel function
finish = MPI_Wtime(); // finish timing
... return finish-start;
}
Copyright © 2000 Sean Philip Peisert
Application Study
• The purpose of the experiment is todetermine the effect of theseoptimizations.
• I use a kernel that solves Poisson’sequation using Gauss-Seidel’s methodwith red-black ordering was used.
Copyright © 2000 Sean Philip Peisert
!$OMP PARALLEL PRIVATE(jj,ii,k,j,i,jk) do jj = ul1+1, uh1-1, sj do ii = ul0+1, uh0-1, si!$OMP DO SCHEDULE(STATIC) do k = ul2+1, uh2-1 do j = jj, min(jj+sj-1,uh1-1) jk = mod(j+k,2) do i = ii+jk, min(ii+jk+si-1,uh0-1), 2 u(i,j,k) = c * 2 ((u(i-1,j,k) + u(i+1,j,k)) + (u(i,j-1,k) + 3 u(i,j+1,k)) + (u(i,j,k+1) + u(i,j,k-1) - 4 c2*rhs(i,j,k))) end do end do end do!$OMP END DO end do end do!$OMP END PARALLEL
Copyright © 2000 Sean Philip Peisert
Image courtesy of SGI.
Copyright © 2000 Sean Philip Peisert
Computing HardwareSGI Origin2000’s
balder.ncsa.uiuc.edu• 256 250-MHz
R10000 processors• 128 GB main
memory
aegir.ncsa.uiuc.edu• 128 250-MHz
R10000 processors• 64 GB main memory
Copyright © 2000 Sean Philip Peisert
Virtual Cluster
• Sputnik API is designed for commodityclusters. None were available, so a pairof SGI Origin 2000’s at NCSA wereused.
• The API allows the number of OpenMPthreads to be set manually.
• Different numbers of threads used oneach Origin to simulate heterogeneity.
Copyright © 2000 Sean Philip Peisert
Predicted Time
Copyright © 2000 Sean Philip Peisert
redblack3D - 48 threads on balder
0
10
20
30
40
50
60
70
24 30 36 42 48
threads on aegir
time
(sec
onds
)
balder Original Computation
aegir Original Computation
balder New Computation
aegir New Computation
Predicted Computation
Copyright © 2000 Sean Philip Peisert
redblack3D Speedup with 48 threads on balder
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
24 30 36 42 48
threads on aegir
spee
dup
Speedup
Theoretical Speedup
Copyright © 2000 Sean Philip Peisert
redblack3D with 32 threads on balder
0
10
20
30
40
50
60
70
80
90
100
16 20 24 28 32
number of threads on aegir
time
(sec
onds
)
balder Original Computation
aegir Original Computation
balder New Computation
aegir New Computation
Predicted Time
Copyright © 2000 Sean Philip Peisert
redblack3D Speedup with 32 threads on balder
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
16 20 24 28 32
number of threads on aegir
spee
dup
Speedup
Theoretical Speedup
Copyright © 2000 Sean Philip Peisert
Validation
• Results for the API indicate better than35% improvement in the situationswhere balder is running twice as manythreads as aegir.
• The model and the API both succeed inthe goal of being easy to program andimproving performance.
Copyright © 2000 Sean Philip Peisert
Anomalies
• The code demonstrated scaling, but ranseveral times slower than with MPI alone,(without OpenMP).
• OpenMP thread binding and memorydistribution are both complex issues on theOrigin 2000 that are the probable causes ofthe slowdown.
• Real target of Sputnik API is commoditycluster.
Copyright © 2000 Sean Philip Peisert
Image courtesy of SGI.
Copyright © 2000 Sean Philip Peisert
Future Work
• Tests on more applications andcomputing hardware, especially acluster of Sun servers.
• Dynamic repartitioning forgrid/metacomputing applications.
• Supporting PhenomenallyHeterogeneous Clusters (PHCs) — notjust multicomputer-based clusters.
Copyright © 2000 Sean Philip Peisert
Future Work
• Different types of optimizations (not justrepartitioning and adjusting the numberof OpenMP threads).
Copyright © 2000 Sean Philip Peisert
Thank you to my family andfriends, everyone in the UCSDComputer Science Department,the Parallel Computation Lab,SDSC, NCSA, my thesiscommittee, and my thesisadvisor, Scott Baden.
Sputnik:Automated Decomposition on
Heterogeneous Clusters ofMultiprocessors
Sean Philip [email protected]
http://www.sdsc.edu/~peisertMarch 20, 2000
Photo courtesy of NASA.