Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories...
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories...
![Page 1: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/1.jpg)
Michael Bender, SUNY Stony BrookDavid Bunde, Knox CollegeVitus Leung, Sandia National LaboratoriesKevin Pedretti, Sandia National LaboratoriesCynthia Phillips, Sandia National Laboratories
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy under contract DE-AC04-94AL85000.
New Experimental Results in Communication-Aware Processor Allocation for Supercomputers
![Page 2: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/2.jpg)
• Commodity-based supercomputers at Sandia National Laboratories (off-the-shelf components)
• Up to 2048 processors• Production computing environment
• Our Job: Improve parallel node allocation on Cplant to optimize performance.
Computational Plant (Cplant)
![Page 3: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/3.jpg)
The Cplant System
• DEC alpha processors
• Myrinet interconnect (Sandia modified)
• MPI
• Different sizes/topologies: usually 2D or 3D grid with toroidal wraps – Ross = 2048 proc, 3D mesh
– Zermatt = 128-proc 2D mesh
– Alaska = ~600, heavily-augmented 2D mesh (cannibalized).
• Modified Linux OS (now public domain)
• Four processors/switch (compute, I/O, service nodes)
![Page 4: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/4.jpg)
Scheduling Environment
• Users submit jobs to queue (online)
• Users specify number of processors and runtime estimate– If a job runs past this estimate by 5 min, it is killed
• No preemption, no migration, no multitasking (security)
• Actual runtime depends on set of processors allocated and placement of other jobs
Goals:
• User - minimum response time
• Bureaucracy (GAO) - high utilization
![Page 5: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/5.jpg)
Scheduler/Allocator Association
Scheduler and allocator effect each others’ performance.
Scheduler Allocator
Performance dependencies
![Page 6: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/6.jpg)
Scheduler/Allocator Dissociation
• Scheduler enforces policy– Management sets priorities for access, utilization policy
• Allocator can optimize performance
UserExecutable# processorsRequested time
Job:
Job
PBSScheduler
NodeAllocator
Cplant
queue
.
.
.
![Page 7: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/7.jpg)
What’s a Good Allocation?
Objective: Allocate jobs to processors to minimize network contention processor locality.
• Especially important for commodity networks
Good allocationFor 2D mesh
Bad allocationFor 2D mesh
![Page 8: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/8.jpg)
Quantitative Effect of Processor Locality
But, speed-up anomaly
= 2
faster than
= empty processor
![Page 9: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/9.jpg)
Communication Hops on a 2D grid
• L1 distance = # hops (~ # switches) between 2 processors on grid
5
4
![Page 10: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/10.jpg)
Allocation Problem
• Given n available points on grid (some unavailable)• Find a set of k available points with minimum average (or
total) L1 distance.• Example: green allocation: 3(2) + 3(1) = 9
![Page 11: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/11.jpg)
EmpiricalCorrelation
Leung et al, 2002
Related support:Mache and Lo, 1996
![Page 12: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/12.jpg)
Previous Work
• Various Work forcing a convex set– Insufficient processor utilization
• Mache, Lo, Windisch MC algorithm
• Krumke et al 2-approximation, NP-hard w/general metric
• Complexity open for grids
• Dispersion problem (max distance) linear time for fixed k (Fekete and Meijer)
![Page 13: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/13.jpg)
Optimal Unconstrained Shape[Bender,Bender,Demaine,Fekete 2004]
Almost a circle but not quite.
Only .05 percent difference in area.
0.650 245 952 951
![Page 14: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/14.jpg)
Previous Results (Bender et al 2005)
• 7/4-approximation (2 - in d dimensions)
• PTAS ((1+)-approximation in poly time for fixed )
• MC is a 4-approximation
• Linear-time exact dynamic program 1D
• O(n log n) time for k=3
• Simulations (performance on job streams)
1
2d
![Page 15: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/15.jpg)
Experiments: Placement Algorithm MC
• Search in shell from minimum-size region of preferred shape.
• Weight processors by shells
• Return processor set with minimum weight.
![Page 16: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/16.jpg)
Alternative: One-Dimensional Reduction
• Order processors so that
close in linear order close in physical processor graph
• Consider one-dimensional processor allocation– Bin packing (best fit, first fit, sum of squares)
– Pack jobs onto the line (or ring), allowing fragmentation
![Page 17: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/17.jpg)
New System Red Storm
• 12,960 Dual-Core AMD Opteron 2.4Ghz
• 39.19 TB Memory, 340 TB disk
• 124 TF peak performance
• 3D Mesh
![Page 18: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/18.jpg)
Impact
• Changed the node allocator on Cplant– 1D default allocator
– 2D algorithms implemented
– Carried over to Red Storm system software• 1D and 2D algorithms implemented
• Selectable at compilation
• R&D 100 winner (Leung, Bender, Bunde, Pedretti, Phillips 2006)
![Page 19: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/19.jpg)
Red Storm Development Machine
1 Cray XT3/4 Cabinet
I/O node Compute node
![Page 20: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/20.jpg)
Does Bandwidth Make a Difference?
Real time (seconds)
User time (seconds)
Sys time (seconds)
1/4 link bandwidth
15623.353 1012.302 50.298
Full bandwidth
6314.818 1010.752 50.003
• Yes!
![Page 21: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/21.jpg)
Red Storm Development Machine
YZ S Curve
I/O node Compute node
![Page 22: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/22.jpg)
Red Storm Development Machine
ZY S Curve
I/O node Compute node
![Page 23: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/23.jpg)
Hilbert (Space-Filling) Curves
• For 2D and 3D grids• Previous applications
– I/O efficient and cache-oblivious computation– Compression (images)– Domain decomposition
![Page 24: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/24.jpg)
Red Storm Development Machine
Zoltan Hilbert-Space-Filling Curve
I/O node Compute node
![Page 25: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/25.jpg)
Red Storm Development Machine
Spliced Hilbert-Space-Filling Curve
I/O node Compute node
![Page 26: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/26.jpg)
Results (Makespan in Seconds)
YZ ZY random Zoltan spliced
MC1x1 5807.1
SS 5830.6 7003.2 6610.1 6699.6 6021.1
FF 5868.6 7039.5 6639.6 6758.7 6052.3
BF 5826.2 7022.6 6631.9 6739.1 6023.4
simple 6102.4
• Consistent with simulations (Bender et al 2005)
![Page 27: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/27.jpg)
Results (Makespan Normalized)
YZ ZY random Zoltan spliced
MC1x1 1
SS 1.0040 1.206 1.1383 1.1537 1.0369
FF 1.0106 1.2122 1.1434 1.1639 1.0422
BF 1.0033 1.2093 1.1420 1.1605 1.0372
simple 1.0509
![Page 28: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/28.jpg)
Red Storm Development Machine
Is it I/O or interprocess communication?
I/O node Compute node
![Page 29: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/29.jpg)
Results (Makespan Normalized)
YZ ZY random Zoltan spliced
BF 1 1.2053 1.1383 1.1567 1.0338
BF2 1 1.2398 1.176 1.1828 1.0443
• Not I/O
• Consistent with Cplant experiments (Leung et al 2002)
• Consistent with Pittsburgh Supercomputing Center experiments (Weisser et al 2006)
![Page 30: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/30.jpg)
Experiments- Test Set
• All-to-All Communications
Job Size Number of Jobs
2 1820
5 660
15 620
20 660
• High communication, best-case for runtime improvements
• Small number of repetitions (3)
![Page 31: Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.](https://reader035.fdocuments.in/reader035/viewer/2022081515/56649d2c5503460f94a01b2e/html5/thumbnails/31.jpg)
Questions
• What’s the right allocation for a stream (online)?
• Scheduling + Allocation
MPP
Jobs