A Ping Too FarReal World Network Latency Measurement
Gary Jackson
JHU/APL
Work done while at the University of Maryland
Pete Keleher and Alan Sussman
University of Maryland
Department of Computer Science
2
Introduction
Context:Peer-to-peer HPC resource discovery and management
Goal Collect a high-quality all-to-all network latency map Campus or department scale, including HPC resources As opposed to Internet-scale, which is well-trod ground
Purpose Compare latency prediction techniques Increase the fidelity of peer-to-peer system simulations
Solved many problems Technical solutions to technical and policy obstacles
Managed only partial success Could not get measurements on more than one HPC-equipped cluster, so it’s
not useful to us But maybe the data set is useful to someone else
3
Four Policy Challenges
1. Where to measure? Ask for access ✖ Compel stakeholders ✖ Find existing resources that meet needs ✔
2. Work around policy obstacles Cannot run persistent daemons on resources
3. Minimal change Cannot ask for significant changes to environment or other
policies
4. Non-disruptive Use of resources cannot disrupt other users
4
Five Technical Challenges
1. Load interferes with measurement
2. User-level programs on both ends
3. Quick measurements
4. Quality measurements
5. Fix technical obstacles
5
The Plan
Use local resources
UMIACS HTCondor Pool
– ~160 nodes spread out over several clusters
– Two clusters equipped with InfiniBand (IB)
"Backfills" HTC jobs on to clusters managed with
TORQUE
OSU MPI microbenchmarks
Distributed system to schedule & collect
6
Particulars of the Environment
Scheduling Cannot schedule arbitrary pairs of nodes in HTCondor
Static Slots 1 job per slot 1 slot per core All slots must be controlled for exclusive measurement
Node Heterogeneity
Lesson learned:Compute environment exists to support somebody’s research, but maybe not yours
7
Aside: Load Affects Network Latency
Space-sharing application model
Measurements between two IB connected nodes
Varied CPU load
Higher load leads to Increased latency Unpredictable latency
Lesson Learned:Environment for measurement should match model environment
8
Solved Technical Obstacles
OpenMPI is finicky about OS & libraries Build OpenMPI separately for every single host
OpenMPI over TCP mysteriously hangs Bogus bridge interface for virtualization Tell OpenMPI not to use it
User limits for mapped memory prevents RDMA over IB Had to modify HTCondor init script
IB library provided by OS didn't work Had to build it ourselves on Cluster E First hint that something was really wrong
Lesson Learned:There are going to be a lot of little problems along the way.
9
Solved Policy Obstacles
Local Resource Management Systems Cannot schedule arbitrary pairs of nodes Cannot run processes outside of HTCondor & Torque Cannot ask to change the way resources are allocated Solution:
Built distributed system to schedule & collect measurements
Accounts Cannot get accounts on some systems Solution:
Workaround to start OpenMPI daemon processes on both ends without SSH
Lesson LearnedSometimes, there are technical solutions to policy problems.
10
Setting the Stage for Failure
Cluster E One of the two clusters in pool with IB Upstream IB libraries from OS vendor didn’t work IB used exclusively for IPoIB to support Lustre Nodes have a large amount of memory
OpenMPI processes crashing Despite rebuild of IB libraries from
hardware vendor
11
Fatal Obstacle
IB driver has tunable parameter to adjust the amount of memory that can be mapped (64GB)
Nodes have twice that physical memory (128GB)
Needs to be twice the physical memory size (256GB)
OS vendor has no guidelines for adjusting that value
Unknown impact on Lustre filesystem using IPoIB
So this can't be fixed
Lesson Learned:Sometimes there’s nothing you can do.
☠
12
Cannot Lay Blame
Sysadmins? No, they made a conservative decision to support primary
stakeholders
IB vendor? Driver right from the IB vendor probably would have worked
OS vendor? Supports what they intended to support (IPoIB)
Me? Using native RDMA over IB isn't asking too much
Lesson Learned:Sometimes it’s no-ones fault.
13
Results
Ping is not a good predictor of application-level latency
Tends to over-estimate
Compared latency prediction techniques
Distributed Tree Metric (DTM) Vivaldi Global Network Positioning
Result:DTM continues to perform better than the other techniques
14
Takeaway
IF you are building a big system/thesis that will rely on many different systems/admin domains
THEN you need to check all the potential choke-points in advance
If the work is self contained, this is much easier.
I should have tested MPI over IB RDMA on that cluster much earlier in the process.
16
Policy
Cannot ask for invasive changes to policy
or implementation
Cannot disrupt HTCondor pool
Cannot interfere with TORQUE users
Cannot get accounts on compute nodes
Must be prepared for preemption
Lesson Learned:
Policies exists to support someone’s
research, but maybe not yours.
17
Seizing a Node
Submitter: query HTCondor and submit master & slave jobs
Node masters & slaves: seize exclusive control over a node
For a node with n slots Submit n-1 slaves Submit 1 master
18
Scheduling Measurements
When all slaves & master are running, contact central control
Slaves & master yield periodically to allow other jobs to run
Scheduler: Schedule measurements
between masters Collect & store results from
masters
Top Related