Download - A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

A Ping Too FarReal World Network Latency Measurement

Gary Jackson

JHU/APL

Work done while at the University of Maryland

Pete Keleher and Alan Sussman

University of Maryland

Department of Computer Science

2

Introduction

Context:Peer-to-peer HPC resource discovery and management

Goal Collect a high-quality all-to-all network latency map Campus or department scale, including HPC resources As opposed to Internet-scale, which is well-trod ground

Purpose Compare latency prediction techniques Increase the fidelity of peer-to-peer system simulations

Solved many problems Technical solutions to technical and policy obstacles

Managed only partial success Could not get measurements on more than one HPC-equipped cluster, so it’s

not useful to us But maybe the data set is useful to someone else

3

Four Policy Challenges

1. Where to measure? Ask for access ✖ Compel stakeholders ✖ Find existing resources that meet needs ✔

2. Work around policy obstacles Cannot run persistent daemons on resources

3. Minimal change Cannot ask for significant changes to environment or other

policies

4. Non-disruptive Use of resources cannot disrupt other users

4

Five Technical Challenges

1. Load interferes with measurement

2. User-level programs on both ends

3. Quick measurements

4. Quality measurements

5. Fix technical obstacles

5

The Plan

Use local resources

UMIACS HTCondor Pool

– ~160 nodes spread out over several clusters

– Two clusters equipped with InfiniBand (IB)

"Backfills" HTC jobs on to clusters managed with

TORQUE

OSU MPI microbenchmarks

Distributed system to schedule & collect

6

Particulars of the Environment

Scheduling Cannot schedule arbitrary pairs of nodes in HTCondor

Static Slots 1 job per slot 1 slot per core All slots must be controlled for exclusive measurement

Node Heterogeneity

Lesson learned:Compute environment exists to support somebody’s research, but maybe not yours

7

Aside: Load Affects Network Latency

Space-sharing application model

Measurements between two IB connected nodes

Varied CPU load

Higher load leads to Increased latency Unpredictable latency

Lesson Learned:Environment for measurement should match model environment

8

Solved Technical Obstacles

OpenMPI is finicky about OS & libraries Build OpenMPI separately for every single host

OpenMPI over TCP mysteriously hangs Bogus bridge interface for virtualization Tell OpenMPI not to use it

User limits for mapped memory prevents RDMA over IB Had to modify HTCondor init script

IB library provided by OS didn't work Had to build it ourselves on Cluster E First hint that something was really wrong

Lesson Learned:There are going to be a lot of little problems along the way.

9

Solved Policy Obstacles

Local Resource Management Systems Cannot schedule arbitrary pairs of nodes Cannot run processes outside of HTCondor & Torque Cannot ask to change the way resources are allocated Solution:

Built distributed system to schedule & collect measurements

Accounts Cannot get accounts on some systems Solution:

Workaround to start OpenMPI daemon processes on both ends without SSH

Lesson LearnedSometimes, there are technical solutions to policy problems.

10

Setting the Stage for Failure

Cluster E One of the two clusters in pool with IB Upstream IB libraries from OS vendor didn’t work IB used exclusively for IPoIB to support Lustre Nodes have a large amount of memory

OpenMPI processes crashing Despite rebuild of IB libraries from

hardware vendor

11

Fatal Obstacle

IB driver has tunable parameter to adjust the amount of memory that can be mapped (64GB)

Nodes have twice that physical memory (128GB)

Needs to be twice the physical memory size (256GB)

OS vendor has no guidelines for adjusting that value

Unknown impact on Lustre filesystem using IPoIB

So this can't be fixed

Lesson Learned:Sometimes there’s nothing you can do.

☠

12

Cannot Lay Blame

Sysadmins? No, they made a conservative decision to support primary

stakeholders

IB vendor? Driver right from the IB vendor probably would have worked

OS vendor? Supports what they intended to support (IPoIB)

Me? Using native RDMA over IB isn't asking too much

Lesson Learned:Sometimes it’s no-ones fault.

13

Results

Ping is not a good predictor of application-level latency

Tends to over-estimate

Compared latency prediction techniques

Distributed Tree Metric (DTM) Vivaldi Global Network Positioning

Result:DTM continues to perform better than the other techniques

14

Takeaway

IF you are building a big system/thesis that will rely on many different systems/admin domains

THEN you need to check all the potential choke-points in advance

If the work is self contained, this is much easier.

I should have tested MPI over IB RDMA on that cluster much earlier in the process.

16

Policy

Cannot ask for invasive changes to policy

or implementation

Cannot disrupt HTCondor pool

Cannot interfere with TORQUE users

Cannot get accounts on compute nodes

Must be prepared for preemption

Lesson Learned:

Policies exists to support someone’s

research, but maybe not yours.

17

Seizing a Node

Submitter: query HTCondor and submit master & slave jobs

Node masters & slaves: seize exclusive control over a node

For a node with n slots Submit n-1 slaves Submit 1 master

18

Scheduling Measurements

When all slaves & master are running, contact central control

Slaves & master yield periodically to allow other jobs to run

Scheduler: Schedule measurements

between masters Collect & store results from

masters