LIBI: The Lightweight Infrastructure-Bootstrapping Infrastructure
description
Transcript of LIBI: The Lightweight Infrastructure-Bootstrapping Infrastructure
Department of Computer Science
LIBI: The Lightweight Infrastructure-Bootstrapping
Infrastructure Joshua Goehner and Dorian Arnold
University of New Mexico(In collaboration with LLNL and U. of Wisconsin
Key statistics from Top500 (Nov. 2011)◦ 149/500 (30%) more than 8,192 core
In 2006, this number was 12/500 (2.4%)◦ 7 systems: more than 64K cores◦ 6 systems: between 64K and 128K cores◦ 3 systems: more than 128K cores
Later this year: LLNL’s Sequoia w/ 1.6M cores 2018?: Exascale systems w/ 107 or 108 cores
A Little Preaching to the Choir
Software systems must scale as well!
Before bootstrapping:◦ Program image on storage device◦ Set of (allocated) computer nodes
After bootstrapping: Application processes started on computer nodes Application’s configuration complete
ready for primary operation
Software Infrastructure-bootstrapping
Given a node allocation, start infrastructure's composite processes and propagate necessary startup information.
“I gotta get my application up and running!”
(1) “First, I start all the processes on the appropriate nodes”
(2) “Next, I must disseminate some initialization information”
Bootstrapping is complete when theinfrastructure is ready for steady-state usage.
Basic Bootstrapping
Sequential instantiation◦ e.g. rsh or ssh-based
Point-to-point propagation
0 50 100 150 200 250 3000
2
4
6
8
10
Number of ProcessesIn
stan
tiatio
n Ti
me
(sec
onds
)
70 seconds to start 2K processes!
Infrastructure-specific, scalable mechanisms◦ Still limited by sequential operations◦ Not portable to other infrastructures
Using high-performance resource managers◦Myriad interfaces◦ No communication facilities
Generic bootstrapping infrastructures◦ LaunchMON: targets tools with wrapper for existing RMs◦ ScELA: sequentially launch agents that create local processes
More Scalable Approaches
MRNet Sequential-ish Bootstrapping
Parent creates children
Local fork()/exec()
Remote rsh-based mechanism
Integrated instantiation and information propagation
MRNet’s “standard”
MRNet XT (hybrid) process launch
Bulk-launch 1 process per node
Process launches collocated processes
Information propagated insimilar fashion
LIBI Approach LIBI: Lightweight infrastructure-
bootstrapping infrastructure
◦ Name is no longer a work in progress
◦ Generic service for scalable distributed software infrastructure bootstrapping
Process launch
Scalable, low-level collectives
Large Scale Distributed Software
Job Launchers Communication Services
LIBI
rsh/ssh
SLURMOpenRTE
ALPS
LaunchMON
Debuggers System Monitors
Applications
Performance Analyzers Overlay Networks
COBO
MPI
session: set of processes (to be) deployed◦ session master manages other members◦ session front-end interacts with session master◦ LIBI currently supports only master/member communication
host-distribution: where to create processes◦ <hostname, num-processes>
process distribution: how/where to create processes◦ <session-id, executable, arguments, host-distribution,
environment>
LIBI API
launch(process-distribution-array)
◦ instantiate processes according to input distributions
[send|recv]UsrData(session-id, msg)
◦ communicate between front-end and session master
broadcast(), scatter(), gather(), barrier()
◦ communicate between session master and members
LIBI API (cont’d)
front-end( ){LIBI_fe_init();LIBI_fe_createSession(sess);
proc_dist_req_t pd;pd.sessionHandle = sess;pd.proc_path = get_ExePath();pd.proc_argv = get_ProgArgs();pd.hd = get_HostDistribution();
LIBI_fe_launch(pd);
//test broadcast and barrierLIBI_fe_sendUsrData(sess1, msg, len );LIBI_fe_recvUsrData(sess1, msg, len);
//test scatter and gatherLIBI_fe_sendUsrData(sess1, msg, len );LIBI_fe_recvUsrData(sess1, msg, len);
return 0;}
Example LIBI Front-end
session_member() {LIBI_init();
//test broadcast and barrierLIBI_recvUsrData(msg, msg_length);LIBI_broadcast(msg, msg_length);LIBI_barrier();LIBI_sendUsrData(msg, msg_length)
//test scatter and gatherLIBI_recvUsrData(msg, msg_length);LIBI_scatter(msg, sizeof(rcvmsg), rcvmsg);LIBI_gather(sndmsg, sizeof(sndmsg), msg);LIBI_sendUsrData(msg, msg_length);
LIBI_finalize();}
Example LIBI-launched Application
LaunchMON-based runtime
◦ Tested SLURM or rsh launching
◦ COBO PMGR service
LIBI Implementation StatusLarge Scale Distributed Software
Job Launchers Communication Services
LIBI
rsh/ssh
SLURMOpenRTE
ALPS
LaunchMON
Debuggers System Monitors
Applications
Performance Analyzers Overlay Networks
COBO
MPI
Goal to demonstrate ease-of-integration and performance/scalability enhancements
Microbenchmark
◦ Time to instantiate a set of processes
◦ Time to broadcast 128 bytes followed by barrier
◦ Time to scatter and gather 128 bytes
LIBI Early Evaluation
LIBI Microbenchmark Results
0 500 1000 1500 2000 2500 30000
1
2
3
4
5
6
7
8 ProjectedSequentialLIBI
Number of Processes
Inst
antia
tion
Tim
e (s
econ
ds)
LIBI Microbenchmark Results
0 500 1000 1500 2000 2500 30000
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
LIBI Broadcast & Barrier
LIBI Scatter & Gather
Sequential Broadcast & Barrier
Sequential Scatter & Gather
Number of Processes
Com
mun
icati
on T
ime
(sec
onds
)
MRNet uses LIBI to launch all MRNet processes
◦ Parse topology file and setup/call LIBI_launch()
Session front-end gathers/scatters startup information
◦ Parent listening socket (IP/port)
MRNet/LIBI Integration
STAT over MRNet over LIBI performance results Basic, scalable rsh-based mechanism Hybrid (XT-like) launch approach Mechanisms to alleviate filesystem contention◦ Like our scalable binary relocation service
More flexible process and host distributions◦ Instantiating different images in same session◦ Integrating allocating and launching
Integrated scalable communication infrastructure
Some Future Work