Holistic Aggregate Resource Environment
-
Upload
eric-van-hensbergen -
Category
Technology
-
view
807 -
download
0
description
Transcript of Holistic Aggregate Resource Environment
Holistic Aggregate Resource Environment
Eric Van Hensbergen (IBM Research)Ron Minnich (Sandia National Labs)
Jim McKie (Bell Labs)Charles Forsyth (Vita Nuova)
David Eckhardt (CMU)
Overview
Sequoia
BG/LBG/L
Red Storm
Research Topics• Pre-requisite: reliability and application driven design is pervasive in all
explored areas
• Offload/Acceleration Deployment Model
• Supercomputer needs to become an extension of scientist's desktop as opposed to batch driven, non-standard run-time environment.
• Leverage aggregation as a first-class systems construct to help manage complexity and provide a foundation for scalability, reliability, and efficiency.
• Distribute system services throughout the machine (not just on io-node)
• Interconnect Abstractions & Utilization
• Leverage HPC interconnects in system services (file system, etc.)
• sockets & TCP/IP don't map well to HPC interconnects (torus and collective) and are inefficient when hardware provides reliability
Right Weight Kernel
• General purpose multi-thread, multi-user environment
• Pleasantly Portable
• Relatively Lightweight (relative to Linux)
• Core Principles
• All resources are synthetic file hierarchies
• Local & remote resources accessed via simple API
• Each thread can dynamically organize local and remote resources via dynamic private namespace
Aggregation• extend BG/P aggregation model
beyond I/O and CPU node barrier
• allow grouping of nodes into collaborating aggregates with distributed system services and dedicated service nodes
• allow specialized kernel for file service, monitoring, checkpointing, and network routing
• parameterized redundancy, reliability, and scaling
• allows dynamic (re-) organization of programming model to match the (changing) workload
local service
remote services
local service proxy service aggregate service
Topology
Desktop Extension
• Users want super computers to be an extension of their desktop
• Current parallel model is traditional batch model
• Workloads must use specialized compilers and be scheduled from special front-end node. Results are collected into a separate file system
• Monitoring and job control through web interface or MMCS command line
• Very difficult development environment and lack of interactivity limits productivity of execution environment
• Proposed Research
• leverage library OS commercial scale-out work to allow tighter coupling between desktop environment and super computer resources
• Construct runtime environment which includes some reasonable subset of support for typical Linux run-time requirements (glibc, python, etc.)
Extension Example
Mac pSeries I/O CPU
brasil
osx
brasil
Linux Plan 9 Plan 9
app app app
...toruscollective10GB Etherssh
internet
Native Interconnects
• Blue Gene specialized networks are used primarily by user space run-time
• Hardware is directly accessed by user space runtime time environment and are not shared leading to poor utilization
• Exclusive use of tree network for I/O limits bandwidth and reliability
• Proposed Solution
• Light weight system software interfaces to interconnects so that they can be leveraged for system management, monitoring, and resource sharing as well as user applications
Protocol Exploration• The Blue Gene networks are unusual (eg, 3D torus carrying 240-byte payloads)
• IP Works, but isn’t well matched to the underlying capabilities
• We want an efficient transport protocol to carry 9P messages & other data streams
• Related Work: IBM’s ‘one-sided’ messaging operations [Blocksome et al]
• It supports both MPI and non-MPI applications such as Global Arrays
• Inspired by the IBM messaging protocol, we think we might do better than just IP
• Years ago there was much work on lightweight protocols for high-speed networks
• We are using ideas from that earlier research to implement an efficient protocol to carry 9P conversations
Project Roadmap
0 1 2 3
Year
Hardware Support
Systems Infrastructure
Evaluation, Scaling, & Tuning
Milestones (Year 1)BASIC
BASELINE
Initial Boot10 GB Ethernet
Collective NetworkInitial Infrastructure
SMP SupportLarge Pages
Torus NetworkNative Protocol
Baseline
0 10 20 30 40 50
Weeks
2009
BG/L (circa 2007)
PUSH
!"#$$%
&'(()*+,-.# /0$1-.$#2'3
,-.#
,-.#
,-.#
,-.#
,-.#
,-.#!"#$$%
&'(()*+
!"#$$%
&'(()*+
!"#$$%
&'(()*+
!"#$$%
&'(()*+
!"#$$%
&'(()*+
!"#$$%
&'(()*+,-.#
,-.#
,-.#
,-.#
,-.#
,-.#
4#(0$1-.$#2'3 ,-.#!"#$$%
&'(()*+
Figure 1: The structure of the PUSH shell
We have added two additional pipeline operators, a multiplexing fan-out(|<[n]), and a coalescingfan-in(>|). This combination allows PUSH to distribute I/O to and from multiple simultaneousthreads of control. The fan-out argument n specifies the desired degree of parallel threading. If noargument is specified, the default of spawning a new thread per record (up to the limit of availablecores) is used. This can also be overriden by command line options or environment variables. Thepipeline operators provide implicit grouping semantics allowing natural nesting and composibility.While their complimentary nature usually lead to symmetric mappings (where the number of fan-outs equal the number of fan-ins), there is nothing within our implementation which enforces it.Normal redirections as well as application specific sources and sinks can provide alternate datapaths. Remote thread distribution and interconnect are composed and managed using syntheticfile systems in much the same manner as Xcpu [11], pushing the distributed complexity into themiddleware in an language and runtime neutral fashion.
PUSH also di!ers from traditional shells by implementing native support for record basedinput handling over pipelines. This facility is similar to the argument field separators, IFS andOFS, in traditional shells which use a pattern to determine how to tokenize arguments. PUSHprovides two variables, ORS and IRS, which point to record separator modules. These modules(called multiplexors in PUSH) split data on record boundaries, emitting individual records that thesystem distributes and coalesces.
The choice of which multipipe to target is left as a decision to the module. Di!erent dataformats may have di!erent output requirements. Demultiplexing from a multipipe is performedby creating a many to one communications channel within the shell. The shell creates a readerprocesses which connects to each pipe in the multipipe. When the data reaches an appropriaterecord boundary a bu!er is passed from the reader to the shell which then writes each record bu!erto the output pipeline.
An example from our particular experience, Natural Language Processing, is to apply an ana-lyzer to a large set of files, a ”corpus”. User programs go through each file which contain a list ofsentences, one sentence per line. They then tokenize the sentence into words, finding the part ofspeech and morphology of the words that make up the sentence. This sort of task maps very wellto the DISC model. There are a large number of discrete sets of data whose order is not necessarilyimportant. We need to perform a computationally intensive task on each of the sentences, whichare small, discrete records and ideal target for parallelization.
PUSH was designed to exploit this mapping. For example, to get a histogram of the distribution
2
push -c ’{ ORS=./blm.dis du -an files |< xargs os chasen | awk ’{print \$1}’ | sort | uniq -c >| sort -rn
}’
Early FTQ Results
Strid3 Y= AX + Y
Time for 1024 iterations
Timein
seconds
“Stride”, i.e. distance between scalars
Application Support• Native
• Inferno Virtual Machine
• CNK Binary Support
• Elf Converter
• Extended proc interface to mark processes as “cnk procs”
• Transition once the process execs, and not before
• Shim in syscall trap code to adapt arg passing conventions
• Linux Binary Support
• Basic Linux binary support
• Functional enough to run basic programs (Python, etc.)
Publications• Unified Execution Model for Cloud Computing; Eric Van Hensbergen, Noah Evans, Phillip Stanley-
Marbell. Submitted to LADIS 2009; October 2009.
• PUSH, a DISC Shell; Eric Van Hensbergen, Noah Evans. To Appear in the Proceedings of the Principles of Distributed Computing Conference; August 2009.
• Measuring Kernel Throughput on BG/P with the Plan 9 Research Operating System; Ron Minnich, John Floren, Aki Nyrhinen. Submitted to SC 09; November 2009.
• XCPU2: Distributed Seamless Desktop Extension; Eric Van Hensbergen, Latchesar Ionkov. Submitted to IEEE Clusters 2009; October 2009.
• Service Oriented File Systems; Eric Van Hensbergen, Noah Evans, Phillip Stanley-Marbell. IBM Research Report (RC24788), June 2009
• Experiences Porting the Plan 9 Research Operating System to the IBM Blue Gene Supercomputers; Ron Minnich, Jim McKie. To appear in the Proceedings of the International Conference on Supercomputing (ISC); June 2009.
• System Support for Many Task Computing; Eric Van Hensbergen and Ron Minnich. In the Proceedings of the Workshop on Many Task Computing on Grids and Supercomputers; November 2008.
• Holistic Aggregate Resource Environment; Charles Forsyth, Jim McKie, Ron Minnich and Eric Van Hensbergen. In the ACM Operating Systems Review; January 2008.
• Night of the Lepus: A Plan 9 Perspective on Blue Gene's Interconnects; Charles Forsyth, Jim McKie, Ron Minnich and Eric Van Hensbergen. In the proceedings of the second annual international workshop on Plan 9; December 2007
• Petascale Plan 9. USENIX 2007
Next Steps• Infrastructure Scale Out
• File Services
• Command Execution
• Alternate Internode Communication Models
• Fail in place software RAS models
• Applications (Linux binaries and native support)
• Large Scale LINPACK Run
• Explore Mantevo Application Suite
• (http://software.sandia.gov/mantevo)
• CMU Working on Native Quake port
Acknowledgments
• Computational Resources Provided by DOE INCITE Program. Thanks to the patient folks at ANL who have supported us bringing up Plan 9 on their development BG/P
• Thanks to IBM Research Blue Gene team and the Kittyhawk Team for guidance and support.
Questions? Discussion?http://www.research.ibm.com/hare
Backup
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM CorporationSystems Support for Many Task Computing 11/17/200824
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM CorporationSystems Support for Many Task Computing 11/17/200825
Plan 9 Characteristics
Kernel Breakdown - Lines of CodeArchitecture Specific Code
BG/P: ~14,000 lines of code
Portable CodePort: ~25,000 lines of code
TCP/IP Stack: ~14,000 lines of code
Binary Sizes415k Text + 140k Data + 107k BSS
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM CorporationSystems Support for Many Task Computing 11/17/200826
Why not Linux?Not a distributed system
Core systems inflexible
VM based on x86 MMU
Networking tightly tied to sockets & TCP/IP w/long call-path
Typical installations extremely overweight and noisy
Benefits of modularity and open-source advantages overcome by complexity, dependencies, and rapid rate of change
Community has become conservative
Support for alternative interfaces waning
Support for large systems which hurts small systems not acceptable
Ultimately a customer constraint
FastOS was developed to prevent OS monoculture in HPC
Few Linux projects were even invited to submit final proposals
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM CorporationSystems Support for Many Task Computing 11/17/200827
Everything Represented as File Systems
HardwareDevices
SystemServices
ApplicationServices
Disk
Network
TCP/IP Stack DNS
GUI
/dev/hda1
/dev/hda2
/dev/eth0
/net /arp /udp /tcp /clone /stats /0 /1 /ctl /data /listen /local /remote /status
/net /cs /dns
/win /clone /0 /1 /ctl /data /refresh /2
Console, Audio, Etc. Wiki, Authentication, and Service Control
Process Control, Debug, Etc.
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM CorporationSystems Support for Many Task Computing 11/17/200828
Plan 9 Networks
Internet
High Bandwidth (10 GB/s) Network
LAN (1 GB/s) Network
Wifi/EdgeCable/DSL
ContentAddressable
Storage
FileServer
CPUServers
CPUServers
PDASmartphone
Term
TermTermTerm
Set Top Box
ScreenPhone
( ((
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM CorporationSystems Support for Many Task Computing 11/17/200829
Aggregation as a First Class Concept
Local Service Aggregate Service
Remote Service
Proxy Service
Remote ServiceRemote Service
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM CorporationSystems Support for Many Task Computing 11/17/200830
Issues of Topology
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM CorporationSystems Support for Many Task Computing 11/17/200831
File Cache Example
Proxy ServiceMonitors access to remote file server & local resourcesLocal cache modeCollaborative cache modeDesignated cache server(s)Integrate replication and redundancyExplore write coherence via “territories” ala Envoy
Based on experiences with Xget deployment modelLeverage natural topology of machine where
possible.
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM CorporationSystems Support for Many Task Computing 11/17/200832
Monitoring Example
Distribute monitoring throughout the systemUse for system health monitoring and load balancingAllow for application-specific monitoring agents
Distribute filtering & control agents at key points in topology
Allow for localized monitoring and control as well as high-level global reporting and control
Explore both push and pull methods of modelingBased on experiences with supermon system.
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM CorporationSystems Support for Many Task Computing 11/17/200833
Workload Management Example
Provide file system interface to job execution and scheduling.
Allows scheduling of new work from within the cluster, using localized as well as global scheduling controls.
Can allow for more organic growth of workloads as well as top-down and bottom-up models.
Can be extended to allow direct access from end-user workstations.
Based on experiences with Xcpu mechanism.
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM CorporationSystems Support for Many Task Computing 11/17/200834
Right Weight Kernels Project (Phase I)
MotivationOS Effect on ApplicationsMetric is based on OS Interference on FWQ & FTQ benchmarks.AIX/Linux has more capability than many apps needLWK and CNK have less capability than apps want
ApproachCustomize the kernel to the application
Ongoing ChallengesNeed to balance capability with overhead
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM CorporationSystems Support for Many Task Computing 11/17/200835
Why Blue Gene?
Readily available large-scale clusterMinimum allocation is 37 nodes
Easy to get 512 and 1024 node configurations
Up to 8192 nodes available upon request internally
FastOS will make 64k configuration available
DOE interest – Blue Gene was a specified target
Variety of interconnects allows exploration of alternatives
Embedded core design provides simple architecture that is quick to port to and doesn't require heavy weight systems software management, device drivers, or firmware