HPC-Colony
-
Upload
cailean-kerrane -
Category
Documents
-
view
16 -
download
0
description
Transcript of HPC-Colony
www.HPC-Colony.org
June 2005 PI Meeting
Terry Jones, LLNL, Coordinating PI Laxmikant Kale, UIUC, PI
Jose Moreira, IBM, PICelso Mendes, UIUC
Sayantan Chakravorty, UIUCTodd Inglett, IBM
Andrew Tauferner, IBM
2
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Outline
Recap Colony Project Goals & Approach Status Update Identify possibilities for Collaborations Identify resources which may be of interest to other
Proects
3
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Recap Colony Project Goals & Approach Status Update Identify possibilities for Collaborations Identify resources which may be of interest to other
Proects
4
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Colony Project Overview
Lawrence Livermore National Laboratory
At mtg: Terry Jones
University of Illinois at Urbana-Champaign
At mtg: Laxmikant Kale, Celso Mendes
International Business Machines
Scalable Load Balancing OS mechanisms for Migration Processor Virtualization for Fault Tolerance Single system management space Parallel Awareness and Coordinated
Scheduling of Services Linux OS for Blue Gene like machines
Services and Interfaces to Support Systems with Very Large Numbers of Processors
Collaborators
Topics
Title
5
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Motivation & Goals
Parallel Resource Management— Today, application programmers must explicitly manage these complex resources. We
address scaling issues and porting issues by delegating resource management tasks to a sophisticated parallel OS.
— “Managing Resources” includes balancing CPU time, network utilization, and memory usage across the entire machine.
Global System Management— Linux Everywhere
— Enhance operating system support for parallel execution by providing coordinated scheduling and improved management services for very large machines.
— Virtual memory management across the entire system.
Fault Tolerance— Scalable stategies for dealing with faults. Checkpoint restart may not be adequate in all
cases (and It may be too expensive).
6
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Approach
Top Down— Our work will start from an existing full-featured OS and remove excess
baggage with a “top down” approach.
Processor virtualization — One of our core techniques: the programmer divides the computation into a
large number of entities, which are mapped to the available processors by an intelligent runtime system.
Leverage Advantages of Full Featured OS & Single System Image— Applications on these extreme-scale systems will benefit from extensive
services and interfaces; managing these complex systems will require an improved “logical view”
Utilize Blue Gene— Suitable platform for ideas intended for very large numbers
of processors
7
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Recap Colony Project Goals & Approach Status Update Identify possibilities for Collaborations Identify resources which may be of interest to other
Proects
8
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Recent AccomplishmentsProject Overview
Started consolidating previous fault-tolerance work— Checkpoint/Restart Techniques— In-Memory Checkpointing— Preliminary Sender-based Message Logging Scheme
Continued parallel resource-management work — Measurement-based load balancing
– New classes of balancers designed & implemented— Support for object & thread migration on BG/L
Global system management work — Linux to BG/L compute nodes
– comparison performance measurements— parallel aware scheduling
9
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Fault Tolerance WorkStatus
Current focus: Proactive fault-tolerance — Approach: Migrate tasks away from processors
where faults are imminent— Rationale: Many faults may have advance
indicators – Hardware / O.S. notification
— Basis: Processor virtualization (many virtual processors mapped to physical processors)
— Implementation: “anytime-migration” support in Charm++ and AMPI (for MPI codes) used for processor evacuation
Tests: Jacobi (C,MPI), Sweep3d (Fortran,MPI)
10
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Fault Tolerance Work
Example: Jacobi-relaxation 2D (C,MPI)— 6,000x6,000 dataset, run on 8 Xeon processors
fault warning,
nigration
7 processors remaining
load balancing
11
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Fault Tolerance Work
Example: sweep3d — Processor-utilization snapshopts
prior to fault
after first fault/migration
after load-balancing
load from failed processor (1)
12
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Fault ToleranceRecent Accomplishments
Proactive Fault-Tolerance Lessons:— Evacuation time is proportional to dataset size per
processor and to network speed – good scalability — Subsequent load balancing step is critical for good
performance— Current scheme can tolerate multiple faults that are not
simultaneous Current Status:
— Working on selecting appropriate load balancers— Analyzing how to improve performance between evacuation
and next load-balancing— Reducing time between notice of impending failure and
stable post-evacuation state— Integrating to regular Charm++/AMPI distribution
13
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Parallel Resource Management Work
Recent focus: Preliminary work on Advanced Load Balancers— Multi-Phase load balancing
– balance for each program phase separately– phases may be identified by user-inserted calls– sample results following
— Asynchronous load balancing– hiding load balancing overhead– overlap of load balancing and computation
— Topology-aware load balancing– considers network topology– major goal: optimize hopbytes metric– On going work
14
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Parallel Resource Management Work
Example: Phase load balancing
Processor 0
Processor 1
Phase 1 Phase 2
Processor 0
Processor 1
Processor 0
Processor 1
original execution
good balancing, but bad performance
phase balancing
15
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Parallel Resource Management Work
Example: synthetic 2-phase program
Tests: 32 pe’s, Xeon cluster
no load-balance
greedy load-balancer
Multi-phase load-balancerutilization60%
utilization70%
16
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Parallel Resource ManagementRecent Accomplishments
Load Balancing Trade-offs:— Trade-off in centralized load balancers
– good balance can be achieved– expensive memory and network usage for LB
— Trade-off in fully distributed load balancers– cheap in terms of memory/network usage– may be slow to achieve acceptable balance
Current Status:— Working on hybrid load-balancing schemes
– Idea: divide processors into hierarchy, balance load at each level first, then across levels
— Integrating balancers to regular Charm++/AMPI
17
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Global System ManagementRecent Accomplishments
Defined team for Blue Gene/L Linux Everywhere Research and development of Linux Everywhere
—Limitations of the Compute Node Kernel—Motivation for Linux on the compute nodes—Challenges of a Linux solution for Blue Gene/L— Initial experiments and results
Measurements for Parallel Aware Scheduling
18
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Global System ManagementLinux Everywhere Solution for BG/L
Would significantly extend the spectrum of applications that can be easily ported to Blue Gene/L – more penetration in industrial and commercial markets
Multiple processes/threads per compute node Leverage set of middleware (libraries, run-time) from
I/O node to compute node An alternative to cross-compilation environments Platform for research on extreme scalability of Linux Possible solution for follow on machine (Blue
Gene/P)
19
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Global System ManagementChallenges of Linux Everywhere
Scalability challenges:—Will the more asynchronous nature of Linux lead
to scalability problems?—Sources of asynchronous events in Linux: Timer
interrupts, TLB misses, Process/thread scheduler, I/O (devices) events
—Parallel file system for 100,000+ processors?— If Linux cannot scale to 10,000-100,000 processor
range, can it be a solution for smaller systems? Reliability challenges:
—Despite some difficulties with file system and Ethernet devices, Linux on BG/L has not presented reliability problems
—Test, test, test, and more test …
20
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Global System ManagementFull Linux on Blue Gene Nodes
5 out of 8 NAS class A benchmarks fit in 256MB
Same compile (gcc –O3) and execute options, and similar libraries (GLIBC)
No daemons or extra processes running in Linux; mainly user space code
— difference suspected to be caused by handling of TLB misses
NAS Serial Benchmark (class A) ResultsExecution time (in seconds)
BSS size CNK Linux Difference
CG 62MB 59.33 65.23 9.94%
IS 100MB 9.75 35.61 265.23%
LU 47MB 1020.07 1372.97 34.59%
SP 83MB 1342.75 1672.62 24.56%
EP 11MB 374.48 385.48 2.93%
21
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Global System ManagementImpact of TLB misses (single node)
0.00E+00
2.00E+10
4.00E+10
6.00E+10
8.00E+10
1.00E+11
1.20E+11
1.40E+11
1.60E+11
Cycles for DGEMM 160
No TLB 1kBpage
4kBpage
Linux 16kBpage
64kBpage
CNK with a 4kB page performs just like Linux
CNK with 64kB page performs just like CNK without TLBs
Normally, CNK does not have TLB misses (memory is directly mapped) – we added a TLB handler with the same overhead as Linux (~160 cycles) and varied the page size
22
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Global System ManagementImpact of TLB misses (multi-node)
1.50
1.60
1.70
1.80
1.90
2.00
2.10
Teraflops
NoTLB
1kBpage
4kBpage
16kBpage
64kBpage
256kBpage
1MBpage
Linpack on 512 nodes (CNK)
CNK with 64kB pages performs within 2% of no TLBs
CNK with 1MB page performs just like CNK without TLBs
23
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Global System ManagementLinux Everywhere Conclusions
A Linux Everywhere (I/O and compute nodes) solution for Blue Gene/L can significantly increase the spectrum of applications for Blue Gene/L
Significant challenges in scaling Linux to 100,000 processors, primarily due to interference on running applications
Large pages effective in reducing impact of TLB misses – 64kB pages seems like a good target
Next steps: — Implement large page support in Linux for Blue
Gene/L—Study other sources of interference (timers)—Devise file system solution—Decide what to do with lack of coherence
24
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Global System ManagementParallel Aware Scheduling
High order hydrodynamics code for
computing fluid instabilities and
turbulent mix
Miranda -- Instability & Turbulence
Favored Priority:41Unfavored Priority:100Percent to Application:99.995%Total Duration:20 seconds--------------------------------------------------------------Without Co-Scheduler….Mean:452.52Without Co-Scheduler….Standard Dev:108.45
With Co-Scheduler……..Mean:254.45With Co-Scheduler……..Standard Dev:5.45
Results @ 1024 Tasks
25
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Recap Colony Project Goals & Approach Status Update Identify possibilities for Collaborations Identify resources which may be of interest to other
Proects
26
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Needs
Early fault indicators— What kinds of fault indicators can the O.S.
provide?— Can these be provided in a uniform interface?
O.S. support for task migration— Is it possible to allocate/reserve virtual memory
consistently and “globally”?— How to handle thread migration in 32-bit systems?
O.S. support for load-balancing— Production of updated computational load and
communication information— Fast data collection (via extra-network support ?)
Interfaces between schedulers and jobs
27
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Collaboration Opportunities
Extending Linux for extreme scale
Addressing OS interference
Addressing how system resources are managed (processor virtualization and SSI strategies)
Migration (for both load-balancing and run through failure)
How far can we push full-featured OS?
28
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Recap Colony Project Goals & Approach Status Update Identify possibilities for Collaborations Identify resources which may be of interest to other
Proects
29
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
We can provide…
Several “scalable” applications from LLNL and UIUC
A full-featured OS comparison point (if given a metric or application)
Feedback on how novel ideas affect the key aspects of our approach (e.g. “That would affect our model of process migration in the following way…”)
And, as mentioned earlier, we’re very interested in collaborating in any of a number of areas…
31
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Fault Tolerance Work
Example: sweep3d (Fortran,MPI)— 1503 case, run on 8 Xeon processors, 32 VP’s
6 processors remaining
32
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Parallel Resource Management Work
Example: LeanMD, Communic.-aware load balancing
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
time per step (s)
Number of processors
w/o LB
GreedyLB
GreedyCommLB
w/o LB 1.605 0.825 0.445 0.38
GreedyLB 0.96 0.515 0.28 0.175
GreedyCommLB 0.95 0.495 0.265 0.165
64 128 256 512
HCA 30K atoms dataRun on PSC LeMieux
33
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Parallel Resource Management Work
Example: Topology-aware load balancing
0
2
4
6
8
10
12
14
16
Average hops per byte
Torus2D Torus3D Torus4D
Topology
Random
TopoLB
LeanMD code, 1024 processor run with HCA atom benchmark
34
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Global System ManagementLimitations of the CNK Solution
Less than half of the Linux system calls are supported – in particular the following are missing—Process and thread creation—Server side sockets—Shared memory segments/memory mapped files
CNK requires a cross-compilation environment, which has been a challenge for some applications
CNK also requires its own separate set of run-time libraries – maintenance and test issue
We could keep extending CNK, but that would eventually defeat its very reason for being (simplicity)
Instead, we want to investigate a “Linux Everywhere” solution for Blue Gene/L – Linux in I/O and compute nodes
35
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Global System ManagementLinux on the Compute Nodes
Our priority is to guarantee scaling of “well behaved” Linux use in the compute nodes:—Small number of processes/threads—No daemons (leave them to the I/O nodes)—Rely on MPI for high-speed communication
Ideally, MPI applications with one or two tasks per node will perform just as well with Linux as with the Compute Node Kernel
TCP and UDP support over the Blue Gene/L networks will be important for a broader set of applications that do not use MPI
For now, our study has focused on reducing the impact of TLB misses
36
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Global System ManagementBlue Gene/L Compute Node Kernel
The Blue Gene/L Compute Node Kernel was developed from scratch – simple and deterministic for scalability and reliability (about 13,000 lines of code)
Every user-level thread is backed by a processor – deterministic execution and no processor sharing
Only timer interrupt is counter virtualization every 6 seconds, which is synchronous across partition
No TLB misses – memory is directly mapped This deterministic execution has been the key to
scalability of (for example) FLASH code Implements a subset of Linux system calls – complex
calls (mostly I/O) are actually executed on I/O node GNU and XL run-time libraries ported to CNK – many
applications have required just “compile and go”
37
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Global System ManagementBlue Gene/L Kernels
In Blue Gene/L, Linux is used in a limited role—Runs on the I/O nodes, in support of file I/O,
socket operations, job control, and process debugging
—Linux on the I/O nodes acts as an extension of the Compute Node Kernel – operations not directly supported by the CNK (e.g., file I/O) are function shipped for execution on the I/O node
—Linux also used in the front-end nodes (job compilation, submission, debugging) and service nodes (machine control, machine monitoring, job scheduling)
Compute nodes run the lightweight Compute Node Kernel for reliability and scalability
38
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Global System ManagementScalability of CNK Solution (FLASH)
Sod Discontinuity Scaling Test - Total Time
0
100
200
300
400
500
600
700
800
1 10 100 1000 10000 100000
Number of Processors
Total Time (s)
QSC
Seaborg
Jazz /Myrinet2000
MCR/1proc/Quadrics
Watson BG /L
Big BG/L
39
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Global System ManagementParallel Aware Scheduling
High order hydrodynamics code for
computing fluid instabilities and turbulent
mix
Employs FFTs and band-diagonal matrix
solvers to compute spectrally-accurate
derivatives, combined with high-order
integration methods for time
advancement
Contains solvers for both compressible
and incompressible flows
Has been used primarily for studying
Rayleigh-Taylor (R-T) and Richtmyer-
Meshkov (R-M) instabilities, which occur
in supernovae and Inertial Confinement
Fusion (ICF)
Miranda -- Instability & Turbulence
40
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Top Down (start with Full-Featured OS)
Why?
Broaden domain of applications that can run on the most powerful machines through OS support
— More general approaches to processor virtualization, load balancing and fault tolerance
— Increased interest in applications such as parallel discrete event simulation
— Multi-threaded apps and libraries
Why not?
Difficult to sustain performance with increasing levels of parallelism
— Many parallel algorithms are extremely sensitive to serializations
— We will address this difficulty with parallel aware scheduling
Question: How much should system software offer in terms of features?Answer: Everything required, and as much desired as possible
41
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Parallel Resource ManagementProcessor Virtualization
Divide the computation into a large number of pieces— Independent of the number of processors
Let the runtime system map objects to processors Implementations: Charm++, Adaptive-MPI (AMPI)
User View System implementation
P0 P1 P2
42
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Parallel Resource ManagementAMPI: MPI with Virtualization
Each MPI process implemented as a user-level thread embedded in a Charm++ object
MPI processes
Real Processors
MPI “processes”
Implemented as virtual processes (user-level migratable threads)
43
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Research Using Processor Virtualization
Efficient resource management
Dynamic load balancing based on object
migration
Optimized inter-processor communication
Measurement-based runtime decisions
— Communication volumes
— Computational loads
Focus on highly scalable strategies
— Centralized Distributed Hybrid
Fault-tolerance approaches for large systems
Proactive reaction to impending faults
— Migrate objects when a fault is imminent
— Keep “good” processors running at full pace
— Refine load balance after migrations
Appropriate for system failures consisting of
a small subset of a large job
Automatic checkpointing / fault-detection /
restart
In-memory checkpointing of objects
Using message-logging to tolerate frequent
faults in a scalable fashion
44
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Leverage Advantages of Full Featured OSs
Make Available on BlueGene/L Compute Nodes
Enable a large scale test bed for our work
Increase flexibility for target applications
45
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Global System ManagementSingle System Image
Move from Physical view of machine to Logical View of machine
In a single process space model, all processes running on the parallel machine belong to a single pool of processes. A process can interact with any other process independent of their physical location on the machine. The single process space mechanism will implement the following process services:
By coalescing several individual operations into one collective operation, and raising the semantic level of the operations, the aggregate model will allow us to address performance issues that arise.
46
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Global System ManagementThe Logical View…
Single Process Space— Process query: These services provide information about which processes are
running on the machine, who they belong to, how they are grouped into jobs, and how much resources they are using.
— Process creation: These services support the creation of new processes and their grouping into jobs.
— Process control: These services support suspension, continuation, and termination of processes.
— Process communication: These services implement communications between processes.
Single File Space— any completely qualified file descriptor (e.g., “/usr/bin/ls”) represents exactly the
same file to all the processes running on the machine.
Single Communication Space— we will provide mechanisms in which any two processes can establish a channel
between them.
47
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Port , optimize and scale existing Charm++/AMPI applications on BlueGene/L
– Molecular Dynamics: NAMD (Univ.Illinois)– Collection of (charged) atoms, with bonds
– Simulations with millions of timesteps desired
– Cosmology: PKDGRAV (Univ.Washington)– N-Body problem, with gravitational forces
– Simulation and analysis/visualization done in parallel
– Quantum Chemistry: CPAIMD (IBM/NYU/others)– Car-Parrinello Ab Initio Molecular Dynamics
– Fine-grain parallelization, long time simulations
Utilize Blue Gene
Most of these efforts will leverage on current collaborations/grants
48
DepartmentOf
Energy
June-9-2005 Colony Project / June 2005 FastOS PI Meeting
Task Migration Time in Charm++
Migration time for 5-point stencil, 16 processors