HPC-Colony

www.HPC-Colony.org

June 2005 PI Meeting

Terry Jones, LLNL, Coordinating PI Laxmikant Kale, UIUC, PI

Jose Moreira, IBM, PICelso Mendes, UIUC

Sayantan Chakravorty, UIUCTodd Inglett, IBM

Andrew Tauferner, IBM

2

DepartmentOf

Energy

June-9-2005 Colony Project / June 2005 FastOS PI Meeting

Outline

Recap Colony Project Goals & Approach Status Update Identify possibilities for Collaborations Identify resources which may be of interest to other

Proects

3

DepartmentOf

Energy



Proects

4

DepartmentOf

Energy


Colony Project Overview

Lawrence Livermore National Laboratory

At mtg: Terry Jones

University of Illinois at Urbana-Champaign

At mtg: Laxmikant Kale, Celso Mendes

International Business Machines

Scalable Load Balancing OS mechanisms for Migration Processor Virtualization for Fault Tolerance Single system management space Parallel Awareness and Coordinated

Scheduling of Services Linux OS for Blue Gene like machines

Services and Interfaces to Support Systems with Very Large Numbers of Processors

Collaborators

Topics

Title

5

DepartmentOf

Energy


Motivation & Goals

Parallel Resource Management— Today, application programmers must explicitly manage these complex resources. We

address scaling issues and porting issues by delegating resource management tasks to a sophisticated parallel OS.

— “Managing Resources” includes balancing CPU time, network utilization, and memory usage across the entire machine.

Global System Management— Linux Everywhere

— Enhance operating system support for parallel execution by providing coordinated scheduling and improved management services for very large machines.

— Virtual memory management across the entire system.

Fault Tolerance— Scalable stategies for dealing with faults. Checkpoint restart may not be adequate in all

cases (and It may be too expensive).

6

DepartmentOf

Energy


Approach

Top Down— Our work will start from an existing full-featured OS and remove excess

baggage with a “top down” approach.

Processor virtualization — One of our core techniques: the programmer divides the computation into a

large number of entities, which are mapped to the available processors by an intelligent runtime system.

Leverage Advantages of Full Featured OS & Single System Image— Applications on these extreme-scale systems will benefit from extensive

services and interfaces; managing these complex systems will require an improved “logical view”

Utilize Blue Gene— Suitable platform for ideas intended for very large numbers

of processors

7

DepartmentOf

Energy



Proects

8

DepartmentOf

Energy


Recent AccomplishmentsProject Overview

Started consolidating previous fault-tolerance work— Checkpoint/Restart Techniques— In-Memory Checkpointing— Preliminary Sender-based Message Logging Scheme

Continued parallel resource-management work — Measurement-based load balancing

– New classes of balancers designed & implemented— Support for object & thread migration on BG/L

Global system management work — Linux to BG/L compute nodes

– comparison performance measurements— parallel aware scheduling

9

DepartmentOf

Energy


Fault Tolerance WorkStatus

Current focus: Proactive fault-tolerance — Approach: Migrate tasks away from processors

where faults are imminent— Rationale: Many faults may have advance

indicators – Hardware / O.S. notification

— Basis: Processor virtualization (many virtual processors mapped to physical processors)

— Implementation: “anytime-migration” support in Charm++ and AMPI (for MPI codes) used for processor evacuation

Tests: Jacobi (C,MPI), Sweep3d (Fortran,MPI)

10

DepartmentOf

Energy


Fault Tolerance Work

Example: Jacobi-relaxation 2D (C,MPI)— 6,000x6,000 dataset, run on 8 Xeon processors

fault warning,

nigration

7 processors remaining

load balancing

11

DepartmentOf

Energy



Example: sweep3d — Processor-utilization snapshopts

prior to fault

after first fault/migration

after load-balancing

load from failed processor (1)

12

DepartmentOf

Energy


Fault ToleranceRecent Accomplishments

Proactive Fault-Tolerance Lessons:— Evacuation time is proportional to dataset size per

processor and to network speed – good scalability — Subsequent load balancing step is critical for good

performance— Current scheme can tolerate multiple faults that are not

simultaneous Current Status:

— Working on selecting appropriate load balancers— Analyzing how to improve performance between evacuation

and next load-balancing— Reducing time between notice of impending failure and

stable post-evacuation state— Integrating to regular Charm++/AMPI distribution

13

DepartmentOf

Energy


Parallel Resource Management Work

Recent focus: Preliminary work on Advanced Load Balancers— Multi-Phase load balancing

– balance for each program phase separately– phases may be identified by user-inserted calls– sample results following

— Asynchronous load balancing– hiding load balancing overhead– overlap of load balancing and computation

— Topology-aware load balancing– considers network topology– major goal: optimize hopbytes metric– On going work

14

DepartmentOf

Energy



Example: Phase load balancing

Processor 0

Processor 1

Phase 1 Phase 2

Processor 0

Processor 1

Processor 0

Processor 1

original execution

good balancing, but bad performance

phase balancing

15

DepartmentOf

Energy



Example: synthetic 2-phase program

Tests: 32 pe’s, Xeon cluster

no load-balance

greedy load-balancer

Multi-phase load-balancerutilization60%

utilization70%

16

DepartmentOf

Energy


Parallel Resource ManagementRecent Accomplishments

Load Balancing Trade-offs:— Trade-off in centralized load balancers

– good balance can be achieved– expensive memory and network usage for LB

— Trade-off in fully distributed load balancers– cheap in terms of memory/network usage– may be slow to achieve acceptable balance

Current Status:— Working on hybrid load-balancing schemes

– Idea: divide processors into hierarchy, balance load at each level first, then across levels

— Integrating balancers to regular Charm++/AMPI

17

DepartmentOf

Energy


Global System ManagementRecent Accomplishments

Defined team for Blue Gene/L Linux Everywhere Research and development of Linux Everywhere

—Limitations of the Compute Node Kernel—Motivation for Linux on the compute nodes—Challenges of a Linux solution for Blue Gene/L— Initial experiments and results

Measurements for Parallel Aware Scheduling

18

DepartmentOf

Energy


Global System ManagementLinux Everywhere Solution for BG/L

Would significantly extend the spectrum of applications that can be easily ported to Blue Gene/L – more penetration in industrial and commercial markets

Multiple processes/threads per compute node Leverage set of middleware (libraries, run-time) from

I/O node to compute node An alternative to cross-compilation environments Platform for research on extreme scalability of Linux Possible solution for follow on machine (Blue

Gene/P)

19

DepartmentOf

Energy


Global System ManagementChallenges of Linux Everywhere

Scalability challenges:—Will the more asynchronous nature of Linux lead

to scalability problems?—Sources of asynchronous events in Linux: Timer

interrupts, TLB misses, Process/thread scheduler, I/O (devices) events

—Parallel file system for 100,000+ processors?— If Linux cannot scale to 10,000-100,000 processor

range, can it be a solution for smaller systems? Reliability challenges:

—Despite some difficulties with file system and Ethernet devices, Linux on BG/L has not presented reliability problems

—Test, test, test, and more test …

20

DepartmentOf

Energy


Global System ManagementFull Linux on Blue Gene Nodes

5 out of 8 NAS class A benchmarks fit in 256MB

Same compile (gcc –O3) and execute options, and similar libraries (GLIBC)

No daemons or extra processes running in Linux; mainly user space code

— difference suspected to be caused by handling of TLB misses

NAS Serial Benchmark (class A) ResultsExecution time (in seconds)

BSS size CNK Linux Difference

CG 62MB 59.33 65.23 9.94%

IS 100MB 9.75 35.61 265.23%

LU 47MB 1020.07 1372.97 34.59%

SP 83MB 1342.75 1672.62 24.56%

EP 11MB 374.48 385.48 2.93%

21

DepartmentOf

Energy


Global System ManagementImpact of TLB misses (single node)

0.00E+00

2.00E+10

4.00E+10

6.00E+10

8.00E+10

1.00E+11

1.20E+11

1.40E+11

1.60E+11

Cycles for DGEMM 160

No TLB 1kBpage

4kBpage

Linux 16kBpage

64kBpage

CNK with a 4kB page performs just like Linux

CNK with 64kB page performs just like CNK without TLBs

Normally, CNK does not have TLB misses (memory is directly mapped) – we added a TLB handler with the same overhead as Linux (~160 cycles) and varied the page size

22

DepartmentOf

Energy


Global System ManagementImpact of TLB misses (multi-node)

1.50

1.60

1.70

1.80

1.90

2.00

2.10

Teraflops

NoTLB

1kBpage

4kBpage

16kBpage

64kBpage

256kBpage

1MBpage

Linpack on 512 nodes (CNK)

CNK with 64kB pages performs within 2% of no TLBs

CNK with 1MB page performs just like CNK without TLBs

23

DepartmentOf

Energy


Global System ManagementLinux Everywhere Conclusions

A Linux Everywhere (I/O and compute nodes) solution for Blue Gene/L can significantly increase the spectrum of applications for Blue Gene/L

Significant challenges in scaling Linux to 100,000 processors, primarily due to interference on running applications

Large pages effective in reducing impact of TLB misses – 64kB pages seems like a good target

Next steps: — Implement large page support in Linux for Blue

Gene/L—Study other sources of interference (timers)—Devise file system solution—Decide what to do with lack of coherence

24

DepartmentOf

Energy


Global System ManagementParallel Aware Scheduling

High order hydrodynamics code for

computing fluid instabilities and

turbulent mix

Miranda -- Instability & Turbulence

Favored Priority:41Unfavored Priority:100Percent to Application:99.995%Total Duration:20 seconds--------------------------------------------------------------Without Co-Scheduler….Mean:452.52Without Co-Scheduler….Standard Dev:108.45

With Co-Scheduler……..Mean:254.45With Co-Scheduler……..Standard Dev:5.45

Results @ 1024 Tasks

25

DepartmentOf

Energy



Proects

26

DepartmentOf

Energy


Needs

Early fault indicators— What kinds of fault indicators can the O.S.

provide?— Can these be provided in a uniform interface?

O.S. support for task migration— Is it possible to allocate/reserve virtual memory

consistently and “globally”?— How to handle thread migration in 32-bit systems?

O.S. support for load-balancing— Production of updated computational load and

communication information— Fast data collection (via extra-network support ?)

Interfaces between schedulers and jobs

27

DepartmentOf

Energy


Collaboration Opportunities

Extending Linux for extreme scale

Addressing OS interference

Addressing how system resources are managed (processor virtualization and SSI strategies)

Migration (for both load-balancing and run through failure)

How far can we push full-featured OS?

28

DepartmentOf

Energy



Proects

29

DepartmentOf

Energy


We can provide…

Several “scalable” applications from LLNL and UIUC

A full-featured OS comparison point (if given a metric or application)

Feedback on how novel ideas affect the key aspects of our approach (e.g. “That would affect our model of process migration in the following way…”)

And, as mentioned earlier, we’re very interested in collaborating in any of a number of areas…

30

DepartmentOf

Energy


Extra Viewgraphs

31

DepartmentOf

Energy



Example: sweep3d (Fortran,MPI)— 1503 case, run on 8 Xeon processors, 32 VP’s

6 processors remaining

32

DepartmentOf

Energy



Example: LeanMD, Communic.-aware load balancing

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

time per step (s)

Number of processors

w/o LB

GreedyLB

GreedyCommLB

w/o LB 1.605 0.825 0.445 0.38

GreedyLB 0.96 0.515 0.28 0.175

GreedyCommLB 0.95 0.495 0.265 0.165

64 128 256 512

HCA 30K atoms dataRun on PSC LeMieux

33

DepartmentOf

Energy



Example: Topology-aware load balancing

0

2

4

6

8

10

12

14

16

Average hops per byte

Torus2D Torus3D Torus4D

Topology

Random

TopoLB

LeanMD code, 1024 processor run with HCA atom benchmark

34

DepartmentOf

Energy


Global System ManagementLimitations of the CNK Solution

Less than half of the Linux system calls are supported – in particular the following are missing—Process and thread creation—Server side sockets—Shared memory segments/memory mapped files

CNK requires a cross-compilation environment, which has been a challenge for some applications

CNK also requires its own separate set of run-time libraries – maintenance and test issue

We could keep extending CNK, but that would eventually defeat its very reason for being (simplicity)

Instead, we want to investigate a “Linux Everywhere” solution for Blue Gene/L – Linux in I/O and compute nodes

35

DepartmentOf

Energy


Global System ManagementLinux on the Compute Nodes

Our priority is to guarantee scaling of “well behaved” Linux use in the compute nodes:—Small number of processes/threads—No daemons (leave them to the I/O nodes)—Rely on MPI for high-speed communication

Ideally, MPI applications with one or two tasks per node will perform just as well with Linux as with the Compute Node Kernel

TCP and UDP support over the Blue Gene/L networks will be important for a broader set of applications that do not use MPI

For now, our study has focused on reducing the impact of TLB misses

36

DepartmentOf

Energy


Global System ManagementBlue Gene/L Compute Node Kernel

The Blue Gene/L Compute Node Kernel was developed from scratch – simple and deterministic for scalability and reliability (about 13,000 lines of code)

Every user-level thread is backed by a processor – deterministic execution and no processor sharing

Only timer interrupt is counter virtualization every 6 seconds, which is synchronous across partition

No TLB misses – memory is directly mapped This deterministic execution has been the key to

scalability of (for example) FLASH code Implements a subset of Linux system calls – complex

calls (mostly I/O) are actually executed on I/O node GNU and XL run-time libraries ported to CNK – many

applications have required just “compile and go”

37

DepartmentOf

Energy


Global System ManagementBlue Gene/L Kernels

In Blue Gene/L, Linux is used in a limited role—Runs on the I/O nodes, in support of file I/O,

socket operations, job control, and process debugging

—Linux on the I/O nodes acts as an extension of the Compute Node Kernel – operations not directly supported by the CNK (e.g., file I/O) are function shipped for execution on the I/O node

—Linux also used in the front-end nodes (job compilation, submission, debugging) and service nodes (machine control, machine monitoring, job scheduling)

Compute nodes run the lightweight Compute Node Kernel for reliability and scalability

38

DepartmentOf

Energy


Global System ManagementScalability of CNK Solution (FLASH)

Sod Discontinuity Scaling Test - Total Time

0

100

200

300

400

500

600

700

800

1 10 100 1000 10000 100000

Number of Processors

Total Time (s)

QSC

Seaborg

Jazz /Myrinet2000

MCR/1proc/Quadrics

Watson BG /L

Big BG/L

39

DepartmentOf

Energy


Global System ManagementParallel Aware Scheduling

High order hydrodynamics code for

computing fluid instabilities and turbulent

mix

Employs FFTs and band-diagonal matrix

solvers to compute spectrally-accurate

derivatives, combined with high-order

integration methods for time

advancement

Contains solvers for both compressible

and incompressible flows

Has been used primarily for studying

Rayleigh-Taylor (R-T) and Richtmyer-

Meshkov (R-M) instabilities, which occur

in supernovae and Inertial Confinement

Fusion (ICF)

Miranda -- Instability & Turbulence

40

DepartmentOf

Energy


Top Down (start with Full-Featured OS)

Why?

Broaden domain of applications that can run on the most powerful machines through OS support

— More general approaches to processor virtualization, load balancing and fault tolerance

— Increased interest in applications such as parallel discrete event simulation

— Multi-threaded apps and libraries

Why not?

Difficult to sustain performance with increasing levels of parallelism

— Many parallel algorithms are extremely sensitive to serializations

— We will address this difficulty with parallel aware scheduling

Question: How much should system software offer in terms of features?Answer: Everything required, and as much desired as possible

41

DepartmentOf

Energy


Parallel Resource ManagementProcessor Virtualization

Divide the computation into a large number of pieces— Independent of the number of processors

Let the runtime system map objects to processors Implementations: Charm++, Adaptive-MPI (AMPI)

User View System implementation

P0 P1 P2

42

DepartmentOf

Energy


Parallel Resource ManagementAMPI: MPI with Virtualization

Each MPI process implemented as a user-level thread embedded in a Charm++ object

MPI processes

Real Processors

MPI “processes”

Implemented as virtual processes (user-level migratable threads)

43

DepartmentOf

Energy


Research Using Processor Virtualization

Efficient resource management

Dynamic load balancing based on object

migration

Optimized inter-processor communication

Measurement-based runtime decisions

— Communication volumes

— Computational loads

Focus on highly scalable strategies

— Centralized Distributed Hybrid

Fault-tolerance approaches for large systems

Proactive reaction to impending faults

— Migrate objects when a fault is imminent

— Keep “good” processors running at full pace

— Refine load balance after migrations

Appropriate for system failures consisting of

a small subset of a large job

Automatic checkpointing / fault-detection /

restart

In-memory checkpointing of objects

Using message-logging to tolerate frequent

faults in a scalable fashion

44

DepartmentOf

Energy


Leverage Advantages of Full Featured OSs

Make Available on BlueGene/L Compute Nodes

Enable a large scale test bed for our work

Increase flexibility for target applications

45

DepartmentOf

Energy


Global System ManagementSingle System Image

Move from Physical view of machine to Logical View of machine

In a single process space model, all processes running on the parallel machine belong to a single pool of processes. A process can interact with any other process independent of their physical location on the machine. The single process space mechanism will implement the following process services:

By coalescing several individual operations into one collective operation, and raising the semantic level of the operations, the aggregate model will allow us to address performance issues that arise.

46

DepartmentOf

Energy


Global System ManagementThe Logical View…

Single Process Space— Process query: These services provide information about which processes are

running on the machine, who they belong to, how they are grouped into jobs, and how much resources they are using.

— Process creation: These services support the creation of new processes and their grouping into jobs.

— Process control: These services support suspension, continuation, and termination of processes.

— Process communication: These services implement communications between processes.

Single File Space— any completely qualified file descriptor (e.g., “/usr/bin/ls”) represents exactly the

same file to all the processes running on the machine.

Single Communication Space— we will provide mechanisms in which any two processes can establish a channel

between them.

47

DepartmentOf

Energy


Port , optimize and scale existing Charm++/AMPI applications on BlueGene/L

– Molecular Dynamics: NAMD (Univ.Illinois)– Collection of (charged) atoms, with bonds

– Simulations with millions of timesteps desired

– Cosmology: PKDGRAV (Univ.Washington)– N-Body problem, with gravitational forces

– Simulation and analysis/visualization done in parallel

– Quantum Chemistry: CPAIMD (IBM/NYU/others)– Car-Parrinello Ab Initio Molecular Dynamics

– Fine-grain parallelization, long time simulations

Utilize Blue Gene

Most of these efforts will leverage on current collaborations/grants

48

DepartmentOf

Energy


Task Migration Time in Charm++

Migration time for 5-point stencil, 16 processors

49

DepartmentOf

Energy


Task Migration Time in Charm++

Migration time for 5-point stencil, 268 MB total data

HPC-Colony

Documents

Transcript of HPC-Colony