Progress Towards Petascale Virtual Machines

23
Progress Towards Petascale Virtual Machines Al Geist Oak Ridge National Laboratory www.csm.ornl.gov/~geist EuroPVM-MPI 2003 Venice, Italy September 30, 2003

description

Progress Towards Petascale Virtual Machines. Al Geist Oak Ridge National Laboratory www.csm.ornl.gov/~geist. EuroPVM-MPI 2003 Venice, Italy September 30, 2003. Petascale Virtual Machine Another kind of “PVM”. This talk will describe: DOE Genomes to Life Project - PowerPoint PPT Presentation

Transcript of Progress Towards Petascale Virtual Machines

Page 1: Progress Towards  Petascale Virtual Machines

Progress Towards Petascale Virtual Machines

Progress Towards Petascale Virtual Machines

Al GeistOak Ridge National Laboratory

www.csm.ornl.gov/~geist

EuroPVM-MPI 2003Venice, Italy

September 30, 2003

Page 2: Progress Towards  Petascale Virtual Machines

Petascale Virtual MachineAnother kind of “PVM”

This talk will describe:

DOE Genomes to Life Project PVM use today in the Genomics Integrated Supercomputer Toolkit for fault tolerance, and high availability in a dynamic environment

Harness Project (next generation of PVM) and its features to help scale to Petascale systems

Distributed peer-to-peer controlH2O – the self adapting core of HarnessFTMPI – fault tolerant MPI

Latest superscalable algorithms with natural fault tolerance for petascale environments.

Page 3: Progress Towards  Petascale Virtual Machines

Understanding the Essential Processes of Living Systems

Follow-on to Human Genome Program– Determined the entire DNA sequence for humans – 24 chromosomes in 6 ft of DNA– 3 billion nucleotides code for 35,000 genes– Only 0.0001% difference between people.

Instructions to build a human fits on a DVD (3GB)

Genomes to Life Program goal is to read the Instructions starting with simple single cell organisms - microbes

– Molecular Machines– Regulatory Pathways– Multi-cell Communities

Develop new computational methods to understand complex biological systems

DOE Genomes to Life Program DOE Genomes to Life Program

PVM

$100M effort

Page 4: Progress Towards  Petascale Virtual Machines

www.genomes-to-life.org

Many interlinked proteins form interacting machines

From The Machinery of Life, David S. Goodsell,

Springer-Verlag, New York, 1993.

Molecular Machines Fill Cells Molecular Machines Fill Cells

Page 5: Progress Towards  Petascale Virtual Machines

www.genomes-to-life.org

Gene regulation controls what genes are expressed

Proteome changes over time and due to environmental conditions

Regulatory Networks Control the Machines Regulatory Networks Control the Machines

- And -

Page 6: Progress Towards  Petascale Virtual Machines

Biological Complexity

ComparativeGenomics

Constraint-BasedFlexible Docking

1000 TF

100 TF

10 TF

1 TF*

Constrained rigid

docking

Genome-scale protein threading

Community metabolic regulatory, signaling simulations

Molecular machine classical simulation

Protein machineInteractions

Cell, pathway, and network

simulation

Molecule-basedcell simulation

*Teraflops

Current U.S. Computing

Cell-basedcommunity simulation

GTL will Require Petascale Systems GTL will Require Petascale Systems

Page 7: Progress Towards  Petascale Virtual Machines

GTL is going to rely on high-performance computing and data analysis to process high-throughput experimental data

Biology for the21st Century Biology for the21st Century

The new computational biology environments will be conceptually integrated “knowledge enabling” environments that couple diverse sets of distributed data, advanced informatics methods, experiments, modeling, and simulation.

models

pathwaysgenomes

proteinstructure

Rawdata

regulatoryelements

Early Endosomes

Late Endosomes

Lysosomes

Golgi

EGFR erbB-2

Shc Cbl eps8

erbB-2 PLCGrb-2

??

1

Annexin II?

2

4

3

65

?

?

Src Eps15 AP-2

ERK

simulation

experiment

Data analysis

modeling

Page 8: Progress Towards  Petascale Virtual Machines

GIST is a framework for large-scale biological application deployment– provides a transparent and high-performance interface to biological applications

– provides transparent access to distributed data sets– utilizes PVM to launch and manage jobs across a wide diversity of supercomputers– highly fault tolerant and adapts to dynamic changes in the environment using PVM– next step deploy across ORNL, ANL, PNNL, SNL as a multi-site “Bio-Grid” – thousand of users for execution of genome analysis and simulation.

Genome Integrated Supercomputer Toolkit Genome Integrated Supercomputer Toolkit

IBM p690864 proc

Cray X1256 proc

P4 Cluster64 proc

SGI Altix256 proc

PVM across Heterogeneous Supercomputers

Protein analysis engine

Web portal

XML

XML

pathways

genomes

Rawdata

Page 9: Progress Towards  Petascale Virtual Machines

The GIST Developers really want HarnessThe GIST Developers really want Harness

They ask us regularly about the next generation of PVM called Harness because they want the increased adaptability and fault tolerance that Harness promises.

Harness is being developed by the same team that developed PVM:

Vaidy Sunderam – Emory UniversityAl Geist – Oak Ridge National LabJack Dongarra – University of Tennessee and ORNL

Page 10: Progress Towards  Petascale Virtual Machines

Harness II Design Goals Harness II Design Goals

Harness is a distributed virtual machine environment that goes beyond the features of PVM:

Allow users to dynamically customize, adapt, and extend a virtual machine's features • to more closely match the needs of their application• to optimize the virtual machine for the underlying computer

resources.

Is being designed to scale to petascale virtual machines • distributed control • minimized global state• no single point of failure

Allows multiple virtual machines to join and split in temporary micro-grids

Page 11: Progress Towards  Petascale Virtual Machines

Host D

Host C

Host B

Host A

VirtualMachine

Operation within VM usesDistributed Control

user features

HARNESS daemon

Customizationand extensionby dynamicallyadding pluglets

Componentbased daemon

Merge/split with other VMs

AnotherVM

HARNESS II ArchitectureHARNESS II Architecture

Daemon built on top of H2O kernel with DVM pluglet loaded

DVMFT-MPI

Processes control

Page 12: Progress Towards  Petascale Virtual Machines

• No single point (or set of points) of failure for Harness. It survives as long as one member still lives.

• All members know the state of the virtual machine, and their knowledge is kept consistent w.r.t. the order of changes of state. (Important parallel programming requirement!)

• No member is more important than any other (at any instant) i.e. here isn’t a pass-around “control token”

• For Petascale Systems the control members can be a distributed subset of all the processors in the system

Symmetric Peer-to-Peer Distributed ControlSymmetric Peer-to-Peer Distributed Control

CharacteristicsCharacteristics

Page 13: Progress Towards  Petascale Virtual Machines

Supportsmultiple

simultaneous updates

Harness Distributed ControlHarness Distributed Control

addhost

Fast host deleteor recoveryfrom fault

Parallel recoveryfrom multiplehost failures

Supportsfast hostadding

Control is Asynchronous and ParallelControl is Asynchronous and Parallel

Page 14: Progress Towards  Petascale Virtual Machines

Virtual machine Size of the Control Loop 1 <= S <= (size of VM)

For small VM and ultimatefault tolerance S = (size of VM)

For large VM a random selectionof a few hosts (f.e. S = 10) givesa balance of multi-point failureand performance.

HARNESS: Petascale Virtual MachineHARNESS: Petascale Virtual Machine

For S = 1, distributed controlbecomes simple client/servermodel.

Variable Distributed Control Loop Size

Page 15: Progress Towards  Petascale Virtual Machines

H2O kernel - OverviewH2O kernel - Overview

H2O is multithreaded lightweight kernel that is dynamically configured by loading “pluglets”

Resources provided as services through pluglets.Services may be deployed by any authorized party: provider, client, or third-party reseller

H2O is stateless and resources independent

In Harness the DVM service, which includes distributed control of services, must be installed on host

Pluglets can provide Multiple programming models

Java and C implementations being developed

FT-MPI PVM

Java RMIActiveobjects

Programming models

OGSA P2P

Pluglet

Pluglet

Functionalinterfaces

Kernel

Clients

[Suspendible]

Page 16: Progress Towards  Petascale Virtual Machines

H2O is built on top of a flexible P2P communication layer called RMIX

– Provides interoperability between kernels and other web services

– Adopts common RMI semantics– Designed for easy porting between protocols – Dynamic protocol negotiation– Scalable P2P design

RPC clientsWeb Services

SOAP clients...

Java H2O kernel

A

C

B

H2O kernel

E

F

D

RMIX

Networking

RMIX

NetworkingRPC, IIOP,JRMP, SOAP, …

H2O kernel – RMIX Communication H2O kernel – RMIX Communication

Page 17: Progress Towards  Petascale Virtual Machines

Deploy

B

A

LegacyApp

DeployProvider

AClient

Repository

A BReseller

C

Deploy

Anativecode

ProviderClient

Repository

ABDeveloper

C

ProviderClient

B

A

...

Registration and Discovery e-mail,phone, ...JNDIUDDI LDAP DNS GIS ...

B

Publish Find

Provider

Cluster computingLike–PVM–Harness–LAM/MPI

Grid Web portalLike–Genome Channel–Biology workbench–Web service

Internet ComputingLike –SETI at HOME–Entropia, –United Devices

H2O can support a wide range of distributed computing modelsH2O can support a wide range of distributed computing models

Flexibility beyond the PVM/MPI model

Page 18: Progress Towards  Petascale Virtual Machines

Harness Fault Tolerant MPI Plug-inHarness Fault Tolerant MPI Plug-in

FT-MPI built in layers with tuned collectives, tuned derived data type handling and good point2point bandwidth.

Works with MPE profiling and tools such as JUMPSHOT from ANL.

libftmpi

MPI application

Startup plugin

Name Service

Ftmpi_notifier

libftmpi

MPI application

Startup plugin

H2O H2O

Application performance on par with MPICH-2.

FTMPI available SC2003

Page 19: Progress Towards  Petascale Virtual Machines

Harness Fault Tolerant MPI Plug-inHarness Fault Tolerant MPI Plug-in

FT-MPI is a system level Fault Tolerant full MPI 1.2 implementation.

Process failures are detected & passed back to the users application using MPI objects. The users application decides how best to reconfigure the system and continue.

Recovery Options for affected communicators:– ABORT: just do as other implementations i.e.checkpoint restart– BLANK: leave hole– SHRINK: re-order processes to make a contiguous communicator– REBUILD: re-spawn lost processes and add them to MPI_COMM_WORLD

11 22 33 44 55 66 77 88 99

11 22 55 66 88 99

11 22 33 44 55 66

11 22 33 44 55 66 77 88 99

XX XX XXCommunicatorOptions

Page 20: Progress Towards  Petascale Virtual Machines

Large-scale Fault Large-scale Fault ToleranceLarge-scale Fault Large-scale Fault Tolerance

Developing fault tolerant algorithms is not trivial. Anything beyond simple checkpoint/restart is beyond most scientists. Many recovery issues must be addressed

Doing a restart of 90,000 tasks because of the failure of 1 task, may be very inefficient use of resources.

When and what are the recovery options for large-scale simulations?

Taking fault tolerance beyond checkpoint/restart.

Page 21: Progress Towards  Petascale Virtual Machines

Fault Tolerance – a petascale perspectiveFault Tolerance – a petascale perspective

Future systems are being designed with 100,000 processors.

The time before some failure will be measured in minutes.

Checkpointing and restarting this large a system could take longer than the time to the next failure!

Development of algorithms that can be naturally fault tolerant I.e. failure anywhere can be ignored? And still get the right answer.

– No monitoring– No notification– No recovery

Is this possible? YES!

What to do?Autonomic? Self-healing?

Page 22: Progress Towards  Petascale Virtual Machines

Demonstrated that the scale invariance and natural fault tolerance can exist for local and global algorithms

Progress on Super-scalar algorithmsProgress on Super-scalar algorithms

local

global

Finite Difference (Christian Engelman) – Demonstrated natural fault tolerance w/ chaotic

relaxation, meshless, finite difference solution of Laplace and Poisson problems

Global information (Kasidit Chancio) – Demonstrated natural fault tolerance in global

max problem w/random, directed graphs

Gridless Multigrid (Ryan Adams)– Combines the fast convergence of multigrid with

the natural fault tolerance property. Hierarchical implementation of finite difference above.

– Three different asynchronous updates explored

Page 23: Progress Towards  Petascale Virtual Machines

Further InformationFurther Information

www.csm.ornl.gov/~geist

Genomes to Life

Harness

Naturally Fault tolerant Algoritnms

Questions?