Alliance Clusters, Cluster in a Box

21
National Computational Science Alliance Clusters, Cluster in a Box Rob Pennington Acting Associate Director Computing and Communications Division NCSA How to stuff a penguin in a box and make everyone happy, even the penguin.

description

Alliance Clusters, Cluster in a Box. How to stuff a penguin in a box and make everyone happy, even the penguin. Rob Pennington Acting Associate Director Computing and Communications Division NCSA. Where Do Clusters Fit?. 1 TF/s delivered. 15 TF/s delivered. Distributed systems. MP - PowerPoint PPT Presentation

Transcript of Alliance Clusters, Cluster in a Box

Page 1: Alliance Clusters,  Cluster in a Box

National Computational Science

Alliance Clusters, Cluster in a Box

Rob PenningtonActing Associate Director

Computing and Communications DivisionNCSA

How to stuff a penguin in a box and make everyone happy, even the penguin.

Page 2: Alliance Clusters,  Cluster in a Box

National Computational Science

Distributedsystems

MPsystems

• Gather (unused) resources• System SW manages resources• System SW adds value• 10% - 20% overhead is OK• Resources drive applications• Time to completion is not critical• Time-shared• Commercial: PopularPower, United

Devices, Centrata, ProcessTree, Applied Meta, etc.

• Bounded set of resources

• Apps grow to consume all cycles

• Application manages resources

• System SW gets in the way

• 5% overhead is maximum

• Apps drive purchase of equipment

• Real-time constraints

• Space-shared

Legi

on\G

lobu

sB

eow

ulf

Ber

kley

NO

WS

uper

clus

ters

Inte

rnet

AS

CI R

ed

Tf

lops

SE

TI@

hom

e

Con

dor

Where Do Clusters Fit?

Src: B. Maccabe, UNM, R.Pennington NCSA

15 TF/s delivered 1 TF/s delivered

Page 3: Alliance Clusters,  Cluster in a Box

National Computational Science

Alliance Clusters Overview

• Major Alliance cluster systems– NT-based cluster at NCSA – Linux-based clusters

– University of New Mexico - Roadrunner, LosLobos– Argonne National Lab - Chiba City

– “Develop Locally, Run Globally”– Local clusters Used for Development and Parameter Studies

• Issues– Compatible Software Environments– Compatible Hardware– Evaluate Technologies at Multiple Sites

– OS, Processors, Interconnect, Middleware

• Computational resource for users

Page 4: Alliance Clusters,  Cluster in a Box

National Computational Science

Cluster in a Box Rationale

• Conventional wisdom: Building a cluster is easy– Recipe:

– Buy hardware from Computer Shopper, Best Buy or Joe’s place– Find a grad student not making enough progress on thesis work and

distract him/her with the prospect of playing with the toys– Allow to incubate for a few days to weeks– Install your application, run and be happy

• Building it right is a little more difficult– Multi user cluster, security, performance tools– Basic question - what works reliably?

• Building it to be compatible with Grid/Alliance...– Compilers, libraries– Accounts, file storage, reproducibility

• Hardware configs may be an issue

Page 5: Alliance Clusters,  Cluster in a Box

National Computational Science

Alliance Cluster Growth: 1 TFLOP IN 2 YEARS

Oct-00

0

200

400

600

800

1000

1200

1400

1600

1800

Jan-98 Jul-98 Feb-99 Aug-99 Mar-00

Inte

l P

roce

sso

rs

NCSA 192p, HP, Compaq

UNM 128p, Alta

NCSA 128p, HP

NCSA 32p, SGI

ANL 512p, IBM, VA Linux

NCSA 128p, HP

UNM 512p, IBM

256p NT Cluster

1600+ Intel CPUs

Page 6: Alliance Clusters,  Cluster in a Box

National Computational Science

Alliance Cluster Status

• UNM Los Lobos– Linux

– 512 processors

– May 2000 – operational system– first performance tests– friendly users

• Argonne Chiba City – Linux– 512 processors

– Myrinet interconnect

– November 1999 – deployment

• NCSA NT Cluster– Windows NT 4– 256 processors– Myrinet– December 1999

– Review Board Allocations

• UNM Road Runner– Linux, 128 processors– Myrinet– September 1999

– Review Board Allocations

Page 7: Alliance Clusters,  Cluster in a Box

National Computational Science

NT Cluster Usage - Large, Long Jobs

NT Cluster Usage by Number of ProcessorsMay1999 to Jul2000

0

100000

200000

300000

400000

500000

1 - 31 32 - 63 64 - 256

Number of Processors

CP

U H

ou

rs

Page 8: Alliance Clusters,  Cluster in a Box

National Computational Science

Full production resources at major site(s)

A Pyramid Scheme:(Involve Your Friends and Win Big)

Small, private systems in labs/offices

Alliance resourcesat partner sites

Can a “Cluster in a Box”support all of the differentconfigs at all of the sites??

No, but it can providean established & testedbase configuration

This is a non-exclusive

club at all levels!

Page 9: Alliance Clusters,  Cluster in a Box

National Computational Science

Cluster in a Box Goals

• Open source software kit for scientific computing– Surf the ground swell – Some things are going to be add-ons

– Invest in compilers, vendors have spent BIG $ optimizing them

• Integration of commonly used components– Minimal development effort– Time to delivery is critical

• Initial target is small to medium clusters– Up to 64 processors– ~1 interconnect switch

• Compatible environment for development and execution across different systems (Grid, anyone?)– Common libraries, compilers

Page 10: Alliance Clusters,  Cluster in a Box

National Computational Science

Key Challenges and Opportunities

• Technical and Applications– Development Environment

– Compilers, Debuggers– Performance Tools

– Storage Performance– Scalable Storage– Common Filesystem

– Admin Tools– Scalable Monitoring Tools– Parallel Process Control

– Node Size– Resource Contention– Shared Memory Apps

– Few Users => Many Users– 600 Users/month on O2000

– Heterogeneous Systems– New generations of systems

– Integration with the Grid

• Organizational– Integration with Existing

Infrastructure– Accounts, Accounting

– Mass Storage

– Training

– Acceptance by Community– Increasing Quickly

– Software environments

Page 11: Alliance Clusters,  Cluster in a Box

National Computational Science

ComputeNodes

Fro

nt

end

No

des

I/ON

od

es

Mgmt Nodes

Network

Cluster Configuration

Visualization Nodes

HSM

Debug Nodes

SystemsTestbed

Green: presentgeneration clusters

UserLogins

Storage

Page 12: Alliance Clusters,  Cluster in a Box

National Computational Science

App1

App2

App3

App4

App6

App5

Users “own” the nodes allocated to them

Space Sharing Example on 64 Nodes

Page 13: Alliance Clusters,  Cluster in a Box

National Computational Science

OSCAR Open Source Cluster

Application Resources is a snapshot of the

best known methods for building and using cluster

software.

OSCAR A(nother) Package for Linux Clustering

Page 14: Alliance Clusters,  Cluster in a Box

National Computational Science

The OSCAR Consortium

• OSCAR is being developed by:– NCSA/Alliance– Oak Ridge National Laboratory– Intel– IBM– Veridian Systems

• Additional supporters are:– SGI, HP, Dell, MPI Software Technology, MSC

Page 15: Alliance Clusters,  Cluster in a Box

National Computational Science

OSCAR Components: Status

PackagingIntegration underway. Documentation under development.

Job ManagementPBS validated and awaiting integration.Long term replacement for PBS underconsideration.

Cluster ManagementC3/MC3 core complete, but further refinement is planned. Evaluation of alternative solutions underway.

Installation & CloningConfiguration Database design is complete. LUI is complete and awaitingintegration with database.

OSCore validation OS’s selected (Red Hat,Turbo and Suse). Integration support issues being worked.

Src; N. Gorsuch, NCSA

Page 16: Alliance Clusters,  Cluster in a Box

National Computational Science

• Open source cluster on a “CD”• Integration meeting v0.5 - September 2000• Integration meeting at ORNL October 24 & 25 - v1.0• v1.0 to be released at Supercomputing 2000 (November 2000)

• Research and industry consortium• NCSA, ORNL, Intel, IBM, MCS Software, SGI, HP, Veridian, Dell

• Components• OS Layer Linux (Redhat, Turbulinux, Suse, etc.)• Installation and cloning LUI• Security openssh for now• Cluster management C3/M3C• Job management OpenPBS• Programming environment gcc etc.• Packaging OSCAR

Open Source Cluster Application Resources

Src; N. Gorsuch, NCSA

Page 17: Alliance Clusters,  Cluster in a Box

National Computational Science

OSCAR Cluster Installation Process

• Install Linux on cluster master or head node• Copy contents of OSCAR CD into cluster head• Collect cluster information and enter into LUI database

– This is a manual phase right now

• Run the pre-client installation script• Boot the clients and let them install themselves

– Can be done over the net or from a floppy

• Run the post-client installation script

KEEP IT SIMPLE!

Page 18: Alliance Clusters,  Cluster in a Box

National Computational Science

Testbeds

• Basic cluster configuration for prototyping at NCSA– Interactive node + 4 compute nodes– Development site for OSCAR contributors– 2nd set of identical machines for testbed– Rolling development between the two testbeds

• POSIC - Linux– 56 dual processor nodes– Mixture of ethernet and Myrinet– User accessible testbed for apps porting and testing

Page 19: Alliance Clusters,  Cluster in a Box

National Computational Science

IA-64 Itanium Systems at NCSA

• Prototype systems– Early hardware

– Not running at production spec

– Code porting and validation– Community codes– Required software infrastructure

• Running 64 bit Linux and Windows– Dual boot capable– Usually one OS for extended periods

• Clustered IA-64 systems – Focused on MPI applications porting/testing– Myrinet, Ethernet, Shared Memory

Page 20: Alliance Clusters,  Cluster in a Box

National Computational Science

HPC Applications Running on Itanium

IA-644p

IA-64 4p

IA-644p

IA-644p

IA-64 test cluster:IA-64 compute nodes + IA-32 compile nodes +

Linux or Win64

IA-32Linux

Cactus MILC ARPI-3D ATLAS sPPMWRF

IA-64 Compute Nodes

Compilers for C/C++/F90

PUPIASPCGHDF4, 5PBSFFTWGlobus

Applications/Packages:

Interconnects:Shared memory

Fast Enet + MPICHMyrinet+GM+VMI+MPICH

IA-642p

IA-32Win32

Myrinet

Page 21: Alliance Clusters,  Cluster in a Box

National Computational Science

Future

• Scale up Current Cluster Efforts– Capability computing at NCSA and Alliance sites

– NT and Linux clusters expand

– Scalable Computing Platforms– Commodity turnkey systems

– Current technology has 1 TF Within Reach – <1000 IA-32 processors

• Teraflop Systems Integrated With the Grid– Multiple Systems Within the Alliance – Complement to current SGI SMP Systems at NCSA– Next generation of technologies– Itanium at ~3 GFLOP, 1 TF is ~350 Processors