Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH...

17
Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG

Transcript of Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH...

Page 1: Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

Vision for System and Resource Management

of the Swiss-Tx class of Supercomputers

Josef NemecekETH Zürich & Supercomputing Systems AG

Page 2: Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

09.03.2000 SOS Workshop 2000 (New Orleans, LA) 2

Agenda

The Supercomputer Lifecycle then and now

The Swiss-T1 Management SW: COSMOSCommodity Supercomputer Management Operating System The goals of COSMOS The concept of COSMOS Implementation of COSMOS

Software Integration with existing Parts Roadmap of COSMOS

Page 3: Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

09.03.2000 SOS Workshop 2000 (New Orleans, LA) 3

Supercomputers – Then and Now

Development by vendor Hardware was hand-made Software was tailored for hardware

Customers just had to orderout of the vendor’s catalogue

Test ManageNeed Order

$$$

Page 4: Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

09.03.2000 SOS Workshop 2000 (New Orleans, LA) 4

Supercomputers – Then and Now

System looks like a puzzle Commodity parts, multiple vendors Zoo of interacting software components

Individual system management Millions of lines of code (scripts,

daemons)

Simulation ManageThought Design

Architecture

Topology

Needs

Specification

$$$ & t

Page 5: Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

09.03.2000 SOS Workshop 2000 (New Orleans, LA) 5

COSMOS – Goals

Integrated management for whole lifecycle Design the supercomputer on-line Simulate the supercomputer performance on-line Build the designed and simulated supercomputer Manage the built supercomputer

Complete run-time system management Fault-tolerance on all (or most) system levels Remote manageability of the whole supercomputer Low run-time overhead for the system management

Page 6: Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

09.03.2000 SOS Workshop 2000 (New Orleans, LA) 6

COSMOS – Supercomputer Design

Architecture selection SAN technology Nodes technology

Topology selection Every topology has it’s +/–

Resource usage Cost of the supercomputer Space, electrical power

Performance estimation

Page 7: Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

09.03.2000 SOS Workshop 2000 (New Orleans, LA) 7

COSMOS – Supercomputer Design

Architecture selection SAN technology Nodes technology

Topology selection Every topology has it’s

+/–

Resource usage Cost of the supercomputer Space, electrical power

Performance estimation

Page 8: Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

09.03.2000 SOS Workshop 2000 (New Orleans, LA) 8

COSMOS – Supercomputer Design

Architecture selection SAN technology Nodes technology

Topology selection Every topology has it’s +/–

Resource usage Cost of the

supercomputer Space, electrical power

Performance estimation

Page 9: Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

09.03.2000 SOS Workshop 2000 (New Orleans, LA) 9

COSMOS – Supercomputer Design

Architecture selection SAN technology Nodes technology

Topology selection Every topology has it’s +/–

Resource usage Cost of the supercomputer Space, electrical power

Performance estimation

Page 10: Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

09.03.2000 SOS Workshop 2000 (New Orleans, LA) 10

COSMOS – Goals

Single-system view of whole system Allows one-point system management Allows remote system management

High availability of the system management Allows high over-all system up-times Allows dynamic configuration changes

Modular software design System-independent concept & design Interfaces to existing management software modules

Page 11: Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

09.03.2000 SOS Workshop 2000 (New Orleans, LA) 11

COSMOS – Concept

Configuration Control the system

Monitoring Observe the system

Planning When? Who? What?

Security Stability & independence

Faults & Traps Help the system

Accounting Charge the usage

Complete, integrated system managementRemote management from everywhere

No administrative programming necessary

Page 12: Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

09.03.2000 SOS Workshop 2000 (New Orleans, LA) 12

COSMOS – ImplementationS

yste

m M

an

ag

em

en

t

Node Management

SAN Management

Process Management

Resource Management

Storage Management

LAN Management

User Interface

State control and monitoringof the nodes, accounting

SAN-dependent managementand monitoring, accounting

Support of and co-operation with parallel environments as MPI/FCI

Resource management:Priorities, allocation, queues

Vendor-dependent storagemanagement software

SNMP-based management ofused LAN components

User-privilege-basedmanagement and monitoring

Page 13: Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

09.03.2000 SOS Workshop 2000 (New Orleans, LA) 13

COSMOS – Implementation

Management Center

COSMOS Center

Node 0

COSMOS Agent

Process 0

Node 1

COSMOS Agent

Node 3

COSMOS Agent

Node 2

COSMOS Agent

Process 1

Process 2

Process 3

Process 4

Process 5

Process 6

Process 7

Management Center

COSMOS Center

Management Center

COSMOS Center

Page 14: Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

09.03.2000 SOS Workshop 2000 (New Orleans, LA) 14

Gridware GRD/Codine

Powerful resource management Integrates resource and batch management Ticket-based job scheduling scheme Well-defined interfaces

Some drawbacks at this moment GRD/Codine is not topology-aware GRD/Codine is a commercial product

Page 15: Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

09.03.2000 SOS Workshop 2000 (New Orleans, LA) 15

COSMOS – Interaction with GRD/Codine

Syste

m M

an

ag

em

en

t

Node Management

SAN Management

Process Management

Storage Management

LAN Management

User Interface

GR

D/C

od

ine

Node Monitoring

Process Monitoring

Resource Management

User Interface

Accounting

Resource Management

Page 16: Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

09.03.2000 SOS Workshop 2000 (New Orleans, LA) 16

Roadmap of COSMOS Development

Prototype release plan for COSMOS 1Q2000 – Centralised process and SAN

management 2Q2000 – Distributed system management

framework 3Q2000 – Complete non-interactive management 4Q2000 – Complete interactive management

Interaction between COSMOS & GRD/Codine Transfer of topology and configuration information Exchange of monitoring information

Page 17: Vision for System and Resource Management of the Swiss-Tx class of Supercomputers Josef Nemecek ETH Zürich & Supercomputing Systems AG.

Vision for System and Resource Management

of the Swiss-Tx class of Supercomputers

Josef NemecekETH Zürich & Supercomputing Systems AG