Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active...

Asymmetric / Active-ActiveHigh-Availability for High-End Computing

C. LeangsuksunV.K. Munganuru

Louisiana Tech UniversityRuston, Louisiana – USA

{box, vkm001}@latech.edu

T. Liu

Dell Inc.Austin, Texas – USATong_Liu@dell.com

S.L. ScottC. Engelmann

Oak Ridge National LaboratoryOak Ridge, Tennessee – USA

{scottsl, engelmannc}@ornl.gov

Second International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters

June 19, 2005Cambridge, Massachusetts (USA)

Dell Inc

Outline

Motivation

Related Work: OSCAR

HA-OSCAR: RAS Management for HPC

Clusters: Self-awareness Approach

Analysis & Experiment

Summary & Future work

Motivation

Cluster architecture dominates HPC community.

Cluster architecture is prone to single-point-of failure (SPoF).

Cluster size has significantly grown.Size and reliability have inverse relationship…

Self-aware Reliability, Availability and Serviceability management is needed.

Reliability

Cluster “Beowulf” Architecture

Single Point of FailureSingle Point of Control

Availability of HEC Systems

Today’s supercomputers typically need to reboot to recover from a single failure.

Entire systems go down (regularly and unscheduled) for any maintenance or repair.

Compute nodes sit idle while a head or service node is down.

Availability will get worse in the future as the MTBI decreases with growing system size.

Productive computation is not done during the checkpoint/restartprocess.

Availability Measured by the 9’s

9’s Availability* Downtime/Year Examples

1 90.0% 36 days, 12 hours Personal Computers

2 99.0% 87 hours, 36 min Entry Level Business

3 99.9% 8 hours, 45.6 min ISPs, Mainstream Business

4 99.99% 52 min, 33.6 sec Data Centers

5 99.999% 5 min, 15.4 sec Banking, Medical

6 99.9999% 31.5 seconds Military Defense

Enterprise-class hardware + Stable Linux kernel = 5+ Substandard hardware + Good high availability package = 2-3Today’s supercomputers = 1-2My desktop = 1-2

* Based on (MTBI) – mean time between interrupt – both software and hardware interrupts.

Solution: Active Redundancy

Clustering High-Availability Models

Active – Hot-Standby

Asymmetric / Active – Active

Symmetric / Active – Active

Open Source Cluster Application Resources

Framework for cluster installation configuration and management

Common used cluster tools

Wizard based cluster software installation

Operating systemCluster environment

AdministrationOperation

Automatically configures cluster components

Increases consistency among cluster builds

Reduces time to build / install a cluster

Reduces need for expertise

Step 5

Step 8 Done!

Step 6

Step 1 Start…

Step 2

Step 3Step 4

Step 7What is OSCAR?

HA-OSCAR: Active – Hot-StandbyProduction-quality Open source Linux-cluster project

HA and HPC clustering techniques to enable critical HPC infrastructure Self-configuration Multi-head Beowulf system

HA-enabled HPC Services:Active / Hot-Standby

Self-healing with 3-5 sec automatic failover time

The first known field-grade open source HA Beowulf cluster release

HA-OSCAR Serviceability

Self-Build and configuration Multi-head Beowulf system

Adopt ease of build and operation same as OSCAR concept

~30 min – installation

Take almost the same time for disaster recovery (that is, each disaster recovery –providing you are prepared)

Step2 create head imageStep3 clone image

Step4 configStandby Step5 web admin to

add/config more services

Adaptive Recovery State Diagram

working Failover

failure

Alert.

Detect

previous state, # counter,recovery

switch over & take control at thestandby

threshold reached after # retry

previous state, # counter,recovery

After the primary node repair, thenoptional Fallback

Monitoring & Self-healing cores

ServiceMonitor

ResourceMonitor

Healthchannel Monitor

Self-Healing Daemon

PBS ,MAUI , NFS,HTTP

services are monitored

load_average, disk_usage, free_memory are monitored

eth0,eth0:1 interfaces

are monitored

HA-OSCAR RAS Software Stack

Redundant H/W platform

Intelligent sensors

HPI wrapper

Operating System (OS) hardware Interface

OS Application Services

Monitoring and Self-healing Core

HA-OSCAR Management layer

Application Services

Monitoring & self-healing core

Asymmetric / Active-Active Architecture

Failover of: Asymmetric / Active-Active Architecture

Asymmetric/Symmetric Active/Active

Reality Checks

Great! We got Highly Reliable HPC system!

But How much improvement?The total uptime?Performance?

Analytical model and predictionStatistical technique to compare uptimeHow many 9’s? (downtime per/year)Stochastic Reward Net with SPNP packageIdentical hardware parameters between Beowulf and HA-OSCAR multi-heads

Availability vs Unavailability

Planned and unplanned downtimeScheduled downtime = 200 hrsRepair time = 24 hrsMonitoring interval = 10 sec

Ours 99.99% vs 91.+%

1k vs 10m TFLOP (1T system)

$70k vs $2m ($20m system)

HA-OSCAR solution vs traditional BeowulfTotal Availability impacted by service nodes

90.580%

91.575%92.081% 92.251% 92.336% 92.387%

99.9896%

99.9951% 99.9962% 99.9966% 99.9968%

99.9684%

90.00%

91.00%

92.00%

93.00%

94.00%

95.00%

96.00%

97.00%

98.00%

99.00%

100.00%

Noda-wise mean time to failure (hr)

99.950%

99.955%

99.960%

99.965%

99.970%

99.975%

99.980%

99.985%

99.990%

99.995%

100.000%

Beowulf 0.905797 0.915751 0.920810 0.922509 0.923361 0.923873

HA-oscar 0.999684 0.999896 0.999951 0.999962 0.999966 0.999968

1000 2000 4000 6000 8000 10000

Model assumption:- scheduled downtime=200 hrs - nodal MTTR = 24 hrs- failover time=10s- During maintainance on the head, standby node acts as primary

Lost investment due to unavailability (based on $20M)

1000 2000 4000 6000 8000 10000

MTTF(hours)

Beowulf

HA-oscar

HA-OSCAR solut ion vs tradit ional BeowulfUnavailable performance (based on 1 T flop machine in a year)

1000000

100000000

10000000000

1000 2000 4000 6000 8000 10000

MTTF(hours)

p Beowulf

HA-OSCAR

Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active...

Documents

Transcript of Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active...

Peptides as Catalysts for Asymmetric 1,4-Addition ..._elektronisch... · as Catalysts for Asymmetric 1,4-Addition Reactions: Structural Requirements for High Catalytic Efficiency”,

Asymmetric Synthesis of Naturally Occuring Spiroketals · 2015-11-24 · article aims at reviewing the asymmetric synthesis of biologically active spiroketals for last 10 years (1998-2007).

12. Electrical -Ijeeer - Asymmetric Parallel Converter Based High-power - Vamsi

New symmetric and asymmetric supercapacitors based …dspace.rri.res.in/bitstream/2289/1464/3/Narayan.pdf · 1 New symmetric and asymmetric supercapacitors based on high surface area

Asymmetric Synthesis of Active Pharmaceutical Ingredients

Modeling asymmetric volatility clusters with copulas and ...rf2283/Conference/Presentation_Volatility_clust… · Modeling asymmetric volatility clusters with copulas and high frequency

University of Groningen Exploring asymmetric catalytic ... · 21 Chapter 2 Synthesis of Optically Active . β - or γ-Alkyl Substituted Alcohols through Copper-Catalyzed Asymmetric

Asymmetric Blade Disc Turbine for High Aeration Rates

Online Asymmetric Active Learning with Imbalanced Data · Online Asymmetric Active Learning with Imbalanced Data Xiaoxuan Zhang, Tianbao Yang, Padmini ... streaming data as for instance

High-power Active Devices

Active High and Active Low Decoders

High-Efficiency Thermal Asymmetric Interlaced PCR (hiTAIL ...High-Efﬁciency Thermal Asymmetric Interlaced PCR (hiTAIL-PCR) for Determination of a Highly Degenerated Prophage WO Genome

Symmetric Active/Active High Availability for High ...

An active mechanical Willis meta-layer with asymmetric ...

A high energy asymmetric supercapacitor based on flower ...yicaige.com/upload/news/1470896622.pdfA high energy asymmetric supercapacitor based on ﬂower-like CoMoO 4/MnO 2 heterostructures

Behavior of Symmetric and Asymmetric Structure in High ...

ISL6726 Datasheet - Renesas Electronics · intended for applications using the active clamp forward converter topology in either n- or p-channel active clamp configurations, the asymmetric

Asymmetric allyl-activation of organosulfides for high- energy … · 2019. 5. 22. · 1 Electronic Supplementary Information (ESI) Asymmetric allyl-activation of organosulfides for

Online Asymmetric Active Learning with Imbalanced Data

Symmetric Active/Active Metadata Service for High Availability Parallel File Systems · 2014-10-31 · Symmetric Active/Active Metadata Service for High Availability Parallel File