BEACON Edited: Dec 11th. Summary Principle Scenarios Existing Technologies.

BEACON

Edited: Dec 11th

Summary

• Principle• Scenarios• Existing Technologies

Logical Representation of BEACONand Beacon end-points (Beeps)

Job and Resource manager

RAS system BEAC

ON

and

Exp

osé

back

plan

esNotificationsCommandsNotificationsCommands

NotificationsCommands



NotificationsCommandsCPU CPU

Node

Enclave

Application

Runtime systems

CPU CPU

Node

OS OS

CPU CPU

Node

Enclave

Application

Runtime systems

CPU CPU

Node

OS OS

Logical Representation of BEACON

Job and Resource manager RAS channel

CPU CPU

Node

Application

Runtime systems

System

LocalBEACON

CPU CPU

Node

Notifications

Comm

ands

LocalBEACON

Notifications

Comm

ands

CPU CPU

Node

Enclave

Application

Runtime systems

LocalBEACON

CPU CPU

Node

Notifications

Comm

ands

LocalBEACON

Notifications

Comm

ands

GlobalBEACON

GlobalBEACON

GlobalBEACON

GlobalBEACON

Noti

ficati

ons

Com

man

ds

Noti

ficati

ons

Com

man

ds

BEACON Principle

NodeLocal

BEACON

NodeLocal

BEACON

NodeLocal

BEACON

NodeLocal

BEACON

GlobalBEACON

GlobalBEACON

GlobalBEACON

GlobalBEACON

Enclave Enclave

• Two daemons helps failure containment, fault isolation, and security• Global Beacon is created when node boots up; connects to other global

beacons on other active nodes during startup• Local beacon is launched with the job in an enclave; connects to the global

beacon on the same node

Beacon Related Services

BEACON Services

IP multicast TCP/IPPAMI(BG/Q),

IBM machine?uGNI (XK6),

Aries (XC30)?

TranslatorsResponse

management

OS, Runtime, Applications, RMS, RAS, Enclave services, EXPOSE

Query management?

Unreliable channel Reliable channel

BEACON API

BEACON Transport

Logger

BEACON Events

• Beacon will support two types of data– Internal events (subscriptions, Beacon maintenance,

announcements, etc.)– External events (notifications, commands)

• Internal events can be produced and consumed by Beacon and its services

• External events are produced and consumed by all Beeps

• ? Do we need discrete and stream events? Stream throttling? Scenarios?

BEACON Event Format

Priority: -reliable or not-discrete or stream (if needed)

Payload:-generated and interpreted by Beeps

BEACON Start-up

• Discovery and Topology – Discovery and Topology daemon will reside on a permanent

node (similar to service node in BG)– Will help establish the topology of global Beacon daemons;

global daemons will contact it for parent discovery – Scalable, resilient (replication)– Topology options are still being researched:

• Small degree• Small diameter• High resilience• Multiple paths• CHORD, and other P2P topologies are candidates

BEACON Transport• BEACON transport can deliver events reliably or unreliably• Unreliable delivery: no delivery guarantees. • Reliable Delivery : Reliability will need to be end-to-end across a

distributed chain of agents (higher protocol that TCP)• Event Buffering

– Required because Time-To-Live for every event message– TTL is set by publisher (from 0 for immediate to few minutes?)– Producer produces events; but subscriber disappears before event reaches it

Event is dropped after TTL– Producer produces events; but subscription has not yet propagated in the

system Event will be sent to the subscriber (by the logger) if TTL is valid

BEACON Services

• Use the Beacon API (no other Point to point messaging)• Translators – Translate events so that they can be understood

semantically between Beeps• Response Management– Manages responses and coordinates

different entities following recovery plans• Logger – Logs external events and re-publishes events, based

on un-expired TTL, for new (or restarting) subscribers, duplicate events (re-published by the logger) will not be re-delivered to subscribers

• ? Query Management - Manages queries within the BEACON framework ?

Translators• The translators do not perform actions – they just read an

event and publish a new event, using state information to translate the payload

• Subscribers would have to subscribe to events coming from the translators

• For any system that does a mapping and/or allocation, we need a translator that can reverse the mapping.

• For ARGO, we will build a specific translator only when there is no other software in the process stack performing that translation (e.g. If MPI can tell that rank Y is failing when 0x1234 fails, then we do not need a translator for that)

Example Scenario

Example scenario

• Fan has failed This will cause several nodes and switches to fail within 5 seconds. The failure will affect several jobs and will affect the network. Some of the jobs can take preventive measures to handle node failures, other cannot. – Fan controller issues event “fan 17245 failed at 00:00:00”– “Translator process” A subscribes to “fan failures in the system” and

picks this message and issue several messages of the form “node 175 will fail at 00:00:05”

– “Translator process” B subscribes to “node failures in the system” and picks this message and issues the message “node 73 of enclave foo will fail at 00:00:05”

– The enclave manager C subscribes to “node failures in enclave foo” and picks this message and issues messages of the form “process with rank 25 in M : PI_COMM_WORLD” will fail at 00:00:05

Example scenarioIdeally speaking, • Translator A - uses information on the physical system topology; it could

also use information on the current system health: • Translator B - uses information on the nodes allocated to each enclave (by

the global resource manager)• Translator C-uses information on the mapping of MPI processes to the

nodes (by the partition manager)

Practically speaking, • Creation of translators might be scenario based

Beacon Scenarios

Double bit error: detected/uncorrectableApplication and library both can handle,

Response manager decide which one does the correction

Example of application:Bag of tasks, each tasks calling linear algebra

functions or FFTs (ABFT version)

Double bit error: detected/uncorrectableIn App: App handles

App

Lib

OS

MemCont

Register@Handler

Hardware interrupt

Progress is stopped

“Classic” way Mem accessor Scrubbing

Invocation of signal handler

Handler fix or not

Handle returns to OS

OS returns control to App

App

Lib

OS

MemCont

App levelHandler

Progress is stopped

Beacon way

ResponseManager

BeaconMem access

Manager decides App should fix

Handler Fix or not

Lib levelHandler

Hardware interrupt

OS uses APIto ask response


App handler returns to OS


OS needs to accept multiple handlers

Double bit error: detected/uncorrectableIn Lib: Lib handles

App

Lib

OS

MemCont

App levelHandler

Progress is stopped

Beacon way

ResponseManager

Beacon

Mem access

Manager decides Lib should fix

Handler Fix or notLib levelHandler

Hardware interrupt



Lib handler returns to OS

OS returns control to Lib

Double bit error: detected/uncorrectableIn Lib: App handles

App

Lib

OS

MemCont

App levelHandler

Progress is stopped

Beacon way

ResponseManagerBeacon

Mem access

Manager decides App should fix

Handler Fix or not

Lib levelHandler

Hardware interrupt



App handler returns to OS


Note that the correction may be attempted in the Lib first and if the Lib does notsucceed then the application handler could be called. The corresponding diagramcould be built from this one and the previous one.

Response Management (RM)• Entities who subscribe and receive events will want to respond with

actions• A response management framework will need to manage

response/recovery authorizations in systematic manner without compromising system stability

• Phases of the BEACON software: Each BEACON-enabled software will have the following phases:

1. Announcement of capabilities : Entities have to announce their response capabilities for various events. Responses are declared on a per-event basis by every component

2. Exchange of events :- Publish and subscribe to event; receive events3. Responding to events :- RM will implement a response plan, decide who

should take action and will publish corresponding events. Response/recovery sequence is listed in an admin-provided data file

Response ManagementResponse manager

– Tracks when component connects and exit– One exists per enclave. We might add a global response manager, if

needed– Will subscribe to events of topic = “auth-requested”– Will publish events of topic = “auth-response” will indicate if a

software has permission to start recovery– “auth-response” events are also called as commands

Response Manager Protocolin case of multiple recovery options

Fault-Tolerant

Application

BEACON

1. Received event foo

MigrationManager

(MM)

1. Received event foo

4. Publish “Recovery Started”

5. Publish “Recovery Failed ”

6. Publish “Recovery Started”

ResponseManager

7. Publish “Recovery Completed”

3. Publishes “Auth granted” to (1) App; (2) MM

2. Publishes “Want Auth for foo”

2. Publishes “Want Auth for foo”

Response plan:Try app firstThen migration

Query management

• Currently, no scenarios seem to require this feature– Wait and see approach; reliable BEACON anyways provides a

foundation to build this

Existing Technologies

• Characterization of the system architecture to be used in the ARGO project

• Looked at existing technologies (Astrolabe, Google Dapper, IBM Elastic subscribe)– Nothing that can be picked up and used since most are designed for the

internet. Use gossip protocols; do not offer reliable delivery

• Other potential technologies under investigation– CIFTS, AMQP, EVPATH

EVPATH

BEACON Edited: Dec 11th. Summary Principle Scenarios Existing Technologies.

Documents

Transcript of BEACON Edited: Dec 11th. Summary Principle Scenarios Existing Technologies.