Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central...

24
©Gabriel Kliot, Technion 1 Condor week – March 2005 Adding Adding High Availability High Availability to Condor to Condor Central Manager Central Manager Gabi Kliot Technion – Israel Institute of Technology

Transcript of Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central...

Page 1: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 1 Condor week – March 2005

Adding Adding High Availability High Availability

to Condor to Condor Central Manager Central Manager

Gabi KliotTechnion – Israel Institute of Technology

Page 2: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 2 Condor week – March 2005

Outline

Current Condor pool Motivation for Highly Available

Central Manager The solution - HA Daemon Performance impacts Testing Future Work

Page 3: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 3 Condor week – March 2005

Collector

Negotiator

Current Condor Pool

Startd and ScheddStartd and Schedd

Startd and ScheddStartd and ScheddStartd and Schedd

Startd and ScheddStartd and ScheddCentral Manager

Page 4: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 4 Condor week – March 2005

Why Highly Available Central Manager

Central manager is a single-point-of-failure Negotiator’s failure - No additional matches

will be possible Collector’s failure – negotiator is out of job,

tools querying collector won’t work, etc.

Our goal Allow continuous pool functioning in

case of failure

Page 5: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 5 Condor week – March 2005

Highly Available Condor Pool

Startd and ScheddStartd and Schedd

Startd and ScheddStartd and Schedd Startd and Schedd

Startd and ScheddStartd and Schedd

IdleCentral

Manager

IdleCentral

Manager

ActiveCentral

Manager

Highly AvailableCentral Manager

Page 6: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 6 Condor week – March 2005

Highly Available Central Manager

Our solution - Highly Available Central Manager Automatic failure detection Transparent failover to backup matchmaker

(no global configuration change for the pool entities)

“Split brain” reconciliation after network partitions

State replication between active and backups No changes to Negotiator/Collector code

Page 7: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 7 Condor week – March 2005

Highly Available Central Manager

How it works

Collector’s HA is provided by redundancy Negotiator’s HA is provided by HA daemons

Page 8: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 8 Condor week – March 2005

Collector

Collector

Collector

HAD HADHAD

How it works – Election

Election msgElection msg

Election msg

Workstation – Startd and Schedd

Workstation – Startd and Schedd

Page 9: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 9 Condor week – March 2005

How it works – Election

Collector

Leader HAD

Collector

HAD

Collector

HAD

Negotiator

I’m aliveI’m alive

Workstation – Startd and Schedd

Workstation – Startd and Schedd

Page 10: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 10 Condor week – March 2005

How it works – basic scenario

Collector

Leader HAD

Collector

HAD

Collector

HAD

Negotiator

Active CM Idle CMIdle CM

I’m aliveI’m alive

Workstation – Startd and Schedd

Workstation – Startd and Schedd

Page 11: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 11 Condor week – March 2005

Collector

Leader HAD

Collector

HAD

Collector

HAD

Negotiator

Active CM Idle CMIdle CM

Workstation – Startd and Schedd

Workstation – Startd and Schedd

I’m aliveI’m alive

How it works – crash event

Page 12: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 12 Condor week – March 2005

How it works – crash event

Collector

Leader HAD

Collector

HAD

Collector

HAD

Negotiator

Active CM Idle CMIdle CM

Workstation – Startd and Schedd

Workstation – Startd and Schedd

Election

Page 13: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 13 Condor week – March 2005

Collector

Leader HAD

Collector

HAD

Collector

LEADER HAD

Negotiator

Active CM Idle CMActive CM

Workstation – Startd and Schedd

Workstation – Startd and Schedd

I’m Alive

How it works – crash event

Negotiator

Page 14: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 14 Condor week – March 2005

I am alive

PRE PASSIVE

Self ID < Other received ID

Timeout with no I’m Alive message

LEADER ELECTION

Self ID < Other IDKill Negotiator

I am the HAD with highest ID after election timeout

Run Negotiator

High Availability DaemonState machine

Page 15: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 15 Condor week – March 2005

Performance impact

Stabilization time – the time it takes for HA daemons to detect failure and fix it

Depends on number of CMs and network performance

HAD_CONNECT_TIMEOUT – time it takes to establish TCP connection (depends on network type, presence of encryption, etc…)

Assuming it takes up to 2 seconds to establish TCP connection and 2 CMs are used - new Negotiator is up and running after 48 seconds

Page 16: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 16 Condor week – March 2005

Testing

Special automatic distributed testing framework was built: simulation of node crashes network disconnections network partition and merges

Extensive testing effort: distributed testing on 5 linux machines in the

Technion interactive distributed testing in Wisc pool automatic testing with NMI framework

Already deployed and fully functioning for 3 weeks on our production pool in the Technion

Page 17: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 17 Condor week – March 2005

Future development

HAD publishing in Collectors condor_status –had

Accounting file replication current solution is provided for NFS

Software High Availability

Page 18: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 18 Condor week – March 2005

Collaboration with Condor team

Compliance with high Condor coding standards

Peer-reviewed code Integration with NMI framework Automation of testing Open-minded attitude of Condor team to

numerous requests and questions Unique experience of working with large peer-

managed group of talented programmers

Page 19: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 19 Condor week – March 2005

Collaboration with Condor team

This work was a collaborative effort of: Distributed Systems Laboratory

in Technion Prof Assaf Schuster, Mark Silberstein,

Gabi Kliot, Svetlana Kantorovitch, Dedi Carmeli, Artiom Sharov

Condor team Prof Miron Livni, Nick, Todd, Derek, Erik,

Carey, Peter, Becky, Parag, Zack, Dan

Page 20: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 20 Condor week – March 2005

Part of the official 6.7.6 development release

Full support by the Technion team More information:

http://dsl.cs.technion.ac.il/projects/gozal/project_pages/ha/ha.html

more details + configuration on my tutorial tomorrow

Contact: [email protected] [email protected]

You should definitely try it !

Page 21: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 21 Condor week – March 2005

In case of time

Page 22: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 22 Condor week – March 2005

How it works – basic scenario

WorkstationStartd and Schedd

WorkstationStartd and Schedd

WorkstationStartd and Schedd

Collector

Negotiator

Collector Collector

HAD Leader HAD

HAD

Machine Cbackup server

Machine Bactive server

Machine Abackup server

“I‘m alive”“I‘m alive”

Page 23: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 23 Condor week – March 2005

WorkstationStartd and Schedd

WorkstationStartd and Schedd

WorkstationStartd and Schedd

Collector

Negotiator

Collector Collector

HAD

Leader HAD

HAD

Machine Cbackup server

Machine Bactive server

Machine Abackup server

Leader Election

How it works – crash event

Page 24: Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.

©Gabriel Kliot, Technion 24 Condor week – March 2005

Usability and administration

Configuration sanity check perl script

Disable HAD perl script Detailed manual section Full support by Technion team