Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle...

33
Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall Sobie, Ryan Taylor ATLAS Software & Computing Workshop 2011-10-20 University of Victoria, Victoria, Canada National Research Council of Canada, Ottawa Ian Gable

description

Strategy and Goals Ian Gable3 Use Condor + Cloud Scheduler to execute Panda pilots on IaaS VMs. Use CernVM Batch node (has Condor already) Condor + Cloud Scheduler handles the creation of the Virtual machines and the movement of the pilot credentials onto the VMs Make it work without any modifications to how panda works now.

Transcript of Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle...

Page 1: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Condor + Cloud Scheduler

Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin

Leavett-Brown, Michael Paterson, Randall Sobie, Ryan Taylor

ATLAS Software & Computing Workshop 2011-10-20

University of Victoria, Victoria, CanadaNational Research Council of Canada, Ottawa

Ian Gable

Page 2: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Outline• Strategy and Goals• Condor +Cloud Scheduler overview• Proof of concept• Status• Summary

Ian Gable 2

Page 3: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Strategy and Goals

Ian Gable 3

• Use Condor + Cloud Scheduler to execute Panda pilots on IaaS VMs.

• Use CernVM Batch node (has Condor already)• Condor + Cloud Scheduler handles the

creation of the Virtual machines and the movement of the pilot credentials onto the VMs

• Make it work without any modifications to how panda works now.

Page 4: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Condor + Cloud Scheduler• Cloud Scheduler is a simple python package created by

UVic and NRC to boot VMs on IaaS clouds based on Condor queues.

• Users create a VM with their experiment software installed– A basic VM is created by our group, and users add on their

analysis or processing software to create their custom VM– For experiments that are supported CernVM is the best

option.• Users then create condor batch jobs as they would on a

regular cluster, but they specify which VM should run their images.

• Aside from the VM creation step, this is very similar to the regular HTC workflow.

Ian Gable 4

Page 5: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Implementation Details• Condor used as Job Scheduler

– VMs contextualized with Condor Pool URL and service certificate.– Each VM has the Condor startd daemon installed, which advertises to the

central manager at start.– GSI host authentication used when VMs joining pools– User credentials delegated to VMs after boot by job submission– Condor Connection broker to get around private IP clouds

• Cloud Scheduler– User proxy certs used for authenticating with IaaS service where possible

(Nimbus). Otherwise using secret API key (EC2 Style).– Can communicate with Condor using SOAP interface (slow at scale) or via

condor_q– Primarily support Nimbus and Amazon EC2, with experimental support for

OpenNebula and Eucalyptus– VMs reused until there are no more jobs requiring that VM type.– Images pulled via http or must preexist at the site.

Ian Gable 5

Page 6: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Condor Job Description File

Ian Gable 6

Universe = vanillashould_transfer_files = YESwhen_to_transfer_output = ON_EXIT

Requirements = VMType =?= “BaBarAnalysis-52”+VMLoc = "http://vmrepo.heprc.uvic.ca/CernVM-Batch"+VMCPUArch = "x86_64"+VMStorage = "1"+VMCPUCores = "1"+VMMem = "2555"+VMAMI = "ami-64ea1a0d"+VMInstanceType = "m1.small"+VMJobPerCore = Truegetenv = True

Queue

Custom condor attributes

Cloud Scheduler uses the custom condor attributes to compose a call to IaaS service. EC2 rest interfaces supported in addition to Nimbus IaaS with X.509

Page 7: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Proof of Concept

Ian Gable 7

Page 8: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Initial Result

Ian Gable 8

• Basic proof of concept works with manual submission of pilot jobs to a single cloud.

• Running on Hermes Cloud at University of Victoria (~300 cores, in production operation since June 2010, see extra slides for details)

• The data transfer via dccp to the VM is working.• Log output got transferred to the SE and back to the

panda server.• Minor problems running the pilot inside CernVM. (some

outdated configs, and learning for us)

Page 9: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Possible Improvements

Ian Gable 9

Add a pilot factory

Add more clouds

Page 10: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Possible Simplification

Ian Gable 10

• Try using development version of Condor (7.7+) with support for EC2 grid resource type.

• UVic has tested with Amazon proper but not with other cloud types.

• What to do with credentials?• How to manage access to multiple clouds?

Page 11: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Summary• Basic proof of concept works, tested

with small number of jobs.– Work to add pilot factory– Already known to scale to hundreds of

VMs will it scale beyond?• Setup a little complex, could possibly

be simplified with some thought about how to deal with credentials.

Ian Gable 11

Page 12: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Beginning of Extra Slides

Ian Gable 12

Page 13: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

About the code

• Relatively small python package, lots of cloud interaction examples

http://github.com/hep-gc/cloud-scheduler

Ian Gable 13

Ian-Gables-MacBook-Pro:cloud-scheduler igable$ cat source_files | xargs wc -l 0 ./cloudscheduler/__init__.py 1 ./cloudscheduler/__version__.py 998 ./cloudscheduler/cloud_management.py 1169 ./cloudscheduler/cluster_tools.py 362 ./cloudscheduler/config.py 277 ./cloudscheduler/info_server.py 1086 ./cloudscheduler/job_management.py 0 ./cloudscheduler/monitoring/__init__.py 63 ./cloudscheduler/monitoring/cloud_logger.py 208 ./cloudscheduler/monitoring/get_clouds.py 176 ./cloudscheduler/utilities.py 13 ./scripts/ec2contexthelper/setup.py 28 ./setup.py 99 cloud_resources.conf 1046 cloud_scheduler 324 cloud_scheduler.conf 130 cloud_status 5980 total

Page 14: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

HEP Legacy Data Project• We have been funded in Canada to

investigate a possible solution for analyzing BaBar data for the next 5-10 years.

• Collaborating with SLAC who are also pursuing this goal.

• We are exploiting VMs and IaaS clouds.• Assume we are going to be able run BaBar

code in a VM for the next 5-10 years.• We hope that results will be applicable to

other experiments.• 2.5 FTEs for 2 years ends in October 2011.

Ian Gable 14

Page 15: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Experience with CANFAR

Ian Gable 15

Shorter Jobs

Cluster Outage

Page 16: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

VM Run Times (CANFAR)

Ian Gable 16

Max allowed VM Run time (1 week)

n = 32572

32572 VMs booted since July 2010

Page 17: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

VM Boots (CANFAR)

Ian Gable 17

Page 18: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Job Submission• We use a Condor Rank expression to ensure

that jobs only end up on the VMs they are intended to.

• Users use Condor attributes to specify the number of CPUs, memory, scratch space, that should be on their VMs (only works with IaaS that supports this).

• Users can also specify EC2 Amazon Machine Image (AMI) ID if Cloud Scheduler is configured for access to this type of cloud.

• Users prevented from running on VMs which were not booted for them (condor start requirement checked against DN).

Ian Gable 18

Page 19: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Ian Gable 19

BaBar Cloud Configuration

Page 20: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Other Examples

Ian Gable 20

Page 21: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Inside a Cloud VM

Ian Gable 21

Page 22: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

BaBar MC production

Ian Gable 22

More resource added

Page 23: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Motivation• Projects requiring modest resources we believe to be

suitable to Infrastructure-as-a-Service (IaaS) Clouds:

– The High Energy Physics Legacy Data project (2.5 FTEs). Target application BaBar MC and user analysis.

– The Canadian Advanced Network for Astronomical Research (CANFAR). Supporting observational astronomy data analysis. Jobs are embarrassingly parallel like HEP( 1 image is like one event).

• We expect an increasing number of IaaS clouds to be available for research computing.

Ian Gable 23

Page 24: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Ian Gable 24

Research and Commercial clouds made available with some cloud-like interface.

Step 1

Page 25: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Ian Gable 25

User submits to Condor Job scheduler that has no resources attached to it.

Step 2

Page 26: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Credentials Movement• User does a myproxy-logon• User submits a condor job with proxy• IaaS API request made to Nimbus with proxy• Cluster service certificate is injected into the

VM (could be replaced with user cert).• VM boots and joins the condor pool using GSI

authentication with the service certificate.• Condor submits job to VM with delegated

user proxy.

Ian Gable 26

Page 27: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Ian Gable 27

Cloud Scheduler detects that there are waiting jobs in the Condor Queues and then makes request to boot the VMs that match the job requirements.

Step 3

Page 28: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Ian Gable 28

Step 4

The VMs boot, attach themselves to the Condor Queues and begin draining jobs. Once no more jobs require the VMs Cloud Scheduler shuts them down.

Page 29: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Ian Gable 29

Credentials Step 1

Page 30: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Ian Gable 30

Credentials Step 2

Page 31: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Ian Gable 31

Credentials Step 3

Page 32: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Ian Gable 32

Credentials Step 4

Page 33: Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.

Ian Gable 33

Credentials Step 5