Submit locally and run globally – The GLOW and OSG Experience

52
Miron Livny Computer Sciences Department University of Wisconsin-Madison [email protected] Submit locally and run globally – The GLOW and OSG Experience

description

Submit locally and run globally – The GLOW and OSG Experience. What impact does our computing infrastructure have on our scientists?. 1992. Oklahoma Supercomputing Symposium 2003. Supercomputing in Social Science. Maria Marta Ferreyra Carnegie Mellon University. - PowerPoint PPT Presentation

Transcript of Submit locally and run globally – The GLOW and OSG Experience

Page 1: Submit locally  and run globally – The GLOW and OSG Experience

Miron LivnyComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]

Submit locally and run globally –

The GLOW and OSGExperience

Page 2: Submit locally  and run globally – The GLOW and OSG Experience

www.cs.wisc.edu/~miron

What impact does

our computing infrastructure

have onour scientists?

Page 3: Submit locally  and run globally – The GLOW and OSG Experience

www.cs.wisc.edu/condor

Page 4: Submit locally  and run globally – The GLOW and OSG Experience

Supercomputing in Supercomputing in Social ScienceSocial Science

Maria Marta FerreyraMaria Marta Ferreyra

Carnegie Mellon UniversityCarnegie Mellon University

Page 5: Submit locally  and run globally – The GLOW and OSG Experience

What would happen if What would happen if manymany families in the families in the largest U.S. metropolitan areas received largest U.S. metropolitan areas received vouchers for private schools?vouchers for private schools?

Completed my dissertation with Condor’s Completed my dissertation with Condor’s helphelp

The contributions from my research are The contributions from my research are made possible by Condormade possible by Condor Questions could not be answered otherwiseQuestions could not be answered otherwise

Page 6: Submit locally  and run globally – The GLOW and OSG Experience

Research questionResearch question

vouchers allow people to choose the type of vouchers allow people to choose the type of school they wantschool they want

vouchers may affect where families choose vouchers may affect where families choose to liveto live

Problem has many moving parts (a Problem has many moving parts (a general equilibrium problem)general equilibrium problem)

Page 7: Submit locally  and run globally – The GLOW and OSG Experience

Why Condor was a great match to my needsWhy Condor was a great match to my needs

(cont.)(cont.)

I did not have to alter my codeI did not have to alter my code

I did not have to payI did not have to pay

since 19 March 2001 I have used 462,667 since 19 March 2001 I have used 462,667 hours (about 53 years with one 1 Ghz hours (about 53 years with one 1 Ghz processor)processor)

Page 8: Submit locally  and run globally – The GLOW and OSG Experience

www.cs.wisc.edu/~miron

The search for SUSY*

› Sanjay Padhi is a UW Chancellor Fellow who is working at the group of Prof. Sau Lan Wu located at CERN (Geneva)

› Using Condor Technologies he established a “grid access point” in his office at CERN

› Through this access-point he managed to harness in 3 month (12/05-2/06) more that 500 CPU years from the LHC Computing Grid (LCG) the Open Science Grid (OSG) the Grid Laboratory Of Wisconsin (GLOW) resources and local group owned desk-top resources.

*Super-Symmetry

Page 9: Submit locally  and run globally – The GLOW and OSG Experience

www.cs.wisc.edu/~miron

Claims for “benefits” provided by Distributed Processing Systems

High Availability and Reliability High System Performance Ease of Modular and Incremental Growth Automatic Load and Resource Sharing Good Response to Temporary Overloads Easy Expansion in Capacity and/or Function

“What is a Distributed Data Processing System?” ,

P.H. Enslow, Computer, January 1978

Page 10: Submit locally  and run globally – The GLOW and OSG Experience

www.cs.wisc.edu/~miron

Democratization

of Computing:You do not need to be a

super-person to do

super-computing

Page 11: Submit locally  and run globally – The GLOW and OSG Experience

www.cs.wisc.edu/~miron

We first introduced the distinction between High Performance Computing (HPC) and High Throughput Computing (HTC) in a seminar at the NASA Goddard Flight Center in July of 1996 and a month later at the European Laboratory for Particle Physics (CERN). In June of 1997 HPCWire published an interview on High Throughput Computing.

High Throughput Computing

Page 12: Submit locally  and run globally – The GLOW and OSG Experience

www.cs.wisc.edu/~miron

HTCFor many experimental scientists, scientific progress and quality of research are strongly linked to computing throughput. In other words, they are less concerned about instantaneous computing power. Instead, what matters to them is the amount of computing they can harness over a month or a year --- they measure computing power in units of scenarios per day, wind patterns per week, instructions sets per month, or crystal configurations per year.

Page 13: Submit locally  and run globally – The GLOW and OSG Experience

www.cs.wisc.edu/~miron

High Throughput Computing

is a24-7-365activity

FLOPY (60*60*24*7*52)*FLOPS

Page 14: Submit locally  and run globally – The GLOW and OSG Experience

www.cs.wisc.edu/~miron

“ … We claim that these mechanisms, although originally developed in the context of a cluster of workstations, are also applicable to computational grids. In addition to the required flexibility of services in these grids, a very important concern is that the system be robust enough to run in “production mode” continuously even in the face of component failures. … “

Miron Livny & Rajesh Raman, "High Throughput Resource Management", in “The Grid: Blueprint for a New Computing Infrastructure”, 1998.

Page 15: Submit locally  and run globally – The GLOW and OSG Experience

www.cs.wisc.edu/~miron

HTC leads to a “bottom up”approach tobuilding and operating a distributed computing

infrastructure

Page 16: Submit locally  and run globally – The GLOW and OSG Experience

www.cs.wisc.edu/~miron

My jobs should run …

› … on my laptop if it is not connected to the network

› … on my group resources if my grid certificate expired

› ... on my campus resources if the meta scheduler is down

› … on my national grid if the trans-Atlantic link was cut by a submarine

Page 17: Submit locally  and run globally – The GLOW and OSG Experience

www.cs.wisc.edu/~miron

Taking HTCto the next level –The Open Science

Grid(OSG)

Page 18: Submit locally  and run globally – The GLOW and OSG Experience

What is OSG?

The Open Science Grid is a US national distributed computing facility that supports scientific computing via an open collaboration of science researchers, software developers and computing, storage  and network providers. The OSG Consortium is building and operating the OSG, bringing resources and researchers from universities and national laboratories together and cooperating with other national and international infrastructures to give scientists from a broad range of disciplines access to shared resources worldwide.

Page 19: Submit locally  and run globally – The GLOW and OSG Experience

The OSG Project

Co-funded by DOE and NSF at an annual rate of ~$6M for 5 years starting FY-07.

16 institutions involved – 4 DOE Labs and 12 universities

Currently main stakeholders are from physics - US LHC experiments, LIGO, STAR  experiment, the Tevatron Run II and Astrophysics experiments

A mix of DOE-Lab and campus resources Active “engagement” effort to add new

domains and resource providers to the OSG consortium

Page 20: Submit locally  and run globally – The GLOW and OSG Experience

OSG PEP - Organization

Page 21: Submit locally  and run globally – The GLOW and OSG Experience

OSG Project Execution Plan (PEP) - FTEs

FTEs

Facility operations 5.0Security and troubleshooting 4.5Software release and support 6.5Engagement 2.0Education, outreach & training 2.0

Facility management 1.0Extensions in capability and scale. 9.0

Staff 3.0

Total FTEs 33

Page 22: Submit locally  and run globally – The GLOW and OSG Experience

Part of the OSG Consortium

ContributorsContributors

ProjectProject

Page 23: Submit locally  and run globally – The GLOW and OSG Experience

OSG Principles

Characteristics - Provide guaranteed and opportunistic access to

shared resources. Operate a heterogeneous environment both in services available at any site and for

any VO, and multiple implementations behind common interfaces.

Interface to Campus and Regional Grids. Federate with other national/international Grids. Support multiple software releases at any one time.

Drivers - Delivery to the schedule, capacity and capability of LHC and LIGO:

Contributions to/from and collaboration with the US ATLAS, US CMS, LIGO software and computing programs.

Support for/collaboration with other physics/non-physics communities. Partnerships with other Grids - especially EGEE and TeraGrid. Evolution by deployment of externally developed new services and technologies:.

Page 24: Submit locally  and run globally – The GLOW and OSG Experience

Grid of Grids - from Local to Global

Community Campus

National

Page 25: Submit locally  and run globally – The GLOW and OSG Experience

Who are you?

A resource can be accessed by a user via the campus, community or national grid.

A user can access a resource with a campus, community or national grid identity.

Page 26: Submit locally  and run globally – The GLOW and OSG Experience

32 Virtual Organizations - participating Groups

3 with >1000 jobs max.(all particle physics)

3 with 500-1000 max.(all outside physics)

5 with 100-500 max(particle, nuclear, and astro

physics)

Page 27: Submit locally  and run globally – The GLOW and OSG Experience
Page 28: Submit locally  and run globally – The GLOW and OSG Experience

OSG Middleware Layering

NSF Middleware Initiative (NMI): Condor, Globus, Myproxy

Virtual Data Toolkit (VDT) Common Services NMI + VOMS, CEMon (common EGEE

components), MonaLisa, Clarens, AuthZ

OSG Release Cache: VDT + Configuration, Validation, VO management

CMSServices & Framewor

k

LIGOData Grid

CDF, D0SamGrid &Framework

Infr

ast

ruct

ure

Applic

ati

ons

ATLAS Services &Framewor

k

Page 29: Submit locally  and run globally – The GLOW and OSG Experience

OSG Middleware Deployment

Domain science requirements.

OSG stakeholders and middleware developer (joint) projects.

Integrate into VDT Release. Deploy on OSG integration grid

Provision in OSG release & deploy to OSG production.

Condor, Globus,Privilege, EGEE etc

Test on “VO specific grid”

Page 30: Submit locally  and run globally – The GLOW and OSG Experience

Inter-operability with Campus grids

At this point we have three operational campus grids – Fermi, Purdue and Wisconsin. We are working on adding Harvard (Crimson) and Lehigh.

FermiGrid is an interesting example for the challenges we face when making the resources of a campus (in this case a DOE Laboratory) grid accessible to the OSG community

Page 31: Submit locally  and run globally – The GLOW and OSG Experience

What is FermiGrid?

• Integrates resources across most (soon all) owners at Fermilab.

• Supports jobs from Fermilab organizations to run on any/all accessible campus FermiGrid and national Open Science Grid resources.

• Supports jobs from OSG to be scheduled onto any/all Fermilab sites.

• Unified and reliable common interface and services for FermiGrid gateway - including security, job scheduling, user management, and storage.

• More information is available at http://fermigrid.fnal.gov

Page 32: Submit locally  and run globally – The GLOW and OSG Experience

Job Forwarding and Resource Sharing

Gateway currently interfaces 5 Condor pools with diverse file systems and >1000 Job Slots. Plans to grow to 11 clusters (8 Condor, 2 PBS and 1 LSF)

Job scheduling policies and in place agreements for sharing allow fast response to changes in resource needs by Fermilab and OSG users.

Gateway provides single bridge between OSG wide area distributed infrastructure and FermiGrid local sites. Consists of a Globus gate-keeper and a Condor-G

Each cluster has its own Globus gate-keeper

Storage and Job execution policies applied through Site-wide managed security and authorization services.

Page 33: Submit locally  and run globally – The GLOW and OSG Experience

Access to FermiGrid

OSGGeneral

UsersFermilab

Users

CDF Condor pool

FermiGridGateway

OSG “agreed”

Users

GT-GK

GT-GKDZero

Condor pool

GT-GKShared

Condor pool

GT-GKCMS

Condor pool

GT-GK

Condor-G

Condor-GCondor-G

Condor-G

Page 34: Submit locally  and run globally – The GLOW and OSG Experience

The Crimson Grid is

•a Scalable collaborative computing environment for research at the interface of science and engineering

•a Gateway/Middleware release service to enable campus/community/national/global computing infrastructures for interdisciplinary research

•a Test bed for faculty & IT-industry affiliates within the framework of a production environment for integrating HPC solutions for higher education & research

•a Campus Resource for skills & knowledge sharing for advanced systems administration & management of switched architectures

Page 35: Submit locally  and run globally – The GLOW and OSG Experience

CrimsonGrid Role as a Campus Grid Enabler

Page 36: Submit locally  and run globally – The GLOW and OSG Experience

Homework?

CrimsonGridATLAS

OSGOSG Tier II

Campus Grids

Page 37: Submit locally  and run globally – The GLOW and OSG Experience

www.cs.wisc.edu/~miron

HTC on the UW campus

(or what you can dowith $1.5M?)

Page 38: Submit locally  and run globally – The GLOW and OSG Experience

Grid Laboratory of Wisconsin

• Computational Genomics, Chemistry• Amanda, Ice-cube, Physics/Space Science• High Energy Physics/CMS, Physics• Materials by Design, Chemical Engineering• Radiation Therapy, Medical Physics• Computer Science

2003 Initiative funded by NSF(MIR)/UW at ~ $1.5M

Six Initial GLOW Sites

Diverse users with different deadlines and usage patterns.

Page 39: Submit locally  and run globally – The GLOW and OSG Experience

Example Uses• Chemical Engineering

– Students do not know where the computing cycles are coming from - they just do it - largest user group

• ATLAS– Over 15 Million proton collision events simulated at 10

minutes each• CMS

– Over 70 Million events simulated, reconstructed and analyzed (total ~10 minutes per event) in the past one year

• IceCube / Amanda– Data filtering used 12 CPU-years in one month

• Computational Genomics– Prof. Shwartz asserts that GLOW has opened up a new

paradigm of work patterns in his group• They no longer think about how long a particular

computational job will take - they just do it

Page 40: Submit locally  and run globally – The GLOW and OSG Experience

GLOW Usage 4/04-9/05

Over 7.6 million CPU-Hours (865 CPU-Years) served!

Takes advantage of “shadow” jobs

Take advantage of check-pointing jobs

Leftover cycles available for “Others”

Page 41: Submit locally  and run globally – The GLOW and OSG Experience

UW Madison Campus Grid

• Condor pools in various departments, made accessible via Condor ‘flocking’– Users submit jobs to their own private or

department Condor scheduler.– Jobs are dynamically matched to available

machines.

• Crosses multiple administrative domains.– No common uid-space across campus.– No cross-campus NFS for file access.

• Users rely on Condor remote I/O, file-staging, AFS, SRM, gridftp, etc.

Page 42: Submit locally  and run globally – The GLOW and OSG Experience

Housing the Machines

• Condominium Style– centralized computing center– space, power, cooling, management– standardized packages

• Neighborhood Association Style– each group hosts its own machines– each contributes to administrative effort– base standards (e.g. Linux & Condor) to make

easy sharing of resources

• GLOW has elements of both, but leans towards neighborhood style

Page 43: Submit locally  and run globally – The GLOW and OSG Experience

The value of the big G

• Our users want to collaborate outside the bounds of the campus (e.g. Atlas and CMS are international).

• We also don’t want to be limited to sharing resources with people who have made identical technological choices.

• The Open Science Grid (OSG) gives us the opportunity to operate at both scales, which is ideal.

Page 44: Submit locally  and run globally – The GLOW and OSG Experience

Submitting Jobs within UW Campus Grid

schedd(Job caretaker)

condor_submit

startd(Job Executor)

HEP matchmaker

CS matchmaker

GLOWmatchmaker

flock

ing

Supports full feature-set of Condor:•matchmaking•remote system calls•checkpointing•MPI•suspension VMs•preemption policies

UW HEP User

Page 45: Submit locally  and run globally – The GLOW and OSG Experience

Submitting jobs through OSG to UW Campus Grid

schedd(Job caretaker)

startd(Job Executor)

HEP matchmaker

CS matchmaker

GLOWmatchmaker

flock

ing

schedd(Job caretaker)

condor_submit

condorgridmanager

Globus gatekeeper

Open Science Grid User

Page 46: Submit locally  and run globally – The GLOW and OSG Experience

Routing Jobs fromUW Campus Grid to OSG

schedd(Job caretaker)

condor_submit

globusgatekeeper

condorgridmanager

HEP matchmaker

CS matchmaker

GLOW matchmaker

GridJobRouter

Combining both worlds:•simple, feature-rich local mode•when possible, transform to grid job for traveling globally

Page 47: Submit locally  and run globally – The GLOW and OSG Experience

GLOW Architecture in a Nutshell

One big Condor pool• But backup central manager runs at each site

(Condor HAD service)• Users submit jobs as members of a group (e.g.

“CMS” or “MedPhysics”)• Computers at each site give highest priority to

jobs from same group (via machine RANK)• Jobs run preferentially at the “home” site, but

may run anywhere when machines are available

Page 48: Submit locally  and run globally – The GLOW and OSG Experience

Accommodating Special Cases

• Members have flexibility to make arrangements with each other when needed– Example: granting 2nd priority

• Opportunistic access– Long-running jobs which can’t easily be

checkpointed can be run as bottom feeders that are suspended instead of being killed by higher priority jobs

• Computing on Demand– tasks requiring low latency (e.g. interactive analysis)

may quickly suspend any other jobs while they run

Page 49: Submit locally  and run globally – The GLOW and OSG Experience

www.cs.wisc.edu/~miron

Schedd

ScheddOn The

Side

Elevating from GLOW to OSG

Job 1Job 2Job 3Job 4Job 5…Job 4*

job queue

Page 50: Submit locally  and run globally – The GLOW and OSG Experience

www.cs.wisc.edu/~miron

Gatekeeper

X

The Grid Universe

Schedd

Startds

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed Random

SeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

•easier to live with private networks•may use non-Condor resources

•restricted Condor feature set(e.g. no std universe over grid)•must pre-allocating jobsbetween vanilla and grid universe

vanilla site X

Page 51: Submit locally  and run globally – The GLOW and OSG Experience

www.cs.wisc.edu/~miron

Dynamic Routing Jobs

Schedd

LocalStartds

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeed Random

SeedRandomSeed

RandomSeed Random

SeedRandomSeed

RandomSeed Random

SeedRandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

ScheddOn The

Side Gatekeeper

X

Y

Z

vanilla site X

RandomSeed

RandomSeed

site Y site Z

•dynamic allocation of jobsbetween vanilla and grid universes.•not every job is appropriate fortransformation into a grid job.

Page 52: Submit locally  and run globally – The GLOW and OSG Experience

www.cs.wisc.edu/~miron

What is theright $ balance

betweenHPC & HTC?