3 07/05/2007
What is an OSG?
• A scientific grid consortium and project
• Rely on the commitments of the participants
• Share common goals and vision other projects
• An evolution of Grid3
• Provides benefit to large scale science in the US
4 07/05/2007
Driving Principles for OSG
Simple and flexible
Built from the bottom up
Coherent but heterogeneous
Performing and persistent
Maximize eventual commonality
Principles apply end-to-end
5 07/05/2007
Virtual Organization in OSG
Self Operated Research Vos: 15
Collider Detector at Fermilab (CDF)
Compact Muon Solenoid (CMS)
CompBioGrid (CompBioGrid)
D0 Experiment at Fermilab (DZero)
Dark Energy Survey (DES)
Functional Magnetic Resonance Imaging (fMRI)
Geant4 Software Toolkit (geant4)
Genome Analysis and Database Update (GADU)
International Linear Collider (ILC)
Laser Interferometer Gravitational-Wave Observatory (LIGO)
nanoHUB Network for Computational Nanotechnology (NCN) (nanoHUB)
Sloan Digital Sky Survey (SDSS)
Solenoidal Tracker at RHIC (STAR)
Structural Biology Grid (SBGrid)
United States ATLAS Collaboration (USATLAS)
Campus Grids: 5.
Georgetown University Grid (GUGrid)
Grid Laboratory of Wisconsin (GLOW)
Grid Research and Education Group at Iowa (GROW)
University of New York at Buffalo (GRASE)
Fermi National Accelerator Center (Fermilab)
Regional Grids: 4
NYSGRID
Distributed Organization for Scientific and Academic Research (DOSAR)
Great Plains Network (GPN)
Northwest Indiana Computational Grid (NWICG)
OSG Operated VOs: 4
Engagement (Engage)
Open Science Grid (OSG)
OSG Education Activity (OSGEDU)
OSG Monitoring & Operations
6 07/05/2007
Timeline
1999 2000 2001 2002 20052003 2004 2006 2007 2008 2009
PPDG
GriPhyN
iVDGL
Trillium Grid3 OSG(DOE)
(DOE+NSF)(NSF)
(NSF)
Campus, regional grids
LHC operationsLHC construction, preparation
LIGO operation LIGO preparation
European Grid + Worldwide LHC Computing Grid
OSG Consortium
7 07/05/2007
Levels of Participation
Participating in the OSG ConsortiumUsing the OSGSharing Resources on OSG=> Either or both with minimal entry threshold
Becoming a StakeholderAll (large scale) users & providers are stakeholders
Determining the Future of OSGCouncil Members determine the Future
Taking on Responsibility for OSG OperationsOSG Project is responsible for OSG Operations
9 07/05/2007
OSG : A Grid of Sites/Facilities
IT Departments at Universities & National Labs make their hardware resources available via OSG interfaces.
— CE: (modified) pre-ws GRAM— SE: SRM for large volume, gftp &
(N)FS for small volume
Today’s scale:— 20-50 “active” sites (depending on
definition of “active”)— ~ 5000 batch slots— ~ 1000TB storage— ~ 10 “active” sites with shared
10Gbps or better connectivity
Expected Scale for End of 2008 ~50 “active” sites ~30-50,000 batch slots Few PB of storage ~ 25-50% of sites with shared
10Gbps or better connectivity
10 07/05/2007
OSG Components: Compute Element
From ~20 CPU Department Computers
to 10,000 CPU Super Computers
Jobs run under anylocalbatch system
OSGgateway machine
+ services
the network & other OSG resources
• Globus GRAM interface (Pre-WS) which supports many different local batch system
• Priorities and policies : Through VO role mapping, Batch queue priority setting according to Site policites and priorities.
OSG Base (OSG 0.6.0)OSG Environment/PublicationOSG Monitoring/AccountingEGEE Interop
11 07/05/2007
Disk Areas in an OSG site
Shared filesystem as applications area at site.— Read only from compute cluster.— Role based installation via GRAM.
Batch slot specific local work space.— No persistency beyond batch slot lease.— Not shared across batch slots.— Read & write access (of course).
SRM/gftp controlled data area.— “persistent” data store beyond job boundaries.— Job related stage in/out.— SRM v1.1 today.— SRM v2.2 expected in Q2 2007 (space reservation).
12 07/05/2007
OSG Components: Storage Element
From 20 GBytes Disk CacheTo 4 Petabyte Robotic
Tape Systems
AnyShared Storage, e.g., dCache
OSG SE gateway
the network & other OSG resources
• Storage Services - access storage through storage resource manager (SRM) interface and GridFtp
• (Typically) VO oriented: Allocation of shared storage through agreements between site and VO(s) facilitated by OSG
gsiftp://mygridftp.nowhere.edusrm://myse.nowhere.edu( srm protocol ~ https protocol)
13 07/05/2007
Authentication and Authorization
OSG Responsibilities— X509 based middleware— Accounts may be dynamic/static, shared/FQAN-specific
VO Responsibilities— Instantiate VOMS— Register users & define/manage their roles
Site Responsibilities— Choose security model (what accounts are supported)— Choose VOs to allow— Default accept of all users in VO but individuals or groups
within VO can be denied.
14 07/05/2007
User Management
User obtains CERT from CA that is vetted by TAGPMA User registers with VO and is added to VOMS of VO.
— VO responsible for registration of VOMS with OSG GOC.— VO responsible for users to sign AUP.— VO responsible for VOMS operations.
— VOMS shared for ops on multiple grids globally by some VOs.— Default OSG VO exists for new communities & single PIs.
Sites decide which VOs to support (striving for default admit)— Site populates GUMS daily from VOMSes of all VOs— Site chooses uid policy for each VO & role
— Dynamic vs static vs group accounts
User uses whatever services the VO provides in support of users — VOs generally hide grid behind portal
Any and all support is responsibility of VO— Helping its users— Responding to complains from grid sites about its users.
15 07/05/2007
Resource Management
• Many resources are owned or statically allocated to one user community– The institutions which own resources typically have ongoing
relationships with (a few) particular user communities (VOs)
• The remainder of an organization’s available resources can be “used by everyone or anyone else”– Organization can decide against supporting particular VOs.– OSG staffs are responsible for monitoring and, if needed, managing
this usage
• Our challenge is to maximize good - successful - output from the whole system
16 07/05/2007
Applications and Runtime Model
• Condor-G client• Pre-WS or WS Gram as site gateway• Priority through VO role and policy, mitigate by site policy• User specific portion that comes with the job• VO specific portion is preinstalled and published• CPU access policies vary from site to site
Ideal runtime ~ O(hours)Small enough to not loose too much due to preemption
policies.Large enough to be efficient despite long scheduling times of
grid middleware.
17 07/05/2007
Simple Workflow
Install Application Software at site(s)— VO admin install via GRAM.— VO users have read only access from batch slots.
“Download” data to site(s)— VO admin move data via SRM/gftp.— VO users have read only access from batch slots.
Submit job(s) to site(s)— VO users submit job(s)/DAG via condor-g.— Jobs run in batch slots, writing output to local disk.— Jobs copy output from local disk to SRM/gftp data area.
Collect output from site(s)— VO users collect output from site(s) via SRM/gftp as part of DAG.
18 07/05/2007
Late Binding(A Strategy)
Grid is a hostile environment:Scheduling policies are unpredictable
Many sites preempt, and only idle resources are free
Inherent diversity of Linux variantsNot everybody is truthful in their advertisement
Submit “pilot” jobs instead of user jobs Bind user to pilot only after batch slot at a site is
successfully leased, and “sanity checked”. Re-bind user jobs to new pilot upon failure.
20 07/05/2007
OSG Activity Breakdown
• Software (UW)– provide a software stack that meets the needs of OSG sites, OSG VOs and OSG operation while supporting interoperability with other national and inter-national cyber infrastructures
• Integration (UC) – Verify, test and evaluate the OSG software• Operation (IU) – Coordinate the OSG sites, monitor the facility, and maintain
and operate centralized services • Security (FNAL) – Define and evaluate procedures and software stack to
prevent un-authorized activities and minimize interruption in service due to security concerns
• Troubleshooting (UIOWA) – help sites and VOs to identify and resolve unexpected behavior of the OSG software stack
• Engagement – (RENCI) Identify VOs and sites that can benefit from joining the OSG and “hold their hand” while becoming a productive member of the OSG community
• Resource Management (UF) - Manages resource• Facility Management (UW) – Overall facility coordination.
21 07/05/2007
OSG Facility Management Activies
• Led by Miron Livny(Wisconsin, Condor)
• Help sites join the facility and enable effective guaranteed and opportunistic usage of their resources by remote users
• Help VOs join the facility and enable effective guaranteed and opportunistic harnessing of remote resources
• Identify (through active engagement) new sites and VOs
22 07/05/2007
OSG Software Activies
• Package the Virtual Data Toolkit(Led by Wisconsin Condor Team)– Requires local building and testing of all components– Tools for incremental installation– Tools for verification of configuration– Tools for functional testing
• Integration of the OSG stack– Verification Testbed (VTB)– Integration Testbed (ITB)
• Deployment of the OSG stack– Build and deploy Pacman caches
23 07/05/2007
OSG Software Release Process
Input from stakeholders and OSG directors
VDT Release
OSG Integration Testbed Release
OSG Production Release
Test on OSG Validation Testbed
24 07/05/2007
How Many Softwares?
05
101520253035404550
Jan-02Jul-02Jan-03Jul-03Jan-04Jul-04Jan-05Jul-05Jan-06Jul-06Jan-07
Number of major components
VDT 1.1.x VDT 1.2.x VDT 1.3.x VDT 1.4.0 VDT 1.5.x VDT 1.6.x
VDT 1.0Globus 2.0bCondor-G 6.3.1
VDT 1.1.8Adopted by LCG
VDT 1.1.11Grid2003 VDT 1.2.0
VDT 1.3.0
VDT 1.3.9For OSG 0.4
VDT 1.6.1For OSG 0.6.0
VDT 1.3.6For OSG 0.2
More dev releases
Both added and removed software
15 Linux-like platforms supported
~45 components on 8 platforms built
25 07/05/2007
OSG Security Activies
• Infrastructure X509 certificate based
• Operational security a priority
• Exercise incident response
• Prepare signed agreements, template policies
• Audit, assess and train
• Infrastructure X509 certificate based
• Operational security a priority
• Exercise incident response
• Prepare signed agreements, template policies
• Audit, assess and train
26 07/05/2007
Operations & Troubleshooting Activities
• Well established Grid Operations Center at Indiana University
• Users support distributed, including [email protected] community support.• Site coordinator supports team of sites.
– Accounting and Site Validation required services of sites.
• Troubleshooting(U Iowa) looks at targetted end to end problems– Partnering with LBNL Troubleshooting work for auditing and
forensics.
• Well established Grid Operations Center at Indiana University
• Users support distributed, including [email protected] community support.• Site coordinator supports team of sites.
– Accounting and Site Validation required services of sites.
• Troubleshooting(U Iowa) looks at targetted end to end problems– Partnering with LBNL Troubleshooting work for auditing and
forensics.
28 07/05/2007
Campus Grids
• Sharing across compute clusters is a change and a challenge for many Universities.
• OSG, TeraGrid, Internet2, Educause working together
• Sharing across compute clusters is a change and a challenge for many Universities.
• OSG, TeraGrid, Internet2, Educause working together
29 07/05/2007
OSG and TeraGridComplementary and interoperating infrastructuresComplementary and interoperating infrastructures
TeraGrid OSGNetworks supercomputer centers. Includes small to large clusters and
organizations.
Based on Condor & Globus s/w stack built at Wisconsin Build and Test.
Based on Same versions of Condor & Globus in the Virtual Data Toolkit.
Development of User Portals/Science Gateways.
Supports jobs/data from TeraGrid science gateways.
Currently relies mainly on remote login.
No login access. Many sites expect VO attributes in the proxy certificate
Training covers OSG and TeraGrid usage.
30 07/05/2007
International Activities
• Interoperate with Europe for large physics users.– Deliver the US based infrastructure for the World Wide Large
Hadron Collider (LHC) Grid Collaboration (WLCG) in support of the LHC experiments.
• Include off-shore sites when approached.
• Help bring common interfaces and best practices to the standards forums.
• Interoperate with Europe for large physics users.– Deliver the US based infrastructure for the World Wide Large
Hadron Collider (LHC) Grid Collaboration (WLCG) in support of the LHC experiments.
• Include off-shore sites when approached.
• Help bring common interfaces and best practices to the standards forums.
32 07/05/2007
Particile Physics and Computing
Science DriverEvent rate = Luminosity x Crossection
LHC Revolution starting in 2008— Luminosity x 10— Crossection x 150 (e.g. top-quark)
Computing Challenge 20PB in first year of running ~ 100MSpecInt2000 ~ close to 100,000 cores
33 07/05/2007
CMS Experiment
Germany
CMS Experiment(P-P Collision Particle Physics Experiment)
Taiwan UKItaly
Data & jobs moving locally, regionally & globally within CMS grid.
Transparently across grid boundaries from campus to globus.
Florida
USA@FNAL
CERN
Caltech
Wisconsin
UCSD
France
Purdue
MIT
UNL
OSG
EGEE
35 07/05/2007
Opportunistic Resource Use
• In Nov ‘06 D0 asked to use 1500-2000 CPUs for 2-4 months for re-processing of an existing dataset (~500 million events) for science results for the summer conferences in July ‘07.
• The Executive Board estimated there were currently sufficient opportunistically available resources on OSG to meet the request; We also looked into the local storage and I/O needs.
• The Council members agreed to contribute resources to meet this request.
• In Nov ‘06 D0 asked to use 1500-2000 CPUs for 2-4 months for re-processing of an existing dataset (~500 million events) for science results for the summer conferences in July ‘07.
• The Executive Board estimated there were currently sufficient opportunistically available resources on OSG to meet the request; We also looked into the local storage and I/O needs.
• The Council members agreed to contribute resources to meet this request.
36 07/05/2007
D0 Throughput
D0 Event Throughput
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Week in 2007
CIT_CMS_T2 FNAL_DZEROOSG_2 FNAL_FERMIGRIDFNAL_GPFARM GLOW GRASE-CCR-U2MIT_CMS MWT2_IU NebraskaNERSC-PDSF OSG_LIGO_PSU OU_OSCER_ATLASOU_OSCER_CONDOR Purdue-RCAC SPRACEUCSDT2 UFlorida-IHEPA UFlorida-PGUSCMS-FNAL-WC1-CE
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Week in 2007
CIT_CMS_T2 FNAL_DZEROOSG_2 FNAL_FERMIGRIDFNAL_GPFARM GLOW GRASE-CCR-U2MIT_CMS MWT2_IU NebraskaNERSC-PDSF OSG_LIGO_PSU OU_OSCER_ATLASOU_OSCER_CONDOR Purdue-RCAC SPRACEUCSDT2 UFlorida-IHEPA UFlorida-PGUSCMS-FNAL-WC1-CE
D0 OSG CPUHours / Week
37 07/05/2007
Lessons Learned from D0 Case
• Consortium members contributed significant opportunistic resources as promised.– VOs can use a significant number of sites they “don’t own” to
achieve a large effective throughput.
• Combined teams make large production runs effective.
• How does this scale?– how we going to support multiple requests that
oversubcribe the resources? We anticipate this may happen soon.
• Consortium members contributed significant opportunistic resources as promised.– VOs can use a significant number of sites they “don’t own” to
achieve a large effective throughput.
• Combined teams make large production runs effective.
• How does this scale?– how we going to support multiple requests that
oversubcribe the resources? We anticipate this may happen soon.
38 07/05/2007
Use Case by Other Disciplines
• Rosetta@Kuhlman lab(protein research): in production across ~15 sites since April
• Weather Research Forecast: MPI job running on 1 OSG site; more to come
• CHARMM molecular dynamic simulation to the problem of water penetration in staphylococcal nuclease
• Genome Analysis and Database Update system (GADU): portal across OSG & TeraGrid. Runs Blast.
• NanoHUB at Purdue: Biomoca and Nanowire production.
• Rosetta@Kuhlman lab(protein research): in production across ~15 sites since April
• Weather Research Forecast: MPI job running on 1 OSG site; more to come
• CHARMM molecular dynamic simulation to the problem of water penetration in staphylococcal nuclease
• Genome Analysis and Database Update system (GADU): portal across OSG & TeraGrid. Runs Blast.
• NanoHUB at Purdue: Biomoca and Nanowire production.
39 07/05/2007
OSG Usage By Numbers
39 Virtual Communities
6 VOs with >1000 jobs max.(5 particle physics & 1 campus
grid)
4 VOs with 500-1000 max.(two outside physics)
10 VOs with 100-500 max(campus grids and physics)
41 07/05/2007
Jobs Running at Sites>1k max 5 sites>0.5k max 10 sites>100 max 29 sitesTotal: 47 sites
Many small sites, or withmostly local activity.
42 07/05/2007
CMS Xfer on OSG in June ‘06
All CMS sites have exceeded 5TB per day in June 2006.Caltech, Purdue, UCSD, UFL, UW exceeded 10TB/day.
450MByte/sec
43 07/05/2007
CPUHours/Day on OSG During 2007
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
1/1/071/8/071/15/071/22/071/29/072/5/072/12/072/19/072/26/073/5/073/12/073/19/073/26/074/2/074/9/074/16/074/23/074/30/075/7/075/14/075/21/075/28/07
AGLT2 ASGC_OSG BNL_OSG BNL_PANDA CIT_CMS_T2 FIU-PGFNAL_CDFOSG_1 FNAL_CDFOSG_2 FNAL_FERMIGRID FNAL_GPFARM GLOW GRASE-CCR-U2GRASE-GENESEO-OSG GROW-PROD HEPGRID_UERJ IPAS_OSG Lehigh Coral MIT_CMSNebraska NERSC-PDSF OSG_LIGO_PSU OU_OCHEP_SWT2 OU_OSCER_ATLAS OU_OSCER_CONDORPurdue-Lear Purdue-RCAC SPRACE STAR-BNL STAR-WSU TTU-ANTAEUSUC_ATLAS_MWT2 UCRHEP UCSDT2 UFlorida-IHEPA UFlorida-PG USCMS-FNAL-WC1-CEUSCMS-FNAL-WC1-CE2 UTA_SWT2 UTA-DPCC UWMilwaukee Vanderbilt
44 07/05/2007
Summary
• OSG Facility utilization is steadily being increased– ~2-4500 jobs all the time– HEP, Astro, Nuclear Phys. but also Bio/Eng/Med
• Constant effort/troubleshooting is being poured to make OSG usable, robust and performant.
• Show use to other sciences.• Trying to bring campus into a pervasive distributed
infrastructure.• Bring research into a ubiquitous appreciation of the
value of (distributed, opportunistic) computation• Educate people to utilize the resources
• OSG Facility utilization is steadily being increased– ~2-4500 jobs all the time– HEP, Astro, Nuclear Phys. but also Bio/Eng/Med
• Constant effort/troubleshooting is being poured to make OSG usable, robust and performant.
• Show use to other sciences.• Trying to bring campus into a pervasive distributed
infrastructure.• Bring research into a ubiquitous appreciation of the
value of (distributed, opportunistic) computation• Educate people to utilize the resources
46 07/05/2007
Principle: Simple and Flexible
The OSG architecture will follow the principles of symmetry and recursion
wherever possible
This principle guides us in our approaches to - Support for hierachies of and property inheritance of VOs.- Federations and interoperability of grids (grids of grids).- Treatment of policies and resources.
47 07/05/2007
Principle: Coherent but heterogeneous
The OSG architecture is VO based.
Most services are instantiated within the
context of a VO.
This principle guides us in our approaches to
- Scope of namespaces & action of services.
- Definition of and services in support of an OSG-wide VO.
- No concept of “global” scope.
- Support for new and dynamic VOs should be light-weight.
48 07/05/2007
OSG Security Activies (continues…)
User VO Site
Jobs
VO infra.
DataStorage
CE
W W W
WWWWWWWW
WWWW
I trust it is the VO (or agent)
I trust it is the user
I trust it is the user’s job
I trust the job is for the VO
49 07/05/2007
Principle: Bottom up/Persistency
All services should function and operate in the local environment when
disconnected from the OSG environment.
This principle guides us in our approaches to - The Architecture of Services. E.G. Services are required to manage their own state, ensure their internal state is consistent and report their state accurately.
- Development and execution of applications in a local context, without an active connection to the distributed services.
50 07/05/2007
Principle: Commonality
OSG will provide baseline services and a reference implementation.
The infrastructure will support support incremental upgrades
The OSG infrastructure should have minimal impact on a Site. Services that must run with superuser privileges will be minimized
Users are not required to interact directly with resource providers.
51 07/05/2007
Scale needed in 2008/2009
• 20-30 Petabyte tertiary automated tape storage at 12 centers world-wide physics and other scientific collaborations.
• High availability (365x24x7) and high data access rates (1GByte/sec) locally and remotely.
• Evolving and scaling smoothly to meet evolving requirements.
• E.g. for a single experiment
• 20-30 Petabyte tertiary automated tape storage at 12 centers world-wide physics and other scientific collaborations.
• High availability (365x24x7) and high data access rates (1GByte/sec) locally and remotely.
• Evolving and scaling smoothly to meet evolving requirements.
• E.g. for a single experiment
52 07/05/2007
OSG Software Concerns
• How quickly (and at what FTE cost) can we patch the OSG stack and redeploy it?– Critical for security patches– Very important for stability and QoS
• How dependable is our software? – Focus on testing (at all phases), troubleshooting of
deployed software and careful adoption of new software
• Functionality of our software– Close consultation with stakeholders
• Impact on other cyber-infrastructures– Critical for interoperability
Top Related