EGEE-DSA1 7-Cookbook-1-489462-v2 0 0 · 2018. 11. 13. · Abstract: This document is the...

INFSO-RI-508833 © Members of EGEE collaboration PUBLIC 1 / 135

EGEE

I N F R A S T R U C T U R E P L A N N I N G G U I D E ( “ C O O K – B O O K ” )

EU DELIVERABLE: DSA1.7

Document identifier: EGEE-DSA1_7-Cookbook-1-489462-v2_0_0.doc

Date: 16/12/2005

Activity: SA1: EGEE Operations

Lead Partner: CERN

Document status: FINAL

Document link: https://edms.cern.ch/document/489462

Abstract: This document is the deliverable DSA1.7 for SA1 Operations Activity. This deliverable is the Infrastructure Planning Guide (“Cook-book”).

The Infrastructure Planning Guide (“Cook-book”) is intended as a summary of the experience and knowledge gained during the building of the EGEE grid infrastructure. The document is intended to explain some of the decisions and choices made in planning, deploying, and operating the infrastructure, and should be helpful to others who consider building grid infrastructures or participating in existing grids or in projects related to EGEE such as EELA. It draws on the experience gained with both the LCG and gLite middleware stacks. It is certainly not intended to be definitive, but rather to explain the issues and the experience with the hope that others can benefit.

EGEE-DSA1_7-Cookbook-1-489462-v2_0_0.doc

I N F R A S T R U C T U R E P L A N N I N G G U I D E

Date: 16/12/2005


Copyright (c) Members of the EGEE Collaboration. 2004.

See http://public.eu-egee.org/partners/ for details on the copyright holders.

EGEE (“Enabling Grids for E-sciencE”) is a project funded by the European Union. For more information on the project, its partners and contributors please see http://www.eu-egee.org.

You are permitted to copy and distribute verbatim copies of this document containing this copyright notice, but modifying this document is not allowed. You are permitted to copy this document in whole or in part into other documents if you attach the following reference to the copied elements: "Copyright (C) 2004. Members of the EGEE Collaboration. http://www.eu-egee.org".

The information contained in this document represents the views of EGEE as of the date they are published. EGEE does not guarantee that any information contained herein is error-free, or up to date.

EGEE MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, BY PUBLISHING THIS DOCUMENT.

Delivery Slip

Name Partner/Activity Date Signature From Ian Bird Manager, SA1, CERN 14/11/05

Reviewed by Mike Mineter NA3

Massimo Sgaravatto JRA1

Gabriel Zaquine JRA2

Ignacio Blanquer NA4

Approved by PEB 15/12/05

Document Log Issue Date Comment Author 2.0.0 16/12/05 Prepared for delivery Anna Cook 1.9.0 16/12/05 Prepared for delivery Alistair Mills 1.6.0 14/12/05 Prepared for PEB/PMB approval Alistair Mills 1.5.0 11/12/05 Prepared for moderation - 3 Alistair Mills 1.4.0 04/12/05 Prepared for moderation - 2 Alistair Mills 1.0 15/11/05 Prepared for moderation Alistair Mills 0.9 15/11/05 Integration of §10, and changes to §4 Alistair Mills 0.8 09/11/05 Switched off tracking and introduced changes Alistair Mills 0.7 01/11/05 Switched on tracking Alistair Mills 0.6 31/10/05 Prepared for moderation Alistair Mills 0.5 20/10/05 Moved §§3 and 4 to their own documents Alistair Mills 0.4 07/09/05 Minor revision Alistair Mills 0.3 07/09/05 Minor revision Alistair Mills 0.2 02/09/05 Minor revision Alistair Mills 0.1 02/09/05 First draft Alistair Mills

Document Change RecordIssue Item Reason for Change



Date: 16/12/2005


CONTENTS

INFRASTRUCTURE PLANNING GUIDE (“COOK–BOOK”)...................................................................... 1

DELIVERY SLIP................................................................................................................................................... 2

DOCUMENT LOG................................................................................................................................................ 2

DOCUMENT CHANGE RECORD ..................................................................................................................... 2

1 INTRODUCTION........................................................................................................................................ 9 1.1 PURPOSE OF THE DOCUMENT .................................................................................................................. 9 1.2 APPLICATION AREA ................................................................................................................................ 9 1.3 COLLECTION OF REFERENCES................................................................................................................. 9 1.4 DOCUMENT AMENDMENT PROCEDURE ................................................................................................... 9 1.5 GLOSSARY.............................................................................................................................................. 9 1.6 REFERENCES .......................................................................................................................................... 9

2 EXECUTIVE SUMMARY........................................................................................................................ 11 2.1 INTRODUCTION..................................................................................................................................... 11 2.2 STANDARDS AND INTEROPERABILITY................................................................................................... 13 2.3 SUMMARY ............................................................................................................................................ 13

3 STRUCTURE OF THIS DOCUMENT.................................................................................................... 15 3.1 HOW TO USE THIS DOCUMENT .............................................................................................................. 15 3.2 INTENDED READERS OF THIS DOCUMENT.............................................................................................. 16

3.2.1 Application users ........................................................................................................................ 16 3.2.2 Security officers .......................................................................................................................... 16 3.2.3 Service managers........................................................................................................................ 16 3.2.4 Grid operators ............................................................................................................................ 16 3.2.5 User supporters........................................................................................................................... 17 3.2.6 Fabric managers......................................................................................................................... 17 3.2.7 Developers .................................................................................................................................. 17 3.2.8 Virtual organisation managers ................................................................................................... 17 3.2.9 Trainers....................................................................................................................................... 18

4 ARCHITECTURE ..................................................................................................................................... 19 4.1 INTRODUCTION..................................................................................................................................... 19

4.1.1 Grid Functionality and Services ................................................................................................. 20 4.1.2 Interoperability ........................................................................................................................... 20

4.2 EGEE MIDDLEWARE............................................................................................................................ 20 4.2.1 Site Services ................................................................................................................................ 21 4.2.2 VO or Global Services ................................................................................................................ 24 4.2.3 VDT............................................................................................................................................. 29

4.3 OPERATING SYSTEM PLATFORMS......................................................................................................... 29 4.4 REFERENCES ........................................................................................................................................ 30

5 CERTIFICATION AND TESTING......................................................................................................... 33 5.1 INTRODUCTION..................................................................................................................................... 33



Date: 16/12/2005


5.1.1 The team...................................................................................................................................... 33 5.1.2 Software integration ................................................................................................................... 34 5.1.3 Software testing........................................................................................................................... 34 5.1.4 Software certification.................................................................................................................. 34 5.1.5 Release manager......................................................................................................................... 34

5.2 INTEGRATION, TESTING AND CERTIFICATION PROCESS ......................................................................... 35 5.2.1 Certification tested...................................................................................................................... 35 5.2.2 Pre-production service................................................................................................................ 37 5.2.3 Testing process ........................................................................................................................... 37 5.2.4 Certification process................................................................................................................... 38 5.2.5 Creation of the candidate release ............................................................................................... 39 5.2.6 Creation of the final release ....................................................................................................... 43 5.2.7 Distribution process.................................................................................................................... 43

5.3 EGEE SOFTWARE................................................................................................................................. 43 5.3.1 Its origins .................................................................................................................................... 43 5.3.2 Importance of user applications for testing ................................................................................ 44 5.3.3 User requirements impact on software development .................................................................. 44

5.4 THE BIRTH OF THE EGEE SOFTWARE ................................................................................................... 45 5.4.1 LCG-0, learning the process....................................................................................................... 45 5.4.2 LCG-1, making the software useful, debugging.......................................................................... 45 5.4.3 LCG-2, the deployment release................................................................................................... 46

5.5 EVOLUTION OF THE RELEASE PROCESS ................................................................................................. 46 5.5.1 As software matures, the processes have to follow ..................................................................... 46 5.5.2 Adding the pre-production service.............................................................................................. 47 5.5.3 Deployment feed-back becomes more important ........................................................................ 47

5.6 LESSONS LEARNED WITH LCG ............................................................................................................. 48 5.6.1 Grid software development must be driven by its deployment.................................................... 48 5.6.2 The release process requires developers’ cooperation and discipline ....................................... 48 5.6.3 Software development and release is not a democratic process ................................................. 48 5.6.4 Too many languages used for software development lead to a maintenance nightmare ............ 49 5.6.5 Software release process is not a glamorous activity ................................................................. 49 5.6.6 The certification test bed is where the published software is made ............................................ 49 5.6.7 LAN is difficult but WAN is impossible at the beginning ............................................................ 49 5.6.8 Work with developers before they deliver the next version......................................................... 49 5.6.9 Make sure users are satisfied from the beginning ...................................................................... 49 5.6.10 Always have a handful of highly qualified developers around ................................................... 50 5.6.11 Never accept a software component that does not comply with the agreed level of quality ....... 50 5.6.12 Fast debugging feedback with quick software reinstallations is the key to success.................... 50 5.6.13 Release manager has to request strict discipline yet keep an open mind for changes................ 50 5.6.14 A software project needs a good bug tracking and process management tool ........................... 50 5.6.15 Release manager is not a threat to anybody ............................................................................... 50

5.7 LESSONS LEARNED DURING EGEE....................................................................................................... 50 5.7.1 Anticipate components ................................................................................................................ 50 5.7.2 Encourage good patch submissions............................................................................................ 51 5.7.3 Delay release dates if necessary ................................................................................................. 51 5.7.4 Capitalise on external expertise.................................................................................................. 51 5.7.5 Use updates................................................................................................................................. 51 5.7.6 Test updates on the test bed with the release .............................................................................. 51 5.7.7 Pre-production is good for finding problems.............................................................................. 51 5.7.8 Documentation is a key component ............................................................................................ 51

6 SECURITY ................................................................................................................................................. 53 6.1 STATEMENT OF PROBLEM ..................................................................................................................... 53 6.2 SOLUTION IN USE WITH EGEE.............................................................................................................. 53



Date: 16/12/2005


6.2.1 Certification Authorities and PMAs............................................................................................ 53 6.2.2 Virtual Organisations (VOs)....................................................................................................... 54 6.2.3 Policy framework – trust............................................................................................................. 54 6.2.4 User registration process............................................................................................................ 55 6.2.5 Access to resources..................................................................................................................... 56 6.2.6 Authentication and authorization control points ........................................................................ 56 6.2.7 Use of credential stores - myproxy ............................................................................................. 57 6.2.8 Site registration process ............................................................................................................. 57

6.3 LESSONS LEARNED ............................................................................................................................... 58 6.3.1 Scaling the trust domain ............................................................................................................. 58 6.3.2 Authorization control .................................................................................................................. 58 6.3.3 Proxy renewal complexity........................................................................................................... 59 6.3.4 Privacy, legal issues and accounting.......................................................................................... 59 6.3.5 Operational security issues......................................................................................................... 59

6.4 REFERENCES ........................................................................................................................................ 59 7 GRID OPERATIONS AND SUPPORT................................................................................................... 63

7.1 EGEE GRID OPERATIONS...................................................................................................................... 63 7.2 OPERATIONS OF THE GRID .................................................................................................................... 65

7.2.1 Operation of the Operations Management Centre (OMC) ......................................................... 65 7.2.2 Operation of the pre-production service (PPS) .......................................................................... 66 7.2.3 Operation of the Core Infrastructure Centre (CIC).................................................................... 68 7.2.4 Operation of the CIC On Duty (COD)........................................................................................ 68 7.2.5 Operation of a ROC participating in CIC................................................................................... 71 7.2.6 Operation of a ROC not participating in CIC ............................................................................ 74 7.2.7 Grid Operations Centre (GOC) .................................................................................................. 76 7.2.8 Site Functional Tests (SFT) ........................................................................................................ 83 7.2.9 Incident operation security ......................................................................................................... 86 7.2.10 Monitoring, automation, alarms ................................................................................................. 86 7.2.11 Support........................................................................................................................................ 87 7.2.12 Tools ........................................................................................................................................... 87 7.2.13 Procedures, escalation................................................................................................................ 87 7.2.14 Metrics ........................................................................................................................................ 87 7.2.15 SLAs ............................................................................................................................................ 87

7.3 LESSONS LEARNED............................................................................................................................... 87 7.3.1 Lessons from the operation of the OMC ..................................................................................... 87 7.3.2 Lessons from the pre-production service .................................................................................... 88 7.3.3 Lessons from the operation of the CIC ....................................................................................... 88 7.3.4 Lessons from the operation of the COD...................................................................................... 88 7.3.5 Lessons from the operation of a ROC participating in CIC........................................................ 90 7.3.6 Lessons from the operation of a ROC not in CIC ....................................................................... 90 7.3.7 Lessons learned from the GOC................................................................................................... 91

7.4 WAYS FORWARD TO EGEE-II AND BEYOND......................................................................................... 91 7.4.1 Ways forward for the pre-production service ............................................................................. 91 7.4.2 Ways forward for the COD......................................................................................................... 92

7.5 REFERENCES ........................................................................................................................................ 92 8 USER SUPPORT........................................................................................................................................ 95

8.1 STATEMENT OF PROBLEM ..................................................................................................................... 95 8.2 SOLUTION IN USE WITH EGEE.............................................................................................................. 95

8.2.1 Management initiatives to provide user support......................................................................... 96 8.3 SUPPORT MODEL .................................................................................................................................. 96



Date: 16/12/2005


8.3.1 Central user support ................................................................................................................... 98 8.3.2 Regional user support ................................................................................................................. 99 8.3.3 Core infrastructure user support ................................................................................................ 99 8.3.4 Virtual organisation user support............................................................................................... 99 8.3.5 Support unit user support............................................................................................................ 99

8.4 PASSIVE SUPPORT............................................................................................................................... 100 8.4.1 Documentation.......................................................................................................................... 100 8.4.2 Training .................................................................................................................................... 100

8.5 MANAGEMENT MATTERS.................................................................................................................... 100 8.5.1 Procedures and escalations ...................................................................................................... 100 8.5.2 Metrics ...................................................................................................................................... 100 8.5.3 Executive Support Committee (ESC) ........................................................................................ 100 8.5.4 Service Level Agreements (SLA) ............................................................................................... 101

8.6 LESSONS LEARNED ............................................................................................................................. 101 8.6.1 Document the support model .................................................................................................... 101 8.6.2 Get software for making help desks .......................................................................................... 101 8.6.3 Work with what the responsible units already have.................................................................. 101 8.6.4 Have experienced support people working at the front of the ticketing system ........................ 101 8.6.5 The Virtual Organisations must be at the front end of the ticketing system.............................. 101 8.6.6 Provide a simple mechanism for submitting a ticket................................................................. 101 8.6.7 Document the work flows.......................................................................................................... 102 8.6.8 Document the user interface ..................................................................................................... 102 8.6.9 Keep the documentation up to date........................................................................................... 102 8.6.10 Make a list of the responsible units........................................................................................... 102 8.6.11 Avoid the development of alternatives ...................................................................................... 102 8.6.12 Encourage people with suitable needs to join the central support organisation ...................... 102 8.6.13 Get the agreement of the responsible units to implement their part of the model..................... 102 8.6.14 Get the responsible units to name a person who is responsible for providing support ............ 102 8.6.15 Treat all responsible units similarly - do not allow exceptions ................................................ 102 8.6.16 Document the agreement with each of the responsible units .................................................... 103 8.6.17 Train the supporters in the responsible units to carry out their tasks....................................... 103 8.6.18 Provide documentation to the supporters ................................................................................. 103 8.6.19 Test that the ticket routing through the system works between the responsible units ............... 103 8.6.20 Involve supporters in the development of the system ................................................................ 103 8.6.21 Have a program of upgrades to the support system.................................................................. 103 8.6.22 Provide a backup system to avoid a single point of failure....................................................... 103 8.6.23 Collect statistics on the use of the system ................................................................................. 103 8.6.24 Establish a mechanism for agreeing change and use it ............................................................ 103

8.7 REFERENCES ...................................................................................................................................... 104 9 FABRIC MANAGEMENT ..................................................................................................................... 105

9.1 OVERVIEW OF GRID CLUSTER COMPONENTS....................................................................................... 105



Date: 16/12/2005


9.1.1 Computing Element (CE).......................................................................................................... 105 9.1.2 Worker Node (WN) ................................................................................................................... 105 9.1.3 User Interface (UI) ................................................................................................................... 105 9.1.4 Storage Element (SE)................................................................................................................ 105 9.1.5 Resource Broker (RB), Workload Management System (WMS) ............................................... 105 9.1.6 Information Index (II, BDII), Service Discovery (SD) .............................................................. 105 9.1.7 Replica Location Service (RLS) ................................................................................................ 106 9.1.8 Proxy Server (PX)..................................................................................................................... 106 9.1.9 Monitor Service (MON), R-GMA Server (R-GMA) .................................................................. 106 9.1.10 Virtual Organisation Server (VOS)........................................................................................... 106 9.1.11 LCG File Catalog and FiReMan (LFC and FC) ...................................................................... 106 9.1.12 File Transfer Service (FTS) ...................................................................................................... 106 9.1.13 VO Box...................................................................................................................................... 106 9.1.14 I/O Server (IO).......................................................................................................................... 107 9.1.15 DataGrid Accounting Server (DGAS)....................................................................................... 107

9.2 PLANNING FOR A GRID CLUSTER – INGREDIENTS ................................................................................ 107 9.2.1 Grid server requirements.......................................................................................................... 108 9.2.2 Worker Node requirements ....................................................................................................... 108 9.2.3 Storage requirements................................................................................................................ 109 9.2.4 Networking................................................................................................................................ 109 9.2.5 Security ..................................................................................................................................... 109

9.3 SOFTWARE NOT INCLUDED IN GRID PACKAGE.................................................................................... 110 9.3.1 Batch system ............................................................................................................................. 110 9.3.2 Cluster monitoring.................................................................................................................... 110 9.3.3 OS Installation and OS updates ................................................................................................ 110

9.4 SETUP, INSTALLATION, VERIFICATION ............................................................................................... 110 9.4.1 Where to install various service components?.......................................................................... 110 9.4.2 Batch system ............................................................................................................................. 110 9.4.3 Cron jobs .................................................................................................................................. 110 9.4.4 Site verification ......................................................................................................................... 110

9.5 MONITORING AND MAINTENANCE ..................................................................................................... 111 9.5.1 What to watch and how often?.................................................................................................. 111 9.5.2 Tool recommendations.............................................................................................................. 113

9.6 LESSONS LEARNED ............................................................................................................................. 113 9.7 REFERENCES ...................................................................................................................................... 113

10 DEPLOYING ADDITIONAL SOFTWARE COMPONENTS............................................................ 115 10.1 INTRODUCTION................................................................................................................................... 115 10.2 DEPLOYING MPI ................................................................................................................................ 115

10.2.1 Why deploy MPI?...................................................................................................................... 115 10.2.2 Documentation on deployment ................................................................................................. 115 10.2.3 Security matters ........................................................................................................................ 115 10.2.4 Other matters to be considered................................................................................................. 116 10.2.5 Open problems.......................................................................................................................... 116 10.2.6 Lessons learned......................................................................................................................... 116

10.3 REFERENCES ...................................................................................................................................... 117 11 APPENDIX – MIDDLEWARE REQUIREMENTS............................................................................. 119

11.1 INTRODUCTION................................................................................................................................... 119 11.2 GENERAL REQUIREMENTS .................................................................................................................. 119 11.3 INSTALLATION AND CONFIGURATION REQUIREMENTS........................................................................ 121 11.4 DEVELOPMENT REQUIREMENTS.......................................................................................................... 122

12 TABLES OF REFERENCES AND GLOSSARY ................................................................................. 123



Date: 16/12/2005


13 INDEX....................................................................................................................................................... 135 FIGURES Figure 1: EGEE certification test bed ................................................................................................................... 35 Figure 2: Resources in the Pre-Production Service............................................................................................... 37 Figure 3: The certification process........................................................................................................................ 38 Figure 4: Web page produced by the certification matrix ..................................................................................... 41 Figure 5: Description of tests ................................................................................................................................ 42 Figure 6: Roadmap to LCG-2 – September 2003-December 2003 ....................................................................... 46 Figure 7: EGEE Certification and Release Process............................................................................................... 47 Figure 8: Typical software release cycle............................................................................................................... 48 Figure 9: Policy Framework.................................................................................................................................. 55 Figure 10: Front Page for the GOC....................................................................................................................... 77 Figure 11: Accounting Flow Diagram .................................................................................................................. 78 Figure 12: Front page for accounting reports........................................................................................................ 79 Figure 13: Page for selecting report ...................................................................................................................... 79 Figure 14: Report on the number of jobs for Biomed ........................................................................................... 80 Figure 15: Main page for the certificate monitor .................................................................................................. 81 Figure 16: Alert page for certificates .................................................................................................................... 82 Figure 17: RGMA and its integration with GIIS and GOC-DB............................................................................ 83 Figure 18: Flow of control and data with SFT ...................................................................................................... 84 Figure 19: Summary of SFT for SW Grid............................................................................................................. 85 Figure 20: Historical results for CERN................................................................................................................. 86 Figure 21: Work flow for a ticket entered to [email protected] ........................................................................ 97 Figure 22: Work flow when a ticket is entered for a VO ...................................................................................... 98 TABLES Table 1: Application areas for the chapters of this document ............................................................................... 18 Table 2: Guide to the parts of this chapter ............................................................................................................ 63 Table 3: References............................................................................................................................................. 123 Table 4: Glossary of terms .................................................................................................................................. 128



Date: 16/12/2005


1 INTRODUCTION

1.1 PURPOSE OF THE DOCUMENT This document is a summary of the experience and knowledge gained during the building of the EGEE grid infrastructure. The document is intended to explain some of the decisions and choices made in planning, deploying, and operating the infrastructure, and should be helpful to others who consider building grid infrastructures or participating in existing grids. It is not intended to be definitive, but rather to explain the issues and the experience with the hope that others can benefit.

1.2 APPLICATION AREA This document is intended for readers both internal and external to the project, with its aim being to summarise achievements and issues in managing production operations and to reference all established procedures. To aid readers with specific interests, the sections of most relevance to several audiences is summarised in §3 Structure of this document.

1.3 COLLECTION OF REFERENCES Almost all of the references in the text are to documents which are available on the Internet. A link to the document is provided at the point of reference. The chapters which contain a lot of references however also have collections of these links in a single location at the end of the chapter. In §12, page 123, there is a table (Table 3: References) which provides the entire collection of references.

1.4 DOCUMENT AMENDMENT PROCEDURE This document is under the responsibility of CERN. Amendments, comments and suggestions should be sent to Alistair Mills ([email protected]). The procedures documented in the EGEE “Document Management Procedure” will be followed [INTRO 4].

1.5 GLOSSARY In §12, page 128, there is a table (Table 4: Glossary of terms) which provides a summary of the most frequently used acronyms and abbreviations in this document. For a complete EGEE glossary please visit the regularly updated web-page provided by JRA2 [INTRO 5].

1.6 REFERENCES

INTRO 1 Details on the copyright holders of project EGEE http://public.eu-egee.org/partners/

INTRO 2 Project web site containing details of the project, its partners and contributors http://www.eu-egee.org

INTRO 3 DSA1.7 Infrastructure Planning Guide ("Cook-book") https://edms.cern.ch/document/489462

INTRO 4 JRA2 Document management procedure http://egee-jra2.web.cern.ch/EGEE-JRA2/Procedures/DocManagmtProcedure/DocMngmt.htm

INTRO 5 JRA2 Glossary of EGEE terms http://egee-jra2.web.cern.ch/EGEE-JRA2/Glossary/Glossary.html



Date: 16/12/2005


2 EXECUTIVE SUMMARY

2.1 INTRODUCTION The operation and management of the EGEE grid infrastructure is a wide ranging activity covering the main aspects of support and operation. Although the EGEE project began in April 2004, the grid infrastructure was based on that developed and set up during the previous 18 months by the LHC Computing Grid (LCG) project. The present document, although describing the work of the operation of the EGEE infrastructure, therefore makes significant references to the work done in LCG since several of the important lessons (or at least their precursors) which we attempt to cover here were learned during that time.

When EGEE started in April 2004, the grid infrastructure that EGEE Service Activity (SA1) took over the responsibility for running was already in place at a significant scale. This was based on the work done in the LCG Grid Deployment team since autumn 2002. At that time the infrastructure itself and the middleware distribution deployed on it were labelled as "LCG" and "LCG-2" respectively. In many places this nomenclature has stuck and it has been difficult to separate the view of LCG (and the four large experiment virtual organisations that it represents) and that of EGEE as the infrastructure provider and maintainer. In fact it is true that significant effort funded through LCG was used to put the EGEE infrastructure in place, but is still very important for the continued success of the EGEE infrastructure. LCG continues to be the largest customer of the EGEE infrastructure. However, it is very important that the other EGEE application communities now increase their roles and visibilities in order to strengthen the infrastructure and ensure its long term viability and versatility.

Since the beginning of the EGEE project, the EGEE SA1 activity has been responsible for all the aspects of deploying, supporting, and maintaining the infrastructure, including all the associated activities essential in such an endeavour: user support, application support, operational security in all its aspects, and operational support to sites. In this document we try to describe some of the processes that EGEE has put in place to manage all of these responsibilities, and we try to extract some of the lessons learned in the process in order that other grid infrastructure projects can learn from our experiences.

As a production infrastructure it has been a key point that it be treated as such. Thus, the middleware distribution and grid services deployed on the infrastructure have undergone a robust testing and certification process before being deployed. This has been shown to be invaluable in providing a stable infrastructure. Of course, very many real problems still remain to be solved, but given the prototype nature of much of the middleware that is available today this process was crucial.

The LCG middleware distribution was based on middleware from other sources - the project never proposed to develop its own, but to integrate existing components. The LCG-2 distribution that was the basis of the EGEE distribution essentially consists of the Globus toolkit (Globus 2.x), provided via the Virtual Data Toolkit (VDT) together with Condor, and several components provided from the European Data Grid (EDG) project. These components included a workload management system, and data management components. The LCG teams worked on improving several of the components, for example providing a re-implementation of the information system, fully compatible with the Globus MDS reference implementation. As EGEE started, this distribution became the responsibility of the SA1 teams to maintain, and gradually over the EGEE project other components have been added from the project development activity (JRA1) with gLite components adding to or replacing existing functionality. In addition the SA1 teams have continued to work on debugging, patching, and improving existing components where necessary, and providing tools to replace or supplement the functionality, or as "glue" to bring together components. This process has been done in an evolutionary way, avoiding 'big-bang" changes to ensure the stability of the production system and ensuring application backwards compatibility across changes. It is unfortunate that the name of the distribution has not evolved to reflect the significant changes and has continued to be labelled "LCG-2.x" which has undoubtedly led to confusion in many places. However, it is now planned to rename to distribution in early 2006.

At the beginning of EGEE the infrastructure supported around 40 resource centres (sites) providing computing and storage resources. In the 18 months since, it has grown to around 180 sites, and has increased the available



Date: 16/12/2005


resources six-fold. During the same period the project has helped many applications get started and deployed on the production infrastructure, while very many other applications, often in national or local contexts have become active in smaller ways. There are currently around 20 applications active on the large scale and probably an equal number active in a more limited scale.

The process of integrating, testing, and certifying the middleware and building a deployable distribution has continuously evolved, to address problems as they have come up, and to manage the larger scale of the system. The tools to install and configure sites have changed significantly - becoming simpler from the site administrators point of view, in order to make the process of installation and configuration more straightforward and reliable. This has been a major point and is responsible in large part for the rapid expansion of the infrastructure. The need to include external sites in the certification process, and to make test deployments before production release is important, and the introduction in EGEE of a pre-production service is an important ingredient in this process, helping to ensure that major problems are uncovered before production. It also provides a better mechanism for the application communities to test new functionality and adapt their software before going to production.

SA1 has put in place a strong and well managed operations oversight process. This was essential in resolving one of the major problems at the beginning of the project - the instability of the grid sites in terms of availability and reliability which caused real problems for applications. There is a weekly-rotation of responsibility for providing "grid operator on duty" oversight of the operations, backed up with well-defined and well followed management processes (problem reporting, follow up, and escalation). Together with a continuously evolving set of functional testing of each site, coupled to a tool allowing applications to select dynamically, sites currently stable, the problems of stability have been addressed. It is an important lesson, that in a large infrastructure many problems can happen, and badly-managed sites, unforeseen problems (power outages), and normal operational issues (service machines crashing), together cause a continuous ongoing "problem" that causes overall instability if not well managed. It is important for the longer term that both applications and grid services developers consider all of these issues from the outset and design software that can operate in this environment. It is also important to note that the operations management is a truly distributed activity - there is no single central point and the participating organisations take responsibility for providing, running, and maintaining various tools.

Together with the operations process, a strong and reliable user support system is necessary. This has been much more complex than anticipated and consequently has taken longer to set up and make usable. The complexity arises from the fact that there is both a need for a central support point, and for local helpdesks and call centres. Many of these exist of course in their own right, either as part of existing computer centre support processes, or put in place by the Regional Operations Centres to support users in their regions, providing essential localisation of support. On the other hand, the importance of having a central point where any user can go to find help, documentation, FAQs, etc, and where a knowledge database can be provided cannot be underestimated. Thus the support system is centrally coordinated, through the Global Grid User Support (GGUS) centre, with strong regional and local efforts. Most of these organisations now provide problem ticketing systems interlinked with that at the GGUS, so that tickets flow into the central point and allow the building of a knowledge database, even though many of them are treated and managed entirely locally. It is vital that sufficient effort be dedicated to oversight and management of these processes, and in EGEE this has been underestimated and understaffed. In addition, the people who actually resolve technical problems, while very active in answering problems on ad-hoc mailing lists feel constrained when brought into a managed process and managers must be careful not to damage the willingness of people to address problems. Again, much of the pressure could be relieved by having sufficient dedicated staff performing triage on the problems and resolving the common or straightforward ones before the busy experts are asked to help.

The other aspect of user support is what we have called "VO support". This is expert assistance working directly with the applications to help integrate their software with the grid middleware and services. This has been found to be very important at this stage, and within LCG dedicated teams were provided. This effort was not really foreseen in EGEE, although individuals have taken this role with some of the other applications. For the medium term future, it is important that this need be recognised and staffed more fully. These teams have also provided vital direct feedback to the deployment and development teams on the usability, functionality, or otherwise of the grid.



Date: 16/12/2005


In such a large infrastructure that ties together in a real sense many independent computer centres, the importance of security in all of its aspects cannot be underestimated. Within SA1 the operational aspects of security are managed. There is a strong policy group, that although started within LCG, from the beginning involved representatives from other grid projects - particularly the US Open Science Grid (OSG) project. In EGEE this has been managed within JRA3, but with strong representation and input from SA1. In terms of operational security SA1 provides a Security Officer who is responsible for making operational decisions. It was found important to have this role, since in the example of a security problem or vulnerability discovered in a middleware component, someone must decide whether rapid reaction is required. It is important that such responsibility be well defined. All ROCs and sites must provide a security contact who is responsible for ensuring coordination of operational security problems. An Operational Security Coordination Team coordinates the overall response and is responsible for managing incident response. Recently, SA1 has set up a vulnerability group tasked with looking for and managing vulnerabilities in middleware. Problems are reported to the developers, and the operational team manages urgent vulnerabilities. This is a new team and very little experience has yet been documented.

2.2 STANDARDS AND INTEROPERABILITY There are a wide variety of grid projects and grid middleware that have emerged over the past few years. They are all capable of providing some or all fundamental grid services such as grid job submission and management, data management, and information services. This broad range has raised the problem of interoperability and standards. So far the grid community has not delivered widely accepted, implemented and usable standards.

We therefore have to accept that for the medium term there will be several middleware implementations, and that we will have to work to bringing these together from a practical point of view, while at the same time working in the standards bodies to ensure that this experience get reflected in the process.

In the last year the problem of interoperability between EGEE and other grid infrastructures has risen in importance. Several application communities have or expect to have the need to run applications on more than one infrastructure. In the longer term this is an important driver of standardisation. EGEE has worked with OSG to achieve cross job submission between the EGEE and OSG services. This has been possible because of common underlying de-facto standards (GSI, SRM for storage, GT2) and explicitly using common schema for many things (information system, accounting etc.). There is also strong interest in sharing operational management experience and eventually perhaps sharing operational oversight. This again will also drive the importance of common monitoring schema, bringing together support workflows and systems. All of these activities will enforce the need for standards at all levels.

Today we are in a situation where there are a number of de-facto standards in use by many grid communities, and at the high level a number of proposed standards being discussed in various bodies (such as the Global Grid Forum etc.). It is important that these two complementary views be brought together. We need to ensure that the experience with the "bottom-up" de-facto standards currently in use inform the standards process, but it is equally important for the longer term that well thought-out standards be brought into the deployed infrastructure. For example, something missing from current middleware is a consistent architecture within which a well defined service container can be provided within which an application can deploy application-specific services. This can only be done with an overall architecture that defines what such containers look like and how they behave so that a site can accept to deploy such a service.

2.3 SUMMARY Throughout this document we try to express some of the important lessons learned during the experience of the past few years, some of which are mentioned above. It is interesting that there a number of common themes which run through the following chapters. Some of these lessons are about technology. However most of them are about the management of people involved in the project. Many of the people involved in preparing the material which is included in this document had comments about not underestimating how difficult it is to get this to work in a reliable, persistent manner. It is important here to recognise that although this activity is a report on a single project, it should also be clear that there is no direct line management. This is rather different



Date: 16/12/2005


from large companies running projects on this scale. Here we must realise that the management activity is rather one of consensus and coordination, and that traditional management techniques do not apply.

Some of the common themes include the following:

• the importance of documentation at all levels;

• the need for tools which support distributed working for distributed teams;

• the need for distributed test beds for several aspects, but the overhead introduced should not be ignored;

• the need for adequate and timely training;

• the need to test things before they are put into service;

• the need to allow sufficient time to prepare things for service;

• the need to work with the international community of people involved in standards for grid;

• the need for monitoring of many things to verify that they are working correctly.

The knowledge gained in the current phase of EGEE and the experiences and lessons learned here, will be taken as important input for an eventual second phase of the project.



Date: 16/12/2005


3 STRUCTURE OF THIS DOCUMENT The operation of the infrastructure of the grid is a complex matter involving the work of many people in many locations. Any organisation joining the grid has to plan their entry into the grid and its organisation. The purpose of this document is to provide information which will assist such organisations to do this.

3.1 HOW TO USE THIS DOCUMENT The document follows the standard arrangement for an EGEE document.

In the standard arrangement §1, Introduction, contains tables of references, a glossary, revision information, and so on.

In the standard arrangement §2, Executive summary, contains the executive summary.

In this document §3 contains a chapter on the Structure of this document.

The document then contains content chapters on the following topics:

§4 Architecture;

§5 Certification and testing;

§6 Security;

§7 Grid operations and support;

§8 User support;

§9 Fabric management;

§10 Deploying additional software components.

The document contains an appendix:

§11 Appendix – Middleware requirements

The content chapters generally cover the following aspects of the topic of the chapter:

• Description;

• Operations;

• Security;

• Monitoring, automation, alarms;

• Support;

• Tools;

• Procedures;

• Metrics;

• Accounting;

• Service level agreements;

• Lessons learned from the experience to date.



Date: 16/12/2005


Each chapter has its own list of references and draws on a common set of acronyms. The acronyms are collected in §1. There is a single table (Table 3: References) in §1 which collects all of the references from each of the chapters.

3.2 INTENDED READERS OF THIS DOCUMENT It is expected that there may be readers of this document in the following categories:

• Application users;

• Security officers;

• Service managers;

• Grid operators;

• User supporters;

• Fabric managers;

• Developers;

• VO Managers;

• Trainers.

All people working with the grid have to be familiar with its architecture and be sensitive to the need for security. For that reason, the chapters entitled Architecture and Security respectively, should be read by all readers. The table at the end of this chapter (Table 1: Application areas for the chapters of this document) provides details of the chapters of most relevance to different readers.

3.2.1 Application users This document is not intended for the users of applications on the grid. It may be of interest to some users to read this document to appreciate the organisation of the people working on the grid. However, this is unlikely. A computing grid should provide services to users, without the users knowing the details of the operation of the grid. By comparison to other networks such as telephone networks, the user should only have to know about the user interface to the system and the point of contact to report a problem or to ask for help. It is not in the scope of this document to address this.

3.2.2 Security officers A security officer is a person who is responsible for the security of operations at a site. §6 entitled Security is relevant to a security manager.

3.2.3 Service managers Service managers are people who are responsible for the operations of a service on the grid. For example, the service manager may be responsible for the provision of a storage element. The chapter entitled Grid operations and support describes the operations of the grid, and is relevant to a service manager. In the event of a failure of a service for which the service manager is responsible, then the service manager will receive a trouble ticket alerting him to the problem.

3.2.4 Grid operators Grid operations and support people are those who deal with the operations of the grid. Generally they work at a resource centre, or at a regional operations centre, or at the operations management centre of the grid. They generally specialise in some aspect of the operations of the grid such as monitoring, accounting, security, service problem location, or service problem resolution. There is a chapter entitled Grid operations and support in this document which addresses the operations of the grid infrastructure, and its organisation.



Date: 16/12/2005


3.2.5 User supporters The use of grids is new and introduces challenges which are new. In order to migrate users and workloads from earlier architectures to grid architectures, there is a need for persons who can assist the users who have responsibility for the applications and workloads. There is a chapter of this document entitled User support which documents the support system and the role of the user supporter.

3.2.6 Fabric managers A fabric manager is a person at a resource centre which is making services available to the grid. There is a chapter (Fabric management) in this document which addresses the matters which a fabric manager must consider when joining and participating in the grid.

3.2.7 Developers A developer is a person who makes software to support the services on the grid. This document is not intended for developers of middleware for services for the grid. However it may be of interest to them, to know how the operational infrastructure works, and to plan accordingly. The most obvious area of immediate concern to a developer is to be able to pass on code for integration, test and certification and then to deployment. It is therefore expected that a developer should be familiar with the contents of the chapter entitled Certification and testing.

There is an operational area where the developer is expected to be involved, and that is the support of the middleware in which the developer is an expert. The support system searches for an appropriate expert in the event of a problem being detected with the middleware. The developer may therefore find that trouble tickets are assigned to his/her team and that the team is expected to deal with them, or to provide appropriate advice. The chapter dealing with the ticketing system entitled User support describes the operation of this.

3.2.8 Virtual organisation managers

The purpose of a grid is to provide an environment in which useful work can be done. The ownership of this work is the domain of the Virtual Organisation (VO). The person who represents the virtual organisation is the Virtual Organisation Manager. The virtual organisation should not have to concern itself with the delivery of the computing service, but it must concern itself with the delivery of service to its users. There is a chapter of this document entitled User support which documents the support system and the role of the virtual organisation in user support.



Date: 16/12/2005


3.2.9 Trainers

A trainer is a person who has responsibility for providing training courses and materials for people so that they can do their jobs. As such, the trainer can represent all users of the grid including all of the other groups mentioned here. Depending on the audience for the work of the trainer, any of the chapters are potentially of interest.

Table 1: Application areas for the chapters of this document

Domain of reader Primary chapter (s) Secondary chapters

Application user None None

Security officer §6 §§3, 4

Service manager §7 §§3, 4, 6

Grid operator §7 §§3, 4, 6

User supporter §8 §§3, 4, 6

Fabric manager §9 §§3, 4, 6

Developer §§5, 8, 11 §§3, 4, 6

VO manager §8 §§3, 4, 6

Trainer §§3, 4, 5, 6, 7, 8, 9, 10, 11



Date: 16/12/2005


4 ARCHITECTURE

4.1 INTRODUCTION

The EGEE architecture consists of an agreed set of services and applications running on the grid infrastructure provided by the EGEE partners. The EGEE infrastructure brings together many of the national and regional grid programmes into a single unified infrastructure. In addition, many sites in the Asia-Pacific region run the EGEE middleware stack and appear as an integral part of the EGEE infrastructure.

This chapter describes the set of middleware and services that are running in the production service. It is not intended as an overall guide to the full set of services anticipated in future, but rather as a state of what is actually available and on which the experiences and lessons described in subsequent chapters are based.

It is important to understand some of the history behind the middleware distribution currently deployed. It is a hybrid of software taken from a variety of sources, and is continually evolving to reflect the experience and needs of the applications. To avoid confusion, the following terms are used throughout this document:

LCG-2: the name of the current (November 2005) middleware distribution. This name is somewhat misleading now, as the contents of this distribution as described below, are derived from many sources.

gLite: The middleware components, developed or re-engineered within the EGEE project itself (the JRA1 activity). Many of these components are re-worked versions of middleware originally developed within the European Data Grid (EDG) project.

The LCG project worked for 18 months prior to the start of the EGEE project, and put together the LCG-2 middleware release, taking components from various sources and understanding how to create a production service as opposed to the previous generation of grid test beds. The result of this effort, both the middleware distribution itself (“LCG-2.x”) and the infrastructure built up during that time, were the basis of the EGEE infrastructure when the project started and the reason the project could rapidly build up the infrastructure and deploy real applications.

In parallel, the JRA1 activity began work on the gLite middleware, as far as possible using a web service architecture. The goal was to replace or supplement existing middleware and services with these components redeveloped in a consistent architecture and based on cumulative experience with LCG-2 and elsewhere.

Many of the gLite components, because they are re-engineered versions of existing middleware are (or will be) replacements for components currently in the LCG-2 release. It is important that this be understood, because the goal for the production service is that the middleware distribution that is deployed, is a gradually evolving distribution, in order that the production service does not break. It is recognised that continuing to label the distribution as LCG-2 is confusing, since more and more gLite components are now adding to or replacing existing older services.

The middleware distribution contains components from a variety of sources, integrated into a coherent whole, through the integration, testing and certification process described later. The sources of components are:

Globus toolkit (GT 2.x) for essential underlying components, GSI, etc.;

Virtual Data Toolkit (VDT) packaging Globus, providing Condor, and other components;

Components from the European Data Grid (EDG) project, such as some data management tools, workload management;

LCG provided components – re-engineering of the GIIS and GRIS of the Globus MDS in the information system; command line tools;

SA1 provided components – such as the Disk Pool Manager (DPM), monitoring tools, etc;

gLite components – gradually including more and more (November 2005: R-GMA, VOMS, File Transfer Service);



Date: 16/12/2005


Various other tools from within the project and from other projects as required.

The essential grid services should be provided to Virtual Organisations (VO) according to the needs of the VOs and by agreement between EGEE, the sites, and the VOs as to how these services are made available. These services and other issues of operability are discussed in this section and also in the discussion on Grid operations and support (§ 7). The architecture of gLite is described in [ARC 3].

The information contained in this chapter is quite detailed. Similar information is presented in §9, Fabric management. Some readers may prefer the presentation in §9. §9 provides detail which is not contained in this chapter as it is intended for a different audience.

4.1.1 Grid Functionality and Services

The set of services that are currently available have evolved over the time since autumn 2002, initially driven by LCG and since EGEE started, by the full range of EGEE applications. In the sections below, we describe the basic and higher level grid services to be found in the production service, and we note their origin. The pre-production service allows a demonstration to applications of new services prior to production release.

4.1.2 Interoperability

This section outlines the basic essential services that must be provided to the VOs by all Grid implementations. The majority of these deal with the basic interfaces from the Grid services to the local computing and storage fabrics, and the mechanisms by which to interact with those fabrics. It is clear that these must be provided in such a way that the application should not have to be concerned with which Grid infrastructure it is running on.

At the basic level of the CE and SE, both EGEE and the U.S. Open Science Grid (OSG) use the same middleware and implementations, both being based on the Virtual Data Toolkit (VDT). In addition, both use the same schema for describing these services, and have agreed to collaborate in ensuring that these continue to be compatible, preferably by agreeing to use a common implementation of the information system and information providers. Common work is also in hand on other basic services such as VOMS and its management interfaces. In addition, both EGEE and OSG projects are defining activities to ensure that interoperability remain a visible and essential component of the systems.

The EGEE Resource Broker, as it is based on Condor-G, can submit jobs to many middleware flavours. When the Glue2 information system schema, being defined jointly by several Grid projects, is available this will enable the EGEE Resource Broker to schedule resources at sites running ARC. Further steps towards interoperability in the areas of workload management and data management are planned by the NorduGrid Collaboration. Other activities are being undertaken by EGEE, LCG, ARC developers, and others to foster and support standards and community agreements. These include participation in the Rome Compute Resource Management Interfaces Initiative [ARC 8] and in the Global Grid Forum.

These activities will improve interoperability between different middleware implementations, and in the longer term we can expect standards to emerge and be supported by future versions of the software. For the medium term, however, the approach taken is to define a set of basic services that can be deployed in all of the existing Grid infrastructures, taking account of their different technical constraints. In this way sites providing resources to EGEE will be able to provide these essential services in a transparent way to the applications.

4.2 EGEE MIDDLEWARE

The EGEE middleware deployed on the EGEE infrastructure consists of a packaged suite of functional components providing a basic set of Grid services including job management, information and monitoring and data management services. The LCG-2 middleware distribution, currently deployed in over 170 sites worldwide originated from Condor, EDG, Globus, VDT and other projects. It is anticipated that the LCG-2 middleware will evolve to include the majority of the functionality of the gLite middleware provided by the EGEE project. The architecture of gLite is described in [ARC 3]. The rest of this chapter will describe, respectively, the LCG-2 middleware services and the gLite ones.



Date: 16/12/2005


The middleware can in general be further categorized into site services and Virtual Organization (VO) services as described below.

4.2.1 Site Services

4.2.1.1 Security

All EGEE middleware services rely on the Grid Security Infrastructure (GSI). Users get and renew their (long-term) certificate from an accredited Certificate Authority (CA). Short-term proxies are then created and used throughout the system for authentication and authorization. These short-term proxies may be annotated with VO membership and group information obtained from the Virtual Organization Membership Services (VOMS). Access to (site) services is controlled by the Java authorization framework (Java services) and LCAS (C services). When necessary, in particular for job submission, mappings between the users’ Distinguished Names (DN) and local accounts are created (and periodically checked) using the LCAS and LCMAPS services. When longer-term proxies are needed, MyProxy services can be used to renew the proxy. The sites maintain Certificate Revocation Lists (CRLs) to invalidate unauthorized usage for a revoked Grid user.

The introduction of VOMS and the local authorisation services described above is still in the early stages. The majority of the production services still rely on the local lookup of a gridmap file, mapping grid users to local users.

VOMS and VOMS administrator documentation are available in the references [ARC 7and ARC 13].

4.2.1.2 Computing Element

The Computing Elements (CEs), often dubbed head nodes, provide the grid interfaces to Local Resource Managers (LRM) (a.k.a. site batch systems). They normally require external network connectivity.

The Computing Element (CE) is the set of services that provide access to a local batch system running on a compute farm. Typically the CE provides access to a set of job queues within the batch system. How these queues are set up and configured is the responsibility of the site and is not discussed here.

A CE is expected to provide the following functions and interfaces:

• A mechanism by which work may be submitted to the local batch system. This is implemented typically at present by the Globus gatekeeper in EGEE/LCG-2 release with modifications that avoid the scalability issues in GRAM, and additional job managers to supplement the standard versions, again avoiding scalability problems.

• Publication of information through the Grid information system and associated information providers, according to the GLUE schema, that describes the resources available at a site and the current state of those resources. With the introduction of new CE implementations we would expect that the GLUE schema, and evolutions of it, should be maintained as the common description of such information.

• Publication of accounting information, in an agreed schema, and at agreed intervals. Presently the schema used in EGEE and OSG follows the GGF accounting schema (user record). It is expected that this will be maintained and evolved as a common schema for this purpose.

• A mechanism by which users or Grid operators can query the status of jobs submitted to that site.

• The Computing Element and associated local batch systems must provide authentication and authorization mechanisms based on the VOMS model. How that is implemented in terms of mapping Grid user DNs to local users and groups, how roles and subgroups are implemented, may be through different mechanisms in different Grid infrastructures. However, the basic requirement is clear — the user presents an extended X509 proxy certificate, which may include a set of roles, groups, and subgroups for which the user is authorized, and the CE/batch system should respect those through appropriate mappings locally.



Date: 16/12/2005


It is anticipated that a new CE from gLite, based on Condor-C, will also be deployed and as a replacement for the existing Globus GRAM-based CEs within EGEE.

4.2.1.2.1 LCG-2 Computing Element

The LCG-2 Computing Element (CE) handles job submission (including staging of required files), cancellation, (subject to support by the Local Resource Management System — LRMS), and job status inquiry. It only works in push mode where a job is sent to the CE by a Resource Broker (RB). Internally the LCG-2 CE makes use of the Globus gatekeeper, LCAS/LCMAPS and the Globus Resource Allocation Manager (GRAM) for submitting jobs to the LRMS. It also interfaces to the logging and bookkeeping services to keep track of the jobs during their lifetime.

The LCG-2 CE interfaces with the following LRMS: BQS, Condor, LSF, PBS and its variants (e.g. Torque), and others.

4.2.1.2.2 gLite Computing Element

The gLite Computing Element (CE) handles job submission (including staging of required files), cancellation, suspension and resume (subject to support by the LRMS), job status inquiry and notification. The CE is able to work in a push model (where a job is pushed to a CE for its execution) or in a pull model (where a CE asks a known Workload Manager — or a set of Workload Managers — for jobs). Internally, the gLite CE makes use of the new Condor-C technology, CEMon [ARC 31], and BLAH [ARC 32], GSI and LCAS/LCMAPS, as well as the Globus gatekeeper. Another web service based CE is under implementation in gLite and is called CREAM [ARC 33]. The CE is expected to evolve into a VO-based scheduler that will allow a VO to dynamically deploy its scheduling agents. The gLite CE makes use of the logging and bookkeeping services to keep track of the jobs during their lifetime.

The gLite CE interfaces with the following LRMS: PBS and its variants (Torque), LSF and Condor. Work to interface to BQS (IN2P3) and SUN Grid Engine (Imperial College) is under way.

4.2.1.3 Storage Element

The Storage Element (SE) provides the grid interfaces to site storage (can be Mass Storage or not). SEs normally require external network connectivity.

A Storage Element (SE) is a logical entity that provides the following services and interfaces:

• A Mass Storage System (MSS), either disk cache or disk cache front-end backed by a tape system. Mass storage management systems currently in use include CASTOR, Enstore-dCache, HPSS and Tivoli for tape/disk systems, and dCache, DPM, and DRM for disk-only systems;

• A Storage Resource Manager (SRM) interface to provide a common way to access the MSS no matter what the implementation of the MSS. The SRM defines a set of functions and services that a storage system provides in an MSS-implementation independent way. The Baseline Services working group of LCG [ARC 29] has defined a set of SRM functionality that is required by all LCG sites. This work has been endorsed by the EGEE PEB for project EGEE and applies to all EGEE sites also. This set is based on SRM v1.1 with additional functionality (such as space reservation) from SRM v2.1. Existing SRM implementations currently deployed include CASTOR-SRM, dCache-SRM, DRM/HRM from LBNL, and the DPM;

• GridFTP service to provide data transfer in and out of the SE to and from the grid. This is the essential basic mechanism by which data is imported to and exported from the SE. The implementation of this service must scale to the bandwidth required. Normally the GridFTP transfer will be invoked indirectly via the File Transfer Service or through srmcopy;

• Local POSIX-like input/output facilities to the local site providing application access to the data on the SE. Currently this is available through rfio, dCap, AIOD, rootd, according to the implementation. Various mechanisms for hiding this complexity also exist, including the Grid File Access Library in



Date: 16/12/2005


LCG-2, and the gLiteIO service developed in gLite. Both of these mechanisms also include connections to the grid file catalogues to enable an application to open a file based on LFN or GUID;

• Authentication, authorization and audit/accounting facilities. The SE should provide and respect ACLs for files and datasets that it owns, with access control based on the use of extended X509 proxy certificates with a user DN and attributes based on VOMS roles and groups. It is essential that a SE provide sufficient information to allow tracing of all activities for an agreed historical period, permitting audit on the activities. It should also provide information and statistics on the use of the storage resources, according to schema and policies to be defined.

A site may provide multiple SEs providing different qualities of storage. For example, it may be considered convenient to provide an SE for data intended to remain for extended periods and a separate SE for data that is transient, and is needed only for the lifetime of a job or set of jobs. Large sites with MSS-based SEs may also deploy disk-only SEs for such a purpose or for general use.

Basic-level data transfer is provided by GridFTP. This may be invoked directly via the globus-url-copy command or through the srmcopy command which provides 3rd-party copy between SRM systems1. However, for reliable data transfer it is expected that an additional service above srmcopy or GridFTP will be used. This is generically referred to as a reliable file transfer service (rfts). A specific implementation of this, known as gLite FTS, has been included in the current LCG-2 distribution. It can be used for 3rd-party transfers between sites that provide an SE. No service needs be installed at the remote site apart from the basic SE services described above. However, tools are available to allow the remote site to manage the transfer service.

File placement services, which would provide a layer above a reliable file transfer service and which provide routing and implementing replication policies, are not part of the current infrastructure layer. An FPS component is developed in gLite and will be included in future releases.

4.2.1.3.1 LCG-2 Storage Elements

The LCG-2 SE can either be a ‘classic’ SE or an SRM SE. The classic SE provides a GridFTP (Efficient FTP functionality with GSI security) interface to disk storage. The RFIO protocol can be used for accessing directly the data on a classic SE. An SRM SE provides the GridFTP interface to a Storage Resource Manager (SRM), a common interface to Mass Storage Systems such as the CERN Advanced Storage Manager (CASTOR) or dCache/Enstore from DESY and FNAL.

Recently, a more lightweight and simpler SRM has been made available, the LCG Disk Pool Manager (DPM), which is targeted at smaller disk pools. The DPM is a natural replacement for the classic SE.

Some applications require the ability to perform POSIX-like I/O operations on files (open, read, write, etc.). In addition, other solutions are being deployed to allow such operations directly from the application. The LCG Grid File Access Library and the gLite I/O service are examples of different implementations of such a service.

It is anticipated that all applications and libraries that provide this facility will communicate with Grid file catalogues (local or remote), and the SRM interface of the SE in order that the file access can be done via the file LFN or GUID. Thus these libraries will hide this complexity from the user.

It is not expected that remote file I/O to applications from other sites will be needed in the short-term, although the mechanisms described above could provide it. Rather, data should be moved to the local storage element before access, or new files are written locally and subsequently copied remotely.

4.2.1.3.2 GFAL

The Grid File Access Library (GFAL) is a POSIX-like I/O layer for access to grid files via their Logical Name. This provides open/read/write/close style of calls to access files while interfacing to a file catalogue. GFAL

1 A third party copy request is a request to copy a file from one location to another. The user submits the request, and trusts the third party to act on his behalf. The first party is the user.



Date: 16/12/2005


currently interfaces to the LFC and the RLS catalogues. A set of command line tools for file replication called lcg-utils have been built on top of GFAL and catalogue tools supporting SRMs and classic SEs.

4.2.1.3.3 gLite Storage Element

A gLite Storage Element consists of a SRM (such as CASTOR, dCache or the Disk Pool Manager) presenting a SRM 1.1 interface, a GridFTP server as the data movement vehicle and gLite I/O for providing a POSIX-like access to the data. gLite itself does not provide a SRM or a GridFTP server which must be obtained from the standard sources.

4.2.1.3.4 gLite I/O

The gLite I/O is a POSIX-like I/O service for access to grid files via their logical name. This provides open/read/write/close style of calls to access files while interfacing to a file catalogue. It enforces the file ACLs specified in the catalogue if appropriate. GLite I/O currently interfaces to the FiReMan and the RLS catalogues.

An overview of gLite data management can be found in the references [ARC 2], while detailed usage of gLite I/O command lines and programmatic interfaces are available in reference [ARC 11].

4.2.1.4 Monitoring and Accounting Services

The monitoring and accounting services retrieve information on grid services provided at a site as well as respective usage data, and publish them. User information (in particular related to job execution progress) may be published as well.

4.2.1.4.1 LCG-2 Monitoring and Accounting Services The LCG-2 monitoring service is based on information providers which inspect the status of grid services and publish their data into the LDAP based BDII system. Accounting data is collected by the Accounting Processor for Event Logs (APEL) system which publishes its data into the R-GMA system. R-GMA requires a server running at a site to produce and consume information.

4.2.1.4.2 gLite Monitoring and Accounting Services gLite relies on the same services as described in §4.2.1.4.1. In addition, an R-GMA-based service discovery system is provided. The gLite accounting system (DGAS) is subject to evaluation. The CEMon service on the CE provides CE information to interested clients, in particular the WMS.

DGAS collects information about usage of grid resources by users, groups of users (including VO). This information can be used to generate reports/billing but also to implement resources quotas. Access to the accounting information is protected by ACLs. More information on DGAS is available in reference [ARC 12].

4.2.2 VO or Global Services

4.2.2.1 Virtual Organization Membership Service The purpose of a grid is to provide an environment in which useful work can be done. The ownership of this work is the domain of the Virtual Organisation (VO). The person who represents the virtual organisation is the Virtual Organisation Manager.

One of the roles of the VO manager is to provide authorisation for members of the virtual organisation to use the resources of the grid. In some VOs the scope of this authorisation is wide, and all members of the VO enjoy equal access to the resources. In other VOs the scope is very detailed, with each user having limited, controlled access. The VOMS software has been deployed so that the VO manager can manage the membership of the VOs. It provides a service to generate extended proxy certificates for registered users which contain information about their authorized use of resources for that VO [ARC 7, ARC 13].



Date: 16/12/2005


The VO manager must also negotiate for resources for the VO. Processes are still under development to deal with this.

In order to support users of the grid, there is a support system called GGUS [ARC 30]. VO managers have to work with GGUS to ensure that GGUS can contact the people inside the VO to provide support, and so that these people in the VO can be supported by others in the wider community. This is discussed in §8 called User support.

The Virtual Organization Membership Service (VOMS) annotates short-term proxies with information on VO and group membership, roles and capabilities. It originated from the EDG project. It is in particular used by the Workload management System and the FireMan catalogue for ACL support to provide the functionality identified by LCG. The main evolution from EDG/LCG is support for SLC3, bug fixes and better conformance to IETF RFCs.

A single VOMS server can serve multiple VOs. A VOMS Administrator Web interface is available for managing VO membership through the use of a web browser.

There is no significant functional difference between the VOMS in LCG-2 and in gLite. VOMS 1.5 and higher supports both MySQL and Oracle.

For a detailed description of VOMS and its interfaces, see references [ARC 7and ARC 13].

4.2.2.2 Workload Management Systems

Various mechanisms are currently available to provide workflow and workload management. These may be at the application level or may be provided by the Grid infrastructure as services to the applications. The general feature of these services is that they provide a mechanism through which the application can express its resource requirements, and the service will determine a site that fulfils those requirements and submit the work to that site.

The area of job workflow and workload management is one where there are expected to be continuing evolutions over the next few years, and these implementations will surely evolve and mature.

The present middleware provides the EDG-developed Resource Broker component. This is expected to be succeeded in the near future by the re-engineered version from gLite, that provides additional functionality, and addresses some of the important performance issues in the current implementation.

The Resource Broker takes a job description from the user, written in a Job Description Language (JDL) devised in EDG, and creates a job request, which it then uses to find resources sufficient to run the job. Once a site is found that satisfies the job request the job is dispatched to the site. The changing state of the job is registered in a Logging and Bookkeeping Service.

4.2.2.2.1 LCG-2 Workload Management System The Workload Management System in LCG-2.x originated from the EDG project. It essentially provides the facilities to manage jobs and to inquire about their status. It makes use of Condor and Globus technologies and relies on GSI security. It dispatches jobs to appropriate CEs, depending on job requirements and available resources. BDII and RLS are used for retrieving information about the resources.

The user interfaces to the WMS using a job Description Language based on Condor Classads is specified in reference [ARC 2].

4.2.2.2.2 gLite Workload Management System

The Workload Management system in gLite is an evolution of the one in LCG-2. As such, it currently relies on BDII as an information system, and will be able to use the CEMon or R-GMA. It is interoperable with LCG-2 CEs.

The Workload Management System (WMS) operates via the following components and functional blocks.



Date: 16/12/2005


The Workload Manager (WM) or Resource Broker is responsible of accepting and satisfying job management requests coming from its clients. The WM will pass job submission requests to appropriate CEs for execution, taking into account requirements and preferences expressed in the job Description. The decision as to which resource should be used is the outcome of a matchmaking process between submission requests and available resources. This not only depends on the state of resources, but also on policies that sites or VO administrators have put in place (on the CEs).

Interfaces to Data Management allowing the WMS to locate sites where the requested data is available are available for RLS, the Data Location Interface (DLI) and the Storage Index interface (allowing for querying catalogues exposing this interface — a set of two methods listing SEs for a given LFN or GUID, implemented by the FiReMan catalogue).

The WMproxy component, providing a Web service interface to the WMS as well as bulk submission and parameterized job capabilities is foreseen to be deployed before the end of the EGEE project.

The user interfaces to the WMS using a Job Description Language based on Condor Classads is specified in reference [ARC 2]. The user interacts with the WMS using a Command Line Interface or APIs. Support of C++ and Java is provided. For a detailed description of the WMS and the interfaces, see reference [ARC 2].

4.2.2.3 File Catalogues

Files on grids can be replicated in many places. The users or applications do not need to know where the files actually are, and use Logical File Names (LFNs) to refer to them. It is the responsibility of file catalogues to locate and access the data. In order to ensure that a file is uniquely identified in the universe, Global Unique Identifiers (GUIDs) are usually used.

The VO models for locating datasets and files vary somewhat between the different VOs, but all rely on Grid file catalogues with a common set of features. These features include:

• mapping of logical file names to GUID and storage locations (SURL); • management of replicas of logical file names; • hierarchical namespace (directory structure); • access control:

at directory level in the catalogue; directories in the catalogue for all users; well-defined set of roles (admin., production, etc.).

• interfaces are required to: workload management systems (e.g., data location Interface /storage index

interfaces); POSIX-like I/O service.

The deployment models also vary between the VOs, and are described in detail elsewhere in this document. The important points to note here are that each VO expects a central catalogue which provides look-up ability to determine the location of replicas of datasets or files. This central catalogue may be supported by read-only copies of it regularly and frequently replicated locally or to a certain set of sites. There is, however, in all cases a single master copy that receives all updates and from which the replicas are generated. Obviously this must be based on a very reliable database service.

The central catalogues must also provide an interface to the various workload management systems. These interfaces provide the location of Storage Elements that contain a file (or dataset) (specified by GUID or by logical file name) that the workload management system can use to determine which set of sites contain the data that the job needs. This interface should be based on the Storage Index of gLite or the Data Location Interface originally proposed by LCG. Both of these are very similar in function. Any catalogue providing these interfaces could be immediately usable by, for example, the Resource Broker or other similar workload managers.

The catalogues are required to provide authenticated and authorized access based on a set of roles, groups and sub-groups. The user will present an extended proxy-certificate, generated by the VOMS system. The catalogue



Date: 16/12/2005


implementations should provide access control at the directory level, and respect ACLs specified by either the user creating the entry or by the VO catalogue administrator.

It is expected that a common set of command-line catalogue management utilities be provided by all implementations of the catalogues. These will be based on the catalogue-manipulation tools in the lcg-utils set with various implementations for the different catalogues, but using the same set of commands and functionality.

There are currently two actual implementations of file catalogues: the LFC and Fireman. LFC is a lightweight catalogue developed to respond to the above requirements, and was explicitly designed as a local or central catalogue. This is currently deployed in the production system and is in use by several applications. Fireman, on the other hand was designed as a more general catalogue, including the ability to act as a distributed catalogue. In its initial implementation it is available on the pre-production service for testing by the applications. In the long term it is anticipated that the two implementations might converge, combining the strengths of both.

Further information on LFC and Fireman is available in the references [ARC 1] and [ARC 3].

4.2.2.3.1 EDG RMS

The services provided by the RMS, originating from EDG, are the Replica Location Service (RLS) and the Replica Metadata Catalogue (RMC). The RLS maintains information about the physical location of the replicas. The RMC stores mappings between GUIDs and LFNs. A last component is the Replica Manager offering a single interface to users, applications or Resource Brokers. The command line interfaces and APIs for Java and C++ are respectively available in the references [ARC 17 and ARC 18]. It is anticipated that the EDG RMS will gradually be phased out.

4.2.2.3.2 LCG File Catalogue

The LCG File catalogue (LFC) offers a hierarchical view of logical file name space. The two functions of the catalogue are to provide Logical File Name to Storage URL translation (via a GUID) and to locate the site at which a given file resides. The LFC provides Unix style permissions and POSIX Access Control Lists (ACL). It exposes a transactional API. The catalogue exposes a so-called Data Location Interface (DLI) that can be used by applications and Resource Brokers. Simple metadata can be associated with file entries. The LFC supports Oracle and MySQL databases. The LFC provides a command line interface and can be interfaced through Python [ARC 19].

4.2.2.3.3 gLite Fireman catalogue

The gLite File and Replica Catalogue (FiReMan) presents a hierarchical view of a logical file name space. The two main functions of the catalogue are to provide Logical File Name to Storage URL translation (via a GUID) and to locate the site at which a given file resides. The catalogue provides Unix-style permissions and Access Control Lists (ACL) support via Distinguished Names or VOMS roles. File access is secured via these ACLs. The Fireman catalogue provides Web Services Interfaces with the full WSDL available. Bulk operations are supported. The catalogue exposes to so-called Storage Index interface used by the gLite Workload Management System to dispatch jobs at the relevant site. Metadata capabilities are supported through the use of key/value pairs on directories. FiReMan supports Oracle and MySQL database back-ends. An overview of gLite data management can be found at reference [ARC 9], while the Fireman catalogue command line interface, Java and C++ APIs in reference [ARC 20].

4.2.2.4 Information Services

Information services publish and maintain data about resources in grids. This information in EGEE is modelled after the Grid Laboratory Uniform Environment schema (GLUE).



Date: 16/12/2005


4.2.2.4.1 BDII

The Berkeley Database Information Index (BDII) is an implementation of the Globus Grid Index Information Service (GIIS), but allowing for more scalability. Information provided by the BDII adheres to the GLUE information model. Interfacing with BDII is made of LDAP operations for which commands and API exist. Both LCG-2 and gLite currently rely on BDII for proper operation.

4.2.2.4.2 R-GMA

R-GMA is an implementation of the Grid Monitoring Architecture of GGF and presents a relational view of the collected data. It is basically a producer/consumer service with command line interfaces as well as an API for Java, C, C++ and Python and a Web interface. R-GMA models the information infrastructure of a grid as a set of consumers (that request information), producers (that provide information) and a central registry which mediates the communication between producers and consumers. R-GMA (via GIN) can use the same information providers as used by BDII.

Recently, a Service Discovery mechanism using R-GMA has been implemented. Detailed information is available in reference [ARC 21].

R-GMA is currently also used to collect EGEE accounting records.

The R-GMA and Service Discovery command line interface, Java, C, C++, and Python APIs are available in reference [ARC 22].

4.2.2.4.3 Logging and Bookkeeping

The Logging and Bookkeeping services (LB), which tracks jobs during their lifetime in term of events (important points of job life, such as submission, starting execution, etc.) gathered from the WMs and the CEs (they are instrumented with LB calls). The events are first passed to a local logger then to bookkeeping servers. More information on the Logging and Bookkeeping services is available in reference [ARC 2].

4.2.2.4.4 Job Provenance

Job Provenance Services, whose role is to keep track of submitted jobs (completed or failed), including execution conditions and environment, and important points of the job life-cycle for longs periods (months to years) are being prototyped. This information can then be reprocessed for debugging, post-mortem analysis, and comparison of job execution and re-execution of jobs. More information on Job Provenance Services is available in reference [ARC 23]. This service is not yet in production.

4.2.2.5 File Transfer Services

4.2.2.5.1 LCG-2 File Transfer Services

LCG-2 did not provide File Transfer Service per se. Rather it was up to the user to issue the relevant commands to replicate the files from one Storage Element to another. During the Service Challenge 2 in 2004, however, a set of ad hoc tools (Radiant) were developed for managing the huge amount of files to be moved from site to site.

4.2.2.5.2 gLite Transfer Services The gLite File Placement Service (FPS) takes data movement requests and executes them according to defined policies. It maintains a persistent transfer queue thus providing reliable data transfer even in the case of network outage and interacts fully with the Fireman catalogue. The File Placement service can be used without the interaction with the catalogue and is then referred to as the File Transfer Service (FTS). The gLite File Transfer Service was used for the Service Challenge 3 project in the summer of 2005 and is now a production component. The FTS command line interface and API are available in reference [ARC 24].



Date: 16/12/2005


4.2.2.5.3 VO Agents

Some VOs require a mechanism to allow them to run long-lived agents at a site. These agents will perform activities on behalf of the VO and its applications, such as scheduling database updates. No such general service currently exists, but solutions are being prototyped. Currently such actions are performed by VO software running in the batch system, but this is not a good mechanism in the longer term as it could be seen as a misuse of the batch system. It is better to provide a generic solution which is accepted by the sites, but which provides the facilities needed by the applications.

4.2.2.5.4 Application Software Installation Facilities

Currently each Grid site provides an area of disk space, generally on a network file system, where VOs can install application software. Tools are provided in the middleware, or by the VOs themselves to install software into these areas, and to later validate that installation. Generally, write access to these areas is limited to the VO software manager. These tools will continue to be provided, and will be further developed to provide the functionality required by the VOs.

4.2.2.5.5 Job Monitoring Tools

The ability to monitor and trace jobs submitted to the Grid is an essential functionality. There are some partial solutions available in the current systems (e.g., the Workload Management system provides a comprehensive logging and bookkeeping database); however, they are far from being full solutions. Effort must be put into continuing to develop these basic tools, and to provide the users with the appropriate mechanisms through which jobs can be traced and monitored.

4.2.3 VDT

The Virtual Data Toolkit (VDT) is an ensemble of grid middleware that can be easily installed and configured. The VDT was originally created to serve as a delivery channel for grid technologies developed and hardened by the NSF-funded GriPhyN [ARC 25] and iVDGL [ARC 26] projects, and these two projects continue to be the primary sources of funding for the VDT. However, the role of the VDT has expanded and now supports the LHC Computing Grid Project (LCG) [ARC 27] and the Particle Physics Data Grid (PPDG) [ARC 28].

Both LCG-2 and gLite middleware components rely on the VDT versions of Condor, Globus, ClassAds and MyProxy. VDT provides direct support to EGEE/ LCG for those packages. LCG-2 and gLite components such as VOMS, CEMon, and information-providers have been added to VDT.

4.3 OPERATING SYSTEM PLATFORMS

The resource centres of EGEE require the middleware stack on a large variety of platforms and operating systems, in several flavours and versions. Therefore, in order to guarantee portability, the software must be written following the most common standards in terms of programming languages and operating systems. Applications area software is being routinely developed and run on a number of different compilers and operating systems, including Red Hat Linux, Microsoft Windows, and Apple Mac OSX, both with gcc and with their C++ proprietary compilers. This approach helps to ensure conformance to language standards and allows the project to manage dependencies on platform-specific features, both on 32-bit and 64-bit hardware architectures.

The production platforms currently supported environments for both the LCG and gLite software stacks are:

• Red Hat 7.3 with gcc 3.2 and gcc 3.2.3 - the Linux reference platform for the LHC VOs and for the main computer centres. Red Hat 7.3 will be stopped by end 2005;

• Scientific Linux 3 with gcc 3.2.3, and in the near future also with gcc 3.4.3 - the new Linux reference platform for CERN and other large HEP laboratories. This is binary compatible with Red Hat Enterprise 3.



Date: 16/12/2005


In addition, ‘development-only’ platforms are supported that have better development tools and are therefore used by many programmers and users to increase productivity and assure software quality:

• Microsoft Windows, with the Visual C++ 7.1 compiler and CygWin; • Mac OSX 10.3 with gcc 3.3, and soon 10.4 probably with gcc 4.

Platforms that will likely be supported in the near future are:

• SLC3 Linux on AMD 64-bit processors as an additional production platform; • gcc 3.4.3 compiler on all Linux platforms to take advantage of better performance; • Mac OSX 10.4 as development platform, to resolve issues related to loading of dynamic libraries.

The range of operating environments which work correctly for worker nodes is much wider than the above list. When sites have a collection of computers which they wish to make available on the grid, and which do not run one of the supported environments, then they can implement the interface to the grid with a small number of gateway systems running one of the supported environments, and only have to port the small subset of the middleware which is required on the worker nodes. Usually there is another site which has already ported the software and can provide assistance to do this. Although this situation is not ideal, it does not represent a severe impediment for participation in the grid.

4.4 REFERENCES

ARC 1 LCG File Catalog (LFC) Administrator's Guide, https://edms.cern.ch/document/579088

ARC 2 LB Service User's Guide https://edms.cern.ch/document/571273

ARC 3 gLite Architecture https://edms.cern.ch/document/476451

ARC 4 DPM Administrator's Guide https://edms.cern.ch/document/591600

ARC 5 Pool Of Persistent Objects for LHC http://lcgapp.cern.ch/project/persist

ARC 6 NA5 - Policy And International Cooperation http://public.eu-egee.org/activities/na5_details.html

ARC 7 User’s Guide for the VOMS Core Services https://edms.cern.ch/document/571991

ARC 8 Rome Compute Resource Management Interfaces Initiative http://www.pd.infn.it/grid/crm

ARC 9 EGEE User’s Guide https://edms.cern.ch/document/572406

ARC 10 EGEE gLite User’s Guide - Overview Of gLite Data Management https://edms.cern.ch/document/570643

ARC 11 EGEE gLite User’s Guide - gLite I/O https://edms.cern.ch/document/570771

ARC 12 User’s Guides for the DGAS Services https://edms.cern.ch/document/571271



Date: 16/12/2005


ARC 13 VOMS admin user's guide https://edms.cern.ch/document/572406

ARC 14 JDL Attributes http://server11.infn.it/workload-grid/docs/DataGrid-01-TEN-0142-0_2.pdf

ARC 15 JDL Attributes Specification https://edms.cern.ch/document/555796

ARC 16 WMS User’s Guide https://edms.cern.ch/document/572489

ARC 17 User Guide For Edg Replica Manager 1.5.4 http://cern.ch/edg-wp2/replication/docu/r2.1/edg-replica-manager-userguide.pdf

ARC 18 Developer Guide For Edg Replica Manager 1.5.4 http://cern.ch/edg-wp2/replication/docu/r2.1/edg-replica-manager-devguide.pdf

ARC 19 LCG Data Management Documentation https://uimon.cern.ch/twiki/bin/view/LCG/DataManagementDocumentation

ARC 20 Fireman Catalogue User Guide https://edms.cern.ch/document/570780

ARC 21 Service Discovery User Guide https://edms.cern.ch/document/578147

ARC 22 gLite Release 1 Web Page http://hepunx.rl.ac.uk/egee/jra1-uk/glite-r1

ARC 23 JP usage guide http://egee-jra1-wm.mi.infn.it/egee-jra1-wm/jp_usage.shtml

ARC 24 File Transfer Service User Guide https://edms.cern.ch/document/591792

ARC 25 Grid Physics Network (GriPhyN) Web Page http://www.griphyn.org

ARC 26 International Virtual Data Grid Laboratory (iVDgL) Web Page http://www.ivdgl.org

ARC 27 LHC Computing Grid (LCG) - Web Page http://lcg.web.cern.ch/LCG

ARC 28 Particle Physics Data Grid (PPDG) Web Page http://www.ppdg.net

ARC 29 Baseline Services Working Group Report http://lcg.web.cern.ch/LCG/PEB/BS/BSReport-v1.0.pdf

ARC 30 Global Grid User Support http://ggus.org

ARC 31 CE Mon http://grid.pd.infn.it/cemon/field.php



Date: 16/12/2005


ARC 32 BLAH http://egee-jra1-wm.mi.infn.it/egee-jra1-wm/ce_blahp.shtml

ARC 33 CREAM CE for gLite http://grid.pd.infn.it/cream/field.php



Date: 16/12/2005


5 CERTIFICATION AND TESTING

5.1 INTRODUCTION The goal of the software certification process is to provide EGEE with production quality software satisfying agreed requirements.

5.1.1 The team Members, roles, tasks The Certification and Testing Team (CT) is responsible for delivering the grid software distribution and the central team consists of seven permanent members and the leader. In addition other partners provide effort on specific tasks. Occasionally we have enjoyed the presence of visitors staying for one or two months but rarely longer. Each member has at any given time a well defined set of tasks that are required to be done so that the certification process can move forward. Here is the non-exhaustive list of tasks that are addressed:

• Provision of the certification test bed. The certification test bed is described later in this document – see §5.2.1; The team leader decides, in discussion with team members, the priorities for the test bed, its configuration, the type of tests to run, which version of software to run on which test bed cluster or node. In some sense CT is the architect of the test bed.

• Software integration; This involves cleaning up the software packages from wrong, incorrect or unnecessary dependencies. This is very tedious and difficult work. At the beginning of the LCG certification and testing activities, we found that this work was so demanding that we had to have two people for two months dedicated only to this task.

• Test bed maintenance; This involves system administration; installing (and removing) patches, and frequent reinstallation of software as the tests require continuous verification of the software for consistency.

• Bug tracking; This includes information sessions with developers to “integrate” them into the certification process so they can get the real feeling for the software quality.

• Defining and writing tests; This is a task where there is a large scope for individual contributions from many partners;

• Analysis of test results;

• Problem isolation on the tested; It is of utmost importance to isolate the problem as much as possible, to define conditions under which it can be seen and in the ideal case find the condition to actually reproduce the problem on demand. This is of tremendous help to developers.

• Find the fix if possible and test it; This can serve as the starting point for developers to develop the proper correction code. Often, however, the correction code developed in the CT team appeared to be the right one.

• Verify the quality of the external software intended for deployment. It is a very demanding task to convince other parties that the software they delivered for inclusion in the future release package has to conform to some minimum requirements for any software to be included.

There are clearly many smaller tasks that do not need special mention here. The tasks assigned to each team member are continuously changing as the software testing progresses. This organization allows a well balanced workload for all members, and fairly rapid progress through the very high number of problems encountered.



Date: 16/12/2005


It should be noted that this team is often subject to many conflicting pressures: their desire to do as good a job as possible before releasing software to the users; the expectation from developers that their software be put into production as quickly as possible; and the expectation from the user community for functional, reliable software. It is important that these conflicting objectives be balanced once against another.

5.1.2 Software integration The software which the development team2 provides to the Certification team for integration, testing and certification and eventually for public release, has to be first put together in a relatively homogenous package that can actually be compiled, linked, configured and installed. This process is called software integration. It ensures that all software pieces depend on the same version of other software such as libraries. It is necessary to limit third-party software to the absolute minimum otherwise one may end up with a product which cannot be maintained.

5.1.3 Software testing Integrated software components have to be first tested for:

• internal consistency3, as they come from number of sources; • features conforming to their advertised functions; • installation consistency, repeatability, upgradeability.

This requires the development of tests focused on special details of software features with the goal to determine the reason for software failure or its inefficiency. These are not necessarily stable tests, are changed very often, and mostly require collaboration with the developers. These are often tests to find out software problems that are rare, difficult to repeat and consequently may need many hours running on the tested to detect the failure as they are typically associated with memory leaks, sockets connections not released, and other similar errors.

5.1.4 Software certification Certification is the repeatable process of passing functionality, performance and stress tests with the goal finding the limits of tested software, in order to verify its solidity and functionality. These tests are usually stable (once debugged) and do not change too much with time, so that meaningful comparisons with the previous releases are possible.

5.1.5 Release manager The Release Manager (RM) is an essential element of the release process and is the person (not a group or a committee) leading the release process. The RM is responsible for bringing the new version of the software to the high quality expected by the project management. The RM consults with the various groups within the project to determine the details of the next release and proposes the content to the project management. The RM has the authority to:

• determine priorities for the release process, including testing, bug fixing, feature development (within the current release cycle) to ensure the planned release date is met;

• determine the new issues to be addressed by the user support group as a result of planned new features or changed behaviour of existing software as a result of possible software changes;

• stimulate the development of new tests or test suites for testing the newly developed features or improving the existing tests to better cover the changing software;

2 The term development team is a collective term for the source of the middleware to be integrated. In the context of EGEE, this is the JRA1 and JRA3 activities. Prior to EGEE, there were other development sources within other projects such as EDG. 3 Internal consistency means that all of the parts of the software are collectively consistent within the collection of software. At a minimum, it means that it is possible to compile, link and execute all of the components without uncovering inconsistencies.



Date: 16/12/2005


• determine the most suitable date for beta test period (typically 2-3 weeks before the release date) and make sure the special beta test period rules are enforced;

• organize the release of the software (such as to make sure all files/CDs are properly tested, all documentation is included);

• ensure the release date is met. If for any reason this appears not to be feasible, the RM has to warn the project management so that an appropriate decision can be taken in the interests of the project. This may include changing of the release date or dropping some features/fixes in favour of maintaining the original date;

• verify that bugs entered into the tracking system are indeed addressed by software changes included in the release.

The RM has the authority to make decisions about the criticality of all modifications during the beta period.

5.2 INTEGRATION, TESTING AND CERTIFICATION PROCESS The complete release process consists of a continuous iterative cycle that also involves feedback from the EGEE production system and input from the software providers. The certification testbed is the main tool used for this activity.

5.2.1 Certification tested The certification testbed is a large set of machines (between 50 and 100 machines) that serve as a tool for the integration, testing and certification of the grid release. It is literally where the grid release is made. It has to be flexible to accommodate the large number of requirements (sometimes contradictory requirements) that need to be satisfied in order to be able to test and certify a large number of different hardware and grid software and operating system combinations that are expected in the deployment. The following figure (Figure 1: EGEE certification ) shows a snapshot of the test bed at some point in time of the EGEE release cycle:

RB_a

BDII_a

MDS_a

CE_a

SE_a

RB_b

BDII_b

CE_b

WNs

CE_2

SE_2

WNs

RB_3

BDII_3

MDS_3_a

CE_3

SE_3

WNs

CE_4

SE_4

WNsWNsWN_a1

WNsWNs

WN_b1 WNsWNsWNs

WN_2_a1

WNsWNs

WN_3_a2

WN_3_a1

WNsWNsWNsWNs

WN_4

RLS_MySQL

RLS_oracle

Cluster_1 Cluster_2 Cluster_3 Cluster_4

UI_1 UI_4

CE_5

WNsWNsWNsWNsWNs

WN_5

Cluster_5

CE_6

WNsWNsWNsWNsWNsWN

Cluster_6

LSFCondor

CertTB

Proxy

WN_b2WN_a2

WN_2_a2

LCFGng Lite install

MDS_b

MDS_3_b

Figure 1: EGEE certification test bed

The certification test bed has been organized as set of connected clusters, each consisting of the service machines and number of worker nodes that represented different sites in the deployment. Clusters have been considered as independent as possible from each other, installed independently, each running different software services such as batch systems (Condor, LSF, OpenPBS), different set of grid services (RB, CE, SE, MDS), different Data



Date: 16/12/2005


Management services such as RLS with MySQL or Oracle backend (later Castor and dCache services were added). It was possible to add more worker nodes to clusters to test various scalability issues. These clusters acted as a grid on LAN.

This organization proved extremely valuable because of its large flexibility. It was relatively easy in principle to change or update software in the whole cluster or just one service and continue with testing. In reality, however, it required a close control of all actions (software upgrade, reconfiguration, bug fix patch installation and similar), because of the paramount requirement to be able to undo any action at any time so one can unequivocally determine the source of a new problem.

In spite of the large complexity of the certification test bed, one can say without any hesitation that it was the utmost care devoted to its difficult maintenance that allowed for very steady and fairly quick progress of the whole EGEE/LCG-2 testing, certification and release process.

Configuration, LAN and WAN The CT team intention was always to have a test bed with clusters both in LAN as well as WAN to have a sufficient realistic testing. Indeed the real-world network failures cannot be realistically tested on a LAN. We have, unfortunately, understood very quickly, that including WAN would slow the testing down so much that any sufficiently quick progress inside the release cycle would become impractical. To understand that, one has to go back in time and realize that the quality of first versions of the grid components was such that many of the machines on the CT test bed had to have their operating system reinstalled several times a day to allow for fast testing of various fixes. This was not a practical option in WAN, as physical access is required to each machine to do this. We were forced to abandon the WAN idea for the certification test bed for the LCG-2 time scale in favour of speed of local installation on the LAN. However, later on, when things got better, we were able to run WAN. Today, the Pre-Production Service with the gLite middleware stack runs in a WAN.

We have compensated for the lack of remote clusters connections by introducing what we called destructive tests. These were tests when we deliberately manually disconnected for example the Ethernet connection for some machine, reconnected the machine a short time later and observed the behaviour of the software. In our belief this simulated real-time network interruptions. This way of testing network problems was very successful and found many problems that could be fixed before hitting the deployment field.

Remarks on certification test bed The certification test bed is a large, expensive tool that is unmatched in its capability of allowing the real certification of the grid software. It is a unique resource required to satisfy many tasks. As such it requires careful management to maximize its efficiency. It is a fast changing environment (upgrades, different tests, bug fixing, testing of fixes) which unfortunately can sometimes be used for only one activity at a time. It provides a highly controlled environment for systematic and exhaustive, repeatable testing of all grid components. As such it is irreplaceable for its task.



Date: 16/12/2005


5.2.2 Pre-production service

The Pre-production Service (PPS) is not a test bed. It is a grid in its own right and provides a service in a manner similar to the production grid. However the PPS is focussed on providing an environment where new middleware can be exercised before being brought into service. The PPS was introduced at the start of the EGEE project. The PPS operates as a distributed grid with participation from many of the partners in EGEE.

The following figure (Figure 2: Resources in the Pre-Production Service) shows the current participation of the sites in the PPS.

§7 contains a section called Operation of the pre-production service (PPS). That section provides details of the pre-production service.

Figure 2: Resources in the Pre-Production Service

5.2.3 Testing process Testing is an iterative process of running different tests or test suites to verify functionality of all components of the grid software. Grid Unit Testing (GUT) consists of testing the following:

• Basic grid functionality; • Grid services; • Security, certificates, proxy; • Information system; • Resource brokering, load distribution, resource saturation; • Data access, data catalog, data replication, catalog consistency; • Connectivity; • Configurability; • Error recovery; • Real world applications; • Site verification suite.

Grid Services Testing (GST) consists of testing: • Service interactions; • Jobs with input data; • Jobs with Mass Storage access; • Different batch systems (OpenPBS, LSF, Condor); • Job submission tests;



Date: 16/12/2005


• Single and multiple job streams; • Resources fully and equally utilized; • Large number of jobs to a single cluster (testing the CE resilience); • Matching of resources to requirements; • Data management tests (copy, replicate, and register) for single and multiple streams; • Proxy renewal by having long sleeper jobs which need at least 2 proxy renewals; • Matchmaking with input data to check that jobs are dispatched only to clusters which allow access to

specific resources with specific protocols.

5.2.4 Certification process The certification process is a highly iterative process of running specially designed test suites and/or set of tests mentioned above under different well defined circumstances that allow for exact repetition of the run, perhaps under a different configuration next time. This permits direct comparisons between different releases and allows discovering difficult problems of newly introduced performance inefficiencies due to bug fixes or new features. The certification test suite is a highly automated test procedure that performs interactive as well as batch tests of various length from several hours to several days if needed, performance evaluation, stress testing, and provides statistics about problems (frequency during long runs).

The suite of Site Functional Tests (SFT) that runs in the production and pre-production services, is the logical extension of the certification test suite. The SFT allows for spotting mis-configuration and loss of service. It also provides an assurance that a site has qualified for connection to the grid.

The entire certification test suite forms a certification matrix which is the key to providing a grid software distribution of an acceptable quality. The whole certification process is perhaps best demonstrated in the following figure (Figure 3: The certification process):

Figure 3: The certification process

Release cycle The CT team obtains software from different sources, currently mainly from the EGEE and VDT projects. Other external software is used as well. To create a release a strict release cycle is followed.



Date: 16/12/2005


5.2.5 Creation of the candidate release

Define desired features Before beginning the integration of new parts of the middleware we define the scope of the next release. The requirements are defined by VOs and by bug fixes supplied by our internal debugging team and external groups. At any step in the certification process bugs can be reported and corrected before the release. We try to create incremental releases without too many changes at a time. The target is to permit VOs to adapt their code easily to new releases. This method permits us to have a new release about once per month. Each release is made available to the Infrastructure team, who decides if the release is useful to deploy. By this method, the EGEE functionality steadily increases without long pauses.

What are the different parts of the middleware? The middleware of the grid is generic. We need job management, data management, an information system, accounting and monitoring. This functionality is provided by EGEE and VDT software. The target of the middleware is to be compatible with the most important batch systems and storage systems already installed in computing centres. We also have to provide open source solutions for centres without any batch or storage system. The VDT distribution of Globus still is the heart of the system. We collaborate with the VDT team, reporting and fixing bugs. We also try to increase the interoperability with other grids. A good connection between EGEE and other grids would be an elegant way to increase the potential storage capacity and calculation power for each participating grid.

Certification of the release We use the certification test-bed supporting various configurations at the same time. The certification process is defined by four different series of tests, a milestone where we create a release candidate, and finally another set of VO tests:

1. Installation test; 2. Basic functionality test; 3. Certification matrix (intensive test); 4. Special tests for new functionality; 5. Tagging of the release candidate; 6. VO specific tests (on the Experiments' (VOs') Integration test-bed).

All supported configurations are certified by the Certification and Testing team. We test different batch systems: • OpenPBS; • LSF; • Condor; • Torque.

Tests are combined with different storage systems: • Classic SE (simple GridFTP server); • dCache SRM; • DPM SRM; • CASTOR SRM.

Installation test Currently the main platform still is Scientific Linux 3 (both IA32 and IA64 versions are supported), compatible with Red Hat Enterprise Linux 3. We also currently support the Red Hat 7.3 operating system. The RH 7.3 certification test bed is mostly installed using the LCFGng fabric management system, while the SL3 test bed is installed using our new, more modern YAIM installation tool. We ensure we can both upgrade nodes from the previous release and reinstall them from scratch with the new release. We use the manual installation guide for the previous release and provide a document with the differences for the new configuration. The package lists are managed in cpp format, as for LCFGng. We create a set of scripts to download and install packages or manage an APT repository.



Date: 16/12/2005


Basic functionality test A number of tests are run to verify whether the newly prepared software is not trivially broken. The tests verify the Resource Broker functions, the data transfer functions, user utilities and backward compatibility. Some of these tests will eventually be moved to the Certification Matrix.

Certification matrix (intensive tests) Every night 25 series of tests are run on the test-bed. 40% of them are functionality tests for various types of job submission, data registration, replication, copy, and removal. 60% are stress tests to assess the limits of the system. Today those tests are simple job storms, 3 types of data storms, a very intensive GridFTP storm, and Matchmaking storms for 2 protocols (letting Resource Brokers match jobs to sites, with constraints on input or output data). The certification matrix is regularly extended to test new functionality and stress them. We may include special tests created by the developers of the software or by the Integration team. A web page in the following figure (Figure 4: Web page produced by the certification matrix) is produced with the results of the tests. The test matrix is not fixed, but evolves with the middleware.



Date: 16/12/2005


Row Number

Title Detailed Results

Main Log File

Detailed Documentation

Test Duration

1 DNS 01_DNS-ReverseDNS [OK] Description 1 sec

2 US_script 02_UserStorm [OK] Description 391 sec

3 US_jdl 03_UserStorm [OK] Description 326 sec

4 FTP 04_GridFTP [OK] Description 339 sec

5 RMS_lcgcr 05_RMSetupTest [FAIL] Description 55 sec

6 RMS_All 06_RMSetupTest [FAIL] Description 64 sec

7 CEGate 07_GlobusGatekeeper [FAIL] Description 999 sec

8 CECycle 08_CECycle [OK] Description 428 sec

9 RB_val 09_PileStorm [OK] Description 510 sec

10 CalStormR3 10_Sleep [OK] Description 1009 sec

11 CalStormR0 11_Sleep [OK] Description 1074 sec

12 JS_sleep 12_JobStorm [OK] Description 619 sec

13 JS_multi 13_JobStorm [OK] Description 592 sec

14 CStorm 14_CopyStorm [OK] Description 517 sec

15 GS_All 15_GfalStorm [FAIL] Description 1413 sec

16 GS_Castor 16_GfalStorm [OK] Description 484 sec

17 KStorm 17_CheckStorm [OK] Description 638 sec

18 DS_All 18_DataStorm [FAIL] Description 1169 sec

19 DS_lcgcr 19_DataStorm [FAIL] Description 1167 sec

20 DS_Castor 20_DataStorm [OK] Description 603 sec

21 DS_DefSE 21_DataStorm [FAIL] Description 1229 sec

22 MultiDMStorm 22_MultiDStorm [FAIL] Description 2428 sec

23 HStorm 23_DavidStorm [FAIL] Description 1094 sec

24 MM_gridftp 24_MatchMaking [FAIL] Description 42 sec

25 MM_rfio 25_MatchMaking [FAIL] Description 41 sec

Figure 4: Web page produced by the certification matrix



Date: 16/12/2005


Here is the brief description of each test in the following figure (Figure 5: Description of tests):

No Name Description

1 DNS Reverse DNS test for all CTB nodes

2 US-script User Storm with script only provided

3 US_idl User Storm with jdl and script provided

4 FTP Overall GridFTP test targeting all GridFTP servers (SE, CE, RB)

5 RMS_lcgcr Replica Management test (from the UI) with lcg-utils commands

6 RMS_all Replica Management test (from the UI) with edg-rm commands

7 CEgate Unit testing for all Gatekeepers (CEs) with direct globus-* commands

8 CECycle Unit testing for all available CEs in turn (with CE target forced)

9 RBval Unit testing for RB validation with 5 functions (with TB target forced)

10 CalStormR3 Free storm with RetryCount=3 (short sleeper jobs)

11 CalStormR0 Free storm with RetryCount=0 (short sleeper jobs)

12 JS_sleep Job Storm with short sleeper jobs (neither CE nor RB target selection)

13 JS_multi Job Storm with short sleeper jobs with CE and RB target selection

14 CStorm Copy Storm (from WNs, GridFTP test over 2 SEs at random)

15 GS_all Gfal Storm - Gfal unit testing from WNs - over all SE types

16 GS_castor Gfal Storm - Gfal unit testing from WNs - with Castor SEs only

17 KStorm Checksum Storm - Sandbox transfers back and forth with checksum

18 DS_All Data Management Storm - from WNs, edg-rm transfers towards 2 different SEs (for all types of SEs)

19 DS_lcgcr Same thing but with lcg-utils commands

20 DS_Castor Same thing as 18 - but for Castor SE targets only

21 DS_DefSE Same thing as 18 - but for testing Default Close SEs

22 MultiDMStorm Data Management Storm, but with 50 files per job created over the target SEs

23 HStorm Data Management Storm, but with all files sent back

to the SE close to the UI

24 MM_gridftp RB Match Making test for the gridftp protocol

25 MM_rfio Same thing but with rfio protocol

Figure 5: Description of tests

Special tests for new functionality These are newly developed tests to test new features in the release being prepared. Eventually, some of these tests, suitably reworked to conform to the rules and format, will become part of the Certification Matrix. The tests are prepared well before the release is prepared as part of CT work with developers.

Tagging of the release candidate When the certification matrix results confirm we have reached the targets for a new release, we tag the middleware and configuration template versions that have just been certified on the test-bed. A release is a set of



Date: 16/12/2005


LCFGng files listing all RPMs to be used, configuration file templates for LCFGng modules, YAIM configuration modules and examples. On our website we create download pages for the RPMs.

VO-specific tests (on the VOs’ Integration Test bed) The Experiments’ Integration Support team (EIS) installs the candidate release on the VOs’ integration test-bed, thereby first testing the installation procedure itself. Next, they test their latest tools for the installation of VO software and invite the VOs to test the release candidate. VOs certify that their own software runs with the new middleware, and report bugs that must be corrected before the new release is put on the production system. Only on the production system will the middleware be subjected to fully realistic conditions, and any problems observed are reported back to the Certification and Testing team for correction in future releases. As discussed, this was an activity specific to LCG, but it was found to be so important and useful to the applications that in future this kind of support (and associated facilities) must be extended to the other applications. Partly this is addressed by the pre-production service.

5.2.6 Creation of the final release After corrections, if any, the candidate release is given to the Infrastructure team, who will remove all CERN-specific settings and packages, and adapt the middleware to the VOs' requirements (VO settings). Then they create manual installation scripts, documentation for LCFGng, YAIM and manual installations. They test the manual installation procedure.

5.2.7 Distribution process All the middleware is distributed through CVS and the web. Our website contains:

• All the required information for deployment of the middleware; • The release page provides CVS commands to download all required files for each release; • Download pages provide all required packages for all releases; • Documentation pages provide information on how to use our tools in general. Specific documentation

for the middleware installation is in the tag given by the release page; • Links to the CVS repository to allow browsing of RPM lists, configuration templates, and source code; • Links to the tracking system, so that problems can be reported in the Savannah bug tracking system for

bugs, patches and tasks; • Links for grid users, VOs, and computing centre system administrators;

We provide a ready-to-use solution using YAIM for centres without any tools to do fabric management. We also provide documentation for manual installation, to permit centres to adapt the installation to their local fabric management software.

5.3 EGEE SOFTWARE

5.3.1 Its origins The software components in the EGEE production service come for a number of different sources as described in § 4, and thus are likely to be of different qualities and may not operate together before integration.

Different sources The grid software components have many sources:

• Globus, Condor from VDT (iVDGL); • DCache from Desy/ FNAL; • Workload Management System from EDG; • Data Management System from EDG; • FTS from EGEE; • LFC, DPM, GFAL, BDII from LCG; • R-GMA from EGEE; • Accounting subsystem from EGEE; • Glue Information Schema from the GLUE community and used in DataTag; • JRA1 activity of EGEE;



Date: 16/12/2005


• JRA3 activity of EGEE.

The list changes with time as more sources are added.

Different quality of components It was very quickly found that the delivered software components, having such different origins, were not of the same quality, had not been tested to the same or similar degree, and in some cases did not work together.

Difficulties to make components work together Sometimes the implementations were not based on the same set of API definitions so the relevant software modules could not talk together or simply that the implementations did not follow the same set of rules established among different development partners, so when all modules were put together it was found that several versions of the same external software libraries were required for correct execution. To correct this required patient debugging and close contact with developers. Obviously, any problems of this type introduce a delay in the integration schedule. Whenever it was possible, the CT team developed fixes itself while waiting for the “official” fix.

5.3.2 Importance of user applications for testing User participation is absolutely necessary, but not too early in the process. If the final destination for any software is its public usage, then user participation in making up the software is absolutely essential. It is extremely profitable for the quality of the released software to be able to run some representative set of user programs to prove the software usefulness for general use. In a competitive environment such a set of user test jobs is called the benchmark. There is, however, an important consideration to be taken into account, namely the state of the software development.

Indeed, there is huge difference between having software already in a (almost) mature state compared to the beginning of a software development cycle when the reliability leaves a lot to be desired. In the latter case, allowing users access to the certification test bed too early would be disastrous, since it would completely disrupt the systematic software debugging, because clearly, the user environment is always by definition chaotic. A chaotic environment does not make software debugging easier. In the end, the user impression would not be favourable.

We judged it to be more profitable for the project to spend a little more time on systematic debugging using our artificial test suite before making an agreement with all VOs to use their software in our debugging environment, under a well controlled organization and slowly progressing from simpler to more complex tasks. The benefit of this collaboration was visible only because the basic grid software was already debugged to the level that prevented user jobs discovering those hundreds of trivial errors that make users’ life so miserable.

In conclusion it goes without saying that there is very fine line to be crossed when making the decision when to bring users on a test bed running immature software. We believe that the careful approach has largely proved itself for the benefit of the EGEE.

5.3.3 User requirements impact on software development Running user software as early as possible has one undeniable advantage that if one is lucky and the software actually can run the user software, it may expose some perhaps profound deficiencies of its design or implementation that cannot be revealed by using some artificially constructed test suite. It may be argued that the role of the CT team is not to judge the architectural design of the software (decisive part of such a judgment should have been done well before the software components reach the CT), but in practice it is unavoidable. Theoretically speaking, discovering such problems as early as possible leaves more time for the redesign and reimplementation. On the other hand, redesigns at this stage are always very costly in terms of time required and human effort necessary, and consequently and inevitably such problems introduce possibly large delays in the release schedule. It may just as well be argued that it may be better to get the software running as-is with the obvious necessity to redesign those faulty parts later and include the changes properly in the release scheduling for next time.



Date: 16/12/2005


5.4 THE BIRTH OF THE EGEE SOFTWARE This section of the document deals with the history of how the EGEE middleware evolved into the system which is in production today. It is not necessary to know this information to understand this chapter. All of what is discussed in this section occurred before EGEE. However we believe that it is relevant to discuss it as many lessons were learned during the time. The time period under discussion is most of 2003. Readers can skip this section if they wish.

5.4.1 LCG-0, learning the process In order to bring all members of the CT team together as far as ideas and practices of a software certification and testing requirements go, we released first the LCG-0 version. This was not meant to be a public release by any means though we behaved exactly as if it was. Major parts of understanding what is involved in a public software release were insistence on basics:

• Installation scripts must be fool-proof; they must allow a repeatable installation, both from scratch as well as an upgrade;

• A release must be accompanied by a proper installation manual or guide that must mention problems found at the last moments before the release that did not warrant a delay for the release. Possible help or workaround on how to avoid seeing those problems after installation must be present. Such a release note must contain all the necessary information describing the target hardware, operating system, and other environmental information to make clear for the installer under what condition the software has been tested;

• There must be a proper user documentation of the software included; • There must be at least a minimal set of tests included to allow for installer some basic verification of the

installation.

The LCG-0 release was made on February 28, 2003 and contained the following modules of middleware: • VDT 1.1.6 (Globus and Condor); • EDG 1.4.3 (work management services and data management services); • EDT Glue schema and information providers.

This initial attempt to bring all CT members together was definitely a very good step in creating overall awareness of problems involved in everyone.

5.4.2 LCG-1, making the software useful, debugging Starting around June 2003, with the first availability of the EDG software, we started work on the LCG-1 release. This was intended for public use, even if perhaps in a somewhat restricted fashion. At this point much of the software was relatively untested in real use. A list of the most critical bugs or issues was introduced, updated daily, and distributed to all developers and managers to keep them informed about the everyday state of the software testing and to make clear that we were emphasizing, at this stage, bug fixing at the expense of putting in new features. We also started short daily meetings with some developers to advance the bug fixing as quickly as possible.

It has to be said that all those actions helped quite dramatically and fixes were produced fairly quickly. Several releases of LCG-1 were made and it started to be used for real work.

The first version of LCG-1 was released July 1, 2003 and contained the following modules: • VDT 1.8; • a minimal resource broker; • MDS information system with a number of modifications and fixes; • LRC; • full support for the LCFGng installation method.

Soon afterwards, the new version of almost all components had been released by development (about 3 months after the work on LCG-1 has started) and because all fixes for bugs we had found in LCG-1 were included and lots of new features we knew VOs required were also included, we decided to stop working on LCG-1, and to



Date: 16/12/2005


start working immediately on the next version called LCG-2 and put LCG-1 purely into minimal maintenance mode.

We established a careful roadmap for moving the CT test bed from LCG-1 to LCG-2 while keeping LCG-1 still available so we were able to react to user reports. For historical interest, the roadmap is included here in the following figure (Figure 6: Roadmap to LCG-2 – September 2003-December 2003). It consisted of careful planning of various features and, creating a small CT test bed for LCG-1 and a big CT test bed for newly coming work for LCG-2. The split brought in the additional work of maintaining the small test bed, but, surprisingly, in practice it did not cause any real problems.

Timeline to LCG-2

Sep/1 Oct/1 Nov/1 Dec/1Nov/20

LCG-2 release

LCG-1 deployment

LCG-1 C&T

LCG-2 C&T

LCG-2 C&T

LCG-1 C&T

Big C&T

Small C&TSep/25

LCG-2 deployment possibleSep/20(LCG-1 C&T extension)

Globus study

LCG-1 upgrade tag

Site + C&T test suitesGFAL

Experiments testing

Oct/10 Oct/20

LCG-2 code cutoff

Nov/15

Figure 6: Roadmap to LCG-2 – September 2003-December 2003

5.4.3 LCG-2, the deployment release The work on LCG-2 started in September 2003. The CT team had already quite an experience with the work so we set up a tight internal schedule to try to meet the release deadline, even if it looked quite difficult. To achieve this, it was important that a good collaboration and feedback with the developers was established. That proved priceless, the debugging and problem fixing proceeded unexpectedly well and quickly.

We included selected users in our testing which further enhanced the quality of the software and, the LCG-2 release was made on December 15, 2003 as an upgrade to LCG-1. The new component added was GFAL (Grid File Access Library).

LCG-2 was not perfect, but proved to be usable enough that VOs started believing it was worth their effort to invest in using it.

5.5 EVOLUTION OF THE RELEASE PROCESS

5.5.1 As software matures, the processes have to follow As the grid software matured, we needed to adapt the certification process to it. There were several reasons for modifications to the certification process:



Date: 16/12/2005


1. The software installation was completely redesigned to be as generic as possible and reasonable and this allowed for much easier installation and for the support of several operating systems;

2. The grid software itself appeared to have fewer bugs than before; in particular all those basic bugs that prevented services and user jobs running were mostly dealt with. The consequence was that the testing and certification process described above became slightly lighter;

3. The need to include users in the process earlier became realistic; 4. It became possible to include WAN sites in the certification process.

5.5.2 Adding the pre-production service The new system called Pre-Production Service (PPS) has been included in the process. Its purpose is to allow faster exposure of users to the new software release. It provides a beta-testing environment for the next release and because it includes WAN sites, it permits earlier exposure to network problems. The original CT test bed is, of course, still necessary. The new process now looks like this (Figure 7: EGEE Certification and Release Process):

EGEE Certification, Testing and Release Cycle

CERTIFICATIONTESTING SERVICES

Integrate

BasicFunctionality

Tests

Run testsC&T suitesSite suites

RunCertification

Matrix

Releasecandidate

tag

REL

EASE

PRE-

PRO

DU

CTI

ON

PRO

DU

CTI

ON

EXPTSINTEGR

Certifiedrelease

tag

DEV

ELO

PMEN

T &

INTE

GR

ATI

ON

UN

IT &

FU

NC

TIO

NA

L TE

STIN

G

DevTag

JRA1

LHCEXPTS

MEDICAL

OTHERTBD

APPSSW

Installation

DEP

LOYM

ENT

PREP

AR

ATI

ON

Deploymentrelease

tag

DEPLOY

SA1

Productiontag

Figure 7: EGEE Certification and Release Process

It should be noted, however, that at present, the certification test bed is used to prepare the LCG-2 release for deployment while the PPS runs gLite software. The pre-production service is discussed in the section entitled Grid operations and support.

5.5.3 Deployment feed-back becomes more important The most important element for grid software development is the feedback obtained from its real-world deployment. It is real users with their real programs and requirement that determine the success or failure of the software. It is absolutely necessary to listen to what users have to say about the release, to correct what appear to be bugs, to discuss their suggested ideas, and to propose better solutions. Simply it is necessary to establish a permanent dialogue with users.



Date: 16/12/2005


5.6 LESSONS LEARNED WITH LCG Many lessons were learned during the period before EGEE began, and are documented here in this section of the document (§5.6). The section following (§5.7) deals with similar matters during the time of EGEE. It is sometimes necessary to learn things many times, as we repeat some of the errors from the past. The reason for the repetitions is that there is a continuous supply of new people and new software components. The EGEE project does not live in a world where there is a lot of control over the supply of either people or software. These matters are documented in the belief they are things to avoid.

5.6.1 Grid software development must be driven by its deployment All software is developed for a particular purpose and so the end usage determines where the most effort has to go: to algorithms, to performance, to features, to user requirements. This project has been defined as a deployment project. Consequently, it must be the requirements of deployment together with user priorities that drive the whole project, collect requirements, set priorities, and steer the development in the right direction. It cannot work differently; otherwise the final product will never reflect VOs needs.

5.6.2 The release process requires developers’ cooperation and discipline The software release process has its particular rules that are very well understood and strictly followed by any software development house (even if every company will have its own variations). It is shown in the following figure (Figure 8: Typical software release cycle). Basically it means the following:

• The release date for the next release is determined by the project management at the end of the previous release cycle;

• The beta-release date is set to typically 2-3 weeks before the release date.

Figure 8: Typical software release cycle

It is absolutely necessary for the release manager to have total control over the activities in the release tree. The RM and the CT must be the only authority to control what fixes need to go into the release based on the priorities defined.

The problem we encountered was that developers sometimes supplied the fixes required, but added within the same fix some other corrections and even new features! This proved disastrous for the progress of testing and certification of the new release, because the testing process ceased to be repeatable. New corrections broke something that worked before. The solution was to give authority to the release manager to control content. This is described in the following paragraph.

5.6.3 Software development and release is not a democratic process It became evident during testing that the developers of the software were making changes to the software while the certification team were testing it. CT decided to create its own CVS source repository and synchronize it with the official CVS repository. CT had to administer this CVS repository, controlling its write access and consequently which patches were introduced to the source code. This was a hard and unpopular decision but it was in fact, the most sensible a necessary solution to solve a recurrent problem.

At the time, the RM was the leader of the CT team, and the RM found it difficult to make the decision to do this. There was a delay between October 2003 and April 2004, while the RM resolved various non technical conflicts. In retrospect, the project would have benefited from this decision having been taken earlier. The lesson is to act decisively and to act as early as possible, when a decision is necessary, even if it was not popular.



Date: 16/12/2005


5.6.4 Too many languages used for software development lead to a maintenance nightmare Software of this size and general use, which is required to be installable on many different operating systems, needs to be developed in as few languages as possible. Possibly only C/C++ plus a scripting language should be all that is needed. The preference for C/C++ and bash is a consequence of experience with deployed software. Software written with these languages is more reliable when deployed than software written with other languages. This is a consequence of the availability of reliable implementations of both C/C++ and bash for all operating environments which are expected for the middleware. Nowadays it is relatively easy to write portable software in C/C++, and good compilers exist on every modern OS so the porting to another OS becomes a fairly easy task. The argument that more time is needed for development compared to development times when other languages are used (such as Java, Perl, etc…) is, of course, valid to a certain extent. However, one has to see the whole problem in its entirety: using other languages invariably brings into the game other software that is needed by those languages, and this leads to new software dependencies, and can dramatically increase the potential for new bugs introduced by this third party software. Possibly, and most importantly, it adds complexity to the integration and testing process. Developers have to understand that it is perfectly all right to use their favourite language for proof of concept type of work, but it is not acceptable for production software with world-wide distribution requirements.

5.6.5 Software release process is not a glamorous activity The software release process is hard, time consuming, nerve wracking, and not a glamorous process, but it provides benefits to others. This is important to realize in order to keep the morale of the team high to attract people for this kind of irreplaceable work.

5.6.6 The certification test bed is where the published software is made Certification test bed is a very expensive one-and-only tool to work with, needs to be well maintained, and well organized, because it is the place where many potentially conflicting activities take place, often under time pressure.

5.6.7 LAN is difficult but WAN is impossible at the beginning When completely new software is coming out of development, it is generally unstable and unreliable. It is highly advisable to have solid, well known and proven procedures for testing, debugging and releasing rather than new ones, which may be sound in development, but are very difficult to use in testing. This is even truer when the developed software is targeted for a highly networked environment. It has proved far more practical and valuable to debug it first in the local environment (LAN) and then move the testing to remote environment (WAN), rather then attack all problems at once.

5.6.8 Work with developers before they deliver the next version Software development is a difficult process, and distributed software development is even more difficult. It requires strict discipline, obeying rules and meeting target dates. When the development process becomes too loose, the quality of delivered software for the certification is poor. This only increases problems for the CT team because the team now has to suddenly play a role of unit testers, which is obviously the role of developers. It becomes suddenly impossible to keep the release dates and that is always detrimental to the acceptance of any software by users. And because the CT team is the last in the chain of the software making, the tendency (and the reality) of blaming the team for delays is very high. There is nothing better to destroy the moral of people working under stress trying to catch the time lost in previous steps than to blame them for delays. As a remedy, the release manager should spend enough informal time during all phases of development with developers themselves to get an accurate feeling of the state of the software.

5.6.9 Make sure users are satisfied from the beginning It is very hard to rely only on “official” software when users require different features. Always make sure the user's point of view on different features has been sufficiently discussed and understood. The users’ opinion may not necessarily be the best, but take the time to explain the alternatives. Do not put the users in the situation of facing a “fait accompli”.



Date: 16/12/2005


5.6.10 Always have a handful of highly qualified developers around It is necessary and wise to have “at hand” a small, but very competent group of developers for debugging and developing ad-hoc software required by deployment. In rare but highly exposed cases, do not hesitate to engage those people in well defined and well targeted development. In this way, several important tools were provided to fill missing functionalities.

5.6.11 Never accept a software component that does not comply with the agreed level of quality

The release process has to have some, even loosely defined, set of quality standards as a starting point for accepting a component. It proved very wise to be uncompromising and refuse anything that was below that standard. Obviously, in many cases this needed strong project management support due to the political impact of such a refusal. We were very fortunate to have this.

5.6.12 Fast debugging feedback with quick software reinstallations is the key to success Several things have to be correctly in place to be able to advance sufficiently quickly with the debugging of new software. Developers’ capabilities, the special, characteristic but indescribable debuggers mind that makes one find the root cause of the problem which other people may never be able to do, is the first essential ingredient. The other is the ability to describe precisely the observed failure to the developer so the developer can make use of the analysis to prepare a fix, or to imagine and actually produce the correct fix within the CT team. The next is to define a hopefully simple, yet exhaustive set of tests to verify the correction, and more importantly, its possible side effects. Finally it is necessary to be able to install the fix for testing. This is not necessarily a simple thing because it has to bypass the standard installation procedures so ad-hoc scripts have to be quickly written and they are for the most part be not reusable. Also the test bed may need to be quickly reconfigured to test the fix efficiently.

5.6.13 Release manager has to request strict discipline yet keep an open mind for changes The release process is a very difficult task that has to absorb developers’ delays, outside pressures and the project’s own deadlines. It needs well established rules, policies, habits just as it needs the open mind of the release manager to modify them as the pressure for next release rises. It has to fit into the established software development culture.

5.6.14 A software project needs a good bug tracking and process management tool This often overlooked issue is quite an important one. Every project uses something, clearly. The quality of the tool has, however, a direct influence on project efficiency and people’s initiative. A good tool with an easy to use graphical user interface (GUI), a well co-ordinated set of features and a simple automatic ticket processing system will certainly help people to be effective. In such a system, developers can browse for new problems with much more interest knowing that in a short time they are guaranteed to get a fairly complete state of problems reported and their interconnections.

5.6.15 Release manager is not a threat to anybody The release process has to have a release manager who is empowered by the project with decision making issues related to details and timings of the release process. This usually highly exposed person to the internal and external influences is, in fact, the key to the quick, reliable and responsible software release process.

5.7 LESSONS LEARNED DURING EGEE The lessons learned during EGEE cover experiences both with the LCG software stack and the gLite software stack.

5.7.1 Anticipate components The arrival of a new service in the middleware stack places particular pressures on the release process. The components' dependencies, intended functionality and configuration must be understood by the staff involved. Appropriate tests must be designed and written. The earlier this process can start the better.



Date: 16/12/2005


5.7.2 Encourage good patch submissions Patch submissions from developers should be a rich source of information on the middleware. Give developers the best opportunity possible to supply or reference all necessary documentation and dependencies here. A field to capture changes since the last submission is particularly useful.

5.7.3 Delay release dates if necessary EGEE has suffered from delayed releases. This is inevitable if the release date is decided before the software has been fully evaluated, because a short delay is more acceptable than a defective release. An announcement later in the process is preferable, even if this results in a release being held 'on ice' for a few days.

5.7.4 Capitalise on external expertise Over the last 2 years the grid-wide expertise on EGEE has deepened considerably, and there is an appetite to contribute. The project funds deployment staff at many sites. These resources should be used! We have for example seen configuration components and release testing by external sites.

5.7.5 Use updates We have found strictly defining a release to be inflexible, and the ability to push out updates offers considerable advantages (as well as being industry practice). The end of the certification process is no longer a 'last chance' for developers to get non-critical fixes in, thus delays are reduced. More importantly a good update process enables us to be much more responsive to post-release developments.

5.7.6 Test updates on the test bed with the release An area in the test bed must be reserved to run the current production release in order to test updates to it. Further, it would be useful to keep nodes running all supported historical releases in order to test interoperability between releases.

5.7.7 Pre-production is good for finding problems Some classes of problems are very difficult to uncover in a controlled and limited test environment such as the certification test bed. Broadening the process somehow to include other sites with different deployment scenarios, infrastructure and assumptions is effective at targeting those problems which would otherwise only appear in production.

5.7.8 Documentation is a key component Good documentation can drastically reduce support load on the group. It enables administrators to educate themselves and gives them the material they need to educate others locally. That said, the best documentation in the world is of no use if it cannot be found easily; the management of the documentation is important too.



Date: 16/12/2005


6 SECURITY

6.1 STATEMENT OF PROBLEM The most common solution to controlling access to resources at a site is the password model, where the site and the user share a secret: the password. By proving knowledge of the secret, the user is able to prove his/her identity. Having thus authenticated to the site infrastructure, the associated user account is able to assert privileges such as compute and storage quotas or read/write access to files. These privileges, or authorizations, are set by the local administrators. In the grid environment, the user has neither direct control over which site the tasks are sent to, nor has any prior relationship with the target site. In such cases, especially with a large and dynamic community of users and resources, the shared secret model of authentication and local account-based authorization cannot function.

For any solution to the problems posed by grid, authentication and authorization must also take account of the increasingly hostile network environment experienced by any user or resource connected to the public internet. Password based authentication is, in many cases, considered too weak compared to the value of the resources accessed or potential liabilities incurred by unauthorized use.

Today's scientists often participate in several overlapping and geographically distributed collaborations. The grid model was designed to accommodate this working pattern and appropriate authentication and authorization infrastructures must be carefully designed to allow efficient use of resources in this environment. Note must also be taken of the need to monitor and account for user’s access to and use of remote resources whilst adhering to site, national and trans-national rules regarding the storage and distribution of personal data.

6.2 SOLUTION IN USE WITH EGEE In conformance with best security practice, authentication and authorization in EGEE are separated. Authentication is based on the Globus [SEC 1] Grid Security Infrastructure [SEC 2] using X.509 PKI [SEC 3] digital certificates issued to users by a network of trusted third parties: Certification Authorities (CAs). For Java-based services equivalent authentication is provided through the EGEE Java trust manager [SEC 4].

Authorization decisions are based on policies derived from both Virtual Organisation and locally at resources. Further details of the EGEE security architecture beyond the following discussion are available in the project Security Architecture document [SEC 5] and in other JRA3 activity documents [SEC 6].

The use of X.509 certificates for authentication, allows the use of a single identity token (the certificate) to authenticate to any grid project deploying the same authentication infrastructure: the so-called "single sign-on", thereby potentially reduces the problems for users managing multiple identity tokens such as one for each collaboration in which they participate.

6.2.1 Certification Authorities and PMAs Technical details of a PKI infrastructure are beyond the scope of this document. However, in understanding deployment issues the reader must be aware that the critical requirement that a network service (resource) can properly identify a user (or other service) connecting from a remote network location depends on a number of things:

• the performance of a proper identification process by the CA during the certificate issuing process;

• the distribution to all participating services of the certificates belonging to accepted CAs. (These certificates are sometimes referred to as the 'roots of trust' since the trust placed in the authenticity of the information in the certificates digitally signed by the CA certificate is the basis of all authentication and authorization decisions in the infrastructure.);

• the private keys associated with certificates are securely managed by all participating entities: the CAs themselves, users and service administrators;



Date: 16/12/2005


• the timely availability of information from the CAs regarding the invalidity of any certificates whose private keys may have been compromised.

EGEE distributes CA certificates taken from the repository of the EUGridPMA [SEC 7]. This Policy Management Authority (PMA) is a grouping of CAs which maintains and publishes a set of minimum standards to which CA members of the PMA conform. Each CA must publish policy documents which describe how the CA meets the standards set by the PMA. These policies are peer reviewed by PMA members and this requires a common standard of trust is established for EGEE.

An alternative to using a so-called ‘classic’ CA to issue certificates to individuals is to base the certificate issuing process on a local site authentication system or Site Integrated Proxy Services (SIPS). One such system is the Kerberized Certification Authority (KCA) [SEC 8]. The KCA is a service which issues X.509 certificates based on Kerberos credentials acquired as part of the site login process. This has been deployed as part of Fermilab PKI project [SEC 9] and credentials from this source are also accepted by the EGEE infrastructure. A SIPS avoids the need to duplicate existing organisational identification processes where such already exist and is generally used to issue short-lived credentials thereby avoiding the exposure of long-lifetime credentials.

6.2.2 Virtual Organisations (VOs) Since policies regarding the use of resources are generally set on a per-VO basis, authorization decisions (generally of the form of "who can do what") are based on decisions related to a user's membership of a VO. In the EGEE infrastructure, whilst the actual management of the service may be delegated, each VO runs an authorization information service. This service provides information related to its membership. EGEE is currently at the transition between the use of a somewhat trivial authorization model, based on a simple list of the certificate identity (a uniquely assigned text string known as the certificate Subject) of VO members, and a more sophisticated and flexible approach based on assertions as to the group member ship and roles a user has within the VO. In the list-based approach each grid service periodically downloads the full membership list provided by a VO LDAP service. Local site configurations provide mappings to local accounts thereby enforcing authorization policy. Simple coarse-grained grouping of users into privilege and non-privilege users is enforced by convention of two lists per VO.

In the more flexible approach to authorization a Virtual Organisation Membership Service (VOMS) no longer simply provides lists of users but uses a certificate issued for the VOMS service to digitally sign an attribute certificate which asserts the groups and roles for the user in the VO. This attribute certificate travels with the user's grid jobs and is used by target services to determine how authorization policies should be applied.

6.2.3 Policy framework – trust To take account of the fact that the grid user may have no direct contact with the site whose resources he or she is using, suitable operating policies for the grid must be made available to provide a contractual relationship between various grid parties. These policies are negotiated through the LCG/EGEE Joint Security Policy Group [SEC 10] which manages a policy framework illustrated below (Figure 9: Policy Framework):



Date: 16/12/2005


Figure 9: Policy Framework

The documents in the policy framework (available at [SEC 11]) above provide policy control over how users and sites enter the grid and how information provided by the user must be treated.

For both the LDAP-based and VOMS services, the grid resource providers are placing trust in the process by which a user's certificate identity is entered into the VO. The identity of the VO service is authenticated using certificates as described above and EGEE, through the Joint Security Policy Group, has established policies and process for the registration of users to which all participating VOs and services must conform. These provide for the gathering and appropriate treatment and release of users' personal data and the registration process requires users to agree to a Grid Acceptable Use Policy [SEC 12]. Together, the Acceptable Use, User and Site registration policies, provide a trust framework within which authorization decisions can be executed.

6.2.4 User registration process Detailed instructions on registration and use of the grid are available in the User Guide [SEC 13] but the general process is as described below.

1. The user must first obtain a digital certificate from one of the accepted issuing authorities as described above. Information on how to contact authorities and their inscription processes are available through the EUGridPMA website. [SEC 7] The certification process ensures that the user is properly identified and the certificate identity issued is unique to that individual.

2. After loading the certificate into their web browser the user is able to access the VO registration interface. Here they must provide personal information as required by the User Registration Requirements policy. It is at this interface that the user must also agree to follow the Acceptable Usage Rules.

3. To verify that the user has provided a valid email address, the user's registration request is temporarily staged until the user visits a unique URL sent in an email to the address provided. This URL is only accessible when authenticated with the user's certificate thereby ensuring that the person in possession of the certificate is also the owner of the contact mail address.



Date: 16/12/2005


4. Following the verification of the user’s email address, the request to join the VO is passed to the VO manager. If the VO manager is satisfied that the user is able to join the VO then the user’s certificate identity is entered into the VO database along with any VO-group and role assignments appropriate for the particular applicant.

For a small VO the administrative burden placed on a VO manager by the user registration process is relatively slight. However, for VOs such as the large high energy physics experiment collaborations of the LCG [SEC 14], with thousands of members around the globe, VO management is a significant undertaking. These VOs have existing membership processes and databases and EGEE/LCG are, with development provided as part of the VO Management Registration Service (VOMRS) [SEC 15] project at Fermilab, providing a VO management interface which links with the existing organizational databases (e.g. the VO membership databases at CERN). In addition VOMS and VOMRS provide the facility for the VO Manager to delegate the management of groups of individuals within the VO, a facility that can map to the organisational structure of large collaborations.

6.2.5 Access to resources Having completed the registration process described above the user is ready to submit work to the grid. This requires that the user’s software environment is suitably configured which is generally accomplished by logging on to a User Interface (UI) node.

Whilst the user’s certificate is essential to authenticate to resources, it is not used directly but rather used to derive a chain of short-lifetime certificates or proxies [SEC 16]. The first proxy certificate in the chain is digitally signed by the user’s long-lifetime certificate private key and so proves that the user was present when it was generated. However, each proxy credential has an unprotected private key (i.e. no passphrase is needed to ‘unlock’ it) and can thus be used by grid service software, without the presence of the user, as a delegation credential to act on behalf of the user. For example, following the process of job submission:

1. The user creates a proxy credential using the command grid-proxy-init. This creates a new certificate derived from the user’s identity but with a short lifetime and unprotected private-key.

2. As part of the protocol used to submit the user’s job to the Workload Management Service (WMS) from the UI, a new proxy credential derived from the original is created on the WMS side. This enables the WMS to act on behalf of the user in passing the job to a Compute Element (CE) service, also by mans of delegation.

In this way, the proxy credential chain, started by the user’s long-term credential is used to authenticate an action on behalf of the user at a remote grid service.

The proxy credential presented to the resource on behalf of the user can be used to authenticate the requested action, but, it does not provide authorization information. As mentioned above, mapping lists of VO members to local resource accounts provides one means of enforcing basic authorization decisions. More flexible authorization schemes, allowing users to acquire group membership and roles within a VO is provided by the VOMS service. In this case the user authenticates to the appropriate VO VOMS server by using the command voms-proxy-init and presenting a normal proxy credential. If the user is a VO member he or she will then receive an attribute certificate signed by the VOMS service certificate, asserting the VO groups to which the member belongs and, if properly entitled, the roles requested. This attribute certificate is bound into a proxy certificate which can then be used for delegation as described previously.

6.2.6 Authentication and authorization control points The sections above provide an overview of the processes and protocols used to properly identify a user, enter that user into the VO, how grid resources authenticate a request by a service as delegated by a user, and how the proper authorization is provided for the actions in the request. Each of these stages provides control points where invalid requests can be denied. In summary:



Date: 16/12/2005


• The operating standards applied to user identification defined by the EUGridPMA ensure that all users have presented appropriate proof of identity (usually government identity document such as passport or other official photo identification) to the CA Registration Authority. Failure to show possession of the necessary identity document will result in rejection of the certificate request.

• The User Registration Requirements aim to ensure that the VO is provided with sufficient information, in addition to the certificate credential, to properly identify the user before membership of the VO is granted and to contact the user should it be necessary. Failure to provide the necessary contact information will result in rejection of the VO membership request.

• When submitting work to the grid the user must be in possession of a valid certificate from one of the accepted authorities. Credentials which have expired will result in authentication failure. In addition, all CAs must publish lists of certificates which have been deemed to be invalid, possibly due to private key compromise, and these certificate revocation lists (CRL) are regularly downloaded by resources. User or host authentication requests originating from such a revoked certificate will be rejected.

• If authentication is successful the grid software then implements a local credential authorization (LCAS) step where the incoming credential identity is compared to a ‘blacklist’ policy set of identities. This allows local site administrators to deny access to a given set of identities in cases of emergency to protect the site or where it is either not possible or inappropriate to revoke the user’s credential.

• By mapping the user’s credential identity to a local account, either by a static mapping or using VOMS attributes bound in the proxy, provides for control over authorization of actions initiated by the user. Should the user request an action for which the mapped local account is unauthorized (e.g. write access to a storage area or excessive use of CPU quota) the action will fail.

6.2.7 Use of credential stores - myproxy The use of proxy certificate chains for authentication allows for delegated actions to take place without the presence of the user or the exposure of a long term credential private key on the network. The short lifetime of the proxy reduces the exposure due to compromise since all credentials in the chain will be expired in a relatively short time. However, in many situations, the user’s job must be queued for an available resource slot. This indeterminate time lag introduces a problem in that, should time in the queue exceed the proxy lifetime, this would cause the eventual authentication to fail. One solution to this problem would be simply to create proxy credentials of sufficient lifetime to exceed the expected maximum queue time. However, this clearly defeats the purpose of using limited lifetimes to reduce exposure to credential compromise. The solution adopted by EGEE to this problem is to deploy secure credential storage services (myproxy servers [SEC 17]) in which the user is able to store a relatively long-lifetime proxy (still only of the order of days compared to the certificate lifetime of up to a year). Grid services which queue tasks are configured to contact the relevant myproxy service to acquire new proxy credentials on behalf of the user before the queued credential expires. Myproxy services are configured only to allow known grid services (i.e. those authenticating with a given service certificate identity) and those with a currently valid proxy to renew credentials. Services seeking to renew expiring VOMS attribute credentials must also contact the VOMS service.

6.2.8 Site registration process Just as resource providers need to trust the processes described above enabling proper authentication and authorization of grid users, grid users and VOs must be able to trust the infrastructure on which their data are stored and processed. To achieve this, the EGEE site registration process [SEC 18] is designed to introduce resources to the production grid only when necessary contact, support and security management details are logged and the site configuration is properly certified. This process reduces instabilities and the possibilities of security vulnerabilities being introduced by poorly configured resources and is centred on a site database at the Grid Operations Centre (GOCDB) [SEC 19]. Database entries for new sites must be created by a supporting Regional Operations Centre (ROC) and, after contact details are logged by the candidate site manager and verified by the ROC, the certification process moves the site from 'candidate' and ultimately into 'production'.



Date: 16/12/2005


6.3 LESSONS LEARNED

6.3.1 Scaling the trust domain Whilst the deployed authentication and authorisation infrastructures use, with the exception of the VOMS service, relatively mature technologies, establishing a trust framework of global scope takes a long time. Proposed policies must be accepted by all collaboration members and an appropriate body must be designated to own the policy and give it force. Delays due to political issues can have serious effects resulting in uneven or broken trust domains. Any planning for wide-scale grid deployment must address these issues at an early stage.

EGEE, through the EUGridPMA, was aware that the inclusion of too many CAs, for example on a per-site basis, would lead to an unmanageable trust domain. It therefore opted for an approximate per-nation granularity. However, given the scale of the project and overlap with other projects such as LCG and OSG, even this granularity is difficult to manage. As a consequence of this EGEE, through the EUGridPMA, is participating in the formation an 'umbrella' PMA: the International Grid Trust Federation, which will allow standards and policy management propagation between regional PMAs for Europe, Asia and the Americas.

Operationally, EGEE decoupled the distribution of CA root certificates from the middleware software distribution at an early stage. This allowed for more timely distribution of the rapidly growing number of CAs without placing strain on the deployment cycle. Configuration of CAs at a resource is relatively trivial but, for security concerns, care is needed to ensure that the deployment is correct. Similarly, the availability of certificate revocation information, used to identify compromised certificates, is an essential part of the trust infrastructure and monitoring of the status of the publishing points provided by the CAs must be planned to detect stale data resulting from network attacks against the CA.

Despite the large-scale deployment of CAs, as membership of the EUGridPMA has expanded, the global nature of large experimental collaborations means that at times there are inevitably individuals not covered by one of the existing CAs. To meet the needs of these people EGEE and LCG have created two so called catch-all CAs [SEC 20, SEC 21] which operate under a policy which allows the issuance of certificates under these special circumstances. The use of catch-all CAs both provides for users and sites to commence Grid activities whilst a more appropriate CA infrastructure is put in place to serve their needs and avoids the necessity of creating and accrediting CAs which only serve a very small community.

6.3.2 Authorization control Whilst services which are fully 'grid-aware' allow for authorization decisions to be enforced entirely on the certificate identity, legacy systems such as local file systems, accounts and quota enforcement layered under grid-services require that the user's task be executed under a local account mapping. To avoid the requirement for static mappings the current EGEE middleware allows for a form of account leasing where one of a pool of pre-created local accounts is allocated to each new grid identity appearing at the resource. Whilst relatively efficient and simple this form of account mapping for legacy services has operational consequences. Firstly the pool accounts must be pre-created in sufficient number. Secondly, periodic release and recycling of pool accounts is difficult to achieve securely to prevent any possible leakage of information from the former to the present owner. Worse still, given a compromised grid account, subtle attacks are possible to deliberately pervert the authorization system. Thirdly, the flexibility afforded by VOMS-based authorization by group and role within the VO is not matched by the simple user and group assignments available under the Linux, currently the dominant operating system in use for the grid. As a result, user, group and role assignment and allocation at both the VO service and the resources must be restricted.

In addition to the consequences impacted at the resource by the deployment of VOMS-based authorization described above, the VO service itself inevitably becomes a more complex component when compared to the simple LDAP-based directory services initially deployed. The VO Management Service components must be subjected to rigorous certification testing and significant expertise must be acquired by both fabric operators and VO managers when configuring and using the service. As a consequence of this, the aggregation of a number of VOs onto a single, or relatively few centrally-managed service instances with appropriate service level support, is expected to be a preferred deployment model and training for VO management anticipated.



Date: 16/12/2005


Much of the discussion above is primarily relevant to the use of processor, or CPU cycles for which the mapping to a local account is a prerequisite. Whilst the same is true for the use of locally attached storage, this has reduced relevance in the grid scenario where the bulk of storage is accessed through grid file access services. In this case, local copies are generally downloaded and saved to and from the service on demand during job execution. This model allows for the file service to implement grid credential-based authentication and authorization schemes.

6.3.3 Proxy renewal complexity The requirement to deploy proxy renewal services, whilst designed to increase security and convenience for the use, also increases complexity. Renewing services must know where to contact for renewal and credential stores must know who to trust. With large numbers of services involved this can scale to a non-trivial configuration management task. Also the inclusion of two extra services in the task chain (myproxy and VOMS) adds two more potential points of failure and hence reduced overall reliability. As a consequence, these services must be deployed in a manner to minimise critical service down-time.

6.3.4 Privacy, legal issues and accounting As mentioned above, a grid user will generally have no direct relationship with a site providing resources to the grid. Whilst the Grid Acceptable Usage Rules allow for the appropriate treatment of a user’s registration data and requires the acceptance by the user of the distribution of this data for security and operational purposes, there are other data associated with the user that must be gathered to enable the effective operation of the grid. Accounting data, i.e. the quantified usage of resources, is necessary to provide grid operators and VOs with an accurate picture of resource allocation and consumption. This data should be available in a variety of projections such as by site, by VO and by user and in combinations of each. Providing VOs with accounting information at the per-user level is necessary to monitor and control the fair and authorized use of resources allocated to the VO by a site. However, in some countries and under the rules of some sites, such per-user usage pattern data is treated as personal or private information and cannot be released by the site. A number of possible solutions to this problem are currently being considered including the strengthening of information release clauses in policy, the use of anonymising credentials and the explicit release of the data by the user.

6.3.5 Operational security issues Much of the discussion above is aimed at the security aspects of authentication and authorization and the necessary trust relationships between grid participants. These are aimed at knowing who is using resources and ensuring that they are used appropriately and this usage is properly accounted. However, no infrastructure of this nature can be made totally secure whilst remaining useable, and risks from attack and subsequent mis-use cannot be reduced to zero. A risk analysis appropriate to grid deployment aimed at understanding and prioritizing these risks was performed by the Joint Security Policy Group [SEC 22]. When security incidents do happen it is critical to have processes in place to manage the containment of damage and safe restoration of service. The JSPG has agreed an Incident Response Handling Guide [SEC 23] and coordination of the implementation and planning for these events is coordinated by an Operational Security Coordination Team consisting of security contacts from each ROC.

6.4 REFERENCES

SEC 1 Globus Alliance http://www.globus.org

SEC 2 Globus Toolkit Security (GSI) http://www-unix.globus.org/toolkit/security/

SEC 3 Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile http://www.ietf.org/rfc/rfc3280.txt



Date: 16/12/2005


SEC 4 Trust Manager: certificate validator for Java services http://hep-project-grid-scg.web.cern.ch/hep-project-grid-scg/trustmanager.html

SEC 5 EGEE Global Security Architecture https://edms.cern.ch/document/487004

SEC 6 JRA3 Security Documentation http://egee-jra3.web.cern.ch/egee-jra3/index.html

SEC 7 The European Policy Management Authority for Grid Authentication in e-Science http://www.eugridpma.org

SEC 8 Kerberos Leveraged PKI http://www.citi.umich.edu/projects/kerb_pki/

SEC 9 Fermilab PKI http://security.fnal.gov/pki/

SEC 10 LCG/EGEE Joint Security Policy Group http://cern.ch/proj-lcg-security

SEC 11 JSPG document repository https://edms.cern.ch/cedar/plsql/navigation.tree?cookie=4134107&p_top_id=1763291383&p_top_type=P&p_open_id=1412060393&p_open_type=P

SEC 12 Grid Acceptable Usage Rules https://edms.cern.ch/document/428036/1

SEC 13 LCG User Guide https://edms.cern.ch/file/454439//LCG-2-UserGuide.html

SEC 14 LHC Computing Grid Project (LCG) http://cern.ch/lcg

SEC 15 Virtual Organisation Management Registration Service (VOMRS) http://computing.fnal.gov/docs/products/vomrs/

SEC 16 Internet X.509 Public Key Infrastructure (PKI) Proxy Certificate Profile http://www.ietf.org/rfc/rfc3820.txt

SEC 17 Myproxy Credential Management Service http://grid.ncsa.uiuc.edu/myproxy/

SEC 18 EGEE site registration process https://edms.cern.ch/document/503198

SEC 19 Grid Operations Centre - Database http://goc.grid-support.ac.uk/gridsite/operations/

SEC 20 EGEE “catch-all” CA, CNRS GRID-FR CA https://igc.services.cnrs.fr/GRID-FR/english

SEC 21 LCG “catch-all” CA, CERN http://lcg.web.cern.ch/LCG/catch-all-ca/

SEC 22 Risk Analysis - Joint Security Policy Group http://proj-lcg-security.web.cern.ch/proj-lcg-security/RiskAnalysis/risk.html



Date: 16/12/2005


SEC 23 Incident Response Handling Guide https://edms.cern.ch/document/428035



Date: 16/12/2005


7 GRID OPERATIONS AND SUPPORT Grid operations and support is the term which is used to describe the work which is carried out to ensure that the grid is operating and that there are processes in place to monitor it, and to ensure that it remains in a usable state. The word support here refers specifically to the support of grid operations. It does not refer to support in a more general sense such as user support. User support is covered in §8.

This chapter includes section on the following topics:

Table 2: Guide to the parts of this chapter

Name Purpose

§7.1 EGEE grid operations Describes the hierarchy of organisations and people in place to operate the grid.

§7.2 Operations of the grid Provides details of the operation of the: o OMC [§7.2.1]; o PPS [§7.2.2]; o CIC [§7.2.3]; o COD [§7.2.4]; o ROC [§§7.2.5, 7.2.6]; o GOC [§7.2.7]; o SFT [§7.2.8].

The sections describe the following elements: o Teams; o People; o Equipment; o Documentation; o Other appropriate information.

§7.3 Lessons Learned Provides details of the lessons learned with the operation of the

organisations. The order is the same as in §7.2.

§7.4 Ways forward to EGEE-II and beyond Provides information on the future evolution of the: o PPS [§7.4.1]; o COD [§7.4.2];

§7.5 References Provides references from the text.

7.1 EGEE GRID OPERATIONS In this chapter, the hierarchy of management components for the grid is described followed by details of the experience of operating a number of these components.

Within the Enabling Grids for E-SciencE project (EGEE) structure there are several different organizations currently defined to provide grid operations and user support. These are:

• Operations Management Centre (OMC); • Core Infrastructure Centres (CICs); • Regional Operations Centres (ROCs); • Global Grid User Support (GGUS).

The Operations Management Centre (OMC) is located at CERN and provides co-ordination for the EGEE grid operation. In addition to management co-ordination it also provides the co-ordination of the deployment of the



Date: 16/12/2005


middleware distributions, the integration, certification, and documentation of the middleware releases, and the co-ordination of the deployment of those distributions. It provides support for problems found in the middleware, both directly through a small team of expert analysts, and also as a co-ordination point with the middleware developers and projects that supply the software.

The Core Infrastructure Centre (CIC) is a single distributed organisation with two roles. The first is to run essential core grid services such as database and catalogue services, VO management services, information services, general usage resource brokers, etc. In addition the centre provides resource and usage monitoring and accounting. The second role is to act as the front line for grid operators, and manage the day-to-day grid operation. Here the partners in the CIC take a week as the primary grid operator, the responsibility being handed between the centres in rotation. The responsibilities include active monitoring of the grid infrastructure and the resource centres, taking the appropriate action to avoid or recover from problems. Part of the responsibility includes the development and evolution of tools to manage this activity. The CIC also must ensure that recovery procedures for critical services are in place.

There is a tight coupling between the CICs, including the shared operational responsibility and close communication between the operations teams. Each CIC manager reports to the CIC co-ordinator at CERN. There is a weekly operations meeting where the current operational issues are discussed and addressed, and where the hand-over between the on-duty CIC takes place. This meeting also ensures that issues and problems in operations get reported back to the middleware developers, deployment teams, applications groups, where necessary and as appropriate.

CIC centres are currently operational at CERN, RAL (UK), CNAF (Italy), CCIN2P3-Lyon (France) and MSU in Russia. In addition, ASCC-Taipei provides resources to the operations monitoring activities and expects to also participate in the grid operations shifts in the fourth quarter of 2005.

The Regional Operations Centres (ROCs) provide the front-line support in each of the geographical regions of the EGEE project. In the regions also participating in CIC these functions overlap within the same organization. In other regions, the ROCs are distributed activities with staff in several physical locations. The roles of the ROCs include:

• Co-ordinating the deployment of the middleware releases within the region;

• Providing first-level support to resolve operations problems at sites in the region. The ROC must have the needed expertise and information on the operational state in order to diagnose problems as originating in the operation of the site, a service, or in the middleware itself;

• Providing support to the sites in the region in all aspects of the grid operation, including providing training to the staff in the sites;

• Taking ownership of problems within a region, ensuring that they are addressed and resolved. They may refer them to the CICs or OMC for second-level support;

• Providing first-line support to users to resolve problems arising from the operation of the services in the region or within the middleware. It involves the VO support teams where necessary;

• Negotiating agreed levels of service and support with the sites in the region, and monitor them to ensure delivery of those levels of service.

The ROC co-ordinator is responsible for ensuring coherent and consistent behaviour of the several ROCs, and reports to the OMC.

The Global Grid User Support centre (GGUS) is currently located at Forschungszentrum Karlsruhe (FZK), with additional resources provided by ASCC-Taipei. This centre provides a central portal for user documentation and support, providing access to a problem ticketing system. The system ensure that problem tickets are dispatched to the appropriate support teams and that problems are addressed and resolved in a timely manner, according to an advertised policy. Although this provides a single point of entry for support for a user, the support teams are



Date: 16/12/2005


many and widely distributed, and cover all aspects of the operations, services, and the middleware. It is expected that each VO also provides application support personnel who can address problems determined to be application-specific. §8 of this document deals with user support in more detail and so it is generally not addressed further in this chapter.

In building this grid operations infrastructure it was clear that a hierarchical support structure was needed to support the grid sites, since with a large number of sites a central organization would not scale. The ROCs form the core of this hierarchy, with the CIC as second-level support and operational oversight, and the OMC at CERN as the co-ordinating and management centre.

The operations infrastructure described above is that built up during the first year (2004–2005) of the EGEE project. In the preparation for the second phase of EGEE, anticipated to be for two years beginning spring 2006, the distinction between the ROCs and CICs is becoming less pronounced during the second year. Since the sites that operate a CIC are also Regional Operations Centres, with the operations and support teams shared between both sets of roles and responsibilities, the distinction is in any case somewhat blurred. CERN is an exception to this; although CERN operates a CIC it does not formally have a ROC, but it acts as the ultimate support centre for all unresolved issues and for other sites that are not part of EGEE but nevertheless participate in the same grid infrastructure. What seems reasonable for the longer term is to have a hierarchy of Regional Operations Centres, some of which take additional responsibility for operations oversight and management, and some of which run some of the essential core grid services according to an agreed service level. It is essential to maintain the hierarchical structure of the operations activity to ensure that no single support centre has to manage more than a reasonable number of sites. In this way the operation of the grid can scale in a manageable and predictable way to a much larger number of sites.

7.2 OPERATIONS OF THE GRID

7.2.1 Operation of the Operations Management Centre (OMC)

Teams

The execution plan for the SA1 activity of EGEE [OMC 1] provides details of the plan for CERN in EGEE. The role of CERN was to oversee all of the activity and in particular to implement the OMC and the CICs. It therefore has the following teams:

• a management team; • a OMC team; • a CIC team.

The management team consists of the following officers:

• an activity manager; • a planning officer; • a OMC-ROC coordinator; • a OMC-CIC coordinator; • a production service manager; • a pre-production service manager; • a security officer.

The OMC has groups to implement the following:

• middleware certification – see §5; • middleware support – see §8; • deployment – see §5 and §9; • pre-production service – see below; • operational security – see §6.



Date: 16/12/2005


In order to meet these requirements, a team of 34 people was assembled, of whom 8 were funded by EGEE. The experience level of the individuals varied from very experienced, such as the activity manager, to some who had recently graduated. During the course of the project, the more experienced members of the team have in general remained in post and in role, while the less experienced people have to a greater extent either left their post or changed their role. The management team consists of 7 people; the other 27 are in the OMC/CIC teams.

Documents

An execution plan was written at the beginning of the project [OMC 1]. This was a required deliverable of the project and it has proven to be a useful document as a reference for what was intended at the outset. The activity has written a number of other deliverables as required in the plan. These deliverables provide a reference point for the operation of many parts of the grid at the time of the writing of the deliverable. They often make use of informal documents which are written to respond to needs during operations.

The activity has written a quarterly report as required in the plan. These reports provide a useful source of information on the grid over time.

Many informal documents have been written as required to describe details of the operation of the centre. These are stored in a document repository which is located on the internet. The content of the document repository is maintained by the OMC. The document repository is available at the following location [OMC 2]: http://egee-docs.web.cern.ch/egee-docs/list.php?dir=./&

The documentation repository typically contains a library of several hundred current documents, as well as being an archive for obsolete versions of many documents, or simply versions of obsolete documents. Documents are typically published in PDF format, but may also be made available in original source form to as appropriate. A comprehensive search engine is available to help readers locate documents.

7.2.2 Operation of the pre-production service (PPS)

The pre-production service (PPS) is a small grid infrastructure which makes available to a broad user base all of the certified gLite middleware components. The term small is relative to the production service.

Teams

There are 14 sites involved in the PPS:

• ASGC (Asia-Pacific); • CYFRONET (CE); • IN2P3 (France); • FZK (Germany); • CNAF (Italy); • NIKHEF (NE); • University of Athens (SEE); • University of Macedonia (SEE); • University of Patras (SEE); • LIP (SWE); • CESGA (SWE); • IFIC (SWE); • PIC (SWE); • CERN.

The co-ordination of the PPS is currently carried out by CERN. There is also some basic site and service monitoring, using the Site Functional Test suite (SFT). This is being implemented at CERN and which is used to determine the state of all sites within the PPS. Wherever possible, the service infrastructure of the EGEE



Date: 16/12/2005


production service is re-used. For example, the production service VOMS servers are incorporated into the PPS and so the user authentication processes of the production service is also implicitly incorporated into the PPS. As another example, site security within the PPS is managed by the security management team of the ROC in which the PPS site sits.

Each pre-production site typically has a team of 1-3 people. The size of the team depends on the size of the site, the number of core services run by the site and the fraction of time that the individuals have available for supporting and maintaining the site. Each of the site administrators generally reports to their ROC manager, who is ultimately responsible for the pre-production sites within the ROC.

At the time of writing this document, the main user groups running jobs on the PPS are:

• the HEP VOs (LHCb, ATLAS, Alice, CMS); • BioMed VO; • DILIGENT; • ARDA.

In addition to VOs to support these groups, there are a small number of informal VOs which are being used for testing. They have names such as picard and crusher.

People

The degree of experience possessed by each of the site administrators varies from those with a reasonable knowledge of system administration but little detailed knowledge of grid technologies, to those who are highly conversant with both general site administration and also with the specifics of grid middleware and grid site management. However, as the PPS has developed and matured it has become clear that a drive to succeed and a willingness to learn are much more important attributes than experience of prior grid middleware.

The work profile of the site administrators within the PPS is somewhat different to that in the production service. In the PPS, the administrators are required to upgrade the middleware on a much more frequent basis and also have to install and configure completely new services more often than would occur in the production service. It also falls to the site administrators to be the first line support to the users of the PPS, as well as acting as a "self-help" group and providing support to each other within the PPS. As the gLite middleware that is deployed on the PPS, is relatively immature, the site administrators are also heavily involved with the developers in debugging the middleware. In addition to the above, the team members at each site also carry out the tasks of cluster installation and deployment as well as day-to-day troubleshooting.

The PPS sites are supported by the JRA1 development team. There is a mailing list ([email protected] ) which has been set up to allow easy communication between the PPS site administrators and the JRA1 development teams (as well as other parties interested in gLite).

Equipment

All of the sites contribute computing resources to the PPS. Many of the site also provide core grid infrastructure services; for example, Workload Management Servers (WMS), File Transfer Service (FTS) and VOMS servers. The sizes of the sites, as measured by computing resources, varies from small (2 WNs) to large (over 1,000 WNs). As might be expected, the majority of the sites are at the lower end of the scale (2-5 dedicated WNs), while three of the sites (CERN, CNAF and PIC) have provided access to their production WN farms, allowing each of them to offer in excess of 150 WNs.

To provide a fully working grid, the PPS also uses the LCG-2 BDII information system, the LCG-2 MyProxy service, the DPM and Castor storage elements (SE).



Date: 16/12/2005


Documentation

All aspects of setting up and operating the PPS are documented in the PPS wiki pages, which are hosted by the CESGA PPS site. The PPS wiki pages have two access points; a public access point and a restricted access point.

The public address of the wiki site is: https://wiki.egee.cesga.es

The private address of the wiki is: https://pps-public-wiki.egee.cesga.es

The site includes the following types of information:

• the process for new users to obtain access to the PPS; • the process for a new site to join the PPS; • the process for VOs to be established on the PPS; • a current list of the sites involved in the PPS; • details of the services run by each PPS site; • geographical maps of the PPS sites; • example site configurations; • troubleshooting guides.

Starting up the PPS

As the PPS was set up to be entirely stand-alone from the production system, a strategy was devised for "boot strapping" the service into existence. This involved separating the start-up phase of the PPS into two stages. In the first stage, the grid core services (WMS, R-GMA schema, FTS, VOMS server, catalogue, MyProxy server, UI under AFS) were established and their details published. In the second stage, the PPS sites were set up and commissioned using these core services.

In the first stage, a request was sent out to the prospective PPS sites for volunteers to run the grid core services. As most of the core services can have multiple instances, it was decided that for each core service there would be one site responsible for running the primary instance of that service, while all other instances of the service would be considered secondary. The site hosting the primary instance is responsible for ensuring a high level of availability for the service.

7.2.3 Operation of the Core Infrastructure Centre (CIC) The main role of the operations team is to take care of operational problems at Resource Centres (sites), Core Infrastructure Centres (CICs - Core Services) and grid wide problems.

The CIC has five groups to implement the following:

• provide a CA – see §6; • provide VO registration services – see §6; • provide grid monitoring – see §7.2.4; • grid services – see §7.2.5 and §7.2.6; • provide user support – see §8.

7.2.4 Operation of the CIC On Duty (COD) The COD is responsible for detecting problems, coordinating their diagnosis, and starting the follow-up procedure (the so-called escalation procedure) to track issues and document the solutions. This has to be done in coordination with the Regional Operations Centres to allow for a hierarchical approach.



Date: 16/12/2005


People

The COD is operated by rotating responsibility between five teams located in the ROCs. Each of the teams has the following people:

• a manager for the operations of the COD at the ROC; • a number of people who are trained and available to carry out the tasks of the COD operator.

The manager for the COD operations generally is a senior person in the ROC and generally reports to the ROC manager. The management of the COD is a task which should take about the same proportion of time on this task, and the proportion of time for which the ROC is responsible for COD operations. At present, there are five ROCs participating in COD operations, and so he spends about 20% of this time on this task. This time is generally spent during the week before, during and after the ROC's turn to do COD work.

The COD operators are generally less experienced persons in the ROC and generally report to the COD manager. It is generally recommended that the number of such people in a ROC doing this work should be three. This ensures that two are available during any week when the ROC has COD responsibility. The proportion of time which a COD operator spends on COD work is similar to that of the COD manager, so at present this is around 20%.

Documentation

The primary document which is used by the COD is the COD Operations Manual [SFT 1]. This is maintained by the CIC, but with the participation of the COD operators. There are also a number of other documents providing details of procedures to be followed by people involved with COD. These include a document describing the registration process for COD operators. The CIC portal provides information which is necessary for the smooth operations of the COD including publishing the schedules of duty.

Equipment

The COD operator can work from almost any computer connected to the network in a suitable way. In general, preparing for COD operations does not require any special arrangements.

The infrastructure of the COD

In creating the COD, the CIC built on existing systems where they were suitable, or created new systems, or adopted existing ones to meet their needs.

The description given here is of course a description of an infrastructure which is specific to EGEE. Nevertheless, there are some general principles. The following things are necessary for reliable operations:

• active monitoring of sites and services; • a database of geographical information; • tools which collect, store and display grid status information.

The operational infrastructure of the EGEE grid is dependent on the following services:

• the Global Grid User Support - portal - http://ggus.org ; • the CIC - portal - http://cic.in2p3.fr ; • the Grid Operations Centre - Portal - http://goc.grid-support.ac.uk ; • the Site Functional Tests - portal

http://lcgtestzonereports.web.cern.ch/lcgtestzonereports/cgibin/listreports.cgi ; • the CERN mail list server - portal - http://listboxservices.web.cern.ch/listboxservices/ .

The CIC portal was the only part of this infrastructure which was created especially for the COD. The reason for its creation was to provide a single location from which the COD operator could work without having to know the details of the operations of many of the other systems. For example, the COD operator can send e-mails to



Date: 16/12/2005


many locations from a single interface and it is the function of the CIC portal to deal with the various e-mail addresses and other details of the mail interface. For example, some of the mail addresses require access privileges which are provided by the portal, not by the operator.

The Grid Operations Centre (GOC) [§7.2.7] was in development before the COD started, but added functionality to meet the operation needs of the COD. The following services of the GOC are of importance to the COD:

• the GOC-DB; • the GIIS monitor; • the GOC job monitor; • the GOC certificate monitor; • the Wiki for documentation.

The SFT was under development before the COD started, and the COD added regular monitoring and follow up as routine service. The SFT is configured to be run by members of the Dteam virtual organisation.

The GGUS system was under development before the COD started but added functionality to work in partnership with the COD.

The CERN mail system did not require any modifications to support the COD. It did of course have to be populated with e-mail addresses and other information to tell it what to do.

Tools in daily use by the COD operator

The tools which the COD operator uses are the following:

• A digital certificate which is recognized by the following services: o GGUS Portal; o CIC Portal; o GOC; o The Dteam VO.

• A CERN userid which is recognized by the following mailing lists: o follow up list; o cic on duty list;

• The GOC GIIS monitor; • The GOC Data Base; • The GOC job monitor; • The GOC certificate monitor; • The operational documentation; • The GOC problem FAQs; • The e-mail templates.



Date: 16/12/2005


Ten rules for COD operators

The following 10 hierarchical rules for problem identification and resolution are followed by the COD.

1. Actively monitor the results of the SFTs; 2. If there is more than one error situation in the SFT, prioritize action as follows:

• deal with a failure in a replica location server; • deal with a new site4; • deal with the others using experience.

3. For each site with a SFT problem look at it status on the GIIS monitor; 4. Check the date and time of the failure; 5. Look at GGUS to see if errors at the site have been reported since the failure; 6. Look at the follow up list to see if an error at the site has been reported; 7. Mail the cic on duty mailing list to ask if a colleague has additional information; 8. Once the problem is identified provide a solution. If the operator does not know the solution, check the

FAQ on the Wiki; 9. If operator still does not know the solution to the problem, then do not guess. Ask for help from

colleagues; 10. When both the problem and its solution are identified, log the problem in GGUS.

7.2.5 Operation of a ROC participating in CIC

There are currently five CICs which include a ROC: CERN, FR, I, UKI, RU. CERN is a special case in that it is a "catch all" ROC for sites outside Europe and Asia. Another special case is Taiwan which is the only non European ROC (for Asia), and which has contributed a lot to the CIC functions. This part of the document does not deal with those special cases.

While every ROC has the same tasks with respect to EGEE, the CICs have agreed to specialize on different aspects. Their common task is essentially to have a global view of operations, and to insure that there are fallback solutions for core services and overall procedural problems. It must be mentioned here, that this is a contractual obligation of the CICs, not an option as it is for any ROC. Beyond the classical measures to increase availability, like automatic fail over between machines on the same site, RAID disks and so on, they also must provide solutions for outages of complete sites or regions, for example due to network problems. They do this by finding a backup site for critical functions. This site may be another CIC or its associated ROC, or any other viable solution. The bottom line here is that all CICs must know what the other ones have done, and what to do when the failure situation arises.

The Operations Advisory Group (OAG) is an organisation within EGEE composed of members of both NA4 and SA1. One of its roles is to negotiate resources on behalf of VOs. Because of the obligation of the CICs to supply essential grid services, they are the first contact point for the OAG to provide basic VO support, like RLS or equivalent, RB, VOMS and so on.

Whereas the initial operational model allowed for a separation of CICs and ROCs, the initial implementation always was that of a joined CIC/ROC. Most of the technical infrastructure for a ROC is also used by the corresponding CIC, and the core of the ROC team is found in the same place as the core of the CIC one. The ROC, in turn, is normally located in the premises of a major resource centre of the region, for example a tier-1 centre for the LCG.

The CICs rely on various ROCs for different core services. Only in the case where a particular service, especially for a particular user group or VO, is not available, it assures this service. As far as possible this is delegated to the associated ROC.

4 When a site is new to the grid, it should get priority over more experienced sites.



Date: 16/12/2005


There are genuine CIC services which cannot be found in ROCs. As far as they are of hardware or software nature, they are installed, maintained, and operated under supervision of the CIC but may be controlled by the ROC and/ or a specific resource centre. If they are services associated to people, like CIC-on-duty, they are staffed and controlled directly by the CIC.

The CICs rely especially on the ROCs or associated RCs on-duty or on-call staff for facing local technical problems with the grid machinery. No CIC is associated to a ROC which does not have a 24x7 on-duty or at least on-call service. A short explanation of terms as used here:

• on-duty is a person permanently assigned to a work, and present; • on-call is a person permanently assigned to a work, and who can be alerted by some means if she is not

present.

In addition, there are agreed time limits for the reaction of the on-call person.

A typical ROC with a CIC is basically structured the same way as one without, except when the availability requirements for the CIC's grid machines and services are higher than those of the ROC. In that case, the CIC staff is also integrated into the ROC's on-duty staff. Quite often the distinction of CIC staff and ROC staff cannot be made easily, for mostly a person works for both.

This is a potential problem especially for the CIC-on-duty service. The persons providing it are doing this on a full time basis but only for one week out of five, at the moment. The CIC and the ROC must have a firm agreement about the assignment of the corresponding people.

Typically, the CIC manager is also the ROC manager of the region. Sometimes the distinction is more visible on the deputy level: one deputy for the ROC work, another for the CIC one. CIC manager and deputy maintain the contact with the other CIC managers, the deputy also quite often with the technical CIC staff of the other CICs. In addition to this their work resembles to the management work of a ROC manager.

The CIC with the responsibility for the grid's registry database for the sites, the GOCDB, has a lot of direct interactions with site managers, in addition to those with the representatives of other SA1 bodies. This CIC also assures overall usage statistics and central accounting data collection.

The CIC with the responsibility for the operations portal - the CIC portal -, also keeps a repository of various related information, like CIC-on-duty staff membership, data about VOs like management and support contacts, resource requirements, and other. In addition, and in accordance to the Technical Annex of EGEE, it provides a "catch all" certification authority for all non-HEP VOs who cannot find any other CA.

The third CIC is actively cooperating with the others in the overall organization and development of services, especially also for accounting, and for integration of user support and monitoring tools.

The Russian CIC which joined during 2005, helped a lot in developing the integration procedures for adding a region to the CIC-on-duty service.

The teams of the CICs provide specialization and include:

• the GOCDB development and maintenance team; • the web development team for the operations portal; • the CIC-on-duty staff; • tool developers for the site functional tests (some are in non-CIC regions).



Date: 16/12/2005


In addition, and beyond their role as "regional service providers" for the CIC as already mentioned before, several persons from the ROC teams work "part time" for the CIC. The implied teams vary with the CIC's specialization. Some of them with their responsibilities:

• deployment and support team, pre-production team: development and maintenance of interfaces to the grid's information system for the operations portal; development and maintenance of some SFTs (Site Functional Tests);

• monitoring and accounting team: development and maintenance of some SFTs; • helpdesk and user support team: interface for CIC-on-duty requirements to GGUS; interface to VO

managers for operations portal; • CA team: has some more staffing because of the catch-all CA commitments; • core services team: establishment of inter region fail over procedures; • portal team: mainly working for the operations portal where applicable.

It must be mentioned that all ROC staff is available with its expertise for the CIC, and vice versa.

Underlying administrative support comes mainly from the resource centre where the ROC is located or from regional project management.

People

From the previous it is obvious that the number of people involved and their qualification depend on the specialization of the CIC and the type and size of the ROC. In the general case, the CIC/ROC manager functions are combined, whereas the corresponding deputies are specialized. All are doing their job full time, all have a solid technical and project management or team leader experience, but not necessarily on grid issues. The CIC/ROC manager generally has a long experience in international cooperation, especially with CERN.

At least the CIC/ROC manager is in direct and frequent contact of the PMB member of the region, and also has good contacts to somebody of the senior management of LCG of the region, if she herself isn't already involved in representing the region within LCG. This is important for reaching a global view of the project and to avoid sticking to regional particularities.

The specific profiles for the ROC team members can be found in a later section. The additional involvement in CIC work implies particular language skills for most of the CICs and all team members, a good capacity to integrate into international working groups, and a higher than usual autonomy and capability to communicate.

While the preceding sentences are true for all ROC teams, the capacities mentioned become even more important on all levels of the genuine CIC staff. The technical experience is comparatively less important, as long as the education level is fairly high and when the ROC has direct access to technically versatile people in the region, or better, in the associated resource centre.

Documentation

In addition to the ROC's documentation, of which the contents and methods are described elsewhere, the CICs have to document their specific parts. This is done according to their local rules and/ or web portals. Information covering global CIC aspects is either documented in deliverables, or centrally by the OMC, or again pointed to by the operations portal.

Equipment

The CICs use additional equipment according to their specific roles. These include:

• dedicated servers for portals; • VO servers such as VOMS; • RLS; • RB, if not provided by one of the ROCs or resource centres;



Date: 16/12/2005


• database services to support the above services.

The members of staff of the CIC are very involved in conferences and so standard desktop equipment is insufficient. Laptops are more useful, and sufficient phone conferencing material and connectivity is crucial for their day to day work.

7.2.6 Operation of a ROC not participating in CIC There are five ROCs which do not include a CIC: CE, NE, SWE, SEE, and D-CH. Although not officially CICs, these ROCs support the majority of CIC-level services, apart from CIC-on-Duty.

The description here captures the operations of the above ROCs on a general level. The actual operations vary from ROC to ROC based on geographical coverage and other regional features.

A typical ROC has a number of teams structured either per country or per functional grouping. Where a number of countries are involved, the functional groupings are replicated per country if needed.

Every ROC provides for central coordination through the ROC manager and a deputy. The ROC manager and the deputy carry out the ROC administrative and technical coordination, including technical strategy and planning. They are also official ROC representatives in SA1 bodies and act as the communications and reporting interface between the ROC and other EGEE operational structures. If a number of countries are involved, ROC branch managers and deputies are provided to deal with administrative and technical coordination specific to a country.

A number of teams are active within the ROC. In most of the cases, these include the following:

• deployment and support team; • monitoring and accounting team; • pre-production team; • helpdesk team; • CA team; • security team; • core services team; • user support team; • portal team.

Deployment and operation support team acts as front-line for deployment, operational and user support, providing a geographical coverage over a number of countries if necessary. This team supports the cluster installation and deployment, upgrades, day-to-day troubleshooting and operational support as well as local user support. A dedicated subset of this team is provided to deal with new cluster deployment, providing deployment support to new clusters, via support within the regional test-zone, using a set of monitoring tools.

Some ROCs provide a branch of certification/installation test bed. Out of the above ROCs, SEE provides this service for EGEE and the region. A dedicated release team is responsible for this activity, where a new release is installed on a test cluster and feedback is provided to the OMC deployment team. On the basis of the hands-on experience of the pre-release, this team provides further support to the deployment and operations teams during installation.

The deployment and operational support team makes extensive use of the local monitoring tools for different aspects of monitoring. ROCs use various local monitoring tools such as GridIce, local SFTs on-demand, and other tools such as Ganglia, MonaLisa, SmokePing, and others. Typically, a dedicated monitoring team is responsible for deployment and maintenance of monitoring tools in the ROC. Typically, the same team takes on the responsibility for accounting coordination in the region.



Date: 16/12/2005


The pre-production team is typically dedicated to deployment and operational support within the pre-production service.

The flow of support information, both operational tickets and user problems, is typically handled through a regional helpdesk which interfaces with GGUS for exchange of relevant operational (COD or local monitoring) tickets and global and local user tickets. The helpdesk team typically involves the implementation team dealing with the helpdesk deployment, maintenance and interfacing with GGUS, and a team dealing with regional ROC support teams organisation within the helpdesk, and day-day operational monitoring and tracking of tickets. The above ROCs also very actively participate in providing the GGUS-level Support-on-Duty or Ticket Processing Managers, responsible for user ticket morning and correct channelling of these tickets.

The ROCs provide per-country CAs managed by dedicated CA teams, and some regions such as SEE provide also catch-all CAs for those countries who do not yet have EUGRIDPMA-accredited CAs.

Another aspect of ROC security management is ROC operational security team, headed by the ROC security contact, who is responsible for contributing to EGEE-wide security policies through relevant EGEE bodies, as well as managing ROC security operations, which relate to site security, incident response, security challenges coordination, etc. Some ROCs provide specialized security services such as YUMIT.

All of the above ROCs run core services such as VO management services, file catalogues, resource brokers and information services both for the infrastructure and for particular regional and EGEE-wide VOs. Typically, a dedicated and specialized “core service team” subset of the operational support team provides support for these services.

Typically, every ROC has a dedicated user support team dealing with user support issues, as well as general user support coordination in the region and user information and training from the operational point of view.

Every ROC provides a dedicated EGEE ROC-specific portal, managed by the dedicated ROC portal team, which maintains the site. The ROC site integrates together all the relevant regional information, including specific detailed documentation and guidelines for operators and users, and provides a single point of access to regional ROC and global monitoring tools, regional helpdesk, core services pointers, etc. In regions where multiple languages are provided each country typically supports a country-level portal providing information in the local language – both for users and operators.

People

The actual structure and size of the teams vary based on the ROC size and type. Typically, every ROC provides full-time ROC manager and the ROC manager deputy and related administrative support: the ROC manager/deputy positions are held by people with solid technical management skills and across the board technical and administrative insight.

Where country ROC representatives are needed, these positions are held by senior people who can provide for technical and administrative coordination within the country and can coordinate the specific details of the local diverse resource centre and user base.

Every ROC provides a dedicated security, user support and helpdesk administrator contact, and dedicated CA teams where size depends on the number of CAs in the region. The security personnel is equipped with security-specific skills, the user support coordinator is required to have in-depth knowledge of technology as well as relevant communications skills (similar skills required from members of the user support team), and helpdesk administrator has specific knowledge of the TT technologies and related processes.

The deployment and support team is composed of a group of experienced grid operators, where again the size and structure depends on the size and structure of the ROC and is in direct relationship with the number of countries and sites. The most experienced members of this team form the core of the core services teams; while people with strong development and software skills are engaged in pre-production activities.



Date: 16/12/2005


Documentation

The ROC documentation is typically available through the regional portal. The ROC portal integrates together all the relevant regional information, including specific detailed documentation and guidelines for operators: covering all the steps of site deployment from candidate registration through installation and deployment to run-time operations. Guidelines are also provided for the use of CAs, core services, monitoring tools, helpdesk, etc. User guidelines covering the beginners level through to advanced production-level users is provided through the portal, with the relevant instructions on what to use and how to use it.

In regions where multiple languages are provided each country typically supports a country-level portal providing information in the local language – both for users and operators.

Equipment

A number of dedicated servers are needed to run the grid operations and management components such as monitors, helpdesk, portals, etc.

Dedicated clusters are needed for installation test bed and pre-production.

7.2.7 Grid Operations Centre (GOC)

The Grid Operations Centre (GOC) is a facility provided to the EGEE infrastructure by two of the ROCs. The purpose of this section of this document is to provide an overview of the facilities provided by the GOC. It is not intended to be complete documentation. References are provided to documents which provide more detail on the facilities. This section of the note addresses the following:

• the accounting service; • the database; • the certificate monitory; • the service status monitor; • the implementation of R-GMA; • the maps service.

In time, as the grid grows, more ROCs may become involved in the GOC work. At present, the GOC is supported by RAL in UK, and SINICA in Taiwan. The GOC provides a set of services which are commonly used in the operations of the infrastructure of the grid.



Date: 16/12/2005


The entry point for the GOC on the web can be found at the following reference [GOC 1] and is shown in the following figure (Figure 10: Front Page for the GOC).

Figure 10: Front Page for the GOC

7.2.7.1 Accounting The term "Accounting" has two distinct meanings. One means to keep track of the use of resources by individuals or groups, and the other means to be able to trace and hold an individual or group accountable for the use of a resource, typically following a security incident. In this section addresses the first of these two meanings.

In any computing environment where expensive resources are provided for use by one or more communities of users, then accounting is important. In a homogeneous and well defined environment, software is usually available which can perform this task on a routine basis. In the case of the grid, the resources are not homogeneous and this makes this a difficult problem. Both the hardware and software are inhomogeneous. Additionally, the privacy of the information about the work being performed by the users of resources raises legal issues. A complete discussion of legal issues will not be addressed in this section. Instead, this section describes the facilities in daily use on the EGEE infrastructure to monitor the use of the resources.

It is essential that usage of the compute and storage resources provided by the collaborating sites be accounted for and reported to the stakeholders, who are the virtual organisations, the funding agencies, and the EGEE Project Board. Each site must be able to demonstrate that it is providing the resources that it has committed to provide to the project, to the virtual organisations, and that it has been done in a way consistent with the agreed shares. This should be done site-by-site, country-by-country, and project-wide. It is also important that each virtual organisation be able to understand what resources it has consumed and who has used those resources.



Date: 16/12/2005


In December 2004, the activity introduced the grid accounting system to service. This is described in the document which is associated with this work [ACC 1]. Since its introduction, the number of accounting records has grown steadily, and now stands at five million. The accounting system collects records from resources on the grid, and provides aggregated views of the data by:

• VO; • Country; • ROC; • Arbitrary time intervals; • Graphical views; • Tabular views.

The following figure (Figure 11: Accounting Flow Diagram) shows the flow of information from the point of collection to the repository to the GOC.

Figure 11: Accounting Flow Diagram



Date: 16/12/2005


The following figure (Figure 12: Front page for accounting reports) shows the front page for accounting.

Figure 12: Front page for accounting reports

The following figure (Figure 13: Page for selecting report) shows the page for selecting a view of the accounting data.

Figure 13: Page for selecting report



Date: 16/12/2005


The following figure (Figure 14: Report on the number of jobs for Biomed) shows a view of the accounting data for the biomed VO during 2005.

Figure 14: Report on the number of jobs for Biomed

It is possible to use the information on accounting which is stored at the GOC in other applications. For example, the SW federation of EGEE has made its own portal where the accounting information for the resources in that Federation is available. The other applications do not have to be a part of EGEE, but are projects which make use of the EGEE infrastructure. Two examples are the LCG project based at CERN and GridPP project based in UK. The web sites are available at the references [ACC 2], [ACC 3] and [ACC 4]. The GOC provides support to projects using the data in this way.

7.2.7.2 GOC-Database

The purpose of the GOC database is to store information on resource centres on the grid, the machines available and the support staff who maintain them. It is used by many monitoring frameworks including gstat and SFT, and also by the CIC on duty staff to see when sites have scheduled downtimes.

The principal point of entry for accessing the GOC database is in given in the following reference [GOCDB 1].

The GOCDB system uses a MySQL database with a secure web interface, accessible to registered users identified by digital certificates. It incorporates a role based authentication system that allows various levels of permissions from editing site information up to suspending entire sites from the grid. The security system for the database is implemented using a software component called GridSite. Further information on MySQL and GridSite are available in the following references [GOCDB 2 and GOCDB 3].

For information on the GOCDB schema or for an account to the MySQL database please contact Matt Thorpe ([email protected]).

7.2.7.3 Certificate Monitor

The central component of ensuring the security of the grid is the digital certificate. All entities on the grid have digital certificates which must come from a trusted source, and which must be valid. Certificates are created with a finite life. The purpose of the certificate monitor is to give early warning of certificates which are about



Date: 16/12/2005


to expire. The COD routinely monitors this information and alerts a site if a certificate is approaching expiry. This is a valuable service to the sites, as once a certificate has expired it can take some time to obtain a new one, and during that time, the resource will not be sent work by the grid, as it cannot be trusted.

Each user must have a certificate which identifies him, and each resource must have a certificate which identifies it. Each party validates the other using the certificate. Certificates are issued by a certificate authority, and have a limited lifetime, usually one year. In addition, each certificate authority routinely issues a certificate revocation list. The revocation list has a limited lifetime, usually one month. The revocation list is a list of certificates which the certificate authority has revoked. Each certificate authority also has a certificate and it has a limited lifetime, usually five years. Regardless of the validity of a certificate, it is generally not trusted if the certificate authority's certificate has expired or if the revocation list has expired.

The following reference is the principal page for accessing information on the certificate monitor [CM 1].

The following figure (Figure 15: Main page for the certificate monitor) shows the front page for the certificate monitor.

Figure 15: Main page for the certificate monitor



Date: 16/12/2005


The following figure (Figure 16: Alert page for certificates) shows the alerts page for the certificate monitor.

Figure 16: Alert page for certificates

7.2.7.4 RGMA

R-GMA is used to collect accounting records.

R-GMA is an implementation of the Grid Monitoring Architecture of GGF and presents a relational view of the collected data. It is a producer/consumer service with command line interfaces as well as an API for Java, C, C++ and Python and a Web interface. R-GMA models the information infrastructure of a grid as a set of consumers (that request information), producers (that provide information) and a central registry which mediates the communication between producers and consumers. R-GMA can use the same information providers as used by BDII. Further information on R-GMA is available in the references [RGMA 1].

A service discovery mechanism using R-GMA has been implemented. Detailed information is available in the references [RGMA 2 and RGMA 3].

The following figure shows the distributed nature of the implementation of R-GMA and its relationship to GIIS and GOC-DB (Figure 17: RGMA and its integration with GIIS and GOC-DB).



Date: 16/12/2005


Figure 17: RGMA and its integration with GIIS and GOC-DB

7.2.7.5 Maps

With the abundance of monitoring data, and portals providing access to this data, it can often be difficult to find the latest known state of test results against a particular site or service. Maps with the monitoring information provide a fast visual way for the CIC teams to identify sites and services that are failing tests. The GOC provides access to monitoring data generated by:

• Site Functional Tests at [GOCMaps 2]; • Service checks on local RGMA services [GOCMaps 3].

The GOC will soon introduce a map service associated with the certificate monitor.

The presentation of the information uses maps which are provided by Google [GOCMaps 4]. There is further information on how to customize Google maps for this application on the GOC Wiki pages, at the following reference [GOCMaps 5].

7.2.8 Site Functional Tests (SFT)

Site Functionality Test Suite (SFT) is used to monitor the performance of sites on the grid. It is the primary tool for monitoring the resources. It is used continuously by the COD to monitor the status of the resources on the grid [SFT 1]. There is documentation available on SFT at the following reference [SFT 2].

The results from running the SFT can be collected over time, and this provides useful information on the performance of resources (and aggregates of resources) over time.

The SFT consists of two packages:



Date: 16/12/2005


SFT Client – this runs on a site's user interface (UI) and submits various job packages to the grid middleware and monitors their execution. Every change of their status is sent to the SFT Server.

SFT Server – this runs on a site's monbox and receives and stores the information sent from the SFT client jobs and presents the information in a dynamic website.

SFT relies on the standard job submission mechanism on a single UI. The tests are in fact different scripts intended for execution on WN-s to test grid functionality. When the SFT test package is run, all these scripts are packed in a single job description and are submitted to selected CE-s. The administrator is able to choose some predefined testing scripts, and also to define new ones, all of which are packaged together for the SFT execution.

After each test script is run on the WN, the results are published directly to the SFT Server using a web service. It is done in this manner in order to be able to have partial results from the tests even when the job fails to finish after a successful submission. The contacted web service on the SFT server stores the results in the local MySQL database. During the time of the test run or after the test has finished, the SFT administrator can invoke publishing of intermediate or final results, so they can be visible on the SFT website.

The following figure (Figure 18: Flow of control and data with SFT) shows the flow of control and data associated with the SFT.

Figure 18: Flow of control and data with SFT



Date: 16/12/2005


There is a page where the results for the grid can be accessed [SFT 3]. In addition to being used on the production grid, SFT is also used in regional grids in SW Europe [SFT 4], SE Europe [SFT 5]. The following figure (Figure 19: Summary of SFT for SW Grid) shows an example of the publishing page. The figure shows the page to the SW grid as the page is much smaller than that associated with the production grid.

In the implementation of SFT in EGEE, the SFT client runs on the AFS UI at CERN. The SFT server runs on the MON box at CERN.

Figure 19: Summary of SFT for SW Grid



Date: 16/12/2005


The documentation on the list of tests is available at [SFT 6]. The following tests are available:

• Job Submission; • WN hostname; • Software Version; • CA RPMs Version; • BrokerInfo; • R-GMA client; • CSH test; • Replication Management using lcg-tools; • GFAL infosys; • lcg-cr to defaultSE; • lcg-cp defaultSE -> WN; • lcg-rep defaultSE -> central SE; • 3rd party lcg-cr -> central SE; • 3rd Party lcg-cp central SE to WN; • 3rd Party lcg-rep central SE to defaultSE; • lcg-del from defaultSE; • Apel test.

The results shown on the main monitoring page are useful for understanding the status of the grid at a moment in time. However they can also be used to determine the status of resources over time. The results from running the SFTs are retained and are available at the following location: [SFT 7]. The results are integrated over different time intervals such as weeks, or months. The following figure for example, (Figure 20: Historical results for CERN) shows the number of processors available in the CERN federation over a week, and the number of sites available. These statistics are collected for all of the federations.

Figure 20: Historical results for CERN

7.2.9 Incident operation security §6 of this document deals with security.

7.2.10 Monitoring, automation, alarms §§7.2.1, 7.2.2, 7.2.3, 7.2.4, 7.2.5 and 7.2.6 deal with these matters with respect to the appropriate operational parts of the grid.



Date: 16/12/2005


7.2.11 Support The centre participates actively in the support of both the operational infrastructure of the grid and also in providing supporters for user support. These two areas are described in §§7 and 8.

7.2.12 Tools The centre uses a large number of tools in order to perform its role. Some of these tools are described in this chapter. Some of these tools are part of the grid middleware. In addition, it has developed new tools when appropriate, so that it can fulfil its mission.

7.2.13 Procedures, escalation §§7.2.1, 7.2.2, 7.2.3, 7.2.4, 7.2.5 and 7.2.6 deal with these matters with respect to the appropriate operational parts of the grid.

7.2.14 Metrics Metrics on the behaviour of the EGEE infrastructure are collected regularly and reported in the quarterly reports. At the time of writing this document, work is underway to enhance the number of metrics collected.

7.2.15 SLAs The development of Service Level Agreements (SLAs) with all the various parts of the infrastructure of EGEE is in progress. However, at this stage in the development of the infrastructure, EGEE has yet to deploy a SLA.

7.3 LESSONS LEARNED

7.3.1 Lessons from the operation of the OMC The hierarchy of OMC, CIC and ROC generally works well.

The arrangements which were planned for EGEE of having a single OMC, with a single CIC distributed over several sites, and a number of ROCs with regional responsibilities has worked well. The infrastructure has grown and operated in a reliable manner.

There are a number of aspects to this process which are scalable and will continue to grow with the infrastructure and a number which are not scalable and may require changes.

The following scale well, and are of no immediate concern within EGEE:

• the number of resource centres; • the number of regional operations centres; • the number of certificate authorities; • the number of virtual organisations.

There are things which have to change in order to grow further

The following aspects of operations are not scalable and will have to be addressed in EGEE-II:

• the middleware certification process; • the support arrangements for some of the middleware components; • the support arrangements for some of the virtual organisations.

New partners should prepare a plan before joining the infrastructure

For organisations considering participating in the grid, they should identify their role and read the appropriate section of this chapter. Following the experience of project EGEE they should do the following:

• prepare a plan for their unit following on the model used by EGEE;



Date: 16/12/2005


• identify the teams required to implement the plan; • recruit appropriate people to provide the teams; • plan their equipment requirements; • plan the service requirements; • ensure that appropriate monitoring, and alarms are in place; • ensure that their participation in the support systems is understood by other parts of the support system; • acquire appropriate tools to fulfil the mission; • ensure that procedures are in place including the escalation procedures; • ensure that appropriate metrics are collected and provided to the appropriate parts of the grid; • ensure that accounting is in place in line with current working practice; • ensure that SLAs are in place in line with current working practice.

7.3.2 Lessons from the pre-production service Ensure that sites understand the commitments that are required

As the sites involved in the PPS are all volunteers, it is easy for the people involved to think of the service as being of low priority. However, this is not the case, and so it is imperative that the implications of joining the service and the commitments required in order to join the service are clearly stated and agreed to in advance.

Provide clear information for users and sites on how to join the service

The provision of an entry point (the wiki pages) with useful information about how to join the PPS and how to install gLite, helps both new users and new sites to participate in the PPS.

Do not mix versions of immature middleware

Mixing different versions of gLite in the same environment creates additional layers of complexity and problems, in an environment which is already difficult to manage.

Wait for a version of the middleware to be fully tested before installing it

For the sake of expediency, there was often pressure to install versions of the middleware that were not thoroughly tested. This almost always led to it taking more time to get the PPS working with the new version than waiting for the new version to be thoroughly tested before installing it on the PPS.

Do not upgrade the service more than once every 2 months

There were releases of middleware every month and sometimes every 2-3 weeks. This frequency of upgrades leads to the service being instable for long periods of time.

7.3.3 Lessons from the operation of the CIC

The information on this topic is contained in §§7.3.4, 7.3.5 and 7.3.6.

7.3.4 Lessons from the operation of the COD Rotating operations requires reliable partners

Each partner is a scheme where they pass responsibility for operations from one to another on a frequent basis has to take their responsibility seriously and carry out their duties in a reliable and timely manner.

Test the procedures before they are introduced to service

Where possible, procedures should be used for some time by a limited number of persons before they are introduced on a wider basis. This is to ensure that when a procedure is introduced it is known to work well, and has been fully documented.



Date: 16/12/2005


Document the procedures

The practices which are to be followed must be documented, and the documentation has to be kept up to date in a timely and reliable manner. With distributed operations, word of mouth, and other informal practices is not sufficient.

Maintain the documentation

The documentation must be accessible in a distributed way, and it must be possible for appropriate people to be able to change it. However it is not a good idea to allow unfettered access to the documentation. This can lead to inappropriate changes being made.

Operators must be trained

Providing documentation and other forms of passive help to enable people to know how to do their jobs is necessary, but it is also essential that they be appropriately trained. This does not necessary being trained in a class room. Most of the CIC operators have learned their skills by working on the job, often assisted by more experienced colleagues.

Operators must follow procedures

When responsibility for a problem has to pass from one person to another over time, the people must follow a standard procedure, so that others can understand their actions. Operators should be encouraged to see how operational procedures can be improved, but not to be innovative on the job!

Have regular meetings of the operators

It is essential to have regular meetings of the people doing the work of the COD. This is easy to arrange at the individual site, but is not so easy to arrange when people are working across a large geographical area. However, the EGEE conferences have provided a suitable opportunity to have such meetings. In addition, there have been meetings held at the Operations Workshops.

Ensure that operators get the right amount of work

People doing COD work should do it regularly so that they build experience and expertise. However it is also important to avoid the situation where someone becomes exhausted by the demands of the job, or bored with the routine of the job.

Trust the operators to do their jobs

Allow the operators to do their jobs, and trust them to act responsibly. If they have to refer minor actions to more senior people for authority, this reduces their effectiveness. This means that the operators can carry out an escalation procedure which may eventually lead to the eviction of a site from the grid.

Give operators authority which matches their responsibilities

The operators are responsible for the moment to moment operations of the grid. This does not mean that they are responsible for fixing minor operational problems. They are responsible for monitoring operational problems and informing the appropriate location of the problem. There is one area in which they have overriding authority, and that is for security. In the event of a security situation, the operators have authority to remove sites from the grid in order to protect the interests of other sites on the grid.

Avoid single points of failure, and single points of knowledge

It is not always possible to avoid single points of failures as many of these are outside of the control of the CIC. However ensuring that more than one person is knowledgeable about such things as passwords, or the location of keys and other security credentials is generally a matter of local management.

Be tolerant of operators' errors - everyone makes mistakes

When an operator makes a mistake, this is an opportunity for learning for all concerned. The only people who never make mistakes are those who never make anything. When someone makes an error, it may tell us more about the tools or procedures which he was following than about the person who made the mistake.



Date: 16/12/2005


Building a distributed system of operations support is not easy, but it is worthwhile

It would have been easier to have built the operational infrastructure by locating all of the resources in one location under a single organizational and management structure, than to have built the one described here. However the present one is more scalable than a monolithic one.

Monitor the work

It is not enough to do the work, it is necessary to be seen to be doing the work, and it is necessary to be able to show that the work is being done and to be able to measure improvement, or otherwise.

Work with the ROCs whenever possible

The purpose of the COD is to provide a service to the regional operations centres so that problems can be resolved at the local level. It is important that the COD and the ROCs co-operate as a team with distinct roles but with a common aim of ensuring reliable grid operations.

7.3.5 Lessons from the operation of a ROC participating in CIC

CIC must have specialists

During the ramp up of the CIC and the ROC, it has proven very useful to get people specializing on particular subjects. This has to be carefully planned, even more by the CIC management, to avoid problems to fulfil the commitments during the CIC-on-duty week.

CIC-on-duty staff must have global knowledge

Every staff member should have in-depth knowledge of at least one core service component, and an overall understanding of the others.

Face to face meetings of CIC-on-duty staff are important

The integration of those meetings into operations workshops and EGEE conferences has proven to be helpful. Besides getting to know one another - which helps to ask the right colleague for a particular problem -, they are occasions to find difficulties with tools or with the organization of the service, and to find solutions. Tool development is often triggered here.

Allow people to work in a wide range of work

Get CIC staff to participate in the regular regional operations meetings, and get CIC staff to do some work for the ROC, and vice versa. CIC staff members should regularly attend in person project wide meetings. This helps in the identification and detection of problem patterns which appear in every region. This then leads to finding solutions.

Pay attention to procedures

Operations are done by people. So try to have clear and concise guidelines for people, and ensure that the detailed information is held by computers and in software as much as possible.

7.3.6 Lessons from the operation of a ROC not in CIC

ROCs should participate in the pre-production service

Involvement in pre-production service is important for all ROCs since in this way operators gain hands-on experience with new middleware that can be transferred further to deployment and support groups within the ROC (but manpower is needed for this transfer).

ROCs should have a local installation test bed

Having a local installation test bed in the region is useful, since many bugs are found in the pre-release and in this way encountering common problems in mass over-the-board installation exercise is avoided.



Date: 16/12/2005


Every ROC has to run core services to support their local VOs

Every ROC needs to run core services anyway to cater for regional VOs as well as to support global VOs.

Every ROC should have a local test bed

The regional test-bed is crucial so as to ensure that the new sites are thoroughly tested and stable enough to go into production. The related set of pre-deployment monitoring tools is needed locally to support this activity.

Every ROC should have regional tools

Customised regional monitoring tools are useful for monitoring problems and for debugging sites at runtime: in many cases the problems are spotted in regional morning tools before COD tickets arrive, and are handled locally anyway.

Every ROC should have a local portal

The EGEE documentation is at times fragmented and the references to LCG documentation confusing. The existence of local portals unifying this information in an easy and accessible way is important.

Every ROC should have “champions”

It is important to have “champions” within the deployment and support team who are the core of the team in terms of experience, responsiveness and support.

7.3.7 Lessons learned from the GOC

Get a clear set of requirements

It has not been easy for the GOC team to get a clear set of requirements from the community when it comes to developing a GOC service.

Providing a persistent reliable service is a lot of work

Prototyping a service can involve a lot of man hours of effort. Improving the service can take longer.

Prototype services then react to feedback

Some of the users of the GOC (eg the COD) want high level information only: which sites are up, which are down. Other users (site administrators) want detailed information: which service is down, and when it failed. Much of the recent work of the GOC has been to address both of these requirements. This is especially true for the accounting and reporting web pages. It is necessary to summarize a lot of data at different levels, for projects such as LCG and GridPP, for EGEE and its partners, the ROCs, for the Tier-1 centres and Tier-2 centres. It is important to engage with the community, prototype services quickly, announce them and encourage feedback.

7.4 WAYS FORWARD TO EGEE-II AND BEYOND

7.4.1 Ways forward for the pre-production service

Although the PPS has provided a valuable service in allowing users access to the functionality of the new gLite middleware and allowing site administrators to gain first hand experience of installing, configuring and debugging gLite, there is still a great deal that the PPS can achieve moving forward into EGEE-II.

Until now, the PPS has been deployed, as far as possible, using only the gLite middleware stack, thus it could be said that calling the service "pre-production" was a misnomer. However, moving forwards, the PPS should take on more of a classic pre-production role in that the software stack running on the PPS should closely reflect the production service and it should be from the experiences of users and operators on the PPS that decisions regarding the contents of future releases of the middleware to production should be based. Thus the PPS needs to move away from being a gLite service and needs to become an integral step in the middleware release cycle.



Date: 16/12/2005


At the same time as the above change in role takes place, the operations, monitoring, user support and site support for the PPS need to be taken over by the teams that carry out these roles in the production service. This will allow for better support of both the users and site administrators than is possible at the moment. This is important as the PPS is now reaching a size and complexity where simple mailing lists are insufficient in providing adequate monitoring and support.

7.4.2 Ways forward for the COD In the future, the COD will have more partners, more rotating responsibilities, more parallel work and more distributed operations. The COD will have to evolve to deal with these changes.

Today, the COD’s principal active ingredient is the Site Functional Tests (SFT). In the near future it will have to provide Core Services Functional Tests (CSFT), and perhaps in the future it will have to provide active monitoring for VO matters – Virtual Organisation Functional Tests (VOFT). Each of these sets of tests adds a dimension of complexity. The amount of work is not proportional to the number of test sets which the CIC does; it is considerably more than this.

7.5 REFERENCES

ACC 1 DSA1.3 - Accounting and reporting web site publicly available. https://edms.cern.ch/document/489455

ACC 2 EGEE view by the CESGA team in SWE federation: http://www.egee.cesga.es/EGEE-SA1-SWE/accounting/reports/

ACC 3 LHC reporting pages: http://www.goc.grid-support.ac.uk/gridsite/accounting/tree/tier1view.php

ACC 4 GridPP reporting pages: http://goc.grid-support.ac.uk/gridsite/accounting/tree/gridppview.php

CM 1 Main page for the certificate monitor http://goctest.grid-support.ac.uk/gridsite/monitoring/tree/CERTView.php

GOC 1 Main page for the GOC http://goc.grid-support.ac.uk/gridsite/gocmain/

GOCDB 1 Point of entry to the GOC DB: http://goc.grid-support.ac.uk/gridsite/gocdb

GOCDB 2 MySQL database: http://www.mysql.com

GOCDB 3 GridSite software: http://www.gridsite.org/

GOCMaps 1 The home page for the GOC maps http://goc.grid-support.ac.uk/googlemaps

GOCMaps 2 The home page for the SFT monitor map http://goc.grid-support.ac.uk/googlemaps/sft.html

GOCMaps 3 The home page for the RGMA service monitor map http://goc.grid-support.ac.uk/googlemaps/rgma.html



Date: 16/12/2005


GOCMaps 4 The home page for the Google map service http://maps.google.com/

GOCMaps 5 Recipes to build Google maps for the GOC http://goc.grid.sinica.edu.tw/gocwiki/GoogleMapHowTo

OMC 1EGEE SA1 Execution Plan https://edms.cern.ch/document/489453

OMC 2 Document repository for OMC http://egee-docs.web.cern.ch/egee-docs/list.php?dir=./&

RGMA 1 Description of the RGMA Monitor Architecture http://goc.grid.sinica.edu.tw/gocwiki/RgmaMonitorArchitecture

RGMA 2 Service Discovery User Guide https://edms.cern.ch/document/578147

RGMA 3 gLite Release http://hepunx.rl.ac.uk/egee/jra1-uk/glite-r1

SFT 1 Operations manual for COD https://cic.in2p3.fr/index.php?id=cic&js_status=2

SFT 2 SFT2 Documentation page: http://goc.grid.sinica.edu.tw/gocwiki/Site_Functional_Tests

SFT 3 Typical report for production grid: https://lcg-sft.cern.ch/sft/lastreport.cgi

SFT 4 Typical report for SE grid: http://grid-se.marnet.net.mk/sft/lastreport.cgi

SFT 5 Typical report for SW grid: http://mon.egee.cesga.es/sft/lastreport.cgi

SFT 6 SFT suite list: https://lcg-sft.cern.ch/sft/sftestcases.html

SFT 7 SFT Historical Metrics https://lcg-sft.cern.ch/sft/metrics.html



Date: 16/12/2005


8 USER SUPPORT

8.1 STATEMENT OF PROBLEM In grid computing, users can experience all the usual difficulties associated with scientific computing. However in addition, they can experience new types of problems caused by those aspects of the grid which are unique to grid computing. In the description here, a user is an individual who submits work to the grid. It does not include people associated with the provision and maintenance of the grid such as system administrators, system managers and so on.

A virtual organization (VO) is an organization which is a client of the grid. The users in effect belong to the VO. The VO typically represents an application, or an application area. Users can experience problems which relate to their participation in a VO, or to their participation in the grid.

The user requires support in the following areas:

• tools which provide passive support5 [§8.4]; • problem ticket submission; • response from a suitable helpdesk; • training in the use of grid technology; • documentation; • training in the use of the support system.

The managers of a support system, require support in the following areas: • metrics on the behaviour of the system; • alerts when tickets expire6; • alerts when service level agreements are breached.

8.2 SOLUTION IN USE WITH EGEE To address user support, EGEE has created a distributed organization to provide user support.

The following organizations deal with problem tickets:

• Global Grid User Support (GGUS); • The regional operational centres (ROCs); • The core infrastructure centres (CICs); • The VO support groups (VOs); • The support units (SUs).

The support organisation is intended to operate as a hierarchical, but distributed system. The use of rotating responsibility for the operations of many parts of the infrastructure provides for enhanced reliability and scalability, but increased complexity of operation.

GGUS appears as a single organisation from the user’s point of view. It is accessible at a single e-mail address, or at a single web site address. The function of GGUS is to manage the overall system. Within GGUS there are rotating components of support.

5 The term passive support is the complement of active support. It refers to providing tools so that users can help themselves. Online documentation, status monitoring tools and search engines are all examples of passive support. Active support is where a person has to deal with the support request. 6 All tickets have a time associated with them. When the time expires, an alert is sent to an appropriate group of people, generally by e-mail.



Date: 16/12/2005


Each ROC may provide a ticketing system of its own and GGUS has published interfaces where it will exchange tickets with ROCs. Many support tickets are submitted to the local ROC and dealt with within the local region. There are 10 ROCs, most of which have their own ticketing systems which work in partnership with GGUS.

The CIC is itself a distributed organisation, but behaves as a single organisation within the support system. There are features within the GGUS system to support this way of operating. GGUS only recognises one CIC, although it is composed of 5 partners.

The VOs are largely independent one of another. However they all receive tickets which are directed to them via the ticketing system. There are currently about 20 such organisations recognised by GGUS.

The support units form the second line of support. When problems cannot be resolved by one of the previous units, then the ticket is assigned to one of the support units. The nature of the support units is varied. Many provide support for specific software components; others provide support for services which are not main stream; others provide services and use GGUS as a ticket exchange system.

8.2.1 Management initiatives to provide user support The management of the support function requires that the following be created:

• an organization to deal with support; • the problem ticket management system; • the network of organisations to respond to problem tickets; • a means of monitoring problem tickets; • a means of agreeing service level agreements; • a means of creating documents; • a means of publishing documents; • a means of creating training materials; • a means of delivering training.

8.3 SUPPORT MODEL The support model is described in the following document referenced in [User 1]: http://egee-docs.web.cern.ch/egee-docs/support/documentation/pdf/9100_GGUS_Support_Model.pdf.

There are two parts to first line support for users in the model.

One part puts the VO at the front of the user ticketing process in GGUS. GGUS provides the end user with an e-mail interface to the VO. The user sends an e-mail and gets replies about the email until the problem is solved. This arrangement makes it easy for VOs to provide a web front end for their support. They can provide documentation and other information relevant to the users of the VO and provide access to the support for the user with a simple button which generates an e-mail.

The other part is an organisation called TPM – ticket processing management. TPM is an organisation populated by people provided from the federations on a rotating basis. This organisation is responsible for the monitoring of all active tickets in the GGUS system. The responsibility for the provision of people for TPM rotates weekly among the federations.

The second line support is formed by many support units. Each support unit is formed from members who are specialists in various areas of grid middleware, or ROC supporters for operations problems, or VO specific supporters. The membership of the support units is maintained on mailing lists.

A single e-mail address is available through which users can request GGUS for help. E-mails sent to this address are automatically converted into tickets.

Tickets are normally assigned to a support unit. This means that the ticket is sent to a mailing list composed of many supporters. One of these supporters assigns the ticket to himself and takes responsibility for it. GGUS monitors tickets. If an "urgent" or "top priority" ticket remains in status "assigned" but not "in progress" for more than 2 hours (no supporter has assigned the ticket to him/herself), TPM assigns the ticket to one person in



Date: 16/12/2005


the support unit. This person is then in charge of the ticket. He either solves it or assigns the ticket to somebody else. The status of the ticket stays set to "in progress" if the ticket is under the responsibility of one supporter and until the ticket has been solved. If an "urgent" or "top priority" ticket stays unsolved for more than 2 days, then TPM is responsible for following the ticket and ensuring that a solution is found. Of course, there could be the case where tickets remain unsolved because of missing features in the middleware or a well identified cause that cannot be fixed in the short term. In this case, the status is set to "unsolved". Part of the management of GGUS is to monitor such situations to ensure that such a state is not abused.

The user can submit a ticket for support by sending e-mail to a mailing address. The name of the mailing address indicates the VO to which the user belongs. For example, the user can submit a ticket to:

[email protected] where VO is one from the following list of VOs:

VO = {alice, atlas, biomed, cdf, cms, compchem, dteam, egeode, esr, lhcb, magic, planck}.

The user then receives email from [email protected] with: • request for further information; • notification of change of status including solved.

If the user responds to any of these e-mails, then the reply is added to the ticket history. The subject of the email includes meta data to ensure the association of the response with the ticket.

The response includes the ticket history and a link to the ticket within GGUS. To use the link requires that the user have a digital certificate in his/ her browser to see the ticket.

If the user does not know which VO list to use, then the user can use the generic mail address for GGUS which is called: [email protected]

The work flow is summarised in the following figures (Figure 21: Work flow for a ticket entered to [email protected] and Figure 22: Work flow when a ticket is entered for a VO).

Figure 21: Work flow for a ticket entered to [email protected]



Date: 16/12/2005


Figure 22: Work flow when a ticket is entered for a VO

8.3.1 Central user support The purpose of the central grid support organization is to do the following:

• provide a web portal which acts as a single port of call for support; • act as a central point for ticket exchange; • manage problem tickets including monitoring; • provide management reports on ticket response; • publish documentation; • operate an organization to deal with support; • operate a problem ticket management system; • operate a network of minor organizations to respond to problem tickets; • operate a means of monitoring problem tickets; • monitor the location and status of the ticket; • escalate a ticket as it ages; • ensure that the response to the ticket is of suitable quality.



Date: 16/12/2005


8.3.2 Regional user support The Regional Operations Centres provide support to their users. There is a document which describes this in general is given in [User 2] http://egee-docs.web.cern.ch/egee-docs/support/documentation/pdf/2000_FAQ_for_ROC.pdf

In addition, there is a document which describes the implementation for each ROC. The index is available in [User 3]: http://egee-docs.web.cern.ch/egee-docs/list.php?dir=.\support\documentation\pdf\&

The documents describing the various ROCs are shown in the following table:

Doc Number ROC document

2100 CERN

2200 UK-Ireland

2300 Italy

2400 Central Europe

2500 South West Europe

2600 South East Europe

2700 North East Europe

2800 Russia

2900 Germany-Switzerland

3000 France

These documents are written and maintained by the OMC in partnership with each of the ROCs and are working documents. The detail about each of the ROCs is contained in these documents. People working with the GGUS ticketing system refer to these documents in order to obtain detail about the operation of each of the ROCs.

8.3.3 Core infrastructure user support The CIC does not deal directly with user support. However, the CIC uses the GGUS ticketing system. This is described in [User 4]: http://egee-docs.web.cern.ch/egee-docs/support/documentation/pdf/3900_FAQ_for_CIC.pdf

8.3.4 Virtual organisation user support The VOs provide support to their users. There is a document which describes this in general in [User 5]: http://egee-docs.web.cern.ch/egee-docs/support/documentation/pdf/4000_FAQ_for_VO.pdf

In addition, there is a document which describes the implementation for each VO. The index is available at the following location [User 6]: http://egee-docs.web.cern.ch/egee-docs/list.php?dir=.\support\documentation\pdf\&

8.3.5 Support unit user support The VOs provide support to their users. There is a document which describes this in general at the following location [User 7]: http://egee-docs.web.cern.ch/egee-docs/support/documentation/pdf/7000_FAQ_for_SU.pdf



Date: 16/12/2005


In addition, there is a document which describes the implementation for each SU. The index is available at the following location [User 8]: http://egee-docs.web.cern.ch/egee-docs/list.php?dir=.\support\documentation\pdf\&

8.4 PASSIVE SUPPORT There are a number of tools which provide passive support to users of the grid. Links to these tools are published on the home page of the GGUS portal. They include links to the following tools:

• CIC-Portal; • GOC Downtime Report; • GOC Grid Monitoring; • Grid-ICE; • Jobstatus GridKa.

8.4.1 Documentation The index to the documentation to help the user is contained within GGUS at the following location [User 9]: https://gus.fzk.de/pages/docu.php

8.4.2 Training At the time of writing this document, the activity is working with NA3 to create materials for training people in the use of the support system. This includes materials for the following groups of people:

• end users; • people working in support units; • people working in GGUS itself.

8.5 MANAGEMENT MATTERS

8.5.1 Procedures and escalations The processes and escalations which are used in GGUS are documented in the documents which are referenced in this chapter.

8.5.2 Metrics Metrics on the operation of the GGUS ticketing system are collected on both a monthly and weekly basis. These are published in the documentation system.

The following reference is the index of the monthly reports [User 10]: http://egee-docs.web.cern.ch/egee-docs/list.php?dir=.\support\usage\&

The following reference is the index of the weekly reports [User 11]: http://egee-docs.web.cern.ch/egee-docs/list.php?dir=.\support\usage\weekly\&

8.5.3 Executive Support Committee (ESC) The activity manager created an organisation called the Executive Support Committee (ESC). This organisation meets once per month, generally on the telephone. It has representatives from parts of SA1, and some members from closely related activities. The role of the ESC is to ensure that the support system being put into place meets the needs of its many users.



Date: 16/12/2005


8.5.4 Service Level Agreements (SLA) The need to have SLA in place with all of the organisations in the GGUS ticketing system is recognised. At the time of writing this version of this document, this work has not started.

8.6 LESSONS LEARNED Making a distributed support system work in a distributed environment is not a simple matter. In order to make it work, it was necessary to do the following:

8.6.1 Document the support model

At the beginning of EGEE, the only document defining the support arrangements was one which had identified possibilities. It was necessary to write a document which clarified the support model. This new paper defined the model with the hierarchies and other detail.

8.6.2 Get software for making help desks There are many software systems available for the implementation of help desk systems. These systems are intended to allow rapid development of helpdesk systems. The various ROCs in EGEE have experience with a number of those including:

• Remedy; • One-or-Zero; • Footprints; • Xoops; • Request Tracker.

8.6.3 Work with what the responsible units already have At the outset of EGEE, some of the ROCs were already operating a substantial helpdesk. It was necessary to work with these ROCs to integrate their existing help desks with the GGUS help desk. This was a lengthy and difficult process. However it has been largely successful and has created a system which is scalable and reliable. When local support groups join the grid in future, a part of the integration will be the integration of their help desk. GGUS now has experience in supporting a number of technologies. The implementation using web services and e-mail exchange provides an interoperable platform. Try to keep the number of interfaces to different systems small. It is expected in the future that GGUS can extend this to work with other federal help desks. Exploratory work is underway with a candidate partner in the US.

8.6.4 Have experienced support people working at the front of the ticketing system At the beginning of EGEE, it was thought that people could be easily trained to deal with tickets when they arrived in the ticketing system, and that the developers of the ticketing system could do this work. This was a mistake. This is a difficult task which requires skills which are very different from developing a ticketing system. It was necessary to change this.

8.6.5 The Virtual Organisations must be at the front end of the ticketing system The users of the grid effectively belong to the virtual organisations. Many of the users’ problems relate to the software of the virtual organisation. It is therefore essential that the virtual organisations are at the front of the ticketing system. There has been great reluctance on the part of the virtual organisations to accept this.

8.6.6 Provide a simple mechanism for submitting a ticket At the start of EGEE, a user could only submit a ticket if he had a digital certificate which was recognised by GGUS. This is not a barrier to experienced users of the grid, as they are used to using a digital certificate.



Date: 16/12/2005


However this is an unreasonable thing for in-experienced users. It was necessary to add an e-mail interface to the system.

8.6.7 Document the work flows Having a support model is not sufficient for the implementation team to implement the work flows. It is necessary to document the work flows and to deal with the detail of what is to happen. This allows the team to implement the work flow quickly without the users experiencing surprises!

8.6.8 Document the user interface Providing a web portal and expecting users to be able to use it with little training and no documentation is not realistic. It is necessary to document the use of the web portal and to ensure that training material is available.

8.6.9 Keep the documentation up to date The documentation on the system must be available readily and be current. There has to be a simple way to update the documentation, and to control who can change the documentation.

8.6.10 Make a list of the responsible units At the outset it was easy to identify many of the responsible units. However as time has gone on, more have emerged.

8.6.11 Avoid the development of alternatives It is easy for an interest group to make their own helpdesk system using one of the free software systems mentioned elsewhere. If they intend to do this, then there is no reason for the central organisation to object. However, if this is set up as an alternative, then it may confuse users without enhancing the support provided to users.

8.6.12 Encourage people with suitable needs to join the central support organisation There are other groups in the project with needs which can be met by the central helpdesk. For example, there is a group dealing with external network providers. While it is not part of the remit of user support to provide network support, it is clearly better to avoid the development of other ticketing systems. Some of these third parties have distinct needs. In particular the security response team requires that their activity be very secure and reliable.

8.6.13 Get the agreement of the responsible units to implement their part of the model This turned out to be very difficult. Many of the responsible units did not wish to commit resources to support. They were reluctant to talk about agreements or to provide names of people to take responsibility for providing support in their area.

8.6.14 Get the responsible units to name a person who is responsible for providing support It is easy to identify the need for support in a particular area, but it can be difficult to find someone who can act as a principal point of contact with the responsible unit. It was easy to get names from the ROCs for example, but very difficult to get names from some of the VOs.

8.6.15 Treat all responsible units similarly - do not allow exceptions When defining the support agreement with responsible units, it is important to treat them all similarly. Once one knows that a favourable concession has been extended to one, the others will want the same concession.



Date: 16/12/2005


8.6.16 Document the agreement with each of the responsible units Even if the agreements with each of the responsible units are similar, there are details which vary. It is necessary to have an agreement which is customised to each one. It is like a letter containing a job offer. It is usual practice that before starting work, employer and employee exchange letters (usually written by the employer and endorsed by the employee) providing the essential facts concerning the job.

8.6.17 Train the supporters in the responsible units to carry out their tasks It is necessary to provide training for supporters. At the very least this should be a document for them to read. It is also necessary to offer additional assistance, on the job if necessary. GGUS provides a help line so that supporters can call and ask for assistance. However, it is necessary to ensure that the training team have the necessary help to include suitable material in their training courses. It may be necessary also to offer courses on a regular basis. In EGEE, progress is being made to doing all of these things.

8.6.18 Provide documentation to the supporters GGUS and ESC have written documents which describe the use of the supporter interface to GGUS.

8.6.19 Test that the ticket routing through the system works between the responsible units The ticketing flow in GGUS can be quite complicated. For example, a user can submit a ticket at the local help desk it one region, but the ticket is sent via GGUS to another region for processing.

8.6.20 Involve supporters in the development of the system It is essential to involve people who are going to work with a system in the development of the system. The ESC was created in EGEE. One of the roles of the ESC was to provide a way to ensure that this was the case. This was successful as it ensured the interface to the system provided the appropriate functionality for supporters to carry out their work.

8.6.21 Have a program of upgrades to the support system The developers only update the help desk system on an occasional basis, about once per month. The developers maintain a list of development projects and plan the implementation and provide notice in advance. This process has been in place since mid 2005, and has led to an increase in supporter satisfaction with the ticketing system.

8.6.22 Provide a backup system to avoid a single point of failure It is necessary to provide a backup system for the help desk system. If the helpdesk is located in one location and the location is not available, then the help desk system disappears. During 2005, there have been a number of occurrences of this problem, and work is in hand to address it, by providing a backup system in another location, and work can be transferred when necessary.

8.6.23 Collect statistics on the use of the system GGUS been doing this, and is working on how to report them with Crystal reports.

8.6.24 Establish a mechanism for agreeing change and use it The ESC was established as the mechanism for agreeing changes to the support system in EGEE. The ESC was also charged with the co-ordination of the implementation of change. The GGUS team is part of the ESC. Establishing this mechanism brought a sense of order to changes to the support system. Prior to that there was a tendency to react to the ideas and requirements of individuals. Satisfaction ratings have improved with both users and supporters since this has been introduced.



Date: 16/12/2005


8.7 REFERENCES

User 1 Support Model: http://egee-docs.web.cern.ch/egee-docs/support/documentation/pdf/9100_GGUS_Support_Model.pdf

User 2 General description of the role of a ROC in GGUS: http://egee-docs.web.cern.ch/egee-docs/support/documentation/pdf/2000_FAQ_for_ROC.pdf

User 3 Location for the documents which describe each of the ROCs: http://egee-docs.web.cern.ch/egee-docs/list.php?dir=.\support\documentation\pdf\&

User 4 Description of the role of the CIC in GGUS: http://egee-docs.web.cern.ch/egee-docs/support/documentation/pdf/3900_FAQ_for_CIC.pdf

User 5 General description of the role of the VO in GGUS: http://egee-docs.web.cern.ch/egee-docs/support/documentation/pdf/4000_FAQ_for_VO.pdf

User 6 Location of the documents which describe each of the VOs: http://egee-docs.web.cern.ch/egee-docs/list.php?dir=.\support\documentation\pdf\&

User 7 General description of the role of the SU in GGUS http://egee-docs.web.cern.ch/egee-docs/support/documentation/pdf/7000_FAQ_for_SU.pdf

User 8 Location of the documents which describe each of the SUs : http://egee-docs.web.cern.ch/egee-docs/list.php?dir=.\support\documentation\pdf\&

User 9 Main documentation index on GGUS portal https://gus.fzk.de/pages/docu.php

User 10 User support monthly usage report http://egee-docs.web.cern.ch/egee-docs/list.php?dir=.\support\usage\&

User 11 User support weekly usage report http://egee-docs.web.cern.ch/egee-docs/list.php?dir=.\support\usage\weekly\&



Date: 16/12/2005


9 FABRIC MANAGEMENT

9.1 OVERVIEW OF GRID CLUSTER COMPONENTS The information contained in this section is intended for an audience of fabric managers. A more detailed description of the architecture is provided in §4. Some readers may prefer the presentation in §4. If readers are familiar with the contents of §4, then they can skip this section and proceed to §9.2.

Most of the parameters for planning will be a function of the size of the site and the VOs that the site would like to support.

9.1.1 Computing Element (CE) The Computing Element (CE) is the interface between the site's batch system and the grid. It enables grid jobs to be submitted to the site, imposing site- and VO-specific authorization and accounting rules. The CE consists of a front-end (head) node that interfaces to the grid, and a back-end that is the site's batch system such as PBS/Torque, LSF or Condor.

9.1.2 Worker Node (WN) The Worker Node (WN) is a (real or virtual) machine that is part of the site's batch system. A grid user's job will run on a WN. The machine needs to have the grid client utilities installed.

9.1.3 User Interface (UI) If as site has users that want to use the grid, it has to support the User Interface (UI). The UI can be installed by the user on a personal machine, or a centralized, site-wide User Interface can be used.

9.1.4 Storage Element (SE) The Storage Element (SE) is the interface between the site's storage system and the grid. A Classical SE is a simple, inflexible disk server with a GridFTP service. Advanced storage systems implement the Storage Resource Management (SRM) interface. Such SEs allow disks and server nodes to be added or removed, files to be replicated for robustness or to balance the load, and a large variety of other features. There are three main SRM implementations currently in place:

• CASTOR; • dCache; • DPM.

9.1.5 Resource Broker (RB), Workload Management System (WMS) The Resource Broker (RB) or Workload Management System (WMS) manages the user's jobs. It finds the best (e.g. least loaded) Computing Element that matches the job requirements. Normally it also provides the Logging and Bookkeeping (LB) service, to keep track of the status and the history of the submitted jobs. The RB/WMS is the most complicated grid service. Therefore RBs are mostly run centrally, per federation and at CERN.

9.1.6 Information Index (II, BDII), Service Discovery (SD) The II is an aggregation service for information from all the sites in the grid. The II is more commonly known as BDII, as its current implementation is based on a standard Berkeley Database. A top-level BDII queries all site BDIIs and lists all currently available grid services in a single LDAP database that is consulted by grid client tools and by other grid services. The gLite Service Discovery component can either be a BDII or an R-GMA server.



Date: 16/12/2005


9.1.7 Replica Location Service (RLS) The Replica Location Service is a node which maintains a catalog of the physical locations for VO data identified by Logical File Names and provides access to this information. The RLS is centrally deployed for VOs that still need an instance. New catalogs should rather be implemented as LFC services discussed below.

9.1.8 Proxy Server (PX) The Proxy Server (PX) is a service that allows a user to store a long-lived credential from which a trusted party can securely obtain short-lived credentials (proxies) to be used by grid jobs. A trusted party has to identify itself with a valid proxy. An RB or FTS uses the PX to renew or obtain the proxy for a user job or file transfer as needed. The PX is more commonly known as MyProxy server.

9.1.9 Monitor Service (MON), R-GMA Server (R-GMA) The Monitor Service (MON) is a server for the Relational-Grid Monitoring Architecture (R-GMA) information system and for the GridICE information collector. Both R-GMA and GridICE have information providers (sensors) for each type of grid service node, such that all grid-related monitoring information for a site is available through the site's MON, commonly known as MON Box.

9.1.10 Virtual Organisation Server (VOS) The Virtual Organisation authorization Server (VOS) provides the list of authorized VO members. This is one of the following:

• Lightweight Directory Access Protocol (LDAP) server; • VO Membership Server (VOMS).

Grid services are to use VOMS for authorization eventually, but many services still need a legacy gridmapfile. The latter can be generated from VOMS; therefore new LDAP servers should rather be avoided.

9.1.11 LCG File Catalog and FiReMan (LFC and FC) The LCG File Catalog is a service which maintains the physical locations for VO data identified by Logical File Names and provides access to this information. It obsoletes the RLS. It can either be centrally deployed for a VO, or each site may have a local instance per VO, listing only locally stored files. In the latter case, a central Data Location Index (DLI) or Storage Index (SI) service would map a logical file name to a list of site LFCs which each have detailed information about a replica of the file.

FiReMan is the standard data catalog for the gLite middleware. As far as deployment at a site is concerned, to first order its characteristics are similar to those of the LFC.

9.1.12 File Transfer Service (FTS) The File Transfer Service (FTS) provides asynchronous management of file transfers and network channels, with catalog plug-ins per VO and retry policies. It uses the PX to obtain or renew the user proxy associated with a transfer. The FTS typically is deployed only on sites with very large storage systems, e.g. Tier-0 and Tier-1 sites in LCG.

9.1.13 VO Box The VO Box is a node to run VO-specific services that are not easily implemented with standard EGEE/LCG grid components. The exact requirements and wish lists depend very much on the VO. An implementation would have to be negotiated between the VO and the site. A basis for the VO Box security infrastructure is provided: privileged VO members may login into non-privileged accounts through their proxies, and a renewal service for user proxies is available. An example of the use of a VO Box may be to support a large disk array with pre-staged data files which are then made available to jobs via an xrootd interface.



Date: 16/12/2005


9.1.14 I/O Server (IO) The gLite I/O Server is a front-end for a Storage Element. Its purpose is to regulate all access to the SE according to ACLs defined in the gLite FiReMan catalog.

9.1.15 DataGrid Accounting Server (DGAS) The DataGrid Accounting Server collects accounting information per user per job.

9.2 PLANNING FOR A GRID CLUSTER – INGREDIENTS The minimal set of services required for a grid cluster is the following:

• Compute Element (CE); • Worker Node (WN); • Monitor Service (MON)/ R-GMA Server (R-GMA).

The following services are optional and depend on the requirements of the site or the VOs supported by the site:

• User Interface (UI); • Storage Element (SE); • Resource Broker (RB)/ Workload Management System (WMS); • Information Index (BDII)/ Service Discovery (SD); • Proxy Server (PX); • VO authorization Server (VOS); • LCG File Catalog (LFC); • File Transfer Service (FTS); • VO Box; • I/O Server (IO); • FiReMan Catalog (FC); • DataGrid Accounting Server (DGAS).

Some services may be combined on a single node, but there are constraints due to the middleware implementation and for security reasons. Combining the following services would lead to an unusable system due to conflicting components:

• CE and BDII; • CE and RB/WMS; • SE and RB/WMS.

For security reasons the following services should not be combined:

• UI and any other service; • WN and any other service; • PX and any other service; • VO Box and any other service.

The UI and the VO Box allow interactive logins, which would allow local security weaknesses, if any, to be exploited more easily. Similarly, a WN will run any user job, which might try to abuse local security weaknesses. The VO Box runs VO-specific middleware that has not undergone a standard certification, hence might have security weaknesses that could have been avoided otherwise. The PX contains long-lived credentials that should be very well shielded from unauthorized access.



Date: 16/12/2005


9.2.1 Grid server requirements

9.2.1.1 Number of servers At least two grid servers are required; one for the CE, and one for the MON. In principle the associated services can run on a single node, but then the node should have at least 2 GB of memory and even then it may easily be overloaded. More servers are required if other services are to be deployed or if backup servers are used.

9.2.1.2 Hardware The hardware required depends on the services provided and on the size of the site. For example, a VO that is supported by the site may need each worker node to supply at least 2 GB of memory per job, or at least 10 GB of scratch space per job. A very large farm of worker nodes probably should not be handled by a single front-end computing element node, to avoid scalability problems as well as avoiding a single point of failure.

9.2.1.3 OS The software releases are provided for Scientific Linux. These releases usually work with RHEL (and compatible) distributions. The main VOs that the site supports may require a specific distribution. If so, effort many be required to port the release to that distribution. A number of groups are actively porting the releases to different distributions. The most important point is that the OS must be secure and up to date.

9.2.2 Worker Node requirements The worker node requirements depend on the VOs that the site supports.

The requirements of the applications which run on the grid are generally set by the requirements of the VOs. It is not possible to generalize on their requirements.

For example, there are VOs which only run one program and that program has predictable and consistent requirements. The requirements are defined by the following parameters:

• memory requirements for the application; • disk space for the application; • duration of an instance of the application.

Other VOs have much more complex processing requirements which may include dependencies on items such as the following:

• operating system type; • specific versions of the operating system; • third party software such as compilers; • third party environmental components such as libraries; • processor type.

There are memoranda of understanding with most of the VOs on EGEE. These MoU contain information on the requirements of the VO. If a site intends to support a VO, it is generally necessary to understand the requirements of the VO and to configure the worker nodes and other services to meet these requirements.

Today, the bulk of the applications which run on the grid are well suited to a “cafeteria type” service. A cafeteria service is one where the requirements of each client are independent and can be characterized using simple criteria such as the above list of memory, disk and processing time. In the future it is likely that the applications will contain much more complex dependencies, for example on data, time, deadlines, availability, and so on.

9.2.2.1 Hardware See §§9.2.1.2 and 9.2.3.1. If a Worker Node has more than one CPU, it will typically be allowed to run multiple jobs concurrently. In that case care must be taken that the amount of memory on average available per job



Date: 16/12/2005


satisfies the requirements of the VOs that the site supports. Typically at least 1 GB should be allocated per job slot to avoid paging. Some VOs may want to run jobs that need a few GB of memory.

9.2.2.2 OS See §9.2.1.3.

9.2.3 Storage requirements A job running on a worker node may need to access various kinds of storage. It needs scratch space for a work area that can be reclaimed as soon as the job has finished. It may need to access a shared area containing standard programs, libraries etc. for the VO. It may need to access a local Storage Element (SE) for input data or for short- or long-term storage of output data. It may access a remote SE for similar purposes.

9.2.3.1 Local disk The local disk of the WN should typically allow for at least a few GB of scratch space for the job’s work area, to be reclaimed after the job has finished. The exact amount of space needed very much depends on the VO. For example, a typical job may download many GB of data files before processing any of them, and it may keep files around even when they are no longer needed. The job may also produce many GB of output data that might only be copied to an SE right before the end of the job. If a WN has more than one CPU, it will usually be configured to run more than one job at a time, with a corresponding increase in local disk space requirements.

9.2.3.2 Shared disk It is usual for a VO to require each WN to have access to a shared file system on which the VO can store commonly used programs, libraries, configuration files etc. Such shared data can easily amount to many tens if not hundreds of GB, in particular when multiple versions of the VO software are to be available to the user jobs. The exact requirements for the amount of shared disk space are to be provided by the MoU for the VOs.

9.2.3.3 File system recommendations Local file systems typically are of type “ext3” or “xfs”. Shared file systems typically are of type NFSv3, occasionally AFS. In the latter case the site must run a GSSKLOG server that allows certain grid users (DNs) to obtain an AFS token on presenting a valid proxy.

A site should be conservative with respect to immature file system or disk technologies, as they might lead to data corruption or losses, which may be hard to analyze due to the distributed nature of the grid.

9.2.4 Networking A site is better shielded from hackers as its services and user jobs are more shielded from the network. Due to the very nature of the grid such shielding can never be complete, but the exposure can be very much reduced. Firstly, the worker nodes should not need inbound connectivity at all. Secondly, outbound connectivity is needed, but can be constrained to a well-defined set of grid service ports on a well-defined set of network domains, by setting up appropriate firewall rules. This would very much limit the potential damage if a hacker managed to get a job to run on a Worker Node. To save on IPv4 addresses, worker nodes are often put on private networks. An interactive job would connect from the WN to an agent started on the UI at job submission, allowing commands to be sent as if the user were logged in on the WN.

9.2.5 Security As usual the system administrator of any grid service node must ensure that its operating system is up to date with the latest security patches. For the grid middleware similar concerns exists: security patches will be explicitly announced and should be applied as appropriate. Comprehensive firewall settings would very much limit the potential damage if a node is abused in any way by a hacker.



Date: 16/12/2005


9.3 SOFTWARE NOT INCLUDED IN GRID PACKAGE

9.3.1 Batch system A site requires a batch system. It is up to the site which batch system to choose, but if the batch system is not directly supported by the grid middleware release, it is expected that the site invest the effort to solve any compatibility issues. Currently the grid software releases come with two reference implementations, namely PBS/Torque and LSF.

9.3.2 Cluster monitoring There are many tools to do cluster monitoring. The site is free to choose which tools to use. The grid middleware releases currently do not come with a reference tool. A popular choice is Ganglia. [FM 8]

9.3.3 OS Installation and OS updates The site can choose how to install the OS and how to keep it up to date. It is very important that the OS is secure and up to date. The grid middleware releases provide the necessary hooks for the popular “apt” and “yum” packages.

9.4 SETUP, INSTALLATION, VERIFICATION

9.4.1 Where to install various service components? The middleware releases are almost completely relocatable, with only a few configuration files still in need to be installed on the root file system. The utilities etc. by default all go under the /opt directory. For the WN and the UI a relocatable “tar ball” release is provided as an alternative to RPMs, allowing the middleware to be installed without root privileges. For WNs it is recommended to mount the middleware area from a file server (NFS, AFS) to avoid update anomalies.

9.4.2 Batch system The batch system should ideally be installed with the default paths suggested by the batch system documentation; otherwise the middleware may have to be informed of the chosen locations explicitly.

9.4.3 Cron jobs To simplify the overall management of cron jobs, the middleware will only add cron jobs to the standard directories /etc/cron.d, /etc/cron.daily etc. Dedicated user crontabs are left undisturbed.

9.4.4 Site verification Site verification can be done by one or more of the following:

• the site; • the ROC; • the CIC operations team. [FM 1].

Along with the release come instructions for manually testing and debugging the site. There is a suite of Site Functional Tests (SFTs) [FM 6] that can be used for testing the site. Once the site administrator thinks that the services are correctly installed, the site’s ROC can be informed to run the verification tests. Once the ROC has verified that the site is OK, it can be added to the grid. The site will be tested and verified continuously and problems will be reported to the ROC and the system administrator.



Date: 16/12/2005


9.5 MONITORING AND MAINTENANCE

9.5.1 What to watch and how often? The site administrator should look at the SFTs [FM 3] and the grid status and statistics tools [FM 5], for indications if the services at the site are working or not. The site administrator can also be proactive by using the site’s fabric monitoring infrastructure to look at common problems, e.g. full disks, crashed machines, etc.

9.5.1.1 Grid components On nodes that run services, the probability for something to go wrong usually is higher than on WNs or UIs. Services might crash, or fill the swap space, or fill a file system with log files or temporary files. A user process on a WN or UI might also fill a file system or the swap space.

9.5.1.2 The functionality for the grid services should be tested by the SFTs In principle, the SFTs should cover, to the extent possible, all the grid services that are run by the site. In practice this means that the services are tested from a grid perspective. The SFT tests for situations such as the following:

• is the service available at all? • is the service response time reasonable?

It is up to fabric monitoring tools to report excessive memory or file system usage.

9.5.1.3 Worker nodes A WN does not run any grid service, but it can become a “black hole” for jobs if it is mis-configured. For example, if the middleware has not been correctly made available, each user job may immediately fail, leaving the CE with the impression that the WN is ready to receive another job, thereby causing a large number of jobs to fail. Commercial batch systems like LSF allow a WN to be put off-line automatically if it is observed to handle too many jobs in a short interval. Public-domain batch systems like PBS/ Torque may not yet have such a facility. Here it becomes even more important to pay attention to the daily SFT results, which should expose such problems. This matter is further discussed below.

A user job on the WN is run by a wrapper that cleans up the work area in which the job was started, but not common scratch directories like /tmp. In future middleware versions each user job will run in a virtual machine started just for the job, and its work area will be a self-contained portion of the host file system that can be completely reclaimed once the job has finished. For the time being, the WNs had better have /tmp etc. regularly cleaned up by a cron job. Care must be taken that a multi-CPU WN may be running two or more concurrent jobs for the same user, so one cannot indiscriminately delete all the files owned by the user once a job has finished.

9.5.1.4 Servers On nodes that run grid services the fabric monitoring should be concerned with the resources used by those services: memory, disk space, file descriptors, sockets, etc. Excessive resource usage should be reported as bugs. Temporary workarounds can be established in collaboration with the ROCs and CICs.

9.5.1.5 Storage The monitoring of storage depends very much of the MoU the site has with the VOs it supports. In general, it is difficult for a site administrator to determine if the storage is being misused or whether it should be automatically cleaned up occasionally.

9.5.1.6 Proactive grid job monitoring Mis-configured machines, especially worker nodes, can have a nasty effect on the jobs running at a site. A particularly nasty effect is a so-called black-hole situation. This refers to what happens when a worker node is mis-configured in such a way that jobs which start on it fail immediately. When this happens, the node also



Date: 16/12/2005


becomes immediately free to accept new jobs at the start of the next scheduling cycle. Given a typical scheduling cycle of 30 seconds and a dual-CPU box, in an hour one could direct 240 jobs to such a node.

Many things can cause this. Two famous examples are:

• the machine clock is out of synch; • home and/ or software directories are on shared file systems, and the automount daemon has died.

In the second example, the job comes in, tries to mount e.g. /software/atlas to find the software, the mount fails, the job dies.

Such problems call for proactive measures, since a problem on one such node in a large farm can cause hundreds of job failures before anyone notices.

There are a number of actions one can take to try to prevent this sort of problem. The basic principle is to look for suspicious situations; depending on the severity, if a node is found in such a suspicious state, either an email is sent to the sysadmin list, or the node is taken out of the LRMS active pool.

Here is a list:

1) check important mounts, such as the following: • home directories if they are shared; • grid software directories if these are shared; • VO software directories if shared (almost always the case); • any other important shared space.

These can be checked via a cron job running every few minutes. If these directories cannot be accessed, the proper action is to remove the node immediately from the LRMS pool and to send a notification email to the sysadmins.

2) check the batch-system log files for error messages on each node.

There are a couple things one can check:

• on PBS, rejection messages from a node; meaning that the LRMS server has decided to send the job to a certain worker node, but when it tries to do so, the WN rejects the job;

• extract job run times from the LRMS log files, and look for black-hole patterns. If many short-running jobs are seen on a single node in a short period, this may be a black hole. If the jobs are all due to one single user, however, it may well be that a user has just submitted a large number of jobs with a script error to the site; the typical action here is to send an email to the system administrators rather than pull the node out of the pool, as most things that look like black holes are actually due to user error.

3) check health status from outside. There are many things one can check.

A basic place to start is to check that something is listening on important ports; if not, something is almost certainly wrong and the node should be removed from the LRMS pool immediately. A basic list to begin with is:

• sshd (port 22); • portmapper (port 111); • the LRMS WN daemons (ports 15002 and 15003 for PBS).

While writing this document, it was found that even this is not enough. Checking that something is listening can apparently succeed if the kernel is still functioning correctly, but user-space operations may be frozen. Situations like this can be trapped by having something on the node periodically contact a server; if the contacts stop, something is wrong. Ganglia [FM 8] can do this.



Date: 16/12/2005


This section is intended to provide ideas. Every system is different, and may see different failure modes. The best thing is to start with a basic list like the one above, and if one sees other failure modes that could possibly be caught and minimized via monitoring, add a script, cron job, or daemon that tests for the condition to the local monitoring suite.

9.5.2 Tool recommendations A popular open-source cluster monitoring tool is Ganglia [FM 8]. Another possibility is Lemon [FM 9], developed and used by CERN. For fabric management the Quattor package could be considered [FM 11].

9.6 LESSONS LEARNED Some lessons learned have already been mentioned in the sections to which they pertain. Quite a few problems and solutions have been collected on the various wiki pages, which are to be consulted early on in case of need [FM 7]. Google also has many good pointers. The front page of GGUS has a web search engine which is powered by Google but is customised to search a set of the most useful sites for grid information [FM 4].

In general, grid service nodes may need more attention than other types of nodes deployed at a site, due to the immaturity of the middleware, in particular with respect to disk space and memory management. A service node left on its own may quickly decay into an unusable state.

Site administrators are very strongly advised to:

• deploy fabric management utilities [FM 11]; • monitor the SFTs for the health of their grid services [FM 3, FM 6]; • monitor the GIIS monitor to verify the view of the site from outside [FM 5] • monitor the rollout list for update or downtime announcements and other concerns [FM 10]; • react promptly to communications on the security contacts list; • react promptly to GGUS tickets [FM 4].

9.7 REFERENCES

FM 1 CIC Portal: https://cic.in2p3.fr

FM 2 GOC DB: http://goc.grid-support.ac.uk/gridsite/gocdb/

FM 3 Sites Functional Tests - SFT2: https://lcg-sft.cern.ch/sft/lastreport.cgi

FM 4 Global Grid User Support: http://ggus.org

FM 5 GIIS Monitoring pages: http://goc.grid.sinica.edu.tw/gstat

FM 6 SFT2 Documentation page: http://goc.grid.sinica.edu.tw/gocwiki/Site_Functional_Tests

FM 7 GOC Wiki page: http://goc.grid.sinica.edu.tw/gocwiki/

FM 8 Ganglia: http://ganglia.sourceforge.net

FM 9 Lemon: http://lemon.web.cern.ch/lemon/index.htm



Date: 16/12/2005


FM 10 LCG Rollout Mail List Archive: http://www.listserv.rl.ac.uk/archives/lcg-rollout.html

FM 11 Quattor: http://quattor.web.cern.ch/quattor



Date: 16/12/2005


10 DEPLOYING ADDITIONAL SOFTWARE COMPONENTS

10.1 INTRODUCTION This chapter of the document describes software which is not part of the middleware, but which is commonly required in order to make use of the grid. The obvious components which could be described here include:

• user interface components such as shells; • user interactivity components such as X11; • user development tools such as compilers and their associated libraries; • user libraries such as parallel processing libraries; • common application software components (such as libraries) required by more than one VO.

To limit the range of subjects, only matters pertaining to the message passing library MPI are described. The reason for describing MPI is that it is a common requirement for applications, and although this is possible on the EGEE infrastructure, the solution remains incomplete.

10.2 DEPLOYING MPI

10.2.1 Why deploy MPI?

In order to really make use of the computational power of the grid, parallel computation is necessary. Unfortunately for most computational problems, this means switching to a specially coded parallel version of the program. The Message Passing Interface (MPI) library [MPI 1] has in the last years become the de-facto standard for distributed parallel programming. Because the EGEE grid mainly consists of cluster architectures, shared memory parallelisation is not very useful within EGEE. Therefore the message passing approach, as used in MPI, is in fact the only useful approach for parallel computation within the EGEE grid.

Because a lot of scientific communities, like e.g. life sciences, chemistry and climate research, depend on the possibility to run MPI programs, it is important to have MPI installed on as many EGEE sites as possible.

10.2.2 Documentation on deployment

Currently the documentation available about installing MPI on EGEE sites is not complete. Furthermore, some of the steps mentioned in the documentation are not portable or are still under discussion.

At the moment the following documents are available:

A document describing the changes needed to support MPI [MPI 2]. The solution makes the assumption of a shared home file system for MPI jobs. The document can be found in the GOC wiki at Taiwan: http://goc.grid.sinica.edu.tw/gocwiki/MPI_Support_with_Torque

There is another document describing the support for MPI jobs without having a shared home directory [MPI 3]. The document only describes the support for the PBS and Torque batch systems. Currently the solution also supports the LSF job manager, but the document has not been updated with respect to that. http://grid-it.cnaf.infn.it/fileadmin/sysadm/mpi-support/MPInotes.txt

10.2.3 Security matters

In order to be able to run MPI jobs, one has to be able to perform password-less logins between worker nodes. Commonly, ssh is used for this, although other solutions like mpiexec [MPI 4] exist. The authentication can



Date: 16/12/2005


either be done based on ssh user keys or host keys. Clearly, one should prevent ssh access to the worker nodes from the outside world, using for example tcp wrappers and/ or a firewall.

When making use of user keys, a shared key should be available in the user’s home directory on all the worker nodes. The largest problem with this approach is that it is hard to prevent user jobs from modifying or copying these keys, breaking the authentication mechanism.

The second solution involves host based authentication. Here the authentication is based on the host keys instead of the user keys. In this case, one should preferably prevent external root logins.

10.2.4 Other matters to be considered

One of the most common issues with MPI jobs is that MPI jobs need the executable and input data available on all the worker nodes the program will run on. This is no problem when making use of a shared home file system, but when each worker node has its own home file system, the program will not run. Therefore the user will need to take some action in a job script in order to copy the files to all the worker nodes. Details can be found in the document [MPI 3].

There is a problem in the current production grid, which runs the LCG-2 software stack, with shared home file systems as I/O on the file system quickly becomes a bottleneck. In principle it is not very difficult, however to move single processor jobs to a local file system. In order to do this, the tmpdir facility of the EGEE supplied Torque can be used. A patch to the job manager script pbs.pm is also needed to move jobs to the temporary directory [MPI 5]. In the gLite software stack, this problem has been addressed. A shared file system is not required and if the file system is not shared, it is the grid middleware, and not the user, which is responsible for dealing in a transparent way with the staging of the necessary files in the WNs.

10.2.5 Open problems

One of the problems with the methods described earlier is that they do not comply with the way the implementation of MPI jobs is done in EGEE. Both methods propose to submit a user script as executable. The jobtype MPICH, however, assumes that the executable specified is a MPI executable that is started as a MPI program using mpirun. At the moment starting scripts as a MPI executable, using mpirun works but this cannot be guaranteed to work in the future.

Furthermore the solution described in [MPI 3] requires the user to do all the copying of files. In principle the job manager should be able to do all the copying necessary. This problem has been fixed in the gLite software stack.

Another problem described in [MPI 2] is that on the LCG-2 middleware stack, the batch system must be specified as PBS. Specifying Torque will cause MPI jobs to fail. This is a bug that should be fixed. This is not a problem with the gLite software stack.

A second problem also described in [MPI 2] is that there is no standard way to start MPI programs. One site may want to deploy mpiexec, another uses the standard MPICH mpirun, and yet another site may have a vendor supplied MPI library that uses some vendor specific way of starting MPI programs. The last is especially important when special interconnect such as Myrinet® or Infiniband® is used.

It is often the case that an MPI application must be recompiled and re-linked depending on the type of connection available on the cluster running the MPI program. This implies providing a development environment with the compiling and linking tools for MPI. This may have to be provided at each node.

10.2.6 Lessons learned

Support for MPI jobs within EGEE is not complete with the LCG-2 middleware stack



Date: 16/12/2005


The conclusion can be drawn that the support for MPI jobs within EGEE is not yet complete. One of the main problems right now is that there is a conflict between the jobtype MPICH as it is used now, and the users’ need and wish to run scripts that start MPI jobs themselves. The latter allows the users to do some post- and pre-processing, while the jobtype MPICH starts the script as if it were a MPI executable.

Changes are required to the job manager to support MPICH and MULTIPLE

This problem may be alleviated slightly by changing the job manager, so that it copies the user sandbox to all the worker nodes for the jobtype MPICH. To allow a user to run scripts, the jobtype MULTIPLE could be made available. This jobtype is already available within Globus. This jobtype would give the users the freedom to do what is required.

Further standardisation of MPI is required

The MPI standard [MPI 1] is not in the control of the EGEE project, but we note that there are some small omissions from the standard. In the future, some standardisation is needed on the following things:

• The way to start MPI programs. In principle the globus job manager already has some basic support for this, and allows specification of the mpirun command in the configuration file. This option is neglected by the EGEE middleware however. For user scripts, a macro definition could be useful.

• The way to compile MPI programs. Currently MPICH defines compiler interfaces called mpif77, mpicc, mpicxx etc. Although these are part of the reference implementation for MPI, they are not part of the standard. Including these in the standard will be useful.

10.3 REFERENCES

MPI 1 MPICH: Documentation on the MPI standard and the MPICH reference implementation: http://www-unix.mcs.anl.gov/mpi/

MPI 2 Document describing support for MPI with Torque: http://goc.grid.sinica.edu.tw/gocwiki/MPI_Support_with_Torque

MPI 3 Document describing MPI jobs without a shared home directory: http://grid-it.cnaf.infn.it/fileadmin/sysadm/mpi-support/MPInotes.txt

MPI 4 mpiexec: Documentation on the mpiexec command: http://www.osc.edu/~pw/mpiexec/

MPI 5 Documentation describing the patch to PBS to support the tmpdir facility: There is currently no documentation on this. Send mail to [email protected] for more information.



Date: 16/12/2005


11 APPENDIX – MIDDLEWARE REQUIREMENTS

11.1 INTRODUCTION The purpose of this appendix is to document a number of operational requirements which have yet to be met in the middleware. These requirements have already been documented by the requirements capture processes in place in project EGEE, such as the Project Task Force (PTF) and the Technical Co-ordination Group (TCG).

Many of the requirements are major and extensive and span the usual boundaries between software components. For this reason they are difficult to implement, although they are easy to express. For example, the provision of a unified administrative interface to the grid middleware is clearly a desirable feature, but its implementation may be quite complex. It is possible to implement such a tool, and many well know software systems such as various brands of Linux have implemented such tools. These tools hide the difference between the components and present a single view which is rational and reasonable for the operator. It is generally possible to implement such a tool without requiring major changes to the software components which the tool will co-ordinate.

Another example is the requirement for standardized logging and messages. To implement this, would require much more than a wrapper which is outside the tools being managed. To implement this in an extensive manner would require changes to all components. This is easy to describe, but will take time to implement.

The focus of all of the requirements expressed in this appendix is deployment and operation. In the past, the focus of software development has been to provide functionality to implement a grid, rather than the functionality for deployment and operation. It is expected that this will change in the future as the middleware matures and continues to become more stable and reliable. A future focus of the development of the middleware will be to address these areas documented in this appendix.

11.2 GENERAL REQUIREMENTS

1. Grid services need to have common administration interface

There should be a common administration interface for at least:

• Grid monitoring; • Creating alarms; • Adding new sites/ services to operation; • Taking sites/ services out of operation; • Redirection of workflow in case of a site or service difficulties.

There should be a common, extensible, set of APIs so the additional plug-ins could be written easily.

2. Grid services must support the movement of the service from one platform to another

It must be possible to move a service from one node to another, in such a way that the state of service transfers.

3. Critical services have to be deployable in a redundant way

There must be a way to have redundant copies of critical services.

4. Grid services (and the resource broker in particular) require a shutdown mechanism

There must be a way of draining a resource broker other than waiting.

5. Critical services have to be locally deployable in a redundant way

This is closely related to the requirement to be able to move a service.



Date: 16/12/2005


6. Grid services must enable verification of VO SLAs

Grid services must enable the measurement of quantities relevant to the verification of Service Level Agreements (SLA). This includes which site can run which VO, what VO software should be installed on the site, perhaps the amount of resources which are allowed for each VO over certain period of time and similar matters.

7. Grid should implement VO priority processing

Sites with different levels of commitment to different VOs need to be able to communicate this somehow to relevant services, not just to the batch system.

8. The grid middleware has to be able to deal with complex resource usage policies

Current software oversimplifies this, leading to confusion and load distribution problems. Currently almost all services are VO aware in a sense that some configuration is needed. The VO awareness should come from the authorization system and not by replicating services (paths, users, etc.) per VO. This would reduce adding a VO to the level of adding a resource usage policy for a new VO.

9. Simplify the administration of users and VO

The mechanism of mapping grid users to pool accounts needs rethinking. With more than twenty VOs and a few hundred users this will not work. The addition or removal of a VO has to become a light-weight operation.

10. Standard file formats

All grid services must be able to use standardized interface for error logs, log files and accounting messages and files. The level of logging has to be adaptable to support different situations. For example, there may be different levels of detail to support debugging.

11. Full job traceability and accountability should be guaranteed

The minimal level of logging has to guarantee audit trails. The error, log and accounting messages coming from different services have to have appropriate identifications so they can be easily synchronized with each other. They should be post-processed (synchronized) for each job and stored in a common place so that job auditing and accounting becomes easy. It is necessary to take into account, specific site policies regarding the amount of information a site allows to be in the public domain. There are serious security issues related to this.

12. Single interface for accounting

The grid accounting should be implemented with a single interface for all services, and the data should be independent of the underlying operating system and hardware.

13. Ability to trace the source of a job

Both the logging information and the operations API have to allow for tracing a job back to the source where it was sent from and to the person who sent it. This implies for privacy and security reasons, that a subset of this API needs a strong authentication and authorization mechanism.

14. Grid services have to be deployable in a scalable and redundant way

All grid services have to be designed in such a way that they can be deployed in a scalable way. For example: if a single instance of a particular service on a site gets overloaded, the administrator has to be able to add another instance. This has to be transparent to the users. A good example is the DNS based gridFTP service for CASTOR.

15. Allow services to break

It is understood that there are services which have to be reliable, persistent, safe and efficient, but which do not lend themselves to being distributed. However, for such services, a clear strategy has to be found to allow continuous operation even if this leads to some manual handling of consistency problems when the service has to migrate from one platform to another.



Date: 16/12/2005


For example: if a central catalogue with write functionality is the only feasible way for jobs to catalogue files, then the jobs should not fail if the catalogue is temporarily not available. A solution such as asynchronous registration should be available without the user or the application being aware of this.

Another example: if the information system top node (BDII) has to be contacted and the connection fails, there should always be a possibility to provide a list of alternate services that are tried before the call finally fails.

16. Services have to be able to handle unusual, exceptional or non-standard situation gracefully

Examples can include services being tolerant of clients which use out of date protocols, or which receive corrupted data, or receive failures from other services refusing an operation, or suffer resource exhaustion (CPU, disk). In all of these situations, the service should behave in such a way that it remains operative in some reasonable, if limited, sense.

17. Grid software needs to support heterogeneous clusters

The system has to be able to handle heterogeneous clusters as batch systems. Most of larger sites have a history of running batch services and as a result a set of very different nodes. Running each of them in an individual cluster is not an efficient solution from the fabric management point of view. Since most LRMS (batch systems) have a possibility of handling resource requirements for individual jobs, it would be very advantageous to make use of this through the grid.

18. Grid middleware component has to keep fine granularity

The administrator has to be able to replace the old component with the improved one and minimize the possible disturbance for the installation. This can be achieved only if the development keeps the granularity of middleware components as fine as possible.

11.3 INSTALLATION AND CONFIGURATION REQUIREMENTS

19. Installation and configuration must be two separate independent steps

This is necessary because some sites may have their own installation tools and need to be free to integrate grid middleware into this tool. This may not be possible if the grid configuration step is part of the grid middleware installation.

20. A simple, tool independent mechanism of installation and configuration has to be provided with middleware releases

21. Middleware installation should be packaged for the platform

Grid middleware release must be provided in the operating systems native format and a tar format, both as source and binary packages.

22. All packages to be installed as part of the grid middleware should be relocatable

This requirement is especially true for those components which are installed on user interface (UI) or on a worker node (WN). Grid middleware cannot interfere with the standard site installations of UI and WN.

23. Grid middleware must support different site network configurations

Sites must be allowed to organize the network as they wish for both internal and external connectivity. Local administrative tools and firewalls must be supported without special constraints. WN must not require outgoing connectivity (but if it exists it may make use of it). The same things apply to inbound connectivity.

24. Middleware installation and configuration should not require root access

The installation and configuration of the grid middleware should only require root or privileged user access with good justification. Today, UI and WN do not need root. Some other services do and will continue to



Date: 16/12/2005


do so, as some services need to run as a privileged user if they have to change UIDs for instance. It should be clearly defined when the usage of root is indispensable.

25. The middleware release should contain the set of standard configuration files

This will considerably simplify the installation on small sites that use the middleware in a standard way. The example will serve as an example for more complex installations.

26. Classify configuration files to simplify configuration

Classify the configuration parameters into:

• those that must be changed for each site installation; • those that might be changed at some time; • those that are rarely changed ("when the persons knows what he/she is doing").

Always give valid sensible defaults for each and every installation parameter. The meaning of every installation parameter has to be clearly specified in the release documentation for each service.

11.4 DEVELOPMENT REQUIREMENTS

27. The grid middleware should not depend on any particular version of the external software such as perl, python and openssl

This requires developers discipline to avoid using the latest and greatest features of the new external software releases. Experience shows this dramatically simplifies the grid middleware installation and maintenance.

28. Usage of different versions of the same external package has to be avoided

Absolutely avoid using several different versions of external packages (libraries, tools) even for different services. This is a no-go; it brings a real nightmare for the installation (different services may be required to run on the same physical hardware).

29. The grid middleware should use minimal number of languages

It is the current experience that when more languages are used for writing grid middleware the deployment task gets much harder. It should be entirely adequate to use one compiled language such as C/C++, one scripting language such as python or perl and scripts written in the bash shell. The usage of C/C++ is a very important issue because these compilers exist probably for all operating systems that could potentially be used for grid deployment, they are well debugged (together with accompanying libraries) and if consistently used, they will make middleware maintenance much easier. They also do not bring with them as much external software as other compiled languages often do. This should be considered not as an impossible dream request but as a hard critical issue.

30. Grid middleware required on UIs and WNs should be reduced to an absolute minimum

Grid middleware has to avoid interference with the standard site installation as much as possible. WNs and UIs should be therefore simple, easily installable and trivially portable.

31. Grid development must provide bug fixes for all middleware versions actively used in the field deployment

It is unrealistic to expect all sites to run the same version of grid middleware at all times, and so when fixes are provided, they must be available for all of the versions which are in use in the field. Usually there are two in use, and occasionally three.



Date: 16/12/2005


12 TABLES OF REFERENCES AND GLOSSARY

Table 3: References

Section References

§1.6 Introduction

INTRO 1 Details on the copyright holders of project EGEE http://public.eu-egee.org/partners/

INTRO 2 Project web site containing details of the project, its partners and contributors http://www.eu-egee.org

INTRO 3 DSA1.7 Infrastructure Planning Guide ("Cook-book") https://edms.cern.ch/document/489462

JRA2 Document management procedure http://egee-jra2.web.cern.ch/EGEE-JRA2/Procedures/DocManagmtProcedure/DocMngmt.htm

INTRO 5 JRA2 Glossary of EGEE terms http://egee-jra2.web.cern.ch/EGEE-JRA2/Glossary/Glossary.html

§4.4 Architecture

ARC 1 LCG File Catalog (LFC) Administrator's Guide, https://edms.cern.ch/document/579088

ARC 2 LB Service User's Guide https://edms.cern.ch/document/571273

ARC 3 gLite Architecture https://edms.cern.ch/document/476451

ARC 4 DPM Administrator's Guide https://edms.cern.ch/document/591600

ARC 5 Pool Of Persistent Objects for LHC http://lcgapp.cern.ch/project/persist

ARC 6 NA5 - Policy And International Cooperation http://public.eu-egee.org/activities/na5_details.html

ARC 7 User’s Guide for the VOMS Core Services https://edms.cern.ch/document/571991

ARC 8 Rome Compute Resource Management Interfaces Initiative http://www.pd.infn.it/grid/crm

ARC 9 EGEE User’s Guide https://edms.cern.ch/document/572406

ARC 10 EGEE gLite User’s Guide - Overview Of gLite Data Management https://edms.cern.ch/document/570643

ARC 11 EGEE gLite User’s Guide - gLite I/O https://edms.cern.ch/document/570771

ARC 12 User’s Guides for the DGAS Services https://edms.cern.ch/document/571271

ARC 13 VOMS admin user's guide https://edms.cern.ch/document/572406



Date: 16/12/2005


Section References

ARC 14 JDL Attributes http://server11.infn.it/workload-grid/docs/DataGrid-01-TEN-0142-0_2.pdf

ARC 15 JDL Attributes Specification https://edms.cern.ch/document/555796

ARC 16 WMS User’s Guide https://edms.cern.ch/document/572489

ARC 17 User Guide For Edg Replica Manager 1.5.4 http://cern.ch/edg-wp2/replication/docu/r2.1/edg-replica-manager-userguide.pdf

ARC 18 Developer Guide For Edg Replica Manager 1.5.4 http://cern.ch/edg-wp2/replication/docu/r2.1/edg-replica-manager-devguide.pdf

ARC 19 LCG Data Management Documentation https://uimon.cern.ch/twiki/bin/view/LCG/DataManagementDocumentation

ARC 20 Fireman Catalogue User Guide https://edms.cern.ch/document/570780

ARC 21 Service Discovery User Guide https://edms.cern.ch/document/578147

ARC 22 gLite Release 1 Web Page http://hepunx.rl.ac.uk/egee/jra1-uk/glite-r1

ARC 23 JP usage guide http://egee-jra1-wm.mi.infn.it/egee-jra1-wm/jp_usage.shtml

ARC 24 File Transfer Service User Guide https://edms.cern.ch/document/591792

ARC 25 Grid Physics Network (GriPhyN) Web Page http://www.griphyn.org

ARC 26 International Virtual Data Grid Laboratory (iVDgL) Web Page http://www.ivdgl.org

ARC 27 LHC Computing Grid (LCG) - Web Page http://lcg.web.cern.ch/LCG

ARC 28 Particle Physics Data Grid (PPDG) Web Page http://www.ppdg.net

ARC 29 Baseline Services Working Group Report http://lcg.web.cern.ch/LCG/PEB/BS/BSReport-v1.0.pdf

ARC 30 Global Grid User Support http://ggus.org

ARC 31 CE Mon http://grid.pd.infn.it/cemon/field.php

ARC 32 BLAH http://egee-jra1-wm.mi.infn.it/egee-jra1-wm/ce_blahp.shtml

ARC 33 CREAM CE for gLite http://grid.pd.infn.it/cream/field.php

§6.4 Security

SEC 1 Globus Alliance



Date: 16/12/2005


Section References

http://www.globus.org

SEC 2 Globus Toolkit Security (GSI) http://www-unix.globus.org/toolkit/security/

SEC 3 Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile http://www.ietf.org/rfc/rfc3280.txt

SEC 4 Trust Manager: certificate validator for Java services http://hep-project-grid-scg.web.cern.ch/hep-project-grid-scg/trustmanager.html

SEC 5 EGEE Global Security Architecture https://edms.cern.ch/document/487004

SEC 6 JRA3 Security Documentation http://egee-jra3.web.cern.ch/egee-jra3/index.html

SEC 7 The European Policy Management Authority for Grid Authentication in e-Science http://www.eugridpma.org

SEC 8 Kerberos Leveraged PKI http://www.citi.umich.edu/projects/kerb_pki/

SEC 9 Fermilab PKI http://security.fnal.gov/pki/

SEC 10 LCG/EGEE Joint Security Policy Group http://cern.ch/proj-lcg-security

SEC 11 JSPG document repository https://edms.cern.ch/cedar/plsql/navigation.tree?cookie=4134107&p_top_id=1763291383&p_top_type=P&p_open_id=1412060393&p_open_type=P

SEC 12 Grid Acceptable Usage Rules https://edms.cern.ch/document/428036/1

SEC 13 LCG User Guide https://edms.cern.ch/file/454439//LCG-2-UserGuide.html

SEC 14 LHC Computing Grid Project (LCG) http://cern.ch/lcg

SEC 15 Virtual Organisation Management Registration Service (VOMRS) http://computing.fnal.gov/docs/products/vomrs/

SEC 16 Internet X.509 Public Key Infrastructure (PKI) Proxy Certificate Profile http://www.ietf.org/rfc/rfc3820.txt

SEC 17 Myproxy Credential Management Service http://grid.ncsa.uiuc.edu/myproxy/

SEC 18 EGEE site registration process https://edms.cern.ch/document/503198

SEC 19 Grid Operations Centre - Database http://goc.grid-support.ac.uk/gridsite/operations/

SEC 20 EGEE “catch-all” CA, CNRS GRID-FR CA https://igc.services.cnrs.fr/GRID-FR/english

SEC 21 LCG “catch-all” CA, CERN



Date: 16/12/2005


Section References

http://lcg.web.cern.ch/LCG/catch-all-ca/

SEC 22 Risk Analysis - Joint Security Policy Group http://proj-lcg-security.web.cern.ch/proj-lcg-security/RiskAnalysis/risk.html

SEC 23 Incident Response Handling Guide https://edms.cern.ch/document/428035

§7.5 Grid operations and support

ACC 1 DSA1.3 - Accounting and reporting web site publicly available. https://edms.cern.ch/document/489455

ACC 2 EGEE view by the CESGA team in SWE federation: http://www.egee.cesga.es/EGEE-SA1-SWE/accounting/reports/

ACC 3 LHC reporting pages: http://www.goc.grid-support.ac.uk/gridsite/accounting/tree/tier1view.php

ACC 4 GridPP reporting pages: http://goc.grid-support.ac.uk/gridsite/accounting/tree/gridppview.php

CM 1 Main page for the certificate monitor http://goctest.grid-support.ac.uk/gridsite/monitoring/tree/CERTView.php

GOC 1 Main page for the GOC http://goc.grid-support.ac.uk/gridsite/gocmain/

GOCDB 1 Point of entry to the GOC DB: http://goc.grid-support.ac.uk/gridsite/gocdb

GOCDB 2 MySQL database: http://www.mysql.com

GOCDB 3 GridSite software: http://www.gridsite.org/

GOCMaps 1 The home page for the GOC maps http://goc.grid-support.ac.uk/googlemaps

GOCMaps 2 The home page for the SFT monitor map http://goc.grid-support.ac.uk/googlemaps/sft.html

GOCMaps 3 The home page for the RGMA service monitor map http://goc.grid-support.ac.uk/googlemaps/rgma.html

GOCMaps 4 The home page for the Google map service http://maps.google.com/

GOCMaps 5 Recipes to build Google maps for the GOC http://goc.grid.sinica.edu.tw/gocwiki/GoogleMapHowTo

OMC 1EGEE SA1 Execution Plan https://edms.cern.ch/document/489453

OMC 2 Document repository for OMC http://egee-docs.web.cern.ch/egee-docs/list.php?dir=./&

RGMA 1 Description of the RGMA Monitor Architecture http://goc.grid.sinica.edu.tw/gocwiki/RgmaMonitorArchitecture

RGMA 2 Service Discovery User Guide



Date: 16/12/2005


Section References

https://edms.cern.ch/document/578147

RGMA 3 gLite Release http://hepunx.rl.ac.uk/egee/jra1-uk/glite-r1

SFT 1 Operations manual for COD

SFT 2 SFT2 Documentation page: http://goc.grid.sinica.edu.tw/gocwiki/Site_Functional_Tests

SFT 3 Typical report for production grid: https://lcg-sft.cern.ch/sft/lastreport.cgi

SFT 4 Typical report for SE grid: http://grid-se.marnet.net.mk/sft/lastreport.cgi

SFT 5 Typical report for SW grid: http://mon.egee.cesga.es/sft/lastreport.cgi

SFT 6 SFT suite list: https://lcg-sft.cern.ch/sft/sftestcases.html

SFT 7 SFT Historical Metrics https://lcg-sft.cern.ch/sft/metrics.html

§8.7 User support

User 1 Support Model: http://egee-docs.web.cern.ch/egee-docs/support/documentation/pdf/9100_GGUS_Support_Model.pdf

User 2 General description of the role of a ROC in GGUS: http://egee-docs.web.cern.ch/egee-docs/support/documentation/pdf/2000_FAQ_for_ROC.pdf

User 3 Location for the documents which describe each of the ROCs: http://egee-docs.web.cern.ch/egee-docs/list.php?dir=.\support\documentation\pdf\&

User 4 Description of the role of the CIC in GGUS: http://egee-docs.web.cern.ch/egee-docs/support/documentation/pdf/3900_FAQ_for_CIC.pdf

User 5 General description of the role of the VO in GGUS: http://egee-docs.web.cern.ch/egee-docs/support/documentation/pdf/4000_FAQ_for_VO.pdf

User 6 Location of the documents which describe each of the VOs: http://egee-docs.web.cern.ch/egee-docs/list.php?dir=.\support\documentation\pdf\&

User 7 General description of the role of the SU in GGUS http://egee-docs.web.cern.ch/egee-docs/support/documentation/pdf/7000_FAQ_for_SU.pdf

User 8 Location of the documents which describe each of the SUs : http://egee-docs.web.cern.ch/egee-docs/list.php?dir=.\support\documentation\pdf\&

User 9 Main documentation index on GGUS portal https://gus.fzk.de/pages/docu.php

User 10 User support monthly usage report http://egee-docs.web.cern.ch/egee-docs/list.php?dir=.\support\usage\&

User 11 User support weekly usage report http://egee-docs.web.cern.ch/egee-docs/list.php?dir=.\support\usage\weekly\&

§9.7 Fabric management

FM 1 CIC Portal:



Date: 16/12/2005


Section References

https://cic.in2p3.fr

FM 2 GOC DB: http://goc.grid-support.ac.uk/gridsite/gocdb/

FM 3 Sites Functional Tests - SFT2: https://lcg-sft.cern.ch/sft/lastreport.cgi

FM 4 Global Grid User Support: http://ggus.org

FM 5 GIIS Monitoring pages: http://goc.grid.sinica.edu.tw/gstat

FM 6 SFT2 Documentation page: http://goc.grid.sinica.edu.tw/gocwiki/Site_Functional_Tests

FM 7 GOC Wiki page: http://goc.grid.sinica.edu.tw/gocwiki/

FM 8 Ganglia: http://ganglia.sourceforge.net

FM 9 Lemon: http://lemon.web.cern.ch/lemon/index.htm

FM 10 LCG Rollout Mail List Archive: http://www.listserv.rl.ac.uk/archives/lcg-rollout.html

FM 11 Quattor: http://quattor.web.cern.ch/quattor

§10.3 Deploying additional software components

MPI 1 MPICH: Documentation on the MPI standard and the MPICH reference implementation: http://www-unix.mcs.anl.gov/mpi/

MPI 2 Document describing support for MPI with Torque: http://goc.grid.sinica.edu.tw/gocwiki/MPI_Support_with_Torque

MPI 3 Document describing MPI jobs without a shared home directory: http://grid-it.cnaf.infn.it/fileadmin/sysadm/mpi-support/MPInotes.txt

MPI 4 mpiexec: Documentation on the mpiexec command: http://www.osc.edu/~pw/mpiexec/

MPI 5 Documentation describing the patch to PBS to support the tmpdir facility: There is currently no documentation on this. Send mail to [email protected] for more information.

Table 4: Glossary of terms

Acronym Meaning AAA Authentication Authorization Accounting ACL Access Control List AFC Administrative Federation Committee AFM Administrative Federation Meeting AGM Architecture Group Member



Date: 16/12/2005


Acronym Meaning ALICE LHC physics experiment AliEn Grid for Alice experiment AM Activity Manager APEL LCG Accounting Application API Application Programming Interface ARDA Architectural Roadmap Toward Distributed Analysis ATF Architecture Task Force (replaced by PTF) Atlas LHC physics experiment AWG Application Working Group BaBar B and B-bar experiment BAR Bandwidth Allocation and Reservation BCU Biocomputing Unit BDII Berkeley Database Information Index BI Bio Informatics BioMed BioMedical VO BMI Bio Medical Informatics CA Certification Authority CA Consortium Agreement CB Collaboration Board CC Cost Claims CDF Collider Detector Experiment at Fermilab CE Compute Element CERN European Organisation for Nuclear Research CHEP Computing in High Energy Physics CIC Core Infrastructure Centre CIS Core Infrastructure Services CM Cluster Manager CMS LHC physics experiment COD CIC-on-duty Compass Physics experiment at CERN CRL Certificate Revocation List CT Certification and Testing Team CVS Concurrent Versions System CyGRID Cyprus Grid D-CH German-Swiss Federation D0 DØ Experiment DAGS Directed Acyclic GraphS (work-flow software from CONDOR project) DAIS Data Access and Integration Services DataGrid European DataGrid DCS Dynamic connectivity Service DEISA Distributed European Infrastructure for Supercomputing Applications diffserv Differentiated services DN Distinguished Name DoS Denial of Service DPM Disk Pool Manager Dteam Deployment team



Date: 16/12/2005


Acronym Meaning e-IRGSP e-Infrastructure Reflection Group Support Project EAC External Advisory Committee EC European Commission EDG European Data Grid EDMS Engineering Data Management Service (CERN Document Management tool) EELA Extending EGEE to Latin America EGAAP EGEE Generic Application Advisory Panel EGEE Enabling Grids for E-sciencE EGO European Grid Organisation eIRG e-Infrastructure Reflection group EIS Experiment Integration Support EL Enterprise Linux EMBnet European Molecular Biology network EMT Engineering Management Team ENOC EGEE-II Network Operation Centre EO Earth observation ERA European Research Area ESC EGEE Executive Support Committee ESR Earth Science research ESUS Experiment Specific User Support EU European Union EUGridPMA European Grid Authentication policy management authority for e-science FAQ Frequently asked questions FNAL Fermi National Atomic Laboratory FR Federation representative FR French Federation FTE Full Time Equivalent FTS File Transfer Service FZK Forschungszentrum Karlsruhe GAE HEP software GAG Grid Application Group GAP Gender Action Plan GD Grid Deployment GE Gigabit-Ethernet GEANT European Academic Network GENIUS Grid Enabled web eNvironment for site Independent User job Submission GFAL Grid File Access Library GGF Global Grid Forum GGUS Global Grid User Support GigE Gigabit Ethernet GIIS Grid Information Index Server GILDA Grid INFN Laboratory for dissemination activities gLite Codename of the middleware software suite developed by JRA1 GLUE Grid Laboratory Uniform Environment GN2 Codename for GEANT-2 (the successor to the Geant network) GOC Grid Operations Centre



Date: 16/12/2005


Acronym Meaning GOCDB Grid Operations Centre Database Grid3 The Grid 3 Project in the USA GridICE Grid monitoring software GridKA Grid based on Karlsruhe in Germany GUID Grid User IDentifier GSC Grid Support Center GST Grid Services Testing GT Globus Toolkit GUT Grid Unit Testing HEP High Energy Physics HepCAL HEP Application Grid requirements I3 Integrated Infrastructure Initiative IA64 Instruction Architecture-64 IAG Israeli Academy Grid ISO 9001 International Organization for Standardization: Quality assurance normalization IST Information Society Technologies IT Italian Federation JRA Joint Research Activity JRA1 EGEE Middleware Re-engineering and Integration activity JRA2 EGEE Quality Assurance activity JRA3 EGEE Security activity JRA4 EGEE Network Services development activity JSPG Joint (EGEE/LCG) Security Policy Group KA Karlsruhe KCA Kerberized Certificate Authority L&B Logging and Bookkeeping software LAN Local Area Network LCG-0/2/3 LHC Computing Grid Middleware releases 0/1/2 LCG LHC Computing Grid LDAP Lightweight Directory Access Protocol LFC LHC File Catalogue LFN Logical File Name LHC Large Hadron Collider LHCb LHC physics experiment LRC LCG Resource Catalogue MDS Monitoring and Discovery System MoU Memorandum of Understanding MPI Message Passing Interface software MSS Mass Storage System MW Middleware NA Networking Activity NA1 EGEE Project Management activity NA2 EGEE Dissemination and Outreach activity NA3 EGEE User Training and Induction NA4 EGEE application identification and support activity NA5 EGEE International Cooperation activity



Date: 16/12/2005


Acronym Meaning NE Northern Europe Federation NEG Northern European Grid NGIS National Grid Initiatives NOC Network Operation Centre NPM Network Performance Monitoring NREN National/ Regional e-Network OAG Operations Advisory Group OCC Operations Coordination Centre OGSA Open Grid Services Architecture OGSI Open Grid Services Infrastructure OMC Operations Management Centre OMII Open Middleware Infrastructure Institute Oracle Data base OSCT Operational Security Coordination Team OSG Open Science Grid OSI Open Source Initiative PBS Product Breakdown Structure PD Project Director PDG Protein Design Group PEB Project Executive Board PIC Policy and International Cooperation PKI Public Key Infrastructure PM Person Month / Project Month PMA Policy Management Authority PMB Project Management Board PO Project Office POOL HEP software PPS Pre-Production Service PPT Project Progress Tracking tool PR EU Periodic Reports PR Public Relations PROOF HEP software PTD Project Technical Director PTF Project Technical Forum QA Quality Assurance QAG Quality Assurance Group QAM Quality Assurance Management QAR Quality Assurance Representative QI Quality Indicator QoS Quality Of Services QR EU Quarterly Report QUATTOR Administration Toolkit for Optimising Resources R-GMA Relational-Grid Monitoring Architecture Raid Redundant Array of Inexpensive Disks RB Resource Broker RBAC Rules Based Access Control



Date: 16/12/2005


Acronym Meaning RC Resource Centre RDIG Russian Data Intensive Grid Remedy Problem tracking software for helpdesks RH Red Hat RLS Replica Location Service RM Release Manager ROC Regional Operations Centre RoGRID Romanian GRID RSS Really Simple Syndication RTAG Requirements Technical Assessment Group RU Russian Federation SA Specific Service Activity SA1 EGEE Operation and Management activity SA2 EGEE Network Resource Provision activity SA2 EGEE Network Resource Provision activity SCG Security Coordination Group SCM Software Configuration Management SE Storage Element SEAL HEP software SEE South East Europe SEEGRID South East Grid e-Infrastructure Development SEEREN South East European Research Networking SFT Site Functional Test SGM Security Group Member SIMBA CERN mail list management tool SIPS Site Integrated Proxy Services SL Scientific Linux SLA Service Level Agreement SLR Service Level Request SLS Service Level Specification SM Software Manager SOA Service Oriented Architecture SPI CERN Software Process Infrastructure SRB Storage Resource Broker SRM Storage Resource Management SSA Special Support Action SU Support Unit SURL Storage Uniform Resource Location SuSE Linux distribution SW Software SW South West Europe Federation SweGrid Swedish national Grid resource for science and research TPM Ticket Process Management UIG User Information Group UK-I UK - Ireland Federation UML Unified Modeling Language



Date: 16/12/2005


Acronym Meaning US United States of America VV Verification and Validation activity VDT Virtual Data Toolkit VO Virtual Organisation VOMS VO Management Service WAN Wide Area Network WBS Work breakdown structure WCIT World Congress on Information Technology WLM WorkLoad Management software WMS Workload Management Service WN Worker Node WSRF Web Services Resource Framework XML eXtensible Markup Language YAIM Yet Another Installation Method



Date: 16/12/2005


13 INDEX Lessons learned

CIC, 88

Fabric Management, 113

GGUS, 101

GOC, 91

MPI, 116

OMC, 87

PPS, 88

ROC, 90

Security, 58

References, 30, 59, 92, 104, 113, 117

EGEE-DSA1 7-Cookbook-1-489462-v2 0 0 · 2018. 11. 13. · Abstract: This document is the...

Documents

Transcript of EGEE-DSA1 7-Cookbook-1-489462-v2 0 0 · 2018. 11. 13. · Abstract: This document is the...