Operations Automation Strategy

22
EGEE-II INFSO-RI- 031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks James Casey GDB, July 2008 Operations Automation Strategy

description

Operations Automation Strategy. James Casey GDB, July 2008. Overview. EGEE MSA1.1 : Operations Automation Strategy Due end of PM1 Delivered mid-June In review – comment welcome https://edms.cern.ch/document/927171/1 Abstract: - PowerPoint PPT Presentation

Transcript of Operations Automation Strategy

Page 1: Operations Automation Strategy

EGEE-II INFSO-RI-031688

Enabling Grids for E-sciencE

www.eu-egee.org

EGEE and gLite are registered trademarks

James Casey

GDB, July 2008

Operations Automation Strategy

Page 2: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Overview

• EGEE MSA1.1 : Operations Automation Strategy– Due end of PM1– Delivered mid-June– In review – comment welcome

• https://edms.cern.ch/document/927171/1

• Abstract:In EGEE-III, within the SA1 activity, a group called the ‘Operations Automation Team’ was

formed with the task of coordinating operational tools and their development, with the specific goal of advising on the strategic directions to take in terms of automating the operations effort. This will entail replacing manual processes with automated ones in order that the overall staffing level of operations can be significantly reduced in a long-term, sustainable infrastructure.

This document outlines a strategy for achieving this automation using an integration architecture based on messaging. It describes how current tools and processes, such as operational alarming and ticketing will evolve during the lifetime of EGEE-III and lays out a roadmap for this evolution.

To change: View -> Header and Footer 2

Page 3: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Overview

• Focus on– Documenting the current model and issues with it– What it the future model?– How does this impact current tools?– How do make the tools support this new model?

• Initially restrict (due to time to deliver) to – Distributed monitoring at ROC and Site (e.g. SAM, Fabric

monitoring)– Information Model

• Follow up with– Accounting– Reporting– SLA/SLDs– Configuration management

To change: View -> Header and Footer 3

Page 4: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Document Outline

1. Introduction

2. Executive summary

3. Project constraints on operational tools during EGEE-III

4. Description of core operational Tools

5. Current operations model

6. Outstanding issues arising from current operations

7. Future operational model

8. Architectural principles

9. Information Architecture

10. Tool integration architecture

11. Sharing system management tools

12. Roadmap for integration and deployment

To change: View -> Header and Footer 4

Page 5: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Core Operational Tools

• Grouped into 5 general areas– Provision of information about resources– Grid Monitoring and Reporting– Grid Accounting and Reporting– User support– Follow-up of alarms created by monitoring systems

To change: View -> Header and Footer 5

Repositories of Information

Accounting MonitoringTicket

Followup

Reporting Alarms

User Support

GOCDB, Operations Portal

APEL, Accounting

Enforcement Portal

SAM, GStatOperationsDashboard

GridViewAccounting Portal

Site Fabric Monitoring

GGUS

Page 6: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Current Operational Model

• Several teams involved– Operations Management (OCC)

– Monitoring system operators (SAM)

– Grid operators (COD)

– Regional Operations Centres (ROC)

– First line support teams (ROC)

– Resource Centres/sites (RC)

– User support team (GGUS)

To change: View -> Header and Footer 6

RC

SAM

ROC1st Line support

COD

OCC

GGUS

RC RC RC

ROC1st Line support

ROC1st Line support

RC RC

Management

Central Operational

Teams

Regional

Site-level

Page 7: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Current operational model (s)

To change: View -> Header and Footer 7

Site

Operations Team (COD)

Alarm

ROC

GGUSTicket

SAM

Site

ROC

SAM

After 24 Hours

Alarms handled by the COD operator Alarms handled directly by the 1st line support

Operations Dashboard

ROC1st Line Support

TeamRegional

DashboardAlarm

Operations Team (COD)

Operations Dashboard

Alarm

GGUSTicket

GGUSTicket

Page 8: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Future operational model

To change: View -> Header and Footer 8

Site

Alarm from siteor regional monitoring

r-COD Team

LocalTicket

RegionalDashboard

1st Line Support

c-COD Team

CentralDashboard

Escalation

GGUSTicket

Provide supportto fix problem

Page 9: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Abstract Information Model

• Data providers are entities that are used as a source of information, primary or not

• Services providers use this information to give a service to a set of consumers

• Consumers use the data which comes from the service providers

• Primary Data Provider This is the authoritative source for entities and/or relations between these entities.

• Derived data provider This is a service that creates new information out of information provided by primary data providers.

To change: View -> Header and Footer 9

Page 10: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Primary Data Providers

• GOCDB – The GOCDB is primary for the Grid infrastructure groups and

services along with their relation to users and general info e.g. lists of administrators for sites, geographical location of site.

• CIC DB – Primary for VO Cards which describe a Virtual Organisation and

their relations to users and services.• BDII Information System

– grid infrastructure groups e.g. services at a site– detailed information about services e.g. endpoints for grid services– Relationships between services and VOs and user groups e.g.

Access control rules for services• VO information providers

– Currently VOs provide attributes about sites and services, such as the list of services that a VO wants to use and the pledged resources they want made available to them.

To change: View -> Header and Footer 10

Page 11: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Secondary Data Providers

Tool Primary Data Derived data

SAM Grid topology Services metrics

SAM DB Services metrics Services status

Gridview Services status, grid topology Site availability

Accounting Portalaccounting records, grid topology, VO info accounting reports

FCR Services status, grid topology list of working services for a VO

GStat BDII site BDII summary

Operations Portal Site & ROC test information Site& ROC Reports

Operations Portal VOMS servers endpoint in VOCard List of VOMS Users

Operations Dashboard status, site information Tickets

To change: View -> Header and Footer 11

Page 12: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Service ProvidersTool Service ProvidedSAM Monitor grid servicesSAM Provide interface to monitoring dataSAMAP Allow real time scheduling of SAM testsGridView Calculate site availabilityGridView Report metrics about site availability and reliability

APEL Provide and process accounting dataAccounting Portal Report metrics about site accountingFreedom of Choice (FCR) Provide a list of working resources for a VOGStat Provide tests results for site information systemOperations Portal Populate VO informationOperations Portal Provide interface to manage VO infoOperations Portal Provide project-wide communication toolsOperations Portal Provide repository for operational proceduresOperations Dashboard Provide dashboard tools for grid operatorsOperations Dashboard Provide dashboard tools for regional supportGOCDB Populate grid topology info, users and services info

GOCDB Provide interface to manage topology info

To change: View -> Header and Footer 12

Page 13: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

How do we distribute?

To change: View -> Header and Footer 13

Service Provider

Consumer

Service Provider

Service Provider

Data Provider

Consumer Consumer

Cache

Central Collector

Data Provider

CacheData

ProviderCache

Page 14: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Aggregation models

• Aggregation at project (WLCG), infrastructure(EGEE) levels

• Filtering between ROC and Project

To change: View -> Header and Footer 14

RegionalProvider

Project Level Provider

Local Provider

Local Provider

RegionalProvider

Local Provider

RegionalProvider

Local Provider

Local Provider

Local Provider

Project Level Service

Filte

r

FilterFilter

Page 15: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Messaging for integration

• ActiveMQ as messaging bus to integrate systems– Reliable + Scalable

• Already in production for WLCG for OSG interoperation

To change: View -> Header and Footer 15

Accounting Database

SAM/Gridview

Dashboards

Nagios @ ROC

Nagios @ Site

21

21

21

(… more clients…)

Page 16: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Multi-level monitoring

• Based on CEE ROC Nagios prototype– Replace central SAM with components at ROC and site– Tie together with the messaging system– Regional operations dashboard and alarms DB– Link into regional ticketing

Perhaps via GGUS (for integration simplicity)

• Follow new operational model– Raise alarms immediately at the site– 1st level support sees them and can respond if needed– Central COD only involved after 2-3 weeks e.g. site banning

• Project/Infrastructure can aggregate data for reporting

GDB, July 2008 16

Page 17: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Multi level monitoring framework

To change: View -> Header and Footer 17

Page 18: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

The site components

To change: View -> Header and Footer 18

Page 19: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Sharing tools

• How to use tools developed at ROCs + site more widely?

• Mostly publicity…– A ‘Lightning Talks’ session at EGEE conferences and events– Encourage developers of tools to publish short articles in iSGTW

(http://www.isgtw.org/)

• Maintain repository of tools– Build on and extend work done in Hepix/WLCG system

management WG https://www.sysadmin.hep.ac.uk/

• Integrate into EGEE releases– Additional ‘EGEE-*’ YAIM components on top of gLite base

software

To change: View -> Header and Footer 19

Page 20: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Roadmap for distributed COD

• Milestone ‘rCOD 1’: September 2008 – 4 ROCs carry out r-COD and 1st line support roles directly. This will

be done with a ‘regionalized’ version of the current operations dashboard, and with SAM as the alarm generation system

• Milestone ‘rCOD 2’: April 2009 – 4 additional ROCs carry out r-COD and 1st line support roles using

the regionalized dashboard• Milestone ‘rCOD 3’: April 2009

– 2 additional ROCs carry out r-COD and 1st line support roles directly using the new multi-level monitoring framework

• Milestone ‘rCOD 4’: September 2009 – All 11 ROCs carry out r-COD and 1st line support roles directly.

The c-COD is fully established• Milestone ‘rCOD 5’: December 2009

– All 11 ROCs carry out r-COD and 1st line support roles using the new multi-level monitoring framework

To change: View -> Header and Footer 20

Page 21: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Roadmap for tools

• Milestone ‘Messaging 1’: August 2008 – Production level messaging broker in production. This should have internal

failover capabilities, but will not have the WAN failover capabilities of a network of broker

• Milestone ‘Messaging 2’: December 2008 – A scalable and reliable network of brokers, consisting of a deployment over

at least 3 sites is in place• Milestone ‘Site Monitoring 1’: September 2008

– A release of the site components for the multi-level monitoring, including packaging and configuration as part of a EGEE middleware release exists and is ready for deployment to the sites.

• Milestone ‘ROC Monitoring 1’: December 2008 – The ROC components for the multi-site monitoring are ready for deployment

to sites.• Milestone ‘ROC Monitoring 2’: February 2009

– The alarm component has been integrated with the regionalized dashboard• Milestone ‘ROC Monitoring 3’: July 2009

– The regional dashboard is now available to be deployed at the ROCs

To change: View -> Header and Footer 21

Page 22: Operations Automation Strategy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Summary

• First architecture for improving automation of operations

• A roadmap defined for moving operational monitoring (a.la. SAM/COD) to regional model– This is the area with potential for most gains from automation– Other areas to follow

• Comments on document welcome !

To change: View -> Header and Footer 22