WP4 report

16
Olof Bärring – WP4 summary- 4/9/2002 - n° 1 Partner Logo WP4 report Plans for testbed 2 [email protected]

description

WP4 report. Plans for testbed 2 [email protected]. Summary. Reminder on how it all fits together What’s in R1.2 (deployed and not-deployed but integrated) Piled up software from R1.3, R1.4 Timeline for R2 developments and beyond A WP4 problem Conclusions. Other Wps. WP4 subsystems. - PowerPoint PPT Presentation

Transcript of WP4 report

Page 1: WP4 report

Olof Bärring – WP4 summary- 4/9/2002 - n° 1

PartnerLogo

WP4 report

Plans for testbed 2

[email protected]

Page 2: WP4 report

Olof Bärring – WP4 summary- 4/9/2002 - n° 2

Summary

Reminder on how it all fits together

What’s in R1.2 (deployed and not-deployed but integrated)

Piled up software from R1.3, R1.4

Timeline for R2 developments and beyond

A WP4 problem

Conclusions

Page 3: WP4 report

Olof Bärring – WP4 summary- 4/9/2002 - n° 3

How it all fits together (job management)

Farm A (LSF) Farm B (PBS)

Grid User

(Mass storage,Disk pools)

Local User

Monitoring

FabricGridification

ResourceManagement

Grid InfoServices(WP3)

WP4 subsystems

Other Wps

ResourceBroker(WP1)

Data Mgmt(WP2)

Grid DataStorage(WP5)

- Submit job- Submit job- Optimized selection of site- Optimized selection of site-Authorize

-Map grid local credentials

-Authorize

-Map grid local credentials

-Select an optimal batch queue and submit

-Return job status and output

-Select an optimal batch queue and submit

-Return job status and output

- publish resource and accounting information

- publish resource and accounting information

Page 4: WP4 report

Olof Bärring – WP4 summary- 4/9/2002 - n° 4

How it all fits together (system mgmt)

WP4 subsystems

Other Wps

Farm A (LSF) Farm B (PBS)

Installation &Node Mgmt

ConfigurationManagement

Monitoring &Fault ToleranceResource

Management

Information

Invocation

- Update configuration templates

- Update configuration templates

- Node malfunction detected- Node malfunction detected-Remove node from queue

-Wait for running jobs(?)

-Remove node from queue

-Wait for running jobs(?)

- Trigger repair- Trigger repair

- Repair (e.g. restart, reboot, reconfigure, …)

- Repair (e.g. restart, reboot, reconfigure, …)

- Node OK detected- Node OK detected-Put back node in queue-Put back node in queue

Automation

Page 5: WP4 report

Olof Bärring – WP4 summary- 4/9/2002 - n° 5

How it all fits together (node autonomy)

Cfg cache

MonitoringBuffer

Correlationengines

Node mgmtcomponents

MonitoringMeasurement

Repository

ConfigurationData Base

Central (distributed)

Buffer copy

Node profile

Local recover if possible(e.g. restarting daemons)

Automation

Page 6: WP4 report

Olof Bärring – WP4 summary- 4/9/2002 - n° 6

What’s in R1.2 (and deployed)

Gridification: Library implementation of LCAS

Page 7: WP4 report

Olof Bärring – WP4 summary- 4/9/2002 - n° 7

What’s in R1.2 but not used/deployed

Resource management Information provider for Condor (not fully tested because you need

a complete testbed including a Condor cluster)

Monitoring Agent + first prototype repository server + basic linuxproc sensors

No LCFG object not deployed

Installation mgmt LCFG light exists in R1.2. Please provide us feedback on any

problems you have with it.

Page 8: WP4 report

Olof Bärring – WP4 summary- 4/9/2002 - n° 8

Piled up software from R1.3, R1.4

Everything mentioned here is ready, unit tested and documented (and rpms are built by autobuild)

Gridification LCAS with dynamic plug-ins. (already in R1.2.1???)

Resource mgmt Complete prototype enterprise level batch system management with

proxy for PBS (see next slide). Includes LCFG object.

Monitoring New agent. Production quality. Already used on CERN production

clusters sampling some 110 metrics/node. Has also been tested on Solaris.

LCFG object

Installation mgmt Next generation LCFG: LCFGng for RH6.2 (RH7.2 almost ready)

Page 9: WP4 report

Olof Bärring – WP4 summary- 4/9/2002 - n° 9

queues resources

Batch system: PBS, LSF, etc.

Scheduler

Runtime ControlSystem

Grid

Local fabricGatekeeper

(Globus or WP4)

job 1 job 2 job n

JM 1 JM 2 JM n

scheduled jobs new jobs

user

qu

eu

e 2

execu

tion

qu

eu

e

stop

ped

, vis

ible

for

use

rs

start

ed

, in

vis

ible

for

use

rs

submit

user

qu

eu

e 1

get job info

movemove job

exec job

RMS components

PBS-, LSF-Cluster

Globus components

Enterprise level batch system mgmt prototype (R1.3)

Page 10: WP4 report

Olof Bärring – WP4 summary- 4/9/2002 - n° 10

Timeline for R2 developments Configuration management: complete central

part of framework High Level Definition Language: 30/9/2002 PAN compiler: 30/9/2002 Configuration Database (CDB): 31/10/2002

Installation mgmt LCFGng for RH72: 30/9/2002

Monitoring: Complete final framework TCP transport: 30/9/2002 Repository server: 30/9/2002 Repository API WSDL: 30/9/2002 Oracle DB support: 31/10/2002 Alarm display: 30/11/2002 Open Source DB (MySQL or PostgreSQL): mid-

December 2002

Page 11: WP4 report

Olof Bärring – WP4 summary- 4/9/2002 - n° 11

Timeline for R2 developments

Resource mgmt GLUE info providers: 15/9/2002

Maintenance support API (e.g. enable/disable a node in the queue): 30/9/2002

Provide accounting information to WP1 accounting group: 30/9/2002

Support Maui as scheduler

Fault tolerance framework Various components already delivered

Complete framework by end of November

Page 12: WP4 report

Olof Bärring – WP4 summary- 4/9/2002 - n° 12

Beyond release 2

Conclusion from WP4 workshop, June 2002: LCFG is not the future for EDG (see WP4 quarterly report for 2Q02) because:

Inherent LCFG constraints on the configuration schema (per-component config)

LCFG is a project of its own and our objectives do not always coincide

We have learned a lot from LCFG architecture and we continue to collaborate with the LCFG team

EDG future: first release by end-March 2003 Proposal for a common schema for all fabric configuration information to be

stored in the configuration database, implemented using the HLDL.

New configuration client and node management replacing LCFG client (the server side is already delivered in October).

New software package management (replacing updaterpms) split into two modules: an OS independent part and an OS dependent part (packager).

Page 13: WP4 report

Olof Bärring – WP4 summary- 4/9/2002 - n° 13

Global schema tree

hardware system sw

CPU harddisk memory ….

sys_name interface_type size ….

hostname architecture partitions services ….

hda1

size type id

hda2 ….

packages known_repositories edg_lcas

edg_lcas ….

version repositories ….

cluster

….

Component specific

configuration

The population of the global schema is an ongoing activityhttp://edms.cern.ch/document/352656/1

Page 14: WP4 report

Olof Bärring – WP4 summary- 4/9/2002 - n° 14

SW repository structure (maintained by repository managers):

/sw/known_repositories/Arep/url = (host, protocol, prefix dir)

/owner =

/extras =

/directories/dir_name_X/path = (asis)

/platform = (i386_rh61)

/packages/pck_a/name = (kernel)

/version = (2.4.9)

/release = 31.1.cern

/architecture = (i686)

/dir_name_Y /path = (sun_system)

/platform = (sun4_58)

/packages/pck_b/name = (SUNWcsd)

/version = 11.7.0

/release = 1998.09.01.04.16

/architecture = (?)

Global schema example

Page 15: WP4 report

Olof Bärring – WP4 summary- 4/9/2002 - n° 15

Problem

Very little of delivered WP4 software is of any interest to EDG application WPs, possibly with the exception of producing nice colour plots of the CPU loads when a job was run…

This is normal, but… Site administrators do not grow on trees. Because of the lack of

good system admin tools, like the ones WP4 tries to develop, the configuration, installation and supervision of the testbed installations require a substantial amount of manual work.

However, thanks to Bob new priority list the need for automated configuration and installation has bubbled up on the required features stack to become absolutely vital for assuring good quality.

Page 16: WP4 report

Olof Bärring – WP4 summary- 4/9/2002 - n° 16

Summary

Substantial amount of s/w piled up from R1.3, R1.4 to be deployed now

R2 also includes two large components: LCFGng – migration is non-trivial but we already perform as much

as the non-trivial part ourselves so TB integration should be smooth

Complete monitoring framework

Beyond R2: LCFG is not future for EDG WP4. First version of new configuration and node management system in March 2003