Download - Grid Job and Information Management (JIM) for D0 and CDF

Grid Job and Information Management (JIM) for D0 and CDF

Gabriele Garzoglio for the JIM Team

OverviewIntroductionGrid-level Management

SAM-Grid = SAM + JIMJob ManagementInformation Management

Fabric-level ManagementRunning jobs on grid resourcesLocal sandbox managementThe DZero Application Framework

Running MC at UWisc

ContextD0 Grid project started in 2001-2002 to handle D0’s expanded needs for globally distributed computingJIM complements the data handling system (SAM) with jobs and info managementJIM is funded by PPDG (our team here), GridPP (Rod Walker in the UK)Collaborative effort with the experiments.CDF joined later in 2002

HistoryDelivered JIM prototype for D0, Oct 10, 2002:

Remote job submissionBrokering based on data cachedWeb-based monitoring

SC-2002 demo – 11 sites (D0, CDF), big successMay 2003 – started deployment of V1Now – working on running MC in production on the Grid

Overview IntroductionGrid-level Management



Running MC at UWisc

SAM-Grid Logistics

SiteSite SiteSite SiteSite

Resource Selector

Info Collector

Info Gatherer

Match Making

User InterfaceUser Interface User InterfaceUser Interface

SubmissionGlobal Job Queue

Grid Client

SubmissionSubmission

User InterfaceUser Interface User InterfaceUser Interface

Global DH ServicesSAM Naming Server

SAM Log Server

Resource Optimizer

SAM DB ServerRC MetaData Catalog

Bookkeeping Service

SAM Stager(s)

SAM Station(+other servs)

Data Handling

Worker Nodes

Grid Gateway

Local Job Handler(CAF, D0MC, BS, ...)

JIM Advertise

Local Job Handling

Cluster

AAA

Dist.FS

Info Manager

XML DB server

Site Conf.Glob/Loc JID map...

Info Providers

MDS

MSS Cache Site

Web ServGrid Monitoring

User Tools

Flow of: job data meta-data

Job Management Highlights

We distinguish grid-level (global) job scheduling (selection of a cluster to run) from local scheduling (distribution of the job within the cluster)We consider 3 types of jobs

analysis: data intensivemonte carlo: CPU intensivereconstruction: data and CPU intensive

Job Management – Distinct JIM Features

Decision making is based on both:Information existing irrespective of jobs (resource description)Functions of (jobs,resource)

Decision making is interfaced with data handling middleware Decision making is entirely in the Condor framework (no own RB) – strong promotion of standards, interoperabilityBrokering algorithms can be extended via plug-ins

JO

B

Computing Element

Submission Client

User Interface

QueuingSystem

Job ManagementUser Interface

User Interface

BrokerMatch

Making Service

Information Collector

Execution Site #1

Submission Client

Submission Client

Match Making Service

Match Making Service

Computing Element

Grid Sensors

Execution Site #n

Queuing System

Queuing System

Grid Sensors

Storage Element

Storage Element

Computing Element

Storage Element

Data Handling System


Storage Element

Storage Element

Storage Element

Storage Element



Grid Sensor

s

Grid Sensor

s

Grid Sensor

s

Grid Sensor

s

Computing Element

Computing Element





Information ManagementIn JIM’s view, this includes:

configuration frameworkresource description for job brokeringinfrastructure for monitoring

Main featuresSites (resources) and jobs monitoringDistributed knowledge about jobs etcIncremental knowledge buildingGMA for current state inquiries, Logging for recent history studiesAll Web based

Information Management via Site Configuration

Main Site/cluster ConfigXMLDB

ResourceAdvertisement

classad

MonitoringConfiguration

LDIF

Service Instantiation

XML

…

TemplateXML

XSLTXSLTXSLTXSLT




Running MC at UWisc

Running jobs on Grid resources

The trend: Grid resources are not dedicated to a single experimentTranslation:

no daemons running on the worker nodes of a Batch Systemno experiment specific software installed

Running jobs on Grid resources

The situation today is transitioning:Worker nodes typically access the software via shared FS: not scalable!Generally, experiments can install specific services on a node close to the cluster.Local resource configuration still too diverse to easily plug into the Grid

The JIM local sandbox managementIt keeps the job executable (from the Grid) at the head node and knows where its product dependencies areIt transports and installs the software to the worker nodeIt can instantiate services at the worker nodeIt sets up the environment for the job to runIt packages the output and hands it over to the Grid, so that it becomes available for the download at the submission site

Running a DZero application

We have JIM sandbox: where is the problem now?JIM sandbox could immediately use the DZero Run Time Environment, but

Not all the DZero packages are RTE Compliant User don’t have experience/incentives in using it today




Running MC at UWisc

Running Monte Carlo at UWisc

University of Wisconsin offered DZero the opportunity of using a 1000 node non-dedicated condor clusterWe are concentrating on putting it to use to run MC with mc_runjob (in production by year end)

The challenges IMC code is not RTE compliant todayChain of 3-5 stages. Each binary 50-200 MB, dynamically linkedAre compiled from 40 packages (total for D0 621). Need these packages at run time for RPC filesRoot, Motif, X11, Ace libraries are found as dependencies (for MC generators…)MC tarballs exist but are hand-crafted (and bug-prone) every time. Size unpacked 2GB (versus 12-15 GB full D0 app tree).

The challenges II

About every advanced C++ feature, every libc library call, every system call, are usedOne can get different results on two RedHat 7.2 systems.Total release tree takes N hours (up to 20+) to build – not something easy to do dynamically at remote site

Summary

The SAM-Grid offers an extensible working framework for Grid-level Job/Data/Info ManagementJIM provides Fabric-level management tools for sandboxingThe applications need to be improved to run on Grid resources