Grid Job and Information Management (JIM) for D0 and CDF
Gabriele Garzoglio for the JIM Team
OverviewIntroductionGrid-level Management
SAM-Grid = SAM + JIMJob ManagementInformation Management
Fabric-level ManagementRunning jobs on grid resourcesLocal sandbox managementThe DZero Application Framework
Running MC at UWisc
ContextD0 Grid project started in 2001-2002 to handle D0’s expanded needs for globally distributed computingJIM complements the data handling system (SAM) with jobs and info managementJIM is funded by PPDG (our team here), GridPP (Rod Walker in the UK)Collaborative effort with the experiments.CDF joined later in 2002
HistoryDelivered JIM prototype for D0, Oct 10, 2002:
Remote job submissionBrokering based on data cachedWeb-based monitoring
SC-2002 demo – 11 sites (D0, CDF), big successMay 2003 – started deployment of V1Now – working on running MC in production on the Grid
Overview IntroductionGrid-level Management
SAM-Grid = SAM + JIMJob ManagementInformation Management
Fabric-level ManagementRunning jobs on grid resourcesLocal sandbox managementThe DZero Application Framework
Running MC at UWisc
SAM-Grid Logistics
SiteSite SiteSite SiteSite
Resource Selector
Info Collector
Info Gatherer
Match Making
User InterfaceUser Interface User InterfaceUser Interface
SubmissionGlobal Job Queue
Grid Client
SubmissionSubmission
User InterfaceUser Interface User InterfaceUser Interface
Global DH ServicesSAM Naming Server
SAM Log Server
Resource Optimizer
SAM DB ServerRC MetaData Catalog
Bookkeeping Service
SAM Stager(s)
SAM Station(+other servs)
Data Handling
Worker Nodes
Grid Gateway
Local Job Handler(CAF, D0MC, BS, ...)
JIM Advertise
Local Job Handling
Cluster
AAA
Dist.FS
Info Manager
XML DB server
Site Conf.Glob/Loc JID map...
Info Providers
MDS
MSS Cache Site
Web ServGrid Monitoring
User Tools
Flow of: job data meta-data
Job Management Highlights
We distinguish grid-level (global) job scheduling (selection of a cluster to run) from local scheduling (distribution of the job within the cluster)We consider 3 types of jobs
analysis: data intensivemonte carlo: CPU intensivereconstruction: data and CPU intensive
Job Management – Distinct JIM Features
Decision making is based on both:Information existing irrespective of jobs (resource description)Functions of (jobs,resource)
Decision making is interfaced with data handling middleware Decision making is entirely in the Condor framework (no own RB) – strong promotion of standards, interoperabilityBrokering algorithms can be extended via plug-ins
JO
B
Computing Element
Submission Client
User Interface
QueuingSystem
Job ManagementUser Interface
User Interface
BrokerMatch
Making Service
Information Collector
Execution Site #1
Submission Client
Submission Client
Match Making Service
Match Making Service
Computing Element
Grid Sensors
Execution Site #n
Queuing System
Queuing System
Grid Sensors
Storage Element
Storage Element
Computing Element
Storage Element
Data Handling System
Data Handling System
Storage Element
Storage Element
Storage Element
Storage Element
Information Collector
Information Collector
Grid Sensor
s
Grid Sensor
s
Grid Sensor
s
Grid Sensor
s
Computing Element
Computing Element
Data Handling System
Data Handling System
Data Handling System
Data Handling System
Information ManagementIn JIM’s view, this includes:
configuration frameworkresource description for job brokeringinfrastructure for monitoring
Main featuresSites (resources) and jobs monitoringDistributed knowledge about jobs etcIncremental knowledge buildingGMA for current state inquiries, Logging for recent history studiesAll Web based
Information Management via Site Configuration
Main Site/cluster ConfigXMLDB
ResourceAdvertisement
classad
MonitoringConfiguration
LDIF
Service Instantiation
XML
…
TemplateXML
XSLTXSLTXSLTXSLT
Overview IntroductionGrid-level Management
SAM-Grid = SAM + JIMJob ManagementInformation Management
Fabric-level ManagementRunning jobs on grid resourcesLocal sandbox managementThe DZero Application Framework
Running MC at UWisc
Running jobs on Grid resources
The trend: Grid resources are not dedicated to a single experimentTranslation:
no daemons running on the worker nodes of a Batch Systemno experiment specific software installed
Running jobs on Grid resources
The situation today is transitioning:Worker nodes typically access the software via shared FS: not scalable!Generally, experiments can install specific services on a node close to the cluster.Local resource configuration still too diverse to easily plug into the Grid
The JIM local sandbox managementIt keeps the job executable (from the Grid) at the head node and knows where its product dependencies areIt transports and installs the software to the worker nodeIt can instantiate services at the worker nodeIt sets up the environment for the job to runIt packages the output and hands it over to the Grid, so that it becomes available for the download at the submission site
Running a DZero application
We have JIM sandbox: where is the problem now?JIM sandbox could immediately use the DZero Run Time Environment, but
Not all the DZero packages are RTE Compliant User don’t have experience/incentives in using it today
Overview IntroductionGrid-level Management
SAM-Grid = SAM + JIMJob ManagementInformation Management
Fabric-level ManagementRunning jobs on grid resourcesLocal sandbox managementThe DZero Application Framework
Running MC at UWisc
Running Monte Carlo at UWisc
University of Wisconsin offered DZero the opportunity of using a 1000 node non-dedicated condor clusterWe are concentrating on putting it to use to run MC with mc_runjob (in production by year end)
The challenges IMC code is not RTE compliant todayChain of 3-5 stages. Each binary 50-200 MB, dynamically linkedAre compiled from 40 packages (total for D0 621). Need these packages at run time for RPC filesRoot, Motif, X11, Ace libraries are found as dependencies (for MC generators…)MC tarballs exist but are hand-crafted (and bug-prone) every time. Size unpacked 2GB (versus 12-15 GB full D0 app tree).
The challenges II
About every advanced C++ feature, every libc library call, every system call, are usedOne can get different results on two RedHat 7.2 systems.Total release tree takes N hours (up to 20+) to build – not something easy to do dynamically at remote site
Summary
The SAM-Grid offers an extensible working framework for Grid-level Job/Data/Info ManagementJIM provides Fabric-level management tools for sandboxingThe applications need to be improved to run on Grid resources
Top Related