Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Enhancements to Condor-G for the ATLAS Tier 1 at BNL
description
Transcript of Enhancements to Condor-G for the ATLAS Tier 1 at BNL
![Page 1: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/1.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
1
Enhancements to Condor-G for the ATLAS Tier 1 at BNL
John HoverJohn Hover
Group LeaderGroup Leader
Experiment Services (Grid Group)Experiment Services (Grid Group)
RACF, BNLRACF, BNL
![Page 2: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/2.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
2
Outline
• Background Background
• ProblemsProblems
• Solutions Solutions
• ResultsResults
• AcknowledgementsAcknowledgements
![Page 3: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/3.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
3
Our (Odd?) Situation
• ATLAS Pilot-based Grid Workload System: PanDA ATLAS Pilot-based Grid Workload System: PanDA
(Production and Distributed Analysis)(Production and Distributed Analysis)– Individual (pilot) jobs are identical.
– Individual (pilot) jobs are not valuable.
– Jobs can, unpredictably, be very short (~3-5 minutes).
• Brookhaven National Laboratory's Role:Brookhaven National Laboratory's Role:– BNL Tier 1 responsible for sending pilots to all ATLAS sites in
OSG (U.S. Cloud).
– Central PanDA services located at CERN.
![Page 4: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/4.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
4
PanDA Autopilot
• Runs on top of Condor-G.Runs on top of Condor-G.– One automatic scheduler process for each PanDA 'queue'.
– Each running condor_q and parsing output, and condor_submit.
– (Nearly) all run as single UNIX user.
– Each minute:• Queries Condor-G for job status (per queue per gatekeeper). • Queries Panda Server for current nqueue value. • Decides how many pilots to submit.
• At BNL ATLAS Tier 1:At BNL ATLAS Tier 1:– 5 Submit hosts. (3 primary)
– Serving 92 PanDA queues at 43 gatekeepers (some overlap).
![Page 5: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/5.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
5
![Page 6: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/6.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
6
Condor-G Interaction Diagram
![Page 7: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/7.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
7
Problems (Opportunities) 1• ~5000 job ceiling~5000 job ceiling
– General scaling issues in Condor-G in 6.8.x.
– Manual operation and cron job often needed to clean up
stuck jobs and restart Condor-G processes.
• HELD jobsHELD jobs– Held jobs “clog” queue and interfere with further submission.
• GRAM <-> Condor communication glitchesGRAM <-> Condor communication glitches– Condor-G loses track of jobs at site. Requires gahp_server
restart. Slow/no job status update.
– Memory leak in Globus client?
• Inter-site effects Inter-site effects – Error condition on one site/gatekeeper can affect another.
![Page 8: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/8.jpg)
Problems 2
• Grid Manager <-> Grid Monitor IssuesGrid Manager <-> Grid Monitor Issues– When problem occurs, a new Grid Monitor is not started for
an hour.
• Difficulty troubleshooting Difficulty troubleshooting – condor_status info oriented toward local batch.
![Page 9: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/9.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
9
Solutions 1
• Establish the goal and set up coordination between BNL Establish the goal and set up coordination between BNL
and Condor Team members. and Condor Team members. – Formal meeting at BNL to discuss plans.
– Full login access for Condor devs on BNL production hosts.
– Frequent email and phone communication to track progress.
– Clear problem list and action items for teams.
– This was a pre-requisite for all further progress.
• Ultimately, establish stress testbed at U.Wisc. to which Ultimately, establish stress testbed at U.Wisc. to which
we submit. we submit.
![Page 10: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/10.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
10
Solutions 2
• Internal efficiency fixes: Internal efficiency fixes: – Jaime found loops (that cycle through internal data
structures) that were inefficient at 5000+ job scales. Fixed.
• HELD jobs never needed. Pilots are expendable.HELD jobs never needed. Pilots are expendable.– +Nonessential = True
– When pilot jobs fail, we don't care. Just remove and discard
them rather than saving them for later execution.
– Unconditional removal and cleanup enabled.
• Grid Monitor restart behavior fix Grid Monitor restart behavior fix – Made this configurable: GRID_MONITOR_DISABLE_TIME
– But required refining the error handling on the Grid Manager
side to avoid accidentally flooding site with Grid Monitors.
![Page 11: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/11.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
11
Solutions 3
• Grid Manager TweaksGrid Manager Tweaks– Previously, one GridManager per user on submit host. Since
all sites served by a single user, only one started.
– GRIDMANAGER_SELECTION_EXPR = GridResource
– Determines how many GridManagers get started, by
providing an expression used to hash resources. Now we
have a separate Gridmanager per gatekeeper, per user on
submit host.
• GAHP Server fixesGAHP Server fixes– Frequent source of communication errors.
– Jaime worked with Globus dev (Joe Bester) to integrate
upstream fixes into GAHP.
![Page 12: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/12.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
12
Solutions 4
• Separate throttle on limiting jobmanager processes Separate throttle on limiting jobmanager processes
based on their role:based on their role:– Previously Condor-G had one throttle for the total number of
jobmanagers invoked on the remote CE• A surge in job completions/removals will stall new job
submission, and vice-versa.
– Now the throttle limit is broken in half, one for job submission,
the other for job completion/cleanup
– Sum controlled by: GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE
– (Might be nice to have distinct settings.)
![Page 13: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/13.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
13
Solutions 5
[root@gridui11 condor-g-probe]# condor_status -grid[root@gridui11 condor-g-probe]# condor_status -grid
Name NumJobs Allowed Wanted Running Idle Name NumJobs Allowed Wanted Running Idle
gt2 abitibi.sbgrid.org:2119 20 20 0 0 20 gt2 abitibi.sbgrid.org:2119 20 20 0 0 20
gt2 cmsosgce3.fnal.gov:2119 7 0 0 0 7 gt2 cmsosgce3.fnal.gov:2119 7 0 0 0 7
gt2 cobalt.uit.tufts.edu:2119 90 90 0 38 52 gt2 cobalt.uit.tufts.edu:2119 90 90 0 38 52
gt2 fester.utdallas.edu:2119 119 119 0 80 39 gt2 fester.utdallas.edu:2119 119 119 0 80 39
gt2 ff-grid3.unl.edu:2119 162 162 0 0 162 gt2 ff-grid3.unl.edu:2119 162 162 0 0 162
gt2 gate01.aglt2.org:2119 2398 2398 0 2017 381 gt2 gate01.aglt2.org:2119 2398 2398 0 2017 381
gt2 gk01.atlas-swt2.org:2119 20 20 0 0 20 gt2 gk01.atlas-swt2.org:2119 20 20 0 0 20
gt2 gk04.swt2.uta.edu:2119 535 535 0 510 25 gt2 gk04.swt2.uta.edu:2119 535 535 0 510 25
gt2 gridgk04.racf.bnl.gov:2119 1994 1994 0 712 1282 gt2 gridgk04.racf.bnl.gov:2119 1994 1994 0 712 1282
gt2 gridgk05.racf.bnl.gov:2119 1410 1398 0 648 737 gt2 gridgk05.racf.bnl.gov:2119 1410 1398 0 648 737
• Improved Improved condor_status -gridcondor_status -grid output: output:
![Page 14: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/14.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
14
Solutions 6
• Establish a stress testbed to explore limits. Establish a stress testbed to explore limits. – One submit host at BNL.
– Four gatekeepers at Wisconsin, in front of a Condor pool of
~7000 nodes.
– Test job: • Sleep 1200• 500KB input and output for staging
– Runs Condor development release.
![Page 15: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/15.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
15
Solutions (Summary)
• Generally, over a ~6 month period (mid 2009 to early Generally, over a ~6 month period (mid 2009 to early
2010) Jaime and the Condor team:2010) Jaime and the Condor team:– Responded promptly to problem reports.
– Actively helped us troubleshoot mysterious behavior.
– Rapidly developed fixes and tweaks to address issues.
– Provided us with pre-release binaries to test.
– Made sure we understood how to leverage newly-added
features.
![Page 16: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/16.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
16
Results 1• ScalabilityScalability
– ~5000 job ceiling now up to ~50000(?) per submit host.
– We are now limited by contention issues and concern about
hardware failures more than raw performance.
• FunctionalityFunctionality– Nonessential jobs enabled.
– HELD job behavior. Unconditional removal.
• ConfigurabilityConfigurability– Tunable via new configuration variables.
• ““Monitor-ability”Monitor-ability”– Enhancements to 'condor_status -grid' help us notice and
solve problems.
![Page 17: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/17.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
17
Results 2
• Stress test results:Stress test results:– Comfortable limit reached.
– Manage 50,000 jobs from one submit host.
– Submit 30,000 jobs to one remote gatekeeper.• Gatekeeper runs only GRAM/GridFTP, no other OSG
services running on it.• 30,000 is a hard limit, restricted by the number of subdirs
allowed by the file system. Now exceeded at BNL with BlueArc NFS appliance.
– All stress test improvements are included in the just-released
condor 7.4.0 release• Now used on our production submit hosts.
![Page 18: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/18.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
18
Results 3: The numbers. ATLAS jobs (pilots) run on OSG (from OSG Gratia report )
~280,000 jobs a day.
![Page 19: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/19.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
19
Results 4: Nov '08 - Oct '09
![Page 20: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/20.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
20
Results 5
• Generally all-around improved reliability. Generally all-around improved reliability. – Fewer crashed processes
– Fewer communication failures.
– Less mysterious anomalies.
• We all sleep better at night. We all sleep better at night.
![Page 21: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/21.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
21
The Future• Continue to refine 'condor_status -grid' output.Continue to refine 'condor_status -grid' output.
– More info, laid out in intuitive fashion.
• Add time-integrated metric information to augment Add time-integrated metric information to augment
instantaneous info. instantaneous info. – How many jobs were submitted in the last 10 minutes to Site X?
– How many finished in the last 10 minutes?
– Rates rather than absolute numbers.
• Finer-grained internal queue categorization to avoid Finer-grained internal queue categorization to avoid
contention. contention. – When multiple queues are served by one GridResource:
PanDA considers them separate, while Condor-G thinks they
are the same.
![Page 22: Enhancements to Condor-G for the ATLAS Tier 1 at BNL](https://reader036.fdocuments.in/reader036/viewer/2022081504/56814647550346895db356b2/html5/thumbnails/22.jpg)
15 April 201015 April 2010John Hover John Hover Condor Week 2010Condor Week 2010
22
Acknowledgments: Thanks!!
• Jaime FreyJaime Frey– Condor-G lead developer.
• Todd TannenbaumTodd Tannenbaum– Condor lead developer.
• Xin ZhaoXin Zhao– BNL OSG Gatekeepers, Condor-G and PanDA Autopilot
wrangler.
• Miron LivnyMiron Livny– Condor Team Leader