ATLAS Data Challenge Production Experience Kaushik De University of Texas at Arlington Oklahoma D0...
-
Upload
richard-scott -
Category
Documents
-
view
215 -
download
1
Transcript of ATLAS Data Challenge Production Experience Kaushik De University of Texas at Arlington Oklahoma D0...
ATLAS Data Challenge ProductionExperience
Kaushik DeKaushik DeUniversity of Texas at ArlingtonUniversity of Texas at Arlington
Oklahoma D0 SARS MeetingOklahoma D0 SARS Meeting
September 26, 2003September 26, 2003
September 26, 2003September 26, 2003Kaushik De, D0 SARS Meeting, Oklahoma UniversityKaushik De, D0 SARS Meeting, Oklahoma University 2
ATLAS Data Challenges
Original Goals (Nov 15, 2001)Original Goals (Nov 15, 2001) Test computing model, its software, its data model, and to ensure
the correctness of the technical choices to be made Data Challenges should be executed at the prototype Tier centres Data challenges will be used as input for a Computing Technical
Design Report due by the end of 2003 (?) and for preparing a MoU
Current StatusCurrent Status Goals are evolving as we gain experience Computing TDR ~end of 2004 DC’s are ~yearly sequence of increasing scale & complexity DC0 and DC1 (completed) DC2 (2004), DC3, and DC4 planned Grid deployment and testing is major part of DC’s
September 26, 2003September 26, 2003Kaushik De, D0 SARS Meeting, Oklahoma UniversityKaushik De, D0 SARS Meeting, Oklahoma University 3
ATLAS DC1: July 2002-April 2003
Goals : Produce the data needed for the HLT TDR Get as many ATLAS institutes involved as possible
Worldwide collaborative activityParticipation : 56 Institutes (39 in phase 1)
Australia Australia
AustriaAustria
CanadaCanada
CERN CERN
ChinaChina
Czech RepublicCzech Republic
Denmark *Denmark *
France France
Germany Germany
GreeceGreece
IsraelIsrael
ItalyItaly
JapanJapan
Norway *Norway *
PolandPoland
RussiaRussia
SpainSpain
Sweden *Sweden *
TaiwanTaiwan
UKUK
USA *USA * New countries or institutes
* using Grid
September 26, 2003September 26, 2003Kaushik De, D0 SARS Meeting, Oklahoma UniversityKaushik De, D0 SARS Meeting, Oklahoma University 4
(6300)(6300)(84)(84)2.5x102.5x1066ReconstructionReconstruction
+ Lvl1/2+ Lvl1/2
14141650165022224x104x1066Lumi02 Pile-upLumi02 Pile-up
22960096001251253x103x1077SimulationSimulation
Single part.Single part.
6060
2121
2323
TBTB
Volume of Volume of datadata
51000 (+6300)51000 (+6300)
37503750
60006000
3000030000
CPU-daysCPU-days
(400 SI2k)(400 SI2k)
kSI2k.monthskSI2k.months
4x104x1066
2.8x102.8x1066
101077
No. of No. of eventsevents
CPU TimeCPU TimeProcessProcess
690 (+84)690 (+84)TotalTotal
5050ReconstructionReconstruction
7878Lumi10 Pile-upLumi10 Pile-up
415415SimulationSimulation
Physics evt.Physics evt.
DC1 Statistics (G. Poulard, July 2003)
September 26, 2003September 26, 2003Kaushik De, D0 SARS Meeting, Oklahoma UniversityKaushik De, D0 SARS Meeting, Oklahoma University 5
End-July 03: End-July 03: Release 7Release 7
Mid-November 03: pre-Mid-November 03: pre-production releaseproduction release
February 1February 1st 04st 04: : Release 8 Release 8 (production)(production)
April 1April 1stst 04: 04:
June 1June 1st st 04: “DC2” 04: “DC2”
July 15thJuly 15th
Put in place, understand & validatePut in place, understand & validate: : Geant4; POOL; LCG applications Event Data Model Digitization; pile-up; byte-stream Conversion of DC1 data to POOL; large scale
persistency tests and reconstructionTesting and validationTesting and validation
Run test-production
Start final validationStart final validation
Start simulation; Pile-up & digitizationStart simulation; Pile-up & digitizationEvent mixingEvent mixingTransfer data to CERNTransfer data to CERN
Intensive Reconstruction on “Tier0”Intensive Reconstruction on “Tier0”Distribution of ESD & AODDistribution of ESD & AODCalibration; alignmentCalibration; alignmentStart Physics analysisStart Physics analysisReprocessingReprocessing
DC2:Scenario & Time scale (G. Poulard)
September 26, 2003September 26, 2003Kaushik De, D0 SARS Meeting, Oklahoma UniversityKaushik De, D0 SARS Meeting, Oklahoma University 6
U.S. ATLAS DC1 Data Production
Year long process, Summer 2002-2003Year long process, Summer 2002-2003
Played 2nd largest role in ATLAS DC1Played 2nd largest role in ATLAS DC1
Exercised both farm and grid based productionExercised both farm and grid based production
10 U.S. sites participating10 U.S. sites participating Tier 1: BNL, Tier 2 prototypes: BU, IU/UC, Grid Testbed sites: ANL,
LBNL, UM, OU, SMU, UTA (UNM & UTPA will join for DC2)
Generated Generated ~2 million~2 million fully simulated, piled-up and fully simulated, piled-up and reconstructed eventsreconstructed events
U.S. was largest grid-based DC1 data producer in ATLASU.S. was largest grid-based DC1 data producer in ATLAS
Data used for HLT TDR, Athens physics workshop, Data used for HLT TDR, Athens physics workshop, reconstruction software tests...reconstruction software tests...
September 26, 2003September 26, 2003Kaushik De, D0 SARS Meeting, Oklahoma UniversityKaushik De, D0 SARS Meeting, Oklahoma University 7
U.S. ATLAS Grid Testbed
BNL - U.S. Tier 1, 2000 nodes, 5% for BNL - U.S. Tier 1, 2000 nodes, 5% for ATLAS, 10 TB, HPSS through MagdaATLAS, 10 TB, HPSS through Magda
LBNL - pdsf cluster, 400 nodes, 5% for LBNL - pdsf cluster, 400 nodes, 5% for ATLAS (more if idle ~10-15% used), 1TBATLAS (more if idle ~10-15% used), 1TB
Boston U. - prototype Tier 2, 64 nodesBoston U. - prototype Tier 2, 64 nodes
Indiana U. - prototype Tier 2, 64 nodesIndiana U. - prototype Tier 2, 64 nodes
UT Arlington - new 200 cpu’s, 50 TBUT Arlington - new 200 cpu’s, 50 TB
Oklahoma U. - OSCER facilityOklahoma U. - OSCER facility
U. Michigan - test nodesU. Michigan - test nodes
ANL - test nodes, JAZZ clusterANL - test nodes, JAZZ cluster
SMU - 6 production nodesSMU - 6 production nodes
UNM - Los Lobos clusterUNM - Los Lobos cluster
U. Chicago - test nodesU. Chicago - test nodes
September 26, 2003September 26, 2003Kaushik De, D0 SARS Meeting, Oklahoma UniversityKaushik De, D0 SARS Meeting, Oklahoma University 8
U.S. Production Summary
Number of Number of CPU hours CPU hours CPU hours
Files in Magda Events Simulation Pile-up Reconstruction
25 Gev di-jets 41k 1M ~60k 56k 60k+
50 Gev di-jets 10k 250k ~20k 22k 20k+
Single particles 24k 200k 17k 6k
Higgs sample 11k 50k 8k 2k
SUSY sample 7k 50k 13k 2k
minbias sample 7k ? ?
* Total ~30 CPU YEARS delivered to DC1 from U.S.* Total produced file size ~20TB on HPSS tape system, ~10TB on disk.* Black - majority grid produced, Blue - majority farm produced
Exercised both farm and grid based productionExercised both farm and grid based production Valuable large scale grid based production experienceValuable large scale grid based production experience
September 26, 2003September 26, 2003Kaushik De, D0 SARS Meeting, Oklahoma UniversityKaushik De, D0 SARS Meeting, Oklahoma University 9
Grid Production Statistics
UTA33%
OU20%
LBL47%
Figure : Pie chart showing the sites where DC1 single particle simulation jobs were processed. Only three grid testbed sites were
used for this production in August 2002.
Figure : Pie chart showing the number of pile-up jobs successfully completed at various U.S. grid sites for dataset 2001 (25 GeV dijets). A total of 6000 partitions
were generated.
These are examples of some datasets produced on the Grid. Many other large samples were produced, especially at BNL using batch.
September 26, 2003September 26, 2003Kaushik De, D0 SARS Meeting, Oklahoma UniversityKaushik De, D0 SARS Meeting, Oklahoma University 10
DC1 Production Systems
Local batch systems - bulk of productionLocal batch systems - bulk of production
GRAT - grid scripts, generated ~50k files produced in U.S.GRAT - grid scripts, generated ~50k files produced in U.S.
NorduGrid - grid system, ~10k files in Nordic countriesNorduGrid - grid system, ~10k files in Nordic countries
AtCom - GUI, ~10k files at CERN (mostly batch)AtCom - GUI, ~10k files at CERN (mostly batch)
GCE - Chimera based, ~1k files producedGCE - Chimera based, ~1k files produced
GRAPPA - interactive GUI for individual userGRAPPA - interactive GUI for individual user
EDG - test files onlyEDG - test files only
+ systems I forgot…+ systems I forgot…
More systems coming for DC2More systems coming for DC2 LCG GANGA DIAL
September 26, 2003September 26, 2003Kaushik De, D0 SARS Meeting, Oklahoma UniversityKaushik De, D0 SARS Meeting, Oklahoma University 11
GRAT Software
GRid Applications ToolkitGRid Applications Toolkit developed by KD, Horst Severini, Mark Sosebee, and students
Based on Globus, Magda & MySQLBased on Globus, Magda & MySQL
Shell & Python scripts, modular designShell & Python scripts, modular design
Rapid development platformRapid development platform Quickly develop packages as needed by DC
Physics simulation (GEANT/ATLSIM) Pileup production & data management Reconstruction
Test grid middleware, test grid performanceTest grid middleware, test grid performance
Modules can be easily enhanced or replaced, e.g. EDG Modules can be easily enhanced or replaced, e.g. EDG resource broker, Chimera, replica catalogue… (in progress)resource broker, Chimera, replica catalogue… (in progress)
September 26, 2003September 26, 2003Kaushik De, D0 SARS Meeting, Oklahoma UniversityKaushik De, D0 SARS Meeting, Oklahoma University 12
GRAT Execution Model
1. Resource Discovery2. Partition Selection3. Job Creation4. Pre-staging5. Batch Submission6. Job Parameterization
DC1
Prod.(UTA)
RemoteGatekeeper
Replica(local)
MAGDA(BNL)
Param(CERN)
BatchExecution
scratch
1,4,5,10
2
3
4
5
6
7
89
7. Simulation8. Post-staging9. Cataloging10. Monitoring
September 26, 2003September 26, 2003Kaushik De, D0 SARS Meeting, Oklahoma UniversityKaushik De, D0 SARS Meeting, Oklahoma University 13
Databases used in GRAT
Production databaseProduction database define logical job parameters & filenames track job status, updated periodically by scripts
Data management (Magda)Data management (Magda) file registration/catalogue grid based file transfers
Virtual Data CatalogueVirtual Data Catalogue simulation job definition job parameters, random numbers
Metadata catalogue (AMI)Metadata catalogue (AMI) post-production summary information data provenance
September 26, 2003September 26, 2003Kaushik De, D0 SARS Meeting, Oklahoma UniversityKaushik De, D0 SARS Meeting, Oklahoma University 14
U.S. Middleware Evolution
Used for 95% of DC1 production
Used successfully for simulation
Tested for simulation, used forall grid-based reconstruction
Used successfully for simulation(complex pile-up workflow not yet)
September 26, 2003September 26, 2003Kaushik De, D0 SARS Meeting, Oklahoma UniversityKaushik De, D0 SARS Meeting, Oklahoma University 15
U.S. Experience with DC1
ATLAS software distribution worked well for DC1 farm ATLAS software distribution worked well for DC1 farm production, but not well suited for grid productionproduction, but not well suited for grid production
No integration of databases - caused many problemsNo integration of databases - caused many problems
Magda & AMI very useful - but we are missing data Magda & AMI very useful - but we are missing data management tool for truly distributed productionmanagement tool for truly distributed production
Required a lot of people to run production in the U.S., Required a lot of people to run production in the U.S., especially with so many sites on both grid and farmespecially with so many sites on both grid and farm
Startup of grid production slow - but learned useful lessonsStartup of grid production slow - but learned useful lessons
Software releases were often late - leading to chaotic last Software releases were often late - leading to chaotic last minute rush to finish productionminute rush to finish production
September 26, 2003September 26, 2003Kaushik De, D0 SARS Meeting, Oklahoma UniversityKaushik De, D0 SARS Meeting, Oklahoma University 16
Plans for New DC2 Production System
Need unified system for ATLASNeed unified system for ATLAS for efficient usage of facilities, improved scheduling, better QC should support all varieties of grid middleware (& batch?)
First “technical” meeting at CERN August 11-12, 2003First “technical” meeting at CERN August 11-12, 2003 phone meetings, forming code development groups all grid systems represented design document is being prepared planning a Supervisor/Executor model (see fig. next slide) first prototype software should be released ~6 months U.S. well represented in this common ATLAS effort Still unresolved - Data Management System Strong coordination with database group
September 26, 2003September 26, 2003Kaushik De, D0 SARS Meeting, Oklahoma UniversityKaushik De, D0 SARS Meeting, Oklahoma University 17
Schematic of New DC2 System
Main featuresMain features Common production
database for all of ATLAS Common ATLAS supervisor
run by all facilities/managers Common data management
system a la Magda Executors developed by
middleware experts (LCG, NorduGrid, Chimera teams)
Final verification of data done by supervisor
U.S. involved in almost all U.S. involved in almost all aspects - could use more helpaspects - could use more help
September 26, 2003September 26, 2003Kaushik De, D0 SARS Meeting, Oklahoma UniversityKaushik De, D0 SARS Meeting, Oklahoma University 18
Conclusion
Data Challenges are important for ATLAS software and Data Challenges are important for ATLAS software and computing infrastructure readinesscomputing infrastructure readiness
U.S. playing a major role in DC planning & productionU.S. playing a major role in DC planning & production
12 U.S. sites ready to participate in DC212 U.S. sites ready to participate in DC2
UTA & OU - major role in production software developmentUTA & OU - major role in production software development
Physics analysis will be emphasis of DC2 - new experiencePhysics analysis will be emphasis of DC2 - new experience
Involvement by more U.S. physicists is needed in DC2Involvement by more U.S. physicists is needed in DC2 to verify quality of data to tune physics algorithms to test scalability of physics analysis model