The first year of LHC physics analysis using the GRID: Prospects from ATLAS
description
Transcript of The first year of LHC physics analysis using the GRID: Prospects from ATLAS
The first year of LHC physics analysis using the GRID:Prospects from ATLAS
Davide CostanzoUniversity of Sheffield
Giant detector, giant computing
ATLAS Computing• Grid based, multi-tier computing model:
– Tier-0 at CERN. First step processing (within 24 hours), storage of Raw data, first-pass calibration
– Tier-1. About 10 worldwide. Reprocessing, data storage (real data and simulation), …
– Tier-2. Regional facilities. Storage of Analysis Object Data, simulation, …
– Tier-3. Small clusters, users’ desktops.
• 3 different “flavors” of grid middleware:– LCG in Europe, Canada and Far East– OSG in the US– Nordugrid in Scandinavia and few other countries
Event Processing Data FlowRaw Data Objects
(RDO)
DetectorReconstruction
Event Summary Data(ESD)
CombinedReconstruction
Analysis Object Data(AOD)
Detector Output. Bytestream object viewSimulation Output
- Tracks, Segments- Calorimeter Towers- ….
Analysis Objects:- Electron, Photon,- Muon, TrackParticle, - ….
User Analysis
Siz
e/ev
ent
100 KBytes
500 Kbytes(target for stable data taking)
~3 MBytes
Simplified ATLAS Analysis• Ideal Scenario:
– Read AOD and create ntuple– Loop over ntuple and make histograms– Use root, make plots – go to ICHEP (or other conference)
• Realistic Scenario:– Customization in the AOD building stage– Different analysis have different needs
• Start-up Scenario:– Iterations needed on some data sample to improve Detector
Reconstruction
• Distributed event processing (on the Grid)– Data sets “scattered” across several grid systems– Need distributed analysis
Severaltimes/day
Fewtimes/week
Once a month?
ATLAS and the Grid: Past experience
• 2002-3 Data Challenge 1– Contribution from about 50 sites. First use of the grid– Prototype distributed data management system
• 2004 Data Challenge 2– Full use of the grid– ATLAS middleware not fully ready– Long delays, simulation data not accessible – Physics Validation not possible. Events not used for physics
analysis• 2005 “Rome Physics Workshop” and combined test
beam – Centralized job definition– First users’ exposure to the Grid (deliver ~10M validated events)– Learn pros and cons of Distributed Data Management (DDM)
ATLAS and the Grid: Present (and Future)
• 2006 Computing System Commissioning (CSC) and Calibration Data Challenge– Use subsequent bug-fix sw releases to ramp-up the system
(Validation)– Access (distributed) database data (eg calibration data)– Decentralize job definition– Test distributed analysis system
• 2006-7 Collection of about 25 physics notes– Use events produced for CSC– Concentrate on techniques to estimate Standard model
background– Prepare physicists for the LHC challenge
• 2006 and beyond. Data taking– ATLAS is already taking cosmic data– Collider data is about to start– Exciting physics is behind the corner
ATLAS Distributed Data Management• ATLAS reviewed all its own Grid systems during the first half of 2005• A new Distributed Data Management System (DDM) was designed:
– A hierarchical definition of datasets– Central dataset catalogues– Data-blocks as units of file storage and replication– Distributed file catalogues– Automatic data transfer mechanisms using distributed services (dataset
subscription system)
• The DDM system allows the implementation of the basic ATLAS Computing Model concepts, as described in the Computing Technical Design Report (June 2005):– Distribution of raw and reconstructed data from CERN to the Tier-1s– Distribution of AODs (Analysis Object Data) to Tier-2 centres for analysis– Storage of simulated data (produced by Tier-2s) at Tier-1 centres for further
distribution and/or processing
ATLAS DDM organization
ATLAS Data Management Model• Tier-1s send AOD data to Tier-2s
• Tier-2s produce simulated data and send them to Tier-1s
• In the ideal world (perfect network communication hardware and
software) we would not need to define default Tier-1—Tier-2
associations
• In practice, it turns out to be convenient (robust) to partition the Grid so
that there are default (not compulsory) data paths between Tier-1s and
Tier-2s
• In this model, a number of data management services are installed only
at Tier-1s and act also on their “associated” Tier-2s
Job Management: Productions• Once we have data distributed in the correct way we can rework the
distributed production system to optimise job distribution, by sending jobs to the data (or as close as possible to them)– This was not the case previously, as jobs were sent to free CPUs and had to
copy the input file(s) to the local WN, from wherever in the world the data happened to be
• Next: make better use of the task and dataset concepts– A “task” acts on a dataset and produces more datasets– Use bulk submission functionality to send all jobs of a given task to the
location of their input datasets– Minimize the dependence on file transfers and the waiting time before
execution– Collect output files belonging to the same dataset to the same SE and
transfer them asynchronously to their final locations
ATLAS Production System (2006)
EGEE NorduGrid OSG
EGEEexe
EGEEexe
NGexe
OSGexe
super super super super
prodDB(jobs)
DMS(Data Management)
Python Python Python Python
DQ2
Eowyn
Tasks
PanDADulcineaLexor Lexor-CG
LSF
LSFexe
super
Python
T0MS
Job Management: Analysis• A system based on a central database (job queue) is good for
scheduled productions (as it allows proper priority settings), but too heavy for user tasks such as analysis
• Lacking a global way to submit jobs, a few tools have been developed to submit Grid jobs in the meantime:– LJSF (Lightweight Job Submission framework) can submit ATLAS jobs to
the LCG/EGEE Grid – Pathena (parallel version of atlas sw framework – athena) can generate
ATLAS jobs that act on a dataset and submits them to PanDA on the OSG Grid
• The ATLAS baseline tool to help users to submit Grid jobs is Ganga– First ganga tutorial given to ATLAS 3 weeks ago– Ganga and pathena integrated to submit jobs to different grids
ATLAS Analysis Work Model
Local system (shell)
Prepare JobOptions Run Athena (interactive or batch) Get Output
Local system (Ganga)
Job book-keepingGet Output
Local system (Ganga)Prepare JobOptions
Find dataset from DDMGenerate & submit jobs
GridRun Athena
Local system (Ganga)Job book-keeping
Access output from Grid
Merge results
Local system (Ganga)Prepare JobOptions
Find dataset from DDMGenerate & submit jobs
ProdSysRun Athena on Grid
Store o/p on Grid
1. Job Preparation
2. Medium-scale testing
3. Large scale running
Distributed analysis use cases• Statistics analyses (eg W mass) on several million event
datasets:– All data files may not be kept on a local disk– Jobs are sent on AODs on the grid to make ntuples for analysis– Parallel processing required
• Select a few interesting (candidate) events to analyze (eg H→4ℓ):– Information on AODs may not be enough. – ESD files accessed to make a lose selection and copy candidate events
on a local disk• Use cases to be exercised in the coming Computing System
Commissioning tests
From managed production to Distributed Analysis
• Central managed production is now “routine work”:– Request a dataset to a physics group convener – Physics groups collect requests– Physics coordination keeps tracks of all requests and pass them to
computing operation team– Pros: Well organized, uniform software used, well documented– Cons: Bureaucratic! Takes time to get what you need…
• Delegate definition of jobs to physics and combined performance working groups:– Remove a management layer– Still requires central organization to avoid duplication of effort– Accounting and priorities?
• Job definition/submission for every ATLAS user:– Pros: you get what you want – Cons: no uniformity, some duplication of effort
Resource Management• In order to provide a usable global system, a few more
pieces must work as well:– Accounting at user and group level– Fair share (job priorities) for workload management– Storage quotas for data management
• Define ~25 groups and ~3 roles in VOMS:– Perhaps they are not trivial– Perhaps they must force re-thinking of some of the current
implementations
• In any case we cannot advertise a system that is “free for all” (no job priorities, no storage quotas)– Therefore we need these features “now”
Conclusions• ATLAS is currently using GRID resources for MC based
studies and real data from Combined test-beam and Cosmic Rays– A user community is emerging– Continue to review critical components to make sure we have
everything we need• Now we need stability and reliability more than new
functionality– New components may be welcome in production, if they are
shown to provide better performance than existing ones, but only after thorough testing in pre-production service instances
• The challenge of data taking is still in front of us!– Simulation exercises can teach us several lessons, but they are
just the beginning of the story…