The first year of LHC physics analysis using the GRID: Prospects from ATLAS

The first year of LHC physics analysis using the GRID:Prospects from ATLAS

Davide CostanzoUniversity of Sheffield

[email protected]

Giant detector, giant computing

ATLAS Computing• Grid based, multi-tier computing model:

– Tier-0 at CERN. First step processing (within 24 hours), storage of Raw data, first-pass calibration

– Tier-1. About 10 worldwide. Reprocessing, data storage (real data and simulation), …

– Tier-2. Regional facilities. Storage of Analysis Object Data, simulation, …

– Tier-3. Small clusters, users’ desktops.

• 3 different “flavors” of grid middleware:– LCG in Europe, Canada and Far East– OSG in the US– Nordugrid in Scandinavia and few other countries

Event Processing Data FlowRaw Data Objects

(RDO)

DetectorReconstruction

Event Summary Data(ESD)

CombinedReconstruction

Analysis Object Data(AOD)

Detector Output. Bytestream object viewSimulation Output

- Tracks, Segments- Calorimeter Towers- ….

Analysis Objects:- Electron, Photon,- Muon, TrackParticle, - ….

User Analysis

Siz

e/ev

ent

100 KBytes

500 Kbytes(target for stable data taking)

~3 MBytes

Simplified ATLAS Analysis• Ideal Scenario:

– Read AOD and create ntuple– Loop over ntuple and make histograms– Use root, make plots – go to ICHEP (or other conference)

• Realistic Scenario:– Customization in the AOD building stage– Different analysis have different needs

• Start-up Scenario:– Iterations needed on some data sample to improve Detector

Reconstruction

• Distributed event processing (on the Grid)– Data sets “scattered” across several grid systems– Need distributed analysis

Severaltimes/day

Fewtimes/week

Once a month?

ATLAS and the Grid: Past experience

• 2002-3 Data Challenge 1– Contribution from about 50 sites. First use of the grid– Prototype distributed data management system

• 2004 Data Challenge 2– Full use of the grid– ATLAS middleware not fully ready– Long delays, simulation data not accessible – Physics Validation not possible. Events not used for physics

analysis• 2005 “Rome Physics Workshop” and combined test

beam – Centralized job definition– First users’ exposure to the Grid (deliver ~10M validated events)– Learn pros and cons of Distributed Data Management (DDM)

ATLAS and the Grid: Present (and Future)

• 2006 Computing System Commissioning (CSC) and Calibration Data Challenge– Use subsequent bug-fix sw releases to ramp-up the system

(Validation)– Access (distributed) database data (eg calibration data)– Decentralize job definition– Test distributed analysis system

• 2006-7 Collection of about 25 physics notes– Use events produced for CSC– Concentrate on techniques to estimate Standard model

background– Prepare physicists for the LHC challenge

• 2006 and beyond. Data taking– ATLAS is already taking cosmic data– Collider data is about to start– Exciting physics is behind the corner

ATLAS Distributed Data Management• ATLAS reviewed all its own Grid systems during the first half of 2005• A new Distributed Data Management System (DDM) was designed:

– A hierarchical definition of datasets– Central dataset catalogues– Data-blocks as units of file storage and replication– Distributed file catalogues– Automatic data transfer mechanisms using distributed services (dataset

subscription system)

• The DDM system allows the implementation of the basic ATLAS Computing Model concepts, as described in the Computing Technical Design Report (June 2005):– Distribution of raw and reconstructed data from CERN to the Tier-1s– Distribution of AODs (Analysis Object Data) to Tier-2 centres for analysis– Storage of simulated data (produced by Tier-2s) at Tier-1 centres for further

distribution and/or processing

ATLAS DDM organization

ATLAS Data Management Model• Tier-1s send AOD data to Tier-2s

• Tier-2s produce simulated data and send them to Tier-1s

• In the ideal world (perfect network communication hardware and

software) we would not need to define default Tier-1—Tier-2

associations

• In practice, it turns out to be convenient (robust) to partition the Grid so

that there are default (not compulsory) data paths between Tier-1s and

Tier-2s

• In this model, a number of data management services are installed only

at Tier-1s and act also on their “associated” Tier-2s

Job Management: Productions• Once we have data distributed in the correct way we can rework the

distributed production system to optimise job distribution, by sending jobs to the data (or as close as possible to them)– This was not the case previously, as jobs were sent to free CPUs and had to

copy the input file(s) to the local WN, from wherever in the world the data happened to be

• Next: make better use of the task and dataset concepts– A “task” acts on a dataset and produces more datasets– Use bulk submission functionality to send all jobs of a given task to the

location of their input datasets– Minimize the dependence on file transfers and the waiting time before

execution– Collect output files belonging to the same dataset to the same SE and

transfer them asynchronously to their final locations

ATLAS Production System (2006)

EGEE NorduGrid OSG

EGEEexe

EGEEexe

NGexe

OSGexe

super super super super

prodDB(jobs)

DMS(Data Management)

Python Python Python Python

DQ2

Eowyn

Tasks

PanDADulcineaLexor Lexor-CG

LSF

LSFexe

super

Python

T0MS

Job Management: Analysis• A system based on a central database (job queue) is good for

scheduled productions (as it allows proper priority settings), but too heavy for user tasks such as analysis

• Lacking a global way to submit jobs, a few tools have been developed to submit Grid jobs in the meantime:– LJSF (Lightweight Job Submission framework) can submit ATLAS jobs to

the LCG/EGEE Grid – Pathena (parallel version of atlas sw framework – athena) can generate

ATLAS jobs that act on a dataset and submits them to PanDA on the OSG Grid

• The ATLAS baseline tool to help users to submit Grid jobs is Ganga– First ganga tutorial given to ATLAS 3 weeks ago– Ganga and pathena integrated to submit jobs to different grids

ATLAS Analysis Work Model

Local system (shell)

Prepare JobOptions Run Athena (interactive or batch) Get Output

Local system (Ganga)

Job book-keepingGet Output

Local system (Ganga)Prepare JobOptions

Find dataset from DDMGenerate & submit jobs

GridRun Athena

Local system (Ganga)Job book-keeping

Access output from Grid

Merge results

Local system (Ganga)Prepare JobOptions

Find dataset from DDMGenerate & submit jobs

ProdSysRun Athena on Grid

Store o/p on Grid

1. Job Preparation

2. Medium-scale testing

3. Large scale running

Distributed analysis use cases• Statistics analyses (eg W mass) on several million event

datasets:– All data files may not be kept on a local disk– Jobs are sent on AODs on the grid to make ntuples for analysis– Parallel processing required

• Select a few interesting (candidate) events to analyze (eg H→4ℓ):– Information on AODs may not be enough. – ESD files accessed to make a lose selection and copy candidate events

on a local disk• Use cases to be exercised in the coming Computing System

Commissioning tests

From managed production to Distributed Analysis

• Central managed production is now “routine work”:– Request a dataset to a physics group convener – Physics groups collect requests– Physics coordination keeps tracks of all requests and pass them to

computing operation team– Pros: Well organized, uniform software used, well documented– Cons: Bureaucratic! Takes time to get what you need…

• Delegate definition of jobs to physics and combined performance working groups:– Remove a management layer– Still requires central organization to avoid duplication of effort– Accounting and priorities?

• Job definition/submission for every ATLAS user:– Pros: you get what you want – Cons: no uniformity, some duplication of effort

Resource Management• In order to provide a usable global system, a few more

pieces must work as well:– Accounting at user and group level– Fair share (job priorities) for workload management– Storage quotas for data management

• Define ~25 groups and ~3 roles in VOMS:– Perhaps they are not trivial– Perhaps they must force re-thinking of some of the current

implementations

• In any case we cannot advertise a system that is “free for all” (no job priorities, no storage quotas)– Therefore we need these features “now”

Conclusions• ATLAS is currently using GRID resources for MC based

studies and real data from Combined test-beam and Cosmic Rays– A user community is emerging– Continue to review critical components to make sure we have

everything we need• Now we need stability and reliability more than new

functionality– New components may be welcome in production, if they are

shown to provide better performance than existing ones, but only after thorough testing in pre-production service instances

• The challenge of data taking is still in front of us!– Simulation exercises can teach us several lessons, but they are

just the beginning of the story…

The first year of LHC physics analysis using the GRID: Prospects from ATLAS

Documents

Transcript of The first year of LHC physics analysis using the GRID: Prospects from ATLAS