DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR...

21
D D Ø Computing Model Ø Computing Model & & Monte Carlo & Data Reprocessing Monte Carlo & Data Reprocessing Gavin Davies Gavin Davies Imperial College Imperial College London London DOSAR Workshop, Sao Paulo, September 2005 DOSAR Workshop, Sao Paulo, September 2005

Transcript of DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR...

Page 1: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DDØ Computing Model Ø Computing Model & &

Monte Carlo & Data ReprocessingMonte Carlo & Data Reprocessing

Gavin DaviesGavin Davies

Imperial CollegeImperial College LondonLondon

DOSAR Workshop, Sao Paulo, September 2005DOSAR Workshop, Sao Paulo, September 2005

Page 2: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 2

OutlineOutline Operational status

Globally – continue to do well Shared by recent Run II Computing Review

DØ Computing model Ongoing, ‘long’ established plan

Production Computing Monte Carlo Reprocessing of Run II data

109 events reprocessed on the grid – largest HEP grid effort

Looking forward

Conclusions

Page 3: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 3

Snapshot of Current Status Reconstruction keeping up with data taking

Data handling is performing well

Production computing is off-site and grid based. It continues to grow & work well

Over 75 million Monte Carlo events produced in last year

Run IIa data set being reprocessed on the grid – 109 events

Analysis cpu power has been expanded

Globally doing well Shared by recent Run II Computing Review

Page 4: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 4

Computing Model

Started with distributed computing with evolution to automated use of common tools/solutions on the grid (SAM-Grid) for all tasks

Scalable Not alone – Joint effort with others at FNAL and elsewhere, LHC…

1997 – Original Plan All Monte Carlo to be produced off-site SAM to be used for all data handling, provides a ‘data-grid’

Now: Monte Carlo and data reprocessing with SAM-Grid

Next: Other production tasks e.g. fixing and then user analysis

Use concept of Regional Centres DOSAR one of pioneers Builds local expertise

Page 5: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 5

Reconstruction Release

Periodically update version of reconstruction code As develop new / more refined algorithms As get better understanding of detector

Frequency of releases decreases with time One major release in last year – p17 Basis for current Monte Carlo (MC) & data reprocessing

Benefits of p17 Reco speed-up Full calorimeter calibration Fuller description of detector material Use of zero-bias overlay for MC (More details:

http://cdinternal.fnal.gov/RUNIIRev/runIIMP05.asp)

Page 6: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 6

SAM continues to perform well, providing a data-grid

50 SAM sites worldwide Over 2.5 PB (50B events) consumed in the last year Up to 300 TB moved per month

Larger SAM cache solved tape access issues

Continued success of SAM shifters Often remote collaborators Form 1st line of defense

SAMTV monitors SAM & SAM stations

Data Handling - SAM

http://d0db-prd.fnal.gov/sm_local/SamAtAGlance/

Page 7: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 7

SAMGrid

More than 10 DØ execution siteshttp://samgrid.fnal.gov:8080/

http://samgrid.fnal.gov:8080/list_of_schedulers.php

http://samgrid.fnal.gov:8080/list_of_resources.php

SAM – data handlingJIM – job submission & monitoringSAM + JIM SAM-Grid

Page 8: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 8

Remote Production Activities – Monte Carlo - I

Over 75M events produced in last year, at more than 10 sites More than double last year’s production

Vast majority on shared sites DOSAR major part of this

SAM-Grid introduced in spring 04, becoming the default Based on request system and jobmanager-mc_runjob MC software package retrieved via SAMo way, inc central farm Average production efficiency ~90% Average inefficiency due to grid infrastructure ~1-5% http://www-d0.fnal.gov/computing/grid/deployment-issues.html

Continued move to common tools DOSAR sites continue move to SAMGrid from McFarm

From

From

0404

Page 9: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 9

Beyond just ‘shared’ resources More than 17M events produced ‘directly’ on LCG via submission from Nikhef

Good example of remote site driving the ‘development’

Similar momentum building on/for OSGTwo good site examples within p17 reprocessing

Remote Production Activities – Monte Carlo - II

Page 10: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 10

After significant improvements to reconstruction, reprocess old data

P14 Winter 2003/04 500M events, 100M remotely, from DST Based around mc_runjob Distributed computing rather than Grid

P17 End march ~Oct x 10 larger ie 1000M events, 250TB Basically all remote From raw ie use of db proxy servers SAM-Grid as default (using mc_runjob) 3200 1GHz PIIIs for 6 months Massive activity - largest grid activity in HEP

Remote Production Activities – Reprocessing - I

http://www-d0.fnal.gov/computing/reprocessing/p17/

Page 11: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 11

Reprocessing - II

““Production”Production”

““Merging”Merging”

Grid jobs spawns many batch Grid jobs spawns many batch jobsjobs

Page 12: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 12

Reprocessing -III

SAMGrid provides Common environment & operation scripts at each site Effective book-keeping

SAM avoids data duplication + defines recovery jobs JIM’s XML-DB used to ease bug tracing

Tough deploying a product, under evolution with limited manpower to new sites (we are a running experiment)

Very significant improvements in JIM (scalability) during this period

Certification of sites - Need to check SAMGrid vs usual production Remote sites vs central site Merged vs unmerged files

FNAL vs SPRACE

Page 13: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 13

( = production speed inM events / day)

( = number batch jobs completing successfully)

Reprocessing - IVhttp://samgrid.fnal.gov:8080/cgi-bin/plot_efficiency.cgi Monitoring (illustration)

Overall efficiency, speed or by site.

Status – into the “end-game”

Data sets all allocated, moving to ‘cleaning-up’ Must now push on the Monte Carlo

~855 Mevents done~855 Mevents done

Page 14: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 14

Need access to greater resources as data sets grow

Ongoing programme on LCG and OSG interoperability

Step 1 (co-existence) – use shared resources with SAM-Grid head-node

Widely done for both Reprocessing and MC OSG co-existence shown for data reprocessing

Step 2 – SAMGrid-LCG interface SAM does data handling & JIM job submission Basically forwarding mechanism Prototype established at IN2P3/Wuppertal Extending to production level

OSG activity increasing – build on LCG experience

Team work between core developers / sites

SAM-Grid Interoperability

Page 15: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 15

Looking Forward

Increased data sets require increased resources for MC, repro etc

Route to these is increased use of grid and common tools

Have an ongoing joint program, but work to do.. Continue development of SAM-Grid

Automated production job submission by shifters Deployment team

Bring in new sites in manpower efficient manner ‘Benefit’ of a new site goes well beyond a ‘cpu’ count – we

appreciate / value this. Full interoperability

Ability to access efficiently all shared resources

Additional resources for above recommended by Taskforce

Page 16: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 16

ConclusionsConclusions Computing model continues to be successful

Based around grid-like computing, using common tools Key part of this is the production computing – MC and

reprocessing

Significant advances this year: Continued migration to common tools Progress on interoperability, both LCG and OSG

Two reprocessing sites operating under OSG P17 reprocessing – a tremendous success

Strongly praised by Review Committee

DOSAR major part of this More ‘general’ contribution also strongly acknowledged. Thank you

Let’s all keep up the good work

Page 17: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 17

Back-up

Page 18: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 18

Terms Tevatron

Approx equiv challenge to LHC in “today’s” money Running experiments

SAM (Sequential Access to Metadata) Well developed metadata and distributed data replication

system Originally developed by DØ & FNAL-CD

JIM (Job Information and Monitoring) handles job submission and monitoring (all but data handling) SAM + JIM →SAM-Grid – computational grid

Tools Runjob - Handles job workflow management dØtools – User interface for job submission dØrte - Specification of runtime needs

Page 19: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 19

Reminder of Data Flow Data acquisition (raw data in evpack format)

Currently limited to 50 Hz Level-3 accept rate Request increase to 100 Hz, as planned for Run IIb – see later

Reconstruction (tmb/DST in evpack format) Additional information in tmb → tmb++ (DST format stopped) Sufficient for ‘complex’ corrections, inc track fitting

Fixing (tmb in evpack format) Improvements / corrections coming after cut of production release Centrally performed

Skimming (tmb in evpack format) Centralised event streaming based on reconstructed physics objects Selection procedures regularly improved

Analysis (out: root histogram) Common root-based Analysis Format (CAF) introduced in last year tmb format remains

Page 20: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 20

Remote Production Activities – Monte Carlo

Page 21: DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.

DOSAR Workshop, Sept 2005 21

The Good and Bad of the Grid

Only viable way to go… Increase in resources (cpu and potentially manpower)

Work with, not against, LHC Still limited

BUT Need to conform to standards – dependence on others..

Long term solutions must be favoured over short term idiosyncratic convenience

Or won’t be able to maintain adequate resources.

Must maintain production level service (papers), while increasing functionality

As transparent as possible to non-expert