Tom Dietel University of Cape Town for the ALICE Collaboration Computing for ALICE at the LHC.
ALICE Computing Model
description
Transcript of ALICE Computing Model
F.CarminatiBNL Seminar
March 21, 2005
ALICE Computing Model
March 21, 2005 ALICE Computing Model 2
Offline framework
AliRoot in development since 1998 Entirely based on ROOT Used since the detector TDR’s for all ALICE studies
Two packages to install (ROOT and AliRoot) Plus MC’s
Ported on most common architectures Linux IA32, IA64 and AMD, Mac OS X, Digital True64, SunOS…
Distributed development Over 50 developers and a single CVS repository 2/3 of the code developed outside CERN
Tight integration with DAQ (data recorder) and HLT (same code-base)
Wide use of abstract interfaces for modularity “Restricted” subset of c++ used for maximum portability
March 21, 2005 ALICE Computing Model 3
AliRoot layout
ROOT
AliRoot
STEER
Virtual MC
G3 G4 FLUKA
HIJING
MEVSIM
PYTHIA6
EVGEN
HBTP
HBTAN
ISAJET
AliE
n/gL
ite
EMCAL ZDCITS PHOSTRD TOF RICH
ESD
AliAnalysis
AliReconstruction
PMD
CRT FMD MUON TPCSTART RALICESTRUCT
AliSimulation
March 21, 2005 ALICE Computing Model 4
Software management
Regular release schedule Major release every six months, minor release (tag) every month
Emphasis on delivering production code Corrections, protections, code cleaning, geometry
Nightly produced UML diagrams, code listing, coding rule violations, build and tests , single repository with all the code No version management software (we have only two packages!)
Advanced code tools under development (collaboration with IRST/Trento) Smell detection (already under testing) Aspect oriented programming tools Automated genetic testing
March 21, 2005 ALICE Computing Model 5
ALICE Detector Construction Database (DCDB) Specifically designed to aid detector construction in distributed environment:
Sub-detector groups around the world work independently All data collected in central repository and used to move components from one sub-
detector group to another and during integration and operation phase at CERN
Multitude of user interfaces: WEB-based for humans LabView, XML for laboratory
equipment and other sources ROOT for visualisation
In production since 2002 A very ambitious project with
important spin-offs Cable Database Calibration Database
March 21, 2005 ALICE Computing Model 6
The Virtual MC
User Code VMC
Geometrical Modeller
G3 G3 transport
G4 transportG4FLUKA
transportFLUKAReconstructi
on
Visualisation
Generators
March 21, 2005 ALICE Computing Model 7
0 10 20 30
microsec/point (1 milion
Gexam1Gexam3Gexam4
ATLASCMS
BRAHMSCDF
MINOS_NEARBTEV
TESLA
Performance for "Where am I" - physics case (G3 geometries collected in 2002)
ROOT
G3
TGeo modeller
March 21, 2005 ALICE Computing Model 8
Results
5000 1GeV/c protons in 60T field
Geant3 FLUKA
HMPID5 GeV Pions
March 21, 2005 ALICE Computing Model 9
ITS – SPD: Cluster SizePRELIMINARY!
March 21, 2005 ALICE Computing Model 10
Reconstruction strategy
Main challenge - Reconstruction in the high flux environment (occupancy in the TPC up to 40%) requires a new approach to tracking
Basic principle – Maximum information approach Use everything you can, you will get the best
Algorithms and data structures optimized for fast access and usage of all relevant information Localize relevant information Keep this information until it is needed
March 21, 2005 ALICE Computing Model 11
Tracking strategy – Primary tracks Incremental process
Forward propagation towards to the vertex TPCITS
Back propagation ITSTPCTRDTOF
Refit inward TOFTRDTPCITS
Continuous seeding Track segment finding in
all detectors
Best track 1 Best track 2
Conflict !
TRD
TPC
ITS
TOF
Combinatorial tracking in ITS Weighted two-tracks 2 calculated Effective probability of cluster sharing Probability not to cross given layer for
secondary particles
March 21, 2005 ALICE Computing Model 12
Tracking & PID
PIV 3GHz – (dN/dy – 6000) TPC tracking - ~ 40s TPC kink finder ~ 10 s ITS tracking ~ 40 s TRD tracking ~ 200 s
TPC ITS+TPC+TOF+TRD
Contamination
Efficiency
ITS & TPC & TOF
March 21, 2005 ALICE Computing Model 13
Condition and alignment
Heterogeneous information sources are periodically polled
ROOT files with condition information are created
These files are published on the Grid and distributed as needed by the Grid DMS
Files contain validity information and are identified via DMS metadata
No need for a distributed DBMS Reuse of the existing Grid services
March 21, 2005 ALICE Computing Model 14
External relations and DB connectivity
DAQ
Trigger
DCS
ECS
Physics
data
DCDB
AliEngLite:metadatafile store
calibration procedures
calibration files
AliRoot
Calibration classes
API
API
API
API
API
filesFrom URs:
Source, volume, granularity, update frequency, access pattern, runtime environment and dependencies
API – Application Program Interface
Relations between DBs not final not all shown
APIAPIHLT Call for UR sent to subdetectors
March 21, 2005 ALICE Computing Model 15
Metadata
MetaData are essential for the selection of events We hope to be able to use the Grid file catalogue for one
part of the MetaData During the Data Challenge we used the AliEn file catalogue
for storing part of the MetaData However these are file-level MetaData
We will need an additional event-level MetaData This can be simply the TAG catalogue with externalisable
references We are discussing with STAR on this subject
We will take a decision soon We would prefer that the Grid scenario be clearer
March 21, 2005 ALICE Computing Model 16
ALICE CDC’s
Date MBytes/s Tbytes to MSS Offline milestone
10/2002 200 200
Rootification of raw data -Raw data for TPC and ITS
9/2003 300 300Integration of single detector HLT code
Partial data replication to remote centres
5/20043/2005(?)
450 450HLT prototype for more detectors - Remote reconstruction of partial data streams -Raw digits for barrel and MUON
5/2005 750 750Prototype of the final remote data replication
(Raw digits for all detectors)
5/2006 750 (1250 if possible)
750 (1250 if possible) Final test (Final system)
March 21, 2005 ALICE Computing Model 17
Use of HLT for monitoring in CDC’s
Aliroot Simulation
DigitsRaw Data
LDCLDCLDCLDC
GDC
Event builder
alimdc
Root fileCASTOR
AliEn
Monitoring
HLT Algorithms
ESD
Histograms
March 21, 2005 ALICE Computing Model 18
Period(milestone)
Fraction of the final capacity (%)
Physics Objective
06/01-12/01 1% pp studies, reconstruction of TPC and ITS
06/02-12/02 5%
• First test of the complete chain from simulation to reconstruction for the PPR
• Simple analysis tools• Digits in ROOT format
01/04-06/04 10%
• Complete chain used for trigger studies• Prototype of the analysis tools• Comparison with parameterised MonteCarlo• Simulated raw data
05/05-07/05 TBD• Test of condition infrastructure and FLUKA• To be combined with SDC 3• Speed test of distributing data from CERN
01/06-06/06 20%• Test of the final system for reconstruction
and analysis
ALICE Physics Data Challenges
March 21, 2005 ALICE Computing Model 19
CERN
Tier2
Tier1
Tier2
Tier1
Production of RAWShipment of RAW to CERNReconstruction of RAW in all T1’s
Analysis
AliEn job controlData transfer
PDC04 schema
March 21, 2005 ALICE Computing Model 20
Signal-free event Mixed
signal
Phase 2 principle
March 21, 2005 ALICE Computing Model 21
Simplified view of the ALICE Grid with AliEn
Local scheduler
ALICE VO – central services
Central Task Queue
Job submission
File Catalogue
Configuration
Accounting
User authentication
Computing Element
Workload management
Job Monitoring
Storage volume manager
Data Transfer
Storage Element
Cluster Monitor
AliEn Site services
Disk and MSS
Existing site components
ALICE VO – Site services integration
March 21, 2005 ALICE Computing Model 22
Site services Inobtrusive – entirely in user space:
Singe user account All authentication already assured by central services Tuned to the existing site configuration – supports various
schedulers and storage solutions Running on many Linux flavours and platforms (IA32, IA64,
Opteron) Automatic software installation and updates (both service and
application) Scalable and modular – different services can be run on
different nodes (in front/behind firewalls) to preserve site security and integrity:
Load balanced file transfer nodes (on HTAR)
CERN firewall solution for large volume file transfersFi
re w
allONLY High ports
(50K-55K) for parallel file transport CERN
Intranet
AliEn Data TransferAliEn Other services
March 21, 2005 ALICE Computing Model 23
HP ProLiant DL380 AliEn Proxy Server
Up to 2500 concurrent client connections
HP ProLiant DL580 AliEn File Catalogue 9Mio entries, 400K
directories,10GB MySQL DB
HP server rx2600 AliEn Job Services 500 K archived jobs
HP ProLiant DL360 AliEn to CASTOR (MSS)
interface
HP ProLiant DL380 AliEn Storage Elements
Volume Manager 4 Mio entries, 3GB MySQL DB
Log files, application
software storage 1TB SATA Disk server
March 21, 2005 ALICE Computing Model 24
Master job submission, Job Optimizer (N sub-jobs), RB, File
catalogue, processes monitoring and control, SE…
Central servers
CEs
Sub-jobs
Job processing
AliEn-LCG interface
Sub-jobs
RB
Job processingCEs
Storage
CERN CASTOR: underlying events
Local SEs
CERN CASTOR: backup copy
Storage
Primary copy Primary copy
Local SEs
Output files Output files
Underlying event input files
zip archive of output files
Register in AliEn FC: LCG SE: LCG LFN = AliEn PFN
edg(lcg) copy®ister
File catalogu
e
Phase 2 job structure Task - simulate the event reconstruction and remote event storage
Comple
ted Se
p. 20
04
March 21, 2005 ALICE Computing Model 25
Production history ALICE repository – history of the entire DC ~ 1 000 monitored parameters:
Running, completed processes Job status and error conditions Network traffic Site status, central services monitoring ….
7 GB data 24 million records with 1 minute granularity –
analysed to improve GRID performance
Statistics 400 000 jobs, 6 hours/job, 750 MSi2K hours 9M entries in the AliEn file catalogue 4M physical files at 20 AliEn SEs in centres world-wide 30 TB stored at CERN CASTOR 10 TB stored at remote AliEn SEs + 10 TB backup at CERN 200 TB network transfer CERN –> remote computing centres AliEn efficiency observed >90% LCG observed efficiency 60% (see GAG document)
March 21, 2005 ALICE Computing Model 26
Job repartition Jobs (AliEn/LCG): Phase 1 - 75/25%, Phase 2 – 89/11% More operation sites added to the ALICE GRID as PDC progressed
Phase 2
Phase 1
17 permanent sites (33 total) under AliEn direct control and additional resources through GRID federation (LCG)
March 21, 2005 ALICE Computing Model 27
Summary of PDC’04 Computing resources
It took some effort to ‘tune’ the resources at the remote computing centres
The centres’ response was very positive – more CPU and storage capacity was made available during the PDC
Middleware AliEn proved to be fully capable of executing high-complexity jobs
and controlling large amounts of resources Functionality for Phase 3 has been demonstrated, but cannot be
used LCG MW proved adequate for Phase 1, but not for Phase 2 and in a
competitive environment It cannot provide the additional functionality needed for Phase 3
ALICE computing model validation: AliRoot – all parts of the code successfully tested Computing elements configuration
Need for a high-functionality MSS shown Phase 2 distributed data storage schema proved robust and fast
Data Analysis could not be tested
March 21, 2005 ALICE Computing Model 28
Development of Analysis
Analysis Object Data designed for efficiency Contain only data needed for a particular analysis
Analysis à la PAW ROOT + at most a small library
Work on the distributed infrastructure has been done by the ARDA project
Batch analysis infrastructure Prototype published at the end of 2004 with AliEn
Interactive analysis infrastructure Demonstration performed at the end 2004 with AliEngLite
Physics working groups are just starting now, so timing is right to receive requirements and feedback
March 21, 2005 ALICE Computing Model 29
Forward Proxy Forward Proxy
RootdProofd
Grid/Root Authentication
Grid Access Control Service
TGrid UI/Queue UI
Proofd Startup
PROOFPROOFClientClient
PROOFPROOFMasterMaster
Slave Registration/ Booking- DB
Site <X>
PROOF PROOF SLAVE SLAVE SERVERSSERVERS
Site APROOF PROOF SLAVE SLAVE
SERVERSSERVERS
Site B LCG
PROOFPROOFSteerSteer
Master Setup
New Elements
Grid Service Interfaces
Grid File/Metadata Catalogue
Client retrieves listof logical file (LFN + MSN)
Booking Requestwith logical file names
“Standard” Proof Session
Slave portsmirrored onMaster host
Optional Site Gateway
Master
ClientGrid-Middleware independend PROOF Setup
Only outgoing connectivity
March 21, 2005 ALICE Computing Model 30
Grid situation History
Jan ‘04: AliEn developers are hired by EGEE and start working on new MW May ‘04: A prototype derived from AliEn is offered to pilot users (ARDA, Biomed..)
under the gLite name Dec ‘04: The four experiments ask for this prototype to be deployed on larger
preproduction service and be part of the EGEE release Jan ‘05: This is vetoed at management level -- AliEn will not be common software
Current situation EGEE has vaguely promised to provide the same functionality of AliEn-derived MW
But with a 2-4 months delay at least on top of the one already accumulated But even this will be just the beginning of the story: the different components will have to be
field tested in a real environment, it took four years for AliEn All experiments have their own middleware Our is not maintained because our developers have been hired by EGEE EGEE has formally vetoed any further work on AliEn or AliEn-derived software LCG has allowed some support for ALICE but the situation is far from being clear
ARDA Workshop, CERN, October, 21st, 2004 - 18
Conclusion
• Phase I and II of our Data Challenge were completedß LCG support and resources were very importantß LCG middleware has limited the usability of the resources
• Phase IIIß With gLite on the verge to be released we think it would be absurd
not to “bite the bullet” and use it nowß We will provide an experience “complementary” to the component
by component strategy of LCGß We feel we would gain many months and acquire a precious
experience with no special additional load on the deployment teamß This will help bootstrapping a process which we feel is much too
slow and timidß And we have already done it with AliEn
• ALICE intends to be gLite-only in the shortest possible time
ARDA Workshop, CERN, October, 21st, 2004 - 19
Let’s avoid thatQuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
gLite
becomes
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
gLate
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.gLite
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.gLate
March 21, 2005 ALICE Computing Model 31
ALICE computing model For pp similar to the other experiments
Quasi-online data distribution and first reconstruction at T0 Further reconstruction passes at T1’s
For AA different model Calibration, alignment and pilot reconstructions during data taking Data distribution and first reconstruction at T0 during the four months after AA run
(shutdown) Second and third pass distributed at T1’s
For safety one copy of RAW at T0 and a second one distributed among all T1’s T0: First pass reconstruction, storage of one copy of RAW, calibration data and
first-pass ESD’s T1: Subsequent reconstructions and scheduled analysis, storage of the second
collective copy of RAW and one copy of all data to be safely kept (including simulation), disk replicas of ESD’s and AOD’s
T2: Simulation and end-user analysis, disk replicas of ESD’s and AOD’s Very difficult to estimate network load
March 21, 2005 ALICE Computing Model 32
ALICE requirements on MiddleWare
One of the main uncertainties of the ALICE computing model comes from the Grid component ALICE was developing its computing model assuming that a
MW with the same quality and functionality that AliEn would have had in two years from now will be deployable on the LCG computing infrastructure
If not, we will still analyse the data (!), but Less efficiency more computers more time and money More people for production more money
To elaborate an alternative model we should know what will be
The functionality of the MW developed by EGEE The support we can count on from LCG Our “political” “margin of manoeuvre”
March 21, 2005 ALICE Computing Model 33
Possible strategy
Ifa) Basic services from LCG/EGEE MW can be trusted at
some levelb) We can get some support to port the “higher functionality”
MW onto these services We have a solution If a) above is not true but if
a) We have support for deploying the ARDA-tested AliEn-derived gLite
b) We do not have a political “veto” We still have a solution Otherwise we are in trouble
March 21, 2005 ALICE Computing Model 34
ALICE Offline Timeline
2004 2005 2006
PDC04Analysis PDC04
Design of new components
Developmentof new components
PDC05
PDC06preparation
PDC06
Final developmentof AliRoot
First data takingpreparation
PDC05Computing TDR PDC06 AliRoot ready
nous sommes
iciCDC 04?
PDC04
CDC 05
March 21, 2005 ALICE Computing Model 35
Main parameters
pp PbPb# T1# T2Size RAW single event pppp pileuprecording rate Hz 100 100Raw size MB 1.00 12.50ESD MB 0.04 2.50AOD MB 0.004 0.250Event catalog MB 0.010 0.010Running time per year s 1.E+07 1.E+06Shutdown period sEvts/year 1.E+09 1.E+08# reconstruction passes per year 3 3RAW duplication factor 2 2ESD duplication factor 2.0 2.0Scheduled analysis passes per event/reco pass 3 3Chaotic Analysis passes per event/y 20 20Total Analysis passes per event/y 23 23
Reconstruction
1.E+07
0.205
Parameters for the ALICE Computing modelValue
7UnitName
Real Data
Analysis
Statistics
Events
23
March 21, 2005 ALICE Computing Model 36
Processing pattern
T0JanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctober AA 1 CalibrationNovember Run1 AA Reco 1 Run1 pp Reco 2DecemberJanuaryFebruary at T0March Run2 pp Reco 1 April at T1sMay Run1 AA Reco 2 Run1 pp Reco 3JuneJulyAugustSeptember at T0October AA 2 Calibration at T1s at T1'sNovember Run2 AA Reco 1 Run1 AA Reco 3 Run2 pp Reco 2DecemberJanuaryFebruary at T0March Run3 pp Reco 1 April at T1s at T1'sMay Run2 AA Reco 2 Run2 pp Reco 3JuneJulyAugustSeptember at T0October AA 3 Calibration at T1s at T1'sNovember Run3 AA Reco 1 Run2 AA Reco 3 Run3 pp Reco 2December
at T0
at T1s at T1's
Year
T1
Process
Run1 pp Reco 1
Calibration
pp 2
2009pp 3
Schutdown
Accelerator
pp 1
Month
2008
2007
Schutdown
Schutdown
Tier0 Tier1 Tier1ex Tier2 Tier2ex Total at CERN HR LHCC review7.5 13.8 13.8 13.7 13.7 35.0 7.5 34.022% 39% 39% 39% 39% 100% 22%3.0 13.8 10.1 13.7 12.7 30.4 7.5 17.6 26.010% 45% 33% 45% 42% 100% 25%
DisK (Pbytes) 0.1 7.5 6.5 2.6 2.5 10.2 1.3 1.6 10.01% 74% 63% 25% 24% 100% 13%
MS (Pbytes/year) 2.3 7.5 6.4 - - 9.8 3.3 4.7 10.723% 77% 66% - - 100% 34%
Network in (Gb/s) 8.00 2.00 - 0.01 - - -Network out (Gb/s) 6.00 1.50 - 0.27 - - -
Summary of Computing Capacities required by ALICE
CPU (MSI2K)Maximum
Average
March 21, 2005 ALICE Computing Model 37
Conclusions
ALICE has made a number of technical choices for the Computing framework since 1998 that have been validated by experience The Offline development is on schedule, although contingency is
scarce Collaboration between physicists and computer scientists is
excellent Tight integration with ROOT allows fast prototyping and development
cycle AliEn goes a long way in providing a GRID solution adapted to HEP
needs However its evolution into a common project has been “stopped” This is probably the largest single “risk factor” for ALICE computing
Some ALICE-developed solutions have a high potential to be adopted by other experiments and indeed are becoming “common solutions”