The LHC Computing Challenge
description
Transcript of The LHC Computing Challenge
1
The LHC Computing Challenge
Tim BellFabric Infrastructure & Operations Group
Information Technology Department
CERN
2nd April 2009
2
The Four LHC Experiments…ATLAS
- General purpose- Origin of mass- Supersymmetry- 2,000 scientists from 34
countries
CMS- General purpose
- Origin of mass- Supersymmetry
- 1,800 scientists from over 150 institutes
ALICE- heavy ion collisions, to create quark-gluon
plasmas- 50,000 particles in each collision
LHCb- to study the differences between matter and
antimatter- will detect over 100 million b and b-bar mesons
each year
3
… generate lots of data …
The accelerator generates 40 million particle collisions (events) every second at the centre of each of the four experiments’ detectors
4
… generate lots of data …reduced by online computers to
a few hundred “good” eventsper second.
Which are recorded on disk and magnetic tapeat 100-1,000 MegaBytes/sec ~15 PetaBytes per year for all four experiments
simulation
reconstruction
analysis
interactivephysicsanalysis
batchphysicsanalysis
batchphysicsanalysis
detector
event summary data
rawdata
eventreprocessing
eventreprocessing
eventsimulation
eventsimulation
analysis objects(extracted by physics topic)
Data Handling and Computation for
Physics Analysisevent filter(selection &
reconstruction)
event filter(selection &
reconstruction)
processeddata
CERN
6
CERN18%
All Tier-1s39%
All Tier-2s43%
CERN12%
All Tier-1s55%
All Tier-2s33%
CERN34%
All Tier-1s66%
Summary of Computing Resource RequirementsAll experiments - 2008From LCG TDR - June 2005
CERN All Tier-1s All Tier-2s TotalCPU (MSPECint2000s) 25 56 61 142Disk (PetaBytes) 7 31 19 57Tape (PetaBytes) 18 35 53
… leading to a high box count
CPU Disk Tape~2,500 PCs Another ~1,500 boxes
7
Computing Service Hierarchy
Tier-0 – the accelerator centre Data acquisition & initial processing Long-term data curation Distribution of data Tier-1 centres
Canada – Triumf (Vancouver)France – IN2P3 (Lyon)Germany – Forschunszentrum KarlsruheItaly – CNAF (Bologna)Netherlands – NIKHEF/SARA (Amsterdam)Nordic countries – distributed Tier-1
Spain – PIC (Barcelona)Taiwan – Academia SInica (Taipei)UK – CLRC (Oxford)US – FermiLab (Illinois) – Brookhaven (NY)
Tier-1 – “online” to the data acquisition process high availability
Managed Mass Storage Data-heavy analysis National, regional support
Tier-2 – ~100 centres in ~40 countries Simulation End-user analysis – batch and interactive
8
The Grid• Timely Technology!• Deploy to meet LHC
computing needs.• Challenges for the
WorldwideLHCComputingGrid Project due to– worldwide nature
• competing middleware…– newness of technology
• competing middleware…– scale– …
9
Interoperability in action
10
Reliability
Site ReliabilityTier-2 Sites
83 Tier-2 sites being monitored
11
• 1990s – Unix wars – 6 different Unix flavours
• Linux allowed all users to align behind a single OS which was low cost and dynamic
• Scientific Linux is based on Red Hat with extensions of key usability and performance features– AFS global file system– XFS high performance file system
• But how to deploy without proprietary tools?
Why Linux ?
See EDG/WP4 report on current technology (http://cern.ch/hep-proj-grid-fabric/Tools/DataGrid-04-TED-0101-3_0.pdf) or “Framework for Managing Grid-enabled Large Scale Computing Fabrics”(http:/cern.ch/quattor/documentation/poznanski-phd.pdf) for reviews of various packages.
12
• Commercial Management Suites– (Full) Linux support rare (5+ years ago…)– Much work needed to deal with specialist HEP
applications; insufficient reduction in staff costs to justify license fees.
• Scalability– 5,000+ machines to be reconfigured– 1,000+ new machines per year– Configuration change rate of 100s per day
Deployment
See EDG/WP4 report on current technology (http://cern.ch/hep-proj-grid-fabric/Tools/DataGrid-04-TED-0101-3_0.pdf) or “Framework for Managing Grid-enabled Large Scale Computing Fabrics”(http:/cern.ch/quattor/documentation/poznanski-phd.pdf) for reviews of various packages.
13
Dataflows and rates
Remember this figure
1430MB/s
700MB/s 1120MB/s
700MB/s 420MB/s
(1600MB/s) (2000MB/s)
Averages! Need to be able tosupport 2x for recovery!
Scheduled work only!
14
• 15PB/year. Peak rate to tape >2GB/s– 3 full SL8500 robots/year
• Requirement in first 5 years to reread all past data between runs– 60PB in 4 months: 6GB/s
• Can run drives at sustained 80MB/s– 75 drives flat out merely for controlled access
• Data Volume has interesting impact on choice of technology– Media use is advantageous: high-end
technology (3592, T10K) favoured over LTO.
Volumes & Rates
15
Castor Architecture
Tape Servers Tape
Dae
mon
Client
StagerJob
RTCPD
NameServer
VDQM
VMGR
Disk Servers
MoverM
over
RH
RR
Scheduler DBSvc
JobSvc
QrySvc
ErrorSvc
Stager
MigHunter
GC
RTCPClientD
DB
Detailed view
Central S
ervices
Disk cache subsystem
Tape archive subsystem
16
Castor Performance
16
17
• LEP, CERN’s last accelerator, started in 1989 and shutdown 10 years later.– First data recorded to IBM 3480s; at least 4
different technologies used over the period.– All data ever taken, right back to 1989,
was reprocessed and reanalysed in 2001/2.• LHC starts in 2007 and will run until at
least 2020.– What technologies will be in use in 2022 for
the final LHC reprocessing and reanalysis?• Data repacking required every 2-3 years.
– Time consuming– Data integrity must be maintained
Long lifetime
18
Disk capacity & I/O rates
1996 2000
1TB
20064GB
10MB/s50GB
20MB/s500GB60MB/s
I/O250x10MB/s
2,500MB/s20x20MB/s
400MB/s 2x60MB/s
120MB/s
CERN now purchases two different storage server models: capacity oriented and throughput oriented.
• fragmentation increases management complexity• (purchase overhead also increased…)
19
– Daily Backup volumes of around 18TB to 10 Linux TSM servers
.. and backup – TSM on Linux
20
Capacity Requirements
2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 20200
200
400
600
800
1000
1200
0
50
100
150
200
250
300
Predicted Growth in Offline Computing Re-quirements
CPUDiskTape
M S
I2K
or
Dis
k P
B
Tape P
B
21
Power Outlook
2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 20200
5
10
15
20
25
Predicted1 Growth in Electrical Power Demand
CPUDiskOther Services
MW
22
• Immense Challenges & Complexity– Data rates, developing software, lack of
standards, worldwide collaboration, …
• Considerable Progress in last ~5-6 years– WLCG service exists– Petabytes of data transferred
• But more data is coming in November…– Will the system cope with chaotic analysis?– Will we understand the system enough to
identify problems—and fix underlying causes ?– Can we meet requirements given power
available?
Summary
22