Ian Willers Information: CMS participation in Monarc and RD45 Slides: Paolo Capiluppi, Irwin Gaines,...
-
Upload
tristian-hudgeons -
Category
Documents
-
view
215 -
download
0
Transcript of Ian Willers Information: CMS participation in Monarc and RD45 Slides: Paolo Capiluppi, Irwin Gaines,...
Ian WillersIan Willers
Information: CMS participation in Monarc and RD45Information: CMS participation in Monarc and RD45
Slides: Paolo Capiluppi, Irwin Gaines, Harvey Newman, Slides: Paolo Capiluppi, Irwin Gaines, Harvey Newman,
Les Robertson, Jamie Shiers, Lucas TaylorLes Robertson, Jamie Shiers, Lucas Taylor
Hardware Resource Needs of CMS Experiment start up 2005
2
ContentsContents
Why is LHC computing differentWhy is LHC computing different
Monarc Project and proposed architectureMonarc Project and proposed architecture
An LHC offline computing facility at CERNAn LHC offline computing facility at CERN
A Regional CentreA Regional Centre
LHC data managementLHC data management
The Particle Physics Data GridThe Particle Physics Data Grid
SummarySummary
3
CMS Structure showing sub-detectorsCMS Structure showing sub-detectors
4
Not covered: CMS Software ProfessionalsNot covered: CMS Software Professionals
Professional software personnel ramping up to ~33 FTE’s (by 2003)Professional software personnel ramping up to ~33 FTE’s (by 2003)
Engineers support much larger no. of physicist developers (~4 times)Engineers support much larger no. of physicist developers (~4 times)
Shortfall: 10 FTE’s (1999)
5
Not covered : Cost of HardwareNot covered : Cost of Hardware
~120 MCHFTotal Computing cost to 2006 incl.
~consistent with canonical 1/3 : 2/3 rule
~40 MCHF(Tier0, Central systems
at CERN)
~40 MCHF(Tier1, ~5 Regional Centres
each ~20% of central systems)
~40 MCHF (?)(Universities, Tier2 centres, MC, etc..)
Figur
es b
eing
revi
sed
Figur
es b
eing
revi
sed
6
LHC Computing: Different from Previous Experiment Generations
LHC Computing: Different from Previous Experiment Generations
Geographical dispersion: of people and resources Complexity: the detector and the LHC environment Scale: Petabytes per year of data
1800 Physicists 150 Institutes 32 Countries
Major challenges associated with Coordinated Use of Distributed computing resources Remote software development and physics analysis Communication and collaboration at a distance
R&D: New Forms of Distributed Systems
7
Comparisons with LHC sized experiment in 2006: CMS at CERN [*]
Comparisons with LHC sized experiment in 2006: CMS at CERN [*]
ExperimentOnsite CPU SI95 1 Si95 = 40 MIPS
onsite disk (TB)
onsite tape (TB) LAN capacity Data Import/Export Box Count
LHC (2006) 520,000* 540 3000 46 GB/s10 TB/day
(sustained) ~1400
CDF - 2 12,000 20 800 1 Gb/s 18 MB/s ~250
D0 - 2 7,000 20 600 300 Mb/s 10 MB/s ~250
Babar ~6000 8 ~300 100 + 1000
Mb/s ~400 GB/day ~400
D0 295 1.5 65 300 Mb/s ? 180
CDF 280 2 100 100 Mb/s ~100 GB/day ?
ALEPH 300 1.8 30 1 Gb/s ? 70
DELPHI 515 1.2 60 1 Gb/s ? 80
L3 625 2 40 1 Gb/s ? 160
OPAL 835 1.6 22 1 Gb/s ? 220
NA45 587 1.3 2 1 Gb/s 5 GB/day 30
[*] [*] Total CPU: CMS or ATLAS ~1.5-2.0 MSi95Total CPU: CMS or ATLAS ~1.5-2.0 MSi95Estimates for disk/tape ratio will change (technology evolution)Estimates for disk/tape ratio will change (technology evolution)
8
CPU needs forbaseline analysis process
CPU needs forbaseline analysis process
52k52k4 Hours4 Hours4 Times/Day4 Times/DayIndividuals Individuals
(500)(500)
Analysis (ESD 1%)Analysis (ESD 1%)
~~580+x580+x
~~700+y700+y
??
2020
1111
200200
150150
200200
Total DiskTotal Disk
Storage Storage
(TB)(TB)
~~1400k1400k
~~1700k1700k
1050k1050k
3k3k
1k1k
190k190k
100k100k
116k116k
Total CPU Total CPU
Power (SI95)Power (SI95)
1010~~300 Days300 Days~~101066 events events
/Day/Day
Experiment Experiment
/Group/Group
Simulation + Simulation +
ReconstructionReconstruction
Total utilizedTotal utilized
Total installedTotal installed
750075004 Hours4 Hours4 Times/Day4 Times/DayIndividuals Individuals
(500)(500)
Analysis Analysis
(AOD, TAG & DPD)(AOD, TAG & DPD)
120012001 Day1 DayOnce/MonthOnce/MonthGroups (20)Groups (20)SelectionSelection
120001200010 Days10 DaysOnce/MonthOnce/MonthExperimentExperimentRe-definition Re-definition
(AOD &TAG)(AOD &TAG)
3003002 Month2 Month3 times/Year3 times/YearExperimentExperimentRe-processingRe-processing
500500100 Days100 DaysOnce/YearOnce/YearExperimentExperimentReconstructionReconstruction
Disk I/ODisk I/O
(MB/sec)(MB/sec)
Response Response
time/passtime/pass
FrequencyFrequencyActivityActivity(100% efficiency,(100% efficiency,
no AMS overhead)no AMS overhead)
9
Major activities foreseen at CERNreality check (Les Robertson, Jan. ‘99)
Major activities foreseen at CERNreality check (Les Robertson, Jan. ‘99)
Activities Cpu(SI95)
Disk I/O(MB/sec)
Tape I/O(MB/sec)
Reconstruction + data recording 35'000 500 100
Copy raw data 120 120
Re-processing 150'000 400 400
Pass1+2 analysis - 4 groups 4'000 1'600
User analysis- 50 simultaneous jobs 100'000 750
Totals 289'000 3'410 620
% of 2006 capacity 56% 4% 124%
Based on 520,000 SI95, present estimate from Les 600,000 SI95 + 200,000 SI95/yearBased on 520,000 SI95, present estimate from Les 600,000 SI95 + 200,000 SI95/year
10
MONARC: Common Project MONARC: Common Project MModels odels OOf f NNetworked etworked AAnalysis nalysis
At At RRegional egional CCentersentersCaltech, CERN, Columbia, FNAL, Heidelberg, Caltech, CERN, Columbia, FNAL, Heidelberg,
Helsinki, INFN, IN2P3, KEK, Marseilles, MPI Helsinki, INFN, IN2P3, KEK, Marseilles, MPI Munich, Orsay, Oxford, TuftsMunich, Orsay, Oxford, Tufts
PROJECT GOALSPROJECT GOALS Develop “Baseline Models”Develop “Baseline Models” Specify the main parameters Specify the main parameters
characterizing the Model’s characterizing the Model’s performance: throughputs, latencies performance: throughputs, latencies
Verify resource requirement baselines: Verify resource requirement baselines: (computing, data handling, networks)(computing, data handling, networks)
TECHNICAL GOALSTECHNICAL GOALS Define the Define the Analysis ProcessAnalysis Process Define Define RC Architectures and ServicesRC Architectures and Services Provide Provide Guidelines for the final ModelsGuidelines for the final Models Provide a Provide a Simulation Toolset Simulation Toolset for Further for Further
Model studiesModel studies
622
Mbi
ts/s 622 M
bits/s
Univ 2
CERN520k SI95 540 Tbytes
Disk; Robot
Tier2 Ctr20k SI95 20 TB
Disk Robot
FNAL/BNL100k SI95100 Tbyte
Disk; Robot
622
Mbi
ts/s
N X
622
Mbi
ts/s
622Mbits/s
622 Mbits/s
Univ1
UnivM
Model Circa 2005Model Circa 2005
11
CMS Analysis Model Based on MONARC and ORCA:
“Typical” Tier1 RC
CPU PowerCPU Power ~100 KSI95~100 KSI95 Disk spaceDisk space ~200 TB~200 TB Tape capacityTape capacity 600 TB, 100 MB/sec600 TB, 100 MB/sec Link speed to Tier2Link speed to Tier2 10 MB/sec (1/2 of 155 Mbps)10 MB/sec (1/2 of 155 Mbps) Raw dataRaw data 5% 5% 50 TB/year50 TB/year ESD dataESD data 100%100% 200 TB/year200 TB/year Selected ESDSelected ESD 25%25% 10 TB/year 10 TB/year [*][*] Revised ESDRevised ESD 25%25% 20 TB/year 20 TB/year [*][*] AOD dataAOD data 100%100% 2 TB/year 2 TB/year [**][**] Revised AODRevised AOD 100%100% 4 TB/year 4 TB/year [**][**] TAG/DPDTAG/DPD 100%100% 200 GB/year 200 GB/year Simulated dataSimulated data 25%25% 25 TB/year 25 TB/year
[*] Covering Five Analysis Groups; each selecting ~1% [*] Covering Five Analysis Groups; each selecting ~1% of Annual ESD or AOD data for a Typical Analysis of Annual ESD or AOD data for a Typical Analysis
[**] Covering All Analysis Groups[**] Covering All Analysis Groups
12
Monarc Data HierarchyMonarc Data Hierarchy
Tier2 Center ~1 TIPS
Online System
Offline Farm~20 TIPS
CERN Computer Center >20 TIPS
Fermilab~4 TIPS
France Regional Center
Italy Regional Center
Germany Regional Center
InstituteInstituteInstituteInstitute ~0.25TIPS
Workstations
~100 MBytes/sec
~100 MBytes/sec
~2.4 Gbits/sec
100 - 1000 Mbits/sec
Bunch crossing per 25 nsecs.100 triggers per secondEvent is ~1 MByte in size
Physicists work on analysis “channels”.
Each institute has ~10 physicists working on one or more channels
Data for these channels should be cached by the institute server
Physics data cache
~PBytes/sec
~622 Mbits/sec or Air Freight
Tier2 Center ~1 TIPS
Tier2 Center ~1 TIPS
Tier2 Center ~1 TIPS
~622 Mbits/sec
Tier 0Tier 0
Tier 1Tier 1
Tier 3Tier 3
Tier 4Tier 4
1 TIPS = 25,000 SpecInt95
PC (1999) = ~15 SpecInt95
Tier2 Center ~1 TIPS
Tier 2Tier 2
13
MONARC Analysis Process Example MONARC Analysis Process Example
DAQ/RAWSlow
Control/Cal
~20~20
~25~25 Huge number of Huge number of ““small” jobs small” jobs
per Day.per Day.Chaotic activityChaotic activity
20 large jobs 20 large jobs per Month.per Month.
Coord. activityCoord. activity
4 times per Year?4 times per Year?(per Exp.)(per Exp.)
14
Tapes
Network from CERN
Networkfrom Tier 2& simulation centers
Tape Mass Storage & Disk Servers
Database Servers
PhysicsSoftware
Development
R&D Systemsand Testbeds
Info serversCode servers
Web ServersTelepresence
Servers
TrainingConsultingHelp Desk
ProductionReconstruction
Raw/Sim Rec objs
Scheduledpredictable
Experiment/Physics groups
ProductionAnalysis
Selection: Rec objs AOD &TAG
ScheduledPhysics groups
Individual Analysis
Selection:TAG plots
Chaotic
PhysicistsDesktops
Tier 2
Local institutes
CERN
Tapes
Regional Centre
15
Offline Computing Facility for CMS at CERNOffline Computing Facility for CMS at CERN
Purpose of the studyPurpose of the study investigate the feasibility of building LHC computing facilities using investigate the feasibility of building LHC computing facilities using
current cluster architectures, conservative assumptions about technology current cluster architectures, conservative assumptions about technology evolution evolution
scale & performancescale & performance technologytechnology powerpower footprintfootprint costcost reliabilityreliability manageabilitymanageability
16
Background & assumptionsBackground & assumptions
SizingSizing Data are estimates from experimentsData are estimates from experiments MONARC analysis group working papers and presentationsMONARC analysis group working papers and presentations
ArchitectureArchitecture CERN is Tier0 centre and act as Tier1 centre ...CERN is Tier0 centre and act as Tier1 centre ... CERN distributed architecture (in same room and across site)CERN distributed architecture (in same room and across site)
simplest components (hyper-sensitive to cost, aversion to complication)simplest components (hyper-sensitive to cost, aversion to complication) throughput (before performance)throughput (before performance) resilience (mostly up all of the time)resilience (mostly up all of the time) computing fabric for flexibility, scalabilitycomputing fabric for flexibility, scalability
avoid special-purpose components avoid special-purpose components everything everything cancan do anything (does not mean that parts are not dedicated for do anything (does not mean that parts are not dedicated for
specific applications, periods, ..)specific applications, periods, ..)
250 Gbps
0.8 Gbps
8 Gbps
…………1400 boxes160 clusters40 sub-farms
12 Gbps*
480 Gbps*
3 Gbps*
1.5 Gbps
100 drives
12 Gbps
5400 disks
340 arrays
……...
LAN-SAN routers
LAN-WAN routers
CERN
Example CMS Offline Farmat CERN circa 2006
tapes
0.8 Gbps (DAQ)
0.8 Gbps
5 Gbps
disks
processors
storage network
storage network
farm network
18
Components (1)Components (1)
ProcessorsProcessors the then-current low-end PC server (equivalent of the dual cpu boards
of 1999) 4 cpus, each >100 SI954 cpus, each >100 SI95 creation of AOD and analysis may need better (more expensive)creation of AOD and analysis may need better (more expensive)
Assembled into clusters and sub-farmsAssembled into clusters and sub-farms according to practical considerations like according to practical considerations like
throughput of first level LAN switchthroughput of first level LAN switch rack capacityrack capacity power & cooling, ….power & cooling, ….
Each cluster comes with a suitable chunk of I/O capacityEach cluster comes with a suitable chunk of I/O capacity
LANLAN no issue - since the computers are high volume components, no issue - since the computers are high volume components,
the computer-LAN interface is standard the computer-LAN interface is standard (then-current Ethernet!)(then-current Ethernet!) higher layers need higher throughput - but only about a Tbpshigher layers need higher throughput - but only about a Tbps
Processor cluster
basic boxfour 100 SI95 processorsstandard network connection (~2 Gbps)15% of systems configured as I/O servers (disk server, disk-tape mover, Objy AMS, ..) with additional connection to the storage networkcluster9 basic boxes with a network switch (<10 Gbps)sub-farm4 clusters - with a second-level network switch (<50 Gbps)one sub-farm fits in one rack
3 Gbps*
1.5 Gbps
configured asI/O servers
storage network
farm network
cluster and sub-farm sizing adjusted to fit convenientlythe capabilities of network switch, racking, power distribution components
sub-farm: 36 boxes, 144 cpus, 5 m2
20
Components (2)Components (2)
Disks inexpensive RAID arrays capacity limited to ensure sufficient number of independent
accessors (say ~100GB with the current size of disk farm)
SAN (Storage Area Network) if this market develops into high-volume, low-cost (?)
hopefully using the standard network medium otherwise use the current model LAN-connected storage servers
instead of special-purpose SAN-connected storage controllers
disk sub-system
arrayTwo RAID controllersDual-attached disksControllers connect to the storage networkSizing of array subject to components available
rackIntegral number of arrays, with first level network switches
In the main model, half-height 3.5” disks are assumed, 16 per shelf of a 19” rack.With space for 18 shelves in the rack (two-sided), half of the shelves are populated with disks, the remainder housing controllers, network switches, power distribution.
0.8 Gbps
5 Gbps
storage network
19” rack, 1m deep, 1.1 m2 withspace for doors, 14TB capacity
disk size restricted to give a diskcount which matches the numberof processors (and thus number of active processes)
22
I/O models to be investigatedI/O models to be investigated
This is a fabric so it should support any model
1 - the I/O server, or Objectivity AMS model all I/O requests must pass through an intelligent processor data passes across the SAN to the I/O server and then across the
LAN to the application server
2 - as above, but the SAN and the LAN are the same, or the SAN is accessed via the LAN
all I/O requests pass twice across the LAN - double the network rates in the drawing
3 - the global shared file system no I/O servers, database servers - all data is accessed directly
from all application servers the LAN is the SAN
23
Components (3)Components (3)
TapesTapes (unpopular in Computer Centres - (unpopular in Computer Centres - new technology by 2004?)new technology by 2004?)
conservative assumption - 100GB per cartridge100GB per cartridge 20MB/sec per drive - with 25% achievable 20MB/sec per drive - with 25% achievable
(robot, load/unload, position/rewind, retry, ….)(robot, load/unload, position/rewind, retry, ….) let’s hope that all of the active data can be
held on disk tape needed as archive and for shipping
View of part of the magnetic tape vault at CERN's computer centre
STK Robot6000 cartridges x 50 GB
= 300TB
24
Problems?Problems?
We hope the local area network will not be a problem cpu-to-I/O requirement is modest
few Gbps at the computer node
suitable switches should be available in a couple of years
Disk system is probably not an issue buy more disk than we currently predict to have enough accessors
Tapes - already talked about that
Space - OK - thanks to the visionaries of 1970
Power & cooling - not a problem but a new cost
25
The real problemThe real problem
ManagementManagement installation monitoring fault-determination re-configuration Integration
All this must be fully automated while retaining simplicity and All this must be fully automated while retaining simplicity and
flexibilityflexibility
Make sure full cost of ownership is consideredMake sure full cost of ownership is considered current industry cost of ownership of PC is 10’000 CHF/year vs. current industry cost of ownership of PC is 10’000 CHF/year vs.
3’000 CHF purchase price3’000 CHF purchase price
26
Tapes
Network from CERN
Networkfrom Tier 2& simulation centers
Tape Mass Storage & Disk Servers
Database Servers
PhysicsSoftware
Development
R&D Systemsand Testbeds
Info serversCode servers
Web ServersTelepresence
Servers
TrainingConsultingHelp Desk
ProductionReconstruction
Raw/Sim Rec objs
Scheduledpredictable
Experiment/Physics groups
ProductionAnalysis
Selection: Rec objs AOD &TAG
ScheduledPhysics groups
Individual Analysis
Selection:TAG plots
Chaotic
PhysicistsDesktops
Tier 2
Local institutes
CERN
Tapes
Regional Centre
27
DataImport
DataExport
Mass Storage & DiskServers
Database Servers
Tapes
Network from CERN
Networkfrom Tier 2 andsimulation centers
PhysicsSoftware
Development
R&D Systemsand Testbeds
Info serversCode servers
Web ServersTelepresence
Servers
TrainingConsultingHelp Desk
ProductionReconstruction
Raw/Sim-->Rec objs
Scheduled, predictable
experiment/physics groups
ProductionAnalysis
SelectionRec objs TAG
Scheduled
Physics groups
Individual Analysis
SelectionTAG plots
Chaotic
Physicists Desktops
Tier 2
Local institutes
CERN
Tapes
DATAFLOW Robotic Mass Storage
Central Disk Cache
Local Disk Cache
Local Disk Cache
Local Disk Cache
Local Disk Cache
28
LHC Data ManagementLHC Data Management
4 Experiments mean4 Experiments mean >10PB / year total 100MB - 1.5GB/s ~20 years running ~5000 physicists ~250 institutes ~500 concurrent analysis jobs 24x7
Solutions must work at CERN and outside
Scalable from 1-10 users to 100-1000
Support lap-top to large servers with 100GB-1TB and HSM
From MB/GB/TB? (private data) to many PB
29
Objectivity/Database Architecture: CMS baseline solution
Objectivity/Database Architecture: CMS baseline solution
ApplicationApplication
Objy ClientObjy Client
Objy ServerObjy Server ObjyObjyLock ServerLock Server Objy ServerObjy Server
HSM ClientHSM Client
HSM ServerHSM Server
ApplicationApplication
Objy ClientObjy Client Objy ServerObjy Server
Application Host Application & Data Server
Data Server Data Server+ HSM
216 - 232 files Any host
31
Particle Physics Data Grid (PPDG)Particle Physics Data Grid (PPDG)
First Year Goal: Optimized cached read access to 1-10 GBytes, drawn from a total data set of order One Petabyte
PRIMARY SITEPRIMARY SITEData Acquisition,Data Acquisition,
CPU, Disk, CPU, Disk, Tape RobotTape Robot
SECONDARY SITESECONDARY SITECPU, Disk, CPU, Disk, Tape RobotTape Robot
Site to Site Data Replication Service
100 Mbytes/sec
DOE/NGI Next Generation Internet ProjectANL, BNL, Caltech, FNAL, JLAB, LBNL,
SDSC, SLAC, U.Wisc/CS
UniversityUniversityCPU, Disk, CPU, Disk,
UsersUsersPRIMARY SITEPRIMARY SITE
DAQ, Tape, CPU, DAQ, Tape, CPU, Disk, RobotDisk, Robot
Satellite SiteSatellite SiteTape, CPU, Tape, CPU, Disk, RobotDisk, Robot
UniversityUniversityCPU, Disk, CPU, Disk,
UsersUsers
UniversityUniversityCPU, Disk, CPU, Disk,
UsersUsers
UniversityUniversityCPU, Disk, CPU, Disk,
UsersUsers
UniversityUniversityCPU, Disk, CPU, Disk,
UsersUsers
Multi-Site Cached File Access Service
Satellite SiteSatellite SiteTape, CPU, Tape, CPU, Disk, RobotDisk, Robot
32
PPDG: Architecture for Reliable High Speed Data Delivery
PPDG: Architecture for Reliable High Speed Data Delivery
Object-based andObject-based andFile-based Application File-based Application
ServicesServices
Cache ManagerCache Manager
File AccessFile AccessServiceService
Matchmaking Matchmaking ServiceService
Cost EstimationCost Estimation
File FetchingFile FetchingServiceService
File Replication File Replication IndexIndex
End-to-End End-to-End Network ServicesNetwork Services
Mass Storage Mass Storage ManagerManager
Resource Resource ManagementManagement
File MoverFile Mover
File MoverFile Mover
Site BoundarySite Boundary Security DomainSecurity Domain
33
Distributed Data Delivery and LHC Software Architecture
Distributed Data Delivery and LHC Software Architecture
Architectural Flexibility GRID will allow resources to be used efficiently
I/O requests up-front; data driven; respond to ensemble of changing cost estimates
Code movement as well as data movement Loosely coupled, dynamic: e.g. Agent-based
implementation
34
Summary - Data IssuesSummary - Data Issues
Development of a robust PB-scale networked data access and analysis system is mission-critical
An effective partnership exists, HENP-wide, through many R&D projects An aggressive R&D program is required to develop the necessary An aggressive R&D program is required to develop the necessary
systemssystems For reliable data access, processing and analysis across a hierarchy of For reliable data access, processing and analysis across a hierarchy of
networksnetworks Solutions could be widely applicable to data problems in other scientific
fields and industry, by LHC startup
35
ConclusionsConclusions
CMS has a first order estimate of needed resources and costsCMS has a first order estimate of needed resources and costs
CMS has identified key issues concerning needed resourcesCMS has identified key issues concerning needed resources
CMS is doing a lot of focused R&D work to refine estimatesCMS is doing a lot of focused R&D work to refine estimates
A lot of integration is needed for the software and hardware A lot of integration is needed for the software and hardware architecturearchitecture
We have positive feedback from different institutions for We have positive feedback from different institutions for regional centres and developmentregional centres and development