Ian Willers Information: CMS participation in Monarc and RD45 Slides: Paolo Capiluppi, Irwin Gaines,...

Ian WillersIan Willers

Information: CMS participation in Monarc and RD45Information: CMS participation in Monarc and RD45

Slides: Paolo Capiluppi, Irwin Gaines, Harvey Newman, Slides: Paolo Capiluppi, Irwin Gaines, Harvey Newman,

Les Robertson, Jamie Shiers, Lucas TaylorLes Robertson, Jamie Shiers, Lucas Taylor

Hardware Resource Needs of CMS Experiment start up 2005

2

ContentsContents

Why is LHC computing differentWhy is LHC computing different

Monarc Project and proposed architectureMonarc Project and proposed architecture

An LHC offline computing facility at CERNAn LHC offline computing facility at CERN

A Regional CentreA Regional Centre

LHC data managementLHC data management

The Particle Physics Data GridThe Particle Physics Data Grid

SummarySummary

3

CMS Structure showing sub-detectorsCMS Structure showing sub-detectors

4

Not covered: CMS Software ProfessionalsNot covered: CMS Software Professionals

Professional software personnel ramping up to ~33 FTE’s (by 2003)Professional software personnel ramping up to ~33 FTE’s (by 2003)

Engineers support much larger no. of physicist developers (~4 times)Engineers support much larger no. of physicist developers (~4 times)

Shortfall: 10 FTE’s (1999)

5

Not covered : Cost of HardwareNot covered : Cost of Hardware

~120 MCHFTotal Computing cost to 2006 incl.

~consistent with canonical 1/3 : 2/3 rule

~40 MCHF(Tier0, Central systems

at CERN)

~40 MCHF(Tier1, ~5 Regional Centres

each ~20% of central systems)

~40 MCHF (?)(Universities, Tier2 centres, MC, etc..)

Figur

es b

eing

revi

sed

Figur

es b

eing

revi

sed

6

LHC Computing: Different from Previous Experiment Generations

LHC Computing: Different from Previous Experiment Generations

Geographical dispersion: of people and resources Complexity: the detector and the LHC environment Scale: Petabytes per year of data

1800 Physicists 150 Institutes 32 Countries

Major challenges associated with Coordinated Use of Distributed computing resources Remote software development and physics analysis Communication and collaboration at a distance

R&D: New Forms of Distributed Systems

7

Comparisons with LHC sized experiment in 2006: CMS at CERN [*]

Comparisons with LHC sized experiment in 2006: CMS at CERN [*]

ExperimentOnsite CPU SI95 1 Si95 = 40 MIPS

onsite disk (TB)

onsite tape (TB) LAN capacity Data Import/Export Box Count

LHC (2006) 520,000* 540 3000 46 GB/s10 TB/day

(sustained) ~1400

CDF - 2 12,000 20 800 1 Gb/s 18 MB/s ~250

D0 - 2 7,000 20 600 300 Mb/s 10 MB/s ~250

Babar ~6000 8 ~300 100 + 1000

Mb/s ~400 GB/day ~400

D0 295 1.5 65 300 Mb/s ? 180

CDF 280 2 100 100 Mb/s ~100 GB/day ?

ALEPH 300 1.8 30 1 Gb/s ? 70

DELPHI 515 1.2 60 1 Gb/s ? 80

L3 625 2 40 1 Gb/s ? 160

OPAL 835 1.6 22 1 Gb/s ? 220

NA45 587 1.3 2 1 Gb/s 5 GB/day 30

[*] [*] Total CPU: CMS or ATLAS ~1.5-2.0 MSi95Total CPU: CMS or ATLAS ~1.5-2.0 MSi95Estimates for disk/tape ratio will change (technology evolution)Estimates for disk/tape ratio will change (technology evolution)

8

CPU needs forbaseline analysis process

CPU needs forbaseline analysis process

52k52k4 Hours4 Hours4 Times/Day4 Times/DayIndividuals Individuals

(500)(500)

Analysis (ESD 1%)Analysis (ESD 1%)

~~580+x580+x

~~700+y700+y

??

2020

1111

200200

150150

200200

Total DiskTotal Disk

Storage Storage

(TB)(TB)

~~1400k1400k

~~1700k1700k

1050k1050k

3k3k

1k1k

190k190k

100k100k

116k116k

Total CPU Total CPU

Power (SI95)Power (SI95)

1010~~300 Days300 Days~~101066 events events

/Day/Day

Experiment Experiment

/Group/Group

Simulation + Simulation +

ReconstructionReconstruction

Total utilizedTotal utilized

Total installedTotal installed

750075004 Hours4 Hours4 Times/Day4 Times/DayIndividuals Individuals

(500)(500)

Analysis Analysis

(AOD, TAG & DPD)(AOD, TAG & DPD)

120012001 Day1 DayOnce/MonthOnce/MonthGroups (20)Groups (20)SelectionSelection

120001200010 Days10 DaysOnce/MonthOnce/MonthExperimentExperimentRe-definition Re-definition

(AOD &TAG)(AOD &TAG)

3003002 Month2 Month3 times/Year3 times/YearExperimentExperimentRe-processingRe-processing

500500100 Days100 DaysOnce/YearOnce/YearExperimentExperimentReconstructionReconstruction

Disk I/ODisk I/O

(MB/sec)(MB/sec)

Response Response

time/passtime/pass

FrequencyFrequencyActivityActivity(100% efficiency,(100% efficiency,

no AMS overhead)no AMS overhead)

9

Major activities foreseen at CERNreality check (Les Robertson, Jan. ‘99)

Major activities foreseen at CERNreality check (Les Robertson, Jan. ‘99)

Activities Cpu(SI95)

Disk I/O(MB/sec)

Tape I/O(MB/sec)

Reconstruction + data recording 35'000 500 100

Copy raw data 120 120

Re-processing 150'000 400 400

Pass1+2 analysis - 4 groups 4'000 1'600

User analysis- 50 simultaneous jobs 100'000 750

Totals 289'000 3'410 620

% of 2006 capacity 56% 4% 124%

Based on 520,000 SI95, present estimate from Les 600,000 SI95 + 200,000 SI95/yearBased on 520,000 SI95, present estimate from Les 600,000 SI95 + 200,000 SI95/year

10

MONARC: Common Project MONARC: Common Project MModels odels OOf f NNetworked etworked AAnalysis nalysis

At At RRegional egional CCentersentersCaltech, CERN, Columbia, FNAL, Heidelberg, Caltech, CERN, Columbia, FNAL, Heidelberg,

Helsinki, INFN, IN2P3, KEK, Marseilles, MPI Helsinki, INFN, IN2P3, KEK, Marseilles, MPI Munich, Orsay, Oxford, TuftsMunich, Orsay, Oxford, Tufts

PROJECT GOALSPROJECT GOALS Develop “Baseline Models”Develop “Baseline Models” Specify the main parameters Specify the main parameters

characterizing the Model’s characterizing the Model’s performance: throughputs, latencies performance: throughputs, latencies

Verify resource requirement baselines: Verify resource requirement baselines: (computing, data handling, networks)(computing, data handling, networks)

TECHNICAL GOALSTECHNICAL GOALS Define the Define the Analysis ProcessAnalysis Process Define Define RC Architectures and ServicesRC Architectures and Services Provide Provide Guidelines for the final ModelsGuidelines for the final Models Provide a Provide a Simulation Toolset Simulation Toolset for Further for Further

Model studiesModel studies

622

Mbi

ts/s 622 M

bits/s

Univ 2

CERN520k SI95 540 Tbytes

Disk; Robot

Tier2 Ctr20k SI95 20 TB

Disk Robot

FNAL/BNL100k SI95100 Tbyte

Disk; Robot

622

Mbi

ts/s

N X

622

Mbi

ts/s

622Mbits/s

622 Mbits/s

Univ1

UnivM

Model Circa 2005Model Circa 2005

11

CMS Analysis Model Based on MONARC and ORCA:

“Typical” Tier1 RC

CPU PowerCPU Power ~100 KSI95~100 KSI95 Disk spaceDisk space ~200 TB~200 TB Tape capacityTape capacity 600 TB, 100 MB/sec600 TB, 100 MB/sec Link speed to Tier2Link speed to Tier2 10 MB/sec (1/2 of 155 Mbps)10 MB/sec (1/2 of 155 Mbps) Raw dataRaw data 5% 5% 50 TB/year50 TB/year ESD dataESD data 100%100% 200 TB/year200 TB/year Selected ESDSelected ESD 25%25% 10 TB/year 10 TB/year [*][*] Revised ESDRevised ESD 25%25% 20 TB/year 20 TB/year [*][*] AOD dataAOD data 100%100% 2 TB/year 2 TB/year [**][**] Revised AODRevised AOD 100%100% 4 TB/year 4 TB/year [**][**] TAG/DPDTAG/DPD 100%100% 200 GB/year 200 GB/year Simulated dataSimulated data 25%25% 25 TB/year 25 TB/year

[*] Covering Five Analysis Groups; each selecting ~1% [*] Covering Five Analysis Groups; each selecting ~1% of Annual ESD or AOD data for a Typical Analysis of Annual ESD or AOD data for a Typical Analysis

[**] Covering All Analysis Groups[**] Covering All Analysis Groups

12

Monarc Data HierarchyMonarc Data Hierarchy

Tier2 Center ~1 TIPS

Online System

Offline Farm~20 TIPS

CERN Computer Center >20 TIPS

Fermilab~4 TIPS

France Regional Center

Italy Regional Center

Germany Regional Center

InstituteInstituteInstituteInstitute ~0.25TIPS

Workstations

~100 MBytes/sec

~100 MBytes/sec

~2.4 Gbits/sec

100 - 1000 Mbits/sec

Bunch crossing per 25 nsecs.100 triggers per secondEvent is ~1 MByte in size

Physicists work on analysis “channels”.

Each institute has ~10 physicists working on one or more channels

Data for these channels should be cached by the institute server

Physics data cache

~PBytes/sec

~622 Mbits/sec or Air Freight




~622 Mbits/sec

Tier 0Tier 0

Tier 1Tier 1

Tier 3Tier 3

Tier 4Tier 4

1 TIPS = 25,000 SpecInt95

PC (1999) = ~15 SpecInt95


Tier 2Tier 2

13

MONARC Analysis Process Example MONARC Analysis Process Example

DAQ/RAWSlow

Control/Cal

~20~20

~25~25 Huge number of Huge number of ““small” jobs small” jobs

per Day.per Day.Chaotic activityChaotic activity

20 large jobs 20 large jobs per Month.per Month.

Coord. activityCoord. activity

4 times per Year?4 times per Year?(per Exp.)(per Exp.)

14

Tapes

Network from CERN

Networkfrom Tier 2& simulation centers

Tape Mass Storage & Disk Servers

Database Servers

PhysicsSoftware

Development

R&D Systemsand Testbeds

Info serversCode servers

Web ServersTelepresence

Servers

TrainingConsultingHelp Desk

ProductionReconstruction

Raw/Sim Rec objs

Scheduledpredictable

Experiment/Physics groups

ProductionAnalysis

Selection: Rec objs AOD &TAG

ScheduledPhysics groups

Individual Analysis

Selection:TAG plots

Chaotic

PhysicistsDesktops

Tier 2

Local institutes

CERN

Tapes

Regional Centre

15

Offline Computing Facility for CMS at CERNOffline Computing Facility for CMS at CERN

Purpose of the studyPurpose of the study investigate the feasibility of building LHC computing facilities using investigate the feasibility of building LHC computing facilities using

current cluster architectures, conservative assumptions about technology current cluster architectures, conservative assumptions about technology evolution evolution

scale & performancescale & performance technologytechnology powerpower footprintfootprint costcost reliabilityreliability manageabilitymanageability

16

Background & assumptionsBackground & assumptions

SizingSizing Data are estimates from experimentsData are estimates from experiments MONARC analysis group working papers and presentationsMONARC analysis group working papers and presentations

ArchitectureArchitecture CERN is Tier0 centre and act as Tier1 centre ...CERN is Tier0 centre and act as Tier1 centre ... CERN distributed architecture (in same room and across site)CERN distributed architecture (in same room and across site)

simplest components (hyper-sensitive to cost, aversion to complication)simplest components (hyper-sensitive to cost, aversion to complication) throughput (before performance)throughput (before performance) resilience (mostly up all of the time)resilience (mostly up all of the time) computing fabric for flexibility, scalabilitycomputing fabric for flexibility, scalability

avoid special-purpose components avoid special-purpose components everything everything cancan do anything (does not mean that parts are not dedicated for do anything (does not mean that parts are not dedicated for

specific applications, periods, ..)specific applications, periods, ..)

250 Gbps

0.8 Gbps

8 Gbps

…………1400 boxes160 clusters40 sub-farms

12 Gbps*

480 Gbps*

3 Gbps*

1.5 Gbps

100 drives

12 Gbps

5400 disks

340 arrays

……...

LAN-SAN routers

LAN-WAN routers

CERN

Example CMS Offline Farmat CERN circa 2006

tapes

0.8 Gbps (DAQ)

0.8 Gbps

5 Gbps

disks

processors

storage network

storage network

farm network

18

Components (1)Components (1)

ProcessorsProcessors the then-current low-end PC server (equivalent of the dual cpu boards

of 1999) 4 cpus, each >100 SI954 cpus, each >100 SI95 creation of AOD and analysis may need better (more expensive)creation of AOD and analysis may need better (more expensive)

Assembled into clusters and sub-farmsAssembled into clusters and sub-farms according to practical considerations like according to practical considerations like

throughput of first level LAN switchthroughput of first level LAN switch rack capacityrack capacity power & cooling, ….power & cooling, ….

Each cluster comes with a suitable chunk of I/O capacityEach cluster comes with a suitable chunk of I/O capacity

LANLAN no issue - since the computers are high volume components, no issue - since the computers are high volume components,

the computer-LAN interface is standard the computer-LAN interface is standard (then-current Ethernet!)(then-current Ethernet!) higher layers need higher throughput - but only about a Tbpshigher layers need higher throughput - but only about a Tbps

Processor cluster

basic boxfour 100 SI95 processorsstandard network connection (~2 Gbps)15% of systems configured as I/O servers (disk server, disk-tape mover, Objy AMS, ..) with additional connection to the storage networkcluster9 basic boxes with a network switch (<10 Gbps)sub-farm4 clusters - with a second-level network switch (<50 Gbps)one sub-farm fits in one rack

3 Gbps*

1.5 Gbps

configured asI/O servers

storage network

farm network

cluster and sub-farm sizing adjusted to fit convenientlythe capabilities of network switch, racking, power distribution components

sub-farm: 36 boxes, 144 cpus, 5 m2

20


Disks inexpensive RAID arrays capacity limited to ensure sufficient number of independent

accessors (say ~100GB with the current size of disk farm)

SAN (Storage Area Network) if this market develops into high-volume, low-cost (?)

hopefully using the standard network medium otherwise use the current model LAN-connected storage servers

instead of special-purpose SAN-connected storage controllers

disk sub-system

arrayTwo RAID controllersDual-attached disksControllers connect to the storage networkSizing of array subject to components available

rackIntegral number of arrays, with first level network switches

In the main model, half-height 3.5” disks are assumed, 16 per shelf of a 19” rack.With space for 18 shelves in the rack (two-sided), half of the shelves are populated with disks, the remainder housing controllers, network switches, power distribution.

0.8 Gbps

5 Gbps

storage network

19” rack, 1m deep, 1.1 m2 withspace for doors, 14TB capacity

disk size restricted to give a diskcount which matches the numberof processors (and thus number of active processes)

22

I/O models to be investigatedI/O models to be investigated

This is a fabric so it should support any model

1 - the I/O server, or Objectivity AMS model all I/O requests must pass through an intelligent processor data passes across the SAN to the I/O server and then across the

LAN to the application server

2 - as above, but the SAN and the LAN are the same, or the SAN is accessed via the LAN

all I/O requests pass twice across the LAN - double the network rates in the drawing

3 - the global shared file system no I/O servers, database servers - all data is accessed directly

from all application servers the LAN is the SAN

23


TapesTapes (unpopular in Computer Centres - (unpopular in Computer Centres - new technology by 2004?)new technology by 2004?)

conservative assumption - 100GB per cartridge100GB per cartridge 20MB/sec per drive - with 25% achievable 20MB/sec per drive - with 25% achievable

(robot, load/unload, position/rewind, retry, ….)(robot, load/unload, position/rewind, retry, ….) let’s hope that all of the active data can be

held on disk tape needed as archive and for shipping

View of part of the magnetic tape vault at CERN's computer centre

STK Robot6000 cartridges x 50 GB

= 300TB

24

Problems?Problems?

We hope the local area network will not be a problem cpu-to-I/O requirement is modest

few Gbps at the computer node

suitable switches should be available in a couple of years

Disk system is probably not an issue buy more disk than we currently predict to have enough accessors

Tapes - already talked about that

Space - OK - thanks to the visionaries of 1970

Power & cooling - not a problem but a new cost

25

The real problemThe real problem

ManagementManagement installation monitoring fault-determination re-configuration Integration

All this must be fully automated while retaining simplicity and All this must be fully automated while retaining simplicity and

flexibilityflexibility

Make sure full cost of ownership is consideredMake sure full cost of ownership is considered current industry cost of ownership of PC is 10’000 CHF/year vs. current industry cost of ownership of PC is 10’000 CHF/year vs.

3’000 CHF purchase price3’000 CHF purchase price

26

Tapes

Network from CERN

Networkfrom Tier 2& simulation centers

Tape Mass Storage & Disk Servers

Database Servers

PhysicsSoftware

Development




Servers



Raw/Sim Rec objs

Scheduledpredictable

Experiment/Physics groups

ProductionAnalysis

Selection: Rec objs AOD &TAG

ScheduledPhysics groups

Individual Analysis

Selection:TAG plots

Chaotic

PhysicistsDesktops

Tier 2

Local institutes

CERN

Tapes

Regional Centre

27

DataImport

DataExport

Mass Storage & DiskServers

Database Servers

Tapes

Network from CERN

Networkfrom Tier 2 andsimulation centers

PhysicsSoftware

Development




Servers



Raw/Sim-->Rec objs

Scheduled, predictable

experiment/physics groups

ProductionAnalysis

SelectionRec objs TAG

Scheduled

Physics groups

Individual Analysis

SelectionTAG plots

Chaotic

Physicists Desktops

Tier 2

Local institutes

CERN

Tapes

DATAFLOW Robotic Mass Storage

Central Disk Cache

Local Disk Cache

Local Disk Cache

Local Disk Cache

Local Disk Cache

28

LHC Data ManagementLHC Data Management

4 Experiments mean4 Experiments mean >10PB / year total 100MB - 1.5GB/s ~20 years running ~5000 physicists ~250 institutes ~500 concurrent analysis jobs 24x7

Solutions must work at CERN and outside

Scalable from 1-10 users to 100-1000

Support lap-top to large servers with 100GB-1TB and HSM

From MB/GB/TB? (private data) to many PB

29

Objectivity/Database Architecture: CMS baseline solution

Objectivity/Database Architecture: CMS baseline solution

ApplicationApplication

Objy ClientObjy Client

Objy ServerObjy Server ObjyObjyLock ServerLock Server Objy ServerObjy Server

HSM ClientHSM Client

HSM ServerHSM Server

ApplicationApplication

Objy ClientObjy Client Objy ServerObjy Server

Application Host Application & Data Server

Data Server Data Server+ HSM

216 - 232 files Any host

31

Particle Physics Data Grid (PPDG)Particle Physics Data Grid (PPDG)

First Year Goal: Optimized cached read access to 1-10 GBytes, drawn from a total data set of order One Petabyte

PRIMARY SITEPRIMARY SITEData Acquisition,Data Acquisition,

CPU, Disk, CPU, Disk, Tape RobotTape Robot

SECONDARY SITESECONDARY SITECPU, Disk, CPU, Disk, Tape RobotTape Robot

Site to Site Data Replication Service

100 Mbytes/sec

DOE/NGI Next Generation Internet ProjectANL, BNL, Caltech, FNAL, JLAB, LBNL,

SDSC, SLAC, U.Wisc/CS

UniversityUniversityCPU, Disk, CPU, Disk,

UsersUsersPRIMARY SITEPRIMARY SITE

DAQ, Tape, CPU, DAQ, Tape, CPU, Disk, RobotDisk, Robot

Satellite SiteSatellite SiteTape, CPU, Tape, CPU, Disk, RobotDisk, Robot


UsersUsers


UsersUsers


UsersUsers


UsersUsers

Multi-Site Cached File Access Service

Satellite SiteSatellite SiteTape, CPU, Tape, CPU, Disk, RobotDisk, Robot

32

PPDG: Architecture for Reliable High Speed Data Delivery

PPDG: Architecture for Reliable High Speed Data Delivery

Object-based andObject-based andFile-based Application File-based Application

ServicesServices

Cache ManagerCache Manager

File AccessFile AccessServiceService

Matchmaking Matchmaking ServiceService

Cost EstimationCost Estimation

File FetchingFile FetchingServiceService

File Replication File Replication IndexIndex

End-to-End End-to-End Network ServicesNetwork Services

Mass Storage Mass Storage ManagerManager

Resource Resource ManagementManagement

File MoverFile Mover

File MoverFile Mover

Site BoundarySite Boundary Security DomainSecurity Domain

33

Distributed Data Delivery and LHC Software Architecture

Distributed Data Delivery and LHC Software Architecture

Architectural Flexibility GRID will allow resources to be used efficiently

I/O requests up-front; data driven; respond to ensemble of changing cost estimates

Code movement as well as data movement Loosely coupled, dynamic: e.g. Agent-based

implementation

34

Summary - Data IssuesSummary - Data Issues

Development of a robust PB-scale networked data access and analysis system is mission-critical

An effective partnership exists, HENP-wide, through many R&D projects An aggressive R&D program is required to develop the necessary An aggressive R&D program is required to develop the necessary

systemssystems For reliable data access, processing and analysis across a hierarchy of For reliable data access, processing and analysis across a hierarchy of

networksnetworks Solutions could be widely applicable to data problems in other scientific

fields and industry, by LHC startup

35

ConclusionsConclusions

CMS has a first order estimate of needed resources and costsCMS has a first order estimate of needed resources and costs

CMS has identified key issues concerning needed resourcesCMS has identified key issues concerning needed resources

CMS is doing a lot of focused R&D work to refine estimatesCMS is doing a lot of focused R&D work to refine estimates

A lot of integration is needed for the software and hardware A lot of integration is needed for the software and hardware architecturearchitecture

We have positive feedback from different institutions for We have positive feedback from different institutions for regional centres and developmentregional centres and development

Ian Willers Information: CMS participation in Monarc and RD45 Slides: Paolo Capiluppi, Irwin Gaines,...

Documents

Transcript of Ian Willers Information: CMS participation in Monarc and RD45 Slides: Paolo Capiluppi, Irwin Gaines,...