Fabric Management for CERN Experiments Past, Present, and Future Tim Smith CERN/IT.

Fabric Managementfor CERN Experiments

Past, Present, and Future

Tim Smith CERN/IT

2000/11/03 Tim Smith: HEPiX @ JLab 2

Contents

The Fabric of CERN today

The new challenges of LHC computing What has this got to do with the GRID

Fabric Management solutions of tomorrow? The DataGRID Project


Fabric Elements

Functionalities Batch and Interactive Disk servers Tape Servers + devices Stage servers Home directory servers Application servers Backup service

Infrastructure Job Scheduler Authentication Authorisation Monitoring Alarms Console managers Networks


Fabric Technology at CERN

89 90 91 92 93 94 95 96 97 98 99 00 01 0302 04 05

1

100

10

1000

10000

MainframesIBM Cray

RISC Workstations

Scalable SystemsSP2 CS2

RISC Workstations

PC Farms

PC Farms

Mu

ltip

licit

y S

cale

Year

SMPsSGI,DEC,HP,SUN


Architecture Considerations

Physics applications have ideal data parallelism mass of independent problems

No message passing throughput rather than performance resilience rather than ultimate reliability

Can build hierarchies of mass market components

High Throughput Computing


Component Architecture

100/1000baseT switch

CPU CPU CPU CPU CPU

High capacitybackboneswitch

1000baseT switch

Tape Server

Tape Server

Tape Server

Tape Server

Disk Server

Application Server


Analysis Chain: Farms

batchphysicsanalysis

batchphysicsanalysis

detector

event summary data

rawdata

eventreconstruction

eventreconstruction

eventsimulation

eventsimulation

interactivephysicsanalysis

analysis objects(extracted by physics topic)

event filter(selection &

reconstruction)

event filter(selection &

reconstruction)

processeddata


Multiplication !

0

200

400

600

800

1000

1200

Jul-97 Jan-98 Jul-98 Jan-99 Jul-99 Jan-00

#CP

Us

tomog

tapes

pcsf

nomad

na49

na48

na45

mta

lxbatch

lxplus

lhcb

l3c

ion

eff

cms

ccf

atlas

alice


PC Farms


Shared FacilitiesEFF Scheduling 2000

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

Week Number

Nu

mb

er o

f P

Cs

DELPHI

CMS

ALEPH

ATLAS

NA45

COMPASS

ALICE

Available


LHC Computing Challenge

The scale will be different CPU 10k SI95 1M SI95 Disk 30TB 3PB Tape 600TB 9PB

The model will be different There are compelling reasons why some of the

farms and some of the capacity will not be located at CERN


Estimated DISK Capacity ay CERN

0

200

400

600

800

1000

1200

1400

1600

1800

1998 1999 2000 2001 2002 2003 2004 2005 2006

year

Tera

Byt

es Non-LHC

Moore’s Law

LHC

Estimated disk storage capacity at CERN

Estimated CPU Capacity at CERN

0

500

1,000

1,500

2,000

2,500

1998 1999 2000 2001 2002 2003 2004 2005 2006

year

K S

I95

~10K SI951200 processors

Non-LHC

LHC

Estimated CPU capacity at CERNBad News: IO

1996: 4G @10MB/s1TB – 2500MB/s

2000: 50G @ 20 MB/s1TB – 400 MB/s

Bad News: Tapes< factor 2 reduction in 8 yearsSignificant fraction of cost


Regional Centres:a Multi-Tier Model

Department

Desktop

CERN – Tier 0

MONARC http://cern.ch/MONARC

Tier 1 FNALRAL

IN2P3622 M

bps2.5 Gbps

622 M

bp

s

155

mbp

s

155 mbps

Tier2 Lab a

Uni b Lab c

Uni n


More realistically:a Grid Topology

CERN – Tier 0

Tier 1 FNALRAL

IN2P3622 M

bps2.5 Gbps

622 M

bp

s

155

mbp

s 155 mbps

Tier2 Lab a

Uni b Lab c

Uni n

Department

Desktop DataGRID http://cern.ch/grid


Can we build LHC farms?

Positive predictions CPU and disk price/performance trends suggest that the raw

processing and disk storage capacities will be affordable, and raw data rates and volumes look manageable

perhaps not today for ALICE

Space, power and cooling issues?

So probably yes… but can we manage them? Understand costs - 1 PC is cheap, but managing 10000 is not! Building and managing coherent systems from such large

numbers of boxes will be a challenge.

1999:

CDR @

45MB/s for

NA48!

2000:

CDR @

90MB/s for

Alice!


Management Tasks I

Supporting adaptability Configuration Management

Machine / Service hierarchy Automated registration / insertion / removal Dynamic reassignment

Automatic Software Installation and Management (OS and applications) Version management Application dependencies Controlled (re)deployment


Management Tasks II

Controlling Quality of Service System Monitoring

Orientation to the service NOT the machine Uniform access to diverse fabric elements Integrated with configuration (change) management

Problem Management Identification of root causes (faults + performance) Correlate network / system / application data Highly automated Adaptive - Integrated with configuration

management


Relevance to the GRID ?

Scalable solutions needed in absence of GRID !

For the GRID to work it must be presented with information and opportunities Coordinated and efficiently run centres Presentable as a guaranteed quality resource

‘GRID’ification : the interfaces


Mgmt Tasks: A GRID centre

GRID enable Support external requests: services

Publication Coordinated + ‘map’able

Security: Authentication / Authorisation Policies: Allocation / Priorities / Estimation / Cost

Scheduling Reservation Change Management

Guarantees Resource availability / QoS


Existing Solutions ?

The world outside is moving fast !! Dissimilar problems

Virtual super computers (~200 nodes) MPI, latency, interconnect topology and bandwith Roadrunner, LosLobos, Cplant, Beowulf

Similar problems ISPs / ASPs (~200 nodes) Clustering: high availability / mission critical

The DataGRID : Fabric Management WP4


WP4 Partners

CERN (CH) Tim Smith ZIB (D) Alexander Reinefeld KIP (D) Volker Lindenstruth NIKHEF (NL) Kors Bos INFN (I) Michele Michelotto RAL (UK) Andrew Sansum IN2P3 (Fr) Denis Linglin


Concluding Remarks

Years of experience in exploiting inexpensive mass market components

But we need to marry these with inexpensive highly scalable management tools

Build components back together as a resource for the GRID

Fabric Management for CERN Experiments Past, Present, and Future Tim Smith CERN/IT.

Documents

Transcript of Fabric Management for CERN Experiments Past, Present, and Future Tim Smith CERN/IT.