Petabytes of Data ? Experiment Manage How Does The …...Relational database management system Use...

How Does The ATLAS Experiment Manage Petabytes of Data ?

Vincent Garonne (University of Oslo)

The Changing Scale of Particle Physics

2

Event Collection in ATLAS

4

What is the data?● C++ objects representing tracks, parts of detector etc, saved in files. Some

geometry information in databases● Data is reconstructed and reduced through various formats

○ RAW → ESD → AOD → NTUP

5Figure from http://cerncourier.com/cws/article/cnl/34054

http://cerncourier.com/cws/article/cnl/34054

Do everything at CERN?● All this requires (just for ATLAS)

○ 250,000 CPU constantly processing data○ Storing 10s of PetaBytes (million GB) of data per year

● CERN cannot physically handle this

⇨ Distributed (Grid) Computing

6

Data Processing & Management Tools● At time of inception, no global/commercial solution for the distributed

computing needed for our ‘Big Data’ handling

● ATLAS developed its own tools

○ PanDA (Production and Data Analysis) for job management

○ DDM (Distributed Data Management)

7

DDM in a Nutshell● The Distributed Data Management project is charged with managing all

ATLAS data

● All for the purpose of helping the ATLAS collaboration to store, manage and process LHC data in a heterogeneous distributed environment

● Requirements:○ Discover data○ Transfer data to/from sites ○ Delete data from sites○ Ensure data consistency at sites○ Enforce ATLAS computing model

➥ The current DDM system relies on the Rucio software project, developed during Long Shutdown 1 to address the challenges of run2 8

ATLAS DDM: The ScaleThe ATLAS DDM System has demonstrated very large scale data managementTotal:● 1B file replicas● 255 PB on 130 sites

Transfers:● 40M files/Month● 40 PB/Month

Download:● 150 M files/Month● 50 PB/Month

Deletion:● 100M files/Month● 40 PB /Month 9

Rucio

Run-2LS-1Run-1

DQ2System:

Big Data ?

10Illustration by Sandbox Studio, ChicagoTaken from http://www.symmetrymagazine.org/image/august-2012-big-data

WIRED.com © 2014 Condé Nast.Taken from http://www.wired.com/2013/04/bigdata/

http://www.symmetrymagazine.org/image/august-2012-big-data

http://www.wired.com/2013/04/bigdata/

DDM/Rucio - Data Concepts● At the heart of everything is a file, of course● Files are grouped into datasets● Datasets/Containers are grouped into

containers➥ Note in particular that Rucio does not organise data in a file system-like way

● The namespace is split into several sub-spaces, e.g.,

○ User: user.jdoe:004406.EXT0.00011.root○ Group: group.phys-higgs:08.physicsD3PDSlimmed.root ○ Detector: data11_7TeV:AOD.491965._0042.pool.root.1

11

Rucio Storage Element (RSE)● Software abstraction for a storage end-point

● RSEs can be grouped in many logical ways by tagging○ e.g., Tier-1, T2D

● Tag can be used to refer to a list of RSEs

● We can have a deterministic mapping between the logical file name and its path name to remove external file catalog lookups

● Rucio has a plug-in based architecture supporting multiple protocols○ e.g., Posix, SRM, HTTP(s), S3, FTP, GridFTP, etc.○ Cf. pie chart for ATLAS protocols

12

Object Store SupportRucio can use objectstore as a standard Storage endpoint● BNL (Ceph), Lancaster (Ceph), RAL (Ceph), CERN (Ceph), MTW2(Ceph)

Two use cases are supported in production:● Log files: Upload/Download are transparently supported in the rucio

clients● ATLAS Event Service: deletion

○ 300k events deleted per day> 75 Hz35 M objects deleted ~22 T

13

Replica Management● Replica management is based on replication rules to optimize storage

space, minimize the number of transfers and automate data distribution○ A replication rule defines the minimal number of replicas to be stored on a set of Rucio

Storage Elements■ E.g., 2 replicas of file user.jdoe:file_001 on any Tier 1

○ Multiple file ownership provided by replication rules○ Global quota defined against a set of logical locations

● Subscriptions offer the on-demand creation of replication rules on new/existing data based on pattern/metadata● E.g., data12_8TeV*physics_Egamma*AOD on T1s

14

Rucio - DDM SW Stack OverviewStrong use of open and standard technologies:

• RESTful APIs

• WSGI server

• HTTP Caching

• Token-based authentication

• ...

15

Daemons● The Rucio daemons are active components that are orchestrating the

collaborative work of all the system:○ Conveyor - File transfers○ Reaper - Deletion○ Undertaker - Expired datasets○ Transmogrifier - Subscriptions○ Judge - Replication Rule Engine○ Hermes - Messaging○ Auditor - Consistency○ Rebalancing○ ...

● The daemons are lightweight, thread-safe and scale horizontally16

Persistence● Rucio has a flexible design with no dependence on particular RDBMS

implementation thanks to the usage of an open source Object Relational Mapper (ORM)

○ Relational database management system

■ Use cases: Real-time data, transactional

■ Rucio supports Oracle, PostgreSQL, etc.

■ In production with Oracle for ATLAS

○ Non relational structured storage

■ Use cases: Popularity, accounting, log analysis, ML

■ Hadoop, Elastic search, spark17

It’s a lot of data !

18

Average ATLAS traffic: 10GB/s

1PB/day

200PB

… and getting more data than expected!

19

Run-1 vs. Run-2Rucio is scalable, robust and reliable. It keeps up nicely with the load increase:

● Transfer○ 2M transfers/day ○ Equivalent to Run-1 but bigger files with

peak at 40GB/s○ More load/hotspots on the network

● Deletion○ 8M deleted files/day ○ Factor 4 increase since Run-1○ Much more pressure on disk space

20

Custodial Disk cache

200 PB

Disk resident

Tape

Run-2

+50 PB

1PB

⇨Focus on optimization, scalability and automation

Disk Usage: Improvements

21

2014: Never accessed data by Age

2016: Never accessed data by Age

Data older than one year1.5 PB

Thanks to:● strategies to keep recent

and popular data on disks (LRU deletion) and to avoid data duplication

● Lifetime model● ATLAS Policies/actions

More automation in place:● Rucio Auditor -

Consistency ● Dynamic Data Placement

daemon 1.5 PB

Data older than one year

(Automatic) Rebalancing● When a T1 disk runs full and there is nothing to delete DDM Operations

rebalances the data to other T1 disks to allow more activities○ Very manual and cumbersome task to select data with the least impact on users and

respect replication policies○ Ideal candidate for automation !

● Automatic rebalancing has been applied on T1s datadisk since few months

○ Data selection is based on policies, age, etc.

● One interesting metric to monitor is the ratio of Resident / Cache data

22






● One interesting metric to monitor is the ratio of Resident / Cache data● This ratio can be quite unbalanced

← Ratio on T1s datadisk on year ago

← Now !avg

PastNow

Comparison of the ratio resident/cache:

23






● One interesting metric to monitor is the ratio of Resident / Cache data● This ratio can be quite unbalanced

← Ratio on T1s datadisk on year ago

← Now !avg

PastNow

Global ratio resident/cache data over time for T1 data disks

24

Computing Model Evolution

25

Initial Model Current Model

● 1 Tier 0: CERN for Data processing● 11 Tier 1s for Simulation & Reprocessing● ~130 Tier 2s for Simulation & User Analysis

● Network proved better than anyone imagined● Any job can run anywhere

Network Evolution

● We’ll be more and more reliant on our foundation of the network

● It’s coherent with our approach to use it more and more

○ E.g., remote i/o

● By 2020, 800 Gbps waves should be possible but not from everywhere..

26

Network Use in ATLAS

40 PB/month ~= 14 GB/s (100Gbit+)

12 PB

40 PB

~X 4

Current Challenges● New trends in data management

○ Original model was based on network being the weak point○ But network has proven to be cheaper and better than expected○ Break the rigid hierarchical model of data flow and sending jobs to data

■ Dynamic data placement■ Remote data access over wide area network

● Event-level workflow instead of file-level

● Computing architecture changes○ More cores per CPU, less memory, HPCs

● Need more CPU and disk but with flat budget → opportunistic resources○ High Performance Computing (supercomputers)○ Volunteer Computing (general public): http://atlasathome.cern.ch 27

http://atlasathome.cern.ch

LHC Upgrade Timeline

28

ATLAS Resource Needs at T1s & T2s

29

30Run-12018 2020 2022 2024

Run-2 Run-3 Run-4

300P

400P

500P

600P

Exascale for Run-4 !

https://www.google.ch/search?espv=2&biw=1183&bih=613&q=1000+PB+exabytes&spell=1&sa=X&ved=0ahUKEwjl7rr16anNAhWD0RQKHZCRDlkQvwUIGSgA





31Run-12018 2020 2022 2024

Run-2 Run-3 Run-4

300P

400P

500P

600P

Exascale for Run-4 !

Will it scale ?Initial studies of computing for HL-LHC

S.Campana






Rucio DB: Nominal Load on Oracle

32

Monitoring per Rucio component (module, client_info, action):

I/O bound application:

Throughput:● 20 Mb.s-1● 900 I/O ops-s11

Top activities:● Insert dataset content● Listing dataset content and replicas● Dataset content deletion

Rucio DB: Scaling tests - Load Increase

33

X1 X2Load

Latency

Throughput 3200 I/O ops.s-140 MB.s-1

2000 I/O ops.s-130 MB.s-1

900 I/O ops.s-120 MB.s-1

CPU

User I/O

Commit

X4

CPU usage is stable

I/O and latency increase

● Rucio exploits commonalities between experiments and other data intensive sciences to address HEP experiments needs and scaling requirements

● One of our goal is to build a broader support community

● Other experiments has their own Rucio instance in production○ AMS: backend Mysql, ~2 M files, ~10 RSEs -- TW collaboration○ Xenon1t, backend MariaDB, ~500K files, ~10 RSEs○ ?

* GitHub mirror: https://github.com/rucio01/rucio

About the code - gitlab.cern.ch/rucio01/rucio*

34

https://github.com/rucio01/rucio

https://gitlab.cern.ch/rucio01/rucio

35

gitlab.cern.ch/rucio01/rucio: Statistics http://rucio.cern.ch/

* Mirror: https://github.com/rucio01/rucio

https://pypi.python.org/pypi/rucio https://pypi.python.org/pypi/rucio-clientshttps://pypi.python.org/pypi/rucio-webui



http://rucio.cern.ch/

http://rucio.cern.ch/

https://github.com/rucio01/rucio

https://pypi.python.org/pypi/rucio

https://pypi.python.org/pypi/rucio

https://pypi.python.org/pypi/rucio-clients

https://pypi.python.org/pypi/rucio-clients

https://pypi.python.org/pypi/rucio-webui

https://pypi.python.org/pypi/rucio-webui

Summary● Distributed computing is a vital part of LHC physics

● Many interesting challenges ahead !

● For the scalability, it’ll follow the advances in hardware and open & standard 'big data' technologies

● One effort is to provide now common solutions to the scientific community

36

Job Stats

38

Petabytes of Data ? Experiment Manage How Does The …...Relational database management system Use...

Documents

Transcript of Petabytes of Data ? Experiment Manage How Does The …...Relational database management system Use...