Petabytes of Data ? Experiment Manage How Does The …...Relational database management system Use...
Transcript of Petabytes of Data ? Experiment Manage How Does The …...Relational database management system Use...
How Does The ATLAS Experiment Manage Petabytes of Data ?
Vincent Garonne (University of Oslo)
The Changing Scale of Particle Physics
2
3
Event Collection in ATLAS
4
What is the data?● C++ objects representing tracks, parts of detector etc, saved in files. Some
geometry information in databases● Data is reconstructed and reduced through various formats
○ RAW → ESD → AOD → NTUP
5Figure from http://cerncourier.com/cws/article/cnl/34054
Do everything at CERN?● All this requires (just for ATLAS)
○ 250,000 CPU constantly processing data○ Storing 10s of PetaBytes (million GB) of data per year
● CERN cannot physically handle this
⇨ Distributed (Grid) Computing
6
Data Processing & Management Tools● At time of inception, no global/commercial solution for the distributed
computing needed for our ‘Big Data’ handling
● ATLAS developed its own tools
○ PanDA (Production and Data Analysis) for job management
○ DDM (Distributed Data Management)
7
DDM in a Nutshell● The Distributed Data Management project is charged with managing all
ATLAS data
● All for the purpose of helping the ATLAS collaboration to store, manage and process LHC data in a heterogeneous distributed environment
● Requirements:○ Discover data○ Transfer data to/from sites ○ Delete data from sites○ Ensure data consistency at sites○ Enforce ATLAS computing model
➥ The current DDM system relies on the Rucio software project, developed during Long Shutdown 1 to address the challenges of run2 8
ATLAS DDM: The ScaleThe ATLAS DDM System has demonstrated very large scale data managementTotal:● 1B file replicas● 255 PB on 130 sites
Transfers:● 40M files/Month● 40 PB/Month
Download:● 150 M files/Month● 50 PB/Month
Deletion:● 100M files/Month● 40 PB /Month 9
Rucio
Run-2LS-1Run-1
DQ2System:
Big Data ?
10Illustration by Sandbox Studio, ChicagoTaken from http://www.symmetrymagazine.org/image/august-2012-big-data
WIRED.com © 2014 Condé Nast.Taken from http://www.wired.com/2013/04/bigdata/
DDM/Rucio - Data Concepts● At the heart of everything is a file, of course● Files are grouped into datasets● Datasets/Containers are grouped into
containers➥ Note in particular that Rucio does not organise data in a file system-like way
● The namespace is split into several sub-spaces, e.g.,
○ User: user.jdoe:004406.EXT0.00011.root○ Group: group.phys-higgs:08.physicsD3PDSlimmed.root ○ Detector: data11_7TeV:AOD.491965._0042.pool.root.1
11
Rucio Storage Element (RSE)● Software abstraction for a storage end-point
● RSEs can be grouped in many logical ways by tagging○ e.g., Tier-1, T2D
● Tag can be used to refer to a list of RSEs
● We can have a deterministic mapping between the logical file name and its path name to remove external file catalog lookups
● Rucio has a plug-in based architecture supporting multiple protocols○ e.g., Posix, SRM, HTTP(s), S3, FTP, GridFTP, etc.○ Cf. pie chart for ATLAS protocols
12
Object Store SupportRucio can use objectstore as a standard Storage endpoint● BNL (Ceph), Lancaster (Ceph), RAL (Ceph), CERN (Ceph), MTW2(Ceph)
Two use cases are supported in production:● Log files: Upload/Download are transparently supported in the rucio
clients● ATLAS Event Service: deletion
○ 300k events deleted per day> 75 Hz35 M objects deleted ~22 T
13
Replica Management● Replica management is based on replication rules to optimize storage
space, minimize the number of transfers and automate data distribution○ A replication rule defines the minimal number of replicas to be stored on a set of Rucio
Storage Elements■ E.g., 2 replicas of file user.jdoe:file_001 on any Tier 1
○ Multiple file ownership provided by replication rules○ Global quota defined against a set of logical locations
● Subscriptions offer the on-demand creation of replication rules on new/existing data based on pattern/metadata● E.g., data12_8TeV*physics_Egamma*AOD on T1s
14
Rucio - DDM SW Stack OverviewStrong use of open and standard technologies:
• RESTful APIs
• WSGI server
• HTTP Caching
• Token-based authentication
• ...
15
Daemons● The Rucio daemons are active components that are orchestrating the
collaborative work of all the system:○ Conveyor - File transfers○ Reaper - Deletion○ Undertaker - Expired datasets○ Transmogrifier - Subscriptions○ Judge - Replication Rule Engine○ Hermes - Messaging○ Auditor - Consistency○ Rebalancing○ ...
● The daemons are lightweight, thread-safe and scale horizontally16
Persistence● Rucio has a flexible design with no dependence on particular RDBMS
implementation thanks to the usage of an open source Object Relational Mapper (ORM)
○ Relational database management system
■ Use cases: Real-time data, transactional
■ Rucio supports Oracle, PostgreSQL, etc.
■ In production with Oracle for ATLAS
○ Non relational structured storage
■ Use cases: Popularity, accounting, log analysis, ML
■ Hadoop, Elastic search, spark17
It’s a lot of data !
18
Average ATLAS traffic: 10GB/s
1PB/day
200PB
… and getting more data than expected!
19
Run-1 vs. Run-2Rucio is scalable, robust and reliable. It keeps up nicely with the load increase:
● Transfer○ 2M transfers/day ○ Equivalent to Run-1 but bigger files with
peak at 40GB/s○ More load/hotspots on the network
● Deletion○ 8M deleted files/day ○ Factor 4 increase since Run-1○ Much more pressure on disk space
20
Custodial Disk cache
200 PB
Disk resident
Tape
Run-2
+50 PB
1PB
⇨Focus on optimization, scalability and automation
Disk Usage: Improvements
21
2014: Never accessed data by Age
2016: Never accessed data by Age
Data older than one year1.5 PB
Thanks to:● strategies to keep recent
and popular data on disks (LRU deletion) and to avoid data duplication
● Lifetime model● ATLAS Policies/actions
More automation in place:● Rucio Auditor -
Consistency ● Dynamic Data Placement
daemon 1.5 PB
Data older than one year
(Automatic) Rebalancing● When a T1 disk runs full and there is nothing to delete DDM Operations
rebalances the data to other T1 disks to allow more activities○ Very manual and cumbersome task to select data with the least impact on users and
respect replication policies○ Ideal candidate for automation !
● Automatic rebalancing has been applied on T1s datadisk since few months
○ Data selection is based on policies, age, etc.
● One interesting metric to monitor is the ratio of Resident / Cache data
22
(Automatic) Rebalancing● When a T1 disk runs full and there is nothing to delete DDM Operations
rebalances the data to other T1 disks to allow more activities○ Very manual and cumbersome task to select data with the least impact on users and
respect replication policies○ Ideal candidate for automation !
● Automatic rebalancing has been applied on T1s datadisk since few months
○ Data selection is based on policies, age, etc.
● One interesting metric to monitor is the ratio of Resident / Cache data● This ratio can be quite unbalanced
← Ratio on T1s datadisk on year ago
← Now !avg
PastNow
Comparison of the ratio resident/cache:
23
(Automatic) Rebalancing● When a T1 disk runs full and there is nothing to delete DDM Operations
rebalances the data to other T1 disks to allow more activities○ Very manual and cumbersome task to select data with the least impact on users and
respect replication policies○ Ideal candidate for automation !
● Automatic rebalancing has been applied on T1s datadisk since few months
○ Data selection is based on policies, age, etc.
● One interesting metric to monitor is the ratio of Resident / Cache data● This ratio can be quite unbalanced
← Ratio on T1s datadisk on year ago
← Now !avg
PastNow
Global ratio resident/cache data over time for T1 data disks
24
Computing Model Evolution
25
Initial Model Current Model
● 1 Tier 0: CERN for Data processing● 11 Tier 1s for Simulation & Reprocessing● ~130 Tier 2s for Simulation & User Analysis
● Network proved better than anyone imagined● Any job can run anywhere
Network Evolution
● We’ll be more and more reliant on our foundation of the network
● It’s coherent with our approach to use it more and more
○ E.g., remote i/o
● By 2020, 800 Gbps waves should be possible but not from everywhere..
26
Network Use in ATLAS
40 PB/month ~= 14 GB/s (100Gbit+)
12 PB
40 PB
~X 4
Current Challenges● New trends in data management
○ Original model was based on network being the weak point○ But network has proven to be cheaper and better than expected○ Break the rigid hierarchical model of data flow and sending jobs to data
■ Dynamic data placement■ Remote data access over wide area network
● Event-level workflow instead of file-level
● Computing architecture changes○ More cores per CPU, less memory, HPCs
● Need more CPU and disk but with flat budget → opportunistic resources○ High Performance Computing (supercomputers)○ Volunteer Computing (general public): http://atlasathome.cern.ch 27
LHC Upgrade Timeline
28
ATLAS Resource Needs at T1s & T2s
29
30Run-12018 2020 2022 2024
Run-2 Run-3 Run-4
300P
400P
500P
600P
Exascale for Run-4 !
31Run-12018 2020 2022 2024
Run-2 Run-3 Run-4
300P
400P
500P
600P
Exascale for Run-4 !
Will it scale ?Initial studies of computing for HL-LHC
S.Campana
Rucio DB: Nominal Load on Oracle
32
Monitoring per Rucio component (module, client_info, action):
I/O bound application:
Throughput:● 20 Mb.s-1● 900 I/O ops-s11
Top activities:● Insert dataset content● Listing dataset content and replicas● Dataset content deletion
Rucio DB: Scaling tests - Load Increase
33
X1 X2Load
Latency
Throughput 3200 I/O ops.s-140 MB.s-1
2000 I/O ops.s-130 MB.s-1
900 I/O ops.s-120 MB.s-1
CPU
User I/O
Commit
X4
CPU usage is stable
I/O and latency increase
● Rucio exploits commonalities between experiments and other data intensive sciences to address HEP experiments needs and scaling requirements
● One of our goal is to build a broader support community
● Other experiments has their own Rucio instance in production○ AMS: backend Mysql, ~2 M files, ~10 RSEs -- TW collaboration○ Xenon1t, backend MariaDB, ~500K files, ~10 RSEs○ ?
* GitHub mirror: https://github.com/rucio01/rucio
About the code - gitlab.cern.ch/rucio01/rucio*
34
35
gitlab.cern.ch/rucio01/rucio: Statistics http://rucio.cern.ch/
* Mirror: https://github.com/rucio01/rucio
https://pypi.python.org/pypi/rucio https://pypi.python.org/pypi/rucio-clientshttps://pypi.python.org/pypi/rucio-webui
Summary● Distributed computing is a vital part of LHC physics
● Many interesting challenges ahead !
● For the scalability, it’ll follow the advances in hardware and open & standard 'big data' technologies
● One effort is to provide now common solutions to the scientific community
36
Job Stats
38