Post on 31-Mar-2015
Building a High Performance Mass Storage System for Tier1 LHC site
Vladimir Sapunenko, INFN-CNAF
GRID’2012, July 16 – 21
Dubna, Russia
Vladimir.Sapunenko@cnaf.infn.it 2July 18, 2012
Tier1 site at INFN-CNAFCNAF is the National Center of INFN (National Institute of Nuclear Physics) for Research and Development into the field of Information Technologies applied to High-Energy physics experiments.
Operational since 2005
Vladimir.Sapunenko@cnaf.infn.it 3
Tier1 at glance• All 4 LHC experiments• 20 HEP, Space and Astro physics
experiments• Computation Farm
– 1300 WNs – 130K HEP SPEC– 13K job slots
• Storage – 10 PB on disk– 14 PB on tapes
July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it
Mass Storage Challenge• Several PetaBytes of data (online and near-line) need to
be accessed at any time from thousands of concurrent processes
• Aggregated data throughput required, both on Local Area Network (LAN) and Wide Area Network (WAN), is of the order of several GB/s.
• Long term transparent archiving of data is needed• Frequent configuration changes• Independent experiments (with independent production
managers and end-users) concur for the usage of disk and tape resources • Chaotic access can lead to traffic jams which must be taken into
account as quasi-ordinary situations4July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it 5
What do we need to doto meet that challenge?
• We need to a Mass Storage Solution which has the following features• Grid-enabled• high performance• modular• stable and robust• targeted to large computing centers (as WLCG Tier-1s)
• “large” means custodial of O(10) PB of data• simple installation and management
• 24x7 operation with limited manpower• centralized administration
July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it 6
Storage HW• 10 PB of disks
– 15 disk arrays • (8x EMC CX3-80, 7x DDN S2A 9950)
• ~130 disk servers– 40 10Gb/s Eth (250-300 TB/server)– 90 2x1Gb/s Eth (50-75 TB/server)
• 14 PB of tapes– SL8500 tape library (10K slots)
• 20 T10000B drives (1TB cartridge)• 10 T10000C drives (5TB cartridge)
– 1 TSM server (+1 stand-by)– 13 HSM nodes
• ~ 500 SAN ports (FC4/FC8)July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it 7
ATLAS: outside view• Disk space used by Atlas, %
– INFN’s share is 8%• Data volume processed in 1 month
– INFN’s share is 10%• Average efficiency of successfully
completed jobs– INFN: the second in global ranking
(data from DQ2 Atlas accounting)
July 18, 2012
CNAF
Vladimir.Sapunenko@cnaf.infn.it 8
ATLAS: inside view• 2.3 PB of disk space
• 3 DDN S2A9950, 2TB SATA, 8xFC8• 8 I/O servers (10Gb/s, 24GB RAM, 2xFC8)• 2 metadata servers (1Gb/s, 4GB RAM, 2FC4)• 4 gridFTP servers (10Gb/s,24GB RAM, 2xFC8)• 5 StoRM servers (1Gb/s, 4GB RAM)• 2 HSM servers (1Gb/s, 4GB RAM)
1 week Stats inGB/sto/from LAN (farm)
to/fromWAN(gridftp)
July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it 9
LHCB: CPU used at CERN and Tier-1s in 2012
CERNCNAF
GRIDKA
RALIN2P3
NIKHEFPICSARA
Share of used CPU in succesfuljobs
CNAF
Share of CPU used in failed jobs
CNAF is the first centre after CERN for CPU used and the last when counted for fraction of CPU time wasted by jobs failing for any reason
The main reason: stability of the storage system !
July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it 10
LHCB
• 0.76 PB of unique file system
• 40TB reserved as tape buffer
• More space can be used if available
• 0.76 PB of disk space• 1 EMC CX4, 1TB SATA, 8xFC4• 10 I/O servers (2x1Gb/s, 8GB RAM, 2xFC4)
• 2 metadata servers (1Gb/s, 8GB RAM, 2xFC4)• 4 gridFTP servers (2x1Gb/s,8GB RAM, 2xFC4)• 3 StoRM servers (1Gb/s, 4GB RAM)• 2 HSM servers (1Gb/s, 4GB RAM)
July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it 11
LHCB data by site
July 18, 2012
CNAF
Vladimir.Sapunenko@cnaf.infn.it 12
ALICE (MonALISA)
July 18, 2012
I/O activity on diskIN: 100 MB/s OUT: 2.1 GB/s
I/O activity on tape bufferIN: 5 MB/s OUT: 800 MB/s
Vladimir.Sapunenko@cnaf.infn.it 13
ALICE• 8 XrootD servers
– 6 for Disk-only,– 2 for Tape buffer
• 8 core 2.2GHz, 10Gb/s, 24GB RAM, 2xFC8• 2 metadata servers
• Storage– DDN S2A 9950,
• 1.3PB net space• Two GPFS file systems
– 960TB disk-only – 385TB tape buffer
• Manages tape recalls directly from GPFS– Custom plug-in to interface XrootD with
GEMSS (CNAF’s MSS) • modified method
XrdxFtsOfsFile::open in XrootD library
– By F. Noferini and V. Vagnoni
July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it 14
ALICE: Tape Performance• ALICE is doing hard this
week reading a lot from the tape buffer
July 18, 2012
Reads from tapes
Vladimir.Sapunenko@cnaf.infn.it 15
Tier1 Storage Group:Tasks and Staff
• Tasks:– Disk storage administration (GPFS, GEMSS)– Tape library (ACSLS, TSM)– SAN maintenance, administration– Servers installation and configuration– Services (SRM, FTS, DB)– Monitoring (of all HW and SW components)– Procurement (Tender definition)– HW life circle management and 1st level support
• Staff:– Just 5 FTE (“Full Time Equivalent”)
• How do we manage all this?July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it 16
Our approach• Fault tolerance and Redundancy everywhere but avoiding resources
trashing– Using “Active-Active” configurations as much as possible
• load of failed elements distributed over remaining (SAN, servers, controllers)
• Monitoring and Automated recovery procedures– NAGIOS event handlers
• Minimizing number of managed objects– Few but BIG storage systems– 10Gb servers
• High level of optimization– OS and network tuning
• Test everything before deploying– A dedicated cluster with all functionality as testing facility (testbed)
• Relying on industry standards (GPFS, TSM)• Reducing complexity
– TSM rather than HPSSJuly 18, 2012
Vladimir.Sapunenko@cnaf.infn.it 17
Software components• GPFS as a Clustered Parallel File System• TSM as HSM system• StoRM as SRM• GEMSS as interface between StoRM and
GPFS and TSM• NAGIOS as alarm and event handling• QUATTOR as system configuration manager• LEMON as monitoring tool
July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it 18
GPFS• General Parallel File System from IBM
– Clustered (fault tolerance and redundancy)– Parallel (scalability)– Used widely in industry (very well documented and
supported by user community and by IBM)– Always provide maximum performance (no need to
replicate data to increase availability)– Running on AIX, Linux (RH, SL) and Windows – Is NOT bounded to IBM’s HW!
July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it 19
GPFS (2)• Advanced High-Availability features
• disruption-free maintainance• servers and storage devices can be added or removed while
keeping the filesystems online• when storage is added or removed the data can be
dynamically rebalanced to maintain optimal performance• Centralized administration
• cluster-wide operations can be managed from any node in the GPFS cluster
• easy administration model, consistent with standard UNIX file systems
• Support standard file system functions• user quotas, snapshots, etc.
• Many other features not fitting in two slides…
July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it 20
TSM• Tivoli Storage Manager (IBM)
– Very powerful – Simple
• DB (db2) management hidden form administrator
– Build-in HSM functionality• Transparent data movement
– Integrated with GPFS– Widely used in industry
• A lot of experience• easy to get technical support ether from IBM or from user
community
July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it 21
StoRM: STOrage Resource Manager
StoRM is an implementation of the SRM solution designed to leverage the advantages of cluster file systems (like GPFS) and standard POSIX file systems in a Grid environment developed at INFN-CNAF.– http://storm.forge.cnaf.infn.it
July 18, 2012
StoRM provides data management capabilities in a Grid environment to share, access and transfer data among heterogeneous and geographically distributed data centers, supporting direct access (native POSIX I/O call) to shared files and directories, as well as other standard Grid access protocols. StoRM is adopted in the context of WLCG computational Grid framework.
Vladimir.Sapunenko@cnaf.infn.it 22
A little bit of historyCASTOR was the “traditional” solution for Mass Storage at CNAF for all VO's since 2003
Large variety of issues– both at set-up/admin level and at VO’s level (complexity, scalability, stability, …)– successfully used in production, despite large operational overhead
In parallel to production, in 2006 we started to search for a potentially more scalable, performing and robust solution
– Q1 2007: after massive comparison tests GPFS was chosen as the only solution for disk-based storage (it was already in use at CNAF for a long time before this test)
– Q2 2007: StoRM (developed at INFN) implements SRM 2.2 specifications
– Q3-Q4 2007: StoRM/GPFS in production for D1T0 for LHCb and Atlas
• Clear benefits for both experiments (significantly reduced load on CASTOR)
– End 2007: a project started at CNAF to realize a complete grid-enabled HSM solution based on StoRM/GPFS/TSM
July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it 23
GEMSS• Grid Enabled Mass Storage System
– A full HSM (Hierarchical Storage Management) integration of GPFS, TSM and StoRM
– combined GPFS and TSM specific features with StoRM to provide a transparent Grid-friendly HSM solution • An interface between GPFS and TSM has been implemented to minimize
mechanical operations in tape robotics (mount/unmount, search/rewind)• StoRM has been extended to include the SRM methods required to
manage the tapes
• Permits minimize management effort and increase reliability• Very positive experience for scalability so far • Based on large GPFS installation in production at CNAF since
2005 with increasing disk space and number of users
July 18, 2012
GEMSS Development TimeLine
July 18, 2012 Vladimir.Sapunenko@cnaf.infn.it 24
2007 2008 2009 2010
D1T0 Storage Class implemented with StoRM/GPFS for LHCb and ATLAS
D1T1 Storage Class implemented with StoRM/GPFS/TSM for LHCb
D0T1 Storage Class implemented with StoRM/GPFS/TSM for CMS
GEMSS is used by all LHC and non-LHC experiments in production for all Storage Classes
ATLAS, ALICE, (CMS) and LHCb experiments, together with all other non-LHC experiments (Argo, Pamela, Virgo, AMS) use GEMSS in production!
2011 2012
Introduced DMAPI server (to support GPFS 3.3/3.4
Vladimir.Sapunenko@cnaf.infn.it 25
Components of GEMSS
July 18, 2012
Disk-centric system with five building blocks• GPFS: disk-storage software
infrastructure• TSM: tape management system• StoRM: SRM service• TSM-GPFS interface• Globus GridFTP: WAN data transfers
Vladimir.Sapunenko@cnaf.infn.it 26
GEMSS recall system• Selective recall system in
GEMSS use 4 processes: yamssEnqueueRecall yamssMonitor, yamssReorderRecallyamssProcessRecall
• yamssEnqueueRecall & yamssrReorderRecall manage a FIFO queue with the files to be recalled, fetches files from the queue and builds sorted lists with optimal file ordering.
July 18, 2012
yamssProcessRecall actually creates the recall streams, perform the recalls and manages the error conditions (i.e. retries file recall failures…)yamssMonitor is the supervisor of the reorder and recall phases
Vladimir.Sapunenko@cnaf.infn.it 27
GEMSS interface• Set of administrative commands have been also developed, (for
monitoring, stopping and starting migrations and recalls, performance reporting).
• Almost 50 user interface commands/daemonsome examples:– yamssEnqueueRecall (command)
• Simple command line to enqueue into a FIFO the files to recall from tape– yamssLogger (daemon)
• Centralized logging facilty. 3 log files (for migrations, premigrations and recalls) are centralized for each YAMSS-managed file system
– yamssLs (command)• “ls”-like interface, but in addition prints status of each file: premigrated, migrated, disk-resident.
• Shipped as RPM package for installation/distribution• Provides several STAT files for accurate statistic which
includes– file name– Time stamp– File size– Tape label
July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it 28
Pre-production tests• ~24 TB of data moved from
tape to disk• Recalls of five days typical
usage by a large LHC experiment (namely CMS) compacted in one shot and completed in 19h
• Files were spread on ~100 tapes
• Average throughput: ~400MB/s• 0 failures• Up to 6 drives used for recalls• Simultaneously, up to 3 drives
used for migrations of new data files
~ 400 MB/s
Up to ~ 530 MB/sof tape recalls
July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it
GEMSS monitoring• Integration with NAGIOS for alert system, notification and automatic actions
(i.e. restarting of failed TSM daemons)• Integration with LEMON monitoring.
July 18, 2012 29
T10KB Tape drive (SAN traffic)
Vladimir.Sapunenko@cnaf.infn.it 30
GEMSS in production• ~11 PB of data have been migrated to tapes
since GEMSS entered in production – (some data was deleted by user => now 8.9PB used)
July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it
ATLAS data re-processing
July 18, 2012 31
GPFS <=> TSM trafficwrite: recalls for tape to disk for reprocessingread: write to tape from TIER-0 (raw data flow)Good performance for simultaneous read/write access
4,20% of total processing activity at T1 (170 TB) in 2011
ATLAS Computing activity involving massive data recall from tape
High efficiency (99% successful jobs) Just a few days to complete
Vladimir.Sapunenko@cnaf.infn.it 32
Conclusions• We implemented a full HSM system based on GPFS and TSM able to satisfy the
requirements of WLCG experiments operating the Large Hadron Collider• StoRM, the SRM service for GPFS, has been extended in order to manage tape
support• An interface between GPFS and TSM (GEMSS) was realized in order to perform
tape recalls in an optimal order, so achieving great performances• A modification to XrootD library permitted to interface XrootD and GEMMS • GEMSS is the storage solution used in production in our Tier1 as a single
integrated system for ALL the LHC and no-LHC experiments.• The recent improvements in GEMSS have increased
the level of reliability and performance in the storage access.• Results from the experiment perspective of the latest years of production show
the system’s reliability and high performance with moderate effort
• GEMSS is the treasure!July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it 33
Contributors• Alessandro Cavalli, INFN-CNAF• Luca Dell’agnello, INFN-CNAF• Daniele Gregori, INFN-CNAF• Andrea Prosperini, INFN-CNAF• Francesco Noferini, INFN Enrico Fermi Centre• Pier Paolo Ricci, INFN-CNAF• Elisabetta Ronchieri, INFN-CNAF• Vincenzo Vagnoni, INFN Bologna
July 18, 2012
Vladimir.Sapunenko@cnaf.infn.it 34
Thank you for your attention!
Questions?
Вопросы?
July 18, 2012