ATLAS HK Tier-2 Site Setup & Storage Research in CUHK By Roger Wong & Runhui Li.
-
Upload
adele-curtis -
Category
Documents
-
view
215 -
download
0
Transcript of ATLAS HK Tier-2 Site Setup & Storage Research in CUHK By Roger Wong & Runhui Li.
ATLAS HK Tier-2 Site Setup&
Storage Research in CUHK
By Roger Wong& Runhui Li
Roadmap
• ATLAS HK Tier-2 Site Setup– Presented by Roger Wong
([email protected]) Research Computing Team Information Technology Services Centre The Chinese University of Hong Kong
• Storage Research in CUHK
Major Tasks
• HTCondor• ARC CE + EGIIS• DPM• Frontier Squid%• Client
Install software components
Install Software Components
HTCondor
• Completed
ARC CE + EGIIS
• Basic configuration completedDP
M
•Basic installation completed (all-in-one node)•It works for protocols such as RFIO, XROOT
Squid
•Completed
Client
•Could access ARC CE, DPM and Frontier Squid
Install Testing Cluster• 10 VMs
– One EGIIS (for testing the registration process to CERN grid)– One ARC CE node with HTCondor manager & submit roles
• Two HTCondor worker nodes
– One ARC CE node with HTCondor manager, submit and execute roles– One DPM head node
• Two DPM disk nodes
– One Squid server– One client
• All 10 servers with production host certificates applied from AP Grid PMA
• Would like to try to connect to CERN grid now (yet to be discussed with counterpart in Lyon)– Need to tune configuration parameters– Need to sort out all outstanding issues
Conduct tender of production cluster
• Preliminary specification– 1,000+ cores– 1 PB storage
• Target to finalize the cluster specification after the testing cluster is connected to CERN grid in “test” mode
Upgrade testing cluster into production cluster
• Replacing ARC CE, HTCondor worker nodes and Squid
Replacing VMs in testing cluster with PMs
Add more HTCondor worker nodes
Reinstall DPM with PMs and storage devices
Tentative Timeline
Connect testing cluster to CERN grid in “test” mode (by 2015)
Conduct tender for production cluster(Jan 2016)
Put cluster into production (H2 2016)
Roadmap
• ATLAS HK Tier-2 Site Setup• Storage Research in CUHK
– Lead by Professor Patrick P. C. Lee ([email protected])
– Presented by Runhui Li ([email protected]) Advanced Networking and System Research Lab Department of Computer Science and Engineering
Storage Research in CUHK Build dependable storage systems with fault tolerance, recovery,
security, performance in mind
Techniques:• Erasure coding: Provide fault tolerance via “controlled” redundancy
(e.g., RAID)• Deduplication: Remove content-level “uncontrolled” redundancy• Security: Ensure data confidentiality and integrity against attacks
Targeted architectures:• Clouds, data centers, disk arrays, SSDs
Approach:• Build prototypes, backed by experiments and theoretical analysis• Open-source software • http://www.cse.cuhk.edu.hk/~pclee
Storage Research in CUHK
Erasure coding Deduplication Security
Cloud Data center Disk array SSD
Backup MapReduce StreamingPrimary I/O
Our focus
Big data
File and storage systems
Motivation
Distributed storage systems are widely deployed to provide scalable storage by striping data across multiple nodes
Failures are common
12
LAN
Replication vs. Erasure Coding
Solution: Add redundancy:• Replication• Erasure coding
Enterprises (e.g., Google, Azure, Facebook) move to erasure coding to save footprints due to explosive data growth• e.g., 3-way replication has 200% overhead; erasure
coding can reduce overhead to 33% over 50% of operational cost saving [Huang, ATC’12]
13
Background: Erasure Coding
Divide file to data chunks (each with multiple blocks) Encode data chunks to additional parity chunks Distribute data/parity chunks to nodes Fault-tolerance: any out of nodes can recover file data
14
File encode divide
Nodes
(n, k) = (4, 2)
ABCD
A+CB+D
A+DB+C+D
AB
CD
A+CB+D
A+DB+C+D
AB
CD
Erasure Coding
Key advantage:• Reduce storage space with high fault tolerance
Challenges:• Data chunk updates need parity chunk updates expensive updates
• k chunks needed to recover a lost chunk expensive recovery
Our work: Mitigating performance overhead of erasure coding, while preserving storage efficiency
CodFS
Object-based distributed file system• Splits a large file into smaller segments that are striped across
different storage nodes
Erasure coding• Each segment is independently encoded with erasure coding for
fault tolerance
Decoupling metadata and data management• Metadata updates off the critical path
Lightweight recovery• Monitor health of storage nodes and trigger recovery if needed
16"Parity Logging with Reserved Space: Towards Efficient Updates and Recovery in Erasure-coded Clustered Storage“, USENIX FAST 2014
CodFS Solve update problem
Novelty: use parity logging with reserved space• Puts deltas in a reserved space next to parity chunks to
eliminate disk seeks in parity updates• Predicts and reclaims reserved space in workload-aware
manner• Mitigates both network and disk I/Os in updates and recovery
17
Data nodes Parity nodes
∆A
∆P = f(∆A) ∆Q = g(∆A)
CodFS: I/O Workflow
18
Client MDS
OSD OSD...
OSD OSD... OSD OSD
primary
secondary
segment
chunk
Encode
1
23
4
MDS: metadata serverOSD: object storage device
CodFS Implementation
CodFS Architecture• Exploits parallelization
across nodes and within each node
• Provides a file system interface based on FUSE
OSD: Modular Design19
Results
Aggregate read/write throughput• Achieve several hundreds of megabytes per second• Network bound 20
Projects on Erasure Coding Mixed failures
• STAIR codes: a general, space-efficient erasure code for tolerating both device failures and latent sector errors [FAST’14, TOS’14]
• I/O-efficient integrity checking against silent data corruptions [MSST’14]
Efficient updates• CodFS: enhanced parity logging to reduce network and disk I/Os [FAST’14]
Efficient recovery• NCCloud: reduce bandwidth for archival storage [FAST’12, INFOCOM’13, TC’14]
• I/O-efficient recovery schemes for erasure codes [MSST’12, DSN’12, TC’14, TPDS’14]
Integration of erasure coding and Hadoop• CORE: Regenerating code deployment in HDFS [MSST’13,TC’15]
• Degraded-First Scheduling: MapReduce on erasure-coded storage [DSN’14]
• Encoding-Aware Replication: efficient transition from replication to erasure coding on HDFS [DSN’15]
Modeling of SSD RAID• Stochastic model to capture reliability changes as SSDs age [SRDS’13, TC]
Projects on Deduplication
LiveDFS: Linux kernel-space deduplication file system [Middleware’11]
• Extends Linux file system with deduplication• Follows Linux file system layout• Deployed as a kernel driver module
CloudVS: Tunable version control for virtual machine images on Openstack [NOMS’12,TSC’15]
• Extends Eucalyptus with deduplication• Tunable performance between storage efficiency and performance
RevDedup: Reverse deduplication with high read/write throughput on GB/s scale [APSys’13, TOS’15]
• Efficient hybrid inline and out-of-line deduplication
Projects on Security
FADE: secure access control and assured deletion for cloud storage [SecureComm’10, ICPP Workshop 11, TDSC’12]
FMSR-DIP: remote data checking for regenerating codes [SRDS’12, TPDS’14]
Cryptographic deduplication cloud storage [TPDS’14, TPDS’15]
CDStore: unifying erasure coding, deduplication, and security via convergent dispersal [HotStorage’14, USENIX ATC’15]
DISCUSSION
Connecting to CERN Grid
By Roger Wong
Connecting to CERN Grid (1)• Registered in GOCDB or OIM• Step 0: Required Services
– SE• CUHK: Setup SRMv2.2 and configure necessary space tokens• Lyon: Configure FTS channels• Could we transfer in data from more than one Tier-1 sites?
– CE and WNs• Not that many in the testing cluster when first connecting to Tier-1 site• Will add much more WNs within 6 months
– CVMFS• What does CUHK need to do? Just ensure our client has CVMFS installed?
– Squid• Install default and fail-over Squid servers• Manual fail-over?
• Question– Could CUHK transfer in data from more than one Tier-1 sites?
• Outstanding items– Separate DPM head node and disk nodes– SRMv2.2 configuration
Connecting to CERN Grid (2)• Step 1: Register the site to AGIS
– Register an “Atlas Site” with the site name in the GOCDB / OIM– Is site name just CUHK?
• Step 2: Register the storage to DDM– Register “DDM EndPoints” corresponding to the space tokens in AGIS
• CUHK– SE name– Space token availability– Email address of responsible person– seinfo– FTS channel information
• Lyon– Open a DDM Ops Savannah ticket
» Include DDM endpoint in SiteServices and DeletionServices» Validate the transfer and deletion steps with one dataset» DDM endpoints will appear in DaTRI after 24 hours
– Fill in all the information
Connecting to CERN Grid (3)• Step 3: Set up a Squid
– Register the Squid in AGIS as well as the Frontier services that it should look up
• Step 4: Panda Queues– CUHK
• CE name and queue name• vmem size per job slot• Available disk size (workdir) per job slot• Wall-time limit if any
– Lyon• Register “Panda Site”, “Panda Resources” and associated “Panda Queues” in AGIS
• Question– Will the ARC CE queue become Panda queue automatically if CUHK registers as
a Panda site? No extra set up in CUHK is necessary?
Connecting to CERN Grid (4)
• Step 4: Panda Queues– Panda site
• Usually “Atlas Site” == GOCDB/OIM site name
– Panda resource• Associated to the “Panda site”• Production jobs: usually “Panda site” == “Atlas site” ==
GOCDB/OIM
– Panda queue• Associated to the “Panda resource”• Usually the same name as “Panda resource”• Associate the CE and the queue• Set queue status to test
Connecting to CERN Grid (5)
• Step 5: ATLAS SW installation/validation system– After “Panda queues” are configured, contact
[email protected] to start automatic software installation/validation
Connecting to CERN Grid (6)
• ATLAS Functional Tests– DDM FT: Test storage and connectivity stability– SAM test: Test CE and storage stability
• Step 6: Perform data transfer functional test– Lyon: include the site in DDM FT (T1->site and
Sonar)
Connecting to CERN Grid (7)
• Step 7: Perform production functional test• Step 8: Perform analysis functional test
– Contact atlas-adc-hammercloud-support and inform the “Panda resource” name
– The site should set automatically into the HC DB within hours
– Test jobs should appear with 24 hours
Connecting to CERN Grid (8)
• Step 9: Analysis activity– Add the site to PanDA database (for pathena/prun
analysis jobs)– Site appear in PanDA Cloud Monitor– Run GangaRobot jobs with a success rate > 95%
for 10 days• Step 10: Production activity
– The site would be online after a few jobs could be successfully run