Scalla/XRootd Advancements
description
Transcript of Scalla/XRootd Advancements
![Page 1: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/1.jpg)
Scalla/XRootd Scalla/XRootd AdvancementsAdvancements
xrootd /cmsd (f.k.a. olbd)
Fabrizio FuranoCERN – IT/PSS
Andrew HanushevskyStanford Linear Accelerator Center
http://xrootd.slac.stanford.edu
![Page 2: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/2.jpg)
28-November-07 2: http://xrootd.slac.stanford.edu
Outline
Current Elaborations Composite Cluster Name Space POSIX file system access via FUSE+xrootd
New Developments Cluster Management Service (cmsdcmsd)
Cluster globalization WAN direct data access
Conclusion
![Page 3: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/3.jpg)
28-November-07 3: http://xrootd.slac.stanford.edu
The Distributed Name Space
The Scalla/xrootd Scalla/xrootd suite implements a distributed name space Very scalable and efficient Sufficient for data analysis
Some users and applications (e.g., SRM) rely on a centralized name space Spurred the development of a Composite Name
Space (cnsdcnsd) add-on Simplest solution with the least entanglement
![Page 4: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/4.jpg)
28-November-07 4: http://xrootd.slac.stanford.edu
Composite Cluster Name Space
Redirectorxrootd:1094
Name Spacexrootd:2094
Data Data ServersServers
ManagerManager
cnsdofs.notify closew, create, mkdir, mv, rm, rmdir |/opt/xrootd/etc/cnsd
open/truncmkdir
mvrm
rmdir
xroot.redirect mkdir myhost:2094
opendir() refers to the directory structure maintained by xrootd:2094
Client
![Page 5: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/5.jpg)
28-November-07 5: http://xrootd.slac.stanford.edu
cnsdcnsd Specifics
Servers direct name space actions to common xrootd(s) Common xrootd maintains composite name space
Typically, these run on the redirector nodes
Name space replicated in the file system No external database needed Small disk footprint
Deployed at SLAC for Atlas Needs synchronization utilities, more documentation, and packaging
See Wei Yang for details Similar mySQL based system being developed by CERN/Atlas
Annabelle Leung <[email protected]>
![Page 6: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/6.jpg)
28-November-07 6: http://xrootd.slac.stanford.edu
Data System vs File System
ScallaScalla is a data access system Some users/applications want file system semantics
More transparent but many times less scalable
For years users have asked …. Can ScallaScalla create a file system experience?
The answer is …. It can to a degree that may be good enough
We relied on FUSEFUSE to show how
![Page 7: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/7.jpg)
28-November-07 7: http://xrootd.slac.stanford.edu
What is FUSEFUSE
FFilesystem in UUsersspacee Used to implement a file system in a user space program
Linux 2.4 and 2.6 only Refer to http://fuse.sourceforge.net/
Can use FUSE FUSE to provide xrootd access Looks like a mounted file system
SLAC and FZK have xrootd-based versions of this Wei Yang at SLAC
Tested and practically fully functional Andreas Petzold at FZK
In alpha test, not fully functional yet
![Page 8: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/8.jpg)
28-November-07 8: http://xrootd.slac.stanford.edu
XrootdFS (Linux/FUSE/Xrootd)
Redirectorxrootd:1094
Name Spacexrootd:2094RedirectorRedirector
HostHost
ClientClientHostHost opendir
createmkdir
mvrm
rmdir
xrootd POSIX Client
Kernel
User Space
Appl
POSIX File System
InterfaceFUSE
FUSE/Xroot Interface
Should run cnsd on serversto capture non-FUSE events
![Page 9: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/9.jpg)
28-November-07 9: http://xrootd.slac.stanford.edu
XrootdFS Performance
Sun V20zRHEL4
2x 2.2Ghz AMD Opteron4GB RAM
1Gbit/sec Ethernet
Client
VA Linux 1220RHEL3
2x 866Mhz Pentium 31GB RAM
100Mbit/sec Ethernet
Unix dd, globus-url-copy & uberftp5-7MB/sec with 128KB I/O block size
Unix cp 0.9MB/sec with 4KB I/O block size
Conclusion: Better for some things than othersConclusion: Better for some things than others..
![Page 10: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/10.jpg)
28-November-07 10: http://xrootd.slac.stanford.edu
Why XrootdFS?
Makes some things much simpler Most SRM implementations run transparently Avoid pre-load library worries
But impacts other things Performance is limited
Kernel-FUSE FUSE interactions are not cheap Rapid file creation (e.g., tar) is limited
FUSE FUSE must be administratively installed to be used Difficult if involves many machines (e.g., batch workers) Easier if it involves an SE node (i.e., SRM gateway)
![Page 11: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/11.jpg)
28-November-07 11: http://xrootd.slac.stanford.edu
Next Generation Clustering
Cluster Management Service (cmsdcmsd) Functionally replaces olbd
Compatible with olbd config file Unless you are using deprecated directives
Straight forward migration Either run olbd or cmsd everywhere
Currently in alpha test phase Available in CVS head Documentation on web site
![Page 12: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/12.jpg)
28-November-07 12: http://xrootd.slac.stanford.edu
cmsdcmsd Advantages
Much lower latencyNew very extensible protocolBetter fault detection and recoveryAdded functionality Global clusters Authentication Server selection can include space utilization metric Uniform handling of opaque information Cross protocol messages to better scale xproof clusters
Better implementation for reduced maintenance cost
![Page 13: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/13.jpg)
28-November-07 13: http://xrootd.slac.stanford.edu
Cluster Globalization
cmsdcmsd
xrootdxrootd
UTA
cmsdcmsd
xrootdxrootd
SLAC
cmsdcmsd
xrootdxrootd
UOM
cmsdcmsd
xrootdxrootd
BNL all.role meta managerall.manager meta atlas.bnl.gov:1312
all.role manager all.role manager all.role manager
root://atlas.bnl.gov/root://atlas.bnl.gov/includesincludes
SLAC, UOM, UTASLAC, UOM, UTAxroot clustersxroot clusters
all.manager meta atlas.bnl.gov:1312 all.manager meta atlas.bnl.gov:1312 all.manager meta atlas.bnl.gov:1312
Note:the security hats will likely
require you use xrootdnative proxy support
Meta Managers can be geographically replicated!
![Page 14: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/14.jpg)
28-November-07 14: http://xrootd.slac.stanford.edu
Why Globalize?
Uniform view of participating clusters Can easily deploy a virtual MSS
Included as part of the existing MPS framework Try out real time WAN access
You really don’t need data everywhere!
Alice is slowly moving in this direction The non-uniform name space is an obstacle
Slowly changing the old approach Some workarounds could be possible, though
![Page 15: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/15.jpg)
28-November-07 15: http://xrootd.slac.stanford.edu
Virtual MSS
Powerful mechanism to increase reliability Data replication load is widely distributed Multiple sites are available for recovery
Allows virtually unattended operation Based on BaBar experience with real MSS
Idea: to consider as a MSS the meta-cluster a cluster is subscribed in Automatic restore due to server failure
Missing files in one cluster fetched from another Typically the fastest one which has the file really online
Local cluster file (pre)fetching on demand Can be transformed into a 3rd-party copy
When cmsd is deployed Practically no need to track file location
But still need for metadata repositories
![Page 16: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/16.jpg)
28-November-07 16: http://xrootd.slac.stanford.edu
Virtual MSS – a way to do it
cmsdcmsd
xrootdxrootd
GSI
cmsdcmsd
xrootdxrootd
SLAC
cmsdcmsd
xrootdxrootd
CERN meta all.role meta managerall.manager meta metaxrd.cern.ch:1312root://metaxrd.cern.ch/root://metaxrd.cern.ch/
includesincludesSLAC, GSISLAC, GSI
xroot clustersxroot clusters
all.manager meta atlas.bnl.gov:1312 all.manager meta metaxrd.cern.ch:1312
Meta Managers can be geographically replicated!
A local client stillcontinues to work
Missing a file?Ask to the global metamgr
Get it from any othercollaborating cluster
![Page 17: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/17.jpg)
28-November-07 17: http://xrootd.slac.stanford.edu
Dumb WAN Access**
Setup: client at CERN, data at SLAC 164ms RTT time, available bandwidth < 100Mb/s
Test 1: Read a large ROOT Tree (~300MB, 200k interactions)
Expected time: 38000s (latency)+750s (data)+CPU 10 hrs!➙
Test 2: Draw a histogram from that tree data (6k interactions)
Measured time ~15-20min Using xrootd with WAN optimizations disabled
**Federico Carminati, Federico Carminati, The The ALICE ALICE Computing Status and ReadinessComputing Status and Readiness , LHCC, November 2007, LHCC, November 2007
![Page 18: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/18.jpg)
28-November-07 18: http://xrootd.slac.stanford.edu
Smart WAN Access**
Exploit xrootd WAN Optimizations TCP multi-streaming: for up to 15x improvement data WAN throughput The ROOT TTreeCache provides the hints on ”future” data accesses TXNetFile/XrdClient ”slides through” keeping the network pipeline full
Data transfer goes in parallel with computation Throughput improvement comparable to “batch” file-copy tools
70-80%, we are doing a live analysis, not a file copy!
Test 1 actual time: 60-70 seconds Compared to 30 seconds using a Gb LAN
Very favorable for sparsely used files
Test 2 actual time: 7-8 seconds Comparable to LAN performance
100x improvement over dumb WAN access (i.e., 15-20 minutes)
**Federico Carminati, Federico Carminati, The The ALICE ALICE Computing Status and ReadinessComputing Status and Readiness , LHCC, November 2007, LHCC, November 2007
![Page 19: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/19.jpg)
28-November-07 19: http://xrootd.slac.stanford.edu
Conclusion
Scalla is a robust framework Elaborative
Composite Name Space XrootdFS
Extensible Cluster globalization
Many opportunities to enhance data analysis Speed and efficiency
![Page 20: Scalla/XRootd Advancements](https://reader036.fdocuments.in/reader036/viewer/2022082819/56814018550346895dab65a8/html5/thumbnails/20.jpg)
28-November-07 20: http://xrootd.slac.stanford.edu
AcknowledgementsSoftware Collaborators INFN/Padova: Alvise Dorigo Root: Fons Rademakers, Gerri Ganis (security), Bertrand Bellenot (windows) Alice: Derek Feichtinger, Guenter Kickinger CERN: Fabrizio Furano (client) , Andreas Peters (Castor) STAR/BNL: Pavel Jakl Cornell: Gregory Sharp SLAC: Jacek Becla, Tofigh Azemoon, Wilko Kroeger, Bill Weeks BaBar: Peter Elmer (packaging)
Operational collaborators BNL, CNAF, FZK, INFN, IN2P3, RAL, SLAC
Funding US Department of Energy
Contract DE-AC02-76SF00515 with Stanford University Formerly INFN (BaBar)