The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27,...
-
Upload
blaise-gaines -
Category
Documents
-
view
216 -
download
0
Transcript of The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27,...
![Page 1: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/1.jpg)
The GRID and the Linux The GRID and the Linux Farm at the RCFFarm at the RCF
CHEP 2003 – San DiegoCHEP 2003 – San Diego
March 27, 2003March 27, 2003
A. Chan, R. Hogue, C. Hollowell, O. Rind,A. Chan, R. Hogue, C. Hollowell, O. Rind,
J. Smith, T. Throwe, T. Wlodek, D. YuJ. Smith, T. Throwe, T. Wlodek, D. Yu
RHIC Computing FacilityRHIC Computing Facility
Brookhaven National LaboratoryBrookhaven National Laboratory
![Page 2: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/2.jpg)
OutlineOutline
• BackgroundBackground• HardwareHardware• SoftwareSoftware• SecuritySecurity• GRID-like capabilitiesGRID-like capabilities• Near-term plansNear-term plans
![Page 3: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/3.jpg)
BackgroundBackground
• Used for mass processing of RHIC dataUsed for mass processing of RHIC data
• U.S. tier 1 Center for ATLASU.S. tier 1 Center for ATLAS
• Listed as 3Listed as 3rdrd largest cluster in largest cluster in http://clusters.top500.orghttp://clusters.top500.org
• Currently staffed with 5 FTECurrently staffed with 5 FTE
![Page 4: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/4.jpg)
Growth of the Linux FarmGrowth of the Linux Farm
0
200
400
600
800
1000
1999 2000 2001 2002 2003
K SpecInt2000
![Page 5: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/5.jpg)
HardwareHardware
• Built with commercially available Intel-based Built with commercially available Intel-based serversservers
• 1097 rack-mounted, dual CPU servers1097 rack-mounted, dual CPU servers
• 917,728 SpecInt2000917,728 SpecInt2000
• Reliable (0.0052 hardware failure/machine-Reliable (0.0052 hardware failure/machine-month—about 6 failures/month)month—about 6 failures/month)
![Page 6: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/6.jpg)
Breakdown of Hardware FailuresBreakdown of Hardware Failures
31%
26%13%
11%
19%DISK
PS
MB
MEM
OTH
![Page 7: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/7.jpg)
The Linux The Linux FarmFarm in the RCF in the RCF
![Page 8: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/8.jpg)
The IBM serversThe IBM servers
![Page 9: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/9.jpg)
The VA Linux serversThe VA Linux servers
![Page 10: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/10.jpg)
Hardware (cont.)Hardware (cont.)
BrandBrand CPUCPU RAMRAM StorageStorage QuantityQuantity
VA LinuxVA Linux 450 MHz450 MHz 0.5-1 GB0.5-1 GB 9-120 GB9-120 GB 154154
VA LinuxVA Linux 700 MHz700 MHz 0.5 GB0.5 GB 9-36 GB9-36 GB 4848
VA LinuxVA Linux 800 MHz800 MHz 0.5-1 GB0.5-1 GB 18-480 GB18-480 GB 168168
IBMIBM 1.0 GHz1.0 GHz 0.5-1 GB0.5-1 GB 18-144 GB18-144 GB 315315
IBMIBM 1.4 GHz1.4 GHz 1 GB1 GB 36-144 GB36-144 GB 160160
IBMIBM 2.4 GHz2.4 GHz 1 GB1 GB 240 GB240 GB 252252
![Page 11: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/11.jpg)
SoftwareSoftware
• Custom version of Red Hat Linux 7.2Custom version of Red Hat Linux 7.2
• Linux image installed with KickstartLinux image installed with Kickstart
• Support for an array of compilers (gcc, PGI, Support for an array of compilers (gcc, PGI, Intel) and debuggers (gdb, TotalView, Intel)Intel) and debuggers (gdb, TotalView, Intel)
• Support for network file systems (AFS, NFS)Support for network file systems (AFS, NFS)
![Page 12: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/12.jpg)
Software (cont.)Software (cont.)
• Support for LSF and MDS-compatible batch Support for LSF and MDS-compatible batch softwaresoftware
• Mix of open-source, RCF-built and vendor-Mix of open-source, RCF-built and vendor-supplied system to monitor and control supplied system to monitor and control hardware, software and infrastructurehardware, software and infrastructure
• Cluster management tools based on open-Cluster management tools based on open-source softwaresource software
![Page 13: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/13.jpg)
Linux Farm MonitoringLinux Farm Monitoring
![Page 14: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/14.jpg)
Batch Control & MonitoringBatch Control & Monitoring
![Page 15: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/15.jpg)
Linux Farm UsageLinux Farm Usage
![Page 16: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/16.jpg)
Remote Power Management Remote Power Management
![Page 17: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/17.jpg)
Infrastructure MonitoringInfrastructure Monitoring
![Page 18: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/18.jpg)
SecuritySecurity
• Firewall to minimize unauthorized accessFirewall to minimize unauthorized access
• User access via SSH through security-enhanced User access via SSH through security-enhanced gateway systemsgateway systems
• Most servers closed to direct external accessMost servers closed to direct external access
• Other security measures being developedOther security measures being developed
![Page 19: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/19.jpg)
Security (cont.)Security (cont.)
![Page 20: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/20.jpg)
GRID-like capabilitiesGRID-like capabilities
• Ganglia (monitoring & job scheduler)Ganglia (monitoring & job scheduler)
• Condor (batch software)Condor (batch software)
• GLOBUS & LSF batchGLOBUS & LSF batch
![Page 21: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/21.jpg)
GangliaGanglia
• Open-sourceOpen-source monitoringmonitoring software software (http://sourceforge.net/projects/ganglia)(http://sourceforge.net/projects/ganglia)
• Can create federation of clustersCan create federation of clusters
• Historical data informationHistorical data information
• Can be used as job scheduler in GRID-like Can be used as job scheduler in GRID-like environmentenvironment
![Page 22: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/22.jpg)
Ganglia (cont.)Ganglia (cont.)
• Web interfaceWeb interface
• Prototype at the RCFPrototype at the RCF
• Scalability issuesScalability issues
• Downside – cannot (yet) restrict data access Downside – cannot (yet) restrict data access easily, not easily customizedeasily, not easily customized
![Page 23: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/23.jpg)
Ganglia at the RCFGanglia at the RCF
![Page 24: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/24.jpg)
Ganglia at the RCF (1)Ganglia at the RCF (1)
![Page 25: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/25.jpg)
Ganglia at the RCF (2)Ganglia at the RCF (2)
![Page 26: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/26.jpg)
CondorCondor
• Open-source software Open-source software (http://www.cs.wisc.edu/condor)(http://www.cs.wisc.edu/condor)
• Supported by Univ. of WisconsinSupported by Univ. of Wisconsin
• Full-feature batch software with job-queuing Full-feature batch software with job-queuing mechanism, scheduling policy, priority scheme, mechanism, scheduling policy, priority scheme, checkpoint capability, resource monitoring & checkpoint capability, resource monitoring & management management
• Can connect together multiple remote clustersCan connect together multiple remote clusters
![Page 27: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/27.jpg)
Condor (cont.)Condor (cont.)
• Interface with GLOBUS via Condor-GInterface with GLOBUS via Condor-G
• Prototype for Linux Farm batch access in GRID-Prototype for Linux Farm batch access in GRID-like environmentlike environment
• Scalability -- not yet tested in very large-scale Scalability -- not yet tested in very large-scale environment?environment?
• MDS-compatible in RCF environment?MDS-compatible in RCF environment?
![Page 28: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/28.jpg)
Condor & the batch softwareCondor & the batch software
![Page 29: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/29.jpg)
GLOBUS & LSFGLOBUS & LSF
• GLOBUS tools allow remote users to submit GLOBUS tools allow remote users to submit jobs on local clusterjobs on local cluster
• ATLAS prototype at the RCFATLAS prototype at the RCF
• Gatekeeper acts as interface between GLOBUS Gatekeeper acts as interface between GLOBUS and local batch systemand local batch system
• LSF job submitted from gatekeeperLSF job submitted from gatekeeper
![Page 30: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/30.jpg)
GRID & LSF (cont.)GRID & LSF (cont.)
![Page 31: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/31.jpg)
GLOBUS & LSFGLOBUS & LSF
![Page 32: The GRID and the Linux Farm at the RCF CHEP 2003 – San Diego CHEP 2003 – San Diego March 27, 2003 March 27, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind,](https://reader035.fdocuments.in/reader035/viewer/2022070402/56649f275503460f94c3e6ba/html5/thumbnails/32.jpg)
Near-term plansNear-term plans
• Use mature version of ganglia for monitoring. Use mature version of ganglia for monitoring. Use as job scheduler?Use as job scheduler?
• Roll out Condor as part of new batch softwareRoll out Condor as part of new batch software
• Upgrade to LSF v. 5.x – GRID-like featuresUpgrade to LSF v. 5.x – GRID-like features
• Other GRID-like capabilities? Other GRID-like capabilities?
• Security issuesSecurity issues