PCGRID ‘08 Workshop, Miami, FL April 18, 2008
Preston Smith <[email protected]>
Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University
• Introduction– Environment– Motivation
• Challenges– Infrastructure– Usage Tracking – Storage– Staffing
• Future Work• Results
BoilerGrid
BoilerGrid - Growth
How did we get from here….
To here?
BoilerGrid - Rosen Center for Advanced Computing
• Research Computing arm of ITaP - Information Technology at Purdue
• Clusters in RCAC are arranged in larger “Community Clusters”– One cluster, one configuration, many owners– Leverages economies of scale for purchasing,
and provides expertise in systems engineering, user support, and networking
BoilerGrid - Motivation
• Early on, we recognized that the diverse owners of the community clusters don’t use the machine at 100% capacity– Community clusters used approximately 70% of
capacity– Condor installed on community clusters to cycle-
scavenge from PBS, the primary scheduler
• Goal: provide a general-purpose high-throughput computing resource on existing hardware
BoilerGrid - Challenges
• In 2005, the Condor deployment at Purdue was unable to scale to the size of the clusters, and ran on an old version of the software
• An overhaul of the Condor infrastructure was needed!
BoilerGrid - Keep Condor Up-to-date
• Upgrading Condor– In late 2005, we were running Condor version 6.6.5,
which was 1.5 years old.– First, we needed to upgrade!
• In a large, busy, Condor grid, we found it’s usually advantageous to run the development release of Condor– Early access to new features, scalability
improvements
BoilerGrid - Pool Design
• Use many machines– In 2005, we ran a single Condor pool with
~1800 machines.
• In 2005, the largest single Condor pools in existence were ~1000 machines.– We implemented BoilerGrid as a flock of 4
pools, of up to 1200 machines each.– Implementing BoilerGrid today?
• Would have looked much different!
BoilerGrid - Submit Hosts
• Many submit hosts– In 2005, a single host ran the Condor schedd
and could submit jobs
– Today, any machine in RCAC for user login, and in many cases end-user desktops are able to submit Condor jobs
BoilerGrid - Challenges
• Usage Tracking– Tracking job-level accounting with a large
Condor pool is difficult– Job history resides on every submit host
– Recent versions of Condor’s Quill software allow for a central database holding job (and machine) information
• Deploying this on BoilerGrid now
BoilerGrid - Storage
• If your users expect to run jobs using a shared filesystem, a large Condor installation can overwhelm NFS servers.
• DAGMan and user logs on NFS can cause problems– The defaults don’t allow this for a reason!
• Train users to rely less on the shared filesystem and take advantage of Condor’s ability to transfer files
BoilerGrid - Expansion
• Successful use of Condor in clusters led us to identify partners around campus– Student computer labs operated by sister unit in ITaP
(2500 machines and growing)– Library terminals (200 machines)– Other campuses (500+ machines)
• Management support is critical!– Purdue’s CIO supports using Condor on many
machines run by ITaP, including the one on his own desk
BoilerGrid - Expansion
• An even better route of expansion– Condor users adding their own resources
• Machines in their own lab• All the machines in their department
• With distributed ownership comes new challenges– Regular contact with owner’s system administration
staff– Ensure that owners are able to set their own policies
BoilerGrid - Staffing
• Implementing BoilerGrid required minimal staff effort– Assuming an existing IT infrastructure exists that can
operate many machines– .25 FTE ongoing to maintain Condor and coordinating
with distributed Condor installations
• With success comes more demand, and the end-user support to go along with it– 1.5 science support consultants assist with porting
codes,training users to effectively use Condor
BoilerGrid - Future Work
• TeraGrid (NSF HPCOPS) - Portal for submission and monitoring of Condor jobs
• Centralized Quill database for job and machine state– Excellent source of data for future research in
distributed systems
BoilerGrid - Results
Year Pool Size
Jobs Hours Delivered
Unique Users
2004 1500 43,551 346,000 14
2005 4000 210,717 1,695,000 26
2006 6100 4,251,981 5,527,000 72
2007 7700 9,611,813 9,524,000 117
2008 13000+ ? ? 63 so far..
BoilerGrid - Results
BoilerGrid - Conclusions
• Condor is a powerful tool for getting real science done on otherwise unused hardware
http://www.rcac.purdue.edu/boilergrid
Questions?
Top Related