CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and...
-
Upload
rachel-bauer -
Category
Documents
-
view
217 -
download
0
Transcript of CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and...
CamGrid
Mark Calleja
Cambridge eScience Centre
What is it?• A number of like minded groups and departments
(10), each running their own Condor pool(s), which federate their resources (12).
• Coordinated by the Cambridge eScience Centre (CeSC), but no overall control.
• Been running now for ~2.5 years, ~70+ users.• Currently have ~950 processors/cores available.• “All” linux (various), mostly x86_64, running 24/7.• Mostly Dell PowerEdge 1950 (like HPCF), four
cores with 8GB.• Around 2M CPU hours to date.
Some details
• Pools run the latest stable version of Condor (currently 6.8.6).
• All machines get an (extra) IP address in a CUDN-only routeable range for Condor.
• Each pool sets its own policies, but these must be visible to other users of CamGrid.
• Currently we see vanilla, standard and parallel (MPI) universe jobs.
• Users get accounts on a machine in their local pool; jobs are then distributed around the grid by Condor using its flocking mechanism.
• MPI jobs on single SMP machines have proved very useful.
NTE of Ag3[Co(CN)6] with SMP/MPI sweep
Monitoring Tools
• A number of web based tools provided to monitor the state of the grid and of jobs.
• CamGrid is based on trust, so must make sure that machines are fairly configured.
• The university gave us £450k (~$950k) to buy new hardware; need to ensure that it’s online as promised.
CamGrid’s file viewer
• Standard universe uses RPCs to echo I/O operations back to submit host.
• What about other universes? How can I check the health of my long running simulation?
• We’ve provided our own facility, which involves an agent installed on each execute node and accessed via a web interface.
• Works with vanilla and parallel (MPI) jobs.
• Requires local sysadmins to install and run it.
CamGrid’s file viewer
Checkpointable vanilla universe
• Standard universe is fine, if you can link to Condor’s libraries (Pete Keller – “getting harder”).
• Investigating using BLCR (Berkeley Lab Checkpoint/Restart) kernel modules for linux.
• Uses kernel resources, and can thus restore resources that user-level libraries cannot.
• Supported by some flavours of MPI (late LAM, OpenMPI).
• The idea was to use Parrot’s user-space FS to wrap a vanilla job and save the job’s state on a chirp server.
• However, currently Parrot breaks some BLCR functionality.
What doesn’t work so well…
• Each pool is run by local sysadmin(s), but these are of variable quality/commitment.
• We’ve set up mailing lists for users and sysadmins: hardly ever used (don’t want to advertise ignorance?).
• Some pools have used SRIF hardware to redeploy machines committed earlier. Naughty…
• Don’t get me started on merger with UCS’s central resource (~400 nodes).
But generally we’re happy bunnies
• “CamGrid was an invaluable tool allowing us to reliably sample the large parameter space in a reasonable amount of time. A half-year's worth of CPU running was collected in a week."
-- Dr. Ben Allanach
• “CamGrid was essential in order for us to be able to run the different codes in real time.”
-- Prof. Fernando Quevedo
• “I needed to run simulations that took a couple of weeks each. Without access to the processors on CamGrid, it would have taken a couple of years to get enough results for a publication.“
-- Dr. Karen Lipkow
Current issues• Protecting resources on execute nodes; Condor
seems lax at this, e.g. memory, disk space. • Increasingly interested in VMs (i.e. Xen). Some
pools run it, but not concerted (effects on SMP MPI jobs?).
• Green issues: will we be forced to buy WoL cards in the near future?
• Altruistic computing: a recent wave of interest for BOINC/backfill jobs for medical, protein folding, etc., but who runs the jobs? Audit trail?
• How do we interact with outsiders? Ideally keep it to Condor (some Globus, toyed with VPNs). Most CamGrid stakeholders just dish out conventional, ssh-accessible accounts.
Finally…
• CamGrid:http://www.escience.cam.ac.uk/projects/camgrid/
• Contact:[email protected]
Questions?