CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and...

19
CamGrid Mark Calleja Cambridge eScience Centre

Transcript of CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and...

Page 1: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),

CamGrid

Mark Calleja

Cambridge eScience Centre

Page 2: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),

What is it?• A number of like minded groups and departments

(10), each running their own Condor pool(s), which federate their resources (12).

• Coordinated by the Cambridge eScience Centre (CeSC), but no overall control.

• Been running now for ~2.5 years, ~70+ users.• Currently have ~950 processors/cores available.• “All” linux (various), mostly x86_64, running 24/7.• Mostly Dell PowerEdge 1950 (like HPCF), four

cores with 8GB.• Around 2M CPU hours to date.

Page 3: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),

Some details

• Pools run the latest stable version of Condor (currently 6.8.6).

• All machines get an (extra) IP address in a CUDN-only routeable range for Condor.

• Each pool sets its own policies, but these must be visible to other users of CamGrid.

• Currently we see vanilla, standard and parallel (MPI) universe jobs.

• Users get accounts on a machine in their local pool; jobs are then distributed around the grid by Condor using its flocking mechanism.

• MPI jobs on single SMP machines have proved very useful.

Page 4: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),
Page 5: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),

NTE of Ag3[Co(CN)6] with SMP/MPI sweep

Page 6: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),

Monitoring Tools

• A number of web based tools provided to monitor the state of the grid and of jobs.

• CamGrid is based on trust, so must make sure that machines are fairly configured.

• The university gave us £450k (~$950k) to buy new hardware; need to ensure that it’s online as promised.

Page 7: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),
Page 8: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),
Page 9: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),
Page 10: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),
Page 11: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),

CamGrid’s file viewer

• Standard universe uses RPCs to echo I/O operations back to submit host.

• What about other universes? How can I check the health of my long running simulation?

• We’ve provided our own facility, which involves an agent installed on each execute node and accessed via a web interface.

• Works with vanilla and parallel (MPI) jobs.

• Requires local sysadmins to install and run it.

Page 12: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),

CamGrid’s file viewer

Page 13: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),
Page 14: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),
Page 15: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),

Checkpointable vanilla universe

• Standard universe is fine, if you can link to Condor’s libraries (Pete Keller – “getting harder”).

• Investigating using BLCR (Berkeley Lab Checkpoint/Restart) kernel modules for linux.

• Uses kernel resources, and can thus restore resources that user-level libraries cannot.

• Supported by some flavours of MPI (late LAM, OpenMPI).

• The idea was to use Parrot’s user-space FS to wrap a vanilla job and save the job’s state on a chirp server.

• However, currently Parrot breaks some BLCR functionality.

Page 16: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),

What doesn’t work so well…

• Each pool is run by local sysadmin(s), but these are of variable quality/commitment.

• We’ve set up mailing lists for users and sysadmins: hardly ever used (don’t want to advertise ignorance?).

• Some pools have used SRIF hardware to redeploy machines committed earlier. Naughty…

• Don’t get me started on merger with UCS’s central resource (~400 nodes).

Page 17: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),

But generally we’re happy bunnies

• “CamGrid was an invaluable tool allowing us to reliably sample the large parameter space in a reasonable amount of time. A half-year's worth of CPU running was collected in a week."

-- Dr. Ben Allanach

• “CamGrid was essential in order for us to be able to run the different codes in real time.”

-- Prof. Fernando Quevedo

• “I needed to run simulations that took a couple of weeks each. Without access to the processors on CamGrid, it would have taken a couple of years to get enough results for a publication.“

-- Dr. Karen Lipkow

Page 18: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),

Current issues• Protecting resources on execute nodes; Condor

seems lax at this, e.g. memory, disk space. • Increasingly interested in VMs (i.e. Xen). Some

pools run it, but not concerted (effects on SMP MPI jobs?).

• Green issues: will we be forced to buy WoL cards in the near future?

• Altruistic computing: a recent wave of interest for BOINC/backfill jobs for medical, protein folding, etc., but who runs the jobs? Audit trail?

• How do we interact with outsiders? Ideally keep it to Condor (some Globus, toyed with VPNs). Most CamGrid stakeholders just dish out conventional, ssh-accessible accounts.

Page 19: CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),

Finally…

• CamGrid:http://www.escience.cam.ac.uk/projects/camgrid/

• Contact:[email protected]

Questions?