Using the Grid for Astronomy Roy Williams, Caltech

Using the Grid for Astronomy

Roy Williams, Caltech

Enzo Case Study

Simulated dark matter density in early universe

• N-body gravitational dynamics (particle-mesh method)

• Hydrodynamics with PPM and ZEUS finite-difference• Up to 9 species of H and He

• Radiative cooling

• Uniform UV background (Haardt & Madau)

• Star formation and feedback

• Metallicity fields

Adaptive Mesh Refinement (AMR)

• multilevel grid hierarchy

• automatic, adaptive, recursive

• no limits on depth,complexity of grids

• C++/F77

• Bryan & Norman (1998)

Source: J. Shalf

Distributed Computing Zoo

• Grid Computing• Also called High-Performance Computing• Big clusters, Big data, Big pipes, Big centers• Globus backbone, which now includes Services and Gateways• Decentralized control

• Cluster Computing• local interconnect between identical cpu’s

• Peer-to-Peer (Napster, Kazaa)• Systems for sharing data without centeral server

• Internet Computing• Screensaver cycle scavenging• eg SETI@home, Einstein@home, ClimatePrediction.net, etc

• Access Grid• A videoconferencing system

• Globus• A popular software package to federate resources into a grid

• TeraGrid• A $150M award from NSF to the Supercomputer centers (NCSA, SCSC, PSC, etc etc)

mailto:SETI@home

mailto:Einstein@home

• The World Wide Web provides seamless access to information that is stored in many millions of different geographical locations

• In contrast, the Grid is an emerging infrastructure that provides seamless access to computing power and data storage capacity distributed over the globe.

What is the Grid?

• “Grid” was coined by Ian Foster and Carl Kesselman “The Grid: blueprint for a new computing infrastructure”.

• Analogy with the electric power grid: plug-in to computing power without worrying where it comes from, like a toaster.

• The idea has been around under other names for a while (distributed computing, metacomputing, …).

• Technology is in place to realise the dream on a global scale.

What is the Grid?

The GRID middleware:• Finds convenient places for the scientists “job” (computing task) to be run• Optimises use of the widely dispersed resources• Organises efficient access to scientific data • Deals with authentication to the different sites • Interfaces to local site authorisation / resource allocation• Runs the jobs• Monitors progress• Recovers from problems

… and ….Tells you when the work is complete and transfers the result back!

What is Middleware?

Grid as Federation

Grid as a federation

independent centers

flexibility

unified interface

power and strength

Large/small state compromise

Three Big Ideas of Grid

• Federation and Uniformity– independent management; uniform face; open

standards

• Trust and Security– access policy; uniform authentication/authorization

• Distance doesn’t matter– 20 Mbyte/sec, global file system

•DOE Science Grid•NSF National Virtual Observatory•NSF GriPhyN/iVDGL•DOE Particle Physics Data Grid•NSF TeraGrid•DOE Earth Systems Grid•NEESGrid•DOH BIRN

•UK e-Science Grid•EUROGRID

•DataGrid (CERN, ...)•EuroGrid (Unicore)•DataTag (CERN,…)•GridLab (Cactus Toolkit)•CrossGrid (Infrastructure Components)

Grid projects in the world

TeraGrid Wide Area Network

TeraGrid Components

• Compute hardware– Intel/Linux Clusters, Alpha SMP clusters, POWER4

cluster, …

• Large-scale storage systems– hundreds of terabytes for secondary storage

• Very high-speed network backbone– bandwidth for rich interaction and tight coupling

• Grid middleware– Globus, data management, …

• Next-generation applications

TeraGrid Resources

ANL/UC

Indiana U

NCSA ORNL PSC PurdueU

SDSC TACC

ComputeResources

1 Tflop 1 TFlop 35 Tflop 16 Tflop 18 Tflop 24 Tflop 8 Tflop

Online Storage

20 TB 6 TByte 700 TByte

200 TByte

28 Tbyte

1000 TByte

50 TByte

ArchivalStorage

150 Tbyte

5000 Tbyte

2400 Tbyte

36 Tbyte

7200 Tbyte

2000 Tbyte

Global FS 220 Tbyte

20 Tbyte 220 Tbyte

Data Collections

Yes Yes Yes Yes Yes

Visualization

Yes Yes Yes Yes Yes

Instruments Yes Yes

Network(Gb/s,Hub)

30CHI

10CHI

30CHI

10ATL

30CHI

10CHI

30LA

10CHI

The TeraGrid VisionDistributing the resources is better than putting them at one site

• Build new, extensible, grid-based infrastructure– New hardware, new networks, new software, new

practices, new policies

• Leverage homogeneity– Run single job across entire TeraGrid– Move executables between sites

• Catch-phrase: Open, Deep and Wide– Open to US science community– Heroic computing possible by programming Unix– Easy to use through science gateways

TeraGrid Allocations Policies

• Any US researcher can request an allocation– http://www.teragrid.org

Wide Variety of Usage Scenarios

• Tightly coupled simulation jobs storing vast amounts of data, performing visualization remotely as well as making data available through online collections (ENZO)

• Thousands of independent jobs using data from a distributed data collection (NVO)

• Science Gateways – "not a Unix prompt"!– from web browser with security– SOAP client for scripting– from application eg IRAF, IDL

Running jobs

Account Security

• Username/Password– weak security, too many holes– deprecated in many places

• SSH keys– put public key on remote machine– serves as single sign-on

• X.509 Certificates– Proves identity– Flexible

Ways to Submit a Job

1. Directly to PBS Batch Scheduler – Simple, scripts are portable among PBS TeraGrid

clusters

2. Globus common batch script syntax– Scripts are portable among other grids using Globus

3. Condor-G= Condor + Globus

4. Use a science gateway, eg Nesssispecific tasks, easy to use

PBS Batch Submission

• Single executables to be on a single remote machine– login to a head node, submit to queue

• Direct, interactive execution– mpirun –np 16 ./a.out

• Through a batch job manager– qsub my_script

• where my_script describes executable location, runtime duration, redirection of stdout/err, mpirun specification…

• ssh tg-login.sdsc.teragrid.org – qsub flatten.sh –v "FILE=f544"– qstat or showq– ls *.dat– pbs.out, pbs.err files

Remote submission

• Through globus– globusrun -r [some-teragrid-head-node].teragrid.org/jobmanager -f my_rsl_script

• where my_rsl_script describes the same details as in the qsub my_script!

• Through Condor-G– condor_submit my_condor_script

• where my_condor_script describes the same details as the globus my_rsl_script!

Condor-G

A Grid-enabled version of Condor that provides robust job management for Globus clients.

– Robust replacement for globusrun– Provides extensive fault-tolerance– Can provide scheduling across multiple

Globus sites– Brings Condor’s job management features

to Globus jobs

Condor DAGMan

• Manages workflow interdependencies• Each task is a Condor description file• A DAG file controls the order in which

the Condor files are run

Cluster Supercomputer

100s of nodes

purged /scratch

parallel file system/home (backed-up)

login node

job submission and queueing(Condor, PBS, ..)

user

metadata node

parallel I/O

global file system

MPI parallel programming

• Each node runs same program• first finds its number (“rank”)• and the number of coordinating nodes (“size”)

• Laplace solver example

Algorithm:

Each value becomes average

of neighbor valuesnode 0 node 1

Parallel:

Run algorithm with ghost points

Use messages to exchange ghost points

Serial:

for each point, compute average

remember boundary conditions

Globus

• Security• Single-sign-on, certificate handling, CAS, MyProxy

• Execution Management• Remote jobs: GRAM and Condor-G

• Data Management• GridFTP, reliable FT, 3rd party FT

• Information Services• aggregating information from federated grid resources

• Common Runtime Components• web services through GT4

The following is a personal opinion, it is NOT the position of the NVO:

• Globus is a complex and difficult installation• Globus needs frequent maintenance and updates• Globus is monolithic (all or nothing)

Data storage

Typical types of HPC storage needs

Type

Typical size

Use Aggregate BW

Tolerance for Latency

Requirements

1 1-10TB Home filesystem

A lot of small files, high metadata rates, interactive use.

2 (optional)

100’s GB (per CPU)

Local scratch space

High bandwidth data cache.

3 10-100TB

Global filesystem

High aggregate bandwidth. Concurrent access to data. Moderate latency tolerated.

4 100TB-PB

Archival Storage

Large storage pools with low cost. Used for long term storage of results.

Disk Farms (datawulf)

Large files striped over disks

Management node for file creation, access, ls, etc etc

• Homogeneous Disk Farm(= parallel file system)

parallel file systemmetadata node

parallel I/O

Parallel File System

• Large files are striped– very fast parallel access

• Medium files are distributed– Stripes do not all start the same place

• Small files choke the PFS manager– Either containerize– or use blobs in a database

• not a file system anymore: pool of 108 blobs with lnames

•

Storage Resource Broker (SRB)

• Single logical namespace while accessing distributed archival storage resources

• Effectively infinite storage• Data replication• Parallel Transfers• Interfaces: command-line, API, SOAP,

web/portal.

Storage Resource Broker (SRB):Virtual Resources, Replication

NCSA

SDSC

workstation

SRB Client

(cmdline, or API)

hpss-sdsc

sfs-tape-sdsc

hpss-caltech

…

Storage Resource Broker (SRB):Virtual Resources, Replication

BrowserSOAP client

Command-line....

casjobs at JHU

tape at sdsc

myDisk

Similar to VOSpace concept

certificate

File may be replicatedFile comes with metadata

... may be customized

Containerizing

• Shared metadata• Easier for bulk movement

container file in container

Data intensive computing with NVO services

Two Key Ideas for Fault-Tolerance

• Transactions• No partial completion -- either all or nothing

– eg copy to a tmp filename, then mv to correct file name

• Idempotent• “Acting as if done only once, even if used multiple times”• Can run the script repeatedly until finished

DPOSS flattening

2650 x 1.1 Gbyte files

Cropping borders

Quadratic fit and subtract

Virtual data

Source Target

Driving the Queues

for f in os.listdir(inputDirectory):

# if the file exists, with the right size and age, then we keep it

ofile = outputDirectory +"/"+ f

if os.path.exists(ofile):

osize = os.path.getsize(ofile)

if osize != 1109404800:

print " -- wrong target size, remaking", osize

else:

time_tgt = filetime(ofile)

time_src = filetime(file)

if time_tgt < time_src:

print(" -- target too old or nonexistant, making")

else:

print " -- already have target file "

continue

cmd = "qsub flat.sh -v \"FILE=" + f +"\""

print " -- submitting batch job: ", cmd

os.system(cmd)

Here is the driver that makes and submits jobs

PBS script

#!/bin/sh

#PBS -N dposs

#PBS -V

#PBS -l nodes=1

#PBS -l walltime=1:00:00

cd /home/roy/dposs-flat/flat

./flat \

-infile /pvfs/mydata/source/${FILE}.fits \

-outfile /pvfs/mydata/target/${FILE}.fits \

-chop 0 0 1500 23552 \

-chop 0 0 23552 1500 \

-chop 0 22052 23552 23552 \

-chop 22052 0 23552 23552 \

-chop 18052 0 23552 4000

A PBS script. Can do "qsub script.sh –v "FILE=f345"

GET services from Python

import urllib

hyperatlasURL = self.hyperatlasServer + "/getChart?atlas=" + atlas \

+ "&RA=" + str(center1) + "&Dec=" + str(center2)

stream = urllib.urlopen(hyperatlasURL)

# result is a tab-separated line, so use split() to tokenize

tokens = stream.readline().split('\t')

print "Using page ", tokens[0], " of atlas ", atlas

self.scale = float(tokens[1])

self.CTYPE1 = tokens[2]

self.CTYPE2 = tokens[3]

rval1 = float(tokens[4])

rval2 = float(tokens[5])

This code uses a service to find the best hyperatlas page for a given sky location

VOTable parser in Python

import urllib

import xml.dom.minidom

stream = urllib.urlopen(SIAP_URL)

doc = xml.dom.minidom.parse(stream)

#Make a dictionary for the columns

col_ucd_dict = {}

for XML_TABLE in doc.getElementsByTagName("TABLE"):

for XML_FIELD in XML_TABLE.getElementsByTagName("FIELD"):

col_ucd = XML_FIELD.getAttribute("ucd")

col_ucd_dict[col_title] = col_counter

urlColumn = col_ucd_dict["VOX:Image_AccessReference"]

formatColumn = col_ucd_dict["VOX:Image_Format"]

raColumn = col_ucd_dict["POS_EQ_RA_MAIN"]

deColumn = col_ucd_dict["POS_EQ_DEC_MAIN"]

From a SIAP URL, we get the XML, and extract the columns that have the image references, image format, and image RA/Dec

VOTable parser in Python

import xml.dom.minidom

table=[]

for XML_TABLE in doc.getElementsByTagName("TABLE"):

for XML_DATA in XML_TABLE.getElementsByTagName("DATA"):

for XML_TABLEDATA in XML_DATA.getElementsByTagName("TABLEDATA"):

for XML_TR in XML_TABLEDATA.getElementsByTagName("TR"):

row=[]

for XML_TD in XML_TR.getElementsByTagName("TD"):

data = ""

for child in XML_TD.childNodes:

data += child.data

row.append(data)

table.append(row)

Table is a list of rows, and each row is a list of table cells

Science Gateways

Grid Impediments

Learn GlobusLearn MPILearn PBSPort code to ItaniumGet certificateGet logged inWait 3 months for accountWrite proposal

and now do some science....

A better way:Graduated Securityfor Science Gateways

Web form - anonymous

somescience....

Register - logging and reporting

morescience....

Authenticate X.509- browser or cmd line

big-ironcomputing

....

Write proposal- own account

power user

2MASS Mosaicking portalAn NVO-Teragrid projectCaltech IPAC

Three Types of Science Gateways

• Web-based Portals – User interacts with community-deployed web interface.– Runs community-deployed codes – Service requests forwarded to grid resources

• Scripted service call – User writes code to submit and monitor jobs

• Grid-enabled applications– Application programs on users' machines (eg IRAF)– Also runs program on grid resource

Nesssi: Secure Web services for astronimy

client

certificaterepository

nesssiweb portal

nesssi

node

node

node

node

web form SOAP http queue

fetchproxy

select useraccount

sandboxstorage

open http

certificatepolicies

nesssiServer.dpossMosaic.mosaic (“-ra 49.1 -dec 60.1 -rawidth 0.5 -decwidth 0.5 -filt f -bgcorr 0”)

Mosaic service

nesssiServer.hyperatlas.run (“-bandpass z1 -ra 170.08 -dec 13.275 -rawidth 1.0 -decwidth 1.0 “)

Coadd service

Cutout ServicenesssiServer.cutout.run(sessionID, "-surveys PQ:gr,PQ:gi,PQ:z1,PQ:z2,SDSS:r,SDSS:i,SDSS:z,2MASS:k,2MASS:h-size 64”)

cutouts from Palomar-Quest, SDSS, 2MASSof sources from Veron quasar catalog

Amazon Grid(who will pay?)

Amazon Grid

• Simple Storage Service

• Write, read, and delete.• Each object has a unique, developer-assigned key.• Authentication mechanisms. Objects can be private or public. Rights

can be granted to specific users.• REST and SOAP interfaces• Default download protocol is HTTP. BitTorrent(TM) also available.

Amazon Grid

• Elastic Compute Cloud

• Create an Amazon Machine Image (AMI) containing your applications, libraries, data and associated configuration settings.

• Upload the AMI into Amazon Simple Storage Service.• Configure security and network access.• Start, terminate, and monitor as many instances of your AMI as

needed.• Pay for the instance hours and bandwidth that you actually consume.

• $0.10 per instance-hour consumed• $0.20 per GB of data transferred outside of Amazon• $0.15 per GB-Month of Amazon S3 storage

Amazon Grid

• Simple Queue Service

• Move data between distributed application components performing different tasks, without losing messages or requiring each component to be always available.

• Unlimited number of queues, unlimited number of messages.• New messages can be added at any time.• A computer can check a queue at any time for messages waiting to be

read.• REST, SOAP and query interfaces.• The queue creator determines which other users can write to or read

from the queue.

Using the Grid for Astronomy Roy Williams, Caltech

Documents

Transcript of Using the Grid for Astronomy Roy Williams, Caltech