Santa Fe 6/18/03 Timothy L. Thomas 2 “UCF” Computing Capabilities at UNM HPC Timothy L. Thomas...

Santa Fe 6/18/03 Timothy L. Thomas 2

“UCF”Computing Capabilities at

UNM HPC

Timothy L. ThomasUNM Dept of Physics and Astronomy

Santa Fe 6/18/03

Timothy L. Thomas 6

Santa Fe 6/18/03

Timothy L. Thomas 7

Santa Fe 6/18/03

Timothy L. Thomas 8

Analysis Workshop, 8/8/2002 Charles F. Maguire, Vanderbilt4

Run2 Remote Site PISA Simulation Statistics

Vanderbilt Farm Projects 1 (hadrons), 6 (deuterons), 9 (pizero), and 10 (electrons) 10000 CPU-hours, >600 GBytes, ~1.5 person-month effort

UNM Farm Projects 0 (EMCal HIJING), 7 (EMCal + Muon Hijing), 9 (pizero) 33000 CPU-hours, >500 GBytes, ~1.5 person-month effort

LLNL Farm Projects 4 (EMCal Hijing), 5 (HBT), 23 (high pt pizero) ~14000 CPU-hours, >300 GBytes, ~1.5 person-month

SUNYSB Farm Projects 11, 13, 14, 17, 23 (electron working group requests) ~1500 CPU-hours, 100 GBytes (?), ~0.5 person-month

WIS Farm Project 8 (Phi->K+K-, Phi->e+e-) ~6000 CPU-hours, ~500 GBytes, ~1.0 person-month

Grand Totals: ~65000 CPU-hours (90 CPU-months),~2 TBytes, ~6 person-months

I have a 200K SU (150K LL CPU hour) grant from the NRAC of theNSF/NCSA, with which UNM HPC (“AHPCC”) is affiliated.

Peripheral Data Vs Simulation

Simulation: Muons From Central Hijing (QM02 Project07)

Data: Centrality by Perp > 60

(Stolen from Andrew…)

Simulated Decay MuonsQM’02 Project07 PISA files (Central HIJING)Closest cuts possible from PISA file to match data(PT parent >1 GeV/c, Theta Porig Parent 155-161) Investigating possibility of keeping only muon and parent hits for reconstruction.17100 total events distributed over Z=±10, ±20, ±38 More events available but only a factor for smallest error bar

Zeff ~75 cm

"(IDPART==5 || IDPART==6) && IDPARENT >6 &&IDPARENT < 13 && PTHE_PRI >155 && PTHE_PRI < 161 && IPLANE == 1 && IARM == 2 && LASTGAP > 2002 && PTOT_PRI*sin(PTHE_PRI*acos(0)/90.) > 1."

Not in fit

(Stolen from Andrew…)

Now at UNM HPC:

• PBS• Globus 2.2.x• Condor-G / Condor• (GDMP)

…all supported by HPC staff.

In Progress:

A new 1.2 TB RAID 5 disk server, to host:• AFS cache PHENIX software• ARGO file catalog (PostgreSQL)• Local Objectivity mirror• Globus 2.2.x (GridFTP and more…)

Pre-QM2002 experiencewith globus-url-copy…

• Easily saturated UNMbandwidth limitations(as they were at that time)

• PKI infrastructure andsophisticated error-handlingare a real bonus over bbftp.(One bug, known at the timeis being / has been addressed.)

(at left: 10 streams used)

KB/sec

Multi-Jet cross section (theory) calculations, run using Condor(/PVM)…

Three years of accumulated CPU time on desktop (MOU) machinesat HPCERC and at the University of Wisconsin.

Very CPU-intensive calculations… 6- and9-dimensional Monte Carlo integrations:A typical job runs for a week and producesonly about 100 KB of output histograms,such as those displayed here.

Santa Fe 6/18/03

Timothy L. Thomas 18

LLDIMU.HPC.UNM.EDU

Santa Fe 6/18/03


Santa Fe 6/18/03


RAID op system issues

Easy re-installation / update of the op sys Grub or Lilo? (MBR or /boot?) Machine has an IDE CDROM (but not a burner)!!! Rescue CDs and/or floppies… Independence of RAID array

o (1.5 hours for RAID 5 verification step)o Should install ext3 on the RAID.

Partitioning of the system disk:o Independence of /home areao Independence of /usr/local area?o Jonathan says: Linux can’t do more than 2 GB swap partitiono Jonathan says: / /usr/local/ /home/ (me: /home1/ /home2/ …?)

NFS issues…o Synchronize UID/GIDs between RAID server and LL.

Santa Fe 6/18/03



Compilers and glibc…

Santa Fe 6/18/03



File systems… What quotas? Ext3? (Quotas working OK?) ReiserFS? (Need special kernel modules for this?)

Santa Fe 6/18/03



Support for the following apps: Raid software Globus… PHENIX application software Objectivity

o gcc 2.95.3 PostGress Open AFS Kerberos 4

Santa Fe 6/18/03



Security issues… IP#: fixed or DHCP? What services to run or avoid?

o NFS…

Tripwire or equiv… Kerberos (for Open AFS)… Globus… ipchains firewall rules; /etc/services; /etc/xinetd config; etc…

Santa Fe 6/18/03



Application-level issues… Which framework? Both? Who maintains framework and how can this job be properly

divided up among locals? SHOULD THE RAID ARRAY BE PARTITONED, a la the PHENIX

counting house buffer boxes’ /a and /b file systems?

Resources

Filtered event can be analyzed, but not ALL PRDF event

Many trigger has overlap.

Assume 90KByte/event and 0.1GByte/hour/CPUSignal

Trigger Lumi[nb^-

1]#Event[

M]Size[Gbyt

e]CPU[hour

]100CPU[da

y]

mu-mu

mu

e-mu

ERT_electron 193 13.0 1170 11700 4.9 1

MUIDN_1D_&BBCLL1 238 34.0 3060 30600 12.8 1 1 1

MUIDN_1D&MUIDS_1D&BBCLL1

59 0.2 18 180 0.1 1

MUIDN_1D1S&BBCL1 254 4.8 432 4320 1.8 1 1 1

MUIDN_1D1S&NTCN 230 18.0 1620 16200 6.8 1

MUIDS_1D&BBCLL1 274 10.7 963 9630 4.0 1 1 1

MUIDS_1D1S&BBCLL1 293 1.3 117 1170 0.5 1

MUIDS_1D1S&NTCS 278 5.0 450 4500 1.9 1

ALL PRDF 350 6600.0 33,000 330,000 137.5

Rough calculation of real-data processing (I/O-intensive) capabilities:

10 M events, PRDF-to-{DST+x}, both mut & mutoo; assume 3 sec/event (*1.3 for LL), 200200 KB/event.

One pass: 7 days on 50 CPUs (25 boxes), using 56% of LL local network capacity.

My 200K “SU” (~150K LL CPU hours) allocation allows for 18 of these passes (4.2 months)

3 MB/sec Internet2 connection = 1.6 TB / 12 nights (MUIDN_1D1S&NTCN)

(Presently) LL is most effective for CPU-intensive tasks: simulations can easily fill the 512 CPUs; e.g, QM02 Project 07.

Caveats: “LLDIMU” is a front-end machine; LL worker node environment is different from CAS/RCS node ( P.Power…)

33Timothy L. ThomasSanta Fe 6/18/03

On UNM Grid activities

T.L.Thomas

Analysis Workshop, 8/8/2002 Charles F. Maguire, Vanderbilt4

Run2 Remote Site PISA Simulation Statistics

Vanderbilt Farm Projects 1 (hadrons), 6 (deuterons), 9 (pizero), and 10 (electrons) 10000 CPU-hours, >600 GBytes, ~1.5 person-month effort

UNM Farm Projects 0 (EMCal HIJING), 7 (EMCal + Muon Hijing), 9 (pizero) 33000 CPU-hours, >500 GBytes, ~1.5 person-month effort

LLNL Farm Projects 4 (EMCal Hijing), 5 (HBT), 23 (high pt pizero) ~14000 CPU-hours, >300 GBytes, ~1.5 person-month

SUNYSB Farm Projects 11, 13, 14, 17, 23 (electron working group requests) ~1500 CPU-hours, 100 GBytes (?), ~0.5 person-month

WIS Farm Project 8 (Phi->K+K-, Phi->e+e-) ~6000 CPU-hours, ~500 GBytes, ~1.0 person-month

Grand Totals: ~65000 CPU-hours (90 CPU-months),~2 TBytes, ~6 person-months

I have a 200K SU (150K LL CPU hour) grant from the NRAC of theNSF/NCSA, with which UNM HPC (“AHPCC”) is affiliated.

. CPU time used: ~ 33,000 LosLobos hours . Number of files handled: > 2200 files . Data moved to BNL: > 0.5 TB (globus-url-copy)

(NOTE: In 2001, did even more (~110,000 hours), as an as an exercise… see http://thomas.phys.unm.edu/tlt/phenix_simulations/ )

. Comments: [from late summer... but still relevant]

. Global storage and I/O (disk, network, network) management a headache; too human intensive. --> Throwing more people at the problem (i.e., giving people accounts at more remote sites) is not a particularly efficient way to solve this problem. . File naming standard essential (esp. for data base issues.)

. I have assembled a (still rough; not included here) standard request form for DETAILED information... --> This could be turned into an automatic interface... A PORTAL (to buzz)

. PWG contacts need to assemble as detailed a plan as they can, but without the kinds of system details that are probably going to be changed anyway. (e.g., "chunk" size hints welcome but may be ignored.)

. Use of varied facilities requires flexibility, including an "ATM" approach --> Simulation database needs to reflect this complexity.

. Generator config / management needs to be somewhat more sophisticated. --> E.g., random seeds, "back-end" generation.

. An big issue (that others may understand better): the relationship and interface between the simulation data base and other PHENIX data bases...

. Multiple levels of logs actually helped bookkeeping! --> Perhaps 'pseudo-parallelism' is the way to go.

. Emerging reality (one of the main motivations "Grid" technology): no one has enough computing when it's needed but everyone has too much when they don't need it, which is much of the time. More than enough computing to get the work done is out there; you don't need your own! BUT: they they are "out there", and this must be dealt with. ==> PHENIX can and should form its own IntraGrid.

Reality Check #1: Perpetual computing person-power shortage;this pertains to both software production and data production,both real and M.C. Given that, M.C. is presently way too muchwork.

Simple Vision: Transparently distributed processing shouldallow us to optimize our use of production computing person-power. Observed and projected massive increases in networkbandwidth make this a not-so-crazy idea.

Reality Check #2: What? Distributed real-data reco? Get Real! (...?)

Fairly Simple Vision: OK, OK: Implement Simple Vision for M.C.first, see how that goes. If one can process M.C., then one isperhaps 75% of the way to processing real data. (Objy write-backproblem is one serious catch.)


(The following slides are from a presentation that Iwas invited to give to the UNM-led multi-institutional

“Internet 2 Day” this past March…)


Internet 2 and the GridThe Future of Computing for

Big Science at UNM

Timothy L. ThomasUNM Dept of Physics and Astronomy


Grokking The GridGrok v. To perceive a subject so deeply that one no longer knows it, but rather understands it on a fundamental level. Coined by Robert Heinlein in his 1961 novel, Stranger in a Strange Land.

(Quotes from a colleague of mine…)

Feb 2002: “This grid stuff is garbage.”

Dec 2002: “Hey, these grid visionaries are serious!”


So what is a “Grid”?

Santa Fe 6/18/03


Ensemble of distributed resources actingtogether to solve a problem:

”The Grid is about collaboration, aboutpeople working together.”

Linking people, computing resources, and sensors / instruments Idea is decades old, but enabling technologies are recent. Capacity distributed throughout an infrastructure

Aspects of Grid computing: Pervasive Consistent Dependable Inexpensive

Santa Fe 6/18/03


Virtual Organizations (VOs) Security implications

Ian Foster’s Three Requirements: VOs that span multiple administrative domains Participant services based on open standards Delivery of serious Quality of Service

Santa Fe 6/18/03


High Energy Physics GridsGriPhyN (NSF) CS research focusing on virtual data, request planning Virtual Data Toolkit: Delivery vehicle for GriPhyN products

iVDGL: International Virtual Data Grid Laboratory (NSF) A testbed for large-scale deployment and validation

Particle Physics Data Grid (DOE) Grid-enabling six High-Energy/Nuclear Physics experiments

EU Data Grid (EDG): Applications areas… Particle physics Earth and planetary sciences: "Earth Observation“ “Biology”

GLUE: Grid Laboratory Uniform Environment Link from US grids to EDG grids


<<< Grid Hype >>>(“Grids: Grease or Glue?”)

Santa Fe 6/18/03


Natural Grid ApplicationsHigh-energy elementary particle and Nuclear Physics (HENP)Distributed image processing Astronomy… Biological/biomedical research; e.g., pathology… Earth and Planetary Sciences Military applications; e.g., space surveillance

Engineering simulations NEES GridDistributed event simulations Military applications; e.g., SF Express Medicine: distributed, immersive patient simulations Project

Touch Biology: complete cell simulations…

Santa Fe 6/18/03


Processing requirements

Two examples

Example 1: High-energy Nuclear Physics 10’s of petabytes of data per year 10’s of teraflops of distributed CPU power

o Comparable to today’s largest supercomputers…

Biological Databases:

Complex interdependencies

GenBank

Swissprot

TRRD

GERD

Transfac

EpoDB

EMBL

DDBJflow of data

BEAD

GAIA

•Domino-effect in data publishing•Efficiently keep many versions

Swissprot

EpoDB

GAIA

BEAD

Transfac

(Yong Zhao, University of Chicago)

Data Mining Example

Santa Fe 6/18/03


…and the role of Internet 2.

It is clear that advanced networking will play a critical role in the development of an intergrid and its eventual evolution into The Grid… Broadband capacity Advanced networking protocols Well-defined, finely graded, clearly-costed high

Qualities of Service

Connectivity of the web: one can pass from any node of IN through SCC to any node of OUT. Hanging off IN and OUT are TENDRILS containing nodes that are reachable from portions of IN, or that can reach portions of OUT, without passage through SCC. It is possible for a TENDRIL hanging off from IN to be hooked into a TENDRIL leading into OUT, forming a TUBE -- a passage from a portion of IN to a portion of OUT without touching SCC.


…In other words:barely predictable

But no doubt inevitable, disruptive, transformative,

…and very exciting!

Santa Fe 6/18/03 Timothy L. Thomas 2 “UCF” Computing Capabilities at UNM HPC Timothy L. Thomas...

Documents

Transcript of Santa Fe 6/18/03 Timothy L. Thomas 2 “UCF” Computing Capabilities at UNM HPC Timothy L. Thomas...