Santa Fe 6/18/03 Timothy L. Thomas 2 “UCF” Computing Capabilities at UNM HPC Timothy L. Thomas...
-
date post
20-Jan-2016 -
Category
Documents
-
view
216 -
download
0
Transcript of Santa Fe 6/18/03 Timothy L. Thomas 2 “UCF” Computing Capabilities at UNM HPC Timothy L. Thomas...
Santa Fe 6/18/03 Timothy L. Thomas 2
“UCF”Computing Capabilities at
UNM HPC
Timothy L. ThomasUNM Dept of Physics and Astronomy
Santa Fe 6/18/03 Timothy L. Thomas 3
Santa Fe 6/18/03
Timothy L. Thomas 6
Santa Fe 6/18/03
Timothy L. Thomas 7
Santa Fe 6/18/03
Timothy L. Thomas 8
Santa Fe 6/18/03 Timothy L. Thomas 9
Santa Fe 6/18/03 Timothy L. Thomas 11
Analysis Workshop, 8/8/2002 Charles F. Maguire, Vanderbilt4
Run2 Remote Site PISA Simulation Statistics
Vanderbilt Farm Projects 1 (hadrons), 6 (deuterons), 9 (pizero), and 10 (electrons) 10000 CPU-hours, >600 GBytes, ~1.5 person-month effort
UNM Farm Projects 0 (EMCal HIJING), 7 (EMCal + Muon Hijing), 9 (pizero) 33000 CPU-hours, >500 GBytes, ~1.5 person-month effort
LLNL Farm Projects 4 (EMCal Hijing), 5 (HBT), 23 (high pt pizero) ~14000 CPU-hours, >300 GBytes, ~1.5 person-month
SUNYSB Farm Projects 11, 13, 14, 17, 23 (electron working group requests) ~1500 CPU-hours, 100 GBytes (?), ~0.5 person-month
WIS Farm Project 8 (Phi->K+K-, Phi->e+e-) ~6000 CPU-hours, ~500 GBytes, ~1.0 person-month
Grand Totals: ~65000 CPU-hours (90 CPU-months),~2 TBytes, ~6 person-months
I have a 200K SU (150K LL CPU hour) grant from the NRAC of theNSF/NCSA, with which UNM HPC (“AHPCC”) is affiliated.
Peripheral Data Vs Simulation
Simulation: Muons From Central Hijing (QM02 Project07)
Data: Centrality by Perp > 60
(Stolen from Andrew…)
Simulated Decay MuonsQM’02 Project07 PISA files (Central HIJING)Closest cuts possible from PISA file to match data(PT parent >1 GeV/c, Theta Porig Parent 155-161) Investigating possibility of keeping only muon and parent hits for reconstruction.17100 total events distributed over Z=±10, ±20, ±38 More events available but only a factor for smallest error bar
Zeff ~75 cm
"(IDPART==5 || IDPART==6) && IDPARENT >6 &&IDPARENT < 13 && PTHE_PRI >155 && PTHE_PRI < 161 && IPLANE == 1 && IARM == 2 && LASTGAP > 2002 && PTOT_PRI*sin(PTHE_PRI*acos(0)/90.) > 1."
Not in fit
(Stolen from Andrew…)
Now at UNM HPC:
• PBS• Globus 2.2.x• Condor-G / Condor• (GDMP)
…all supported by HPC staff.
In Progress:
A new 1.2 TB RAID 5 disk server, to host:• AFS cache PHENIX software• ARGO file catalog (PostgreSQL)• Local Objectivity mirror• Globus 2.2.x (GridFTP and more…)
Pre-QM2002 experiencewith globus-url-copy…
• Easily saturated UNMbandwidth limitations(as they were at that time)
• PKI infrastructure andsophisticated error-handlingare a real bonus over bbftp.(One bug, known at the timeis being / has been addressed.)
(at left: 10 streams used)
KB/sec
Multi-Jet cross section (theory) calculations, run using Condor(/PVM)…
Three years of accumulated CPU time on desktop (MOU) machinesat HPCERC and at the University of Wisconsin.
Very CPU-intensive calculations… 6- and9-dimensional Monte Carlo integrations:A typical job runs for a week and producesonly about 100 KB of output histograms,such as those displayed here.
Santa Fe 6/18/03
Timothy L. Thomas 18
LLDIMU.HPC.UNM.EDU
Santa Fe 6/18/03
Timothy L. Thomas 19
Santa Fe 6/18/03
Timothy L. Thomas 20
Santa Fe 6/18/03
Timothy L. Thomas 21
Santa Fe 6/18/03 Timothy L. Thomas 22
Santa Fe 6/18/03
Timothy L. Thomas 23
RAID op system issues
Easy re-installation / update of the op sys Grub or Lilo? (MBR or /boot?) Machine has an IDE CDROM (but not a burner)!!! Rescue CDs and/or floppies… Independence of RAID array
o (1.5 hours for RAID 5 verification step)o Should install ext3 on the RAID.
Partitioning of the system disk:o Independence of /home areao Independence of /usr/local area?o Jonathan says: Linux can’t do more than 2 GB swap partitiono Jonathan says: / /usr/local/ /home/ (me: /home1/ /home2/ …?)
NFS issues…o Synchronize UID/GIDs between RAID server and LL.
Santa Fe 6/18/03
Timothy L. Thomas 24
RAID op system issues
Compilers and glibc…
Santa Fe 6/18/03
Timothy L. Thomas 25
RAID op system issues
File systems… What quotas? Ext3? (Quotas working OK?) ReiserFS? (Need special kernel modules for this?)
Santa Fe 6/18/03
Timothy L. Thomas 26
RAID op system issues
Support for the following apps: Raid software Globus… PHENIX application software Objectivity
o gcc 2.95.3 PostGress Open AFS Kerberos 4
Santa Fe 6/18/03
Timothy L. Thomas 27
RAID op system issues
Security issues… IP#: fixed or DHCP? What services to run or avoid?
o NFS…
Tripwire or equiv… Kerberos (for Open AFS)… Globus… ipchains firewall rules; /etc/services; /etc/xinetd config; etc…
Santa Fe 6/18/03
Timothy L. Thomas 28
RAID op system issues
Application-level issues… Which framework? Both? Who maintains framework and how can this job be properly
divided up among locals? SHOULD THE RAID ARRAY BE PARTITONED, a la the PHENIX
counting house buffer boxes’ /a and /b file systems?
Resources
Filtered event can be analyzed, but not ALL PRDF event
Many trigger has overlap.
Assume 90KByte/event and 0.1GByte/hour/CPUSignal
Trigger Lumi[nb^-
1]#Event[
M]Size[Gbyt
e]CPU[hour
]100CPU[da
y]
mu-mu
mu
e-mu
ERT_electron 193 13.0 1170 11700 4.9 1
MUIDN_1D_&BBCLL1 238 34.0 3060 30600 12.8 1 1 1
MUIDN_1D&MUIDS_1D&BBCLL1
59 0.2 18 180 0.1 1
MUIDN_1D1S&BBCL1 254 4.8 432 4320 1.8 1 1 1
MUIDN_1D1S&NTCN 230 18.0 1620 16200 6.8 1
MUIDS_1D&BBCLL1 274 10.7 963 9630 4.0 1 1 1
MUIDS_1D1S&BBCLL1 293 1.3 117 1170 0.5 1
MUIDS_1D1S&NTCS 278 5.0 450 4500 1.9 1
ALL PRDF 350 6600.0 33,000 330,000 137.5
Rough calculation of real-data processing (I/O-intensive) capabilities:
10 M events, PRDF-to-{DST+x}, both mut & mutoo; assume 3 sec/event (*1.3 for LL), 200200 KB/event.
One pass: 7 days on 50 CPUs (25 boxes), using 56% of LL local network capacity.
My 200K “SU” (~150K LL CPU hours) allocation allows for 18 of these passes (4.2 months)
3 MB/sec Internet2 connection = 1.6 TB / 12 nights (MUIDN_1D1S&NTCN)
(Presently) LL is most effective for CPU-intensive tasks: simulations can easily fill the 512 CPUs; e.g, QM02 Project 07.
Caveats: “LLDIMU” is a front-end machine; LL worker node environment is different from CAS/RCS node ( P.Power…)
Santa Fe 6/18/03 Timothy L. Thomas 31
Santa Fe 6/18/03 Timothy L. Thomas 32
33Timothy L. ThomasSanta Fe 6/18/03
On UNM Grid activities
T.L.Thomas
Analysis Workshop, 8/8/2002 Charles F. Maguire, Vanderbilt4
Run2 Remote Site PISA Simulation Statistics
Vanderbilt Farm Projects 1 (hadrons), 6 (deuterons), 9 (pizero), and 10 (electrons) 10000 CPU-hours, >600 GBytes, ~1.5 person-month effort
UNM Farm Projects 0 (EMCal HIJING), 7 (EMCal + Muon Hijing), 9 (pizero) 33000 CPU-hours, >500 GBytes, ~1.5 person-month effort
LLNL Farm Projects 4 (EMCal Hijing), 5 (HBT), 23 (high pt pizero) ~14000 CPU-hours, >300 GBytes, ~1.5 person-month
SUNYSB Farm Projects 11, 13, 14, 17, 23 (electron working group requests) ~1500 CPU-hours, 100 GBytes (?), ~0.5 person-month
WIS Farm Project 8 (Phi->K+K-, Phi->e+e-) ~6000 CPU-hours, ~500 GBytes, ~1.0 person-month
Grand Totals: ~65000 CPU-hours (90 CPU-months),~2 TBytes, ~6 person-months
I have a 200K SU (150K LL CPU hour) grant from the NRAC of theNSF/NCSA, with which UNM HPC (“AHPCC”) is affiliated.
. CPU time used: ~ 33,000 LosLobos hours . Number of files handled: > 2200 files . Data moved to BNL: > 0.5 TB (globus-url-copy)
(NOTE: In 2001, did even more (~110,000 hours), as an as an exercise… see http://thomas.phys.unm.edu/tlt/phenix_simulations/ )
. Comments: [from late summer... but still relevant]
. Global storage and I/O (disk, network, network) management a headache; too human intensive. --> Throwing more people at the problem (i.e., giving people accounts at more remote sites) is not a particularly efficient way to solve this problem. . File naming standard essential (esp. for data base issues.)
. I have assembled a (still rough; not included here) standard request form for DETAILED information... --> This could be turned into an automatic interface... A PORTAL (to buzz)
. PWG contacts need to assemble as detailed a plan as they can, but without the kinds of system details that are probably going to be changed anyway. (e.g., "chunk" size hints welcome but may be ignored.)
. Use of varied facilities requires flexibility, including an "ATM" approach --> Simulation database needs to reflect this complexity.
. Generator config / management needs to be somewhat more sophisticated. --> E.g., random seeds, "back-end" generation.
. An big issue (that others may understand better): the relationship and interface between the simulation data base and other PHENIX data bases...
. Multiple levels of logs actually helped bookkeeping! --> Perhaps 'pseudo-parallelism' is the way to go.
. Emerging reality (one of the main motivations "Grid" technology): no one has enough computing when it's needed but everyone has too much when they don't need it, which is much of the time. More than enough computing to get the work done is out there; you don't need your own! BUT: they they are "out there", and this must be dealt with. ==> PHENIX can and should form its own IntraGrid.
Reality Check #1: Perpetual computing person-power shortage;this pertains to both software production and data production,both real and M.C. Given that, M.C. is presently way too muchwork.
Simple Vision: Transparently distributed processing shouldallow us to optimize our use of production computing person-power. Observed and projected massive increases in networkbandwidth make this a not-so-crazy idea.
Reality Check #2: What? Distributed real-data reco? Get Real! (...?)
Fairly Simple Vision: OK, OK: Implement Simple Vision for M.C.first, see how that goes. If one can process M.C., then one isperhaps 75% of the way to processing real data. (Objy write-backproblem is one serious catch.)
Santa Fe 6/18/03 Timothy L. Thomas 38
(The following slides are from a presentation that Iwas invited to give to the UNM-led multi-institutional
“Internet 2 Day” this past March…)
Santa Fe 6/18/03 Timothy L. Thomas 39
Internet 2 and the GridThe Future of Computing for
Big Science at UNM
Timothy L. ThomasUNM Dept of Physics and Astronomy
Santa Fe 6/18/03 Timothy L. Thomas 40
Grokking The GridGrok v. To perceive a subject so deeply that one no longer knows it, but rather understands it on a fundamental level. Coined by Robert Heinlein in his 1961 novel, Stranger in a Strange Land.
(Quotes from a colleague of mine…)
Feb 2002: “This grid stuff is garbage.”
Dec 2002: “Hey, these grid visionaries are serious!”
Santa Fe 6/18/03 Timothy L. Thomas 41
So what is a “Grid”?
Santa Fe 6/18/03
Timothy L. Thomas 44
Ensemble of distributed resources actingtogether to solve a problem:
”The Grid is about collaboration, aboutpeople working together.”
Linking people, computing resources, and sensors / instruments Idea is decades old, but enabling technologies are recent. Capacity distributed throughout an infrastructure
Aspects of Grid computing: Pervasive Consistent Dependable Inexpensive
Santa Fe 6/18/03
Timothy L. Thomas 45
Virtual Organizations (VOs) Security implications
Ian Foster’s Three Requirements: VOs that span multiple administrative domains Participant services based on open standards Delivery of serious Quality of Service
Santa Fe 6/18/03
Timothy L. Thomas 46
High Energy Physics GridsGriPhyN (NSF) CS research focusing on virtual data, request planning Virtual Data Toolkit: Delivery vehicle for GriPhyN products
iVDGL: International Virtual Data Grid Laboratory (NSF) A testbed for large-scale deployment and validation
Particle Physics Data Grid (DOE) Grid-enabling six High-Energy/Nuclear Physics experiments
EU Data Grid (EDG): Applications areas… Particle physics Earth and planetary sciences: "Earth Observation“ “Biology”
GLUE: Grid Laboratory Uniform Environment Link from US grids to EDG grids
Santa Fe 6/18/03 Timothy L. Thomas 47
<<< Grid Hype >>>(“Grids: Grease or Glue?”)
Santa Fe 6/18/03
Timothy L. Thomas 48
Natural Grid ApplicationsHigh-energy elementary particle and Nuclear Physics (HENP)Distributed image processing Astronomy… Biological/biomedical research; e.g., pathology… Earth and Planetary Sciences Military applications; e.g., space surveillance
Engineering simulations NEES GridDistributed event simulations Military applications; e.g., SF Express Medicine: distributed, immersive patient simulations Project
Touch Biology: complete cell simulations…
Santa Fe 6/18/03
Timothy L. Thomas 49
Processing requirements
Two examples
Example 1: High-energy Nuclear Physics 10’s of petabytes of data per year 10’s of teraflops of distributed CPU power
o Comparable to today’s largest supercomputers…
Biological Databases:
Complex interdependencies
GenBank
Swissprot
TRRD
GERD
Transfac
EpoDB
EMBL
DDBJflow of data
BEAD
GAIA
•Domino-effect in data publishing•Efficiently keep many versions
Swissprot
EpoDB
GAIA
BEAD
Transfac
(Yong Zhao, University of Chicago)
Data Mining Example
Santa Fe 6/18/03
Timothy L. Thomas 52
…and the role of Internet 2.
It is clear that advanced networking will play a critical role in the development of an intergrid and its eventual evolution into The Grid… Broadband capacity Advanced networking protocols Well-defined, finely graded, clearly-costed high
Qualities of Service
Connectivity of the web: one can pass from any node of IN through SCC to any node of OUT. Hanging off IN and OUT are TENDRILS containing nodes that are reachable from portions of IN, or that can reach portions of OUT, without passage through SCC. It is possible for a TENDRIL hanging off from IN to be hooked into a TENDRIL leading into OUT, forming a TUBE -- a passage from a portion of IN to a portion of OUT without touching SCC.
Santa Fe 6/18/03 Timothy L. Thomas 56
…In other words:barely predictable
But no doubt inevitable, disruptive, transformative,
…and very exciting!