1 Grid3: an Application Grid Laboratory for Science Rob Gardner University of Chicago on behalf of...
-
Upload
tabitha-tate -
Category
Documents
-
view
214 -
download
1
Transcript of 1 Grid3: an Application Grid Laboratory for Science Rob Gardner University of Chicago on behalf of...
1
Grid3: an Application Grid Laboratory for
ScienceRob GardnerUniversity of Chicagoon behalf of the Grid3 project
CHEP ’04, Interlaken
September 28, 2004
2
Grid2003: an application grid laboratory
virtual data grid laboratory
virtual data research
end-to-end HENPapplications
CERN LHC: US ATLAStestbeds & data challenges
CERN LHC: USCMStestbeds & data challenges
Grid3
3
Grid3 at a Glance Grid environment built from core Globus and Condor
middleware, as delivered through the Virtual Data Toolkit (VDT) GRAM, GridFTP, MDS, RLS, VDS
…equipped with VO and multi-VO security, monitoring, and operations services
…allowing federation with other Grids where possible, eg. CERN LHC Computing Grid (LCG) USATLAS: GriPhyN VDS execution on LCG sites USCMS: storage element interoperability (SRM/dCache)
Delivering the US LHC Data Challenges
4
Grid3 Design Simple approach:
Sites consisting of Computing element (CE) Storage element (SE) Information and monitoring services
VO level, and multi-VO VO information services Operations (iGOC)
Minimal use of grid-wide systems No centralized workload manager, replica or data
management catalogs, or command line interface higher level services are provided by individual VO’s
sitesitesitesite
VO VO
…
iGOC
sitesitesitesite
CE SE
5
Site Services and Installation Goal is to install and configure with minimal human
intervention Use Pacman and distributed software “caches” Registers site with VO and Grid3 level services Accounts, application install areas & working directories
Compute Element
Storage
Grid3 Site%pacman –get iVDGL:Grid3
VDT
VO service
GIIS register
Info providers
Grid3 Schema
Log management
$app
$tmp
four hourinstall
and validate
6
Multi-VO Security Model DOEGrids Certificate Authority PPDG or iVDGL Registration
Authority Authorization service: VOMS Each Grid3 site generates a Globus
gridmap file with an authenticated SOAP query to each VO service
Site-specific adjustments or mappings
Group accounts to associate VOs with jobs iVDGL
US CMS
LSC
SDSS
BTeVSite
Grid3 grid-map
VOMS
US ATLAS
7
iVDGL Operations Center (iGOC) Co-located with Abilene NOC (Indianapolis) Hosts/manages multi-VO services:
top level Ganglia, GIIS collectors MonALISA web server and archival service VOMS servers for iVDGL, BTeV, SDSS Site Catalog service, Pacman caches
Trouble ticket systems phone (24 hr), web and email based collection and reporting system Investigation and resolution of grid middleware problems at the level
of 30 contacts per week
Weekly operations meetings for troubleshooting
8
Service monitoring
Grid3 – a snapshot of sites
Sep 04•30 sites, multi-VO•shared resources•~3000 CPUs (shared)
9
Grid3 Monitoring Framework
c.f. M. Mambelli, B. Kim et al., #490
10
Monitors
Jobs by VOACDC
Job Queues(Monalisa)
Data IO(Monalisa)
Metrics(MDViewer)
11
Use of Grid3 – led by US LHC 7 Scientific
applications and 3 CS demonstrators A third HEP and two biology
experiments also participated
Over 100 users authorized to run on Grid3 Application execution
performed by dedicated individuals
Typically ~few users ran the applications from a particular experiment
12
US CMS Data Challenge DC04
CMS dedicated (red)
Opportunistic use of Grid3 non-CMS (blue)
Events producedvs. day
c.f. A. Fanfani, #497
14
Shared infrastructure, last 6 months
cms dc04
atlasdc2
Sep 10
Usa
ge:
CP
Us
15
ATLAS DC2 production on Grid3: a joint activity with LCG and NorduGrid
-20000
0
20000
40000
60000
80000
100000
120000
140000
4062
3
4062
6
4062
9
4070
2
4070
5
4070
8
4071
1
4071
4
4071
7
4072
0
4072
3
4072
6
4072
9
4080
1
4080
4
4080
7
4081
0
4081
3
4081
6
4081
9
4082
2
4082
5
4082
8
4083
1
4090
3
4090
6
4090
9
4091
2
4091
5
4091
8
Days
Nu
mb
er
of
job
s
LCGNorduGridGrid3Total
G. Poulard, 9/21/04
# V
alid
ated
Job
s
total
c.f. L. Goossens, #501& O. Smirnova #499
Day
17
Beyond LHC applications…
Astrophysics and Astronomical LIGO/LSC: blind search for continuous gravitational waves SDSS: maxBcg, cluster finding package
Biochemical SnB: Bio-molecular program, analyses on X-ray diffraction to
find molecular structures GADU/Gnare: Genome analysis, compares protein sequences
Computer Science Supporting Ph.D. research
adaptive data placement and scheduling algorithms mechanisms for policy information expression, use, and monitoring
18
Astrophysics: Sloan Sky Survey Image stripes of the sky from telescope data sources:
galaxy cluster finding red shift analysis, weak lensing effects
Analyze weighted images Increase sensitivity by 2 orders of magnitude with object detection and measurement code
Workflow: replicate sky segment data to Grid3 sites average, analyze, send output to Fermilab 44,000 jobs, 30% complete
19
Time Period:May 1 - Sept. 1, 2004
Total Number of Jobs:71949
Total CPU Time:774 CPU Days
Average Job Runtime:0.26 Hr
SDSS Job Statistics on Grid3
20
Structural BiologySnB is a computer program based on Shake-and-Bake where:
A dual-space direct-methods procedure for determining molecular crystal structures from X-ray diffraction data is used.
As many as 2000 unique non-H atom difficult molecular structures have been solved in a routine fashion.
SnB has been routinely applied to jump-start the solution of large proteins, increasing the number of selenium atoms determined in Se-Met molecules from dozens to several hundred.
SnB is expected to play a vital role in the study of ribosomes and large macromolecular assemblies containing many different protein molecules and hundreds of heavy-atom sites.
21
Genomic Searches and Analysis Searches and find new
genomes on public databases (eg. NCBI)
Each genome composed of ~4k genes
Each gene needs to be processed and characterized Each gene handled by
separate process Save results for future use
also: BLAST protein sequences
Smart DiffCompare local directory
with PDB directory
User InterfaceSelect genomes torun through tools
PDB Acquisition ftp to Public Databases (PDB) Search New, Updated genome
Exit
Genome Upload Get new or updated
genomes to localdirectory
Create info files foranalyzing genomes
Pre-HPC Select jobs to run Parse info files
Tool GrabberParse data fromoutput files
Genbank GrabberParse data fromannotation files
Check Output
User NotificationNotify userregarding updates
New/Updated Old
Submit to bio toolsSubmit genomes in parallel Blast Pfam Blocks
HPC ProcessingCHIBA-CITY
OUTPUT
Correct
ORACLERelational DB
GenomeIntegratedDatabase
Error
Organism Name:Cornybacterium_glutamicum
Version and GINumber:NC_003450.1GI:19551250
Definition:Cornybateriumglutamicum, completegenome.
Sequence Qty: 3456
Path to fasta file:/nfs/............
Tool: ChibaBlast
GADU Work FlowGADU
250 processors3M sequencesID’d: bacterial, viral, vertebrate, mammal
22
Lessons (1) Human interactions in grid building costly Keeping site requirements light lead to heavy
loads on gatekeeper hosts Diverse set of sites made jobs requirements
exchange difficult Single point failures – rarely happen; certificate
revocation lists expiring happened twice Configuration problems – Pacman helped, but still
spent enormous amounts of time diagnosing problems
23
Lessons (2) Software updates were relatively easy or
extremely painful Authorization: simple in Grid3, but coarse grained Troubleshooting: efficiency for submitted jobs
was not as high as we’d like. Complex system with many failure modes and points
of failure. Need fine grained monitoring tools. Need to improve at both service level and user level
24
Operations Experience iGOC and US ATLAS Tier1 (BNL) developed operations
response model in support of DC2 Tier1 center
core services, “on-call” person available always response protocol developed
iGOC Coordinates problem resolution for Tier1 “off hours” Trouble handling for non-ATLAS Grid3 sites. Problems
resolved at weekly iVDGL operations meetings ~600 trouble tickets (generic); ~20 ATLAS DC2 specific
Extensive use of email lists
25
Not major problems bringing sites into single purpose grids simple computational grids for highly portable
applications specific workflows as defined by today’s JDL
and/or DAG approaches centralized, project-managed grids to a particular
scale, yet to be seen
26
Major problems: two perspectives Site & service providing perspective:
maintaining multiple “logical” grids with a given resource; maintaining robustness; long term management; dynamic reconfiguration; platforms
complex resource sharing policies (department, university, projects, collaborative), user roles
Application developer perspective: challenge of building integrated distributed systems end-to-end debugging of jobs, understanding faults common workload and data management systems
developed separately for each VO
27
Grid3 is evolving into OSG Main features/enhancements
Storage Resource Management Improve authorization service Add data management capabilities Improve monitoring and information services Service challenges and interoperability with other
Grids Timeline
Current Grid3 remains stable through 2004 Service development continues Grid3dev platform
c.f. R. Pordes, #192
28
Conclusions Grid3 taught us many lessons about how to
deploy and run a production grid Breakthrough in demonstrated use of
“opportunistic” resources enabled by grid technologies
Grid3 will be a critical resource for continued data challenges through 2004, and environment to learn how to operate and upgrade large scale production grids
Grid3 is evolving to OSG with enhanced capabilities
29
Acknowledgements R. Pordes
(Grid3 co-coordinator)
and the rest of the Grid3 team which did all the work! Site administrators VO service administrators Application developers Developers and contributors iGOC team Project teams