Metadata Management on the SCEC PetaSHA Project: Helping Users Describe, Discover, Understand, and...
-
date post
22-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of Metadata Management on the SCEC PetaSHA Project: Helping Users Describe, Discover, Understand, and...
Metadata Management on the SCEC PetaSHA Project: Helping Users Describe, Discover, Understand, and Use Simulation Data in a
Large-scale Scientific Collaboration
David Okaya (Univ. Southern California)Ewa Deelman (ISI)
Phil Maechling (SCEC)Mona Wong-Barnum (SDSC)
Tom Jordan (USC/SCEC) David Meyers (SCEC)
AGU • December 14, 2007
Southern California Earthquake Center
Outline
• Southern California Earthquake Center (SCEC) and the cyberinfrastructure PetaSHA earthquake hazards
project.
• PetaSHA computer-based estimations of ground shaking: "Platforms"
• Role of metadata in automated & manual workflow within Platforms.
• Lessons learned - what works and where we need assistance from computer scientists.
Southern California A Natural Laboratory for Earthquake Hazard & Risk Analysis
• Complex network ofover 300 active faults
high hazard
• Large urban population
high risk
• Southern CaliforniaEarthquake Center(SCEC) coordinatesa major program ofearthquake research
system-level studies of hazard and risk
Southern California Earthquake Center
• Involves 500+ scientists at 55 institutions worldwide
• Focuses on earthquake system science using Southern California as a natural laboratory
• Translates basic research into practical products for earthquake risk reduction
• SCEC Collaboratory – Grid-enabled Community Modeling Environment (CME) developed under
NSF’s ITR Program
– Partnership with IT organizations in physics-based seismic hazard analysis
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
• SCEC - Computer Science - Information Technology Collaboration
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Cyberinfrastructure layering of the SCEC Collaboratory
– Vertical integration of hardware, software, and wetware into a cyber-infrastructure for earthquake scientists and consumer of reseach products.
– Across-the-Internet: High performance computing (Terascale, Petascale), Grid services, storage and digital libraries, visualization, portals, validated and optimized scientific codes, scientific workflow technologies.
SCEC/CME Computational PlatformsVertically integrated computational configurations (hardware + software + wetware) for physics-based seismic hazard analysis
Platform Attributes:– System-level scale range– High-performance hardware– IT/geoscience collaboration– Validated software framework– Workflow management tools– Well-defined interface
CyberShake
Broadband
Simulate one earthquake through earth volume.
standard computing
capacity & data-intensive computing
multiple computing
EarthWorks gatewayincreased complexity
Numerical simulations of ground shaking by an earthquake
Simulate one earthquake through earth volume including high frequency near-surface shaking (e.g., under buildings).
Simulate hundreds to thousands of variations of earthquakes and statistically calculate earthquake probabilities.
Platform Metadata: Workflow Provenance and Scientific Content
Traditional:– domain scientists do on own.– text, embedded in file
headers.
CyberShake
Broadband
Simulate one earthquake through earth volume.
standard computing
capacity & data-intensive computing
multiple computing
EarthWorks gatewayincreased complexity
Numerical simulations of ground shaking by an earthquake
Simulate one earthquake through earth volume including high frequency near-surface shaking (e.g., under buildings).
Simulate hundreds to thousands of variations of earthquakes and statistically calculate earthquake probabilities.
Defined:– history.– produced by codes,
appended to flat file.
Optimized:– history and more:– produced by codes.– upstream metadata used by
downstream codes to determine run; eliminates hardcodes.
Optimized
Defined
Traditional
Dolan et al. (2003)Dolan et al. (2003)
3D Velocities of seismic waves
Earthworks Gateway
Linear workflow with choices;Optimized metadata.
scientific workflow
managem
ent
Earthworks Gateway
Linear workflow with choices;Optimized metadata.
scientific workflow
managem
ent
Two types of metadata
• workflow metadata: resources, provenance.
• scientific content metadata: describes products and can be used by scientific codes.
Earthworks Gateway
Linear workflow with choices;Optimized metadata.
scientific workflow
managem
ent
Two types of metadata
• workflow metadata: resources, provenance.
• scientific content metadata: describes products and can be used by scientific codes.
simulation_codeauthor=Rob_Gravessimulation_codename=emod3d#set_region ...region_origin_definition=lat_longregion_latlong_ellipsoid=WGS-84region_UTM_zone=11region_origin_latitude=34.00000region_origin_longitude=-118.00000region_lengtheast_m=30000.0region_lengthnorth_m=30000.0region_depth_shallow=0.0region_depth_deep=17000.0region_velocitymodel=SCEC_CVM3.0#set_simulation_seismic_times.simulation_tmax=5.000simulation_dt=0.0050simulation_timesamples=1001#define_earthquake_location.eq_latitude=34.05300eq_longitude=-117.90000eq_depth_km=2.0000eq_depth_m=2000.0eq_Mw=5.00source_type=PT_DCOUPLEsource_wavetype=triangle#set_mesh_info ...mesh_dx=100.00mesh_dy=100.00mesh_dz=100.00mesh_nx=301mesh_ny=301mesh_nz=171
Broadband Platform
Parallel workflow with choices;Defined metadata.
create EQ.
choice of code:1. Stanford2. UCSB3. URSCorp
User choice ofof earthquake
and earth volume
earthquakedescription
file
library:earth volumes,location info.
Combine low & high frequency.
filter, time shift, sum, etc.
Broadband seismograms
(0-10 Hz)
workflows
workflows
Low frequency simulation ( < 1Hz)
choice of codes.
High frequency simulation ( > 1Hz)
choice of codes.
Broadband Platform
Parallel workflow with choices;Defined metadata.
create EQ.
choice of code:1. Stanford2. UCSB3. URSCorp
User choice ofof earthquake
and earth volume
earthquakedescription
file
library:earth volumes,location info.
Combine low & high frequency.
filter, time shift, sum, etc.
Broadband seismograms
(0-10 Hz)
workflows
workflows
Low frequency simulation ( < 1Hz)
choice of codes.
High frequency simulation ( > 1Hz)
choice of codes.
Mixed type of metadata
• workflow history & limited scientific content metadata: what codes were run, input files.
Broadband Platform
Parallel workflow with choices;Defined metadata.
create EQ.
choice of code:1. Stanford2. UCSB3. URSCorp
User choice ofof earthquake
and earth volume
earthquakedescription
file
library:earth volumes,location info.
Combine low & high frequency.
filter, time shift, sum, etc.
Broadband seismograms
(0-10 Hz)
workflows
workflows
Low frequency simulation ( < 1Hz)
choice of codes.
High frequency simulation ( > 1Hz)
choice of codes.
Mixed type of metadata
• workflow history & limited scientific content metadata: what codes were run, input files.
#Starting a Metadata File for the Broadband Platform
workflow_name=wf_1_urs_urs_urs
workflow_name=urs_genslip
urs_genslip.00003_velmod=indata/6904311/nga_rock1.v1d
urs_genslip.00002_srcfile=indata/6904311/rg_hd4-eq.src
urs_genslip.00001_version=2.3urs_genslip.00004_genslip=$BIN/genslip-v2.3 read_erf=0 outfile=tmpdata/6904311/tmp_slip stype=urs mag=7.00 nx=128 ny=128 dx=0.281 dy=0.219 dtop=0.000 strike=0 dip=45 rake=90 elon=-118.0000 elat=34.0000 ns=1 nh=1 shypo=10.800 dhypo=22.400 stretch_kcorner=1 dt=0.0250 velfile=indata/6904311/nga_rock1.v1d seed=9urs_jbrun.00015_wcc_resamp_ardbt=$BIN/wcc_resamp_arbdt newdt=0.025000 infile=tmpdata/6904311/s178.ver outfile=tmpdata/6904311/s178.ver inbin=0 outbin=0
CyberShake Platform
Ewa Deelman, ISI
ERF (Earthquake Rupture Forcast): general earthquake description.
Rupture Generator: variations of the earthquake.
SGT (Strain Green's Tensor) Generator: numerical simulation of earth response.
GM (Ground Motion) Simulation: makes ground shaking of each EQ variation.
Hazard Curve Calculator: probability of exceeding a specific level of ground shaking.
Simulates ground motions for potential fault ruptures within 200 km of each site. ~12,700 sources in SoCal from USGS 2002 ERF.Extends ERF to multiple hypo- centers and slip models for each source. ~100,000 ground motion simulations for each site.
Thousands of runs;Traditional metadata(coherent set not formalized).
CyberShake Platform
Ewa Deelman, ISI
ERF (Earthquake Rupture Forcast): general earthquake description.
Rupture Generator: variations of the earthquake.
SGT (Strain Green's Tensor) Generator: numerical simulation of earth response.
GM (Ground Motion) Simulation: makes ground shaking of each EQ variation.
Hazard Curve Calculator: probability of exceeding a specific level of ground shaking.
Thousands of runs;Traditional metadata(coherent set not formalized).
Domain-scientist structured metadata:
• metadata packed into name of files (thousands of them): earthquake.rupture.variation#
• some metadata stored in datafile header blocks.
• other metadata describing end products stored in a database.
• primary knowledge resides with operator who runs codes;not easily searchable.
Lessons Learned
• Two tiers of Metadata: (workflow) provenance & scientific content.
– Don't cross-over that often except when choices exist and scientific content indicates the choice.
• Upstream-downstream codes can communicate via metadata.
– "one code's output is another code's input". True for data & metadata.
– allows construction of dynamic workflows. Choice of different codes when available. NO HARDCODED PARAMETER VALUES.
• Domain scientists need metadata tools and strategies of metadata structure and naming conventions.
– must be able to code with scientific languages such as C, Fortran.
– need to use early on before alternative approaches spring up on own.
• How: Metadata and hence codes tethered to database or free-form via flat text or XML?