Metadata Management on the SCEC PetaSHA Project: Helping Users Describe, Discover, Understand, and...

17
Metadata Management on the SCEC PetaSHA Project: Helping Users Describe, Discover, Understand, and Use Simulation Data in a Large-scale Scientific Collaboration David Okaya (Univ. Southern California) Ewa Deelman (ISI) Phil Maechling (SCEC) Mona Wong-Barnum (SDSC) Tom Jordan (USC/SCEC) David Meyers (SCEC) AGU • December 14, 2007 Southern California Earthquake Center
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of Metadata Management on the SCEC PetaSHA Project: Helping Users Describe, Discover, Understand, and...

Metadata Management on the SCEC PetaSHA Project: Helping Users Describe, Discover, Understand, and Use Simulation Data in a

Large-scale Scientific Collaboration

David Okaya (Univ. Southern California)Ewa Deelman (ISI)

Phil Maechling (SCEC)Mona Wong-Barnum (SDSC)

Tom Jordan (USC/SCEC) David Meyers (SCEC)

AGU • December 14, 2007

Southern California Earthquake Center

Outline

• Southern California Earthquake Center (SCEC) and the cyberinfrastructure PetaSHA earthquake hazards

project.

• PetaSHA computer-based estimations of ground shaking: "Platforms"

• Role of metadata in automated & manual workflow within Platforms.

• Lessons learned - what works and where we need assistance from computer scientists.

Southern California A Natural Laboratory for Earthquake Hazard & Risk Analysis

• Complex network ofover 300 active faults

high hazard

• Large urban population

high risk

• Southern CaliforniaEarthquake Center(SCEC) coordinatesa major program ofearthquake research

system-level studies of hazard and risk

Southern California Earthquake Center

• Involves 500+ scientists at 55 institutions worldwide

• Focuses on earthquake system science using Southern California as a natural laboratory

• Translates basic research into practical products for earthquake risk reduction

• SCEC Collaboratory – Grid-enabled Community Modeling Environment (CME) developed under

NSF’s ITR Program

– Partnership with IT organizations in physics-based seismic hazard analysis

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

• SCEC - Computer Science - Information Technology Collaboration

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Cyberinfrastructure layering of the SCEC Collaboratory

– Vertical integration of hardware, software, and wetware into a cyber-infrastructure for earthquake scientists and consumer of reseach products.

– Across-the-Internet: High performance computing (Terascale, Petascale), Grid services, storage and digital libraries, visualization, portals, validated and optimized scientific codes, scientific workflow technologies.

SCEC Focus Groups

Sedimentary Basins

1857 rupture

San Andreas fault

SCEC/CME Computational PlatformsVertically integrated computational configurations (hardware + software + wetware) for physics-based seismic hazard analysis

Platform Attributes:– System-level scale range– High-performance hardware– IT/geoscience collaboration– Validated software framework– Workflow management tools– Well-defined interface

CyberShake

Broadband

Simulate one earthquake through earth volume.

standard computing

capacity & data-intensive computing

multiple computing

EarthWorks gatewayincreased complexity

Numerical simulations of ground shaking by an earthquake

Simulate one earthquake through earth volume including high frequency near-surface shaking (e.g., under buildings).

Simulate hundreds to thousands of variations of earthquakes and statistically calculate earthquake probabilities.

Platform Metadata: Workflow Provenance and Scientific Content

Traditional:– domain scientists do on own.– text, embedded in file

headers.

CyberShake

Broadband

Simulate one earthquake through earth volume.

standard computing

capacity & data-intensive computing

multiple computing

EarthWorks gatewayincreased complexity

Numerical simulations of ground shaking by an earthquake

Simulate one earthquake through earth volume including high frequency near-surface shaking (e.g., under buildings).

Simulate hundreds to thousands of variations of earthquakes and statistically calculate earthquake probabilities.

Defined:– history.– produced by codes,

appended to flat file.

Optimized:– history and more:– produced by codes.– upstream metadata used by

downstream codes to determine run; eliminates hardcodes.

Optimized

Defined

Traditional

Dolan et al. (2003)Dolan et al. (2003)

3D Velocities of seismic waves

Earthworks Gateway

Linear workflow with choices;Optimized metadata.

scientific workflow

managem

ent

Earthworks Gateway

Linear workflow with choices;Optimized metadata.

scientific workflow

managem

ent

Two types of metadata

• workflow metadata: resources, provenance.

• scientific content metadata: describes products and can be used by scientific codes.

Earthworks Gateway

Linear workflow with choices;Optimized metadata.

scientific workflow

managem

ent

Two types of metadata

• workflow metadata: resources, provenance.

• scientific content metadata: describes products and can be used by scientific codes.

simulation_codeauthor=Rob_Gravessimulation_codename=emod3d#set_region ...region_origin_definition=lat_longregion_latlong_ellipsoid=WGS-84region_UTM_zone=11region_origin_latitude=34.00000region_origin_longitude=-118.00000region_lengtheast_m=30000.0region_lengthnorth_m=30000.0region_depth_shallow=0.0region_depth_deep=17000.0region_velocitymodel=SCEC_CVM3.0#set_simulation_seismic_times.simulation_tmax=5.000simulation_dt=0.0050simulation_timesamples=1001#define_earthquake_location.eq_latitude=34.05300eq_longitude=-117.90000eq_depth_km=2.0000eq_depth_m=2000.0eq_Mw=5.00source_type=PT_DCOUPLEsource_wavetype=triangle#set_mesh_info ...mesh_dx=100.00mesh_dy=100.00mesh_dz=100.00mesh_nx=301mesh_ny=301mesh_nz=171

Broadband Platform

Parallel workflow with choices;Defined metadata.

create EQ.

choice of code:1. Stanford2. UCSB3. URSCorp

User choice ofof earthquake

and earth volume

earthquakedescription

file

library:earth volumes,location info.

Combine low & high frequency.

filter, time shift, sum, etc.

Broadband seismograms

(0-10 Hz)

workflows

workflows

Low frequency simulation ( < 1Hz)

choice of codes.

High frequency simulation ( > 1Hz)

choice of codes.

Broadband Platform

Parallel workflow with choices;Defined metadata.

create EQ.

choice of code:1. Stanford2. UCSB3. URSCorp

User choice ofof earthquake

and earth volume

earthquakedescription

file

library:earth volumes,location info.

Combine low & high frequency.

filter, time shift, sum, etc.

Broadband seismograms

(0-10 Hz)

workflows

workflows

Low frequency simulation ( < 1Hz)

choice of codes.

High frequency simulation ( > 1Hz)

choice of codes.

Mixed type of metadata

• workflow history & limited scientific content metadata: what codes were run, input files.

Broadband Platform

Parallel workflow with choices;Defined metadata.

create EQ.

choice of code:1. Stanford2. UCSB3. URSCorp

User choice ofof earthquake

and earth volume

earthquakedescription

file

library:earth volumes,location info.

Combine low & high frequency.

filter, time shift, sum, etc.

Broadband seismograms

(0-10 Hz)

workflows

workflows

Low frequency simulation ( < 1Hz)

choice of codes.

High frequency simulation ( > 1Hz)

choice of codes.

Mixed type of metadata

• workflow history & limited scientific content metadata: what codes were run, input files.

#Starting a Metadata File for the Broadband Platform

workflow_name=wf_1_urs_urs_urs

workflow_name=urs_genslip

urs_genslip.00003_velmod=indata/6904311/nga_rock1.v1d

urs_genslip.00002_srcfile=indata/6904311/rg_hd4-eq.src

urs_genslip.00001_version=2.3urs_genslip.00004_genslip=$BIN/genslip-v2.3 read_erf=0 outfile=tmpdata/6904311/tmp_slip stype=urs mag=7.00 nx=128 ny=128 dx=0.281 dy=0.219 dtop=0.000 strike=0 dip=45 rake=90 elon=-118.0000 elat=34.0000 ns=1 nh=1 shypo=10.800 dhypo=22.400 stretch_kcorner=1 dt=0.0250 velfile=indata/6904311/nga_rock1.v1d seed=9urs_jbrun.00015_wcc_resamp_ardbt=$BIN/wcc_resamp_arbdt newdt=0.025000 infile=tmpdata/6904311/s178.ver outfile=tmpdata/6904311/s178.ver inbin=0 outbin=0

CyberShake Platform

Ewa Deelman, ISI

ERF (Earthquake Rupture Forcast): general earthquake description.

Rupture Generator: variations of the earthquake.

SGT (Strain Green's Tensor) Generator: numerical simulation of earth response.

GM (Ground Motion) Simulation: makes ground shaking of each EQ variation.

Hazard Curve Calculator: probability of exceeding a specific level of ground shaking.

Simulates ground motions for potential fault ruptures within 200 km of each site. ~12,700 sources in SoCal from USGS 2002 ERF.Extends ERF to multiple hypo- centers and slip models for each source. ~100,000 ground motion simulations for each site.

Thousands of runs;Traditional metadata(coherent set not formalized).

CyberShake Platform

Ewa Deelman, ISI

ERF (Earthquake Rupture Forcast): general earthquake description.

Rupture Generator: variations of the earthquake.

SGT (Strain Green's Tensor) Generator: numerical simulation of earth response.

GM (Ground Motion) Simulation: makes ground shaking of each EQ variation.

Hazard Curve Calculator: probability of exceeding a specific level of ground shaking.

Thousands of runs;Traditional metadata(coherent set not formalized).

Domain-scientist structured metadata:

• metadata packed into name of files (thousands of them): earthquake.rupture.variation#

• some metadata stored in datafile header blocks.

• other metadata describing end products stored in a database.

• primary knowledge resides with operator who runs codes;not easily searchable.

Lessons Learned

• Two tiers of Metadata: (workflow) provenance & scientific content.

– Don't cross-over that often except when choices exist and scientific content indicates the choice.

• Upstream-downstream codes can communicate via metadata.

– "one code's output is another code's input". True for data & metadata.

– allows construction of dynamic workflows. Choice of different codes when available. NO HARDCODED PARAMETER VALUES.

• Domain scientists need metadata tools and strategies of metadata structure and naming conventions.

– must be able to code with scientific languages such as C, Fortran.

– need to use early on before alternative approaches spring up on own.

• How: Metadata and hence codes tethered to database or free-form via flat text or XML?