Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Performance Applications
Unleashing the BIG in small Science Data (NY Scientific Data Summit)
-
Upload
kerstin-lehnert -
Category
Data & Analytics
-
view
57 -
download
1
Transcript of Unleashing the BIG in small Science Data (NY Scientific Data Summit)
UNLEASHING THE BIG IN small SCIENCE DATA
Vision and Reality in the Geosciences
Kerstin Lehnert Lamont -Doherty Earth Observatory of Columbia UniversityPalisades, NY, 10964
NYSDS 2015: "Unleashing the BIG in small Science Data"
2
Small Science Data have BIG Potential
8/3/2015
Long Tail: Environmental and Earth sciences
The Head: Astronomy, Climate,High Energy Physics, Genomics
“The long tail is a breeding ground for new ideas and never before attempted science.”
(Heidorn, B. 2008: “Shedding Light on the Dark Data in the Long Tail of
Science”)
NYSDS 2015: "Unleashing the BIG in small Science Data"
3
8/3/2015
• 4 scientists going to Alaska• 5 weeks on remote islands• a boat (with crew)• a helicopter
Small Data, BIG Investment
1
2
3GS
Unalaska, fly camp site priorities 1-3; if the helicopter can stay with us for part of the day, we would make local moves within each area, and we would make ground stops at the alternate sites labeled “GS”. We are flexible. We can probably arrange some charter boat support to visit shoreline sites.
GS enroute to Umnak? … just a thought
GS
GS
GS
8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data" 4
Collected Samples
8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data" 5
Data Acquisition in the Lab
NYSDS 2015: "Unleashing the BIG in small Science Data" 6
8/3/2015
Rock crushing
Sample chemistry
Sample analysis
Anticipated Outcomes• Samples
• ~ 250 rock samples• Data
• ~ 200 major element analyses (Lamont-Doherty Earth Observatory)• ~ 150 trace element analyses (University of South Carolina)• 50 U/Pb zircon geochronology (University of Santa Barbara)• 30 Ar-Ar ages (Lamont-Doherty Earth Observatory)• 80 Sr, Nd, Hf and Pb isotope analyses (Lamont-Doherty Earth
Observatory)• Scientific papers
8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data" 7
Small Data “Sharing”
NYSDS 2015: "Unleashing the BIG in small Science Data" 8
8/3/2015
Monday’s Musings: Beyond The Three V’s of Big Data – Viscosity and ViralityFebruary 27, 2012 by R "Ray" Wanghttp://blog.softwareinsider.org/2012/02/27/mondays-musings-beyond-the-three-vs-of-big-data-viscosity-and-virality/
What Makes Data BIG?
NYSDS 2015: "Unleashing the BIG in small Science Data" 9
ValueThe sixth ‘V’:
8/3/2015
accessible
small data
findableidentification,
persistence
protection,protocols
context,provenance
re-usableharmonized,
machine-readable
interoperable
BIG DATA
Value
How to Unleash the BIG in Small Data
8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data" 10
accessible
small data
findableidentification,
persistence
protection,protocols
context,provenance
re-usableharmonized,
machine-readable
interoperable
BIG DATA
Value
Domain Repositories8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data" 11
Data Curation Standards
Discipline-specific Data Standards
Domain-specific Data Facilities
NYSDS 2015: "Unleashing the BIG in small Science Data" 12
Science Community
Domain specific Data facility
12
Libraries Archives
CI, Computer Science
Publishers, editors
Discipline-specific data services• Context & provenance metadata
• Semantics• Workflows
Funding Agencies
Data Facilities
Registries
8/3/2015
Data curation servicesCI development
NYSDS 2015: "Unleashing the BIG in small Science Data"
Domain-specific Data Facilities
8/3/2015
Domain Expertise
Community Responsiveness
Operational Reliability
13
Quality Utility
Trust
IEDA: A “Long-Tail” Data Facility
14
www.iedadata.org
• Multiple core disciplines (focus: solid earth)• High-T Geochemistry• Low-T Geochemistry• Petrology• Marine Geophysics & Geology• Geochronology
• Cross-disciplinary tools & services• Sample registry SESAR• IEDA Data Browser• Portals (GeoPRISMs, USAP-DCC, etc.)• GeoMapApp• Data management support
8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data"
Impact of IEDA Services
8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data" 15
from a NSF proposal’s Data Management Plan
Geochemical Observations• Hundreds of chemical properties of different Earth
materials• elemental or oxide concentrations• isotopes and isotopic ratios
• Large diversity of materials• Samples & sub-samples• Chemical & physical preparations
• Wide range of analytical methods• Customized & optimized configurations• Local data reduction code• Data quality measures relevant
16
EarthChem: BIG Data for Geochemistry• EarthChem Library (File repository & metadata catalog)
• DOI registration• Long-term archiving• CC license• Data templates & guidelines for data documentation• QC by data managers
• Synthesis Databases (PetDB, EarthChem Portal)• QA/QC by data managers• Data & metadata harmonization• Standards-compliant data model• Service Oriented Architecture
• Sample Registry SESAR• Persistent Unique Identifiers for Samples• Searchable metadata catalog
NYSDS 2015: "Unleashing the BIG in small Science Data" 17
8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data"
18
EarthChem Data Systems
Metadata
Data Data Data Data Data
EarthChem Library
Data Data Data
Search
Investigators
8/3/2015
Repository
Long-term archiving at CU Libraries
Data templates, review by data curators
19
Geochemical Data Standards
Metadata for geochemical methods & samples• Sample location, sampling method, sample preparation, analytical method,
instrument settings, lab, reproducibility measures (standards), normalization• EarthChem XML schema
Geochronology metadata standards Unique sample identifiers (IGSN = International Geo Sample
Number) & metadata profiles
• EarthChem engaged community in development• Worked with publishers on data policies to advance implementation• Provided incentives, tools, & user support to community to
facilitate adoption8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data"
8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data" 20
• > 80,000 samples• > 1,660 field programs• > 3 million measurements• > 2,000 references
global compilation of geochemical data for volcanic rocks from the ocean floor & mantle xenoliths
21
Data Mining: Search & Filter
NYSDS 2015: "Unleashing the BIG in small Science Data"
Filter by method or concentration
8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data" 22
8/3/2015
PetDB Impact• 500 - 800 downloads per quarter• >550 citations in the literature• many fundamental new
discoveries & insights• Disciplinary• Multi-disciplinary• Unanticipated purposes
• new scientific approaches• Statistical rather than hypothetical
reservoirs
NYSDS 2015: "Unleashing the BIG in small Science Data" 23
8/3/2015
Even BIGGER Geochemical Data
8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data" 24
• Data visualization• Interoperability with modeling databases & software
Inventory:~ 20 million analytical values
for > 850,000 samples
EarthChem PortalAccess to federated geochemical databases
25
SESAR (www.geosamples.org)
System for Earth Sample Registration
• Integrating physical samples in the Earth Sciences with CI• Persistent unique identifiers for samples
• IGSN = International Geo Sample Number• Citation and credit for samples!
• Preservation & persistent access of sample metadata• tools and services for users to catalog and manage sample metadata
• APIs• register sample metadata & obtain IGSNs• access to IGSN metadata
• Linking data, publications, and physical samples
8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data"
26
How can we grow small data across all Geoscience domains?
EarthCube COPDESS
The BIG Question
8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data"
EarthCube’s VisionA future state of geosciences cyberinfrastructure that serves the
entire academic geosciences community• facilitates science on the Earth System• simple and easy access to data and information• seamless connection to tools and services• flexible and community-driven development
Partnership between CISE/ACI and GEO: NSF 13-529
Slide provided by Eva Zanzerkia, NSF8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data" 27
EarthCube’s Strategy• Open science
• encourage the publication of all science products so they are discoverable and accessible and can be adapted to solve new problems.
• enable reproducibility• Knowledge-rich components
• ensure rich metadata for scientific resources and products to enable others to understand and reuse them.
• develop advanced analysis techniques and automated learning techniques from for the effective, mining, fusion, and assimilation of data into sophisticated inference models.
• Federated organization• contribute resources designed to interoperate through agreed standards.• coordinate by fostering standards and integration.
8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data" 28
EarthCube Foci That Help Small Data Grow
• Fostering new data communitiesDomain end-user workshopsResearch Coordination Networks
• Developing and adapting new technologies to structure, transform, integrate, document, harmonize data & metadataBuilding Blocks
• Advancing coordination, collaboration, and integration• Community governance• Integrative Activities
8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data" 29
RCN: The Internet of Samples in the Earth Sciences (iSamples)• Build consensus among diverse stakeholders working with physical
samples and their digital representation• investigators, curators, data managers, publishers
• To develop, promote, and implement best practices for sample identification, documentation, citation, and sharing.
NYSDS 2015: "Unleashing the BIG in small Science Data" 30
8/3/2015
Toward BIG Data for Samples
8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data" 31
EarthCube Technologies
8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data" 32
Projects Data AccessData
DiscoveryData
IntegrationData
Management Modeling Semantics
BCube √ √ √ √CINERGI √ √ √ √DisConBB √ √ √ √ √ √Earth System Bridge √ √ √GeoDeepDive √ √ √ √GeoSoft √ √ √GeoWS √ √ √ √ODSIP √ √ √ √ √GeoLink √ √ √ √ √ √CyberConnector √ √ √ √ √CHORDS √ √ √ √ √Digital Crust √ √ √ √ √EarthCollab √ √GeoDataSpace √ √ √Geosemantics for Long-Tail √ √ √ √
2013
2014
Slide provided by Ilya Zaslavsky, SDSC
EarthCube Integrative Activity:
Partnership to Leverage Existing Repository Services
NYSDS 2015: "Unleashing the BIG in small Science Data" 33
8/3/2015
EarthCube Integrative Activity: “Interdisciplinary Earth Data Alliance as a Model for Integrating EC Technology Resources and Engaging the Broad Community” (start 9/1/2015)
Coalition for Publishing Data in the Earth & Space Sciences (COPDESS)• Joint initiative of Earth Science publishers and Data Facilities to
better help translate the aspirations of open, available, and useful data from policy into practice.• Reaffirm and ensure adherence to existing journal and publishing policies
and society position statements regarding open data sharing and archiving of data, tools, and models.
• Ensure that Earth science data will, to the greatest extent possible, be stored in community approved repositories that can provide additional data services.
• Statement of Commitment signed by all major Earth & Space Science publishers
34
www.copdess.org8/3/2015
NYSDS 2015: "Unleashing the BIG in small Science Data"
Conclusions• Small data grows BIG when properly curated, documented, harmonized,
and integrated.• Domain-specific data facilities are essential to ensure quality of data for
trusted re-use & community engagement.• Integration with publications will augment the flow of data into
repositories and data products.• Partnerships among long-tail data communities allow sharing of
data publication & preservation infrastructure while supporting domain-specific data curation.• Community-wide initiatives such as EarthCube help solve the
entire range of social, technical, and organizational challenges.
NYSDS 2015: "Unleashing the BIG in small Science Data" 36
8/3/2015