Unleashing the BIG in small Science Data (NY Scientific Data Summit)

36
UNLEASHING THE BIG IN small SCIENCE DATA Vision and Reality in the Geosciences Kerstin Lehnert Lamont -Doherty Earth Observatory of Columbia University Palisades, NY, 10964

Transcript of Unleashing the BIG in small Science Data (NY Scientific Data Summit)

Page 1: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

UNLEASHING THE BIG IN small SCIENCE DATA

Vision and Reality in the Geosciences

Kerstin Lehnert Lamont -Doherty Earth Observatory of Columbia UniversityPalisades, NY, 10964

Page 2: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

NYSDS 2015: "Unleashing the BIG in small Science Data"

2

Small Science Data have BIG Potential

8/3/2015

Long Tail: Environmental and Earth sciences

The Head: Astronomy, Climate,High Energy Physics, Genomics

“The long tail is a breeding ground for new ideas and never before attempted science.”

(Heidorn, B. 2008: “Shedding Light on the Dark Data in the Long Tail of

Science”)

Page 3: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

NYSDS 2015: "Unleashing the BIG in small Science Data"

3

8/3/2015

• 4 scientists going to Alaska• 5 weeks on remote islands• a boat (with crew)• a helicopter

Small Data, BIG Investment

Page 4: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

1

2

3GS

Unalaska, fly camp site priorities 1-3; if the helicopter can stay with us for part of the day, we would make local moves within each area, and we would make ground stops at the alternate sites labeled “GS”. We are flexible. We can probably arrange some charter boat support to visit shoreline sites.

GS enroute to Umnak? … just a thought

GS

GS

GS

8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data" 4

Page 5: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

Collected Samples

8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data" 5

Page 6: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

Data Acquisition in the Lab

NYSDS 2015: "Unleashing the BIG in small Science Data" 6

8/3/2015

Rock crushing

Sample chemistry

Sample analysis

Page 7: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

Anticipated Outcomes• Samples

• ~ 250 rock samples• Data

• ~ 200 major element analyses (Lamont-Doherty Earth Observatory)• ~ 150 trace element analyses (University of South Carolina)• 50 U/Pb zircon geochronology (University of Santa Barbara)• 30 Ar-Ar ages (Lamont-Doherty Earth Observatory)• 80 Sr, Nd, Hf and Pb isotope analyses (Lamont-Doherty Earth

Observatory)• Scientific papers

8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data" 7

Page 8: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

Small Data “Sharing”

NYSDS 2015: "Unleashing the BIG in small Science Data" 8

8/3/2015

Page 9: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

Monday’s Musings: Beyond The Three V’s of Big Data – Viscosity and ViralityFebruary 27, 2012 by R "Ray" Wanghttp://blog.softwareinsider.org/2012/02/27/mondays-musings-beyond-the-three-vs-of-big-data-viscosity-and-virality/

What Makes Data BIG?

NYSDS 2015: "Unleashing the BIG in small Science Data" 9

ValueThe sixth ‘V’:

8/3/2015

Page 10: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

accessible

small data

findableidentification,

persistence

protection,protocols

context,provenance

re-usableharmonized,

machine-readable

interoperable

BIG DATA

Value

How to Unleash the BIG in Small Data

8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data" 10

Page 11: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

accessible

small data

findableidentification,

persistence

protection,protocols

context,provenance

re-usableharmonized,

machine-readable

interoperable

BIG DATA

Value

Domain Repositories8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data" 11

Data Curation Standards

Discipline-specific Data Standards

Page 12: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

Domain-specific Data Facilities

NYSDS 2015: "Unleashing the BIG in small Science Data" 12

Science Community

Domain specific Data facility

12

Libraries Archives

CI, Computer Science

Publishers, editors

Discipline-specific data services• Context & provenance metadata

• Semantics• Workflows

Funding Agencies

Data Facilities

Registries

8/3/2015

Data curation servicesCI development

Page 13: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

NYSDS 2015: "Unleashing the BIG in small Science Data"

Domain-specific Data Facilities

8/3/2015

Domain Expertise

Community Responsiveness

Operational Reliability

13

Quality Utility

Trust

Page 14: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

IEDA: A “Long-Tail” Data Facility

14

www.iedadata.org

• Multiple core disciplines (focus: solid earth)• High-T Geochemistry• Low-T Geochemistry• Petrology• Marine Geophysics & Geology• Geochronology

• Cross-disciplinary tools & services• Sample registry SESAR• IEDA Data Browser• Portals (GeoPRISMs, USAP-DCC, etc.)• GeoMapApp• Data management support

8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data"

Page 15: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

Impact of IEDA Services

8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data" 15

from a NSF proposal’s Data Management Plan

Page 16: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

Geochemical Observations• Hundreds of chemical properties of different Earth

materials• elemental or oxide concentrations• isotopes and isotopic ratios

• Large diversity of materials• Samples & sub-samples• Chemical & physical preparations

• Wide range of analytical methods• Customized & optimized configurations• Local data reduction code• Data quality measures relevant

16

Page 17: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

EarthChem: BIG Data for Geochemistry• EarthChem Library (File repository & metadata catalog)

• DOI registration• Long-term archiving• CC license• Data templates & guidelines for data documentation• QC by data managers

• Synthesis Databases (PetDB, EarthChem Portal)• QA/QC by data managers• Data & metadata harmonization• Standards-compliant data model• Service Oriented Architecture

• Sample Registry SESAR• Persistent Unique Identifiers for Samples• Searchable metadata catalog

NYSDS 2015: "Unleashing the BIG in small Science Data" 17

8/3/2015

Page 18: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

NYSDS 2015: "Unleashing the BIG in small Science Data"

18

EarthChem Data Systems

Metadata

Data Data Data Data Data

EarthChem Library

Data Data Data

Search

Investigators

8/3/2015

Repository

Long-term archiving at CU Libraries

Data templates, review by data curators

Page 19: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

19

Geochemical Data Standards

Metadata for geochemical methods & samples• Sample location, sampling method, sample preparation, analytical method,

instrument settings, lab, reproducibility measures (standards), normalization• EarthChem XML schema

Geochronology metadata standards Unique sample identifiers (IGSN = International Geo Sample

Number) & metadata profiles

• EarthChem engaged community in development• Worked with publishers on data policies to advance implementation• Provided incentives, tools, & user support to community to

facilitate adoption8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data"

Page 20: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data" 20

• > 80,000 samples• > 1,660 field programs• > 3 million measurements• > 2,000 references

global compilation of geochemical data for volcanic rocks from the ocean floor & mantle xenoliths

Page 21: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

21

Data Mining: Search & Filter

NYSDS 2015: "Unleashing the BIG in small Science Data"

Filter by method or concentration

8/3/2015

Page 22: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

NYSDS 2015: "Unleashing the BIG in small Science Data" 22

8/3/2015

Page 23: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

PetDB Impact• 500 - 800 downloads per quarter• >550 citations in the literature• many fundamental new

discoveries & insights• Disciplinary• Multi-disciplinary• Unanticipated purposes

• new scientific approaches• Statistical rather than hypothetical

reservoirs

NYSDS 2015: "Unleashing the BIG in small Science Data" 23

8/3/2015

Page 24: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

Even BIGGER Geochemical Data

8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data" 24

• Data visualization• Interoperability with modeling databases & software

Inventory:~ 20 million analytical values

for > 850,000 samples

EarthChem PortalAccess to federated geochemical databases

Page 25: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

25

SESAR (www.geosamples.org)

System for Earth Sample Registration

• Integrating physical samples in the Earth Sciences with CI• Persistent unique identifiers for samples

• IGSN = International Geo Sample Number• Citation and credit for samples!

• Preservation & persistent access of sample metadata• tools and services for users to catalog and manage sample metadata

• APIs• register sample metadata & obtain IGSNs• access to IGSN metadata

• Linking data, publications, and physical samples

8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data"

Page 26: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

26

How can we grow small data across all Geoscience domains?

EarthCube COPDESS

The BIG Question

8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data"

Page 27: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

EarthCube’s VisionA future state of geosciences cyberinfrastructure that serves the

entire academic geosciences community• facilitates science on the Earth System• simple and easy access to data and information• seamless connection to tools and services• flexible and community-driven development

Partnership between CISE/ACI and GEO: NSF 13-529

Slide provided by Eva Zanzerkia, NSF8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data" 27

Page 28: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

EarthCube’s Strategy• Open science

• encourage the publication of all science products so they are discoverable and accessible and can be adapted to solve new problems.

• enable reproducibility• Knowledge-rich components

• ensure rich metadata for scientific resources and products to enable others to understand and reuse them.

• develop advanced analysis techniques and automated learning techniques from for the effective, mining, fusion, and assimilation of data into sophisticated inference models.

• Federated organization• contribute resources designed to interoperate through agreed standards.• coordinate by fostering standards and integration.

8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data" 28

Page 29: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

EarthCube Foci That Help Small Data Grow

• Fostering new data communitiesDomain end-user workshopsResearch Coordination Networks

• Developing and adapting new technologies to structure, transform, integrate, document, harmonize data & metadataBuilding Blocks

• Advancing coordination, collaboration, and integration• Community governance• Integrative Activities

8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data" 29

Page 30: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

RCN: The Internet of Samples in the Earth Sciences (iSamples)• Build consensus among diverse stakeholders working with physical

samples and their digital representation• investigators, curators, data managers, publishers

• To develop, promote, and implement best practices for sample identification, documentation, citation, and sharing.

NYSDS 2015: "Unleashing the BIG in small Science Data" 30

8/3/2015

Page 31: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

Toward BIG Data for Samples

8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data" 31

Page 32: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

EarthCube Technologies

8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data" 32

Projects Data AccessData

DiscoveryData

IntegrationData

Management Modeling Semantics

BCube √ √ √ √CINERGI √ √ √ √DisConBB √ √ √ √ √ √Earth System Bridge √ √ √GeoDeepDive √ √ √ √GeoSoft √ √ √GeoWS √ √ √ √ODSIP √ √ √ √ √GeoLink √ √ √ √ √ √CyberConnector √ √ √ √ √CHORDS √ √ √ √ √Digital Crust √ √ √ √ √EarthCollab √ √GeoDataSpace √ √ √Geosemantics for Long-Tail √ √ √ √

2013

2014

Slide provided by Ilya Zaslavsky, SDSC

Page 33: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

EarthCube Integrative Activity:

Partnership to Leverage Existing Repository Services

NYSDS 2015: "Unleashing the BIG in small Science Data" 33

8/3/2015

EarthCube Integrative Activity: “Interdisciplinary Earth Data Alliance as a Model for Integrating EC Technology Resources and Engaging the Broad Community” (start 9/1/2015)

Page 34: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

Coalition for Publishing Data in the Earth & Space Sciences (COPDESS)• Joint initiative of Earth Science publishers and Data Facilities to

better help translate the aspirations of open, available, and useful data from policy into practice.• Reaffirm and ensure adherence to existing journal and publishing policies

and society position statements regarding open data sharing and archiving of data, tools, and models.

• Ensure that Earth science data will, to the greatest extent possible, be stored in community approved repositories that can provide additional data services.

• Statement of Commitment signed by all major Earth & Space Science publishers

34

www.copdess.org8/3/2015

NYSDS 2015: "Unleashing the BIG in small Science Data"

Page 35: Unleashing the BIG in small Science Data (NY Scientific Data Summit)
Page 36: Unleashing the BIG in small Science Data (NY Scientific Data Summit)

Conclusions• Small data grows BIG when properly curated, documented, harmonized,

and integrated.• Domain-specific data facilities are essential to ensure quality of data for

trusted re-use & community engagement.• Integration with publications will augment the flow of data into

repositories and data products.• Partnerships among long-tail data communities allow sharing of

data publication & preservation infrastructure while supporting domain-specific data curation.• Community-wide initiatives such as EarthCube help solve the

entire range of social, technical, and organizational challenges.

NYSDS 2015: "Unleashing the BIG in small Science Data" 36

8/3/2015