Bonazzi commons bd2 k ahm 2016 v2

55
The Data Commons An introduction & Overview BD2K AHM, November 29, 2016 Vivien Bonazzi (ADDS)

Transcript of Bonazzi commons bd2 k ahm 2016 v2

Page 1: Bonazzi commons bd2 k ahm 2016 v2

The Data Commons An introduction & Overview

BD2K AHM, November 29, 2016

Vivien Bonazzi (ADDS)

Page 2: Bonazzi commons bd2 k ahm 2016 v2

Outline

What’s driving the need for a Data Commons?

Development of the Data Commons at NIH

Current Data Commons Pilots• Next steps

Considerations & Concluding Thoughts

Page 3: Bonazzi commons bd2 k ahm 2016 v2

What’s driving the need for a Data Commons?

Page 4: Bonazzi commons bd2 k ahm 2016 v2

Convergence of factors

Mountains of Data

Increasing need and support for Data sharing

Availability of digital technologies and infrastructures that support Data at scale

Page 5: Bonazzi commons bd2 k ahm 2016 v2
Page 6: Bonazzi commons bd2 k ahm 2016 v2
Page 7: Bonazzi commons bd2 k ahm 2016 v2

https://gds.nih.gov/ Went into effect January 25, 2015

NCI guidance:http://www.cancer.gov/grants-training/grants-management/nci-policies/genomic-data

Requires public sharing of genomic data sets

Page 8: Bonazzi commons bd2 k ahm 2016 v2

8

Recommendation #4: A national cancer data ecosystem for sharing and analysis.

Create a National Cancer Data Ecosystem to collect, share, and interconnect a broad array of large datasets so that researchers, clinicians, and patients will be able to both contribute and analyze data, facilitating discovery that will ultimately improve patient care and outcomes.

8

Page 9: Bonazzi commons bd2 k ahm 2016 v2
Page 10: Bonazzi commons bd2 k ahm 2016 v2
Page 11: Bonazzi commons bd2 k ahm 2016 v2

Challenges with Biomedical DataThe Journal Article is the end goal

Data is a means to an ends (low value) Data is not FAIR Findable, Accessible, Interoperable, Reproducible Limited e-infrastructures to support FAIR data

Page 12: Bonazzi commons bd2 k ahm 2016 v2

What’sChanging?

Digital ecosystems

Page 13: Bonazzi commons bd2 k ahm 2016 v2

Development of the NIH Data Commons

Page 14: Bonazzi commons bd2 k ahm 2016 v2

How do we find data, software, standards? How can we make (large) data, annotations,

software, metadata accessible? How do we reuse data, tools and standards? How do we make more data machine readable? How do we leverage existing digital technologies

systems, infrastructures? How do we collaborate? How do we enable digital ecosystem?

Changing the conversation around Data sharing and access

NIH Data Commons

Page 15: Bonazzi commons bd2 k ahm 2016 v2

Data Commons enabling data driven science

Enable investigators to leverage all possible data and tools in the effort to accelerate biomedical discoveries, therapies and cures

by driving the development of data infrastructure and data science capabilities through collaborative research and robust engineering

Matthew Trunnel, FHC

Page 16: Bonazzi commons bd2 k ahm 2016 v2

Data Commons’s

Page 17: Bonazzi commons bd2 k ahm 2016 v2

Developing a Data Commons

Treats products of research – data, methods, papers etc. as digital objects

These digital objects exist in a shared virtual space• Find, Deposit, Manage, Share, and Reuse data,

software, metadata and workflows Digital object compliance through FAIR principles:

• Findable• Accessible (and usable)• Interoperable • Reusable

Page 18: Bonazzi commons bd2 k ahm 2016 v2

The Data Commons

is a framework that supports

FAIR data access and sharing and fosters the development of a digital ecosystem

https://datascience.nih.gov/commons

Page 19: Bonazzi commons bd2 k ahm 2016 v2

The Data Commons Framework

Compute Platform: Cloud

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data“Reference” Data Sets

User defined data

Digital Object Compliance

App store/User Interface

PaaS

SaaS

IaaS

https://datascience.nih.gov/commons

Page 20: Bonazzi commons bd2 k ahm 2016 v2

NIH + Community defined data sets

BD2K Centers, MODS, HMP & InteroperabilitySupplements

Cloud credits model (CCM)

BioCADDIE/OtherIndexing

NCI & NIAID Cloud Pilots+ GDC

Compute Platform: Cloud or HPC

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data“Reference” Data Sets

User defined data

Digital Object Compliance

App store/User Interface

Mapping BD2K Activities and Commons Pilots to the Commons Framework

Page 21: Bonazzi commons bd2 k ahm 2016 v2

Current Data Commons Pilots

Page 22: Bonazzi commons bd2 k ahm 2016 v2

Current Data Commons PilotsReference Data

Sets

Commons Framework Pilots

Cloud Credit Model

Resource Search & Index

Explore feasibility of the Commons Framework Facilitate collaboration and interoperability

Making large and/or high impact NIH funded data sets and tools accessible in the cloud

Developing Data and Software indexing methodsLeveraging BD2K Efforts: bioCADDIE and others.Collaborating with external groups

Provide access to cloud (IaaS) and PaaS/SaaS via creditsConnecting credits to the grants system

Page 23: Bonazzi commons bd2 k ahm 2016 v2

Reference Data Sets Pilot Large, High-Impact Datasets in the Cloud

Vivien Bonazzi

Page 24: Bonazzi commons bd2 k ahm 2016 v2

Compute Platform: Cloud or HPC

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data“Reference” Data Sets

User defined data

Digital Object Compliance

App store/User Interface

Mapping to the Commons FrameworkLarge, High-Impact Datasets in the Cloud - Populating the Commons

Large, High-Impact Data Sets in the Cloud

Page 25: Bonazzi commons bd2 k ahm 2016 v2

Make large, high impact, NIH funded data sets available in the cloud/commons

Co-locate large datasets and compute power, to improve access, use, re-use, and sharing of data and tools

Kick-start the Commons with Commons-compliant data and tools Data must adhere to Common compliance /FAIR principles

Provide an indexable test data sets for bioCADDIE (and other indexing efforts)

Overview: Large, High-Impact Datasets in the Cloud - Populating the Commons

Page 26: Bonazzi commons bd2 k ahm 2016 v2

This pilot project will inform NIH on: Which Clouds are most functional, practical, and

cost effective? What is involved in moving data resources to the

Cloud? What will it cost? How to manage challenges associated with both

open access and controlled access data? How do we find data and resources across clouds? How do we compute across clouds?

What will we learn: Large, High-Impact Datasets in the Cloud - Populating the Commons

Page 27: Bonazzi commons bd2 k ahm 2016 v2

Biomedical data resources and tools • Support to migrate large, high-impact datasets and associated tools

into multiple cloud providers• Data an tools sets must be FAIR

Cloud Infrastructure • Support for cloud storage and architectural engineering to support

data and tools

Coordination • Facilitate activities across the biomedical data resources and cloud providers• Development of market place/app store approaches• Auth: Authorization & Access controls • Tracking metrics (cost, usage etc.) and impact of the overall project

Proposed Components: Large, High-Impact Datasets in the Cloud

Page 28: Bonazzi commons bd2 k ahm 2016 v2

Reference Data Sets – Next Steps NIH Data Task Force

• Chaired by Francis Collins• Involves many NIH ICs• Developing some shorter term preliminary pilots for larger NIH

funded data sets in the cloud• Expect to see some announcements in Jan/Feb 2017

RFI – engage in dialoged with the community • Planned Winter 2017

FOAs – Supporting large high impact data sets in the cloud• Spring 2017

Page 29: Bonazzi commons bd2 k ahm 2016 v2

Commons Framework PilotsExploring feasibility of the Commons Framework : Software and Services layer

Valentina Di Francesco

Page 30: Bonazzi commons bd2 k ahm 2016 v2

Commons Framework Pilots (CFPs)

Exploring feasibility of the Commons Framework

Facilitating connectivity, interoperability and access to digital objects

Providing digital research objects to populate the Commons

Page 31: Bonazzi commons bd2 k ahm 2016 v2

Commons Framework PilotsPI Parent grant’s IC Project description

TOGA NIBIB • Cloud-hosted data publication system • Allows the automatic creation and publication of data a personalized data

repositoryMUSEN NIAID • Smart APIs – improved handling for metadata within APIs

• Ontological support for metadata within an API• Improving smart API discoverability: a registry of APIs

HAN NIGMS • Docker container hub for BD2K community• Docker containers for genomic analysis applications and pipelines• Benchmark, Evaluation & best practices

COOPER/KOHANE

NHGRI • Cloud based authenticated API access and exchange of causal modeling data , tools + genomic and phenomic data (PIC)

• Docker containers for CCD tools available in AWSHAUSSLER NHGRI • Secure sharing of germline genetic variations for a targeted panel of breast

cancer susceptibility genes and variations• (GA4GH) API : being able to query this data and metadata

Ohno-Machado NHLBI • Development of an ecosystem for repeatable science • easy reuse of data AND software; tracking of provenance. • Use of container technologies for software and data reuse.

White NHGRI • The entire HMP1 data set made accessible on AWS• Analysis tools for microbiome data in AWS

Ma’ayan NHLBI • A Cloud-Based Microscopy Imaging Commons Portal with microscopy data and metadata

Sternberg NHGRI • Development of a cloud-based literature curation system for specific curation tasks of the collaborating sites.

• An API to provide programmatic access to the relevant papers in PMC

MODs PIs NHGRI • Development of a common data model for the MODs• Development of APIs accessing data across the MODs

Page 32: Bonazzi commons bd2 k ahm 2016 v2

Commons Framework Pilots

• APIs• Containerization:

• Docker containers, guidelines, registry store

• Workbenches, Connectors • Indexing• Market Place/App Store

Page 33: Bonazzi commons bd2 k ahm 2016 v2

Mapping the Commons Framework PILOTS to the Commons Framework

White - HMP

Compute Platform: Cloud or HPC

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data“Reference” Data Sets

User defined data

App store/User Interface

Musen

Ma’ayan

Cooper Han

Haussler

MODs

Sternberg

Ohno-Machado

Toga

Page 34: Bonazzi commons bd2 k ahm 2016 v2

Commons Framework Pilots : Updates

Sept. 2015 – First set of CFPs awarded

Nov. 2015 - CFPs participated in the AHM and the Commons breakout session

Feb. 2016 - Established Common Framework Working Group (CFWG)

• CFWG members: Pilots’ PIs and/or technical leads; few PIs of the BD2K interoperability projects

• Meeting in person on March 1, 2016

Page 35: Bonazzi commons bd2 k ahm 2016 v2

Commons Framework Pilots : UpdatesMarch 2016 – CFPs meeting in person

• To develop an initial plan for the implementation of Commons Framework

• Meeting presentations here• A manuscript describing the outcomes of the meeting was submitted

• Established the Commons Framework Working Group (CFWG) and sub-WGs on the following topics:• FAIRness Metrics (Neil McKenna & Michel Dumontier)• Data-object registry (Lucila Ohno-Machado, Michel Dumontier, Wei Wang) • Interoperability of APIs (Michel Dumontier)• Workflow sharing and docker registry (Umberto Ravaioli & Brian O’Connor)• Commons Framework Publications (Owen White)

Nov 28, 2016 – Held a CFWG meeting in person

These groups will present a report of their activities at the Commons Session tomorrow at 10:30am

Page 36: Bonazzi commons bd2 k ahm 2016 v2

Commons Framework WG - Next Steps

GET INVOLVED: See Valentina Di Francesco or WG leads for details

A broad announcement to the BD2K research community went out in late summer – we are seeking more participants

Contribute to the implementation of the Commons Framework

Suggest other scientific areas of interest that need coordination

Generate guidelines that all of our peers will use as we begin to jumpstart the NIH Commons

Participate in meetings of the CFWG and hear the latest news

Page 37: Bonazzi commons bd2 k ahm 2016 v2

Commons Framework – Next Steps FOA: Support investigator-initiated projects to further develop

the Data Commons Framework• Could leverage and expand upon resources developed with the

Reference data sets• Planned Fall 2017

FOA: Making existing data and tools Commons Compliant/FAIR• Competitive Supplements to existing NIH Awards.• Provide support to existing projects to make current digital

resources FAIR & Commons Compliant• Digital resources could include: data, analytical software, or

workflows• Planned Fall 2017

Page 38: Bonazzi commons bd2 k ahm 2016 v2

Resource Search & Indexing

Discoverability of data and software

Ian Fore, Ron Margolis, Alison Yao, Claire Schulkey Dawei Lin

Page 39: Bonazzi commons bd2 k ahm 2016 v2

Compute Platform: Cloud or HPC

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data“Reference” Data Sets

User defined data

Digital Object Compliance

App store/User Interface

Mapping to the Commons FrameworkLarge, High-Impact Datasets in the Cloud - Populating the Commons

Indexing

Page 40: Bonazzi commons bd2 k ahm 2016 v2

An Indexing Ecosystem for the Commons: a virtual environment for ‘FIND’

Enable biomedical research by providing scientists with the ability to FIND digital resources

Establish a mature resource discovery tool(s) that can be sustained as long as the need for it exists

Focus on characteristics of the tool as infrastructure Maintains a defined level of service Contribute to a Commons that is reliable, available,

easy to use, and adaptable

Page 41: Bonazzi commons bd2 k ahm 2016 v2

Identify indexing

activities in andoutside NIH

BD2K: bioCADDIE, Centers of Excellence

ICs: NLM, NCI, NHGRI,

other

Non-BD2K: Elixir (EBI), Publishers (Elsevier),

Repositories, schema.org

Compare ongoing

activities and identify

needs

Benchmarking

Identify gaps in strategy• Dimensions to

consider• Content,

Metadata, Platform/Technology

Coordinate with other

BD2K PMWGs

Standards

Specific Center WGs

Current Activities

Page 42: Bonazzi commons bd2 k ahm 2016 v2

Cloud Credits Model

George Komatsoulis

Page 43: Bonazzi commons bd2 k ahm 2016 v2

Compute Platform: Cloud or HPC

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data“Reference” Data Sets

User defined data

Digital Object Compliance

App store/User Interface

Mapping to the Commons FrameworkLarge, High-Impact Datasets in the Cloud - Populating the Commons

Cloud Credits Pilot

Page 44: Bonazzi commons bd2 k ahm 2016 v2

Investigator

CMS FFRDC

The Commons

Cloud ProviderC

Cloud ProviderB

Cloud ProviderA

Investigator Institution

[OPTIONAL]Approves Credit

Request

RequestsCredits

Directs resellerto distribute

credits

Distributes

Uses credits

1

2

NIH

3

4

5

7

8

DeliversFunding Recommendation

Review &

Approval

CMS FFRDC

Review &

Selection

6

Page 45: Bonazzi commons bd2 k ahm 2016 v2

How do credits work from the point of view of an investigator?

Investigators receive credits worth a certain amount (in dollars) that can be used at the conformant provider(s) of their choice

Credits are pre-purchased and applied to the account of the investigator with the relevant provider(s)

As the investigator uses services with a conformant provider, the provider debits the value of the investigators usage against the pre-loaded credits

INVESTIGATORS ARE NOT BILLED BY PROVIDERS AS LONG AS THEY DO NOT EXCEED THEIR CREDIT ALLOCATION.

Page 46: Bonazzi commons bd2 k ahm 2016 v2

3 year pilot to test this business model to facilitate researcher use of cloud resources (enhance data sharing and potentially reduce costs).

Contract with the CMS Alliance to Modernize Healthcare (CAMH) Federally Funded Research and Development Center (FFRDC) managed by the MITRE corporation• FFRDCs are special purpose, government-owned but

contractor-managed entities that meet R&D needs that can’t be well managed by traditional grants and contracts

• Examples: National Labs and organizations like RAND Pilot will not directly interact with the existing grant system.

• Instead is modeled on the mechanisms being used to gain access to NSF and DOE national resources (HPC, light sources, etc.)

The only required qualification for applying for credits will be that the investigator must have an existing NIH grant

Commons Credits Model Pilot

Page 47: Bonazzi commons bd2 k ahm 2016 v2

Current List of Approved Vendors DLT = Amazon Web Services Reseller IBM Onix = Google Reseller Broad and ISB NCI Cloud Pilots accessible via Google Two more approved but negotiating participation agreement

First batch of credits issued Sep 29, 2016 8 Investigators (cohort 1) that are part of an ‘alpha test’ Only IBM/AWS at the time 93% AWS, 7% IBM First credits have been used, usage information coming

First “production” credit request period opening this month

Commons Credits Model Pilot

Page 48: Bonazzi commons bd2 k ahm 2016 v2

Considerations and Concluding Thoughts

Page 49: Bonazzi commons bd2 k ahm 2016 v2

ConsiderationsCommunication

Metrics – Understanding and accounting of data usage patternsCost

• Cloud Storage• Pay for use cloud compute (NIH credits pilot)• Indirect costs for cloud

Hybrid Clouds – Institution (private) and commercial (public) cloudsManaging Open vs Controlled access data

• Auth: single sign on - dreams/nightmares?

Archive vs Working Copies of data Interoperability with other Commons (clouds)

Page 50: Bonazzi commons bd2 k ahm 2016 v2

Standards – Metadata, UIDs, APIs Discoverability – Finding digital objects across clouds Interfaces – For users with different needs and capabilities Consent – Reconsenting data, Dynamic consents? Policies

• Data sharing policies that are useful and effective • Keep pace with use of technology (e.g. dbGAP data in the Cloud)

Incentives • Access to, and shareability of FAIR Data as part of NIH grant review

criteria

Governance – Community involvement in governance models Sustainability – Long term support

Page 51: Bonazzi commons bd2 k ahm 2016 v2

Summary

We need an unprecedented level of convergence and collaboration to drive biomedical science to the next level.

Supporting this model of data-intensive collaborative science requires a shift in academic research culture and new investments in data infrastructure and capabilities.

Matthew Trunnel, FHC

Page 52: Bonazzi commons bd2 k ahm 2016 v2

Acknowledgments• ADDS Office: Jennie Larkin, Phil Bourne, Michelle Dunn,Mark Guyer, Allen Dearry, Sonynka

Ngosso, Tonya Scott, Lisa Dunneback, Vivek Navale (CIT/ADDS)

• NCBI: George Komatsoulis

• NHGRI: Valentina di Francesco

• NIGMS: Susan Gregurick

• CIT: Andrea Norris, Debbie Sinmao

• NIH Common Fund: Jim Anderson , Betsy Wilder, Leslie Derr

• NCI Cloud Pilots/ GDC: Warren Kibbe, Tony Kerlavage, Tanja Davidsen

• Commons Reference Data Set Working Group: Weiniu Gan (HL), Ajay Pillai (HG), Elaine Ayres, (BITRIS), Sean Davis (NCI), Vinay Pai (NIBIB), Maria Giovanni (AI), Leslie Derr (CF), Claire Schulkey (AI)

• RIWG Core Team: Ron Margolis (DK), Ian Fore, (NCI), Alison Yao (AI), Claire Schulkey (AI), Eric Choi (AI)

• OSP: Dina Paltoo, Kris Langlais, Erin Luetkemeier, Agnes Rooke,

• Research and Industry: Mathew Trunnell (FHC), Bob Grossman (Chicago), Toby Bloom (NYGC)

Page 53: Bonazzi commons bd2 k ahm 2016 v2

Acknowledgements- CFPs

NIH CFPs WG • Valentina Di Francesco• Sam Moore• Vivien Bonazzi• Allen Dearry• Maria Giovanni• Susan Gregurick• Weiniu Gan• James Luo• Stacia Friedman-Hill• Ajay Pillai• Leslie Derr • Debbie Sinmao• Eric Choi• Claire Schulkey • George Komatsoulis

CFWG • Owen White• Neil McKenna• Michel Dumontier• Umberto Ravaioli• Brian O’Connor• Lucila Ohno-Machado• Wei Wang• All the other members

Page 54: Bonazzi commons bd2 k ahm 2016 v2

Acknowledgements - Credits Model

• ADDS Office• Vivien Bonazzi• Phil Bourne• Jennie Larkin• Mark Guyer

• MITRE• Ari Abrams-Kudan• Wenling (Eileen) Chang• Peter Gutgarts• Lynette Hirschman• William Kim• Eldred Rubeiro• Bruce Shirk• David Tanenbaum• Lisa Tutterow

• Grant Thornton• Katie Beringer• Mike Clifford• Tamara Reynolds

• NIH• Tanja Davidsen (NCI)• Valentina di Franceso (NHGRI) • Susan Gregurick (NIGMS)• David Lipman (NCBI)• Vivek Navale (CIT)• Jim Ostell (NCBI)• Debbie Sinmao (CIT)• Nick Weber (NIAID)

• NITRD• Peter Lyster