EMBL Australian Bioinformatics Resource AHM - Data Commons

38
BD2K and why bioinformatics matters relevance to Australia EMBL - Australia AHM 2016 Vivien Bonazzi Senior Advisor for Data Science Technologies ADDs (Assoc. Director for Data Science) Office Office of the Director (OD) National Institutes of Health (NIH)

Transcript of EMBL Australian Bioinformatics Resource AHM - Data Commons

Page 1: EMBL Australian Bioinformatics Resource AHM   - Data Commons

BD2K and why bioinformatics matters

relevance to Australia

EMBL - Australia AHM 2016

Vivien BonazziSenior Advisor for Data Science Technologies

ADDs (Assoc. Director for Data Science) OfficeOffice of the Director (OD)National Institutes of Health (NIH)

Page 2: EMBL Australian Bioinformatics Resource AHM   - Data Commons

The NIH Data Commons Digital Ecosystems for using and sharing FAIR Data

EMBL - Australia AHM 2016

Vivien BonazziSenior Advisor for Data Science Technologies

ADDs (Assoc. Director for Data Science) OfficeOffice of the Director (OD)National Institutes of Health (NIH)

Page 3: EMBL Australian Bioinformatics Resource AHM   - Data Commons

http://datascience.nih.gov/bd2k

A word about BD2K

Page 4: EMBL Australian Bioinformatics Resource AHM   - Data Commons

What’s driving the need for a Data Commons?

Page 5: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Convergence of factors

Mountains of Data

Increasing need and support for Data sharing

Availability of digital technologies and infrastructures that support Data at scale

Page 6: EMBL Australian Bioinformatics Resource AHM   - Data Commons
Page 7: EMBL Australian Bioinformatics Resource AHM   - Data Commons
Page 8: EMBL Australian Bioinformatics Resource AHM   - Data Commons

https://gds.nih.gov/ Went into effect January 25, 2015

NCI guidance:http://www.cancer.gov/grants-training/grants-management/nci-policies/genomic-data

Requires public sharing of genomic data sets

Page 9: EMBL Australian Bioinformatics Resource AHM   - Data Commons

9

Recommendation #4: A national cancer data ecosystem for sharing and analysis.

Create a National Cancer Data Ecosystem to collect, share, and interconnect a broad array of large datasets so that researchers, clinicians, and patients will be able to both contribute and analyze data, facilitating discovery that will ultimately improve patient care and outcomes.

9

Page 10: EMBL Australian Bioinformatics Resource AHM   - Data Commons
Page 11: EMBL Australian Bioinformatics Resource AHM   - Data Commons
Page 12: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Challenges with Biomedical DataThe Journal Article is the end goal

Data is a means to an ends (low value) Data is not FAIR Findable, Accessible, Interoperable, Reproducible Limited e-infrastructures to support FAIR data

Page 13: EMBL Australian Bioinformatics Resource AHM   - Data Commons

What’sChanging?

Digital ecosystems

Page 14: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Development of the NIH Data Commons

Page 15: EMBL Australian Bioinformatics Resource AHM   - Data Commons

How do we find data, software, standards? How can we make (large) data, annotations,

software, metadata accessible? How do we reuse data, tools and standards? How do we make more data machine readable? How do we leverage existing digital technologies

systems, infrastructures? How do we collaborate? How do we enable digital ecosystem?

Changing the conversation around Data sharing and access

NIH Data Commons

Page 16: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Data Commons enabling data driven science

Enable investigators to leverage all possible data and tools in the effort to accelerate biomedical discoveries, therapies and cures

by driving the development of data infrastructure and data science capabilities through collaborative research and robust engineering

Matthew Trunnel, FHC

Page 17: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Data Commons’s

Page 18: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Developing a Data Commons

Treats products of research – data, methods, papers etc. as digital objects

These digital objects exist in a shared virtual space• Find, Deposit, Manage, Share, and Reuse data,

software, metadata and workflows Digital object compliance through FAIR principles:• Findable• Accessible (and usable)• Interoperable • Reusable

Page 19: EMBL Australian Bioinformatics Resource AHM   - Data Commons

The Data Commons is a framework that supports

FAIR data access and sharing and fosters the development of a digital ecosystem

https://datascience.nih.gov/commons

Page 20: EMBL Australian Bioinformatics Resource AHM   - Data Commons

The Data Commons Framework

Compute Platform: Cloud

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data“Reference” Data Sets

User defined data

Digital Object Compliance

App store/User Interface

PaaS

SaaS

IaaS

https://datascience.nih.gov/commons

Page 21: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Current Data Commons Pilots

Page 22: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Current Data Commons PilotsReference Data

Sets

Commons Framework Pilots

Cloud Credit Model

Resource Search & Index

Explore feasibility of the Commons Framework Facilitate collaboration and interoperability

Making large and/or high impact NIH funded data sets and tools accessible in the cloud

Developing Data and Software indexing methodsLeveraging BD2K Efforts: bioCADDIE and others.Collaborating with external groups

Provide access to cloud (IaaS) and PaaS/SaaS via creditsConnecting credits to the grants system

Page 23: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Reference Data Sets Pilot Large, High-Impact Datasets in the Cloud

Page 24: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Commons Framework Pilots

Software and Services

Page 25: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Commons Framework

• FAIRness Metrics• Data-object registry • Interoperability of APIs • Workflow sharing and docker registry • Commons Framework Publications

Page 26: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Resource Search & Indexing

Discoverability of data and software

Page 27: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Cloud Credits Model

$ denominated NIH credits to use cloud resources (IaaS) and services (PaaS/SaaS)

Page 28: EMBL Australian Bioinformatics Resource AHM   - Data Commons

The Data Commons Framework

Compute Platform: Cloud

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data“Reference” Data Sets

User defined data

Digital Object Compliance

App store/User Interface

PaaS

SaaS

IaaS

https://datascience.nih.gov/commons

Page 29: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Indexing

Authorization /authentication layer

Digital Ecosystem

Page 30: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Considerations and Concluding Thoughts

Page 31: EMBL Australian Bioinformatics Resource AHM   - Data Commons

ConsiderationsMetrics – Understanding and accounting of data usage patternsCost• Cloud Storage• Pay for use cloud compute (NIH credits pilot)• Indirect costs for cloud

Hybrid Clouds – Institution (private) and commercial (public) cloudsManaging Open vs Controlled access data • Auth: single sign on - dreams/nightmares?

Archive vs Working and versioning Copies of data Interoperability with other Commons (clouds)

Page 32: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Standards – Metadata, UIDs, APIs Discoverability – Finding digital objects across clouds Interfaces – For users with different needs and capabilities Consent – Reconsenting data, Dynamic consents? Policies

• Data sharing policies that are useful and effective • Keep pace with use of technology (e.g. dbGAP data in the Cloud)

Incentives • Access to, and shareability of FAIR Data as part of NIH grant review

criteria

Governance – Community involvement in governance models Sustainability – Long term support

Page 33: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Relevance to Australia?

Page 34: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Relevance to Australia The value of Australian Data *

Unique flora and fauna e.g Marsupials

Indigenous Australians

Understanding of genomic structure – health & disease Medicinal products

Making this data (securely) available With high quality annotation and metadata Attributions to original authors On the cloud Via open standard APIs

Aggregation of data via an Australian wide Commons?

Page 35: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Indexing

Authorization /authentication layer

Oz Digital Ecosystem

Page 36: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Summary

We need an unprecedented level of convergence and collaboration to drive biomedical science to the next level.

Supporting this model of data-intensive collaborative science requires a shift in academic research culture and new investments in data infrastructure and capabilities.

Matthew Trunnel, FHC

Page 37: EMBL Australian Bioinformatics Resource AHM   - Data Commons

Acknowledgments• ADDS Office: Jennie Larkin, Phil Bourne, Michelle Dunn,Mark Guyer, Allen Dearry, Sonynka

Ngosso, Tonya Scott, Lisa Dunneback, Vivek Navale (CIT/ADDS)

• NCBI: George Komatsoulis

• NHGRI: Valentina di Francesco

• NIGMS: Susan Gregurick

• CIT: Andrea Norris, Debbie Sinmao

• NIH Common Fund: Jim Anderson , Betsy Wilder, Leslie Derr

• NCI Cloud Pilots/ GDC: Warren Kibbe, Tony Kerlavage, Tanja Davidsen

• Commons Reference Data Set Working Group: Weiniu Gan (HL), Ajay Pillai (HG), Elaine Ayres, (BITRIS), Sean Davis (NCI), Vinay Pai (NIBIB), Maria Giovanni (AI), Leslie Derr (CF), Claire Schulkey (AI)

• RIWG Core Team: Ron Margolis (DK), Ian Fore, (NCI), Alison Yao (AI), Claire Schulkey (AI), Eric Choi (AI)

• OSP: Dina Paltoo, Kris Langlais, Erin Luetkemeier, Agnes Rooke,

• Research and Industry: Mathew Trunnell (FHC), Bob Grossman (Chicago), Toby Bloom (NYGC)