Big Data

25
06/15/22 2 PPP

description

Barend Mons over Big Data op de SURFnet Relatiedagen 2012

Transcript of Big Data

Page 1: Big Data

04/12/23 2

PPP

Page 2: Big Data

DISC: the connected data departments of DTL research Hotels

DTL

DISC*

*) DISC = DTL Data Integration & Stewardship Centre

technology research

education & training

technologyfacilities

Page 3: Big Data

What is bioinformatics?

5

• The science of storing, retrieving and analysing large amounts of biological information

• An interdisciplinary science involving biologists, biochemists, computer scientists and mathematicians

• At the heart of modern biology

Page 4: Big Data

6

1 GenomesContain genes

1 GenomesContain genes

2 Genes are transcribed

2 Genes are transcribed

5 Proteins interact with each other and with small

molecules to form pathways

5 Proteins interact with each other and with small

molecules to form pathways

3 Transcripts translate to protein

sequences

3 Transcripts translate to protein

sequences

4 Proteins form three-dimensional

structures

4 Proteins form three-dimensional

structures

6 Pathways combine to build

systems

6 Pathways combine to build

systems

Bioinformatics underpins life-science research

Page 5: Big Data

Life Science data: Multi-omics, multi-technology, multi organism, multi dimensional

Page 6: Big Data

From molecules to medicine

8

Molecular components Integration Translation

Genomes

Nucleotides

Transcripts

Proteins

Complexes

Pathways

Small molecules

Structures

Domains

Cells

Biobanks

Tissues and organs

Humanpopulations

Therapies

Diseaseprevention

EarlyDiagnosis

Humanindividuals

Page 7: Big Data

The challenge• Computer speed

and storage capacity is doubling every 18 months and this rate is steady

• DNA sequence data is doubling every 6-8 months over the last 3 years and looks to continue for this decade

11

Guy Cochrane, ENA, EMBL-EBI

Page 8: Big Data

Europe has already paid for the science

12

Annual cost of generating new protein structure data in labs around the world

Annual cost of maintaining the datain a central database

Page 9: Big Data

ELIXIR’s mission

13

medicine

environment

bioindustries

society

To build a sustainable European infrastructure for biological information, supporting life science research and its translation to:

Page 10: Big Data
Page 11: Big Data

13 ELIXIR Countries

21

Page 12: Big Data

Part two >>>> eScience in LS

• The way we dicover knowledge has changed fundamentally over just a decade.

04/12/23 22

BIGNORANCE

Page 13: Big Data

The general challenge: Data has far outgrown institutional handling capacity

….The amount of digital data is exploding, with a staggering 1.8 zettabytes in 2011

The Issue:The Data Deluge is everywhereBut Life Sciences is particularly challenged and complex.

More and moreWe write‘about datasets’ That are too large to publishIn narrative

Page 14: Big Data

Cardinal Assertion

1 identicalassertion

‘n’ differentprovenances

Nanopublications & Cardinal Assertions

A Cardinal Assertion aggregates all ‘n’ Nanopublications making the same assertion. It therefore has 1 assertion and ‘n’ provenances, eliminating redundancy.

A Nanopublication is the smallest unit of publishable information containing: 1.Assertion

A statement of concepts in terms of one or more ‘subject -> predicate -> object’ (triple) relationships.

2.Provenancea)Attribution – Who made this assertion, when and where? b)Supporting information – Any other information which is relevant to the assertion (e.g. this assertion is only valid in humans under 18).

Nanopublication

Page 15: Big Data

Under the hood……

Page 16: Big Data

Managing volume & complexity

Individual Nanopublications

> 1014

55 4 2 1

Individual Cardinal Assertions

> 1011

55

44 22

11

Individual Concept Profiles

≈4x106

Combining Cardinal Assertions with Concept profiles reduces the amount of data with ≈99.999996%

Page 17: Big Data

The LS concept web: 2x2x106 concepts (profiles)

Page 18: Big Data

28

A dynamic Concept Web versus a static Ontology

Page 19: Big Data

More mutual informationNo increase in concept overlap

Including manual curation

More mutual informationNo increase in concept overlap

Including manual curation

More concepts in commonMore concepts in common

Removal of low info pathsRemoval of low info paths

= Known reference pairs= non-co-occurrence pairs

Page 20: Big Data
Page 21: Big Data

eScience…. in silico reasoning and in cerebro validation

Expert Skype calls

Reading up

Page 22: Big Data

Organisation of the ecosystem

CA Space (OCS & ICS)

Providers

Original Data Owners

Global Authority Nanopublishers App & Service Providers

Users

Endorse

Assist & Certify

Application development

Reasoning services

technical and process

consultancy

project delivery capacity

ONS/INSsAcademic & Commercial

Users

KnowledgeManagement

KnowledgeDiscovery

Best

Practices

Page 23: Big Data

33

Page 24: Big Data

Acceptance of Semantic Web Approach

Over the last decade, academic research organisations developed new methodologies and tools to address the Big Data problem.Global agreement by leading scientists on unique Nanopublication solution.100’s of millions already invested in the basis technologyApplicable as a technology across (STM) domains and industries.Pharmaceutical companies are early adopters (Innovative Medicine Initiative).

Page 25: Big Data

Acknowledging…• Herman van Haagen , MsC. (LUMC)• Dr. Peter Bram ‘t Hoen (LUMC)• Dr. Marco Roos (LUMC)• Dr. Erik Schultes (LUMC)• Prof. Johan den Dunnen (LUMC)• Prof. Gertjan van Ommen (LUMC)• Dr. Erik van Mulligen (EMC)• Dr. Jan Kors (EMC)• Dr. Martijn Schuemie (EMC)• Prof. Johan van der Lei (EMC)• Dr. Rob Hooft (NBIC)• Dr. Christine Chichester (NBIC)• Dr. Leon Mei (NBIC)• Kees Burger (NBIC)• Bharat Singh (NBIC/EMC)• Dr. Marc van Driel (NBIC)• Dr. Ruben Kok (NBIC)• Prof. Marcel Reinders (NBIC)• Prof. Jaap Heringa (NBIC)• Prof. Gert Vriend (NBIC)• Dr. Morris Schwertz (BBMRI, CWA)• Dr. Andra Waagmeester (NBIC)• Dr. Kristina Hettne (LUMC)• Dr. Rene van Schaik (eScience Cenrte)• Drs. Albert Mons (PHORTOS consultants)• Mr. Drs. Arie Baak (PHORTOS consultants)

• Prof. Amos Bairoch (SIB, Switzerland, CWA) • Prof. Carole Goble (Mancheste, CWA, OPS)• Prof. Katy Borner (Indiana University CWA)• Prof. Mark Musen (NCBO, Stanford CWA,OPS)• Dr. Pascale Gaudet (UniProt, ISB, CWA• Dr. Mike Colon (VIVO, UF, CWA)• Prof. Maryann Martone (Force 11, USC, CWA)• Dr. Nigam Shah (NCBO, Stanford, CWA, OPS)• Dr. Mark Wlikinson (Canada, CWA)• Abel Packer (Brazil, Scielo, CWA, OPS)• Jan Velterop (ACKnowledge, CWA, OPS)• Albert Mons (CWA, NBIC)• Prof. Frank van Harnelen (FUA/LARKC, CWA, OPS)• Dr. Chris Evelo (Maastrciht, CWA, OPS)• Dr. Antony Willams (RSC/ChemSpider, CWA,OPS)• Dr. Richard Kidd (RSC, OPS)• Dr. Paul Groth (FUA, CWA, OPS)• Dr. Michel Dumontier (Canada, CWA, OPS)• Dr .Andrew Gibson, UA, CWA, OPS)• Dr. Bryn Williams-Jones (Pfizer, OPS)• Dr. Ian Dix (Astra Zeneca, OPS)• Dr. Niklas Blomberg (Astra Zeneca, OPS)• Dr. Mike Barnes, GSK, OPS)• Prof. Jan-erik Litton (CWA, BBMRI)

The ‘Dutch Team’

CWA- Open PHACTS