Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

78
Dr. Sven Nahnsen, Quantitative Biology Center (QBiC) Data Management for Quantitative Biology Lecture 1: Introduction and overview

Transcript of Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Page 1: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Dr. Sven Nahnsen, Quantitative Biology Center (QBiC)

Data Management for Quantitative Biology

Lecture 1: Introduction and overview

Page 2: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Overview

• Administrative stuff (credits, requirements)

• Motivation/quick review of relevant contents

(Bioinformatics I and II)

•  Introduction to this lecture series

• Semester overview

2

Page 3: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Course requirements

To pass this course you must:

•  regularly and actively participate in the weekly problem sessions,

•  pass the final exam, assignments and projects

•  You have to work on assignments alone

•  You will work in small groups for the problem-orientated research

project

3

Page 4: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Course credits and grading •  Credits -  MSc Bioinfo: 4 LP, module “Wahlpflichtbereich Bioinformatik” -  MSc Info: 4 LP, area “Wahlpflichtbereich Informatik” • Grade

-  30% assignments -  20% project -  50% finals

•  Finals: oral exam (30 minutes) covering the contents of the whole lecture, the assignments and the project

•  Finals will be scheduled at the end of the semester (Thu, 30/07/2015)

4

Page 5: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Recommended literature • We will point to relevant papers during the course of the literature •  Important overview papers:

§  Hastings et al., 2005, Quantitative Bioscience for the 21st century. BioScience. Vol 55 No. 6

§  Cohen JE (2004) Mathematics Is Biology's Next Microscope, Only Better; Biology Is Mathematics' Next Physics, Only Better. PLoS Biol 2(12): e439.

•  Books §  Free E-Book: Data management in Bioinformatics (

http://en.wikibooks.org/wiki/Data_Management_in_Bioinformatics) §  Lacroix, Z.; Critchlow, T. (eds): Bioinformatics: Managing Scientific Data. Morgan

Kaufmann Publishers, San Francisco 2003 §  Michale E. Wall, Quantitative Biology: From Molecular to Cellular Systems. 2012.

Chapman & Hall §  Pierre Bonnet. Enterprise Data Governance: Reference and Master Data

Management Semantic Modeling. 2013. Wiley

• Web resources §  http://www.ariadne.ac.uk: Ariadne, Web Magazine for Information Professionals §  http://www.dama.org: THE GLOBAL DATA MANAGEMENT COMMUNITY §  H.D. Ehrich: http://www.ifis.cs.tu-bs.de/node/2855

5

Page 6: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Recommended Software •  These software tools/framework and webservers will be used

during the problem sessions

http://www.cisd.ethz.ch/software/openBIS https://usegalaxy.org

https://vaadin.com/home https://www.knime.org/

6

Page 7: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Contact and organization •  Questions concerning the lecture/assignments

§  [email protected] •  Website

§  abi.inf.uni-tuebingen.de/Teaching/ws-2013-14/CPM

•  Christopher Mohr (Sand 14, C322) , Andreas Friedrich (Sand 14, C 304)

•  Dr. Sven Nahnsen (Quantitative Biology Center, Auf der Morgenstelle 10, C2P43, please send e-mail first)

•  Course material will be available on the website (see above), through social media channels and (if wished) as a hard copy during the lecture

facebook.com/qbic.tuebingen twitter.com/qbic_tue

7

Page 8: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Who am I • Most of me and on our work can be found here: www.qbic.uni-tuebingen.de

8

Page 9: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Contents of this lecture

Date   Lecturer   Lecture  8-­‐10  AM  Thursday  16  April  15   Nahnsen   Introduc8on  and  overview  Thursday  23  April  15   Nahnsen   Biological  Data  Management  

Thursday  30  April  15   Czemmel  Data  sources  ("Next-­‐genera8on"  

technologies)  

Dr. Stefan Czemmel

9

Page 10: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Contents of this lecture

Date   Lecturer   Lecture  8-­‐10  AM  

Thursday  7  May  15   Codrea  Database  systems    (mySQL,  noSQL,  etc.)  

Thursday  14  May  15  Ascension  Day  (Himmelfahrt)  

Thursday  21  May  15   Czemmel   LIMS  and  E-­‐Lab  books  Thursday  28,  May  15   Kenar   Experimental  Design  

Dr. Marius Codrea Erhan Kenar

10

Page 11: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Contents of this lecture

Date   Lecturer   Lecture  8-­‐10  AM  

Thursday  4  June  15  Corpus  Chris8  Day  (Fronleichnam)  

Thursday  11  June  15   Nahnsen   Data  analysis  workflows  (I)  Thursday  18  June  15   Nahnsen   Data  analysis  workflows  (II)  Thursday  25  June  15   Nahnsen   Standardiza8on  Thursday  2  July  15   Nahnsen   Big  Data  

Thursday  9  July  15   Nahnsen  Integrated  data  management  (OpenBIS,  OpenBEB)  

Thursday  16  July  15   Nahnsen   Applica8ons  Thursday  23  July  15   Nahnsen   Exam  prepara8on  

Thursday  30  July  15  Nahnsen,  Mohr,  Friedrich   EXAMS  

11

Page 12: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

What is your background? Ad hoc collection from the audience, Apr. 16, 2015 •  Computer Science •  Bioinformatics(immonoinformatics; User Front-end;integration ,

visualization) •  Biology •  Drug design •  Agricultural biology (plant breeding) •  Bioinformatics (Tx, NGS) • Geoecology •  (ecology) •  Biochemistry; Molecular Biology •  Structural Biology •  Electronic business

12

Page 13: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Let us brainstorm Ad hoc collection from the audience, Apr. 16, 2015 • What is data management?

-  Rapid access to data -  Selective access to data; database queries -  Combine data; manipulate data efficiently -  Big data storage/analysis -  Curating quality -  Data visualization -  Make data interpretable

13

Page 14: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Let us brainstorm • What is data management?

http://zonese7en.com/wp-content/uploads/2014/04/Data-Management.jpg, accessed Apr 10, 2015, 11 AM

14

Page 15: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Data Management •  The official definition provided by DAMA (Data management

association) International, the professional organization for those in the data management profession, is: "Data Resource Management is the development and execution of architectures, policies, practices and procedures that properly manage the full data lifecycle needs of an enterprise.”

•  Further, the DAMA – Data management Body of Knowledge ((DAMA-DMBOK)) states:” Data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets ”

Wikipedia: http://en.wikipedia.org/wiki/Data_management accessed Mar 30, 2015, 10 PM 15

Page 16: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

10 Data Management functions According to the DAMA Data Management Body of Knowledge (DMBOK)

16

Page 17: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Data governance •  Strategy

• Organization and roles

•  Policies and standards

•  Projects and services

•  Issues

•  Valuation

Source: DAMA DMBOK Guide, p. 10

“Planning, supervision and control over data management and use”

http://meship.com

17

Page 18: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Data quality management •  Data cleansing

•  Data integrity

•  Data enrichment

•  Data quality

•  Data quality assurance

Source: DAMA DMBOK Guide, p. 10

“defining, monitoring and improving data quality”

http://www.arcplan.com/

18

Page 19: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Data architecture management •  Data architecture

•  Data analysis

•  Data design (modeling)

Source: DAMA DMBOK Guide, p. 10

atasourceconsulting.com

19

Page 20: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Data development •  Analysis

•  Data modeling

•  Database design

•  Implementation

Source: DAMA DMBOK Guide, p. 10

dataone.org

20

“Data development is the process of building a data set for a specific purpose. The process includes identifying what data are required and how feasible it is to obtain the data. Data development includes developing or adopting data standards in consultation with stakeholders to ensure uniform data collection and reporting, and obtaining authoritative approval for the data set.”, A guide to data development, Australian Institute of Health and Welfare Canberra, 2007

Page 21: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Database management •  Data maintenance

•  Data administration

•  Database management system

Source: DAMA DMBOK Guide, p. 10 21

Page 22: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Data Security Management •  Standards

•  Classification

•  Administration

•  Authentication/Authorization

•  Auditing

Source: DAMA DMBOK Guide, p. 10

http://www.techieapps.com/wp-content/uploads/2013/07/2-1024x768.jpg

22

Page 23: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Reference and Master Data management •  External/internal codes

•  Customer Data

•  Product Data

•  Dimension management (why do different dimensions (entities) relate to each other)

•  Taxonomy/Ontology

Source: DAMA DMBOK Guide, p. 10

Master Reference

23

Reference data management

Page 24: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Master data

Page 25: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Data warehousing and business intelligence management

•  Architecture

•  Implementation

•  Training and Support

• Monitoring and Tuning

Source: DAMA DMBOK Guide, p. 10

Raw data

Metadata

Summary data

Data warehouse

25

Page 26: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Data warehousing and business intelligence management

Raw data

Metadata

Summary data

Data warehouse

Input

Report Business intelligence

26

Page 27: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Document, record and content management •  Acquisition and storage

•  Backup and Recovery

•  Content Management

•  Retrieval

•  Retention

Source: DAMA DMBOK Guide, p. 10 27

Page 28: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Metadata management Metadata is data about data

•  Architecture

•  Integration

•  Control

•  Delivery

Source: DAMA DMBOK Guide, p. 10 28

Page 29: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

DAMA – DMBOK •  A broad collection of all discipline and subtopics that are

summarized under the umbrella of data management •  These concern many business-related issues, but many concepts

are very well applicable to the field of bioscience • We will come back to various aspects of the DAMA DMBOK during

the course

29

Page 30: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Data management needs in science and research •  Survey at the University of Oregon, USA (Brian Westra. "Data Services for the Sciences: A

Needs Assessment". July 2010, Ariadne Issue 64 http://www.ariadne.ac.uk/issue64/westra/)

•  Different scientific discipline

30

Page 31: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Data management in science and research

Brian Westra. "Data Services for the Sciences: A Needs Assessment". July 2010, Ariadne Issue 64 http://www.ariadne.ac.uk/issue64/westra/, accessed Apr. 10, 2015, 11 AM

1 2 3 4 5 6 7 8 9 10 11

1 Data storage and backup 7 Finding and accessing related data from others

2 Making scientific data findable by others 8 Connecting data storage to data analysis

3 Connecting data acquisition to data storage 9 Liniking this data to publications or other asset

4 Allowing or controlling access to scientific data by others 10 Ensuring data is secure and trustworthy

5 Documenting and tracking updates 11 Others

6 Data analysis and manipulation

31

Page 32: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Let us brainstorm • What is Quantitative Biology? Ad hoc collection from the audience, Apr. 16, 2015

-  Not only yes/no, but put amounts to entities -  Huge amount of data -  Qunatitative methods to study biology -  System-wide analysis; specific pathways -  Make results human readable and accesible

32

Page 33: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Quantitative Biology •  The term quantitative biology has been coined by Hastings et al.,

2005. •  High-throughput methods have led to a paradigm shift in

biomedical research •  Traditionally, the focus was on one-molecule-at-a-time for most

bio(medical) research projects •  Now, data on whole genomes, exomes, epigenomes,

transcriptomes, proteomes and metabolomes can be generated at low cost.

•  The term quantitative biology is used to describe this paradigm shift. Improvements in this area have been driven mainly by two technological developments:

Hastings et al., 2005, Quantitative Bioscience for the 21st century. BioScience. Vol 55 No. 6 33

Page 34: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Technological innovations •  State-of-the-art mass spectrometers coupled to high-

performance liquid chromatography through soft ionization techniques (HPLC-ESI-MS) have quickly changed the way we do proteomics, metabolomics, and lipidomics.

•  Next-generation sequencing has similarly changed the way we look at genomes, epigenomes, transcriptomes, and metagenomes. Due to advances in chemistry and imaging, sequencing reactions have been parallelized on a very large scale. The comprehensiveness of the data produced by high-throughput methods makes them particularly interesting as general-purpose analytical and diagnostic techniques.

34

Page 35: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Technological innovations •  Imaging technologies can now produce high-resolution pictures

of fine-grained cellular details at a very high speed

•  Finally methods from bioinformatics and computational biology have matured to rapidly analyze the huge raw data sets that are generated by these high throughput technologies

35

Page 36: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Contents from Bioinformatics 2 (high-throughput technologies

• Most of the high throughput technologies have been introduced during the Bioinformatics II lecture

•  There are specialized lectures on “Transcriptomics” and on “Computational Proteomics and Metabolomics”

• We will give a short Recap on the Bioinformatics II contents that are relevant for this lecture

• More advanced topics on data generation methods will be introduced in lecture 3 by Dr. Stefan Czemmel (focus on next generation sequencing)

36

Page 37: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Origin of the “Central Dogma of Molecular Biology” (Francis Crick, 1956)

The central dogma of molecular biology •  First articulation by Francis Crick in 1956 •  Published in Nature in 1970

37

Page 38: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

The central dogma – classical view •  In general, the classic view reflects how biology is (biological data

are) organized • Genomics, however, enabled a more complex view

Cox Systems Biology Lab | Research, University of Toronto, Canada 38

Page 39: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Reminder (Bioinformatics 2)

Oltvai-Barabasi, Science, 2002 39

Page 40: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Recap Bioinformatics II: Systems biology •  Quantitative data on various levels of biological complexity build

fundaments of systems biology •  Mathematical modeling has been based on gene expression •  Recent important technological improvements allow the analysis of protein

and metabolite profiles to a great depth •  Important layers for understanding biology •  New experimental techniques offer tremendous challenges for

computational analysis

40

Page 41: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Recap Bioinformatics II: Aims of Systems Biology •  Describe large-scale organization • Quantitative modeling •  Describe cell as system of networks

-  Fundamental research: time-resolved quantitative

understanding of living systems -  Medicine: enable personalized medicine (e.g., improve

treatment strategies for cancer patients) -  Biotechnology: improve production, degradation, construction

of synthetic organisms, etc.

41

Page 42: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Exp. Methods – Transcriptomics •  Extract and amplify RNA •  Hybridization on microarray •  Identify and quantify by fluorescence signal •  Sequences can be mapped back to genome

Lindsay, Nature Rev. Drug Discovery, 2003, 2, 803 42

Page 43: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Microarray Data Analysis •  Key problems in microarray

data analysis are -  Data normalization -  Clustering -  Dimension reduction -  Diagnostics/classification -  Network inference -  Visualization of results

Janko Dietzsch , Nils Gehlenborg and Kay Nieselt. Mayday-a microarray data analysis workbench. Bioinformatics 2006 22(8):1010-1012 43

Page 44: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Genome sequencing

February 15, 2001 February 16, 2001

44

Page 45: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Genome sequencing •  2001: initial publication •  2003: 2nd draft “Human Genome” •  > 13 years of work and > 3*109 $ •  2010: 8 days 1*104 $ •  Today: approximately 5.5 days and < 1*104 $ •  Future: within 3 years Biotech company (Pacific Biosciences)

expects similar amount of data in < 15 min for < 1*103 $

45

Page 46: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Status genomics/transcriptomics •  Dramatic drop in cost for genome sequencing •  Number of sequenced genomes grows continuously • Genome is a very static snapshot of living system •  Biological adaption is rather slow; long-term information storage •  Proteins and their reaction products, metabolites are much closer

to reality • Genome and transcriptome databases are essential bases for

proteomics and metabolomics research

46

Page 47: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Genomics vs. Proteomics

Genomics Proteomics Genomes rather static

~ 20 k genes

established technology

(capillary sequencer)

Proteomes are dynamic

(age, tissue, breakfast, …)

up to 1000 k proteins

emerging technologies

(MS, HPLC/MS, protein chips)

47

Page 48: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Proteomics

http://www.iamashcash.com/wp-content/uploads/2011/03/caterpillar-to-butterfly1.jpg, accessed: 14/10/2013 6 PM

Genome remains the same

Proteome changes

48

Page 49: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Main fields of proteomics

49

Page 50: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Applications of proteomics

50

Page 51: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Shotgun Proteomics

51

Page 52: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Next generation sequencing 1st generation 2nd generation

Illumina / Solexa Genetic Analyzer 2000 Mb / run (96h)

Roche / 454 Genome Sequencer FLX 400 Mb / run (8h)

Applied Biosystems SOLiD 3000 Mb / run (120h)

300 : 1 (2008)

Applied Biosystems 3730xl 0,08 Mb / run 1 Mb / 24 h

>3000 : 1 (2010)

1st generation 2nd generation

Slide: Prof. Peter Bauer

In 2008

52

Page 53: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

„3rd“ generation sequencing

Drmanac Science (2010) 5961: 78-81.

CompleteGenomics DNB sequencing 18x 210Gb / run

>37.000 : 1 (2010)

3rd generation sequencing Slide: Prof. Peter Bauer

53

Page 54: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

High resolution imaging “Imaging in biology may refer to >15 different technologies” Prominent and data-intense examples include: • Optical (bioluminescence and fluorescence imaging)

• Magnetic resonance imaging

•  X-ray computed tomography

•  Positron emission tomography

http://en.wikipedia.org/wiki/Biological_imaging, accessed Apr. 13, 4 PM 54

Page 55: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Imaging workflow

Eliceiri et al., Nature Methods 9, 697–710 (2012) 55

Page 56: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Database systems •  Relational databases

-  Example MySQL

•  NoSQL databases

-  Example MongoDB

•  How to query databases

•  Entity relationship models

•  Repositories (e.g. Pride, PeptideAtlas)

-  Annotations

-  Sequences

56

Page 57: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Many database concepts

http://dataconomy.com/wp-content/uploads/2014/07/fig2large.jpg 57

Page 58: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Databases/Repositories in Genomics/Proteomics

http://www.ebi.ac.uk/ena

http://www.peptideatlas.org

http://www.ebi.ac.uk/pride/archive/

58

Page 59: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Large-scale study data – 1000 Genomes •  Sample lists and sequencing progress •  Variant Calls •  Alignments •  Raw sequence files

http://www.1000genomes.org/data 59

Page 60: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Large-scale study data – The cancer genome atlas (TCGA)

•  TCGA aims to help to diagnose, treat and prevent cancer •  explore the entire spectrum of genomic changes involved in more

than 20 types of human cancer. •  Approx. 2 PB of genomic raw data

http://cancergenome.nih.gov 60

Page 61: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Laboratory information management systems/ Electronic Lab Books •  How to track all information that is generated in the laboratory

•  Automated annotation of all experimental parameters is essential for reproducible science

•  Currently, most experiments are protocolled manually in lab textbooks

•  Data security (intellectual property versus open data)

61

Page 62: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Experimental design •  Biological experiments are very complex •  Statistical significance requires a high number of biological

replicates • Often many different conditions and time points need to be

considered • One study can involve many different experiments (multi omics

studies involve different omics layers, e.g. genomics + transcriptomics + proteomics)

•  All experiments come with different meta data requirements •  For various reasons the experimental design is not always

balanced (e.g. 5 samples in group A and and only 3 samples are available for group B)

Friedrich, A., et al. Biomed Research International, April 2015 – in press. Nahnsen, S., Drug Target, May 2015 – in press. 62

Page 63: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Experimental design

Friedrich, A., et al. Biomed Research International, April 2015 – in press. Nahnsen, S., Drug Target, May 2015 – in press. 63

Page 64: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Data analysis workflows •  Chain different (heterogeneous) tools •  Parameter handling •  Execution in high performance computing environment made easy

64

Page 65: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Standardization in bioinformatics • Many world-wide bioinformatics initiatives need to rely on open

standards •  Development of standards has to be a community effort •  Standardized data formats are important to guarantee

-  Sustainability -  Independence of instrument vendors -  Independence of analysis software -  Exchangeability of raw data

•  Standard formats increase the amount of data by a factor of x (x =

2-4) • Many people refrain from using open standards

65

Page 66: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

http://en.wikipedia.org/wiki/Big_Data, accessed Apr 24, 2014

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found ……

Big data

66

Page 67: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Big data examples •  European Council for Nuclear Research (CERN) Geneva,

Swizerland 25 Petabyte/Jahr at LHC (Large Hadron collider) (~6.2 Mio. DVDs)

CERN LHD data

Big data Beispiele

ep.ph.bham.ac.uk, 2014 67

Page 68: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Big data examples • Google verarbeitet 9.1 Exabyte/year (300 Mio. DVDs)

GOOGLE data

Mayer-Schönberger, 2013, ititch.com, 2014 68

Page 69: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Biology and Big data? •  Klassisch: Beobachtung der Natur

und deren Phänomene

DNA RNA Proteine

Träger der Erbinformation

Expression von bestimmten Genen

Üben nötige Funktion in der Zelle aus

1956 formuliert Francis Crick das zentrale Dogma der Molekularbiologie:

•  1950er JahreDurchbruch in der Molekularbiologie

69

Page 70: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Big data

Vivien Marx, Biology: The big challenges of big data, Nature. 2013, doi:10.1038/498255a 70

Page 71: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Integrated data management in biology/biomedicine

71 http://media.americanlaboratory.com/m/20/Article/35231-fig1.jpg

Page 72: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

QBiC infrastructure

72

Page 73: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

NGS Lab

Lab Storage

Data movers

•  Automatically moves large to huge file-based data to a remote (central) storage

•  Uses rsync routine; easy configuration using config file •  Data mover athentification: public/private key ssh authentification • Moves data to openbis dropboxes (individual boxes and users for

each of the five member labs)

Data Mover

DataMover: •  Developed at ETH Zurich as part of OpenBIS •  http://www.cisd.ethz.ch/software/Data_Mover

73

Page 74: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

openBIS (meta) data store

• Open, distributed system for managing biological information

•  Captures different experiment types (OMICS, imaging, screening,...)

•  Tracking, annotating and sharing of experiments, samples and datasets for distributed research

•  Different servers for meta data and bulk raw data

•  Underlying PostgreSQL database •  ETL routines for extraction of meta data and

linking

74

Page 75: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Data organization •  http://www.cisd.ethz.ch/software/openBIS

75

Page 76: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Data organization •  http://www.cisd.ethz.ch/software/openBIS

76

Page 77: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Applications •  Personalized medicine: Individualized vaccination in cancer •  Large-scale clinical studies: example Hepatocellular carcinoma

77

Page 78: Data Management for Quantitative Biology - Lecture 1, Apr 16, 2015

Contact: Quantitative Biology Center (QBiC) Auf der Morgenstelle 10 72076 Tübingen · Germany [email protected]

Thanks for listening – See you next week