The Transformation of Systems Biology Into A Large Data Science

48
Becoming a Data Intensive Science? Assuming So, Are You Ready? Robert Grossman Laboratory for Advanced Computing University of Illinois at Chicago December 7, 2009 1

Transcript of The Transformation of Systems Biology Into A Large Data Science

Page 1: The Transformation of Systems Biology Into A Large Data Science

Is Systems Biology Becoming a Data Intensive Science?

Assuming So, Are You Ready?

Robert GrossmanLaboratory for Advanced Computing

University of Illinois at Chicago

December 7, 2009

Page 2: The Transformation of Systems Biology Into A Large Data Science

Part 1Biology as a Data Intensive Science.

2

Two of the 14 high throughput sequencers at the Ontario Institute for Cancer Research (OICR).

Page 3: The Transformation of Systems Biology Into A Large Data Science

Growth of Genomic Data

1977

Sanger Sequencing

1995

Microarray technology

2005

454, Solexa sequencing

2001HGP

2003ENCODE

Genbank 10^5 10^8 10^10

Page 4: The Transformation of Systems Biology Into A Large Data Science

Growth of Genomic Data

1977

Sanger Sequencing

1995

Microarray technology

2005

454, Solexa sequencing

2001HGP

2003ENCODESequence

species

Sequence individualsSequence

environment

Genbank 10^5 10^8 10^10

2003GFS

2008Hadoop 2006

AWS

Page 5: The Transformation of Systems Biology Into A Large Data Science

The Challenge is to Support Cubes of High Throughput Sequence Data

Perturb the environment

Different developmental stages

Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set.

Differentconditions

Page 6: The Transformation of Systems Biology Into A Large Data Science

We Have a Problem

More and more of your colleagues produce so much data that they cannot easily manage & analyze it.

Large projects build their own infrastructure. Every else is on their own.

vs…

Page 7: The Transformation of Systems Biology Into A Large Data Science

experimental science

simulation science

datascience

160930x

1670250x

197610x-100x

200310x-100x

Page 8: The Transformation of Systems Biology Into A Large Data Science

Point of View

Data

Analytic algorithms & statistical models

Analytic infrastructure

To Answer today’s biological questions

Page 9: The Transformation of Systems Biology Into A Large Data Science

Part 2What is a Cloud?

9

Page 10: The Transformation of Systems Biology Into A Large Data Science

What is a Cloud?

10

Software as a Service

Page 11: The Transformation of Systems Biology Into A Large Data Science

Is Anything Else a Cloud?

11

Infrastructure as a Service – based upon scaling Virtual Machines (VMs)

Page 12: The Transformation of Systems Biology Into A Large Data Science

Are There Other Types of Clouds?

12

Large Data Cloud Services

web search & ad targeting

Page 13: The Transformation of Systems Biology Into A Large Data Science

What is Virtualization?

13

Page 14: The Transformation of Systems Biology Into A Large Data Science

Idea Dates Back to the 1960s

Virtualization first widely deployed with IBM VM/370.

14

IBM Mainframe

IBM VM/370

CMS

App

Native (Full) VirtualizationExamples: Vmware ESX

MVS

App

CMS

App

Page 15: The Transformation of Systems Biology Into A Large Data Science

What Do You Optimize?

Goal: Minimze latency and control heat.

Goal: Maximze data (with matching compute) and control cost.

Page 16: The Transformation of Systems Biology Into A Large Data Science

16

Scale is new

Page 17: The Transformation of Systems Biology Into A Large Data Science

Elastic, Usage Based Pricing Is New

17

1 computer in a rack for 120 hours

120 computers in three racks for 1 hour

costs the same as

Elastic, usage based pricing turns capex into opex.Clouds can be used to manage surges in computing.

Page 18: The Transformation of Systems Biology Into A Large Data Science

Simplicity Offered By the Cloud is New

18

+ .. and you have a computer ready to work.

A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.

Page 19: The Transformation of Systems Biology Into A Large Data Science

Grids CloudsProblem Too few cycles Too much dataInfrastructure Clusters and

supercomputersData centers

Architecture Federated Virtual Organization

Hosted Organization

Programming Model

Powerful, but difficult to use

Not as powerful, but easy to use

Model NSF, DOD, DOE HPC centers

Google, Amazon, Yahoo, Microsoft

Projects caBIG, BIRN, … CUBioS/Cistrack

Clouds vs Grids

Page 20: The Transformation of Systems Biology Into A Large Data Science

Part 3Case Studies

Page 21: The Transformation of Systems Biology Into A Large Data Science

Case Study 1Cistrack Large Data Cloud

21www.cistrack.org

Page 22: The Transformation of Systems Biology Into A Large Data Science

Cistrack

Resource for cis-regulatory data. It is open source and based upon CUBioS. Currently used by the White Lab at University of

Chicago for managing ModENCODE fly data. Contains raw data, intermediate, and analyzed data

from approximately 240 experiments from Agilent, Affy and Solexa platforms.

Page 23: The Transformation of Systems Biology Into A Large Data Science

CUBioS Applications

Bowtie, TopH

AT, R pipelines, etc…

RNA seqChIP seqDNA captureetc.

CUBioS

Cistrack is an instance of CUBioS.

IngestionFront Ends

Page 24: The Transformation of Systems Biology Into A Large Data Science

Chromatin Developmental Time-Course H3K4me1 enhancers H3K4me3 promoters

& enhancers H3K9Ac activation H3K9me3

heterochromotin H3K27Ac activation H3K27me3 repression PolII transcript.

& promoters CBP HAT-

enhancers Total RNA expression

12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre)

8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)

X

Page 25: The Transformation of Systems Biology Into A Large Data Science

Cistrack Supports Multi-Dim. Cubes…

Drosophila regulatory elements from Drosophila modENCODE. ChIP-chip data using Agilent 244K dual-color arrays. Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1,

H3K27Ac, H3K9Ac), PolII and CBP. Each factor has been studied for 12 different time-points of Drosophila

development.

Page 26: The Transformation of Systems Biology Into A Large Data Science

… Each Cell in a Cube Can Be Three ChIP-Seq Datasets from a Solexa

Cistrack integrates with large data clouds. Cistrack uses the Sector/Sphere large data cloud.

Page 27: The Transformation of Systems Biology Into A Large Data Science

Hadoop vs SectorHadoop Sector

Storage Cloud Block-based File-basedProgramming Model

MapReduce UDF & MapReduce

Image processing

Difficult with MapReduce

Easy with UDF

Protocol TCP UDTReplication At write At write or period.Security Not yet HIPAA capableLanguage Java C++

27Source: Gu and Grossman, Sector and Sphere, Phil. Trans. Royal Society A, 2009.

Page 28: The Transformation of Systems Biology Into A Large Data Science

Cistrack Database

Analysis Pipelines & Re-analysis

Services

Cistrack Web Portal & Widgets

Cistrack Large Data Cloud Services

Ingestion Services

Cistrack Elastic Cloud

Services

Page 29: The Transformation of Systems Biology Into A Large Data Science

Case Study 2: Combinatorial Analysis of Marks

Page 30: The Transformation of Systems Biology Into A Large Data Science

Active Gene - Method

Gene Activeness: Label a transcript t as XYZ– X=1 if a H3K4Me3 binds in

[-1800, min(2200, TranscriptLength)]

– Y=1 if a Pol II binds in

[-1800, min(2200, TranscriptLength)]

– Z=1 if at least one exon has ≥30% covered by RNA, and in total ≥10% covered by RNA.

K4Me3 to TSS distance

Pol II to TSS distance

Source: Jia Chen et. al. (ModENCODE)

Page 31: The Transformation of Systems Biology Into A Large Data Science

Promoters: Use H3K4me3, PolII & RNA to Map Active Genes

Source: Jia Chen et. al. (ModENCODE)

Page 32: The Transformation of Systems Biology Into A Large Data Science

PolII H3K4me3

RNA

1350

332

6806104

482

1418 753

A. B. C.

bp from TSS bp from TSS

Active Genes (cont’d)

Source: Jia Chen et. al. (ModENCODE)

Page 33: The Transformation of Systems Biology Into A Large Data Science

Interesting Combinatorial Combination of Marks

Item-sets formed by sliding moving window along genome.

A-prior algorithm generates interesting itemsets. Post-processing retains itemsets of biological relevance.

Probes along genome

Marks

Page 34: The Transformation of Systems Biology Into A Large Data Science

Case Study 3Cistrack Elastic Cloud

Page 35: The Transformation of Systems Biology Into A Large Data Science

Cistrack Elastic Cloud

A rack in Cistrack’s Elastic Cloud contains 128 Cores and 128 TB.

Multiple racks form a data center. Virtual machines can run

pipelines. Virtual machines have access to

large data services. No need to move large datasets in

and out of Amazon public cloud.

Page 36: The Transformation of Systems Biology Into A Large Data Science

Use VMs to Support Reanalysis

At anytime, you can launch one or virtual machines (VMs) to redo pipeline analysis and persist results to a database and VM to a cloud.

Replace

Cloud

VM VMVM

Page 37: The Transformation of Systems Biology Into A Large Data Science

Comparing Peak Calling Algorithms for ModENCODE

We’re using the Cistrack Elastic Cloud to rerun peak calls for fly data using the worm pipeline.

Also running the worm peak calling pipeline on the fly data.

Page 38: The Transformation of Systems Biology Into A Large Data Science

Case Study 4Ensembles of Trees on Clouds

data100 tree models

10,000??? tree models

Wenxuan Gao, Robert Grossman, Philip S. Yu, Yunhong Gu, Why Naïve Ensembles Do Not Work in Cloud Computing, Proceedings of LSDM, 2009.

Page 39: The Transformation of Systems Biology Into A Large Data Science

Ensembles of Trees for Clouds Top-k ensembles

– Each node builds single random tree with local data.– Central node picks k best random trees to predict.– Lower cost with corresponding lower accuracy.– Shuffling data can improve accuracy.

Skeleton ensembles– Central node builds k skeletons of random trees.– Each local node fills in the skeletons.– Central node merges all trees from local nodes.– Greater cost, but more accurate.

Page 40: The Transformation of Systems Biology Into A Large Data Science

Experimental Studies Performed experimental studies on 4 racks (104 nodes)

of Open Cloud Testbed. Standard ensemble based models are more expensive

than proposed approaches and can overfit. Skeleton ensembles are more accurate but more

expensive to build. Shuffling improves accuracy of top-k algorithm. For KDDCup99 dataset top-k ensembles with shuffling

0.1% of data matches accuracy of skeleton method. For UCI Census income dataset, 20% shuffle required,

which is more expensive than top-k ensemble. Without knowledge of uniformity of dataset,

recommend skeleton ensembles.

Page 41: The Transformation of Systems Biology Into A Large Data Science

0 0.005% 0.01% 0.05% 0.1% 1% 5%0

%10

20%

30%

40%

50%

60%

70%

80%

90%

shuffle rate

erro

r ra

te

skeleton

top-k

0 5% 10% 15% 20% 25% 30% 00

%10

20%

30%

40%

50%

60%

shuffle rate

erro

r ra

te

skeleton

top-k

0 0.005% 0.01% 0.05% 0.1% 1% 5%0

50

100

150

200

250

shuffle rate

com

puta

tion

cost

(se

cond

s)

skeleton

top-k

shuffle

0 5% 10% 15% 20% 25% 30% 00

50

100

150

200

250

shuffle rate

com

puta

tion

cost

(se

cond

s)

skeleton

top-k

shuffle

KDDCup99 dataset Census income dataset

Page 42: The Transformation of Systems Biology Into A Large Data Science

Part 5.Open Cloud Consortium

Biocloud

Page 43: The Transformation of Systems Biology Into A Large Data Science

Open Cloud Testbed

Phase 2 9 racks 250+ Nodes 1000+ Cores 10+ Gb/s

43

MREN

CENIC Dragon

Hadoop Sector/Sphere Thrift KVM VMs Eucalyptus VMs

C-Wave

Page 44: The Transformation of Systems Biology Into A Large Data Science

Open Science Data Cloud

44

sky cloud

biocloud

additional projects in planning…

Page 45: The Transformation of Systems Biology Into A Large Data Science

OCC Condominium Clouds In a condominium cloud, you buy your own

rack or bunch of racks. The racks are managed and operated by the

condominium association, in this case the OCC.

If your rack is 120 TB, you get the rights to approx. 40 TB of storage in the cloud. The rest is a shared resource.

45

Page 46: The Transformation of Systems Biology Into A Large Data Science

Acknowledgements

Page 47: The Transformation of Systems Biology Into A Large Data Science

To Get Involved

The Cistrack resource for transcriptional data: www.cistrack.org

Sector/Sphere cloud: sector.sourceforge.net

Page 48: The Transformation of Systems Biology Into A Large Data Science

Thank You

For more information: blog.rgrossman.com or

www.rgrossman.com