The Transformation of Systems Biology Into A Large Data Science

Is Systems Biology Becoming a Data Intensive Science?

Assuming So, Are You Ready?

Robert GrossmanLaboratory for Advanced Computing

University of Illinois at Chicago

December 7, 2009

Part 1Biology as a Data Intensive Science.

2

Two of the 14 high throughput sequencers at the Ontario Institute for Cancer Research (OICR).

Growth of Genomic Data

1977

Sanger Sequencing

1995

Microarray technology

2005

454, Solexa sequencing

2001HGP

2003ENCODE

Genbank 10^5 10^8 10^10

Growth of Genomic Data

1977

Sanger Sequencing

1995

Microarray technology

2005

454, Solexa sequencing

2001HGP

2003ENCODESequence

species

Sequence individualsSequence

environment

Genbank 10^5 10^8 10^10

2003GFS

2008Hadoop 2006

AWS

The Challenge is to Support Cubes of High Throughput Sequence Data

Perturb the environment

Different developmental stages

Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set.

Differentconditions

We Have a Problem

More and more of your colleagues produce so much data that they cannot easily manage & analyze it.

Large projects build their own infrastructure. Every else is on their own.

vs…

experimental science

simulation science

datascience

160930x

1670250x

197610x-100x

200310x-100x

Point of View

Data

Analytic algorithms & statistical models

Analytic infrastructure

To Answer today’s biological questions

Part 2What is a Cloud?

9

What is a Cloud?

10

Software as a Service

Is Anything Else a Cloud?

11

Infrastructure as a Service – based upon scaling Virtual Machines (VMs)

Are There Other Types of Clouds?

12

Large Data Cloud Services

web search & ad targeting

What is Virtualization?

13

Idea Dates Back to the 1960s

Virtualization first widely deployed with IBM VM/370.

14

IBM Mainframe

IBM VM/370

CMS

App

Native (Full) VirtualizationExamples: Vmware ESX

MVS

App

CMS

App

What Do You Optimize?

Goal: Minimze latency and control heat.

Goal: Maximze data (with matching compute) and control cost.

16

Scale is new

Elastic, Usage Based Pricing Is New

17

1 computer in a rack for 120 hours

120 computers in three racks for 1 hour

costs the same as

Elastic, usage based pricing turns capex into opex.Clouds can be used to manage surges in computing.

Simplicity Offered By the Cloud is New

18

+ .. and you have a computer ready to work.

A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.

Grids CloudsProblem Too few cycles Too much dataInfrastructure Clusters and

supercomputersData centers

Architecture Federated Virtual Organization

Hosted Organization

Programming Model

Powerful, but difficult to use

Not as powerful, but easy to use

Model NSF, DOD, DOE HPC centers

Google, Amazon, Yahoo, Microsoft

Projects caBIG, BIRN, … CUBioS/Cistrack

Clouds vs Grids

Part 3Case Studies

Case Study 1Cistrack Large Data Cloud

21www.cistrack.org

Cistrack

Resource for cis-regulatory data. It is open source and based upon CUBioS. Currently used by the White Lab at University of

Chicago for managing ModENCODE fly data. Contains raw data, intermediate, and analyzed data

from approximately 240 experiments from Agilent, Affy and Solexa platforms.

CUBioS Applications

Bowtie, TopH

AT, R pipelines, etc…

RNA seqChIP seqDNA captureetc.

CUBioS

Cistrack is an instance of CUBioS.

IngestionFront Ends

Chromatin Developmental Time-Course H3K4me1 enhancers H3K4me3 promoters

& enhancers H3K9Ac activation H3K9me3

heterochromotin H3K27Ac activation H3K27me3 repression PolII transcript.

& promoters CBP HAT-

enhancers Total RNA expression

12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre)

8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)

X

Cistrack Supports Multi-Dim. Cubes…

Drosophila regulatory elements from Drosophila modENCODE. ChIP-chip data using Agilent 244K dual-color arrays. Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1,

H3K27Ac, H3K9Ac), PolII and CBP. Each factor has been studied for 12 different time-points of Drosophila

development.

… Each Cell in a Cube Can Be Three ChIP-Seq Datasets from a Solexa

Cistrack integrates with large data clouds. Cistrack uses the Sector/Sphere large data cloud.

Hadoop vs SectorHadoop Sector

Storage Cloud Block-based File-basedProgramming Model

MapReduce UDF & MapReduce

Image processing

Difficult with MapReduce

Easy with UDF

Protocol TCP UDTReplication At write At write or period.Security Not yet HIPAA capableLanguage Java C++

27Source: Gu and Grossman, Sector and Sphere, Phil. Trans. Royal Society A, 2009.

Cistrack Database

Analysis Pipelines & Re-analysis

Services

Cistrack Web Portal & Widgets

Cistrack Large Data Cloud Services

Ingestion Services

Cistrack Elastic Cloud

Services

Case Study 2: Combinatorial Analysis of Marks

Active Gene - Method

Gene Activeness: Label a transcript t as XYZ– X=1 if a H3K4Me3 binds in

[-1800, min(2200, TranscriptLength)]

– Y=1 if a Pol II binds in

[-1800, min(2200, TranscriptLength)]

– Z=1 if at least one exon has ≥30% covered by RNA, and in total ≥10% covered by RNA.

K4Me3 to TSS distance

Pol II to TSS distance

Source: Jia Chen et. al. (ModENCODE)

Promoters: Use H3K4me3, PolII & RNA to Map Active Genes


PolII H3K4me3

RNA

1350

332

6806104

482

1418 753

A. B. C.

bp from TSS bp from TSS

Active Genes (cont’d)


Interesting Combinatorial Combination of Marks

Item-sets formed by sliding moving window along genome.

A-prior algorithm generates interesting itemsets. Post-processing retains itemsets of biological relevance.

Probes along genome

Marks

…

Case Study 3Cistrack Elastic Cloud

Cistrack Elastic Cloud

A rack in Cistrack’s Elastic Cloud contains 128 Cores and 128 TB.

Multiple racks form a data center. Virtual machines can run

pipelines. Virtual machines have access to

large data services. No need to move large datasets in

and out of Amazon public cloud.

Use VMs to Support Reanalysis

At anytime, you can launch one or virtual machines (VMs) to redo pipeline analysis and persist results to a database and VM to a cloud.

Replace

Cloud

VM VMVM

Comparing Peak Calling Algorithms for ModENCODE

We’re using the Cistrack Elastic Cloud to rerun peak calls for fly data using the worm pipeline.

Also running the worm peak calling pipeline on the fly data.

Case Study 4Ensembles of Trees on Clouds

data100 tree models

10,000??? tree models

Wenxuan Gao, Robert Grossman, Philip S. Yu, Yunhong Gu, Why Naïve Ensembles Do Not Work in Cloud Computing, Proceedings of LSDM, 2009.

Ensembles of Trees for Clouds Top-k ensembles

– Each node builds single random tree with local data.– Central node picks k best random trees to predict.– Lower cost with corresponding lower accuracy.– Shuffling data can improve accuracy.

Skeleton ensembles– Central node builds k skeletons of random trees.– Each local node fills in the skeletons.– Central node merges all trees from local nodes.– Greater cost, but more accurate.

Experimental Studies Performed experimental studies on 4 racks (104 nodes)

of Open Cloud Testbed. Standard ensemble based models are more expensive

than proposed approaches and can overfit. Skeleton ensembles are more accurate but more

expensive to build. Shuffling improves accuracy of top-k algorithm. For KDDCup99 dataset top-k ensembles with shuffling

0.1% of data matches accuracy of skeleton method. For UCI Census income dataset, 20% shuffle required,

which is more expensive than top-k ensemble. Without knowledge of uniformity of dataset,

recommend skeleton ensembles.

0 0.005% 0.01% 0.05% 0.1% 1% 5%0

%10

20%

30%

40%

50%

60%

70%

80%

90%

shuffle rate

erro

r ra

te

skeleton

top-k

0 5% 10% 15% 20% 25% 30% 00

%10

20%

30%

40%

50%

60%

shuffle rate

erro

r ra

te

skeleton

top-k

0 0.005% 0.01% 0.05% 0.1% 1% 5%0

50

100

150

200

250

shuffle rate

com

puta

tion

cost

(se

cond

s)

skeleton

top-k

shuffle

0 5% 10% 15% 20% 25% 30% 00

50

100

150

200

250

shuffle rate

com

puta

tion

cost

(se

cond

s)

skeleton

top-k

shuffle

KDDCup99 dataset Census income dataset

Part 5.Open Cloud Consortium

Biocloud

Open Cloud Testbed

Phase 2 9 racks 250+ Nodes 1000+ Cores 10+ Gb/s

43

MREN

CENIC Dragon

Hadoop Sector/Sphere Thrift KVM VMs Eucalyptus VMs

C-Wave

Open Science Data Cloud

44

sky cloud

biocloud

additional projects in planning…

OCC Condominium Clouds In a condominium cloud, you buy your own

rack or bunch of racks. The racks are managed and operated by the

condominium association, in this case the OCC.

If your rack is 120 TB, you get the rights to approx. 40 TB of storage in the cloud. The rest is a shared resource.

45

Acknowledgements

To Get Involved

The Cistrack resource for transcriptional data: www.cistrack.org

Sector/Sphere cloud: sector.sourceforge.net

Thank You

For more information: blog.rgrossman.com or

www.rgrossman.com

The Transformation of Systems Biology Into A Large Data Science

Technology

Transcript of The Transformation of Systems Biology Into A Large Data Science