The Transformation of Systems Biology Into A Large Data Science
-
Upload
robert-grossman -
Category
Technology
-
view
2.011 -
download
1
Transcript of The Transformation of Systems Biology Into A Large Data Science
Is Systems Biology Becoming a Data Intensive Science?
Assuming So, Are You Ready?
Robert GrossmanLaboratory for Advanced Computing
University of Illinois at Chicago
December 7, 2009
Part 1Biology as a Data Intensive Science.
2
Two of the 14 high throughput sequencers at the Ontario Institute for Cancer Research (OICR).
Growth of Genomic Data
1977
Sanger Sequencing
1995
Microarray technology
2005
454, Solexa sequencing
2001HGP
2003ENCODE
Genbank 10^5 10^8 10^10
Growth of Genomic Data
1977
Sanger Sequencing
1995
Microarray technology
2005
454, Solexa sequencing
2001HGP
2003ENCODESequence
species
Sequence individualsSequence
environment
Genbank 10^5 10^8 10^10
2003GFS
2008Hadoop 2006
AWS
The Challenge is to Support Cubes of High Throughput Sequence Data
Perturb the environment
Different developmental stages
Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set.
Differentconditions
We Have a Problem
More and more of your colleagues produce so much data that they cannot easily manage & analyze it.
Large projects build their own infrastructure. Every else is on their own.
vs…
experimental science
simulation science
datascience
160930x
1670250x
197610x-100x
200310x-100x
Point of View
Data
Analytic algorithms & statistical models
Analytic infrastructure
To Answer today’s biological questions
Part 2What is a Cloud?
9
What is a Cloud?
10
Software as a Service
Is Anything Else a Cloud?
11
Infrastructure as a Service – based upon scaling Virtual Machines (VMs)
Are There Other Types of Clouds?
12
Large Data Cloud Services
web search & ad targeting
What is Virtualization?
13
Idea Dates Back to the 1960s
Virtualization first widely deployed with IBM VM/370.
14
IBM Mainframe
IBM VM/370
CMS
App
Native (Full) VirtualizationExamples: Vmware ESX
MVS
App
CMS
App
What Do You Optimize?
Goal: Minimze latency and control heat.
Goal: Maximze data (with matching compute) and control cost.
16
Scale is new
Elastic, Usage Based Pricing Is New
17
1 computer in a rack for 120 hours
120 computers in three racks for 1 hour
costs the same as
Elastic, usage based pricing turns capex into opex.Clouds can be used to manage surges in computing.
Simplicity Offered By the Cloud is New
18
+ .. and you have a computer ready to work.
A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.
Grids CloudsProblem Too few cycles Too much dataInfrastructure Clusters and
supercomputersData centers
Architecture Federated Virtual Organization
Hosted Organization
Programming Model
Powerful, but difficult to use
Not as powerful, but easy to use
Model NSF, DOD, DOE HPC centers
Google, Amazon, Yahoo, Microsoft
Projects caBIG, BIRN, … CUBioS/Cistrack
Clouds vs Grids
Part 3Case Studies
Case Study 1Cistrack Large Data Cloud
21www.cistrack.org
Cistrack
Resource for cis-regulatory data. It is open source and based upon CUBioS. Currently used by the White Lab at University of
Chicago for managing ModENCODE fly data. Contains raw data, intermediate, and analyzed data
from approximately 240 experiments from Agilent, Affy and Solexa platforms.
CUBioS Applications
Bowtie, TopH
AT, R pipelines, etc…
RNA seqChIP seqDNA captureetc.
CUBioS
Cistrack is an instance of CUBioS.
IngestionFront Ends
Chromatin Developmental Time-Course H3K4me1 enhancers H3K4me3 promoters
& enhancers H3K9Ac activation H3K9me3
heterochromotin H3K27Ac activation H3K27me3 repression PolII transcript.
& promoters CBP HAT-
enhancers Total RNA expression
12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre)
8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)
X
Cistrack Supports Multi-Dim. Cubes…
Drosophila regulatory elements from Drosophila modENCODE. ChIP-chip data using Agilent 244K dual-color arrays. Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1,
H3K27Ac, H3K9Ac), PolII and CBP. Each factor has been studied for 12 different time-points of Drosophila
development.
… Each Cell in a Cube Can Be Three ChIP-Seq Datasets from a Solexa
Cistrack integrates with large data clouds. Cistrack uses the Sector/Sphere large data cloud.
Hadoop vs SectorHadoop Sector
Storage Cloud Block-based File-basedProgramming Model
MapReduce UDF & MapReduce
Image processing
Difficult with MapReduce
Easy with UDF
Protocol TCP UDTReplication At write At write or period.Security Not yet HIPAA capableLanguage Java C++
27Source: Gu and Grossman, Sector and Sphere, Phil. Trans. Royal Society A, 2009.
Cistrack Database
Analysis Pipelines & Re-analysis
Services
Cistrack Web Portal & Widgets
Cistrack Large Data Cloud Services
Ingestion Services
Cistrack Elastic Cloud
Services
Case Study 2: Combinatorial Analysis of Marks
Active Gene - Method
Gene Activeness: Label a transcript t as XYZ– X=1 if a H3K4Me3 binds in
[-1800, min(2200, TranscriptLength)]
– Y=1 if a Pol II binds in
[-1800, min(2200, TranscriptLength)]
– Z=1 if at least one exon has ≥30% covered by RNA, and in total ≥10% covered by RNA.
K4Me3 to TSS distance
Pol II to TSS distance
Source: Jia Chen et. al. (ModENCODE)
Promoters: Use H3K4me3, PolII & RNA to Map Active Genes
Source: Jia Chen et. al. (ModENCODE)
PolII H3K4me3
RNA
1350
332
6806104
482
1418 753
A. B. C.
bp from TSS bp from TSS
Active Genes (cont’d)
Source: Jia Chen et. al. (ModENCODE)
Interesting Combinatorial Combination of Marks
Item-sets formed by sliding moving window along genome.
A-prior algorithm generates interesting itemsets. Post-processing retains itemsets of biological relevance.
Probes along genome
Marks
…
Case Study 3Cistrack Elastic Cloud
Cistrack Elastic Cloud
A rack in Cistrack’s Elastic Cloud contains 128 Cores and 128 TB.
Multiple racks form a data center. Virtual machines can run
pipelines. Virtual machines have access to
large data services. No need to move large datasets in
and out of Amazon public cloud.
Use VMs to Support Reanalysis
At anytime, you can launch one or virtual machines (VMs) to redo pipeline analysis and persist results to a database and VM to a cloud.
Replace
Cloud
VM VMVM
Comparing Peak Calling Algorithms for ModENCODE
We’re using the Cistrack Elastic Cloud to rerun peak calls for fly data using the worm pipeline.
Also running the worm peak calling pipeline on the fly data.
Case Study 4Ensembles of Trees on Clouds
data100 tree models
10,000??? tree models
Wenxuan Gao, Robert Grossman, Philip S. Yu, Yunhong Gu, Why Naïve Ensembles Do Not Work in Cloud Computing, Proceedings of LSDM, 2009.
Ensembles of Trees for Clouds Top-k ensembles
– Each node builds single random tree with local data.– Central node picks k best random trees to predict.– Lower cost with corresponding lower accuracy.– Shuffling data can improve accuracy.
Skeleton ensembles– Central node builds k skeletons of random trees.– Each local node fills in the skeletons.– Central node merges all trees from local nodes.– Greater cost, but more accurate.
Experimental Studies Performed experimental studies on 4 racks (104 nodes)
of Open Cloud Testbed. Standard ensemble based models are more expensive
than proposed approaches and can overfit. Skeleton ensembles are more accurate but more
expensive to build. Shuffling improves accuracy of top-k algorithm. For KDDCup99 dataset top-k ensembles with shuffling
0.1% of data matches accuracy of skeleton method. For UCI Census income dataset, 20% shuffle required,
which is more expensive than top-k ensemble. Without knowledge of uniformity of dataset,
recommend skeleton ensembles.
0 0.005% 0.01% 0.05% 0.1% 1% 5%0
%10
20%
30%
40%
50%
60%
70%
80%
90%
shuffle rate
erro
r ra
te
skeleton
top-k
0 5% 10% 15% 20% 25% 30% 00
%10
20%
30%
40%
50%
60%
shuffle rate
erro
r ra
te
skeleton
top-k
0 0.005% 0.01% 0.05% 0.1% 1% 5%0
50
100
150
200
250
shuffle rate
com
puta
tion
cost
(se
cond
s)
skeleton
top-k
shuffle
0 5% 10% 15% 20% 25% 30% 00
50
100
150
200
250
shuffle rate
com
puta
tion
cost
(se
cond
s)
skeleton
top-k
shuffle
KDDCup99 dataset Census income dataset
Part 5.Open Cloud Consortium
Biocloud
Open Cloud Testbed
Phase 2 9 racks 250+ Nodes 1000+ Cores 10+ Gb/s
43
MREN
CENIC Dragon
Hadoop Sector/Sphere Thrift KVM VMs Eucalyptus VMs
C-Wave
Open Science Data Cloud
44
sky cloud
biocloud
additional projects in planning…
OCC Condominium Clouds In a condominium cloud, you buy your own
rack or bunch of racks. The racks are managed and operated by the
condominium association, in this case the OCC.
If your rack is 120 TB, you get the rights to approx. 40 TB of storage in the cloud. The rest is a shared resource.
45
Acknowledgements
To Get Involved
The Cistrack resource for transcriptional data: www.cistrack.org
Sector/Sphere cloud: sector.sourceforge.net
Thank You
For more information: blog.rgrossman.com or
www.rgrossman.com