Bionimbus Cambridge Workshop (3-28-11, v7)

32
Bionimbus: A Cloud-Based Infrastructure for Managing, Analyzing and Sharing Genomics Data Robert Grossman Institute for Genomics & Systems Biology Computation Institute University of Chicago and Open Cloud Consortium March 29, 2011

Transcript of Bionimbus Cambridge Workshop (3-28-11, v7)

Page 1: Bionimbus Cambridge Workshop (3-28-11, v7)

Bionimbus: A Cloud-Based Infrastructure for Managing,

Analyzing and Sharing Genomics Data

Robert GrossmanInstitute for Genomics & Systems Biology

Computation InstituteUniversity of Chicago

andOpen Cloud Consortium

March 29, 2011

Page 2: Bionimbus Cambridge Workshop (3-28-11, v7)

Part 1Biology, Big Data & Clouds

2

Two of the 14 high throughput sequencers at the Ontario Institute for Cancer Research (OICR).

Page 3: Bionimbus Cambridge Workshop (3-28-11, v7)

Source: Lincoln Stein

Page 4: Bionimbus Cambridge Workshop (3-28-11, v7)

The Challenge is to Support Cubes of Next Gen Sequence Data

Perturb the environment

Different developmental stages

Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set.

Different pathologies

Page 5: Bionimbus Cambridge Workshop (3-28-11, v7)

Discipline Duration Size # Devices

HEP - LHC 10 years 15 PB/year One

Astronomy - LSST 10 years 10 PB/year One

Genomics -NGS 2-4 years 0.5 TB/genome Hundreds

Genomics as a Big Data Science

Page 6: Bionimbus Cambridge Workshop (3-28-11, v7)

What is a new about clouds?

6

Page 7: Bionimbus Cambridge Workshop (3-28-11, v7)

7

Scale is New

Page 8: Bionimbus Cambridge Workshop (3-28-11, v7)

Elastic, On-Demand Computing with Usage Based Pricing Is New

8

1 computer in a rack for 120 hours

120 computers in three racks for 1 hour

costs the same as

Page 9: Bionimbus Cambridge Workshop (3-28-11, v7)

Part 2. What is Bionimbus?

www.bionimbus.org

Page 10: Bionimbus Cambridge Workshop (3-28-11, v7)

Bionimbus is a community cloud for storing, analyzing and sharing genomics and related data.

Page 11: Bionimbus Cambridge Workshop (3-28-11, v7)

Bionimbus Private Cloud

UC

Bionimbus Community

Cloud

Bionimbus Private

Cloud XYAmazondbGaP

External Sequencers

IGSBSequencers

Step 1. Get Bionimbus ID (BID), assign project, private/community, public cloud, etc.

Step 2. Send sample tobe sequenced.

BID Generator

Step 3b. Returnvariant calls, CNV, annotation…

Step 4. Secure datarouting to appropriatecloud based upon BID.

Step 5. Cloud based analysis

using IGSB and 3rd party tools and applications.

Step 3a. Return rawreads.

Page 12: Bionimbus Cambridge Workshop (3-28-11, v7)

What is a good unit to understand data intensive computing of

biological data?

Page 13: Bionimbus Cambridge Workshop (3-28-11, v7)

Bionimbus & OSDC Today

• The NIH in the U.S. currently makes available for download approximately 2PB of data.

• Bionimbus 2010 consists of 6 racks, 212 nodes, 1568 cores and 0.9 PB of storage.

• Bionimbus is part of the POC Open Science Data Cloud that consists of 14 racks, 472 nodes, 3776 cores and 3+ PB of storage.

Page 14: Bionimbus Cambridge Workshop (3-28-11, v7)

Database Services

Analysis Pipelines & Re-analysis

Services

GWT-based Front End

Large Data Cloud Services

Data Ingestion Services

Elastic Cloud Services

Intercloud Services

Page 15: Bionimbus Cambridge Workshop (3-28-11, v7)

Bionimbus Deployment Options

Bionimbus Community Cloudwww.bionimbus.org

Bionimbus AMIs & Amazon hosted applications

Bionimbus Private Clouds

Page 16: Bionimbus Cambridge Workshop (3-28-11, v7)

Part 3. Some Bionimbus Case

Page 17: Bionimbus Cambridge Workshop (3-28-11, v7)

Case Study: Public Datasets in Bionimbus

Page 18: Bionimbus Cambridge Workshop (3-28-11, v7)

Case Study: ModENCODE

• Bionimbus is used to process the modENCODE data from the White lab (over 1000 experiments).

• Bionimbus VMs were used for some of the integrative analysis.

• Bionimbus is used as a backup for the modENCODE DCC

Page 19: Bionimbus Cambridge Workshop (3-28-11, v7)

Case Study: IGSB

• All samples processed by the Institute for Genomics & Systems Biology High-Throughput Genome Analysis Core (HGAC) at the University of Chicago use Bionimbus.

Page 20: Bionimbus Cambridge Workshop (3-28-11, v7)

20

Bionimbus Virtual Machine Releases Peak Calling MAT

MA2CPeakSeqMACSSPP

Quality Control

Various

Alignment & Genotyping

Bowtie

TopHatSamtoolsPicard

Page 21: Bionimbus Cambridge Workshop (3-28-11, v7)

What is the OSDC?

Part 4

Page 22: Bionimbus Cambridge Workshop (3-28-11, v7)

Astronomical dataBiological data (Bionimbus)

NSF-PIRE OSDC Data ChallengeEarth science data (& disaster relief)

Open Science Data Cloud

Page 23: Bionimbus Cambridge Workshop (3-28-11, v7)

23www.opencloudconsortium.org

• U.S based not-for-profit corporation.• Manages cloud computing infrastructure to

support scientific research: Open Science Data Cloud.

• Manages cloud computing testbeds: Open Cloud Testbed.

• Develop reference implementations, benchmarks and standards.

Page 24: Bionimbus Cambridge Workshop (3-28-11, v7)

OCC Members

• Companies: Cisco, Citrix, Yahoo!, …• Universities: University of Chicago, Calit2,

Johns Hopkins, Northwestern Univ., ORNL, University of Illinois at Chicago, …

• Federal agencies: NASA• Other: National Lambda Rail• Adding international partners in 2011.

24

Page 25: Bionimbus Cambridge Workshop (3-28-11, v7)

Infrastructure

• 2010 Proof-of-Concept Infrastructure– 450+ nodes– 3000+ cores– 3+ PB– Four data centers (two more to come in 2011)– Data centers have 10G network connections (some

100G links in 2011)• Plan to add approximately 1 PB of data in 2011.• With current funding, we will refresh 1/3 of the

infrastructure in 2011 and 2012.

Page 26: Bionimbus Cambridge Workshop (3-28-11, v7)

Towards a Long Term, Sustainable Model

• Cap Exp about $1M/year• Op Exp about $1M/year• Moore Foundation providing $1M/year for

2011 and 2012 to support the Cap Exp.

Page 27: Bionimbus Cambridge Workshop (3-28-11, v7)

Small Medium to Large Very Large

Data Size

Low

Med

Wide

Variety of analysis

No infrastructure Dedicated infrastructureGeneral infrastructure

Scientist with laptop

Open Science Data Cloud

Sequencing centers, LHC, LSST

Page 28: Bionimbus Cambridge Workshop (3-28-11, v7)

Single workstations

Small to medium clusters

HPC

Cycles

Small

Med

Large

Persistent data

data clouds

Large & spec. clusters

databases

Page 29: Bionimbus Cambridge Workshop (3-28-11, v7)

Bionimbus Team*David Hanley, Nicolas Negre, Elizabeth Bartom, Nicholas Bild, Christopher D. Brown, Marc Domanus, , Robert L Grossman, A. Jason Grundstad, Xiangjun Liu, Michal Sabala, Parantu K Shah, Kevin P WhiteInstitute for Genomics & Systems BiologyUniversity of Chicago

Jia Chen, Yunhong Gu and Damian RoqueiroUniversity of Illinois at Chicago

Lincoln Stein and Zheng ZhaOntario Institute for Cancer Research*In alphabetical order

Page 30: Bionimbus Cambridge Workshop (3-28-11, v7)

Acknowledgements

Page 31: Bionimbus Cambridge Workshop (3-28-11, v7)

Questions?

Page 32: Bionimbus Cambridge Workshop (3-28-11, v7)

Thank You

For more information: www.bionimbus.org

www.opencloudconsortium.orgwww.igsb.org

rgrossman.com