Chi next gen-ntino-krampis
-
Upload
ntino-krampis -
Category
Documents
-
view
1.556 -
download
1
Transcript of Chi next gen-ntino-krampis
Cloud BioLinux: Pre-Configured and On-Demand
High Performance Computing for the Genomics Community
Ntino Krampis, PhD
Next-Gen Sequence Data Management '10Providence, RI
Expensive sequencing, computing and large organizations
● multi-million, broad-impact sequencing projects
● large sequencing center, with a dedicated bioinformatics department
● large-scale computations on SGE cluster, algorithm acceleration hardware
Bench-top, commodity sequencing and small labs
● small-factor sequencer available: GS Junior by 454
● sequencing as a standard technique in basic biology and genetics research
● remember microarrays and lengthy assays for protein interactions ?
● RNAseq and ChiPseq, and each biologist will be tackling a metagenome
Will small labs become the long tail of sequencing ?
● downstream bioinformatic analysis required for biological discovery
● basic analysis example: large-scale BLAST to public DBs (try 0.5GB at NCBI)
● do not have the hardware, expertise, or time to install and run software locally
amount of sequencing
number of labs
Credit: WikiMedia Commons
Cloud Biolinuxpre-configured and on-demand bioinformatics on the cloud
● a public virtual machine (VM) on EC2 with 100+ bioinformatics tools
● how it came to be, what offers for sequence analysis
● where and how do I run it, especially if I am not a computer expert
● modifying and sharing VM configurations and data with your peers
● openness and community around Cloud Biolinux
Cloud Biolinux
The Biolinux part
tinyurl.com/BioLinux-NEBC
tinyurl.com/CloudBioLinux-JCVI
+
=
● an Ubuntu Linux desktop for bioinformatics
● NEBC packaged software and maintains repository
● Ubuntu AMI on EC2, pull packages from repository
● additional software of interest to JCVI
Cloud Biolinuxwhat comes in the box
● glimmer, hmmer, phylip, rasmol, genespring, clustalw, EMBOSS
● mpiBLAST clusters using EC2 virtual machine instances
● Celera whole genome shotgun assembler
● NX remote desktop, easy to use for benchtop scientists
Cloud Biolinux
The Cloud part
● find our VM on Amazon EC2:
Biolinux 5.0 packages (32-bit): ami-6953b200Biolinux 6.0 packages (64-bit): ami-6011e409 , EBS based
● 17GB / 6 core instances 0.5$ / hour, see aws.amazon.com/ec2/pricing
● a small bacterial genome assembly costs a little over 2$
● up to 68 RAM / 26 core, EBS up to 1000 GB in size (0.10$ / GB / month)
● make a copy of our public Biolinux ami - add your data - make private
Cloud Biolinux http://tinyurl.com/cloud-biolinux-tutorial (credit to the NEBC team)
simply signup at
aws.amazon.com
then
aws.amazon.com/console
and
Cloud Biolinuxhttp://tinyurl.com/cloud-biolinux-tutorial (credit to the NEBC team)
● find Cloud Biolinux AMI using ID
● enter desired password for remote desktop login
● all other default
● get remote desktop client:nomachine.com/download.php
● simply enter VM's IP address and your password
What if I want to share my alignments with a collaborator?
save your data as a new AMI
EBS cost 0.10$ / GB / month
at 15GB, it costs 1.5$ / month
share your data: public or with another AWS user
users with access can boot the AMI with all the software + data
Cloud Biolinux
The Cloud part
● run Cloud Biolinux on your private cloud ?
● Eucalyptus open source cloud platform
● identical API with EC2, without the usage charges
● easy to set up on your lab's cluster, comes with Ubuntu server (UEC)
● download VMs from Sourceforge ( tinyurl.com/CloudBiolinux-SF )
open.eucalyptus.com
Cloud Biolinux
● porting VMs across cloud platforms is not trivial
● Cloud Biolinux VMs from EC2 to Eucalyptus, Xen kernel and boot sector
● framework to share VM configurations ( tinyurl.com/bootstrap-cloudbiolinux )
● based on python-fabric automated deployment tool
● simply edit the software list files and share with collaborators
● they start with fresh VM, python-fabric replicates VM setup on their cloud
tinyurl.com/python-fabric
Cloud Biolinux
Collaboration and open source
high-level configuration describing software groups
for each group individual software packages
simply edit the files to change the VM configuration
tinyurl.com/CloudBioLinux-github
...............
Cloud Biolinux
The community
● from JCVI and NEBC to an open-source, community-based project
● community initiated during tele-conference meeting at SC '10, Portland, OR
● first meeting past July in Boston, tinyurl.com/openbio-codefest-2010
● work done: 64-bit AMIs, NX remote desktop, set-up the fabric framework
● next year's at ISMB/BOSC in Vienna, Austria http://metalab.at/
● cloudbiolinux.com and most important, tinyurl.com/cloudbiolinux-lists
Cloud Biolinux
The future
● expand community, receive feedback, add more software to the VM
● genome assemblers, high-memory EC2 instances up to 68GB RAM
● Hadoop / MapReduce (for those running the VM in private clouds)
● analysis pipelines that are used by large sequencing centers
● actively seeking funding to put major effort in development
● tinyurl.com/cloudbiolinux-lists or [email protected]
Acknowledgments & Credits
Brad Chapman - development of the fabric scripts and community organizer
Tim Booth, Bela Tiwari – BioLinux 6.0 development and EC2 documentation
Deepak Singh and AWS - education grant supporting codefest workshop
Justin Johnson – community and sponsorship of cloudbiolinux.com
J. Craig Venter Inst. - time allowed to work on an open-source project
D. Gomez, E. Navarro, J. Shao, I. Singh – JCVI technology innovation
Members of the Cloud Biolinux community:
Enis AfganMichael HeuerRichard HollandMark JensenDave MessinaSteffen MöllerRoman Valls
Thank you !