1
Community demands for cloud computing &challenges: Environmental Genomics Community
• Richard Nichols (QMUL & NBAF) • Yannick Wurm (QMUL)
What does the NERC environmental genomics community do ? What do they ask NBAF for ?
(real data for 2014-‐15)Vertebrates Invertebrates Plants Micro-orgs
RAD seq x x x
Epigenomics x xx
Metagenomics/barcoding x x xxxxxxxxxxx
Long-read methods x x x
Sequence capture/ reseq xxxx x x
Transcriptomics xx xxx xxx x
Genomic sequencing x xx
Challenges to design & delivery of rational provision
UKRO Funding National & International Infrastructure
AWS iPLANT JASMIN National services and facilities
Overseas public & commercial funding
Regional and inter-‐institution networks
Multi-‐institution grants & capital expenditure windfalls
Regional HPC Institution-‐level provision QR funding
Subscription from smaller research grants
Research group level
Dedicated clusters, servers and specialist architectures
Smaller grants
Challenges to design & delivery of rational provision
UKRO Funding National & International Infrastructure
AWS iPLANT JASMIN National services and facilities
Overseas public & commercial funding
Regional and inter-‐institution networks
Multi-‐institution grants & capital expenditure windfalls
Regional HPC Institution-‐level provision QR funding
Subscription from smaller research grants
Research group level
Dedicated clusters, servers and specialist architectures
Smaller grants
Strategy ?
Challenges to design & delivery of rational provision
UKRO Funding National & International Infrastructure
AWS iPLANT JASMIN National services and facilities
Overseas public & commercial funding
Regional and inter-‐institution networks
Multi-‐institution grants & capital expenditure windfalls
Regional HPC Institution-‐level provision QR funding
Subscription from smaller research grants
Research group level
Dedicated clusters, servers and specialist architectures
Smaller grants
Strategy ?
Opportunism ?
Challenges to design & delivery of rational provision
UKRO Funding National & International Infrastructure
AWS iPLANT JASMIN National services and facilities
Overseas public & commercial funding
Regional and inter-‐institution networks
Multi-‐institution grants & capital windfalls
Regional HPC Institution-‐level provision QR funding
Subscription from smaller research grants
Research group level
Dedicated clusters, servers and specialist architectures
Smaller grants
Strategy ?
Opportunism ?
Exasperation ?
Expertise
• Training may not be the answer – Remove the need? – Provide expertise with other services ? – Rebalance the community ?
http://wurmlab.github.io
http://wurmlab.github.io
Huge variance of genomics compute needsRepetitiveness “Disk”
Input/Output Memory Duration per task
Build 10,000 trees 10,000x low low short
Trim FASTQ files 40-400x high low short
One de novo genome assembly 1 high high long
Many de novo genome assemblies 20-1000x high high long
Determine which of 10 new tools that
promise X can actually do X (once). “genome hacking”
1 depends depends depends
No easy solutions
http://wurmlab.github.io
• Biology/life is complex• Field is young.
Genomics computation is harder than other fields
Dorylus driver ants: ants with no home
© BBC
http://wurmlab.github.io
• Biology/life is complex• Field is young.• Biologists lack computational training.• Generally, analysis tools suck.
• badly written• badly tested• hard to install• output quality… often questionable.
• Understanding/visualizing/massaging data is hard.• Datasets continue to grow!• Data formats keep changing.
Genomics computation is harder than other fields
Specific challenges
• Software switching
• Exploring approaches
• Project-specific versions (for reproducibility)
More genomics-specific challenges
Installing things (yourself/sysadmin)
Too complicated
Cloud VM instance creation interfaces (amazon)
Too slow/complicated/unreliable
http://wurmlab.github.io
mymac:~/2015-‐06-‐01-‐myproject> abyss-‐pe k=25 reads.fastq.gz zsh: command not found: abyss-‐pe
mymac:~/2015-‐06-‐01-‐myproject> oswitch -‐l yeban/biolinux:8 ubuntu:14.04 ontouchstart/texlive-‐full ipython/ipython
mymac:~/2015-‐06-‐01-‐myproject> oswitch yeban/biolinux ###### You are now running: biolinux in container biolinux-‐7187. ######
biolinux-‐7187:~/2015-‐06-‐01-‐myproject> abyss-‐pe k=25 reads.fastq.gz [... just works on your files where they are...]
biolinux-‐7187:~/2015-‐06-‐01-‐myproject> exit ###### Back to your host OS ######
mymac:~/2015-‐06-‐01-‐myproject> [... output is where you expect it to be ...]
oSwitchOne-line access to other operating systems.
EOS Cloud
https://github.com/wurmlab/oswitch
Or use in one-line e.g.: oswitch yeban/biolinux bwa aln -t 48 genome.fna reads.fq
Things feel (largely) unchanged: • Current working directory• User name, uid and gid• Login shell (bash/zsh/fish)• Home directory (including .dotfiles config).• read/write permissions.• Paths (when possible) - host-mounted
volumes (drives, NAS, USB) available in the container at the same path.
EOS Cloud
https://github.com/wurmlab/oswitch
oSwitchOne-line access to other operating systems.
Specific challenges
• Software switching
• Cloud instance provisioning user experience
More genomics-specific challenges
Usage patterns: CPU/RAM• 90% of the time: need nothing
• 5% of the time: need small resources
• 5% of the time: need huge resources
• sometimes on fat machine (many-core many-cpu)
• sometime via queuing system
Provisioning strategy?
What users “should” do: • Development:
• Confusing cloud interface. Choose small number of Cores + RAM. Launch.
• Develop pipeline; seems to be working. • Production:
• Confusing cloud interface. Choose larger number of Cores + RAM. Launch.
• After 5 days it crashes. • Confusing cloud interface. Choose even larger number of
Cores + RAM. Launch.• Lucky. Analysis complete.
• Dailed to shutdown: Huge bill 2 months later
http://wurmlab.github.io
They’ll be inefficient and frustrated or go elsewhere.
Mylab @ RCUK cloud• Single place to connect to:
ssh mylab.rcukcloud.co.uk
• Is *always* there.
• Instance automagically:• grows from small to medium or large CPU/RAM machine
with increasing CPU and RAM demands (Nerc EOS Boost should be transparent).
• shrinks down to minimum• hibernates/sleeps when unused.
• (Also allows queue submission.)
Specific challenges
• Software switching
• Cloud instance provisioning user experience
• Balancing storage demand: hyperfast (for working) vs large but cheap & easily accessible archival.
More genomics-specific challenges
http://wurmlab.github.io
Top Related