US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

46
US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute [email protected]

description

US DOE Joint Genome Institute Read Lengths are Getting Shorter The debate about impact of read length on genome assembly has never been resolved Is more 650bp reads better than fewer 750bp reads? What about 100? How do you feel about 35? Why wait, join the revolution…

Transcript of US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

Page 1: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Surviving the DelugeDarren PlattJoint Genome [email protected]

Page 2: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Surviving the Deluge

• Shrinking Read Lengths• Hybrid Assemblies• The Coming Storm• Improving metabolic flux• The Future

Page 3: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Read Lengths are Getting Shorter

• The debate about impact of read length on genome assembly has never been resolved

• Is more 650bp reads better than fewer 750bp reads?

• What about 100?• How do you feel about 35?• Why wait, join the revolution…

Page 4: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Assembling with Four Base Pair Reads..

>GATCGATC

Page 5: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Not as dire as you might think

• Dramatically simplifies Genbank trace archive submissions• Only 256 distinct sequences, 65K 3bp overlaps• Store reads as a single byte

GGGG GGGA GGGT GGGC GGAG GGAA GGAT GGAC GGTG GGTA GGTT GGTC GGCG GGCA GGCT GGCCGAGG GAGA GAGT GAGC GAAG GAAA GAAT GAAC GATG GATA GATT GATC GACG GACA GACT GACCGTGG GTGA GTGT GTGC GTAG GTAA GTAT GTAC GTTG GTTA GTTT GTTC GTCG GTCA GTCT GTCCGCGG GCGA GCGT GCGC GCAG GCAA GCAT GCAC GCTG GCTA GCTT GCTC GCCG GCCA GCCT GCCCAGGG AGGA AGGT AGGC AGAG AGAA AGAT AGAC AGTG AGTA AGTT AGTC AGCG AGCA AGCT AGCCAAGG AAGA AAGT AAGC AAAG AAAA AAAT AAAC AATG AATA AATT AATC AACG AACA AACT AACCATGG ATGA ATGT ATGC ATAG ATAA ATAT ATAC ATTG ATTA ATTT ATTC ATCG ATCA ATCT ATCCACGG ACGA ACGT ACGC ACAG ACAA ACAT ACAC ACTG ACTA ACTT ACTC ACCG ACCA ACCT ACCCTGGG TGGA TGGT TGGC TGAG TGAA TGAT TGAC TGTG TGTA TGTT TGTC TGCG TGCA TGCT TGCCTAGG TAGA TAGT TAGC TAAG TAAA TAAT TAAC TATG TATA TATT TATC TACG TACA TACT TACCTTGG TTGA TTGT TTGC TTAG TTAA TTAT TTAC TTTG TTTA TTTT TTTC TTCG TTCA TTCT TTCCTCGG TCGA TCGT TCGC TCAG TCAA TCAT TCAC TCTG TCTA TCTT TCTC TCCG TCCA TCCT TCCCCGGG CGGA CGGT CGGC CGAG CGAA CGAT CGAC CGTG CGTA CGTT CGTC CGCG CGCA CGCT CGCCCAGG CAGA CAGT CAGC CAAG CAAA CAAT CAAC CATG CATA CATT CATC CACG CACA CACT CACCCTGG CTGA CTGT CTGC CTAG CTAA CTAT CTAC CTTG CTTA CTTT CTTC CTCG CTCA CTCT CTCCCCGG CCGA CCGT CCGC CCAG CCAA CCAT CCAC CCTG CCTA CCTT CCTC CCCG CCCA CCCT CCCC

• Challenging to assemble• Vector trimming out of the question

Page 6: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Testing on a Real Genome

gi|11496567|ref|NC_001830.1| Pear blister canker viroid PBCVd, complete genome CTTTCCTGAGGTTCCTGTGGTGCTCCCCTGACCTGCGTTCCAAAAAGCGAAAAAGTGAGAGGCCCTAGGG GCTTCTCGGCTCGTCGTCGACGAAGGGTCTAGAAGCCTGGGCGCTGGCTGGAGCGCGCGGCTGTGAGTAA TCGCTCCTTTGGAGAAGAAAACCAGCGTTGCTTCCTGCCTGAGCCTCGTCTTCTGTCCCGCTAGTCGAGC GGACAACCCGAGCACCGCCGAAGCGCTTTTTTCTTTTATAGCAGCTTGGCTTCGCGGCGAGGGTGGAAGT TTACCGCGGACCCCCGAGAGGAGGCCCTCGGGTCC

Page 7: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

ABCD Assembler

• ABCD assembler: Only 500 lines of C++• Libraries with insert sizes of

— 4, 5, 6, 8, 10, 20, 40, 100 and 200 bp• Generated 8Mb of sequence (22K x coverage)• Results in 2410 unique data points

100 AAAA CGCT100 AAAA GCTC100 AAAA GGAG100 AAAA GGCT100 AAAA TGGA100 AAAC GCTT100 AAAG CTCC100 AAAG GAGA100 AACC CTTC100 AAGA CTTC….

Page 8: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Performance of the ABCD Assembler

• Genetic Algorithm evolves candidate genomesand compares to observed data frequencies

• Reward and breed genomes that produce similar data

• Penalize genomes that generate unobserveddata

• After 3-4 days on a high end CPU..REF CTTTCCTGAGGTTCCTGTGGTGCTCCCCTGACCTGCGTTCCAAAAAGCGAAAAAGTGAGAGGCCCTAGGGGCTTCTCGGCTCGTCGTCGACGAAGGGSEQ CTTTCCTGAGGTTCCTGTGGTGCTCCCCTGACCTGCGTTCCAAAAAGCGAAAAAGTGAGAGGCCCTAGGGGCTTCTCGGCTCGTCGTCGACGAAGGG

REF TCTAGAAGCCTGGGCGCTGGCTGGAGCGCGCGGCTGTGAGTAATCGCTCCTTTGGAGAAGAAAACCAGCGTTGCTTCCTGCCTGAGCCTCGTCTTCTSEQ TCTAGAAGCCTGGGCGCTGGCTGGAGCGCGCGGCTGTGAGTAATCGCTCCTTTGGAGAAGAAAACCAGCGTTGCTTCCTGCCTGAGCCTCGTCTTCT

REF GTCCCGCTAGTCGAGCGGACAACCCGAGCACCGCCGAAGCGCTTTTTTCTTTTATAGCAGCTTGGCTTCGCGGCGAGGGTGGAAGTTTACCGCGGACSEQ GTCCCGCTAGTCGAGCGGACAACCCGAGCACCGCCGAAGCGCTTTTTTCTTTTATAGCAGCTTGGCTTCGCGGCGAGGGTGGAAGTTTACCGCGGAC

REF CCCCGAGAGGAGGCCCTCGGGTCCSEQ CCCCGAGAGGAGGCCC

Page 9: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Read Pair Overview

Page 10: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Consensus Alignment

Page 11: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Hybrid Assemblies

Page 12: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Forge: 454/Sanger Hybrid Assembly

Page 13: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Up Close

Page 14: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Accurate Consensus Generation is more Challenging

Page 15: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

The Coming Storm

Page 16: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Growth Rates

Page 17: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

These are just the foothills

Thought exercises• How to deal with 1-10

microbes/day?• How best to use

3 Gb/day?• Will human reseq

technologies enabledenovo large genomesequencing

• Remember that organisms aren’t getting larger

Page 18: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

No problem, Computers are getting faster too..

http://Tomshardware.comhttp://intel.com

Page 19: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

What really holds us back?

Limiting Reagents• CPU time• Disk space• Network Bandwidth• Human Bandwidth• Software quality

Page 20: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Improving Metabolic Flux

Page 21: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

JGI as an Organism

Prokaryote

DNA

Library

QC

4x 8x Post AssQC

Annotation Jamboree

Portals

IMG8x Final Annotation

DraftAnnotation

Finishing

Finishing

Eukaryote

Page 22: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

It’s 2am, where is your Genome..

Page 23: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Scaling up Global Project Tracking

How would a 30x increase in production capacity affect tracking?

• PGF has sequencedover 300 species

• More than 100“active” in freezer

• Wave of newprojects propagatingthrough pipeline

• Majority of sequencing isin projectsstill underway

• Considering use ofBlog like features toimprove interaction

Page 24: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Assembly and Quality Control

Prokaryote

DNA

Library

QC

4x 8x Post AssQC

Annotation Jamboree

Portals

IMG8x Final Annotation

DraftAnnotation

Finishing

Finishing

Eukaryote

Page 25: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Bimodal GC Content distributions

Page 26: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Use test Fosmids to QC WGS data

Page 27: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Kitchen sink Blast

Page 28: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

On a bad day..

Page 29: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Annotation

Prokaryote

DNA

Library

QC

4x 8x Post AssQC

Annotation Jamboree

Portals

IMG8x Final Annotation

DraftAnnotation

Finishing

Finishing

Eukaryote

Page 30: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Scaling Annotation

• “Last year we annotated ~5 genomes, this year plan to do 20,  CSP has twice more requests, does it mean 40 next year? At some point we may need to talk in 100s”

• How to prioritize them and share time for support of each of them?

• Measure CPU consumption in 1000 CPU day units

• Need to fundamentally rethink methods/assumptions— algorithms (e.g gene finders) not improving much— Need more experimental data e.g tiling arrays— Software quality holds us back

Page 31: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Annotation Pipelines

• “So nineties” but still not a well solved problem• Issues:

— “Non sucking software” — “Skillset for building distributed scalable

systems is rare in CS types, perhaps non-existent in biologists”

— “Moore’s law will succumb to N squared”• In 3 years, computers will be 4 times faster, we will have

10 times more genomes and 100 times more comparisons to do if we insist on comparing all against all.

—QA/QC/Reproducibility

Page 32: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Environmental Interaction

Prokaryote

DNA

Library

QC

4x 8x Post AssQC

Annotation Jamboree

Portals

IMG8x Final Annotation

DraftAnnotation

Finishing

Finishing

Eukaryote

Page 33: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Data delivery Models

• Continuing Interaction withenvironment

• Good Luck! Data Delivery Model

Page 34: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

JGI Genome Portals

• Key tools for presentinglarge genomes

• Support Jamboreeactivities

• Attract a lot of webtraffic

Page 35: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

VISTA: Comparative Genomics Tool

Frazer KA, Pachter L, Poliakov A, Rubin EM, Dubchak I. VISTA: computational tools for comparative genomics. Nucleic Acids Res. 2004 Jul 1;32 (Web Server issue):W273-9

Page 36: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

• IMG allows 3-click comparisonof proteomes

• Can rapidly discoverfunctional differences

• BUT…• 90% of

“differences”are annotationquality issues

Page 37: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

http://regtransbase.lbl.gov

Page 38: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

RegTransBase statistics, March 2006

Experiment types related to:Gene/operon activation 2354

Gene/operon repression 1128

Operon structure characterization

666

Promoter mapping 1410

Regulatory site mapping 1670

Terminator mapping 46

Regulatory site prediction 733

Plasmid replication 16

Taxonomy Genes SitesAlphaproteobacteria 3208 1678Betaproteobacteria 103 17Gammaproteobacteria 4542 2668 E.coli 1516 997

Delta/epsilon proteobacteria

1 1

Firmicutes 3195 1459 B. subtilis 666 320

Cyanobacteria 135 196Actinobacteria 3 3Bacteroidetes/Chlorobi group

1 2

Archea 3 4Multi- or unknown host plasmids, transposons and phages 1331 439TOTAL 12817 6470

Page 39: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

The Future?

Page 40: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Future of Bioinformatic Data Analysis?

Page 41: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Finishing will be reduced to solving the hardest problems

The JGI will sequence 20 times more genomes in 2011 than now.

In few years we will look back and see that today we are doing low throughput sequencing.

GenBank will be taken over by Google

I think all of the old problems will stay with us :)

Every genome will have several ref sequences (e.g. Male, Female)

Where will users get their CPU time?Who will do the detailed number crunching?As a corollary to all of this, quality & usability of software will need to dramatically improve.

Nanotech will affect computers profoundly .. Hopefully this will ease our data storage problems just as the flood becomes unmanageable

The bottlenecks will be:Integration with other resources.Standardization of data exchange.Get the expert knowledge to database.Integration of expert’s knowledge.

Bandwidth may finally become the bottleneck

The flood of data will force people think more about data management.

The field of bioinformatics has progressed to the point where the crazy quilt of formats, modules, scripts, etc. is now interfering with people's ability to make additional research progress.

Web-based tools will be much more valuable because of the richness of the data set,

I don't have a sense of whether short reads will really be the future. .. if systematic sequencing errors end up being a problem for all of them, and substantial pairing isn't feasible, we might never be able to do anything other than a prokaryote with them.

Page 42: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Writing

http://www.agen.ufl.edu/~chyn/age2062/lect/lect_09/FG10_008.GIF

AlterObserve

UnderstandRecapitulate

• Synthetic technologies will improve ergonomics of gene function validation

• E.g — active site confirmation— Heterologous expression— Tagged proteins— Cutting and pasting regulatory

elements— “Simplifying” systems

Page 43: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Thanks

Joint Genome Institute

Page 44: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Annotation is Time Consuming

Preps (pre-assembly, time=2+ weeks)•Identify scope, contributors, resources•Identify and collect available data (ESTs, FL)•Develop strategy for annotationAnnotation (once assembly is available, time=5-8 weeks)•Identify repeats •Train gene prediction (1 week)•Customize, configure & test-run Pipeline (1 week)•Run Pipeline & other tools (2-4 weeks)•QC gene models and annotations (1-2) weeksSupport (post-release, time=?) •Analysis, custom data, user support, jamboree, publications

Page 45: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Example: Filtered Scaffold Depth Estimate

Page 46: US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Curator interface