Download - US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

US DOE Joint Genome Institute

Surviving the DelugeDarren PlattJoint Genome [email protected]

http://206.202.10.48/city/fire/flowtest.jpg


Surviving the Deluge

• Shrinking Read Lengths• Hybrid Assemblies• The Coming Storm• Improving metabolic flux• The Future


Read Lengths are Getting Shorter

• The debate about impact of read length on genome assembly has never been resolved

• Is more 650bp reads better than fewer 750bp reads?

• What about 100?• How do you feel about 35?• Why wait, join the revolution…


Assembling with Four Base Pair Reads..

>GATCGATC


Not as dire as you might think

• Dramatically simplifies Genbank trace archive submissions• Only 256 distinct sequences, 65K 3bp overlaps• Store reads as a single byte

GGGG GGGA GGGT GGGC GGAG GGAA GGAT GGAC GGTG GGTA GGTT GGTC GGCG GGCA GGCT GGCCGAGG GAGA GAGT GAGC GAAG GAAA GAAT GAAC GATG GATA GATT GATC GACG GACA GACT GACCGTGG GTGA GTGT GTGC GTAG GTAA GTAT GTAC GTTG GTTA GTTT GTTC GTCG GTCA GTCT GTCCGCGG GCGA GCGT GCGC GCAG GCAA GCAT GCAC GCTG GCTA GCTT GCTC GCCG GCCA GCCT GCCCAGGG AGGA AGGT AGGC AGAG AGAA AGAT AGAC AGTG AGTA AGTT AGTC AGCG AGCA AGCT AGCCAAGG AAGA AAGT AAGC AAAG AAAA AAAT AAAC AATG AATA AATT AATC AACG AACA AACT AACCATGG ATGA ATGT ATGC ATAG ATAA ATAT ATAC ATTG ATTA ATTT ATTC ATCG ATCA ATCT ATCCACGG ACGA ACGT ACGC ACAG ACAA ACAT ACAC ACTG ACTA ACTT ACTC ACCG ACCA ACCT ACCCTGGG TGGA TGGT TGGC TGAG TGAA TGAT TGAC TGTG TGTA TGTT TGTC TGCG TGCA TGCT TGCCTAGG TAGA TAGT TAGC TAAG TAAA TAAT TAAC TATG TATA TATT TATC TACG TACA TACT TACCTTGG TTGA TTGT TTGC TTAG TTAA TTAT TTAC TTTG TTTA TTTT TTTC TTCG TTCA TTCT TTCCTCGG TCGA TCGT TCGC TCAG TCAA TCAT TCAC TCTG TCTA TCTT TCTC TCCG TCCA TCCT TCCCCGGG CGGA CGGT CGGC CGAG CGAA CGAT CGAC CGTG CGTA CGTT CGTC CGCG CGCA CGCT CGCCCAGG CAGA CAGT CAGC CAAG CAAA CAAT CAAC CATG CATA CATT CATC CACG CACA CACT CACCCTGG CTGA CTGT CTGC CTAG CTAA CTAT CTAC CTTG CTTA CTTT CTTC CTCG CTCA CTCT CTCCCCGG CCGA CCGT CCGC CCAG CCAA CCAT CCAC CCTG CCTA CCTT CCTC CCCG CCCA CCCT CCCC

• Challenging to assemble• Vector trimming out of the question


Testing on a Real Genome

gi|11496567|ref|NC_001830.1| Pear blister canker viroid PBCVd, complete genome CTTTCCTGAGGTTCCTGTGGTGCTCCCCTGACCTGCGTTCCAAAAAGCGAAAAAGTGAGAGGCCCTAGGG GCTTCTCGGCTCGTCGTCGACGAAGGGTCTAGAAGCCTGGGCGCTGGCTGGAGCGCGCGGCTGTGAGTAA TCGCTCCTTTGGAGAAGAAAACCAGCGTTGCTTCCTGCCTGAGCCTCGTCTTCTGTCCCGCTAGTCGAGC GGACAACCCGAGCACCGCCGAAGCGCTTTTTTCTTTTATAGCAGCTTGGCTTCGCGGCGAGGGTGGAAGT TTACCGCGGACCCCCGAGAGGAGGCCCTCGGGTCC


ABCD Assembler

• ABCD assembler: Only 500 lines of C++• Libraries with insert sizes of

— 4, 5, 6, 8, 10, 20, 40, 100 and 200 bp• Generated 8Mb of sequence (22K x coverage)• Results in 2410 unique data points

100 AAAA CGCT100 AAAA GCTC100 AAAA GGAG100 AAAA GGCT100 AAAA TGGA100 AAAC GCTT100 AAAG CTCC100 AAAG GAGA100 AACC CTTC100 AAGA CTTC….


Performance of the ABCD Assembler

• Genetic Algorithm evolves candidate genomesand compares to observed data frequencies

• Reward and breed genomes that produce similar data

• Penalize genomes that generate unobserveddata

• After 3-4 days on a high end CPU..REF CTTTCCTGAGGTTCCTGTGGTGCTCCCCTGACCTGCGTTCCAAAAAGCGAAAAAGTGAGAGGCCCTAGGGGCTTCTCGGCTCGTCGTCGACGAAGGGSEQ CTTTCCTGAGGTTCCTGTGGTGCTCCCCTGACCTGCGTTCCAAAAAGCGAAAAAGTGAGAGGCCCTAGGGGCTTCTCGGCTCGTCGTCGACGAAGGG

REF TCTAGAAGCCTGGGCGCTGGCTGGAGCGCGCGGCTGTGAGTAATCGCTCCTTTGGAGAAGAAAACCAGCGTTGCTTCCTGCCTGAGCCTCGTCTTCTSEQ TCTAGAAGCCTGGGCGCTGGCTGGAGCGCGCGGCTGTGAGTAATCGCTCCTTTGGAGAAGAAAACCAGCGTTGCTTCCTGCCTGAGCCTCGTCTTCT

REF GTCCCGCTAGTCGAGCGGACAACCCGAGCACCGCCGAAGCGCTTTTTTCTTTTATAGCAGCTTGGCTTCGCGGCGAGGGTGGAAGTTTACCGCGGACSEQ GTCCCGCTAGTCGAGCGGACAACCCGAGCACCGCCGAAGCGCTTTTTTCTTTTATAGCAGCTTGGCTTCGCGGCGAGGGTGGAAGTTTACCGCGGAC

REF CCCCGAGAGGAGGCCCTCGGGTCCSEQ CCCCGAGAGGAGGCCC


Read Pair Overview


Consensus Alignment


Hybrid Assemblies


Forge: 454/Sanger Hybrid Assembly


Up Close


Accurate Consensus Generation is more Challenging


The Coming Storm


Growth Rates

http://206.202.10.48/city/fire/flowtest.jpg


These are just the foothills

Thought exercises• How to deal with 1-10

microbes/day?• How best to use

3 Gb/day?• Will human reseq

technologies enabledenovo large genomesequencing

• Remember that organisms aren’t getting larger


No problem, Computers are getting faster too..

http://Tomshardware.comhttp://intel.com


What really holds us back?

Limiting Reagents• CPU time• Disk space• Network Bandwidth• Human Bandwidth• Software quality


Improving Metabolic Flux


JGI as an Organism

Prokaryote

DNA

Library

QC

4x 8x Post AssQC

Annotation Jamboree

Portals

IMG8x Final Annotation

DraftAnnotation

Finishing

Finishing

Eukaryote


It’s 2am, where is your Genome..


Scaling up Global Project Tracking

How would a 30x increase in production capacity affect tracking?

• PGF has sequencedover 300 species

• More than 100“active” in freezer

• Wave of newprojects propagatingthrough pipeline

• Majority of sequencing isin projectsstill underway

• Considering use ofBlog like features toimprove interaction


Assembly and Quality Control

Prokaryote

DNA

Library

QC

4x 8x Post AssQC

Annotation Jamboree

Portals


DraftAnnotation

Finishing

Finishing

Eukaryote


Bimodal GC Content distributions


Use test Fosmids to QC WGS data


Kitchen sink Blast


On a bad day..


Annotation

Prokaryote

DNA

Library

QC

4x 8x Post AssQC

Annotation Jamboree

Portals


DraftAnnotation

Finishing

Finishing

Eukaryote


Scaling Annotation

• “Last year we annotated ~5 genomes, this year plan to do 20, CSP has twice more requests, does it mean 40 next year? At some point we may need to talk in 100s”

• How to prioritize them and share time for support of each of them?

• Measure CPU consumption in 1000 CPU day units

• Need to fundamentally rethink methods/assumptions— algorithms (e.g gene finders) not improving much— Need more experimental data e.g tiling arrays— Software quality holds us back


Annotation Pipelines

• “So nineties” but still not a well solved problem• Issues:

— “Non sucking software” — “Skillset for building distributed scalable

systems is rare in CS types, perhaps non-existent in biologists”

— “Moore’s law will succumb to N squared”• In 3 years, computers will be 4 times faster, we will have

10 times more genomes and 100 times more comparisons to do if we insist on comparing all against all.

—QA/QC/Reproducibility


Environmental Interaction

Prokaryote

DNA

Library

QC

4x 8x Post AssQC

Annotation Jamboree

Portals


DraftAnnotation

Finishing

Finishing

Eukaryote


Data delivery Models

• Continuing Interaction withenvironment

• Good Luck! Data Delivery Model


JGI Genome Portals

• Key tools for presentinglarge genomes

• Support Jamboreeactivities

• Attract a lot of webtraffic


VISTA: Comparative Genomics Tool

Frazer KA, Pachter L, Poliakov A, Rubin EM, Dubchak I. VISTA: computational tools for comparative genomics. Nucleic Acids Res. 2004 Jul 1;32 (Web Server issue):W273-9


• IMG allows 3-click comparisonof proteomes

• Can rapidly discoverfunctional differences

• BUT…• 90% of

“differences”are annotationquality issues


http://regtransbase.lbl.gov


RegTransBase statistics, March 2006

Experiment types related to:Gene/operon activation 2354

Gene/operon repression 1128

Operon structure characterization

666

Promoter mapping 1410

Regulatory site mapping 1670

Terminator mapping 46

Regulatory site prediction 733

Plasmid replication 16

Taxonomy Genes SitesAlphaproteobacteria 3208 1678Betaproteobacteria 103 17Gammaproteobacteria 4542 2668 E.coli 1516 997

Delta/epsilon proteobacteria

1 1

Firmicutes 3195 1459 B. subtilis 666 320

Cyanobacteria 135 196Actinobacteria 3 3Bacteroidetes/Chlorobi group

1 2

Archea 3 4Multi- or unknown host plasmids, transposons and phages 1331 439TOTAL 12817 6470


The Future?


Future of Bioinformatic Data Analysis?


Finishing will be reduced to solving the hardest problems

The JGI will sequence 20 times more genomes in 2011 than now.

In few years we will look back and see that today we are doing low throughput sequencing.

GenBank will be taken over by Google

I think all of the old problems will stay with us :)

Every genome will have several ref sequences (e.g. Male, Female)

Where will users get their CPU time?Who will do the detailed number crunching?As a corollary to all of this, quality & usability of software will need to dramatically improve.

Nanotech will affect computers profoundly .. Hopefully this will ease our data storage problems just as the flood becomes unmanageable

The bottlenecks will be:Integration with other resources.Standardization of data exchange.Get the expert knowledge to database.Integration of expert’s knowledge.

Bandwidth may finally become the bottleneck

The flood of data will force people think more about data management.

The field of bioinformatics has progressed to the point where the crazy quilt of formats, modules, scripts, etc. is now interfering with people's ability to make additional research progress.

Web-based tools will be much more valuable because of the richness of the data set,

I don't have a sense of whether short reads will really be the future. .. if systematic sequencing errors end up being a problem for all of them, and substantial pairing isn't feasible, we might never be able to do anything other than a prokaryote with them.


Writing

http://www.agen.ufl.edu/~chyn/age2062/lect/lect_09/FG10_008.GIF

AlterObserve

UnderstandRecapitulate

• Synthetic technologies will improve ergonomics of gene function validation

• E.g — active site confirmation— Heterologous expression— Tagged proteins— Cutting and pasting regulatory

elements— “Simplifying” systems


Thanks

Joint Genome Institute


Annotation is Time Consuming

Preps (pre-assembly, time=2+ weeks)•Identify scope, contributors, resources•Identify and collect available data (ESTs, FL)•Develop strategy for annotationAnnotation (once assembly is available, time=5-8 weeks)•Identify repeats •Train gene prediction (1 week)•Customize, configure & test-run Pipeline (1 week)•Run Pipeline & other tools (2-4 weeks)•QC gene models and annotations (1-2) weeksSupport (post-release, time=?) •Analysis, custom data, user support, jamboree, publications


Example: Filtered Scaffold Depth Estimate


Curator interface