US DOE Joint Genome Institute
Surviving the DelugeDarren PlattJoint Genome [email protected]
US DOE Joint Genome Institute
Surviving the Deluge
• Shrinking Read Lengths• Hybrid Assemblies• The Coming Storm• Improving metabolic flux• The Future
US DOE Joint Genome Institute
Read Lengths are Getting Shorter
• The debate about impact of read length on genome assembly has never been resolved
• Is more 650bp reads better than fewer 750bp reads?
• What about 100?• How do you feel about 35?• Why wait, join the revolution…
US DOE Joint Genome Institute
Assembling with Four Base Pair Reads..
>GATCGATC
US DOE Joint Genome Institute
Not as dire as you might think
• Dramatically simplifies Genbank trace archive submissions• Only 256 distinct sequences, 65K 3bp overlaps• Store reads as a single byte
GGGG GGGA GGGT GGGC GGAG GGAA GGAT GGAC GGTG GGTA GGTT GGTC GGCG GGCA GGCT GGCCGAGG GAGA GAGT GAGC GAAG GAAA GAAT GAAC GATG GATA GATT GATC GACG GACA GACT GACCGTGG GTGA GTGT GTGC GTAG GTAA GTAT GTAC GTTG GTTA GTTT GTTC GTCG GTCA GTCT GTCCGCGG GCGA GCGT GCGC GCAG GCAA GCAT GCAC GCTG GCTA GCTT GCTC GCCG GCCA GCCT GCCCAGGG AGGA AGGT AGGC AGAG AGAA AGAT AGAC AGTG AGTA AGTT AGTC AGCG AGCA AGCT AGCCAAGG AAGA AAGT AAGC AAAG AAAA AAAT AAAC AATG AATA AATT AATC AACG AACA AACT AACCATGG ATGA ATGT ATGC ATAG ATAA ATAT ATAC ATTG ATTA ATTT ATTC ATCG ATCA ATCT ATCCACGG ACGA ACGT ACGC ACAG ACAA ACAT ACAC ACTG ACTA ACTT ACTC ACCG ACCA ACCT ACCCTGGG TGGA TGGT TGGC TGAG TGAA TGAT TGAC TGTG TGTA TGTT TGTC TGCG TGCA TGCT TGCCTAGG TAGA TAGT TAGC TAAG TAAA TAAT TAAC TATG TATA TATT TATC TACG TACA TACT TACCTTGG TTGA TTGT TTGC TTAG TTAA TTAT TTAC TTTG TTTA TTTT TTTC TTCG TTCA TTCT TTCCTCGG TCGA TCGT TCGC TCAG TCAA TCAT TCAC TCTG TCTA TCTT TCTC TCCG TCCA TCCT TCCCCGGG CGGA CGGT CGGC CGAG CGAA CGAT CGAC CGTG CGTA CGTT CGTC CGCG CGCA CGCT CGCCCAGG CAGA CAGT CAGC CAAG CAAA CAAT CAAC CATG CATA CATT CATC CACG CACA CACT CACCCTGG CTGA CTGT CTGC CTAG CTAA CTAT CTAC CTTG CTTA CTTT CTTC CTCG CTCA CTCT CTCCCCGG CCGA CCGT CCGC CCAG CCAA CCAT CCAC CCTG CCTA CCTT CCTC CCCG CCCA CCCT CCCC
• Challenging to assemble• Vector trimming out of the question
US DOE Joint Genome Institute
Testing on a Real Genome
gi|11496567|ref|NC_001830.1| Pear blister canker viroid PBCVd, complete genome CTTTCCTGAGGTTCCTGTGGTGCTCCCCTGACCTGCGTTCCAAAAAGCGAAAAAGTGAGAGGCCCTAGGG GCTTCTCGGCTCGTCGTCGACGAAGGGTCTAGAAGCCTGGGCGCTGGCTGGAGCGCGCGGCTGTGAGTAA TCGCTCCTTTGGAGAAGAAAACCAGCGTTGCTTCCTGCCTGAGCCTCGTCTTCTGTCCCGCTAGTCGAGC GGACAACCCGAGCACCGCCGAAGCGCTTTTTTCTTTTATAGCAGCTTGGCTTCGCGGCGAGGGTGGAAGT TTACCGCGGACCCCCGAGAGGAGGCCCTCGGGTCC
US DOE Joint Genome Institute
ABCD Assembler
• ABCD assembler: Only 500 lines of C++• Libraries with insert sizes of
— 4, 5, 6, 8, 10, 20, 40, 100 and 200 bp• Generated 8Mb of sequence (22K x coverage)• Results in 2410 unique data points
100 AAAA CGCT100 AAAA GCTC100 AAAA GGAG100 AAAA GGCT100 AAAA TGGA100 AAAC GCTT100 AAAG CTCC100 AAAG GAGA100 AACC CTTC100 AAGA CTTC….
US DOE Joint Genome Institute
Performance of the ABCD Assembler
• Genetic Algorithm evolves candidate genomesand compares to observed data frequencies
• Reward and breed genomes that produce similar data
• Penalize genomes that generate unobserveddata
• After 3-4 days on a high end CPU..REF CTTTCCTGAGGTTCCTGTGGTGCTCCCCTGACCTGCGTTCCAAAAAGCGAAAAAGTGAGAGGCCCTAGGGGCTTCTCGGCTCGTCGTCGACGAAGGGSEQ CTTTCCTGAGGTTCCTGTGGTGCTCCCCTGACCTGCGTTCCAAAAAGCGAAAAAGTGAGAGGCCCTAGGGGCTTCTCGGCTCGTCGTCGACGAAGGG
REF TCTAGAAGCCTGGGCGCTGGCTGGAGCGCGCGGCTGTGAGTAATCGCTCCTTTGGAGAAGAAAACCAGCGTTGCTTCCTGCCTGAGCCTCGTCTTCTSEQ TCTAGAAGCCTGGGCGCTGGCTGGAGCGCGCGGCTGTGAGTAATCGCTCCTTTGGAGAAGAAAACCAGCGTTGCTTCCTGCCTGAGCCTCGTCTTCT
REF GTCCCGCTAGTCGAGCGGACAACCCGAGCACCGCCGAAGCGCTTTTTTCTTTTATAGCAGCTTGGCTTCGCGGCGAGGGTGGAAGTTTACCGCGGACSEQ GTCCCGCTAGTCGAGCGGACAACCCGAGCACCGCCGAAGCGCTTTTTTCTTTTATAGCAGCTTGGCTTCGCGGCGAGGGTGGAAGTTTACCGCGGAC
REF CCCCGAGAGGAGGCCCTCGGGTCCSEQ CCCCGAGAGGAGGCCC
US DOE Joint Genome Institute
Read Pair Overview
US DOE Joint Genome Institute
Consensus Alignment
US DOE Joint Genome Institute
Hybrid Assemblies
US DOE Joint Genome Institute
Forge: 454/Sanger Hybrid Assembly
US DOE Joint Genome Institute
Up Close
US DOE Joint Genome Institute
Accurate Consensus Generation is more Challenging
US DOE Joint Genome Institute
The Coming Storm
US DOE Joint Genome Institute
These are just the foothills
Thought exercises• How to deal with 1-10
microbes/day?• How best to use
3 Gb/day?• Will human reseq
technologies enabledenovo large genomesequencing
• Remember that organisms aren’t getting larger
US DOE Joint Genome Institute
No problem, Computers are getting faster too..
http://Tomshardware.comhttp://intel.com
US DOE Joint Genome Institute
What really holds us back?
Limiting Reagents• CPU time• Disk space• Network Bandwidth• Human Bandwidth• Software quality
US DOE Joint Genome Institute
Improving Metabolic Flux
US DOE Joint Genome Institute
JGI as an Organism
Prokaryote
DNA
Library
QC
4x 8x Post AssQC
Annotation Jamboree
Portals
IMG8x Final Annotation
DraftAnnotation
Finishing
Finishing
Eukaryote
US DOE Joint Genome Institute
It’s 2am, where is your Genome..
US DOE Joint Genome Institute
Scaling up Global Project Tracking
How would a 30x increase in production capacity affect tracking?
• PGF has sequencedover 300 species
• More than 100“active” in freezer
• Wave of newprojects propagatingthrough pipeline
• Majority of sequencing isin projectsstill underway
• Considering use ofBlog like features toimprove interaction
US DOE Joint Genome Institute
Assembly and Quality Control
Prokaryote
DNA
Library
QC
4x 8x Post AssQC
Annotation Jamboree
Portals
IMG8x Final Annotation
DraftAnnotation
Finishing
Finishing
Eukaryote
US DOE Joint Genome Institute
Bimodal GC Content distributions
US DOE Joint Genome Institute
Use test Fosmids to QC WGS data
US DOE Joint Genome Institute
Kitchen sink Blast
US DOE Joint Genome Institute
On a bad day..
US DOE Joint Genome Institute
Annotation
Prokaryote
DNA
Library
QC
4x 8x Post AssQC
Annotation Jamboree
Portals
IMG8x Final Annotation
DraftAnnotation
Finishing
Finishing
Eukaryote
US DOE Joint Genome Institute
Scaling Annotation
• “Last year we annotated ~5 genomes, this year plan to do 20, CSP has twice more requests, does it mean 40 next year? At some point we may need to talk in 100s”
• How to prioritize them and share time for support of each of them?
• Measure CPU consumption in 1000 CPU day units
• Need to fundamentally rethink methods/assumptions— algorithms (e.g gene finders) not improving much— Need more experimental data e.g tiling arrays— Software quality holds us back
US DOE Joint Genome Institute
Annotation Pipelines
• “So nineties” but still not a well solved problem• Issues:
— “Non sucking software” — “Skillset for building distributed scalable
systems is rare in CS types, perhaps non-existent in biologists”
— “Moore’s law will succumb to N squared”• In 3 years, computers will be 4 times faster, we will have
10 times more genomes and 100 times more comparisons to do if we insist on comparing all against all.
—QA/QC/Reproducibility
US DOE Joint Genome Institute
Environmental Interaction
Prokaryote
DNA
Library
QC
4x 8x Post AssQC
Annotation Jamboree
Portals
IMG8x Final Annotation
DraftAnnotation
Finishing
Finishing
Eukaryote
US DOE Joint Genome Institute
Data delivery Models
• Continuing Interaction withenvironment
• Good Luck! Data Delivery Model
US DOE Joint Genome Institute
JGI Genome Portals
• Key tools for presentinglarge genomes
• Support Jamboreeactivities
• Attract a lot of webtraffic
US DOE Joint Genome Institute
VISTA: Comparative Genomics Tool
Frazer KA, Pachter L, Poliakov A, Rubin EM, Dubchak I. VISTA: computational tools for comparative genomics. Nucleic Acids Res. 2004 Jul 1;32 (Web Server issue):W273-9
US DOE Joint Genome Institute
• IMG allows 3-click comparisonof proteomes
• Can rapidly discoverfunctional differences
• BUT…• 90% of
“differences”are annotationquality issues
US DOE Joint Genome Institute
http://regtransbase.lbl.gov
US DOE Joint Genome Institute
RegTransBase statistics, March 2006
Experiment types related to:Gene/operon activation 2354
Gene/operon repression 1128
Operon structure characterization
666
Promoter mapping 1410
Regulatory site mapping 1670
Terminator mapping 46
Regulatory site prediction 733
Plasmid replication 16
Taxonomy Genes SitesAlphaproteobacteria 3208 1678Betaproteobacteria 103 17Gammaproteobacteria 4542 2668 E.coli 1516 997
Delta/epsilon proteobacteria
1 1
Firmicutes 3195 1459 B. subtilis 666 320
Cyanobacteria 135 196Actinobacteria 3 3Bacteroidetes/Chlorobi group
1 2
Archea 3 4Multi- or unknown host plasmids, transposons and phages 1331 439TOTAL 12817 6470
US DOE Joint Genome Institute
The Future?
US DOE Joint Genome Institute
Future of Bioinformatic Data Analysis?
US DOE Joint Genome Institute
Finishing will be reduced to solving the hardest problems
The JGI will sequence 20 times more genomes in 2011 than now.
In few years we will look back and see that today we are doing low throughput sequencing.
GenBank will be taken over by Google
I think all of the old problems will stay with us :)
Every genome will have several ref sequences (e.g. Male, Female)
Where will users get their CPU time?Who will do the detailed number crunching?As a corollary to all of this, quality & usability of software will need to dramatically improve.
Nanotech will affect computers profoundly .. Hopefully this will ease our data storage problems just as the flood becomes unmanageable
The bottlenecks will be:Integration with other resources.Standardization of data exchange.Get the expert knowledge to database.Integration of expert’s knowledge.
Bandwidth may finally become the bottleneck
The flood of data will force people think more about data management.
The field of bioinformatics has progressed to the point where the crazy quilt of formats, modules, scripts, etc. is now interfering with people's ability to make additional research progress.
Web-based tools will be much more valuable because of the richness of the data set,
I don't have a sense of whether short reads will really be the future. .. if systematic sequencing errors end up being a problem for all of them, and substantial pairing isn't feasible, we might never be able to do anything other than a prokaryote with them.
US DOE Joint Genome Institute
Writing
http://www.agen.ufl.edu/~chyn/age2062/lect/lect_09/FG10_008.GIF
AlterObserve
UnderstandRecapitulate
• Synthetic technologies will improve ergonomics of gene function validation
• E.g — active site confirmation— Heterologous expression— Tagged proteins— Cutting and pasting regulatory
elements— “Simplifying” systems
US DOE Joint Genome Institute
Thanks
Joint Genome Institute
US DOE Joint Genome Institute
Annotation is Time Consuming
Preps (pre-assembly, time=2+ weeks)•Identify scope, contributors, resources•Identify and collect available data (ESTs, FL)•Develop strategy for annotationAnnotation (once assembly is available, time=5-8 weeks)•Identify repeats •Train gene prediction (1 week)•Customize, configure & test-run Pipeline (1 week)•Run Pipeline & other tools (2-4 weeks)•QC gene models and annotations (1-2) weeksSupport (post-release, time=?) •Analysis, custom data, user support, jamboree, publications
US DOE Joint Genome Institute
Example: Filtered Scaffold Depth Estimate
US DOE Joint Genome Institute
Curator interface
Top Related