A New Distributed System for Large-Scale Sequence Analysisrobins/posters_and_presentations/... ·...

1
Time Today TAAGTTATTATTTAGTTAATACTTTTAACAATATTATTAAGGTATTTAAAAAATACTATT ATAGTATTTAACATAGTTAAATACCTTCCTTAATACTGTTAAATTATATTCAATCAATAC ATATATAATATTATTAAAATACTTGATAAGTATTATTTAGATATTAGACAAATACTAATT TTATATTGCTTTAATACTTAATAAATACTACTTATGTATTAAGTAAATATTACTGTAATA CTAATAACAATATTATTACAATATGCTAGAATAATATTGCTAGTATCAATAATTACTAAT ATAGTATTAGGAAAATACCATAATAATATTTCTACATAATACTAAGTTAATACTATGTGT AGAATAATAAATAATCAGATTAAAAAAATTTTATTTATCTGAAACATATTTAATCAATTG AACTGATTATTTTCAGCAGTAATAATTACATATGTACATAGTACATATGTAAAATATCAT TAATTTCTGTTATATATAATAGTATCTATTTTAGAGAGTATTAATTATTACTATAATTAA GCATTTATGCTTAATTATAAGCTTTTTATGAACAAAATTATAGACATTTTAGTTCTTATA ATAAATAATAGATATTAAAGAAAATAAAAAAATAGAAATAAATATCATAACCCTTGATAA CCCAGAAATTAATACTTAATCAAAAATGAAAATATTAATTAATAAAAGTGAATTGAATAA AATTTTGGGAAAAAATGAATAACGTTATTATTTCCAATAACAAAATAAAACCACATCATT CATATTTTTTAATAGAGGCAAAAGAAAAAGAAATAAACTTTTATGCTAACAATGAATACT TTTCTGTCAAATGTAATTTAAATAAAAATATTGATATTCTTGAACAAGGCTCCTTAATTG TTAAAGGAAAAATTTTTAACGATCTTATTAATGGCATAAAAGAAGAGATTATTACTATTC AAGAAAAAGATCAAACACTTTTGGTTAAAACAAAAAAAACAAGTATTAATTTAAACACAA TTAATGTGAATGAATTTCCAAGAATAAGGTTTAATGAAAAAAACGATTTAAGTGAATTTA ATCAATTCAAAATAAATTATTCACTTTTAGTAAAAGGCATTAAAAAAATTTTTCACTCAG TTTCAAATAATCGTGAAATATCTTCTAAATTTAATGGAGTAAATTTCAATGGATCCAATG GAAAAGAAATATTTTTAGAAGCTTCTGACACTTATAAACTATCTGTTTTTGAGATAAAGC AAGAAACAGAACCATTTGATTTCATTTTGGAGAGTAATTTACTTAGTTTCATTAATTCTT TTAATCCTGAAGAAGATAAATCTATTGTTTTTTATTACAGAAAAGATAATAAAGATAGCT TTAGTACAGAAATGTTGATTTCAATGGATAACTTTATGATTAGTTACACATCGGTTAATG AAAAATTTCCAGAGGTAAACTACTTTTTTGAATTTGAACCTGAAACTAAAATAGTTGTTC AAAAAAATGAATTAAAAGATGCACTTCAAAGAATTCAAACTTTGGCTCAAAATGAAAGAA CTTTTTTATGCGATATGCAAATTAACAGTTCTGAATTAAAAATAAGAGCTATTGTTAATA ATATCGGAAATTCTCTTGAGGAAATTTCTTGTCTTAAATTTGAAGGTTATAAACTTAATA TTTCTTTTAACCCAAGTTCTCTATTAGATCACATAGAGTCTTTTGAATCAAATGAAATAA ATTTTGATTTCCAAGGAAATAGTAAGTATTTTTTGATAACCTCTAAAAGTGAACCTGAAC TTAAGCAAATATTGGTTCCTTCAAGATAATGAATCTTTACGATCTTTTAGAACTACCAAC TACAGCATCAATAAAAGAAATAAAAATTGCTTATAAAAGATTAGCAAAGCGTTATCACCC TGATGTAAATAAATTAGGTTCGCAAACTTTTGTTGAAATTAATAATGCTTATTCAATATT AAGTGATCCTAACCAAAAGGAAAAATATGATTCAATGCTGAAAGTTAATGATTTTCAAAA TCGCATCAAAAATTTAGATATTAGTGTTAGATGACATGAAAATTTCATGGAAGAACTCGA ACTTCGTAAGACCTGAGAATTTGATTTTTTTTCATCTGATGAAGATTTCTTTTATTCTCC ATTTACAAAAAACAAATATGCTTCCTTTTTAGATAAAGATGTTTCTTTAGCTTTTTTTCA GCTTTACAGCAAGGGCAAAATAGATCATCAATTGGAAAAATCTTTATTGAAAAGAAGAGA TGTAAAAGAAGCTTGTCAACAGAATAAAAATTTTATTGAAGTTATAAAAGAGCAATATAA CTATTTTGGTTGAATTGAAGCTAAGCGTTATTTCAATATTAATGTTGAACTTGAGCTCAC ACAGAGAGAGATAAGAGATAGAGATGTTGTTAACCTACCTTTAAAAATTAAAGTTATTAA TAATGATTTTCCAAATCAACTCTGATATGAAATTTATAAAAACTATTCATTTCGCTTATC TTGAGATATAAAAAATGGTGAAATTGCTGAATTTTTCAATAAAGGTAATAGAGCTTTAGG MNNVIISNNKIKPHHSYFLIEAKEKEINFYANNEYFSVKCNLNKNIDILEQGSLIVKGKIFNDLINGIKE EIITIQEKDQTLLVKTKKTSINLNTINVNEFPRIRFNEKNDLSEFNQFKINYSLLVKGIKKIFHSVSNNR EISSKFNGVNFNGSNGKEIFLEASDTYKLSVFEIKQETEPFDFILESNLLSFINSFNPEEDKSIVFYYRK DNKDSFSTEMLISMDNFMISYTSVNEKFPEVNYFFEFEPETKIVVQKNELKDALQRIQTLAQNERTFLCD MQINSSELKIRAIVNNIGNSLEEISCLKFEGYKLNISFNPSSLLDHIESFESNEINFDFQGNSKYFLITS APVLLNMKQALEVYLDHQIDVLVRKTKFVLNKQQERYHILSGLLIAALNIDEVVAIIKKSANNQEAINTL NTKFKLDEIQAKAVLDMRLRSLSVLEVNKLQTEQKELKDSIEFCKKVLADQKLQLKIIKEELQKINDQFG DERRSEILYDISEEIDDESLIKVENVVITMSTNGYLKRIGVDAYNLQHRGGVGVKGLTTYVDDSISQLLV CSTHSDLLFFTDKGKVYRIRAHQIPYGFRTNKGIPAVNLIKIEKDERICSLLSVNNYDDGYFFFCTKNGI VKRTSLNEFINILSNGKRAISFDDNDTLYSVIKTHGNDEIFIGSTNGFVVRFHENQLRVLSRTARGVFGI SLNKGEFVNGLSTSSNGSLLLSVGQNGIGKLTSIDKYRLTKRNAKGVKTLRVTDRTGPVVTTTTVFGNED LLMISSAGKIVRTSLQELSEQGKNTSGVKLIRLKDNERLERVTIFKEELEDKEMQLEDVGSKQI MNLYDLLELPTTASIKEIKIAYKRLAKRYHPDVNKLGSQTFVEINNAYSILSDPNQKEKYDSMLKVNDFQ NRIKNLDISVRWHENFMEELELRKTWEFDFFSSDEDFFYSPFTKNKYASFLDKDVSLAFFQLYSKGKIDH QLEKSLLKRRDVKEACQQNKNFIEVIKEQYNYFGWIEAKRYFNINVELELTQREIRDRDVVNLPLKIKVI NNDFPNQLWYEIYKNYSFRLSWDIKNGEIAEFFNKGNRALGWKGDLIVRMKVVNKVNKRLRIFSSFFEND MEENNKANIYDSSSIKVLEGLEAVRKRPGMYIGSTGEEGLHHMIWEIVDNSIDEAMGGFASFVKLTLEDN FVTRVEDDGRGIPVDIHPKTNRSTVETVFTVLHAGGKFDNDSYKVSGGLHGVGASVVNALSSSFKVWVFR QNKKYFLSFSDGGKVIGDLVQEGNSEKEHGTIVEFVPDFSVMEKSDYKQTVIVSRLQQLAFLNKGIRIDF VDNRKQNPQSFSWKYDGGLVEYIHHLNNEKEPLFNEVIADEKTETVKAVNRDENYTVKVEVAFQYNKTYN QSIFSFCNNINTTEGGTHVEGFRNALVKIINRFAVENKFLKDSDEKINRDDVCEGLTAIISIKHPNPQYE GQTKKKLGNTEVRPLVNSVVSEIFERFMLENPQEANAIIRKTLLAQEARRRSQEARELTRRKSPFDSGSL MGKLADCTTRDPSISELYIVEGDSAGGTAKTGRDRYFQAILPLRGKILNVEKSNFEQIFNNAEISALVMA IGCGIKPDFELEKLRYSKIVIMTDADVDGAHIRTLLLTFFFRFMYPLVEQGNIFIAQPPLYKVSYSHKDL YMHTDVQLEQWKSQNPNVKFGLQRYKGLGEMDALQLWETTMDPKVRTLLKVTVEDASIADKAFSLLMG MAKQQDQVDKIRENLDNSTVKSISLANELERSFMEYAMSVIVARALPDARDGLKPVHRRVLYGAYIGGMH HDRPFKKSARIVGDVMSKFHPHGDMAIYDTMSRMAQDFSLRYLLIDGHGNFGSIDGDRPAAQRYTEARLS KLAAELLKDIDKDTVDFIANYDGEEKEPTVLPAAFPNLLANGSSGIAVGMSTSIPSHNLSELIAGLIMLI DNPQCTFQELLTVIKGPDFPTGANIIYTKGIESYFETGKGNVVIRSKVEIEQLQTRSALVVTEIPYMVNK TTLIEKIVELVKAEEISGIADIRDESSREGIRLVIEVKRDTVPEVLLNQLFKSTRLQVRFPVNMLALVKG Genome Proteome A New Distributed System for A New Distributed System for Large Large- Scale Sequence Analysis Scale Sequence Analysis (434) 982-2207 Douglas Blair and Gabriel Robins Douglas Blair and Gabriel Robins Computer Science Department of School of Engineering and Applied Science University of Virginia {dmb4x, robins}@cs.virginia.edu www.cs.virginia.edu Runaway Growth Advances in sequencing technology Exponentially increasing data volume GenBank: 8.6 billion nucleotides (Jun 2000) 9.5 billion nucleotides (Aug 2000) Data growing faster than computer speeds: - Data volume doubles every 12 months - Moore’s Law: 18-month doubling time Source: http://www.ncbi.nlm.nih.gov Proteins and Evolution YRMFEPKC YRMFEPKCLDAFANLRDFLARFEGLKKISA FRVAKFEIDKYANLNRWYENAKKVTPGWEE PGWEE YRVAFEPTLDAYANLRDFEGVKKITPE YRVAKFELDAYANLRWENVKKITPE FRVAKFELDKYANLRWENVKKIT ENVKKITPGWE YRVFEPDAYANLRDFLEGVKKITSE FRVAKFELDKYANLRWYENAKKITPGWE ENAKKITPGWE YRM YRMFEPKLDAFANLRDFLREGVKKITSA YRMFEPKLDA YRMFEPKLDAFANLRDFLREGVKKITSA YRMFEPKLDAFA YRMFEPKLDAFANLRDFLAREGLKKITSA FRVAKFE---IDKYANLNRW---YENAKKVTPGWEE .:. :: .: .::: . .:. ::.. YRM--FEPKCLDAFANLRDFLARFEGLKKISA Presented at Intelligent Systems for Molecular Biology, August 19-23, 2000, San Diego, CA Sequence Alignment FRVAKFEIDKYANLNRWYENAKKVTPGWEE Y R M F E P K C L D A F A N L R D F L A R F E G L K K I S A Algorithms and Statistics Sequence Comparison Dynamic Programming Algorithms: Needleman-Wunsch [Needleman & Wunsch, 1970] Smith-Waterman [Smith & Waterman, 1981] Smith-Waterman with gaps [Gotoh, 1982] FASTA [Pearson & Lipman, 1988] BLAST [Altschul et al., 1990] Statistical Significance: Distribution of Smith-Waterman scores [Karlin & Altschul 1990] Distribution of n SW scores [Karlin & Altschul 1993] Empirical distribution for gapped scores [Altschul & Gish 1996] Old Sequence Analysis Paradigm Compare to known sequences in database Record new experimentally derived sequence Determine statistical significance of comparison scores Deduce biological and evolutionary relationships ATGCCTATGATACTGGGATAC... ? New Sequence Analysis Paradigm: Genomics and Comparative Genomics S. cerevisiae E. coli D. melan. DATABASE OF KNOWN SEQUENCES GENOMIC DNA E. coli D. melan. S. cerevisiae M. gen. M. gen. Challenges • Computing power growing less quickly than data volume • Heuristic methods are faster but less sensitive Faster Better • Current parallel implementations scale poorly • Computation grows quadratically with data volume Data << Work -- “Square” tasks minimize data/computation Data expansion relatively inexpensive - compression becomes worthwhile Tasks self-contained, compact, independent - exquisitely parallel Parallelism constrained only by the available number of machines Future Directions Solution: Break the Data Bottleneck M Computation Data Transmitted Computer N k Computers (krM)+N Total Data (MrN)/k Work/CPU MrN Total Work M+(N/k) Data/CPU (M+N)rk Total Data (M+N)/k Data/CPU Bioinformatics Bioinformatics Old vs. New Data Work Data expansion takes as long as or longer than comparison operations Entire library required everywhere simultaneously (Poor NFS server…) Parallelism constrained to number of machines not starved for data Paves the way for Massively Parallel Computation Ability to take advantage of more processors encourages use of more sensitive, computationally demanding techniques “Central Dogma” of Molecular Biology Basic mechanisms in all living organisms [Crick, ~1956] DNA RNA Protein Transcription Replication Translation ATGCCTATGATACTGGGATAC... ATGCCTATGATACTGGGATAC... MPMILGY... MPMILGY... AUGCCUAUGAUACUGGGAUAC... AUGCCUAUGAUACUGGGAUAC... 32 Genomes and Proteomes 31 complete microbial genomes (87 in progress) Many new microbial genomes every year Many other higher organisms’ genomes being sequenced Organism Year Genome Size Proteome Size Completed (Base Pairs) (# of Proteins) Mycoplasma Genitalium 1995 ~588,000 480 Escherichia Coli 1997 ~4,600,000 4,289 Saccharomyces Cerevisiae 1997 ~11,000,000 ~6,600 Caenorhabditis Elegans 1998 ~86,000,000 ~14,300 Drosophila Melanogaster 2000 ~137,000,000 ~13,500 Homo Sapiens ~2000 ~3,100,000,000 ~30,000-60,000 Test Platform: Parabon Frontier It is wonderful how much may be done, if we are always doing.” -- Thomas Jefferson, May 5, 1787 “Determine never to be idle. No person will have occasion to complain of the want of time, who never loses any. Yesterday CTGAAGCCAGTTTGAGAA GACCACAGCACCAGCACC ATGCCTATGATACTGGGA TACTGGAACGTCCGCGGA CTGACACACCCGATCCGC ATGCTCCTGGAATACACA GACTCAAGCTATGATGAG AAGAGATACACCATGGGT GACGCTCCCGACTTTGAC AGAAGCCAGTGGCTGAAT GAGAAGTTCAAGCTGGGC CTGGACTTTCCCAATCTG CCTTACTTGATCGATGGA TCACACAAGATCACCCAG MPMILGYWNVRG LTHPIRMLLEYT DSSYDEKRYTMG DAPDFDRSQWLN EKFKLGLDFPNL PYLIDGSHKITQ SNAILRAHWSNK Gene Protein Data Avalanche Data Avalanche Paradigm Shift Paradigm Shift Implementation & Results Implementation & Results Client (UVa) Code Data Job Frontier Server (Housed at Exodus) Providers (Idle Internet Machines) Task Job Results Task Results Internet Internet Internet Internet Results Postprocessing Code & Data Elements Task Definitions •Further Smith-Waterman optimizations •Investigation of novel methods for estimating statistical significance •Other methods (BLAST, FASTA, HMMs, GeneWise, etc.) •Data compression •Implementation of DNA-protein and DNA-DNA comparisons •Large-scale structure-structure comparison •Large-scale sequence-structure threading/comparison •Human Genome vs. GenBank scale searches •Java 1.3 JVM for Provider Compute Engine (Faster than C!) •Other projects (e.g. Maximum Likelihood Tree Searches) 0 10 20 30 40 50 60 70 Data 2 4 9 16 25 36 49 64 # of CPUs 4-1 Compression 2-1 Compression New Method Old Method Data Transmitted y = 37.412x + 271.65 R 2 = 0.9968 0 3000 6000 9000 12000 15000 0 100 200 300 400 CPUs Smith-Waterman Seq. Comparisons/Sec Scalability Time 35

Transcript of A New Distributed System for Large-Scale Sequence Analysisrobins/posters_and_presentations/... ·...

Page 1: A New Distributed System for Large-Scale Sequence Analysisrobins/posters_and_presentations/... · 2007. 1. 27. · Test Platform: Parabon Frontier It is wonderful how much may be

Time

TodayTAAGTTATTATTTAGTTAATACTTTTAACAATATTATTAAGGTATTTAAAAAATACTATTATAGTATTTAACATAGTTAAATACCTTCCTTAATACTGTTAAATTATATTCAATCAATACATATATAATATTATTAAAATACTTGATAAGTATTATTTAGATATTAGACAAATACTAATTTTATATTGCTTTAATACTTAATAAATACTACTTATGTATTAAGTAAATATTACTGTAATACTAATAACAATATTATTACAATATGCTAGAATAATATTGCTAGTATCAATAATTACTAATATAGTATTAGGAAAATACCATAATAATATTTCTACATAATACTAAGTTAATACTATGTGTAGAATAATAAATAATCAGATTAAAAAAATTTTATTTATCTGAAACATATTTAATCAATTGAACTGATTATTTTCAGCAGTAATAATTACATATGTACATAGTACATATGTAAAATATCATTAATTTCTGTTATATATAATAGTATCTATTTTAGAGAGTATTAATTATTACTATAATTAAGCATTTATGCTTAATTATAAGCTTTTTATGAACAAAATTATAGACATTTTAGTTCTTATA

ATAAATAATAGATATTAAAGAAAATAAAAAAATAGAAATAAATATCATAACCCTTGATAACCCAGAAATTAATACTTAATCAAAAATGAAAATATTAATTAATAAAAGTGAATTGAATAAAATTTTGGGAAAAAATGAATAACGTTATTATTTCCAATAACAAAATAAAACCACATCATTCATATTTTTTAATAGAGGCAAAAGAAAAAGAAATAAACTTTTATGCTAACAATGAATACTTTTCTGTCAAATGTAATTTAAATAAAAATATTGATATTCTTGAACAAGGCTCCTTAATTGTTAAAGGAAAAATTTTTAACGATCTTATTAATGGCATAAAAGAAGAGATTATTACTATTC

AAGAAAAAGATCAAACACTTTTGGTTAAAACAAAAAAAACAAGTATTAATTTAAACACAATTAATGTGAATGAATTTCCAAGAATAAGGTTTAATGAAAAAAACGATTTAAGTGAATTTAATCAATTCAAAATAAATTATTCACTTTTAGTAAAAGGCATTAAAAAAATTTTTCACTCAGTTTCAAATAATCGTGAAATATCTTCTAAATTTAATGGAGTAAATTTCAATGGATCCAATGGAAAAGAAATATTTTTAGAAGCTTCTGACACTTATAAACTATCTGTTTTTGAGATAAAGCAAGAAACAGAACCATTTGATTTCATTTTGGAGAGTAATTTACTTAGTTTCATTAATTCTTTTAATCCTGAAGAAGATAAATCTATTGTTTTTTATTACAGAAAAGATAATAAAGATAGCTTTAGTACAGAAATGTTGATTTCAATGGATAACTTTATGATTAGTTACACATCGGTTAATGAAAAATTTCCAGAGGTAAACTACTTTTTTGAATTTGAACCTGAAACTAAAATAGTTGTTC

AAAAAAATGAATTAAAAGATGCACTTCAAAGAATTCAAACTTTGGCTCAAAATGAAAGAACTTTTTTATGCGATATGCAAATTAACAGTTCTGAATTAAAAATAAGAGCTATTGTTAATAATATCGGAAATTCTCTTGAGGAAATTTCTTGTCTTAAATTTGAAGGTTATAAACTTAATATTTCTTTTAACCCAAGTTCTCTATTAGATCACATAGAGTCTTTTGAATCAAATGAAATAAATTTTGATTTCCAAGGAAATAGTAAGTATTTTTTGATAACCTCTAAAAGTGAACCTGAACTTAAGCAAATATTGGTTCCTTCAAGATAATGAATCTTTACGATCTTTTAGAACTACCAAC

TACAGCATCAATAAAAGAAATAAAAATTGCTTATAAAAGATTAGCAAAGCGTTATCACCCTGATGTAAATAAATTAGGTTCGCAAACTTTTGTTGAAATTAATAATGCTTATTCAATATT

AAGTGATCCTAACCAAAAGGAAAAATATGATTCAATGCTGAAAGTTAATGATTTTCAAAATCGCATCAAAAATTTAGATATTAGTGTTAGATGACATGAAAATTTCATGGAAGAACTCGA

ACTTCGTAAGACCTGAGAATTTGATTTTTTTTCATCTGATGAAGATTTCTTTTATTCTCCATTTACAAAAAACAAATATGCTTCCTTTTTAGATAAAGATGTTTCTTTAGCTTTTTTTCA

GCTTTACAGCAAGGGCAAAATAGATCATCAATTGGAAAAATCTTTATTGAAAAGAAGAGATGTAAAAGAAGCTTGTCAACAGAATAAAAATTTTATTGAAGTTATAAAAGAGCAATATAACTATTTTGGTTGAATTGAAGCTAAGCGTTATTTCAATATTAATGTTGAACTTGAGCTCAC

ACAGAGAGAGATAAGAGATAGAGATGTTGTTAACCTACCTTTAAAAATTAAAGTTATTAATAATGATTTTCCAAATCAACTCTGATATGAAATTTATAAAAACTATTCATTTCGCTTATC

TTGAGATATAAAAAATGGTGAAATTGCTGAATTTTTCAATAAAGGTAATAGAGCTTTAGG

MNNVIISNNKIKPHHSYFLIEAKEKEINFYANNEYFSVKCNLNKNIDILEQGSLIVKGKIFNDLINGIKE EIITIQEKDQTLLVKTKKTSINLNTINVNEFPRIRFNEKNDLSEFNQFKINYSLLVKGIKKIFHSVSNNR EISSKFNGVNFNGSNGKEIFLEASDTYKLSVFEIKQETEPFDFILESNLLSFINSFNPEEDKSIVFYYRK DNKDSFSTEMLISMDNFMISYTSVNEKFPEVNYFFEFEPETKIVVQKNELKDALQRIQTLAQNERTFLCD MQINSSELKIRAIVNNIGNSLEEISCLKFEGYKLNISFNPSSLLDHIESFESNEINFDFQGNSKYFLITS

APVLLNMKQALEVYLDHQIDVLVRKTKFVLNKQQERYHILSGLLIAALNIDEVVAIIKKSANNQEAINTL NTKFKLDEIQAKAVLDMRLRSLSVLEVNKLQTEQKELKDSIEFCKKVLADQKLQLKIIKEELQKINDQFG DERRSEILYDISEEIDDESLIKVENVVITMSTNGYLKRIGVDAYNLQHRGGVGVKGLTTYVDDSISQLLV CSTHSDLLFFTDKGKVYRIRAHQIPYGFRTNKGIPAVNLIKIEKDERICSLLSVNNYDDGYFFFCTKNGI VKRTSLNEFINILSNGKRAISFDDNDTLYSVIKTHGNDEIFIGSTNGFVVRFHENQLRVLSRTARGVFGI SLNKGEFVNGLSTSSNGSLLLSVGQNGIGKLTSIDKYRLTKRNAKGVKTLRVTDRTGPVVTTTTVFGNED LLMISSAGKIVRTSLQELSEQGKNTSGVKLIRLKDNERLERVTIFKEELEDKEMQLEDVGSKQI

MNLYDLLELPTTASIKEIKIAYKRLAKRYHPDVNKLGSQTFVEINNAYSILSDPNQKEKYDSMLKVNDFQ NRIKNLDISVRWHENFMEELELRKTWEFDFFSSDEDFFYSPFTKNKYASFLDKDVSLAFFQLYSKGKIDH QLEKSLLKRRDVKEACQQNKNFIEVIKEQYNYFGWIEAKRYFNINVELELTQREIRDRDVVNLPLKIKVI NNDFPNQLWYEIYKNYSFRLSWDIKNGEIAEFFNKGNRALGWKGDLIVRMKVVNKVNKRLRIFSSFFEND

MEENNKANIYDSSSIKVLEGLEAVRKRPGMYIGSTGEEGLHHMIWEIVDNSIDEAMGGFASFVKLTLEDN FVTRVEDDGRGIPVDIHPKTNRSTVETVFTVLHAGGKFDNDSYKVSGGLHGVGASVVNALSSSFKVWVFR QNKKYFLSFSDGGKVIGDLVQEGNSEKEHGTIVEFVPDFSVMEKSDYKQTVIVSRLQQLAFLNKGIRIDF VDNRKQNPQSFSWKYDGGLVEYIHHLNNEKEPLFNEVIADEKTETVKAVNRDENYTVKVEVAFQYNKTYN QSIFSFCNNINTTEGGTHVEGFRNALVKIINRFAVENKFLKDSDEKINRDDVCEGLTAIISIKHPNPQYE GQTKKKLGNTEVRPLVNSVVSEIFERFMLENPQEANAIIRKTLLAQEARRRSQEARELTRRKSPFDSGSL

MGKLADCTTRDPSISELYIVEGDSAGGTAKTGRDRYFQAILPLRGKILNVEKSNFEQIFNNAEISALVMA IGCGIKPDFELEKLRYSKIVIMTDADVDGAHIRTLLLTFFFRFMYPLVEQGNIFIAQPPLYKVSYSHKDL YMHTDVQLEQWKSQNPNVKFGLQRYKGLGEMDALQLWETTMDPKVRTLLKVTVEDASIADKAFSLLMG

MAKQQDQVDKIRENLDNSTVKSISLANELERSFMEYAMSVIVARALPDARDGLKPVHRRVLYGAYIGGMHHDRPFKKSARIVGDVMSKFHPHGDMAIYDTMSRMAQDFSLRYLLIDGHGNFGSIDGDRPAAQRYTEARLS KLAAELLKDIDKDTVDFIANYDGEEKEPTVLPAAFPNLLANGSSGIAVGMSTSIPSHNLSELIAGLIMLI DNPQCTFQELLTVIKGPDFPTGANIIYTKGIESYFETGKGNVVIRSKVEIEQLQTRSALVVTEIPYMVNK TTLIEKIVELVKAEEISGIADIRDESSREGIRLVIEVKRDTVPEVLLNQLFKSTRLQVRFPVNMLALVKG

Genome Proteome

A New Distributed System forA New Distributed System forLargeLarge--Scale Sequence AnalysisScale Sequence Analysis

(434) 982-2207Douglas Blair and Gabriel RobinsDouglas Blair and Gabriel Robins

Computer ScienceDepartment of

School of Engineeringand Applied Science

University of Virginia{dmb4x, robins}@cs.virginia.edu

www.cs.virginia.edu

Runaway Growth

Advances in sequencing technology Exponentially increasing data volume

GenBank: 8.6 billion nucleotides (Jun 2000)9.5 billion nucleotides (Aug 2000)

Data growing faster than computer speeds:- Data volume doubles every 12

months- Moore’s Law: 18-month doubling

time

Source: http://www.ncbi.nlm.nih.gov

Proteins and Evolution

YRMFEPKCYRMFEPKCLDAFANLRDFLARFEGLKKISAFRVAKFEIDKYANLNRWYENAKKVTPGWEEPGWEE

YRVAFEPTLDAYANLRDFEGVKKITPE

YRVAKFELDAYANLRWENVKKITPE

FRVAKFELDKYANLRWENVKKITENVKKITPGWE

YRVFEPDAYANLRDFLEGVKKITSE

FRVAKFELDKYANLRWYENAKKITPGWEENAKKITPGWE

YRMYRMFEPKLDAFANLRDFLREGVKKITSA

YRMFEPKLDAYRMFEPKLDAFANLRDFLREGVKKITSA

YRMFEPKLDAFAYRMFEPKLDAFANLRDFLAREGLKKITSA

FRVAKFE---IDKYANLNRW---YENAKKVTPGWEE.:. :: .: .::: . .:. ::.. YRM--FEPKCLDAFANLRDFLARFEGLKKISA

Presented at Intelligent Systems for Molecular Biology, August 19-23, 2000, San Diego, CA

Sequence AlignmentFRVAKFEIDKYANLNRWYENAKKVTPGWEE

YRMFEPKCLDAFANLRDFLARFEGLKKISA

Algorithms and Statistics

Sequence Comparison Dynamic Programming Algorithms:Needleman-Wunsch [Needleman & Wunsch, 1970]

Smith-Waterman [Smith & Waterman, 1981]

Smith-Waterman with gaps [Gotoh, 1982]

FASTA [Pearson & Lipman, 1988]

BLAST [Altschul et al., 1990]

Statistical Significance:Distribution of Smith-Waterman scores [Karlin & Altschul 1990]

Distribution of n SW scores [Karlin & Altschul 1993]

Empirical distribution for gapped scores [Altschul & Gish 1996]

Old Sequence Analysis Paradigm

Compare to known sequences in databaseRecord new experimentally derived sequence

Determine statistical significance of comparison scoresDeduce biological and evolutionary relationships

ATGCCTATGATACTGGGATAC...

?

New Sequence Analysis Paradigm:Genomics and Comparative Genomics

S. cerevisiae E. coli D. melan.

DATABASE OF KNOWN SEQUENCES

GE

NO

MIC

DN

A

E. c

oli

D. m

elan

.S.

cer

evis

iae

M. gen.

M. g

en.

Challenges

• Computing power growing less quickly than data volume

• Heuristic methods are faster but less sensitiveFaster

Better• Current parallel implementations scale poorly

• Computation grows quadratically with data volume

Data << Work -- “Square” tasks minimize data/computation

Data expansion relatively inexpensive - compression becomes worthwhile

Tasks self-contained, compact, independent - exquisitely parallel

Parallelism constrained only by the available number of machines

Future Directions

Solution: Break the Data Bottleneck M

Computation

Data Transmitted

Computer

N

k Computers

(krM)+N Total Data(MrN)/k Work/CPU

MrN Total WorkM+(N/k) Data/CPU

(M+N)r√k Total Data(M+N)/√k Data/CPU

BioinformaticsBioinformatics

Old vs. NewData ≅ Work

Data expansion takes as long as or longer than comparison operations

Entire library required everywhere simultaneously (Poor NFS server…)

Parallelism constrained to number of machines not starved for data

Paves the way for Massively Parallel ComputationAbility to take advantage of more processors encourages use of more sensitive, computationally demanding techniques

“Central Dogma” of Molecular Biology

Basic mechanisms in all living organisms [Crick, ~1956]

DNA RNA

Protein

Transcription

Replication

Translation

ATGCCTATGATACTGGGATAC...ATGCCTATGATACTGGGATAC...

MPMILGY...MPMILGY...

AUGCCUAUGAUACUGGGAUAC...AUGCCUAUGAUACUGGGAUAC...

32

Genomes and Proteomes

31 complete microbial genomes (87 in progress)

Many new microbial genomes every year

Many other higher organisms’ genomes being sequenced

Organism Year Genome Size Proteome SizeCompleted (Base Pairs) (# of Proteins)

Mycoplasma Genitalium 1995 ~588,000 480Escherichia Coli 1997 ~4,600,000 4,289Saccharomyces Cerevisiae 1997 ~11,000,000 ~6,600Caenorhabditis Elegans 1998 ~86,000,000 ~14,300Drosophila Melanogaster 2000 ~137,000,000 ~13,500Homo Sapiens ~2000 ~3,100,000,000 ~30,000-60,000

Test Platform: Parabon Frontier

It is wonderfulhow much maybe done, if we arealways doing.”

-- Thomas Jefferson, May 5, 1787

“Determine never to be idle.No person will have occasionto complain of the want oftime, who never loses any.

YesterdayCTGAAGCCAGTTTGAGAAGACCACAGCACCAGCACCATGCCTATGATACTGGGATACTGGAACGTCCGCGGACTGACACACCCGATCCGCATGCTCCTGGAATACACAGACTCAAGCTATGATGAGAAGAGATACACCATGGGTGACGCTCCCGACTTTGACAGAAGCCAGTGGCTGAATGAGAAGTTCAAGCTGGGCCTGGACTTTCCCAATCTGCCTTACTTGATCGATGGATCACACAAGATCACCCAG

MPMILGYWNVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKLGLDFPNLPYLIDGSHKITQSNAILRAHWSNK

Gene Protein

Data AvalancheData Avalanche Paradigm ShiftParadigm Shift

Implementation & ResultsImplementation & Results

Client(UVa)

Code DataJob

Frontier Server(Housed at Exodus)

Providers(Idle Internet Machines)

Task

Job ResultsTask Results

InternetInternet InternetInternet

ResultsPostprocessing Code & Data

ElementsTask

Definitions

•Further Smith-Waterman optimizations•Investigation of novel methods for estimating statistical significance•Other methods (BLAST, FASTA, HMMs, GeneWise, etc.)•Data compression•Implementation of DNA-protein and DNA-DNA comparisons•Large-scale structure-structure comparison•Large-scale sequence-structure threading/comparison•Human Genome vs. GenBank scale searches•Java 1.3 JVM for Provider Compute Engine (Faster than C!)•Other projects (e.g. Maximum Likelihood Tree Searches)

0

10

20

30

40

50

60

70

Data

2 4 9 16 25 36 49 64

# of CPUs

4-1 Compression2-1 CompressionNew MethodOld Method

Data Transmitted

y = 37.412x + 271.65R2 = 0.9968

0

3000

6000

9000

12000

15000

0 100 200 300 400

CPUs

Smith

-Wat

erm

anSe

q. C

ompa

riso

ns/S

ec

Scalability

Time

35