WABA Success: A Tool for Sequence Comparison between …WABA Success: A Tool for Sequence Comparison...

WABA Success: A Tool for Sequence Comparisonbetween Large GenomesDavid L. Baillie1 and Ann M. Rose2,3

1Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia V5A 1S6 Canada; 2Department ofMedical Genetics, University of British Columbia, Vancouver V6T 1Z3 Canada

Whole-genome sequence compari-sons between bacterial sequences areone thing, but try comparing two eu-karyotic genomes, each containing tensor hundreds of millions of nucleotides.And try to do it on your desktop ma-chine in your office or at home. That iswhat Kent and Zahler (2000) have tried,and the results are presented in this issueof Genome Research. The use of evolu-tionary conservation to unveil func-tional information contained within ge-nomes is not new. In the case of thenematode, comparisons of Caenorhabdi-tis elegans to its close relative Caenorhab-ditis briggsae go back as far as Emmons etal. (1979). Snutch (1984) made the firstC. briggsae genomic library available tothe research community. These twonematodes are almost identical in mor-phology and development, yet their ge-nomes have been separated for a suffi-cient amount of time to allow intronicand intragenic sequences to become ef-fectively randomized, while protein-encoding sequences and cis-linked regu-latory elements are conserved. C. brigg-sae has diverged from C. elegans inregions of unselected sequence, themiddle of large introns, between genes,and at the third nucleotide of synony-mous codons in genes not abundantlytranslated. Based on the now-outdatedconcept of a constant evolutionaryclock, these two species were estimatedto have diverged some 30–60 millionyears ago. These facts prompted a C.briggsae genome sequencing effort to beinitiated by The Washington UniversityGenome Sequencing Center in St. Louis,

Missouri. The project was given a jump-start with the construction of a physicalmap (Marra et al. 1997) and an open in-vitation to the research community toparticipate in the selection of fosmidsfor sequence analysis. Researchers fromaround the world probed microarrayedfosmid filters with their favorite geneand contacted The Genome SequencingCenter with a request to sequence theidentified fosmid. Currently, approxi-mately 10% of the genome has nowbeen completed and is available athttp://www.genome.wustl.edu/pub/gsc1/sequence/st.louis/briggsae/finish/.Comparison of C. elegans and C. briggsaesequence has facilitated the identifica-tion of cis-linked regulatory elements(Heine and Blumenthal 1986), the clon-ing of genes as a result of their syntenicrelationship (Kuwabara and Shah 1994),and interpretation of complex genestructure (Thacker et al. 1999). In manycases examined, adjacent genes are con-served both in position and orientation,with the occasional interruption causedby a transposable element or an appar-ent pseudogene.

Prior to the availability of C. briggsaesequence, gene feature identificationwas done by abinitio prediction. Manycomputer programs exist that attempt topredict the exon-intron structure ofgenes from genomic sequence. Theseprograms vary in their accuracy and areat their worst when asked to predict thefirst and last exons of genes, often fail-ing to identify correctly exons and in-trons outside the actual coding element.Messenger RNA and EST-based featuredetection is often confounded by largetranscripts, genes that have low levels oftranscription, or genes lacking poly(A)+-

containing transcripts. Further compli-cations adding to the problem of genefinding include the still-murky rules ofsequence identification unknown fornumerous important genomic elements(snRNAs, ribozymes, transcription factorbinding sites, replication origins, chro-matin folding and packaging signals,meiotic pairing information, etc.). Be-cause of these issues, the initial annota-tion of the C. elegans genome was doneinteractively using GENEFINDER, a pro-gram written by Phil Green andLaDeana Hillier (unpubl.). The program,which was ahead of its time, did a re-markable job of gene prediction. Takentogether with manual interpretation,the cDNA data (Kohara 1996) and re-lated genomic sequence data from C.briggsae, gene structure predictions weremade for > 19,000 C. elegans genes.

Kent and Zahler (2000) have takenadvantage of the availability of genomicsequence from these two closely relatednematodes to test the feasibility of do-ing large-scale alignments between ge-nomic DNA of different species. Theyhave developed an algorithm for se-quence comparison in which every thirdbase (the wobble position of the codon)is ignored. The wobble-aware bulkaligner (WABA) allows the sensitiveidentification of conserved coding re-gions. They use this algorithm in athree-tiered process to identify con-served regions between eight millionbase pairs of C. briggsae genomic se-quence and the entire 97 million basepairs of C. elegans. Their analysis wasperformed on a readily available 450mHz Intel-based machine. The resultsthey have achieved are remarkable andwill provide a useful resource for all in

3Corresponding author.E-MAIL [email protected]; FAX (604) 822-5348.

Insight/Outlook

10:1071–1073 ©2000 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/00 $5.00; www.genome.org Genome Research 1071www.genome.org

Baill ie and Rose

1072 Genome Researchwww.genome.org

the continuing analysis of genomes.One of the examples given by the au-thors is the bli-4 (K04F10.4) gene. Thestructure of this gene was characterizedpreviously by examination of the con-servation between C. elegans and C.briggsae (Thacker et al. 1999; Fig. 1B).Traditionally, gene prediction programshave been forced to ignore the compu-tationally complex problem of ab-initioprediction of splice variants. NeitherACEDB nor the Genie program predictthe entire nine alternate transcripts de-scribed in Thacker et al. (1999). Figure1A shows the WABA/intronator displayfor the gene. The areas of C. briggsae ho-mology identified by the WABA pro-gram would greatly assist researchers byalerting them to the strong possibilitythat alternative transcripts may exist.The intronator display brings togethergene predictions, C. briggsae homolo-gies, and cDNA information in an easilyused and intuitive fashion.

One surprising observation made byKent and Zahler (2000) was the short-

ness of the syntenic segments, which ispartly a consequence of the length ofthe C. briggsae clones. The number offragments that clones are broken into bythe alignment exhibited a bimodal dis-tribution, which corresponds to someextent to the position of the sequenceon the C. elegans autosomes. In thegene-rich central clusters, long align-ments were observed, predicted to con-stitute approximately 40% of the ge-nome. In contrast, the flanking armswere more susceptible to rearrangement.It has been known for some time thatmeiotic recombination is much higheron the chromosome arms. This fact andthe observation that sequence similari-ties to organisms other than nematodestends to be lower on the arms led the C.elegans Sequencing Consortium (1998)to suggest that the DNA in the armsmight be evolving more rapidly thanthat in the central regions of the auto-somes.

Computational tools, like WABA,will be essential for future analysis of

large vertebrate genomes. Computa-tional comparative genomics is expectedto become an integral part of the analy-sis of human and mouse genomes andwill be needed to identify coding ele-ments and other functional compo-nents resistant to analysis by more con-ventional genetic and molecular ap-proaches.

REFERENCESC. elegans Sequencing Consortium. 1998. Science

282: 2012–2018.Emmons, S.W., Klass, M.R., and Hirsh, D. 1979.

Proc. Natl. Acad. Sci. 76: 1333–1337.Heine, U. and Blumenthal, T. 1986. J. Mol. Biol.

188: 189–198.Kohara, Y. 1996. Protein Nucleic Acid Enzyme

41: 715.Kuwabara, P.E. and Shah, S. 1994. Nucleic Acids

Res. 22: 4414–4418.Marra, M.A., Kucaba, T.A., Dietrich, N.L., Green,

E.D., Brownstein, B., Wilson, R.K., McDonald,K.M., Hillier, L.W., McPherson, J.D., andWaterston, R.H. 1997. Genome Res.7: 1072–1084.

Snutch, T.P. 1984. Ph.D. thesis, Simon FraserUniversity, Burnaby, B.C.

Thacker, C., Marra, M.A., Jones, A., Baillie, D.L.,and Rose, A.M. 1999. Genome Res. 9: 348–359.

Figure 1 (A) Exon prediction in genomic regions near bli-4 (also called K04F10; from Kent and Zahler at http://www.cse.ucsc.edu/∼kent/cgi-bin/tracks.exe?where=K04F10). (B) Schematic representation of the bli-4 gene from C. elegans (top) and C. briggsae (bottom). The common region(s) encodingthe signal peptide, prodomain, protease domain, and middle domain are shown in brown. Alternatively spliced exons that encode carboxyl termini uniqueto the individual isoforms are labeled alphabetically and color coded. The position and extent of the e937 3325-bp deletion is indicated. Open boxesrepresent noncoding exons or untranslated regions; shaded boxes represent coding exons. Hatched boxes (labeled I–VIII) represent regions of nucleotidehomology that may constitute regulatory elements, particularly those at the 5� end. (Xb) XbaI; (Blp) BlpI; (Bgl) BglII; (Xho) XhoI. The approximate locationof the C. briggsae bli-4 gene is indicated by dashed lines on the fosmid clones used in Thacker et al. (1999). The entire sequence of clones G25K01 andG06P23 has been determined, whereas G26K16 was used solely for transformation rescue experiments. (Reprinted with permission from Thacker et al. 1999.)

Insight/Outlook

Genome Research 1073www.genome.org

WABA Success: A Tool for Sequence Comparison between …WABA Success: A Tool for Sequence Comparison...

Documents

Transcript of WABA Success: A Tool for Sequence Comparison between …WABA Success: A Tool for Sequence Comparison...