Tools for Plant Comparative Genomics

1
Tools for Plant Comparative Tools for Plant Comparative Genomics Genomics The unifying goal of this five-year project (currently starting year three) is to apply and adapt computational methods from molecular evolution to plant functional genomics. The centerpiece of the project is the development of Phytome, an online phylogenomics database and visualization tool enabling exploration of gene families and comparative maps for angiosperms and other land plants. The first version of the database, with information on thousands of protein- coding gene families, is freely available online. In addition, there are three research goals: Refinement of computational methods for analysis of homologous chromosomal segments within and between genomes incorporating gene family and species phylogenies. Application of phylogenetic methods to study functional divergence within gene families. Analysis of patterns of spatial co-expression and divergence of expression profiles within gene families in Arabidopsis. The educational and outreach goals include: Contribution of secondary education curricular materials to a mobile science lab targeting minority school districts in North Carolina. Contribution of material on bioinformatics to a genomics mediabook targeted at college-level biology students. Overview Comparative Mapping and Gene Family Evolution References 1. Altschul SF, Madden TL, Schaffer AA et al. (1997) Nucl. Acids Res. 25, 3389-3402. 2. Birney E, Clamp M, Durbin R (2004) Genome Res. 14, 988-995. 3. Blanc G, Hokamp K, Wolfe KH (2003) Genome Res 13, 137–44. 4. Calabrese PP, Chakravarty S, Vision TJ (2003) Bioinformatics, 19, i74–i80. 5. Clamp M, Cuff J, Searle SM, Barton GJ (2004) Bioinformatics 20, 426-427. 6. Eddy SR (2003) HMMER v2, http://hmmer/wustl.edu. 7. Enright AJ, Van Dongen S, Ouzounis CA (2002) Nucl. Acids. Res. 30:1575-1584. 8. Felsenstein J (2004) Phylip v3.6, http://evolution.genetics.washington.edu/phylip.html. 9. Harris MA et al. (2004) Nucl. Acids Res. 32, D258-261. 10. Huan J, Prins J, Wang W, Vision TJ (2003) Proc. IEEE CSB Conf., 484-485. 11. Katoh K, Misawa K, Kuma K, Miyata T (2002) MNucl. Acids Res. 30, 3059-3066 12. Ku H-M, Vision T, Liu J, Tanksley SD (2000) Proc Natl Acad Sci USA, 97, 9121–9126. 13. Meyers BC, Tej SS, Vu TH et al. (2004) Genome Res. 14, 1641-1653. 14. Morgenstern B (1999) Bioinformatics 15, 211-218. 15. Mulder NJ et al. (2003) Nucl. Acids Res. 31, 315-318. 16. Notredame C, Higgins DG, Heringa J (2000) J. Mol. Biol. 302, 205-217. 17. Remington DL, Vision TJ, Guilfoyle TJ, Reed JW (2004) Plant Physiol. 135, 1738-1752. 18. Schmidt HA, Strimmer M, Vingron M, von Haeseler A (2002) Bioinformatics 18, 502-504. 19. Simillion C, Vandepoele K, Saeys Y et al. (2004) Genome Res. 14, 1095-1106. 20. Vision TJ, Brown DG, Tanksley SD (2000) Science 290, 2114-7 21. Thompson JD, Thierry JC, Poch O (2003) Bioinformatics 19, 1155-1161. 22. Zdobnov EM, Apweiler R (2001) Bioinformatics 17, 847-848. 23. Zmasek CM, Eddy SR (2001) Bioinformatics 17, 383-384. National Science Foundation Plant Genome Research Program Young Investigator Award (#0227314) PI: Todd Vision Department of Biology University of North Carolina at Chapel Hill [email protected] (919) 843-4507 UNC Personnel Key Collaborators Stefanie Hartmann, postdoctoral associate Skip Bollenbacher, PMABS Dihui Lu, graduate student Blake Meyers, University of Delaware Jason Phillips, computing support Phytome Education and Outreach We are collaborating with the Partnership for Minority Advancement in Biosciences (PMABS) to develop curricular materials for a mobile teaching lab named Destiny. Destiny brings a modern biology laboratory to over 130 rural and urban secondary schools with high minority enrollment. Our contribution is the design of a module focusing on plant molecular evolution and bioinformatics. The module will be field-tested in March 2005 as part of the UNC Science Spectrum series, in which high school students are brought to campus for a day of science activities. We are also working with PMABS on developing bioinformatics material for a Genomics Mediabook. This will be a DVD containing a rich variety of multimedia content: narrated animations of the fundamental genomics concepts, case studies to motivate the material, interactive learning and assessment activities, live links to external websites (such as Genbank), and a hyperlinked mini-encyclopedia of genomics. The mediabook is targeted at the undergraduate level. A comparative plant genomics web resource, named Phytome, was publicly launched in September 2004 (http://www.phytome.com). Phytome contains inferred protein sequences (called Unipeptides) from 39 plant species (33 angiosperms and six other land plants). The protein sequences were predicted from public-domain DNA sequences in a variety of databases, including genomic DNA gene models, full-length cDNA sequences, and Expressed Sequence Tags (ESTs). Source databases include: NCBI Unigenes PlantGDB Plant Genome Network Sputnik TAIR TIGR Gene Indices The first release of Phytome supports phylogenetic and functional analyses of Unipeptides and Families by providing a web-based graphical user interface (GUI) for searching and browsing the results of Phytome's analysis pipeline. The web pages are dynamic HTML documents generated by PHP and backed by a MySQL database. Figure 1. Phytome analysis pipeline. Processes and computation times are shown at left, database components at right. Total computation time for all processes was ~460 days. FAMILY PAGE: This page (not shown) can be reached directly from a Family search or from other Unipeptide and Family Pages. It contains the following features: • Links to related Families (for which the best pair-wise BLASTP E-value is ≤ 1x10 -15 ) • The list of component Subfamilies. • The list of Family members excluded from the reduced alignment by REAP. • The list of those species represented within the Family. • Two tabs allow the user to view the list of Unipeptides sorted either by Subfamily or by species. • Additional tabs allow one to view the InterPro and GO assignments for the examplar of each subfamily. • The user can select groups of Unipeptides and proceed to the Alignment Page. ALIGNMENT PAGE: This page is presented once a user has selected a set of Unipeptides from a Family Page. The following functions and features are shown in the screenshot below (Fig. 4). 1 Download the full and reduced alignments, the phylogeny, and the Unipeptide sequences. 2 Download the IDs of the original unigenes together with those of their component sequences. 3 View the phylogeny interactively using ATV (Zmasek and Eddy 2002). 4 Judge the reliability of the phylogenetic root based on a molecular clock test. 5 View and edit the alignment using JalView (Clamp et al. 2004). 6 The reduced multiple sequence alignment showing only conserved columns. The numbers of excluded columns are shown at the corresponding positions in the alignment. 1 2 3 4 6 5 7 8 SEARCHING PHYTOME Phytome can be explored in a variety of ways via the web GUI. For example, Unipeptides may be queried with a Genbank accession number, an alias such as a unigene name from a source database, a Gene Ontology term or ID, or an InterPro term or ID. One can search for Families that include or exclude members from particular species or clades. In addition, Unipeptides and Families can be retrieved via BLAST searches. Batch BLAST is available to registered users. Registration is free and can be done online. BULK DOWNLOADS: This set of files is particularly important for inter-operability with other plant genome databases. Currently, these files show the following correspondences for all Unipeptides and Families in Phytome: Genbank Accession Numbers for all component sequences (such as ESTs) of unigenes used to generate Unipeptides. Which unigenes from NCBI Unigene, PlantGDB, Plant Genome Network, Sputnik and TIGR Gene Indices share component sequences with each other (and with which Phytome Unipeptides). Which Unipeptides are in each Phytome Family and Subfamily. UNIPEPTIDE PAGE: This page can be reached a number of ways, including direct search for a Unipeptide, through a BLAST search, or following a link from a Family Page. The following features of this page are labelled in the screenshot below (Figure 3): 1 Family and Subfamily IDs. 2 Interpro and Gene Ontology assignments (if the Unipeptide is the exemplar of its Subfamily). 3 Species of origin. 4 Original unigene sequence translated by Phytome. 5 A list of the unigene's constituent sequences (e.g. ESTs, Arabidopsis AGI numbers). 6 A list of related unigenes (from all sources) that share at least one component sequence. 7 Predicted peptide sequence (available for download in FASTA format) 8 Graphic showing InterPro assignments (if the Unipeptide is the exemplar of its Subfamily) ANALYSIS PIPELINE Phytome's analysis pipeline (Figure 1) was designed to maintain data quality as much as possible given the constraints of large- scale automated sequence analysis. The pipeline is built entirely using publicly available software, with some modifications made to existing tools as well as development of some custom tools specifically for this project Unigenes were translated using ESTWise (Birney et al. 200) based on protein sequence templates from Swissprot/TrEMBL. Results of an all-by-all BLAST search (Altschul et al. 1997) of Unipeptides were input to the clustering software Tribe-MCL (Enright et al. 2002). Clusters were refined to produce Families. Multiple sequence alignments were generated by MAFFT (Katoh et al. 2002), T-Cofffe (Notredame et al. 2000) or Dialign (Morgenstern 1999) and refined with RASCAL (Thompson et al. 2003). When these tools failed to produce quality alignments, we instead constructed seed alignments for select family members using T-Coffee (Notredame et al. 2000), then used HMMER (Eddy 2003) to align other family members to a hidden Markov model derived from the seed alignment. To ensure positional homology within those columns used for phylogenetic inference, we developed a tool named REAP (Hartmann, Phillips and Vision, unpublished) that pruned extremely "gappy" and divergent columns and discarded sequences that either had little overlap with the rest of the alignment or were obviously misaligned. Midpoint-rooted neighbor-joining phylogenies were computed from the reduced alignments using PHYLIP (Felsenstein 2004). Molecular clock tests were performed using TREE-PUZZLE (Schmidt et al. 2002) to determine the reliability of the midpoint roots. Subfamilies were identified automatically from the rooted phylogenies. Examplars from each subfamily were searched for InterPro signatures (Mulder et al. 2004) using InterProScan (Zbdonov and Apweiler 2001) and assigned Gene Ontology terms (Harris et al. 2004). Figure 2. Phylogenetic tree for the 39 species included in Phytome version 1. The number of Unipeptides for each species is shown at right. Figure 3. Unipeptide Page. Labels are described above. Figure 3. Unipeptide Information Page. See text for labels 6 5 4 3 2 1 Figure 4. Alignment Page. Labels are described above. The primary task in comparative mapping is identification of homologous chromosomal regions. Until recently, this was accomplished by ad hoc manual methods and a statistical framework for evaluating the results was lacking. To address this deficiency, we developed a software package named FISH, for Fast Identification of Segmental Homology, that employs a dynamic programming algorithm to identify pairwise segmental homologies and applies an explicit probability model to compute expect values (Calabrese et al 2003). The documented source code and binaries are freely available from http://www.bio.unc.edu/vision/. We are now extending this approach to the alignment of ≥2 chromosomal segments (Huan et al. 2003, see also Simillion et al. 2003). In an alignment of multiple genomic segments, a given marker may be observed in one segment while being unobserved in one or more of its syntenic partners. These markers may fail to be observed either because they are actually absent or because there is incomplete marker sampling. For a related Plant Genome project (DB-0110069, PI:Comstock), we have been developing a methodology based on hidden Markov models (HMMs) for calculating the probability that an unobserved marker is absent versus present in a syntenic segment (Xu and Vision, in prep.). We use transition probabilities based upon phylogenetic relationships among syntenoc segments, which allows us to model the differing frequency of gene loss after speciation versus large-scale duplication (Ku et al. 2000). The goal is to incorporate this software into the Phytome analysis pipeline, making it easy for experimentalists and molecular breeders to obtain predictions of gene content within QTL candidate regions. By combining a gene family phylogeny with information regarding the physical positions of gene family members, one can study the diversification of gene families in a chromosomal context. We have used this approach to study the sister ARF and Aux/IAA gene families in Arabidopsis. The pattern of diversification in the ARF family has been typical of the genome as a whole: a minority of duplication events date back to large-scale duplications (Vision et al. 2000). By contrast, large-scale duplications are responsible for a disproportionate number of splits in the Aux/IAA phylogeny. This suggests that successful duplication of Aux/IAA genes has, since the split with the ARF genes, become contingent on interactions with other loci, either through the dependence on long-range cis-regulatory sequences or because of dosage relationships with other genes (Remington et al. 2004). We have been examining the evolution of gene expression in Arabidopsis, in collaboration with Blake Meyers (Univ. of Delaware). We have focused on a set of ~500 duplicate gene pairs assayed for expression across five libraries (root, leaf, inflorescence, silique, callus) using MPSS technology (DBI- 0110528, Meyers et al. 2004). There are three major findings (Vision and Meyers, in prep.). First, highly divergent young duplicates show that the correlation in expression can decay at or very soon after duplication. Secondly, an asymmetry is typically observed where one duplicate is expressed at a higher level in all libraries, even though both are detectably expressed. Finally, among those pairs were duplicated simultaneously during a single polyploidy event 20-80 MYA (Blanc et al 2003), there is a strong relationship between the degree of protein sequence divergence and the degree of expression divergence. Thus, it appears that there is a correlation between the degree of functional constraint at the levels of protein sequence and gene expression. FUTURE PLANS Phytome version 1 allows users to study the evolution and functional diversification of plant proteins among lineages. Over the next two years, Phytome will be expanded to include genetic and physical map data. Wherever possible, mapped markers will be related to Unipeptides. Building upon the phylogenetic relationships among mapped Unipeptides in different species, comparative maps will then be constructed as part of the analysis pipeline. The eventual goal is to be able to infer the gene content within chromosome segments from any species for which sequenced markers are available based upon an evolutionary analysis of synteny conservation among related chromosome segments. Progress in developing software that can accomplish that task is described below. We wish to thank all those who have contributed to the unigene datasets and databases, particularly those who have helped us to integrate these data: Lukas Mueller (PGN), Volker Brendel (PlantGDB), and Stephen Rudd (Sputnik). All statements are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

description

1. 2. 3. 4. 5. 6. 1. 2. 3. 4. 5. 8. 6. 7. Tools for Plant Comparative Genomics. ANALYSIS PIPELINE - PowerPoint PPT Presentation

Transcript of Tools for Plant Comparative Genomics

Page 1: Tools for Plant Comparative Genomics

Tools for Plant Comparative GenomicsTools for Plant Comparative Genomics

The unifying goal of this five-year project (currently starting year three) is to apply and adapt computational methods from molecular evolution to plant functional genomics.

The centerpiece of the project is the development of Phytome, an online phylogenomics database and visualization tool enabling exploration of gene families and comparative maps for angiosperms and other land plants. The first version of the database, with information on thousands of protein-coding gene families, is freely available online.

In addition, there are three research goals:• Refinement of computational methods for analysis of homologous chromosomal

segments within and between genomes incorporating gene family and species phylogenies.

• Application of phylogenetic methods to study functional divergence within gene families.

• Analysis of patterns of spatial co-expression and divergence of expression profiles within gene families in Arabidopsis.

The educational and outreach goals include:• Contribution of secondary education curricular materials to a mobile science lab

targeting minority school districts in North Carolina.• Contribution of material on bioinformatics to a genomics mediabook targeted at

college-level biology students.

Overview

Comparative Mapping and Gene Family Evolution

References

1. Altschul SF, Madden TL, Schaffer AA et al. (1997) Nucl. Acids Res. 25, 3389-3402.2. Birney E, Clamp M, Durbin R (2004) Genome Res. 14, 988-995.3. Blanc G, Hokamp K, Wolfe KH (2003) Genome Res 13, 137–44.4. Calabrese PP, Chakravarty S, Vision TJ (2003) Bioinformatics, 19, i74–i80.5. Clamp M, Cuff J, Searle SM, Barton GJ (2004) Bioinformatics 20, 426-427.6. Eddy SR (2003) HMMER v2, http://hmmer/wustl.edu.7. Enright AJ, Van Dongen S, Ouzounis CA (2002) Nucl. Acids. Res. 30:1575-1584.8. Felsenstein J (2004) Phylip v3.6, http://evolution.genetics.washington.edu/phylip.html.9. Harris MA et al. (2004) Nucl. Acids Res. 32, D258-261.10. Huan J, Prins J, Wang W, Vision TJ (2003) Proc. IEEE CSB Conf., 484-485.11. Katoh K, Misawa K, Kuma K, Miyata T (2002) MNucl. Acids Res. 30, 3059-306612. Ku H-M, Vision T, Liu J, Tanksley SD (2000) Proc Natl Acad Sci USA, 97, 9121–9126.13. Meyers BC, Tej SS, Vu TH et al. (2004) Genome Res. 14, 1641-1653.14. Morgenstern B (1999) Bioinformatics 15, 211-218.15. Mulder NJ et al. (2003) Nucl. Acids Res. 31, 315-318.16. Notredame C, Higgins DG, Heringa J (2000) J. Mol. Biol. 302, 205-217.17. Remington DL, Vision TJ, Guilfoyle TJ, Reed JW (2004) Plant Physiol. 135, 1738-1752.18. Schmidt HA, Strimmer M, Vingron M, von Haeseler A (2002) Bioinformatics 18, 502-504.19. Simillion C, Vandepoele K, Saeys Y et al. (2004) Genome Res. 14, 1095-1106.20. Vision TJ, Brown DG, Tanksley SD (2000) Science 290, 2114-721. Thompson JD, Thierry JC, Poch O (2003) Bioinformatics 19, 1155-1161.22. Zdobnov EM, Apweiler R (2001) Bioinformatics 17, 847-848.23. Zmasek CM, Eddy SR (2001) Bioinformatics 17, 383-384.

National Science Foundation Plant Genome Research Program

Young Investigator Award (#0227314)

PI: Todd VisionDepartment of Biology

University of North Carolina at Chapel [email protected](919) 843-4507

UNC Personnel Key CollaboratorsStefanie Hartmann, postdoctoral associate Skip Bollenbacher, PMABSDihui Lu, graduate student Blake Meyers, University of DelawareJason Phillips, computing support

Phytome

Education and OutreachWe are collaborating with the Partnership for Minority Advancement in Biosciences (PMABS) to develop curricular materials for a mobile teaching lab named Destiny. Destiny brings a modern biology laboratory to over 130 rural and urban secondary schools with high minority enrollment. Our contribution is the design of a module focusing on plant molecular evolution and bioinformatics. The module will be field-tested in March 2005 as part of the UNC Science Spectrum series, in which high school students are brought to campus for a day of science activities.

We are also working with PMABS on developing bioinformatics material for a Genomics Mediabook. This will be a DVD containing a rich variety of multimedia content: narrated animations of the fundamental genomics concepts, case studies to motivate the material, interactive learning and assessment activities, live links to external websites (such as Genbank), and a hyperlinked mini-encyclopedia of genomics. The mediabook is targeted at the undergraduate level.

A comparative plant genomics web resource, named Phytome, was publicly launched in September 2004 (http://www.phytome.com). Phytome contains inferred protein sequences (called Unipeptides) from 39 plant species (33 angiosperms and six other land plants). The protein sequences were predicted from public-domain DNA sequences in a variety of databases, including genomic DNA gene models, full-length cDNA sequences, and Expressed Sequence Tags (ESTs). Source databases include:

• NCBI Unigenes• PlantGDB• Plant Genome Network• Sputnik• TAIR• TIGR Gene Indices

The first release of Phytome supports phylogenetic and functional analyses of Unipeptides and Families by providing a web-based graphical user interface (GUI) for searching and browsing the results of Phytome's analysis pipeline. The web pages are dynamic HTML documents generated by PHP and backed by a MySQL database.

Figure 1. Phytome analysis pipeline. Processes and computation times are shown at left, database components at right. Total computation time for all processes was ~460 days.

FAMILY PAGE: This page (not shown) can be reached directly from a Family search or from other Unipeptide and Family Pages. It contains the following features:• Links to related Families (for which the best pair-wise BLASTP E-value is ≤ 1x10-15)• The list of component Subfamilies.• The list of Family members excluded from the reduced alignment by REAP.• The list of those species represented within the Family.• Two tabs allow the user to view the list of Unipeptides sorted either by Subfamily or by species. • Additional tabs allow one to view the InterPro and GO assignments for the examplar of each subfamily.• The user can select groups of Unipeptides and proceed to the Alignment Page.

ALIGNMENT PAGE: This page is presented once a user has selected a set of Unipeptides from a Family Page. The following functions and features are shown in the screenshot below (Fig. 4). 1 Download the full and reduced alignments, the phylogeny, and the Unipeptide sequences.2 Download the IDs of the original unigenes together with those of their component sequences.3 View the phylogeny interactively using ATV (Zmasek and Eddy 2002).4 Judge the reliability of the phylogenetic root based on a molecular clock test.5 View and edit the alignment using JalView (Clamp et al. 2004).6 The reduced multiple sequence alignment showing only conserved columns. The numbers of excluded columns are shown at the corresponding positions in the alignment.

1

23 4

65

7

8

SEARCHING PHYTOME

Phytome can be explored in a variety of ways via the web GUI. For example, Unipeptides may be queried with a Genbank accession number, an alias such as a unigene name from a source database, a Gene Ontology term or ID, or an InterPro term or ID. One can search for Families that include or exclude members from particular species or clades. In addition, Unipeptides and Families can be retrieved via BLAST searches. Batch BLAST is available to registered users. Registration is free and can be done online.

BULK DOWNLOADS: This set of files is particularly important for inter-operability with other plant genome databases. Currently, these files show the following correspondences for all Unipeptides and Families in Phytome:

• Genbank Accession Numbers for all component sequences (such as ESTs) of unigenes used to generate Unipeptides.

• Which unigenes from NCBI Unigene, PlantGDB, Plant Genome Network, Sputnik and TIGR Gene Indices share component sequences with each other (and with which Phytome Unipeptides).

• Which Unipeptides are in each Phytome Family and Subfamily.

UNIPEPTIDE PAGE: This page can be reached a number of ways, including direct search for a Unipeptide, through a BLAST search, or following a link from a Family Page. The following features of this page are labelled in the screenshot below (Figure 3):

1 Family and Subfamily IDs.2 Interpro and Gene Ontology assignments (if the Unipeptide is the exemplar of its

Subfamily).3 Species of origin.4 Original unigene sequence translated by Phytome.5 A list of the unigene's constituent sequences (e.g. ESTs, Arabidopsis AGI numbers).6 A list of related unigenes (from all sources) that share at least one component

sequence.7 Predicted peptide sequence (available for download in FASTA format)8 Graphic showing InterPro assignments (if the Unipeptide is the exemplar of its

Subfamily)

ANALYSIS PIPELINE

Phytome's analysis pipeline (Figure 1) was designed to maintain data quality as much as possible given the constraints of large-scale automated sequence analysis. The pipeline is built entirely using publicly available software, with some modifications made to existing tools as well as development of some custom tools specifically for this project

Unigenes were translated using ESTWise (Birney et al. 200) based on protein sequence templates from Swissprot/TrEMBL. Results of an all-by-all BLAST search (Altschul et al. 1997) of Unipeptides were input to the clustering software Tribe-MCL (Enright et al. 2002). Clusters were refined to produce Families. Multiple sequence alignments were generated by MAFFT (Katoh et al. 2002), T-Cofffe (Notredame et al. 2000) or Dialign (Morgenstern 1999) and refined with RASCAL (Thompson et al. 2003). When these tools failed to produce quality alignments, we instead constructed seed alignments for select family members using T-Coffee (Notredame et al. 2000), then used HMMER (Eddy 2003) to align other family members to a hidden Markov model derived from the seed alignment. To ensure positional homology within those columns used for phylogenetic inference, we developed a tool named REAP (Hartmann, Phillips and Vision, unpublished) that pruned extremely "gappy" and divergent columns and discarded sequences that either had little overlap with the rest of the alignment or were obviously misaligned. Midpoint-rooted neighbor-joining phylogenies were computed from the reduced alignments using PHYLIP (Felsenstein 2004). Molecular clock tests were performed using TREE-PUZZLE (Schmidt et al. 2002) to determine the reliability of the midpoint roots. Subfamilies were identified automatically from the rooted phylogenies. Examplars from each subfamily were searched for InterPro signatures (Mulder et al. 2004) using InterProScan (Zbdonov and Apweiler 2001) and assigned Gene Ontology terms (Harris et al. 2004).

Figure 2. Phylogenetic tree for the 39 species included in Phytome version 1. The number of Unipeptides for each species is shown at right.

Figure 3. Unipeptide Page. Labels are described above.

Figure 3. Unipeptide Information Page. See text for labels

6

5

43

2

1

Figure 4. Alignment Page. Labels are described above.

The primary task in comparative mapping is identification of homologous chromosomal regions. Until recently, this was accomplished by ad hoc manual methods and a statistical framework for evaluating the results was lacking. To address this deficiency, we developed a software package named FISH, for Fast Identification of Segmental Homology, that employs a dynamic programming algorithm to identify pairwise segmental homologies and applies an explicit probability model to compute expect values (Calabrese et al 2003). The documented source code and binaries are freely available from http://www.bio.unc.edu/vision/. We are now extending this approach to the alignment of ≥2 chromosomal segments (Huan et al. 2003, see also Simillion et al. 2003).

In an alignment of multiple genomic segments, a given marker may be observed in one segment while being unobserved in one or more of its syntenic partners. These markers may fail to be observed either because they are actually absent or because there is incomplete marker sampling. For a related Plant Genome project (DB-0110069, PI:Comstock), we have been developing a methodology based on hidden Markov models (HMMs) for calculating the probability that an unobserved marker is absent versus present in a syntenic segment (Xu and Vision, in prep.). We use transition probabilities based upon phylogenetic relationships among syntenoc segments, which allows us to model the differing frequency of gene loss after speciation versus large-scale duplication (Ku et al. 2000). The goal is to incorporate this software into the Phytome analysis pipeline, making it easy for experimentalists and molecular breeders to obtain predictions of gene content within QTL candidate regions.

By combining a gene family phylogeny with information regarding the physical positions of gene family members, one can study the diversification of gene families in a chromosomal context. We have used this approach to study the sister ARF and Aux/IAA gene families in Arabidopsis. The pattern of diversification in the ARF family has been typical of the genome as a whole: a minority of duplication events date back to large-scale duplications (Vision et al. 2000). By contrast, large-scale duplications are responsible for a disproportionate number of splits in the Aux/IAA phylogeny. This suggests that successful duplication of Aux/IAA genes has, since the split with the ARF genes, become contingent on interactions with other loci, either through the dependence on long-range cis-regulatory sequences or because of dosage relationships with other genes (Remington et al. 2004).

We have been examining the evolution of gene expression in Arabidopsis, in collaboration with Blake Meyers (Univ. of Delaware). We have focused on a set of ~500 duplicate gene pairs assayed for expression across five libraries (root, leaf, inflorescence, silique, callus) using MPSS technology (DBI-0110528, Meyers et al. 2004). There are three major findings (Vision and Meyers, in prep.). First, highly divergent young duplicates show that the correlation in expression can decay at or very soon after duplication. Secondly, an asymmetry is typically observed where one duplicate is expressed at a higher level in all libraries, even though both are detectably expressed. Finally, among those pairs were duplicated simultaneously during a single polyploidy event 20-80 MYA (Blanc et al 2003), there is a strong relationship between the degree of protein sequence divergence and the degree of expression divergence. Thus, it appears that there is a correlation between the degree of functional constraint at the levels of protein sequence and gene expression.

FUTURE PLANS

Phytome version 1 allows users to study the evolution and functional diversification of plant proteins among lineages. Over the next two years, Phytome will be expanded to include genetic and physical map data. Wherever possible, mapped markers will be related to Unipeptides. Building upon the phylogenetic relationships among mapped Unipeptides in different species, comparative maps will then be constructed as part of the analysis pipeline. The eventual goal is to be able to infer the gene content within chromosome segments from any species for which sequenced markers are available based upon an evolutionary analysis of synteny conservation among related chromosome segments. Progress in developing software that can accomplish that task is described below.

We wish to thank all those who have contributed to the unigene datasets and databases, particularly those who have helped us to integrate these data: Lukas Mueller (PGN), Volker Brendel (PlantGDB), and Stephen Rudd (Sputnik). All statements are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.