J. Bacteriol. 2010 Williams 2305 14

10
JOURNAL OF BACTERIOLOGY, May 2010, p. 2305–2314 Vol. 192, No. 9 0021-9193/10/$12.00 doi:10.1128/JB.01480-09 Copyright © 2010, American Society for Microbiology. All Rights Reserved. Phylogeny of Gammaproteobacteria § Kelly P. Williams,* Joseph J. Gillespie, Bruno W. S. Sobral, Eric K. Nordberg, Eric E. Snyder, Joshua M. Shallom,† and Allan W. Dickerman Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia 24061 Received 11 November 2009/Accepted 4 February 2010 The phylogeny of the large bacterial class Gammaproteobacteria has been difficult to resolve. Here we apply a telescoping multiprotein approach to the problem for 104 diverse gammaproteobacterial genomes, based on a set of 356 protein families for the whole class and even larger sets for each of four cohesive subregions of the tree. Although the deepest divergences were resistant to full resolution, some surprising patterns were strongly supported. A representative of the Acidithiobacillales routinely appeared among the outgroup members, sug- gesting that in conflict with rRNA-based phylogenies this order does not belong to Gammaproteobacteria; instead, it (and, independently, “Mariprofundus”) diverged after the establishment of the Alphaproteobacteria yet before the betaproteobacteria/gammaproteobacteria split. None of the orders Alteromonadales, Pseudo- monadales, or Oceanospirillales were monophyletic; we obtained strong support for clades that contain some but exclude other members of all three orders. Extreme amino acid bias in the highly AT-rich genome of Candidatus Carsonella prevented its reliable placement within Gammaproteobacteria, and high bias caused artifacts that limited the resolution of the relationships of other insect endosymbionts, which appear to have had multiple origins, although the unbiased genome of the endosymbiont Sodalis acted as an attractor for them. Instability was observed for the root of the Enterobacteriales, with nearly equal subsets of the protein families favoring one or the other of two alternative root positions; the nematode symbiont Photorhabdus was identified as a disruptor whose omission helped stabilize the Enterobacteriales root. Although Gammaproteobacteria has only the taxonomic rank of class within the phylum Proteobacteria, it is richer in genera (250) than all bacterial phyla except Firmicutes (11). It in- cludes the paradigmatic bacterium Escherichia coli, well- known pathogens Salmonella, Yersinia, Vibrio, and Pseudomo- nas, additional pathogens occurring at the more basal and less well-resolved positions in the phylogeny like Coxiella and Fran- cisella, and endosymbionts that can be required for survival of their insect hosts. Members exhibit broad ranges of aerobicity, of trophism, including chemoautotrophism and photoautotro- phism, and of temperature adaptation (31). Morphologies in- clude rods, curved rods, cocci, spirilla, and filaments, and the class contains the largest known bacterial cells (30). Interesting combinations of phenotypes occur, such as terrestrial biolumi- nescence with pathogenicity (toward insects and humans) and symbiosis (with nematodes) in the genus Photorhabdus (37). One feature alone, 16S rRNA sequence relationship, has been used to define the class (11). Previous phylogenetic studies of this group indicate deep branches that make it difficult to obtain a large well-resolved phylogeny (10, 40). Taking advantage of the large number of genomes available for the class, we have followed a multipro- tein approach to gammaproteobacterial phylogeny (39). A po- tential challenge to the validity of phylogenetic reconstruction in bacteria is the view that horizontal transfer is so pervasive that the tree interpretation is artifactual (2). However, a cen- tral trend in the phylogenetic signals from multiple bacterial protein families can be detected that probably represents their shared vertical inheritance (25). Our methodology sought to include single-copy families most likely to reflect this central trend, applying a filter to remove the families most discordant with the majority. The resulting tree breaks with current tax- onomy. In contrast to previous studies with a small number of ge- nomes, indicating high phylogenetic concordance among gam- maproteobacterial protein families (20, 24), we found that nearly equal numbers of protein families favored one or the other of two roots for Enterobacteriales. However, much of this effect could be ascribed to a single taxon that was absent from the earlier studies. MATERIALS AND METHODS Genome selection by collapse of a preliminary tree. On 27 July 2007, there were 215 complete or incomplete genomes available for nominal Gammapro- teobacteria at NCBI. Because the phylogenetic distribution was biased, we se- lected a smaller number with balanced distribution. Protein sets were taken for the 215 Gammaproteobacteria and for two outgroup species each from the Al- phaproteobacteria and Betaproteobacteria. An initial set of protein families that approximated one occurrence per ge- nome among 97 Gammaproteobacteria were selected from the GeneTrees data- base (36), and the alignments were used for HMMsearch using HMMer (3) to find instances in the full set of 215 genomes. The 20 gene families with the lowest standard deviation of members per genome, the closest to single copy, were processed further. Phylogenetic trees were built for each family to resolve 54 cases of duplicate or split genes. A multiprotein species tree was prepared, aligning sequences with MUSCLE version 3.6 (7) in default mode and removing ambiguous portions of alignments with Gblocks version 0.91b (35) in default mode except for using the intermediate setting for gap tolerance; the concate- nation of masked alignments containing 5,538 characters was used to prepare the tree applying MrBayes version 3.2 (28) in model-jumping mode as described * Corresponding author. Mailing address: Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24061. Phone: (540) 231- 7121. Fax: (540) 231-2606. E-mail: [email protected]. † Present address: Department of Fisheries and Wildlife Sciences, Integrated Life Sciences Building (0913), 1981 Kraft Drive, Blacks- burg, VA 24061. § Supplemental material for this article may be found at http://jb .asm.org/. Published ahead of print on 5 March 2010. 2305 on August 26, 2015 by guest http://jb.asm.org/ Downloaded from

description

Gammaproteobacteria

Transcript of J. Bacteriol. 2010 Williams 2305 14

Page 1: J. Bacteriol. 2010 Williams 2305 14

JOURNAL OF BACTERIOLOGY, May 2010, p. 2305–2314 Vol. 192, No. 90021-9193/10/$12.00 doi:10.1128/JB.01480-09Copyright © 2010, American Society for Microbiology. All Rights Reserved.

Phylogeny of Gammaproteobacteria�§Kelly P. Williams,* Joseph J. Gillespie, Bruno W. S. Sobral, Eric K. Nordberg,

Eric E. Snyder, Joshua M. Shallom,† and Allan W. DickermanVirginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia 24061

Received 11 November 2009/Accepted 4 February 2010

The phylogeny of the large bacterial class Gammaproteobacteria has been difficult to resolve. Here we applya telescoping multiprotein approach to the problem for 104 diverse gammaproteobacterial genomes, based ona set of 356 protein families for the whole class and even larger sets for each of four cohesive subregions of thetree. Although the deepest divergences were resistant to full resolution, some surprising patterns were stronglysupported. A representative of the Acidithiobacillales routinely appeared among the outgroup members, sug-gesting that in conflict with rRNA-based phylogenies this order does not belong to Gammaproteobacteria;instead, it (and, independently, “Mariprofundus”) diverged after the establishment of the Alphaproteobacteriayet before the betaproteobacteria/gammaproteobacteria split. None of the orders Alteromonadales, Pseudo-monadales, or Oceanospirillales were monophyletic; we obtained strong support for clades that contain some butexclude other members of all three orders. Extreme amino acid bias in the highly A�T-rich genome ofCandidatus Carsonella prevented its reliable placement within Gammaproteobacteria, and high bias causedartifacts that limited the resolution of the relationships of other insect endosymbionts, which appear to havehad multiple origins, although the unbiased genome of the endosymbiont Sodalis acted as an attractor for them.Instability was observed for the root of the Enterobacteriales, with nearly equal subsets of the protein familiesfavoring one or the other of two alternative root positions; the nematode symbiont Photorhabdus was identifiedas a disruptor whose omission helped stabilize the Enterobacteriales root.

Although Gammaproteobacteria has only the taxonomic rankof class within the phylum Proteobacteria, it is richer in genera(�250) than all bacterial phyla except Firmicutes (11). It in-cludes the paradigmatic bacterium Escherichia coli, well-known pathogens Salmonella, Yersinia, Vibrio, and Pseudomo-nas, additional pathogens occurring at the more basal and lesswell-resolved positions in the phylogeny like Coxiella and Fran-cisella, and endosymbionts that can be required for survival oftheir insect hosts. Members exhibit broad ranges of aerobicity,of trophism, including chemoautotrophism and photoautotro-phism, and of temperature adaptation (31). Morphologies in-clude rods, curved rods, cocci, spirilla, and filaments, and theclass contains the largest known bacterial cells (30). Interestingcombinations of phenotypes occur, such as terrestrial biolumi-nescence with pathogenicity (toward insects and humans) andsymbiosis (with nematodes) in the genus Photorhabdus (37).One feature alone, 16S rRNA sequence relationship, has beenused to define the class (11).

Previous phylogenetic studies of this group indicate deepbranches that make it difficult to obtain a large well-resolvedphylogeny (10, 40). Taking advantage of the large number ofgenomes available for the class, we have followed a multipro-tein approach to gammaproteobacterial phylogeny (39). A po-tential challenge to the validity of phylogenetic reconstruction

in bacteria is the view that horizontal transfer is so pervasivethat the tree interpretation is artifactual (2). However, a cen-tral trend in the phylogenetic signals from multiple bacterialprotein families can be detected that probably represents theirshared vertical inheritance (25). Our methodology sought toinclude single-copy families most likely to reflect this centraltrend, applying a filter to remove the families most discordantwith the majority. The resulting tree breaks with current tax-onomy.

In contrast to previous studies with a small number of ge-nomes, indicating high phylogenetic concordance among gam-maproteobacterial protein families (20, 24), we found thatnearly equal numbers of protein families favored one or theother of two roots for Enterobacteriales. However, much of thiseffect could be ascribed to a single taxon that was absent fromthe earlier studies.

MATERIALS AND METHODS

Genome selection by collapse of a preliminary tree. On 27 July 2007, therewere 215 complete or incomplete genomes available for nominal Gammapro-teobacteria at NCBI. Because the phylogenetic distribution was biased, we se-lected a smaller number with balanced distribution. Protein sets were taken forthe 215 Gammaproteobacteria and for two outgroup species each from the Al-phaproteobacteria and Betaproteobacteria.

An initial set of protein families that approximated one occurrence per ge-nome among 97 Gammaproteobacteria were selected from the GeneTrees data-base (36), and the alignments were used for HMMsearch using HMMer (3) tofind instances in the full set of 215 genomes. The 20 gene families with the loweststandard deviation of members per genome, the closest to single copy, wereprocessed further. Phylogenetic trees were built for each family to resolve 54cases of duplicate or split genes. A multiprotein species tree was prepared,aligning sequences with MUSCLE version 3.6 (7) in default mode and removingambiguous portions of alignments with Gblocks version 0.91b (35) in defaultmode except for using the intermediate setting for gap tolerance; the concate-nation of masked alignments containing 5,538 characters was used to prepare thetree applying MrBayes version 3.2 (28) in model-jumping mode as described

* Corresponding author. Mailing address: Virginia BioinformaticsInstitute, Virginia Tech, Blacksburg, VA 24061. Phone: (540) 231-7121. Fax: (540) 231-2606. E-mail: [email protected].

† Present address: Department of Fisheries and Wildlife Sciences,Integrated Life Sciences Building (0913), 1981 Kraft Drive, Blacks-burg, VA 24061.

§ Supplemental material for this article may be found at http://jb.asm.org/.

� Published ahead of print on 5 March 2010.

2305

on August 26, 2015 by guest

http://jb.asm.org/

Dow

nloaded from

Page 2: J. Bacteriol. 2010 Williams 2305 14

previously (39). A process of sequential tip pair collapse was applied to this tree,in which at each stage the terminal node with the shortest total length of its twoleaves was collapsed into a single leaf whose length was the average of the two.The distance between the pair of collapsed tips, a measure of diversity reduction,was plotted against the stage of collapse (Fig. 1). The stage with 104 clusters waschosen because it preceded a phase of steep loss of genome diversity and yetyields a reasonable number for phylogenetic analysis. One genome was selectedfrom within each of the 104 clusters, favoring complete genomes and better-known taxa. The total was raised to 110 with additional genomes: three fromunrepresented genera that had become available at NCBI more recently, twothat allowed comparison to a classical set of 13 taxa (20), and a nonpublicgenome that became available to us. Later, two incomplete genomes were re-jected because as described below their rRNA sequences suggested that theywere not bacteriologically pure. The remaining 108 taxa contained four outgrouptaxa, at least one member from each of the 14 gammaproteobacterial orders, and11 taxa classified at NCBI as Gammaproteobacteria but not assigned to an order(see Table S1 in the supplemental material).

rRNA sequence assay of genome purity. All complete or incomplete 16S and23S rRNA sequences were collected from 110 genomes using a curated set ofseed sequences and a BLASTN-based script, resulting in 1,135 sequences (529 of16S rRNA and 606 of 23S rRNA), of which 791 were unique. These sets werealigned using MUSCLE. For 54 rRNA fragments, sequences from the ends of acontig that aligned poorly to whole rRNAs were trimmed, leaving 337 16S rRNAand 452 23S rRNA unique sequences that were realigned using MUSCLE,manually correcting its error of splitting off small terminal portions of incompletesequences. For all intragenomic pairs whose alignment overlapped �40 nt, thesubalignment for the overlap region was taken for all available rRNA sequences(78 subalignments for 16S rRNA, 124 for 23S rRNA). Intragenomic sequencepairs for 16S and 23S rRNA were 409 and 658 for the full alignments and 228 and558 for the subalignments, respectively. For each full and subalignment, a dis-tance matrix was taken using distmat (26). All cases where an intragenomic pairhad a distance score above the tenth percentile (39) for a partner were investi-gated further.

(i) In the 23S rRNA subalignment of positions 808 to 1075, a fragment fromBeggiatoa sp. strain SS had a higher distance score to its intragenomic partnerthan to 99% of all its distance scores and also ranked highly in other 23Ssubalignments. BLAST found one of the pair to have 83% identity to Coxiella,but the other had 96% identity to Haemophilus influenzae PittGG. Because thisincomplete genome, derived from a single bacterial filament isolated from seasediment, contained highly divergent 23S fragments, we considered that its pro-tein sequences might be contaminated with those from additional microbes, andit was rejected from our analysis.

(ii) A 23S rRNA fragment from the incomplete Azotobacter vinelandii genome

had high-ranking intragenomic distance scores and in one subalignment rankedas high as 98% among all distance scores. This fragment showed 96% identity toDesulfovibrio 23S rRNA, while the other eight sequences from this genomematched Azotobacter 23S rRNA. We rejected this genome from our analysis dueto the potential for contamination with protein sequences from other organisms.The complete A. vinelandii genome now available no longer has this rRNAcontamination.

(iii) A short high-scoring 23S rRNA fragment from the complete Chromohalo-bacter genome proved to be a likely regulatory region of an operon encodingproteins known to bind that region of 23S rRNA (38).

(iv) One high-scoring intragenomic rRNA distance was ascribed to misalign-ment by MUSCLE.

(v) The remaining high-scoring intragenomic rRNA distances were from in-complete genomes, and the top BLAST hit of each contender was to the nominalgenus though low scoring. These cases could be ascribed to low-quality sequencedata.

rRNA trees. Analysis began with the rRNA sequences from 108 genomesremaining after the filters of the genomic purity assay. In cases of multiple rRNAsequences, a single full-length sequence was chosen as having the shortest branchlength in a preliminary tree or, if only fragments, were available, they weremerged. With single representative 16S and 23S rRNA sequences for eachgenome, the MUSCLE alignments were further evaluated using the secondarystructure models available at the comparative RNA web site (5). Sequences weremanually adjusted to adhere to the predicted helical and unpaired regions of therRNAs, using compensatory base change evidence to support homology assign-ment (17). Regions of the alignment that included length heterogeneity wereminimized by flanking structural support (i.e., compensatory base change evi-dence) with remaining regions of ambiguous alignment masked (12). In severalcases of mismatch to the structural model, the flanking ends of partial sequenceswere also masked. Resulting alignments were processed using tools from thejRNA web site (13) to create alignments, base pair analyses, and input files forMrBayes annotated for use of the doublet model, which assigns an RNA sub-stitution matrix to base pairs.

Bayesian phylogenetic trees were generated for the 16S and 23S rRNA maskedalignments and for a concatenation of the two. For each, MrBayes was run threetimes as a single MCMC chain for 2 � 106 generations, with a GTR � � � Imodel (� distribution approximated by 4 rate categories) for among-site ratevariation, differing at each site, and the doublet substitution model for base pairs.In all cases, burn-in was attained well before 1.5 � 106 generations, and the treesfrom the final 0.5 � 106 generations were pooled from triplicate runs (7,500 treestotal) to build a consensus tree.

Multiprotein data sets. An automated workflow for gene family selection,discordance filtering, and tree building was implemented through a series of Perlscripts. All proteins for a group of genomes were sorted into families usingall-versus-all BLASTP followed by OrthoMCL version 1.4 (21). Approximatelysingle-copy protein families were selected by choosing (i) a core set of completeand well-behaved ingroup genomes from among the taxa and tolerances for (ii)the number of core genomes missing from the family and (iii) the number ofgenomes with multiple members in the family (see Table 2, footnote a).

For the full taxon set of 108 study taxa, this process yielded 404 families, withtwo taxa especially poorly represented: Candidatus Carsonella in 69 families andCandidatus Endoriftia in 58. Since only 182 proteins are annotated for theextremely small Ca. Carsonella genome, its low family representation was notpursued further. The incomplete Ca. Endoriftia genome had many genes splitamong different contigs, frameshifted by apparent sequencing errors or simplyuncalled. This genome was queried by all members of its unrepresented familiesusing TBLASTN, and 317 of the unrepresented families received good hits, ofwhich the 148 most reliable models for missing proteins were added to thecorresponding families and to the Ca. Endoriftia protein file.

The resulting families were subjected to discordance filtering, that is, removalof the 10% with the most anomalous phylogenetic signal (e.g., potential cases oflateral gene transfer). Family members were aligned (MUSCLE) and the align-ments were masked to remove unreliably aligned regions (Gblocks, with inter-mediate gap tolerance); families obliterated by Gblocks were rejected. For eachfamily, the preferred amino acid substitution model among the 20 available inRAxML 7.0.4 was chosen, most frequently the WAG model, based on a scriptprovided with RAxML (34). Fifty quick bootstraps (RAxML option -x) weretaken for the family to determine high-support bipartitions (HSBs) of the taxaaccording to a chosen cutoff (see Table 2). Families with no HSBs were imme-diately rejected; otherwise, pairwise normalized HSB conflict counts were takenfor all pairwise comparisons of the families, after modifying HSBs for sharedtaxa. Unshared taxa were removed, causing some HSBs to merge and others tobecome trivial. Conflicting HSBs were counted and normalized, dividing by the

FIG. 1. Taxon selection by collapse of a preliminary tree. Collapsewas a sequential process of identifying the shortest leaf pair and re-placing it with a single representative leaf. This maximized diversitybetween clusters of the available genomes. The number of genomesselected this way was 104, since this preceded a phase in the collapseprocess with an especially high loss of diversity.

2306 WILLIAMS ET AL. J. BACTERIOL.

on August 26, 2015 by guest

http://jb.asm.org/

Dow

nloaded from

Page 3: J. Bacteriol. 2010 Williams 2305 14

product of the two families’ HSB counts. A conflict score was assigned to eachfamily (its average pairwise normalized HSB conflict count), which had littlecorrelation with numbers of characters, taxa, or HSBs (data not shown). A givenfraction (10%) of families with the highest conflict scores were removed. Figure2 shows that 10% is an effective and safe rejection rate. Note that discordancefiltering does not refer to (and thereby circularly reinforce) any particular phy-logenetic tree topology.

Phylogenetic analysis. The masked alignments for the remaining families wereconcatenated. We found that each individual MrBayes run would canalize into asingle tree topology, but duplicate runs would find slightly different topologies.We also found that the product of the faster (MIX) mode of RAxML dependedon the starting tree provided. Instead we used the slowest RAxML mode, withgamma-distributed rate heterogeneity, which yielded a single final tree whetherwe started with random trees or with the products from the MIX-mode runs.Protein families were assigned their preferred substitution models found previ-ously. To provide support values, tree building was repeated for 50 whole-proteinjackknife (random removal of half the protein families) resamplings and for allsingle-taxon jackknife resamplings.

The above approach was applied to the full taxon set and to the four maintaxon subgroups as detailed in Table 1. In addition to the use of multiple aminoacid substitution matrices with the gene-partitioned concatenations, we testeduniform usage of the new LG matrix (18). Maximum likelihood trees wereprepared with uniform (unpartitioned) use of LG for the concatenations of thefull taxon set and all four subgroups using RAxML version 7.2.3. All trees hadtopologies identical to those of their counterparts that had been produced withmultimodel partitioning, except that the Basal subset and the full taxon set (inthe Basal region) each had a single case of a branch crossing a single node. Bothcases involved the node with the poorest jackknife support in the whole tree(represented by multifurcation in Fig. 3). Comparing the two different trees inthese two cases, using each of the two substitution matrix sets by the Approxi-mately Unbiased test, the differences were not significant at the 0.05 level, andthe trees from multimodel partitioning scored higher than those from uniformapplication of the LG matrix.

Amino acid bias. To examine amino acid bias, the 61 proteins from the 356all-taxon set that included Ca. Carsonella were used to measure amino acid biasfor each taxon relative to all the gammaproteobacterial taxa according to themethod of Karlin (15) (Fig. 4). To examine the significance of amino acid bias,the chi-squared test implemented in TreePuzzle version 5.2 (29) was applied; 79of 108 genomes were biased at the 5% level. As a second test of the significanceof amino acid bias, we generated an artificial distribution of amino acid biasscores by making 1 million simulated 61-protein concatenations where eachcharacter state was chosen randomly from each column of the study supermatrix;bias against the gammaproteobacterial taxa was calculated for each, and signif-icance levels were taken from the sorted list of scores. According to this test, 105of the 108 genomes were biased at the P � 0.05 level, and 81 (which included all79 of those identified by TreePuzzle) were biased at the P � 0.000001 level.

Statistical tree tests. The Approximately Unbiased test as implemented inCONSEL v0.1j (32) was applied, and site likelihood scores were prepared withRAxML.

RESULTS

Taxon selection. Over 200 complete and incomplete gam-maproteobacterial genomes were available when our projectbegan, but they had a biased taxonomic distribution; the sixgenera Escherichia, Haemophilus, Pseudomonas, Shewanella,Vibrio, and Yersinia accounted for over half the genomes.These near-duplicate genomes contain little additional phylo-genetic information; we selected a subset of the genomes withmaximized diversity by (i) building a preliminary phylogenetictree, (ii) sequentially clustering close relatives, and (iii) choos-ing a stage in the clustering process that preceded a phase ofrapid loss of diversity (Fig. 1). Two incomplete genomes wererejected because analysis of rRNA sequences suggested thatthese two projects might be contaminated with foreign DNA.The final set of 108 genomes included four outgroup taxa (twoeach from the sister class Betaproteobacteria and the subtend-ing class Alphaproteobacteria), at least one representative ofeach of the 14 orders listed in Bergey’s Manual (11), and 11 taxathat had been assigned at NCBI to the class Gammaproteobac-teria but not to a particular order (see Table S1 in the supple-mental material). This genome selection process is summa-rized in Table 1.

rRNA trees. As important a tool as 16S rRNA phylogeny hasbeen in bacteriology, we and others have found that this singlegene contains insufficient phylogenetic information to robustlyrecover deep relationships, such as those of interest in thiswork. For our 108 selected genomes, we made trees for 16Sand 23S rRNAs (see Fig. S1 and S2 in the supplementalmaterial). Both trees were questionable because numerousingroup genomes (21 and 17, respectively) were placed be-tween the two Alphaproteobacteria and Betaproteobacteriaoutgroups; moreover, the two trees differed substantially at

TABLE 1. Selection of genomes

StepNo. of genomes remaining

Ingroup Corea Outgroup

Collect gammaproteobacterial genomesfrom NCBI

215 0

Select alpha-/betaproteobacterialoutgroups

215 4

Collapse preliminary 20-protein tree 100 4Complete the “classical” genome set (20) 102 4Add recently available genomes from

unrepresented genera106 4

Apply rRNA purity filter 104 4Select core genomesa for family selection 104 31 4

Generate main tree, split into isolatedsubgroups

Basal subgroup 19 14 3PO subgroup 27 17 2VAAP subgroup 37 24 2Entero subgroup 19 8 2

a The core genomes are a subset of the ingroup that have complete sequencesand were further selected based on their diversity and their low compositionalbias; these were used in the near-single-copy filter for protein families.

FIG. 2. Family conflict. The conflict scores for each set of proteinfamilies identified those with egregious phylogenetic signals. Familiesscoring above the 90th percentile were rejected.

VOL. 192, 2010 GAMMAPROTEOBACTERIAL PHYLOGENY 2307

on August 26, 2015 by guest

http://jb.asm.org/

Dow

nloaded from

Page 4: J. Bacteriol. 2010 Williams 2305 14

FIG. 3. Gammaproteobacterial phylogeny. Four subgroups of the taxa (excepting Ca. Carsonella, which is omitted here) were found not to mingle,by protein-jackknife subsampling of an all-taxon concatenation of 356 protein alignments. Longer concatenations were prepared for each taxon subgroup(Table 1), and tree topologies were built for each; small black geometric symbols link each subgroup to the outgroups used in building its subtree. Thefour regional trees were merged back into the topology for the all-taxon data set. Branch lengths for the merged topology were computed based on theall-taxon data set, and the four of its bifurcations that had less than 25% protein-jackknife support were collapsed, leaving three multifurcating nodes(dotted lines). Support values are shown (in the following order: all-taxon protein 50% jackknifing, taxon-subgroup protein 50% jackknifing, andtaxon-subgroup single-taxon jackknifing) when any of these are �100%; cases marked NA (not applicable) were for nodes not included in taxon subgroupstudies. The Entero region subtree was prepared by a two-step procedure described in the text that makes the usual support values inapplicable; however,for the first-step subtree prepared for the core Enterobacteriales taxa whose topology persists in this figure, protein and taxon jackknifing gave 100%support at each node. Taxon names are given in parentheses for incomplete genomes and are color coded according to the taxonomic order designationat NCBI. Taxonomic families represented by more than one taxon, except in cases in which the lone family represents the order, are bracketed, extendingbrackets to include unassigned taxa as necessary; only two of these nine families are split in our tree. Two nodes of interest are marked: the X indicatesa node supported by a rare indel (10), and the star indicates a fully supported node that groups members of Pseudomonadales, Alteromonadales, andOceanospirillales while excluding other members of each order. All known members of the latter clade, and no other known bacteria, autoregulate aribosomal protein operon through strong mimicry of a 23S rRNA domain (38).

2308 WILLIAMS ET AL. J. BACTERIOL.

on August 26, 2015 by guest

http://jb.asm.org/

Dow

nloaded from

Page 5: J. Bacteriol. 2010 Williams 2305 14

the deeper nodes. A tree for the concatenated alignment ofthe two rRNAs had the same problems (see Fig. S3 in thesupplemental material). In a multiprotein study of alphapro-teobacterial phylogeny, it was found that the combined rRNAshad the phylogenetic inaccuracy (measured as the level ofdisagreement between two different phylogenetic reconstruc-tion methods) of �2.5 single-copy proteins combined; more-over, accuracy increased as more proteins were employed, atleast up to 100 proteins (39). Therefore, we took a multipro-tein approach to gammaproteobacterial phylogeny.

Telescoping multiprotein approach. The first step in ourmultiprotein approach (Table 2) was to select a large numberof protein families with (nearly) single-copy distribution pat-terns, because these are likely to have similar histories thatreflect the organismal phylogeny. A strict single-copy criterionwould only have allowed one gene family, so the criterion wasslightly relaxed. A set of 31 core taxa was selected that hadcomplete genomes of �1 Mbp and were phylogenetically rep-resentative (see Table S1 in the supplemental material), andthe relaxed single-copy criterion allowed families with up totwo core taxa unrepresented and up to six core taxa withduplicate membership. Cases of multiple membership for ataxon were expediently resolved by removing both (or all)members from that taxon in that family. This yielded 405 genefamilies.

The alignments for these families were masked by removingambiguously aligned regions. Families with no sequence re-maining after masking were rejected, as were families whosebootstrap trees produced no highly supported taxon biparti-tions. These two filters eliminated families that were grosslydeficient in phylogenetic signal.

To further refine the gene families and minimize the poten-tial influence of horizontal transfer, a phylogenetic discordancefilter was imposed to eliminate the 10% of protein familiesleast concordant with the others. The set of highly supportedtaxon bipartitions for each family were compared to generatea normalized conflict matrix (Fig. 2). Families with high aver-age discordance were eliminated without referring to (and

thereby circularly reinforcing) an a priori tree topology, leaving356 families (see Table S2 in the supplemental material). Fif-teen of 19 COG (clusters of orthologous groups) categorieswere significantly overrepresented (translation proteins espe-cially) or underrepresented (uncategorized proteins especially)among these families (see Fig. S4 in the supplemental mate-rial).

The masked alignments for these families were concate-nated into a supermatrix, and a maximum likelihood tree wasproduced. Its robustness was evaluated by whole-protein 50%jackknifing, that is, generating subsets of the data from whicha random half of the protein families had been removed andcomputing a tree for each. Support for each node in the maintree was then evaluated as the fraction of protein jackknifetrees that produced the node. A rationale for the use of thissupport measure is presented in the Discussion. Several nodesreceived very low support. However, the protein jackknife datarevealed five regions of the tree whose boundaries were nevercrossed (except by the anomalous taxon Ca. Carsonella, whichwe discarded, as further described below), indicating isolatedgroups of taxa that could be studied separately, reuniting theresulting telescoped trees into a better-resolved compositetree. One group was the Enterobacterales clade (“Entero” inFig. 3). The second group, termed “VAAP,” consists mainly ofVibrionales, Alteromonadales, Aeromonadales, and Pasteurel-lales; the Entero clade emerges from the VAAP group. A sisterclade to the VAAP and Entero groups is the “PO” group,consisting mainly of Pseudomonadales and Oceanospirillales.The “Basal” region is a paraphyletic group of the most an-ciently diverging lineages in the class. Finally, the “Non-Gamma” region contains the alpha- and betaproteobacterialoutgroups and additionally contains “Mariprofundus,” whichhad been assigned to the Gammaproteobacteria by the NCBItaxonomy but to a novel lineage outside the Gammaproteobac-teria by rRNA analysis (8), and, surprisingly, Acidithiobacillus,which has been considered a member of the Gammaproteobac-teria (16). The Acidithiobacillus result is not due to an anomalyof the particular genome project studied here; a second Acidi-thiobacillus project has since become available, and it showsnear-identity to the earlier project in all the protein familiesemployed here.

FIG. 4. Compositional bias. Amino acid bias was measured accord-ing to the method of Karlin (15) for the set of 61 all-taxon proteinfamilies that contain Ca. Carsonella and plotted against nucleotidecomposition, producing a horn of A�T-rich endosymbionts with Ca.Carsonella at its tip.

TABLE 2. Selection of protein families

StepNo. of families remaining (by genome group)

All-taxon Basal PO VAAP Entero

OrthoMCL 25,292 8,227 12,727 12,482 7,365Near-single-copy filtera 405 684 679 712 1,403Masking filterb 404 684 679 712 1,403High-support bipartition

filterc396 680 678 712 1,402

Discordance filter 356 616 611 640 1,262

a This filter removed families exceeding the tolerance for either absent ormultiple representation of core taxa in the family membership. The followingtolerances were used (absent/multiple): all-taxon, 2/6; Basal, 5/1; PO, 1/1; VAAP,2/1; Entero, 0/0.

b Families with no sequence remaining after masking by Gblocks was removed.c Taxon bipartitions for a protein family were counted as highly supported

when they appeared among �75% of the bootstrap trees for the family (exceptfor the all-taxon families, where a 95% cutoff was applied); the filter removedfamilies with no high-support bipartitions.

VOL. 192, 2010 GAMMAPROTEOBACTERIAL PHYLOGENY 2309

on August 26, 2015 by guest

http://jb.asm.org/

Dow

nloaded from

Page 6: J. Bacteriol. 2010 Williams 2305 14

Aiming to improve resolution, each of the four gammapro-teobacterial regions (Basal, PO, VAAP, and Entero) of thetree were separately reanalyzed with larger numbers (611 to1,262) of protein families, by the same selection/discordance-filtering/tree-building approach described above, and as de-tailed in Table 2. Even the very large Entero collection (1,262protein families) shows functional bias compared to all 4,149encoded proteins of E. coli; overrepresented COG categorieswere translation, coenzyme metabolism, cell division, DNAmetabolism, and protein processing (chi-squared P values of�10�10, �10�10, �10�7, �10�7, �10�4, and �10�3, respec-tively), while the uncategorized, carbohydrate metabolism, andtranscription genes were underrepresented (�10�10, �10�7,and �0.03, respectively) (see Fig. S4 in the supplemental ma-terial). Also, none of the 185 E. coli genes annotated ingenomic islands were among the Entero families. Support val-ues for the nodes in each regional tree were then evaluated bywhole-family 50% jackknifing as described above and also bysingle-taxon jackknifing, in which, for each ingroup taxon, atree was built after that taxon was removed. Before reunitingthese telescoped regional trees, the Enterobacteriales regionwas examined further, initially because it contained the anom-alous Ca. Carsonella genome.

Compositional attraction. The Ca. Carsonella genome (22)is unusual for many reasons: it has an extremely small andA�T-rich genome and is represented in only 61 of our 356families. It was located on an extremely long branch in ouroriginal tree within a group of other insect endosymbionts,themselves (except for Sodalis) on long branches and withsmall, A�T-rich genomes. It was the only taxon that crossed atree regional boundary, found among the insect endosymbiontEnterobacteriales in 48 of the whole-protein jackknife replicatesbut with the A�T-rich Francisella or Rickettsiella genomes(Basal region) in the other two. Either or both of these treepositions may have been artifacts resulting from amino acidbias, in turn arising from nucleotide bias in these genomes.Using the protein families that contain Ca. Carsonella, wemeasured the amino acid bias of each taxon relative to allGammaproteobacteria combined (15). In a plot of amino acidbias against nucleotide bias (Fig. 4), some taxa fell into aprominent A�T-rich horn, with Ca. Carsonella at its far tip,eight insect endosymbionts that are its usual neighbors in thetrees, and three members of the Basal region: CandidatusRuthia, Francisella, and Rickettsiella. We tested more directlythe possibility that the placement of Ca. Carsonella was arti-factual by building a tree after removing its eight neighboringA�T-rich taxa from the analysis; in this tree, Ca. Carsonellajumped to the group with the next-most-similar composition,appearing with Francisella. In a similar tree built after furtherremoving Ca. Ruthia, Francisella, and Rickettsiella, Ca. Car-sonella appeared in yet a new region of the tree, amongPseudomonas species. We conclude that either (i) the place-ment of Ca. Carsonella among the insect endosymbionts isartifactual, (ii) its secondary placement with Francisella is ar-tifactual, or (iii) both these placements are artifactual such thatCa. Carsonella may not even belong to the Gammaproteobac-teria. Ca. Carsonella was excluded from further analysis.

The group of insect endosymbionts with small, A�T-richgenomes, consisting of Buchnera, Candidatus Baumannia, Can-didatus Blochmannia, and Wigglesworthia, appeared as a single

clade in the Entero region of initial trees. These genomes allhave high amino acid bias relative to the remaining Gamma-proteobacteria (Fig. 4), raising the possibility that the clade mayhave been an artifact of compositional attraction. We testedthese genomes as we had Ca. Carsonella, building a tree byadding each one singly after omitting all their A�T-rich neigh-bors. Each member of the group was confirmed to fall amongthe Enterobacteriales in single-member tests using the all-taxondata set. Although each of the Ca. Baumannia-Ca. Blochman-nia-Wigglesworthia strains were placed as before next to Soda-lis, each Buchnera strain was placed at the base of the Entero-bacteriales. These placements were reproduced when thesingle-member tests were repeated with the Enterobacteriales-focused data set (Fig. 5A and B, part i). This split placementsupports the idea that the original grouping of all these endo-symbionts was due to compositional attraction.

The presence of Sodalis was suspected to have influencedthese results, since it is itself an insect endosymbiont (of thesame insect host as Wigglesworthia) with a somewhat reduced,though compositionally unbiased (54% G�C), genome. Omit-ting Sodalis affected the placement of those endosymbiontsthat had clustered with it in the single-member tests; Ca.Blochmannia and Wigglesworthia moved to the base of theEnterobacteriales, while Ca. Baumannia moved to an interme-diate position (Fig. 5A and B, part i). Thus, the basal positionof the Buchnera strains was independent of Sodalis, while thepositions of the other endosymbionts were Sodalis dependentand therefore uncertain.

Instability of the Enterobacteriales root. The above experi-ments revealed instability in the root position of the Enterobac-teriales. Our all-taxon and Enterobacteriales-focused trees, mostof their taxon- or protein-jackknife support trees, and most ofthe endosymbiont study trees show Photorhabdus at the root(“Plu-root” in Fig. 5A), but a substantial number of membersof the last group show the Escherichia-Salmonella-Enterobacterclade at the root (“Eco-root” in Fig. 5A). These two frequentrooting positions are separated by two internodes that them-selves are strongly disfavored as rooting positions by the Ap-proximately Unbiased (AU) test (32) (Fig. 5C).

The Enterobacteriales present an interesting situation whereeach of two different trees are supported by nearly equal sub-sets of the protein families. When the A�T-rich endosymbi-onts were excluded and the Enterobacteriales protein familiesfor the remaining core taxa (and containing at least one out-group) were sorted into those favoring either the Plu- or theEco-root, the family counts were similar (464 and 428, respec-tively) (see Table S3 in the supplemental material). The Soda-lis effect on endosymbiont placement was reexamined with theEnterobacteriales Plu- or Eco-root-favoring family subsets andfound to occur with each subset (Fig. 5B, parts ii and iii). Thesetwo family subsets were not distinct in terms of functionalclassification (COG types) or in terms of fast- or slow-evolvinggenes as judged by total tree branch length for each family, norwere they substantially segregated when mapped to the E. coligenome. For each of the two family subsets, the supermatrixproduced a tree with the respective root, with 100% bootstrapsupport (Fig. 5B, parts ii and iii). One explanation for such asituation could be a massive horizontal transfer event followedby gene sorting; however, there is no single within-Enterobac-teriales genetic exchange path that would convert one tree to

2310 WILLIAMS ET AL. J. BACTERIOL.

on August 26, 2015 by guest

http://jb.asm.org/

Dow

nloaded from

Page 7: J. Bacteriol. 2010 Williams 2305 14

the other, because the alternate topologies are separated bytwo disfavored internodes. We identified the families amongour 356-family all-gammaproteobacterial data set correspond-ing to the Enterobacteriales Plu- or Eco-root-favoring families,and we built all-gammaproteobacterial trees with the superma-trix for each subset. This might have identified a gammapro-teobacterial source for a hypothetical massive horizontal gene

transfer event; instead, the two all-taxon family subsets pro-duced trees like that for the whole set, except again with therespective Enterobacteriales rooting. This test also ruled out theconcern that the alternate-root phenomenon was an artifact ofthe particular two outgroup genomes chosen for the Entero-bacteriales-focused data set, since the test effectively increasedthe outgroup from 2 to 89 members.

The observation of two large gene family subsets that favordifferent tree topologies appears to stand in contrast to earlierstudies that found high concordance among gammaproteobac-terial protein families (20, 24). These studies included onlythree genera (Escherichia, Salmonella, and Yersinia) from ourcore Enterobacteriales set; thus, the extra genomes in our study(Sodalis, Serratia, Photorhabdus, and Pectobacterium) may beresponsible for the difference. Omitting each of these genomesone by one from either the Plu- or Eco-root-favoring superma-trix had no effect on remaining tree topology, except that whenPhotorhabdus was omitted, both supermatrices produced theEco-root; this identified Photorhabdus as a disruptive taxon.These four taxa were also removed one by one from each ofthe 696 Enterobacteriales protein families that contain all coretaxa and both outgroups, assessing preference for the Plu- orEco-root (Fig. 5D). Omission of Photorhabdus made the fam-ilies more concordant, shifting the fraction of Plu-favoringfamilies from 0.54 to 0.31, confirming Photorhabdus as a dis-ruptor.

Final tree. Based on the difficulties with the Enterobacteria-les, its regional tree was reconstructed as follows. The A�T-rich genomes and Photorhabdus were omitted, and a tree wasbuilt only for the remaining taxa; support for each node bywhole-family and single-taxon jackknifing was 100%. Since thefour Buchnera genomes all separately appeared in the sameposition of this subtree, independently of Sodalis, they wereincluded in a second round of tree building, together withPhotorhabdus; these taxa were added back to the data set tobuild a larger tree, constraining to the topology of the initialsubtree (since without constraint the Buchnera and Photorhab-dus genomes altered that topology [Fig. 5B, part i]). Since itsmembership in the Enterobacteriales is unconfirmed, Ca. Car-sonella was excluded from the regional tree, as were theendosymbionts Ca. Baumannia, Ca. Blochmannia, and Wiggles-worthia, due to uncertainty arising from their sensitivity to thepresence of Sodalis.

The four regional trees were then merged back into theirpositions in the original all-taxon tree, yielding the final treedepicted in Fig. 3. Support values reported are the 50% whole-protein jackknife values for the all-taxon data set, the corre-sponding values from each regional subtree, and single-taxonjackknife values from each regional subtree. A few supportvalues were so low as to warrant the introduction of two mul-tifurcations in the Basal region and one in the VAAP region.

DISCUSSION

Multiprotein phylogeny produces more robust trees thansingle-gene approaches, which better serve bacteriologists ininterpreting biological data. A concrete example is the fullysupported clade marked by a star in Fig. 3 whose members(Pseudomonas, Marinomonas, Saccharophagus, etc.) share amolecular biological trait that is unique among bacteria: their

FIG. 5. Positions of the root and endosymbionts among the Entero-bacteriales. (A) Two root positions (Plu and Eco) were obtained in ourstudy; positions taken by A�T-rich endosymbionts (O and S) and twotested intermediate root positions (Int1 and Int2) are marked. Esch.,Escherichia; Salm., Salmonella; Enter., Enterobacter; lum., luminescens;outgp, outgroup. (B) Positions taken by endosymbionts and their effecton rooting. The full Enterobacteriales taxon set was reduced to a coreby removing all eight A�T-rich genomes (retaining the two out-groups), then each alone, or in groups of four, was added back to thedata set. The compositionally unbiased endosymbiont Sodalis was ex-cluded or not, and either the full set of Entero protein families (Fams)(part i) or subsets with at least one outgroup and favoring either thePlu-root (part ii) or the Eco-root (part iii) were used. The resultingroot position and the position taken by the tested endosymbiont isreported. *, bootstrap analysis was performed for these trees, withsupport measuring 100% for every node. Bau., Ca. Baumannia; Blo. orBloch., Ca. Blochmannia; Wig., Wigglesworthia; florid., floridanus;penn., pennsylvanicus. (C) Intermediate root positions disfavored.Each of the 696 protein families containing all core taxa and bothoutgroups were tested for preference of four hypothetical root po-sitions using the P value of the Approximately Unbiased test (pAU),and pAU is reported for the supermatrix (Supermtx) that combinedthese 696 protein families. A pAU of �0.05 is typically taken as arejection. (D) Identifying Photorhabdus as a disruptive taxon. Foreach of the 696 Entero families used in panel C, the indicated coretaxa were singly omitted and pAU was taken for either the Plu-Rootor Eco-Root topology, to determine the topology favored by thefamily; pAU was also taken for the two topologies based on thecorresponding supermatrix.

VOL. 192, 2010 GAMMAPROTEOBACTERIAL PHYLOGENY 2311

on August 26, 2015 by guest

http://jb.asm.org/

Dow

nloaded from

Page 8: J. Bacteriol. 2010 Williams 2305 14

S19 ribosomal protein operon contains a strong mimic of a 23SrRNA fragment that likely confers autoregulation on theoperon (38). This trait would have required a more complexexplanation than simple vertical inheritance according to 16SrRNA trees, since to our knowledge no such trees, includingour own, recover this clade. (It was recovered with weak sup-port in our 23S rRNA tree but was again lost in the combined16S/23S rRNA tree.)

Although performance is improved, multiprotein phylogenyis still subject to some of the same artifacts and difficulties assingle-gene approaches, such as long-branch attraction andpoor resolution of branches that are deep and short. Telescop-ing allowed data from more protein families to be applied toisolated subgroups of taxa, but this was still insufficient toresolve all nodes or solve difficulties with highly biased ge-nomes. We identified one genome as a phylogenetic attractor(Sodalis) and one as a disruptor (Photorhabdus); full explana-tion of the mechanisms of these effects will require furtherstudy and targeted genomic sequencing of more members ofthese clades.

This phylogenetic analysis based on hundreds of proteins forover a hundred taxa strongly supports the splitting of twofamilies (Alteromonadaceae and Pseudoalteromonadaceae) andthree orders (Alteromonadales, Pseudomonadales, and Oceano-spirillales). A fourth order, Thiotrichales, also appears split butwith lower support. Finally, the unity of the order Chromatialescould not be established. Thus, the classification based on 16SrRNA breaks down occasionally at the family level but morefrequently at the order level.

The tree assigns taxonomy to previously unclassified genomes:Reinekea to Oceanospirillaceae and Ca. Ruthia and CandidatusVesicomyosocius to Piscirickettsiaceae; the affiliation of Ca. En-doriftia with Methylococcales requires confirmation as more ge-nomes become available. For four genomes (Congregibacter andthe marine strains HTCC2080, HTCC2143, and HTCC2207) theclosest assigned genome was Saccharophagus of the split Al-teromonadaceae. Two genomes fell among the outgroup andare therefore not part of the Gammaproteobacteria. For one ofthese, Mariprofundus, this placement is consistent with previ-ous analysis of its 16S rRNA sequence that placed it as a deepbranch within the phylum Proteobacteria (8). The other newlyidentified non-gammaproteobacterium is Acidithiobacillus, sur-prisingly, since this genus has long been considered a memberof Gammaproteobacteria (16). A recent tree (K. P. Williams,unpublished data) prepared using 173 protein families from124 bacterium-wide genomes, chosen to represent each avail-able bacterial order, showed the same relationships of Acidi-thiobacillus and Mariprofundus to other proteobacteria as de-picted in Fig. 3, confirming that these two genera representindependent sister groups to the beta-/gammaproteobacteriaclade that arose after divergence from the Alphaproteobacteria.All these observations suggest revisions of the taxonomy atmultiple ranks in the class and phylum.

An early multiprotein study of the Gammaproteobacteriaused over 200 proteins, but only 13 genomes were available atthat time (20). Later studies have used far fewer protein fam-ilies (10 to 35 proteins) and fewer representative genomes (28to 55 genomes) than the present analysis (4, 6, 10, 40). Allthese studies and ours agree on the basal branching patternconnecting the five orders Enterobacteriales, Pasteurellales,

Vibrionales, Pseudomonas, and Xanthomonadales. One of thesenodes in particular (marked by an “X” in Fig. 3) has additionalsupport from a unique 4-amino-acid deletion in RpoB; how-ever, our results do not agree with two other suggestions basedon rare indels: (i) exclusion of Francisella from the Gamma-proteobacteria and (ii) a particular split of Alteromonadales(10). Some of the intermingling of bacterial orders observedhere had been noted in one of these studies (40), but in theother studies the taxa employed or the extent of multifurcationprecluded its detection.

Nearly half of the bacterial genomes currently available areincompletely sequenced, a fraction that may increase in thefuture, given short-read technologies and sequencing projectsthat do not include a goal of closing the genome. These in-complete genomes add greatly to the taxonomic diversity avail-able for study and are nearly as rich in protein information asthe complete ones, so they are worth including in such analysesdespite the minor problems they raise. However, some incom-plete genomes are contaminated with DNA from additionaltaxonomic sources and should be rejected; an rRNA impuritytest was employed here to identify some such mixed genomes.Although highly divergent rRNA alleles can occur within asingle genome (41), the caution exercised here was warranted;the highly divergent rRNA allele that we found in the incom-plete Azotobacter vinelandii genome, thereby rejecting the ge-nome, did not appear in the recently completed version of thisgenome project.

When our five multiprotein supermatrices (i) were parti-tioned according to the substitution matrix favored by eachfamily or (ii) had the new LG substitution matrix applieduniformly, nearly identical trees resulted, with only two casesof crossing nodes, and these nodes were the least well sup-ported in the whole tree. This agrees with a recent survey ofmany alignments and supermatrices that concluded that al-though model choice can affect tree topology, “it rarely affectsevolutionary inferences drawn from the data because differ-ences are mainly confined to poorly supported nodes” (27).

While both the jackknifing and bootstrapping approaches todetermining support values remove columns from the align-ment supermatrix, jackknifing does not then duplicate the re-maining columns and therefore produces smaller subsamplingsthat can speed processing for large data sets. We prefer therandom removal of whole proteins rather than single columnsacross all protein alignments, to better assess variation at acoarser granularity and as a double-check against possible re-maining incongruity of protein families. This 50% protein jack-knifing is a more stringent measure of support than the usualper-column bootstrapping, counteracting the misleadingly highsupport values that the latter brings to long supermatrices (23).

Compositional attraction through amino acid bias made itdifficult to infer the phylogeny of the insect endosymbiontgenomes, whose composition was the most biased in our taxonset. As in most previous studies, the endosymbionts joinedtogether in a clade, with Sodalis when present, when all wereincluded in a simple maximum likelihood analysis. When in-stead the endosymbionts were tested one at a time, Buchneraconsistently joined at the base of the Enterobacteriales; unity ofthe Buchnera genomes is strongly supported by gene orderstudies (4). Ca. Blochmannia and Wigglesworthia may also de-rive from this point yet consistently branch as sister to Sodalis

2312 WILLIAMS ET AL. J. BACTERIOL.

on August 26, 2015 by guest

http://jb.asm.org/

Dow

nloaded from

Page 9: J. Bacteriol. 2010 Williams 2305 14

when that taxon is included, jumping past two tree nodes. Ca.Baumannia appeared with Sodalis when present and, unlikeother endosymbionts, at nearby positions when absent, weaklysupporting the idea that Ca. Baumannia has a unique origin(Fig. 5A and B, part i). Our finding of multiple origins for theendosymbionts agrees with those of other studies that havesought to avoid compositional attraction artifacts, through theuse of either nonhomogeneous substitution matrices or anal-ysis of genomic rearrangements (4, 14). It would appear thatthe pattern of an Enterobacteriales member adopting the en-dosymbiotic lifestyle and becoming highly A�T-rich has oc-curred in multiple independent cases.

Surprisingly, nearly equal subsets of Enterobacteriales pro-teins favored one or the other of two root positions. Massivehorizontal transfer is a plausible explanation in principle, al-though we could not identify a particular single transfer path;the pattern could be a result of numerous transfers alongmultiple paths. It should be noted that our family selectionmethod removed the genes best known for horizontal transfer,those on genomic islands. Earlier studies of a classical set of 13gammaproteobacterial genomes that includes Escherichia,Salmonella, and Yersinia but not Sodalis, Pectobacterium, orPhotorhabdus found very few of the single-copy genes withevidence of horizontal transfer. We found that omittingPhotorhabdus shifted the support by protein families from anear balance for two root positions to a preponderance for oneof these roots, identifying Photorhabdus as a disruptor. Theambiguities left by this study show that the Enterobacterialespresent rich problems in phylogenetic reconstruction that arenot resolved simply by accumulating larger genome-scale pro-tein supermatrices. These problems are probably due both tohorizontal transfer that is especially favored by close contactwith diverse cohabitants in enteric environments and to theextensive genomic alterations in multiple isolated symbioticlineages. Future analysis of these problems within the Entero-bacteriales is favored by the detailed mechanistic informationon mutation and gene transfer known from E. coli and rela-tives.

Much of the failure to fully resolve the deeper regions of thetree can probably be ascribed to stochastic accumulation ofnoise that obscures reconstruction of short and ancient inter-nodes. It may be further compounded by residual cases ofhorizontal gene transfer that passed through our incongruencefilter. Compositional bias may be the greatest challenge facingstudies on such broad scales of bacterial phylogeny as this one.The Gammaproteobacteria make it clear that as clades developnucleotide bias, they can concomitantly develop amino acidbias (33), which would be expected to produce local asymme-tries in amino acid substitution rates. Simple statistical testswould have rejected 73% of the taxa from our study on thebasis of bias. Thus, our data set (like other data sets of broadphylogenetic scope) violated the typical assumptions of phylo-genetic reconstruction algorithms regarding homogeneity andreversibility of amino acid substitution. We were able to man-age the problem in some cases by the approach of placingindividual biased taxa in the absence of similarly biased attrac-tors. Some possible future solutions may be to counter bias ona per-column basis in the supermatrix, to use mixture models(19), and to use maximum-likelihood programs that do notdemand model homogeneity (9). Another promising area for

improving multiprotein phylogeny is the use of rare indels (10),especially if the detection of these markers was automated andif tree building could properly weight indel data in combina-tion with alignment data; most of these are removed in ourcurrent protocol.

The multiprotein approach to gammaproteobacterial phy-logeny, applied here with some methodological advances, hasimproved resolution for this challenging group, and we antic-ipate that additional advances are within reach that will furtherimprove performance of the approach. As more genomes ac-crue, the multiprotein approach should become a new stan-dard, supplanting 16S rRNA as the basis for phylogenetic re-constructions (1).

ACKNOWLEDGMENTS

This study was supported by USDA agreement 2006-55605-16655 toA.W.D., by DOD grant W911SR-04-0045 to B.W.S.S., by NIAID con-tract HHSN266200400035C to B.W.S.S., and by the Virginia Bioinfor-matics Institute. Computing support from Virginia Tech’s AdvancedResearch Computing facility is acknowledged.

We thank David Holmes and Jorge Valdes (Universidad NacionalAndres Bello, Santiago, Chile) for supplying the Acidithiobacillus ge-nome prior to publication.

REFERENCES

1. Bapteste, E., Y. Boucher, J. Leigh, and W. F. Doolittle. 2004. Phylogeneticreconstruction and lateral gene transfer. Trends Microbiol. 12:406–411.

2. Bapteste, E., E. Susko, J. Leigh, D. MacLeod, R. L. Charlebois, and W. F.Doolittle. 2005. Do orthologous gene phylogenies really support tree-think-ing? BMC Evol. Biol. 5:33.

3. Bateman, A., E. Birney, R. Durbin, S. R. Eddy, R. D. Finn, and E. L.Sonnhammer. 1999. Pfam 3.1: 1313 multiple alignments and profile HMMsmatch the majority of proteins. Nucleic Acids Res. 27:260–262.

4. Belda, E., A. Moya, and F. J. Silva. 2005. Genome rearrangement distancesand gene order phylogeny in gamma-Proteobacteria. Mol. Biol. Evol. 22:1456–1467.

5. Cannone, J. J., S. Subramanian, M. N. Schnare, J. R. Collett, L. M. D’Souza,Y. Du, B. Feng, N. Lin, L. V. Madabusi, K. M. Muller, N. Pande, Z. Shang,N. Yu, and R. R. Gutell. 2002. The comparative RNA web (CRW) site: anonline database of comparative sequence and structure information for ri-bosomal, intron, and other RNAs. BMC Bioinformatics 3:2.

6. Ciccarelli, F. D., T. Doerks, C. von Mering, C. J. Creevey, B. Snel, and P.Bork. 2006. Toward automatic reconstruction of a highly resolved tree of life.Science 311:1283–1287.

7. Edgar, R. C. 2004. MUSCLE: multiple sequence alignment with high accu-racy and high throughput. Nucleic Acids Res. 32:1792–1797.

8. Emerson, D., J. A. Rentz, T. G. Lilburn, R. E. Davis, H. Aldrich, C. Chan,and C. L. Moyer. 2007. A novel lineage of proteobacteria involved information of marine Fe-oxidizing microbial mat communities. PLoS One2:e667.

9. Galtier, N., and M. Gouy. 1998. Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequenceevolution for phylogenetic analysis. Mol. Biol. Evol. 15:871–879.

10. Gao, B., R. Mohan, and R. S. Gupta. 2009. Phylogenomics and proteinsignatures elucidating the evolutionary relationships among the Gammapro-teobacteria. Int. J. Syst. Evol. Microbiol. 59:234–247.

11. Garrity, G. M., J. A. Bell, and T. G. Lilburn. 2005. Class III. Gammapro-teobacteria class. nov., p. 1. In D. J. Brenner, N. R. Krieg, J. T. Staley, andG. M. Garrity (ed.), Bergey’s manual of systematic bacteriology, 2nd ed., vol.2. Springer, New York, NY.

12. Gillespie, J. J. 2004. Characterizing regions of ambiguous alignment causedby the expansion and contraction of hairpin-stem loops in ribosomal RNAmolecules. Mol. Phylogenet. Evol. 33:936–943.

13. Gillespie, J. J., M. J. Yoder, and R. A. Wharton. 2005. Predicted secondarystructure for 28S and 18S rRNA from Ichneumonoidea (Insecta: Hymenop-tera: Apocrita): impact on sequence alignment and phylogeny estimation. J.Mol. Evol. 61:114–137.

14. Herbeck, J. T., P. H. Degnan, and J. J. Wernegreen. 2005. Nonhomogeneousmodel of sequence evolution indicates independent origins of primary en-dosymbionts within the enterobacteriales (gamma-Proteobacteria). Mol.Biol. Evol. 22:520–532.

15. Karlin, S. 2001. Detecting anomalous gene clusters and pathogenicity islandsin diverse bacterial genomes. Trends Microbiol. 9:335–343.

16. Kelly, D. P., and A. P. Wood. 2000. Reclassification of some species of

VOL. 192, 2010 GAMMAPROTEOBACTERIAL PHYLOGENY 2313

on August 26, 2015 by guest

http://jb.asm.org/

Dow

nloaded from

Page 10: J. Bacteriol. 2010 Williams 2305 14

Thiobacillus to the newly designated genera Acidithiobacillus gen. nov., Halo-thiobacillus gen. nov. and Thermithiobacillus gen. nov. Int. J. Syst. Evol.Microbiol. 50(part 2):511–516.

17. Kjer, K. M. 1995. Use of rRNA secondary structure in phylogenetic studiesto identify homologous positions: an example of alignment and data presen-tation from the frogs. Mol. Phylogenet. Evol. 4:314–330.

18. Le, S. Q., and O. Gascuel. 2008. An improved general amino acid replace-ment matrix. Mol. Biol. Evol. 25:1307–1320.

19. Le, S. Q., N. Lartillot, and O. Gascuel. 2008. Phylogenetic mixture modelsfor proteins. Philos. Trans. R. Soc. Lond. B Biol. Sci. 363:3965–3976.

20. Lerat, E., V. Daubin, and N. A. Moran. 2003. From gene trees to organismalphylogeny in prokaryotes: the case of the gamma-Proteobacteria. PLoS Biol.1:E19.

21. Li, L., C. J. Stoeckert, Jr., and D. S. Roos. 2003. OrthoMCL: identificationof ortholog groups for eukaryotic genomes. Genome Res. 13:2178–2189.

22. Nakabachi, A., A. Yamashita, H. Toh, H. Ishikawa, H. E. Dunbar, N. A.Moran, and M. Hattori. 2006. The 160-kilobase genome of the bacterialendosymbiont Carsonella. Science 314:267.

23. Phillips, M. J., F. Delsuc, and D. Penny. 2004. Genome-scale phylogeny andthe detection of systematic biases. Mol. Biol. Evol. 21:1455–1458.

24. Poptsova, M. S., and J. P. Gogarten. 2007. The power of phylogeneticapproaches to detect horizontally transferred genes. BMC Evol. Biol. 7:45.

25. Puigbo, P., Y. I. Wolf, and E. V. Koonin. 2009. Search for a “Tree of Life” inthe thicket of the phylogenetic forest. J. Biol. 8:59.

26. Rice, P., I. Longden, and A. Bleasby. 2000. EMBOSS: the European Molec-ular Biology Open Software Suite. Trends Genet. 16:276–277.

27. Ripplinger, J., and J. Sullivan. 2008. Does choice in model selection affectmaximum likelihood analysis? Syst. Biol. 57:76–85.

28. Ronquist, F., and J. P. Huelsenbeck. 2003. MrBayes 3: Bayesian phylogeneticinference under mixed models. Bioinformatics 19:1572–1574.

29. Schmidt, H. A., K. Strimmer, M. Vingron, and A. von Haeseler. 2002.TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartetsand parallel computing. Bioinformatics 18:502–504.

30. Schulz, H. N., T. Brinkhoff, T. G. Ferdelman, M. H. Marine, A. Teske, andB. B. Jorgensen. 1999. Dense populations of a giant sulfur bacterium inNamibian shelf sediments. Science 284:493–495.

31. Scott, K. M., S. M. Sievert, F. N. Abril, L. A. Ball, C. J. Barrett, R. A. Blake,A. J. Boller, P. S. Chain, J. A. Clark, C. R. Davis, C. Detter, K. F. Do, K. P.Dobrinski, B. I. Faza, K. A. Fitzpatrick, S. K. Freyermuth, T. L. Harmer,L. J. Hauser, M. Hugler, C. A. Kerfeld, M. G. Klotz, W. W. Kong, M. Land,A. Lapidus, F. W. Larimer, D. L. Longo, S. Lucas, S. A. Malfatti, S. E.Massey, D. D. Martin, Z. McCuddin, F. Meyer, J. L. Moore, L. H. Ocampo,Jr., J. H. Paul, I. T. Paulsen, D. K. Reep, Q. Ren, R. L. Ross, P. Y. Sato, P.Thomas, L. E. Tinkham, and G. T. Zeruth. 2006. The genome of deep-seavent chemolithoautotroph Thiomicrospira crunogena XCL-2. PLoS Biol.4:e383.

32. Shimodaira, H. 2002. An approximately unbiased test of phylogenetic treeselection. Syst. Biol. 51:492–508.

33. Singer, G. A., and D. A. Hickey. 2000. Nucleotide bias causes a genomewidebias in the amino acid composition of proteins. Mol. Biol. Evol. 17:1581–1588.

34. Stamatakis, A. 2006. RAxML-VI-HPC: maximum likelihood-based phylo-genetic analyses with thousands of taxa and mixed models. Bioinformatics22:2688–2690.

35. Talavera, G., and J. Castresana. 2007. Improvement of phylogenies afterremoving divergent and ambiguously aligned blocks from protein sequencealignments. Syst. Biol. 56:564–577.

36. Tian, Y., and A. W. Dickerman. 2007. GeneTrees: a phylogenomics resourcefor prokaryotes. Nucleic Acids Res. 35:D328–D331.

37. Waterfield, N. R., T. Ciche, and D. Clarke. 2009. Photorhabdus and a host ofhosts. Annu. Rev. Microbiol. 63:557–574.

38. Williams, K. P. 2008. Strong mimicry of an rRNA binding site for twoproteins by the mRNA encoding both proteins. RNA Biol. 5:145–148.

39. Williams, K. P., B. W. Sobral, and A. W. Dickerman. 2007. A robust speciestree for the alphaproteobacteria. J. Bacteriol. 189:4578–4586.

40. Wu, M., and J. A. Eisen. 2008. A simple, fast, and accurate method ofphylogenomic inference. Genome Biol. 9:R151.

41. Yap, W. H., Z. Zhang, and Y. Wang. 1999. Distinct types of rRNA operonsexist in the genome of the actinomycete Thermomonospora chromogena andevidence for horizontal transfer of an entire rRNA operon. J. Bacteriol.181:5201–5209.

2314 WILLIAMS ET AL. J. BACTERIOL.

on August 26, 2015 by guest

http://jb.asm.org/

Dow

nloaded from