Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

21
Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome R. Eric Collins and Paul G. Higgs Origins Institute and Dept. of Physics and Astronomy, McMaster University, Hamilton, Ontario L8S 4M1, Canada Any pair of closely related bacteria tends to share most—but not all—of the same genes. Usually, the number of core genes decreases slowly as new genomes are observed, while the size of the pangenome (the set of genes found on at least one of the genomes) increases because each genome contains new genes not observed in the others. We analyze 172 complete genomes of Bacilli to compare the properties of the pangenomes and core genomes of monophyletic subsets of this group, and to assess the capacities of evolutionary models to predict these properties. When subdivided into clades of different levels of diversity, including examples at the level of species, genus and higher taxonomic groups, more phylogenetically diverse groups are found to have smaller core genomes and larger pangenomes relative to the mean genome size for the group. The infinitely many genes (IMG) model is a population genetics model that allows prediction of the sizes of the core and pangenomes. It is based on the assumption that each new gene can arise only once, as would be the case for de novo gene invention or horizontal gene transfer from a highly diverse external gene pool. We consider variants of this model that allow for one or more classes of dispensable genes with different insertion and deletion rates, and for the possibility of a class of essential genes that must be present in all species. We show that the predictions of the model depend on the shape of the evolutionary tree that underlies the divergence of the genomes, and we calculate results for coalescent trees, star trees, and arbitrary phylogenetic trees of predefined fixed branch length. In general, we show that the IMG model is useful for comparison with experimental genome data both for species and for widely divergent groups. Software implementing many of the formulae described herein is provided at http://reric.org/work/pangenome/. Introduction One of the major surprises that has resulted from the analysis of large numbers of complete bacterial genomes over the past decade is that the gene content is highly vari- able, even among sets of closely related genomes. For any set of genomes analyzed, the core genome is the set of genes found in all genomes (i.e. the intersection of the gene sets on individual genomes) and the pangenome is the set of genes found on at least one of the genomes (i.e. the union of the sets of genes on individual genomes). For example, Welch et al. (2002) compared three genomes of Escherichia coli for which the mean number of genes was 4769. They found only 2996 genes in the core genome and 7638 genes in the pangenome. This illustrates a gen- eral point that has now been observed with many groups of bacteria: the core genome is always substantially smaller than the mean genome size and the pangenome is always substantially larger. A more recent analysis of 17 E. coli genomes (Rasko et al. 2008) finds that the core genome is reduced to about 2200 genes. The number of genes in the core genome, which we will call G core (n), depends on the num- ber of genomes in the sample, n. In the E. coli example, G core (n) continues to decrease slowly with n, which makes it difficult to estimate the limiting size of the core genome that would be reached if very large numbers of genomes were included. The number of genes in the pangenome, G pan (n), is about 13000 for the 17 E. coli genomes and Key words: bacteria, evolution, genome, pangenome Email: [email protected] Phone: (905) 525-9140 x26870 Fax: (905) 546-1252 Submitted as a Research Article to Mol. Biol. Evol. preprint generated October 5, 2011 URL http://reric.org/papers/pangenome.mbe.pdf increases rapidly as new genomes are added. If G pan (n) continues to increase with n for large n, the pangenome is said to be open, whereas if G pan (n) tends to a maximum limit for large n, the pangenome is said to be closed. The E. coli pangenome appears to be open, or least, G pan (n) is still far from reaching a limit with the available number of genomes. Open pangenomes have also been observed in Streptococcus agalactiae (Tettelin et al. 2005), Prochloro- coccus (Kettler et al. 2007), a combination of E. coli and Shigella (Touchon et al. 2009), Listeria (den Bakker et al. 2010), and in a study taking representative genomes from across the full range of bacteria (Lapierre and Gogarten, 2009). A closed pangenome was found in Clostridium dif- ficile (Scaria et al. 2010), where it was estimated that a sample of 26 genomes would be sufficient to capture the entire pangenome. However, this result may depend on the way the extrapolation to large n was done, and may not indicate a qualitative difference between C. difficile and other bacteria. A quantity closely related to the core and pangenome curves is the gene frequency spectrum G(k|n), which is the number of genes found in k genomes after analysis of n genomes. If G(k|n g ) is known for some current number, n g , of available genomes, then it is possible to calculate G core (n) and G pan (n) for all n 6 n g . Typically G(k|n) has a U-shaped distribution, with many genes found in only 1 or a small number of species, a substantial number of core genes found in all (or nearly all) species, and rather few genes at intermediate values of k. Examples of G(k|n) are given in the papers cited in the previous paragraph, and in our previous work (Collins et al. 2011). Another quantity related to the pangenome size is the number of ‘new’ genes that are found for the first time in the nth genome sequenced, i.e. G new (n)= G pan (n) - G pan (n - 1). It was found by Tettelin et al. (2005) that both G core (n) and G new (n) could be fitted quite well by simple functions that decreased exponentially with n. However, no theoret- the authors 2011

description

Infinitely Many Genes Model

Transcript of Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Page 1: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Testing the Infinitely Many Genes Modelfor the Evolution of the Bacterial Core Genome and Pangenome

R. Eric Collins and Paul G. HiggsOrigins Institute and Dept. of Physics and Astronomy, McMaster University, Hamilton, Ontario L8S 4M1, Canada

Any pair of closely related bacteria tends to share most—but not all—of the same genes. Usually, the number of coregenes decreases slowly as new genomes are observed, while the size of the pangenome (the set of genes found on atleast one of the genomes) increases because each genome contains new genes not observed in the others. We analyze 172complete genomes of Bacilli to compare the properties of the pangenomes and core genomes of monophyletic subsetsof this group, and to assess the capacities of evolutionary models to predict these properties. When subdivided intoclades of different levels of diversity, including examples at the level of species, genus and higher taxonomic groups,more phylogenetically diverse groups are found to have smaller core genomes and larger pangenomes relative to themean genome size for the group. The infinitely many genes (IMG) model is a population genetics model that allowsprediction of the sizes of the core and pangenomes. It is based on the assumption that each new gene can arise onlyonce, as would be the case for de novo gene invention or horizontal gene transfer from a highly diverse external genepool. We consider variants of this model that allow for one or more classes of dispensable genes with different insertionand deletion rates, and for the possibility of a class of essential genes that must be present in all species. We showthat the predictions of the model depend on the shape of the evolutionary tree that underlies the divergence of thegenomes, and we calculate results for coalescent trees, star trees, and arbitrary phylogenetic trees of predefined fixedbranch length. In general, we show that the IMG model is useful for comparison with experimental genome data bothfor species and for widely divergent groups. Software implementing many of the formulae described herein is providedat http://reric.org/work/pangenome/.

Introduction

One of the major surprises that has resulted from theanalysis of large numbers of complete bacterial genomesover the past decade is that the gene content is highly vari-able, even among sets of closely related genomes. For anyset of genomes analyzed, the core genome is the set ofgenes found in all genomes (i.e. the intersection of thegene sets on individual genomes) and the pangenome isthe set of genes found on at least one of the genomes (i.e.the union of the sets of genes on individual genomes). Forexample, Welch et al. (2002) compared three genomes ofEscherichia coli for which the mean number of genes was4769. They found only 2996 genes in the core genomeand 7638 genes in the pangenome. This illustrates a gen-eral point that has now been observed with many groups ofbacteria: the core genome is always substantially smallerthan the mean genome size and the pangenome is alwayssubstantially larger.

A more recent analysis of 17 E. coli genomes(Rasko et al. 2008) finds that the core genome is reducedto about 2200 genes. The number of genes in the coregenome, which we will call Gcore(n), depends on the num-ber of genomes in the sample, n. In the E. coli example,Gcore(n) continues to decrease slowly with n, which makesit difficult to estimate the limiting size of the core genomethat would be reached if very large numbers of genomeswere included. The number of genes in the pangenome,Gpan(n), is about 13000 for the 17 E. coli genomes and

Key words: bacteria, evolution, genome, pangenome

Email: [email protected]: (905) 525-9140 x26870Fax: (905) 546-1252

Submitted as a Research Article to Mol. Biol. Evol.preprint generated October 5, 2011URL http://reric.org/papers/pangenome.mbe.pdf

increases rapidly as new genomes are added. If Gpan(n)continues to increase with n for large n, the pangenome issaid to be open, whereas if Gpan(n) tends to a maximumlimit for large n, the pangenome is said to be closed. TheE. coli pangenome appears to be open, or least, Gpan(n)is still far from reaching a limit with the available numberof genomes. Open pangenomes have also been observed inStreptococcus agalactiae (Tettelin et al. 2005), Prochloro-coccus (Kettler et al. 2007), a combination of E. coli andShigella (Touchon et al. 2009), Listeria (den Bakker et al.2010), and in a study taking representative genomes fromacross the full range of bacteria (Lapierre and Gogarten,2009). A closed pangenome was found in Clostridium dif-ficile (Scaria et al. 2010), where it was estimated that asample of 26 genomes would be sufficient to capture theentire pangenome. However, this result may depend on theway the extrapolation to large n was done, and may notindicate a qualitative difference between C. difficile andother bacteria.

A quantity closely related to the core and pangenomecurves is the gene frequency spectrum G(k|n), which isthe number of genes found in k genomes after analysis ofn genomes. If G(k|ng) is known for some current number,ng, of available genomes, then it is possible to calculateGcore(n) and Gpan(n) for all n 6 ng. Typically G(k|n) hasa U-shaped distribution, with many genes found in only1 or a small number of species, a substantial number ofcore genes found in all (or nearly all) species, and ratherfew genes at intermediate values of k. Examples of G(k|n)are given in the papers cited in the previous paragraph,and in our previous work (Collins et al. 2011). Anotherquantity related to the pangenome size is the number of‘new’ genes that are found for the first time in the nthgenome sequenced, i.e. Gnew(n) = Gpan(n)−Gpan(n−1).It was found by Tettelin et al. (2005) that both Gcore(n)and Gnew(n) could be fitted quite well by simple functionsthat decreased exponentially with n. However, no theoret-

« the authors 2011

Page 2: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 2

ical model was given to explain why this form of fittingfunction should apply. It is the main objective of our cur-rent paper to consider ways of predicting the shape of thecore and pangenome curves and the gene frequency spec-trum beginning from explicit evolutionary models.

An approach that we consider to be very promis-ing is the infinitely many genes (IMG) model, introducedby Baumdicker et al. (2010). The name IMG is a ref-erence to the infinitely many alleles and infinitely manysites models, which are well known in population genetics(Hein et al. 2005). In the infinitely many alleles model, itis assumed that an infinite number of possible alleles couldexist at a locus; hence, each new mutation gives rise to anew allele that did not exist previously. In the infinitelymany sites model, it is assumed that there are an infinitenumber of possible sites at which a mutation could oc-cur; hence, each new mutation occurs at a different site.Similarly, in the IMG model, it is assumed that an infinitenumber of possible genes might arise in a genome; hence,each new gene that arises is different from all previousgenes. Baumdicker et al. (2010) consider a population ofgenomes evolving according to the neutral Wright-Fishermodel. New genes arise in each genome at a constant rate.Genes spread to new genomes when the lineages branchas time goes forward in the Wright-Fisher model. It isalso supposed that each gene currently in a genome canbe deleted at constant rate. The model allows the func-tions Gcore(n), Gpan(n) and G(k|n) to be calculated in anelegant way as a function of the evolutionary parameters(insertion and deletion rates and effective population size).The model has not yet been tested on many data sets. Ourintention here is to test the IMG model on a variety ofsets of genomes, and to compare the results with the pre-dictions of several alternative evolutionary models that wederive here.

The IMG model touches on fundamental questionsabout the mechanisms of bacterial genome evolution,including the processes by which new genes arise ingenomes and the possibility of exchange of genes betweengenomes by horizontal gene transfer (HGT). The funda-mental assumption of the IMG model is that a given genecan be introduced into a population of genomes only once.There are several reasons to think that this might often betrue. The origin of a new open reading frame from a non-coding region is presumed to be rare, but if it occurs, it willlead to a new sequence that is unlikely to have similarityto previous genes; hence it will fit the IMG assumption. Anew gene might also be created by modification of an ex-isting gene by a rapid burst of substitutions, by a substan-tial insertion or deletion, or by shuffling of domains withinthe gene. If the gene becomes sufficiently divergent fromthe other related sequences, it will no longer be classed asbelonging to the same gene family by the similarity crite-ria used for clustering. Thus a new gene will have arisenwithin a single lineage. We presume that this kind of geneorigin is fairly frequent, and it is also likely to create aunique sequence that has not been seen before. Alterna-tively, the new gene could arise by HGT of a gene fromoutside the group of genomes being studied into one of

the genomes in the group. If the diversity of genes in theenvironment is sufficiently large, it will be unlikely thatthe same type of gene is inserted more than once into thegroup being studied, and this will again fit the assumptionsof the IMG model. On the other hand, if the pool of genesavailable for HGT contains a smaller number of genes ofhigh frequency, it is possible that a gene of the same typecould get inserted more than once into the set of genomesunder study. This should show up in systematic deviationsfrom the predictions of the IMG model.

The IMG also assumes that the presence or absenceof a gene does not affect the fitness of the organism. Thegenealogy of the genomes in the sample therefore fol-lows the standard coalescent model for neutral evolution(Hein et al. 2005). The IMG model does allow for thepresence of a class of essential genes that must be presentin every genome, but these essential genes do not affectthe evolution of the non-essential genes or the shape ofthe coalescent tree. It is unlikely that the fairly restrictiveassumptions of the model are exactly true. Nevertheless,we view the IMG model as a useful null model of neutralgenome evolution. Building more complex models fromthis starting point should allow identification of any im-portant additional factors. In fact, as we shall see below,the IMG model and the variants we consider here stand upsurprisingly well to analysis of a large number of data setsof bacterial genomes.

GenomicsGene Clustering

Complete nucleic acid and translated proteome se-quences were obtained from the NCBI Genomes databasefor all 172 Bacteria from the Class Bacilli available inApril 2011 (Table S1). Only complete genome sequenceswere used – no incomplete or draft genomes were in-cluded. Genes encoded on plasmids associated with eachgenome sequence were also included, and were treated aspart of the genome. BLAST databases were created foreach genome with all amino acid sequences encoded byeach genome, and all-against-all searches were performedusing blastp (v. 2.221). For each pair of genomes, in-cluding self pairs, each peptide sequence was used asa query against every peptide sequence in the pairedgenome, and the process was repeated in the reverse direc-tion. A direct link was counted between two genes if eitherBLAST E-value was less than 1e-30 for searches in bothdirections, and if the length of the locally aligned regionfound by BLAST was longer than 70% of the length of thelonger of the two sequences. Sequences were grouped intoclusters using the single-link cluster procedure, i.e. se-quences are part of the same cluster if there is a direct linkbetween them or if there is a chain of direct links that con-nects them. These clustering methods were implementedwith custom Perl scripts that are available from the au-thors. We have previously used these clustering methodsfor analysis of other sets of genomes (Collins et al. 2011)and we have considered the effects of changing the cutoffE-value and the minimum length criterion. In this paper

Page 3: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 3

Table 1 Comparison of properties of subsets of genomes of Bacilli. Abbreviations are as follows: ng, number of genomes in taxonomic group;Ngenes, average number of genes per genome; G0, average number of gene families per genome; Gcore, number of clusters of gene families in coregenome; Gpan, number of clusters in pangenome; dprot , average number of amino acid substitutions per site expected since the most recent commonancestor of the taxa group.

Taxonomic group ng Ngenes G0Ngenes

G0Gcore

GcoreG0

GpanGpanG0

dprot

Staphylococcus aureus 15 2668 2212 1.21 1532 0.69 5522 2.50 0.003Streptococcus pyogenes 13 1853 1593 1.16 1043 0.65 4096 2.57 0.006Streptococcus pneumoniae 14 2129 1787 1.19 1141 0.64 5509 3.08 0.008Listeria spp. 9 2888 2330 1.24 1718 0.74 4430 1.90 0.057Bacillus cereus group 20 5419 4274 1.27 1854 0.43 18801 4.40 0.068Staphyloccoccus spp. 22 2620 2185 1.20 994 0.45 9473 4.34 0.330Streptococcus spp. 50 2000 1693 1.18 487 0.29 16583 9.79 0.355“Bacillaceae 1” 38 4684 3718 1.26 929 0.25 32260 8.68 0.426Lactobacillales 31 2152 1736 1.24 356 0.21 16357 9.42 0.899Bacilli 172 2984 2417 1.23 143 0.06 97860 40.49 1.136

we present only one set of clusters obtained with conser-vative clustering parameters, but similar results were ob-tained for more relaxed clustering parameters as well.

Genome Phylogeny

Gene families present in all genomes and havingexactly one gene per genome were used to build agenome phylogeny. The translated amino acid sequencesof genes from 55 such single-copy clusters (Table S2)were aligned using MUSCLE v3.7 (Edgar 2004) with pa-rameter ’-diags’. Each cluster was aligned indepen-dently; alignments were then concatenated and input intoPhyML v3.0.1 (Guindon et al. 2010) for phylogenetictree construction using default parameters. The multiplesequence alignment and phylogenetic trees are archivedon TreeBASE (URL http://bit.ly/palfvN). Thecomplete tree is shown in Figure S1. From within this treewe selected subsets of genomes for further analysis, asshown in Fig. 1 and Fig. S1. The selected clades are char-acterized by relatively long internal branches separatingthem from other parts of the tree and are strongly sup-ported by the approximate likelihood ratio test (aLRT),obtained from the program PhyML (Fig. S1). The cladesspan multiple taxonomic levels, including species-specificgroups (Bacillus cereus group, Staphylococcus aureus,Streptococcus pneumoniae, and Streptococcus pyogenes),genus-specific groups (Staphylococcus spp., Listeria spp.,and Streptococcus spp.), and higher-level groups: Fam-ily “Bacillaceae 1” (Ludwig et al. 2009), Order Lac-tobacillales, and Class Bacilli. The definitions of taxo-nomic groups like ‘species’ and ‘genus’ are far from set-tled within the Bacteria, and the gene trees may not befully consistent with each other because of horizontalgene transfer, so our intention is not to produce a fullyresolved tree for these genomes. However, the phyloge-netic tree clearly shows that the selected groups are mono-phyletic and, except for the Lactobacillales, nearly ultra-metric, making them particularly suitable for the evolu-tionary analyses presented here.

Core genome and pangenome properties

Gene clusters created by the above method may con-tain more than one gene from the same genome in somecases. We call a group of paralogous genes in the samegenome a gene family. We deal with gene families, notindividual genes, in the following data analysis. Addi-tionally, the IMG model accounts for gain and loss ofgene families but does not account for duplication withina genome; therefore, it also makes sense to work at thefamily level from the point of view of model comparison.Table 1 shows some useful statistics with which to com-pare the different subsets of genomes. In this table, ng isthe number of genomes in each subset, and Ngenes is themean number of genes for genomes in the set. G0 is themean number of gene families in a genome. The meannumber of genes per family is Ngenes/G0, which is closeto 1.2 in every case. As this ratio is only slightly largerthan 1, this indicates that most genes are in single genefamilies. We have previously considered the distributionof sizes of families of paralogues in individual genomes(Collins et al. 2011). There are a substantial number ofparalogues in most genomes, and there are a small numberof gene families with large numbers of genes. However,the majority of genes in any one genome are in single-gene families, and for the purpose of measuring the coreand pangenome sizes, we do not lose much informationby working at the family level.

The numbers of gene clusters in the core andpangenomes for each data set are shown in Table 1. Itis useful to measure these as ratios relative to the meannumber of gene families per genome. The core ratio,Gcore/G0, is in the range 0.6–0.7 for the three species-levelsets (Staphylococcus aureus, Streptococcus pyogenes andStreptococcus pneumoniae), i.e. it is substantially lessthan one, even for genomes nominally in the same species.For the full set of Bacilli, the core ratio is only 0.06, indi-cating that this group is quite diverse with relatively fewgenes conserved across the whole group. The pangenomeratio, Gpan/G0, is in the range 2–3 for the species-leveldata sets and increases to more than 40 for the full set of

Page 4: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 4

Bacilli. The final column shows dprot , the mean phyloge-netic distance on the protein evolution tree (Figs. 1 andS1) from the common ancestor of the group to the tipsof the branches for each genome, measured in units ofamino acid substitutions per site. This phylogenetic dis-tance is a measure of the amount of protein sequence evo-lution within each group, and should be an indicator of thetime since divergence of the group. The datasets are listedin order of increasing dprot . It can be seen that Gpan/G0increases and Gcore/G0 decreases as dprot increases, aswe would expect if the set of genomes becomes more di-verse as the time since the common ancestor increases.The Listeria data set stands out in this table, as it has ahigher Gcore/G0 and a lower Gpan/G0 than we would ex-pect, given the observed amount of protein sequence evo-lution. This probably indicates that the Listeria group isless prone to insertion and deletion of genes than mostother bacteria, although it could also be interpreted asfaster protein sequence evolution within this genus. DenBakker et al. (2010) also commented that although thepangenome of Listeria is not closed, there seems to be avery limited on-going introduction of new genetic materialfrom external gene pools.

Evolutionary ModelsGenome Evolution on a Coalescent Tree

We will now discuss theoretical models that can beused to interpret the core and pangenome sizes. Our mainfocus in this paper will be the IMG model. However, wewill begin with a model in which there are a finite numberM of gene families, each of which can be either presentor absent from a genome. We will refer to this as thefinitely many genes (FMG) model, and we will show thatif the limit M→∞ is taken in a suitable way, the modelbecomes the IMG model that was previously defined byBaumdicker et al. (2010). This provides an alternative wayof deriving the results of the IMG that may be simpler thanthe original.

We suppose that each gene family can be insertedinto a genome at a rate a if it is not already present, andeach family that is present can be deleted at a rate v. Weconsider a population of Ne haploid genomes evolving ac-cording to the neutral Wright-Fisher model. Let x be thefraction of the population that possesses one particularfamily, and let φ(x) be the probability that the family hasfrequency x. The insertion and deletion problem is equiv-alent to the problem of recurrent mutation between twoalleles, for which the stationary distribution φ(x) is wellknown (Wright, 1969):

φ(x) =Γ(σ +ρ)

Γ(σ)Γ(ρ)xσ−1(1− x)ρ−1 (1)

where σ = 2Nea and ρ = 2Nev. The mean value of x isσ /(σ+ρ); hence the mean number of families present inone genome is

G0 =Mσ

σ +ρ(2)

If a sample of n genomes is chosen from the popu-lation, the genealogy of the genomes is described by the

coalescent process (Kingman, 1982; Hein et al. 2005). Atypical coalescent tree (Fig. 2a) has many short branchesclose to the tips and relatively few longer branches closeto the root. The time to coalescence of all the lineages isroughly 2Ne generations, although there are large fluctua-tions about this average.

The probability that k of the sampled genomes con-tain the family, given that its frequency in the populationis x and that n genomes were sampled, is a binomial dis-tribution:

g(k|x,n) = n!k!(n− k)!

xk(1− x)n−k (3)

The gene family frequency spectrum, G(k|n), is the ex-pected number of families that will be found in k genomesafter sampling n. This can be found by integrating over thedistribution of x, assuming that the frequencies of each ofthe M possible families are independent:

G(k|n) = M1∫0

φ(x)g(k|x,n)dx

= Mn!

k!(n− k)!Γ(k+σ)

Γ(σ)

Γ(n− k+ρ)

Γ(ρ)

Γ(σ +ρ)

Γ(n+σ +ρ)(4)

For numerical evaluation of this result, it is useful to notethat the ratio of gamma functions can be written as

Γ(k+σ)

Γ(σ)= (k−1+σ) . . .σ =

k−1

∏i=0

(i+σ) (5)

and equivalent formulae apply for the other ratios. Thenumber of clusters in the core genome is the number offamilies present in every genome:

Gcore(n) = G(n|n)

= MΓ(n+σ)

Γ(σ)

Γ(σ +ρ)

Γ(n+σ +ρ)

(6)

G(0|n) is the number of possible types of families that arenot observed in any of the sampled genomes. The numberof clusters in the pangenome is the number of families thatare present in at least one of the sampled genomes:

Gpan(n) = M−G(0|n)

= M−MΓ(n+ρ)

Γ(ρ)

Γ(σ +ρ)

Γ(n+σ +ρ)

(7)

This completes the results for the FMG case. To ob-tain the IMG results, we note that the overall rate of inser-tion of all types of gene family is u = Ma. It is useful todefine the parameter θ = 2Neu = 2NeMa = Mσ . We takethe limit M→∞, and a→0 keeping u constant, or equiva-lently, σ→0 keeping θ constant. The results for the IMGmodel depend on θ and ρ . These are the same parametersused by Baumdicker et al. (2010). By taking the IMG limitin Equations (2), (4), (6) and (7), we obtain

G0 = θ/ρ (8)

G(k|n) = θ

kn . . .(n− k+1)

(n−1+ρ) . . .(n− k+ρ)(9)

Gcore(n) =θ(n−1)!

(n−1+ρ) . . .ρ(10)

Page 5: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 5

Gpan(n) = M−M(n−1+ρ) . . .ρ

(n−1+σ +ρ) . . .(σ +ρ)

= M−Mn−1

∏k=0

11+ σ

k+ρ

= M−M

(1−σ

n−1

∑k=0

1k+ρ

)

= θ

n−1

∑k=0

1k+ρ

(11)

The pangenome size continues to increase with n,hence the pangenome is open in the IMG model. FromEquation (11), Gpan(n) increases roughly as ln(n) for largen, i.e. it increases less rapidly than linearly. In contrast,the pangenome is closed in the FMG model because therecannot be more than M genes (as in Equation (7)).

So far it was assumed that the same parameters applyfor all gene families. Thus there are two parameters, θ andρ , in the IMG case. Families are assumed to be ‘dispens-able’, i.e. they can be inserted or deleted without affectingthe fitness. A natural extension of the model is to considera second class of essential families that are present in allgenomes and cannot be inserted and deleted. This intro-duces an extra parameter, Gess, the number of essentialgene families. This changes the above results simply byadding the constant Gess to the equations for G0, G(n|n),Gcore(n) and Gpan(n). The gene family frequency spec-trum G(k|n) is unaltered for k < n. A second extension ofthe model is to consider two classes of dispensable fami-lies with different parameters. In this case, there are fourparameters θ1, θ2, ρ1 and ρ2. All the formulae for the two-class models are simply the sum of two terms of the sameform as the single class models.

Genome Evolution on a Star Tree

In the previous section, it was assumed that thegenomes evolve on a coalescent tree. The coalescent pro-cess generates an ensemble of possible trees with well-defined statistical properties. The results for Gcore(n) andGpan(n) etc. are expectation values averaged over all pos-sible coalescent trees. However, results for any one real-ization of a coalescent tree may differ substantially fromthe mean. The coalescent arises in population geneticswhen we consider the genealogy of individuals within aspecies. Here, we want to study groups of genomes thatextend beyond the species level. In this case it is moreappropriate to consider a phylogenetic tree across speciesrather than a coalescent tree within species. The process ofspeciation/diversification giving rise to the genomes in theavailable data is not necessarily equivalent to a coalescent.Therefore, we wish to consider other types of tree in orderto determine the extent to which the predicted behavior ofthe core and pangenomes depends on the shape of the tree.

Here we will consider the star phylogeny shown inFigure 2b, where there is a radiation of lineages at theroot at a time t in the past. We choose the star phylogenyfor three reasons. Firstly, its shape is very different from

a coalescent tree; therefore, if the results are dependenton the tree shape, we would expect a clear difference be-tween these cases. Secondly, the star phylogeny is a sim-ple case for which an analytical solution is easy to ob-tain. Thirdly, we are motivated by the observation of Tet-telin et al. (2005) that the pangenome size appears to in-crease linearly with n for large n, whereas according toEquation (11), the pangenome should increase approxi-mately as ln(n) for large n. We will now show that if evo-lution occurs on a star phylogeny instead of a coalescenttree, then the pangenome does in fact increase linearlywith n.

Consider a set of genomes evolving according to theIMG model with a single class of dispensable gene fam-ilies with deletion rate v, and overall insertion rate u asbefore. Suppose that the genome at the root of the starphylogeny is a typical genome described by this model. Ittherefore contains G0 = u/v gene families. The probabil-ity that a dispensable family is retained on one branch fortime t without deletion is e−vt . The core families are thosethat are retained on all n branches. Therefore, the numberof core families is

Gcore(n) =uv

e−nvt for n > 2 (12)

For a single genome, all families are in the core by def-inition, so Gcore(1) = G0. The probability that a dispens-able family is retained on at least one of the n branches isone minus the probability that it is deleted on all branches.Hence, the number of dispensable families retained sincethe root is

Gret =uv

(1− (1− e−vt)n) (13)

Let Ggain be the number of families that were not presentat the root and are gained along one branch of length t. Wemay write

dGgain

dt= u− vGgain (14)

from which

Ggain =uv(1− e−vt) (15)

There will be Ggain gene families gained on each of the nbranches. Hence the size of the pangenome is

Gpan(n) = Gret +nGgain

=uv

(1− (1− e−vt)n)+ nu

v(1− e−vt)

(16)

From this, the number of new families found for the firsttime in the nth genome is

Gnew(n) =uv

e−vt(1− e−vt)n−1 +uv(1− e−vt) (17)

In order to facilitate comparison between the starphylogeny and the coalescent, it is useful to define θ = ut,and ρ = vt. Note that the time to the root in the star is t,whereas the typical time to the root in a coalescent treeis 2N; hence θ and ρ mean almost the same thing in thetwo cases, however the shape of the tree affects the waythe core and pangenome depend on θ and ρ . As with thecoalescent case, it is possible to add a number Gess of es-sential gene families, or to consider two classes of dis-pensable families with insertion and deletion rates θ1, θ2,

Page 6: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 6

ρ1 and ρ2. Thus, we now have four models that we cancompare to real data: either a coalescent tree or a star phy-logeny can be used, and in each case, we can considereither one class of dispensable families and one class ofessential families, or two classes of dispensable familieswith different insertion and deletion rates.

Before proceeding to data analysis, we wish to makethe connection between the star phylogeny model and thework of Tettelin et al. (2005). These authors found that thenumbers of new and core gene families in Streptococcuscan be fitted by simple exponential decay functions:

Gcore(n) = κc exp(−n/τc)+Ωc (18)

Gnew(n) = κs exp(−n/τs)+Ωs (19)Equations (18) and (19) are equivalent to those given inthe captions to Figures 2 and 3 of Tettelin et al. (2005).We have used the notation Ωs in Equation (19) instead oftg(θ), which was used by Tettelin et al., in order to avoidconfusion with the parameter θ in the IMG model. If wewrite the results from Equations (12) and (17) in terms ofθ and ρ , and we explicitly include the essential familiesin the core, we obtain:

Gcore(n) =θ

ρe−nρ +Gess for n > 2 (20)

Gnew(n) =θe−ρ

ρ(1− e−ρ)(1− e−ρ)n +

θ

ρ(1− e−ρ) (21)

It can be seen that Equation (20) is a simple exponentialdecay, as in Equation (18) if we identify the parametersas Ωc = Gess, κc = θ/ρ , and 1/τc = ρ . Equation (21) isalso equivalent to Equation (19) if we identify the param-eters Ωs, κs and τs in a suitable way. Note that the numberof new families approaches a constant for large n, so thepangenome increases linearly with n for large n. The expo-nential decay functions were originally given as empiricalfitting functions with no theoretical model to justify them.We now see that these fitting functions are the expectedresults for an IMG model on a star phylogeny. However,in our interpretation, there are only three independent pa-rameters Gess, θ and ρ , and all six parameters in Equa-tions (18) and (19) depend on these three.

Model FittingFitting to the core and pangenome sizes

When calculating Gcore(n) and Gpan(n) for the realdata, the result depends on which order the genomes areconsidered. For each of the data sets in Table 1, we con-sidered many random permutations of the ng genomes andcalculated the core and the pangenome curves for n in-creasing from 1 to ng. If the results from the many permu-tations are plotted on one graph, this generates a spreadof points, shown as the shaded area in Figures 3 and 4.The mean values for Gcore(n) and Gpan(n) can be obtainedby averaging over the results for the different permuta-tions. However, a more direct way to obtain these meanfunctions is to use the gene family frequency spectrum,G(k|ng), which is a function of the full data set and does

not depend on the permutation. Consider a family that ispresent in k genomes out of ng. If n genomes are sampled,the probability that the family is present in all n is

Pall(n,k) =

k...(k−n+1)

ng...(ng−n+1) if n 6 k,

0 if n > k.(22)

Therefore the mean size of the core genome is

Gmeancore (n) =

ng

∑k=n

G(k|ng)Pall(n,k) (23)

The probability that the family is absent in all n is

Pabs(n,k) =

(ng−k)...(ng−k−n+1)

ng...(ng−n+1) if n 6 ng− k,

0 if n > ng− k.(24)

The probability that this family is in at least one of the n is1−Pabs(n,k). Therefore the mean size of the pangenomeis

Gmeanpan (n) =

ng

∑k=1

G(k|ng)(1−Pabs(n,k)) (25)

We denote the core and pangenome curves for anyone particular permutation of the genomes in the data asGperm

core (n) and Gpermpan (n). The root mean square deviation

per point between the permutation and the mean is

RMS(perm) =

(1

2ng

( ng

∑n=1

(Gperm

core (n)−Gmeancore (n)

)2+

ng

∑n=1

(Gperm

pan (n)−Gmeanpan (n)

)2))1/2

(26)

We define RMS(data) as the mean value ofRMS(perm) averaged over permutations. RMS(data) is ameasure of the spread of points around the mean, i.e. it isa measure of the inherent uncertainty in the data. Table 2gives RMS(data) for each of the data sets, calculated from10000 random permutations. The units of RMS(data) aregene families, i.e. this value may be compared to G0 inTable 1.

Each theoretical model was fitted to the mean ofthe data using the Nelder-Mead least-squares optimizationroutine in the open-source statistical software package R(R Development Core Team 2011). The squared devia-tions for both the core and the pangenome were used inthe minimization because a good model should be able tofit both curves at the same time. The root mean square de-viation per point is

RMS(theory) =

(1

2ng

( ng

∑n=1

(Gtheory

core (n)−Gmeancore (n)

)2+

ng

∑n=1

(Gtheory

pan (n)−Gmeanpan (n)

)2))1/2

(27)

where the superscript ‘theory’ denotes any one of the the-oretical models described above. For each model, we im-posed a constraint on the parameters such that the mean

Page 7: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 7

number of families per genome, G0, is the same in thetheory as the data. This reduces the number of free pa-rameters by one in each model and means that the theorycurve passes exactly through the mean of the data for then = 1 point in both the core and pangenome curves. Af-ter the parameters are chosen to minimize RMS(theory),the ratio RMS(theory)/RMS(data) is a useful measure ofthe quality of fit of a theoretical model to the data. Thesmaller the ratio, the better the fit. If the ratio is less than1, the theory deviates less from the mean than does a typ-ical permutation, i.e. the curve falls within the spread ofpoints generated by the permutations. A well fitting modelshould have this ratio less than 1.

Of the four models fit to each dataset, the co-alescent model with two classes of dispensable genefamilies (model 2D) was always the best, with aRMS(theory)/RMS(data) much less than 1 in every case(Table 2). The 2D model is a good fit for every casetested, unlike the original IMG model of Baumdicker et al.(2010), which uses a coalescent model with one class ofdispensable gene families plus one class of essential fam-ilies (model D+E). The fitting ratio of model D+E is morethan 1 in several data sets, and is always much worsethan the 2D model. This is shown clearly in the exam-ple of Streptococcus pneumoniae (Fig. 3), where the bestfitting model (2D on the coalescent) fits both the core andpangenome curves exceptionally well. In this example, theD+E models dramatically fail to fit the pangenome andcore genome curves simultaneously. Allowing two classesthat evolve at different rates is thus a significant improve-ment over using a strictly conserved category.

When comparing the D+E models on the star tree andthe coalescent, the star tree is sometimes a better fit. How-ever, the 2D model on the coalescent always fits betterthan the 2D model on the star tree, so in general there isno reason to choose the star tree over the coalescent. In theexample of a more diverse group, “Bacillaceae 1”, the 2Dmodel on the coalescent clearly fits the pangenome better(Fig. 4) because the star model incorrectly predicts a lin-early increasing pangenome. None of the models performparticularly well in fitting the core genome curve for thisdataset because of the shape of the underlying tree, whichwe discuss in further detail below.

The fitted values for the insertion and deletion pa-rameters (θ and ρ) vary greatly between the clades for the2D model on the coalescent (Table 3). In general they arelarger for the more diverse groups (i.e. those for whichdprot is larger; Table 1). If we think in population geneticsterms, this is to be expected, because θ and ρ depend onthe effective population size Ne, which will be larger formore diverse groups. If we think in phylogenetic terms,this is to be expected, because the time since divergenceof the more diverse groups is higher.

The parameters θ1 and ρ1 correspond to gene fam-ilies with slow rates of insertion and deletion, while θ2and ρ2 correspond to families with fast rates. The meannumber of families per genome in the two categoriesare Gslow = θ1/ρ1 and G f ast = θ2/ρ2. Table 3 showsthe fractions of families in the two categories: fslow =

Table 2 Comparison of quality-of-fit for four models using root meansquared (RMS) values. RMS(data) is the RMS of 10,000 replicate per-mutations of the pangenome and core genome data points compared tocalculated means. RMS(theory) is the RMS of the best fit line for evolu-tionary models based on 2 different tree shapes (star tree and coalescenttree) and either 1 dispensable and 1 essential gene class (D+E) or 2 dis-pensable gene classes (2D).

RMS(theory)/RMS(data)RMS Star Tree Coalescent

Taxonomic Group (data) D+E 2D D+E 2DStaph. aureus 126 0.27 0.15 1.03 0.06Strep. pyogenes 109 0.23 0.21 0.74 0.03Strep. pneumoniae 105 0.28 0.25 0.93 0.05Listeria spp. 140 0.12 0.11 0.38 0.03B. cereus group 493 0.36 0.29 1.21 0.08Staph. spp. 335 0.26 0.23 0.99 0.03Strept. spp. 333 1.72 1.72 1.18 0.21“Bacillaceae 1” 703 1.26 1.26 1.30 0.20Lactobacillales 387 1.28 1.28 0.72 0.11Bacilli 1972 1.81 1.81 0.63 0.10

Gslow/(Gslow +G f ast) and f f ast = G f ast/(Gslow +G f ast).There is usually a relatively small fraction of fast evolvingfamilies ( f f ast ) in the range 0.1–0.25 for most data sets),and the deletion rates for the fast families are very muchlarger than the slow families (ρ2 ρ1). The presence offast evolving families, which are easily gained and lost,means that there can be quite rapid divergence betweengenomes that are closely related (such as the species leveldatasets considered here). However, if the majority of fam-ilies are more slowly evolving, this explains why signifi-cant numbers of conserved families are still found in themore diverse data sets. It was previously observed thatgain and loss of families in the B. cereus group appearedto be much faster than in the wider set of Bacilli (Hao andGolding, 2006). This is because the fast changes are dom-inant when considering the short branch lengths in closelyrelated groups, whereas changes at slower time scales arerelevant for the longer branches, and it is not possible tosee this with a model that has only a single rate.

The fitted parameters for the 2D model were used topredict the sizes of the pangenome and core genome whenextrapolated to larger numbers of genomes (Table 3). Sub-stantial numbers of new families are expected to be foundeven if very many genomes are sequenced. For example,the predicted Gnew is in the range of 50 to several hun-dred, even after sequencing 100 genomes. Hundreds oreven thousands of genomes are predicted to be requiredbefore the number of new families falls to less than 1% ofthe mean genome size. It should be remembered that thepangenome is open according to these models, so therewill always be new families, however many genomes aresequenced.

The observed core genome size is slightly less thanGslow for the species level sets, indicating that most of theslowly evolving families are conserved in all the genomesin the closely-related groups (Fig. 5). In the more diver-

Page 8: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 8

Table 3 Fitted parameters for the best fitting model, having two classes of dispensable gene families (2D) on a coalescent tree. The columns fslowand f f ast are the fractions of gene families in each rate category. Extrapolations to large numbers of genomes were performed to predict the numberof new gene families (Gnew(k)) and core clusters (Gcore(k)) detected when k genomes are sequenced. ‘1% Gnew’ is the predicted number of additionalgenome sequences required before the number of new families falls to less than 1% of the mean genome size.

Taxonomic group θ1 ρ1 θ2 ρ2 fslow f f ast Gnew(100) 1% Gnew Gcore(100) Gcore(1000)Staphylococcus aureus 161 0.080 53326 260.2 0.91 0.09 150 2159 1331 1106Streptococcus pyogenes 135 0.097 9507 48.4 0.88 0.12 66 558 852 681Streptococcus pneumoniae 131 0.088 12547 42.7 0.84 0.16 90 668 950 776Listeria spp. 155 0.074 6861 28.6 0.90 0.10 55 274 1430 1205Bacillus cereus group 643 0.179 146941 215.5 0.84 0.16 474 3240 1456 963Staphyloccoccus spp. 335 0.178 160825 538.3 0.86 0.14 256 6840 769 510Streptococcus spp. 297 0.234 16950 39.7 0.75 0.25 125 981 392 228“Bacillaceae 1” 925 0.328 45461 50.5 0.76 0.24 313 1202 556 261Lactobacillales 314 0.304 16475 23.4 0.60 0.40 138 945 229 114Bacilli 3707 1.987 226245 406.4 0.77 0.23 484 9094 0 0

gent groups, the core is much smaller than Gslow, indicat-ing that there is sufficient time for deletion of the slowfamilies as well (Fig. 5). The predicted size of the coregenome continues to slowly decrease well beyond thenumber of genomes currently available in each data set(Table 3). In the 2D models there are no essential fami-lies, so the core will eventually decrease to zero. In someof the most diverse clades, predictions of the core genomesize for large numbers of genomes are improved if a classof essential families is also added (i.e. 2D+E). However,we did not feel that inclusion of this extra class was welljustified in terms of the improvement of the fit to coreand pangenome data for the available number of genomes;therefore these results are not shown. Unfortunately, it willalways be difficult to make accurate extrapolations far be-yond the range of the available data.

Fitting to the gene family frequency spectrum

We also wished to test whether the theoretical modelsprovide a good prediction of the shape of the gene familyfrequency spectrum, G(k|ng). For this, we chose the pa-rameters by minimizing the χ2 function:

χ2 =

ng

∑k=1

(Gtheory(k|ng)−Gdata(k|ng)

)2

Gtheory(k|ng)(28)

This function is weighted such that the theory has to fit thedata across the full range of k, whereas if the unweightedleast squares function is used, the model tends to fit thehigh values at the extremes and ignore the low values atintermediate k.

When fit to the gene family frequency spectrum forthe B. cereus group, the 2D model on the coalescent re-turns a good prediction, whereas the D+E model on thecoalescent does not adequately explain the shape of thiscurve (Fig. 6a). Fig. 6b shows the predicted core andpangenome curves using the 2D model with parametersobtained from fitting the frequency spectrum in Fig. 6a.This shows that the 2D model on the coalescent fits allthese aspects of the data simultaneously.

The B. cereus example is typical of the other data sets

with closely related genomes. However, we found that thegene family frequency spectrum is more difficult to pre-dict adequately for the more diverse sets of genomes thatwe considered. For example, we found that the 2D modelon the coalescent appeared to do a reasonable job at fit-ting the gene family frequency spectrum for the “Bacil-laceae 1”, but when these same parameters were used tocalculate the core and pangenome curves, the predictionwas poor (not shown). However, for this dataset alone, theprediction was noticeably improved if we included twodispensable classes and an essential class (2D+E) in themodel (Fig. 7a). In this case, the predictions for the coreand pangenomes using the same parameters were muchimproved (Fig. 7b).

The predictions of the models on the coalescent areaverages over all possible coalescent trees, whereas theactual data arise on one particular tree, and there is no rea-son to suppose that the real tree conforms to the coales-cent process. We have also calculated the predicted genefamily frequency spectrum using the star phylogeny, butthe fits are noticeably worse than with the coalescent, andwe have not shown them here. We already have an esti-mate of the shape of the real tree calculated from the pro-tein sequence evolution of the conserved genes (Fig. 1).Here, we will show that it is possible to calculate the ex-pected gene family frequency spectrum for the IMG modelon any given fixed tree. The reason for doing this is thatthe spectrum is not simply a smooth curve (as seen inFig. 7), but has structure, such as peaks close to k = 20and k = 30, and dips on either side of these peaks. Theseare signatures of the shape of the particular tree on whichthe genomes evolved. The “Bacillaceae 1” group containsthe 20 genomes of the B. cereus group within it, whichare very closely related to one another in comparison tothe other members of the “Bacillaceae 1” (Fig. 1). Themost likely explanation of the k = 20 peak is that there aremany gene families present in this particular group of 20genomes that are not present in the rest of this group.

The following method can be used to calculateG(k|n) on a fixed tree. We suppose branch lengths of thetree are given in some convenient units. In our case, the

Page 9: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 9

tree is a maximum likelihood tree derived from protein se-quence evolution and the branch lengths are measured inamino acid substitutions per site. Let a be a node in thisfixed tree, let ta be the length of the branch leading to nodea, and let na be the number of genomes in the data that de-scend from node a, as shown in Figure 8. We may write

G(k|n) =2n−1

∑a=1

g(ta)pa(k) (29)

where g(ta) is the expected number of families present ina that arose on the branch leading to a, and pa(k) is theprobability that a family present at a is present in k outof na genomes that descend from a. The sum goes over ntip nodes (current genomes) and n− 1 internal nodes, i.e.2n−1 nodes in total.

For a single class of dispensable gene families in theIMG model,

g(ta) =uv(1− e−vta) (30)

as in Equation (15). If a is the root node, then g(ta) = u/v.If a is a tip node, then pa(1) = 1 and pa(k) = 0 for

k 6=1. If a is an internal node, pa(k) = 0 for k > na andwe can calculate pa(k) for 0 6 k 6 na using the followingrecursion. Let the probability that a family is retained fora time t be r(t) = e−vt , and the probability that it is lostduring time t be l(t) = 1− e−vt . Let c be the parent nodeof a and b, as in Figure 8. Then, for k > 1, we may write

pc(k) = r(ta)l(tb)pa(k)+

l(ta)r(tb)pb(k)+

r(ta)r(tb)k

∑j=0

pa( j)pb(k− j) (31)

and for k = 0,pc(0) =

(l(ta)+ r(ta)pa(0)

)(l(tb)+ r(tb)pb(0)

)(32)

Once these probabilities have been calculated for everynode, Equation (29) gives the gene family frequency spec-trum. If there is more than one category of dispensablefamily, then the spectrum is the sum of the spectra for thetwo classes. If there is a class of essential families, then aconstant Gess is added to G(n|n).

Using the fixed tree for the “Bacillaceae 1” in Fig. 1,we calculated G(k|ng) for the 2D+E model. The parame-ters were chosen by optimizing the χ2 function in Equa-tion (28). The model predicts the most important peaksand dips in the gene family frequency spectrum in a waythat cannot be done with the coalescent model. Having ob-tained G(k|ng), it is possible to obtain the prediction forthe core and pangenome curves on the same tree by us-ing Equations (22)–(25) in the same way as is done forthe observed spectrum. For the “Bacillaceae 1” example,the resulting theoretical curves from the fixed tree and thecoalescent are almost indistinguishable from one another(Fig. 7b). Moving from the gene family frequency spec-trum to the core and pangenome curves smooths out someof the irregularities that arise on the fixed tree. Thus thespectrum is more sensitive to the shape of the tree than arethe core and pangenome curves.

Software

Many of the formulae described herein are availableas functions written in the R programming language, avail-able at http://reric.org/work/pangenome/.A helper script provides example usage and plotting foreach of 7 functions using an example dataset. Users cancompute the gene family frequency spectrum, G(k), froma matrix of gene clusters, and calculate G(k) using the 1D,1D+E, 2D, and 2D+E IMG models on a coalescent. Themean core genome (Gcore) and pangenome (Gpan) curvescan be calculated from G(k), and users can compute per-mutations of these curves from the data. Additional func-tions allow users to calculate Gcore and Gpan using the 1D,1D+E, 2D, and 2D+E models on either a coalescent or astar tree.

Discussion and Conclusions

The original version of the IMG introduced byBaumdicker et al. (2010) uses a single class of dispensablegene families plus a class of essential families, and corre-sponds to our model D+E on the coalescent tree. Theseauthors gave an example where the gene family frequencyspectrum from nine Prochlorococcus genomes was wellfitted with this model. In our data sets, however, we al-ways found that model 2D, with two dispensable classes,gave a much better fit than the D+E model. We note thatthe essential class is a limiting case of a dispensable classwith both u and v very small. This means that the fit mustbe at least as good with the 2D model as the D+E. The2D model has an additional free parameter, but additionof this parameter seems well justified because the ratioRMS(theory)/RMS(data) is very much smaller (Table 2)and because the shape of the fitting curves is clearly muchbetter (Figures 3, 4 & 4). Two classes of families seemsadequate for fitting the core and pangenome curves, andalso most of the gene family frequency spectra.

Not all of the clades were well fit by the 2D model,however. For the gene family frequency spectrum ofthe “Bacillaceae 1” (Fig. 7), there was a noticeable im-provement when a third class was added (model 2D+E).This mathematical model corresponds to the conceptualmodel described by Koonin and Wolf (2008) that in-cludes a ‘core’ of essential gene families, a ‘shell’ of con-served families, and a ‘cloud’ of dispensable families. The“Bacillaceae 1” is one of the most diverse data sets andone in which there is noticeable structure in the spec-trum arising from the unusual shape of the tree. The fre-quency spectrum contains more information than the coreand pangenome curves because it is possible to calculatethose curves from the spectrum, but not vice versa. Forthis reason, more complex models may be justified for fit-ting the spectrum than for fitting the core and pangenomes.For example, the calculation of the frequency spectrum onthe fixed tree is noticeably different than for the coales-cent, although both give almost the same prediction forthe core and pangenomes. However, the original clusterdata contains a lot of information that is lost in calculat-ing the gene family frequency spectrum. For example, in

Page 10: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 10

the “Bacillaceae 1” case, we argued that the peak close tok = 20 arose from the group of 20 closely related genomeswithin the larger set. This should mean that the cluster datacontains many gene families found in this particular groupof 20, rather than in any other random group of 20. In thefuture, we intend to analyze the cluster data directly usingthe IMG model. This should allow a more sensitive test ofsome of the details of the IMG model that are difficult toassess from the core and pangenome curves and from thegene frequency spectrum.

In understanding the parameters of the models, it isinstructive to compare two clades, Listeria and Staphylo-coccus aureus, which have similar average genome sizesand similar core genome sizes. The fitted values for inser-tion and deletion rates of slowly evolving gene familiesare also similar between the two groups, making up about90% of the total in each clade (Fig. 5). However, the pre-dicted rates of insertion and deletion for the fast gene fam-ilies were nearly an order of magnitude lower in Listeriathan in S. aureus, indicating that the variable fraction ofthe genome evolves much more slowly in Listeria. Thisin turn leads to the prediction of a larger (more diverse)pool of gene families available to S. aureus compared toListeria, which is surprising because the sequenced Liste-ria genomes include several species and are much morephylogenetically divergent than the S. aureus genomes.

Assuming that rates of protein sequence evolutionin core genes are not drastically different between Lis-teria and S. aureus, explanations for the unexpectedlysmall Listeria pangenome include differences in lifestyleor mechanisms of evolution. One aspect of lifestyle thatmay vary is habitat or host range. For example, a pathogenthat infects many host species may have a larger effectivepopulation size than an obligate human pathogen, and mayrequire access to a larger repertoire of genes. In our exam-ple organisms, S. aureus is commensal on and an oppor-tunistic pathogen of humans and occasionally other an-imals, while Listeria spp. are pathogens of humans andvarious other animals, in addition to inhabiting environ-ments like soil and water. So, the smaller pangenome sizeof Listeria is not obviously attributable to a less variablehabitat or smaller populations.

An alternative explanation is that the mechanisms ofevolution might vary by clade, particularly those mech-anisms involved in the acquisition or invention of novelgenes. Exogenous genes can be acquired via conjugation(by self-propagating plasmids or transposons), transduc-tion (by phage infection), and natural transformation (bythe uptake of free DNA from the environment). Membersof all of the clades we studied are known to host both plas-mids and bacteriophages, indicating that conjugation andtransduction are both potential sources of new genes inthese genomes. However, not all of the clades are knownto become naturally competent for transformation, a com-plex process that requires a large complement of genes.In particular, no known isolates of Listeria, Staphylococ-cus aureus, or Streptococcus pyogenes—the three cladeswith the lowest proportional pangenome sizes—have beenobserved to become naturally competent in vitro. Strep-

tococcus pyogenes and S. aureus each carry many genesknown to be required for competence and could possiblybecome competent under conditions not yet replicated inthe laboratory, but sequenced Listeria spp. carry only afew of the required genes and are unlikely to ever becomecompetent in their native habitats. Additionally, just onefamily of insertion sequence has been reported for Listeriaspp. compared to more than 20 families in S. aureus andmost other clades we investigated (ISFinder database,Siguier et al., 2006). Compared to S. aureus, the smallpangenome of Listeria could be attributable to a restrictedability to acquire exogenous genes and a decreased rate ofgene innovation due to a paucity of insertion sequences.

An important question that we found difficult to ad-dress with the methods in this paper is the comparison ofthe IMG with the finitely many genes (FMG) case. TheFMG was introduced as a stepping stone to derive theIMG limit. However, it is a valid model in itself, and weattempted to fit the core and pangenome curves using theFMG model. We found that the value of M (the finite num-ber of possible gene families) is very difficult to estimate.M must be at least as large as the observed pangenomesize, but if M is much larger than this, then the predictionsof the model depend principally on the product Mσ , andonly very slightly on the two parameters separately. Thus,from the core and pangenome curves, we could not makea clear distinction between the IMG model and the FMGmodel with a very large but finite number of families. Thisdistinction would be more apparent in a direct phyloge-netic analysis of the cluster data, because the FMG modelwould allow certain patterns of presence and absence tooccur that would be extremely unlikely under the IMGmodel.

In the phylogenetic context, the FMG corresponds toa standard model of a two state 0/1 character, such as thatused by Hao and Golding (2006) to look at gene familypresence/absence patterns. This model allows independenttransitions from 0 to 1 in different parts of a phylogenetictree, which can be interpreted as independent insertions ofthe same kind of family by HGT from a pool of sequencesof finite diversity. In contrast, the IMG is not a standardmodel in the phylogenetic context, because it has an in-finite number of possible characters, and the number ofcharacters actually present in the data is variable. A com-parison of these two approaches on the same data wouldtherefore be interesting.

The distinction between the IMG and FMG is re-lated to the question of whether the pangenome is openor closed. It seems to us that it must almost certainly beopen in real cases. As long as there is some non-zero rateof origination of gene families within individual lineagesor there is a high diversity of families available for HGT,then there will always be some constant rate of origina-tion of gene families that have not been seen before. Thismeans that there are at least some types of families that fitthe IMG assumption, and so the pangenome must be open.It is also possible that there are other families that have ahigh rate of repeated HGT, and which would fit an FMGmodel better. These could be present at the same time as

Page 11: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 11

gene families of the IMG type. But unless all families cor-respond to the FMG case, which seems unlikely to us, thepangenome will be open.

An alternate way to model HGT is to suppose thatgenes can be repeatedly transferred between members ofthe group. In this case the rate of transfer of a gene de-pends on its frequency x in the population in propor-tional to x(1− x), because the donor must possess thegene and the receiver must not. Baumdicker and Pfaffel-huber (2011) have recently extended the IMG to includethis kind of within-group HGT in the population genet-ics context on the coalescent tree. If HGT is frequent, thischanges the shape of the gene frequency spectrum sub-stantially so that it is no-longer U-shaped, but has a broadpeak at intermediate k. The gene frequency spectra for allthe data sets that we studied here were U-shaped, whichleads us to believe that within-group HGT is not a primefactor in our data. However, we have not tested this explic-itly. We also acknowledge the recent claim that the major-ity of gene gains in multi-gene families arise from within-group HGT rather than by gene duplication (Treangen andRocha, 2011). One possible way to reconcile these obser-vations would be if genes in large families have an unusu-ally high rate of HGT, whereas the majority of genes havea very low rate that is consistent with the IMG assumption.Gain of additional members of gene families is not visiblein the way we analyzed the data here because we workedat the level of presence/absence of whole gene families.It is also worth pointing out that it would be extremelydifficult to deal with within-group HGT in a phylogeneticcontext on a fixed tree because the rate of 0 to 1 charac-ter changes in one branch would depend on the characterstates in other branches, which would invalidate the usualkind of phylogenetic recursion relations used for calculat-ing likelihoods.

Our main reason for introducing the star phylogenycalculations was the fact that the pangenome increases lin-early with n on the star phylogeny, and this correspondsto claims made for real data in some of the original workwith pangenomes (Tettelin et al. 2005). However, the IMGmodel was not available at that time, and we have shownhere that the 2D IMG on the coalescent tree gives a betterfit than the star phylogeny. In other words, the pangenomeseems to increase logarithmically with n, and not linearly,after all. Our interpretation of this is that real evolution oc-curs on a tree that is neither a star nor a coalescent. How-ever, typical phylogenetic trees resemble a typical coales-cent more than a star, in that as more genomes are addedto the sample, the length of the tip branch leading to thelast genome added gets shorter and shorter. This explainsqualitatively why the pangenome only increases logarith-mically.

We will note one final interesting connection with thestar tree. The finite supragenome model is an approachdeveloped in a recent series of papers (Hogg et al. 2007;Donati et al. 2010; Boissy et al. 2011) that makes predic-tions on the core and pangenome sizes. This model sup-poses that there are a finite total number of gene familiesthat are divided among K classes. Gene families are as-

sumed to be present with a probability µ in each genomeand absent with a probability 1−µ , with a different valueof µ for each class of family. The gene frequency spec-trum for families in a given class is binomial, and the totalgene family frequency spectrum is a sum over the bino-mial distributions for the different classes. The connectionwith the star phylogeny is that gene families present at theroot of the star have a probability e−vt of being presentin each of the tips. Therefore the frequency spectrum isbinomial, with µ = e−vt . We tried to fit our gene fam-ily frequency spectra with the 2D star tree model, and wefound it to be considerably worse than the 2D coalescentmodel; therefore we did not present these results. In thefinite supragenome model, K = 7 classes were used, and itmay be that the star phylogeny model would fit the data ifa large number of classes were introduced. We do not needto do this however, because the IMG gives a better predic-tion of the frequency spectrum, using either the coalescentor a fixed phylogenetic tree, with only 2 or 3 classes. Moreimportantly, our conclusion is that the pangenome is open,because there is a significant rate of origination of newgene families that have never existed before, and this is notconsistent with the basis of the finite supragenome model,which starts from the assumption that the pangenome isclosed.

In conclusion, we have found that variants of the in-finitely many genes model make good predictions of thesizes of the core and pangenomes and the shape of thegene family frequency spectrum in many groups of bac-terial genomes. The model has a sound basis in popula-tion genetics, and is derived from an explicit evolution-ary model. Therefore it is more useful for interpretationof experimental data than alternative approaches that aresimply empirical fitting functions or statistical models fordistributions that are not associated with an evolutionarymechanism.

Literature CitedBaumdicker F, Hess WR, Pfaffelhuber P. 2010. The diversity of a

distributed genome in bacterial populations. Ann. Appl. Prob.20:1567–1606.

Baumdicker F, Hess WR, Pfaffelhuber P. 2011. The infinitelymany genes model for the distributed genome of bacteria.PNAS forthcoming.

Baumdicker F, Pfaffelhuber P. 2011. Evolution of bacterial pop-ulations under horizontal gene transfer. Preprint, http://arxiv.org/abs/1105.5014

Boissy R, Ahmed A, Janto B et al. (16 co-authors). 2011.Comparative supragenomic analyses among the pathogensStaphylococcus aureus, Streptococcus pneumoniae, andHaemophilus influenzae using a modification of the finitesupragenome model. BMC Genomics 12:187.

Collins RE, Merz H, Higgs PG. 2011. Origin and evolution ofgene families in Bacteria and Archaea. BMC Bioinformatics(Proceedings of the RECOMB Comparative Genomics con-ference – to appear).

den Bakker HC, Cummings CA, Ferreira V, Vatta P, Orsi RH,Degoricija L, Barker M, Petrauskene O, Furtado MR andWiedmann M. 2010. Comparative genomics of the bacterialgenus Listeria: Genome evolution is characterized by lim-ited gene acquisition and limited gene loss. BMC Genomics

Page 12: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 12

11:688.Donati C, Hiller NL, Tettelin H et al. (18 co-authors) 2010.

Structure and dynamics of the pan-genome of Streptococ-cus pneumoniae and closely related species. Genome Biol.11:R107.

Edgar RC. 2004. MUSCLE: multiple sequence alignment withhigh accuracy and high throughput. Nucl. Acids Res. 32:1792–1797.

Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W,Gascuel O. 2010. New algorithms and methods to estimatemaximum-likelihood phylogenies: assessing the performanceof PhyML 3.0. Sys. Biol. 59:307–21.

Hao W, Golding GB. 2006. The fate of laterally transferredgenes: Life in the fast lane to adaptation or death. GenomeRes. 16: 636–643.

Hein J, Schierup MH, Wiuf C. 2005. Gene genalogies, variationand evolution: A primer in coalescent theory. Oxford Univer-sity Press, Oxford.

Hogg JS, Hu FZ, Janto B, Boissy R, Hayes J, Keefe R, Post JC,Ehrlich GD. 2007. Characterization of the Haemophilus in-fluenzae core and supragenomes based on the complete ge-nomic sequences of Rd and 12 clinical notypeable strains.Genome Biol. 8:R103.

Kettler GC, Martiny AC, Huang K, et al. (14 co-authors) 2007.Patterns and implications of gene gain and loss in the evolu-tion of Prochlorococcus. PLoS Genet. 3(12):e231.

Kingman JFC. 1982. The coalescent. Stoch. Proc. Appl. 13:235–248.

Koonin EV and Wolf YI. 2008. Genomics of bacteria and ar-chaea: the emerging dynamic view of the prokaryotic world.Nucleic Acids Res. 36:6688–6719.

Lapierre P, Gogarten JP. 2009. Estimating the size of the bacterialpan-genome. Trends in Genetics 25:107–110.

Ludwig W, Schleifer K-H, Whitman WB. Accessed 9/2011.Revised road map to the phylum Firmicutes. URLhttp://www.bergeys.org/outlines/bergeys_vol_3_roadmap_outline.pdf

R Development Core Team. 2009. R: A language and environ-

ment for statistical computing. R Foundation for StatisticalComputing, Vienna, Austria. R version 2.10.1 (2009-12-14).ISBN 3-900051-07-0. URL http://www.R-project.org.

Rasko DA, Rosovitz MJ, Myers GSA, et al. (13 co-authors)2008. The pangenome structure of Escherichia coli: Compar-ative genomic analysis of E. coli commensal and pathogenicisolates. J. Bacteriol. 190:6881–6893.

Sanderson, MJ, Donoghue MJ, Piel W, Eriksson T. 1994. Tree-BASE: a prototype database of phylogenetic analyses and aninteractive tool for browsing the phylogeny of life. Am. J. Bot.81:183.

Scaria J, Ponnala L, Janvilisri T, Yan W, Mueller LA, andChang YF. 2010. Analysis of ultra-low genome conservationin Clostridium difficile. PLoS ONE 5(12):e15147.

Siguier P, Perochon J, Lestrade L, Mahillon J, and Chandler M.2006. ISfinder: the reference centre for bacterial insertion se-quences. Nucleic Acids Res. 34: D32-D36. URL http://www-is.biotoul.fr

Tettelin H, Masignani V, Cieslewicz MJ, et al. (46 co-authors)2005. Genome analysis of multiple pathogenic isolates ofStreptococcus agalactiae: Implications for the microbialpangenome. Proc. Nat. Acad. Sci. USA 102:13950–13955.

Touchon M, Hoede C, Tenaillon O, et al. (41 co-authors)2009. Organized genome dynamics in the Escherichia colispecies results in highly diverse adaptive paths. PLoS Genet.5(1):e1000344.

Treangen TJ, Rocha EPC. 2011. Horizontal transfer, not geneduplication, drives the expansion of gene families in prokary-otes. PLoS Genet. 7(1):e1001284

Welch RA, Burland V, Plunkett G, et al. (19 co-authors) 2002.Extensive mosaic structure revealed by the complete genomesequence of uropathogenic E. coli. Proc. Nat. Acad. Sci. USA99:17020–17024.

Wright S. 1969. Evolution and the Genetics of Populations. Vol.II. The theory of gene frequencies. University of ChicagoPress.

Page 13: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 13

Figures

FIG. 1.—Phylogenetic tree of 172 Bacilli having complete genome sequences, annotated with the taxonomic groups used in this study. Treetopology and branch lengths were determined by maximum likelihood analysis (using PhyML) of concatenated alignments of amino acid sequencesfrom 55 single-copy core genes (Table S2). The fully annotated tree is given in Fig. S1. Taxonomic groupings demarcated with stars (F) were 100%supported by the approximate likelihood ratio test (as calculated in PhyML) and define the groups used in this study. Scale bar indicates expectednumber of amino acid substitutions per site.

Page 14: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 14

FIG. 2.—Schematic representation of the shapes of (a) a coalescent tree, and (b) a star tree.

Page 15: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 15

FIG. 3.—Fitting four evolutionary models to the core and pangenome curves for 14 sequenced Streptococcus pneumoniae genomes. Models werebased on 2 different tree shapes: the star tree (blue lines) or the coalescent tree (red lines) and either 1 dispensable and 1 essential class of genefamily (D+E; dashed lines) or 2 dispensable classes (2D; solid lines). The lines for the star tree pangenome overlap. The shaded region demarcates themaximum range observed during 10,000 permutations of the data.

Page 16: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 16

FIG. 4.—Fitting four evolutionary models to the core and pangenome curves for 38 sequenced “Bacillaceae 1” genomes. Models were based on2 different tree shapes: the star tree (blue lines) or the coalescent tree (red lines) and either 1 dispensable and 1 essential class of gene family (D+E;dashed lines) or 2 dispensable classes (2D; solid lines). The lines for the star tree overlap. The shaded region demarcates the maximum range observedduring 10,000 permutations of the data.

Page 17: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 17

FIG. 5.—The variation in the number of gene families in the slow category (Gslow; blue bar) and the fast category (G f ast ; red bar), and the totalnumber of gene families per genome (G0 = Gslow +G f ast ) in each of the clades. The symbol shows the observed size of the core genome.

Page 18: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 18

FIG. 6.—Fitting the gene family frequency spectrum, G(k), for 20 sequenced genomes of the Bacillus cereus group (circles). Models were basedon the coalescent tree and either 1 dispensable and 1 essential class of gene family (D+E; dashed lines) or 2 dispensable classes (2D; solid lines). The2D model gives a reasonable fit, whereas the D+E model does not. The lower figure shows the predictions of the 2D model for the core and pangenomecurves with the parameter values estimated from fitting the gene frequency spectrum. The shaded region demarcates the maximum range observedduring 10,000 permutations of the data.

Page 19: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 19

FIG. 7.—Fitting the gene family frequency spectrum, G(k), for 38 sequenced “Bacillaceae 1” genomes (circles). Models were based on 2 differenttree shapes: the coalescent tree (red lines) or the inferred phylogenetic tree (green lines; Fig. S1), using 2 dispensable and 1 essential class of genefamily (2D+E). The 2D+E model on the coalescent gives an adequate smooth fit through the data. The same model calculated on the fixed tree of“Bacillaceae 1” predicts some of the tree-dependent structure in the data. The lower graph shows the (overlapping) predictions of both these modelsfor the core and pangenomes, using the parameters estimated from fitting the gene family frequency spectrum. The shaded region demarcates themaximum range observed during 10,000 permutations of the data.

Page 20: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 20

FIG. 8.—Schematic to explain the algorithm for calculation of the gene frequency spectrum on a fixed tree.

Page 21: Testing the Infinitely Many Genes Model for the Evolution of the Bacterial Core Genome and Pangenome

Evolution of the bacterial pangenome 21

Supplementary Material Captions

Supplementary Table 1: The list of taxa used in the study. The list is archived, with additional links to taxonomicdatabases, on TreeBASE (Sanderson et al., 1994) at the following URL: http://bit.ly/pS075b

Supplementary Table 2: Single-copy core genes used for phylogenetic analysis. Numbers in brackets indicate thefrequency (out of 172 genomes) with which the given annotation was observed.

Supplementary Figure 1: Phylogenetic tree of 172 Bacilli having complete genome sequences. Tree topology andbranch lengths were determined by maximum likelihood analysis of concatenated alignments of amino acid sequencesfrom 55 single-copy core genes (Table S2). Taxonomic groupings are indicated on the tree by stars and by vertical linesadjacent to taxon names. Each grouping was monophyletic and highly supported by the approximate likelihood ratio test(aLRT, as calculated in PhyML); values of aLRT for each taxonomic group are shown adjacent to the stars. The alignmentand tree files are archived on TreeBASE (Sanderson et al., 1994; URL http://bit.ly/qD1Ch2).