14.modelSelection01.pdf

22
Syst. Biol. 50(4):580–601, 2001 Selecting the Best-Fit Model of Nucleotide Substitution D AVID POSADA AND KEITH A. CRANDALL Department of Zoology, Brigham Young University, Provo, Utah 84602-5255, USA; E-mail: [email protected] (author for correspondence), keith [email protected] Abstract.—Despite the relevant role of models of nucleotide substitution in phylogenetics, choosing among different models remains a problem. Several statistical methods for selecting the model that best ts the data at hand have been proposed, but their absolute and relative performance has not yet been characterized. In this study, we compare under various conditions the performance of different hierarchical and dynamic likelihood ratio tests, and of Akaike and Bayesian information methods, for selecting best- t models of nucleotide substitution. We speci cally examine the role of the topology used to estimate the likelihood of the different models and the importance of the order in which hypotheses are tested. We do this by simulating DNA sequences under a known model of nucleotide substitution and recording how often this true model is recovered by the different methods. Our results suggest that model selection is reasonably accurate and indicate that some likelihood ratio test methods perform overall better than the Akaike or Bayesian information criteria. The tree used to estimate the likelihood scores does not in uence model selection unless it is a randomly chosen tree. The order in which hypotheses are tested, and the complexity of the initial model in the sequence of tests, in uence model selection in some cases. Model tting in phylogenetics has been suggested for many years, yet many authors still arbitrarily choose their models, often using the default models implemented in standard computer programs for phylogenetic estimation. We show here that a best- t model can be readily identi ed. Consequently, given the relevance of models, model tting should be routine in any phylogenetic analysis that uses models of evolution. [AIC; BIC; dynamic LRT; hierarchical LRT; likelihood ratio tests; model selection; substitution models.] Phylogenetic reconstruction has been re- garded as a problem of statistical infer- ence since the pioneering work of Edwards and Cavalli-Sforza (1964). Because statisti- cal inferences cannot be drawn in the ab- sence of a probability model, the use of a model of nucleotide substitution—a model of evolution—becomes necessary when us- ing DNA sequences to estimate phylogenetic relationships among organisms. Models of evolution, sets of assumptions about the pro- cess of nucleotide substitution (Fig. 1), are used in phylogenetic analyses to describe the different probabilities of change from one nucleotide to another, with the aim of cor- recting for unseen changes along the phy- logeny. Whereas maximum parsimony im- plicitly assumes a model of evolution (Farris, 1973; Felsenstein, 1973; Yang, 1996a; Steel and Penny, 2000), distance and maximum likelihood methods estimate parameters ac- cording to an explicit model of evolution. However, whereas distance methods esti- mate only a single parameter (substitutions per site) given the model, maximum likeli- hood can estimate all the relevant parameters of the substitution model. Although for the last 30 years an array of models of increasing complexity regarding nucleotide substitution have been described (see Swofford et al., 1996), choosing among models remains a major problem in phylo- genetic reconstruction (Cunningham et al., 1998). As is well established, the use of one model of evolution or another may change the results of an analysis (Leitner et al., 1997; Sullivan and Swofford, 1997; Cunningham et al., 1998; Kelsey et al., 1999). Especially, estimates of branch length or bootstrap support can be severely affected (Yang et al., 1994; Buckley et al., 2000). In general, phylogenetic methods may be less accurate (recover an incorrect tree more often) or may be inconsistent (con- verge to an incorrect tree with increased amounts of data) when the wrong model of evolution is assumed (Felsenstein, 1978; Huelsenbeck and Hillis, 1993; Penny et al., 1994; Bruno and Halpern, 1999). Because the performance of a method is maximized when its assumptions are satis ed, some indication of the t of the data to the phy- logenetic model is necessary (Huelsenbeck, 1995). Indeed, model selection is not im- portant just because of its consequences in phylogenetic analysis, but because the characterization of the evolutionary process at the sequence level is itself a legitimate pursuit. Moreover, models of evolution are especially critical for estimating substitution 580

Transcript of 14.modelSelection01.pdf

  • Syst. Biol. 50(4):580601, 2001

    Selecting the Best-Fit Model of Nucleotide Substitution

    DAVID POSADA AND KEITH A. CRANDALLDepartment of Zoology, Brigham Young University, Provo, Utah 84602-5255,USA; E-mail: [email protected]

    (author for correspondence), keith [email protected]

    Abstract.Despite the relevant role of models of nucleotide substitution in phylogenetics, choosingamong different models remains a problem. Several statistical methods for selecting the model thatbest ts the data at hand have been proposed, but their absolute and relative performance has not yetbeen characterized. In this study, we compare under various conditions the performance of differenthierarchical and dynamic likelihood ratio tests, and of Akaike and Bayesian information methods, forselecting best-t models of nucleotide substitution. We specically examine the role of the topologyused to estimate the likelihood of the different models and the importance of the order in whichhypotheses are tested. We do this by simulating DNA sequences under a known model of nucleotidesubstitution andrecording howoften this truemodel is recovered by thedifferentmethods.Our resultssuggest thatmodel selection is reasonablyaccurateandindicate that some likelihood ratio testmethodsperform overall better than the Akaike or Bayesian information criteria. The tree used to estimate thelikelihood scores does not inuence model selection unless it is a randomly chosen tree. The order inwhich hypotheses are tested, and the complexity of the initial model in the sequence of tests, inuencemodel selection in some cases. Model tting in phylogenetics has been suggested for many years,yet many authors still arbitrarily choose their models, often using the default models implementedin standard computer programs for phylogenetic estimation. We show here that a best-t model canbe readily identied. Consequently, given the relevance of models, model tting should be routine inany phylogenetic analysis that uses models of evolution. [AIC; BIC; dynamic LRT; hierarchical LRT;likelihood ratio tests; model selection; substitution models.]

    Phylogenetic reconstruction has been re-garded as a problem of statistical infer-ence since the pioneering work of Edwardsand Cavalli-Sforza (1964). Because statisti-cal inferences cannot be drawn in the ab-sence of a probability model, the use of amodel of nucleotide substitutiona modelof evolutionbecomes necessary when us-ing DNA sequences to estimatephylogeneticrelationships among organisms. Models ofevolution, sets of assumptions about the pro-cess of nucleotide substitution (Fig. 1), areused in phylogenetic analyses to describe thedifferent probabilities of change from onenucleotide to another, with the aim of cor-recting for unseen changes along the phy-logeny. Whereas maximum parsimony im-plicitly assumes amodel of evolution (Farris,1973; Felsenstein, 1973; Yang, 1996a; Steeland Penny, 2000), distance and maximumlikelihood methods estimate parameters ac-cording to an explicit model of evolution.However, whereas distance methods esti-mate only a single parameter (substitutionsper site) given the model, maximum likeli-hood can estimateall the relevant parametersof the substitution model.Although for the last 30 years an array of

    models of increasing complexity regardingnucleotide substitution have been described

    (see Swofford et al., 1996), choosing amongmodels remains a major problem in phylo-genetic reconstruction (Cunningham et al.,1998). As is well established, the use ofone model of evolution or another maychange the results of an analysis (Leitneret al., 1997; Sullivan and Swofford, 1997;Cunningham et al., 1998; Kelsey et al., 1999).Especially, estimates of branch length orbootstrap support can be severely affected(Yang et al., 1994; Buckley et al., 2000).In general, phylogenetic methods maybe less accurate (recover an incorrect treemore often) or may be inconsistent (con-verge to an incorrect tree with increasedamounts of data) when the wrong modelof evolution is assumed (Felsenstein, 1978;Huelsenbeck and Hillis, 1993; Penny et al.,1994; Bruno and Halpern, 1999). Becausethe performance of a method is maximizedwhen its assumptions are satised, someindication of the t of the data to the phy-logenetic model is necessary (Huelsenbeck,1995). Indeed, model selection is not im-portant just because of its consequencesin phylogenetic analysis, but because thecharacterization of the evolutionary processat the sequence level is itself a legitimatepursuit. Moreover, models of evolution areespecially critical for estimating substitution

    580

  • 2001 POSADA AND CRANDALLMODEL SELECTION 581

    FIGURE 1. A comparison of models of nucleotide substitution. Model selection methods selected the best-tmodel for the data set at hand among 24 possible models. See Table 1 footnote for explanation of acronyms formethods.

    parameters or for hypothesis testing(Tamura, 1992; Wakeley, 1994; Adachi andHasegawa, 1995; Yang et al., 1995; Zhang,1999).Unfortunately, and despite these conclu-

    sions, the unjustied use of models of evo-lution is still a common practice in phy-logenetic studies. In a quick examinationof 13 issues of Systematic Biology (March1997 to March 2000), 30 empirical articlesused models of evolution in the analyses, ofwhich the model of nucleotide substitution

    implemented was statistically justied inonly 6 of those studies (20%). If the modelof evolution may inuence the results of theanalysis, then the use of a particular modelshould be justied. An a priori attractive se-lection procedure would be the arbitrary useof complex, parameter-rich models, but thisapproach has several disadvantages: (1) alarge number of parameters need to be es-timated, making the analysis computation-ally difcult and requiring a large amountof time, and (2) as more parameters need to

  • 582 SYSTEMATIC BIOLOGY VOL. 50

    be estimated, more error is included in eachestimate (Huelsenbeck and Crandall, 1997).Ideally,wewould like to incorporateasmuchcomplexity as needed.The best-tmodel of evolution for apartic-

    ular data set can be selected through statis-tical testing. Statistical tests of models of nu-cleotide substitution are of two types: sometests are designed to compare two differentmodels, others to test the overall adequacyof a particular model. In this study we areinterested only in the rst class of tests, thatis, how to select for the best-t model for thespecic data set at hand for a given set ofalternative models. The likelihood ratio teststatistic (LRT) has been suggested for com-paring twomodels of evolution (Felsenstein,1981, 1988;Goldman,1993). Several LRTs canbe performed hierarchically (LRT) to selectthe simplest model that best explains thedata among a set of possible models (Yanget al., 1994;Frati et al., 1997;Huelsenbeck andCrandall, 1997; Sullivan et al., 1997; PosadaandCrandall, 1998). Rzhetsky andNei (1995)developed several tests, using linear invari-ants to assess the applicability of a particu-lar model to the data. They tested whetherthe deviation from the expected invariantwould be signicant if the evaluated model

    FIGURE 2. Studydesign. Sequenceswere simulated under a knownmodel of substitution. The times that amodelselection method recovered the true model out of 100 replicates was used as the measure of a methods accuracy. 20and 200 indicate alternative paths to step 2 used in part of the simulations. See text for explanation of model ttingstrategies.

    were true.Although these testsdonot requirethe use of an initial phylogeny, and they areindependent of evolutionary time, they aremodel-specic, and currently can be appliedto a small set of substitutionmodels. Adiffer-ent approach formodel selection is the simul-taneous comparison of all competing mod-els through the Akaike information criterion(AIC) (Akaike, 1974) or the Bayesian infor-mation criterion (BIC) (Schwarz, 1974). Theuse of LRTs is more extensive in the phylo-genetic literature, however, examples of theuse of the AIC to select the best-t modelof nucleotide substitution can be found inHasegawa et al. (1990a,b), Tamura (1994),and Muse (1999). Morozov et al. (2000) ap-plied the BIC to compare different models ofprotein evolution.Although several statisticalprocedures ex-

    ist for model selection, the absolute accuracyand relativeperformance of these proceduresare unknown. In this study, we compare theperformance of different LRTs with the AICand BIC model selection procedures undervarious conditions. We do this by simulat-ing DNA sequences under a known modelofnucleotide substitutionand recordinghowoften this true model is recovered by the dif-ferent model-selecting strategies (Fig. 2).

  • 2001 POSADA AND CRANDALLMODEL SELECTION 583

    METHODSData Simulation

    Simulations were conducted in severalsteps (Table 1). An initial, global simulationwas carried out to explore the effect of dif-ferent broad conditions in model selection(Simulation I). Given the results from thissimulation, additional simulationswere con-ducted to explore a more restricted but dif-ferent parameter space. A 20-taxon clockliketree (Fig. 3a) was used as the model tree inmost of the simulations (Simulations I andIV and part of Simulation II). Other clockliketrees, with 10, 50, and 100 taxa were usedin part of Simulation II, and a nonclock treewas used in Simulation III (Fig. 3b). Clock-like trees were simulated by a birthdeathprocess with complete taxon sampling (YangandRannala, 1997)using the programPAML2.0g (Yang, 1997a).The birthdeathprocess isa continuous-timeprocess in which the prob-ability that a speciation event occurs along alineage during an innitesimal time interval1t is 1t, the probability that an extinctionoccurs is 1t, and the probability that twoor more events occur is of order 0(1t) (Ran-nala and Yang, 1996). Parameters and arethe branching and extinction rates per lin-

    TABLE 1. Simulations scheme. One hundred data sets were simulated for each set of conditions. True tree is thetree upon which sequences were evolved. Tree height is the expected number of substitutions per site from theroot to the tip. Ntaxa is the number of taxa. Nchar is the total number of characters. True model is the model ofnucleotide substitution upon which sequences were evolved. Alpha () is the shape of the 0 distribution. Ncat isthe number of discrete categories for the 0 distribution. Base tree is the tree estimated from the data and on whichparameters and likelihood scores were estimated.

    Tree TrueSimulation True tree hight Ntaxa Nchar modela Ncat Base treeb

    I Clock 0.10 20 100 16 0.5 4 16, R, TClock 0.10 20 500 16 0.5 4 16, R, TClock 0.10 20 1,000 18 0.5 4 16, R, T

    II Clock 0.10 10 500 16 0.5 4 3Clock 0.10 10 1,000 6 0.5 4 3Clock 0.10 50 500 16 0.5 4 3Clock 0.10 100 500 16 0.5 4 3Clock 0.01 20 500 16 0.5 4 3Clock 0.20 20 500 16 0.5 4 3Clock 0.50 20 500 16 0.5 4 3Clock 0.75 20 500 16 0.5 4 3

    III Nonclock 20 1,000 16 0.5 4 1IV Clock 0.10 20 1,000 16 0.2 4 1

    Clock 0.10 20 1,000 16 0.2 8 1Clock 0.10 20 1,000 16 0.05 8 1

    aTruemodels: JC(1), JCC0 (2), HKY(3),HKYC 0 (4), GTR(5)GTRC0 (6). JC: Jukes andCantor (1969)model; F81: Felsenstein (1981)model; HKY: Hasagawaet al. (1985) model; GTR (also called REV): general time reversiblemodel (Tavare, 1986). 0 represents thediscrete gamma distributionwith four rate categories.bThe base tree is a NJ tree estimated according to the specied model (16)a , a random tree (R), or the true tree (T).

    eage, respectively. The values used to simu-late the trees were D 0:1, D 0:1, and sam-pling fraction D 1.0. We parameterized therate of substitution as the tree height (m),which is the expected number of substitu-tions per site for a single lineage from the rootto the tip of the tree. To study the inuenceof the substitution rate in model selectionwe used various tree heights (0.01, 0.10, 0.20,0.50, and 0.75) (Table 1). The nonclock tree(Fig. 3b) was obtained by arbitrarily chang-ing the length of some branches of the treerepresented in Figure 3a. DNA sequenceswere simulated over the generated trees byusing the program SeqGen 1.1 (Rambaut andGrassly, 1997) according to different modelsof DNA substitution (Table 2). When appro-priate, a gamma shape parameter () (Yang,1993;Yang, 1994a;Yang, 1996b)of 0.5 (0.2 and0.05 in Simulation IV) was used tomodel ratevariation among sites.

    Likelihood Estimation and Base Tree

    The likelihood of a tree is calculated as theprobability of observing the data if the treeis true, under a given model of nucleotidesubstitution. To estimate the relative tofdifferent models to a given data set, we can

  • 584 SYSTEMATIC BIOLOGY VOL. 50

    FIGURE 3. True trees used in the simulations. (a) Thisclock tree was simulated by a birthdeath process withbirth rate () D 0:1, death rate () D 0:1, and samplingfraction D 1.0. The height of the tree is 0.10. (b) The non-clock tree was obtainedby arbitrarily altering the branchlengths in tree (a). The scale of the branch lengths is in-dicated.

    contrast the likelihoods obtained for a treeestimated from the data (hereafter called thebase tree) under the different models com-pared. The base tree ideally would be thetrue tree. However, for real data sets, thetrue tree is unknown and needs to be es-timated. To quantify the potential effect ofthe base tree on model selection, we usedeight different base trees: the model tree(D true tree), sixneighbor-joining (SaitouandNei, 1987) (NJ) trees calculated under differ-ent arbitrary models of evolution, and a ran-dom tree (Table 1). For each set of simulateddataandbase tree, 24 likelihood scores, corre-

    sponding to 24 different models of evolution(Fig. 1), were calculated in PAUP (Swofford,1998).

    Model Selection StrategiesThese likelihood scoreswere used to select

    the best-t model of evolution for each dataset and conditions,usingninedifferentmeth-ods that could be grouped in four classes:Hierarchical Likelihood Ratio Tests.In tra-

    ditional statistical theory, a widely acceptedstatistic for testing thegoodness of t ofmod-els is the LRT statistic,

    D 2(lnL1 lnL0),

    where L1 is the maximum likelihood underthemore parameter-rich, complex model (al-ternativehypothesis) and L0 is themaximumlikelihoodunder the less parameter-rich sim-plemodel (null hypothesis). The value of thisstatistic is always0, because the likelihoodunder the more complex model will alwaysbe equal or bigger than the likelihood un-der the simpler model. Simply put, the su-peruous parameters in the complex modelprovide a better explanation of the stochasticvariation in the data than the simpler modeldoes, even if the simple model is the trueone. When the models compared are nested(the null hypothesis is a special case of thealternative hypothesis) and the null hypoth-esis is correct, this statistic is asymptoticallydistributed as 2 with q degrees of freedom,where q is the difference in number of freeparameters between the two models; equiv-alently, q is the number of restrictions on theparameters of the alternative hypothesis re-quired to derive the particular case of thenull hypothesis (Kendall and Stuart, 1979;Goldman, 1993). To preserve the nesting ofthe models, the likelihood scores are esti-mated on the same tree topology. Goldman(1993) questioned the appropriateness of the 2 approximation of the LRT statistic whencomparing models of evolution, but Yanget al.s (1995)simulationstudysuggested thatthe 2 approximation is acceptable in mostcases. However, the 2 distribution may notbe appropriatewhen the null model is equiv-alent toxing someparameters at the bound-ary of the parameter space of the alternativemodel. An example of this situation is therate homogeneity among sites test,where the

  • 2001 POSADA AND CRANDALLMODEL SELECTION 585

    TABLE 2. Parametervalues used in the simulations. D (A , C , G , T) describes base frequencies at equilibriumand has three free parameters, because of the constraint that6X D 1. -(A-C , A-G , A-T, C-G , C-T , G-T) describesthe substitution rates among bases; it has ve free parameters because substitution rates are expressed relative toG-T which is set up equal to 1. The parameter describes the transition/transversion ratio, a specic constraint on, D (A-G D C-T=A-C D A-T D C-G D G-T). The parameter is the shape parameter of the gamma distribution(0), which was simulated with four discrete categories.

    Parameters JC JC C 0 HKY HKY C 0 GRT GTRC 0A 0.25 0.25 0.35 0.35 0.35 0.35C 0.25 0.25 0.15 0.15 0.15 0.15G 0.25 0.25 0.25 0.25 0.25 0.25T 0.25 0.25 0.25 0.25 0.25 0.25 2 2 A-C 2 2A-G 4 4A-T 1.8 1.8C-G 1.4 1.4C-T 6 6G-T 1 1 0.5 0.5 0.5

    See Table 1 footnote for explanation of abbreviations.

    null hypothesis (rate homogeneity) is a spe-cial case of the gamma-distribution model(rate heterogeneity), with shape parameterequal to innity (Yang, 1996c). Whelan andGoldman (1999) concluded that for com-parisons of rate variation across sites andnucleotide frequencies estimated as the ob-served base frequencies, the observed distri-bution of the LRT statistic was signicantlydifferent from the 2 distribution. To solvethis problem, Ota et al. (2000) and Goldmanand Whelan (2000) have suggested insteadtheuseof amixed2 (or 2)distribution,con-sisting of 50% 20 and 50%

    21 , when a param-

    eter in the null model is xed at the bound-ary of its parameter space. On the other side,the difference in likelihood when comparingmodelsmaybe very large, and the inaccuracyof the 2 approximation might not changethe results of the tests in these cases. To studythis question, hereweused both the standardand mixed 2 for the boundary LRTs, andthe standard 2 for the other LRTs.When we compare two different nested

    models through a LRT, we are actually test-ing hypotheses about our data. The hy-potheses tested are those represented by thedifference in the assumptions among themodels compared. Several hypotheses canbe tested hierarchically to select the best-t model for the data set at hand (Fratiet al., 1997;Huelsenbeck and Crandall, 1997;Posada and Crandall, 1998). It is to our ad-vantage to test one hypothesis at a time: Arethe base frequencies equal? Is there a tran-

    sition/transversion bias? Are all transitionrates equal? Are there invariable sites, or isthere rate homogeneity among sites? and soon. For example, to test the equal base fre-quencies hypothesis,we coulddo aLRTcom-paring JC with F81 (see Table 1 for identi-cation of models), models differing only inthe fact that F81 allows for unequal base fre-quencies (alternative hypothesis), whereasJC assumes equal base frequencies (null hy-pothesis). However, to test this hypothesis,we could also compare JC C 0 with F81 C0, or K80C I with HKY C I, or SYM withGTR. Which model comparison is used tocompare which hypotheses depends on thestarting model of the hierarchy and on theorder in which different hypotheses are ap-plied. For example, we could start with thesimple JC or with the most-complex GTR CI C 0. In the same way, we could performrst a test for equal base frequencies andlater a test for rate heterogeneity amongsites, or vice versa. Given that the choiceof the best-t model has been suggested tobe affected by the parameter addition se-quence (Cunningham et al., 1998), the modelselection process might also be dependenton the order in which the LRTs are per-formed. To test whether the order in whichhypotheses are tested inuences model se-lection, we used four different hierarchiesto perform the LRTs (LRT1 through LRT4)(Fig. 4). Indeed, many different hierarchiesmight be possible, and we also deviseda dynamic LRT procedure (LRT) (Fig. 5),

  • 586 SYSTEMATIC BIOLOGY VOL. 50

    FIGURE 4. Hierarchical LRTs. LRTs are used to compare two different models at a time. The simpler modelrepresents the null hypothesis. A model is accepted (A) or rejected (R) and the next LRT in the corresponding path,A or R, is performed until a nal model is selected. Several starting models (in parentheses) and several orders ofparameter additions were tried (panels ad). (Continued)

    explained below. To adjust for the ination oftype I error (rejection of the null model whenit is true) when performing multiple LRTs,we applied a standardBonferroni correction.Because four or ve LTRswere carried out ineach case, the individual alpha level was setto 0.01 to preserve on average a family alphalevel of 0.05.The inationof type II error (fail-ure to reject the null model when it is false)was not corrected, because there is no obvi-ous procedure to adjust for this kind of errorwhen performing multiple LRTs. Althoughthis is not the most satisfactory solution, it

    shouldnot inuence themodel selectionpro-cedure, P-values for the LRTs being typicallyvery small.Dynamic Likelihood Ratio Tests.An alter-

    native to the use of a predened LRT isto let the data set itself determine the orderin which the hypotheses are tested; that is,the hierarchy used does not have to be thesame for different data sets. The algorithmswe suggest (LRT1 and LRT2) are as follows:

    1. Start with a simple JC (LRT1) or a com-plex GTR C I C 0 (LRT2) model and

  • 2001 POSADA AND CRANDALLMODEL SELECTION 587

    FIGURE 4. (Continued).

    calculate its likelihood. This is the currentmodel.

    2. Calculate the likelihood of the alternative(LRT1) or null (LRT2) models that differby one assumption, and perform the cor-responding nested LRTs.

    3. LRT1: If any hypothesis or hypothesesare rejected, the alternative model corre-sponding to the LRT with smallest associ-ated P-value becomes the current model.In the case of several equally smallestP-values, select the alternative modelwith the best likelihood.

    LRT2: If any hypothesis or hypothesesare not rejected, the null model corre-sponding to the LRT with greatest associ-ated P-value becomes the current model.In the case of several equally greatestP-values, select the null model with thebest likelihood.

    4. Repeat steps 2 and 3 until the algorithmconverges.

    The alternative paths the algorithm cangenerate can be represented graphically(Fig. 5). Regarding multiple signicance, it

  • 588 SYSTEMATIC BIOLOGY VOL. 50

    FIGURE 5. Dynamic LRTs. Starting with the simplest (JC) or the most complex model (GTR C I C 0), LRTs areperformed among the current model and the alternative model that maximizes the difference in likelihood. : basefrequencies. : transition/transversion bias. : substitution rates among nucleotides. 0: rate heterogeneity amongsites. I: proportion of invariable sites.

    is not clear how to apply a correction consis-tently in this case. The number of tests per-formed may vary in each case. In addition,several tests are performed, but only someof them are actually considered. We decidedto use an individual alpha value of 0.01 inall tests. In any case, the P-values obtainedare generally so small that the different pos-sible corrections for the type I error inationshould not change the nal outcome.Akaike Information Criterion.The AIC

    (Akaike, 1974) is an asymptotically unbiasedestimator of the KullbackLeibler informa-tion quantity (Kullback and Leibler, 1951).The smaller the AIC, the better the t of themodel to the data (approximately equivalenttominimizing the expectedKullbackLeiblerdistance between the true model and the es-timated model). An advantage of the AICis that it can also be used to compare bothnested and nonnested models. Because theAIC penalizes for the increasing number ofparameters in the model, it is taking into ac-count not only the goodness of t but alsothe variance of the parameter estimates. It is

    computed as

    AICi D 2 lnL i C 2Ni ,

    where Ni is the number of free parametersin the ith model and L i is the maximum-likelihood value of the data under the i thmodel. We also tried to empirically tunethe penalty of the AIC by running sev-eral simulations and nding which penaltywould increase model selection accuracy inthose simulations. We use the name AIC1for the standard denition with a penalty of2; while AIC2 is the empirically tuned AIC(AIC2i D 2lnL i C 5Ni ).Bayesian Information Criterion.The BIC

    (Schwarz, 1974) provides an approximatesolution to the natural log of the Bayesfactor, especially when sample sizes arelarge and competing hypotheses are nested(Kass and Wasserman, 1994). The Bayesfactor measures the relative support the dataset gives to different models, but its compu-tation often involves difcult integrals and

  • 2001 POSADA AND CRANDALLMODEL SELECTION 589

    an approximation becomes convenient. Aswith the AIC, the BIC can also be used tocompare nested and nonnested models. Itsdenition is

    BICi D 2 lnL i C Ni ln n,

    where n is the sample size (sequence length).The smaller the BIC, the better the t ofthe model to the data. Because in real dataanalysis, the natural log of n is usually >2,the BIC should tend to choose simpler mod-els than does the AIC1 but more complexmodels than the AIC2.

    RESULTSWe recorded the number of times a model

    selection method chose any model as thebest-tmodel out of 100 replicates. The accu-racy of the different model selection strate-gies was dened as the number of times amethod recovered the correct model of evo-lution out of the 100 replicates, that is, theprobability of recovering the true model un-der which the data were simulated. The ab-solute and relative accuracy of the differentmethods varied across the different simu-lated conditions: base tree and number ofcharacters (Simulation I), number of taxaand tree height (Simulation II), molecularclock (Simulation III), and rate heterogene-ity among sites (Simulations I and IV).

    Base TreeThe rst step in the model selection pro-

    cedure is the estimation of a base tree (in-cluding branch lengths). The use of differ-ent models of evolution for estimating theNJ, base trees did not affect model selectionaccuracy (Fig. 6). The use of the true treeas the base tree only increased model selec-tion performance, compared with the use ofNJ base trees, when the number of charac-ters was small (100). When the base modelwas a random tree, then relative to the useof a NJ as the base tree, model selection ac-curacy slightly increased for 100 charactersfor some methods, but decreased vastly for500 and 1,000 characters. However, this de-crease in accuracy was mainly due to theoverestimation of rate variation. The sub-stitutional pattern and the base frequen-cies were correctly identied only 1020%

    less frequently than when a NJ base treewas used as the base tree. When the truemodel did not include rate variation, themodel selected by using a random tree asthe base true was identical to the true modelin the substitutional pattern and in the basefrequencies assumption, but included rateheterogeneity among sites as modeled bythe gamma distribution (i.e., the selectedmodel was model C 0 instead of just model;data not shown). When the true modelincluded rate variation (C0), the modelselected was identical to the true model ex-cept that a signicant proportion of invari-able sites was also included (i.e., the selectedmodel was model C I C 0 instead of justmodel C 0).

    Number of CharactersIncreasing the number of characters

    rapidly improved the performance of mostmodel selection methods. When the truemodel was JC (Fig. 7), accuracy values were95% for all methods, except for the AIC1,which selected the true model 70% of thetime. When the truemodel wasHKY (Fig. 8),accuracy was 5080% for 100 charactersand increased to 75100%with 500 and 1,000characters. However, the LRT3 selected thetrue model only 10% of the time, beingextremely biased towards models that aremore complex (selecting GTR when the truemodelwasHKY). In addition, the AIC1, with500 and 1,000 characters, recovered the truemodel only 75% of the time. When the truemodel was GTR (Fig. 9), accuracy increasedfrom 10% to 80% and 95%with 100, 500, and1,000 characters, respectively. The AIC1 per-formed better than the rest with 100 char-acters, but the opposite was true with 500or 1,000 characters. With rate heterogeneity(Figs. 79), the patternswere similar but witha decrease in accuracy. This decrease wasparticularly true for LRT2 and LRT4 whenthenumber of characterssimulatedwere low(100), and for theAICs andBICwhen the truemodel was GTR C 0.

    Number of TaxaAdding taxa also increased, in general, the

    accuracyof thedifferentmethods (Table 3). Inthe absence of rate variation, most methodsperformed quite well (>90%) with 10 taxawhen the true model was JC or HKY, or with

  • 590 SYSTEMATIC BIOLOGY VOL. 50

    FIGURE 6. Effect of the base tree in model selection. The y-axis represents the number of times a model wasselected as the best-t model (out of 100 replicates). The x-axis represents the model of nucleotide substitutionselected (GTRIG corresponds to GTRC I C 0 model, and so on). The true model is identied by the black trianglebelow the x-axis. Different methods for model selection are represented on the z-axis (left to right on the legendis front to back on the z-axis). Data were simulated for 20 taxa and a tree height of 0.10 according to the GTR C 0model with parameter values as in Table 2. (Continued)

  • 2001 POSADA AND CRANDALLMODEL SELECTION 591

    FIGURE 6. (Continued).

    20 taxa when the true model was GTR. In thepresence of rate variation, the performancepatterns were more complex, and the LRT1and LRT3 accuracy decreased with moretaxa when the true model was HKY C 0.For the most complex model, simulated,GTR C 0, and with 500 characters, 50 taxawere needed to obtain accuracies>90%.

    Tree HeightThe effect of tree height depended on the

    complexity of the truemodel (Table 4).Whenthe true model was JC or HKY, accuracywasnot very affected by the different tree heightsexplored. The LRT3 method, however,showed a dramatic decrease in accuracywith increasing tree height, because of its

  • 592 SYSTEMATIC BIOLOGY VOL. 50

    FIGURE 7. Model selection accuracy for the JC and JC C 0 models. The y-axis represents the number of timesa model was selected as the best-t model (out of 100 replicates). The x-axis represents the model of nucleotidesubstitution selected (GTRIG corresponds to GTRC IC 0 model, and so on). Different methods for model selectionare represented on the z-axis (left to right on the legend is front to back on the z-axis). Data were simulated for20 taxa and a tree height of 0.10 with parameter values as in Table 2.

    increasing bias towards more complex mod-els. When the true model was GTR, treeheights of 0.10 (0.20 for AIC2 and BIC) werenecessary to attain accuracy values >90%.The effect of tree height was more com-

    plex and dramatic when the true model in-cluded rate variation among sites. For JC C0 the accuracy rapidly increased with atree height of 0.10. When the true modelwas HKY C 0, increasing tree heights >0.10

  • 2001 POSADA AND CRANDALLMODEL SELECTION 593

    FIGURE 8. Model selection accuracy for the HKY and HKY C 0 models. The y-axis represents the number oftimes amodel was selected as the best-tmodel (out of 100 replicates). The x-axis represents themodel of nucleotidesubstitution selected (GTRIG corresponds to GTRC IC 0 model, and so on). Different methods for model selectionare represented on the z-axis (left to right on the legend is front to back on the z-axis). Data were simulated for20 taxa and a tree height of 0.10 with parameter values as in Table 2.

  • 594 SYSTEMATIC BIOLOGY VOL. 50

    FIGURE 9. Model selection accuracy for the GTRandGTRC0models. The y-axis represents the number of timesa model was selected as the best-t model (out of 100 replicates). The x-axis represents the model of nucleotidesubstitution selected (GTRIG corresponds to GTRC IC 0 model, and so on). Different methods for model selectionare represented on the z-axis (left to right on the legend is front to back on the z-axis). Data were simulated for 20taxa and a tree height of 0.10 with parameter values as in Table 2.

    decreased accuracy for the LRT1, whereasthe LRT2 and LRT4 reached accuraciesof 7590% only with tree height of 0.50 or0.75. Again, the LRT3 showed a dramaticdecrease in accuracy with increasing treeheight. For the GTR C 0 case, most meth-

    ods showed high accuracy with tree heights>0.10.

    Molecular Clock

    The presence or absence of rate varia-tion among lineages, or in other words,

  • 2001 POSADA AND CRANDALLMODEL SELECTION 595

    TABLE 3. Effect of the number of taxa on model selection. Alignments with 500 characters were simulated underdifferent topologies with a tree height of 0.10, generated by a birthdeath process. Sequences were simulated withparameter values as in Table 2. The base tree was a NJ-HKY tree. Ntaxa is the number of taxa.

    Truemodel Ntaxa LRT1 LRT2 LRT3 LRT4 LRT1 LRT2 AIC1 AIC2 BIC

    JC 10 98 98 98 97 98 98 64 96 9820 97 97 97 96 97 96 70 93 9650 97 95 97 95 97 95 64 95 98100 94 96 94 94 94 93 67 95 98

    HKY 10 100 100 37 100 100 100 87 100 10020 95 95 12 95 95 95 78 99 9950 99 99 0 100 99 99 86 98 99100 98 97 0 98 98 98 80 100 100

    GTR 10 59 60 60 60 59 59 71 38 1620 91 89 97 91 91 91 84 63 5050 99 100 97 100 99 99 88 98 97100 100 100 90 100 100 100 81 100 100

    JC C 0 10 98 98 98 97 98 98 64 96 9820 93 91 93 88 93 91 52 91 9450 92 93 94 93 94 92 61 90 95100 94 97 97 97 97 96 72 95 99

    HKY C 0 10 97 0 54 0 22 32 27 31 3120 95 98 4 99 99 99 80 100 10050 83 96 0 95 95 95 78 96 97100 70 99 0 99 99 99 89 99 99

    GTR C 0 10 52 11 67 11 30 29 52 8 120 82 67 97 67 66 67 84 35 1750 93 94 95 95 95 95 86 74 54100 99 99 98 99 99 99 93 98 98

    the presence or absence of a molecularclock, did not affect model selection accu-racy for the 1,000-character data sets sim-ulated (Table 5). The extreme bias of theLRT3 for more complex models (select-ing GTR when the true model was HKY,and selecting GTR C 0 when the truemodel was HKY C 0) was constant inthe presence or absence of a molecularclock.

    Rate Variation Among Sites

    Almost invariably, throughout all thesimulations, the presence of rate variationamong sites reduced accuracy, especially forlownumbers of charactersor taxaor for smalltreeheights.When the truemodelwas JC and1,000 characters were simulated, the amountof rate variation did not affect model selec-tion (Table 6).When the truemodelwasHKY,the LRT3 performed poorly in the presenceof any rate variation (again, bias for morecomplex models), whereas the LRT1 andLRT2 methods showed a decrease in accu-racy with extreme values of rate variation( D 0:05). In this case, the decrease in ac-curacy due to rate variation resulted from atrend to infer more complex substitutional

    patterns (data not shown). When the truemodel was GTRC 0, most methods showedlower accuracieswith extreme rate variation.In that case, the decrease in accuracywasdueto the inference of less complex substitutionalpatterns (data not shown).

    LRT Distribution and Mixed 2

    Results from theuseof amixed 2 distribu-tion instead of a standard 2 distribution toapproximate the LRT P-values were undis-tinguishable (data not shown).

    DISCUSSIONAccuracy of Model Selection

    The results of this simulation study sug-gest thatmodel selectionprocedures performquite well under different conditions. Givena modest number of characters (500) andtaxa (20), most model selection proceduresare highly accurate in most conditions. TheLRTmethods performed better than the AICor BIC methods, although LRT3 performedverybadlywhen themodelswere ofmediumcomplexity (HKY). The differences amongthe LRTs methods were small, and the hi-erarchical and dynamic approaches seemed

  • 596 SYSTEMATIC BIOLOGY VOL. 50

    TABLE 4. Effect of tree height on model selection. Alignments with 20 taxa and 500 characters were simulatedunder different topologies generated by a birthdeath process. Sequences were simulated with parameter valuesas in Table 2. The base tree was a NJ-HKY tree. Tree height is the probability of substitution per site from the rootto the tip.

    True Treemodel height LRT1 LRT2 LRT3 LRT4 LRT1 LRT2 AIC1 AIC2 BIC

    JC 0.01 99 98 99 97 99 98 66 98 990.10 97 97 97 96 97 96 70 93 960.20 97 95 97 97 97 96 64 97 980.50 97 96 97 96 97 96 62 96 960.75 98 99 98 98 98 98 60 97 99

    HKY 0.01 99 99 93 99 99 99 86 99 990.10 95 95 12 95 95 95 78 99 990.20 99 99 0 99 99 99 86 99 1000.50 96 98 0 98 96 96 83 96 960.75 95 95 0 96 95 95 74 95 96

    GTR 0.01 12 13 11 12 12 12 34 1 00.10 91 89 97 91 91 91 84 63 500.20 99 99 98 99 99 99 83 98 920.50 99 100 99 100 99 99 81 97 990.75 98 98 99 98 98 98 81 97 98

    JC C 0 0.01 27 0 27 0 13 11 16 16 130.10 93 91 93 88 93 91 52 91 940.20 96 96 96 96 96 96 66 94 960.50 92 92 92 92 92 92 60 93 940.75 98 96 98 97 98 97 71 97 99

    HKYC 0 0.01 74 0 67 0 16 17 24 27 200.10 95 98 4 99 99 99 80 100 1000.20 81 97 0 97 97 97 70 96 990.50 65 100 0 100 100 100 81 100 1000.75 48 98 0 98 98 98 87 98 99

    GTRC 0 0.01 6 0 0 0 1 2 9 0 00.10 82 67 97 67 66 67 84 35 170.20 91 77 97 78 78 78 84 47 300.50 90 86 95 86 86 86 91 60 400.75 90 89 82 90 90 90 88 56 45

    to perform equally well, although the dy-namic approaches were more stable in theiraccuracy patterns. The empirically tunedAIC2 performed better than the originalAIC1.

    TABLE 5. Effect of the presence or absence of a molecular clock on model selection. Alignments with 20 taxa and1,000 characters were simulated under the conditions for the trees shown in Figure 3. Sequences were simulatedwith parameter values as in Table 2. The base tree was a NJ-JC tree.

    True Molecularmodel clock LRT1 LRT2 LRT3 LRT4 LRT1 LRT2 AIC1 AIC2 BIC

    JC Clock 95 95 95 95 95 94 63 97 98Nonclock 96 92 96 93 96 92 55 95 98

    HKY Clock 98 98 1 98 98 98 81 97 100Nonclock 95 96 0 96 95 95 83 98 99

    GTR Clock 98 98 98 98 98 98 87 97 94Nonclock 100 100 100 100 100 100 91 98 93

    JC C 0 Clock 98 97 97 96 97 96 65 94 98Nonclock 97 96 97 97 97 96 61 97 98

    HKYC 0 Clock 98 99 0 98 99 99 82 98 100Nonclock 92 96 0 96 96 96 82 96 97

    GTRC 0 Clock 97 94 100 95 95 95 85 84 55Nonclock 98 92 99 92 92 92 88 81 50

    Inuence of the Base TreeThe initial tree topology used to estimate

    the likelihood scores for the different models(the base tree) did not affect model selectionas long it was not a random tree. For as few

  • 2001 POSADA AND CRANDALLMODEL SELECTION 597

    TABLE 6. Effect of rate variation among sites on model selection. Alignments with 20 taxa and 1,000 characterswere simulated under the conditions for the tree shown in Figure 3a. The tree height was 0.10. Sequences weresimulated with base frequencies and substitution parameters as in Table 2. The base tree was aNJ-JC tree. Alpha ()is the shape of the 0 distribution Ncat is the number of discrete categories for the 0 distribution. D 1 effectivelymeans no rate variation has been observed among sites.

    Truemodel Ncat LRT1 LRT2 LRT3 LRT4 LRT1 LRT2 AIC1 AIC2 BIC

    JC C 0 1 95 95 95 95 95 94 63 97 980.5 4 98 97 97 96 97 96 65 94 980.2 4 96 94 96 96 96 95 64 95 960.2 8 95 94 96 93 96 95 63 97 980.05 8 97 70 98 95 98 96 70 97 99

    HKY C 0 1 98 98 1 98 98 98 81 97 1000.5 4 98 99 0 98 99 99 82 98 1000.2 4 81 98 0 99 99 99 74 98 990.2 8 77 98 0 98 98 98 73 97 980.05 8 67 54 7 96 97 96 88 100 100

    GTR C 0 1 98 98 98 98 98 98 87 97 940.5 4 97 94 100 95 95 95 85 84 550.2 4 92 90 100 94 94 94 91 73 450.2 8 87 79 100 79 79 79 92 50 170.05 8 70 64 93 70 70 70 92 39 13

    as 100 characters, the use of the true tree asthe base tree increased accuracy. This is ex-plained by the fact that NJ trees are worseestimates of the true tree with 100 charac-ters than with 500 or 1,000 characters. At thesame time, the amount of information is lessin the 100 character data sets, and the esti-mation of the different parameters in a reli-able topology becomes more relevant. Withincreased amounts of data, the use ofNJ treesas base trees resulted in the same accuracy asthe use of the true tree. A simple NJ-JC treeworked as well as a NJ-GTR tree in all cases.It has been found previously that accurateestimates of substitution parameters can beobtained even with an incorrect phylogeny(Yang, 1994b), although the tree used in thesesimulations had very short internal branches(Sullivan et al., 1996). As shown previouslyby Sullivan et al. (1996), the use of randomtrees can lead to an overestimation of rateheterogeneity.

    Adding or Removing Parameters?An open debate in statistics is whether

    model selectionprocedures should startwitha simple model to which parameters mightbe added (bottom-up) or with a complexmodel from which parameters might be re-moved (top-down). For example, to selectthe best-t model for the data at hand, wecould start with the simple JC model, andtest whether the addition of one parameter

    improves signicantly the likelihood of themodel (e.g., JC vs. JC C 0). If this is thecase, the parameter is included in themodel,whereas if the likelihood does not improvesignicantly, the parameter is not added.On the other hand, we could start with thecomplex GTR C I C 0 and test whetherthe removal of one parameter (e.g., GTRC I C 0 vs. GTR C I) decreases the like-lihood signicantly. If the likelihood doesnot decrease signicantly, the parameter isremoved from the model. In the contextof selection of models of nucleotide sub-stitution, both approaches have been used(e.g., Sullivan and Swofford, 1997; Keylseyet al., 1999). In this simulation, the bottom-upapproaches (LRT1 and LRT3) performedbetter than the top-down ones (LRT2 andLRT4) for small tree heights or numberof taxa but showed some biases towardswrong models when the true model wasHKY or HKY C 0. This indicates that start-ing with the simplest or most complexmodel may inuence model selection, al-though not in a consistent manner. WithLRTs, however, the complexity of the start-ing model did not change the accuracy ofmodel selection.

    The Order of Tests of HypothesesFor both the bottom-up or top-down ap-

    proaches, different parameters of the modelmaybe addedor removed in different orders.

  • 598 SYSTEMATIC BIOLOGY VOL. 50

    In other words, different hypotheses can betested earlier or later in the hierarchy ofLRTs.The order in which parameters are addedor removed determines which hypothesesare tested in the presence of which param-eters. For example, the hypothesis can betested by comparing JC and K80, with noadditional free parameters, or by comparingF81 versus HKY, both of which contain pa-rameters . If the presence of additional pa-rameters does not affect the outcome of theLRTs, we expect model selection to be inde-pendent of the order in which the differenthypotheses are tested.Whelan andGoldman(1999) and Goldman andWhelan (2000) sug-gested this to be the case, but their LRTswereperformed by assuming the true model wasthe null hypothesis. This is not the situationhere, where the null hypothesis will be thetrue model only in a few of the LRTs per-formed, nor is it the case with real data, be-cause the truemodel is unknown.On thecon-trary, Zhang (1999) suggested that LRTs ofthe transition/transversion bias or rate vari-ation are affected by the presence of otherparameters. For example, in Zhangs simula-tions the failure to take into account unequalbase frequencies led to the rejection of thenull hypothesis of no transition bias muchmore often than expected. Cunningham et al.(1998), using an empirically generated phy-logeny, observed that the choice of the best-t models was affected by the order ofaddition of parameters. However, their con-clusionwould be theopposite if theyhad cor-rected for type I error in the tests presentedin their Table 1. In these simulations, the pat-terns of accuracy of LRT1 and LRT2, and ofLRT3 andLRT4,were almostidenticalmostof the time, but the LRT3 clearly showed abias toward complex models not present inLRT4. This suggests that the order in whichparameters are added or removed to or froma model may have some effect on modelselection under some circumstances. In gen-eral, apparently testing rst for the base fre-quencies and the substitution pattern beforetesting for rate heterogeneity was slightlymore effective, unless rate heterogeneitywasstrong. Perhaps the inclusion of rate hetero-geneitymight alsoaccount for a large portionof the variation (information) in the data. Ifincluded at the beginning of the sequence oftests, especially when starting from simplemodels, this variation would make it moredifcult to effectively test other hypotheses.

    However, this effect was not signicant, be-cause theperformanceof thehierarchicalanddynamic LRTs was more often similar thannot.

    The 2 DistributionAlthough the standard 2 distribution

    may be signicantly different from the trueLRT distribution in the case of boundaryLRTs, the P-values obtained in LRTs of evo-lutionary hypotheses are often so small thatthis bias does not affect the results. Indeed,the appropriate 2 distribution should beused in each case (Goldman and Whelan,2000; Ota et al., 2000).

    The Importance of ModelsThe relevance of models of evolution

    to phylogenetic estimation has been exten-sively discussed in the literature. In gen-eral, phylogenetic methods perform worsewhen the model of evolution assumed is in-correct (Felsenstein, 1978; Huelsenbeck andHillis, 1993; Huelsenbeck, 1995; Bruno andHalpern, 1999).When substitution rates veryamong lineages, the use of an appropri-ate model is of utmost importance for ob-taining a correct tree topology (Takezakiand Gojobori, 1999; Philippe and Germot,2000). Cases where the use of wrong mod-els increases phylogenetic performance (seeYang, 1997b; Xia, 2000; Posada and Crandall,2001) are exceptional and rather represent abias towards the true tree associated withviolated assumptions (Bruno and Halpern,1999).However, the relationshipbetween thet of the model to the data and the abilityof the model to correctly predict topologyis not straightforward. Topology estimationbymethods such asmaximum likelihood arerelatively robust to themodel used (Fukami-Kobayashi andTateno,1991;GautandLewis,1995; Yang et al., 1995). The evaluation of re-liability of the estimated trees depends crit-ically on the model; false or simple modelstend to suggest that a tree is signicantlysup-ported when it cannot be (Yang et al., 1994;Buckley et al., 2000).The use of appropriate models is es-

    pecially critical for parameter estimationand, consequently, to understand the evo-lutionary process. When a relatively simplemodel of substitution is assumed, the transi-tion/transversion ratio, branch lengths, andsequence divergence are underestimated,

  • 2001 POSADA AND CRANDALLMODEL SELECTION 599

    whereas the shape parameter of the gammadistribution is overestimated (Tamura, 1992;Wakeley, 1994;Yang et al., 1994, 1995;Adachiand Hasegawa, 1995; Buckley et al., 2000).Moreover, the outcome of different testsof evolutionary hypotheses (e.g., molecularclock) may depend on the model of evolu-tion assumed (Zhang, 1999).A researcher should then adopt the

    statistical model-tting approach (sensuHuelsenbeck, 1995) and select among differ-ent models the one that best ts its data.However, is this selection reliable?

    Caveats and ConclusionsDifferent model selection methods work

    well with simulated data sets. What is therelevance of this result to real data sets?The result obtained here pertain to a per-fect t between model and data, and realdata rarely t models perfectly. A future av-enue of research might explore more realis-ticmodels, as nonreversible models or codonmodels for coding sequences. However, wehave chosen conservative and meaningfulparameter values in an attempt to mimic,as much as possible, empirical data sets. Wesuggest that if model selection proceduresare able to recognize some features of theprocess of evolution in simulated data sets,these same methods can be expected to rec-ognize these same features in real data sets,selecting themore appropriate, although stillimperfect, models of evolution. Moreover,the parameter values used here were conser-vative. For example, more biased base fre-quencies (e.g., 0.5:0.1:0.1:0.3, instead of the0.35:0.15:0.25:0.25 used here) would be ex-pected to increase accuracy values for allmodel selection methods, because the differ-ences amongthemodelswouldbecomemoreevident.Indeed, any model of nucleotide substi-

    tution is necessarily a simplication of theactual evolutionary process. Even the best-t model is far from the true model under-lying the evolution of the sequences understudy. However, the statistical selection ofthe model of evolution used in the analysisis, rst, philosophically necessary (for jus-tication of the use of a particular model),and second, should provide equal or bet-ter estimates. Model selection should be astandard procedure in phylogenetic studies.A program facilitating this task, Modeltest

    (Posada and Crandall, 1998), can be down-loaded at no charge from http://bioag.byu.edu/zoology/crandall lab/modeltest.htm.

    ACKNOWLEDGMENTSMany thanks to David Swofford for his discussions

    on model selection. This work was supported by a BYUGraduateStudiesAward (D.P.), theAlfred P. SloanFoun-dation, and grant NIH R01-HD 34350-01A1(K.A.C.).

    REFERENCESADACHI, J., AND M. HASEGAWA. 1995. Improved dat-ing of the human/chimpanzee separation in the mi-tochondrial DNA tree: Heterogeneity among aminoacid sites. J. Mol. Evol. 40:622628.

    AKAIKE, H. 1974. A new look at the statistical modelidentication. IEEE Trans. Autom. Contr. 19:716723.

    BRUNO, W. J., AND A. L. HALPERN. 1999. Topologicalbias and inconsistency of maximum likelihood usingwrong models. Mol. Biol. Evol. 16:564566.

    BUCKLEY, T. R., C. SIMON, AND G. K. CHAMBERS . 2001.Exploring among-site ratevariationmodels in amaxi-mum likelihood framework using empirical data: Theeffects of model assumptions on estimates of topol-ogy, edge lengths, and bootstrap support. Syst. Biol.50:6786.

    CUNNINGHAM , C. W., H. ZHU, AND D. M. HILLIS . 1998.Best-tmaximum-likelihoodmodels forphylogeneticinference: Empirical tests with known phylogenies.Evolution 52:978987.

    EDWARDS, A. W. F., AND L. L. CAVALLI-SFORZA. 1964.Reconstruction of evolutionary trees. Pages 6776 inPhenetic and phylogenetic classication (J. McNeill,ed.). Systematics Association Publication, London.

    FARRIS, J. S. 1973.A probability model for inferring evo-lutionary trees. Syst. Zool. 22:250256.

    FELSENSTEIN, J. 1973. Maximum likelihood and mini-mum-stepsmethods for estimating evolutionary treesfrom data on discrete characters. Syst. Zool. 22:240249.

    FELSENSTEIN, J. 1978.Cases in which parsimonyor com-patibility methods will be positively misleading. Syst.Zool. 27:401410.

    FELSENSTEIN, J. 1981. Evolutionary trees from DNA se-quences: A maximum likelihood approach. J. Mol.Evol. 17:368376.

    FELSENSTEIN, J. 1988. Phylogenies from molecular se-quences: inference and reliability. Annu. Rev. Genet.22:521565.

    FRATI, F., C. SIMON, J. SULLIVAN, AND D. L. SWOFFORD.1997.Gene evolution and phylogeny of themitochon-drial cytochrome oxidase gene in Collembola. J. Mol.Evol. 44:145158.

    FUKAMI-KOBAYASHI, K., AND Y. TATENO . 1991. Robust-ness of maximum likelihood tree estimation againstdifferent patterns of base substitutions. J. Mol. Evol.32:7991.

    GAUT, B. S., AND P. O. LEWIS . 1995. Success of maximumlikelihood phylogeny inference in the four-taxon case.Mol. Biol. Evol. 12:152162.

    GOLDMAN, N. 1993. Statistical tests of models of DNAsubstitution. J. Mol. Evol. 36:182198.

    GOLDMAN, N., AND S. WHELAN. 2000. Statistical testsof gamma-distributed rate heterogeneity in models of

  • 600 SYSTEMATIC BIOLOGY VOL. 50

    sequence evolution in phylogenetics. Mol. Biol. Evol.17:975978.

    HASEGAWA, M. 1990a.Mitochondrial DNA evolution inprimates: Transition rate has been extremely low inthe lemur. J. Mol. Evol. 31:113121.

    HASEGAWA, M. 1990b.Phylogeny andmolecular evolu-tion in primates. Jpn. J. Genet. 65:243266.

    HASEGAWA, M., K. KISHINO , AND T. YANO. 1985. Dat-ing the humanape splitting by a molecular clock ofmitochondrial DNA. J. Mol. Evol. 22:160174.

    HUELSENBECK , J. P. 1995. Performance of phy-logenetic methods in simulation. Syst. Biol. 44:1748.

    HUELSENBECK , J. P., AND K. A. CRANDALL. 1997. Phy-logeny estimation and hypothesis testing using max-imum likelihood. Annu. Rev. Ecol. Syst. 28:437466.

    HUELSENBECK , J. P., AND D. M. HILLIS . 1993. Successof phylogenetic methods in the four-taxon case. Syst.Biol. 42:247264.

    JUKES , T. H.,AND C. R. CANTOR. 1969.Evolution of pro-tein molecules. Pages 21132 in Mammalian proteinmetabolism (H.M.Munro, ed.). Academic Press, NewYork.

    KASS , R. E., AND L. WASSERMAN. 1994. A referenceBayesian test for nested hypotheses and its rela-tionship to the Schwarz criterion. Department ofStatistics, Carnegie Mellon University. Pittsburgh,Pennsylvania. 16.

    KELSEY, C. R., K. A. CRANDALL, AND A. F. VOEVODIN.1999.Differentmodels, different trees: Thegeographicorigin of PTLV-I. Mol. Phylogenet. Evol. 13:336347.

    KENDALL, M., AND STUART. 1979. The advanced theoryof statistics. Charles Grifn, London.

    KULLBACK, S., AND R. A. LEIBLER. 1951. On informationand sufciency. An. Math. Stat. 22:7986.

    MOROZOV, P., T. SITNIKOVA, G. CHURCHILL, F. J. AYALA,AND A. RZHETSKY. 2000. A new method for char-acterizing replacement rate variation in molecularsequences: Application of the Fourier and Waveletmodels to Drosophila andmamalian proteins. Genet-ics 154:381395.

    MUSE, S. 1999. Modeling the molecular evolution ofHIV sequences. Pages 122152 in The evolution ofHIV (K. A. Crandall, ed.). Johns Hopkins Univ. Press,Baltimore.

    OTA, R., P. J.WADDELL,M.HASEGAWA, H. SHIMODAIRA,AND H. KISHINO . 2000. Appropriate likelihood ra-tio tests and marginal distributions for evolutionarytreemodels with constraints on parameters.Mol. Biol.Evol. 17:798803.

    PHILIPPE, H., AND A. GERMOT. 2000. Phylogeny of Eu-karyotes based in ribosomal RNA: Long-branch at-traction andmodels of sequence evolution. Mol. Biol.Evol. 17:830834.

    POSADA,D.,ANDK.A.CRANDALL. 1998.Modeltest: Test-ing the model of DNA substitution. Bioinformatics14:817818.

    POSADA,D.,ANDK.A. CRANDALL. 2001.Simple (wrong)models for complex trees: Empirical bias. Mol. Biol.Evol. 18:271275.

    RAMBAUT, A., AND N. C. GRASSLY. 1997. Seq-Gen: Anapplication for theMonteCarlo simulation ofDNAse-quence evolution along phylogenetic trees. Comput.Appl. Biosci. 13:235238.

    RANNALA, B., AND Z. YANG. 1996. Probability distribu-tion of molecular evolutionary trees: A new methodof phylogenetic inference. J. Mol. Evol. 43:304311.

    RZHETSKY, A.,AND M.NEI. 1995.Tests of applicability ofseveral substitution models for DNA sequence data.Mol. Biol. Evol. 12:131151.

    SAITOU, N., AND M. NEI. 1987. The neighbor-joiningmethod: A new method for reconstructing phyloge-netic trees. Mol. Biol. Evol. 4:406425.

    SCHWARZ, G. 1974.Estimating thedimension of amodel.Ann. Stat. 6:461464.

    STEEL, M., AND D. PENNY. 2000. Parsimony, likelihood,and the role of models in molecular phylogenetics.Mol. Biol. Evol. 17:839850.

    SULLIVAN, J., K. E. HOLSINGER , AND C. SIMON. 1996.The effect of topology on estimates of among-site ratevariation. J. Mol. Evol. 42:308312.

    SULLIVAN, J., AND D. L. SWOFFORD. 1997. Are guineapigs rodents? The importance of adequate models inmolecular phylogenies. J. Mammal. Evol. 4:7786.

    SWOFFORD D. L., G. J. OLSEN, P. J. WADDELL and D. M.HILLS . 1996. Phylogenetic inference. Pages 407514in molecular systematics (D. M. Hills, C. Moritz andB. K. Mable, eds.). Sinauer Associates, SunderlandMA.

    SWOFFORD, D. L. 1998. PAUP Phylogenetic analysis us-ing parsimony and other methods. 4.0 beta. SinauerAssociates, Sunderland, Massachusetts.

    TAKEZAKI, N., AND T. GOJOBORI. 1999. Correct and in-correct vertebrate phylogenies obtained by the en-tire mitochondrial DNA sequences. Mol. Biol. Evol.16:590601.

    TAMURA, K. 1992. Estimation of the number ofnucleotide substitutions when there are strongtransition-transversion andGCCcontent biases.Mol.Biol. Evol. 9:678687.

    TAMURA, K. 1994. Model selection in the estimationof the number of nucleotide substitutions. Mol. Biol.Evol. 11:154157.

    TAVARE, S. 1986. Some probabilistic and statistical prob-lems in the analysis of DNA sequences. Pages 5786in Somemathematical questions in biologyDNA se-quence analysis (R. M. Miura, ed.). Am. Math. Soc.,Providence, RI.

    WAKELEY, J. 1994. Substitution-rate variation amongsites and the estimation of transition bias. Mol. Biol.Evol. 11:436442.

    WHELAN, S., AND N. GOLDMAN, 1999. Distributions ofstatistics used for the comparison of models of se-quence evolution in phylogenetics Mol. Biol. Evol.16:12921299.

    XIA, X. 2000. Phylogenetic relationships among horse-shoe crab species: Effect of substitution models inphy-logenetic analysis. Syst. Biol. 49:87100.

    YANG, Z. 1993. Maximum likelihood estimation ofphylogeny from DNA sequences when substitu-tion rates differ over sites. Mol. Biol. Evol. 10:13961401.

    YANG, Z. 1994a. Maximum likelihood phylogenetic es-timation from DNA sequences with variable ratesover sites: Approximatemethods. J.Mol. Evol. 39:306314.

    YANG, Z. 1994b. Statistical properties of the maximumlikelihood method of phylogenetic estimation andcomparison with distance matrix methods. Syst. Biol.43:329342.

    YANG, Z. 1996a. Phylogenetic analysis using parsimonyand likelihood methods. J. Mol. Evol. 42:294307.

    YANG, Z. 1996b. Among-site rate variation and its im-pact on phylogenetic analysis. Trends Ecol. Evol.11:367372.

  • 2001 POSADA AND CRANDALLMODEL SELECTION 601

    YANG, Z. 1996c. Maximum-likelihood models for com-bined analysesofmultiple sequence data. J.Mol. Evol.42:587596.

    YANG, Z. 1997a. PAML: A program package for phy-logenetic analysis by maximum likelihood. Comput.Appl. Biosci. 13:555556.

    YANG, Z. 1997b. How often do wrong models pro-duce better phylogenies? Mol. Biol. Evol. 14:105108.

    YANG, Z., N. GOLDMAN, AND A. FRIDAY. 1994. Com-parison of models for nucleotide substitution used inmaximum-likelihood phylogenetic estimation. Mol.Biol. Evol. 11:316324.

    YANG, Z., N. GOLDMAN, AND A. FRIDAY. 1995. Maxi-mum likelihood trees from DNA sequences: A pecu-liar statistical estimation problem. Syst. Biol. 44:384399.

    YANG, Z.,AND B.RANNALA. 1997.Bayesianphylogeneticinference using DNA sequences: A Markov chainMonte Carlo method. Mol. Biol. Evol. 14:717724.

    ZHANG , J. 1999. Performance of likelihood ratio tests ofevolutionary hypotheses under inadequate substitu-tion models. Mol. Biol. Evol. 16:868875.

    Received 7 April 2000; accepted 13 June 2000Associate Editor: C. Simon