Partitioning and Combining Data in Phylogenetic Analysis

14
384 SYSTEMATIC BIOLOGY VOL. 42 limitation is the restricted availability of these metrics in standard computer pack- ages. REFERENCES CAVALLI-SFORZA, L. L., E. MINCH, AND J. L. MOUNTAIN. 1992. Coevolution of genes and languages revis- ited. Proc. Natl. Acad. Sci. USA 89:5620-5624. CAVALLI-SFORZA, L. L., A. PIAZZA, P. MENOZZI, AND J. L. MOUNTAIN. 1988. Reconstruction of human evolution: Bringing together genetic, archaeologi- cal, and linguistic data. Proc. Natl. Acad. Sci. USA 85:6002-6006. CAVALLI-SFORZA, L. L., A. "PIAZZA, P. MENOZZI, AND J. L. MOUNTAIN. 1989. Genetic and linguistic evo- lution. Science 244:1128-1129. CORBALLIS, M. C. 1992. On the evolution of language and generativity. Cognition 44:197-226. DAY, W. H. E. 1986. Analysis of quartet dissimilarity measures between undirected phylogenetic trees. Syst. Zool. 35:325-333. FOLEY, R. A. 1991. The silence of the past. Nature 353:114-115. HEDGES, S. B., S. KUMAR, K. TAMURA, AND M. STONE- KING. 1992. Human origins and analysis of mi- tochondrial DNA sequences. Science 255:737-739. NOBLE, W., AND I. DAVIDSON. 1991. The evolutionary emergence of modern human behaviour: Language and its archeology. Man 26:223-253. O'GRADY, R. T., I. GODDARD, R. T. BATEMAN, W. D. DIMICHELLE, V. A. FUNK, W. J. KRESS, R. MOOI, AND P. F. CANNELL. 1989. Genes and tongues. Science 243:1651. PENNY, D., AND M. D. HENDY. 1985. The use of tree comparison metrics. Syst. Zool. 34:75-82. PENNY, D., M. D. HENDY, AND M. A. STEEL. 1991. Testing the theory of descent. Pages 155-183 in Phy- logenetic analysis of DNA sequences (M. M. Mi- yamoto and J. Cracraft, eds.). Oxford Univ. Press, New York. PENNY, D., M. D. HENDY, AND M. A. STEEL. 1992. Progress with methods for constructing evolution- ary trees. Trends Ecol. Evol. 7:73-79. ROBINSON, D. F., AND L. R. FOULDS. 1981. Compari- son of phylogenetic trees. Math. Biosci. 53:131-147. RUHLEN, M. 1987. A guide to the world's languages, Volume 1. Classification. Stanford Univ. Press, Stanford, California. STEEL, M. A. 1988. Distributions of the symmetric difference metric on phylogenetic trees. SIAM J. Discr. Math. 1:541-551. STEEL, M. A., AND D. PENNY. 1993. Distributions of tree comparison metrics—Some new results. Syst. Biol. 42:126-141. SWOFFORD, D. L. 1991. When are phylogeny esti- mates from molecular and morphological data in- congruent? Pages 295-333 in Phylogenetic analysis of DNA sequences (M. M. Miyamoto and J. Cracraft, eds.). Oxford Univ. Press, New York. WARD, R. H., B. L. FRAZIER, K. DEW-JAGER, AND S. PAABO. 1991. Extensive mitochondrial diversity within a single Amerindian tribe. Proc. Natl. Acad. Sci. USA 88:8720-8724. WILLIAMS. W. T., AND H. T. CLIFFORD. 1971. On the comparison of two classifications of the same set of elements. Taxon 20:519-522. Received 4 December 1992; accepted 5 March 1993 Sysl. Biol. 42(3):384-397, 1993 Partitioning and Combining Data in Phylogenetic Analysis 1 J. J. BULL, 2 JOHN P. HUELSENBECK, 2 CLIFFORD W. CUNNINGHAM, 2 DAVID L. SWOFFORD, 3 AND PETER J. WADDELL 4 department of Zoology, University of Texas, Austin, Texas 78712, USA laboratory of Molecular Systematics, MRC 534, Smithsonian Institution, Washington, D.C. 20560, USA ^Department of Plant Biology, Massey University, Palmerston North, New Zealand Systematists are confronted with an abundance of data for estimating phylog- enies, and the number of these data is ex- 1 The ideas in this paper were developed indepen- dently by J.J.B., J.P.H., and C.W.C. and by D.L.S. and P.W. panding at a remarkable pace. Rather than proving to be an unqualified blessing, however, this glut poses a new challenge to the field because not all data produce the same estimate of phylogeny. System- atists must thus decide how best to analyze diverse data. Opinions differ on whether at University of Saskatchewan on July 6, 2012 http://sysbio.oxfordjournals.org/ Downloaded from

Transcript of Partitioning and Combining Data in Phylogenetic Analysis

Page 1: Partitioning and Combining Data in Phylogenetic Analysis

384 SYSTEMATIC BIOLOGY VOL. 42

limitation is the restricted availability ofthese metrics in standard computer pack-ages.

REFERENCES

CAVALLI-SFORZA, L. L., E. MINCH, AND J. L. MOUNTAIN.1992. Coevolution of genes and languages revis-ited. Proc. Natl. Acad. Sci. USA 89:5620-5624.

CAVALLI-SFORZA, L. L., A. PIAZZA, P. MENOZZI, AND J.L. MOUNTAIN. 1988. Reconstruction of humanevolution: Bringing together genetic, archaeologi-cal, and linguistic data. Proc. Natl. Acad. Sci. USA85:6002-6006.

CAVALLI-SFORZA, L. L., A. "PIAZZA, P. MENOZZI, AND J.L. MOUNTAIN. 1989. Genetic and linguistic evo-lution. Science 244:1128-1129.

CORBALLIS, M. C. 1992. On the evolution of languageand generativity. Cognition 44:197-226.

DAY, W. H. E. 1986. Analysis of quartet dissimilaritymeasures between undirected phylogenetic trees.Syst. Zool. 35:325-333.

FOLEY, R. A. 1991. The silence of the past. Nature353:114-115.

HEDGES, S. B., S. KUMAR, K. TAMURA, AND M. STONE-KING. 1992. Human origins and analysis of mi-tochondrial DNA sequences. Science 255:737-739.

NOBLE, W., AND I. DAVIDSON. 1991. The evolutionaryemergence of modern human behaviour: Languageand its archeology. Man 26:223-253.

O'GRADY, R. T., I. GODDARD, R. T. BATEMAN, W. D.DIMICHELLE, V. A. FUNK, W. J. KRESS, R. MOOI, ANDP. F. CANNELL. 1989. Genes and tongues. Science243:1651.

PENNY, D., AND M. D. HENDY. 1985. The use of treecomparison metrics. Syst. Zool. 34:75-82.

PENNY, D., M. D. HENDY, AND M. A. STEEL. 1991.Testing the theory of descent. Pages 155-183 in Phy-logenetic analysis of DNA sequences (M. M. Mi-yamoto and J. Cracraft, eds.). Oxford Univ. Press,New York.

PENNY, D., M. D. HENDY, AND M. A. STEEL. 1992.Progress with methods for constructing evolution-ary trees. Trends Ecol. Evol. 7:73-79.

ROBINSON, D. F., AND L. R. FOULDS. 1981. Compari-son of phylogenetic trees. Math. Biosci. 53:131-147.

RUHLEN, M. 1987. A guide to the world's languages,Volume 1. Classification. Stanford Univ. Press,Stanford, California.

STEEL, M. A. 1988. Distributions of the symmetricdifference metric on phylogenetic trees. SIAM J.Discr. Math. 1:541-551.

STEEL, M. A., AND D. PENNY. 1993. Distributions oftree comparison metrics—Some new results. Syst.Biol. 42:126-141.

SWOFFORD, D. L. 1991. When are phylogeny esti-mates from molecular and morphological data in-congruent? Pages 295-333 in Phylogenetic analysisof DNA sequences (M. M. Miyamoto and J. Cracraft,eds.). Oxford Univ. Press, New York.

WARD, R. H., B. L. FRAZIER, K. DEW-JAGER, AND S. PAABO.1991. Extensive mitochondrial diversity within asingle Amerindian tribe. Proc. Natl. Acad. Sci. USA88:8720-8724.

WILLIAMS. W. T., AND H. T. CLIFFORD. 1971. On thecomparison of two classifications of the same set ofelements. Taxon 20:519-522.

Received 4 December 1992; accepted 5 March 1993

Sysl. Biol. 42(3):384-397, 1993

Partitioning and Combining Data in Phylogenetic Analysis1

J. J. BULL,2 JOHN P. HUELSENBECK,2 CLIFFORD W. CUNNINGHAM,2

DAVID L. SWOFFORD,3 AND PETER J. WADDELL4

department of Zoology, University of Texas,Austin, Texas 78712, USA

laboratory of Molecular Systematics, MRC 534, Smithsonian Institution,Washington, D.C. 20560, USA

^Department of Plant Biology, Massey University,Palmerston North, New Zealand

Systematists are confronted with anabundance of data for estimating phylog-enies, and the number of these data is ex-

1 The ideas in this paper were developed indepen-dently by J.J.B., J.P.H., and C.W.C. and by D.L.S. andP.W.

panding at a remarkable pace. Rather thanproving to be an unqualified blessing,however, this glut poses a new challengeto the field because not all data producethe same estimate of phylogeny. System-atists must thus decide how best to analyzediverse data. Opinions differ on whether

at University of Saskatchew

an on July 6, 2012http://sysbio.oxfordjournals.org/

Dow

nloaded from

Page 2: Partitioning and Combining Data in Phylogenetic Analysis

1993 POINTS OF VIEW 385

data sets collected from the same taxashould be (1) combined prior to phyloge-netic analysis (Miyamoto, 1985; Kluge,1989; Barrett et al., 1991; Donoghue andSanderson, 1992), (2) analyzed separately(Pesoleetal.,1991;Shafferetal.,1991;Swof-ford, 1991; Marshall, 1992; de Queiroz, 1993;Rodrigo et al., in press), or (3) analyzedseparately before combining the indepen-dent estimates using consensus methods(Adams, 1972; Mickevich, 1978; Nelson,1979; Hillis, 1987; Swofford, 1991; de Quei-roz, 1993; Rodrigo et al., in press). Contro-versy exists over which of these approach-es is superior, and advocates of differentviews openly acknowledge that the matteris unresolved. Here, we suggest that a com-bined analysis of potentially diverse datais inappropriate unless it is shown that thedifferent data sets are not significantly het-erogeneous with respect to the reconstruc-tion model. If data sets are demonstrablyheterogeneous they should not be com-bined in an analysis that assumes characterhomogeneity.

THE BASIC ARGUMENT

For any modest set of well-known taxa,the systematist may have access to nucleicacid sequences from many different genesplus a variety of morphological, biochem-ical, and physiological characters. Recon-struction based on a broad range of char-acter types offers the appeal of avoidingconsistent biases among characters thatmight yield erroneous reconstructions. Ar-guments for including all available data ina single analysis have typically followedthree basic lines. The first (e.g., Hillis, 1987)suggests that some character classes areuseful in resolving certain areas of the treebut are uninformative for others. If, forexample, one character set resolves nodescloser to the tips of the tree and another ismore useful for basal branching, combi-nation of the two may substantially im-prove the resolution of the full tree. A sec-ond line of argument is that weak, but true,signals may be present in different datasets, but the signal within any single dataset may be masked by noise. By combiningthe data sets, it is hoped that the signals

will be additive and will rise above thenoise (e.g., Barrett et al., 1991). Finally, itis sometimes argued from first principles(e.g., Kluge, 1989) that any phylogeneticanalysis must attempt to explain all avail-able data simultaneously.

Although these arguments clearly havemerit, the inclusion of many differentcharacters also increases the chance thatsupport for true phylogenetic groupingscoming from reliable characters may be di-luted by random or systematic errors fromunreliable characters. In the worst case, acombined analysis can yield an erroneousestimate of phylogeny with increasing cer-tainty as data set size increases (i.e., is "pos-itively misleading"), even when one of thedata sets taken alone provides a reliableestimate. We suggest that it is better to ob-tain one right answer and one wrong an-swer from separate analyses than to get asingle wrong answer in a combined anal-ysis, especially now that statistical meth-ods are becoming available for determin-ing when an answer is likely to be wrongdue to violation of assumptions (Penny etal., 1992).

When faced with potentially diverse setsof data for the same taxa, the data shouldbe compared statistically for phylogeneticincongruence (heterogeneity) and shouldnot automatically be assumed to be ho-mogeneous (see Fig. 1). Statistical evidenceof heterogeneity requires that the phylog-eny estimates obtained from the variousdata sets differ by an amount greater thana threshold (e.g., 95%) calculated accordingto a prescribed sampling process. If the nullhypothesis of homogeneity is supported,the data may be combined in an analysisthat assumes homogeneity. If the null hy-pothesis is rejected, the data should not becombined without further justification.When the basis of heterogeneity is known,it may be possible to develop alternativereconstruction models that accommodatethe heterogeneity in a single analysis. Butif the cause of the heterogeneity remainsunknown, results from the separate anal-yses must await future resolution, and theseparate estimates should each be enter-tained as possible trees (e.g., one of the

at University of Saskatchew

an on July 6, 2012http://sysbio.oxfordjournals.org/

Dow

nloaded from

Page 3: Partitioning and Combining Data in Phylogenetic Analysis

386 SYSTEMATIC BIOLOGY VOL. 42

Test for homogeneitybetween data sets

Accept nullhypothesis ofhomogeneity

Reject nullhypothesis ofhomogeneity

Combine data in single analysis

Known cause of heterogeneity:

Revise reconstruction model

OR

Unknown cause of heterogeneity:

No Obvious Resolution

FIGURE 1. Partitions of character data should be subjected to a statistical test to determine if the partitioneddata violate the null model of data homogeneity with respect to a phylogenetic reconstruction model. If thenull model is rejected and the basis of heterogeneity can be identified and incorporated into a new recon-struction model, then the partitions can be incorporated into a single analysis. If the data heterogeneitycannot be identified, phylogenetic analyses should be applied to each partition, and the different estimatesshould be treated as tenable.

trees is correct or more than one tree iscorrect).

This general procedure of evaluatingheterogeneity before combining data iswidely used in nearly all fields of science,and it is given a formal basis with the sta-tistical procedures of contingency tablesand analyses of variance. There are twoparts to the argument for use of this pro-cedure in systematics. First, any attempt toreconstruct phylogeny rests on an under-lying model (assumptions) of how evolu-tion has operated in the lineages. Second,the failure of a method to recover the truephylogeny reflects a failure of the meth-od's assumptions and/or sampling error.When different data sets yield significantlydifferent phylogenetic estimates (i.e., thedifferences are too great to be attributed tosampling error), the sets of characters musthave evolved under different rules (per-haps even different histories). In this case,the systematist should not be forced toadopt a single set of assumptions aboutcharacter evolution in a combined analy-sis.

Reconstruction Assumes EvolutionaryProcess

Algorithms that attempt to recover evo-lutionary history incorporate models ofevolutionary processes. For example, re-construction methods assume a treelikestructure to evolutionary history and fur-ther assume that the same history is com-mon to all characters included in the anal-ysis. These methods are also sensitive toand thus incorporate assumptions about thecauses of convergence or homoplasy. Fur-thermore, reconstruction methods makeassumptions about rates of character-statechange (e.g., Jukes and Cantor, 1969; Ki-mura, 1980; also see Huelsenbeck and Hil-lis, 1993), in some cases explicitly imposingdifferent costs for various rates of charac-ter-state changes (Camin and Sokal, 1965;Kluge and Farris, 1969; Farris, 1970, 1977;Williams and Fitch, 1989; Swofford and 01-sen, 1990; Wheeler, 1990). The model ofevolutionary process is explicit in maxi-mum-likelihood reconstruction, whereas itis largely implicit in other methods such

at University of Saskatchew

an on July 6, 2012http://sysbio.oxfordjournals.org/

Dow

nloaded from

Page 4: Partitioning and Combining Data in Phylogenetic Analysis

1993 POINTS OF VIEW 387

as parsimony and neighbor joining (Fel-senstein, 1981). Nonetheless, all of theseimplicit reconstruction methods are knownto fail when lineages have evolved undercertain rules; these failures reveal the de-pendence of the methods on underlyingassumptions of reconstruction just as sure-ly as if the underlying model were explicit(Felsenstein, 1978).

Heterogeneity Indicates Model Failureor Different Histories

The consistent failure of a method to re-cover the correct phylogeny (i.e., failurewith a sufficiently large amount of data tonegate sampling error) indicates that atleast one of the method's assumptions aboutevolutionary process is unacceptably false.However, we rarely know the true phy-logeny and thus cannot know whether thereconstruction has failed (except in simu-lation studies). But suppose that estimatesobtained from two different data sets aresignificantly different, such that the dif-ferences cannot be attributed merely tosampling among the different sets of char-acters (Shaffer et al., 1991; Swofford, 1991;de Queiroz, 1993). In this case, either one(or both) of the estimates is wrong or bothestimates are correct and the two sets ofcharacters had different histories. It makesno sense to combine the data in either case.We may not know the reason for the in-congruence, and thus we may not knowwhich of the data sets violates the recon-struction model's assumptions. Yet we arecertain that the combined data set violatesthe model.

By adopting the null hypothesis of ho-mogeneity, data will be combined unlessshown to be significantly heterogeneous.This approach ensures that many cases ofslight (and perhaps even modest) hetero-geneity will be treated as homogeneoussimply because the observed level of het-erogeneity for the given sample size willnot be large enough to reject the null hy-pothesis. Thus the fact that two subsets ofthe total data yield different estimates ofphylogeny will not alone preclude com-bining the data. For the same reasons, therequirements of sufficient sample size to

reject a null hypothesis of homogeneitywill set an effective lower limit to the sizeof the subsets.

An Example

Molecular studies of viruses and bacteriahave revealed cases of horizontal genetransfer, i.e., a small portion of the genomeof one "species" (species A) has replacedthe homologous portion of the genome inanother species (B) (Dykhuizen and Green,1991; Maynard Smith et al., 1991; Medigueet al., 1991; Souza et al., 1992; Valdez andPifLero, 1992). The evolutionary history ofthe species B genome differs for differentgenes, and reconstruction of the evolu-tionary histories of species A and B woulddepend on which characters were ana-lyzed. No rational systematist would sug-gest combining genes with different his-tories to produce a single reconstruction,because combining the data not only ob-scures an important feature of history butruns the risk of producing a reconstructionthat fails to represent either history. Yetthe situation is no different when two datasets have experienced different evolution-ary processes for other reasons; a combinedanalysis obscures the different processesand may also produce an erroneous recon-struction (see also Doyle, 1992). Testing forheterogeneity is the way to determinewhether the rules have been different.

ELABORATION

Estimation Is Not Necessarily Improved byCombining Data

It is perhaps counterintuitive that theaddition of more data to an estimation pro-cedure can fail to improve the estimate.The catch here is that the data are hetero-geneous and thus are derived from differ-ent distributions. In statistics, an estimatoris said to be consistent if it converges onthe true value as more data are added. Invarious analytical and simulation studies,some tree reconstruction methods are ro-bust estimators of phylogenetic relation-ships under a wide range of conditions (seeFelsenstein, 1978; Hendy and Penny, 1989;DeBry, 1992; Huelsenbeck and Hillis, 1993,

at University of Saskatchew

an on July 6, 2012http://sysbio.oxfordjournals.org/

Dow

nloaded from

Page 5: Partitioning and Combining Data in Phylogenetic Analysis

388 SYSTEMATIC BIOLOGY VOL. 42

CM

4 *

Rate 1FIGURE 2. Phylogenetic estimation from combined

data can be worse than that from data sets analyzedseparately. Black squares indicate that the estimatefrom combined data is on average worse than thebetter of the separate estimates. Data were generatednumerically assuming a four-taxon tree with equalbranch lengths (Appendix, Fig. 7). At any point onthe graph, each of the two data sets are evolving atone of two rates—rate 1 (x-axis) or rate 2 (y-axis). Thedata sets are homogeneous with respect to evolution-ary rate along the diagonal from the lower left to theupper right of the graph and depart from homoge-neity as distance from the diagonal increases. The axesrepresent the percentage of characters that would beexpected to change along any given branch of themodel phylogeny and proceed from very low ratesof change to instantaneous rates of change that areinfinite (with an infinite instantaneous rate of change,75% of characters would be expected to change be-tween nodes). The graph contains a 75 x 75 array ofpoints, with each point representing 100 simulatedphylogenies. The size of each data set was standard-ized to 50 variable characters.

for these conditions). In such cases, onemight suppose that combining data in-variably improves the estimate, but this in-tuition fails when the data being combinedhave evolved under different rules (henceare heterogeneous).

Consider a model of nucleotide evolu-tion with four (terminal) taxa. In the firstanalysis, assume that the expected branchlengths are equal for all five branches ofthe phylogeny (Appendix, Fig. 7). How-ever, let one set of sites evolve at one rate(rate 1) and another set of sites evolve at adifferent rate (rate 2). The question iswhether the combination of sites evolving

at rate 1 with sites evolving at rate 2 everfails to provide a better (or equally good)estimate of phylogeny as obtained fromthe best estimate from the two sets ana-lyzed separately. Figure 2 shows in blackthose combinations of parameters for whichthe estimate from the combined data is onaverage worse than the best estimate of theseparate data sets. The basis of this resultis explored in Figure 3 by examining indetail a single point from Figure 2 (rate 1= 35% expected internodal change, rate 2= 60%). This figure shows the relationshipbetween the probability of obtaining thecorrect phylogeny as a function of thenumber of characters from each data set.Whether or not the number of variablecharacters is equal in the rapidly and slow-ly evolving data sets, the probability of cor-rect reconstruction is much higher for theslowly evolving data set than for the rap-idly evolving data set (Figs. 3a, 3b). As aconsequence of this disparity, there areconditions under which the estimate ob-tained from the slowly evolving data setis worsened by adding the rapidly evolv-ing characters (up to 400 combined char-acters, Fig. 3a; up to 1,000 combined char-acters, Fig. 3b).

The example in Figure 3 represents a best-case scenario for combining heteroge-neous data because the estimate from thecombined data is itself consistent, con-verging to 100% probability of correct re-construction with increasing numbers ofcharacters. But if one of the estimates isinconsistent, the estimate from the com-bined data may be seriously compromisedregardless of how many characters aresampled. In a second analysis, we gener-ated one data set as before (equal rates forall branches), but the second data set as-sumed that two opposing peripheralbranches evolved at much higher rates thanthe remaining three branches (e.g., the lin-eages leading to Tl and T3 in Fig. 7 evolvedfaster). Under such a process, parsimonyyields an inconsistent estimate (converg-ing to 100% probability of reconstructingthe wrong phylogeny as the number ofcharacters is increased; Felsenstein, 1978).We could just as easily have created incon-sistent character data using a model of se-

at University of Saskatchew

an on July 6, 2012http://sysbio.oxfordjournals.org/

Dow

nloaded from

Page 6: Partitioning and Combining Data in Phylogenetic Analysis

1993 POINTS OF VIEW 389

Combined data =50% slow + 50% fast characters

Slow

0 100 200 300 400 500 600 700 800 900 1000

Total characters in combined data set

Combined data =20% slow + 80% fast characters

100 200 300 400 500 600 700 800 900 1000

Total characters in combined data set

FIGURE 3. The effect of combining data sets (•)when both provide consistent phylogenetic esti-mates. The probability of estimating the correct phy-logeny is shown as a function of the number of char-acters in each data set, with curves drawn by hand.All branch lengths on the four-taxon tree are equal,but for the slowly evolving data set (•) the branchlengths are 35% expected internodal change, whereasfor the rapidly evolving data set (A) the branch lengthsare 60% expected internodal change. The horizontalscale represents the total number of characters in thecombined data set but is only proportional to thenumber of characters in the slow (fast) set accordingto their fractional contribution to the combined data.See text and Appendix for details, (a) Data sets areequal in size. The estimate from the slowly evolvingdata set is worsened by adding characters from the

lection that acted differently on differentlineages of the tree. Figure 4 presents theresults of this second analysis, showing thatthe estimate from the combined data canbe much worse than the best estimate fromthe separate data sets.

Additional insights into the problemscaused by inconsistency can be gained byexamining the asymptotic behavior of sep-arate and combined analyses—i.e., whatwould happen if an infinite number ofcharacters were sampled? Our approachhere is to assume a model tree, calculatethe expected frequencies of each observedcharacter-state pattern according to two dif-ferent sets of rate parameters, and then ex-amine whether the patterns supporting thetrue tree occur at higher frequency thanthose supporting incorrect trees. For bi-nary (0,1) characters and an unrooted four-taxon tree, Felsenstein (1978:407) derivedequations for calculating the probability ofeach possible character-state pattern givena set of expected branch lengths such asthat shown in Figure 5a; these probabilitiesare tabulated for several branch-lengthcombinations in Table 1. If all charactersare evolving relatively slowly, the numberof characters supporting the true tree isexpected to be much greater than the num-ber of characters supporting the next besttree over a broad range of relative branchlengths (Table 1; Fig. 5b). For example, ifa = 0.07 and b = 0.01, in a data set of 1,000characters we would expect about twice asmany characters to support the true treethan to support the second best tree (9.66vs. 4.24, respectively). But if we add in an-other data set of 1,000 characters that areevolving at a rate five times that of the firstset, many more characters in the combineddata set are expected to support the wrongtree than to support the true tree (61.00/2,000 vs. 50.09/2,000, respectively; see alsoFig. 5c). The situation deteriorates even

rapidly evolving data set for up to 400 characters inthe combined analysis, (b) There are four charactersfrom the rapidly evolving data set for every characterfrom the slowly evolving data set. The estimate fromthe slowly evolving data set is worsened by addingcharacters from the rapidly evolving data set for upto 1,000 characters in the combined analysis.

at University of Saskatchew

an on July 6, 2012http://sysbio.oxfordjournals.org/

Dow

nloaded from

Page 7: Partitioning and Combining Data in Phylogenetic Analysis

390 SYSTEMATIC BIOLOGY VOL. 42

Combined data =50% inconsistent + 50% consistent characters

0 100 200 300 400 500 600 700 800 900 1000

Total characters in combined data set

Combined data =20% consistent + 80% inconsistent characters

o 400 600 800 1000

Total characters in combined data set

FIGURE 4. The effect of combining data sets (•) when one data set provides an inconsistent phylogeneticestimate. The probability of estimating the correct phylogeny is shown as a function of the number of charactersin each data set, with curves drawn by hand. Branch lengths of the consistent data set (•) are equal andevolving at a slow rate (35% expected internodal change), whereas two of the opposing peripheral branchesof the inconsistent data set (A) are evolving at a much higher rate than the three remaining branches (twoopposing peripheral branches at 50% expected internodal change and the remaining three branches at 5%expected internodal change). See text and Appendix for details. Scales as in Figure 3. (a) Data sets are equalin size, (b) There are four characters from the inconsistent data set for every character from the consistentdata set. In both (a) and (b), the estimate from the consistent data set is dramatically worsened by addingcharacters from the inconsistent data set.

more rapidly if the branches leading totaxa 1 and 3 in Figure 5a are evolving at aproportionally higher rate in the more rap-idly evolving data set (Fig. 5d). In this case,combination of the two data sets yields an

inconsistent estimate over approximately33% of the tested parameter space.

Although inconsistency occurs in thefour-taxon example above only when ratesof evolution vary across branches, similar

at University of Saskatchew

an on July 6, 2012http://sysbio.oxfordjournals.org/

Dow

nloaded from

Page 8: Partitioning and Combining Data in Phylogenetic Analysis

1993 POINTS OF VIEW 391

problems can occur for five or more taxaeven when rates of evolution are equalthroughout the tree (Hendy and Penny,1989). The example of Figure 6 shows onesuch case (here we can think of the longbranches leading to the terminal taxa asthe result of ancient splits rather than ac-celerated rates of change, although theproblem is the same under either inter-pretation). For any weighted tree (a com-bination of expected number of changes qtfor each branch i on a given tree), the prob-ability of observing each possible patternof character states in the five taxa can bedirectly determined using the Hadamardconjugation procedure of Hendy and Pen-ny (1989). This procedure could also havebeen used for the unrooted four-taxon casediscussed above, for which it obtains re-sults identical to those of Felsenstein's for-mulas. The beauty of the Hadamard meth-od is that it provides a straightforward wayof calculating the character pattern prob-abilities for a greater number of taxa andwith any choice of branch lengths.

We used David Penny's HADTREE pro-gram to calculate character pattern prob-abilities for the tree of Figure 6 using (1)the rates indicated in the figure ("slow"),(2) these rates multiplied by five ("fast"),and (3) a mixture of equal numbers of char-acters evolving at the two rates (Table 2).By converting these probabilities into ex-pected tree lengths (Table 3), we see thatparsimony analysis will select the correcttree as long as the data set is large enoughto overcome sampling errors (Table 3, col-umn 3). On the other hand, for the rapidlyevolving characters, parsimony analysisconverges to the wrong answer as moredata are accumulated, with 12 trees ex-pected to be shorter than the true tree (Ta-ble 3, column 5; Hendy and Penny, 1989,showed a similar example). If equal num-bers of characters evolving at the two ratesare combined, the true tree is expected tobe the fifth most parsimonious tree (Table3, column 7). Thus, the addition of a dataset of rapidly evolving characters to onecontaining more slowly evolving charac-ters can change a correct estimate of the

(a)

0.09-

0.08-

0.07-

0.06-

0.05-

0.04-

0.03-

0.02-

0.01-

/

1

/

(c)

^—

—3~w^\.

~4%\ \x \ \

(d)

0.09-

0.08-

0.07-

0.06-

0.05-

0.04-

0.03-

0.02-

0.01-

\ -

1 / y

/

1.5

2.5 -~^,

! Ib

o d o o

FIGURE 5. Mixtures of equal numbers of rapidlyand slowly evolving characters can cause parsimonyto be inconsistent for analyses of combined data sets,(a) Model tree and representation of the parametergrids in (b), (c), and (d). The parameters a and b arethe expected actual number of changes along the des-ignated branches. Under a simple Poisson model, theseare converted to expected number of apparent changesusing the relationship P = [1 - exp(-2Q)]/2, whereQ is the expected number of actual changes. In (b)-(d), moving upward in the graphs corresponds tolengthening the a branches; moving to the right cor-responds to lengthening the b branches, (b) Contourplot showing ratio of expected frequency of "true"patterns, i.e., supporting tree ((1, 2)(3, 4)), to that of"false" patterns, i.e., supporting alternate tree ((1, 3)(2,4)). For example, for parameter values correspondingto the line labeled "10," there are expected to be 10times as many characters supporting the true tree thansupporting the tree that incorrectly groups taxa 1 and3. A consistent estimate is made throughout the pa-rameter space, (c) Contour plot as in (b) for an equalmixture of characters evolving at rates shown on axes,but with characters evolving five times faster. Upper-left region of parameter space (ratio of "true" to "false"patterns is <1) leads to inconsistent estimates, (d)Contour plot as in (c) but with more rapidly evolvingcharacters changing an additional three times fasteralong branches leading to taxon 1 and taxon 3 relativeto other branches. Note the large zone of inconsis-tency above and to the left of the "1" contour line.

tree to an erroneous one even when ratesof change are uniform throughout the treeand would be guaranteed to do so in thiscase if an infinite number of characters weresampled.

at University of Saskatchew

an on July 6, 2012http://sysbio.oxfordjournals.org/

Dow

nloaded from

Page 9: Partitioning and Combining Data in Phylogenetic Analysis

392 SYSTEMATIC BIOLOGY VOL. 4 2

TABLE 1. Expected proportionnations of evolutionary

Ratesa

0.010.040.070.100.010.040.070.100.010.040.070.100.010.040.070.10

b

0.010.010.010.010.040.040.040.040.070.070.070.070.100.100.100.10

tree

rates (a, b]

Slow (a, b)

tree

of character) (see Fig. 5)

tree((1,2)(3,4))((1,3)(2,4))((1,4)(2,3)) (

0.0097050.0096910.0096640.0096260.0355400.0354910.0353920.0352520.0570600.0569820.0568240.0565990.0749430.0748400.0746330.074340

0.0001920.0015310.0042380.0080690.0015080.0027330.0052080.0087130.0040680.0051890.0074550.0106630.0075430.0085700.0106470.013586

0.0001920.0007330.0012270.0016800.0007560.0027330.0045410.0061980.0013980.0045610.0074550.0101060.0022070.0063400.0101220.013586

1 patterns

tree(1, 2)(3, 4))

0.0430640.0419910.0404330.0388600.1147240.1119720.1079740.1039370.1406590.1375450.1330220.1284550.1475810.1446850.1404780.136230

supporting each four-taxon tree for various combi-

Fast (5a, 5b)

tree«1,3)(2,4))

0.0041070.0255730.0567560.0882430.0239610.0379050.0581620.0786160.0482510.0575060.0709520.0845290.0688910.0751580.0842630.093456

tree

0.0041070.0131320.0190540.0230220.0147440.0379050.0531040.0632880.0275590.0537590.0709520.0824730.0423740.0667460.0827390.093456

Fast plus long "a"

tree((1, 2)(3,4))

0.0424380.0379050.0348720.0334430.1131190.1014850.0937020.0900330.1388420.1256820.1168780.1127270.1458920.1336500.1254600.121599

(15a, 5b)

tree((1- 3)(2, 4)

0.0166290.1073650.1680680.1966830.0320950.0910380.1304710.1490600.0536500.0927740.1189480.1312870.0725470.0990380.1167610.125116

branches

tree) ((1, 4)(2, 3))

0.0105430.0249380.0296000.0312990.0312620.0682040.0801710.0845320.0462440.088034 .0.1015710.1065040.0597560.0986290.1112220.115810

Process Partitions

We use the term process partition as a di-vision of characters into two or more sub-sets such that characters in each subset haveevolved according to rules that are de-monstrably different from those in other

0.001

FIGURE 6. A tree for five taxa with long terminalbranches and short internal branches. The rate pa-rameters represent the expected number of characterchanges along each branch (not the apparent numberof changes, which underestimates the actual numberof changes; see Fig. 5 caption). Parsimony analysismakes a consistent estimate of the tree for the indi-cated rates, but if the rates are increased by a factorof five, on average the "true" tree will be the 13thmost parsimonious tree. Combination of the two datasets leads to an inconsistent estimate because therewill be more informative characters supporting in-correct trees than supporting the correct tree.

subsets. The importance of process parti-tions to systematics is that reconstructionmethods may need to accommodate themspecifically. Many different partitions arealready known or suspected in moleculardata: (1) first- and second- versus third-codon positions, (2) coding versus noncod-ing regions, (3) different genes and dif-ferent regions within genes, possiblyaccording to constraints on the three-di-mensional protein structure, (4) stem ver-sus loop regions in ribosomal RNA genes,(5) molecules inherited in Mendelian ver-sus non-Mendelian fashion (e.g., nucleargenes vs. plastid DNA), and others (Ki-mura, 1980; Tateno et al., 1982; Nei, 1987;Wheeler and Honeycutt, 1988; Gillespie,1991; Li and Graur, 1991; Sueoka, 1992).Postulated partitions in morphologicalcharacters include larval versus adult, cra-nial versus postcranial, and hard versus softparts (Trueb and Cloutier, 1991). Even thedeliberate choice to include or gather cer-tain classes of data over others is tanta-mount to acknowledging the relevance ofprocess partitions.

Testing for and AccommodatingHeterogeneity

The search for phylogenetic incongru-ence among partitions has been pursued

at University of Saskatchew

an on July 6, 2012http://sysbio.oxfordjournals.org/

Dow

nloaded from

Page 10: Partitioning and Combining Data in Phylogenetic Analysis

1993 POINTS OF VIEW 393

TABLE 2. Probability of obtaining each possiblecharacter pattern from the tree and rate parametersof Figure 6.

Pattern8

O={0}, {1,2,3,4,5}1 = {1}, {2,3,4,5}2 ={2}, {1,3,4,5}3 = {1,2}, {3,4,5}*4 = {3}, {1,2,4,5}5 = {1,3}, {2, 4, 5}*6 = {2,3}, {1,4,5}*7 = {1,2,3}, {4,5}*8 = {4}, {1,2, 3, 5}9 = {1,4}, {2,3,5}*10= {2, 4}, {1,3, 5}*11 = {1,2,4}, {3,5}*12= {3,4}, {1,2, 5}*13 = {1,3,4}, {2,5}*14= {2,3,4}, {1,5}*15 = {1,2,3,4}, {5}

Character pattern probabilities

Rate 1

(slow)b

0.878140.017580.017580.001290.017580.000370.000370.000900.017580.000370.000370.000900.001290.000900.000900.04387

Rate 2(fast)c

0.543940.054680.054680.010110.054680.006790.006790.014160.054680.006790.006790.014160.010110.014160.014160.13333

Combinedd

0.711040.036130.036130.005700.036130.003580.003580.007530.036130.003580.003580.007530.005700.007530.007530.08860

a Each binary character partitions the taxa into two disjointsubsets, one subset with state 0 and a complementary subsetwith state 1. The first partition represents characters that havethe same state in all five taxa. Asterisks indicate informativepartitions.

b Rates shown on tree of Figure 6.c Rates shown on tree of Figure 6 multiplied by five.d Mixture of equal numbers of "slow" and "fast" characters.

in several contexts. The most active arealies in detecting horizontal gene transferand recombination among bacterial ge-nomes, viruses, and plasmids (Dykhuizenand Green, 1991; Maynard Smith et al.,1991; Medigue et al., 1991; Souza et al.,1992; Valdez and Pinero, 1992). Even inmulticellular organisms, systematists areaware that gene trees may differ from spe-cies trees because of polymorphisms exist-ing at the time of speciation and occasionalhybridization among recently separatedlineages. These phenomena are fundamen-tally similar to horizontal gene transfer(Tateno et al., 1982) and may, for example,cause trees estimated from mitochondrialor chloroplast genes to differ from thoseestimated from nuclear genes of the sametaxa (Wilson et al., 1977; Pesole et al., 1991;Doyle, 1992). Evaluation of phylogeneticcongruence is also integral to tests of vi-cariance biogeography (determiningwhether bifurcation events are the samefor different taxa) and in evaluating para-site cospeciation with hosts (Nelson andPlatnick, 1981; Simberloff, 1987; Page,1990a, 1990b; Valdez and Pinero, 1992).

TABLE 3. Expected lengths of the 15 possible trees for five taxa calculated from the character patternfrequencies of Table 2."

Tree

(((1, 2)(3, 4))5)<(((1,3)(2,4))5)(((1, 4)(2, 3))5)((((1, 2)3)4)5)((((1,2)4)3)5)((((1, 3)2)4)5)((((1, 3)4)2)5)((((1, 4)2)3)5)((((1, 4)3)2)5)((((2,3)1)4)5)((((2,3)4)1)5)

((((2,4)1)3)5)((((2,4)3)1)5)((((3, 4)1)2)5)((((3, 4)2)1)5)

Rate

Length13

0.12693c0.12877c0.12877c0.12732c0.12732c0.12824c0.12824c0.12824c0.12824c0.12824c0.12824c

0.12824c0.12824c0.12732c0.12732c

1 (slow)

Rank

110.510.53.53.510.510.510.510.510.510.5

10.510.53.53.5

Rate 2 (fast)

Lengthb

0.53987c0.54651c0.54651c0.53582c0.53582c0.53914c0.53914c0.53914c0.53914c0.53914c0.53914c

0.53914c0.53914c0.53582c0.53582c

Rank

1314.514.52.52.511.511.511.511.511.511.511.511.52.52.5

Combined

Lengthb

0.33340c0.33764c0.33764c0.33157c0.33157c0.33369c0.33369c0.33369c0.33369c0.33369c0.33369c0.33369c0.33369c0.33157c0.33157c

Rank

514.514.52.52.59.59.59.59.59.59.59.59.52.52.5

a For example, consider calculation of expected length for first tree and "slow" rate parameters. Each "uninformative"pattern (1, 2, 4,8,15) requires one step on any tree. Patterns compatible with the specified tree (3,12) also require one step.Patterns not compatible with the specified tree (5-7, 9-11, 13, 14) each require two steps. Thus, the expected length of thistree is (0.01758 + 0.01758 + 0.01758 + 0.01758 + 0.04387) + (0.00129 + 0.00129) + 2(0.00037 + 0.00037 + 0.00090 + 0.00037+ 0.00037 + 0.00090 + 0.00090 + 0.00090) = 0.12693.

b c = total number of characters in data set.c The "true" tree.

at University of Saskatchew

an on July 6, 2012http://sysbio.oxfordjournals.org/

Dow

nloaded from

Page 11: Partitioning and Combining Data in Phylogenetic Analysis

394 SYSTEMATIC BIOLOGY VOL. 42

How can one test for heterogeneity? Anumber of currently available methods mayprovide appropriate tests for heterogene-ity, although a thorough appraisal of themfor this purpose has not been conducted.In broad outline, a researcher begins withtwo or more data sets (D, denotes data seti). One or more optimal trees (T,) are esti-mated from each D, and the test is used todecide whether the variation among theestimates T, from different data sets is sig-nificantly greater than expected. The nullmodel of homogeneity is that the variationamong T, from different D, is no greaterthan the variation inherent in each esti-mate (represented by the hypothetical casethat replicates of each D, could be ob-tained). Thus, any test must incorporate animplicit or explicit sampling process to es-timate the within-data variation. Rodrigoet al. (in press) developed one samplingmethod based on the nonparametric boot-strap. Another approach is to assume an apriori sampling model, as underlies thestatistical tests that have been developedto compare support for two trees against asingle data set (e.g., Templeton, 1983; Ki-shino and Hasegawa, 1989; Faith, 1991). Inall cases, however, it is essential that thetest statistic be developed for the appro-priate null model, and many of the existingmethods that might be applied to detectheterogeneity will require some modifi-cation from their current incarnations (e.g.,Mickevich and Farris, 1981; M. M. Miya-moto, pers. comm. to A. G. Kluge, 1989).

We have shown that combining hetero-geneous data may greatly worsen the phy-logeny estimate, and we have casually not-ed that a variety of tests may be applied todetect heterogeneity. This approach mustbe taken a step further, however, to deter-mine when combining heterogeneous dataworsens versus improves an estimate. Anyevaluation of methods used for demon-strating heterogeneity should include anevaluation of whether the recognition ofheterogeneity actually improves the re-construction. We cannot recommend anapproach without extensive simulationstudy, although preliminary results sug-gest that Faith's (1991) T-PTP test appears

to work quite well, as does the Rodrigo etal. (in press) component analysis approach.

Once heterogeneity is detected, the nextstep is to identify its cause and modify theanalysis. If process partitions are hetero-geneous (strongly specify different trees),the characters in each partition may haveevolved under different histories, in whichcase a combined analysis is inappropriate.The same is true if the source of hetero-geneity is unknown or difficult to identify,as when one partition appears to show lackof independence of characters (Swofford,1991). However, it may be possible to ac-commodate heterogeneity by nonuniformweighting of characters or changes withincharacters when the heterogeneity appearsto be due only to rates of change or trans-formational probability (Farris, 1969;LeQuesne, 1969; Kimura, 1980; Williamsand Fitch, 1989; Wheeler, 1990; Jin and Nei,1990; Reeves, 1992).

CONCLUSIONS

A phylogenetic reconstruction providesa statement about presumed genealogy andabout evolutionary process. Both aspects ofevolution must be addressed when recov-ering either aspect. However, evolutionarybiologists tend to focus on just one of thesetopics rather than both. To the biologistinterested in evolutionary process, thesearch for process partitions is an obviousand interesting exercise because differentpartitions yield direct insight to differentprocesses and mechanisms. To the system-atist interested in phylogeny, the identi-fication and acknowledgment of partitionsmay seem irrelevant at best. Yet to ignorethe details of evolutionary process presup-poses that reconstruction methods are ro-bust to such influences, which they maynot be. Reconstructions of phylogenies areoften riddled with ambiguities. Even in therecent history of systematics, the choice ofthe best phylogenetic estimate has been adynamic process because of an input ofnew data and an increasing appreciationfor details in the evolutionary process(Holmes, 1991; Pesole et al., 1991). Also,obvious inconsistencies remain betweenassumptions of the reconstruction model

at University of Saskatchew

an on July 6, 2012http://sysbio.oxfordjournals.org/

Dow

nloaded from

Page 12: Partitioning and Combining Data in Phylogenetic Analysis

1993 POINTS OF VIEW 395

and rates of character evolution (e.g.,Reeves, 1992).

The presumed individual goal of mostsystematists is to provide the first accuratephylogeny of their chosen taxa and to re-solve any outstanding uncertainties. Thisgoal may often be incompatible with a de-liberate ignorance of the different evolu-tionary processes that have given rise tothe character states used in estimating thephylogeny.

ACKNOWLEDGMENTS

We thank M. Riley, D. Dykhuizen, and D. Hillisfor references and T. Reeder, J. Wiens, P. Chippindale,D. Cannatella, D. Eernisse, D. Hillis, D. Penny, andM. Donoghue for challenging discussions. We alsoare indebted to Alan de Queiroz and an anonymousreviewer for insightful and constructive criticism andto David Penny for providing the HADTREE pro-gram.

REFERENCES

ADAMS, E. N., III. 1972. Consensus techniques andthe comparison of taxonomic trees. Syst. Zool. 21:390-397.

BARRETT, M., M. J. DONOGHUE, AND E. SOBER. 1991.Against consensus. Syst. Zool. 40:486-493.

CAMIN, J. H., AND R. R. SOKAL. 1965. A method fordeducting branching sequences in phylogeny. Evo-lution 19:311-326.

DEBRY, R. W. 1992. The consistency of several phy-logeny-inference methods under varying evolu-tionary rates. Mol. Biol. Evol. 9:537-551.

DE QUEIROZ, A. 1993. For consensus (sometimes).Syst. Biol. 42:368-372.

DONOGHUE, M. J., AND M. J. SANDERSON. 1992. Thesuitability of molecular and morphological evi-dence in reconstructing plant phylogeny. Pages 340-368 in Molecular systematics of plants (P. S. Soltis,D. E. Soltis, and J. J. Doyle, eds.). Chapman andHall, New York.

DOYLE, J. J. 1992. Gene trees and species trees: Mo-lecular systematics as one-character taxonomy. Syst.Bot. 17:144-163.

DYKHUIZEN, D. E., AND L. GREEN. 1991. Recombi-nation in Escherichia coli and the definition of bio-logical species. J. Bacteriol. 173:7257-7268.

FAITH, D. P. 1991. Cladistic permutation tests formonophyly and nonmonophyly. Syst. Zool. 40:366-375.

FARRIS, J. S. 1969. A successive approximations ap-proach to character weighting. Syst. Zool. 18:374-385.

FARRIS, J. S. 1970. Methods for computing Wagnertrees. Syst. Zool. 19:83-92.

FARRIS, J. S. 1977. Phylogenetic analysis under Dol-lo's Law. Syst. Zool. 26:77-88.

FELSENSTEIN, J. 1978. Cases in which parsimony or

compatibility methods will be positively mislead-ing. Syst. Zool. 27:401-410.

FELSENSTEIN, J. 1981. Evolutionary trees from DNAsequences: A maximum likelihood approach. J. Mol.Evol. 17:368-376.

FITCH, W. M. 1971. Toward defining the course ofevolution: Minimal change for a specific tree to-pology. Syst. Zool. 20:406-416.

GILLESPIE, J. H. 1991. The causes of molecular evo-lution. Oxford Univ. Press, New York.

HENDY, M. D., AND D. PENNY. 1989. A frameworkfor the quantitative study of evolutionary trees. Syst.Zool. 38:297-309.

HILLIS, D. M. 1987. Molecular versus morphologicalapproaches to systematics. Annu. Rev. Ecol. Syst.18:23-42.

HOLMES, E. C. 1991. Different rates of substitutionmay produce different phylogenies of the euther-ian mammals. J. Mol. Evol. 33:209-215.

HUELSENBECK, J. P., AND D. M. HILLIS. 1993. Successof phylogenetic methods in the four-taxon case.Syst. Biol. 42:247-264.

JIN, L., AND M. NEI. 1990. Limitations of the evo-lutionary parsimony method of phylogenetic anal-ysis. Mol. Biol. Evol. 7:82-102.

JUKES, T. H., AND C. R. CANTOR. 1969. Evolution ofprotein molecules. Pages 21-132 in Mammalian pro-tein metabolism (H. Munro, ed.). Academic Press,New York.

KIMURA, M. 1980. A simple method for estimatingevolutionary rate of base substitutions throughcomparative studies of nucleotide sequences. J. Mol.Evol. 16:111-120.

KISHINO, H., AND M. HASEGAWA. 1989. Evaluation ofthe maximum likelihood estimate of the evolution-ary tree topologies from DNA sequence data, andthe branching order in Hominoidea. J. Mol. Evol.29:170-179.

KLUGE, A. G. 1989. A concern for evidence and aphylogenetic hypothesis of relationships amongEpicrates (Boidae, Serpentes). Syst. Zool. 38:7-25.

KLUGE, A. G., AND J. S. FARRIS. 1969. Quantitativephyletics and the evolution of anurans. Syst. Zool.18:1-32.

LEQUESNE, W. J. 1969. A method of selection of char-acters in numerical taxonomy. Syst. Zool. 18:201-205.

Li, W.-H., AND D. GRAUR. 1991. Fundamentals ofmolecular evolution. Sinauer, Sunderland, Massa-chusetts.

MARSHALL, C. R. 1992. Character analysis and theintegration of molecular and morphological data inan understanding of sand dollar phylogeny. Mol.Biol. Evol. 9:309-322.

MAYNARD SMITH, J., C. G. DOWSON, AND B. G. SPRATT.1991. Localized sex in bacteria. Nature 349:29-31.

MEDIGUE, C, T. ROUXEL, P. VIGIER, A. HENAUT, ANDA. DANCHIN. 1991. Evidence for horizontal genetransfer in Escherichia coli speciation. J. Mol. Biol.222:851-856.

MICKEVICH, M. F. 1978. Taxonomic congruence. Syst.Zool. 27:143-158.

MICKEVICH, M. F., AND J. S. FARRIS. 1981. The im-

at University of Saskatchew

an on July 6, 2012http://sysbio.oxfordjournals.org/

Dow

nloaded from

Page 13: Partitioning and Combining Data in Phylogenetic Analysis

396 SYSTEMATIC BIOLOGY VOL. 42

plications of congruence in Menidia. Syst. Zool. 30:351-370.

MIYAMOTO, M. M. 1985. Consensus cladograms andgeneral classifications. Cladistics 1:186-189.

NEI, M. 1987. Molecular evolutionary genetics. Co-lumbia Univ. Press, New York.

NELSON, G. J. 1979. Cladistic analysis and synthesis:Principles and definitions, with a historical note onAdanson's Families des Plantes (1763-1764). Syst. Zool.28:1-21.

NELSON, G. J., AND N. PLATNICK. 1981. Systematicsand biogeography: Cladistics and vicariance. Co-lumbia Univ. Press, New York.

PAGE, R. D. M. 1990a. Component analysis: A valiantfailure? Cladistics 6:119-136.

PAGE, R. D. M. 1990b. Temporal congruence andcladistic analysis of biogeography and cospeciation.Syst. Zool. 39:205-226.

PENNY, D., M. D. HENDY, AND M. A. STEEL. 1992.Progress with methods for constructing evolution-ary trees. Trends Ecol. Evol. 7:73-79.

PESOLE, G., E. SBISA, F. MIGNOTTE, AND C. SACCONE.1991. The branching order of mammals: Phylo-genetic trees inferred from nuclear and mitochon-drial molecular data. J. Mol. Evol. 33:537-542.

REEVES, J. H. 1992. Heterogeneity in the substitutionprocess of amino acid sites of proteins coded for bymitochondrial DNA. J. Mol. Evol. 35:17-31.

RODRIGO, A. G., M. KELLY-BORGES, P. R. BERGQUIST,AND P. L. BERGQUIST. In press. A randomizationtest of the null hypothesis that two cladograms aresample estimates of a parametric phylogenetic tree.N.Z. J. Bot.

SHAFFER, H. B., J. M. CLARK, AND F. KRAUS. 1991.When molecules and morphology clash: A phylo-genetic analysis of the North American ambysto-matid salamanders (Caudata: Ambystomatidae). Syst.Zool. 40:284-303.

SIMBERLOFF, D. 1987. Calculating probabilities thatcladograms match: A method of biogeographicalinference. Syst. Zool. 36:175-195.

SOUZA, V. T., T. NGUYEN, R. R. HUDSON, D. PINERO,AND R. E. LENSKI. 1992. Hierarchical analysis oflinkage disequilibrium in Rhizobium populations:Evidence for sex? Proc. Natl. Acad. Sci. USA 89:8389-8393.

SUEOKA, N. 1992. Directional mutation pressure, se-lective constraints, and genetic equilibria. J. Mol.Evol. 34:95-114.

SWOFFORD, D. L. 1991. When are phylogeny esti-mates from molecular and morphological data in-congruent? Pages 295-333 in Phylogenetic analysisof DNA sequences (M. M. Miyamoto and J. Cracraft,eds.). Oxford Univ. Press, New York.

SWOFFORD, D. L., AND G. J. OLSEN. 1990. Phylogenyreconstruction. Pages 411-501 in Molecular system-atics (D. M. Hillis and C. Moritz, eds.). Sinauer,Sunderland, Massachusetts.

TATENO, Y., M. NEI, AND F. TAJIMA. 1982. Accuracyof estimated phylogenetic trees from molecular data.I. Distantly related species. J. Mol. Evol. 18:387-404.

TEMPLETON, A. R. 1983. Phylogenetic inference from

restriction endonuclease cleavage site maps withparticular reference to the humans and apes. Evo-lution 37:221-244.

TRUEB, L., AND R. CLOUTIER. 1991. A phylogeneticinvestigation of the inter- and intrarelationships ofthe Lissamphibia (Amphibia: Temnospondyli).Pages 223-313 in Origins of the higher groups oftetrapods: Controversy and consensus (H.-P.Schultze and L. Trueb, eds.). Cornell Univ. Press,Ithaca, New York.

VALDEZ, A. M., AND D. PINERO. 1992. Phylogeneticestimation of plasmid exchange in bacteria. Evo-lution 46:641-656.

WHEELER, W. C. 1990. Combinatorial weights andphylogenetic analysis: A statistical parsimony pro-cedure. Cladistics 6:269-275.

WHEELER, W. C, AND R. L. HONEYCUTT. 1988. Pairedsequence difference in ribosomal RNAs: Evolution-ary and phylogenetic implications. Mol. Biol. Evol.5:90-96.

WILLIAMS, P. L., AND W. M. FITCH. 1989. Evolution:Computer analysis of protein and nucleic acid se-quences. Methods Enzymol. 183:615-625.

WILSON, A. C, S. S. CARLSON, AND T. J. WHITE. 1977.Biochemical evolution. Annu. Rev. Biochem. 46:473-639.

Received 25 January 1993; accepted 17 May 1993

Note added in proof.—A talk offering several of theideas in this article was presented at the annual meet-ing of the Society of Systematic Biologists (Snowbird,Utah, 19 June 1993) by J. Harshman and S. M. Lanyon.

APPENDIX

An unrooted four-taxon tree (Fig. 7) was simulatedin the analyses underlying Figures 2-4. A Jukes-Can-tor model of sequence evolution was used to describecharacter-state changes along the branches of themodel tree. The Jukes-Cantor model employs a pa-rameter a that describes the rate at which a givennucleotide changes to one of the three remainingnucleotides. Hence, the probability of observing a

FIGURE 7. Simulations for Figures 2-4 assumed anunrooted four-taxon tree with two internal nodes (Nland N2) and four external nodes (Tl, T2, T3, and T4).

at University of Saskatchew

an on July 6, 2012http://sysbio.oxfordjournals.org/

Dow

nloaded from

Page 14: Partitioning and Combining Data in Phylogenetic Analysis

1993 POINTS OF VIEW 397

nucleotide substitution depends on the product of aand time (t). However, given enough time (or largea) there is a meaningful probability of back mutationsto the original nucleotide, in which case, no substi-tution will be observed. The probability of actuallyobserving a nucleotide substitution is

S =3(1 - e"

where a substitution to each of the three new basesis equally probable (i.e., S/3; Jukes and Cantor, 1969;see also Swofford and Olsen, 1990).

Simulated phylogenies were created in several stepsfollowing the procedure of Huelsenbeck and Hillis(1993): (1) an at (branch length) was chosen and thecorresponding substitution probability (S) was cal-culated; (2) an internal node (Nl in Fig. 7) was as-signed a nucleotide sequence using a pseudorandomnumber generator, the four possible bases appearingwith equal frequency; and (3) for each base of nodeNl, the nucleotide at each descendent node (N2, Tl,

T2) was determined using the probabilities deter-mined in step 1 and a pseudorandom number gen-erator. Using node N2 as the ancestor, the same pro-cess was repeated to derive the sequences of nodesT3 and T4. In generating the data sets of Figure 2 andthe "slow," "fast," and "consistent" data sets for Fig-ures 3 and 4, at was set equal among all five branchesof the tree (within each data set). For the "inconsis-tent" data sets of Figure 4, at was set larger for theterminal branches to Tl and T3 than for the otherthree branches. The character matrices of simulatedDNA data, which contained the simulated nucleo-tides of sequences for nodes Tl, T2, T3, and T4, wereanalyzed using the unordered parsimony criterion(Fitch, 1971). Further details of the simulations areprovided in Huelsenbeck and Hillis (1993).

Note added in proof.—A talk offering several of theideas in this article was presented at the annual meet-ing of the Society of Systematic Biologists (Snowbird,Utah, 19 June 1993) by J. Harshman and S. M. Lanyon.

at University of Saskatchew

an on July 6, 2012http://sysbio.oxfordjournals.org/

Dow

nloaded from