Comparative Methods for Reconstructing Ancient Genome ...

HAL Id: hal-03192460https://hal.archives-ouvertes.fr/hal-03192460

Submitted on 8 Apr 2021

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Comparative Methods for Reconstructing AncientGenome Organization

Yoann Anselmetti, Nina Luhmann, Sèverine Bérard, Eric Tannier, CedricChauve

To cite this version:Yoann Anselmetti, Nina Luhmann, Sèverine Bérard, Eric Tannier, Cedric Chauve. ComparativeMethods for Reconstructing Ancient Genome Organization. Comparative Genomics: Methods andProtocols, pp.343 - 362, 2017, �10.1007/978-1-4939-7463-4_13�. �hal-03192460�

https://hal.archives-ouvertes.fr/hal-03192460

https://hal.archives-ouvertes.fr

Chapter 13

Comparative Methods for Reconstructing Ancient GenomeOrganization

Yoann Anselmetti, Nina Luhmann, Severine Berard, Eric Tannier,and Cedric Chauve

Abstract

Comparative genomics considers the detection of similarities and differences between extant genomes, and,based on more or less formalized hypotheses regarding the involved evolutionary processes, inferringancestral states explaining the similarities and an evolutionary history explaining the differences. In thischapter, we focus on the reconstruction of the organization of ancient genomes into chromosomes. Wereview different methodological approaches and software, applied to a wide range of datasets from differentkingdoms of life and at different evolutionary depths. We discuss relations with genome assembly, andpotential approaches to validate computational predictions on ancient genomes that are almost always onlyaccessible through these predictions.

Key words Comparative genomics, Paleogenomics, Ancient genomes, Ancestral genomes

1 Introduction

Rearrangements were the first discovered genome mutations [1],long before the discovery of the molecular structure of DNA.Molecular evolutionary studies started with the reconstruction ofthe organization of ancient Drosophila chromosomes, from thecomparison of extant ones [2]. However, it took almost 30 moreyears before the formal introduction of paleogenetics, as the field ofreconstructing ancient genes [3]. Since then, the development ofsequencing technologies and the availability of sequenced genomeshas led to the introduction of paleogenomics, a field that aims atreconstructing ancient whole genomes using computational meth-ods. The term paleogenomics can be understood in two ways:ancient genome sequencing [4], or the computational reconstruc-tion of ancestral genome features, given extant sequences, off-spring, and relatives [5]. We take it in the latter meaning, thoughwe highlight several links between both interpretations.

Joao C. Setubal et al. (eds.), Comparative Genomics: Methods and Protocols, Methods in Molecular Biology, vol. 1704,https://doi.org/10.1007/978-1-4939-7463-4_13, © Springer Science+Business Media LLC 2018

343

https://doi.org/10.1007/978-1-4939-7463-4_13

Despite its early start as a molecular evolution problem,paleogenomics is still in its infancy. Whereas evolution by substitu-tions has been studied extensively from the 1960s, and has now wellestablished mathematical and computational foundations, evolu-tion by genome scale events such as rearrangements looks almostlike a fallow field. Two reasons can be invoked. First, rearrangementstudies require having fully assembled genomes, and genomeassembly is still an extremely challenging problem, resulting in asmall number of available genomes, compared to gene sequencesfor example. Second, the state space of sequence evolution is verysmall (4 possible nucleotides or 20 possible amino acids per ances-tral locus), leading to computational problems that are much easierthan the rearrangement ones, which work on the basically infinitediscrete space of possible chromosomal organizations (gene ordersfor example). However, none of these reasons is biological, andrecent progresses in technology and methodology are susceptibleto quickly change this situation.

There have been tremendous methodological developmentsover the last 10–15 years. Standard and principled computationalmethods are now able to propose reconstructions of the organiza-tion of ancestral genomes over all kingdoms of life: mammals [6, 7],insects [8, 9], fungi [10], plants such as monocotyledons [11–13](reviewed in [14]) and dicotyledons [13–16], bacteria[17–19]. Prospective ad hoc methods have attempted the recon-struction of more ancient animal proto karyotypes: amniotes[20–22], bony fishes [23–25], vertebrates [20, 21, 26], chordates[27], or even eumetazoa [28].

Here, we review some of the existing methods for reconstruct-ing ancient gene orders, focusing on their methodological princi-ples, strengths, and weaknesses. We detail the data preprocessingsteps that are necessary to use these methods. We finally review theavailable software and give an insight on the possible validations ofancestral genomes.

2 Preliminaries: Material and Preprocessing

The starting material of comparative paleogenomics is composed ofextant genome sequences and assemblies. These are often availablein public databases such as Ensembl and the UCSC GenomeBrowser [29, 30]. A genome assembly is a set of linear or circularDNA sequences (we refer the reader to [31] for a recent review ongenome assembly). Depending on the combination of the proper-ties of the sequenced genomes (repeats in particular), the sequenc-ing technology, and the assembly algorithm, the assembledsequences can be at various levels of completion, from full chromo-somes (in which case the genome is said to be fully assembled) toscaffolds or contigs (fragmented assembly); for the sake of

344 Yoann Anselmetti et al.

exposition, we use here the term chromosome for an assembledcontiguous DNA sequence. The fragmentation of extant genomeassemblies has a significant impact on the quality of reconstructedancestral genomes, which will be discussed in Subheading 3.4.

To reconstruct the organization of ancient genomes from thecomparison of extant ones, it is first necessary to define sets ofmarkers on extant genomes, that is, DNA segments defined bytheir coordinates on the genomes (chromosome or scaffold orcontig, start position, end position, reading direction). Markersare clustered into families with the desired property that two mar-kers in the same family are homologous over their whole length,and two markers from a different family show no or limitedhomology.

Gene families, available in some databases [29, 32], are goodcandidates for being markers, though intersecting genes and partialhomologies can be a problem for certain methods. Markers can alsobe obtained by constructing synteny blocks from whole genomemultiple alignments [33], Chapter 11, or by segmenting genomesaccording to pairwise alignments [34], or searching ultra-conservedelements (UCEs) [35] or virtual probes [36]. These methods areuseful for example when considering genomes that exhibit lowgene density.

Whether the considered markers are genes or other genomicmarkers, the identification of genomic marker families is both afundamental initial step toward reconstructing ancestral genomesand a challenging computational biology problem, with links tosequence clustering, whole genome alignment, and phylogenetics,among others. There is currently no standard method or tool that isuniversally used andmany applied works rely on ad hoc methods forthis important preprocessing step.

Depending on combinatorial properties of the algorithms usedto infer ancient genome organization from the comparison ofextant genomes, several restrictions might need to be applied onfamilies of genomic markers. Most methods require that no twomarkers overlap on a genome, as this might induce some ambiguityregarding their relative order along their chromosome. Othermethods might also require that every genome contains at mostone marker per family (unique markers) or at least one marker perfamily (universal markers), or both (unique and universal markers).Enforcing such constraints requires extra preprocessing of an initialmarker set. Nevertheless, we consider now that we have obtained,for a set of extant genomes of interest, a dataset of genomic mar-kers, that will serve as input to reconstruct the organization of oneor several ancient genomes.

Eventually, a comparative approach requires phylogeneticinformation relating one or several ancestral species of interest toa set of extant species whose genome data are available. This infor-mation can range from a fully resolved species phylogeny with branch

Comparative Methods for Reconstructing Ancient Genome Organization 345

https://doi.org/10.1007/978-1-4939-7463-4_11

lengths [6, 7], to a partition of the extant species in threenon-empty groups that define a single ancestral species (two groupsof descendant species and one group of outgroup species). So in theextreme case of considering a single ancestral species, a minimaldataset is composed of genome information for a set of three extantspecies, composed of two species whose last common ancestor isthe ancestral species of interest and one outgroup [37].

3 Ancestral Reconstruction Methods

All methods consider a genome as a set of circular or linear order-ings of markers, representing chromosomes or chromosomal seg-ments. This implies that the exact markers’ physical coordinates aretransformed into a relative ordering of markers. It induces a loss ofinformation which can have an influence on the result [38] but it isuniversally used. Then methods differ in their strategies: either theymodel the evolution of these arrangements of markers by evolu-tionary events such as duplications, losses, rearrangements, or theymodel the evolution of more local syntenic features/characterssuch as the physical proximity of sets of markers. In the following,we call adjacency (resp. interval) a pair of (resp. a set of at leastthree) markers that either occur contiguously along an extantgenome or are assumed to occur contiguously along an ancestralgenome.

The first strategy (evolution of whole genomes) quickly leads tocomputational tractability issues. The second strategy (evolution oflocal syntenic characters such as adjacencies and intervals) benefitsfrom a standard evolutionary toolbox modeling the evolution ofpresence or absence of a character, and tractability issues are post-poned to a final linearization step where local characters are assem-bled into chromosome scale arrangements of markers.Linearization procedures then benefit from standard algorithmsoriginating from algorithms for computing physical maps of extantgenomes [39].

3.1 Whole Genome

Evolution

We first describe the approach that considers the evolution ofgenomes seen as sets of linear or circular orders of markers, i.e.,roughly permutations that can possibly be separated into severalchromosomes. Evolutionary events like inversions, translocations,transpositions, fissions, and fusions, all subsumed in the now stan-dard Double Cut and Join (DCJ) model [40], are susceptible toalter these genomes. The reconstruction of ancestral genomes thenaims, given marker orders representing extant genomes at the leavesof a species phylogeny, at assigning marker orders for all ancestralnodes, maximizing a mathematical criterion according to the cho-sen evolutionary model. Most of the time this criterion is theparsimony score, which is the minimum number of events


transforming a permutation into another [41], also called thedistance, although some methods consider a likelihood criterion.

For most rearrangement models that do not include duplica-tions, the distance between two genomes can be computed effi-ciently. But even the simplest non-pairwise ancestral genomereconstruction problem, the median problem reconstructing agenome minimizing the distance in a tree with only three leaves,is already NP hard [42]. Adding duplications makes all problemshard even for the comparison of two genomes [41]. Hence, withduplications considered, reconstructing rearrangement events thathappened along the branches of a tree is not tractable either.

Heuristics for the ancestral genome reconstruction problemusually follow the strategy of assigning an initial genome arrange-ment to each internal node of the tree and then iteratively refiningthe solution by solving the median problem for internal nodes untilno further improvement in the overall tree distance can beachieved. The implementation of GASTS [43] improves over pre-vious methods applying this strategy by trying to find a good initialarrangement avoiding local optima. Using adequate subgraphs forheuristic assignment of the median, this method can handle multichromosomal data with unique and universal markers. Anotherapproach is based on the Pathgroup data structure [44] storingpartially completed cycles in a breakpoint graph [41] for eachbranch in the phylogeny. Graphs are greedily completed and even-tually form genomes at all internal nodes. This solution can be usedas an initialization prior to local iterative improvements based onthe median again using the Pathgroup approach. An interestingproperty of Pathgroup is that it can handle whole genome duplica-tions. The method MGRA [45] on the other hand relies on amultiple breakpoint graph combining all extant genome organiza-tions into one structure. MGRA then searches for breaks in agree-ment with the species tree structure transforming the breakpointgraph into an identity breakpoint graph. While MGRA requiresunique and universal markers, it has recently been extended tohandle unequal marker content [46]. More complex models ofevolution have been considered, which include duplications forexample [47, 48], but are tractable only under some specific condi-tion, such as the hypothesis that rearrangement breakpoints are notre-used [47].

Some methods adopt a probabilistic point of view, like Badger[49], a software using Bayesian analysis under a model wherecircular genomes can evolve by reversals. It samples phylogenetictrees and rearrangement scenarios from the joint posterior distri-bution under this model by MCMC implementing different pro-posal methods in the Metropolis–Hastings algorithm. It is a similarlocal search to the heuristic on the minimization problem, butinstead of giving a single solution without guarantee as an output,it provides a sample of solutions from a mathematically grounded


distribution. However, it faces the same tractability issuesconcerning the convergence time.

Finally, a simpler rearrangement distance is the Single-Cut-or-Join (SCJ) distance [50] that models cuts and joins of adjacencies.With this model the ancestral reconstruction becomes tractable.Ancestral genomes that minimize the SCJ distance can be com-puted efficiently using a variant of the Fitch algorithm [51] inpolynomial time; however, constraints required to ensure linear orcircular ancestral marker orders result in mostly fragmented ances-tral genomes. In [52], a Gibbs sampler for sampling rearrangementscenarios under the SCJ model has been described. It starts with anoptimal fragmented scenario obtained as described above and thenexplores the space of co-optima by repeatedly changing the scenar-ios of single adjacencies.

3.2 Genomes as Sets

of Adjacencies

and Intervals: Mapping

Approaches

The linear or circular orders of markers can be seen as sets ofadjacencies and intervals, instead of permutations. Then each adja-cency or interval can be considered independently, as a separatesyntenic feature, which evolves within the larger context of wholegenomes. This independence assumption allows computing quicklyancestral states for adjacencies and intervals. The main problem isthat the collection of ancestral adjacencies and intervals is notguaranteed to be compatible with a linear or circular ordering.

We describe here a family of approaches that focus on a single-ancestral genome and consist of two main steps, which are inspiredby the methods initially developed to compute physical maps ofextant genomes:

1. Genomes of related extant species are compared to detectcommon local syntenic features, such as marker adjacencies orintervals, that are then considered candidate ancestral featuresfor the ancestral genome of interest. Common features are notnecessarily conserved from an ancestor due to convergent evo-lution or assembly errors for example, so this method generatesfalse positives. In some methods, each local syntenic feature isweighted, according to its pattern of presence/absence inextant species genomes, to represent a confidence measure inthe hypothesis it is indeed an ancestral syntenic feature.

2. A maximum weight subset of the potentially ancestral localsyntenic features (detected in the first step) is selected that iscompatible with the genome structure of the considered ances-tral species (linear/circular chromosomes, ancestral copy num-ber of markers, etc.) and is then assembled into a more detailedancestral genome map.

The case of unique markers: The initial applications [6, 7] of thesephysical mapping principles to ancestral genome organizationreconstruction considered unique markers, i.e., markers that are


assumed to occur once and exactly once in the ancestral genome ofinterest.

In several methods [7, 10, 22, 53] step 1, the detection ofcommon adjacencies and intervals and the inference of ancestraladjacencies and intervals, is implemented using a Dollo parsimonyprinciple: any group of markers that are colocalized in two genomesof extant species whose evolutionary path in the species phylogenycontains the ancestral species of interest is deemed to be a potentialancestral syntenic feature. Here by colocalized we mean that thegroup of markers occur contiguously in both extant genomesregardless of their relative orders but without any other markeroccurring in between; so the marker content of both occurrencesof the colocalized group of markers in the extant genomes isidentical while the marker orders can differ. Groups of two markersare adjacencies, while groups of more than two markers are inter-vals. Variations on the principle outlined above can be considered,such as relaxing the Dollo parsimony criterion or considering onlyadjacencies (see [6] for example) or considering probabilistic infer-ence of ancestral adjacencies [54, 55].

Given a set of local ancestral syntenic groups, the second stepaims at selecting a maximum weight subset of these groups that iscompatible with the considered genome structure and does notcontain any syntenic conflict, defined as a marker that is deemedadjacent to more than two other markers. Several methods such asInfercars [6] and MLGO [54] consider only marker adjacencies;these adjacencies define a graph whose vertices are markers andweighted edges represent adjacencies, and aim at computing amaximal set of weighted adjacencies that form a set of paths, eachsuch path being then a linear order of markers called a ContiguousAncestral Region (CAR). This problem is equivalent to a TravelingSalesman Problem (TSP) and is NP hard. It is addressed in [6]through a greedy heuristic and in [54] using a standard TSP solver.However, as shown in [56], if the linearity of CARs is relaxed andcircular CARs are allowed, the optimization problem of selecting amaximum weight subset of adjacencies that forms a mix of linearand circular CARs is tractable and can be solved by reduction to aMaximum Weight Matching (MWM) problem.

When intervals are considered in addition to adjacencies, ances-tral adjacencies and intervals can be encoded by a binary matrix, inthe same way as hybridization experiments are encoded by binarymatrices in physical mapping algorithms. The problem of extractinga conflict free maximum weight subset of adjacencies is then NPhard in all cases, i.e., even if a mix of circular and linear CARs isallowed. Traditionally, it is solved using either greedy heuristics orbranch and bound algorithms (ensuring an optimal solution whenthey terminate). Moreover, when intervals are considered, CARsmight not be completely defined and are represented using a PQtree data structure that has been widely used in physical mapping


algorithms [39] and is related to the classical combinatorial conceptof Consecutive Ones Property (C1P) (see [7] and references there).The software ANGES [53] and ROCOCO [57] are, so far, the onlyancestral genome reconstruction methods that consider intervals ofmarkers and encode CARs using PQ trees.

Last, when markers are assumed to be unique in the ancestralgenome of interest but are subject to insertion or loss duringevolution, the model of common adjacencies and intervals mightbe too stringent. In this case, the notions of gapped adjacencies andintervals were introduced that allows for some flexibility in thedefinition of conserved group of markers. However, this impliesalso that the C1P model is too stringent and needs to be relaxedinto a gapped C1P model, in which optimization problems are NPhard [58, 59].

These approaches have been used on various datasets, includingmammalian genomes [6, 7, 60], the amniote ancestor [22], fungigenomes [10], insect genomes [8], plant genomes [12, 16].

Non-unique markers. If markers exhibit varying copy numbersin extant genomes, they cannot be assumed to all occur once andonly once in the considered ancestral genome. The first issue isthen, for a given marker, to infer its ancestral copy numbers. This isa classical evolutionary genomics problem, for example to infer thegene content of an extinct genome. Given a model of gains andlosses of markers, it is possible to compute a more likely ancestralcontent [61, 62], or content that minimizes the number of gainsand losses [63], by a Dynamic Programming (DP) algorithm fol-lowing the general pattern of the Sankoff-Rousseaualgorithm [64].

Once copy numbers of ancestral markers, or bounds on suchcopy numbers, have been obtained, the two-steps approach out-lined in the previous paragraphs can be applied: first, local syntenies(adjacencies and intervals) are detected using similar notions ofadjacencies and intervals (we refer the reader to [65] for an overviewof interval models when duplicated markers can occur) and areweighted according to their conservation pattern, and, in a secondstep, a maximumweight subset of local syntenies is computed that iscompatible with the marker copy numbers. This second problem isknown as the C1P with multiplicity (mC1P) and has been shown tobe NP hard in general; the only tractable case requires consideringonly adjacencies and allowing an unbounded number of circularCARs [56, 66]. Moreover, when markers have a copy numberhigher than one and only adjacencies are considered, a conflict freeset of adjacencies does not define unambiguously a set of CARs; thisissue is similar to the well-identified problem of determining thelocation and context of repeats in genome assembly [67]. This issuecan be addressed, at least partially, by considering intervals framedby non-repeats (repeat spanning intervals) as described in[68, 69]. Finally, when variation of copy numbers can be attributed


toWhole-GenomeDuplications (WGD), specificmethods based ona combination of gapped adjacencies and TSP algorithms have beenproposed and applied to fungi and plant data [70].

3.3 Adjacency

Evolution along Gene

Phylogenies

We now discuss a variant of the approach described in the previoussection, which still considers genomes as sets of adjacenciesbetween markers, but assumes that evolutionary scenarios formarker families are also available and focuses on all ancestral gen-omes of the species phylogeny at once. Due to its similarity withtraditional character-based phylogenetics, we rely on the standardphylogenetic vocabulary and call genomic markers genes. To sum-marize this approach, ancestral adjacencies are inferred, as previ-ously, but using an optimization criterion and the available genephylogenies both as a guide and a constraint.

Input: gene trees and adjacencies: This phylogeny-based approachrequires as main input a fully binary rooted species phylogeny, andreconciled phylogenies for all gene families. This means that for allgene families, a rooted and annotated phylogenetic tree is required,depicting the whole history of the marker in ancestral and extantspecies in terms of speciations (S), duplications (D), transfers (T),or losses (L), where a transfer is the event of a species acquiring agenomic segment from another species (horizontal/lateral trans-fer). These reconciled gene trees can be obtained by several meth-ods and software, depending on the set of evolutionary events onewants to consider (DTL or DL only), on the models and methods(parsimony or probabilistic approaches, joint or sequential recon-struction of the tree topology and reconciliation) [71–73]. Somedatabases also provide gene trees or reconciled gene trees[29, 32]. A reconciliation yields a presence pattern of ancestralgenes in ancestral species. The leaves of these trees are the extantgenes, and its internal nodes and events define ancestral genes.

The other information needed by the methods is the list of thegene adjacencies in the extant genomes. As defined above, weusually consider that two genes are adjacent if there is no othergene in that dataset between them, although here again relaxednotions of adjacencies can be considered.

Adjacency evolution: As genes and species, gene adjacencies alsoevolve. They can be gained, lost, duplicated, and transferred forexample. The core element of the phylogeny-based methods wedescribe in this section is to infer the evolution of these adjacenciesalong the gene phylogenies, which themselves evolve within thespecies phylogeny. This leads to the inference of adjacencies betweenancestral genes, i.e., ancestral adjacencies, and thus provides ele-ments of the organization of genes in ancestral species. The cur-rently available methods compute an evolutionary history of theadjacencies by either minimizing a discrete parsimony criterion or


maximizing a likelihood within a probabilistic framework. Themaindifficulty in such methods is to infer adjacencies evolution scenariosthat are consistent with the evolutionary history of the consideredgenes, encoded in their respective reconciled gene trees (see Fig. 1).The result of this approach, which considers each adjacency inde-pendently of all other adjacencies (like in the methods described inSubheading 3.2 but unlike those in Subheading 3.1), is a set ofancestral adjacencies for each ancestral species. As it is not guaran-teed that these adjacencies are compatible with a linear structure,linearization methods such as [56] or global evolution methodssuch as [74] can be applied to infer valid ancestral gene arrange-ments, for each individual ancestor. This approach was followed inDupCAR [75], which imposes some constraints on the gene trees,and in the family of DeCo algorithms that we describe below.

DeCo algorithm family: The inference of adjacency histories iscomputed by Dynamic Programming techniques, implementingthe rules of transmission of an adjacency from an ancestor to adescendant. As in the previous methods, at any point, a rearrange-ment can break or form an adjacency. But in addition, when a geneundergoes an event (Birth, Duplication, Loss, Transfer), an adja-cency that has this gene as extremity necessarily changes: it can begained, lost, duplicated, or transferred according to the evolution-ary pattern of its extremities. The algorithm proceeds in three steps.A first step is to group adjacencies that may share a commonancestor in classes. Then, each class is examined independentlyusing a DP algorithm that generalizes the Sankoff–Fitch parsimony

Adjacency creation

Gene loss

Adjacencytransfer

Adjacencybreak

Gene duplicationDouble geneduplication

a0

a1 a2 a3 a4 a5

Fig. 1 Propagation of adjacencies (red) along gene phylogenies (black and blue) reconciled with a speciesphylogeny (green). This figure represents the evolutionary history of five extant adjacencies a1 to a5 sharing acommon ancestor a0 in agreement with the history of the genes present at their extremities. The double geneduplication on the left side induces an adjacency duplication, whereas the single-gene duplication on the rightside does not. Events such as gene losses or rearrangements make adjacencies lost or broken


algorithms on binary alphabets; here the binary character is thepresence or absence of an adjacency that evolves along pairs ofreconciled gene phylogenies. Last, a backtrack step infers an unam-biguous parsimonious evolutionary scenario for each adjacency.

This principle has been first implemented in parsimony whengene trees are reconciled in a duplication/loss model [76]. Follow-ing this initial model, several extensions have been proposed, whichwe outline briefly now. DeCoLT is an extension of DeCo thatallows modeling the lateral transfers of genes between species, afrequent evolutionary event in bacterial evolution [18]. Two prob-abilistic extensions were recently introduced: in [9], the optimiza-tion criterion is a maximum likelihood criterion, while DeClone[77] implements a probabilistic approach to parsimony by allowingsampling evolutionary scenarios according to a Boltzmann-Gibbsprobability distribution. Last, Art DeCo [78] has been introducedto handle fragmented extant genome assemblies (see the next sec-tion). DeCo and its variants all run in polynomial time allowingusing them on large-scale datasets such as 69 eukaryotic genomes[76, 79].

3.4 Handling

Fragmented Extant

Genomes

Ideally, to reconstruct an accurate and complete organization ofone (or several) ancestral genome(s) with a comparative approach,one would like to rely on the complete chromosomal organizationof the considered related extant genomes. However, currently,most genome assemblies are incomplete and can even be highlyfragmented1. This fact is due to the prevalence of sequencingtechnologies producing short and accurate reads that do notallow assembling repeated regions [67]. Recent improvements insequencing technologies (for example long read sequencing proto-cols), as well as advances in processing methods (for example hybridassemblies [80, 81] and gap closing methods [82–84]), make itpossible to obtain the complete genome organization of microbialgenomes [85]; however, the problem of genome assembly is stillhard for large eukaryotic genomes [86].

Fragmented extant genome assemblies are characterized by thefact that chromosomes are split into several contigs or scaffolds,whose relative order and orientation is not known. This missinginformation on order and orientation of these scaffolds might hideconserved syntenies such as marker adjacencies, which leads tosimilarly fragmented ancestral genome organization. One can seethe problem of reconstructing the organization of ancestral gen-omes as similar to genome mapping or scaffolding problems, inwhich case ancestral genome reconstruction and extant genomeassembly can be considered a unique problem that consists inordering genomic markers whether ancient or extant. The

1 see the GOLD database for example https://gold.jgi.doe.gov/statistics.


https://gold.jgi.doe.gov/statistics

algorithmic similarity between these two problems has beenremarked [87] by noting a similarity between the breakpointgraph [41], used for the reconstruction of gene order in ancestralgenomes, and the de Bruijn graph [88], used in genome assembly.This observation has led to the recent development of approachesaiming at improving extant genome assembly in an evolutionaryframework that reconstructs jointly ancient genome organization.

This similarity was first exploited by Munoz et al. [89], to givean order and an orientation to scaffolds by contig fusion with theconstruction of the breakpoint graph of a reference genome and atarget genome to assemble. The concept was taken further byAganezov et al. [90]. They considered several related extant gen-omes (possibly at various levels of fragmentation) and appliedsimultaneous co-scaffolding of all extant genomes, under thehypothesis that fragmentation breakpoints are not the same (i.e.,between the same markers) in all species and conserved synteniescan thus be detected, although with a weaker conservation signal.The core of their method is an extension of the classical breakpointgraph to more than two genomes [45, 46] and follows the parsi-mony principle on permutations (see Subheading 3.1). In conse-quence the method is limited to a small number of species (less than10) and does not handle duplications.

Another alternative is an extension of the DeCo algorithm (seeSubheading 3.3), called Art DeCo [77]. The method scaffoldsseveral fragmented-related genomes by reconstructing gene adja-cencies evolution. The method is based on a parsimony principlethat considers gains and breaks of adjacencies, but also the cost ofcreating scaffolding adjacencies in extant genomes but is appliedindependently to each adjacency, thus avoiding the computationaltractability issue of a parsimony approach on permutations. ArtDeCo can handle a large number of species (several dozens) aswell as gene duplications. The linearization issue however propa-gates to extant genomes: neither extant nor ancestral genomes areguaranteed to be compatible with a linear or circular structure, andlinearization algorithms are needed as a post process.

3.5 Using

Ancient DNA

In addition to extant genomes, ancient DNA (aDNA) extractedfrom archaeological or paleontological remains can provide directevidence about the contents and structure of an ancient genome.Early works using aDNA concentrated on mitochondrial DNA notolder than a few 1000 years, recovered for example from quagga[91], extinct moa [92], cave bears [93], or Neanderthal[94]. Later, advances in sequencing technologies and in aDNArecovery protocols [95] opened the way to the sequencing ofnuclear aDNA in even older samples of bacteria like Yersinia pestis[96, 97] or mammals like the extinct woolly mammoth [98] orancient horses [99].


However due to postmortem DNA decay and degradation bynucleases, only short fragments of aDNA can be recovered. Subse-quently, the retrieved sequences are usually aligned to referencesand variants are identified keeping aDNA damage patterns in mind,precluding the analysis of more complex rearrangements betweenthe ancient and extant genomes [100]. While a contig assemblybased on such data can be expected to be quite fragmented, classicalscaffolding approaches can often not be applied to aDNA data, dueto the nature of the aDNA capture process for example. Hence,comparative phylogenetic methods following principles similar tothe ancestral reconstruction methods described above have to beused to order and orient the obtained contigs. Combining aDNAsequencing data with comparative methods is therefore useful intwo ways: scaffolding of a fragmented aDNA assembly whileimproving the reconstruction of other, probably older ancientgenomes in the phylogeny. We outline this approach below.

Given sets of contigs from aDNA assemblies assigned to inter-nal nodes of the species phylogeny, one first needs to define acommon set of markers between the ancient contigs and extantgenome sequences. Each family of markers should then consist of atleast one ancient contig fragment and its occurrences in severalextant genomes. An iterative segmentation approach based onmappings of ancient contigs to extant genomes is described inFPSAC [69] although other fragmentation or synteny blocks con-struction algorithms can also be applied [34, 101].

Once marker families have been obtained using aDNA andextant DNA data, the methods outlined in the previous sectionscan be applied directly. For example, the FPSAC method [69]computes copy numbers for markers using discrete parsimony,infers potential ancestral adjacencies using the Dollo parsimonyprinciple, linearizes these adjacencies using the MWM algorithmintroduced in [56], and clears ambiguities due to repeated markersusing the algorithms of [68]. Moreover, as the set of markers islikely not covering the whole ancient genome, gaps between adja-cent markers in scaffolds need to be filled. In FPSAC, thecorresponding extant gaps are identified and their sequences arealigned. Then, for each column of the alignment, the parsimoniousancestral state is reconstructed with the Fitch algorithm [51]. Thisapproach has been successfully applied to a set of aDNA contigsfrom the human pathogen Yersinia pestis, which was obtained fromremains of victims of the Black Death pandemic in the fourteenthcentury [102].

3.6 Software We review in Table 1 below the main existing software implement-ing the principles described in the previous sections.

3.7 Validation Validation is a constant concern in evolutionary studies. Differenthypotheses, different methods, and different types of data may leadto different results [103], and their quality is difficult to quantify.


Predictions concern events that can be up to 4 billion years old, andno DNA molecule is preserved, even in exceptional conditions,more than 1 million years. And even for the rare cases when ancientDNA is available, it is often not for ancestral genomes, and assem-bly issues make it hard to use it for validation purposes (see Sub-heading 3.5).

Theoretical considerations about the models and methods canhelp to assess the validity of the results. Agreement with widelyaccepted biological hypotheses, statistical consistency, computa-tional complexity, clarity, and validity of the underlying hypotheseshave to be discussed [104]. For example, a majority of the methodspresented in this chapter are based on parsimony, which assumesthat the possibility of convergence or reversion is negligible, whileall statistical studies tended to show that it was not the case[105]. Models have to find a good balance between realism, con-sistency, and complexity. An important feature of a methodology iswhether it is able to provide several alternative equivalent solutions

Table 1Main methods publicly available for ancient genome reconstruction

Name

AdjacenciesIntervalsPermutations

Parsimony (Pa)Probabilistic (Pr)

Insertionsand losses Duplications Transfers

Explorationof alternativesolutionsand/or supportof solutions

ANGES A/I Pa Y N N Y

FPSAC A/I Pa Y Y N N

DeCo* A Pa/Pr Y Y Y Y

DupCAR A Pa Y Y N N

ROCOCO A/I Pa N N N N

MGRA2 P Pa Y N N N

MGLO A Pr Y Y N N

Badger P Pr N N N Y

GASTS P Pa N N N Y

Pathgroup P Pa N N N N

Infercars A Pa N N N Y

Col. 1 records the name of the method. Col. 2 indicates which type of method it implements, either genomes as

permutations (Subheading 3.1), or genomes as sets of adjacencies and intervals (Subheadings 3.2 and 3.3). Col. 3 records

if it uses a parsimony assumption or a probabilistic approach. Col. 4 indicates if the method allows unequal markercontent in extant and ancestral species. Col. 5 indicates if the underlying evolution model considers gene duplication.

Col. 6 indicates if the underlying evolution model considers gene transfers. Col. 7 indicate if alternative solutions can be

provided (through sampling for example) or if there is a measure of support for features of the provided solution.

References of the listed methods: ANGES [53], FPSAC [69], DeCo and variants [9, 18, 76–78], DupCAR [75],ROCOCO [57], MGRA2 [45, 46], MGLO [54], Badger [49], GASTS [43], Pathgroup [44], infercars [6]


[43] (most of the time an optimal or a likely solution is not unique),or better, a sampling of possible solutions according to a likelihood[49]. At least, if this is not possible, statistical supports of localfeatures such as ancestral adjacencies can provide a robustness [77](see Col. 7 in Table 1).

Though it is not possible to travel in time, nor to replay the tapeof evolution [106], it is possible to experimentally generate somelineages and test reconstruction methods on them [107, 108]. Ithas been realized for ancestral sequence reconstruction purposes,but it is very expensive, time consuming, and usually generates easyinstances where all methods perform equally well. It has never beendone for chromosome organization, although some experimentscould theoretically be used as benchmarks [109].

Another validation technique is to compare the results withsimilar ones produced by independent data and techniques. Forexample, molecular evolutionary studies can compare their resultswith fossil data [110, 111]. Bioinformatics ancestral genome recon-structions have, for example, been compared with reconstructionsfrom cytogenetics data [103]. But as for ancient sequences, eachkind of protocol has caveats, and none can be considered as thetruth.

The main validation tool remains simulation. Genome evolu-tion can be simulated in silico for a much higher number of gen-erations than in experimental evolution, at a lower cost. There are atleast two issues that need to be considered for the simulation,where no general consensus exists: the set of operations applied,and the parameters (e.g., relative frequencies) of the differentoperations, if more than one type is used. Moreover, they areoften designed by the team developing the inference method, andeven if they are designed to be used by another team for inference[112, 113], they originate from a community interested in provingthe validity of inference methods and are based on similar modelsthat underly the reconstruction methods. Situations where theteams developing the inference methods and testing them areseparated from the start are very rare [114] and, in their currentstate, existing testing schemes are not complex enough to be usedfor ancestral genome organization reconstruction yet. Neverthe-less, this is likely an important aspect of ancient genome recon-struction methods that needs to be developed.

4 Conclusion—a Short User Guide

There has been an important effort, mostly over the last 10 years, inthe development of computational methods for the reconstructionof ancestral genome organizations. Choosing a method among themany that are available requires considering several variables, suchas the nature of available data, evolutionary properties of the con-sidered lineages, computational infrastructures.


If a dataset is large (more than ~10 species), or if it containsmany duplications that are deemed important to consider, it isbetter to look at methods that consider genomes as sets of adja-cencies or intervals rather than permutations. The latter is appro-priate for a small number of small genomes, provided duplicatemarkers can be ignored and a reasonable amount of computingpower is available. In that case probabilistic methods as Badgershould be preferred, because it proposes a sample of solutionsbased on grounded statistical principles, instead of a unique solu-tion of a heuristic, but it is the most computationally intensive.

In all other cases, in our opinion, a local approach with adja-cencies and intervals should be favored. If duplicates can be ignored(unique markers), ANGES is the most flexible tool, which allowsretrieving most information (common intervals in addition to adja-cencies). Otherwise, assuming duplicated markers are importantand need to be considered, if good gene or marker phylogeniesare available, the DeCo method and its variants are a natural choiceproviding the most comprehensive evolutionary scenarios. Thechoice of the variant depends if lateral transfers are considered, orthe considered genomes are poorly assembled. In the absence ofgood reliable gene phylogenies, MGLO and FPSAC (used withoutaDNA data) are the only available methods.

Acknowledgment

C.C. is funded by the Natural Sciences and Engineering ResearchCouncil of Canada (NSERC) Discovery Grant 249834. E.T., S.B.,and Y.A. are funded by the French Agence Nationale pour laRecherche (ANR) through PIA Grant ANR-10-BINF-01-01“Ancestrome”. N.L. is funded by the International DFG ResearchTraining Group GRK 1906/1.

References

1. Sturtevant AH (1921) A case of rearrange-ment of genes in drosophila. Proc Natl AcadSci U S A 7:235–237

2. Dobzhansky T, Sturtevant AH (1938) Inver-sions in the chromosomes of drosophila pseu-doobscura. Genetics 23:28–64

3. Pauling L, Zuckerkandl E (1963) Chemicalpaleogenetics. Acta Chem Scand 17:S9–S16

4. Poinar HN, Schwarz C, Qi J et al (2006)Metagenomics to paleogenomics: large–scalesequencing of mammoth DNA. Science311:392–394

5. Muffato M, Roest Crollius H (2008) Paleo-genomics in vertebrates, or the recovery of

lost genomes from the mist of time. Bioessays30:122–134

6. Ma J, Zhang L, Suh BB et al (2006) Recon-structing contiguous regions of an ancestralgenome. Genome Res 16:1557–1565

7. Chauve C, Tannier E (2008) A methodologi-cal framework for the reconstruction of con-tiguous regions of ancestral genomes and itsapplication to mammalian genomes. PLoSComput Biol 4:e1000234

8. Neafsey DE, Waterhouse RM, Abai MR et al(2015) Mosquito genomics. Highly evolvablemalaria vectors: the genomes of 16 anophelesmosquitoes. Science 347:1258522


9. Semeria M, Tannier E, Gueguen L (2015)Probabilistic modeling of the evolution ofgene synteny within reconciled phylogenies.BMC Bioinformatics 16(Suppl 14):S5

10. Chauve C, Gavranovic H, Ouangraoua A et al(2010) Yeast ancestral genome reconstruc-tions: the possibilities of computational meth-ods II. J Comput Biol 17:1097–1112

11. Sankoff D, Zheng C, Wall PK et al (2009)Towards improved reconstruction of ancestralgene order in angiosperm phylogeny. J Com-put Biol 16:1353–1367

12. Murat F, Xu JH, Tannier E et al (2010)Ancestral grass karyotype reconstructionunravels new mechanisms of genome shuf-fling as a source of plant evolution. GenomeRes 20:1545–1557

13. Ming R, VanBuren R, Wai CM et al (2015)The pineapple genome and the evolution ofCAM photosynthesis. Nat Genet47:1435–1442

14. Salse J (2016) Ancestors of modern plantcrops. Curr Opin Plant Biol 30:134–142

15. Murat F, Louis A, Maumus F et al (2015)Understanding Brassicaceae evolutionthrough ancestral genome reconstruction.Genome Biol 16:262

16. Murat F, Zhang R, Guizard S et al (2015)Karyotype and gene order evolution fromreconstructed extinct ancestors highlight con-trasts in genome plasticity of modern rosidcrops. Genome Biol Evol 7:735–749

17. Wang Y, Li W, Zhang T et al (2006) Recon-struction of ancient genome and gene orderfrom complete microbial genome sequences. JTheor Biol 239:494–498

18. Patterson M, Szollosi G, Daubin V et al(2013) Lateral gene transfer, rearrangement,reconciliation. BMC Bioinformatics 14(Suppl15):S4

19. Darling AE, Miklos I, Ragan MA (2008)Dynamics of genome rearrangement in bacte-rial populations. PLoS Genet 4:e1000128

20. Kohn M, Hogel J, Vogel W et al (2006)Reconstruction of a 450–my–old ancestralvertebrate protokaryotype. Trends Genet22:203–210

21. Nakatani Y, Takeda H, Kohara Y et al (2007)Reconstruction of the vertebrate ancestralgenome reveals dynamic genome reorganiza-tion in early vertebrates. Genome Res17:1254–1265

22. Ouangraoua A, Tannier E, Chauve C (2011)Reconstructing the architecture of the ances-tral amniote genome. Bioinformatics27:2664–2671

23. Jaillon O, Aury JM, Brunet F et al (2004)Genome duplication in the teleost fish Tetra-odon nigroviridis reveals the early vertebrateproto–karyotype. Nature 431:946–957

24. Woods IG, Wilson C, Friedlander B et al(2005) The zebrafish gene map defines ances-tral vertebrate chromosomes. Genome Res15:1307–1314

25. Catchen JM, Conery JS, Postlethwait JH(2008) Inferring ancestral gene order. Meth-ods Mol Biol 452:365–383

26. Naruse K, Tanaka M, Mita K et al (2004) Amedaka gene map: the trace of ancestral ver-tebrate proto–chromosomes revealed by com-parative gene mapping. Genome Res14:820–828

27. Putnam NH, Butts T, Ferrier DEK et al(2008) The amphioxus genome and the evo-lution of the chordate karyotype. Nature453:1064–1071

28. Putnam NH, Srivastava M, Hellsten U et al(2007) Sea anemone genome reveals ancestraleumetazoan gene repertoire and genomicorganization. Science 317:86–94

29. Herrero J, Muffato M, Beal K et al (2016)Ensembl comparative genomics resources.Database 2016:bav096. https://doi.org/10.1093/database/bav096

30. Speir ML, Zweig AS, Rosenbloom KR et al(2016) The UCSC genome browser database:2016 update. Nucleic Acids Res 44:D717–D725

31. Nagarajan N, Pop M (2013) Sequence assem-bly demystified. Nat Rev Genet 14:157–167

32. Penel S, Arigon AM, Dufayard JF, Sertier AS,Daubin V, Duret L, Gouy M, Perriere G(2009) Databases of homologous genefamilies for comparative genomics. BMC Bio-informatics 10(Suppl 6):S3

33. Sankoff D, Nadeau JH (2003) Chromosomerearrangements in evolution: from gene orderto genome sequence and back. Proc NatlAcad Sci U S A 100:11188–11189

34. M. Visnovska, T. Vinar, and B. Brejova (2013)DNA sequence segmentation based on localsimilarity. In: ITAT 2013 Proceedings,pp. 36–43

35. Dousse A, Junier T, Zdobnov EM (2016)CEGA–a catalog of conserved elements fromgenomic alignments. Nucleic Acids Res 44:D96–D100

36. M. Belcaid, A. Bergeron, A. Chateau, et al.(2007) Exploring genome rearrangementsusing virtual hybridization. In: APBC’07:5th Asia–Pacific bioinformatics conference,Imperial College Press 2007, pp. 205–214


https://doi.org/10.1093/database/bav096

https://doi.org/10.1093/database/bav096

37. Kim J, Larkin DM, Cai Q et al (2013) Refer-ence–assisted chromosome assembly. ProcNatl Acad Sci U S A 110:1785–1790

38. Biller P, Gueguen L, Knibbe C, Tannier E(2016) Breaking good: accounting for thefragility of genomic regions in rearrangementdistance estimation. Genome Biol Evol 8(5):1427–1439

39. Alizadeh F, Karp RM,Weisser DK et al (1995)Physical mapping of chromosomes usingunique probes. J Comput Biol 2:159–184

40. Yancopoulos S, Attie O, Friedberg R (2005)Efficient sorting of genomic permutations bytranslocation, inversion and block inter-change. Bioinformatics 21:3340–3346

41. Fertin G (2009) Combinatorics of genomerearrangements. MIT Press, Cambridge

42. Tannier E, Zheng C, Sankoff D (2009)Multi-chromosomal median and halving problemsunder different genomic distances. BMC Bio-informatics 10:120

43. Xu AW, Moret BME (2011) GASTS: parsi-mony scoring under rearrangements. In:Algorithms in bioinformatics. Springer, BerlinHeidelberg, pp 351–363

44. Zheng C, Sankoff D (2011) On thePATHGROUPS approach to rapid small phy-logeny. BMC Bioinformatics 12(Suppl 1):S4

45. Alekseyev MA, Pevzner PA (2009) Break-point graphs and ancestral genome recon-structions. Genome Res 19:943–957

46. Avdeyev P, Jiang S, Aganezov S et al (2016)Reconstruction of ancestral genomes in pres-ence of gene gain and loss. J Comput Biol23:150–164

47. Ma J, Ratan A, Raney BJ et al (2008) Theinfinite sites model of genome evolution.Proc Natl Acad Sci U S A 105:14254–14261

48. Paten B, Zerbino DR, Hickey G et al (2014)A unifying model of genome evolution underparsimony. BMC Bioinformatics 15:206

49. D. Simon and B. Larget (2004) Bayesian anal-ysis to describe genomic evolution by rear-rangement (BADGER), version 1.02 beta,Department of Mathematics and ComputerScience, Duquesne University

50. Feijao P, Meidanis J (2011) SCJ: a break-point–like distance that simplifies several rear-rangement problems. IEEE/ACM TransComput Biol Bioinform 8:1318–1329

51. Fitch WM (1971) Toward defining the courseof evolution: minimum change for a specifictree topology. Syst Biol 20:406–416

52. Miklos I, Smith H (2015) Sampling andcounting genome rearrangement scenarios.BMC Bioinformatics 16(Suppl 14):S6

53. Jones BR, Rajaraman A, Tannier E et al(2012) ANGES: reconstructing ANcestralGEnomeS maps. Bioinformatics28:2388–2390

54. Hu F, Zhou J, Zhou L et al (2014) Probabi-listic reconstruction of ancestral gene orderswith insertions and deletions. IEEE/ACMTrans Comput Biol Bioinform 11:667–672

55. J. Ma (2010) A probabilistic framework forinferring ancestral genomic orders. In: Bioin-formatics and biomedicine (BIBM),pp. 179–184

56. Manuch J, PattersonM,Wittler R et al (2012)Linearization of ancestral multichromosomalgenomes. BMC Bioinformatics 13(Suppl 19):S11

57. Stoye J, Wittler R (2009) A unified approachfor reconstructing ancient gene clusters.IEEE/ACM Trans Comput Biol Bioinform6:387–400

58. Manuch J, Patterson M, Chauve C (2012)Hardness results on the gapped consecuti-ve–ones property problem. Discrete ApplMath 160:2760–2768

59. Manuch J, Patterson M (2011) The complex-ity of the gapped consecutive–ones propertyproblem for matrices of bounded maximumdegree. J Comput Biol 18:1243–1253

60. Gavranovic H, Chauve C, Salse J et al (2011)Mapping ancestral genomes with massivegene loss: a matrix sandwich problem. Bioin-formatics 27:i257–i265

61. Csuros M (2010) Count: evolutionary analy-sis of phylogenetic profiles with parsimonyand likelihood. Bioinformatics26:1910–1912

62. De Bie T, Cristianini N, Demuth JP et al(2006) CAFE: a computational tool for thestudy of gene family evolution. Bioinformatics22:1269–1271

63. Csuros M (2013) How to infer ancestralgenome features by parsimony: dynamic pro-gramming over an evolutionary tree. In:Models and algorithms for genome evolution.Springer, London, pp 29–45

64. Sankoff D, Rousseau P (1975) Locating thevertices of a steiner tree in an arbitrary metricspace. Math Prog 9:240–246

65. Bergeron A, Chauve C, Gingras Y (2008)Formal models of gene clusters. In: Bioinfor-matics algorithms. John Wiley & Sons, Inc,Hoboken, pp 175–202

66. Wittler R, Manuch J, PattersonM et al (2011)Consistency of sequence–based gene clusters.J Comput Biol 18:1023–1039

67. Treangen TJ, Salzberg SL (2012) RepetitiveDNA and next–generation sequencing:


computational challenges and solutions. NatRev Genet 13:36–46

68. Rajaraman A, Zanetti J, Manuch J et al (2016)Algorithms and complexity results forgenome mapping problems. IEEE/ACMTrans Comput Biol Bioinform 14(2):418–430. https://doi.org/10.1109/TCBB.2016.2528239

69. Rajaraman A, Tannier E, Chauve C (2013)FPSAC: fast phylogenetic scaffolding ofancient contigs. Bioinformatics29:2987–2994

70. Gagnon Y, Blanchette M, El Mabrouk N(2012) A flexible ancestral genome recon-struction method based on gapped adjacen-cies. BMC Bioinformatics 13(Suppl 19):S4

71. Nakhleh L (2013) Computational approachesto species phylogeny inference and gene treereconciliation. Trends Ecol Evol 28:719–728

72. Szollosi GJ, Tannier E, Daubin V et al (2015)The inference of gene trees with species trees.Syst Biol 64:42–62

73. Jacox E, Chauve C, Szollosi GJ et al (2016)ecceTERA: comprehensive gene tree-speciestree reconciliation using parsimony. Bioinfor-matics 32(13):2056–2058. https://doi.org/10.1093/bioinformatics/btw105

74. Luhmann N, Thevenin A, Ouangraoua A et al(2016) The SCJ small parsimony problem forweighted gene adjacencies. In: Bioinformaticsresearch and applications. Springer, BerlinHeidelberg

75. Ma J, Ratan A, Raney BJ et al (2008) DUP-CAR: reconstructing contiguous ancestralregions with duplications. J Comput Biol15:1007–1027

76. Berard S, Gallien C, Boussau B et al (2012)Evolution of gene neighborhoods withinreconciled phylogenies. Bioinformatics 28:i382–i388

77. Chauve C, Ponty Y, Zanetti J (2015) Evolu-tion of genes neighborhood within reconciledphylogenies: an ensemble approach. BMCBioinformatics 16(Suppl 19):S6

78. Anselmetti Y, Berry V, Chauve C et al (2015)Ancestral gene synteny reconstructionimproves extant species scaffolding. BMCGenomics 16(Suppl 10):S11

79. Duchemin W, Anselmetti Y, Patterson M et al(2017) DeCoSTAR: reconstructing theancestral organization of genes or genomesusing reconciled phylogenies. Genome BiolEvol 9:1312–1319

80. Koren S, Schatz MC, Walenz BP et al (2012)Hybrid error correction and de novo assemblyof single–molecule sequencing reads. NatBiotechnol 30:693–700

81. Antipov D, Korobeynikov A, McLean JS et al(2015) hybridSPAdes: an algorithm forhybrid assembly of short and long reads. Bio-informatics 32:1009–1015

82. Paulino D, Warren RL, Vandervalk BP et al(2015) Sealer: a scalable gap–closing applica-tion for finishing draft genomes. BMC Bioin-formatics 16:230

83. Salmela L, Sahlin K, M€akinen V et al (2016)Gap filling as exact path length problem. JComput Biol 23:347–361

84. English AC, Richards S, Han Y et al (2012)Mind the gap: upgrading genomes withPacific biosciences RS long read sequencingtechnology. PLoS One 7:e47768

85. Koren S, Phillippy AM (2015) One chromo-some, one contig: complete microbial gen-omes from long–read sequencing andassembly. Curr Opin Microbiol 23:110–120

86. Rhoads A, Au KF (2015) PacBio sequencingand its applications. Genomics ProteomicsBioinformatics 13:278–289

87. Lin Y, Nurk S, Pevzner PA (2014) What is thedifference between the breakpoint graph andthe de Bruijn graph? BMC Genomics 15(Suppl 6):S6

88. Compeau PEC, Pevzner PA, Tesler G (2011)How to apply de Bruijn graphs to genomeassembly. Nat Biotechnol 29:987–991

89. Munoz A, Zheng C, Zhu Q et al (2010)Scaffold filling, contig fusion and comparativegene order inference. BMC Bioinformatics11:304

90. Aganezov S, Sitdykova N, AGC Consortiumet al (2015) Scaffold assembly based ongenome rearrangement analysis. ComputBiol Chem 57:46–53

91. Higuchi R, Bowman B, Freiberger M et al(1984) DNA sequences from the quagga, anextinct member of the horse family. Nature312:282–284

92. Cooper A, Lalueza-Fox C, Anderson S et al(2001) Complete mitochondrial genomesequences of two extinct moas clarify ratiteevolution. Nature 409:704–707

93. Stiller M, Baryshnikov G, Bocherens H et al(2010) Withering away–25,000 years ofgenetic decline preceded cave bear extinction.Mol Biol Evol 27:975–978

94. Krings M, Stone A, Schmitz RW et al (1997)Neandertal DNA sequences and the origin ofmodern humans. Cell 90:19–30

95. Marciniak S, Klunk J, Devault A et al (2015)Ancient human genomics: the methodologybehind reconstructing evolutionary pathways.J Hum Evol 79:21–34


https://doi.org/10.1109/TCBB.2016.2528239

https://doi.org/10.1109/TCBB.2016.2528239

https://doi.org/10.1093/bioinformatics/btw105

https://doi.org/10.1093/bioinformatics/btw105

96. Rasmussen S, Allentoft ME, Nielsen K et al(2015) Early divergent strains of Yersinia Pes-tis in Eurasia 5,000 years ago. Cell163:571–582

97. Wagner DM, Klunk J, HarbeckM et al (2014)Yersinia Pestis and the plague of Justinian541–543 AD: a genomic analysis. LancetInfect Dis 14:319–326

98. Miller W, Drautz DI, Ratan A et al (2008)Sequencing the nuclear genome of the extinctwoolly mammoth. Nature 456:387–390

99. Orlando L, Ginolhac A, Zhang G et al (2013)Recalibrating Equus evolution using thegenome sequence of an early middle pleisto-cene horse. Nature 499:74–78

100. Peltzer A, J€ager G, Herbig A et al (2016)EAGER: efficient ancient genome reconstruc-tion. Genome Biol 17:1–14

101. Minkin I, Patel A, Kolmogorov M et al(2013) Sibelia: a scalable and comprehensivesynteny block generation tool for closelyrelated microbial genomes. In: Algorithms inbioinformatics. Springer, Berlin Heidelberg,pp 215–229

102. Bos KI, Schuenemann VJ, Golding GB et al(2011) A draft genome of Yersinia Pestis fromvictims of the black death. Nature478:506–510

103. Froenicke L, Caldes MG, Graphodatsky Aet al (2006) Are molecular cytogenetics andbioinformatics suggesting diverging modelsof ancestral mammalian genomes? GenomeRes 16:306–310

104. Steel M, Penny D (2000) Parsimony, likeli-hood, and the role of models in molecularphylogenetics. Mol Biol Evol 17:839–850

105. Durrett R, Nielsen R, York TL (2004) Bayes-ian estimation of genomic distance. Genetics166:621–629

106. Gould SJ (1990) Wonderful life: the burgessshale and the nature of history. Norton,New York

107. Hillis DM, Bull JJ, White ME et al (1992)Experimental phylogenetics: generation of aknown phylogeny. Science 255:589–592

108. R.N. Randall (2012) Experimental phyloge-netics: a benchmark for ancestral sequencereconstruction. https://smartech.gatech.edu/handle/1853/48998

109. Barrick JE, Yu DS, Yoon SH et al (2009)Genome evolution and adaptation in a long–-term experiment with Escherichia Coli.Nature 461:1243–1247

110. Romiguier J, Ranwez V, Douzery EJP et al(2013) Genomic evidence for large, long–-lived ancestors to placental mammals. MolBiol Evol 30:5–13

111. Szollosi GJ, Boussau B, Abby SS et al (2012)Phylogenetic modeling of lateral gene transferreconstructs the pattern and relative timing ofspeciations. Proc Natl Acad Sci U S A109:17513–17518

112. Beiko RG, Charlebois RL (2007) A simula-tion test bed for hypotheses of genome evo-lution. Bioinformatics 23:825–831

113. Dalquen DA, Anisimova M, Gonnet GH et al(2012) ALF–a simulation framework forgenome evolution. Mol Biol Evol29:1115–1123

114. Biller P, Knibbe C, Beslon G, Tannier E(2016) Comparative genomics on artificiallife. In: Computability in Europe, to appear.Springer, Cham


https://smartech.gatech.edu/handle/1853/48998

https://smartech.gatech.edu/handle/1853/48998

Comparative Methods for Reconstructing Ancient Genome ...

Documents

Transcript of Comparative Methods for Reconstructing Ancient Genome ...