Salas BBRC 2007.pdf

download Salas BBRC 2007.pdf

of 9

Transcript of Salas BBRC 2007.pdf

  • 8/14/2019 Salas BBRC 2007.pdf

    1/9

    A practical guide to mitochondrial DNA error prevention

    in clinical, forensic, and population genetics

    Antonio Salas a,*, Angel Carracedo a, Vincent Macaulay b, Martin Richards c,Hans-Jurgen Bandelt d

    a Unidade de Xenetica, Instituto de Medicina Legal, Facultade de Medicina, 15782 Universidade de Santiago de Compostela,

    Centro Nacional de Xenotipado (CeGen), Hospital ClnicoUniversitario, 15706 Galicia, Spainb Department of Statistics, University of Glasgow, Glasgow, UK

    c School of Biology, University of Leeds, Leeds, UKd Fachbereich Mathematik, Universitat Hamburg, Hamburg, Germany

    Received 26 July 2005Available online 8 August 2005

    Abstract

    Several suggestions have been made for avoiding errors in mitochondrial DNA (mtDNA) sequencing and documentation.Unfortunately, the current clinical, forensic, and population genetic literature on mtDNA still delivers a large number of studieswith flawed sequence data, which, in extreme cases, damage the whole message of a study. The phylogenetic approach has beenshown to be useful for pinpointing most of the errors. However, many geneticists, especially in the forensic and medical fields,are not familiar with either effective search strategies or the evolutionary terminology. We here provide a manual that should helpprevent errors at any stage by re-examining data fresh from the sequencer in the light of previously published data. A fictitious case

    study of a European mtDNA data set (albeit composed from the literature) then demonstrates the steps one has to go through inorder to assess the quality of sequencing and documentation. 2005 Elsevier Inc. All rights reserved.

    Keywords: Human mtDNA; HVS-I; HVS-II; Error detection; Networks; Phylogenetic analysis; Haplogroup; SWGDAM

    During the last decade, the evolution of PCR technol-ogy, in parallel with the development of new automaticsequencers and sequencing chemistries, has allowed theimprovement of electropherogram reliability and theaccuracy of the DNA sequence data. Large volumes ofmtDNA sequence data are nowadays produced auto-

    matically and reported in the literature or in databases.However, many errors still routinely occur, most ofwhich could have been avoided if careful checking strat-egies had been applied and lab conditions critically re-examined. A lack of knowledge in this regard seems topersist even in population genetics and forensics, as,

    for example, occasionally expressed by misrepresenta-tion of the error issue[1,2].

    Numerous papers have investigated the error issue todate[39], but this effort has not yet been assimilated bythe scientific community. Most of the previous publica-tions about errors in mitochondrial data dealt with data

    sets from population genetics and forensics, but the med-ical field seems to be most strongly affected by misse-quencing and misdocumentation. For instance, a list ofcontrol region mutations found in colorectal cancer pa-tients was displayed in[10]; according to the authors, po-sition 71 was altered in three out of seven cases (deletionof one G). Taking into account that the majority ofmutations found to be unstable in tumors are also com-mon polymorphisms in human populations[11], it is atleast surprising to see this highly stable site in human

    0006-291X/$ - see front matter 2005 Elsevier Inc. All rights reserved.

    doi:10.1016/j.bbrc.2005.07.161

    * Corresponding author. Fax: +34 981 580336.E-mail address: [email protected](A. Salas).

    www.elsevier.com/locate/ybbrc

    Biochemical and Biophysical Research Communications 335 (2005) 891899

    BBRC

    mailto:[email protected]:[email protected]
  • 8/14/2019 Salas BBRC 2007.pdf

    2/9

    studies as extremely unstable in[10]. A readability prob-lem in the G stretch (6671) is a possible explanation forthis result. In[12], the authors sought to find a G to Cmutation at site 73 in the mitochondrial databank, whichis an obvious misdocumentation, since position 73 showsan A in the revised Cambridge reference sequence

    (rCRS). In the medical field, incomplete and incorrectrecording of mutations in total sequencing attemptsand sample mix-up seem to be committed on a routinebasis; see, e.g., the cases reanalyzed in [13,14].

    Many studies (particularly in the medical field) justtarget the coding region of the mtDNA genome, sincethere is a general expectation that only mutations inthis region could be responsible (as a risk factor) forcertain (complex) diseases. In many of these studies,the control region is therefore not treated with muchcare (and at best semi-sequenced) or not even analyzedat all. However, it is important to screen the control re-gion, or at least the first two hypervariable segments

    (HVS-I and HVS-II), in order to (i) link the coding-re-gion information to potential motifs in the HVS-I/IIdatabase and (ii) have a feedback as to qualityassessment.

    As to the feature (i) of linking motifs, the worldwidedatabase of published HVS-I sequences is quite enor-mous, currently comprising 40,000 sequences, albeitscattered through publicly available databases and arti-cles. It is not prudent, however, to employ a database(on the web) that simply compiled data from tables ofpublished articles, without comparing with the originalsources. For example, the data entries in the mtDNA

    database HVRbase were found to be highly unreliable[15]and they still are, for several reasons (Bandeltet al., unpublished manuscript). We therefore adviseagainst employing HVRbase to circumvent the morelaborious own compilation of published data, table bytable. When a sufficient number (say, more than10,000) of HVS-I sequences (plus corresponding HVS-II sequences in some cases) are ready for screening, itwill in general be quite easy to connect a particular com-plete mtDNA sequence from a patient with a particularcontrol-region motif in the database. One can then focusthe search for potentially phylogenetically related mtD-NAs to those which (nearly) match the patient s HVS-Imotif or full control-region motif. As exemplified in[13],complete sequencing of such candidate samples can thenhelp to distinguish shared polymorphisms from privatemutations, so that the search of potentially pathogenicmutations becomes much more focused.

    As to the feature (ii) of quality assessment, sequencingof the control region with its hypervariable segments de-mands high standards of sequencing (because of homo-polymeric tracts) and documentation (because ofnumerous variant nucleotides to be recorded correctly).The large number of HVS-I sequences available for di-rect comparison aids in pinpointing principal problems

    in sequencing and documentation. To give a motivatingexample, the authors of [16] have aimed at analyzingthe entire mtDNA genomes of three LHON patients(of Chinese origin). One of the mtDNA sequences ob-tained (allocated to mtDNA haplogroup F1) is reportedto bear the transversion T16304G in HVS-I (relative to

    the rCRS). However, consultation of previous publica-tions [1719] reveals that the transition T16304C isshared by haplogroup F1 mtDNAs. Moreover, amongall published HVS-I sequences we found only twoinstances of T16304G in the data sets[20,21],which seemto be problematic anyway[3,4]. The best explanation forseeing T16304G is a quite typical documentation error,viz. the confusion of the nucleotides C and G[3]. Thus,being warned by this HVS-I finding, one would then fur-ther search for coding region mutations for which C andG seem to have been interchanged by mistake; and in-deed, we have observed such instances (Bandelt et al.,unpublished manuscript) in the three mtDNA sequences

    reported by [16]. Moreover, the sequence allocated tohaplogroup M10 lacks the haplogroup M marker muta-tion T489C in the third hypervariable segment (HVS-III)a very unusual finding. This could suggest thatsome coding-region mutations in that data set might beoverlooked as well, which in fact seems to be the case;compare with the data of[22]re-analyzed by[13].

    Site-specific mutational rates and error detection

    Conceptually, the method of error detection we fol-

    low is very simple: any single mtDNA sequence neces-sarily needs to fit into a specific part of the mtDNAphylogeny that is characterized by specific mutations.We often detect some pieces that do not quite fit the pat-tern, where we may expect that an artifactand not abiological processcould explain the profile. For in-stance, the deletion 249d in HVS-II is almost alwaysconnected with one or other of two different HVS-I mo-tifs, viz. either the mutational pair C16223T T16298C orthe single change T16304C relative to the rCRS[23]. Wethen learn from a systematic phylogenetic study that thetwo motifs C16223T T16298C A73G 249d A263G andT16304C A73G 249d A263G constitute ancestralHVS-I and II sequences for basal branches of themtDNA phylogeny[19]and are thus inherited as a sin-gle haplotype block rather than having arisen de novomultiple times (although it should be borne in mind thatthe mutational stability of either HVS-I motif is notinfallible). Therefore, it would at least be somewhatunusual (but not implausible or unbelievable) to detecta profile that combines a typical West Eurasian HVS-Ilineage, such as C16069T T16126C, with 249d, especial-ly when, on the other hand, an expected mutation (here,say, C295T) was missing. It is the combination of severalof such unusual findings that would point to flawed re-

    892 A. Salas et al. / Biochemical and Biophysical Research Communications 335 (2005) 891899

  • 8/14/2019 Salas BBRC 2007.pdf

    3/9

    sults and warrant a careful re-examination of the under-lying chromatograms.

    Thus, to decide whether a particular mutation isunusual in a certain sequence context depends on anunderstanding of how frequently such a mutation wouldoccur in the total database and in which haplogroup

    background. As a general rule, frequent substitutionswould tend to show up in sequences belonging to manydifferent branches of the phylogeny, referred to ashaplogroups, while very stable positions would rarelyoccur mutated in different haplogroups. To be more pre-cise, a haplogroup is a monophyletic clade, that is, itcomprises all descendants of a single ancestral mtDNAsequence. Therefore, the descendant lineage inheritsthe whole block (motif) of mtDNA variants that dis-tinguishes that ancestral sequence from the reference se-quence (rCRS), potentially disguised by sporadicrecurrent mutations at those motif positions. To identifythe haplogroup status of a particular sequence, it is nec-

    essary that phylogeny estimation be carried out before-handpreferably (these days) one based on a large setof complete mtDNA genomes. From such systematicstudies [19,2426], a number of ancestral sequencescan be inferred, which can guide in the classification ofnew (total or partial) mtDNA sequences.

    Table 1shows some available (preliminary) informa-tion of mutation spectra for both hypervariable regions.The spectrum of site-specific mutational rates is difficultto estimate since it needs (i) large test samples (of sizes>1000), (ii) reliable data, and (iii) a robust estimationof the phylogeny, in order to avoid systematic biases.

    None of the published lists of mutational scores/ratesmeet all of these requirements, although some attemptsviolate most of these criteria, such as the multiply biasedapproach by [27]; see [4] for a brief comment. For in-stance, the estimated mutational scores provided by[28] rely on a parsimony analysis of HVS-I and HVS-II sequences plus some full control region sequencesdrawn from the SWGDAM database [29], includingsome as yet unpublished data. The otherwise very stableHVS-II sites 42, 256, and 318 appear with mutationalcounts 5, 7, and 0 in West Eurasian mtDNA data [28]and with scores 0, 0, and 5 in East Asian data [30]butnot at all in the African data [31]. This suggests thateither some subsets of the data were affected by carryingspecific phantom mutations or that the estimated phy-logeny has dispersed the members of a (minor) trueclade defined by one of those mutations across the tree.There is evidence for the former interpretation from theSWGDAM database itself, at least for positions 42 and256, which are each listed three times, viz. as 42N(twice), 42.1C, and 256N, 256del, and C256T, respec-tively. Note that the insertion of one C in the stretch4244 has thereby been misscored in THA.ASN.000038:instead of 42.1C the forensic convention [32] requires44.1C.

    On the other hand, some hypervariable positions maystill bear some useful information, e.g., when the muta-tion rate is high only in one direction. For example, themutation T16189C is seen in many different EurasianmtDNA lineages, but it usually creates a long C-stretchwhich appears rarely to be disrupted by subsequentmutation at exactly the same base position. This meansthat once this mutation has occurred it has a goodchance to remain fixed, as is indeed the case with certainhaplogroups, such as haplogroups B and X (with motifsT16189C and T16189C T16223C C16278T, respective-ly). We also know from phylogenetic analyses of theworldwide mtDNA variation that the short C-stretchesinterrupted by a single T at position 16189 in most Eur-asian mtDNA were positioned differently in an earlyAfrican ancestor, namely, a T at position 16187 in aconsiderable portion of African mtDNA. The latter pat-tern is inherited by members of the African haplogroupsL0 and L1 (sensu[33]). Then, we would expect that theC to T change at 16189 in these L0 and L1 Africanhaplogroups is not as frequent as the T to C change inthe Afro-Eurasian haplogroup L3. For example, theAfrican haplogroup L1c is characterized by mutationsG16129A C16187T T16189C T16223C C16278T

    Table 1Mutation spectra for HVS-I and HVS-II sites

    HVS-1 (1) (2) HVS-I (1) (2) HVS-II (2) HVS-II (2)

    16051 21 n.a. 16261 33 25 73 6 217 516092 18 20 16265 12 21 93 10 228 716093 101 69 16266 12 24 146 36 234 516111 17 35 16274 24 32 150 32 310 7

    16126 14 20 16278 36 34 151 2216129 56 5 4 16298 18 14 152 5616145 21 38 16290 12 23 153 516148 12 22 16291 33 42 182 1016172 27 50 16292 21 21 185 1016189 68 6 2 16293 21 23 189 1716192 45 3 6 16294 9 27 194 1116193 18 14 16304 18 24 195 4616213 17 16 16311 89 70 198 1616214 18 17 16319 15 20 199 1616223 27 12 16320 25 23 200 1316234 14 35 16355 12 30 204 1916239 6 24 16362 57 54 207 1716256 27 34 16390 n.a. n.a. 215 7

    Bold numbers indicate the top 9 fast sites in HVS-I (of either list) andthe top 6 HVS-II sites. To facilitate comparison between HVS-Ispectra, we took three times the original scores from [4] (see http://www.stats.gla.ac.uk/~vincent/fingerprint/index.html) and roundedthem to integers. For HVS-I we listed all those sites from [4,41] thatreceived a new cumulative score of greater than or equal to 30. Wehave added 16390 to this list as a candidate for considerable variationoutside the scoring frame. For HVS-II we took all sites from[41]withscores greater or equal to five.

    (1) Data from [4]. HVS-I sequence range: 1605116365. See also(http://www.stats.gla.ac.uk/~vincent/fingerprint/index.html) fordetailson these mutation spectra.

    (2) Scores taken from Figs. 1 and 2 of [41]. HVS-I sequence range1609216365; HVS-II sequence range: 72297.

    A. Salas et al. / Biochemical and Biophysical Research Communications 335 (2005) 891899 893

    http://www.mitomap.org/http://www.mitomap.org/http://dx.doi.org/10.1016/j.bbrc.2005.07.161http://dx.doi.org/10.1016/j.bbrc.2005.07.161http://www.mitomap.org/http://www.mitomap.org/
  • 8/14/2019 Salas BBRC 2007.pdf

    4/9

    C16294T T16311C C16320T; it would be rare to find anL1c sequence lacking the variant T16189C.

    In regard to hypervariability, the five positions 16093,16129, 16189, 16311, and 16362 rank highest in HVS-Iand one can expect to see all possible combinations oftransitions at these positionsalbeit with drastic differ-

    ences in frequency, because some combinations occur inbasal motifs of continental haplogroups. For instance,when we observe all five transitions together, we can befairly confident that they occur in some lineage of Africanancestry, such as L0a, for example, because the combina-tion G16129A C16189T T16311C is part of an inheritedmotif in basal sub-Saharan African haplogroups and thusdoes not have to be generated de novo. But this, of course,does not justify the rather mystical belief that mutationalhotspots should or could be explained by ancient muta-tions plus occasional recombination[34].

    Error classification

    We follow [3] in distinguishing five different catego-ries of sequencing errors (Fig. 1):

    Type I (base shift): a single position or several posi-tions are misscored by an alignment or reading shiftor by a column shift during the preparation of atable.

    Type II (reference bias): the oversight of nucleotidevariants relative to the reference sequence.

    Type III (phantom mutations): uncommon variantsthat appear simultaneously in different lineages ofthe data set. These mutations are easily generatedduring the sequencing process, facilitated by incorrect

    use of software for reading and interpreting electro-pherograms. In case of old samples (with degradedDNA), one should also take the possibility of post-mortem damage into consideration[35].

    Type IV (base misscoring): miswriting the nucleotideletter in a dot table or mistyping a transition as atransversion or vice versa in a motif table.

    Type V (artifactual recombination): mosaic origin ofa compound haplotype comprising different frag-ments from more than one sample. The main causeunderlying this kind of error is that the majority oflabs analyze two (or three or more) segments of thecontrol region with no (or only minimal) overlap in

    separate PCRs; therefore, a sample mix-up can easilyprovoke these artifacts (i.e., an unfortunate mistakewhen loading the sequence product in the sequencer,mispipetting of the sample during PCR set-up, mis-taking the well when setting up gel electrophoresisor pipetting samples into a plate for a multi-capillarysequencer, contamination, etc.). Separate amplifica-tions also become necessary when the sequence understudy is interrupted by a homopolymeric track, suchthat individual forward and reverse reactions are nec-essary to complete HVS-I or HVS-II segments. Final-ly, at the documentation stage, transferring data into

    a database or table (thereby shifting rows in part,etc.) is a common source for this kind of error.

    Prophylaxis

    How can sequencing errors be avoided from the out-set? The, admittedly, not very optimistic answer is thattotal security is impossible to achieve, although the rath-er pessimistic perspective of[36]about the persistence oferrors in mtDNA sequencing results and databases isperhaps a bit too fatalistic. Documentation errors, inparticular, could practically be avoided in total (contra[37]) with a proper protocol for electropherogram read-ing and interpretation plus visual inspections by at leasttwo people. We recommend several strategies for errorprevention that will help in reducing errors by one ortwo orders of magnitude. The basic idea is to use triplechecking at all stages of the sequencing and editingprocesses.

    Sample handling. The encoding of every sampleshould combine a machine readable code for scanningalongside a code readable by eye. Automatic samplehandling plus visual control would help to avoid sample

    Fig. 1. The five types of errors in a nutshell: a tree showing a section ofthe mtDNA phylogeny. A fragment of the western European mtDNAphylogeny is shown at the top, in which existing sequence types coexistwith reconstructed ancestral sequences that (as yet) have not beenobserved. Here, an error of Type I is represented by a 10 base shift atposition 285 (#1); the transitions at 73 and 285 are omitted in #2 (TypeII); position 16085 suffers from a phantom mutation in #35 (TypeIII); profile #6 shows a base misscoring at site 489 (Type IV); andfinally, #7 constitutes a recombinant sequence U1b J (Type V).

    894 A. Salas et al. / Biochemical and Biophysical Research Communications 335 (2005) 891899

  • 8/14/2019 Salas BBRC 2007.pdf

    5/9

    mix-up, which is a frequent cause for artificialrecombinants.

    Contamination The guidelines outlined by [32] mayserve as a good starting point for avoiding contamina-tion in the lab.

    Overlapping sequencing (e.g. [38]) allows checking

    for variants that must appear in two different sequencingsegments of the same sample, which would make samplemix-up less likely to remain undetected.

    Optimizing chemistry.Background noise is suspiciousper se. Change (or modification) of primers or sequenc-ing chemistry and resequencing is necessary in case ofhigh levels of background. In exceptional cases ofdoubt, cloning experiments could help to identify sourc-es of contamination, although we recognize that cloningis not a very practical technique in standard forensic ormedical labs.

    Automatic reading, by employing up-to-date soft-ware, should always be accompanied by thorough visual

    inspection of sequence traces.Data presentation and nomenclature. Where auto-

    matic generation of data tables is not possible, it wouldbe best to generate two kinds of tables from the raw dataindependently by two different persons in order to min-imize the high risk of documentation errors. Namely, amotif table using the extended medical style (but other-wise following the forensic conventions) and a dot tableeach emphasize different aspects of the same data in acomplementary way. We also recommend highlightingtransversions in the motif table style and even in thedot table in order to facilitate a visual inspection of

    incorrect base assignment or an excess of transversionsin the data set. The nomenclature we advertise here fol-lows a style more common in medical studies and is notthe conventional one in forensics or population genetics.We suggest that this nomenclature system could be fol-lowed at least in the lab routine since it is more conve-nient for inspection of potential editing errors. We donot follow the forensic recommendation for denotinginsertions and deletions since it may be more prone toerrors, especially when reporting long insertions anddeletions; e.g., 523.1C 524.2A 525.3C 526.4A is repre-sented here in a more compact style, namely523+CACA. Note that this could alternatively be scoredas 524+ACAC when the repetitive pattern is interpretedas an AC short tandem repeat. We notice that even acommon polymorphism such as the insertion 309+CCin the homopolymeric HVS-II tract is a very commonsource of confusion among forensic labs in terms ofnomenclature (see for instance[39]).

    We are aware that some of the above recommenda-tions are not new to many labs. However, it is obviousthat these basic rules are not systematically followedby all of them, including some important institutions/labs. For instance, the high incidence of ambiguitieslocated immediately after (uni-directionally) homopoly-

    meric tracks in the SWGDAM database indicates thatreverse sequencing was probably not carried out system-atically (see supplementary data, Item 3, for more de-tails). It could be argued that the forward part couldnot be read well but the backward part could be readnormally, so that an N was placed whenever not both

    strands were 100% readable or unambiguous. First, thereason for the unbalanced distribution of N is no-where explained by the SWGDAM database curators;however, the users of this database should be alertedabout this important deficiency (which is completelycryptic when searching the database using the tool avail-able by the SWGDAM [29]). Second, this explanationwould violate the ISFG recommendations. Third, thereis no need for such an inadequate practice in a popula-tion study, especially when this database is commonlyused for frequency estimations to underpin decisionsin court. Whatever the real reasons for the unusual accu-mulation of ambiguities in the SWGDAM database[40]

    are, a high incidence of ambiguities is not a sign of goodquality of the mtDNA sequence profiles.

    Error detection

    A quick self-help guide to detect mistakes was provid-ed by [3]. Here we reproduce and extend theseguidelines:

    We should be alerted by an over-representation oftransversions in the data set. As in all the genome, inmtDNA, transitions are much more frequent than trans-

    versions, especially in HVS-II[9,41]. Since this is a char-acteristic of the evolutionary process of the molecule(and not the population under study), a similar ratio be-tween the frequencies of transitions and transversionsshould characterize any population sample. Note thatthe transitiontransversion ratioa/b(usually >30) refersto the probabilities a of a transition and b of each of thetwo transversions at a sequence position in an underly-ing model of sequence evolution. In forensics, this ratiois routinely confused with a sample-dependent ratio forthe corresponding polymorphism counts. Similarly,most insertions and deletions (indels, for short), exceptfor hotspots around C-stretches or at exceptional posi-tions outside the control region (for example, the prom-inent 9 base-pair deletion defining mtDNA haplogroupB), are rather rare events.

    Errors are easily detectable when they affect the basalmotif shared by many other sequences in the database.If only part of the basal motif of a specific haplogroupis present, it is highly desirable to re-analyze the rawdata. It is also advisable to examine databases formatching and neighboring haplotypes. For instance,the SWGDAM database [29], although still of some-what questionable quality in comparison with otherpublished control region sequences [6,40], can serve as

    A. Salas et al. / Biochemical and Biophysical Research Communications 335 (2005) 891899 895

    http://-/?-http://-/?-
  • 8/14/2019 Salas BBRC 2007.pdf

    6/9

    a vehicle to search for specific patterns (see below). Datacollected from the literature, although tedious to com-pile, also constitute a good choice. The popular Mito-map database (http://www.mitomap.org/) can also beof help, although it only provides information on singlemutational variants but not on haplotypes. We must,

    however, be aware of the limitations of the databases(i.e., with regard to representativeness, size, etc.). Inaddition, attention should also be paid to private vari-ants (at the tips of the estimated phylogenies), that is,those not shared by any sister haplotype. The best gen-eral advice is however to recheck electropherograms(obtained through sequencing either in both directionsor by two independent amplifications with differentprimers) and the final reports as well. These tasks shouldbest be done by more than one person independently.

    Inconsistent numbering can lead to false mismatchesin database searches. A deletion or insertion in a homo-polymeric tract can yield a common mistake in databas-

    es: e.g., a deletion at position 16380 (at the 3 0 end of aC-stretch) might be named as 16375delC or briefly as16375del or 16375d (at the 5 0 end). Recommendationsfor nomenclature have been proposed by [32,42]; none-theless, many complicated instances are not covered bythese recommendations (especially for homopolymerictracts commonly involving length polymorphisms).

    Background noise in an electropherogram can intro-duce doubts about the existence of point heteroplasmyor even a new variant. Multiple heteroplasmies are notcommon and would often point to sequencing problems[43]. Therefore, if we detect more than two in a single

    sequence fragment (such as in, e.g., USA.CAU.001484from the SWGDAM database) we should take phantommutations or contamination into consideration as a causefor these variants. Sequencing in only one direction couldalso explain many of these undetermined variants (seebelow). When necessary (e.g., reamplification and rese-quencing of the region in question continued to indicatecontamination), new DNA extraction, amplification,and sequencing of such a sample should be carried out.

    Strategies for screening a database

    For the reader of an article with new mtDNA data,electropherograms are normally not available, so thathe/she has to make do with the data as published in adatabase, on a journal website or in the correspondingarticle. A first step is to check in which context eachof the rare mutations found in the new data was alreadyreported in the literature. The next step is to search forcombinations of mutations.

    In particular, seeking (near-)matches for one segmentcould to some extent predict a motif for the other seg-ment or vice versa. An example of such a search strategyfollows. We observe that the HVS-I part of the sequence

    G16129A C16148T C16223T G16391A A73G C150TG185A A263G 315+C (no. 120 of[44]) points to haplo-group I, as indicated by the variant at position 16391.Indeed, in the SWGDAM database we find a perfectmatch within the restricted sequencing range 1602416365, viz. the sequence G16129A C16148T C16223T

    A73G T199C T204C T250C A263G 309+C 315+C(USA.CAU.001343), which bears the three characteris-tic HVS-II mutations T199C T204C T250C expectedfor haplogroup I lineages. In[44], we find a companionsequence G16129A C16148T C16223T G16391A A73GT204C T250C A263G 309+C 315+C (no. 121), whichbears at least two of those required HVS-II mutations.On the other hand, the HVS-II part of no. 120 wouldbest fit either with the HVS-I motif T16189C C16270Tof a haplogroup U5b lineage (viz. compare with USA.CAU.000083 or e.g., sequence no. 734 from [45]) or toa particular lineage from haplogroup U6a1 (comparewith SKE.AFR.000113 from the SWGDAM database).

    Thus, the HVS-II part of sequence no. 120 in[44]wouldlead us to the HVS-I motifs T16189C C16270T (U5b1)or T16172C T16189C A16219G C16278T (U6a1). Sincethe latter would not be expected in northeast Germany(U6 is indigenous to North Africa with some minor dif-fusion into the Iberian Peninsula), U5b1 would be thebest guess (and in any case haplogroup I status wouldclearly be rejected). A U5b1 type is, in fact, what thecorrigendum of [44] eventually shows (now ID no.205). We are led to conclude that the previous sequenceno. 120 was therefore the result of a sample mix-upinvolving an HVS-I part from haplogroup I and an

    HVS-II part from haplogroup U5; the mixed status isthen denoted here by I U5.

    For a motif table one would normally list mutationsin their numerical order. Sometimes one may wish todeviate from this rule by separating mutations intoblocks according to their position in the mtDNA phy-logeny; but then at least within each block the numericalorder would serve as the default ordering. Any deviationfrom this convention is potentially suspicious as it maysuggest a clerical error. For instance, sequence no. 54in[44]perfectly serves to illustrate a strategy for detect-ing and repairing the cause of an error. In no. 54, muta-tion C16291T precedes C16260T. Since HVS-II containsthe rare combination T204C T239C, we first check theSWGDAM database for this mutational pair withinthe sequence range, say, 73240. Then four matchinginstances are found, namely:

    C16193T A16219G T16362C T204C T239C A263G309+CC 315+C (USA.CAU.000648, USA.CAU.000942),

    C16193T A16219G T16362C T204C T239C A263G309+C 315+C (USA.CAU.001058),

    C16193T A16219G A16230G T16362C T152C T204CT239C A263G 309+C 315+C (USA.CAU.001346).

    896 A. Salas et al. / Biochemical and Biophysical Research Communications 335 (2005) 891899

    http://www.mitomap.org/http://www.mitomap.org/
  • 8/14/2019 Salas BBRC 2007.pdf

    7/9

    This clearly suggests that what was recorded as 16291in no. 54 is 16219 in reality, so that C16291T should bereplaced by A16219G. This is indeed confirmed in thecorrected data presented in the corrigendum of [44],now with ID code no. 50.

    A fictitious European mtDNA data set

    Imagine that we are faced with the preliminary re-sults of some fresh sequencing efforts in the lab, wherenot all sequences have yet been determined at the tar-geted full length. These new data come in the form oftwo tables (Tables S1 and S2 of the supplementarydata) as recommended above. Since our imaginationcannot compete with reality, we have compiled a ficti-tious European population sample by selecting realsequences from the literature, which to the best of ourknowledge have not yet been revised. We illustrate

    our points along with an item-by-item guide throughthis data set.

    In Tables S1 and S2, we have highlighted those muta-tions, positions, and nucleotides that we suspect are theresults of missequencing or misdocumentation. For eas-ier reference, we have also inferred the haplogroup sta-tus for each sequence in Table S2. In the case ofrecombinant sequences, two haplogroups are involved,one for the HVS-I part and the other for the HVS-II/III parts.

    For space reasons, the detailed analysis of Tables S1and S2 is moved tosupplementary data. The reader who

    has gone through this exercise will be prepared to detectfurther inaccuracies in those data sets from which wehave extracted those sequences. Real data tables are,however, less extreme in that there (usually!) exist manyentries that happen to be perfectly correct andunsuspicious.

    Final remarks

    As seen in the present report, significant numbersof errors frequently emerge in the literature generatedin the fields where human mtDNA is being analyzed.The consequences of errors can be significant. For in-stance, mtDNA research is commonly used to provideevidence relevant to verdicts made in courts, such thatany single sequence mistake can be decisive, e.g., lead-ing to the false exclusion of a suspect. In the medicalfield, the most worrying cases come from studies ofmtDNA instability in tumorigenesis: we have recentlyestimated that more than 80% of the whole subfieldhas been severely affected by flawed sequencing results[46].

    The methodology for error prevention consists ofthree main parts: (i) prophylaxis, (ii) systematic routines

    for error detection, and (iii) strategies for screeningmtDNA databases. Here, we have demonstrated thepower of a phylogenetic approach to error detection.This method has allowed us to discover a large numberof still unreported sequencing mistakes. We have fo-cused our attention on the most common regions ana-

    lyzed in the literature, namely, the three hypervariablesegments (HVS-I/II/III). Nevertheless, since the analysisof complete mtDNA genomes or partial coding-regioninformation is a growing field of research, we are afraidthat significant errors will also emerge in large-scaleanalysis (e.g., mosaic genomes artificially generateddue to the large number of fragments that are involvedin the sequence analysis of whole genomes). Paradoxi-cally, the refining of the mtDNA phylogeny mainly fa-vored by these (cleaned) large-scale studies willfacilitate the detection of more errors in mtDNA se-quence data.

    The phylogenetic approach should be the starting

    point for a systematic reanalysis of mtDNA data sets,bearing in mind that this tool will detecton aver-agearound 50% of the errors (authors unpublisheddata). Numerous errors will remain that are difficult orimpossible to detect by phylogenetic analysis and data-base comparisons, so that one is able to see perhaps few-er than 10% of the errors in some parts of the database.These estimates depend on both the data provided in thedata set under study (HVS-I alone or combined withHVS-II segment or RFLP/coding region data, etc.)and the part of the phylogeny which it represents. Thus,for example, some West Eurasian haplogroups, such as

    haplogroup H, have hardly any diagnostic sites in HVS-I and HVS-II that would facilitate detection of artificialHVS-I HVS-II recombinants using a phylogeneticapproach.

    Therefore, a rigorous protocol covering absolutely allthe steps that lead to the final report, such as the one wepropose here, is to be recommended in all the labs. Theexhaustive corrigendum of[44]is an encouraging exam-ple. In contrast, the first reanalysis of the SWGDAMdatabase is, in this sense, incomplete [2], and so is thesecond [37]. In fact, this partial corrigendum shouldhave warned those responsible for the database thatmore clerical errors of the same kind likely exist([6,40]; see also supplementary data for some newexamples).

    Acknowledgments

    This work was supported by grants from the Ministe-rio de Sanidad y Consumo (Fondo de InvestigacionSanitaria; Instituto de Salud Carlos III, PI030893;SCO/3425/2002) and Genoma Espana (CeGen; CentroNacional de Genotipado).

    A. Salas et al. / Biochemical and Biophysical Research Communications 335 (2005) 891899 897

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/14/2019 Salas BBRC 2007.pdf

    8/9

    Appendix A. Supplementary data

    Supplementary data associated with this article canbe found, in the online version, at doi:10.1016/j.bbrc.2005.07.161.

    References

    [1] M.P. Cox, Genetic patterning at Austronesian contact zones. PhDthesis 2003, University of Otago, Dunedin, New Zealand.

    [2] B. Budowle, D. Polanskey, M.W. Allard, R. Chakraborty,Addressing the use of phylogenetics for identification of sequencesin error in the SWGDAM mitochondrial DNA database, J.Forensic Sci. 49 (2004) 16.

    [3] H.-J. Bandelt, P. Lahermo, M. Richards, V. Macaulay, Detectingerrors in mtDNA data by phylogenetic analysis, Int. J. Legal Med.115 (2001) 6469.

    [4] H.-J. Bandelt, L. Quintana-Murci, A. Salas, V. Macaulay, Thefingerprint of phantom mutations in mitochondrial DNA data,

    Am. J. Hum. Genet. 71 (2002) 11501160.[5] H.-J. Bandelt, W. Parson, Fehlerquellen mitochondrialer DNS-Datensatze und Evaluation der mtDNS-Datenbank D-Loop-BASE, Rechtsmedizin 14 (2004) 251257.

    [6] H.-J. Bandelt, A. Salas, S. Lutz-Bonengel, Artificial recombina-tion in forensic mtDNA population databases, Int. J. Legal Med.118 (2004) 267273.

    [7] A. Rohl, B. Brinkmann, L. Forster, P. Forster, An annotatedmtDNA database, Int. J. Legal Med. 115 (2001) 2939.

    [8] P. Forster, To err is human, Ann. Hum. Genet. 67 (2003) 24.[9] Y.-G. Yao, C.M. Bravi, H.-J. Bandelt, A call for mtDNA data

    quality control in forensic science, Forensic Sci. Int. 141 (2004) 16.

    [10] K. Hibi, H. Nakayama, T. Yamazaki, T. Takase, M. Taguchi, Y.Kasai, K. Ito, S. Akiyama, A. Nakao, Detection of mitochondrial

    DNA alterations in primary tumors and corresponding serum ofcolorectal cancer patients, Int. J. Cancer 94 (2001) 429431.[11] A. Vega, A. Salas, E. Gamborino, M.J. Sobrido, V. Macaulay, A.

    Carracedo, mtDNA mutations in tumors of the central nervoussystem reflect the neutral evolution of mtDNA in populations,Oncogene 23 (2004) 13141320.

    [12] K. Hibi, H. Nakayama, T. Yamazaki, T. Takase, M. Taguchi, Y.Kasai, K. Ito, S. Akiyama, A. Nakao, Mitochondrial DNAalteration in esophageal cancer, Int. J. Cancer 92 (2001) 319321.

    [13] H.-J. Bandelt, A. Achilli, Q.-P. Kong, A. Salas, S. Lutz-Bonengel,C. Sun, Y.-P. Zhang, A. Torroni, Y.-G. Yao, Low penetranceof phylogenetic knowledge in mitochondrial disease studies,Biochem. Biophys. Res. Commun. 333 (2005) 122130.

    [14] H.-J. Bandelt, Q.-P. Kong, W. Parson, A. Salas, More evidencefor non-maternal inheritance of mitochondrial DNA?, J. Med.

    Genet. (in press).[15] E. Arnason, Genetic heterogeneity of Icelanders, Ann. Hum.

    Genet. 67 (2003) 516.[16] Y. Qian, X. Zhou, Y. Hu, Y. Tong, R. Li, F. Lu, H. Yang, J.Q.

    Mo, J. Qu, M.-X. Guan, Clinical evaluation and mitochondrialDNA sequence analysis in three Chinese families with Lebershereditary optic neuropathy, Biochem. Biophys. Res. Commun.332 (2005) 614621.

    [17] T. Kivisild, H.V. Tolk, J. Parik, Y. Wang, S.S. Papiha, H.-J.Bandelt, R. Villems, The emerging limbs and twigs of the EastAsian mtDNA tree, Mol. Biol. Evol. 19 (2002) 17371751,erratum in: Mol. Biol. Evol. 20 (2002) 162.

    [18] Y.-G. Yao, Q.-P. Kong, H.-J. Bandelt, T. Kivisild, Y.-P. Zhang,Phylogeographic differentiation of mitochondrial DNA in HanChinese, Am. J. Hum. Genet. 70 (2002) 635651.

    [19] Q.-P. Kong, Y.-G. Yao, C. Sun, H.-J. Bandelt, C.-L. Zhu, Y.-P.Zhang, Phylogeny of east Asian mitochondrial DNA lineagesinferred from complete sequences, Am. J. Hum. Genet. 73 (2003)671676, erratum in: Am. J. Hum. Genet. 75:157.

    [20] A. Sajantila, P. Lahermo, T. Anttinen, M. Lukka, P. Sistonen,M.L. Savontaus, P. Aula, L. Beckman, L. Tranebjaerg, T. Gedde-Dahl, L. Issel-Tarver, A. Di Rienzo, S. Paabo S, Genes andlanguages in Europe: an analysis of mitochondrial lineages,Genome Res. 5 (1995) 4252.

    [21] C. Vernesi, G. Di Benedetto, D. Caramelli, E. Secchieri, E. Katti,P. Malaspina, A. Novelletto, A. Terribile Wiel Marin, G.Barbujani, Genetic characterization of the body attributed tothe evangelist Luke, Proc. Natl. Acad. Sci. USA 98 (2001) 1346013463.

    [22] W.-Y. Young, L. Zhao, Y. Qian, Q. Wang, N. Li, J.H. GreinwaldJr., M.-X. Guan, Extremely low penetrance of hearing loss in fourChinese families with the mitochondrial 12S rRNA A1555Gmutation, Biochem. Biophys. Res. Commun. 328 (2005) 12441251.

    [23] R.M. Andrews, I. Kubacka, P.F. Chinnery, R.N. Lightowlers,D.M. Turnbull, N. Howell, Reanalysis and revision of theCambridge reference sequence for human mitochondrial DNA,Nat. Genet. 23 (1999) 147.

    [24] A. Achilli, C. Rengo, C. Magri, V. Battaglia, A. Olivieri, R.Scozzari, F. Cruciani, M. Zeviani, E. Briem, V. Carelli, P. Moral,J.M. Dugoujon, U. Roostalu, E.L. Loogvali, T. Kivisild, H.-J.Bandelt, M. Richards, R. Villems, A.S. Santachiara-Benerecetti,O. Semino, A. Torroni, The molecular dissection of mtDNAhaplogroup H confirms that the Franco-Cantabrian glacial refugewas a major source for the European gene pool, Am. J. Hum.Genet. 75 (2004) 910918.

    [25] A. Achilli, C. Rengo, V. Battaglia, M. Pala, A. Olivieri, S.Fornarino, C. Magri, R. Scozzari, N. Babudri, A.S. Santachiara-Benerecetti, H.-J. Bandelt, O. Semino, A. Torroni, Saami andBerbersan unexpected mitochondrial DNA link, Am. J. Hum.Genet. 76 (2005) 883886.

    [26] M.g. Palanichamy, C. Sun, S. Agrawal, H.-J. Bandelt, Q.-P.

    Kong, F. Khan, C.-Y. Wang, T.K. Chaudhuri, V. Palla, Y.-P.Zhang, Phylogeny of mitochondrial DNA macrohaplogroup Nin India, based on complete sequencing: implications for thepeopling of South Asia, Am. J. Hum. Genet. 75 (2004) 966978.

    [27] S. Meyer, G. Weiss, A. von Haeseler, Pattern of nucleotidesubstitution and rate heterogeneity in the hypervariable regions Iand II of human mtDNA, Genetics 152 (1999) 11031110.

    [28] M.W. Allard, K. Miller, M. Wilson, K. Monson, B. Budowle,Characterization of the Caucasian haplogroups present in theSWGDAM forensic mtDNA dataset for 1771 human controlregion sequences. Scientific Working Group on DNA AnalysisMethods, J. Forensic Sci. 47 (2002) 12151223.

    [29] K.L. Monson, K.W.P. Miller, M.R. Wilson, J.A. DiZinno, B.Budowle, The mtDNA population database: an integrated soft-

    ware and database resource for forensic comparison, Forensic Sci.Commun. 4 (2002) 2.[30] M.W. Allard, M.R. Wilson, K.L. Monson, B. Budowle, Control

    region sequences for East Asian individuals in the ScientificWorking Group on DNA Analysis Methods forensic mtDNAdata set, Legal Med. 6 (2004) 1124.

    [31] M.W. Allard, D. Polanskey, K. Miller, M.R. Wilson, K.L.Monson, B. Budowle, Characterization of human control regionsequences of the African American SWGDAM forensic mtDNAdata set, Forensic Sci. Int. 148 (2005) 169179.

    [32] A. Carracedo, W. Bar, P. Lincoln, W. Mayr, N. Morling, B.Olaisen, P. Schneider, B. Budowle, B. Brinkmann, P. Gill, M.Holland, G. Tully, M. Wilson, DNA Commission of the Inter-national Society for Forensic Genetics: guidelines for mitochon-drial DNA typing, Forensic Sci. Int. 110 (2000) 7985.

    898 A. Salas et al. / Biochemical and Biophysical Research Communications 335 (2005) 891899

    http://dx.doi.org/10.1016/j.bbrc.2005.07.161http://dx.doi.org/10.1016/j.bbrc.2005.07.161http://dx.doi.org/10.1016/j.bbrc.2005.07.161http://dx.doi.org/10.1016/j.bbrc.2005.07.161
  • 8/14/2019 Salas BBRC 2007.pdf

    9/9

    [33] D. Mishmar, E. Ruiz-Pesini, P. Golik, V. Macaulay, A.G. Clark,S. Hosseini, M. Brandon, K. Easley, E. Chen, M.D. Brown, R.I.Sukernik, A. Olckers, D.C. Wallace, Natural selection shapedregional mtDNA variation in humans, Proc. Natl. Acad. Sci. USA100 (2003) 171176.

    [34] E. Hagelberg, Recombination or mutation rate heterogeneity?Implications for mitochondrial eve, Trends Genet. 19 (2003) 8490.

    [35] M.T.P. Gilbert, E. Willerslev, A.J. Hansen, I. Barnes, L. Rudbeck,N. Lynnerup, A. Cooper, Distribution patterns of postmortemdamage in human mitochondrial DNA, Am. J. Hum. Genet. 72(2003) 3247, erratum in: Am. J. Hum. Genet. 72:779.

    [36] C. Herrnstadt, J.L. Elson, E. Fahy, G. Preston, D.M. Turnbull, C.Anderson, S.S. Ghosh, J.M. Olefsky, M.F. Beal, R.E. Davis, N.Howell, Reduced-median-network analysis of complete mitochon-drial DNA coding-region sequences from the major African,Asian, and European haplogroups, Am. J. Hum. Genet. 70 (2002)11521171, erratum in: Am. J. Hum. Genet. 71:448449.

    [37] D. Polanskey, B. Budowle, Summary of the findings of a qualityreview of the Scientific Working Group on DNA analysis methodsmitochondrial DNA database, Forensic Sci. Commun. 7 (2005) 1.

    [38] A. Brandstatter, H. Niederstatter, W. Parson, Monitoring theinheritance of heteroplasmy by computer-assisted detection ofmixed basecalls in the entire human mitochondrial DNA controlregion, Int. J. Legal Med. 118 (2004) 4754.

    [39] A. Salas, L. Prieto, M. Montesino, C. Albarran, E. Arroyo, M.R.Paredes-Herrera, A.M. Di Lonardo, C. Doutremepuich, I. Fern-andez-Fernandez, A.G. de la Vega, C. Alves, C.M. Lo pez, M.Lopez-Soto, J.A. Lorente, A. Picornell, R.M. Espinheira, A.Hernandez, A.M. Palacio, M. Espinoza, J.J. Yunis, A. Perez-Lezaun,J.J.Pestano,J.C.Carril, D.Corach, M.C. Vide,V. Alvarez-

    Iglesias, M.F. Pinheiro, M.R. Whittle, A. Brehm, J. Gomez,Mitochondrial DNA error prophylaxis: assessing the causes oferrors in the GEP02-03 proficiency testing trial, Forensic Sci. Int.148 (2005) 191198.

    [40] H.-J. Bandelt, A. Salas, C. Bravi, Problems in FBI mtDNAdatabase, Science 305 (2004) 14021404.

    [41] B.A. Malyarchuk, I.B. Rogozin, Mutagenesis by transient mis-alignment in the human mitochondrial DNA control region, Ann.Hum. Genet. 68 (2004) 324339.

    [42] G. Tully, W. Bar, B. Brinkmann, A. Carracedo, P. Gill, N.Morling, W. Parson, P. Schneider, Considerations by the Euro-pean DNA profiling (EDNAP) group on the working practices,nomenclature and interpretation of mitochondrial DNA profiles,Forensic Sci. Int. 124 (2001) 8391.

    [43] T. Grzybowski, B.A. Malyarchuk, J. Czarny, D. Miscicka-Sliwka,R. Kotzbach, High levels of mitochondrial DNA heteroplasmy insingle hair roots: reanalysis and revision, Electrophoresis 24(2003) 11591165.

    [44] M. Poetsch, H. Wittig, D. Krause, E. Lignitz, Mitochondrialdiversity of a northeast German population sample, Forensic Sci.Int. 137 (2003) 125132, Corrigendum, Forensic Sci. Int. 145(2004) 7377.

    [45] S. Hofmann, M. Jaksch, R. Bezold, S. Mertens, S. Aholt, A.Paprotta, K.D. Gerbitz, Population genetics and disease suscep-tibility: characterization of central European haplogroups bymtDNA gene mutations, correlations with D loop variants andassociation with disease, Hum. Mol. Genet. 6 (1997) 18351846.

    [46] A. Salas, Y.-G. Yao, V. Macaulay, A. Vega, A. Carracedo, H.-J.Bandelt, A critical reassessment of the role of mitochondria intumorigenesis, PloS Med. (in press).

    A. Salas et al. / Biochemical and Biophysical Research Communications 335 (2005) 891899 899