1517

COMMENTARY

Molecular epidemiology of Mycobacterium leprae:a solid beginning

BARRY G. HALL

Bellingham Research Institute, Bellingham, USA

Accepted for publication 15 June 2009

Until I was invited to present a workshop on phylogenetic methods at an IDEAL consortium

meeting in March of 2007 I knew virtually nothing about Mycobacterium leprae. I was both

astonished and awed as I learned about the many constraints that leprae investigators labour

under; constraints that would drive most microbiologists and molecular biologists

whimpering into a corner. The simple matter of growing a culture in order to prepare a

sample of highly purified DNA for molecular strain typing purposes, something that most of

us think of as a matter of taking a few hours, becomes a prodigious effort often requiring

months to years – and the clonality of that culture is always uncertain. Likewise we take for

granted the high level of genetic variability that permits the straight forward application of

modern DNA technology to the problem of strain typing for epidemiological purposes.

The apparent paucity of variability has forced leprae epidemiologists to turn to the variability

that is inherent in short polynucleotide repeats (VNTR) as a basis of strain typing. At that

March 2007 meeting it was evident that the field lacked a sufficient number of VNTR loci to

reliably type strains or to estimate relationships among those strains. The variability in VNTR

loci is with respect to the number of repeats of a short sequence, and the genotype of each

allele is the repeat number. The papers in this special issue represent a concerted effort to

provide a practical, reliable basis for molecular typing of M. leprae.

The Gillis et al. paper identifies 16 potentially useful VNTR loci, provides a standard

operating protocol (SOP) for determining the repeat number at each locus, and rigorously

evaluates the reliability of each locus and its stability through time in a single infection. Six of

the seven remaining papers employ that set of VNTR loci and apply the SOP in the field in

six different countries as a practical trial of the value of those loci and the SOP in the field.

Taken together, these papers represent both an amazing and an outstanding effort to create

and validate the tools of a modern epidemiological system.

The Gillis et al. paper sets rigorous standards for ‘certification’ of a locus and its SOP: the

PCR conditions must permit reliable amplification, generating a full-length PCR product,

from 10 cells, and the product must be able to be reliably sequenced. All 16 loci meet that

standard. It turns out that determining the repeat number is not as simple as just reading the

output sequence file. Slippage during the PCR amplification process can potentially result in

Correspondence to: Barry G. Hall, Bellingham Research Institute, 218 Chuckanut Point Re, Bellingham,WA 98229, USA (Tel: þ360 752 1422; e-mail: [email protected])

Lepr Rev (2009) 80, 246–249

246 0305-7518/09/064053+04 $1.00 q Lepra

mixed products in which the repeat number is unclear. Applying an uncommonly high

standard, Gillis et al. sequenced multiple independent amplifications in both directions, and

had two independent readers determine the repeat number in order to rank the reliability of

each locus. Two loci, AT15 and TA18, ranked at the bottom of reliability and were not

recommended for use. Gillis et al. judged, quite correctly, that unreliable data are worse than

no data at all.

None of the six surveys, conducted in Columbia, Brazil, India, the Philippines, China and

Thailand, excluded those two loci. By including those ‘unreliable’ loci they provided the

opportunity to determine, under field conditions, whether they are, in fact, any less reliable

than the other loci. Inexplicably (or at least unexplained) was the decision in three studies to

exclude the TA10 locus and in one to exclude the 18-8 locus, both of which ranked at the top

of the Gillis et al. reliability scale. Those decisions leave 14 loci (including the two Gillis et al.

‘unreliable’ loci) that are common to all six survey studies.

Taken together those six studies include (by my manual count) 386 samples, and they

afford the opportunity to evaluate the reliability of those loci and the SOP in the field. One of

the most unusual, important, and very refreshing, aspects of these six studies is that they

report all of the results, not just the positive outcomes. For reasons that I will make clear

below, we can consider a sample to have succeeded if it was possible to determine the repeat

number at all 14 of the loci that were common to all six studies. On that basis, 208, or 53·9%,

of the samples succeeded. The success rates ranged from 91·2% in the Brazil study down to

22% in the Thailand study. Two types of failure occurred: a failure to amplify enough product

for sequencing, or an inability to determine the repeat number, usually meaning that two

possible repeat lengths were observed. The 208 successful samples provide a large set of

common samples that can be analysed in terms of geographical distributions of allele, and

patterns of those distributions, as discussed below. The 178 failed samples, those that were

not readable at every locus, provide the opportunity for an equally important analysis. The 14

loci can be ranked in terms of reliability. Does unreliability in the laboratory tell us anything

about unreliability in the field? In some cases there were interesting patterns of failures;

i.e. most of the samples from one particular region failed at the same locus. Why? Those same

loci, in the hands of the same investigators, produced reliable results in many other regions.

While the problem might be as simple as a contaminant in the local water supply, it is also

possible that sequence variation at a common site near the 30 end of one of the primers

interfered with amplification. If so, the site of another SNP may have accidentally been

revealed. Appropriate analysis of the failed samples might suggest whether it is worthwhile

designing some alternative primers for those loci.

It may seem unduly harsh to consider samples in which the repeat number at a single locus

was not determined as ‘failed’, but that really is not the case. Each locus is a character, and the

state of each character is the repeat number. Fourteen is a tiny number of characters to

distinguish among isolates, and every absent character significantly degrades the resolution of

VNTR-based strain typing. However, the most serious problem is not resolution, but the

effect of missing data on the ability to estimate relationships among the isolates. The ability

to understand issues such as sources of infection, transmission, incubation period, modes of

transmission and importance of contact patterns depends completely upon accurate

estimation of the relationships among isolates. Based on a simulation study, I estimated that

about 25 VNTR loci would be required to achieve 90% accuracy of VNTR-based

phylogenetic trees. With only 14 loci that accuracy is reduced to about 77%, and is reduced

about another 2% for each additional locus excluded. (Hall, unpublished results).

Molecular epidemiology 247

While estimation of a phylogenetic tree is the most common way to estimate relationships

among isolates, there is serious question about the validity of phylogenetic analysis as applied

to estimating relationships within a bacterial species. Valid phylogenetic analysis depends

completely on the assumption that the taxa are genetically isolated from one another. It is

now well understood that, when there is significant recombination within a species, the

relationships among isolates cannot validly be represented as a tree.1,2 For most microbial

species studied there is sufficient recombination to significantly degrade phylogenetic signal,3

and my comparison of sequenced Mycobacterium tuberculosis genomes suggests

considerable recombination even within that species (Hall, to be published elsewhere).

There is insufficient information to estimate the extent of recombination withinM. leprae, but

it would be unwise to assume an absence of recombination, thus unwise to estimate

relationships by phylogenetic analysis.

The problem of taking recombination into account led to an alternative method, eBURST,

for estimating relationships from multi-locus sequence typing (MLST) data.4 eBURST

considers all alleles of a locus to be equidistant from each other whether they differ by one

or several nucleotides. As a consequence recombination, which may introduce many

substitutions from a single event, and mutation are given equal weight. eBURST does not

seek to represent the historical relationships among all isolates on a single diagram (a tree),

instead it seeks to cluster closely related isolates into groups based only upon identity by

state, not identity by descent. eBURST connects the most closely related isolates to each

other, and generates diagrams that are ideal for epidemiological investigations. MLST and

eBURST analysis are by now the standard tools for molecular epidemiology of bacteria.

Epidemiology based on VNTR loci is ideally suited to analysis by eBURST. Where

MLST represents each unique sequence of a locus by an allele number, the allele number of a

VNTR locus is simply the repeat number. MLST defines an allele profile as the set of allele

numbers at the loci under consideration, and assigns to each unique allele profile a sequence

type (ST) number. For VNTR loci the equivalent would be a VT number that represent a

unique profile of the repeat lengths at all 14 loci. The VT thus represents the VNTR-based

genotype as a single number, from which the repeat number at all 14 loci can be deduced.

That sort of representation is ideal for constructing a database of M. leprae genotypes.

However, if the repeat number of a single locus is missing it is impossible to assign a VT and

impossible for the eBURST algorithm to place the isolate in a cluster. Thus, samples missing

the repeat number for a single locus are, indeed, ‘failed’.

Some of the surveys included SNP data for the samples they reported, however those data

appear, at first glance, to contribute little to the resolution of the typing scheme. They do,

however, afford the opportunity to ask whether SNP analysis potentially adds anything to

epidemiological studies of leprae by asking, for samples where the SNP data is available, can

we draw any conclusions with the SNP information that could not be drawn without it? That

is an important question because it addresses the issue of efficiency. It requires as much effort

to determine the status of a single polymorphic site (which can have only four possible states)

as it does to determine the status of a VNTR locus that can have many possible states.

A comparison of the two completely sequenced M. leprae genomes shows that there are

only about 125 polymorphic sites in the genes that the two strains have in common (some of

which are at VNTR loci). If SNPs contribute little of value to epidemiological information

it may not be cost effective to try to identify more polymorphic sites.

The common goal of the papers in this special issue was to establish a common set of

loci for strain typing and a common protocol for the application of those loci to investigating

B. G. Hall248

the epidemiology of leprae. To accomplish that goal it is important to agree upon and use a

common terminology. These papers fail to reflect such a common terminology. VNTR loci

are referred to as VNTR, STR, micro-satellites or mini-satellites (depending on repeat

length). Whatever the personal preferences of the authors, they should agree on a single

terminology.

The data in these papers represent a rich epidemiological resource, but effective

exploitation of that resource requires the development of a database that is open to all and that

includes tools such as eBURST for analysing VNTR data. Development of that database

should be given a high priority in the near future, for without it the results of these studies in

six countries will remain effectively isolated from each other. A database, of course, requires

agreement upon not only a common set of loci, but upon what other information will be

associated with each strain. It requires development of the equivalent of an MLST scheme, a

group to host the database, computer hardware, and a person to curate the database. In short, it

requires both funds and commitment. When allocating scarce resources it should be fully

understood that the development of such a database is far more important than is additional or

expanded surveys of the sort that are reported in these papers.

In the end there are really only three ways to distinguish among strains based on their

DNA content: sequence differences among their core genes (SNPs), differences in repeat

numbers of VNTRs, and differences in the presence or absence of genes. The two sequenced

genomes differ only in the presence/absence of 10 genes. One strain has three genes the other

lacks, and the second has seven genes the first lacks. This proportion of ‘distributed’ genes is

the smallest among the 22 species that have been studied (Hall, unpublished results). Only

two M. leprae genomes have been sequenced so far. When another 10 or so have been

sequenced we will have a much better understanding of leprae genome dynamics, but at this

point the prospects for molecular epidemiology based on gene sequences differences do

not appear to be good. It seems likely that analysis of VNTR loci is likely to be, for the

foreseeable future, as good as it gets for leprae molecular epidemiology. That makes

the continued investment in developing and improving VNTR analysis very worthwhile.

The work reported here is both important and very valuable, but it is only a beginning –

an impressive and solid beginning, but just that, a beginning.

References

1 Feil EJ, Holmes EC, Bessen DE et al. Recombination within natural populations of pathogenic bacteria: short-termempirical estimates and long-term phylogenetic consequences. Proc Natl Acad Sci U S A, 2001; 98: 182–187.

2 Didelot X, Falush D. Inference of bacterial microevolution using multilocus sequence data. Genetics, 2007; 175:1251–1266.

3 Perez-Losada M, Browne EB, Madsen A et al. Population genetics of microbial pathogens estimated frommultilocus sequence typing (MLST) data. Infect Genet Evol, 2006; 6: 97–112.

4 Feil EJ, Li BC, Aanensen DM et al. eBURST: inferring patterns of evolutionary descent among clusters ofrelated bacterial genotypes from multilocus sequence typing data. J Bacteriol, 2004; 186: 1518–1530.

Molecular epidemiology 249

1517

Documents

Transcript of 1517