TRANSPOSABLE ELEMENTS IN A - Genetics · 2003-07-30 · Transposable elements are DNA sequences,...

15
Copyright 0 1983 by the Genetics Society of America TRANSPOSABLE ELEMENTS IN MENDELIAN POPULATIONS. I. A THEORY CHARLES H. LANGLEY,*JOHN F. Y. BROOKFIELD*’ AND NORMAN KAPLAN** * Laboratory of Genetics, ** Biometry and Risk Assessment Program, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina 27709 Manuscript received June 22,1982 Revised copy accepted March 15,1983 ABSTRACT Transposable elements are DNA sequences, found throughout eukaryotes, that transpose replicatively and cause numerous genetic and developmental effects on their hosts. A model of the evolution of transposable elements in Mendelian populations is proposed. From its analysis, formulas for the mean copy number and frequency spectrum are obtained. HE ubiquity of transposable elements (TEs) is becoming more apparent. The T similarity of the structure and function of elements from bacteria to mammals suggests either a new and large phylogenetic group or remarkable convergence due to recurrent invasion of a narrow but common niche (TEMIN 1982). Transposable elements appear to be a major factor in the quality and quantity of mutation in bacteria (KLECKNER 1981),yeast (ERREDE et al. 1980) and Drosophila (SPRADLING and RUBIN 1981). Retroviruses, which are structurally related to TEs, are involved in somatic mutations in carcinogenesis (PAYNE, BISHOP and VARMUS 1982) and germ line mutations of the mouse (JENKINS, et al. 1981).Finally, the role of the transposable element, P factor, in hybrid dysgenesis of Drosophila melanogaster (RUBIN, KIDWELL and BINGHAM 1982; BINGHAM, KIDWELL and RUBIN 1982) allows the possibility of the rapid establishment of reproductive isolation among allopatric populations. Thus, new species of hosts may be formed with no significant change in their gene pools other than the distribution of the TEs. The expected frequency distributions of TEs in an individual’s genome, a population, among populations and among species are unknown. So far little theory has been proposed that might offer any predictions. BROOKFIELD (1982) has produced an infinite-population-size model, in which elements reduce the fitness of their hosts. His model fails to yield realistic equilibrium numbers of transposable elements per genome. It cannot be used to consider the problem of determining the equilibrium frequency spectrum of transposable elements. CHARLESWORTH and CHARLESWORTH (1983) have addressed this issue in the context of finite populations. In this paper a simple theory is proposed that Present address and to whom reprint requests should be addressed: Department of Genetics, University of Leicester, Adrian Building, University Rd, Leicester LE1 7RH, United Kingdom. Genetics 104: 457-471 July 1983

Transcript of TRANSPOSABLE ELEMENTS IN A - Genetics · 2003-07-30 · Transposable elements are DNA sequences,...

Page 1: TRANSPOSABLE ELEMENTS IN A - Genetics · 2003-07-30 · Transposable elements are DNA sequences, found throughout eukaryotes, that transpose replicatively and cause numerous genetic

Copyright 0 1983 by the Genetics Society of America

TRANSPOSABLE ELEMENTS IN MENDELIAN POPULATIONS. I. A THEORY

CHARLES H. LANGLEY,* JOHN F. Y. BROOKFIELD*’ AND NORMAN KAPLAN**

* Laboratory of Genetics, ** Biometry and Risk Assessment Program, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina 27709

Manuscript received June 22,1982 Revised copy accepted March 15,1983

ABSTRACT

Transposable elements are DNA sequences, found throughout eukaryotes, that transpose replicatively and cause numerous genetic and developmental effects on their hosts. A model of the evolution of transposable elements in Mendelian populations is proposed. From its analysis, formulas for the mean copy number and frequency spectrum are obtained.

HE ubiquity of transposable elements (TEs) is becoming more apparent. The T similarity of the structure and function of elements from bacteria to mammals suggests either a new and large phylogenetic group or remarkable convergence due to recurrent invasion of a narrow but common niche (TEMIN 1982). Transposable elements appear to be a major factor in the quality and quantity of mutation in bacteria (KLECKNER 1981), yeast (ERREDE et al. 1980) and Drosophila (SPRADLING and RUBIN 1981). Retroviruses, which are structurally related to TEs, are involved in somatic mutations in carcinogenesis (PAYNE, BISHOP and VARMUS 1982) and germ line mutations of the mouse (JENKINS, et al. 1981). Finally, the role of the transposable element, P factor, in hybrid dysgenesis of Drosophila melanogaster (RUBIN, KIDWELL and BINGHAM 1982; BINGHAM, KIDWELL and RUBIN 1982) allows the possibility of the rapid establishment of reproductive isolation among allopatric populations. Thus, new species of hosts may be formed with no significant change in their gene pools other than the distribution of the TEs.

The expected frequency distributions of TEs in an individual’s genome, a population, among populations and among species are unknown. So far little theory has been proposed that might offer any predictions. BROOKFIELD (1982) has produced an infinite-population-size model, in which elements reduce the fitness of their hosts. His model fails to yield realistic equilibrium numbers of transposable elements per genome. It cannot be used to consider the problem of determining the equilibrium frequency spectrum of transposable elements. CHARLESWORTH and CHARLESWORTH (1983) have addressed this issue in the context of finite populations. In this paper a simple theory is proposed that

’ Present address and to whom reprint requests should be addressed: Department of Genetics, University of Leicester, Adrian Building, University Rd, Leicester LE1 7RH, United Kingdom.

Genetics 104: 457-471 July 1983

Page 2: TRANSPOSABLE ELEMENTS IN A - Genetics · 2003-07-30 · Transposable elements are DNA sequences, found throughout eukaryotes, that transpose replicatively and cause numerous genetic

458 C. H. LANGLEY, J. F. Y. BROOKFIELD, AND N. KAPLAN

addresses these problems. Although a clear and detailed picture of TE biology in eukaryotes is not yet available, sufficient understanding exists to formulate a minimal model. Specifically this model provides predictions about the ex- pected frequency distribution of copies of the TE in the genomes of a finite population of hosts. Data such as in situ hybridization of cloned TE sequences to Drosophila salivary gland chromosomes can be analyzed to test the assump- tions and sufficiency of the model (MONTGOMERY and LANGLEY 1983).

Before formulating the mathematical model, we will justify the salient as- sumptions. In this paper, all copies of the TE are identical with each other except for their location in the host genome. Some populations of TEs (families) do, in fact, appear to be quite homogeneous, whereas others are physically, if not functionally, variable (SPRADLING and RUBIN 1981). The TE replicatively transposes to new locations in the host genome. This is perhaps the essential quality that characterizes a DNA sequence as a transposable element. Hence, a population of TEs can increase its numbers by inserting new copies at many possible locations throughout the hosts’ genomes. The dispersed distribution of TEs distinguishes them from other types of repeated sequences which are tandemly repeated.

The transposition process is assumed to be regulated in such a way that the rate of transposition is high in host genomes containing few TEs and low in host genomes containing many. In bacteria there is some evidence to support this assumption (KLECKNER 1981), but in higher organisms the evidence for such regulation is still meager. The phenomena of hybrid dysgenesis in Drosophila, however, are reasonably interpreted as arising from paternal P factors (TEs) transmitted into the egg of a female with a few or no P factors (RUBIN, KIDWELL and BINGHAM 1982). Regulation is consistent with the observations of fairly constant numbers (STROBEL, DUNSMUIR and RUBIN 1979; YOUNG 1979).

It has been observed in prokaryotic and eukaryotic hosts that TEs excise or delete. In bacteria, this appears to be by mechanisms independent of transpo- sition (KLECKNER 1981). Since little is known about this process in eukaryotic TEs, it will be assumed that there is a constant probability per generation of deletion for each TE.

Finally, it is assumed that the frequencies of TEs at particular locations can increase only by random drift, whereas the frequencies can decrease by random drift and deletion. Weak additive selection can also be incorporated (see DISCUSSION and KAPLAN and BROOKFIELD 1983b).

THE MODEL

In this section a mathematical model is studied that describes the evolution of a population of transposable elements in a finite haploid population of size 2N. Since there are a large number of possible locations in the genome at which new copies of the transposable element can be placed, it is assumed that each new copy is inserted at a location that is currently unoccupied in all the genomes of the host population. Any location at which a TE is present in at least one of the 2N genomes of the host population is called a site. The population frequency of a site is the proportion of the population having the TE at that site. It is assumed that once a site is introduced into the population, its frequency process

Page 3: TRANSPOSABLE ELEMENTS IN A - Genetics · 2003-07-30 · Transposable elements are DNA sequences, found throughout eukaryotes, that transpose replicatively and cause numerous genetic

TRANSPOSABLE ELEMENTS: THEORY 459

evolves in a random way independent of all other sites, i.e., there is free recombination between sites. Furthermore, the frequency process of any site is assumed to follow a random deletion model, i.e., if in generation t the frequency of a site is f t , then the distribution of ft+l is,

P( f t + ~ = f ) = (‘r) [(l - p)ftIi[l - (1 - p) f t IpN- j 0 f j c 2N (1)

The parameter p is interpreted as the probability, each generation, that an individual TE is deleted.

To complete the description of the model, the mechanisms for introducing new sites into the population each generation need to be specified. First, new sites can result from invasion from outside the population. This is assumed to be a rare phenomenon and so the expected number of sites created each generation in this way is small. Second, new sites can result from transposition. As indicated, this process is assumed to be self-regulating, i.e., when there are few TEs in a genome, the conditions are more favorable for the creation of new copies of the TE than if there are many TEs in the genome.

One way to model this regulation process is the following. Each generation, when the host population replicates itself to form the next generation, random numbers of new sites are created in each of the 2N daughters. Let J(1), J(2) . . . , J(2N) denote these random quantities. It is assumed that the J’s are independent and identically distributed and that their common distribution depends only on the random quantities [ p(i), i P 01, where p(i) is the fraction of the parent population having exactly i copies of the TE, i 3 0. Since the p(i)’s change from generation to generation, they will be written as pt(i) to denote the generation. Similarly the J’s will be written as Jt(l), It@). . . , Jt(2N). To simplify the notation, Jt will denote any of the Jt( .). The only assumption that is placed on the Jt’s is that lima, E ( I t ) exists and is finite and nonzero. The model just described will be referred to from now on as the general TE model.

It is instructive to consider some examples of the general TE model. Model I: The simplest example is when the distribution of J t does not depend

on the pt(i). In this case E ( J t ) is a fixed constant independent of t . This model is not considered in this paper since there is no self-regulation.

Model 11: Suppose that each of the N zygotes in the daughter generation is formed by randomly choosing two gametes. In each zygote a random number of new sites is created, and the distribution of this variable is Poisson with mean B + r ( yl) + r ( y2). The quantities yl and y2 are the numbers of TEs in the two gametes forming the zygote. The function r has the properties that r(0) = 0 and r(x) is positive, decreasing for x > 0. Finally, the constant p equals the expected number of new sites that are created each generation by invasion from outside the population. The distribution of J t is, therefore,

In this case,

Page 4: TRANSPOSABLE ELEMENTS IN A - Genetics · 2003-07-30 · Transposable elements are DNA sequences, found throughout eukaryotes, that transpose replicatively and cause numerous genetic

460 C. H. LANGLEY, J , F. Y. BROOKFIELD, A N D N. KAPLAN

It should be noted that for this model the numbers of elements created in each of the N zygotes are independent variables, whereas the numbers in each of the two gametes derived from a particular zygote are correlated.

Model III: An alternative possibility to model I1 is to assume that the Poisson number of newly created sites in the zygote has mean p + 2r(( yl + y2)/2). In this case,

(4) E ( j t ) = 2 + CC r((I + k ) / Z ) ) E ( p t ( l ) p t ( k ) ) .

In APPENDIX I it is shown that limhm E ( ] , ) exists and is finite and nonzero for models I1 and 111.

Let f t ( l ) , f t (Z ) , . . . denote the population frequencies in generation t of the different sites. Several interesting quantities can be expressed as E (Ci h( f t ( i ) ) ) for suitable choices of the function h. Some examples are:

P lk

(a) Let

Then E ( x i h( f t ( i ) ) ) = expected number of distinct sites in the population in generation t.

(b) Let

Then E (Ci h( f t ( i ) ) ) = expected number of distinct sites in the population in generation t that have frequencies between a and b.

(c) Let h( f t ( i ) ) = ft(i). Then E ( C i h ( f t ( i ) ) ) = expected number of sites per haploid genome in generation t.

In view of these examples it is of interest to study the asymptotic behavior of E ( C i h ( f t ( i ) ) ) . Let h be any bounded piecewise continuous function on [0, 11. The random variable VJh) = Cj h( f t ( i ) ) can be written as

where Kl is the number of new sites introduced into the population in generation I and ft-l(l, k ) is the population frequency in generation t of the kth site introduced in generation I, 1 < t. The key property of the general TE model is that once KL is given, the random variables { ft-l(l, k ) , 1 k c KI} are independent and identically distributed. Their common distribution is the same as that of ft-l

which is the population frequency of a site in the ( t - I)th generation under a random deletion model with parameter p. Thus, the expectation of Vt(h) satisfies,

(6) 1=1

Page 5: TRANSPOSABLE ELEMENTS IN A - Genetics · 2003-07-30 · Transposable elements are DNA sequences, found throughout eukaryotes, that transpose replicatively and cause numerous genetic

TRANSPOSABLE ELEMENTS: THEORY 461

A change of variable in (6) leads to

Since K t = z;El Jt(i), E(&) = 2N E(Jt), and consequently by the assumptions of the general TE model, limb, E(Kt) = E(K,) exists and is finite. Thus, as t + CO, equation (7) becomes in the limit

It can be shown, using a diffusion approximation (EWENS, 1979, Chapter 4), that for a random deletion model with deletion parameter p ,

E(? h(f I ) ) = i:N)-L h(y)y-'(l - y)'-'dy (9)

where 8 = 4Np. This approximation works best when N is large and p is small. The evaluation of E(K,) is in general difficult as there is no explicit formula

that one can use to sompute it. The following result can be asserted. Let Lt denote the total number of copies of the transposable element in the population in generation t. Then,

E(Lt+d = (1 - p)E(Lt) + E(&) (10)

E(Km) = pE(Lco). (11)

It therefore follows that when t + CO,

Equation (11) states the intuitive result that at equilibrium the expected number of new copies of the TE each generation must equal the expected number deleted each generation. Let A = E(L,/ZN), which can be interpreted as the expected number of TEs per haploid genome at equilibrium. Equation (8) can be rewritten as

E(V,(h)) =; BA f h(y)y-l(l - y)"ldy (12)

This formula for E(V,(h)) is more convenient since A can be estimated from sample data (KAPLAN and BROOKFIELD 1983). The function 8Ay-'(l - y)'-' is generally called the frequency spectrum of the process (EWENS 1979, p. 96). It should be noted that, when 2N is large and h(x)/x remains bounded as x + 0, the lower limit of the integral in (12) can be set at 0 without causing appreciable error.

It is useful for analyzing simulation data to develop methods for approximat- ing A. The following two approaches to this problem can be used for both models I1 and 111. For definiteness it will be assumed that the assumptions of model I1 are in force.

(2N) - I

Approach I: It follows from (11) and the assumptions of model I1 that e E (Lm) - A = 2Np- = E(K,) = 2N 2 2N

Page 6: TRANSPOSABLE ELEMENTS IN A - Genetics · 2003-07-30 · Transposable elements are DNA sequences, found throughout eukaryotes, that transpose replicatively and cause numerous genetic

462 C. H. LANGLEY, J. F. Y. BROOKFIELD, AND N. KAPLAN

where m(i) = limb, E ( p t ( i ) ) . The important observation about the ?r(i) is that Ci ia(i) = A. For the remainder of the discussion it is more convenient to write (13) in terms of expectations. Hence, let Y be a random variable with distribution given by the a(i), i.e., Y represents the number of TEs in a randomly chosen genome at equilibrium. Then, by using a two-term Taylor series for r, (13) can be written as

1 1 !!A = N/3 + 2NE(r(Y)) = N/3 + 2N E(r”(c)[Y - E(Y)]’) (14) 2

where r” denotes that the second derivative of r and 5 is a positive random variable. Thus, if either r ” is small or Y has a small variance, then A is approximately equal to the solution of the equation

9 - x - Nj3 - 2Nr(x) = 0. 2 (15)

When r is convex, r ” is negative and so the solution of (15) is less than A. It is of interest to examine the variance of Y. Let Yt denote the number of

sites in a randomly chosen genome from generation t. The variable Yt can be written as

Yt = 2 x t ( i ) (16) i

where 1 if site i is present in a randomly chosen

x t ( i ) = genome from generation t. I 0 otherwise.

Given the ft(i), the xt(i) are independent and E f ( x t ( i ) ) = ft(i). E f ( .) denotes the conditional expectation of a sample variable given the current population structure. The variance of Yt can be written as,

The last approximation follows from (12) with h(x) = x(l - x). It is clear from (17) that the variance of Y, which is equal to limhm Var(Yt),

can only be small when 8 is small. Thus, when B is large and r ” is not negligible, an alternative approach for approximating A is needed.

Approach 11: Let At = Lt /2N. It follows from APPENDIX I that there exists a random variable A, such that At converges to A, in distribution as t -+ 00. It

Page 7: TRANSPOSABLE ELEMENTS IN A - Genetics · 2003-07-30 · Transposable elements are DNA sequences, found throughout eukaryotes, that transpose replicatively and cause numerous genetic

TRANSPOSABLE ELEMENTS: THEORY 463

should be noted that A = E(A,). Let X be the solution of the equation e -xx i

- 0 -px + -+ r(i) -- P 2 i i!

It is shown in APPENDIX 11 that, if 8 is large, so that sites rarely achieve a high population frequency, then the distribution of f i (AJX - 1) is approximately normal with mean zero and variance (28)-’, and the larger X is, the more accurate is the approximation. The obvious approximation of A is thus A. It should be noted that for model I11 an approximation of A is the solution of the equation

P (2x)i -p + - + r(i/z)e-% - = o

2 i I! (19)

THE SIMULATIONS

Simulations of the spread of transposable elements throbgh a population were carried out by an iterative computer program written by Ken Risko. The population is defined as consisting of 2N gametes, each of which has a number of transposable elements at different locations, within the genome. The location of a TE is defined by an integer in the range from 0 to 32,767. In the simulations, the total number of different locations in the population at which TEs were found was never greater than 200. Thus, it is assumed that every copy of an element found in the population at a given location is descended from one element generated by a unique transpositional event. The population is acted on, each generation, by the four processes of sampling, deletion, transposition and recombination.

Sampling: Two gametes are sampled at random from the population, with replacement. The processes of deletion, transposition and recombination act sequentially on these two gametes. The process is repeated N times.

Deletion: Each element in each of the two chosen gametes is independently deleted with probability p.

Transposition: A random number of new elements are added at random sites to each gamete. The number added is a Poisson variable with mean p/2 + [r(x) + r ( y ) ] / 2 , where x and y are the numbers of TEs already present in the two genomes. This mechanism of transposition is similar to model 11. For all the simulations, the function r(x) equaled zero if x = 0 and a/x if x a 1. This choice of r has the desired property that it is positive and monotone decreasing.

Recombination: The number of crossovers between the two genomes has a Poisson distribution with mean R, and the locations of crossover points are distributed uniformly in the range 0 to 32,767. At each crossover point, a recombination transfers between the two genomes that part of the genome that comes before the next crossover.

All simulations were started with one of the 2N gametes having a single TE located at a randomly chosen site and were run for enough generations so that the population of TEs was at or near equilibrium. For each set of state variables,

Page 8: TRANSPOSABLE ELEMENTS IN A - Genetics · 2003-07-30 · Transposable elements are DNA sequences, found throughout eukaryotes, that transpose replicatively and cause numerous genetic

464 C. H. LANGLEY, J . F . Y. BROOKFIELD, A N D N. KAPLAN

the simulation was repeated a number of times depending on how many generations were needed to approach stationarity.

The simulations provide one way to study the validity of the theory developed. Table 1 reveals that the approximations of A obtained from equations (15) and (18) give values that agree quite well with the observed averages, despite the fact that 2N is small. The close agreement of the approximation given by equation (18) with the simulation results suggests its adequacy when A is not large. It is also evident from Figures 1 and 2 that there is close agreement between the observed frequency spectra and the predicted ones.

When t9 is large, there is typically an excess of sites that are present in only one gamete. This is due to the fact that when data from the simulations are collected, it is done after the processes of transposition, deletion and recombi- nation have taken place in the previous generation, and not after the sampling process. Each generation, transposition introduces new sites that occur only in single gametes, and so censusing the population immediately after transposition and recombination would show an excess of these single sites. When t9 is large, this is particularly true as the likelihood of introducing a large number of TEs each generation is greater.

Finally, an important assumption of the model was that elements evolve independently. The simulation results show that, as expected, this assumption can be made for high rates of recombination. In fact, the simulations indicate that even for low rates of recombination, there is good agreement between the observed and predicted values of the spectra and the average number of TEs per individual (Figure 2 ) .

DISCUSSION

Transposable elements, mobile elements, nomadic sequences and perhaps retroviruses are a new systematic, if not phylogenetic, entity. Their vast range of hosts and the fundamental nature of their possible effects make them important subjects of study. Although most questions about their biology, evolution and effects on their hosts are premature, certain minimal character- istics can be inferred. It is clear that TEs increase their copy number by replicatively copying themselves into a large number of possible locations in the host genome. Further, TEs at particular locations are lost by what appears to be formal excision. Inasmuch as the number of possible locations is much larger than the number ever occupied in a particular host, some form of regulation of copy number is suggested.

Three possible mechanisms can be considered whereby copy number regu- lation could occur. First, there could be natural selection of hosts based on copy number. That is, individuals with relatively high copy number could have comparatively few surviving offspring. However, although it is possible to hypothesize particular schemes of selection that will stabilize the numbers of transposable elements in such a model, the regulation of copy number is not a necessary consequence of the postulate that host fitness is a monotonic decreas- ing function of element copy number (BROOKFIELD 1982; CHARLESWORTH and CHARLES WORTH, 1983).

Page 9: TRANSPOSABLE ELEMENTS IN A - Genetics · 2003-07-30 · Transposable elements are DNA sequences, found throughout eukaryotes, that transpose replicatively and cause numerous genetic

TRANSPOSABLE ELEMENTS: THEORY 465

TABLE 1

Details of the simulations

Simulation no.

Parameters 1 2 3A 3B 4 5 6

e Deletion rate (U) Population size (2N) Recombination (R) Immigration rate @) Transposition constant (a) No. of generations (g) No. of simulations Sample mean of A," Approximation of

A = E(A,)

0.2 0.005

6.0 0.001 0.5

20

200 20 10.34 10.Oob

0.2 0.002

6.0 0.001 0.2

50

1000 20 10.44 lo.OOb

0.4 0.01

0.1 0.001 1.0

20

300 20 10.53 10.006

0.4 0.01

6.0 0.001 1.0

20

200 20 10.18 10.006

2 0.02

50 6.0 0.001 2.0

100 20 10.48 10.MIc

5 0.05

50 6.0 0.001 1.8

100 20 6.64 6.65'

10

50 0.10

6.0 0.001 1.6

100 20 4.61 4.55c

Sample variance of A, 1.72 3.87 3.37 1.63 1.63 0.62 0.23

" A# = (number of copies of the TE in the population in generation g)/2N. Calculated from equation (15).

e Calculated from equation (18).

A second possible mechanism of copy number regulation might be the deletion or loss process. The probability of loss might increase disproportion- ately with copy number so that, under a constant rate of transposition, copy number would increase to some steady state number. There is no direct evidence for this type of process, but the observation of illegitimate recombination in yeast among TYI elements (LIEBMAN, SHALIT and PICOLOGLOU 1981) and in- creased rearrangement and reversion of insertional mutants in Drosophila (BERG, ENGELS and KREBER 1980; RUBIN, KIDWELL and BINGHAM 1982) affords concrete examples of cooperative loss of TEs as gametes or as zygotes.

The third possible mechanism of regulation of copy number is transposition. There is evidence in bacteria that TEs produce a repressor that inhibits trans- position; hence, when copy number is low, transposition is more likely. Analo- gously the high transposition rate of P factor in M cytotype (free of P factor) suggests that when copy numbers are low, the P factor transposes more often (BINGHAM, KIDWELL and RUBIN 1982). Further evidence from bacteria indicates that loss of TEs is largely host function dependent, whereas transposition and its regulation are at least partially determined in the genome of the TE (KLECKNER 1981).

The model investigated in this paper assumes that copy number is regulated by transposition and that the evolution of each site is independent of the others, i.e., deletion or loss is not copy number dependent. Much of the tractability of this model is dependent on those two assumptions. The independence of the evolution of different sites, and the existence of a stochastic equilibrium between transposition and loss, lead to a formula for the frequency spectrum of the process. This formula is the same as that obtained for the frequency

Page 10: TRANSPOSABLE ELEMENTS IN A - Genetics · 2003-07-30 · Transposable elements are DNA sequences, found throughout eukaryotes, that transpose replicatively and cause numerous genetic

5

4

3

2

I

0

50

40

30

20

10

0

a.

i I

5 I O 15 20

--7Pdl 5 IO 15 50

x Individuals RCURE 1.-A comparison of the observed (0) and expected (solid line) frequency spectra with

different 0 values. a, 0 = 0.2; simulation conditions 1 (see Table 1). b, 0 = 10; simulation conditions 6 (see Table 1).

466

Page 11: TRANSPOSABLE ELEMENTS IN A - Genetics · 2003-07-30 · Transposable elements are DNA sequences, found throughout eukaryotes, that transpose replicatively and cause numerous genetic

TRANSPOSABLE ELEMENTS: THEORY 467

0

I I I I

5 IO 15 20

x Individuals FIGURE 2.-A comparison of the observed (circles) and expected (solid line) frequency spectra

with different amounts of recombination. The solid circles represent the outcome of simulations under conditions 3A (see Table l), R = 0.1. The open circles represent the outcome of simulations under conditions 3B (see Table l), R = 6.

spectrum of the infinite alleles model (EWENS 1979) except that it is multiplied by a constant equal to the average number of TEs per genome. For reasonable amounts of recombination, the simulation results fit the predictions quite well. Using a different approach CHARLESWORTH and CHARLESWORTH (1983) have obtained the same formula for the frequency spectrum.

As with the application of the infinite alleles model, it is difficult to test whether a given data set is adequately described by this model. One approach is to compare the observed frequency spectrum of the data set with that predicted by the model. To calculate the expected frequency spectrum for a

Page 12: TRANSPOSABLE ELEMENTS IN A - Genetics · 2003-07-30 · Transposable elements are DNA sequences, found throughout eukaryotes, that transpose replicatively and cause numerous genetic

468 C. H. LANGLEY, J. F. Y. BROOKFIELD, A N D N. KAPLAN

given data set, estimates of 6' and A must be made. This problem is discussed by KAPLAN and BROOKFIELD (1983) in the accompanying paper.

There is no single statistic that is optimal for testing goodness of fit of the observed and predicted frequency spectra, especial!y in the absence of a specific alternative hypothesis. A natural statistic that can be used for examining goodness of fit is H = xi f(i)2, which is interpreted as the expected number of sites homozygous for TEs in a randomly chosen zygote [ f ( i ) is the population frequency of site i]. This statistic is analogous to the expected homozygosity in the infinite alleles model (WATTERSON 1977,1978). Several possible explanations suggest themselves if the sample value of H differs significantly from its expected value predicted by the model. When the sample value of H is lower than the predicted expected value, it may be the case that there is some unknown mechanism forcing the TE frequencies to the intermediate range. On the other hand, if the sample value of H is higher than that predicted by the model, it is possible, based upon simulation evidence, that the TE population has not reached equilibrium and is still increasing. A more interesting expla- nation of such a discrepancy would be some variation among sites in the rate of deletion, p . This could be due to differential deletion rates or mildly delete- rious effects on the fitness of the host. It can be proven (KAPLAN and BROOKFIELD 1983a) that any mixture of TEs with different rates of loss will have a higher expected H than predicted when a single 8 value is assumed. So an observed H that is larger than predicted may be consistent with variation in the stability of insertions or their effect on the fitness of the host. This latter possibility can be corroborated by comparing the autosomal data with that from the sex chro- mosomes inasmuch as the latter are subjected to hemizygosity and, therefore, more severe selection. If some proportion of TEs have a substantial effect on host fitness, they should be less common on the X chromosome inasmuch as such deleterious effects are generally recessive.

The proposed model provides a structure to approach data on TEs from natural populations. The simulation results support the validity of the analysis of the models. The frequencies of sites in chromosomes from natural populations can be compared with that predicted by the model (MONTGOMERY and LANGLEY 1983; KAPLAN and BROOKFIELD 1983). If the data fit, it may be concluded that the model is adequate to explain the distribution of the TE. If the data do not fit, variation in the stability or host fitness effects of TEs at different sites in the sample may be indicated. With larger samples and other TEs and host popula- tions, more tests of the assumptions of the model will be available.

We thank STANLEY SAWYER for critically reading this paper and providing constructive criticism.

LITERATURE CITED

BERG, R., W. R. ENGELS and R. A. KREBER, 1980 Site-specific X-chromosome rearrangements from hybrid dysgenesis in Drosophila melonogaster. Science 210 427-429.

BINGHAM, P. M., M. G. KIDWELL and G. M. RUBIN, 1982 The molecular basis of P-M hybrid dygenesis: the role of the P element, a P-strain-specific transposon family. Cell 29: 995-1004.

BROOKFIELD, J. F. Y., 1982 Interspersed repetitive DNA sequences are unlikely to be parasitic. J . Theor. Biol. 94: 281-299.

Page 13: TRANSPOSABLE ELEMENTS IN A - Genetics · 2003-07-30 · Transposable elements are DNA sequences, found throughout eukaryotes, that transpose replicatively and cause numerous genetic

TRANSPOSABLE ELEMENTS: THEORY 469

CHARLESWORTH, B. and D. CHARLESWORTH, 1983 The population dynamics of transposable ele- ments. Genet. Res. In press.

ERREDE, B., T. S. CARDILLO, F. SHERMAN, E. DUBOIS, J. DESCHAMPS and J. M. WAIME, 1980 Mating signals control expression of mutations resulting from insertion of a transposable repetitive element adjacent to diverse yeast genes. Cell 22: 427-436.

EWENS, W. J., 1979 Mothematical Population Genetics. Lecture Notes In Biomathematics, ed. 9, Springer Verlag, New York.

FELLER, W., 1957 An Introduction to Probability Theory and Its Application. John Wiley and Sons, New York.

JENKINS, N. A., N. G. COPELAND, B. A. TAYLOR and B. K. KEE, 1981 Dilute (d) coat colour mutation of DBA/2J mice is associated with the site of integration of a ecotropic MuLV genome. Nature 293 370-374.

KAPLAN, N. and J. F. Y. BROOKFIELD, 1983a Effect on homozygosity of selective differences between

KAPLAN, N. and J. F. Y. BROOKFIELD, 1983b Transposable elements in Mendelian populations. 111.

KLECKNER, N., 1981

LIEBMAN, S., L. SHALIT and S. PICOLOGLOU, 1981 TY elements are involved in the formation of

MONTGOMERY, E. A. and C. H. LANGLEY, 1983 Transposable elements in Mendelian populations.

NORMAN, F., 1974 A central limit theorem for Markov processes that move by small steps. Ann.

PAYNE, G. S., J. M. BISHOP and M. E. VARMUS, 1982 Multiple arrangements of viral DNA and activated host oncogene in bursal lymphomas. Nature 295: 209-214.

RUBIN, G. M., M. KIDWELL and P. M. BINGHAM, 1982 The molecular basis of P-M hybrid dygenesis: The nature of induced mutations. Cell 2 9 987-994.

SPRADLING, A. C. and G. M. RUBIN, 1981 Drosophila genome organization: Conserved and dynamic aspects. Annu. Rev. Genet. 15: 219-264.

STROBEL, E., P. DUNSMUIR and G. M. RUBIN, 1979 Polymorphisms in the chromosomal locations of elements of the 412, copia and 297 dispersed repeated gene families in Drosophila. Cell 17:

Function of the retrovirus long terminal repeat. Cell 28 3-5.

sites of transposable elements. Theor. Pop. Biol. In press.

Statistical results. Genetics 104 485-495. Transposable elements in prokaryotes. Annu. Rev. Genet. 15: 341-404.

deletions in DEL 1 strains of Soccharomyces cerevisioe. Cell 26: 401-409.

11. Distribution of copia-like elements in natural populations. Genetics 104 473-483.

Prob. 2: 1065-1074.

429-439.

TEMIN, H. M., 1982

WATTERSON, G. A., 1977 Heterosis or neutrality. Genetics 85: 789-814.

WATTERSON, G. A., 1978 The homozygosity test of neutrality. Genetics 88: 405-417. YOUNG, M. W., 1979 Middle repetitive DNA a fluid component of the Drosophila genome. Proc.

Corresponding editor: B. S. WEIR Natl. Acad. Sci. USA 7 6 6274-6278.

APPENDIX 1 The general TE model can be described in the following mathematical way. For j 2 1, let S,

denote the set of all 2N x j matrices whose elements are zeros and ones with the additional proviso that every column have at least one nonzero element. Elements of S, that differ only by a permutation of the columns are considered to be the same. For j = 0, let SO denote the 2N X 1 matrix of all zeros. Let S = U;-, S,. The set of matrices S is the state space of the general TE model. If in any generation, the process is in the state s, and seS,, then there are j sites in the population, and the genomes in which these j sites are present are given by the ones in the j columns of the matrix S .

Suppose in generation t , the state of the general TE matrix is s and srS,. The stochastic mechanism for going to generation t + 1 has two steps.

Page 14: TRANSPOSABLE ELEMENTS IN A - Genetics · 2003-07-30 · Transposable elements are DNA sequences, found throughout eukaryotes, that transpose replicatively and cause numerous genetic

470 C. H. LANGLEY, J . F. Y. BROOKFIELD, AND N. KAPLAN

Step 1: All j existing sites in generation t are assumed to evolve independently of each other, i.e., there is free recombination between sites. For any of the j sites, the number of genomes in generation t + 1 having this site has the binomial distribution given in (1). Furthermore, the genomes in generation t + 1 having this site are randomly chosen among the 2N in the population.

Step 2: Random numbers of new sites are created each generation by transposition in each of the 2N genomes. The distribution of the number of new sites created in a particular genome in generation t + 1 depends of the pt(i). Furthermore, the pt(i) can be determined from the matrix s.

The general TE model is thus a countable Markov chain and so the process has a limiting distribution (FELLER, 1957, p. 356). It only remains to determine conditions when the limiting distribution is nontrivial.

Let RI denote the number of sites in generation t . It follows from example (a), (6) and properties of the random deletion model that,

SYP E(Rt) d C SYP E(J t ) (Al)

where C is a constant. It follows from (AI) that if suprE(Jt) c m, then limh,P(Rt < m) = 1. Thus, either the general model has a legitimate limiting distribution, or else the TE has gone extinct. In models I1 and 111 the latter possibility cannot occur inasmuch as invasion from outside the population is assumed. Hence, for models I1 and 11, the process will have a limiting distribution if suptE(Jt) < m. Clearly, whenever r is bounded, this condition holds.

APPENDIX 11

Recall that Lt is the total number of TEs in generation t . Then

where Yt is defined in (16). If 0 is large, most of the ft(i) are small. Thus the distribution of Y t given the ft(i) is approximately Poisson with mean zz ft(i) = Lt/2N. (FELLER 1957, p. 264). Let At = Lt/2N. Then,

After rearrangement, equation (A2) can be written as

One can also show that

where

Let h denote the solution of

P -p + - + g(x) = 0. 2

If p = 0/4N, /3 = 0(1/2N) and r(x) - 2 as x -+ m, then it is not difficult to show that A - as

N-+ m. The key observation is that g(x)/r(x) -+ 1 as x -+ m. Let Zt = &/A. Then with some algebra, equations (A3) and (A5) can be rewritten as

X

A'Er(Zt+i - Zt) = a(Zt) + 0(1) 645)

Page 15: TRANSPOSABLE ELEMENTS IN A - Genetics · 2003-07-30 · Transposable elements are DNA sequences, found throughout eukaryotes, that transpose replicatively and cause numerous genetic

TRANSPOSABLE ELEMENTS: THEORY 471

and

A2Ef(Zt+l - Zt)2 = - 2a - Zt + 0 (;*) - B A

where a(ZJ = -a Z t - - . One can also show with a little work that

( 2 P E f - ZJ3 s C / P . (A81

It is proved by NORMAN (1974) that equations (A5), (A6) and (A7) imply that as N + 00 the distribution of d ( Z l ~ s l - s ( t ) ) converges to a normal distribution with mean zero and variance w(t) where s ( t ) and w(t) satisfy the differential equations s ' ( t ) = a(s(t)) and w'(t) = Za'(s(t))w(t)

+ - and [XI = greatest integer in x. More importantly, for the purposes of this paper, NORMAN also

shows that the distribution of d - - 1 converges to a normal distribution with mean zero and

variance 1/28. Thus, A, is approximately normal with mean h and variance h/28. Hence, if h/28 is small, it would be expected that the distribution of L is concentrated about A.

2lY 8

(: )