The coalescent process - Carnegie Mellon School of...

The coalescent process

Introduction

‡ Random drift can be seen in several ways

Forwards in time: variation in allele frequency

Backwards in time: a process of inbreeding//coalescence

Allele frequenciesRandom variation in reproduction causes random fluctuations in allele frequency:

var HpL =pq

ÅÅÅÅÅÅÅÅÅÅÅ2 Ne

After many generations, the distribution can be approximated by a diffusion.With random drift and mutation (PØQ at rate m, QØP at rate n) the equilibrium distribution is:

prob HpL~p4 Ne n-1 q4 Ne m-1

The left-hand plot shows the distribution of p for Ne = 2, 500, n = 2.5 µ 10-5 , m = 5 µ 10-5 ; the right-handplot is for Ne = 20,000

0.2 0.4 0.6 0.8 1

1

2

3

4

5

6

7

0.2 0.4 0.6 0.8 1

0.5

1

1.5

2

The diffusion approximation can also include other forces, such as selection and migration. For example, theequilibrium distribution under mutation, random drift, and selection is:

prob HpL~p4 Ne n-1 q4 Ne m-1 Wê2 Ne

With heterozygote advantage (fitnesses 1-s;1:1-s), Wêêê2 Ne = 1 - sHp2 + q2 L~ Exp@-2 Ne sHp2 + q2 LDWith Ne = 2, 500, n = 2.5 µ 10-5 , m = 5 µ 10-5 , and s=0.0001, 0.001, 0.004 (left to right):

0.2 0.4 0.6 0.8 1

12345

0.2 0.4 0.6 0.8 1

0.20.40.60.8

0.2 0.4 0.6 0.8 1

0.20.40.60.8

1

¤ The key parameters are Ne m, Ne n, Ne s , which give the strength of drift relative to mutation and selection.

ü Further reading: Kimura, The neutral theory of molecular evolution, Chap.3

Identity by descent

‡ Definition

Wright (1921, 1922), Haldane & Moshinsky (1939), Cotterman (1940) and Malécot (1948) developed the ideaof identity by descent.

Two genes are identical by descent if they descend from the same gene in some ancestral population.

ü Note:

- Identity by descent is distinct from identity in state- i.b.d. is defined relative to some ancestral reference population.- Identity measures can extend to many genes; usually, however, we just deal with identity between

pairs of genes. This is related to variance of allele frequency, correlation between genes, and homozygosity - Relationships among many genes are better thought of in terms of coalescence of lineages in a

genealogy.

‡ The probability of identity by descent is easily calculated for pedigrees

e.g. brother-sister mating

2 Coalescent process.nb

Genes are NOT ibdin this case

Probability of identity by descent is 1/4

In general, the probability that two distinct genes in a diploid individual are i.b.d. isf = ‚

loopsH 1ÅÅÅÅ2 Ln-1

H1 + fA L , where the sum is over all loops in the pedigree, n is the number of individuals in

the loop, and fA the identity between genes in the common ancestor.

Note that the random element here is in segregation, not reproduction

Coalescent process.nb 3

‡ The increase in i.b.d. with random mating

ü Wright-Fisher model

Suppose that there are 2 Nt individuals in a haploid population. In the next generation, there are 2 Nt+1 ,drawn randomly from all 2 Nt possible parents.

On this scheme, individuals produce a number of offspring which is close to a Poisson distribution.

The Wright-Fisher model also applies to a random-mating diploid population, provided that individuals are aslikely to mate with themselves as with anyone else.

Then, the probability that two genes are i.b.d. from the previous generation is 1 ê 2 Nt :

ft+1 =1

ÅÅÅÅÅÅÅÅÅÅÅ2 Nt+ J1 -

1ÅÅÅÅÅÅÅÅÅÅÅ2 Nt

N ft f0 = 0

ht+1 ª 1 - ft = J1 -1

ÅÅÅÅÅÅÅÅÅÅÅ2 NtN ht hence ht = ‰

i=0

t-1 J1 -1

ÅÅÅÅÅÅÅÅÅÅÅ2 NiN

With constant population size, ht declines by (1-1/2N) per generation - approximately, as ~exp(-t/2N).The typical timescale for inbreeding and random drift is 2N generations.

With fluctuating sizes, ht declines (approximately) as exp H-H⁄i=0t-1 1ÅÅÅÅÅÅÅÅ2 Ni LL= expH-t ê 2 NH L where NH is

the harmonic mean population size.

CoalescenceThe ancestry of a sample of neutral genes has a simple statistical distribution:

the chance that any two lineages coalesce is 1ÅÅÅÅÅÅÅÅÅÅÅ2 Neper generation

-

----

More precisely: - suppose that each gene leaves v descendants- As NØ¶, the probability that any pair of lineages coalesce, per generation, tends to varHnLÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 N

i.e. Ne = N ê varHnLThe coalescent process refers to this limit

- equivalent to the diffusion approximation

An influential idea:- DNA sequences are best described by their genealogy- a variety of mutation models can be superimposed- tracing back samples of alleles

- speeds up simulations- gives statistical tests on sampled data


An influential idea:- DNA sequences are best described by their genealogy- a variety of mutation models can be superimposed- tracing back samples of alleles

- speeds up simulations- gives statistical tests on sampled data

ü References

Hudson, R. (1990). Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol. 7, 1-44.Hudson, R. (1993). The how and why of generating gene genealogies. In Mechanisms of molecular evolution,ed. Takahata N & Clark AG, pp 23-36.Donnelly, P. and S. Tavaré. (1995). Coalescents and genealogical structure under neutrality. Ann. Rev.Genet. 29, 401-421.Rosenberg, N. A., and M. Nordborg, 2002 Genealogical trees, coalescent theory and the analysis of geneticpolymorphisms. Nature Reviews Genetics 3: 380-390.

‡ Properties of the coalescent process

The time during which there are k lineages is exponentially distributed with expectation 1ÅÅÅÅl = 2 NeÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅkHk-1Lê2 :

P HtkL = Exp@-l tkD l „tk where l =k Hk - 1LÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ4 Ne

ü The genealogy is dominated by the deepest split.

The expected depth of the tree is:

2 Ne J 2ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅk Hk - 1L +

2ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅHk - 1L Hk - 2L … 1

ÅÅÅÅ6 +1ÅÅÅÅ3 + 1N =

2 Ne JJ 2ÅÅÅÅÅÅÅÅÅÅÅÅk - 1 -

2ÅÅÅÅk N + J 2

ÅÅÅÅÅÅÅÅÅÅÅÅk - 2 -2

ÅÅÅÅÅÅÅÅÅÅÅÅk - 1 N + … J 2ÅÅÅÅ2 -

2ÅÅÅÅ3 N + J 2

ÅÅÅÅ1 -2ÅÅÅÅ2 NN =

2 Ne JJ1 -2ÅÅÅÅk N + 1N ~4 Ne for large k

Thus, the tree collapses to 2 lineages in ~ 2 Ne generations; these take another 2 Ne generations to coalesceHence, pairwise measures are uninformative

ü The expected length of the genealogy is ~ 4 Ne [email protected] kDThe expected length of the tree is:

2 Ne Jk 2

ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅk Hk - 1L + Hk - 1L 2

ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅHk - 1L Hk - 2L … 4ÅÅÅÅ6 +

3ÅÅÅÅ3 + 2N

= 2 Ne J 2ÅÅÅÅÅÅÅÅÅÅÅÅk - 1 +

2ÅÅÅÅÅÅÅÅÅÅÅÅk - 2 + … 2

ÅÅÅÅ3 +2ÅÅÅÅ2 +

2ÅÅÅÅ1 N

= 4 Ne ‚j=1

k-1 1ÅÅÅÅj

~4 Ne [email protected] kD for large k

The distribution of length is highly variable:

The dots show the quantiles at 0.001, 0.01, 0.1, 0.9, 0.99, 0.999.


Figure 1

5 10 20 50n

1

2

5

10

20

L

‡ Fluctuating population size

Changes in Ne cause changes in timescale

The standard coalescent

Expanding populations Ø "star phylogeny"exponential growth: popl'n was 10% of the current size at TMRCA


Population bottlenecks Ø burst of coalescencea bottleneck equivalent to 2 Ne 'ordinary' generations of drift

ü Changing timescales

The "scaled time" is a measure of the total amount of genetic drift that has occurred:

T = ‡0

t „tÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 N HtL

For a constant population size, T = t ê H2 NL . If the population is growing at a rate l, and the present size isN0 , then N = N0 ‰-lt , and so:

T = ‡0

t ‰ltÅÅÅÅÅÅÅÅÅÅÅ2 N0

„t =1

ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 N0 l

H‰lt - 1LThe parameter l is a measure of the amount of population growth over the current timescale set by populationsize, 2 N0 . Here is the transformation for l = 1.5


0.5 1 1.5 2actual time

2

4

6

8

10

12

scaled time

‡ Branching processesThe coalescent process only applies to samples from a large population

If all genes are observed, we have a branching process

e.g. discrete time: # of offspring i follows a Poisson distribution with E@iD = l

1 2 3 4l

1P

More generally, for l~1, P ~ 2 Hl - 1L ê varHiL


1 2 34 5 67 8910 111213 141516 17

18

1920

t d i s o q f p c k h m j a e n r l b g

coalescent

12

345 6

789 1011 1213 14151617 181920

o b s d g i t j k r a m p n e q c f h l sample froma branchingprocessl = 1.1

Mutation

‡ Infinite alleles

Assuming that every mutation generates a new allele, the probability of identity in allelic state("homozygosity") is F = ⁄t ft H1 - mL2 t , where ft is the distribution of coalescence times.

F ~ E@‰-2 m t D = ‡0

¶

‰-2 m t ft „ t = ‡0

¶

‰-2 m t ‰-tê2 Ne „ t

ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 Ne=

1ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ1 + 4 Ne m

Identity coefficients, F, can easily be calculated by going back in time one generation:


F =H1 - mL2 JJ1 -1

ÅÅÅÅÅÅÅÅÅÅÅ2 NeN F +

1ÅÅÅÅÅÅÅÅÅÅÅ2 Ne

N fl F =H1 - mL2

ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 Ne H1 - H1 - 1ÅÅÅÅÅÅÅÅ2 Ne L H1 - mL2L ~

1ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ1 + 4 Ne m

Identity coefficients are generating functions for the distribution of coalescence times:

F ~ E@‰-2 m t D \ F = 1 when m = 0

dFÅÅÅÅÅÅÅÅÅÅÅd m

~ E@-2 t ‰-2 m t D \dFÅÅÅÅÅÅÅÅÅÅÅd m

= -2 E@tD when m = 0

d2 FÅÅÅÅÅÅÅÅÅÅÅÅÅÅd m2 ~ E@4 t2 ‰-2 m t D \

d2 FÅÅÅÅÅÅÅÅÅÅÅÅÅÅd m2 = 4 E@t2 D when m = 0

‡ More general models of mutation

Bases mutate at rate m, and change to A, T, G, C with equal probabilityProbability of identity in state of two genes is:

F = EA 1ÅÅÅÅ4 H1 - ‰-2 mtL + ‰-2 mtE

‡ Infinite sitesFor DNA sequences, the 'infinite sites' model is more appropriate: each mutation is at a new site in thesequence. Two alleles may differ by mutations at 1, 2… sites - giving a measure of the time for which they have beendiverging.

If there are mutations on every internal branch, the genealogy can be reconstructed:

-

----

a

b

c

de


GeneMut' n

1 2 3 4 5 6

a 1 1 1 1 0 0b 0 0 0 1 0 0c 0 0 0 0 1 1d 0 1 1 0 0 0e 1 0 0 0 0 0

To root the tree, we must know which mutations are derived - which requires an outgroup

Any pair of sites which carried all four combinations is incompatible with a tree- recombination- multiple mutations

The mean pairwise diversity, p, is just E[2m t] = 4 Ne m

The number of segregating sites, ns , in a sample is proportional to the total length of the tree: E@ns D = mL ,where L = ⁄ j=1

k jt j

E@nsD = E@m LD = 4 Ne m J 1ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅHk - 1L +

1ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅHk - 2L … 1

ÅÅÅÅ3 +1ÅÅÅÅ2 + 1N ~ 4 Ne [email protected] kD

Under neutrality, we expect a definite relation between the # of segregating sites and the pairwise diversity

Recombination

‡ Ancestral graphsWith sexual reproduction, genomes have multiple ancestors.Ancestry is described by an ancestral graph:


Coalescence amongst k lineages at a rate kHk-1LÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 1ÅÅÅÅÅÅÅÅÅÅÅ2 Ne

Recombination at a rate kr

Pattern depends on R = 2 Ne r

Each recombination generates a pair of unique junctionsJunctions can disappear if they meet eachother in a coalescence

At any time, any one genome is distributed across several ancestral lineages

1 + R -R2ÅÅÅÅÅÅÅÅÅÅ3 +

13ÅÅÅÅÅÅÅÅÅ54 R3 + OHR4 L HDerrida & Jung - Muller 1999L

‡ Example: R = 50Number of ancestral lineages:


2 4 6 8

5

10

15

20

25

A typical sample, with 18 ancestors:

Six sampled genomes represented by colours I tÅÅÅÅÅÅÅÅÅÅÅ2 Ne= 0.6M :


‡ Looking along the genome....Different regions have different genealogies:


ü Patterns of diversity vary along the genome:

Numbers of segregating sites H20 sampled genomes; q = 4 Ne m = 30; sliding window width 0.5L

2 4 6 8 10

2.5

5

7.5

10

12.5

15

17.5

20


2 4 6 8 10

2.5

5

7.5

10

12.5

15

17.5

20

2 4 6 8 10

2.5

5

7.5

10

12.5

15

17.5

20

Mean number of pairwise differences:


2 4 6 8 10

1

2

3

4

5

2 4 6 8 10

1

2

3

4

5


2 4 6 8 10

1

2

3

4

5

‡ Pedigrees - or an infinitely long genome

Probability of ancestor repetitions in the genealogicaltree of the king Edward III. The continuous and dashedlines show simulations of F@rD in a closed population with211 and 212 individuals for our model.

Distribution H@r, tD of r repetitions after t generations. t = 9,13, 15, 17, 19, 21, and 23 for a population with N = 215.


Derrida, B., S. C. Manrubia, and D. H. Zanette. 1999. Statistical properties of genealogical trees. PhysicalReview Letters 82:1987-1990.

‡ Forwards in timeWhat is the fate of a single ancestral genome?In an infinitely large population, this is a branching process.

The chance that the pedigree will survive is ~ 80%

Any finite piece of genome is certain to be lost - but very slowly

The probability of survival of a neutral genome (S = 0) as a function of map length, R. From top to bottom,thecurves show Pt [R] for t = 0, 1, 2... 10; 20, 30...100; and 200, 300...1000 generations.


The distribution of blocks of genome that remain after 50 generations; map lengthR = 1. The two panels show two random realisations of this process. Each line represents one genome.

The increase in mean block number over time (±1 standard error), compared with the expectation

1+Rt. (b) The mean amount of ancestral material over time , compared with the constant expecta-tionR. (c) The probability of survival, P, compared with the value calculated from Eq. 2. (d) The distribu-tion ofblock sizes at time t = 30 compared with the expectation. (R=1).


The increase in mean block number over time (±1 standard error), compared with the expectation

1+Rt. (b) The mean amount of ancestral material over time , compared with the constant expecta-tionR. (c) The probability of survival, P, compared with the value calculated from Eq. 2. (d) The distribu-tion ofblock sizes at time t = 30 compared with the expectation. (R=1).

‡ What do we see?

What is the relation between the ancestry of segments of genome, and the patterns we see?

Patil et al. 2001 Science 294:1719 21,676,868 bases, 36000 SNPs; ~4000 "blocks" identified; ~2700 SNPs capture ~80% of haplotype variationWhat is the actual structure of these 20 chromosomes?


Selection on linked sites

‡ Balancing selection

ü Complete linkage

Kreitman & Aguade (Genetics, 1986) observed excess polymorphism in theAdh region of D. melanogaster.

Hudson, Kreitman & Aguade (Genetics, 1987) introduced the "HKA test" todetect balancing selection.A polymorphism with two alleles P, Q divides linked markers into two separate gene pools.

Eventually, there will be a set of alleles with homozygosity 1ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅH1+4 NpmL associated with P, and a distinct setassociated with Q, with homozygosity 1ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅH1+4 NqmL . The overall homozygosity is:

F =p2

ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ1 + 4 N m p +q2

ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ1 + 4 N m q

e.g. 1-F vs p for 4Nm = 0.1 (bottom), q=1 (top):

1p

1F

ü Recombination

We must follow identities between genes both associated with P, FPP , both with Q, FQQ , or one with each,FPQ

FPP' = H1 - r qL2 FPP + 2 r q H1 - r qL FPQ + r2 q2 FQQ

Assuming r small:

dFPP = 2 r q HFPQ - FPPLdFPQ = r Hq FQQ + p FPP - FPQLdFQQ = 2 r p HFPQ - FQQL


The effects of mutation and drift can be found in a similar way.Overall:

dFPP = -2 m FPP + 2 r q HFPQ - FPPL +H1 - FPPLÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 N p

dFPQ = -2 m FPQ + r Hq FQQ + p FPP - FPQLdFQQ = -2 m FQQ + 2 r p HFPQ - FQQL +

H1 - FPPLÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 N q

At equilibrium, dF=0. The average F is:

Fê = H2 + r - 4 p q H1 - N m H2 + 3 r + r2LLL êH2 + r + 4 N m H2 + H1 + 4 pqL r + p q r2L + 16 N2 m2 p q H2 + 3 r + r2LLwhere r=r/m.

Note that the effect is only over recombination rates of order m

ü Plot of heterozygosity H1 - FèèL against r/m for 4Nm = 0.1

2 4 6 8 10

0.2

0.4

0.6

0.8

1

ss = SolveA90 == -2 m FPP + 2 r q HFPQ - FPPL +H1 - FPPLÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 n p

,

0 == -2 m FPQ + r Hq FQQ + p FPP - FPQL,0 == -2 m FQQ + 2 r p HFPQ - FQQL +

H1 - FQQLÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 n q

=, 8FPP, FPQ, FQQ<E;


H8FPP, FPQ, FQQ, p2 FPP + 2 p q FPQ + q2 FQQ< ê. ss@@1DD ê.8m -> g m, r -> g r m, n -> 1ê Hg nnL, q -> 1 - p< êêCancelL ê. nn -> 1ên êê Simplify8HH2 + rL H-1 + 4 n H-1 + pL m H1 + p rLLL êH-2 - r + 16 n2 H-1 + pL p m2 H2 + 3 r + r2L +4 n m H-2 + H-1 - 4 p + 4 p2L r + H-1 + pL p r2LL,Hr H-1 + 4 n H-1 + pL p m H2 + rLLL êH-2 - r + 16 n2 H-1 + pL p m2 H2 + 3 r + r2L +

4 n m H-2 + H-1 - 4 p + 4 p2L r + H-1 + pL p r2LL,HH2 + rL H-1 + 4 n p m H-1 + H-1 + pL rLLL êH-2 - r + 16 n2 H-1 + pL p m2 H2 + 3 r + r2L +

4 n m H-2 + H-1 - 4 p + 4 p2L r + H-1 + pL p r2LL,-H2 + r + 4 p H-1 + n m H2 + 3 r + r2LL - 4 p2 H-1 + n m H2 + 3 r + r2LLL êH-2 - r + 16 n2 H-1 + pL p m2 H2 + 3 r + r2L +

4 n m H-2 + H-1 - 4 p + 4 p2L r + H-1 + pL p r2LL<Plot@1 + H2 + r + 4 p H-1 + n m H2 + 3 r + r2LL - 4 p2 H-1 + n m H2 + 3 r + r2LLL êH-2 - r + 16 n2 H-1 + pL p m2 H2 + 3 r + r2L +

4 n m H-2 + H-1 - 4 p + 4 p2L r + H-1 + pL p r2LL ê.8n -> 0.025 ê m, p -> 1ê 2, r -> Abs@rD<, 8r, 0, 10<,PlotRange -> 880, 10<, 80, 1<<D;

‡ Selective sweeps

Fixation of a single favourable mutation carries with it a segment of linked genome


mutation

branchingprocess

ns >> 1

deterministic increase

p<<1

fixation

sample

An example: s = 0.1, N = 105 , sampled when p = 0.1. r = {-0.05, 0.15}s/Log[2N] = 0.008


Fixation takes ~ Log@2 NDÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅs generations, so a region of r~ sÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅLog@2 ND has reduced diversity


ü References

Maynard Smith, J., and J. Haigh. 1974. The hitch-hiking effect of a favour-able gene. Genet.Res. 23:23-35.Hudson, R. B., and N. L. Kaplan. 1988. The coalescent process in modelswith selection and recombination. Genetics 120:831-840.Kaplan, N. L., R. R. Hudson, and C. H. Langley. 1989. The hitch-hikingeffect revisited. Genetics 123:887-899.Barton, N. H. 2000. Genetic hitch-hiking. Philosophical Transactions of theRoyal Society (London) B 355:553-1562.Kim, Y., and W. Stephan. 2002. Detecting a local signature of genetichitchhiking along a recombining chromosome. Genetics 160:765-777.Gillespie, J. H. 2001. Is the population size of a species relevant to itsevolution? Evolution 55:2161-2169.

Monte Carlo methods

ü Generalities

How can we make inferences from genetic data?- statistics such as # of segregating sites, pairwise diversity…- likelihood: the probability of observing the data, given some hypothesis

Statistical inference:- significance tests- likelihood- Bayesian inference

ü Griffiths-Tavare

Griffiths, R. C., and S. Tavare. 1994. Simulating probability distributions inthe coalescent. Theoretical Population Biology 46:131-159.

We observe some configuration of mutations:ikjjjjjjjjjjjjjjjjjjjjjjj

1 2 3 4 5 6 7 8 9a 0 0 0 1 1 0 1 0 0b 0 0 0 1 1 0 0 1 1c 1 1 1 0 0 1 0 0 0d 1 1 1 0 0 0 0 0 0e 0 0 0 0 0 0 0 0 0

y{zzzzzzzzzzzzzzzzzzzzzzz

This configuration was produced by this genealogy:


e c d a bThis rooted genealogy cannot be fully reconstructed, because there were no mutations along the branchsleading down to e and to {a,b,c,d}

ü The algorithm (exact version):

- Work back along the genealogy, until the most recent mutation or coalescence

- Sites can only lose a mutation if that mutation is represented only in one leaf; let there be J such sites. (Inthe example above, sites 6,7,8,9 are singletons; J=4).

- A pair of lineages can only coalesce if they carry the same set of mutations; let there be K such pairs. In theexample, there are no such possibilities: K=0.

- With n lineages, the rate of events is ln = n qÅÅÅÅ2 + nHn-1LÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 ; a sum is taken over these events, with the appropri-ate probability, and expressed in terms of the probabilities of the simpler configurations generated by loss of amutation or coalescence.

- This sum over J+K possible previous configurations is wighted by the overall weight 1ÅÅÅÅl :

P@SD =1

ÅÅÅÅÅÅÅÅÅln

ikjjjjjj ‚

j=1

J qÅÅÅÅÅ2 P@Sj

* D + ‚k=1

K

P@Sk* D y{zzzzzz where ln = n

qÅÅÅÅÅ2 +

n Hn - 1LÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2

Sj* represents deletion of the j ' th singleton site from S, and Sk

* the coalescence of the k ' th pair.

This algorithm becomes extremely slow for large numbers of mutations and lineages.

ü Monte Carlo version:

A Monte Carlo estimate can be made by sampling possible paths back through the genealogy, with relativeprobability fÅÅÅÅÅ2 for possible losses of mutations, and 1 for possible coalescences:

P@SD = J qÅÅÅÅÅÅf

Nm

EA‰ 1ÅÅÅÅÅÅÅÅli

J fÅÅÅÅÅÅ2 Ji

* + Ki* NE

where Ji* is the number of possible losses of mutations, Ki

* the number of possible coalescences,m the number of segregating sites, and i the current # of lineages


The parameter f can be chosen arbitrarily: it should take a value which minimises the variance of the estima-tor. Note that while f=q seems natural, it does not give an optimal estimator.

ü Other applications:

Joint estimation of recombination and mutation H4 Ne r, 4 Ne mL :

Kuhner, M. K., J. Yamato, and J. Felsenstein. 2000. Maximum likelihoodestimation of recombination rates from population data. Genetics156:1393-401.Fearnhead, P., and P. Donnelly. 2001. Estimating recombination ratesfrom population genetic data. Genetics 159:1299-1318.

Estimation of population structure:

Beerli, P., and J. Felsenstein. 2001. Maximum likelihood estimation of amigration matrix and effective population sizes in n subpopulations byusing a coalescent approach. Proceedings of the National Academy of Sci-ences (U.S.A.) 98:4563-4568


The coalescent process - Carnegie Mellon School of...

Documents

Transcript of The coalescent process - Carnegie Mellon School of...