Introduction to Coalescent · The Genographic project National Geographic project A person can...

121
Introduction to Coalescent Pavlos Pavlidis Pavlos Pavlidis () Introduction to Coalescent 2013/02 1 / 91

Transcript of Introduction to Coalescent · The Genographic project National Geographic project A person can...

Page 1: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Introduction to Coalescent

Pavlos Pavlidis

Pavlos Pavlidis () Introduction to Coalescent 2013/02 1 / 91

Page 2: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Methods in population genetics

Workflow in population genetics studies:

Population genetics Materials and Methods

Sample a population

Study the sample

Infer parameters for the population

Learn about the past of the populationGeneralize the results for the whole population

Pavlos Pavlidis () Introduction to Coalescent 2013/02 2 / 91

Page 3: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Methods in population genetics

Workflow in population genetics studies:

Population genetics Materials and Methods

Sample a population

Study the sample

Infer parameters for the populationLearn about the past of the population

Generalize the results for the whole population

Pavlos Pavlidis () Introduction to Coalescent 2013/02 2 / 91

Page 4: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Methods in population genetics

Workflow in population genetics studies:

Population genetics Materials and Methods

Sample a population

Study the sample

Infer parameters for the populationLearn about the past of the populationGeneralize the results for the whole population

Pavlos Pavlidis () Introduction to Coalescent 2013/02 2 / 91

Page 5: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Example of a population genetics study

The Genographic project

National Geographic project

A person can order a kit from the website and provide his DNAsample to the project

Data are analyzed and conclusions about the human populationare made

Pavlos Pavlidis () Introduction to Coalescent 2013/02 3 / 91

Page 6: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

The Genographic Project

Pavlos Pavlidis () Introduction to Coalescent 2013/02 4 / 91

Page 7: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

The Genographic Project

Pavlos Pavlidis () Introduction to Coalescent 2013/02 5 / 91

Page 8: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

The Genographic Project

It uses a sample of modern humans

Understand population processes and population parameters(e.g. the migration rate)

Learn the history of the population

Pavlos Pavlidis () Introduction to Coalescent 2013/02 6 / 91

Page 9: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

The phylogeny of languages

Pavlos Pavlidis () Introduction to Coalescent 2013/02 7 / 91

Page 10: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

We are interested in parameters of the population

Was the population constant during its history?

Has it evolved neutrally or are there signs of selection?

What is the mutation rate, what is the recombination rate?

Is there gene flow? (migration)

Pavlos Pavlidis () Introduction to Coalescent 2013/02 8 / 91

Page 11: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

We want to answer these questions by analyzing

population samples

Pavlos Pavlidis () Introduction to Coalescent 2013/02 9 / 91

Page 12: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Ideas from classical statistics

In statistics we often want to know the parameters of somedistribution.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 10 / 91

Page 13: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Some distributions and their parameters

−10 −5 0 5 10

0.0

0.1

0.2

0.3

0.4

N(0,1)

x

dnor

m(x

, 0, 1

)

−10 −5 0 5 100.

025

0.03

00.

035

0.04

0

N(0, 10)

x

dnor

m(x

, 0, 1

0)

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Exp(1)

x2

dexp

(x2,

1)

0 2 4 6 8 10

02

46

810

Exp(10)

x2

dexp

(x2,

10)

Pavlos Pavlidis () Introduction to Coalescent 2013/02 11 / 91

Page 14: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Sampling from a distribution provides information

about the properties of the distribution

Infering population parameters

A major goal in population genetics (similar to statistics) is toestimate the parameters of the population by studying populationsamplesCan you name some population parameters?

Pavlos Pavlidis () Introduction to Coalescent 2013/02 12 / 91

Page 15: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Hypothesis testing

Does a neutral model (null hypothesis) or a selection model fitthe data better?

Was the population size constant (null hypothesis) or did itchange over time?

Was the migration rate 0 (null hypothesis) or did migrationoccur during the evolution of the population?

Pavlos Pavlidis () Introduction to Coalescent 2013/02 13 / 91

Page 16: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

In population genetics there are several parameters

Population genetics parameters define ‘the probability distribution ofsequences’.

θ = 4Nµ is the mutation rate.

ρ = 4Nr is the recombination rate.

M = 4Nm is the migration rate.

These parameters will be explained later.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 14 / 91

Page 17: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

In population genetics there are several parameters

Population genetics parameters define ‘the probability distribution ofsequences’.

θ = 4Nµ is the mutation rate.

ρ = 4Nr is the recombination rate.

M = 4Nm is the migration rate.

These parameters will be explained later.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 14 / 91

Page 18: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

In population genetics there are several parameters

Population genetics parameters define ‘the probability distribution ofsequences’.

θ = 4Nµ is the mutation rate.

ρ = 4Nr is the recombination rate.

M = 4Nm is the migration rate.

These parameters will be explained later.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 14 / 91

Page 19: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Mutation rate: 0.1 versus 0.5TCCGGTTCCCATTCATATGGTCCGGTTCCCATTATCTTGGTCCGGTTCCCATTCATCTGGTCCGGTTCCCATTCATCTGGTCCGGTTCCCATTCATCTGG

TGACCACTGCCCAAACAGCTAATAGAAGAGGTGTACGCCCAAGGGCCCCCCTGAACGCACAATGGCCCAGCTGTACGTGCAAGGGCCCGGCTGTACGTAC

Pavlos Pavlidis () Introduction to Coalescent 2013/02 15 / 91

Page 20: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Recombination rate: 0.1 versus 0.5

CTCCTGCCCCTGAGCGGATGTTACTCCAACACAGCAGATGTTACTCCAACACAGCAGATGTTTGCCACATCCGGCGGATATTTGCCACATCCGGTGGATA

TGACCACTGCCCAAACAGCTAATAGAAGAGGTGTACGCCCAAGGGCCCCCCTGAACGCACAATGGCCCAGCTGTACGTGCAAGGGCCCGGCTGTACGTAC

Pavlos Pavlidis () Introduction to Coalescent 2013/02 16 / 91

Page 21: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Samples and parameters

Parameter values affect ‘How samples look like’.

Samples contain information about the parameter values of thepopulation

Pavlos Pavlidis () Introduction to Coalescent 2013/02 17 / 91

Page 22: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Hypothesis testing example

Assume a sample of k := 12 homologous genes, that is, k individuals.Hypothesis: Our sample is from a constant (over time) populationof N individuals.The length of the sequences is l := 1000.The mutation rate per base pair, per generation is µ := 10−8.We observe s = 20 segregating sites (SNPs, Single NucleotidePolymorphisms).

Is our hypothesis correct?

How probable is it to observe s ≤ 20 segregating sites for apopulation of size N ?

Pavlos Pavlidis () Introduction to Coalescent 2013/02 18 / 91

Page 23: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Ideas?

What would you do to solve this problem?

Pavlos Pavlidis () Introduction to Coalescent 2013/02 19 / 91

Page 24: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

What we need is . . .

. . . a way to calculate sample distributions for population geneticssamples

Pavlos Pavlidis () Introduction to Coalescent 2013/02 20 / 91

Page 25: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Summary 1

We need methods to study samples

In population genetics studies we are often interested in sampledistributions: Number of SNPs, pairwise differences betweensequences

Sample distributions are often related to Hypothesis testing:What is the probability to observe ≤ 20 segregating sites (SNPs)in a sample of 12 individuals given a mutation rate µ and apopulation size N?

To study sample distributions either analytically or bysimulations we need a model for samples

Individuals in a population genetics sample are not independent:They are related due to co-ancestry

We will see later-on that the coalescent is a natural way to studysample distributions either analytically or via simulationsPavlos Pavlidis () Introduction to Coalescent 2013/02 21 / 91

Page 26: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

The coalescent

is a model that describes the relationships within a sample fromthe present individuals (sequences) back to the most recentcommon ancestor.

Thus, it provides a natural way to model samples frompopulations.

Using the coalescent model for simulations, It allows torepeatedly draw samples from a population with certainparameters.

One can also study the properties of the coalescent analytically.

It allows to estimate population genetics parameters.

. . . to perform hypothesis testing.

The coalescent is NOT a tree reconstruction method. It is asampling method.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 22 / 91

Page 27: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

The coalescent

is a model that describes the relationships within a sample fromthe present individuals (sequences) back to the most recentcommon ancestor.

Thus, it provides a natural way to model samples frompopulations.

Using the coalescent model for simulations, It allows torepeatedly draw samples from a population with certainparameters.

One can also study the properties of the coalescent analytically.

It allows to estimate population genetics parameters.

. . . to perform hypothesis testing.

The coalescent is NOT a tree reconstruction method. It is asampling method.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 22 / 91

Page 28: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

The coalescent

is a model that describes the relationships within a sample fromthe present individuals (sequences) back to the most recentcommon ancestor.

Thus, it provides a natural way to model samples frompopulations.

Using the coalescent model for simulations, It allows torepeatedly draw samples from a population with certainparameters.

One can also study the properties of the coalescent analytically.

It allows to estimate population genetics parameters.

. . . to perform hypothesis testing.

The coalescent is NOT a tree reconstruction method. It is asampling method.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 22 / 91

Page 29: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

The coalescent

is a model that describes the relationships within a sample fromthe present individuals (sequences) back to the most recentcommon ancestor.

Thus, it provides a natural way to model samples frompopulations.

Using the coalescent model for simulations, It allows torepeatedly draw samples from a population with certainparameters.

One can also study the properties of the coalescent analytically.

It allows to estimate population genetics parameters.

. . . to perform hypothesis testing.

The coalescent is NOT a tree reconstruction method. It is asampling method.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 22 / 91

Page 30: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

The coalescent

is a model that describes the relationships within a sample fromthe present individuals (sequences) back to the most recentcommon ancestor.

Thus, it provides a natural way to model samples frompopulations.

Using the coalescent model for simulations, It allows torepeatedly draw samples from a population with certainparameters.

One can also study the properties of the coalescent analytically.

It allows to estimate population genetics parameters.

. . . to perform hypothesis testing.

The coalescent is NOT a tree reconstruction method. It is asampling method.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 22 / 91

Page 31: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

The coalescent

is a model that describes the relationships within a sample fromthe present individuals (sequences) back to the most recentcommon ancestor.

Thus, it provides a natural way to model samples frompopulations.

Using the coalescent model for simulations, It allows torepeatedly draw samples from a population with certainparameters.

One can also study the properties of the coalescent analytically.

It allows to estimate population genetics parameters.

. . . to perform hypothesis testing.

The coalescent is NOT a tree reconstruction method. It is asampling method.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 22 / 91

Page 32: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

The coalescent

is a model that describes the relationships within a sample fromthe present individuals (sequences) back to the most recentcommon ancestor.

Thus, it provides a natural way to model samples frompopulations.

Using the coalescent model for simulations, It allows torepeatedly draw samples from a population with certainparameters.

One can also study the properties of the coalescent analytically.

It allows to estimate population genetics parameters.

. . . to perform hypothesis testing.

The coalescent is NOT a tree reconstruction method. It is asampling method.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 22 / 91

Page 33: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

The coalescent differs from the phylogenetic tree

concept!

The goals of population genetics analyses usually differ fromthose in phylogenetics.

In phylogenetics the goal is to obtain the tree that best describesthe data. The research question is: What are the evolutionaryrelationships between the given set of sequences

Pavlos Pavlidis () Introduction to Coalescent 2013/02 23 / 91

Page 34: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

In population genetics:

we want to learn something about the population

1 but, the genealogy is used exclusively to obtain statisticalsamples

2 or to analytically infer the properties of samples.

3 there is not the tree, we calculate statistics over (all) possiblegenealogies

4 A geneaology is a rooted, strictly binary, tree!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 24 / 91

Page 35: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

In population genetics:

we want to learn something about the population

1 but, the genealogy is used exclusively to obtain statisticalsamples

2 or to analytically infer the properties of samples.

3 there is not the tree, we calculate statistics over (all) possiblegenealogies

4 A geneaology is a rooted, strictly binary, tree!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 24 / 91

Page 36: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

In population genetics:

we want to learn something about the population

1 but, the genealogy is used exclusively to obtain statisticalsamples

2 or to analytically infer the properties of samples.

3 there is not the tree, we calculate statistics over (all) possiblegenealogies

4 A geneaology is a rooted, strictly binary, tree!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 24 / 91

Page 37: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

In population genetics:

we want to learn something about the population

1 but, the genealogy is used exclusively to obtain statisticalsamples

2 or to analytically infer the properties of samples.

3 there is not the tree, we calculate statistics over (all) possiblegenealogies

4 A geneaology is a rooted, strictly binary, tree!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 24 / 91

Page 38: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Let’s see a coalescent . . .

1 2 3 4 5 6

Figure: Schematically the coalescentlooks like a tree

1 2 3 4 65

Figure: Often they draw itupside-down

Pavlos Pavlidis () Introduction to Coalescent 2013/02 25 / 91

Page 39: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Terminology on the coalescent

coalescent, coalescent tree, genealogy, coancestry, . . .

Branches

Most Recent Common Ancestor (MRCA)

Height

Nodes or Coalescent Events

Total Length

Leaves

Pavlos Pavlidis () Introduction to Coalescent 2013/02 26 / 91

Page 40: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Let’s see how a coalescent is built

Pavlos Pavlidis () Introduction to Coalescent 2013/02 27 / 91

Page 41: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Constructing a coalescent tree: The discrete way

Assume a population of 20 individuals.

t = 0 (present)

Pavlos Pavlidis () Introduction to Coalescent 2013/02 28 / 91

Page 42: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Constructing a coalescent tree: The discrete way

Sample individuals (n = 5). Sampling is random.

t = 0 (present)

Pavlos Pavlidis () Introduction to Coalescent 2013/02 29 / 91

Page 43: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Constructing a coalescent tree: The discrete way

Let’s go one generation in the past . . .

t = 0 (present)

t = 1

Pavlos Pavlidis () Introduction to Coalescent 2013/02 30 / 91

Page 44: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Constructing a coalescent tree: The discrete way

. . . and choose parents . . .

t = 0 (present)

t = 1

Pavlos Pavlidis () Introduction to Coalescent 2013/02 31 / 91

Page 45: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Constructing a coalescent tree: The discrete way

. . . let’s go backwards one more generation . . .

t = 0 (present)

t = 1

t = 2

Pavlos Pavlidis () Introduction to Coalescent 2013/02 32 / 91

Page 46: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Constructing a coalescent tree: The discrete way

. . . and choose parents again . . .

t = 0 (present)

t = 1

t = 2

Pavlos Pavlidis () Introduction to Coalescent 2013/02 33 / 91

Page 47: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Constructing a coalescent tree: The discrete way

t = 0 (present)

t = 1

t = 2

t = 3

Pavlos Pavlidis () Introduction to Coalescent 2013/02 34 / 91

Page 48: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Constructing a coalescent tree: The discrete way

t = 0 (present)

t = 1

t = 2

t = 3

Pavlos Pavlidis () Introduction to Coalescent 2013/02 35 / 91

Page 49: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Constructing a coalescent tree: The discrete way

t = 0 (present)

t = 1

t = 2

t = 3

t = 4

Pavlos Pavlidis () Introduction to Coalescent 2013/02 36 / 91

Page 50: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Constructing a coalescent tree: The discrete way

t = 0 (present)

t = 1

t = 2

t = 3

t = 4

Pavlos Pavlidis () Introduction to Coalescent 2013/02 37 / 91

Page 51: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Constructing a coalescent tree: The discrete way

t = 0 (present)

t = 1

t = 2

t = 3

t = 4

t = 5

Pavlos Pavlidis () Introduction to Coalescent 2013/02 38 / 91

Page 52: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Constructing a coalescent tree: The discrete way

t = 0 (present)

t = 1

t = 2

t = 3

t = 4

t = 5

Pavlos Pavlidis () Introduction to Coalescent 2013/02 39 / 91

Page 53: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Constructing a coalescent tree: The discrete way

t = 0 (present)

t = 1

t = 2

t = 3

t = 4

t = 5

t = 6

t = 7

t = 8

t = 9

t = 10

Pavlos Pavlidis () Introduction to Coalescent 2013/02 40 / 91

Page 54: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Constructing a coalescent tree: The discrete way

t = 0 (present)

t = 1

t = 2

t = 3

t = 4

t = 5

t = 6

t = 7

t = 8

t = 9

t = 10

Pavlos Pavlidis () Introduction to Coalescent 2013/02 41 / 91

Page 55: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Building a coalescent within the population: The

discrete way

What do we need to know to construct a coalescent

the sample size: 5

the population size of each generation: 20 and we assume that itremains constant

the probability of each individual to be chosen as a parent (hereUniform = neutral)

Why is: Uniform = neutral ?

Pavlos Pavlidis () Introduction to Coalescent 2013/02 42 / 91

Page 56: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Why are backward simulations faster than forward simulations?

But...

Anafits by Andre Aberer

Pavlos Pavlidis () Introduction to Coalescent 2013/02 43 / 91

Page 57: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

In forward simulations we need to simulate the

entire population at each generation

t = 0 (present)

t = 1

Then uniformly chose n = 5 individuals of the sample.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 44 / 91

Page 58: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

In backward simulations we only care about the

present sample size (we know the population size)

Forward

t = 0 (present)

t = 1

Backward

t = 0 (present)

t = 1

Pavlos Pavlidis () Introduction to Coalescent 2013/02 45 / 91

Page 59: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Assumptions of the evolutionary model used in this

simple version of the coalescent (J. Kingman)

We assume the Wright-Fisher model:

discrete, non-overlapping generations

haploid individuals

population size is constant

neutrality: all individuals are equally fit to survive

no population structure, no migration, . . .

no recombination

Later we will see how to relax some of these assumptions.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 46 / 91

Page 60: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Assumptions of the evolutionary model used in this

simple version of the coalescent (J. Kingman)

We assume the Wright-Fisher model:

discrete, non-overlapping generations

haploid individuals

population size is constant

neutrality: all individuals are equally fit to survive

no population structure, no migration, . . .

no recombination

Later we will see how to relax some of these assumptions.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 46 / 91

Page 61: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Assumptions of the evolutionary model used in this

simple version of the coalescent (J. Kingman)

We assume the Wright-Fisher model:

discrete, non-overlapping generations

haploid individuals

population size is constant

neutrality: all individuals are equally fit to survive

no population structure, no migration, . . .

no recombination

Later we will see how to relax some of these assumptions.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 46 / 91

Page 62: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Assumptions of the evolutionary model used in this

simple version of the coalescent (J. Kingman)

We assume the Wright-Fisher model:

discrete, non-overlapping generations

haploid individuals

population size is constant

neutrality: all individuals are equally fit to survive

no population structure, no migration, . . .

no recombination

Later we will see how to relax some of these assumptions.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 46 / 91

Page 63: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Assumptions of the evolutionary model used in this

simple version of the coalescent (J. Kingman)

We assume the Wright-Fisher model:

discrete, non-overlapping generations

haploid individuals

population size is constant

neutrality: all individuals are equally fit to survive

no population structure, no migration, . . .

no recombination

Later we will see how to relax some of these assumptions.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 46 / 91

Page 64: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Assumptions of the evolutionary model used in this

simple version of the coalescent (J. Kingman)

We assume the Wright-Fisher model:

discrete, non-overlapping generations

haploid individuals

population size is constant

neutrality: all individuals are equally fit to survive

no population structure, no migration, . . .

no recombination

Later we will see how to relax some of these assumptions.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 46 / 91

Page 65: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals

The probability that a pair of individuals coalesced

in the previous generation

This is the probability that two individuals in generation t = i hadthe same parent in generation t = i + 1

The second individual has the same parent as the first withprobability:pN = 1

N

Notice: implicitly we assume neutrality!!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 47 / 91

Page 66: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals

The probability that a pair of individuals coalesced

in the previous generation

This is the probability that two individuals in generation t = i hadthe same parent in generation t = i + 1

The second individual has the same parent as the first withprobability:pN = 1

N

Notice: implicitly we assume neutrality!!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 47 / 91

Page 67: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals

The probability that a pair of individuals coalesced

in the previous generation

This is the probability that two individuals in generation t = i hadthe same parent in generation t = i + 1

The second individual has the same parent as the first withprobability:pN = 1

N

Notice: implicitly we assume neutrality!!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 47 / 91

Page 68: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals

Question 1

What is the expected number of generations that will pass until acoalescent event occurs (given a sample size of 2 individuals?)

Intuitively, this is just N generations.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 48 / 91

Page 69: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals

More formally

Geometric Distribution with probability of success p:p(x) = (1− p)x−1p

How many failures do we expect on average before the firstsuccess?: E(x) =

∑∞x=1 xp(x) =

∑∞x=1 x(1− p)x−1p = 1/p.

The mean value of a geometric distribution with parameter p is1p

= N .

The variance of a geometric distribution is 1−pp2

.

The variance is large; thus, the process will often either be longor short.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 49 / 91

Page 70: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals

More formally

Geometric Distribution with probability of success p:p(x) = (1− p)x−1p

How many failures do we expect on average before the firstsuccess?: E(x) =

∑∞x=1 xp(x) =

∑∞x=1 x(1− p)x−1p = 1/p.

The mean value of a geometric distribution with parameter p is1p

= N .

The variance of a geometric distribution is 1−pp2

.

The variance is large; thus, the process will often either be longor short.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 49 / 91

Page 71: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals

More formally

Geometric Distribution with probability of success p:p(x) = (1− p)x−1p

How many failures do we expect on average before the firstsuccess?: E(x) =

∑∞x=1 xp(x) =

∑∞x=1 x(1− p)x−1p = 1/p.

The mean value of a geometric distribution with parameter p is1p

= N .

The variance of a geometric distribution is 1−pp2

.

The variance is large; thus, the process will often either be longor short.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 49 / 91

Page 72: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals

More formally

Geometric Distribution with probability of success p:p(x) = (1− p)x−1p

How many failures do we expect on average before the firstsuccess?: E(x) =

∑∞x=1 xp(x) =

∑∞x=1 x(1− p)x−1p = 1/p.

The mean value of a geometric distribution with parameter p is1p

= N .

The variance of a geometric distribution is 1−pp2

.

The variance is large; thus, the process will often either be longor short.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 49 / 91

Page 73: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals

Summary 2

We can construct a coalescent tree of a random samplebackward in time.

Neutrality is implied by picking parents uniformly at random.

For a sample of 2 individuals:

The probability of a coalescent event happening one generationbefore is 1

NThe waiting time until this first coalescent event follows ageometric distribution with parameter p = 1

N .When there are two individuals, we will have to wait on averagefor N generations until they coalesce.The variance of the waiting time is large!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 50 / 91

Page 74: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals

Summary 2

We can construct a coalescent tree of a random samplebackward in time.

Neutrality is implied by picking parents uniformly at random.

For a sample of 2 individuals:

The probability of a coalescent event happening one generationbefore is 1

NThe waiting time until this first coalescent event follows ageometric distribution with parameter p = 1

N .When there are two individuals, we will have to wait on averagefor N generations until they coalesce.The variance of the waiting time is large!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 50 / 91

Page 75: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals

Summary 2

We can construct a coalescent tree of a random samplebackward in time.

Neutrality is implied by picking parents uniformly at random.

For a sample of 2 individuals:

The probability of a coalescent event happening one generationbefore is 1

NThe waiting time until this first coalescent event follows ageometric distribution with parameter p = 1

N .When there are two individuals, we will have to wait on averagefor N generations until they coalesce.The variance of the waiting time is large!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 50 / 91

Page 76: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals

Summary 2

We can construct a coalescent tree of a random samplebackward in time.

Neutrality is implied by picking parents uniformly at random.

For a sample of 2 individuals:

The probability of a coalescent event happening one generationbefore is 1

N

The waiting time until this first coalescent event follows ageometric distribution with parameter p = 1

N .When there are two individuals, we will have to wait on averagefor N generations until they coalesce.The variance of the waiting time is large!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 50 / 91

Page 77: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals

Summary 2

We can construct a coalescent tree of a random samplebackward in time.

Neutrality is implied by picking parents uniformly at random.

For a sample of 2 individuals:

The probability of a coalescent event happening one generationbefore is 1

NThe waiting time until this first coalescent event follows ageometric distribution with parameter p = 1

N .

When there are two individuals, we will have to wait on averagefor N generations until they coalesce.The variance of the waiting time is large!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 50 / 91

Page 78: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals

Summary 2

We can construct a coalescent tree of a random samplebackward in time.

Neutrality is implied by picking parents uniformly at random.

For a sample of 2 individuals:

The probability of a coalescent event happening one generationbefore is 1

NThe waiting time until this first coalescent event follows ageometric distribution with parameter p = 1

N .When there are two individuals, we will have to wait on averagefor N generations until they coalesce.

The variance of the waiting time is large!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 50 / 91

Page 79: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals

Summary 2

We can construct a coalescent tree of a random samplebackward in time.

Neutrality is implied by picking parents uniformly at random.

For a sample of 2 individuals:

The probability of a coalescent event happening one generationbefore is 1

NThe waiting time until this first coalescent event follows ageometric distribution with parameter p = 1

N .When there are two individuals, we will have to wait on averagefor N generations until they coalesce.The variance of the waiting time is large!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 50 / 91

Page 80: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals

Questions

Assume two populations A and B and two individuals from eachof these two populations, A1, A2, B1, B2. Which coalescent treeis expected to be deeper on average if NA < NB? (A1,A2 orB1,B2).

Assume two human individuals and two Drosophila individuals(Nhuman = 10000,NDroso = 1000000). Which coalescent isexpected to be deeper on average?

Pavlos Pavlidis () Introduction to Coalescent 2013/02 51 / 91

Page 81: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals

Questions

Assume two populations A and B and two individuals from eachof these two populations, A1, A2, B1, B2. Which coalescent treeis expected to be deeper on average if NA < NB? (A1,A2 orB1,B2).

Assume two human individuals and two Drosophila individuals(Nhuman = 10000,NDroso = 1000000). Which coalescent isexpected to be deeper on average?

Pavlos Pavlidis () Introduction to Coalescent 2013/02 51 / 91

Page 82: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The coalescent for n > 2 individuals

Usually the sample size is greater than 2

t = 0 (present)

t = 1

t = 2

t = 3

t = 4

t = 5

t = 6

t = 7

t = 8

t = 9

t = 10

Pavlos Pavlidis () Introduction to Coalescent 2013/02 52 / 91

Page 83: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The coalescent for n > 2 individuals

What is the probability of a coalescent event in

one generation

t = 0 (present)

t = 1

Pavlos Pavlidis () Introduction to Coalescent 2013/02 53 / 91

Page 84: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The coalescent for n > 2 individuals

What is the probability of a coalescent event

occuring one generation back

It is actually easier to calculate the probability that a coalescentevent does not occur in a generation.

If the sample size is k , and the population size is 2N , then . . .

one individual chooses its parent.

the second individual chooses a DIFFERENT parent withprobability 2N−1

2N.

the third individual chooses a DIFFERENT parent withprobability 2N−2

2N.

the k th individual chooses a DIFFERENT parent with probability2N−k+1

2N.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 54 / 91

Page 85: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The coalescent for n > 2 individuals

What is the probability of a coalescent event

occuring one generation back

It is actually easier to calculate the probability that a coalescentevent does not occur in a generation.

If the sample size is k , and the population size is 2N , then . . .

the probability that no coalescent occurs is:P(NOCOAL) = 2N−1

2N2N−22N

. . . 2N−k+12N

= (1− 12N

)(1− 22N

)(1−32N

) . . . = 1− 1/2N − 2/2N − . . . + 2/(2N)2 − . . . .

≈ 1− 1/2N − 2/2N − 3/2N . . . (k − 1)/2N = 1−∑k−1

i=1i

2N=

1−(k2

)12N

(NOTE: 1 + 2 + 3 + . . . + k − 1 = k(k−1)2

=(k2

))

We neglect the probability that two coalescents occursimultaneously in a single generation.why?

The probability to observe a single coalescent event onegeneration back is

(k2

)12N

.Pavlos Pavlidis () Introduction to Coalescent 2013/02 55 / 91

Page 86: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The coalescent for n > 2 individuals

What is the probability of a coalescent event

occuring one generation back

It is actually easier to calculate the probability that a coalescentevent does not occur in a generation.

The probability of observing a coalescent event one generationback increases with the sample size.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 56 / 91

Page 87: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The coalescent for n > 2 individuals

How does this affect the shape of the coalescent?

Pavlos Pavlidis () Introduction to Coalescent 2013/02 57 / 91

Page 88: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The coalescent for n > 2 individuals

Summary: the discrete coalescent for a sample of

size k

It proceeds in discrete generation steps backward in time.

It starts with k individuals and stops with 1.

The probability of a coalescent event increases with the numberof lineages.

This increasing probabilty (as a function of the number oflineages) generates coalescent trees with short branches at theleaves and long branches toward the root.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 58 / 91

Page 89: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The continuous coalescent

The continuous coalescent as an approximation of

the discrete coalescent

Assume a sample of size k and a population size of 2N .

What is the waiting time until two sequence coalesce?

The probability of that no coalescent event occurs in a singlegeneration is 1−

(k2

)p, p = 1/2N .

The probability of no coalescent occuring until generation τ is

P(T > τ) = (1−(k2

)1/2N)τ

replace τ with 2Nt. This means: when t = 1, then 2Ngenerations have passed.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 59 / 91

Page 90: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The continuous coalescent

The continuous coalescent as an approximation of

the discrete coalescent

P(T > 2Nt) = (1−(k2

)1/2N)2Nt

P(T/2N > t) = exp−(k2)t

P(T/2N ≤ t) = 1− exp−(k2)t

This means that the waiting time until a coalescent event occurscan be approximated by an exponentially distrubted variable withparameter

(k2

). The time is measured in units of 2N generations.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 60 / 91

Page 91: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The continuous coalescent

let’s see how accurate the approximation is . . .

1e+00 1e+02 1e+04 1e+06

0.00

0.05

0.10

0.15

0.20

Population size

P(T

>0.

05)

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

exactapprox

Question: What does this imply?Pavlos Pavlidis () Introduction to Coalescent 2013/02 61 / 91

Page 92: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The continuous coalescent

let’s see how accurate the approximation is . . .

For small population sizes the exponential approximationunderestimates the number of coalescent events!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 62 / 91

Page 93: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The continuous coalescent

How do we build coalescent trees using

exponentially distributed random variables?

Exponentially distributed random variables

The waiting time until a coalescent event is an exponentialrandom variable

Pavlos Pavlidis () Introduction to Coalescent 2013/02 63 / 91

Page 94: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The continuous coalescent

How do exponential variables look like?

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

Exponential Cumulative par: n(n−1)/2

Waiting time

P(T

<=

t)

(n = 10) Let’s draw some random numbers from the plot!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 64 / 91

Page 95: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The continuous coalescent

Let’s create a coalescent by drawing random

numbers from the exponential distribution

t = 0

Pavlos Pavlidis () Introduction to Coalescent 2013/02 65 / 91

Page 96: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The continuous coalescent

Let’s create a coalescent by drawing exponential

variables MAKE CONSISTENT

t = 0

T = 0.001, n = 10, i: 1, j:5

Pavlos Pavlidis () Introduction to Coalescent 2013/02 66 / 91

Page 97: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The continuous coalescent

Let’s create a coalescent by drawing exponential

variables

t = 0

T = 0.001, n = 10, i: 1, j:5

T = 0.07, n = 9, i: 7, j:10

This process continues until the MRCA (Most Recent CommonAncester)!!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 67 / 91

Page 98: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The continuous coalescent

Continuous coalescent

To construct the continuous coalescent we draw exponential randomvariables with parameters:(

n2

)(n−12

)(n−22

). . .(22

)= 1

Pavlos Pavlidis () Introduction to Coalescent 2013/02 68 / 91

Page 99: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The continuous coalescent

Continuous coalescent

Waiting times are getting longer and longer as we move back intime (toward MRCA)!

Recent branches (in the present) are shorter than deeperbranches (near the root)!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 69 / 91

Page 100: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The continuous coalescent

How does this affect the shape of the coalescent?

Pavlos Pavlidis () Introduction to Coalescent 2013/02 70 / 91

Page 101: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The continuous coalescent

Practically, the coalescent is constructed by using

the continuous approximation

It’s faster: we are only interested in the times of the coalescentevents and not in the generations where nothing happens!

Pavlos Pavlidis () Introduction to Coalescent 2013/02 71 / 91

Page 102: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The continuous coalescent

Let’s play a bit with coalescent

Coalescent simulator: www.coalescent.dk

Pavlos Pavlidis () Introduction to Coalescent 2013/02 72 / 91

Page 103: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Simple mathematical formulas and the coalescent The continuous coalescent

Summary

The coalescent is built by using exponential waiting times.

The continuous coalescent represents a good approximation ofthe discrete coalescent, when the population size is large enough.

We assume that two coalescent events cannot occursimultaneously.

Waiting times increase with the number of coalescent eventsthat have already occured

The last waiting time before the root (tMRCA), is on average aslong as the time we need to obtain the two ancestors of the kobserved individuals.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 73 / 91

Page 104: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

Coalescent and polymorphisms

We saw how to build the coalescent = how to model therelationships of individuals within a sample.

How can we generate/simulate sequence data using a coalescenttree?

Pavlos Pavlidis () Introduction to Coalescent 2013/02 74 / 91

Page 105: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

It’s easy to simulate sequences using a coalescent

tree!

Just put mutations on the coalescent tree :-)

Pavlos Pavlidis () Introduction to Coalescent 2013/02 75 / 91

Page 106: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

Putting mutations on the coalescent tree

C1 C2 C3 C5C4

Pavlos Pavlidis () Introduction to Coalescent 2013/02 76 / 91

Page 107: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

Putting mutations on the coalescent tree

C1 C2 C3 C5C4

NOTE: INFINITE SITE MODEL!Pavlos Pavlidis () Introduction to Coalescent 2013/02 77 / 91

Page 108: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

Putting mutations on the coalescent tree

C1 C2 C3 C5C4

Pavlos Pavlidis () Introduction to Coalescent 2013/02 78 / 91

Page 109: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

Putting mutations on the coalescents

C1 C2 C3 C5C4

Pavlos Pavlidis () Introduction to Coalescent 2013/02 79 / 91

Page 110: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

Putting mutations on the coalescent tree

We need to . . .

choose a position on the tree to put the mutation

choose a position on the sequence where the mutation occurred

Pavlos Pavlidis () Introduction to Coalescent 2013/02 80 / 91

Page 111: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

How do we choose a position on the tree

We assume that the mutation rate is θ. This means that, theexpected number of mutations per unit time on a single branchof the coalescent tree is θ.

The total number of mutations on the coalescent tree is apoisson random number Poi(θT ), where T is the total treelength.

Then, we randomly put mutations on the branches of thecoalescent tree.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 81 / 91

Page 112: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

The number of mutations follows a Poisson

distribution

Waiting times are exponentially distributed

Events are independent from each other

Pavlos Pavlidis () Introduction to Coalescent 2013/02 82 / 91

Page 113: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

Waiting times between mutations

t is measured in units of 2N generations

P(T > t) = (1− µ)t2N = (1− θ

4N)2tN

Thus as N goes to infinity

P(T > t)→ e−θt/2

Additionally, mutation events are independent from each other.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 83 / 91

Page 114: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

Put the mutations randomly on the tree

Pavlos Pavlidis () Introduction to Coalescent 2013/02 84 / 91

Page 115: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

What is the expected number of mutations?

Mean of Pois(θT ) = θT .

Pavlos Pavlidis () Introduction to Coalescent 2013/02 85 / 91

Page 116: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

Explaining data with the coalescent

The goal is to create a coalescent tree, and put mutations on it suchas to generate/simulate a dataset with specific properties.

Pavlos Pavlidis () Introduction to Coalescent 2013/02 86 / 91

Page 117: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

Explaining data with the coalescent

Assume a dataset:

seq1 AAATCGseq2 AAACCGseq3 TTTCCGseq4 AAATTC

Pavlos Pavlidis () Introduction to Coalescent 2013/02 87 / 91

Page 118: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

Explaining data with the coalescent

AAATCG

AAATCG AAATTCAAACCG TTTCCG

4. T−>C

1. A−>T

2. A−>T

3−>A−>T

5. C−>T

6. G−>C

Pavlos Pavlidis () Introduction to Coalescent 2013/02 88 / 91

Page 119: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

Software in population genetics that can be

optimized

Pavlos Pavlidis () Introduction to Coalescent 2013/02 89 / 91

Page 120: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

Software in population genetics that can be

optimized

The IM model

Pavlos Pavlidis () Introduction to Coalescent 2013/02 90 / 91

Page 121: Introduction to Coalescent · The Genographic project National Geographic project A person can order a kit from the website and provide his DNA sample to the project Data are analyzed

Coalescent and polymorphisms

Software to construct coalescent trees with

recombination and multiple positively selected

mutations

Pavlos Pavlidis () Introduction to Coalescent 2013/02 91 / 91