Robust Demographic Inference from Genomic and SNP Data (slides)

27
Robust demographic inference from genomic and SNP data Laurent Excoffier Isabelle Duperret, Emilia Huerta-Sanchez, Matthieu Foll, Vitor Sousa, Isabel Alves Computational and Molecular Population Genetics Lab (CMPG) Institute of Ecology and Evolution University of Berne Swiss Institute of Bioinformatics

Transcript of Robust Demographic Inference from Genomic and SNP Data (slides)

Page 1: Robust Demographic Inference from Genomic and SNP Data (slides)

Robust demographic inference from genomic and SNP dataLaurent ExcoffierIsabelle Duperret, Emilia Huerta-Sanchez, Matthieu Foll, Vitor Sousa, Isabel Alves

Computational and Molecular Population Genetics Lab (CMPG)Institute of Ecology and EvolutionUniversity of BerneSwiss Institute of Bioinformatics

Page 2: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Past demography affect genetic diversityStationary population Recent expansion Recent contraction

Past

Present

Few and mostly rare mutationsMixture of rare and frequent mutations

Very deep lineages separating little differentiated clades

Page 3: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Site Frequency Spectrum (SFS) depends on past demography

Page 4: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Joint SFS (2D-SFS)NA

N1 N2m12 TDIV

Model of Isolation with migration (IM)

Page 5: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Problems with estimation of demographic parameters from SFS

Page 6: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Estimation of demographic parameters from SFS with dadi

dadi estimates the site frequency spectrum based on a diffusion approximation

2009

Program : Diffusion Approximation for Demographic Inference http://code.google.com/p/dadi/

∂a∂i

Page 7: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Advantages of SFS for parameter inference

• Accuracy of estimates increases with data size, but computing time does not

• Can be used to study complex scenarios (e.g. as complex as ABC)

• Very fast estimations (as compared to ABC, or full likelihoods)

Page 8: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Potential problems

• Maximization of the CL is not trivial (precision of the approximation and convergence problems)

• Ignores (assumes no) LD

• Need to repeat estimations to find maximum CL

• Needs genomic data (several Mb)

– difficult to have gene-specific estimates

• Next-generation sequencing data must have high coverage to correctly estimate SFS (likely to miss singletons or show errors).

• SFS needs to be estimated from the NGS reads (ML methods: Nielsen et al. 2013, Keightley and Halligan, 2011)

Page 9: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Estimating the SFS with coalescent simulations

The probability of a SFS entry i can be estimated under a specific model θ from

its expected coalescent tree as (Nielsen 2000) a ratio of expected branch

lengths( | ) / ( | )i ip E t E Tθ θ=

where bkj is the length of the k-th compatible

branch in simulation j.

b2 b2

b2b4

b1

b6

b1

b1b1

ˆi

Z Z

i kj jj k j

p b T∈Φ

=∑∑ ∑

ti : total length of all branches directly leading to i terminal nodes

T : total tree length.

This probability can then be estimated on the basis of Z

simulations as

Page 10: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Likelihood

1

0 01

ˆPr( | ) (1 ) i

nmM S

obs ii

CL SFS P P pθ−

=

= ∝ − ∏M : number of monomorphic sites

S : number of polymorphic sites

P0 : probability of no mutation on the tree

pi : probability of the i-th SFS entry

mi: number of sites with derived frequency i

This can be generalized for the joint SFS of two or more populations

The (composite) likelihood of a model θ is obtained as a multinomial

sampling of sites (Adams and Hudson, 2004)

Page 11: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

fastsimcoal2 program

• Uses coalescent simulations to estimate the SFS and approximate the likelihood– Large number of simulations per point (>50000)

• Uses a conditional expectation maximization (CEM) algorithm to find maxCL parameters

• Relatively fast and can explore wide and unbounded parameter ranges

• Can handle an arbitrary number of populations• For more than 4 populations, we use a composite composite-

likelihood CL1234…= CL12×CL13×CL14×… ×CL23 ×…

Page 12: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Approximation of the SFS5000

5K 500 TDIV

Divergence model

TDIV=10 TDIV=100

Chen (2012) TPBCoalescent approach to infer the expected joint SFS numerically

Page 13: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Bottleneck model

TBOT

NA

NCUR

NBOT

fastsimcoal2 ∂a∂i

Simulation of 20 Mb data10 cases, 50 runs/case

9/10∂a∂i

Page 14: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

IM model

∂a∂i8/10

NA

N1 N2m12

m21

TDIV

Page 15: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Pseudo human evolution model NA

N1

NOUT

NBOT

TDIV

TBOT

106 106

m

8/10∂a∂i

Page 16: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Herarchical island model

12 populations in two continent-island models

Migration rates over 3 orders of magnitude are well recovered !!!

Page 17: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Application: Complete genomics data

Four sampled human populations:

4 Luhya from Kenya (LWK)9 Europeans (CEU)9 Yoruba (YRI)5 African Americans (ASW)

(sequenced at 51-89x per genome)

Data:

Multidimensional SFS estimated from :239, 120 SNPs in non-coding and non CpG regionsEach SNP more than 5 Kb away from the other

Page 18: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Model of admixture in African Americans

Luhya (Kenya)Yoruba (Nigeria)

Northern EuropeansAfr. Am.

European meta-population

West-African meta-population

Ghost (East-African) meta-population

Page 19: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Model of admixture in African Americans

Page 20: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Models of African population divergence

The estimation of each model were performed separately for the San (109,020 SNPs) and the Yoruba (81,383 SNPs) SNP panels

Two models with different degrees of realism and complexity

3 populations 5 populations

IM model 2 continent-island model

Page 21: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Models of African population divergence

Good agreement between panels

IM model

Page 22: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Models of African population divergence

Akaike’s weigths of evidence in favor of model B are close to 1 for both panels

2 continent-island model

Page 23: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Models of African population divergence

138,250 y 258,250 y

1,925 y1,475 y

7,450 y4,250 y

Page 24: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Inference of archaic admixture in modern humans

NANH

2,000

TDIV

admixTDN TBOT

NBOT

NCH

NALT

NN

NAN NH

Simple model (proof of concept)

Data set: Non coding DNA and non CpG sites.Altai Neandertal (Prüfer et al. 2013), unfiltered vcf271,994 regions of 100 bp in non-coding DNAAncestral state deduced by 1000G for 26,466,040 bp (26.5Mb)All regions are at least 5 Kb apart from each other

Complete genomicsCHB or TSI samples

(4 inds / pop)

Altai Neandertal

Other unsampled Neandertal

Page 25: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Inference of archaic admixture in modern humans

Admixture levelCHB: 1.2% (0.94-1.43)TSI: 1.3% (0.85-1.45)

Recent admixture ! !TSI: 875 gen (790-1030)CHB: 950 gen ( 810-1200)<25,000 y (assuming u=2e-8)

Very preliminary results

Page 26: Robust Demographic Inference from Genomic and SNP Data (slides)

© 2012 SIB

Possible extensions

• Multiprocessor version of fsc• MCMC (Beaumont 2004, Garrigan 2009)

• Multilocus SFS• Coalescent simulations through pedigrees

Page 27: Robust Demographic Inference from Genomic and SNP Data (slides)

Thanks to:Isabelle DuperretEmilia Huerta-SanchezIsabel AlvesVitor SousaMatthieu Foll

Rasmus Nielsen

CMPG lab

David ReichNick Patterson