EPI 511, Advanced Population and Medical Genetics€¦ · Alkes Price Harvard School of Public...

Alkes Price

Harvard School of Public Health

January 31 & February 2, 2017

EPI 511, Advanced Population and Medical Genetics

Week 2:

• Population structure

• Population admixture


Week 2:



Outline

1. Introduction to population structure

2. Model-based clustering (STRUCTURE, FRAPPE programs)

3. Principal Components Analysis (PCA)

4. Ancestry-informative markers (AIMs)

What is population structure?

Population structure refers to genetic differences

between populations due to geographic ancestry.

Genetic differences between populations are small

5-7% of worldwide human genetic variation is due to

genetic differences between human populations.

The remaining 93-95% of human genetic variation is due to

genetic variation within human populations

(Rosenberg et al. 2002 Science).

Genetic differences between populations are small (International HapMap Consortium 2005 and 2007, Nature)

FST = 0.19

FST = 0.11

FST = 0.16

Populations can be distinguished using

a large number of genetic markers

• Model-based clustering programs such as STRUCTURE (Pritchard et al. 2000 Genetics)

Rosenberg et al. 2002 Science

Populations can be distinguished using

a large number of genetic markers

• Principal components analysis (PCA) (Cavalli-Sforza 1994, The History and Geography of Human Genes)

using 3 million markers

Model-based clustering vs. PCA:

What’s the difference?

Model-based clustering:

• Output for each individual: ancestry in N population clusters

• Fractional ancestry (20% pop1, 80% pop2) may be allowed

• Number N of population clusters must be decided in advance

• Results may be sensitive to number of population clusters

Model-based clustering vs. PCA:


Model-based clustering:

• Output for each individual: ancestry in N population clusters

• Fractional ancestry (20% pop1, 80% pop2) may be allowed

• Number N of population clusters must be decided in advance

• Results may be sensitive to number of population clusters

Principal components analysis (PCA):

• Output for each individual: ancestry as principal components

• PCs do not necessarily correspond to specific populations

• Results of top PCs are not sensitive to the number of PCs

Trees can also describe population structure

Unrooted tree Rooted tree Jakobsson et al. 2008 Nature Li et al. 2008 Science

also see Cavalli-Sforza et al. 2003 Nat Genet

Population structure vs. Population admixture:


Population structure: [Tue of Week 2]

• Genetic differences due to geographic ancestry.

• Use genome-wide data to infer genome-wide ancestry.






Population admixture: [Thu of Week 2]

• Mixed ancestry from multiple continental populations.

• e.g. African Americans, Latino Americans, Hawaiians.

• Infer local ancestry at each location in the genome.

Population structure vs. Population stratification:





Population stratification: [Tue of Week 3 & Thu of Week 3]

• Refers specifically to a genotype-phenotype association study.

• Differences in genetic ancestry between cases and controls.

Outline





Model-based clustering when allele frequencies

in ancestral populations are known

Example 1. POP1 and POP2 with known allele frequencies.

SNP1 SNP2 SNP3 SNP4 ………………………

POP1 0.25 0.57 0.29 0.38 … (allele frequencies)


Individual x 2 0 1 1 … (SNP genotypes)

Does individual x belong to POP1 or POP2?








Does individual x belong to POP1 or POP2?

P(DATA | x is in POP1) is proportional to

(0.25)2(0.75)0(0.57)0(0.43)2(0.29)1(0.71)1(0.38)1(0.62)1 = 0.0006

P(DATA | x is in POP2) is proportional to

(0.40)2(0.60)0(0.32)0(0.68)2(0.84)1(0.16)1(0.22)1(0.78)1 = 0.0017

(Fractional) model-based clustering when allele

frequencies in ancestral populations are known






If individual x has ancestry α from POP1 and (1–α) from POP2,

then what is the most likely value of α?








If individual x has ancestry α from POP1 and (1–α) from POP2,

then what is the most likely value of α?

P(DATA | α) is proportional to

[0.25α + 0.40(1–α)]2[0.75α + 0.60(1–α)]0

[0.57α + 0.32(1–α)]0[0.43α + 0.68(1–α)]2

[0.29α + 0.84(1–α)]1[0.71α + 0.16(1–α)]1

[0.38α + 0.22(1–α)]1[0.62α + 0.78(1–α)]1

max. value 0.0020

attained at α = 0.22



General case: M SNPs (m = 1 to M), N populations (n = 1 to N),

known allele frequency pmn for SNP m in population n,

observed genotype counts gm for SNP m in individual x.

Which population (n = 1 to N) does individual x belong to?






Which population (n = 1 to N) does individual x belong to?

P(DATA | x ~ population n) is proportional to

Answer: find the choice of n which maximizes this expression.

M

m

g

mn

g

mnmm pp

1

2)1(






If individual x has fractional ancestry αn from each population n,

subject to Σnαn = 1, then what are the most likely values of αn?

P(DATA | x ~ α1, …, αN) is proportional to

Answer: find the values of αn which maximize this expression.

M

m

g

mn

N

n

n

gN

n

mnn

mm

pp1

2

11

)1(


frequencies in ancestral populations are unknown


unknown allele frequency pmn for SNP m in population n,

observed genotype counts gim for SNP m in many individuals xi.

If individual xi has fractional ancestry αin from each population n,

subject to Σnαin = 1, then what are the most likely values of αin?


frequencies in ancestral populations are unknown




If individual xi has fractional ancestry αin from each population n,

subject to Σnαin = 1, then what are the most likely values of αin?

P(DATA | xi ~ αi1, …, αiN for each i; pmn) is proportional to

Answer: find values of αin, pmn which maximize this expression.

I

i

M

m

g

mn

N

n

in

gN

n

mnin

imim

pp1 1

2

11

)1(

How to optimize αin and pmn?




??? Which ancestries αin and allele frequencies pmn maximize

• Approach #1: EM algorithm (Dempster et al. 1977 JRSS B)

(FRAPPE program; Tang et al. 2005 Genet Epidemiol)

I

i

M

m

g

mn

N

n

in

gN

n

mnin

imim

pp1 1

2

11

)1(

also see ADMIXTURE program (Alexander et al. 2009 Genome Res)

The EM algorithm

Want to estimate ancestries αin and allele frequencies pmn.

Hidden variables: Zihm = ancestry of indiv i, hap h at SNP m.

Let zihmn = P(Zihm = n) denote expectations of hidden variables.

Tang et al. 2005 Genet Epidemiol

also see Dempster et al. 1977 JRSS B, Pritchard et al. 2000 Am J Hum Genet

The EM algorithm




Here h = 0 or 1 (two haplotypes per individual)

Let gihm denote haplotype of indiv i, hap h

Diploid genotype gim = 0 or 1 or 2

Haploid genotype gihm = 0 or 1, Σh gihm = gim

Tang et al. 2005 Genet Epidemiol also see Dempster et al. 1977 JRSS B, Pritchard et al. 2000 Am J Hum Genet

The EM algorithm



If Zihm is known: choose αin and pmn to maximize

But Zihm is unknown. What to do?

I

i

M

m h ihmmZiZ

ihmmZiZ

gp

gp

ihmihm

ihmihm

1 1

1

00)1(

1



The EM algorithm




Initialization step: Assign zihmn arbitrarily.



The EM algorithm




Expectation step: Compute expectations zihmn from αin and pmn.

0)1()1(

1

1'

''

1'

''

ihm

N

n

mninmnin

ihm

N

n

mninmnin

ihmn

gpp

gpp

z



The EM algorithm




Maximization step: Maximize P(DATA | αin and pmn) using zihmn.

M

m h

ihmnin zM 1

1

02

1

I

i h

ihmn

I

i h

ihmihmnmn zgzp1

1

01

1

0



The EM algorithm




Initialization step.

Maximization step.

Expectation step.

Maximization step.

Expectation step.

Maximization step.

etc. (to convergence)

Tang et al. 2005 Genet Epidemiol also see Dempster et al. 1977 JRSS B, Pritchard et al. 2000 Am J Hum Genet

Bayesian posterior inference





• Approach #2: Place Bayesian priors on αin and pmn, then

sample from posterior via Markov Chain Monte Carlo (MCMC)

(STRUCTURE program; Pritchard et al. 2000 Genetics)

I

i

M

m

g

mn

N

n

in

gN

n

mnin

imim

pp1 1

2

11

)1(

Bayesian posterior inference





• Approach #2: Place Bayesian priors on αin and pmn, then

sample from posterior via Markov Chain Monte Carlo (MCMC)

(STRUCTURE program; Pritchard et al. 2000 Genetics)

or variational Bayes approximation

(TeraStructure program; Gopalanan et al. 2016 Nat Genet)

I

i

M

m

g

mn

N

n

in

gN

n

mnin

imim

pp1 1

2

11

)1(

Next steps to understanding model-based clustering

Let there be rock.

-- Bon S.

Let there be data.

-- Alkes

Application #1: Human Genome Diversity Project

Cann et al. 2002 Science, Cavalli-Sforza et al. 2005 Nat Rev Genet

also see Mallick et al. 2016 Nature (SGDP), Paganic et al. 2016 Nature (EGDP)

STRUCTURE results on HGDP samples


• 1,056 individuals from 52 world populations

• 377 microsatellite markers (multi-allelic)

STRUCTURE results on HGDP samples


• 1,056 individuals from 52 world populations

• 377 microsatellite markers (multi-allelic)

Africa Europe Western Eurasia East Asia

Oce

an

ia

Am

eric

a

STRUCTURE results: How many clusters?


“We do not claim that our procedure provides an accurate estimate”

(Heuristic procedure for #clusters, Pritchard et al. 2000 Genetics)

Africa Europe Western Eurasia East Asia

Oce

an

ia

Am

eric

a

FRAPPE results on GWAS data from HGDP

Li et al. 2008 Science

• 938 HGDP individuals (118 related individuals removed)

• 51 world populations (N. Han and S. Han merged)

• Illumina 650K chip

FRAPPE results at K=7:

Application #2: diverse African populations

also see Figure 5 of

Cavalli-Sforza et al. 2003 Nat Genet

Language families

of Africa

Application #2: diverse African populations

• 2,432 individuals from 113 African populations

• 1,327 markers (microsatellite markers and indels)

STRUCTURE (Pritchard et al. 2000 Genetics) at K=14.

Tishkoff et al. 2009 Science; also see Gurdasani et al. 2015 Nature

STRUCTURE results on African populations

= West African/Bantu = East African

= Khoisan

= Pygmy

= European/Middle Eastern

K=14:



= West African/Bantu K=14:

Bantu expansion

(2000 BC – 1000 AD) (Cavalli-Sforza et al. 1994,

The History and Geography

Of Human Genes)


Outline





Principal Components Analysis

• •

•

•

•

•

•

•

•

•

10 points in 1,000,000-dimensional space.

Axes of variation (PCs, eigenvectors)

• •

•

•

•

•

•

•

•

•

Axis 1

Axis 1 is the axis explaining the

maximum amount of variation.


• •

•

•

•

•

•

•

•

•

Axis 1

Axis 2

Axis 2 is the axis explaining the

maximum amount of variation

among axes orthogonal to Axis 1.


• •

•

•

•

•

•

•

•

•

Axis 1

Axis 2

Axis 10

Axis 9

Axis 3

Top axis of variation

• •

•

•

•

•

•

•

•

•

Axis 1

Axis 2

+0.45

+0.02 +0.30

+0.09

-0.36

-0.33

+0.22

-0.08 -0.18

-0.50

The math Let X be an M x N matrix with M > N (e.g. M SNPs, N individuals)

Let Ψ be the N x N covariance matrix of X:

Ψjk = Cov(xj, xk), where xj and xk are jth and kth columns of X.

Pearson 1901 Phil Mag, Ser B

Hoteling 1933 J Educ Psychol

Jackson 2003, A User’s Guide to Principal Components

The math Let X be an M x N matrix with M > N (e.g. M SNPs, N individuals)

Let Ψ be the N x N covariance matrix of X:

Ψjk = Cov(xj, xk), where xj and xk are jth and kth columns of X.

Matrix diagonalization (Eigen-decomposition):

Ψ = VDVT , where

D is a diagonal N x N matrix of eigenvalues

V is an N x N matrix whose columns are the eigenvectors of Ψ

Eigenvectors are orthonormal (VTV = I), thus ΨV = VD, i.e.

Ψvj = djvj (vj = jth eigenvector, dj = jth eigenvalue)

Pearson 1901 Phil Mag, Ser B

Hoteling 1933 J Educ Psychol

Jackson 2003, A User’s Guide to Principal Components

Toy Example 2 -2

1 -1

X = 0 0

-1 1

-2 2

Toy Example 2 -2

1 -1

X = 0 0 Ψ = 10 -10

-1 1 -10 10

-2 2

Toy Example 2 -2

1 -1 V D VT

X = 0 0 Ψ = 10 -10 =

-1 1 -10 10

-2 2

2/12/1

2/12/1

2/12/1

2/12/1

00

020

Toy Example 2 -2 Eigenvalue 1

1 -1 V D VT

X = 0 0 Ψ = 10 -10 =

-1 1 -10 10

-2 2

PC1

Ψv1 = d1v1 =

2/12/1

2/12/1

2/12/1

2/12/1

00

020

2/20

2/20

Toy Example 2 -2 Eigenvalue 2

1 -1 V D VT

X = 0 0 Ψ = 10 -10 =

-1 1 -10 10

-2 2

PC2

Ψv2 = d2v2 =

2/12/1

2/12/1

2/12/1

2/12/1

00

020

0

0

PCA on genotype data G = M x N matrix of individual genotypes

M SNPs, N individuals

gij = genotype (0, 1, or 2 alleles) of SNP i in individual j

Price et al. 2006 Nat Genet, Patterson et al. 2006 PLoS Genet

also see McVean 2009 PLoS Genet, Engelhardt & Stephens 2010 PLoS Genet




• Subtract off the mean of SNP i: pi = Avgj gij/2, set gij = gij – 2pi

(Missing data: set gij = 0 if SNP i in individual j is missing data)

• Optional: normalize by , i.e. set gij = gij /



)1(2 ii pp )1(2 ii pp




• Subtract off the mean of SNP i: pi = Avgj gij/2, set gij = gij – 2pi

(Missing data: set gij = 0 if SNP i in individual j is missing data)

• Optional: normalize by , i.e. set gij = gij /

Ψ = N x N covariance matrix of G

Ψ = VDVT (Eigen-decomposition)

Columns of V are eigenvectors (principal components, PCs) of G.

Diagonal entries of D are eigenvalues of G.

The hope: Top PCs (PC1, PC2) correspond to genetic ancestry.



)1(2 ii pp )1(2 ii pp

Approximating top PCs quickly in genetic data

• Power iteration: a random vector is repeatedly multiplied by the

target matrix A, stretching it along the top eigenvector of A.

• In genetic data, GRM A = XTX/M , where X = norm. genotypes.

Multiply vector by X and XT in turn to avoid cost of computing A.

• Can approximate a fixed number of top PCs in time O(MN)

Rokhlin et al. 2009 J Matrix Anal Appl

Halko et al. 2011 SIAM Rev

Galinsky et al. 2016a Am J Hum Genet

http://www.math.drexel.edu/~pg/520/Math520.html

Individuals

1 1 1 0 0

0 1 2 1 2

2 1 1 0 1

SNPs 0 0 1 2 2

2 1 1 0 0

0 0 1 1 1

2 2 1 1 0

PCA on genotype data: Toy Example

Price et al. 2006 Nat Genet

Individuals

1 1 1 0 0

0 1 2 1 2

2 1 1 0 1

SNPs 0 0 1 2 2

2 1 1 0 0

0 0 1 1 1

2 2 1 1 0

mean-adjust each SNP



Individuals

0.4 0.4 0.4 -0.6 -0.6

-1.2 -0.2 0.8 -0.2 0.8

1.0 0.0 0.0 -1.0 0.0

SNPs -1.0 -1.0 0.0 1.0 1.0

1.2 0.2 0.2 -0.8 -0.8

-0.6 -0.6 0.4 0.4 0.4

0.8 0.8 -0.2 -0.2 -1.2



Individuals

0.4 0.4 0.4 -0.6 -0.6

-1.2 -0.2 0.8 -0.2 0.8 0.9 0.4 -0.2 -0.5 -0.6

1.0 0.0 0.0 -1.0 0.0 0.4 0.3 0.0 -0.3 -0.4

SNPs -1.0 -1.0 0.0 1.0 1.0 -0.2 0.0 0.1 0.0 0.1

1.2 0.2 0.2 -0.8 -0.8 -0.5 -0.3 0.0 0.4 0.3

-0.6 -0.6 0.4 0.4 0.4 -0.6 -0.4 0.1 0.3 0.6

0.8 0.8 -0.2 -0.2 -1.2



Covariance matrix

Individuals

0.4 0.4 0.4 -0.6 -0.6

-1.2 -0.2 0.8 -0.2 0.8

1.0 0.0 0.0 -1.0 0.0

SNPs -1.0 -1.0 0.0 1.0 1.0 0.7 0.3 -0.1 -0.4 -0.5

1.2 0.2 0.2 -0.8 -0.8

-0.6 -0.6 0.4 0.4 0.4

0.8 0.8 -0.2 -0.2 -1.2

PCA Axis of variation



Individuals

1 1 1 0 0

0 1 2 1 2

2 1 1 0 1

SNPs 0 0 1 2 2 0.7 0.3 -0.1 -0.4 -0.5

2 1 1 0 0

0 0 1 1 1

2 2 1 1 0

PCA Axis of variation



Next steps to understanding PCA

Let there be rock.

-- Bon S.

Let there be data.

-- Alkes

PCA using genotype data from HapMap

using 3 million markers

from HapMap2

International HapMap Consortium 2007 Nature

PCA using genotype data from HGDP


938 HGDP individuals

Illumina 650K chip

PCA in an admixed population: African Americans

AA: 21% ± 14%

European ancestry

YRI

CHB+JPT

CEU

Price, Patterson et al. 2008 PLoS Genet

also see Smith et al. 2004 Am J Hum Genet; Bryc, Auton et al. 2010 PNAS

PCA using genotype data from Europe

3,192 Europeans

Affymetrix 500K chip

Novembre et al. 2008 Nature

also see Ralph & Coop 2013 PLoS Biol, Leslie et al. 2015 Nature, Haak et al. 2015 Nature

PCA using genotype data from Switzerland

Geographical origin of

European individuals can be

inferred to within 300-700km!

Novembre et al. 2008 Nature

also see Ralph & Coop 2013 PLoS Biol, Leslie et al. 2015 Nature, Haak et al. 2015 Nature

PCA using genotype data from 113,851 UK samples

Galinsky et al. 2016b Am J Hum Genet

also see Leslie e al. 2015 Nature

http://ukmap.facts.co/

European American population structure:

What’s inside the melting pot?

???

PCA using genotype data from European Americans

2745 European Americans


Price, Butler et al. 2008 PLoS Genet; also see Price et al. 2006 Nat Genet, Tian et al. 2008 PLoS Genet, Galinsky et al. 2016a Am J Hum Genet


Galinsky et al. 2016a Am J Hum Genet

Genetic distances (FST) between

European American subpopulations

Ashkenazi

Northwest Southeast

FST = 0.009 FST = 0.004

FST = 0.005

Price, Butler et al. 2008 PLoS Genet

PCA using SNP weights from external reference panels

Chen et al. 2013 Bioinformatics

PCs do not necessarily reflect population structure

• Batch effects (see Clayton et al. 2005 Nat Genet, Price et al. 2006 Nat Genet)

• Cryptic relatedness (see Patterson et al. 2006 PLoS Genet)

• Long-range LD, e.g. due to inversion polymorphisms

(see Tian et al. 2008 PLoS Genet, Price et al. 2008 Am J Hum Genet)

“We recommend inferring population structure using all markers …

based on an analysis of HapMap2 data with >3 million markers

(45 Chinese and 44 Japanese).”

-- Supp Note 5 of Price et al. 2006 Nat Genet

“We corrected for LD using our regression technique”.

-- Patterson et al. 2006 PLoS Genet (also see Zou et al. 2009 Hum Hered)

“We identified 24 autosomal long-range LD regions, each spanning

>2Mb, that explained one of the top PCs [when running PCA] on

327 European Americans genotyped on the Illumina 550K array.”

-- Price et al. 2008 AJHG (also see Tian et al. 2008 PLoS Genet)

PCA of 531 Northern European + 387 Southern European samples

sequenced at 202 genes (864kb) [Nelson et al. 2012 Science data]:

r2(PC1, true ancestry) = 0.34; increases to 0.54 with LD-pruning.

-- Galinsky et al. 2016a Am J Hum Genet (Appendix)

To LD-prune or not to LD-prune in PCA?

Is human population genetic variation

best described by clusters or clines?

“We identified six main genetic clusters, five of which correspond

to major geographic regions.” (Rosenberg et al. 2002 Science)

“When individuals are sampled homogeneously from around the

globe, the pattern seen is one of gradients of allele frequencies,

rather than discrete clusters.” (Serre and Paabo 2004 Genome Res)

“Examination of the relationship between genetic and geographic

distance supports a view in which the clusters arise not as an

artifact of the sampling scheme, but from small discontinuous

jumps in genetic distance on opposite sides of geographic barriers.”

(Rosenberg et al. 2005 PLoS Genet)

Do geographic barriers lead to clusters?

• Continuous geographic distance (along land routes) explains

69% of the variance in genetic distance between two populations.

Rosenberg et al. 2005 PLoS Genet

also see Pagani et al. 2016 Nature

Do geographic barriers lead to clusters?

• Continuous geographic distance (along land routes) explains


• Continuous geographic distance (along land routes)

PLUS geographic barriers (ocean, Himalayas, Sahara) explains


This suggests that geographic barriers contribute very slightly

to genetic clustering of world populations.

Rosenberg et al. 2005 PLoS Genet

also see Pagani et al. 2016 Nature

Outline





Ancestry-informative markers (AIMs)

Standard approach to inferring genetic ancestry:

• Genotype each individual on a GWAS chip

(500,000-1,000,000 random genetic markers).

Apply model-based clustering or PCA.

Price, Butler et al. 2008 PLoS Genet


2745 European Americans


Ancestry-informative markers (AIMs)

Standard approach to inferring genetic ancestry:

• Genotype each individual on a GWAS chip

(500,000-1,000,000 random genetic markers).


OR

AIM approach to inferring genetic ancestry:

• Genotype each individual on a small set of 50-300 AIMs:

markers that are highly informative for genetic ancestry.


Hoggart et al. 2003 Am J Hum Genet

AIMs for northwest vs. southeast Europe

100 AIMs distinguishing NW vs. SE ancestry

• Ascertained using European Americans genotyped at

100,000 to 500,000 markers.

• Validated using a panel of samples of known ancestry:

Swedish, UK, Polish, Greek, Italian, Spanish

Price, Butler et al. 2008 PLoS Genet; reviewed in Seldin & Price 2008 PLoS Genet

also see Seldin et al. 2006 PLoS Genet, Tian et al. 2008 PLoS Genet

300 AIMs for northwest vs. southeast Europe

and southeast Europe vs. Ashkenazi Jewish

100 AIMs distinguishing NW vs. SE ancestry

200 AIMs distinguishing SE vs. AJ ancestry

• Ascertained using European Americans genotyped at

100,000 to 500,000 markers.

• Validated using a panel of samples of known ancestry:

Swedish, UK, Polish, Greek, Italian, Spanish, Ashkenazi



How many AIMs are needed?

Theorem 3:

The squared correlation between an inferred axis of variation

and the true axis of variation (e.g. using genome-wide data) is

≈ x/(1+x), where x = FST times the number of AIMs.

[where FST is measured in the set of AIMs.]

Price, Butler et al. 2008 PLoS Genet, Patterson et al. 2006 PLoS Genet

also see Rosenberg et al. 2003 Am J Hum Genet


Theorem 3:





e.g. Affymetrix 500K chip for northwest vs. southeast Europe:

Effective #markers ≈ 100,000, after accounting for LD.

FST(NW Europe, SE Europe) = 0.005 (for the set of all SNPs)

x = (0.005)(100,000) = 500

x/(1+x) = 0.998.




Theorem 3:





e.g. 100 AIMs for northwest vs. southeast Europe:

FST(NW Europe, SE Europe) = 0.005 (for the set of all SNPs)

FST(NW Europe, SE Europe) = 0.07 for the set of 100 AIMs

x = (0.07)(100) = 7

x/(1+x) = 0.88.




Theorem 3:





e.g. 200 AIMs for southeast Europe vs. Ashkenazi Jewish:

FST(SE Europe, AJ) = 0.004 (for the set of all SNPs)

FST(SE Europe, AJ) = 0.04 for the set of 200 AIMs

x = (0.04)(200) = 8

x/(1+x) = 0.89.



AIMs for Africa, Europe, Asia, America

Lao et al. 2006 Am J Hum Genet

also see Ruiz-Narvaez et al. 2011 Am J Epidemiol, Galanter et al. 2012 PLoS Genet

STRUCTURE runs

using only 10 AIMs

• Genetic differences between human populations are small, but

populations can be distinguished using a large number of

genetic markers.

• Model-based clustering is an effective way of modeling

genetic variation and inferring ancestry via discrete clusters.

• PCA is an effective way of modeling genetic variation and

inferring ancestry via continuous clines.

• Model-based clustering methods and PCA can be applied to

random markers, or to ancestry-informative markers (AIMs),

to infer genetic ancestry.

Conclusions


Week 2:



Outline

1. Admixture leads to variation in genome-wide ancestry

2. Admixture creates mosaic chromosomes

3. Local ancestry inference

4. Evaluating local ancestry inference algorithms

Hellenthal et al. 2014 Science

What is an admixed population?

An admixed population is a population with recent

ancestry from two or more continents

(e.g. within the past 1,000 years).

What is an admixed population?

An admixed population is a population with recent

ancestry from two or more continents

(e.g. within the past 1,000 years).

Note: the word “admixture” is also sometimes used to

refer to more ancient admixture events. (e.g. Patterson et al. 2012 Genetics, Hellenthal et al. 2014 Science)






Population admixture: [Thu of Week 2]

• Mixed ancestry from multiple continental populations.

• e.g. African Americans, Latino Americans.

• Infer local ancestry at each location in the genome.

Population admixture implies population structure.

Population structure does not imply population admixture.

Examples of admixed populations

African Americans:

• Inherit African and European ancestry

• >10% of U.S. population

Smith et al. 2004 Am J Hum Genet


Hispanic/Latino Americans:

• Inherit European and Native American

or European, Native American and African ancestry

• e.g. Mexican Americans, Puerto Ricans, etc.

• >15% of U.S. population

Bryc, Velez et al. 2010 PNAS


Latinos outside the U.S.:

• Inherit European and Native American

or European, Native American and African ancestry

• hundreds of millions of people throughout Latin America


An aside: Characteristics of African,

European and Native American populations

African populations:

• High within-population diversity, low LD (no bottleneck).

• Low genetic distance (FST) between West African populations

European populations:

• Lower within-population diversity, higher LD (bottleneck).

• Low genetic distance (FST) between European populations

Native American populations:

• Lowest within-population diversity, highest LD due to

multiple population bottlenecks.

• Very high FST between Native American populations

Cavalli-Sforza et al. 1994 The History and Geography of Human Genes

Reich et al. 2012 Nature

Other examples of admixed populations

Native Hawaiians (Polynesian, European, East Asian ancestry)

Uyghurs (East Asian and European-related ancestry)

A population that self-identifies and is described in the the academic literature as “South African Coloured” (San African, Bantu African, European, S Asian, SE Asian ancestry)

Haiman et al. 2003 Hum Mol Genet,

Haiman et al. 2007 Nat Genet

Xu, Huang et al. 2008 Am J Hum Genet,

Xu & Jin 2008 Am J Hum Genet

de Wit et al. 2010 Hum Genet, Patterson et al. 2010 Hum Mol Genet,

Tishkoff et al. 2009 Science, Chimusa et al. 2013 Hum Mol Genet

Inferring genome-wide ancestry proportions

Apply the usual clustering programs, allowing fractional ancestry

(see Tue of Week 2 slides):

• STRUCTURE (Pritchard et al. 2000 Genetics, Falush et al. 2003 Genetics)

• FRAPPE (Tang et al. 2005 Genet Epidemiol, Li et al. 2008 Science)

• ADMIXTURE (Alexander et al. 2009 Genome Res)

Inferring genome-wide ancestry proportions

Apply the usual clustering programs, allowing fractional ancestry


• STRUCTURE (Pritchard et al. 2000 Genetics, Falush et al. 2003 Genetics)

• FRAPPE (Tang et al. 2005 Genet Epidemiol, Li et al. 2008 Science)

• ADMIXTURE (Alexander et al. 2009 Genome Res)

Or, apply principal components analysis


• PCA (Price et al. 2006 Nat Genet, Patterson et al. 2006 PLoS Genet)

Admixture leads to variation in genome-wide ancestry

AA: 21% ± 14%

European ancestry

YRI

CHB+JPT

CEU

African Americans

Price, Patterson et al. 2008 PLoS Genet

also see Smith et al. 2004 Am J Hum Genet; Bryc, Auton et al. 2010 PNAS

(from Tue of Week 2)

Admixture proportion varies across individuals,

but also varies with U.S. geographic location

Kittles et al. 2007 CJHP

also see Bryc et al. 2015 Am J Hum Genet

% European ancestry in African American populations

Latino populations: 3-way admixture


European

Native American

African

Latino populations: 3-way admixture

Price et al. 2007 Am J Hum Genet; also see Bryc, Velez et al. 2010 PNAS;

Moreno-Estrada et al. 2014 Science; Ruiz-Linares et al. 2014 PLoS Genet

Mexican Americans

50% European, 45% Native American, 5% African on average,

with substantial variation among individuals.

Puerto Ricans


with substantial variation among individuals.

Brazilians and Colombians


with substantial variation among individuals. [For populations sampled. Values may not apply to all populations.]

Different Native American ancestral populations

for Latino populations in different regions

Wang et al. 2008 PLoS Genet

also see Price et al. 2007 Am J Hum Genet

CEU northern European USA 180

CHB Chinese China 90

JPT Japanese Japan 90

YRI Yoruba Nigeria 180

TSI Tuscan Italy 90

CHD Chinese USA 100

LWK Luhya Kenya 90

MKK Maasai Kenya 180

ASW African-American USA 90

MXL Mexican-American USA 90

GIH Gujarati-American USA 90

Which HapMap3 populations are admixed?

PCA of all HapMap3 populations

International HapMap3 Consortium 2010 Nature (see Supp Figures)

These populations are “homogeneous”

in their continental ancestry


ASW, MKK and LWK are admixed


ASW, MKK and LWK are admixed


Bantu expansion

(2000 BC – 1000 AD)

Arab migrations

(500 – 1500 AD)

(Cavalli-Sforza et al. 1994,

The History and Geography

Of Human Genes)

X Ancestral East African population


= West African/Bantu = East African

= Khoisan

= Pygmy

= European/Middle Eastern


K=14:


MXL (Mexican Americans) are admixed


Are GIH (Gujarati Americans) admixed?


also see Reich et al. 2009 Nature, Basu et al. 2016 PNAS

Which HGDP populations are admixed?


938 HGDP individuals

Illumina 650K chip (from Tue of Week 2)

Which HGDP populations are admixed?


admixture in

Middle East / North Africa?

Recent? Or not? (Price, Tandon et al. 2009 PLoS Genet)

European Americans: 3-way admixture!

Bryc et al. 2015 Am J Hum Genet

European Americans

>99% European,

0.2% Native American,

0.2% African on average

with substantial variation

among individuals.

Trees can also describe population structure

Unrooted tree Rooted tree Jakobsson et al. 2008 Nature Li et al. 2008 Science


also see Cavalli-Sforza et al. 2003 Nat Genet

Trees cannot model recent admixture

root

YRI CEU

root

YRI CEU ASW ASW

WRONG. WRONG.

Outline





Admixture creates mosaic chromosomes

Population 1 Population 2

1 generation later


2 generations later



several generations later


Local ancestry = 0, 1 or 2

copies from population 1






Average segment length (in Morgans) ~ 1/g

where g = average #generations since admixture

g ≈ 6 for African Americans, g ≈ 10 for Latino populations

Smith et al. 2004 Am J Hum Genet, Price et al. 2007 Am J Hum Genet






Avg segment length ~ 1/g [> 1/g due to recombination b/t same ancestry]

where g = average #generations since admixture

g ≈ 6 for African Americans, g ≈ 10 for Latino populations

Smith et al. 2004 Am J Hum Genet, Price et al. 2007 Am J Hum Genet

Mosaic chromosomes create admixture-LD

Toy example: Admixed population with 50% POP1, 50% POP2

SNP1 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2


SNP1 and SNP2 are unlinked in POP1, unlinked in POP2.

SNP1 and SNP2 are 200kb apart: (nearly) always same local ancestry.







P(SNP1=A, SNP2=A) = 50%·0.10·0.10 + 50%·0.90·0.90 = 0.41

POP1 POP2







P(SNP1=A, SNP2=A) = 50%·0.10·0.10 + 50%·0.90·0.90 = 0.41

P(SNP1=A, SNP2=C) = 50%·0.10·0.90 + 50%·0.90·0.10 = 0.09

P(SNP1=C, SNP2=A) = 50%·0.90·0.10 + 50%·0.10·0.90 = 0.09

P(SNP1=C, SNP2=C) = 50%·0.90·0.90 + 50%·0.10·0.10 = 0.41







P(SNP1=A, SNP2=A) = 50%·0.10·0.10 + 50%·0.90·0.90 = 0.41

P(SNP1=A, SNP2=C) = 50%·0.10·0.90 + 50%·0.90·0.10 = 0.09

P(SNP1=C, SNP2=A) = 50%·0.90·0.10 + 50%·0.10·0.90 = 0.09

P(SNP1=C, SNP2=C) = 50%·0.90·0.90 + 50%·0.10·0.10 = 0.41

SNP1 and SNP2 are in admixture-LD in the admixed population!

Admixture-LD depends on allele frequency differences






P(SNP1=A, SNP2=A) = 50%·0.10·0.10 + 50%·0.10·0.90 = 0.05

P(SNP1=A, SNP2=C) = 50%·0.10·0.90 + 50%·0.10·0.10 = 0.05

P(SNP1=C, SNP2=A) = 50%·0.90·0.10 + 50%·0.90·0.90 = 0.45

P(SNP1=C, SNP2=C) = 50%·0.90·0.90 + 50%·0.90·0.10 = 0.45

No allele frequency difference in SNP1 => no admixture-LD.


Real example of admixture-LD:

rs164781: 0.42 in CEU, 0.88 in YRI (HapMap3)

rs10495758: 0.88 in CEU, 0.32 in YRI (HapMap3)

These SNPs are located roughly 3Mb apart.

r2 between rs164781 and rs10495758:

0.01 in CEU, 0.01 in YRI, 0.28 in ASW (HapMap3)

rs164781 and rs10495758 are in admixture-LD in ASW!

International HapMap3 Consortium 2010 Nature

SNPs chosen from Tandon et al. 2011 Genet Epidemiol


Collins-Schramm et al. 2003 Hum Genet

No LD in Europeans (P-values for LD not significant)


Collins-Schramm et al. 2003 Hum Genet

Admixture-LD in African Americans (significant P-values)



at a specific locus

Local ancestry vs. Genome-wide ancestry

Local

ancestry

Genome-wide

ancestry

Genome-wide ancestry

(e.g. 20% European)

Outline





Ancestry-informative marker (AIM) panels for

local ancestry inference in African Americans

The most

informative

~1% of

SNPs

provide

powerful

information

about

ancestry

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

European American Frequency

We

st

Afr

ica

n F

req

ue

nc

y

Smith et al. 2004

• Choose 1,500-3,000 SNPs with large Δ(EUR,AFR) (unlinked, i.e. not in LD, in ancestral populations)

Smith et al. 2004 Am J Hum Genet

Tian et al. 2006 Am J Hum Genet (slide from David Reich)

The most informative SNPs

provide powerful information

about local ancestry

“African-American

admixture map”

Ancestry-informative marker (AIM) panels for

local ancestry inference in Latino populations

Price et al. 2007 Am J Hum Genet

Mao et al. 2007 Am J Hum Genet

Tian et al. 2007 Am J Hum Genet

The most informative SNPs

provide powerful information

about local ancestry

“Latino admixture map”

• Choose 1,500-3,000 SNPs with large Δ(EUR,NA) (unlinked, i.e. not in LD, in ancestral populations)



at a specific locus

Local ancestry vs. Genome-wide ancestry

Local

ancestry

Genome-wide

ancestry

Genome-wide ancestry

(e.g. 20% European)

25-50 AIMs

1,500-3,000 AIMs

Inferring local ancestry using AIM panels

SNP chr position Eur freq Afr freq

rs2814778 1 159,174,683 0% 100%

1 SNP with Δ=100%: perfect information about local ancestry

Duffy blood group locus

see Hamblin et al. 2000 Am J Hum Genet, Hamblin et al. 2002 Am J Hum Genet



rs1962508

rs2806424

rs1780349

1

1

1

158,677,077

159,423,117

161,340963

4%

84%

44%

74%

26%

99%

Several SNPs with Δ=60-80%: ???



rs1962508

rs2806424

rs1780349

1

1

1

158,677,077

159,423,117

161,340963

4%

84%

44%

74%

26%

99%

Several SNPs with Δ=60-80%: Hidden Markov Model methods

STRUCTURE (Falush et al. 2003 Genetics), ADMIXMAP (Hoggart et al. 2004

Am J Hum Genet), ANCESTRYMAP (Patterson et al. 2004 Am J Hum Genet)

(unobserved) state:



Overview of Hidden Markov Model approach

• Simplifying assumption: for a individual i, suppose we know

M = genome-wide ancestry (e.g. 20%)

λ = average #generations since admixture (e.g. 6)

• Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)

of this individual at marker j along the genome.

INITIAL PROBABILITIES (e.g. left end of chromosome):

TRANSITION PROBABILITIES

EMISSION PROBABILITIES

Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,

Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis








P(X0 = 0) = (1 – M)2

P(X0 = 1) = 2M(1 – M)

P(X0 = 2) = M2









TRANSITION PROBABILITIES:

Let d be the genetic distance (in Morgans) between markers j and j+1.

P(Xj+1 = 0 | Xj = 0) = e–2λd + 2e–λd(1 – e–λd)(1 – M) + (1 – e–λd)2(1 – M)2



0 of 2

chrom.

recombine

1 of 2

chrom.

recombine

2 of 2

chrom.

recombine









P(Xj+1 = 0 | Xj = 0) = e–2λd + 2e–λd(1 – e–λd)(1 – M) + (1 – e–λd)2(1 – M)2

P(Xj+1 = 1 | Xj = 0) = 2e–λd(1 – e–λd)M + (1 – e–λd)22M(1 – M)

P(Xj+1 = 2 | Xj = 0) = (1 – e–λd)2M2











P(Xj+1 = 0 | Xj = 1) = 2e–λd(1 – e–λd)(1 – M) + (1 – e–λd)2(1 – M)2

P(Xj+1 = 1 | Xj = 1) = e–2λd + e–λd(1 – e–λd) + (1 – e–λd)22M(1 – M)

P(Xj+1 = 2 | Xj = 1) = e–λd(1 – e–λd)M + (1 – e–λd)2M2











P(Xj+1 = 0 | Xj = 2) = (1 – e–λd)2(1 – M)2

P(Xj+1 = 1 | Xj = 2) = 2e–λd(1 – e–λd)(1 – M) + (1 – e–λd)22M(1 – M)

P(Xj+1 = 2 | Xj = 2) = e–2λd + 2e–λd(1 – e–λd)M + (1 – e–λd)2M2









EMISSION PROBABILITIES:

Let pA and pE be genotype frequencies of marker j in AFR and EUR.

P(gj = 0 | Xj = 0) = (1 – pA)2

P(gj = 1 | Xj = 0) = 2pA(1 – pA)

P(gj = 2 | Xj = 0) = pA2











P(gj = 0 | Xj = 1) = (1 – pA)(1 – pE)

P(gj = 1 | Xj = 1) = pA(1 – pE) + pE(1 – pA)

P(gj = 2 | Xj = 1) = pApE











P(gj = 0 | Xj = 2) = (1 – pE)2

P(gj = 1 | Xj = 2) = 2pE(1 – pE)

P(gj = 2 | Xj = 2) = pE2












Then apply forward-backward algorithm to infer P(Xj | genotypes).





P(X1|g1) P(Xj|g1…gj) P(XM-1|g1…gM-1) P(XM|g1…gM)

(FORWARD PROBABILITIES)

Durbin et al 1998 Biological Sequence Analysis




(FORWARD PROBABILITIES)

P(g2…gM|X1) P(gj+1…gM|Xj) P(gM|XM-1) 1

(BACKWARD PROBABILITIES)












(Or, use MCMC to integrate over uncertainty in M, λ, pA, pE.)



Big trouble if markers are in LD in ancestral populations

Example: Admixed population with 80% POP1, 20% POP2 ancestry


A allele has frequency 80%·0.25 + 20%·0.75 = 0.35 in Admixed pop.

Inference of local ancestry of a haploid chromosome using SNP1:

prob 0.35: P(POP1 | A) = 80%·0.25/(80%·0.25 + 20%·0.75 ) = 57%

prob 0.65: P(POP1 | C) = 80%·0.75/(80%·0.75 + 20%·0.25) = 92%

Overall: P(POP1) = 57%·0.35 + 92%·0.65 = 80%. Unbiased.


Big trouble if markers are in LD in ancestral populations

Example: Admixed population with 80% POP1, 20% POP2 ancestry


SNP2 = A/C SNP in perfect LD with SNP1 in POP1, POP2

A allele has frequency 80%·0.25 + 20%·0.75 = 0.35 in Admixed pop.

Inference of local ancestry of a haploid chr using SNP1, SNP2:

prob 0.35: P(POP1 | AA) = 80%·0.252/(80%·0.252 + 20%·0.752) =

31%

prob 0.65: P(POP1 | CC) = 80%·0.752/(80%·0.752 + 20%·0.252) =

97%

Overall: P(POP1) = 31%·0.35 + 97%·0.65 = 74%. Biased.


Inferring local ancestry using GWAS chip data

Advantages of AIM panels of 1,500+ SNPs:

• Lower cost: $80/sample

(vs. $300+sample for GWAS chips).

Advantages of GWAS chips:

• Dense SNP coverage enables LD mapping

• More accurate local ancestry inference?


• ANCESTRYMAP using a subset of ~8,000 unlinked AIMs

(Patterson et al. 2004 Am J Hum Genet; Tandon et al. 2011 Genet Epidemiol)

New methods developed for GWAS chip data: • SABER (Tang et al. 2006 Am J Hum Genet)

• LAMP (Sankararaman et al. 2008 Am J Hum Genet)

• uSWITCH (Sankararaman et al. 2008 Genome Res)

• HAPAA (Sundquist et al. 2008 Genome Res)

• HAPMIX (Price, Tandon et al. 2009 PLoS Genet)

• WINPOP (Pasaniuc et al. 2009 Bioinformatics)

• GEDI-ADMX (Pasaniuc et al. 2009 Lect Notes Comput Sci)

• PCA-based method (Bryc, Auton et al. 2010 PNAS)

• LAMP-LD (Baran et al. 2012 Bioinformatics)

• MULTIMIX (Churchhouse & Marchini 2013 Genet Epidemiol)

• RFMix (Maples et al. 2013 Am J Hum Genet)

reviewed in Seldin et al. 2011 Nat Rev Genet


• ANCESTRYMAP using a subset of ~8,000 unlinked AIMs

(Patterson et al. 2004 Am J Hum Genet; Tandon et al. 2011 Genet Epidemiol)

New methods developed for GWAS chip data: • SABER (Tang et al. 2006 Am J Hum Genet)

• LAMP (Sankararaman et al. 2008 Am J Hum Genet) • uSWITCH (Sankararaman et al. 2008 Genome Res)

• HAPAA (Sundquist et al. 2008 Genome Res)

• HAPMIX (Price, Tandon et al. 2009 PLoS Genet) • WINPOP (Pasaniuc et al. 2009 Bioinformatics)

• GEDI-ADMX (Pasaniuc et al. 2009 Lect Notes Comput Sci)

• PCA-based method (Bryc, Auton et al. 2010 PNAS)

• LAMP-LD (Baran et al. 2012 Bioinformatics)

• MULTIMIX (Churchhouse & Marchini 2013 Genet Epidemiol)

• RFMix (Maples et al. 2013 Am J Hum Genet)

reviewed in Seldin et al. 2011 Nat Rev Genet

Inferring local ancestry: LAMP method

LAMP method: (allele frequencies in ancestral populations not known)

• Prune SNP set to restrict to unlinked markers (r2 < 0.10)

• Choose fixed window length l

• Infer local ancestry within each window of length l via EM algorithm

(Unsupervised clustering, integer-valued haploid local ancestries)

• For each SNP, compute majority vote of local ancestry across

all windows overlapping that SNP.

Sankararaman et al. 2008 Am J Hum Genet

1

2

3

4

5

6

7

Window window length l

Inferring local ancestry: LAMP-ANC method

LAMP-ANC: (allele frequencies in ancestral populations are known)



• Infer local ancestry within each window of length l via max likelihood

(Supervised clustering, integer-valued haploid local ancestries)

• For each SNP, compute majority vote of local ancestry across

all windows overlapping that SNP.


1

2

3

4

5

6

7

Window window length l

Inferring local ancestry: LAMP and LAMP-ANC

LAMP and LAMP-ANC:



• Infer local ancestry within each window of length l

• For each SNP, compute majority vote across windows containing SNP

Choice of window length l is key. If window length is

• too small: not enough information to infer local ancestry

• too big: violates assumption of constant local ancestry within window

window length l



LAMP and LAMP-ANC:





Choice of window length l is key. If window length is

• too small: not enough information to infer local ancestry

• too big: violates assumption of constant local ancestry within window

Use window length l which is

inversely proportional to # generations since admixture,

i.e. proportional to ancestry segment lengths

window length l



LAMP and LAMP-ANC:





Advantages:

• Simple and transparent approach, low computational cost

Disadvantages:

• Information from neighboring windows is not used

• Does not make use of haplotype information

window length l


WINPOP improvement to LAMP-ANC

LAMP and LAMP-ANC:





WINPOP:

• Allow variable window length l depending on local genetic

structure of ancestral populations.

• Explicitly model the possibility of one recombination event per window,

enabling larger windows.

window length l

Pasaniuc et al. 2009 Bioinformatics

also see Baran et al. 2012 Bioinformatics

Inferring local ancestry: HAPMIX method

HAPMIX method: nested Hidden Markov Models

• Large-scale HMM: transitions between local ancestry states

(Patterson et al. 2004 Am J Hum Genet).

• Small-scale HMM: transitions between haplotypes from

ancestral reference populations (Li & Stephens 2003 Genetics)

Price, Tandon et al. 2009 PLoS Genet

POP1

POP2

hap1

hap2

hap3

hap4

hap5

hap1

hap2

hap3

hap4

hap5


HAPMIX method: nested Hidden Markov Models

• States: local ancestry AND haplotype from POP1 or POP2.

• Given initial, transition and emission probabilities: use

forward-backward algorithm to infer P(states | data).

(Durbin et al. 1998 Biological Sequence Analysis + other HMM refs)

Price, Tandon et al. 2009 PLoS Genet

POP1

POP2

hap1

hap2

hap3

hap4

hap5

hap1

hap2

hap3

hap4

hap5


Advantages:

• Large-scale + Small-scal

EPI 511, Advanced Population and Medical Genetics€¦ · Alkes Price Harvard School of Public...

Documents

Transcript of EPI 511, Advanced Population and Medical Genetics€¦ · Alkes Price Harvard School of Public...