EPI 511, Advanced Population and Medical Genetics€¦ · Alkes Price Harvard School of Public...
Transcript of EPI 511, Advanced Population and Medical Genetics€¦ · Alkes Price Harvard School of Public...
-
Alkes Price
Harvard School of Public Health
January 31 & February 2, 2017
EPI 511, Advanced Population and Medical Genetics
Week 2:
• Population structure
• Population admixture
-
EPI 511, Advanced Population and Medical Genetics
Week 2:
• Population structure
• Population admixture
-
Outline
1. Introduction to population structure
2. Model-based clustering (STRUCTURE, FRAPPE programs)
3. Principal Components Analysis (PCA)
4. Ancestry-informative markers (AIMs)
-
Outline
1. Introduction to population structure
2. Model-based clustering (STRUCTURE, FRAPPE programs)
3. Principal Components Analysis (PCA)
4. Ancestry-informative markers (AIMs)
-
What is population structure?
Population structure refers to genetic differences
between populations due to geographic ancestry.
-
Genetic differences between populations are small
5-7% of worldwide human genetic variation is due to
genetic differences between human populations.
The remaining 93-95% of human genetic variation is due to
genetic variation within human populations
(Rosenberg et al. 2002 Science).
-
Genetic differences between populations are small (International HapMap Consortium 2005 and 2007, Nature)
FST = 0.19
FST = 0.11
FST = 0.16
-
Populations can be distinguished using
a large number of genetic markers
• Model-based clustering programs such as STRUCTURE (Pritchard et al. 2000 Genetics)
Rosenberg et al. 2002 Science
-
Populations can be distinguished using
a large number of genetic markers
• Principal components analysis (PCA) (Cavalli-Sforza 1994, The History and Geography of Human Genes)
using 3 million markers
-
Model-based clustering vs. PCA:
What’s the difference?
Model-based clustering:
• Output for each individual: ancestry in N population clusters
• Fractional ancestry (20% pop1, 80% pop2) may be allowed
• Number N of population clusters must be decided in advance
• Results may be sensitive to number of population clusters
-
Model-based clustering vs. PCA:
What’s the difference?
Model-based clustering:
• Output for each individual: ancestry in N population clusters
• Fractional ancestry (20% pop1, 80% pop2) may be allowed
• Number N of population clusters must be decided in advance
• Results may be sensitive to number of population clusters
Principal components analysis (PCA):
• Output for each individual: ancestry as principal components
• PCs do not necessarily correspond to specific populations
• Results of top PCs are not sensitive to the number of PCs
-
Trees can also describe population structure
Unrooted tree Rooted tree Jakobsson et al. 2008 Nature Li et al. 2008 Science
also see Cavalli-Sforza et al. 2003 Nat Genet
-
Population structure vs. Population admixture:
What’s the difference?
Population structure: [Tue of Week 2]
• Genetic differences due to geographic ancestry.
• Use genome-wide data to infer genome-wide ancestry.
-
Population structure vs. Population admixture:
What’s the difference?
Population structure: [Tue of Week 2]
• Genetic differences due to geographic ancestry.
• Use genome-wide data to infer genome-wide ancestry.
Population admixture: [Thu of Week 2]
• Mixed ancestry from multiple continental populations.
• e.g. African Americans, Latino Americans, Hawaiians.
• Infer local ancestry at each location in the genome.
-
Population structure vs. Population stratification:
What’s the difference?
Population structure: [Tue of Week 2]
• Genetic differences due to geographic ancestry.
• Use genome-wide data to infer genome-wide ancestry.
Population stratification: [Tue of Week 3 & Thu of Week 3]
• Refers specifically to a genotype-phenotype association study.
• Differences in genetic ancestry between cases and controls.
-
Outline
1. Introduction to population structure
2. Model-based clustering (STRUCTURE, FRAPPE programs)
3. Principal Components Analysis (PCA)
4. Ancestry-informative markers (AIMs)
-
Model-based clustering when allele frequencies
in ancestral populations are known
Example 1. POP1 and POP2 with known allele frequencies.
SNP1 SNP2 SNP3 SNP4 ………………………
POP1 0.25 0.57 0.29 0.38 … (allele frequencies)
POP2 0.40 0.32 0.84 0.22 … (allele frequencies)
Individual x 2 0 1 1 … (SNP genotypes)
Does individual x belong to POP1 or POP2?
-
Model-based clustering when allele frequencies
in ancestral populations are known
Example 1. POP1 and POP2 with known allele frequencies.
SNP1 SNP2 SNP3 SNP4 ………………………
POP1 0.25 0.57 0.29 0.38 … (allele frequencies)
POP2 0.40 0.32 0.84 0.22 … (allele frequencies)
Individual x 2 0 1 1 … (SNP genotypes)
Does individual x belong to POP1 or POP2?
P(DATA | x is in POP1) is proportional to
(0.25)2(0.75)0(0.57)0(0.43)2(0.29)1(0.71)1(0.38)1(0.62)1 = 0.0006
P(DATA | x is in POP2) is proportional to
(0.40)2(0.60)0(0.32)0(0.68)2(0.84)1(0.16)1(0.22)1(0.78)1 = 0.0017
-
(Fractional) model-based clustering when allele
frequencies in ancestral populations are known
Example 1. POP1 and POP2 with known allele frequencies.
SNP1 SNP2 SNP3 SNP4 ………………………
POP1 0.25 0.57 0.29 0.38 … (allele frequencies)
POP2 0.40 0.32 0.84 0.22 … (allele frequencies)
Individual x 2 0 1 1 … (SNP genotypes)
If individual x has ancestry α from POP1 and (1–α) from POP2,
then what is the most likely value of α?
-
(Fractional) model-based clustering when allele
frequencies in ancestral populations are known
Example 1. POP1 and POP2 with known allele frequencies.
SNP1 SNP2 SNP3 SNP4 ………………………
POP1 0.25 0.57 0.29 0.38 … (allele frequencies)
POP2 0.40 0.32 0.84 0.22 … (allele frequencies)
Individual x 2 0 1 1 … (SNP genotypes)
If individual x has ancestry α from POP1 and (1–α) from POP2,
then what is the most likely value of α?
P(DATA | α) is proportional to
[0.25α + 0.40(1–α)]2[0.75α + 0.60(1–α)]0
[0.57α + 0.32(1–α)]0[0.43α + 0.68(1–α)]2
[0.29α + 0.84(1–α)]1[0.71α + 0.16(1–α)]1
[0.38α + 0.22(1–α)]1[0.62α + 0.78(1–α)]1
max. value 0.0020
attained at α = 0.22
-
Model-based clustering when allele frequencies
in ancestral populations are known
General case: M SNPs (m = 1 to M), N populations (n = 1 to N),
known allele frequency pmn for SNP m in population n,
observed genotype counts gm for SNP m in individual x.
Which population (n = 1 to N) does individual x belong to?
-
Model-based clustering when allele frequencies
in ancestral populations are known
General case: M SNPs (m = 1 to M), N populations (n = 1 to N),
known allele frequency pmn for SNP m in population n,
observed genotype counts gm for SNP m in individual x.
Which population (n = 1 to N) does individual x belong to?
P(DATA | x ~ population n) is proportional to
Answer: find the choice of n which maximizes this expression.
M
m
g
mn
g
mnmm pp
1
2)1(
-
(Fractional) model-based clustering when allele
frequencies in ancestral populations are known
General case: M SNPs (m = 1 to M), N populations (n = 1 to N),
known allele frequency pmn for SNP m in population n,
observed genotype counts gm for SNP m in individual x.
If individual x has fractional ancestry αn from each population n,
subject to Σnαn = 1, then what are the most likely values of αn?
P(DATA | x ~ α1, …, αN) is proportional to
Answer: find the values of αn which maximize this expression.
M
m
g
mn
N
n
n
gN
n
mnn
mm
pp1
2
11
)1(
-
(Fractional) model-based clustering when allele
frequencies in ancestral populations are unknown
General case: M SNPs (m = 1 to M), N populations (n = 1 to N),
unknown allele frequency pmn for SNP m in population n,
observed genotype counts gim for SNP m in many individuals xi.
If individual xi has fractional ancestry αin from each population n,
subject to Σnαin = 1, then what are the most likely values of αin?
-
(Fractional) model-based clustering when allele
frequencies in ancestral populations are unknown
General case: M SNPs (m = 1 to M), N populations (n = 1 to N),
unknown allele frequency pmn for SNP m in population n,
observed genotype counts gim for SNP m in many individuals xi.
If individual xi has fractional ancestry αin from each population n,
subject to Σnαin = 1, then what are the most likely values of αin?
P(DATA | xi ~ αi1, …, αiN for each i; pmn) is proportional to
Answer: find values of αin, pmn which maximize this expression.
I
i
M
m
g
mn
N
n
in
gN
n
mnin
imim
pp1 1
2
11
)1(
-
How to optimize αin and pmn?
General case: M SNPs (m = 1 to M), N populations (n = 1 to N),
unknown allele frequency pmn for SNP m in population n,
observed genotype counts gim for SNP m in many individuals xi.
??? Which ancestries αin and allele frequencies pmn maximize
• Approach #1: EM algorithm (Dempster et al. 1977 JRSS B)
(FRAPPE program; Tang et al. 2005 Genet Epidemiol)
I
i
M
m
g
mn
N
n
in
gN
n
mnin
imim
pp1 1
2
11
)1(
also see ADMIXTURE program (Alexander et al. 2009 Genome Res)
-
The EM algorithm
Want to estimate ancestries αin and allele frequencies pmn.
Hidden variables: Zihm = ancestry of indiv i, hap h at SNP m.
Let zihmn = P(Zihm = n) denote expectations of hidden variables.
Tang et al. 2005 Genet Epidemiol
also see Dempster et al. 1977 JRSS B, Pritchard et al. 2000 Am J Hum Genet
-
The EM algorithm
Want to estimate ancestries αin and allele frequencies pmn.
Hidden variables: Zihm = ancestry of indiv i, hap h at SNP m.
Let zihmn = P(Zihm = n) denote expectations of hidden variables.
Here h = 0 or 1 (two haplotypes per individual)
Let gihm denote haplotype of indiv i, hap h
Diploid genotype gim = 0 or 1 or 2
Haploid genotype gihm = 0 or 1, Σh gihm = gim
Tang et al. 2005 Genet Epidemiol also see Dempster et al. 1977 JRSS B, Pritchard et al. 2000 Am J Hum Genet
-
The EM algorithm
Want to estimate ancestries αin and allele frequencies pmn.
Hidden variables: Zihm = ancestry of indiv i, hap h at SNP m.
If Zihm is known: choose αin and pmn to maximize
But Zihm is unknown. What to do?
I
i
M
m h ihmmZiZ
ihmmZiZ
gp
gp
ihmihm
ihmihm
1 1
1
00)1(
1
Tang et al. 2005 Genet Epidemiol
also see Dempster et al. 1977 JRSS B, Pritchard et al. 2000 Am J Hum Genet
-
The EM algorithm
Want to estimate ancestries αin and allele frequencies pmn.
Hidden variables: Zihm = ancestry of indiv i, hap h at SNP m.
Let zihmn = P(Zihm = n) denote expectations of hidden variables.
Initialization step: Assign zihmn arbitrarily.
Tang et al. 2005 Genet Epidemiol
also see Dempster et al. 1977 JRSS B, Pritchard et al. 2000 Am J Hum Genet
-
The EM algorithm
Want to estimate ancestries αin and allele frequencies pmn.
Hidden variables: Zihm = ancestry of indiv i, hap h at SNP m.
Let zihmn = P(Zihm = n) denote expectations of hidden variables.
Expectation step: Compute expectations zihmn from αin and pmn.
0)1()1(
1
1'
''
1'
''
ihm
N
n
mninmnin
ihm
N
n
mninmnin
ihmn
gpp
gpp
z
Tang et al. 2005 Genet Epidemiol
also see Dempster et al. 1977 JRSS B, Pritchard et al. 2000 Am J Hum Genet
-
The EM algorithm
Want to estimate ancestries αin and allele frequencies pmn.
Hidden variables: Zihm = ancestry of indiv i, hap h at SNP m.
Let zihmn = P(Zihm = n) denote expectations of hidden variables.
Maximization step: Maximize P(DATA | αin and pmn) using zihmn.
M
m h
ihmnin zM 1
1
02
1
I
i h
ihmn
I
i h
ihmihmnmn zgzp1
1
01
1
0
Tang et al. 2005 Genet Epidemiol
also see Dempster et al. 1977 JRSS B, Pritchard et al. 2000 Am J Hum Genet
-
The EM algorithm
Want to estimate ancestries αin and allele frequencies pmn.
Hidden variables: Zihm = ancestry of indiv i, hap h at SNP m.
Let zihmn = P(Zihm = n) denote expectations of hidden variables.
Initialization step.
Maximization step.
Expectation step.
Maximization step.
Expectation step.
Maximization step.
etc. (to convergence)
Tang et al. 2005 Genet Epidemiol also see Dempster et al. 1977 JRSS B, Pritchard et al. 2000 Am J Hum Genet
-
Bayesian posterior inference
General case: M SNPs (m = 1 to M), N populations (n = 1 to N),
unknown allele frequency pmn for SNP m in population n,
observed genotype counts gim for SNP m in many individuals xi.
??? Which ancestries αin and allele frequencies pmn maximize
• Approach #2: Place Bayesian priors on αin and pmn, then
sample from posterior via Markov Chain Monte Carlo (MCMC)
(STRUCTURE program; Pritchard et al. 2000 Genetics)
I
i
M
m
g
mn
N
n
in
gN
n
mnin
imim
pp1 1
2
11
)1(
-
Bayesian posterior inference
General case: M SNPs (m = 1 to M), N populations (n = 1 to N),
unknown allele frequency pmn for SNP m in population n,
observed genotype counts gim for SNP m in many individuals xi.
??? Which ancestries αin and allele frequencies pmn maximize
• Approach #2: Place Bayesian priors on αin and pmn, then
sample from posterior via Markov Chain Monte Carlo (MCMC)
(STRUCTURE program; Pritchard et al. 2000 Genetics)
I
i
M
m
g
mn
N
n
in
gN
n
mnin
imim
pp1 1
2
11
)1(
-
Bayesian posterior inference
General case: M SNPs (m = 1 to M), N populations (n = 1 to N),
unknown allele frequency pmn for SNP m in population n,
observed genotype counts gim for SNP m in many individuals xi.
??? Which ancestries αin and allele frequencies pmn maximize
• Approach #2: Place Bayesian priors on αin and pmn, then
sample from posterior via Markov Chain Monte Carlo (MCMC)
(STRUCTURE program; Pritchard et al. 2000 Genetics)
or variational Bayes approximation
(TeraStructure program; Gopalanan et al. 2016 Nat Genet)
I
i
M
m
g
mn
N
n
in
gN
n
mnin
imim
pp1 1
2
11
)1(
-
Next steps to understanding model-based clustering
Let there be rock.
-- Bon S.
Let there be data.
-- Alkes
-
Application #1: Human Genome Diversity Project
Cann et al. 2002 Science, Cavalli-Sforza et al. 2005 Nat Rev Genet
also see Mallick et al. 2016 Nature (SGDP), Paganic et al. 2016 Nature (EGDP)
-
Application #1: Human Genome Diversity Project
Cann et al. 2002 Science, Cavalli-Sforza et al. 2005 Nat Rev Genet
also see Mallick et al. 2016 Nature (SGDP), Paganic et al. 2016 Nature (EGDP)
-
STRUCTURE results on HGDP samples
Rosenberg et al. 2002 Science
• 1,056 individuals from 52 world populations
• 377 microsatellite markers (multi-allelic)
-
STRUCTURE results on HGDP samples
Rosenberg et al. 2002 Science
• 1,056 individuals from 52 world populations
• 377 microsatellite markers (multi-allelic)
Africa Europe Western Eurasia East Asia
Oce
an
ia
Am
eric
a
-
STRUCTURE results on HGDP samples
Rosenberg et al. 2002 Science
• 1,056 individuals from 52 world populations
• 377 microsatellite markers (multi-allelic)
Africa Europe Western Eurasia East Asia
Oce
an
ia
Am
eric
a
-
STRUCTURE results on HGDP samples
Rosenberg et al. 2002 Science
• 1,056 individuals from 52 world populations
• 377 microsatellite markers (multi-allelic)
Africa Europe Western Eurasia East Asia
Oce
an
ia
Am
eric
a
-
STRUCTURE results on HGDP samples
Rosenberg et al. 2002 Science
• 1,056 individuals from 52 world populations
• 377 microsatellite markers (multi-allelic)
Africa Europe Western Eurasia East Asia
Oce
an
ia
Am
eric
a
-
STRUCTURE results on HGDP samples
Rosenberg et al. 2002 Science
• 1,056 individuals from 52 world populations
• 377 microsatellite markers (multi-allelic)
Africa Europe Western Eurasia East Asia
Oce
an
ia
Am
eric
a
-
STRUCTURE results: How many clusters?
Rosenberg et al. 2002 Science
“We do not claim that our procedure provides an accurate estimate”
(Heuristic procedure for #clusters, Pritchard et al. 2000 Genetics)
Africa Europe Western Eurasia East Asia
Oce
an
ia
Am
eric
a
-
FRAPPE results on GWAS data from HGDP
Li et al. 2008 Science
• 938 HGDP individuals (118 related individuals removed)
• 51 world populations (N. Han and S. Han merged)
• Illumina 650K chip
FRAPPE results at K=7:
-
Application #2: diverse African populations
also see Figure 5 of
Cavalli-Sforza et al. 2003 Nat Genet
Language families
of Africa
-
Application #2: diverse African populations
• 2,432 individuals from 113 African populations
• 1,327 markers (microsatellite markers and indels)
STRUCTURE (Pritchard et al. 2000 Genetics) at K=14.
Tishkoff et al. 2009 Science; also see Gurdasani et al. 2015 Nature
-
STRUCTURE results on African populations
= West African/Bantu = East African
= Khoisan
= Pygmy
= European/Middle Eastern
K=14:
Tishkoff et al. 2009 Science; also see Gurdasani et al. 2015 Nature
-
STRUCTURE results on African populations
= West African/Bantu K=14:
Bantu expansion
(2000 BC – 1000 AD) (Cavalli-Sforza et al. 1994,
The History and Geography
Of Human Genes)
Tishkoff et al. 2009 Science; also see Gurdasani et al. 2015 Nature
-
Outline
1. Introduction to population structure
2. Model-based clustering (STRUCTURE, FRAPPE programs)
3. Principal Components Analysis (PCA)
4. Ancestry-informative markers (AIMs)
-
Principal Components Analysis
• •
•
•
•
•
•
•
•
•
10 points in 1,000,000-dimensional space.
-
Axes of variation (PCs, eigenvectors)
• •
•
•
•
•
•
•
•
•
Axis 1
Axis 1 is the axis explaining the
maximum amount of variation.
-
Axes of variation (PCs, eigenvectors)
• •
•
•
•
•
•
•
•
•
Axis 1
Axis 2
Axis 2 is the axis explaining the
maximum amount of variation
among axes orthogonal to Axis 1.
-
Axes of variation (PCs, eigenvectors)
• •
•
•
•
•
•
•
•
•
Axis 1
Axis 2
Axis 10
Axis 9
Axis 3
-
Top axis of variation
• •
•
•
•
•
•
•
•
•
Axis 1
Axis 2
+0.45
+0.02 +0.30
+0.09
-0.36
-0.33
+0.22
-0.08 -0.18
-0.50
-
The math Let X be an M x N matrix with M > N (e.g. M SNPs, N individuals)
Let Ψ be the N x N covariance matrix of X:
Ψjk = Cov(xj, xk), where xj and xk are jth and kth columns of X.
Pearson 1901 Phil Mag, Ser B
Hoteling 1933 J Educ Psychol
Jackson 2003, A User’s Guide to Principal Components
-
The math Let X be an M x N matrix with M > N (e.g. M SNPs, N individuals)
Let Ψ be the N x N covariance matrix of X:
Ψjk = Cov(xj, xk), where xj and xk are jth and kth columns of X.
Matrix diagonalization (Eigen-decomposition):
Ψ = VDVT , where
D is a diagonal N x N matrix of eigenvalues
V is an N x N matrix whose columns are the eigenvectors of Ψ
Eigenvectors are orthonormal (VTV = I), thus ΨV = VD, i.e.
Ψvj = djvj (vj = jth eigenvector, dj = jth eigenvalue)
Pearson 1901 Phil Mag, Ser B
Hoteling 1933 J Educ Psychol
Jackson 2003, A User’s Guide to Principal Components
-
Toy Example 2 -2
1 -1
X = 0 0
-1 1
-2 2
-
Toy Example 2 -2
1 -1
X = 0 0 Ψ = 10 -10
-1 1 -10 10
-2 2
-
Toy Example 2 -2
1 -1 V D VT
X = 0 0 Ψ = 10 -10 =
-1 1 -10 10
-2 2
2/12/1
2/12/1
2/12/1
2/12/1
00
020
-
Toy Example 2 -2 Eigenvalue 1
1 -1 V D VT
X = 0 0 Ψ = 10 -10 =
-1 1 -10 10
-2 2
PC1
Ψv1 = d1v1 =
2/12/1
2/12/1
2/12/1
2/12/1
00
020
2/20
2/20
-
Toy Example 2 -2 Eigenvalue 2
1 -1 V D VT
X = 0 0 Ψ = 10 -10 =
-1 1 -10 10
-2 2
PC2
Ψv2 = d2v2 =
2/12/1
2/12/1
2/12/1
2/12/1
00
020
0
0
-
PCA on genotype data G = M x N matrix of individual genotypes
M SNPs, N individuals
gij = genotype (0, 1, or 2 alleles) of SNP i in individual j
Price et al. 2006 Nat Genet, Patterson et al. 2006 PLoS Genet
also see McVean 2009 PLoS Genet, Engelhardt & Stephens 2010 PLoS Genet
-
PCA on genotype data G = M x N matrix of individual genotypes
M SNPs, N individuals
gij = genotype (0, 1, or 2 alleles) of SNP i in individual j
• Subtract off the mean of SNP i: pi = Avgj gij/2, set gij = gij – 2pi
(Missing data: set gij = 0 if SNP i in individual j is missing data)
• Optional: normalize by , i.e. set gij = gij /
Price et al. 2006 Nat Genet, Patterson et al. 2006 PLoS Genet
also see McVean 2009 PLoS Genet, Engelhardt & Stephens 2010 PLoS Genet
)1(2 ii pp )1(2 ii pp
-
PCA on genotype data G = M x N matrix of individual genotypes
M SNPs, N individuals
gij = genotype (0, 1, or 2 alleles) of SNP i in individual j
• Subtract off the mean of SNP i: pi = Avgj gij/2, set gij = gij – 2pi
(Missing data: set gij = 0 if SNP i in individual j is missing data)
• Optional: normalize by , i.e. set gij = gij /
Ψ = N x N covariance matrix of G
Ψ = VDVT (Eigen-decomposition)
Columns of V are eigenvectors (principal components, PCs) of G.
Diagonal entries of D are eigenvalues of G.
The hope: Top PCs (PC1, PC2) correspond to genetic ancestry.
Price et al. 2006 Nat Genet, Patterson et al. 2006 PLoS Genet
also see McVean 2009 PLoS Genet, Engelhardt & Stephens 2010 PLoS Genet
)1(2 ii pp )1(2 ii pp
-
Approximating top PCs quickly in genetic data
• Power iteration: a random vector is repeatedly multiplied by the
target matrix A, stretching it along the top eigenvector of A.
• In genetic data, GRM A = XTX/M , where X = norm. genotypes.
Multiply vector by X and XT in turn to avoid cost of computing A.
• Can approximate a fixed number of top PCs in time O(MN)
Rokhlin et al. 2009 J Matrix Anal Appl
Halko et al. 2011 SIAM Rev
Galinsky et al. 2016a Am J Hum Genet
http://www.math.drexel.edu/~pg/520/Math520.html
-
Individuals
1 1 1 0 0
0 1 2 1 2
2 1 1 0 1
SNPs 0 0 1 2 2
2 1 1 0 0
0 0 1 1 1
2 2 1 1 0
PCA on genotype data: Toy Example
Price et al. 2006 Nat Genet
-
Individuals
1 1 1 0 0
0 1 2 1 2
2 1 1 0 1
SNPs 0 0 1 2 2
2 1 1 0 0
0 0 1 1 1
2 2 1 1 0
mean-adjust each SNP
PCA on genotype data: Toy Example
Price et al. 2006 Nat Genet
-
Individuals
0.4 0.4 0.4 -0.6 -0.6
-1.2 -0.2 0.8 -0.2 0.8
1.0 0.0 0.0 -1.0 0.0
SNPs -1.0 -1.0 0.0 1.0 1.0
1.2 0.2 0.2 -0.8 -0.8
-0.6 -0.6 0.4 0.4 0.4
0.8 0.8 -0.2 -0.2 -1.2
PCA on genotype data: Toy Example
Price et al. 2006 Nat Genet
-
Individuals
0.4 0.4 0.4 -0.6 -0.6
-1.2 -0.2 0.8 -0.2 0.8 0.9 0.4 -0.2 -0.5 -0.6
1.0 0.0 0.0 -1.0 0.0 0.4 0.3 0.0 -0.3 -0.4
SNPs -1.0 -1.0 0.0 1.0 1.0 -0.2 0.0 0.1 0.0 0.1
1.2 0.2 0.2 -0.8 -0.8 -0.5 -0.3 0.0 0.4 0.3
-0.6 -0.6 0.4 0.4 0.4 -0.6 -0.4 0.1 0.3 0.6
0.8 0.8 -0.2 -0.2 -1.2
PCA on genotype data: Toy Example
Price et al. 2006 Nat Genet
Covariance matrix
-
Individuals
0.4 0.4 0.4 -0.6 -0.6
-1.2 -0.2 0.8 -0.2 0.8
1.0 0.0 0.0 -1.0 0.0
SNPs -1.0 -1.0 0.0 1.0 1.0 0.7 0.3 -0.1 -0.4 -0.5
1.2 0.2 0.2 -0.8 -0.8
-0.6 -0.6 0.4 0.4 0.4
0.8 0.8 -0.2 -0.2 -1.2
PCA Axis of variation
PCA on genotype data: Toy Example
Price et al. 2006 Nat Genet
-
Individuals
1 1 1 0 0
0 1 2 1 2
2 1 1 0 1
SNPs 0 0 1 2 2 0.7 0.3 -0.1 -0.4 -0.5
2 1 1 0 0
0 0 1 1 1
2 2 1 1 0
PCA Axis of variation
PCA on genotype data: Toy Example
Price et al. 2006 Nat Genet
-
Next steps to understanding PCA
Let there be rock.
-- Bon S.
Let there be data.
-- Alkes
-
PCA using genotype data from HapMap
using 3 million markers
from HapMap2
International HapMap Consortium 2007 Nature
-
PCA using genotype data from HGDP
Li et al. 2008 Science
938 HGDP individuals
Illumina 650K chip
-
PCA in an admixed population: African Americans
AA: 21% ± 14%
European ancestry
YRI
CHB+JPT
CEU
Price, Patterson et al. 2008 PLoS Genet
also see Smith et al. 2004 Am J Hum Genet; Bryc, Auton et al. 2010 PNAS
-
PCA using genotype data from Europe
3,192 Europeans
Affymetrix 500K chip
Novembre et al. 2008 Nature
also see Ralph & Coop 2013 PLoS Biol, Leslie et al. 2015 Nature, Haak et al. 2015 Nature
-
PCA using genotype data from Switzerland
Geographical origin of
European individuals can be
inferred to within 300-700km!
Novembre et al. 2008 Nature
also see Ralph & Coop 2013 PLoS Biol, Leslie et al. 2015 Nature, Haak et al. 2015 Nature
-
PCA using genotype data from 113,851 UK samples
Galinsky et al. 2016b Am J Hum Genet
also see Leslie e al. 2015 Nature
http://ukmap.facts.co/
-
European American population structure:
What’s inside the melting pot?
???
-
PCA using genotype data from European Americans
2745 European Americans
Affymetrix 500K chip
Price, Butler et al. 2008 PLoS Genet; also see Price et al. 2006 Nat Genet, Tian et al. 2008 PLoS Genet, Galinsky et al. 2016a Am J Hum Genet
-
PCA using genotype data from European Americans
2745 European Americans
Affymetrix 500K chip
Price, Butler et al. 2008 PLoS Genet; also see Price et al. 2006 Nat Genet, Tian et al. 2008 PLoS Genet, Galinsky et al. 2016a Am J Hum Genet
-
PCA using genotype data from European Americans
Galinsky et al. 2016a Am J Hum Genet
-
PCA using genotype data from European Americans
Galinsky et al. 2016a Am J Hum Genet
-
Genetic distances (FST) between
European American subpopulations
Ashkenazi
Northwest Southeast
FST = 0.009 FST = 0.004
FST = 0.005
Price, Butler et al. 2008 PLoS Genet
-
PCA using SNP weights from external reference panels
Chen et al. 2013 Bioinformatics
-
PCs do not necessarily reflect population structure
• Batch effects (see Clayton et al. 2005 Nat Genet, Price et al. 2006 Nat Genet)
• Cryptic relatedness (see Patterson et al. 2006 PLoS Genet)
• Long-range LD, e.g. due to inversion polymorphisms
(see Tian et al. 2008 PLoS Genet, Price et al. 2008 Am J Hum Genet)
-
“We recommend inferring population structure using all markers …
based on an analysis of HapMap2 data with >3 million markers
(45 Chinese and 44 Japanese).”
-- Supp Note 5 of Price et al. 2006 Nat Genet
“We corrected for LD using our regression technique”.
-- Patterson et al. 2006 PLoS Genet (also see Zou et al. 2009 Hum Hered)
“We identified 24 autosomal long-range LD regions, each spanning
>2Mb, that explained one of the top PCs [when running PCA] on
327 European Americans genotyped on the Illumina 550K array.”
-- Price et al. 2008 AJHG (also see Tian et al. 2008 PLoS Genet)
PCA of 531 Northern European + 387 Southern European samples
sequenced at 202 genes (864kb) [Nelson et al. 2012 Science data]:
r2(PC1, true ancestry) = 0.34; increases to 0.54 with LD-pruning.
-- Galinsky et al. 2016a Am J Hum Genet (Appendix)
To LD-prune or not to LD-prune in PCA?
-
Is human population genetic variation
best described by clusters or clines?
“We identified six main genetic clusters, five of which correspond
to major geographic regions.” (Rosenberg et al. 2002 Science)
“When individuals are sampled homogeneously from around the
globe, the pattern seen is one of gradients of allele frequencies,
rather than discrete clusters.” (Serre and Paabo 2004 Genome Res)
“Examination of the relationship between genetic and geographic
distance supports a view in which the clusters arise not as an
artifact of the sampling scheme, but from small discontinuous
jumps in genetic distance on opposite sides of geographic barriers.”
(Rosenberg et al. 2005 PLoS Genet)
-
Do geographic barriers lead to clusters?
• Continuous geographic distance (along land routes) explains
69% of the variance in genetic distance between two populations.
Rosenberg et al. 2005 PLoS Genet
also see Pagani et al. 2016 Nature
-
Do geographic barriers lead to clusters?
• Continuous geographic distance (along land routes) explains
69% of the variance in genetic distance between two populations.
• Continuous geographic distance (along land routes)
PLUS geographic barriers (ocean, Himalayas, Sahara) explains
73% of the variance in genetic distance between two populations.
This suggests that geographic barriers contribute very slightly
to genetic clustering of world populations.
Rosenberg et al. 2005 PLoS Genet
also see Pagani et al. 2016 Nature
-
Outline
1. Introduction to population structure
2. Model-based clustering (STRUCTURE, FRAPPE programs)
3. Principal Components Analysis (PCA)
4. Ancestry-informative markers (AIMs)
-
Ancestry-informative markers (AIMs)
Standard approach to inferring genetic ancestry:
• Genotype each individual on a GWAS chip
(500,000-1,000,000 random genetic markers).
Apply model-based clustering or PCA.
-
Price, Butler et al. 2008 PLoS Genet
PCA using genotype data from European Americans
2745 European Americans
Affymetrix 500K chip
-
Ancestry-informative markers (AIMs)
Standard approach to inferring genetic ancestry:
• Genotype each individual on a GWAS chip
(500,000-1,000,000 random genetic markers).
Apply model-based clustering or PCA.
OR
AIM approach to inferring genetic ancestry:
• Genotype each individual on a small set of 50-300 AIMs:
markers that are highly informative for genetic ancestry.
Apply model-based clustering or PCA.
Hoggart et al. 2003 Am J Hum Genet
-
AIMs for northwest vs. southeast Europe
100 AIMs distinguishing NW vs. SE ancestry
• Ascertained using European Americans genotyped at
100,000 to 500,000 markers.
• Validated using a panel of samples of known ancestry:
Swedish, UK, Polish, Greek, Italian, Spanish
Price, Butler et al. 2008 PLoS Genet; reviewed in Seldin & Price 2008 PLoS Genet
also see Seldin et al. 2006 PLoS Genet, Tian et al. 2008 PLoS Genet
-
300 AIMs for northwest vs. southeast Europe
and southeast Europe vs. Ashkenazi Jewish
100 AIMs distinguishing NW vs. SE ancestry
200 AIMs distinguishing SE vs. AJ ancestry
• Ascertained using European Americans genotyped at
100,000 to 500,000 markers.
• Validated using a panel of samples of known ancestry:
Swedish, UK, Polish, Greek, Italian, Spanish, Ashkenazi
Price, Butler et al. 2008 PLoS Genet; reviewed in Seldin & Price 2008 PLoS Genet
also see Seldin et al. 2006 PLoS Genet, Tian et al. 2008 PLoS Genet
-
300 AIMs for northwest vs. southeast Europe
and southeast Europe vs. Ashkenazi Jewish
Price, Butler et al. 2008 PLoS Genet; reviewed in Seldin & Price 2008 PLoS Genet
also see Seldin et al. 2006 PLoS Genet, Tian et al. 2008 PLoS Genet
-
How many AIMs are needed?
Theorem 3:
The squared correlation between an inferred axis of variation
and the true axis of variation (e.g. using genome-wide data) is
≈ x/(1+x), where x = FST times the number of AIMs.
[where FST is measured in the set of AIMs.]
Price, Butler et al. 2008 PLoS Genet, Patterson et al. 2006 PLoS Genet
also see Rosenberg et al. 2003 Am J Hum Genet
-
How many AIMs are needed?
Theorem 3:
The squared correlation between an inferred axis of variation
and the true axis of variation (e.g. using genome-wide data) is
≈ x/(1+x), where x = FST times the number of AIMs.
[where FST is measured in the set of AIMs.]
e.g. Affymetrix 500K chip for northwest vs. southeast Europe:
Effective #markers ≈ 100,000, after accounting for LD.
FST(NW Europe, SE Europe) = 0.005 (for the set of all SNPs)
x = (0.005)(100,000) = 500
x/(1+x) = 0.998.
Price, Butler et al. 2008 PLoS Genet, Patterson et al. 2006 PLoS Genet
also see Rosenberg et al. 2003 Am J Hum Genet
-
How many AIMs are needed?
Theorem 3:
The squared correlation between an inferred axis of variation
and the true axis of variation (e.g. using genome-wide data) is
≈ x/(1+x), where x = FST times the number of AIMs.
[where FST is measured in the set of AIMs.]
e.g. 100 AIMs for northwest vs. southeast Europe:
FST(NW Europe, SE Europe) = 0.005 (for the set of all SNPs)
FST(NW Europe, SE Europe) = 0.07 for the set of 100 AIMs
x = (0.07)(100) = 7
x/(1+x) = 0.88.
Price, Butler et al. 2008 PLoS Genet, Patterson et al. 2006 PLoS Genet
also see Rosenberg et al. 2003 Am J Hum Genet
-
How many AIMs are needed?
Theorem 3:
The squared correlation between an inferred axis of variation
and the true axis of variation (e.g. using genome-wide data) is
≈ x/(1+x), where x = FST times the number of AIMs.
[where FST is measured in the set of AIMs.]
e.g. 200 AIMs for southeast Europe vs. Ashkenazi Jewish:
FST(SE Europe, AJ) = 0.004 (for the set of all SNPs)
FST(SE Europe, AJ) = 0.04 for the set of 200 AIMs
x = (0.04)(200) = 8
x/(1+x) = 0.89.
Price, Butler et al. 2008 PLoS Genet, Patterson et al. 2006 PLoS Genet
also see Rosenberg et al. 2003 Am J Hum Genet
-
300 AIMs for northwest vs. southeast Europe
and southeast Europe vs. Ashkenazi Jewish
Price, Butler et al. 2008 PLoS Genet; reviewed in Seldin & Price 2008 PLoS Genet
also see Seldin et al. 2006 PLoS Genet, Tian et al. 2008 PLoS Genet
-
AIMs for Africa, Europe, Asia, America
Lao et al. 2006 Am J Hum Genet
also see Ruiz-Narvaez et al. 2011 Am J Epidemiol, Galanter et al. 2012 PLoS Genet
STRUCTURE runs
using only 10 AIMs
-
• Genetic differences between human populations are small, but
populations can be distinguished using a large number of
genetic markers.
• Model-based clustering is an effective way of modeling
genetic variation and inferring ancestry via discrete clusters.
• PCA is an effective way of modeling genetic variation and
inferring ancestry via continuous clines.
• Model-based clustering methods and PCA can be applied to
random markers, or to ancestry-informative markers (AIMs),
to infer genetic ancestry.
Conclusions
-
EPI 511, Advanced Population and Medical Genetics
Week 2:
• Population structure
• Population admixture
-
Outline
1. Admixture leads to variation in genome-wide ancestry
2. Admixture creates mosaic chromosomes
3. Local ancestry inference
4. Evaluating local ancestry inference algorithms
-
Outline
1. Admixture leads to variation in genome-wide ancestry
2. Admixture creates mosaic chromosomes
3. Local ancestry inference
4. Evaluating local ancestry inference algorithms
-
Hellenthal et al. 2014 Science
-
What is an admixed population?
An admixed population is a population with recent
ancestry from two or more continents
(e.g. within the past 1,000 years).
-
What is an admixed population?
An admixed population is a population with recent
ancestry from two or more continents
(e.g. within the past 1,000 years).
Note: the word “admixture” is also sometimes used to
refer to more ancient admixture events. (e.g. Patterson et al. 2012 Genetics, Hellenthal et al. 2014 Science)
-
Population structure vs. Population admixture:
What’s the difference?
Population structure: [Tue of Week 2]
• Genetic differences due to geographic ancestry.
• Use genome-wide data to infer genome-wide ancestry.
Population admixture: [Thu of Week 2]
• Mixed ancestry from multiple continental populations.
• e.g. African Americans, Latino Americans.
• Infer local ancestry at each location in the genome.
Population admixture implies population structure.
Population structure does not imply population admixture.
-
Examples of admixed populations
African Americans:
• Inherit African and European ancestry
• >10% of U.S. population
Smith et al. 2004 Am J Hum Genet
-
Examples of admixed populations
Hispanic/Latino Americans:
• Inherit European and Native American
or European, Native American and African ancestry
• e.g. Mexican Americans, Puerto Ricans, etc.
• >15% of U.S. population
Bryc, Velez et al. 2010 PNAS
-
Examples of admixed populations
Latinos outside the U.S.:
• Inherit European and Native American
or European, Native American and African ancestry
• hundreds of millions of people throughout Latin America
Bryc, Velez et al. 2010 PNAS
-
An aside: Characteristics of African,
European and Native American populations
African populations:
• High within-population diversity, low LD (no bottleneck).
• Low genetic distance (FST) between West African populations
European populations:
• Lower within-population diversity, higher LD (bottleneck).
• Low genetic distance (FST) between European populations
Native American populations:
• Lowest within-population diversity, highest LD due to
multiple population bottlenecks.
• Very high FST between Native American populations
Cavalli-Sforza et al. 1994 The History and Geography of Human Genes
Reich et al. 2012 Nature
-
Other examples of admixed populations
Native Hawaiians (Polynesian, European, East Asian ancestry)
Uyghurs (East Asian and European-related ancestry)
A population that self-identifies and is described in the the academic literature as “South African Coloured” (San African, Bantu African, European, S Asian, SE Asian ancestry)
Haiman et al. 2003 Hum Mol Genet,
Haiman et al. 2007 Nat Genet
Xu, Huang et al. 2008 Am J Hum Genet,
Xu & Jin 2008 Am J Hum Genet
de Wit et al. 2010 Hum Genet, Patterson et al. 2010 Hum Mol Genet,
Tishkoff et al. 2009 Science, Chimusa et al. 2013 Hum Mol Genet
-
Inferring genome-wide ancestry proportions
Apply the usual clustering programs, allowing fractional ancestry
(see Tue of Week 2 slides):
• STRUCTURE (Pritchard et al. 2000 Genetics, Falush et al. 2003 Genetics)
• FRAPPE (Tang et al. 2005 Genet Epidemiol, Li et al. 2008 Science)
• ADMIXTURE (Alexander et al. 2009 Genome Res)
-
Inferring genome-wide ancestry proportions
Apply the usual clustering programs, allowing fractional ancestry
(see Tue of Week 2 slides):
• STRUCTURE (Pritchard et al. 2000 Genetics, Falush et al. 2003 Genetics)
• FRAPPE (Tang et al. 2005 Genet Epidemiol, Li et al. 2008 Science)
• ADMIXTURE (Alexander et al. 2009 Genome Res)
Or, apply principal components analysis
(see Tue of Week 2 slides):
• PCA (Price et al. 2006 Nat Genet, Patterson et al. 2006 PLoS Genet)
-
Admixture leads to variation in genome-wide ancestry
AA: 21% ± 14%
European ancestry
YRI
CHB+JPT
CEU
African Americans
Price, Patterson et al. 2008 PLoS Genet
also see Smith et al. 2004 Am J Hum Genet; Bryc, Auton et al. 2010 PNAS
(from Tue of Week 2)
-
Admixture proportion varies across individuals,
but also varies with U.S. geographic location
Kittles et al. 2007 CJHP
also see Bryc et al. 2015 Am J Hum Genet
% European ancestry in African American populations
-
Latino populations: 3-way admixture
Bryc, Velez et al. 2010 PNAS
European
Native American
African
-
Latino populations: 3-way admixture
Price et al. 2007 Am J Hum Genet; also see Bryc, Velez et al. 2010 PNAS;
Moreno-Estrada et al. 2014 Science; Ruiz-Linares et al. 2014 PLoS Genet
Mexican Americans
50% European, 45% Native American, 5% African on average,
with substantial variation among individuals.
Puerto Ricans
60% European, 20% Native American, 20% African on average,
with substantial variation among individuals.
Brazilians and Colombians
70% European, 20% Native American, 10% African on average,
with substantial variation among individuals. [For populations sampled. Values may not apply to all populations.]
-
Different Native American ancestral populations
for Latino populations in different regions
Wang et al. 2008 PLoS Genet
also see Price et al. 2007 Am J Hum Genet
-
CEU northern European USA 180
CHB Chinese China 90
JPT Japanese Japan 90
YRI Yoruba Nigeria 180
TSI Tuscan Italy 90
CHD Chinese USA 100
LWK Luhya Kenya 90
MKK Maasai Kenya 180
ASW African-American USA 90
MXL Mexican-American USA 90
GIH Gujarati-American USA 90
Which HapMap3 populations are admixed?
-
PCA of all HapMap3 populations
International HapMap3 Consortium 2010 Nature (see Supp Figures)
-
These populations are “homogeneous”
in their continental ancestry
International HapMap3 Consortium 2010 Nature (see Supp Figures)
-
ASW, MKK and LWK are admixed
International HapMap3 Consortium 2010 Nature (see Supp Figures)
-
ASW, MKK and LWK are admixed
International HapMap3 Consortium 2010 Nature (see Supp Figures)
Bantu expansion
(2000 BC – 1000 AD)
Arab migrations
(500 – 1500 AD)
(Cavalli-Sforza et al. 1994,
The History and Geography
Of Human Genes)
X Ancestral East African population
-
STRUCTURE results on African populations
= West African/Bantu = East African
= Khoisan
= Pygmy
= European/Middle Eastern
Tishkoff et al. 2009 Science; also see Gurdasani et al. 2015 Nature
K=14:
(from Tue of Week 2)
-
MXL (Mexican Americans) are admixed
International HapMap3 Consortium 2010 Nature (see Supp Figures)
-
Are GIH (Gujarati Americans) admixed?
International HapMap3 Consortium 2010 Nature (see Supp Figures)
also see Reich et al. 2009 Nature, Basu et al. 2016 PNAS
-
Are GIH (Gujarati Americans) admixed?
International HapMap3 Consortium 2010 Nature (see Supp Figures)
also see Reich et al. 2009 Nature, Basu et al. 2016 PNAS
-
Are GIH (Gujarati Americans) admixed?
International HapMap3 Consortium 2010 Nature (see Supp Figures)
also see Reich et al. 2009 Nature, Basu et al. 2016 PNAS
-
Which HGDP populations are admixed?
Li et al. 2008 Science
938 HGDP individuals
Illumina 650K chip (from Tue of Week 2)
-
Which HGDP populations are admixed?
Li et al. 2008 Science
admixture in
Middle East / North Africa?
Recent? Or not? (Price, Tandon et al. 2009 PLoS Genet)
-
European Americans: 3-way admixture!
Bryc et al. 2015 Am J Hum Genet
European Americans
>99% European,
0.2% Native American,
0.2% African on average
with substantial variation
among individuals.
-
Trees can also describe population structure
Unrooted tree Rooted tree Jakobsson et al. 2008 Nature Li et al. 2008 Science
(from Tue of Week 2)
also see Cavalli-Sforza et al. 2003 Nat Genet
-
Trees cannot model recent admixture
root
YRI CEU
root
YRI CEU ASW ASW
WRONG. WRONG.
-
Outline
1. Admixture leads to variation in genome-wide ancestry
2. Admixture creates mosaic chromosomes
3. Local ancestry inference
4. Evaluating local ancestry inference algorithms
-
Admixture creates mosaic chromosomes
Population 1 Population 2
1 generation later
-
Population 1 Population 2
2 generations later
Admixture creates mosaic chromosomes
-
Population 1 Population 2
several generations later
Admixture creates mosaic chromosomes
Local ancestry = 0, 1 or 2
copies from population 1
-
Population 1 Population 2
Admixture creates mosaic chromosomes
several generations later
Local ancestry = 0, 1 or 2
copies from population 1
Average segment length (in Morgans) ~ 1/g
where g = average #generations since admixture
g ≈ 6 for African Americans, g ≈ 10 for Latino populations
Smith et al. 2004 Am J Hum Genet, Price et al. 2007 Am J Hum Genet
-
Population 1 Population 2
Admixture creates mosaic chromosomes
several generations later
Local ancestry = 0, 1 or 2
copies from population 1
Avg segment length ~ 1/g [> 1/g due to recombination b/t same ancestry]
where g = average #generations since admixture
g ≈ 6 for African Americans, g ≈ 10 for Latino populations
Smith et al. 2004 Am J Hum Genet, Price et al. 2007 Am J Hum Genet
-
Mosaic chromosomes create admixture-LD
Toy example: Admixed population with 50% POP1, 50% POP2
SNP1 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2
SNP2 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2
SNP1 and SNP2 are unlinked in POP1, unlinked in POP2.
SNP1 and SNP2 are 200kb apart: (nearly) always same local ancestry.
-
Mosaic chromosomes create admixture-LD
Toy example: Admixed population with 50% POP1, 50% POP2
SNP1 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2
SNP2 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2
SNP1 and SNP2 are unlinked in POP1, unlinked in POP2.
SNP1 and SNP2 are 200kb apart: (nearly) always same local ancestry.
P(SNP1=A, SNP2=A) = 50%·0.10·0.10 + 50%·0.90·0.90 = 0.41
POP1 POP2
-
Mosaic chromosomes create admixture-LD
Toy example: Admixed population with 50% POP1, 50% POP2
SNP1 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2
SNP2 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2
SNP1 and SNP2 are unlinked in POP1, unlinked in POP2.
SNP1 and SNP2 are 200kb apart: (nearly) always same local ancestry.
P(SNP1=A, SNP2=A) = 50%·0.10·0.10 + 50%·0.90·0.90 = 0.41
P(SNP1=A, SNP2=C) = 50%·0.10·0.90 + 50%·0.90·0.10 = 0.09
P(SNP1=C, SNP2=A) = 50%·0.90·0.10 + 50%·0.10·0.90 = 0.09
P(SNP1=C, SNP2=C) = 50%·0.90·0.90 + 50%·0.10·0.10 = 0.41
-
Mosaic chromosomes create admixture-LD
Toy example: Admixed population with 50% POP1, 50% POP2
SNP1 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2
SNP2 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2
SNP1 and SNP2 are unlinked in POP1, unlinked in POP2.
SNP1 and SNP2 are 200kb apart: (nearly) always same local ancestry.
P(SNP1=A, SNP2=A) = 50%·0.10·0.10 + 50%·0.90·0.90 = 0.41
P(SNP1=A, SNP2=C) = 50%·0.10·0.90 + 50%·0.90·0.10 = 0.09
P(SNP1=C, SNP2=A) = 50%·0.90·0.10 + 50%·0.10·0.90 = 0.09
P(SNP1=C, SNP2=C) = 50%·0.90·0.90 + 50%·0.10·0.10 = 0.41
SNP1 and SNP2 are in admixture-LD in the admixed population!
-
Admixture-LD depends on allele frequency differences
Toy example: Admixed population with 50% POP1, 50% POP2
SNP1 = A/C SNP, A allele has frequency 0.10 in POP1, 0.10 in POP2
SNP2 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2
SNP1 and SNP2 are unlinked in POP1, unlinked in POP2.
SNP1 and SNP2 are 200kb apart: (nearly) always same local ancestry.
P(SNP1=A, SNP2=A) = 50%·0.10·0.10 + 50%·0.10·0.90 = 0.05
P(SNP1=A, SNP2=C) = 50%·0.10·0.90 + 50%·0.10·0.10 = 0.05
P(SNP1=C, SNP2=A) = 50%·0.90·0.10 + 50%·0.90·0.90 = 0.45
P(SNP1=C, SNP2=C) = 50%·0.90·0.90 + 50%·0.90·0.10 = 0.45
No allele frequency difference in SNP1 => no admixture-LD.
-
Mosaic chromosomes create admixture-LD
Real example of admixture-LD:
rs164781: 0.42 in CEU, 0.88 in YRI (HapMap3)
rs10495758: 0.88 in CEU, 0.32 in YRI (HapMap3)
These SNPs are located roughly 3Mb apart.
r2 between rs164781 and rs10495758:
0.01 in CEU, 0.01 in YRI, 0.28 in ASW (HapMap3)
rs164781 and rs10495758 are in admixture-LD in ASW!
International HapMap3 Consortium 2010 Nature
SNPs chosen from Tandon et al. 2011 Genet Epidemiol
-
Mosaic chromosomes create admixture-LD
Collins-Schramm et al. 2003 Hum Genet
No LD in Europeans (P-values for LD not significant)
-
Mosaic chromosomes create admixture-LD
Collins-Schramm et al. 2003 Hum Genet
Admixture-LD in African Americans (significant P-values)
-
Local ancestry = 0, 1 or 2
copies from population 1
at a specific locus
Local ancestry vs. Genome-wide ancestry
Local
ancestry
Genome-wide
ancestry
Genome-wide ancestry
(e.g. 20% European)
-
Outline
1. Admixture leads to variation in genome-wide ancestry
2. Admixture creates mosaic chromosomes
3. Local ancestry inference
4. Evaluating local ancestry inference algorithms
-
Ancestry-informative marker (AIM) panels for
local ancestry inference in African Americans
The most
informative
~1% of
SNPs
provide
powerful
information
about
ancestry
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
European American Frequency
We
st
Afr
ica
n F
req
ue
nc
y
Smith et al. 2004
• Choose 1,500-3,000 SNPs with large Δ(EUR,AFR) (unlinked, i.e. not in LD, in ancestral populations)
Smith et al. 2004 Am J Hum Genet
Tian et al. 2006 Am J Hum Genet (slide from David Reich)
The most informative SNPs
provide powerful information
about local ancestry
“African-American
admixture map”
-
Ancestry-informative marker (AIM) panels for
local ancestry inference in Latino populations
Price et al. 2007 Am J Hum Genet
Mao et al. 2007 Am J Hum Genet
Tian et al. 2007 Am J Hum Genet
The most informative SNPs
provide powerful information
about local ancestry
“Latino admixture map”
• Choose 1,500-3,000 SNPs with large Δ(EUR,NA) (unlinked, i.e. not in LD, in ancestral populations)
-
Local ancestry = 0, 1 or 2
copies from population 1
at a specific locus
Local ancestry vs. Genome-wide ancestry
Local
ancestry
Genome-wide
ancestry
Genome-wide ancestry
(e.g. 20% European)
25-50 AIMs
1,500-3,000 AIMs
-
Inferring local ancestry using AIM panels
SNP chr position Eur freq Afr freq
rs2814778 1 159,174,683 0% 100%
1 SNP with Δ=100%: perfect information about local ancestry
Duffy blood group locus
see Hamblin et al. 2000 Am J Hum Genet, Hamblin et al. 2002 Am J Hum Genet
-
Inferring local ancestry using AIM panels
SNP chr position Eur freq Afr freq
rs1962508
rs2806424
rs1780349
1
1
1
158,677,077
159,423,117
161,340963
4%
84%
44%
74%
26%
99%
Several SNPs with Δ=60-80%: ???
-
Inferring local ancestry using AIM panels
SNP chr position Eur freq Afr freq
rs1962508
rs2806424
rs1780349
1
1
1
158,677,077
159,423,117
161,340963
4%
84%
44%
74%
26%
99%
Several SNPs with Δ=60-80%: Hidden Markov Model methods
STRUCTURE (Falush et al. 2003 Genetics), ADMIXMAP (Hoggart et al. 2004
Am J Hum Genet), ANCESTRYMAP (Patterson et al. 2004 Am J Hum Genet)
(unobserved) state:
Local ancestry = 0, 1 or 2
copies from population 1
-
Overview of Hidden Markov Model approach
• Simplifying assumption: for a individual i, suppose we know
M = genome-wide ancestry (e.g. 20%)
λ = average #generations since admixture (e.g. 6)
• Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)
of this individual at marker j along the genome.
INITIAL PROBABILITIES (e.g. left end of chromosome):
TRANSITION PROBABILITIES
EMISSION PROBABILITIES
Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,
Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis
-
Overview of Hidden Markov Model approach
• Simplifying assumption: for a individual i, suppose we know
M = genome-wide ancestry (e.g. 20%)
λ = average #generations since admixture (e.g. 6)
• Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)
of this individual at marker j along the genome.
INITIAL PROBABILITIES (e.g. left end of chromosome):
P(X0 = 0) = (1 – M)2
P(X0 = 1) = 2M(1 – M)
P(X0 = 2) = M2
Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,
Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis
-
Overview of Hidden Markov Model approach
• Simplifying assumption: for a individual i, suppose we know
M = genome-wide ancestry (e.g. 20%)
λ = average #generations since admixture (e.g. 6)
• Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)
of this individual at marker j along the genome.
TRANSITION PROBABILITIES:
Let d be the genetic distance (in Morgans) between markers j and j+1.
P(Xj+1 = 0 | Xj = 0) = e–2λd + 2e–λd(1 – e–λd)(1 – M) + (1 – e–λd)2(1 – M)2
Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,
Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis
0 of 2
chrom.
recombine
1 of 2
chrom.
recombine
2 of 2
chrom.
recombine
-
Overview of Hidden Markov Model approach
• Simplifying assumption: for a individual i, suppose we know
M = genome-wide ancestry (e.g. 20%)
λ = average #generations since admixture (e.g. 6)
• Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)
of this individual at marker j along the genome.
TRANSITION PROBABILITIES:
Let d be the genetic distance (in Morgans) between markers j and j+1.
P(Xj+1 = 0 | Xj = 0) = e–2λd + 2e–λd(1 – e–λd)(1 – M) + (1 – e–λd)2(1 – M)2
P(Xj+1 = 1 | Xj = 0) = 2e–λd(1 – e–λd)M + (1 – e–λd)22M(1 – M)
P(Xj+1 = 2 | Xj = 0) = (1 – e–λd)2M2
Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,
Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis
-
Overview of Hidden Markov Model approach
• Simplifying assumption: for a individual i, suppose we know
M = genome-wide ancestry (e.g. 20%)
λ = average #generations since admixture (e.g. 6)
• Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)
of this individual at marker j along the genome.
TRANSITION PROBABILITIES:
Let d be the genetic distance (in Morgans) between markers j and j+1.
P(Xj+1 = 0 | Xj = 1) = 2e–λd(1 – e–λd)(1 – M) + (1 – e–λd)2(1 – M)2
P(Xj+1 = 1 | Xj = 1) = e–2λd + e–λd(1 – e–λd) + (1 – e–λd)22M(1 – M)
P(Xj+1 = 2 | Xj = 1) = e–λd(1 – e–λd)M + (1 – e–λd)2M2
Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,
Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis
-
Overview of Hidden Markov Model approach
• Simplifying assumption: for a individual i, suppose we know
M = genome-wide ancestry (e.g. 20%)
λ = average #generations since admixture (e.g. 6)
• Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)
of this individual at marker j along the genome.
TRANSITION PROBABILITIES:
Let d be the genetic distance (in Morgans) between markers j and j+1.
P(Xj+1 = 0 | Xj = 2) = (1 – e–λd)2(1 – M)2
P(Xj+1 = 1 | Xj = 2) = 2e–λd(1 – e–λd)(1 – M) + (1 – e–λd)22M(1 – M)
P(Xj+1 = 2 | Xj = 2) = e–2λd + 2e–λd(1 – e–λd)M + (1 – e–λd)2M2
Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,
Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis
-
Overview of Hidden Markov Model approach
• Simplifying assumption: for a individual i, suppose we know
M = genome-wide ancestry (e.g. 20%)
λ = average #generations since admixture (e.g. 6)
• Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)
of this individual at marker j along the genome.
EMISSION PROBABILITIES:
Let pA and pE be genotype frequencies of marker j in AFR and EUR.
P(gj = 0 | Xj = 0) = (1 – pA)2
P(gj = 1 | Xj = 0) = 2pA(1 – pA)
P(gj = 2 | Xj = 0) = pA2
Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,
Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis
-
Overview of Hidden Markov Model approach
• Simplifying assumption: for a individual i, suppose we know
M = genome-wide ancestry (e.g. 20%)
λ = average #generations since admixture (e.g. 6)
• Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)
of this individual at marker j along the genome.
EMISSION PROBABILITIES:
Let pA and pE be genotype frequencies of marker j in AFR and EUR.
P(gj = 0 | Xj = 1) = (1 – pA)(1 – pE)
P(gj = 1 | Xj = 1) = pA(1 – pE) + pE(1 – pA)
P(gj = 2 | Xj = 1) = pApE
Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,
Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis
-
Overview of Hidden Markov Model approach
• Simplifying assumption: for a individual i, suppose we know
M = genome-wide ancestry (e.g. 20%)
λ = average #generations since admixture (e.g. 6)
• Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)
of this individual at marker j along the genome.
EMISSION PROBABILITIES:
Let pA and pE be genotype frequencies of marker j in AFR and EUR.
P(gj = 0 | Xj = 2) = (1 – pE)2
P(gj = 1 | Xj = 2) = 2pE(1 – pE)
P(gj = 2 | Xj = 2) = pE2
Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,
Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis
-
Overview of Hidden Markov Model approach
• Simplifying assumption: for a individual i, suppose we know
M = genome-wide ancestry (e.g. 20%)
λ = average #generations since admixture (e.g. 6)
• Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)
of this individual at marker j along the genome.
INITIAL PROBABILITIES (e.g. left end of chromosome):
TRANSITION PROBABILITIES
EMISSION PROBABILITIES
Then apply forward-backward algorithm to infer P(Xj | genotypes).
Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,
Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis
-
Overview of Hidden Markov Model approach
Then apply forward-backward algorithm to infer P(Xj | genotypes).
P(X1|g1) P(Xj|g1…gj) P(XM-1|g1…gM-1) P(XM|g1…gM)
(FORWARD PROBABILITIES)
Durbin et al 1998 Biological Sequence Analysis
-
Overview of Hidden Markov Model approach
Then apply forward-backward algorithm to infer P(Xj | genotypes).
P(X1|g1) P(Xj|g1…gj) P(XM-1|g1…gM-1) P(XM|g1…gM)
(FORWARD PROBABILITIES)
P(g2…gM|X1) P(gj+1…gM|Xj) P(gM|XM-1) 1
(BACKWARD PROBABILITIES)
Durbin et al 1998 Biological Sequence Analysis
-
Overview of Hidden Markov Model approach
Then apply forward-backward algorithm to infer P(Xj | genotypes).
P(X1|g1) P(Xj|g1…gj) P(XM-1|g1…gM-1) P(XM|g1…gM)
P(g2…gM|X1) P(gj+1…gM|Xj) P(gM|XM-1) 1
P(X1|g1…gM) … P(Xj|g1…gM) … P(XM-1|g1…gM) P(XM|g1…gM)
Durbin et al 1998 Biological Sequence Analysis
-
Overview of Hidden Markov Model approach
• Simplifying assumption: for a individual i, suppose we know
M = genome-wide ancestry (e.g. 20%)
λ = average #generations since admixture (e.g. 6)
• Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)
of this individual at marker j along the genome.
INITIAL PROBABILITIES (e.g. left end of chromosome):
TRANSITION PROBABILITIES
EMISSION PROBABILITIES
Then apply forward-backward algorithm to infer P(Xj | genotypes).
(Or, use MCMC to integrate over uncertainty in M, λ, pA, pE.)
Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,
Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis
-
Big trouble if markers are in LD in ancestral populations
Example: Admixed population with 80% POP1, 20% POP2 ancestry
SNP1 = A/C SNP, A allele has frequency 0.25 in POP1, 0.75 in POP2
A allele has frequency 80%·0.25 + 20%·0.75 = 0.35 in Admixed pop.
Inference of local ancestry of a haploid chromosome using SNP1:
prob 0.35: P(POP1 | A) = 80%·0.25/(80%·0.25 + 20%·0.75 ) = 57%
prob 0.65: P(POP1 | C) = 80%·0.75/(80%·0.75 + 20%·0.25) = 92%
Overall: P(POP1) = 57%·0.35 + 92%·0.65 = 80%. Unbiased.
Price et al. 2008 Am J Hum Genet
-
Big trouble if markers are in LD in ancestral populations
Example: Admixed population with 80% POP1, 20% POP2 ancestry
SNP1 = A/C SNP, A allele has frequency 0.25 in POP1, 0.75 in POP2
SNP2 = A/C SNP in perfect LD with SNP1 in POP1, POP2
A allele has frequency 80%·0.25 + 20%·0.75 = 0.35 in Admixed pop.
Inference of local ancestry of a haploid chr using SNP1, SNP2:
prob 0.35: P(POP1 | AA) = 80%·0.252/(80%·0.252 + 20%·0.752) =
31%
prob 0.65: P(POP1 | CC) = 80%·0.752/(80%·0.752 + 20%·0.252) =
97%
Overall: P(POP1) = 31%·0.35 + 97%·0.65 = 74%. Biased.
Price et al. 2008 Am J Hum Genet
-
Inferring local ancestry using GWAS chip data
Advantages of AIM panels of 1,500+ SNPs:
• Lower cost: $80/sample
(vs. $300+sample for GWAS chips).
Advantages of GWAS chips:
• Dense SNP coverage enables LD mapping
• More accurate local ancestry inference?
-
Inferring local ancestry using GWAS chip data
• ANCESTRYMAP using a subset of ~8,000 unlinked AIMs
(Patterson et al. 2004 Am J Hum Genet; Tandon et al. 2011 Genet Epidemiol)
New methods developed for GWAS chip data: • SABER (Tang et al. 2006 Am J Hum Genet)
• LAMP (Sankararaman et al. 2008 Am J Hum Genet)
• uSWITCH (Sankararaman et al. 2008 Genome Res)
• HAPAA (Sundquist et al. 2008 Genome Res)
• HAPMIX (Price, Tandon et al. 2009 PLoS Genet)
• WINPOP (Pasaniuc et al. 2009 Bioinformatics)
• GEDI-ADMX (Pasaniuc et al. 2009 Lect Notes Comput Sci)
• PCA-based method (Bryc, Auton et al. 2010 PNAS)
• LAMP-LD (Baran et al. 2012 Bioinformatics)
• MULTIMIX (Churchhouse & Marchini 2013 Genet Epidemiol)
• RFMix (Maples et al. 2013 Am J Hum Genet)
reviewed in Seldin et al. 2011 Nat Rev Genet
-
Inferring local ancestry using GWAS chip data
• ANCESTRYMAP using a subset of ~8,000 unlinked AIMs
(Patterson et al. 2004 Am J Hum Genet; Tandon et al. 2011 Genet Epidemiol)
New methods developed for GWAS chip data: • SABER (Tang et al. 2006 Am J Hum Genet)
• LAMP (Sankararaman et al. 2008 Am J Hum Genet) • uSWITCH (Sankararaman et al. 2008 Genome Res)
• HAPAA (Sundquist et al. 2008 Genome Res)
• HAPMIX (Price, Tandon et al. 2009 PLoS Genet) • WINPOP (Pasaniuc et al. 2009 Bioinformatics)
• GEDI-ADMX (Pasaniuc et al. 2009 Lect Notes Comput Sci)
• PCA-based method (Bryc, Auton et al. 2010 PNAS)
• LAMP-LD (Baran et al. 2012 Bioinformatics)
• MULTIMIX (Churchhouse & Marchini 2013 Genet Epidemiol)
• RFMix (Maples et al. 2013 Am J Hum Genet)
reviewed in Seldin et al. 2011 Nat Rev Genet
-
Inferring local ancestry: LAMP method
LAMP method: (allele frequencies in ancestral populations not known)
• Prune SNP set to restrict to unlinked markers (r2 < 0.10)
• Choose fixed window length l
• Infer local ancestry within each window of length l via EM algorithm
(Unsupervised clustering, integer-valued haploid local ancestries)
• For each SNP, compute majority vote of local ancestry across
all windows overlapping that SNP.
Sankararaman et al. 2008 Am J Hum Genet
1
2
3
4
5
6
7
Window window length l
-
Inferring local ancestry: LAMP-ANC method
LAMP-ANC: (allele frequencies in ancestral populations are known)
• Prune SNP set to restrict to unlinked markers (r2 < 0.10)
• Choose fixed window length l
• Infer local ancestry within each window of length l via max likelihood
(Supervised clustering, integer-valued haploid local ancestries)
• For each SNP, compute majority vote of local ancestry across
all windows overlapping that SNP.
Sankararaman et al. 2008 Am J Hum Genet
1
2
3
4
5
6
7
Window window length l
-
Inferring local ancestry: LAMP and LAMP-ANC
LAMP and LAMP-ANC:
• Prune SNP set to restrict to unlinked markers (r2 < 0.10)
• Choose fixed window length l
• Infer local ancestry within each window of length l
• For each SNP, compute majority vote across windows containing SNP
Choice of window length l is key. If window length is
• too small: not enough information to infer local ancestry
• too big: violates assumption of constant local ancestry within window
window length l
Sankararaman et al. 2008 Am J Hum Genet
-
Inferring local ancestry: LAMP and LAMP-ANC
LAMP and LAMP-ANC:
• Prune SNP set to restrict to unlinked markers (r2 < 0.10)
• Choose fixed window length l
• Infer local ancestry within each window of length l
• For each SNP, compute majority vote across windows containing SNP
Choice of window length l is key. If window length is
• too small: not enough information to infer local ancestry
• too big: violates assumption of constant local ancestry within window
Use window length l which is
inversely proportional to # generations since admixture,
i.e. proportional to ancestry segment lengths
window length l
Sankararaman et al. 2008 Am J Hum Genet
-
Inferring local ancestry: LAMP and LAMP-ANC
LAMP and LAMP-ANC:
• Prune SNP set to restrict to unlinked markers (r2 < 0.10)
• Choose fixed window length l
• Infer local ancestry within each window of length l
• For each SNP, compute majority vote across windows containing SNP
Advantages:
• Simple and transparent approach, low computational cost
Disadvantages:
• Information from neighboring windows is not used
• Does not make use of haplotype information
window length l
Sankararaman et al. 2008 Am J Hum Genet
-
WINPOP improvement to LAMP-ANC
LAMP and LAMP-ANC:
• Prune SNP set to restrict to unlinked markers (r2 < 0.10)
• Choose fixed window length l
• Infer local ancestry within each window of length l
• For each SNP, compute majority vote across windows containing SNP
WINPOP:
• Allow variable window length l depending on local genetic
structure of ancestral populations.
• Explicitly model the possibility of one recombination event per window,
enabling larger windows.
window length l
Pasaniuc et al. 2009 Bioinformatics
also see Baran et al. 2012 Bioinformatics
-
Inferring local ancestry: HAPMIX method
HAPMIX method: nested Hidden Markov Models
• Large-scale HMM: transitions between local ancestry states
(Patterson et al. 2004 Am J Hum Genet).
• Small-scale HMM: transitions between haplotypes from
ancestral reference populations (Li & Stephens 2003 Genetics)
Price, Tandon et al. 2009 PLoS Genet
POP1
POP2
hap1
hap2
hap3
hap4
hap5
hap1
hap2
hap3
hap4
hap5
-
Inferring local ancestry: HAPMIX method
HAPMIX method: nested Hidden Markov Models
• States: local ancestry AND haplotype from POP1 or POP2.
• Given initial, transition and emission probabilities: use
forward-backward algorithm to infer P(states | data).
(Durbin et al. 1998 Biological Sequence Analysis + other HMM refs)
Price, Tandon et al. 2009 PLoS Genet
POP1
POP2
hap1
hap2
hap3
hap4
hap5
hap1
hap2
hap3
hap4
hap5
-
Inferring local ancestry: HAPMIX method
Advantages:
• Large-scale + Small-scal