FST and kinship for arbitrary population structures2017/02/13 · C) European populations FIN FIN...
Transcript of FST and kinship for arbitrary population structures2017/02/13 · C) European populations FIN FIN...
FST and kinship for arbitrary population structures
Alejandro Ochoa
John D. Storey Lab
Center for Statistics and Machine Learning, andLewis-Sigler Institute for Integrative Genomics,
Princeton University
2017-02-13
Why study FST and kinship?
Human geneticsis fascinating!
Pop. structureconfoundsassociationstudies (GWAS)
Heritability ofcomplex traits
Animal and plantbreeding
FST measures population structure and differentiation
●
●
●●
●
●
●●
●
●
●●
●●●●
●
●●
●
●●
●●●●●
●
●●●
●
●
●●
●
●
●
●●●
●●
● ●
●●
●●
●●
●●
00.
20.
4
Alle
le fr
eque
ncy
Median differentiation SNP (rs11692531)F̂ST ≈ 0.081 using Weir-Cockerham estimatorHuman Genome Diversity Project (HGDP)
FST in the independent subpopulation model
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
0.4
0.6
Alle
le fr
eque
ncy
Illustration.
FST =Var(pSi∣∣T)
pTi(1− pTi
) .
FST estimation is constrained to independent subpopulations
T
S1
f TS1
S2
f TS2
...SK
f TSK
Indep. Subpops.T
S1 S2 ... SK
A1 A2 ... An
Admixture
00.
050.
150.
25
Kin
ship
Indi
vidu
als
Our contribution
Previous FST definitions/estimators assume independent subpopulations.
1. We generalize FST for arbitrary populations, in terms of individuals.2. We characterize the bias of popular estimators under arbitrary population
structure, through theory and simulations.3. We develop a new estimator of kinship and FST for arbitrary population
structures.
Confusion: three versions of FST
Definition 1: FST as ameasure of relatedness in apopulation
FST = f̄ TS = θT or θ̄T .
Initially estimated frompedigrees.
Definition 2: FST as aparameter controllingallelic variance
FST =Var(pSi∣∣T)
pTi(1− pTi
) .Def. 1 ⇒ Def. 2 with FST
I Shared across loci i .I No µ or selection.
Definition 3: FST as astatistic of locus-specificvariance
FST,i =σ̂2i
p̄i(1− p̄i).
Goals:I Varies per locus i .I Measures µ and
selection.
Our generalized definition corresponds most closely to Definition 1.
Wright’s FST in cattle
Populations:T : ShorthornS : Dutchessstrain
Wright (1951)
Populations related by a treeT
Ljk
Lj
Lk
kinship orcoancestry
...inbreeding
FST in a subdivided population: Wright (1951)T
S
I
FIS ...
FST ...FIT
(1− FIT) = (1− FIS)(1− FST)
Admixed populations have complex structures
US individuals are often admixed from populations across the world.I European: UK, Ireland, Germany, ItalyI African: West AfricaI Hispanic: Puerto Rico, MexicoI Asian: China, India
African-Americans and Hispanics are recently admixed (5-15 generations ago)from differentiated populations.
Admixture proportions vary (admix. LD) ⇒ complex kinship.
GWAS and heritability estimation in multiethnic or admixed data?
Recently admixed populations
African-Americans
Baharian, et al. (2016)
Hispanics
Moreno-Estrada, et al. (2013)
Admixed siblings from different populations?
Lucy and Maria, UK
Ochoa brothers, MX
High Admixture LD:
Moreno-Estrada, et al. (2013)
Solution: treat every individual as its own population!
SNP dataExample: Genotype CC CT TT
xij 0 1 2Genome\Wide(SNP(Data(
0 2 2 1 1 0 1 0 2 1 0 1 2 . . .
Individuals
SN
Ps
X
An unstructured population
Individuals mate randomly.
In a large population T , genotypes
xij ∼ Binomial(2, pTi ),
at SNP i with reference allele frequencypTi , for any individual j .
This is “Hardy-Weinberg Equilibrium”. 0 1 2
pi = 0.25
Genotype ( xij)
Pro
babi
lity
0.0
0.2
0.4
0.6
Inbreeding coefficient f Tj
f Tj : Probability that the two alleles ofindividual j at a random SNP are“identical by descent” (IBD) given anancestral population T .
A structured population has f Tj > 0.
0 1 2
pi = 0.25, fj = 0.5
Genotype ( xij)
Pro
babi
lity
0.0
0.2
0.4
0.6
Kinship coefficients ϕTjk
ϕTjk : Probability that one allele of
individual j and one of individual k , at arandom SNP, are IBD, given anancestral population T .
Local kinship,given unrelated founders
j , k relation ϕTjk
self 1/2child 1/4sibling 1/4
half sibling 1/8uncle or nephew 1/8first cousins 1/16
second cousins 1/64unrelated 0
Kinship model for genotypes
Let T be the ancestral population. In the absence of selection or mutation, allelefrequencies drift randomly from the ancestral frequency pTi , with covariancesmodulated by the kinship coefficients:
E[xij |T ] = 2pTi ,
Var(xij |T ) = 2pTi(1− pTi
)(1 + f Tj ),
Cov(xij , xik |T ) = 4pTi(1− pTi
)ϕTjk .
Note that ϕTjj = 1
2
(1 + f Tj
).
(Wright 1921, Malécot 1948, Wright 1951, Jacquard 1970).
Individual-level analogs of FIT, FIS, FST“Total” coef., analogous to FIT:f Tj and ϕT
jk are relative to T .
“Local” coef., analogous to FIS:fLjj is relative to Lj ,ϕLjkjk is relative to Ljk .
“Structural” coef., analogous to FST:
f TLj =f Tj − f
Ljj
1− fLjj
,
f TLjk =ϕTjk − ϕ
Ljkjk
1− ϕLjkjk
.
T
Ljk
Lj
fLjkLj
Lk
fLjkLk
f TLjk ...f TLj
FST for arbitrary population structures
We propose
FST =n∑
j=1
wj fTLj,
whereI f TLj = inbreeding coefficient of Lj relative to T
I wj ≥ 0,∑n
j=1 wj = 1 are weights
Backward compatible with FST for subpopulations.Coherent with Wright’s 1951 definition.
Coancestry model and individual allele frequencies
This restricted model assumes the existence of individual-specific allelefrequencies πij , modulated by coancestry coefficients θTjk :
E[πij |T ] = pTi ,
Cov(πij , πik |T ) = pTi(1− pTi
)θTjk ,
xij |πij ∼ Binomial(2, πij).
This model excludes local relationships. Given these assumptions, coancestryand kinship coefficients are the same:
θTjk =
{ϕTjk if j 6= k ,
f Tj = 2ϕTjj − 1 if j = k .
FST =n∑
j=1
wjθTjj
FST estimation under independent subpopulationsWeir-Cockerham and Hudson FST
estimators with πij simplify to
p̂Ti =1n
n∑j=1
πij ,
σ̂2i =
1n − 1
n∑j=1
(πij − p̂Ti
)2,
F̂ indepST =
m∑i=1
σ̂2i
m∑i=1
p̂Ti(1− p̂Ti
)+ 1
nσ̂2i
a.s.−−−→m→∞
FST.
Under independent subpopulations, FST
can be solved for:
E
[1m
m∑i=1
σ̂2i
]= p(1− p)
TFST,
E
[1m
m∑i=1
p̂Ti(1− p̂Ti
)]= p(1− p)
T(1− FST
n
)
FST estimation under arbitrary coancestryWeir-Cockerham and Hudson FST
estimators with πij simplify to
p̂Ti =1n
n∑j=1
πij ,
σ̂2i =
1n − 1
n∑j=1
(πij − p̂Ti
)2,
F̂ indepST =
m∑i=1
σ̂2i
m∑i=1
p̂Ti(1− p̂Ti
)+ 1
nσ̂2i
a.s.−−−→m→∞
n(FST − θ̄T
)n − 1 + FST − nθ̄T
Under the general coancestry model,system is underdetermined:
E
[1m
m∑i=1
σ̂2i
]= p(1− p)
T n(FST − θ̄T )
n − 1,
E
[1m
m∑i=1
p̂Ti(1− p̂Ti
)]= p(1− p)
T(1− θ̄T ).
θ̄T : mean coancestry.
In independent subpopulationsθ̄T = 1
nFST.
Admixture models
Draw alleles from a mixture of populations:
πij =K∑
u=1
pSui qju,
where qju is ancestry proportion, pSui is AF in subpopulation Su.If subpopulations are independent and f TSu is FST of Su relative to T , then
θTjk =K∑
u=1
qjuqkufTSu , FST =
n∑j=1
K∑u=1
wjq2juf
TSu .
Our admixture simulationA) Intermediate population differentiation
FS
T
0.0
1.0
0.0
0.2
B) Intermediate population spread
dens
ity
C) Admixture proportions
ance
stry
0.0
1.0
D) Discrete subpopulations approx.
ance
stry
0.0
1.0
position in 1D geography
Comparison of population structures in simulation
T
S1
f TS1
S2
f TS2
...SK
f TSK
Indep. Subpops.T
S1 S2 ... SK
A1 A2 ... An
Admixture
00.
050.
150.
25
Kin
ship
Indi
vidu
als
Bias estimating the generalized FSTThe popular Weir-Cockerham (WC) and Hudson FST estimators, formulated forindependent subpopulations, are biased in our admixture simulation:
0.00
0.05
0.10
A) Indep. Subpops.
Wei
r−C
ocke
rham
Hud
son
Kin
ship
Plu
g−in
− −−
B) Admixture
Wei
r−C
ocke
rham
Hud
son
Kin
ship
Plu
g−in
− − −
True FST
FSTindep limit
FS
T e
stim
ate
FST estimator
Bias estimating kinship coefficientsThe popular kinship estimator from genotypes and its limit are
ϕ̂Tjk =
m∑i=1
(xij − 2p̂Ti
) (xik − 2p̂Ti
)4
m∑i=1
p̂Ti(1− p̂Ti
) a.s.−−−→m→∞
ϕTjk − ϕ̄T
j − ϕ̄Tk + ϕ̄T
1− ϕ̄T,
where ϕ̄Tj and ϕ̄T are weighted mean kinships. In our admixture simulation:
A) True Kinship B) Estimate C) Limit
−0.0
50.
10.
2
Kin
ship
indi
vidu
als
A new kinship estimator
Bias in new kinship estimator is parametrized by ϕ̄T :
ϕ̂T ,Oldjk =
m∑i=1
(xij − 2p̂Ti
) (xik − 2p̂Ti
)4
m∑i=1
p̂Ti(1− p̂Ti
) a.s.−−−→m→∞
ϕTjk − ϕ̄T
j − ϕ̄Tk + ϕ̄T
1− ϕ̄T,
ϕ̂T ,Newjk =
m∑i=1
(xij − 1)(xik − 1)− 1
4m∑i=1
p̂Ti(1− p̂Ti
) + 1 a.s.−−−→m→∞
ϕTjk − ϕ̄T
1− ϕ̄T.
Remaining bias in ϕ̂T ,Newjk comes from estimating pTi
(1− pTi
)with p̂Ti
(1− p̂Ti
).
A new kinship estimator
Limit of proposed estimate:
ϕ̂T ,Newjk =
m∑i=1
(xij − 1)(xik − 1)− 1
4m∑i=1
p̂Ti(1− p̂Ti
) + 1 a.s.−−−→m→∞
ϕTjk − ϕ̄T
1− ϕ̄T,
If minj ,k
ϕTjk = 0, then
minj ,k
ϕ̂T ,Newjk
a.s.−−−→m→∞
−ϕ̄T
1− ϕ̄T
Performance of proposed estimator
A) Truth B) Proposed C) Existing
−0.1
00.
10.
20.
3
Kin
ship
Indi
vidu
als
Population-level and Individual-level distances in 1000 GenomesA) Distant populations
YRI
YR
I
CEU
CE
U
CHB
CH
B
B) Hispanic populations
PEL
PE
L
MXL
MX
L
CLM
CLM
PUR
PU
R
C) European populations
FIN
FIN
GBR
GB
R
IBS
IBS
TSI
TS
I
−0.0
50
0.05
0.1
0.15
pairw
ise
FS
T
indi
vidu
als
pairw
ise
FS
T
CH
B−C
HB
CE
U−C
EU
YR
I−Y
RI
CH
B−C
EU
CE
U−Y
RI
CH
B−Y
RI
Population−level estimateMean of individual−levelestimates
D) Pairwise comparisons
0.0
0.1
0.2
PU
R−P
UR
PE
L−P
EL
CLM
−CLM
MX
L−M
XL
PU
R−C
LM
CLM
−MX
L
MX
L−P
EL
PU
R−M
XL
CLM
−PE
L
PU
R−P
EL
E) Pairwise comparisons
0.00
0.05
FIN
−FIN
TS
I−T
SI
GB
R−G
BR
IBS
−IB
S
TS
I−IB
S
IBS
−GB
R
TS
I−G
BR
GB
R−F
IN
IBS
−FIN
TS
I−F
IN
F) Pairwise comparisons
0.00
00.
005
0.01
0
Revised FST estimates in 1000 Genomes
A) Independent Pop Model
AFR
AF
R
SAS
SA
S
AMR
AM
R
EUR
EU
R
EAS
EA
S
B) 1000 Genomes
AFR
AF
R
SAS
SA
S
AMR
AM
R
EUR
EU
R
EAS
EA
S
00.
050.
10.
150.
20.
25
Kin
ship
0.0 0.1 0.2 0.3
05
1015
C) Differentiation
FST or Inbreeding
Den
sity
WCHudsonNew (indiv)New (mean)
Indi
vidu
als
We have...
...generalized FST using parameters for arbitrary structure in terms of individuals.
...connected FST, kinship coefficients, and admixture models.
...characterized bias of common estimators when assumptions are broken.
...used an admixture simulation to illustrate biases.
...developed new estimators of FST and kinship/coancestry.
Other work from Dr. OchoaModeling the placebo response inpsychiatric drug trialsCollaboration with Otsuka Pharma.
0 5 10 15
5010
015
020
0
Week
PAN
SS
Study
31−93−20231−97−20131−97−202CN138−001CN138−11331−03−239CN138−39731−12−291
Protein sequence analysisImproving sequence homology stats
AB R E
B
B R Globular
H H ... H H TM
H R R H C R C
CC
E
Future work: Selection tests
xi : genotype vector at SNP i ,Φ̂T : kinship matrix estimate,p̂Ti : ancestral allele frequency estimate,
Then this generalized z-score measures deviation of this SNP from the neutralgenetic structure:
z2i =
(xi − 2p̂Ti 1
)ᵀ (Φ̂T)−1 (
xi − 2p̂Ti 1)
4p̂Ti(1− p̂Ti
) .
Complements other info such as selective sweeps.
Future work: Admixture LD
Moreno-Estrada, et al. (2013)
Simple extension:The kinship matrix varies per locusdepending on population assignments.
More general local kinship estimation?
Future work: Kinship in Recent Mutations
Recall the following only holds for neutral SNPs polymorphic in T :
E[xij |T ] = 2pTi ,
Cov(xij , xik |T ) = 4pTi(1− pTi
)ϕTjk .
A SNP that arose from recent mutation in S instead has pTi = 0 or 1 and:
E[xij |S ] = 2pSi ,
Cov(xij , xik |S) = 4pSi(1− pSi
)ϕSjk .
Also recall: (1− ϕT
jk
)=(1− ϕS
jk
) (1− f TS
).
Recent mutations require special treatment in GWAS/herit. studies!
AcknowledgmentsJohn D. StoreyAndrew BassIrineo CabrerosChee ChenWei HaoEmily NelsonRiley Skeen-Gaar
Neo Christopher ChungInstitute of InformaticsUniversity of Warsaw
Funding:National Institutes of HealthOtsuka Pharmaceuticals
Lewis-Sigler Institute for Integrative Genomics
Future work: Variable kinship in GWAS
Suppose the kinship matrix ΦTi = (ϕT
ijk) varies per locus i :
Cov (xij , xik |T ) = 4pTi(1− pTi
)ϕTijk .
This ΦTi replaces the global kinship ΦT used in LMM and adjusted χ2 GWAS,
varying given local admixture or the recent mutation model.
Future work: Variable kinship in heritability estimationSuppose the kinship matrix ΦT
i = (ϕTijk) varies per locus i :
Cov (xij , xik |T ) = 4pTi(1− pTi
)ϕTijk .
Let y = (yj) be a trait controlled by additive genetic effects as
yj = µ +∑i∈C
βixij + εj ,
The trait’s covariance structure is now given by the mean kinship at causal loci C :
Cov(y|T ) = σ2 (h22Φ̄T + (1− h2)I), where
Φ̄T =∑i∈C
wiΦTi , wi ∝ β2
i pTi (1− pTi ).