Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621...

22
Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 https://dsgweb.wustl.edu/qunyuan/presentations/ PopStrat2011.pptx 1

Transcript of Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621...

Page 1: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Population Stratification

Qunyuan ZhangDivision of Statistical Genomics

GEMS Course M21-621 Computational Statistical Genetics

Mar. 24, 2011

https://dsgweb.wustl.edu/qunyuan/presentations/PopStrat2011.pptx

1

Page 2: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

What is Population Stratification (PS) ?

In narrow sense PS is the presence of a

systematic difference in allele frequencies between subpopulations in a population, possibly due to different ancestry or origins, especially in the context of genetic association studies. Population stratification is also referred to as population structure.

In broad sense PS can be regarded as the

presence of a difference in relatedness between individuals in a population, due to different subpopulations, family/pedigree structure and/or cryptic relation.

2

Page 3: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

False Positives (inflation)

Association could be due to the underlying structure of the population, even there is no disease-locus association.

PS & False Positives

3

Page 4: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

An Example of PS-caused False Positive

Sub-population 1case control total risk

A 72 8 80 9/1a 18 2 20 9/1total 90 10 100 9/1Sub-population 2

case control total riskA 3 27 30 1/9a 7 63 70 1/9

10 90 100 1/9Mixed population

case control total riskA 75 35 110 2.14a 25 65 90 0.38

100 100 200 1.00

• No disease-locus association.

• Risk difference between sub-populations.

• Allele Frequency difference between sub-populations.

• False disease-locus association in mixed population. (any allele with higher frequency in higher-risk sub-population seems to be risk allele)

4

Page 5: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Mantel-Haenszel Test for Stratification

Adjusted RR

Standard error

Chi-square test

An Example

(1)

(2)

(3)

5

Page 6: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Linear Model

Marker data

Population structure variableGenetic background variableMembership variableSubgroup/sub-population variableAncestry/admixture proportion variable

Usually Q is unknown, needs to be estimated

6

Page 7: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

-0.28 -0.95 0.11-0.75 0.29 0.59-0.60 0.08 -0.80

Estimating Q by Eigen-analysis

References: Patterson et al. 2006, Price et al. 2006 (software EIGENSTRAT)

X = U S VT

Q1 Q2 Q3Eigenvector of COV(X)

T

idv1 idv2 idv3snp1 0 2 1snp2 1 2 2snp3 0 0 1snp4 0 1 0snp5 2 0 0

-0.55 0.33 0.34-0.78 -0.10 -0.27-0.16 0.04 -0.71-0.20 0.14 0.52-0.15 -0.93 0.20

3.81 0.00 0.000.00 2.05 0.000.00 0.00 1.13

singular values

eigenvaluesS2

14.51 0.00 0.00

0.00 4.21 0.00

0.00 0.00 1.28

Or SAS Proc PRINCOM; R svd() and eigen() 7

Page 8: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Eigen-analysis of HapMap Populations

Q1

Q2

8

Page 9: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Estimating Q by MLE(for admixed population)

G: Observed genotypes of admixed [and parental populations]Q: Allelic frequencies in parental populationsP : Individual membership to be estimated

Goal: obtain P that maximizes Pr(G|P,Q)

1. Assign prior values for Q (randomly or estimated from parental population genotype data) & P (randomly)

2. Compute P(i) by solving

3. Compute Q(i) by solving

4. Iterate Steps 1 and 2 until convergence.

Tang et al. Genetic Epidemiology, 2005(28): 289–301

0)(

),|(

P

PQG

0)(

),|(

Q

PQG

9

Page 10: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Observed G : genotypes of admixed [and parental populations]

Unknown Z : admixed individuals’ membership from ancestral populations

Problem: How to estimate Z ?

Bayesian and Markov Chain Monte Carlo (MCMC) methods1. Assume ancestral population number K (see next slide)

2. Define prior distribution Pr(Z) under K3. Use MCMC to sample from posterior distribution Pr(Z|G) = Pr(Z) Pr(∙ G|

Z)

4. Average over large number of MCMC samples to obtain estimate of Z

Falush et al. Genetics, 2003(164):1567–1587 Software : STRUCTURE

Estimating Q by MCMC(for admixed population)

10

Page 11: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Infer Population Number (K)

11

Page 12: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Linear Model (an example including m Q-variables)

eQbQbQbbxay mm ...2211

eQbbxaym

iii

1

SAS Proc REG, Proc GENMOD; R lm(), glm()

Generalized, can fit binary/categorical y 12

Page 13: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Unified Mixed Model(more general)

SNP(s)

Inferred population membership

ID matrixCovariate(s)

V = Z G Z ' + R

Modeling the resemblance among individuals

13

Page 14: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Multi-Variate Normal Distribution (MVN) & Likelihood of Mixed Model

Based on MVN, the likelihood of trait (y) in a matrix form is:

no. of individuals (in a pedigree) nn variance-

covariance matrix

phenotype vector

mean phenotype

vector

V = Z G Z ' + R

IV ea222

Kinship (IBD) matrix (nn )

14

Page 15: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Kinship

Inbreeding CoefficientThe inbreeding coefficient of an individual is the probability that the pair of alleles carried by the gametes that produced it are Identical By Descent (IBD).

Identical By Descent (IBD)Two alleles come from the same ancestry.

Kinship/Coancestry

The inbreeding coefficient of an individual is equal to the coancestry between its parents. For example if parents X and Y have a child Z, theninbreeding coefficient of Z = coancestry between X and Y

Software: SAS (PROC INBREED), MERLIN, SPAGedi , R(kinship, emma) et al. (need pedigree and/or marker data)

15

Page 16: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Kinship Matrix (expected probability of allele sharing among

relatives)

16

Page 17: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Resources for Mixed Model with Kinship Matrix

Software Kinship Mixed Model Data

SAS Proc INBREED Proc MIXED Quantitative traitPedigree data

SAS Proc INBREED Proc GLIMMIX Quantitative/qualitative trait, Pedigree data

R : kinship makekinship() lmekin() Quantitative traitPedigree data

R: emma emma.kinship() emma.REML.t() Quantitative traitUsing maker data to calculate kinship

EMMAX emmax-kin emmax

17

Page 18: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Diagnosis of Inflation of False Positives

• Inflation: more false positives than expected under the null

• In GWAS, usually due to PS

• Can be caused by inappropriate statistical methods even with no PS

• May (not necessarily) indicate PS

18

Page 19: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Theoretical Basis of Diagnosis Uniform distribution [0,1] of p-values under the null

Histogram

-log10(p)Q-Q plot

inflationno inflation

19

Page 20: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Inflation Rate (IR)

For Binary Trait

For Continuous Trait

Amin , Duijn, Aulchenko, 2007

Devlin et al. 2004

20

Page 21: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Genomic Control (by IR)

For Binary Trait

For Continuous Trait

22iiY 22 )( ii tY

Or based on p-value 2)1,1(

2 dfpi i

Y

21

22 ~

ˆ~

dfi

i

YY

)~

(Pr~ 221 idfi Yobp

21

Page 22: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 .

Practice• Download and unzip the data from dsgweb.wustl.edu/qunyuan/data/ popstra2011hw.zip• Ignore pedigree.csv, test each SNP in snp.csv for association (with trait in

trait.csv);• Investigate p-values to see if there is any inflation;• Try to explain why;• List some possible methods to reduce or control the inflation;• Choose one method, apply it to the data;• Does it work? • Try to explain why. • Clearly document each step of you analysis.

The is no standard answer, feel free to try anything you like !

Report back to [email protected] and [email protected] in one week. Thanks !

22