EPI 511, Advanced Population and Medical Genetics€¦ · Alkes Price Harvard School of Public...

278
Alkes Price Harvard School of Public Health January 31 & February 2, 2017 EPI 511, Advanced Population and Medical Genetics Week 2: • Population structure • Population admixture

Transcript of EPI 511, Advanced Population and Medical Genetics€¦ · Alkes Price Harvard School of Public...

  • Alkes Price

    Harvard School of Public Health

    January 31 & February 2, 2017

    EPI 511, Advanced Population and Medical Genetics

    Week 2:

    • Population structure

    • Population admixture

  • EPI 511, Advanced Population and Medical Genetics

    Week 2:

    • Population structure

    • Population admixture

  • Outline

    1. Introduction to population structure

    2. Model-based clustering (STRUCTURE, FRAPPE programs)

    3. Principal Components Analysis (PCA)

    4. Ancestry-informative markers (AIMs)

  • Outline

    1. Introduction to population structure

    2. Model-based clustering (STRUCTURE, FRAPPE programs)

    3. Principal Components Analysis (PCA)

    4. Ancestry-informative markers (AIMs)

  • What is population structure?

    Population structure refers to genetic differences

    between populations due to geographic ancestry.

  • Genetic differences between populations are small

    5-7% of worldwide human genetic variation is due to

    genetic differences between human populations.

    The remaining 93-95% of human genetic variation is due to

    genetic variation within human populations

    (Rosenberg et al. 2002 Science).

  • Genetic differences between populations are small (International HapMap Consortium 2005 and 2007, Nature)

    FST = 0.19

    FST = 0.11

    FST = 0.16

  • Populations can be distinguished using

    a large number of genetic markers

    • Model-based clustering programs such as STRUCTURE (Pritchard et al. 2000 Genetics)

    Rosenberg et al. 2002 Science

  • Populations can be distinguished using

    a large number of genetic markers

    • Principal components analysis (PCA) (Cavalli-Sforza 1994, The History and Geography of Human Genes)

    using 3 million markers

  • Model-based clustering vs. PCA:

    What’s the difference?

    Model-based clustering:

    • Output for each individual: ancestry in N population clusters

    • Fractional ancestry (20% pop1, 80% pop2) may be allowed

    • Number N of population clusters must be decided in advance

    • Results may be sensitive to number of population clusters

  • Model-based clustering vs. PCA:

    What’s the difference?

    Model-based clustering:

    • Output for each individual: ancestry in N population clusters

    • Fractional ancestry (20% pop1, 80% pop2) may be allowed

    • Number N of population clusters must be decided in advance

    • Results may be sensitive to number of population clusters

    Principal components analysis (PCA):

    • Output for each individual: ancestry as principal components

    • PCs do not necessarily correspond to specific populations

    • Results of top PCs are not sensitive to the number of PCs

  • Trees can also describe population structure

    Unrooted tree Rooted tree Jakobsson et al. 2008 Nature Li et al. 2008 Science

    also see Cavalli-Sforza et al. 2003 Nat Genet

  • Population structure vs. Population admixture:

    What’s the difference?

    Population structure: [Tue of Week 2]

    • Genetic differences due to geographic ancestry.

    • Use genome-wide data to infer genome-wide ancestry.

  • Population structure vs. Population admixture:

    What’s the difference?

    Population structure: [Tue of Week 2]

    • Genetic differences due to geographic ancestry.

    • Use genome-wide data to infer genome-wide ancestry.

    Population admixture: [Thu of Week 2]

    • Mixed ancestry from multiple continental populations.

    • e.g. African Americans, Latino Americans, Hawaiians.

    • Infer local ancestry at each location in the genome.

  • Population structure vs. Population stratification:

    What’s the difference?

    Population structure: [Tue of Week 2]

    • Genetic differences due to geographic ancestry.

    • Use genome-wide data to infer genome-wide ancestry.

    Population stratification: [Tue of Week 3 & Thu of Week 3]

    • Refers specifically to a genotype-phenotype association study.

    • Differences in genetic ancestry between cases and controls.

  • Outline

    1. Introduction to population structure

    2. Model-based clustering (STRUCTURE, FRAPPE programs)

    3. Principal Components Analysis (PCA)

    4. Ancestry-informative markers (AIMs)

  • Model-based clustering when allele frequencies

    in ancestral populations are known

    Example 1. POP1 and POP2 with known allele frequencies.

    SNP1 SNP2 SNP3 SNP4 ………………………

    POP1 0.25 0.57 0.29 0.38 … (allele frequencies)

    POP2 0.40 0.32 0.84 0.22 … (allele frequencies)

    Individual x 2 0 1 1 … (SNP genotypes)

    Does individual x belong to POP1 or POP2?

  • Model-based clustering when allele frequencies

    in ancestral populations are known

    Example 1. POP1 and POP2 with known allele frequencies.

    SNP1 SNP2 SNP3 SNP4 ………………………

    POP1 0.25 0.57 0.29 0.38 … (allele frequencies)

    POP2 0.40 0.32 0.84 0.22 … (allele frequencies)

    Individual x 2 0 1 1 … (SNP genotypes)

    Does individual x belong to POP1 or POP2?

    P(DATA | x is in POP1) is proportional to

    (0.25)2(0.75)0(0.57)0(0.43)2(0.29)1(0.71)1(0.38)1(0.62)1 = 0.0006

    P(DATA | x is in POP2) is proportional to

    (0.40)2(0.60)0(0.32)0(0.68)2(0.84)1(0.16)1(0.22)1(0.78)1 = 0.0017

  • (Fractional) model-based clustering when allele

    frequencies in ancestral populations are known

    Example 1. POP1 and POP2 with known allele frequencies.

    SNP1 SNP2 SNP3 SNP4 ………………………

    POP1 0.25 0.57 0.29 0.38 … (allele frequencies)

    POP2 0.40 0.32 0.84 0.22 … (allele frequencies)

    Individual x 2 0 1 1 … (SNP genotypes)

    If individual x has ancestry α from POP1 and (1–α) from POP2,

    then what is the most likely value of α?

  • (Fractional) model-based clustering when allele

    frequencies in ancestral populations are known

    Example 1. POP1 and POP2 with known allele frequencies.

    SNP1 SNP2 SNP3 SNP4 ………………………

    POP1 0.25 0.57 0.29 0.38 … (allele frequencies)

    POP2 0.40 0.32 0.84 0.22 … (allele frequencies)

    Individual x 2 0 1 1 … (SNP genotypes)

    If individual x has ancestry α from POP1 and (1–α) from POP2,

    then what is the most likely value of α?

    P(DATA | α) is proportional to

    [0.25α + 0.40(1–α)]2[0.75α + 0.60(1–α)]0

    [0.57α + 0.32(1–α)]0[0.43α + 0.68(1–α)]2

    [0.29α + 0.84(1–α)]1[0.71α + 0.16(1–α)]1

    [0.38α + 0.22(1–α)]1[0.62α + 0.78(1–α)]1

    max. value 0.0020

    attained at α = 0.22

  • Model-based clustering when allele frequencies

    in ancestral populations are known

    General case: M SNPs (m = 1 to M), N populations (n = 1 to N),

    known allele frequency pmn for SNP m in population n,

    observed genotype counts gm for SNP m in individual x.

    Which population (n = 1 to N) does individual x belong to?

  • Model-based clustering when allele frequencies

    in ancestral populations are known

    General case: M SNPs (m = 1 to M), N populations (n = 1 to N),

    known allele frequency pmn for SNP m in population n,

    observed genotype counts gm for SNP m in individual x.

    Which population (n = 1 to N) does individual x belong to?

    P(DATA | x ~ population n) is proportional to

    Answer: find the choice of n which maximizes this expression.

    M

    m

    g

    mn

    g

    mnmm pp

    1

    2)1(

  • (Fractional) model-based clustering when allele

    frequencies in ancestral populations are known

    General case: M SNPs (m = 1 to M), N populations (n = 1 to N),

    known allele frequency pmn for SNP m in population n,

    observed genotype counts gm for SNP m in individual x.

    If individual x has fractional ancestry αn from each population n,

    subject to Σnαn = 1, then what are the most likely values of αn?

    P(DATA | x ~ α1, …, αN) is proportional to

    Answer: find the values of αn which maximize this expression.

    M

    m

    g

    mn

    N

    n

    n

    gN

    n

    mnn

    mm

    pp1

    2

    11

    )1(

  • (Fractional) model-based clustering when allele

    frequencies in ancestral populations are unknown

    General case: M SNPs (m = 1 to M), N populations (n = 1 to N),

    unknown allele frequency pmn for SNP m in population n,

    observed genotype counts gim for SNP m in many individuals xi.

    If individual xi has fractional ancestry αin from each population n,

    subject to Σnαin = 1, then what are the most likely values of αin?

  • (Fractional) model-based clustering when allele

    frequencies in ancestral populations are unknown

    General case: M SNPs (m = 1 to M), N populations (n = 1 to N),

    unknown allele frequency pmn for SNP m in population n,

    observed genotype counts gim for SNP m in many individuals xi.

    If individual xi has fractional ancestry αin from each population n,

    subject to Σnαin = 1, then what are the most likely values of αin?

    P(DATA | xi ~ αi1, …, αiN for each i; pmn) is proportional to

    Answer: find values of αin, pmn which maximize this expression.

    I

    i

    M

    m

    g

    mn

    N

    n

    in

    gN

    n

    mnin

    imim

    pp1 1

    2

    11

    )1(

  • How to optimize αin and pmn?

    General case: M SNPs (m = 1 to M), N populations (n = 1 to N),

    unknown allele frequency pmn for SNP m in population n,

    observed genotype counts gim for SNP m in many individuals xi.

    ??? Which ancestries αin and allele frequencies pmn maximize

    • Approach #1: EM algorithm (Dempster et al. 1977 JRSS B)

    (FRAPPE program; Tang et al. 2005 Genet Epidemiol)

    I

    i

    M

    m

    g

    mn

    N

    n

    in

    gN

    n

    mnin

    imim

    pp1 1

    2

    11

    )1(

    also see ADMIXTURE program (Alexander et al. 2009 Genome Res)

  • The EM algorithm

    Want to estimate ancestries αin and allele frequencies pmn.

    Hidden variables: Zihm = ancestry of indiv i, hap h at SNP m.

    Let zihmn = P(Zihm = n) denote expectations of hidden variables.

    Tang et al. 2005 Genet Epidemiol

    also see Dempster et al. 1977 JRSS B, Pritchard et al. 2000 Am J Hum Genet

  • The EM algorithm

    Want to estimate ancestries αin and allele frequencies pmn.

    Hidden variables: Zihm = ancestry of indiv i, hap h at SNP m.

    Let zihmn = P(Zihm = n) denote expectations of hidden variables.

    Here h = 0 or 1 (two haplotypes per individual)

    Let gihm denote haplotype of indiv i, hap h

    Diploid genotype gim = 0 or 1 or 2

    Haploid genotype gihm = 0 or 1, Σh gihm = gim

    Tang et al. 2005 Genet Epidemiol also see Dempster et al. 1977 JRSS B, Pritchard et al. 2000 Am J Hum Genet

  • The EM algorithm

    Want to estimate ancestries αin and allele frequencies pmn.

    Hidden variables: Zihm = ancestry of indiv i, hap h at SNP m.

    If Zihm is known: choose αin and pmn to maximize

    But Zihm is unknown. What to do?

    I

    i

    M

    m h ihmmZiZ

    ihmmZiZ

    gp

    gp

    ihmihm

    ihmihm

    1 1

    1

    00)1(

    1

    Tang et al. 2005 Genet Epidemiol

    also see Dempster et al. 1977 JRSS B, Pritchard et al. 2000 Am J Hum Genet

  • The EM algorithm

    Want to estimate ancestries αin and allele frequencies pmn.

    Hidden variables: Zihm = ancestry of indiv i, hap h at SNP m.

    Let zihmn = P(Zihm = n) denote expectations of hidden variables.

    Initialization step: Assign zihmn arbitrarily.

    Tang et al. 2005 Genet Epidemiol

    also see Dempster et al. 1977 JRSS B, Pritchard et al. 2000 Am J Hum Genet

  • The EM algorithm

    Want to estimate ancestries αin and allele frequencies pmn.

    Hidden variables: Zihm = ancestry of indiv i, hap h at SNP m.

    Let zihmn = P(Zihm = n) denote expectations of hidden variables.

    Expectation step: Compute expectations zihmn from αin and pmn.

    0)1()1(

    1

    1'

    ''

    1'

    ''

    ihm

    N

    n

    mninmnin

    ihm

    N

    n

    mninmnin

    ihmn

    gpp

    gpp

    z

    Tang et al. 2005 Genet Epidemiol

    also see Dempster et al. 1977 JRSS B, Pritchard et al. 2000 Am J Hum Genet

  • The EM algorithm

    Want to estimate ancestries αin and allele frequencies pmn.

    Hidden variables: Zihm = ancestry of indiv i, hap h at SNP m.

    Let zihmn = P(Zihm = n) denote expectations of hidden variables.

    Maximization step: Maximize P(DATA | αin and pmn) using zihmn.

    M

    m h

    ihmnin zM 1

    1

    02

    1

    I

    i h

    ihmn

    I

    i h

    ihmihmnmn zgzp1

    1

    01

    1

    0

    Tang et al. 2005 Genet Epidemiol

    also see Dempster et al. 1977 JRSS B, Pritchard et al. 2000 Am J Hum Genet

  • The EM algorithm

    Want to estimate ancestries αin and allele frequencies pmn.

    Hidden variables: Zihm = ancestry of indiv i, hap h at SNP m.

    Let zihmn = P(Zihm = n) denote expectations of hidden variables.

    Initialization step.

    Maximization step.

    Expectation step.

    Maximization step.

    Expectation step.

    Maximization step.

    etc. (to convergence)

    Tang et al. 2005 Genet Epidemiol also see Dempster et al. 1977 JRSS B, Pritchard et al. 2000 Am J Hum Genet

  • Bayesian posterior inference

    General case: M SNPs (m = 1 to M), N populations (n = 1 to N),

    unknown allele frequency pmn for SNP m in population n,

    observed genotype counts gim for SNP m in many individuals xi.

    ??? Which ancestries αin and allele frequencies pmn maximize

    • Approach #2: Place Bayesian priors on αin and pmn, then

    sample from posterior via Markov Chain Monte Carlo (MCMC)

    (STRUCTURE program; Pritchard et al. 2000 Genetics)

    I

    i

    M

    m

    g

    mn

    N

    n

    in

    gN

    n

    mnin

    imim

    pp1 1

    2

    11

    )1(

  • Bayesian posterior inference

    General case: M SNPs (m = 1 to M), N populations (n = 1 to N),

    unknown allele frequency pmn for SNP m in population n,

    observed genotype counts gim for SNP m in many individuals xi.

    ??? Which ancestries αin and allele frequencies pmn maximize

    • Approach #2: Place Bayesian priors on αin and pmn, then

    sample from posterior via Markov Chain Monte Carlo (MCMC)

    (STRUCTURE program; Pritchard et al. 2000 Genetics)

    I

    i

    M

    m

    g

    mn

    N

    n

    in

    gN

    n

    mnin

    imim

    pp1 1

    2

    11

    )1(

  • Bayesian posterior inference

    General case: M SNPs (m = 1 to M), N populations (n = 1 to N),

    unknown allele frequency pmn for SNP m in population n,

    observed genotype counts gim for SNP m in many individuals xi.

    ??? Which ancestries αin and allele frequencies pmn maximize

    • Approach #2: Place Bayesian priors on αin and pmn, then

    sample from posterior via Markov Chain Monte Carlo (MCMC)

    (STRUCTURE program; Pritchard et al. 2000 Genetics)

    or variational Bayes approximation

    (TeraStructure program; Gopalanan et al. 2016 Nat Genet)

    I

    i

    M

    m

    g

    mn

    N

    n

    in

    gN

    n

    mnin

    imim

    pp1 1

    2

    11

    )1(

  • Next steps to understanding model-based clustering

    Let there be rock.

    -- Bon S.

    Let there be data.

    -- Alkes

  • Application #1: Human Genome Diversity Project

    Cann et al. 2002 Science, Cavalli-Sforza et al. 2005 Nat Rev Genet

    also see Mallick et al. 2016 Nature (SGDP), Paganic et al. 2016 Nature (EGDP)

  • Application #1: Human Genome Diversity Project

    Cann et al. 2002 Science, Cavalli-Sforza et al. 2005 Nat Rev Genet

    also see Mallick et al. 2016 Nature (SGDP), Paganic et al. 2016 Nature (EGDP)

  • STRUCTURE results on HGDP samples

    Rosenberg et al. 2002 Science

    • 1,056 individuals from 52 world populations

    • 377 microsatellite markers (multi-allelic)

  • STRUCTURE results on HGDP samples

    Rosenberg et al. 2002 Science

    • 1,056 individuals from 52 world populations

    • 377 microsatellite markers (multi-allelic)

    Africa Europe Western Eurasia East Asia

    Oce

    an

    ia

    Am

    eric

    a

  • STRUCTURE results on HGDP samples

    Rosenberg et al. 2002 Science

    • 1,056 individuals from 52 world populations

    • 377 microsatellite markers (multi-allelic)

    Africa Europe Western Eurasia East Asia

    Oce

    an

    ia

    Am

    eric

    a

  • STRUCTURE results on HGDP samples

    Rosenberg et al. 2002 Science

    • 1,056 individuals from 52 world populations

    • 377 microsatellite markers (multi-allelic)

    Africa Europe Western Eurasia East Asia

    Oce

    an

    ia

    Am

    eric

    a

  • STRUCTURE results on HGDP samples

    Rosenberg et al. 2002 Science

    • 1,056 individuals from 52 world populations

    • 377 microsatellite markers (multi-allelic)

    Africa Europe Western Eurasia East Asia

    Oce

    an

    ia

    Am

    eric

    a

  • STRUCTURE results on HGDP samples

    Rosenberg et al. 2002 Science

    • 1,056 individuals from 52 world populations

    • 377 microsatellite markers (multi-allelic)

    Africa Europe Western Eurasia East Asia

    Oce

    an

    ia

    Am

    eric

    a

  • STRUCTURE results: How many clusters?

    Rosenberg et al. 2002 Science

    “We do not claim that our procedure provides an accurate estimate”

    (Heuristic procedure for #clusters, Pritchard et al. 2000 Genetics)

    Africa Europe Western Eurasia East Asia

    Oce

    an

    ia

    Am

    eric

    a

  • FRAPPE results on GWAS data from HGDP

    Li et al. 2008 Science

    • 938 HGDP individuals (118 related individuals removed)

    • 51 world populations (N. Han and S. Han merged)

    • Illumina 650K chip

    FRAPPE results at K=7:

  • Application #2: diverse African populations

    also see Figure 5 of

    Cavalli-Sforza et al. 2003 Nat Genet

    Language families

    of Africa

  • Application #2: diverse African populations

    • 2,432 individuals from 113 African populations

    • 1,327 markers (microsatellite markers and indels)

    STRUCTURE (Pritchard et al. 2000 Genetics) at K=14.

    Tishkoff et al. 2009 Science; also see Gurdasani et al. 2015 Nature

  • STRUCTURE results on African populations

    = West African/Bantu = East African

    = Khoisan

    = Pygmy

    = European/Middle Eastern

    K=14:

    Tishkoff et al. 2009 Science; also see Gurdasani et al. 2015 Nature

  • STRUCTURE results on African populations

    = West African/Bantu K=14:

    Bantu expansion

    (2000 BC – 1000 AD) (Cavalli-Sforza et al. 1994,

    The History and Geography

    Of Human Genes)

    Tishkoff et al. 2009 Science; also see Gurdasani et al. 2015 Nature

  • Outline

    1. Introduction to population structure

    2. Model-based clustering (STRUCTURE, FRAPPE programs)

    3. Principal Components Analysis (PCA)

    4. Ancestry-informative markers (AIMs)

  • Principal Components Analysis

    • •

    10 points in 1,000,000-dimensional space.

  • Axes of variation (PCs, eigenvectors)

    • •

    Axis 1

    Axis 1 is the axis explaining the

    maximum amount of variation.

  • Axes of variation (PCs, eigenvectors)

    • •

    Axis 1

    Axis 2

    Axis 2 is the axis explaining the

    maximum amount of variation

    among axes orthogonal to Axis 1.

  • Axes of variation (PCs, eigenvectors)

    • •

    Axis 1

    Axis 2

    Axis 10

    Axis 9

    Axis 3

  • Top axis of variation

    • •

    Axis 1

    Axis 2

    +0.45

    +0.02 +0.30

    +0.09

    -0.36

    -0.33

    +0.22

    -0.08 -0.18

    -0.50

  • The math Let X be an M x N matrix with M > N (e.g. M SNPs, N individuals)

    Let Ψ be the N x N covariance matrix of X:

    Ψjk = Cov(xj, xk), where xj and xk are jth and kth columns of X.

    Pearson 1901 Phil Mag, Ser B

    Hoteling 1933 J Educ Psychol

    Jackson 2003, A User’s Guide to Principal Components

  • The math Let X be an M x N matrix with M > N (e.g. M SNPs, N individuals)

    Let Ψ be the N x N covariance matrix of X:

    Ψjk = Cov(xj, xk), where xj and xk are jth and kth columns of X.

    Matrix diagonalization (Eigen-decomposition):

    Ψ = VDVT , where

    D is a diagonal N x N matrix of eigenvalues

    V is an N x N matrix whose columns are the eigenvectors of Ψ

    Eigenvectors are orthonormal (VTV = I), thus ΨV = VD, i.e.

    Ψvj = djvj (vj = jth eigenvector, dj = jth eigenvalue)

    Pearson 1901 Phil Mag, Ser B

    Hoteling 1933 J Educ Psychol

    Jackson 2003, A User’s Guide to Principal Components

  • Toy Example 2 -2

    1 -1

    X = 0 0

    -1 1

    -2 2

  • Toy Example 2 -2

    1 -1

    X = 0 0 Ψ = 10 -10

    -1 1 -10 10

    -2 2

  • Toy Example 2 -2

    1 -1 V D VT

    X = 0 0 Ψ = 10 -10 =

    -1 1 -10 10

    -2 2

    2/12/1

    2/12/1

    2/12/1

    2/12/1

    00

    020

  • Toy Example 2 -2 Eigenvalue 1

    1 -1 V D VT

    X = 0 0 Ψ = 10 -10 =

    -1 1 -10 10

    -2 2

    PC1

    Ψv1 = d1v1 =

    2/12/1

    2/12/1

    2/12/1

    2/12/1

    00

    020

    2/20

    2/20

  • Toy Example 2 -2 Eigenvalue 2

    1 -1 V D VT

    X = 0 0 Ψ = 10 -10 =

    -1 1 -10 10

    -2 2

    PC2

    Ψv2 = d2v2 =

    2/12/1

    2/12/1

    2/12/1

    2/12/1

    00

    020

    0

    0

  • PCA on genotype data G = M x N matrix of individual genotypes

    M SNPs, N individuals

    gij = genotype (0, 1, or 2 alleles) of SNP i in individual j

    Price et al. 2006 Nat Genet, Patterson et al. 2006 PLoS Genet

    also see McVean 2009 PLoS Genet, Engelhardt & Stephens 2010 PLoS Genet

  • PCA on genotype data G = M x N matrix of individual genotypes

    M SNPs, N individuals

    gij = genotype (0, 1, or 2 alleles) of SNP i in individual j

    • Subtract off the mean of SNP i: pi = Avgj gij/2, set gij = gij – 2pi

    (Missing data: set gij = 0 if SNP i in individual j is missing data)

    • Optional: normalize by , i.e. set gij = gij /

    Price et al. 2006 Nat Genet, Patterson et al. 2006 PLoS Genet

    also see McVean 2009 PLoS Genet, Engelhardt & Stephens 2010 PLoS Genet

    )1(2 ii pp )1(2 ii pp

  • PCA on genotype data G = M x N matrix of individual genotypes

    M SNPs, N individuals

    gij = genotype (0, 1, or 2 alleles) of SNP i in individual j

    • Subtract off the mean of SNP i: pi = Avgj gij/2, set gij = gij – 2pi

    (Missing data: set gij = 0 if SNP i in individual j is missing data)

    • Optional: normalize by , i.e. set gij = gij /

    Ψ = N x N covariance matrix of G

    Ψ = VDVT (Eigen-decomposition)

    Columns of V are eigenvectors (principal components, PCs) of G.

    Diagonal entries of D are eigenvalues of G.

    The hope: Top PCs (PC1, PC2) correspond to genetic ancestry.

    Price et al. 2006 Nat Genet, Patterson et al. 2006 PLoS Genet

    also see McVean 2009 PLoS Genet, Engelhardt & Stephens 2010 PLoS Genet

    )1(2 ii pp )1(2 ii pp

  • Approximating top PCs quickly in genetic data

    • Power iteration: a random vector is repeatedly multiplied by the

    target matrix A, stretching it along the top eigenvector of A.

    • In genetic data, GRM A = XTX/M , where X = norm. genotypes.

    Multiply vector by X and XT in turn to avoid cost of computing A.

    • Can approximate a fixed number of top PCs in time O(MN)

    Rokhlin et al. 2009 J Matrix Anal Appl

    Halko et al. 2011 SIAM Rev

    Galinsky et al. 2016a Am J Hum Genet

    http://www.math.drexel.edu/~pg/520/Math520.html

  • Individuals

    1 1 1 0 0

    0 1 2 1 2

    2 1 1 0 1

    SNPs 0 0 1 2 2

    2 1 1 0 0

    0 0 1 1 1

    2 2 1 1 0

    PCA on genotype data: Toy Example

    Price et al. 2006 Nat Genet

  • Individuals

    1 1 1 0 0

    0 1 2 1 2

    2 1 1 0 1

    SNPs 0 0 1 2 2

    2 1 1 0 0

    0 0 1 1 1

    2 2 1 1 0

    mean-adjust each SNP

    PCA on genotype data: Toy Example

    Price et al. 2006 Nat Genet

  • Individuals

    0.4 0.4 0.4 -0.6 -0.6

    -1.2 -0.2 0.8 -0.2 0.8

    1.0 0.0 0.0 -1.0 0.0

    SNPs -1.0 -1.0 0.0 1.0 1.0

    1.2 0.2 0.2 -0.8 -0.8

    -0.6 -0.6 0.4 0.4 0.4

    0.8 0.8 -0.2 -0.2 -1.2

    PCA on genotype data: Toy Example

    Price et al. 2006 Nat Genet

  • Individuals

    0.4 0.4 0.4 -0.6 -0.6

    -1.2 -0.2 0.8 -0.2 0.8 0.9 0.4 -0.2 -0.5 -0.6

    1.0 0.0 0.0 -1.0 0.0 0.4 0.3 0.0 -0.3 -0.4

    SNPs -1.0 -1.0 0.0 1.0 1.0 -0.2 0.0 0.1 0.0 0.1

    1.2 0.2 0.2 -0.8 -0.8 -0.5 -0.3 0.0 0.4 0.3

    -0.6 -0.6 0.4 0.4 0.4 -0.6 -0.4 0.1 0.3 0.6

    0.8 0.8 -0.2 -0.2 -1.2

    PCA on genotype data: Toy Example

    Price et al. 2006 Nat Genet

    Covariance matrix

  • Individuals

    0.4 0.4 0.4 -0.6 -0.6

    -1.2 -0.2 0.8 -0.2 0.8

    1.0 0.0 0.0 -1.0 0.0

    SNPs -1.0 -1.0 0.0 1.0 1.0 0.7 0.3 -0.1 -0.4 -0.5

    1.2 0.2 0.2 -0.8 -0.8

    -0.6 -0.6 0.4 0.4 0.4

    0.8 0.8 -0.2 -0.2 -1.2

    PCA Axis of variation

    PCA on genotype data: Toy Example

    Price et al. 2006 Nat Genet

  • Individuals

    1 1 1 0 0

    0 1 2 1 2

    2 1 1 0 1

    SNPs 0 0 1 2 2 0.7 0.3 -0.1 -0.4 -0.5

    2 1 1 0 0

    0 0 1 1 1

    2 2 1 1 0

    PCA Axis of variation

    PCA on genotype data: Toy Example

    Price et al. 2006 Nat Genet

  • Next steps to understanding PCA

    Let there be rock.

    -- Bon S.

    Let there be data.

    -- Alkes

  • PCA using genotype data from HapMap

    using 3 million markers

    from HapMap2

    International HapMap Consortium 2007 Nature

  • PCA using genotype data from HGDP

    Li et al. 2008 Science

    938 HGDP individuals

    Illumina 650K chip

  • PCA in an admixed population: African Americans

    AA: 21% ± 14%

    European ancestry

    YRI

    CHB+JPT

    CEU

    Price, Patterson et al. 2008 PLoS Genet

    also see Smith et al. 2004 Am J Hum Genet; Bryc, Auton et al. 2010 PNAS

  • PCA using genotype data from Europe

    3,192 Europeans

    Affymetrix 500K chip

    Novembre et al. 2008 Nature

    also see Ralph & Coop 2013 PLoS Biol, Leslie et al. 2015 Nature, Haak et al. 2015 Nature

  • PCA using genotype data from Switzerland

    Geographical origin of

    European individuals can be

    inferred to within 300-700km!

    Novembre et al. 2008 Nature

    also see Ralph & Coop 2013 PLoS Biol, Leslie et al. 2015 Nature, Haak et al. 2015 Nature

  • PCA using genotype data from 113,851 UK samples

    Galinsky et al. 2016b Am J Hum Genet

    also see Leslie e al. 2015 Nature

    http://ukmap.facts.co/

  • European American population structure:

    What’s inside the melting pot?

    ???

  • PCA using genotype data from European Americans

    2745 European Americans

    Affymetrix 500K chip

    Price, Butler et al. 2008 PLoS Genet; also see Price et al. 2006 Nat Genet, Tian et al. 2008 PLoS Genet, Galinsky et al. 2016a Am J Hum Genet

  • PCA using genotype data from European Americans

    2745 European Americans

    Affymetrix 500K chip

    Price, Butler et al. 2008 PLoS Genet; also see Price et al. 2006 Nat Genet, Tian et al. 2008 PLoS Genet, Galinsky et al. 2016a Am J Hum Genet

  • PCA using genotype data from European Americans

    Galinsky et al. 2016a Am J Hum Genet

  • PCA using genotype data from European Americans

    Galinsky et al. 2016a Am J Hum Genet

  • Genetic distances (FST) between

    European American subpopulations

    Ashkenazi

    Northwest Southeast

    FST = 0.009 FST = 0.004

    FST = 0.005

    Price, Butler et al. 2008 PLoS Genet

  • PCA using SNP weights from external reference panels

    Chen et al. 2013 Bioinformatics

  • PCs do not necessarily reflect population structure

    • Batch effects (see Clayton et al. 2005 Nat Genet, Price et al. 2006 Nat Genet)

    • Cryptic relatedness (see Patterson et al. 2006 PLoS Genet)

    • Long-range LD, e.g. due to inversion polymorphisms

    (see Tian et al. 2008 PLoS Genet, Price et al. 2008 Am J Hum Genet)

  • “We recommend inferring population structure using all markers …

    based on an analysis of HapMap2 data with >3 million markers

    (45 Chinese and 44 Japanese).”

    -- Supp Note 5 of Price et al. 2006 Nat Genet

    “We corrected for LD using our regression technique”.

    -- Patterson et al. 2006 PLoS Genet (also see Zou et al. 2009 Hum Hered)

    “We identified 24 autosomal long-range LD regions, each spanning

    >2Mb, that explained one of the top PCs [when running PCA] on

    327 European Americans genotyped on the Illumina 550K array.”

    -- Price et al. 2008 AJHG (also see Tian et al. 2008 PLoS Genet)

    PCA of 531 Northern European + 387 Southern European samples

    sequenced at 202 genes (864kb) [Nelson et al. 2012 Science data]:

    r2(PC1, true ancestry) = 0.34; increases to 0.54 with LD-pruning.

    -- Galinsky et al. 2016a Am J Hum Genet (Appendix)

    To LD-prune or not to LD-prune in PCA?

  • Is human population genetic variation

    best described by clusters or clines?

    “We identified six main genetic clusters, five of which correspond

    to major geographic regions.” (Rosenberg et al. 2002 Science)

    “When individuals are sampled homogeneously from around the

    globe, the pattern seen is one of gradients of allele frequencies,

    rather than discrete clusters.” (Serre and Paabo 2004 Genome Res)

    “Examination of the relationship between genetic and geographic

    distance supports a view in which the clusters arise not as an

    artifact of the sampling scheme, but from small discontinuous

    jumps in genetic distance on opposite sides of geographic barriers.”

    (Rosenberg et al. 2005 PLoS Genet)

  • Do geographic barriers lead to clusters?

    • Continuous geographic distance (along land routes) explains

    69% of the variance in genetic distance between two populations.

    Rosenberg et al. 2005 PLoS Genet

    also see Pagani et al. 2016 Nature

  • Do geographic barriers lead to clusters?

    • Continuous geographic distance (along land routes) explains

    69% of the variance in genetic distance between two populations.

    • Continuous geographic distance (along land routes)

    PLUS geographic barriers (ocean, Himalayas, Sahara) explains

    73% of the variance in genetic distance between two populations.

    This suggests that geographic barriers contribute very slightly

    to genetic clustering of world populations.

    Rosenberg et al. 2005 PLoS Genet

    also see Pagani et al. 2016 Nature

  • Outline

    1. Introduction to population structure

    2. Model-based clustering (STRUCTURE, FRAPPE programs)

    3. Principal Components Analysis (PCA)

    4. Ancestry-informative markers (AIMs)

  • Ancestry-informative markers (AIMs)

    Standard approach to inferring genetic ancestry:

    • Genotype each individual on a GWAS chip

    (500,000-1,000,000 random genetic markers).

    Apply model-based clustering or PCA.

  • Price, Butler et al. 2008 PLoS Genet

    PCA using genotype data from European Americans

    2745 European Americans

    Affymetrix 500K chip

  • Ancestry-informative markers (AIMs)

    Standard approach to inferring genetic ancestry:

    • Genotype each individual on a GWAS chip

    (500,000-1,000,000 random genetic markers).

    Apply model-based clustering or PCA.

    OR

    AIM approach to inferring genetic ancestry:

    • Genotype each individual on a small set of 50-300 AIMs:

    markers that are highly informative for genetic ancestry.

    Apply model-based clustering or PCA.

    Hoggart et al. 2003 Am J Hum Genet

  • AIMs for northwest vs. southeast Europe

    100 AIMs distinguishing NW vs. SE ancestry

    • Ascertained using European Americans genotyped at

    100,000 to 500,000 markers.

    • Validated using a panel of samples of known ancestry:

    Swedish, UK, Polish, Greek, Italian, Spanish

    Price, Butler et al. 2008 PLoS Genet; reviewed in Seldin & Price 2008 PLoS Genet

    also see Seldin et al. 2006 PLoS Genet, Tian et al. 2008 PLoS Genet

  • 300 AIMs for northwest vs. southeast Europe

    and southeast Europe vs. Ashkenazi Jewish

    100 AIMs distinguishing NW vs. SE ancestry

    200 AIMs distinguishing SE vs. AJ ancestry

    • Ascertained using European Americans genotyped at

    100,000 to 500,000 markers.

    • Validated using a panel of samples of known ancestry:

    Swedish, UK, Polish, Greek, Italian, Spanish, Ashkenazi

    Price, Butler et al. 2008 PLoS Genet; reviewed in Seldin & Price 2008 PLoS Genet

    also see Seldin et al. 2006 PLoS Genet, Tian et al. 2008 PLoS Genet

  • 300 AIMs for northwest vs. southeast Europe

    and southeast Europe vs. Ashkenazi Jewish

    Price, Butler et al. 2008 PLoS Genet; reviewed in Seldin & Price 2008 PLoS Genet

    also see Seldin et al. 2006 PLoS Genet, Tian et al. 2008 PLoS Genet

  • How many AIMs are needed?

    Theorem 3:

    The squared correlation between an inferred axis of variation

    and the true axis of variation (e.g. using genome-wide data) is

    ≈ x/(1+x), where x = FST times the number of AIMs.

    [where FST is measured in the set of AIMs.]

    Price, Butler et al. 2008 PLoS Genet, Patterson et al. 2006 PLoS Genet

    also see Rosenberg et al. 2003 Am J Hum Genet

  • How many AIMs are needed?

    Theorem 3:

    The squared correlation between an inferred axis of variation

    and the true axis of variation (e.g. using genome-wide data) is

    ≈ x/(1+x), where x = FST times the number of AIMs.

    [where FST is measured in the set of AIMs.]

    e.g. Affymetrix 500K chip for northwest vs. southeast Europe:

    Effective #markers ≈ 100,000, after accounting for LD.

    FST(NW Europe, SE Europe) = 0.005 (for the set of all SNPs)

    x = (0.005)(100,000) = 500

    x/(1+x) = 0.998.

    Price, Butler et al. 2008 PLoS Genet, Patterson et al. 2006 PLoS Genet

    also see Rosenberg et al. 2003 Am J Hum Genet

  • How many AIMs are needed?

    Theorem 3:

    The squared correlation between an inferred axis of variation

    and the true axis of variation (e.g. using genome-wide data) is

    ≈ x/(1+x), where x = FST times the number of AIMs.

    [where FST is measured in the set of AIMs.]

    e.g. 100 AIMs for northwest vs. southeast Europe:

    FST(NW Europe, SE Europe) = 0.005 (for the set of all SNPs)

    FST(NW Europe, SE Europe) = 0.07 for the set of 100 AIMs

    x = (0.07)(100) = 7

    x/(1+x) = 0.88.

    Price, Butler et al. 2008 PLoS Genet, Patterson et al. 2006 PLoS Genet

    also see Rosenberg et al. 2003 Am J Hum Genet

  • How many AIMs are needed?

    Theorem 3:

    The squared correlation between an inferred axis of variation

    and the true axis of variation (e.g. using genome-wide data) is

    ≈ x/(1+x), where x = FST times the number of AIMs.

    [where FST is measured in the set of AIMs.]

    e.g. 200 AIMs for southeast Europe vs. Ashkenazi Jewish:

    FST(SE Europe, AJ) = 0.004 (for the set of all SNPs)

    FST(SE Europe, AJ) = 0.04 for the set of 200 AIMs

    x = (0.04)(200) = 8

    x/(1+x) = 0.89.

    Price, Butler et al. 2008 PLoS Genet, Patterson et al. 2006 PLoS Genet

    also see Rosenberg et al. 2003 Am J Hum Genet

  • 300 AIMs for northwest vs. southeast Europe

    and southeast Europe vs. Ashkenazi Jewish

    Price, Butler et al. 2008 PLoS Genet; reviewed in Seldin & Price 2008 PLoS Genet

    also see Seldin et al. 2006 PLoS Genet, Tian et al. 2008 PLoS Genet

  • AIMs for Africa, Europe, Asia, America

    Lao et al. 2006 Am J Hum Genet

    also see Ruiz-Narvaez et al. 2011 Am J Epidemiol, Galanter et al. 2012 PLoS Genet

    STRUCTURE runs

    using only 10 AIMs

  • • Genetic differences between human populations are small, but

    populations can be distinguished using a large number of

    genetic markers.

    • Model-based clustering is an effective way of modeling

    genetic variation and inferring ancestry via discrete clusters.

    • PCA is an effective way of modeling genetic variation and

    inferring ancestry via continuous clines.

    • Model-based clustering methods and PCA can be applied to

    random markers, or to ancestry-informative markers (AIMs),

    to infer genetic ancestry.

    Conclusions

  • EPI 511, Advanced Population and Medical Genetics

    Week 2:

    • Population structure

    • Population admixture

  • Outline

    1. Admixture leads to variation in genome-wide ancestry

    2. Admixture creates mosaic chromosomes

    3. Local ancestry inference

    4. Evaluating local ancestry inference algorithms

  • Outline

    1. Admixture leads to variation in genome-wide ancestry

    2. Admixture creates mosaic chromosomes

    3. Local ancestry inference

    4. Evaluating local ancestry inference algorithms

  • Hellenthal et al. 2014 Science

  • What is an admixed population?

    An admixed population is a population with recent

    ancestry from two or more continents

    (e.g. within the past 1,000 years).

  • What is an admixed population?

    An admixed population is a population with recent

    ancestry from two or more continents

    (e.g. within the past 1,000 years).

    Note: the word “admixture” is also sometimes used to

    refer to more ancient admixture events. (e.g. Patterson et al. 2012 Genetics, Hellenthal et al. 2014 Science)

  • Population structure vs. Population admixture:

    What’s the difference?

    Population structure: [Tue of Week 2]

    • Genetic differences due to geographic ancestry.

    • Use genome-wide data to infer genome-wide ancestry.

    Population admixture: [Thu of Week 2]

    • Mixed ancestry from multiple continental populations.

    • e.g. African Americans, Latino Americans.

    • Infer local ancestry at each location in the genome.

    Population admixture implies population structure.

    Population structure does not imply population admixture.

  • Examples of admixed populations

    African Americans:

    • Inherit African and European ancestry

    • >10% of U.S. population

    Smith et al. 2004 Am J Hum Genet

  • Examples of admixed populations

    Hispanic/Latino Americans:

    • Inherit European and Native American

    or European, Native American and African ancestry

    • e.g. Mexican Americans, Puerto Ricans, etc.

    • >15% of U.S. population

    Bryc, Velez et al. 2010 PNAS

  • Examples of admixed populations

    Latinos outside the U.S.:

    • Inherit European and Native American

    or European, Native American and African ancestry

    • hundreds of millions of people throughout Latin America

    Bryc, Velez et al. 2010 PNAS

  • An aside: Characteristics of African,

    European and Native American populations

    African populations:

    • High within-population diversity, low LD (no bottleneck).

    • Low genetic distance (FST) between West African populations

    European populations:

    • Lower within-population diversity, higher LD (bottleneck).

    • Low genetic distance (FST) between European populations

    Native American populations:

    • Lowest within-population diversity, highest LD due to

    multiple population bottlenecks.

    • Very high FST between Native American populations

    Cavalli-Sforza et al. 1994 The History and Geography of Human Genes

    Reich et al. 2012 Nature

  • Other examples of admixed populations

    Native Hawaiians (Polynesian, European, East Asian ancestry)

    Uyghurs (East Asian and European-related ancestry)

    A population that self-identifies and is described in the the academic literature as “South African Coloured” (San African, Bantu African, European, S Asian, SE Asian ancestry)

    Haiman et al. 2003 Hum Mol Genet,

    Haiman et al. 2007 Nat Genet

    Xu, Huang et al. 2008 Am J Hum Genet,

    Xu & Jin 2008 Am J Hum Genet

    de Wit et al. 2010 Hum Genet, Patterson et al. 2010 Hum Mol Genet,

    Tishkoff et al. 2009 Science, Chimusa et al. 2013 Hum Mol Genet

  • Inferring genome-wide ancestry proportions

    Apply the usual clustering programs, allowing fractional ancestry

    (see Tue of Week 2 slides):

    • STRUCTURE (Pritchard et al. 2000 Genetics, Falush et al. 2003 Genetics)

    • FRAPPE (Tang et al. 2005 Genet Epidemiol, Li et al. 2008 Science)

    • ADMIXTURE (Alexander et al. 2009 Genome Res)

  • Inferring genome-wide ancestry proportions

    Apply the usual clustering programs, allowing fractional ancestry

    (see Tue of Week 2 slides):

    • STRUCTURE (Pritchard et al. 2000 Genetics, Falush et al. 2003 Genetics)

    • FRAPPE (Tang et al. 2005 Genet Epidemiol, Li et al. 2008 Science)

    • ADMIXTURE (Alexander et al. 2009 Genome Res)

    Or, apply principal components analysis

    (see Tue of Week 2 slides):

    • PCA (Price et al. 2006 Nat Genet, Patterson et al. 2006 PLoS Genet)

  • Admixture leads to variation in genome-wide ancestry

    AA: 21% ± 14%

    European ancestry

    YRI

    CHB+JPT

    CEU

    African Americans

    Price, Patterson et al. 2008 PLoS Genet

    also see Smith et al. 2004 Am J Hum Genet; Bryc, Auton et al. 2010 PNAS

    (from Tue of Week 2)

  • Admixture proportion varies across individuals,

    but also varies with U.S. geographic location

    Kittles et al. 2007 CJHP

    also see Bryc et al. 2015 Am J Hum Genet

    % European ancestry in African American populations

  • Latino populations: 3-way admixture

    Bryc, Velez et al. 2010 PNAS

    European

    Native American

    African

  • Latino populations: 3-way admixture

    Price et al. 2007 Am J Hum Genet; also see Bryc, Velez et al. 2010 PNAS;

    Moreno-Estrada et al. 2014 Science; Ruiz-Linares et al. 2014 PLoS Genet

    Mexican Americans

    50% European, 45% Native American, 5% African on average,

    with substantial variation among individuals.

    Puerto Ricans

    60% European, 20% Native American, 20% African on average,

    with substantial variation among individuals.

    Brazilians and Colombians

    70% European, 20% Native American, 10% African on average,

    with substantial variation among individuals. [For populations sampled. Values may not apply to all populations.]

  • Different Native American ancestral populations

    for Latino populations in different regions

    Wang et al. 2008 PLoS Genet

    also see Price et al. 2007 Am J Hum Genet

  • CEU northern European USA 180

    CHB Chinese China 90

    JPT Japanese Japan 90

    YRI Yoruba Nigeria 180

    TSI Tuscan Italy 90

    CHD Chinese USA 100

    LWK Luhya Kenya 90

    MKK Maasai Kenya 180

    ASW African-American USA 90

    MXL Mexican-American USA 90

    GIH Gujarati-American USA 90

    Which HapMap3 populations are admixed?

  • PCA of all HapMap3 populations

    International HapMap3 Consortium 2010 Nature (see Supp Figures)

  • These populations are “homogeneous”

    in their continental ancestry

    International HapMap3 Consortium 2010 Nature (see Supp Figures)

  • ASW, MKK and LWK are admixed

    International HapMap3 Consortium 2010 Nature (see Supp Figures)

  • ASW, MKK and LWK are admixed

    International HapMap3 Consortium 2010 Nature (see Supp Figures)

    Bantu expansion

    (2000 BC – 1000 AD)

    Arab migrations

    (500 – 1500 AD)

    (Cavalli-Sforza et al. 1994,

    The History and Geography

    Of Human Genes)

    X Ancestral East African population

  • STRUCTURE results on African populations

    = West African/Bantu = East African

    = Khoisan

    = Pygmy

    = European/Middle Eastern

    Tishkoff et al. 2009 Science; also see Gurdasani et al. 2015 Nature

    K=14:

    (from Tue of Week 2)

  • MXL (Mexican Americans) are admixed

    International HapMap3 Consortium 2010 Nature (see Supp Figures)

  • Are GIH (Gujarati Americans) admixed?

    International HapMap3 Consortium 2010 Nature (see Supp Figures)

    also see Reich et al. 2009 Nature, Basu et al. 2016 PNAS

  • Are GIH (Gujarati Americans) admixed?

    International HapMap3 Consortium 2010 Nature (see Supp Figures)

    also see Reich et al. 2009 Nature, Basu et al. 2016 PNAS

  • Are GIH (Gujarati Americans) admixed?

    International HapMap3 Consortium 2010 Nature (see Supp Figures)

    also see Reich et al. 2009 Nature, Basu et al. 2016 PNAS

  • Which HGDP populations are admixed?

    Li et al. 2008 Science

    938 HGDP individuals

    Illumina 650K chip (from Tue of Week 2)

  • Which HGDP populations are admixed?

    Li et al. 2008 Science

    admixture in

    Middle East / North Africa?

    Recent? Or not? (Price, Tandon et al. 2009 PLoS Genet)

  • European Americans: 3-way admixture!

    Bryc et al. 2015 Am J Hum Genet

    European Americans

    >99% European,

    0.2% Native American,

    0.2% African on average

    with substantial variation

    among individuals.

  • Trees can also describe population structure

    Unrooted tree Rooted tree Jakobsson et al. 2008 Nature Li et al. 2008 Science

    (from Tue of Week 2)

    also see Cavalli-Sforza et al. 2003 Nat Genet

  • Trees cannot model recent admixture

    root

    YRI CEU

    root

    YRI CEU ASW ASW

    WRONG. WRONG.

  • Outline

    1. Admixture leads to variation in genome-wide ancestry

    2. Admixture creates mosaic chromosomes

    3. Local ancestry inference

    4. Evaluating local ancestry inference algorithms

  • Admixture creates mosaic chromosomes

    Population 1 Population 2

    1 generation later

  • Population 1 Population 2

    2 generations later

    Admixture creates mosaic chromosomes

  • Population 1 Population 2

    several generations later

    Admixture creates mosaic chromosomes

    Local ancestry = 0, 1 or 2

    copies from population 1

  • Population 1 Population 2

    Admixture creates mosaic chromosomes

    several generations later

    Local ancestry = 0, 1 or 2

    copies from population 1

    Average segment length (in Morgans) ~ 1/g

    where g = average #generations since admixture

    g ≈ 6 for African Americans, g ≈ 10 for Latino populations

    Smith et al. 2004 Am J Hum Genet, Price et al. 2007 Am J Hum Genet

  • Population 1 Population 2

    Admixture creates mosaic chromosomes

    several generations later

    Local ancestry = 0, 1 or 2

    copies from population 1

    Avg segment length ~ 1/g [> 1/g due to recombination b/t same ancestry]

    where g = average #generations since admixture

    g ≈ 6 for African Americans, g ≈ 10 for Latino populations

    Smith et al. 2004 Am J Hum Genet, Price et al. 2007 Am J Hum Genet

  • Mosaic chromosomes create admixture-LD

    Toy example: Admixed population with 50% POP1, 50% POP2

    SNP1 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2

    SNP2 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2

    SNP1 and SNP2 are unlinked in POP1, unlinked in POP2.

    SNP1 and SNP2 are 200kb apart: (nearly) always same local ancestry.

  • Mosaic chromosomes create admixture-LD

    Toy example: Admixed population with 50% POP1, 50% POP2

    SNP1 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2

    SNP2 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2

    SNP1 and SNP2 are unlinked in POP1, unlinked in POP2.

    SNP1 and SNP2 are 200kb apart: (nearly) always same local ancestry.

    P(SNP1=A, SNP2=A) = 50%·0.10·0.10 + 50%·0.90·0.90 = 0.41

    POP1 POP2

  • Mosaic chromosomes create admixture-LD

    Toy example: Admixed population with 50% POP1, 50% POP2

    SNP1 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2

    SNP2 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2

    SNP1 and SNP2 are unlinked in POP1, unlinked in POP2.

    SNP1 and SNP2 are 200kb apart: (nearly) always same local ancestry.

    P(SNP1=A, SNP2=A) = 50%·0.10·0.10 + 50%·0.90·0.90 = 0.41

    P(SNP1=A, SNP2=C) = 50%·0.10·0.90 + 50%·0.90·0.10 = 0.09

    P(SNP1=C, SNP2=A) = 50%·0.90·0.10 + 50%·0.10·0.90 = 0.09

    P(SNP1=C, SNP2=C) = 50%·0.90·0.90 + 50%·0.10·0.10 = 0.41

  • Mosaic chromosomes create admixture-LD

    Toy example: Admixed population with 50% POP1, 50% POP2

    SNP1 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2

    SNP2 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2

    SNP1 and SNP2 are unlinked in POP1, unlinked in POP2.

    SNP1 and SNP2 are 200kb apart: (nearly) always same local ancestry.

    P(SNP1=A, SNP2=A) = 50%·0.10·0.10 + 50%·0.90·0.90 = 0.41

    P(SNP1=A, SNP2=C) = 50%·0.10·0.90 + 50%·0.90·0.10 = 0.09

    P(SNP1=C, SNP2=A) = 50%·0.90·0.10 + 50%·0.10·0.90 = 0.09

    P(SNP1=C, SNP2=C) = 50%·0.90·0.90 + 50%·0.10·0.10 = 0.41

    SNP1 and SNP2 are in admixture-LD in the admixed population!

  • Admixture-LD depends on allele frequency differences

    Toy example: Admixed population with 50% POP1, 50% POP2

    SNP1 = A/C SNP, A allele has frequency 0.10 in POP1, 0.10 in POP2

    SNP2 = A/C SNP, A allele has frequency 0.10 in POP1, 0.90 in POP2

    SNP1 and SNP2 are unlinked in POP1, unlinked in POP2.

    SNP1 and SNP2 are 200kb apart: (nearly) always same local ancestry.

    P(SNP1=A, SNP2=A) = 50%·0.10·0.10 + 50%·0.10·0.90 = 0.05

    P(SNP1=A, SNP2=C) = 50%·0.10·0.90 + 50%·0.10·0.10 = 0.05

    P(SNP1=C, SNP2=A) = 50%·0.90·0.10 + 50%·0.90·0.90 = 0.45

    P(SNP1=C, SNP2=C) = 50%·0.90·0.90 + 50%·0.90·0.10 = 0.45

    No allele frequency difference in SNP1 => no admixture-LD.

  • Mosaic chromosomes create admixture-LD

    Real example of admixture-LD:

    rs164781: 0.42 in CEU, 0.88 in YRI (HapMap3)

    rs10495758: 0.88 in CEU, 0.32 in YRI (HapMap3)

    These SNPs are located roughly 3Mb apart.

    r2 between rs164781 and rs10495758:

    0.01 in CEU, 0.01 in YRI, 0.28 in ASW (HapMap3)

    rs164781 and rs10495758 are in admixture-LD in ASW!

    International HapMap3 Consortium 2010 Nature

    SNPs chosen from Tandon et al. 2011 Genet Epidemiol

  • Mosaic chromosomes create admixture-LD

    Collins-Schramm et al. 2003 Hum Genet

    No LD in Europeans (P-values for LD not significant)

  • Mosaic chromosomes create admixture-LD

    Collins-Schramm et al. 2003 Hum Genet

    Admixture-LD in African Americans (significant P-values)

  • Local ancestry = 0, 1 or 2

    copies from population 1

    at a specific locus

    Local ancestry vs. Genome-wide ancestry

    Local

    ancestry

    Genome-wide

    ancestry

    Genome-wide ancestry

    (e.g. 20% European)

  • Outline

    1. Admixture leads to variation in genome-wide ancestry

    2. Admixture creates mosaic chromosomes

    3. Local ancestry inference

    4. Evaluating local ancestry inference algorithms

  • Ancestry-informative marker (AIM) panels for

    local ancestry inference in African Americans

    The most

    informative

    ~1% of

    SNPs

    provide

    powerful

    information

    about

    ancestry

    0%

    20%

    40%

    60%

    80%

    100%

    0% 20% 40% 60% 80% 100%

    European American Frequency

    We

    st

    Afr

    ica

    n F

    req

    ue

    nc

    y

    Smith et al. 2004

    • Choose 1,500-3,000 SNPs with large Δ(EUR,AFR) (unlinked, i.e. not in LD, in ancestral populations)

    Smith et al. 2004 Am J Hum Genet

    Tian et al. 2006 Am J Hum Genet (slide from David Reich)

    The most informative SNPs

    provide powerful information

    about local ancestry

    “African-American

    admixture map”

  • Ancestry-informative marker (AIM) panels for

    local ancestry inference in Latino populations

    Price et al. 2007 Am J Hum Genet

    Mao et al. 2007 Am J Hum Genet

    Tian et al. 2007 Am J Hum Genet

    The most informative SNPs

    provide powerful information

    about local ancestry

    “Latino admixture map”

    • Choose 1,500-3,000 SNPs with large Δ(EUR,NA) (unlinked, i.e. not in LD, in ancestral populations)

  • Local ancestry = 0, 1 or 2

    copies from population 1

    at a specific locus

    Local ancestry vs. Genome-wide ancestry

    Local

    ancestry

    Genome-wide

    ancestry

    Genome-wide ancestry

    (e.g. 20% European)

    25-50 AIMs

    1,500-3,000 AIMs

  • Inferring local ancestry using AIM panels

    SNP chr position Eur freq Afr freq

    rs2814778 1 159,174,683 0% 100%

    1 SNP with Δ=100%: perfect information about local ancestry

    Duffy blood group locus

    see Hamblin et al. 2000 Am J Hum Genet, Hamblin et al. 2002 Am J Hum Genet

  • Inferring local ancestry using AIM panels

    SNP chr position Eur freq Afr freq

    rs1962508

    rs2806424

    rs1780349

    1

    1

    1

    158,677,077

    159,423,117

    161,340963

    4%

    84%

    44%

    74%

    26%

    99%

    Several SNPs with Δ=60-80%: ???

  • Inferring local ancestry using AIM panels

    SNP chr position Eur freq Afr freq

    rs1962508

    rs2806424

    rs1780349

    1

    1

    1

    158,677,077

    159,423,117

    161,340963

    4%

    84%

    44%

    74%

    26%

    99%

    Several SNPs with Δ=60-80%: Hidden Markov Model methods

    STRUCTURE (Falush et al. 2003 Genetics), ADMIXMAP (Hoggart et al. 2004

    Am J Hum Genet), ANCESTRYMAP (Patterson et al. 2004 Am J Hum Genet)

    (unobserved) state:

    Local ancestry = 0, 1 or 2

    copies from population 1

  • Overview of Hidden Markov Model approach

    • Simplifying assumption: for a individual i, suppose we know

    M = genome-wide ancestry (e.g. 20%)

    λ = average #generations since admixture (e.g. 6)

    • Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)

    of this individual at marker j along the genome.

    INITIAL PROBABILITIES (e.g. left end of chromosome):

    TRANSITION PROBABILITIES

    EMISSION PROBABILITIES

    Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,

    Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis

  • Overview of Hidden Markov Model approach

    • Simplifying assumption: for a individual i, suppose we know

    M = genome-wide ancestry (e.g. 20%)

    λ = average #generations since admixture (e.g. 6)

    • Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)

    of this individual at marker j along the genome.

    INITIAL PROBABILITIES (e.g. left end of chromosome):

    P(X0 = 0) = (1 – M)2

    P(X0 = 1) = 2M(1 – M)

    P(X0 = 2) = M2

    Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,

    Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis

  • Overview of Hidden Markov Model approach

    • Simplifying assumption: for a individual i, suppose we know

    M = genome-wide ancestry (e.g. 20%)

    λ = average #generations since admixture (e.g. 6)

    • Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)

    of this individual at marker j along the genome.

    TRANSITION PROBABILITIES:

    Let d be the genetic distance (in Morgans) between markers j and j+1.

    P(Xj+1 = 0 | Xj = 0) = e–2λd + 2e–λd(1 – e–λd)(1 – M) + (1 – e–λd)2(1 – M)2

    Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,

    Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis

    0 of 2

    chrom.

    recombine

    1 of 2

    chrom.

    recombine

    2 of 2

    chrom.

    recombine

  • Overview of Hidden Markov Model approach

    • Simplifying assumption: for a individual i, suppose we know

    M = genome-wide ancestry (e.g. 20%)

    λ = average #generations since admixture (e.g. 6)

    • Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)

    of this individual at marker j along the genome.

    TRANSITION PROBABILITIES:

    Let d be the genetic distance (in Morgans) between markers j and j+1.

    P(Xj+1 = 0 | Xj = 0) = e–2λd + 2e–λd(1 – e–λd)(1 – M) + (1 – e–λd)2(1 – M)2

    P(Xj+1 = 1 | Xj = 0) = 2e–λd(1 – e–λd)M + (1 – e–λd)22M(1 – M)

    P(Xj+1 = 2 | Xj = 0) = (1 – e–λd)2M2

    Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,

    Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis

  • Overview of Hidden Markov Model approach

    • Simplifying assumption: for a individual i, suppose we know

    M = genome-wide ancestry (e.g. 20%)

    λ = average #generations since admixture (e.g. 6)

    • Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)

    of this individual at marker j along the genome.

    TRANSITION PROBABILITIES:

    Let d be the genetic distance (in Morgans) between markers j and j+1.

    P(Xj+1 = 0 | Xj = 1) = 2e–λd(1 – e–λd)(1 – M) + (1 – e–λd)2(1 – M)2

    P(Xj+1 = 1 | Xj = 1) = e–2λd + e–λd(1 – e–λd) + (1 – e–λd)22M(1 – M)

    P(Xj+1 = 2 | Xj = 1) = e–λd(1 – e–λd)M + (1 – e–λd)2M2

    Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,

    Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis

  • Overview of Hidden Markov Model approach

    • Simplifying assumption: for a individual i, suppose we know

    M = genome-wide ancestry (e.g. 20%)

    λ = average #generations since admixture (e.g. 6)

    • Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)

    of this individual at marker j along the genome.

    TRANSITION PROBABILITIES:

    Let d be the genetic distance (in Morgans) between markers j and j+1.

    P(Xj+1 = 0 | Xj = 2) = (1 – e–λd)2(1 – M)2

    P(Xj+1 = 1 | Xj = 2) = 2e–λd(1 – e–λd)(1 – M) + (1 – e–λd)22M(1 – M)

    P(Xj+1 = 2 | Xj = 2) = e–2λd + 2e–λd(1 – e–λd)M + (1 – e–λd)2M2

    Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,

    Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis

  • Overview of Hidden Markov Model approach

    • Simplifying assumption: for a individual i, suppose we know

    M = genome-wide ancestry (e.g. 20%)

    λ = average #generations since admixture (e.g. 6)

    • Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)

    of this individual at marker j along the genome.

    EMISSION PROBABILITIES:

    Let pA and pE be genotype frequencies of marker j in AFR and EUR.

    P(gj = 0 | Xj = 0) = (1 – pA)2

    P(gj = 1 | Xj = 0) = 2pA(1 – pA)

    P(gj = 2 | Xj = 0) = pA2

    Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,

    Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis

  • Overview of Hidden Markov Model approach

    • Simplifying assumption: for a individual i, suppose we know

    M = genome-wide ancestry (e.g. 20%)

    λ = average #generations since admixture (e.g. 6)

    • Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)

    of this individual at marker j along the genome.

    EMISSION PROBABILITIES:

    Let pA and pE be genotype frequencies of marker j in AFR and EUR.

    P(gj = 0 | Xj = 1) = (1 – pA)(1 – pE)

    P(gj = 1 | Xj = 1) = pA(1 – pE) + pE(1 – pA)

    P(gj = 2 | Xj = 1) = pApE

    Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,

    Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis

  • Overview of Hidden Markov Model approach

    • Simplifying assumption: for a individual i, suppose we know

    M = genome-wide ancestry (e.g. 20%)

    λ = average #generations since admixture (e.g. 6)

    • Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)

    of this individual at marker j along the genome.

    EMISSION PROBABILITIES:

    Let pA and pE be genotype frequencies of marker j in AFR and EUR.

    P(gj = 0 | Xj = 2) = (1 – pE)2

    P(gj = 1 | Xj = 2) = 2pE(1 – pE)

    P(gj = 2 | Xj = 2) = pE2

    Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,

    Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis

  • Overview of Hidden Markov Model approach

    • Simplifying assumption: for a individual i, suppose we know

    M = genome-wide ancestry (e.g. 20%)

    λ = average #generations since admixture (e.g. 6)

    • Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)

    of this individual at marker j along the genome.

    INITIAL PROBABILITIES (e.g. left end of chromosome):

    TRANSITION PROBABILITIES

    EMISSION PROBABILITIES

    Then apply forward-backward algorithm to infer P(Xj | genotypes).

    Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,

    Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis

  • Overview of Hidden Markov Model approach

    Then apply forward-backward algorithm to infer P(Xj | genotypes).

    P(X1|g1) P(Xj|g1…gj) P(XM-1|g1…gM-1) P(XM|g1…gM)

    (FORWARD PROBABILITIES)

    Durbin et al 1998 Biological Sequence Analysis

  • Overview of Hidden Markov Model approach

    Then apply forward-backward algorithm to infer P(Xj | genotypes).

    P(X1|g1) P(Xj|g1…gj) P(XM-1|g1…gM-1) P(XM|g1…gM)

    (FORWARD PROBABILITIES)

    P(g2…gM|X1) P(gj+1…gM|Xj) P(gM|XM-1) 1

    (BACKWARD PROBABILITIES)

    Durbin et al 1998 Biological Sequence Analysis

  • Overview of Hidden Markov Model approach

    Then apply forward-backward algorithm to infer P(Xj | genotypes).

    P(X1|g1) P(Xj|g1…gj) P(XM-1|g1…gM-1) P(XM|g1…gM)

    P(g2…gM|X1) P(gj+1…gM|Xj) P(gM|XM-1) 1

    P(X1|g1…gM) … P(Xj|g1…gM) … P(XM-1|g1…gM) P(XM|g1…gM)

    Durbin et al 1998 Biological Sequence Analysis

  • Overview of Hidden Markov Model approach

    • Simplifying assumption: for a individual i, suppose we know

    M = genome-wide ancestry (e.g. 20%)

    λ = average #generations since admixture (e.g. 6)

    • Let Xj = the (unobserved) state (0, 1 or 2 European chromosomes)

    of this individual at marker j along the genome.

    INITIAL PROBABILITIES (e.g. left end of chromosome):

    TRANSITION PROBABILITIES

    EMISSION PROBABILITIES

    Then apply forward-backward algorithm to infer P(Xj | genotypes).

    (Or, use MCMC to integrate over uncertainty in M, λ, pA, pE.)

    Patterson et al. 2004 Am J Hum Genet; also HMM refs: Lander & Green 1987 PNAS,

    Rabiner 1987 Proceedings of the IEEE, Durbin et al 1998 Biological Sequence Analysis

  • Big trouble if markers are in LD in ancestral populations

    Example: Admixed population with 80% POP1, 20% POP2 ancestry

    SNP1 = A/C SNP, A allele has frequency 0.25 in POP1, 0.75 in POP2

    A allele has frequency 80%·0.25 + 20%·0.75 = 0.35 in Admixed pop.

    Inference of local ancestry of a haploid chromosome using SNP1:

    prob 0.35: P(POP1 | A) = 80%·0.25/(80%·0.25 + 20%·0.75 ) = 57%

    prob 0.65: P(POP1 | C) = 80%·0.75/(80%·0.75 + 20%·0.25) = 92%

    Overall: P(POP1) = 57%·0.35 + 92%·0.65 = 80%. Unbiased.

    Price et al. 2008 Am J Hum Genet

  • Big trouble if markers are in LD in ancestral populations

    Example: Admixed population with 80% POP1, 20% POP2 ancestry

    SNP1 = A/C SNP, A allele has frequency 0.25 in POP1, 0.75 in POP2

    SNP2 = A/C SNP in perfect LD with SNP1 in POP1, POP2

    A allele has frequency 80%·0.25 + 20%·0.75 = 0.35 in Admixed pop.

    Inference of local ancestry of a haploid chr using SNP1, SNP2:

    prob 0.35: P(POP1 | AA) = 80%·0.252/(80%·0.252 + 20%·0.752) =

    31%

    prob 0.65: P(POP1 | CC) = 80%·0.752/(80%·0.752 + 20%·0.252) =

    97%

    Overall: P(POP1) = 31%·0.35 + 97%·0.65 = 74%. Biased.

    Price et al. 2008 Am J Hum Genet

  • Inferring local ancestry using GWAS chip data

    Advantages of AIM panels of 1,500+ SNPs:

    • Lower cost: $80/sample

    (vs. $300+sample for GWAS chips).

    Advantages of GWAS chips:

    • Dense SNP coverage enables LD mapping

    • More accurate local ancestry inference?

  • Inferring local ancestry using GWAS chip data

    • ANCESTRYMAP using a subset of ~8,000 unlinked AIMs

    (Patterson et al. 2004 Am J Hum Genet; Tandon et al. 2011 Genet Epidemiol)

    New methods developed for GWAS chip data: • SABER (Tang et al. 2006 Am J Hum Genet)

    • LAMP (Sankararaman et al. 2008 Am J Hum Genet)

    • uSWITCH (Sankararaman et al. 2008 Genome Res)

    • HAPAA (Sundquist et al. 2008 Genome Res)

    • HAPMIX (Price, Tandon et al. 2009 PLoS Genet)

    • WINPOP (Pasaniuc et al. 2009 Bioinformatics)

    • GEDI-ADMX (Pasaniuc et al. 2009 Lect Notes Comput Sci)

    • PCA-based method (Bryc, Auton et al. 2010 PNAS)

    • LAMP-LD (Baran et al. 2012 Bioinformatics)

    • MULTIMIX (Churchhouse & Marchini 2013 Genet Epidemiol)

    • RFMix (Maples et al. 2013 Am J Hum Genet)

    reviewed in Seldin et al. 2011 Nat Rev Genet

  • Inferring local ancestry using GWAS chip data

    • ANCESTRYMAP using a subset of ~8,000 unlinked AIMs

    (Patterson et al. 2004 Am J Hum Genet; Tandon et al. 2011 Genet Epidemiol)

    New methods developed for GWAS chip data: • SABER (Tang et al. 2006 Am J Hum Genet)

    • LAMP (Sankararaman et al. 2008 Am J Hum Genet) • uSWITCH (Sankararaman et al. 2008 Genome Res)

    • HAPAA (Sundquist et al. 2008 Genome Res)

    • HAPMIX (Price, Tandon et al. 2009 PLoS Genet) • WINPOP (Pasaniuc et al. 2009 Bioinformatics)

    • GEDI-ADMX (Pasaniuc et al. 2009 Lect Notes Comput Sci)

    • PCA-based method (Bryc, Auton et al. 2010 PNAS)

    • LAMP-LD (Baran et al. 2012 Bioinformatics)

    • MULTIMIX (Churchhouse & Marchini 2013 Genet Epidemiol)

    • RFMix (Maples et al. 2013 Am J Hum Genet)

    reviewed in Seldin et al. 2011 Nat Rev Genet

  • Inferring local ancestry: LAMP method

    LAMP method: (allele frequencies in ancestral populations not known)

    • Prune SNP set to restrict to unlinked markers (r2 < 0.10)

    • Choose fixed window length l

    • Infer local ancestry within each window of length l via EM algorithm

    (Unsupervised clustering, integer-valued haploid local ancestries)

    • For each SNP, compute majority vote of local ancestry across

    all windows overlapping that SNP.

    Sankararaman et al. 2008 Am J Hum Genet

    1

    2

    3

    4

    5

    6

    7

    Window window length l

  • Inferring local ancestry: LAMP-ANC method

    LAMP-ANC: (allele frequencies in ancestral populations are known)

    • Prune SNP set to restrict to unlinked markers (r2 < 0.10)

    • Choose fixed window length l

    • Infer local ancestry within each window of length l via max likelihood

    (Supervised clustering, integer-valued haploid local ancestries)

    • For each SNP, compute majority vote of local ancestry across

    all windows overlapping that SNP.

    Sankararaman et al. 2008 Am J Hum Genet

    1

    2

    3

    4

    5

    6

    7

    Window window length l

  • Inferring local ancestry: LAMP and LAMP-ANC

    LAMP and LAMP-ANC:

    • Prune SNP set to restrict to unlinked markers (r2 < 0.10)

    • Choose fixed window length l

    • Infer local ancestry within each window of length l

    • For each SNP, compute majority vote across windows containing SNP

    Choice of window length l is key. If window length is

    • too small: not enough information to infer local ancestry

    • too big: violates assumption of constant local ancestry within window

    window length l

    Sankararaman et al. 2008 Am J Hum Genet

  • Inferring local ancestry: LAMP and LAMP-ANC

    LAMP and LAMP-ANC:

    • Prune SNP set to restrict to unlinked markers (r2 < 0.10)

    • Choose fixed window length l

    • Infer local ancestry within each window of length l

    • For each SNP, compute majority vote across windows containing SNP

    Choice of window length l is key. If window length is

    • too small: not enough information to infer local ancestry

    • too big: violates assumption of constant local ancestry within window

    Use window length l which is

    inversely proportional to # generations since admixture,

    i.e. proportional to ancestry segment lengths

    window length l

    Sankararaman et al. 2008 Am J Hum Genet

  • Inferring local ancestry: LAMP and LAMP-ANC

    LAMP and LAMP-ANC:

    • Prune SNP set to restrict to unlinked markers (r2 < 0.10)

    • Choose fixed window length l

    • Infer local ancestry within each window of length l

    • For each SNP, compute majority vote across windows containing SNP

    Advantages:

    • Simple and transparent approach, low computational cost

    Disadvantages:

    • Information from neighboring windows is not used

    • Does not make use of haplotype information

    window length l

    Sankararaman et al. 2008 Am J Hum Genet

  • WINPOP improvement to LAMP-ANC

    LAMP and LAMP-ANC:

    • Prune SNP set to restrict to unlinked markers (r2 < 0.10)

    • Choose fixed window length l

    • Infer local ancestry within each window of length l

    • For each SNP, compute majority vote across windows containing SNP

    WINPOP:

    • Allow variable window length l depending on local genetic

    structure of ancestral populations.

    • Explicitly model the possibility of one recombination event per window,

    enabling larger windows.

    window length l

    Pasaniuc et al. 2009 Bioinformatics

    also see Baran et al. 2012 Bioinformatics

  • Inferring local ancestry: HAPMIX method

    HAPMIX method: nested Hidden Markov Models

    • Large-scale HMM: transitions between local ancestry states

    (Patterson et al. 2004 Am J Hum Genet).

    • Small-scale HMM: transitions between haplotypes from

    ancestral reference populations (Li & Stephens 2003 Genetics)

    Price, Tandon et al. 2009 PLoS Genet

    POP1

    POP2

    hap1

    hap2

    hap3

    hap4

    hap5

    hap1

    hap2

    hap3

    hap4

    hap5

  • Inferring local ancestry: HAPMIX method

    HAPMIX method: nested Hidden Markov Models

    • States: local ancestry AND haplotype from POP1 or POP2.

    • Given initial, transition and emission probabilities: use

    forward-backward algorithm to infer P(states | data).

    (Durbin et al. 1998 Biological Sequence Analysis + other HMM refs)

    Price, Tandon et al. 2009 PLoS Genet

    POP1

    POP2

    hap1

    hap2

    hap3

    hap4

    hap5

    hap1

    hap2

    hap3

    hap4

    hap5

  • Inferring local ancestry: HAPMIX method

    Advantages:

    • Large-scale + Small-scal