Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from...
Transcript of Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from...
Lecture 2: Population Structure
02-‐715 Advanced Topics in Computa8onal Genomics
1
What is population structure?
• Popula8on Structure – A set of individuals characterized by some measure of gene8c
dis8nc8on
– A “popula8on” is usually characterized by a dis8nct distribu8on over genotypes
– Example Genotypes aa aA AA
Popula8on 1 Popula8on 2
2
Motivation
• Reconstruc*ng individual ancestry: The Genographic Project – hIps://genographic.na8onalgeographic.com/genographic/index.html
• Studying human migra*on – Out of Africa
– Mul*-‐regional hypothesis
• Study of various traits – Lactose intolerance
– Origins in Europe?
– Infer from
• Migra8on studies
• Muta8on studies in popula8ons
3
200,000 years ago
50,000 years ago
30,000 years ago 10,000 years ago
hIps://genographic.na8onalgeographic.com/genographic/index.html
4
Overview
• Background – Hardy-‐Weinberg Equilibrium
– Gene8c driZ – Wright’s FST
• Inferring popula8on structure from genotype data – Structure (Falush et al., 2003) – Matrix factoriza8on/dimensionality reduc8on methods (Engelhardt &
Stephens, 2010)
5
Hardy-Weinberg Equilibrium
• Hardy-‐Weinberg Equilibrium – Under random ma8ng, both allele and genotype frequencies in a
popula8on remain constant over genera8ons.
– Assump8ons of the standard random ma8ng • Diploid organism
• Sexual reproduc8on • Nonoverlapping genera8ons • Random ma8ng
• Large popula8on size • Equal allele frequencies in the sexes • No migra8on/muta8on/selec8on
– Chi-‐square test for Hardy-‐Weinberg equilibrium
6
Hardy-Weinberg Equilibrium
• p q: allele frequencies of A and a • D, H, R: genotype frequencies for AA, Aa, aa, respec8vely.
– D = p2 – H=2pq – R=q2
7
Hardy-Weinberg Equilibrium
• p q: allele frequencies of A and a • D, H, R: genotype frequencies for AA, Aa, aa, respec8vely.
8
Hardy-Weinberg Equilibrium
• The genotype and allele frequencies of the offspring
9
Testing Whether Hardy-Weinberg Equilibrium Holds
• Chi-‐square test – Null hypothesis: HWE holds in the observed data
– Test if the null hypothesis is violated in the data by comparing the observed genotype frequencies (in the parent genera8on) with the expected frequencies (in the offspring genera8on)
Testing Whether Hardy-Weinberg Equilibrium Holds
Genotype AA Aa aa Total
Observed 224 64 6 294
Expected ? ? ? 294
Testing Whether Hardy-Weinberg Equilibrium Holds
Genotype AA Aa aa Total
Observed 224 64 6 294
Expected 222.9 66.2 4.9 294
Step 3: Compute the test sta8s8c
€
χ2 =(observed - expected)2
expected∑
=(224 − 222.9)2
222.9+(64 − 66.2)2
66.2+(6 − 4.9)2
4.9= 0.32
€
p =224 × 2 + 64294 × 2
= 0.871
q =1− p = 0.129
Step 1: Compute allele frequencies from the observed data
€
Expected(AA) = p2n = 0.87072 × 294 = 222.9Step 2: Compute the expected genotype frequencies
Genetic Drift
• The change in allele frequencies in a popula8on due to random sampling
• Neutral process unlike natural selec8on – But gene8c driZ can eliminate an allele from the given popula8on.
• The effect of gene8c driZ is larger in a small popula8on
13
Population Divergence
• Wright’s FST – Sta8s8cs used to quan8fy the extent of divergence among mul8ple
popula8ons rela8ve to the overall gene8c diversity
– Summarizes the average devia8on of a collec8on of popula8ons a way from the mean
– FST = Var(pk)/p’(1-p’) • p’: the overall frequency of an allele across all subpopulations • pk :the allele frequency within population k
14
Scenarios of How Populations Evolve
15
Methods for Learning Population Structure from Genetic Markers
• Low-‐dimensional projec8on – Matrix-‐factoriza8on-‐based methods (PaIerson et al., PLoS Gene8cs 2006)
• Model-‐based clustering – STRUCTURE (Pritchard et al., Gene8cs 2000)
16
Low-dimensional Projections
• Gene8c data is very large – Number of markers may range from a few hundreds to hundreds of
thousands
– Thus each individual is described by a high-‐dimensional vector of marker configura8ons
– A low-‐dimensional projec8on allows easy visualiza8on
• Allows projec8on of individuals into a low dimensional space
• Usually projected to 2 dimensions to allow visualiza8on
17
Matrix Factorization and Population Structure
• Matrix factoriza8on for learning popula8on structure
Genotype Data (NxP matrix)
N: number of samples P: number of genotypes
Individuals’ ancestry propor8ons (NxK matrix) K: number of subpopula8ons
Subpopula8on Allele Frequencies (KxP matrix) = x
18
Unifying Framework of Matrix Factorization
• PCA – Based on eigen decomposi8on: columns of Λ are orthogonal, rows of F
are orthnormal. – Works well for the case of isola8on-‐by-‐distance (con8nuous varia8on
of popula8ons among individuals)
• Admixture – Based on probability models: rows of Λ and columns of F should sum
to 1. – Works well if the individuals are admixtures of discretely separated
popula8ons
• Sparse factor model – Sparsity via automa8c relevance determina8on prior
19
Principal Component Analysis
• Most common form of factor analysis
• The new variables/dimensions ... – Are linear combina8ons of the original ones
– Are uncorrelated with one another • Orthogonal in original dimension space
– Capture as much of the original variance in the data as possible
– Are called Principal Components
20
What are the new axes?
Original Variable A
PC 1 PC 2
• Orthogonal direc8ons of greatest variance in data • Projec8ons along PC1 discriminate the data most along any one axis
Original Variable B
21
Principal Components
• First principal component is the direc8on of greatest variability (covariance) in the data
• Second is the next orthogonal (uncorrelated) direc8on of greatest variability – So first remove all the variability along the first component, and then find the next direc8on of greatest variability
• And so on …
22
Dimensionality Reduction
Can ignore the components of lesser significance.
You do lose some informa8on, but if the eigenvalues are small, you don’t lose much
– n dimensions in original data – calculate n eigenvectors and eigenvalues – choose only the first p eigenvectors, based on their eigenvalues – final data set has only p dimensions
23
PCA Analysis (Cavalli-sforza,1978)
• Plot of geographical distribu8on of 3 PCs (Intensity propor8onal to value of each component) – First – blue
– Second -‐ green
– Third -‐ red
24
Discrete/Admixed Populations
SFA
PCA
Admixture
Loading (popula8on) 1 Loading 2 Loading 3
25
Analysis of European Genotype Data
PCA SFAm Admixture 26
Probabilistic Models for Population Structure
• Mixture model – Cluster individuals into K popula8ons
• Admixture model – The genotypes of each individual are an admixture of mul8ple
ancestor popula8ons
– Assumes alleles are in linkage equilibrium
• Linkage model – Model recombina8on, correla8on in alleles across chromosome
27
• Organizing data into clusters such that there is
• high intra-‐cluster similarity
• low inter-‐cluster similarity
• Informally, finding natural groupings among objects.
0
1
2
3
4
5
0 1 2 3 4 5
k1
k2
k3
• For a pre-‐defined number of clusters K, ini8alize K centers randomly
0
1
2
3
4
5
0 1 2 3 4 5
k1
k2
k3
• Iterate between the following two steps – Assign all objects to the nearest center.
– Move a center to the mean of its members.
0
1
2
3
4
5
0 1 2 3 4 5
k1
k2
k3
• AZer moving centers, re-‐assign the objects…
0
1
2
3
4
5
0 1 2 3 4 5
k1
k2
k3
• AZer moving centers, re-‐assign the objects to nearest centers.
• Move a center to the mean of its new members.
k1
k2 k3
• Re-‐assign and move centers, un8l no objects changed membership.
Soft-Clustering of Individuals into Three Clusters with Gaussian Mixture Model
Cluster 1 Cluster 2 Cluster 3
0.1 0.4 0.5
0.8 0.1 0.1
0.7 0.2 0.1
0.10 0.05 0.85
… … …
… … …
… … …
… … …
… … …
… … …
Probability of
Individual 1
Individual 2
Individual 3
Individual 4
Individual 5
Individual 6
Individual 7
Individual 8
Individual 9
Individual 10
Sum
1
1
1
1
1
1
1
1
1
1 • Each individual can assigned to more than one clusters with a certain probability. • For each individual, the probabili8es for all clusters should sum to 1. (i.e., each row should sum to 1.) • Each cluster is explained by a cluster center variable (i.e., cluster mean)
Mixture Model
• The goal is to discover K clusters for K popula8ons from NxJ genotype matrix (N: # of samples, J: # of loci) (xi,n in the diagram on the right)
• Assume K popula8ons (clusters)
• θ = Distribu8on over popula8ons – Mixing propor8ons in mixture model
• β = Distribu8on over alleles at each locus in each popula8on – Mixture component model in mixture model
• To generate an individual’s genome – All individuals share the same θ – Sample zi from Mul8nomial(θ) – For each locus
• Sample xi,n from β corresponding to the popula8on chosen by zi
35
βki =1…I λ
xi,n
zi,
θ
i=1…J
n=1…N
α
k=1…K
Admixture Model
• Relax the assump8on of one popula8on per individual in mixture model
• Individuals can be assigned to mul8ple different popula8ons in different loci
36
The Admixture Model
• β = Distribu8on over alleles – One per popula8on –locus pair
• To generate an individual’s genome – Sample θn from Dirichlet(α)
– For each locus • Sample zi,n from Mul8nomial(θn)
• Sample xi,n from β corresponding to the popula8on chosen by zi,n
37
Structure Model
• Hypothesis: Modern popula8ons are created by an intermixing of ancestral popula8ons.
• An individual’s genome contains contribu8ons from one or more ancestral popula8ons.
• The contribu8ons of popula8ons can be different for different individuals.
• Other assump8ons – Hardy-‐weinberg equilbrium
– No linkage disequilbrium – Markers are i.i.d (independent and iden8cally distributed)
38
Linkage Model
• From admixture model, replace the assump8on that the ancestry labels zil for individual i, locus l are independent with the assump8on that adjacent zil are correlated.
• Use Poisson process to model the correla8on between neighboring alleles – dl : distance between locus l and locus l+1 – r: recombina8on rate
39
Linkage Model
• As recombina8on rate r goes to infinity, all loci become independent and linkage model becomes admixture model.
• Recombina8on rate r can be viewed as being related to the number of genera8ons since admixture occurred.
• Use MCMC algorithm to fit the unkown parameters.
40
Population Structure from Ancestry Proportion of Each Individual
• How to display popula8on structure?
Genetic structure of Human Populations (Rosenberg et al., Science 2002)#
Africa Europe Mid-‐East Cent./S. Asia East Asia Oceania
Ancestral proportion
41
Population of Origin Assignments of a Single Individual
True origin
Es8mated Origin (Unphased data)
Es8mated Origin (Phased data)
42
Comparison of Different Methods
PCA Model-‐based Clustering
Advantages • Sta8s8cal tests for significance of results (PaIerson et al. 2006) • Easy visualiza8on
• Genera8ve process that explicitly models admixture • Clustering is probabilis8c: it is possible to assign confidence level of clusters
Disadvantages • No intui8on about underlying processes
• Computa8onal more demanding • Based on assump8ons of evolu8onary models: • Structure: No models of muta8on, recombina8on • Recombina8on added in extension by Falush et al.
43