Lecture 2: Population Structure

42
Lecture 2: Population Structure 02715 Advanced Topics in Computa8onal Genomics 1

Transcript of Lecture 2: Population Structure

Page 1: Lecture 2: Population Structure

Lecture 2: Population Structure

02-­‐715  Advanced  Topics  in  Computa8onal  Genomics  

1  

Page 2: Lecture 2: Population Structure

What is population structure?

•  Popula8on  Structure  –  A  set  of  individuals  characterized  by  some  measure  of  gene8c  

dis8nc8on  

–  A  “popula8on”  is  usually  characterized  by  a  dis8nct  distribu8on  over  genotypes  

–  Example  Genotypes                                  aa                              aA                                  AA  

Popula8on  1   Popula8on  2  

2  

Page 3: Lecture 2: Population Structure

1000 Genome Projects

3  

Page 4: Lecture 2: Population Structure

Motivation

•  Reconstruc*ng  individual  ancestry:  The  Genographic  Project  –  hJps://genographic.na8onalgeographic.com/genographic/index.html  

•  Studying  human  migra*on  –  Out  of  Africa  

–  Mul*-­‐regional  hypothesis  

•  Study  of  various  traits  –  Lactose  intolerance  

–  Origins  in  Europe?  

–  Infer  from    

•  Migra8on  studies  

•  Muta8on  studies  in  popula8ons  

4  

Page 5: Lecture 2: Population Structure

200,000  years  ago  

50,000  years  ago  

30,000  years  ago  10,000  years  ago  

hJps://genographic.na8onalgeographic.com/genographic/index.html  

5  

Page 6: Lecture 2: Population Structure

Overview

•  Background  –  Hardy-­‐Weinberg  Equilibrium  

–  Gene8c  driZ  –  Wright’s  FST  

•  Inferring  popula8on  structure  from  genotype  data  –  Structure  (Falush  et  al.,  2003)  –  Matrix  factoriza8on/dimensionality  reduc8on  methods  (Engelhardt  &  

Stephens,  2010)  

6  

Page 7: Lecture 2: Population Structure

Hardy-Weinberg Equilibrium

•  Hardy-­‐Weinberg  Equilibruim  –  Under  random  ma8ng,  both  allele  and  genotype  frequencies  in  a  

popula8on  remain  constant  over  genera8ons.  

–  Assump8ons  of  the  standard  random  ma8ng  •  Diploid  organism  

•  Sexual  reproduc8on  •  Nonoverlapping  genera8ons  •  Random  ma8ng  

•  Large  popula8on  size  •  Equal  allele  frequencies  in  the  sexes  •  No  migra8on/muta8on/selec8on  

–  Chi-­‐square  test  for  Hardy-­‐Weinberg  equilibrium  

7  

Page 8: Lecture 2: Population Structure

Hardy-Weinberg Equilibrium

•  D,  H,  R:  genotype  frequencies  for  AA,  Aa,  aa,  respec8vely.  •  p  q:  allele  frequencies  of  A  and  a  

8  

Page 9: Lecture 2: Population Structure

Hardy-Weinberg Equilibrium

•  The  genotype  and  allele  frequencies  of  the  offspring  

9  

Page 10: Lecture 2: Population Structure

Genetic Drift

•  The  change  in  allele  frequencies  in  a  popula8on  due  to  random  sampling  

•  Neutral  process  unlike  natural  selec8on  –  But  gene8c  driZ  can  eliminate  an  allele  from  the  given  popula8on.    

•  The  effect  of  gene8c  driZ  is  larger  in  a  small  popula8on  

10  

Page 11: Lecture 2: Population Structure

Population Divergence

•  Wright’s  FST –  Sta8s8cs  used  to  quan8fy  the  extent  of  divergence  among  mul8ple  

popula8ons  rela8ve  to  the  overall  gene8c  diversity    

–  Summarizes  the  average  devia8on  of  a  collec8on  of  popula8ons  a  way  from  the  mean  

–  FST = Var(pk)/p’(1-p’) •  p’: the overall frequency of an allele across all subpopulations •  pk :the allele frequency within population k  

11  

Page 12: Lecture 2: Population Structure

Scenarios of How Populations Evolve

12  

Page 13: Lecture 2: Population Structure

Methods for Learning Population Structure from Genetic Markers

•  Low-­‐dimensional  projec8on  –  PCA-­‐based  methods  (PaJerson  et  al.,  PLoS  Gene8cs  2006)  

•  Clustering  –  Distance-­‐based  (Bowcock  et  al.,  Nature  1994)  –  Model-­‐based  

•  STRUCTURE  (Pritchard  et  al.,  Gene8cs  2000)  •  mStruct  (Shringarpure  &  Xing,  Gene8cs  2008)  

13  

Page 14: Lecture 2: Population Structure

Probabilistic Models for Population Structure

•  Mixture  model  –  Cluster  individuals  into  K  popula8ons  

•  Admixture  model  –  The  genotypes  of  each  individual  are  an  admixture  of  mul8ple  ancestor  

popula8ons  –  Assumes  alleles  are  in  linkage  equilibrium  

•  Linkage  model  –  Model  recombina8on,  correla8on  in  alleles  across  chromosome  

•  F  model  –  Model  correla8on  in  alleles  in  ancestry  

14  

Page 15: Lecture 2: Population Structure

Mixture Model

•  K  popula8ons  

•  z(i):  popula8on  of  origin  of  individual  i  

•  For  each  of  the  K  popula8ons  –  pklj:  the  frequency  of  allele  j  at  locus  l  in  popula8on  k  

15  

Page 16: Lecture 2: Population Structure

Admixture Model

•  Relax  the  assump8on  of  one  ancestor  per  individual  in  mixture  model  

•  Individuals  can  have  ancestors  in  mul8ple  different  popula8ons  

•  qk(i):  propor8on  of  individual  i’s  genome  derived  from  popula8on  k  

•  Alleles  at  different  lock  can  come  from  different  popula8ons  

16  

Page 17: Lecture 2: Population Structure

Structure Model

•  Hypothesis:  Modern  popula8ons  are  created  by  an  intermixing  of  ancestral  popula8ons.  

•  An  individual’s  genome  contains  contribu8ons  from  one  or  more  ancestral  popula8ons.  

•  The  contribu8ons  of  popula8ons  can  be  different  for  different  individuals.  

•  Other  assump8ons  –  Hardy-­‐weinberg  equilbrium  

–  No  linkage  disequilbrium  –  Markers  are  i.i.d  (independent  and  iden8cally  distributed)  

17  

Page 18: Lecture 2: Population Structure

Linkage Model

•  From  admixture  model,  replace  the  assump8on  that  the  ancestry  labels  zil  for  individual  i,  locus  l  are  independent  with  the  assump8on  that  adjacent  zil  are  correlated.  

•  Use  Poisson  process  to  model  the  correla8on  between  neighboring  alleles  –  dl  :  distance  between  locus  l  and  locus  l+1  –  r:  recombina8on  rate  

18  

Page 19: Lecture 2: Population Structure

Linkage Model

•  As  recombina8on  rate  r  goes  to  infinity,  all  loci  become  independent  and  linkage  model  becomes  admixture  model.  

•  Recombina8on  rate  r  can  be  viewed  as  being  related  to  the  number  of  genera8ons  since  admixture  occurred.  

•  Use  MCMC  algorithm  to  fit  the  unkown  parameters.  

19  

Page 20: Lecture 2: Population Structure

F Model

•  Introduce  correla8ons  in  allele  frequencies  among  ancestral  popula8ons  –  pAl:  allele  frequencies  in  ancestral  popula8ons  modeled  as  symmetric  

Dirichlet  distribu8on  

–  Subpopula8ons  of  the  ancestral  popula8on  go  through  gene8c  driZ  at  different  rate  Fk    

–  Individuals  are  admixture  of  those  K  popula8ons  who  went  through  gene8c  driZ  from  the  common  ancestral  popula8on    

20  

Page 21: Lecture 2: Population Structure

F Model

•  Rela8onship  between  Fk  and  FST  

•  Designed  to  between  closely  related  popula8ons  with  similar  allele  frequencies  

21  

Page 22: Lecture 2: Population Structure

Scenarios of How Populations Evolve

22  

Page 23: Lecture 2: Population Structure

Unknown Parameters To Be Estimated

•  qi:  the  admixture  propor8ons  of  individual  i  

•  pk:  allele  frequencies  of  popula8on  k  •  zi:  popula8on  label  for  each  locus  of  individual  i  •  r  :  recombina8on  rate  •  Fk  :  es8mate  of  popula8on  divergence  from  the  ancestral  

popula8on  

23  

Page 24: Lecture 2: Population Structure

Population Structure from Ancestry Proportion of Each Individual

•   How  to  display  popula8on  structure?  

Genetic structure of Human Populations (Rosenberg et al., Science 2002)‏#

Africa   Europe   Mid-­‐East   Cent./S.  Asia   East  Asia   Oceania  

Ancestral proportion

24  

Page 25: Lecture 2: Population Structure

Population of Origin Assignments of a Single Individual

True  origin  

Es8mated  Origin  (Unphased  data)  

Es8mated  Origin  (Phased  data)  

25  

Page 26: Lecture 2: Population Structure

Admixture vs Divergence

26  

Page 27: Lecture 2: Population Structure

Posterior Distribution of Recombination Rate

•  Using  the  original  dataset  

•  AZer  permu8ng  the  genotype  loci  

27  

Page 28: Lecture 2: Population Structure

Distinguishing Between Two Closely Related Populations

28  

Page 29: Lecture 2: Population Structure

Three Sources of Linkage Disequilibrium

•  Mixture  LD  –  Due  to  varia8on  in  ancestry  across  individuals  that  induce  correla8on  

among  markers  at  different  loci    –  Modeled  by  admixture  model  

•  Admixture  LD  –  Due  to  unbroken  chunks  of  DNA  derived  from  an  ancestor  popula8on.  –  Modeled  by  linkage  model  

•  Background  LD  –  Due  to  LD  within  popula8ons  –  Decays  at  smaller  scale  

29  

Page 30: Lecture 2: Population Structure

Low-dimensional Projections

•  Gene8c  data  is  very  large  –  Number  of  markers  may  range  from  a  few  hundreds  to  hundreds  of  

thousands  –  Thus  each  individual  is  described  by  a  high-­‐dimensional  vector  of  marker  

configura8ons    –  A  low-­‐dimensional  projec8on  allows  easy  visualiza8on  

•  Technique  used  –  Factor  analysis  –  Many  sta8s8cal  methods  exist  –  ICA,  PCA,  NMF  etc.  –  Principal  Components  Analysis  (next  slide)  

•  Allows  projec8on  of  individuals  into  a  low  dimensional  space  

•  Usually  projected  to  2  dimensions  to  allow  visualiza8on  

30  

Page 31: Lecture 2: Population Structure

Principal Component Analysis

•  Most  common  form  of  factor  analysis  

•  The  new  variables/dimensions  ...  –  Are  linear  combina8ons  of  the  original  ones  

–  Are  uncorrelated  with  one  another  •  Orthogonal  in  original  dimension  space  

–  Capture  as  much  of  the  original  variance  in  the  data  as  possible  

–  Are  called  Principal  Components  

•  Demo  at  hJp://www.cs.mcgill.ca/~sqrt/dimr/dimreduc8on.html  

31  

Page 32: Lecture 2: Population Structure

What are the new axes?

Original  Variable  A  

PC  1  PC  2  

•   Orthogonal  direc8ons  of  greatest  variance  in  data  •   Projec8ons  along  PC1  discriminate  the  data  most  along  any  one  axis  

Original  Variable  B  

32  

Page 33: Lecture 2: Population Structure

Principal Components

•  First  principal  component  is  the  direc8on  of  greatest  variability  (covariance)  in  the  data  

•  Second  is  the  next  orthogonal  (uncorrelated)  direc8on  of  greatest  variability  –  So  first  remove  all  the  variability  along  the  first  component,  and  then  find  the  next  direc8on  of  greatest  variability  

•  And  so  on  …  

33  

Page 34: Lecture 2: Population Structure

Dimensionality Reduction

Can  ignore  the  components  of  lesser  significance.    

You  do  lose  some  informa8on,  but  if  the  eigenvalues  are  small,  you  don’t  lose  much  

–  n  dimensions  in  original  data    –  calculate  n  eigenvectors  and  eigenvalues  –  choose  only  the  first  p  eigenvectors,  based  on  their  eigenvalues  –  final  data  set  has  only  p  dimensions  

34  

Page 35: Lecture 2: Population Structure

PCA Analysis (Cavalli-sforza,1978)

•  Plot  of  geographical  distribu8on  of  3  PCs  (Intensity  propor8onal  to  value  of  each  component)  –  First  –  blue  

–  Second    -­‐  green  

–  Third    -­‐  red  

35  

Page 36: Lecture 2: Population Structure

Matrix Factorization and Population Structure

•  Matrix  factoriza8on  for  learning  popula8on  structure  

Genotype  Data    (NxP  matrix)  

N:  number  of  samples  P:  number  of  genotypes  

Individuals’  ancestry  propor8ons  (NxK  matrix)  K:  number  of  subpopula8ons  

Subpopula8on  Allele  Frequencies  (KxP  matrix)  =   x  

36  

Page 37: Lecture 2: Population Structure

Unifying Framework of Matrix Factorization

•  Admixture  –  Based  on  probability  models:  rows  of    Λ  and  columns  of  F  should  sum  

to  1.  –  Works  well  if  the  individuals  are  admixtures  of  discretely  separated  

popula8ons  

•  PCA  –  Based  on  eigen  decomposi8on:  columns  of  Λ  are  orthogonal,  rows  of  F  

are  orthnormal.  –  Works  well  for  the  case  of  isola8on-­‐by-­‐distance  (con8nuous  varia8on  

of  popula8ons  among  individuals)  

•  Sparse  factor  model  –  Sparsity  via  automa8c  relevance  determina8on  prior  

37  

Page 38: Lecture 2: Population Structure

Discrete/Admixed Populations

SFA  

PCA  

Admixture  

Loading  1   Loading  2   Loading  3  

38  

Page 39: Lecture 2: Population Structure

Isolation-by-Distance Models

39  

Page 40: Lecture 2: Population Structure

Clustered Populations in 1d Habitat •  SFA  

•  Admixture  

•  PCA  

Assume  two  popula8ons  

Assume  five  popula8ons  

Assume  two  popula8ons  

Assume  five  popula8ons  

40  

Page 41: Lecture 2: Population Structure

Analysis of European Genotype Data

PCA   SFAm   Admixture  41  

Page 42: Lecture 2: Population Structure

Comparison of Different Methods

PCA   Model-­‐based  Clustering    

Advantages   •   Sta8s8cal  tests  for  significance  of  results  (PaJerson  et  al.  2006)  •   Easy  visualiza8on  

•   Genera8ve  process  that  explicitly  models  admixture  •   Clustering  is  probabilis8c:  it  is  possible  to  assign  confidence  level  of  clusters  

Disadvantages   •   No  intui8on  about  underlying  processes  

•   Computa8onally  more  demanding    •  Based  on  assump8ons  of  evolu8onary      models:      •   Structure:  No  models  of  muta8on,  recombina8on  •   Muta8on  added  in  mStruct    •   Recombina8on  added  in  extension  by  Falush  et  al.  

42