Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from...

43
Lecture 2: Population Structure 02715 Advanced Topics in Computa8onal Genomics 1

Transcript of Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from...

Page 1: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Lecture 2: Population Structure

02-­‐715  Advanced  Topics  in  Computa8onal  Genomics  

1  

Page 2: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

What is population structure?

•  Popula8on  Structure  –  A  set  of  individuals  characterized  by  some  measure  of  gene8c  

dis8nc8on  

–  A  “popula8on”  is  usually  characterized  by  a  dis8nct  distribu8on  over  genotypes  

–  Example  Genotypes                                  aa                              aA                                  AA  

Popula8on  1   Popula8on  2  

2  

Page 3: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Motivation

•  Reconstruc*ng  individual  ancestry:  The  Genographic  Project  –  hIps://genographic.na8onalgeographic.com/genographic/index.html  

•  Studying  human  migra*on  –  Out  of  Africa  

–  Mul*-­‐regional  hypothesis  

•  Study  of  various  traits  –  Lactose  intolerance  

–  Origins  in  Europe?  

–  Infer  from    

•  Migra8on  studies  

•  Muta8on  studies  in  popula8ons  

3  

Page 4: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

200,000  years  ago  

50,000  years  ago  

30,000  years  ago  10,000  years  ago  

hIps://genographic.na8onalgeographic.com/genographic/index.html  

4  

Page 5: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Overview

•  Background  –  Hardy-­‐Weinberg  Equilibrium  

–  Gene8c  driZ  –  Wright’s  FST  

•  Inferring  popula8on  structure  from  genotype  data  –  Structure  (Falush  et  al.,  2003)  –  Matrix  factoriza8on/dimensionality  reduc8on  methods  (Engelhardt  &  

Stephens,  2010)  

5  

Page 6: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Hardy-Weinberg Equilibrium

•  Hardy-­‐Weinberg  Equilibrium  –  Under  random  ma8ng,  both  allele  and  genotype  frequencies  in  a  

popula8on  remain  constant  over  genera8ons.  

–  Assump8ons  of  the  standard  random  ma8ng  •  Diploid  organism  

•  Sexual  reproduc8on  •  Nonoverlapping  genera8ons  •  Random  ma8ng  

•  Large  popula8on  size  •  Equal  allele  frequencies  in  the  sexes  •  No  migra8on/muta8on/selec8on  

–  Chi-­‐square  test  for  Hardy-­‐Weinberg  equilibrium  

6  

Page 7: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Hardy-Weinberg Equilibrium

•  p  q:  allele  frequencies  of  A  and  a  •  D,  H,  R:  genotype  frequencies  for  AA,  Aa,  aa,  respec8vely.  

–  D  =  p2  –  H=2pq  –  R=q2  

7  

Page 8: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Hardy-Weinberg Equilibrium

•  p  q:  allele  frequencies  of  A  and  a  •  D,  H,  R:  genotype  frequencies  for  AA,  Aa,  aa,  respec8vely.  

8  

Page 9: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Hardy-Weinberg Equilibrium

•  The  genotype  and  allele  frequencies  of  the  offspring  

9  

Page 10: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Testing Whether Hardy-Weinberg Equilibrium Holds

•  Chi-­‐square  test  –  Null  hypothesis:  HWE  holds  in  the  observed  data  

–  Test  if  the  null  hypothesis  is  violated  in  the  data  by  comparing  the  observed  genotype  frequencies  (in  the  parent  genera8on)  with  the  expected  frequencies  (in  the  offspring  genera8on)  

Page 11: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Testing Whether Hardy-Weinberg Equilibrium Holds

Genotype   AA   Aa   aa   Total  

Observed   224   64   6   294  

Expected   ?   ?   ?   294  

Page 12: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Testing Whether Hardy-Weinberg Equilibrium Holds

Genotype   AA   Aa   aa   Total  

Observed   224   64   6   294  

Expected   222.9   66.2   4.9   294  

Step  3:  Compute  the  test  sta8s8c  

χ2 =(observed - expected)2

expected∑

=(224 − 222.9)2

222.9+(64 − 66.2)2

66.2+(6 − 4.9)2

4.9= 0.32

p =224 × 2 + 64294 × 2

= 0.871

q =1− p = 0.129

Step  1:  Compute  allele  frequencies  from  the  observed  data    

Expected(AA) = p2n = 0.87072 × 294 = 222.9Step  2:  Compute  the  expected  genotype  frequencies  

Page 13: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Genetic Drift

•  The  change  in  allele  frequencies  in  a  popula8on  due  to  random  sampling  

•  Neutral  process  unlike  natural  selec8on  –  But  gene8c  driZ  can  eliminate  an  allele  from  the  given  popula8on.    

•  The  effect  of  gene8c  driZ  is  larger  in  a  small  popula8on  

13  

Page 14: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Population Divergence

•  Wright’s  FST –  Sta8s8cs  used  to  quan8fy  the  extent  of  divergence  among  mul8ple  

popula8ons  rela8ve  to  the  overall  gene8c  diversity    

–  Summarizes  the  average  devia8on  of  a  collec8on  of  popula8ons  a  way  from  the  mean  

–  FST = Var(pk)/p’(1-p’) •  p’: the overall frequency of an allele across all subpopulations •  pk :the allele frequency within population k  

14  

Page 15: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Scenarios of How Populations Evolve

15  

Page 16: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Methods for Learning Population Structure from Genetic Markers

•  Low-­‐dimensional  projec8on  –  Matrix-­‐factoriza8on-­‐based  methods  (PaIerson  et  al.,  PLoS  Gene8cs  2006)  

•  Model-­‐based  clustering  –  STRUCTURE  (Pritchard  et  al.,  Gene8cs  2000)  

16  

Page 17: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Low-dimensional Projections

•  Gene8c  data  is  very  large  –  Number  of  markers  may  range  from  a  few  hundreds  to  hundreds  of  

thousands  

–  Thus  each  individual  is  described  by  a  high-­‐dimensional  vector  of  marker  configura8ons    

–  A  low-­‐dimensional  projec8on  allows  easy  visualiza8on  

•  Allows  projec8on  of  individuals  into  a  low  dimensional  space  

•  Usually  projected  to  2  dimensions  to  allow  visualiza8on  

17  

Page 18: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Matrix Factorization and Population Structure

•  Matrix  factoriza8on  for  learning  popula8on  structure  

Genotype  Data    (NxP  matrix)  

N:  number  of  samples  P:  number  of  genotypes  

Individuals’  ancestry  propor8ons  (NxK  matrix)  K:  number  of  subpopula8ons  

Subpopula8on  Allele  Frequencies  (KxP  matrix)  =   x  

18  

Page 19: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Unifying Framework of Matrix Factorization

•  PCA  –  Based  on  eigen  decomposi8on:  columns  of  Λ  are  orthogonal,  rows  of  F  

are  orthnormal.  –  Works  well  for  the  case  of  isola8on-­‐by-­‐distance  (con8nuous  varia8on  

of  popula8ons  among  individuals)  

•  Admixture  –  Based  on  probability  models:  rows  of    Λ  and  columns  of  F  should  sum  

to  1.  –  Works  well  if  the  individuals  are  admixtures  of  discretely  separated  

popula8ons  

•  Sparse  factor  model  –  Sparsity  via  automa8c  relevance  determina8on  prior  

19  

Page 20: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Principal Component Analysis

•  Most  common  form  of  factor  analysis  

•  The  new  variables/dimensions  ...  –  Are  linear  combina8ons  of  the  original  ones  

–  Are  uncorrelated  with  one  another  •  Orthogonal  in  original  dimension  space  

–  Capture  as  much  of  the  original  variance  in  the  data  as  possible  

–  Are  called  Principal  Components  

20  

Page 21: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

What are the new axes?

Original  Variable  A  

PC  1  PC  2  

•   Orthogonal  direc8ons  of  greatest  variance  in  data  •   Projec8ons  along  PC1  discriminate  the  data  most  along  any  one  axis  

Original  Variable  B  

21  

Page 22: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Principal Components

•  First  principal  component  is  the  direc8on  of  greatest  variability  (covariance)  in  the  data  

•  Second  is  the  next  orthogonal  (uncorrelated)  direc8on  of  greatest  variability  – So  first  remove  all  the  variability  along  the  first  component,  and  then  find  the  next  direc8on  of  greatest  variability  

•  And  so  on  …  

22  

Page 23: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Dimensionality Reduction

Can  ignore  the  components  of  lesser  significance.    

   

You  do  lose  some  informa8on,  but  if  the  eigenvalues  are  small,  you  don’t  lose  much  

–  n  dimensions  in  original  data    –  calculate  n  eigenvectors  and  eigenvalues  –  choose  only  the  first  p  eigenvectors,  based  on  their  eigenvalues  –  final  data  set  has  only  p  dimensions  

23  

Page 24: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

PCA Analysis (Cavalli-sforza,1978)

•  Plot  of  geographical  distribu8on  of  3  PCs  (Intensity  propor8onal  to  value  of  each  component)  –  First  –  blue  

–  Second    -­‐  green  

–  Third    -­‐  red  

24  

Page 25: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Discrete/Admixed Populations

SFA  

PCA  

Admixture  

Loading  (popula8on)  1   Loading  2   Loading  3  

25  

Page 26: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Analysis of European Genotype Data

PCA   SFAm   Admixture  26  

Page 27: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Probabilistic Models for Population Structure

•  Mixture  model  –  Cluster  individuals  into  K  popula8ons  

•  Admixture  model  –  The  genotypes  of  each  individual  are  an  admixture  of  mul8ple  

ancestor  popula8ons  

–  Assumes  alleles  are  in  linkage  equilibrium  

•  Linkage  model  –  Model  recombina8on,  correla8on  in  alleles  across  chromosome  

27  

Page 28: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

•  Organizing  data  into  clusters  such  that  there  is  

•   high  intra-­‐cluster  similarity  

•   low  inter-­‐cluster  similarity    

•  Informally,  finding  natural  groupings  among  objects.  

Page 29: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

0  

1  

2  

3  

4  

5  

0   1   2   3   4   5  

k1  

k2  

k3  

•  For  a  pre-­‐defined  number  of  clusters  K,  ini8alize  K  centers  randomly  

Page 30: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

0  

1  

2  

3  

4  

5  

0   1   2   3   4   5  

k1  

k2  

k3  

•  Iterate  between  the  following  two  steps  –  Assign  all  objects  to  the  nearest  center.  

–  Move  a  center  to  the  mean  of  its  members.  

Page 31: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

0  

1  

2  

3  

4  

5  

0   1   2   3   4   5  

k1  

k2  

k3  

•  AZer  moving  centers,  re-­‐assign  the  objects…          

Page 32: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

0  

1  

2  

3  

4  

5  

0   1   2   3   4   5  

k1  

k2  

k3  

•  AZer  moving  centers,  re-­‐assign  the  objects  to  nearest  centers.  

•  Move  a  center  to  the  mean  of  its  new  members.  

Page 33: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

k1  

k2  k3  

•  Re-­‐assign  and  move  centers,  un8l  no  objects  changed  membership.  

Page 34: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Soft-Clustering of Individuals into Three Clusters with Gaussian Mixture Model

Cluster  1   Cluster  2   Cluster  3  

0.1   0.4   0.5  

0.8   0.1   0.1  

0.7   0.2   0.1  

0.10   0.05   0.85  

…   …   …  

…   …   …  

…   …   …  

…   …   …  

…   …   …  

…   …   …  

Probability  of  

Individual  1  

Individual  2  

Individual  3  

Individual  4  

Individual  5  

Individual  6  

Individual  7  

Individual  8  

Individual  9  

Individual  10  

Sum  

1  

1  

1  

1  

1  

1  

1  

1  

1  

1  •   Each  individual  can  assigned  to  more  than  one  clusters  with  a  certain  probability.  •   For  each  individual,  the  probabili8es  for  all  clusters  should  sum  to  1.  (i.e.,  each  row  should  sum  to  1.)    • Each  cluster  is  explained  by  a  cluster  center  variable  (i.e.,  cluster  mean)  

Page 35: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Mixture Model

•  The  goal  is  to  discover  K  clusters  for  K  popula8ons  from  NxJ  genotype  matrix  (N:  #  of  samples,  J:  #  of  loci)  (xi,n  in  the  diagram  on  the  right)  

•  Assume  K  popula8ons  (clusters)  

•  θ  =  Distribu8on  over  popula8ons      –  Mixing  propor8ons  in  mixture  model      

•  β  =  Distribu8on  over  alleles  at  each  locus  in  each  popula8on  –  Mixture  component  model  in  mixture  model  

•  To  generate  an  individual’s  genome  –  All  individuals  share  the  same  θ  –  Sample    zi      from  Mul8nomial(θ)  –  For  each  locus  

•  Sample    xi,n  from  β  corresponding  to  the  popula8on  chosen  by  zi  

35  

βki  =1…I    λ  

xi,n  

zi,  

θ  

i=1…J  

n=1…N  

α  

k=1…K  

Page 36: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Admixture Model

•  Relax  the  assump8on  of  one  popula8on  per  individual  in  mixture  model  

•  Individuals  can  be  assigned  to  mul8ple  different  popula8ons  in  different  loci  

36  

Page 37: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

The Admixture Model

•  β  =  Distribu8on  over  alleles  – One  per  popula8on  –locus  pair  

•  To  generate  an  individual’s  genome  –  Sample  θn  from    Dirichlet(α)  

–  For  each  locus  •  Sample    zi,n      from  Mul8nomial(θn)  

•  Sample    xi,n  from  β  corresponding  to  the  popula8on  chosen  by  zi,n  

37  

Page 38: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Structure Model

•  Hypothesis:  Modern  popula8ons  are  created  by  an  intermixing  of  ancestral  popula8ons.  

•  An  individual’s  genome  contains  contribu8ons  from  one  or  more  ancestral  popula8ons.  

•  The  contribu8ons  of  popula8ons  can  be  different  for  different  individuals.  

•  Other  assump8ons  –  Hardy-­‐weinberg  equilbrium  

–  No  linkage  disequilbrium  –  Markers  are  i.i.d  (independent  and  iden8cally  distributed)  

38  

Page 39: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Linkage Model

•  From  admixture  model,  replace  the  assump8on  that  the  ancestry  labels  zil  for  individual  i,  locus  l  are  independent  with  the  assump8on  that  adjacent  zil  are  correlated.  

•  Use  Poisson  process  to  model  the  correla8on  between  neighboring  alleles  –  dl  :  distance  between  locus  l  and  locus  l+1  –  r:  recombina8on  rate  

39  

Page 40: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Linkage Model

•  As  recombina8on  rate  r  goes  to  infinity,  all  loci  become  independent  and  linkage  model  becomes  admixture  model.  

•  Recombina8on  rate  r  can  be  viewed  as  being  related  to  the  number  of  genera8ons  since  admixture  occurred.  

•  Use  MCMC  algorithm  to  fit  the  unkown  parameters.  

40  

Page 41: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Population Structure from Ancestry Proportion of Each Individual

•   How  to  display  popula8on  structure?  

Genetic structure of Human Populations (Rosenberg et al., Science 2002)‏#

Africa   Europe   Mid-­‐East   Cent./S.  Asia   East  Asia   Oceania  

Ancestral proportion

41  

Page 42: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Population of Origin Assignments of a Single Individual

True  origin  

Es8mated  Origin  (Unphased  data)  

Es8mated  Origin  (Phased  data)  

42  

Page 43: Lecture 2: Population Structuresssykim/teaching/s13/slides/Lecture2.pdfPopulation Structure from Ancestry Proportion of Each Individual • ’How’to’display’populaon’structure?’

Comparison of Different Methods

PCA   Model-­‐based  Clustering    

Advantages   •   Sta8s8cal  tests  for  significance  of  results  (PaIerson  et  al.  2006)  •   Easy  visualiza8on  

•   Genera8ve  process  that  explicitly  models  admixture  •   Clustering  is  probabilis8c:  it  is  possible  to  assign  confidence  level  of  clusters  

Disadvantages   •   No  intui8on  about  underlying  processes  

•   Computa8onal  more  demanding    • Based  on  assump8ons  of  evolu8onary      models:      •   Structure:  No  models  of  muta8on,  recombina8on  •   Recombina8on  added  in  extension  by  Falush  et  al.  

43