Teaching Population Genetics with R

37
A Simulation-Based Approach to Teaching Population Genetics: R as a Teaching Platform Bruce J. Cochrane Department of Zoology/Biology Miami University Oxford OH

description

Presented at Evolution 2013, June 24; describes an approach to teaching populations genetics at the upper undergraduate/beginning graduate level, using simulations based in R and incorporating available large genomic data sets.

Transcript of Teaching Population Genetics with R

Page 1: Teaching Population Genetics with R

A Simulation-Based Approach to Teaching Population Genetics:

R as a Teaching PlatformBruce J. Cochrane

Department of Zoology/BiologyMiami University

Oxford OH

Page 2: Teaching Population Genetics with R

Two Time Points

• 1974o Lots of Theoryo Not much Datao Allozymes Rule

• 2013o Even More Theoryo Lots of Datao Sequences, -omics, ???

Page 3: Teaching Population Genetics with R

The Problem

• The basic approach hasn’t changed, e. g.o Hardy Weinbergo Mutationo Selectiono Drifto Etc.

• Much of it is deterministic

Page 4: Teaching Population Genetics with R

And

• There is little initial connection with real data o The world seems to revolve around A and a

• At least in my hands, it doesn’t work

Page 5: Teaching Population Genetics with R

The Alternative

• Take a numerical (as opposed to analytical) approach• Focus on understanding random variables and distributions• Incorporate “big data”• Introduce current approaches – coalescence, Bayesian

Analysis, etc. – in this context

Page 6: Teaching Population Genetics with R

Why R?

• Open Source• Platform-independent (Windows, Mac, Linux)• Object oriented• Facile Graphics• Web-oriented• Packages available for specialized functions

Page 7: Teaching Population Genetics with R

Where We are Going

• The Basics – Distributions, chi-square and the Hardy Weinberg Equilibrium

• Simulating the Ewens-Watterson Distribution• Coalescence and summary statistics• What works and what doesn’t

Page 8: Teaching Population Genetics with R

The RStudio Interface

Page 9: Teaching Population Genetics with R

The Normal Distribution

dat.norm <-rnorm(1000)hist(dat.norm,freq=FALSE,ylim=c(0,.5))curve(dnorm(x,0,1),add=TRUE,col="red")mean(dat.norm)var(dat.norm)

> mean(dat.norm)[1] 0.003546691> var(dat.norm)[1] 1.020076

Page 10: Teaching Population Genetics with R

Sample Size and Cutoff Values

n <-c(10,30,100,1000)res <-sapply(n,ndist)colnames(res)=nres

> res 10 30 100 10002.5% -1.110054 -1.599227 -1.713401 -1.98167597.5% 2.043314 1.679208 1.729095 1.928852

Page 11: Teaching Population Genetics with R

What is chi-square All About?

xsq <-rchisq(10000,1)hist(xsq, main="Chi Square Distribution, N=1000, 1 d. f",xlab="Value")p05 <-quantile(xsq,.95)abline(v=p05, col="red")p05

95% 3.867886

Page 12: Teaching Population Genetics with R

Simple Generation of Critical Values

d <-1:10chicrit <-qchisq(.95,d)chitab <-cbind(d,chicrit)chitab

d chicrit [1,] 1 3.841459 [2,] 2 5.991465 [3,] 3 7.814728 [4,] 4 9.487729 [5,] 5 11.070498 [6,] 6 12.591587 [7,] 7 14.067140 [8,] 8 15.507313 [9,] 9 16.918978[10,] 10 18.307038

Page 13: Teaching Population Genetics with R

Calculating chi-squared

The function

function(obs,exp,df=1){chi <-sum((obs-exp)^2/exp)pr <-1-pchisq(chi,df)c(chi,pr)

A sample function call

obs <-c(315,108,101,32)z <-sum(obs)/16exp <-c(9*z,3*z,3*z,z)chixw(obs,exp,3)

The output

chi-square = 0.47probability(<.05) = 0.93deg. freedom = 3

Page 14: Teaching Population Genetics with R

Basic Hardy Weinberg Calculations

The Biallelic Case

Sample input obs <-c(13,35,70)hw(obs)

Output

[1] "p= 0.2585 q= 0.7415" obs exp[1,] 13 8[2,] 35 45[3,] 70 65[1] "chi squared = 5.732 p = 0.017 with 1 d. f."

Page 15: Teaching Population Genetics with R

Illustrating With Ternary Plots

library(HardyWeinberg)dat <-(HWData(100,100))gdist <-dat$Xt #create a variable with the working dataHWTernaryPlot(gdist, hwcurve=TRUE,addmarkers=FALSE,region=0,vbounds=FALSE,axis=2,vertexlab=c("0","","1"),main="Theoretical Relationship",cex.main=1.5)

Page 16: Teaching Population Genetics with R

Access to Data

• Direct access of datao HapMapo Dryado Others

• Manipulation and visualization within R• Preparation for export (e. g. Genalex)

Page 17: Teaching Population Genetics with R

Direct Access of HapMap Data

library (chopsticks)chr21 <-read.HapMap.data("http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/latest_phaseII_ncbi_b36/fwd_strand/

non-redundant/genotypes_chr21_YRI_r24_nr.b36_fwd.txt.gz")chr21.sum <-summary(chr21$snp.data)head(chr21.sum)

Calls Call.rate MAF P.AA P.AB P.BB z.HWErs885550 90 1.0000000 0.09444444 0.8111111 0.1888889 0.00000000 0.9894243rs1468022 90 1.0000000 0.00000000 0.0000000 0.0000000 1.00000000 NArs169758 90 1.0000000 0.31666667 0.4000000 0.5666667 0.03333333 2.9349509rs150482 89 0.9888889 0.00000000 0.0000000 0.0000000 1.00000000 NArs12627229 89 0.9888889 0.00000000 0.0000000 0.0000000 1.00000000 NArs9982283 90 1.0000000 0.05555556 0.0000000 0.1111111 0.88888889 0.5580490

Page 18: Teaching Population Genetics with R

Distribution of Hardy Weinberg Deviation on

Chromosome 22 Markers

Page 19: Teaching Population Genetics with R

And Determining the Number of Outliers

nsnps <- length(hwdist)quant <-quantile(hwdist,c(.025,.975))low <-length(hwdist[hwdist<quant[1]])high <-length(hwdist[hwdist>quant[2]])accept <-nsnps-low-highlow; accept; high

[1] 982[1] 37330[1] 976

Page 20: Teaching Population Genetics with R

Sampling and Plotting Deviation from Hardy Weinberg

chr21.poly <-na.omit(chr21.sum) #remove all NA's (fixed SNPs)chr21.samp <-sample(nrow(chr21.poly),1000, replace=FALSE) plot(chr21.poly$z.HWE[chr21.samp])

Page 21: Teaching Population Genetics with R

Plotting F for Randomly Sampled Markers

chr21.sub <-chr21.poly[chr21.samp,]Hexp <- 2*chr21.sub$MAF*(1-chr21.sub$MAF)Fi <- 1-(chr21.sub$P.AB/Hexp)plot(Fi,xlab="Locus",ylab="F")

Page 22: Teaching Population Genetics with R

Additional Information

head(chr21$snp.support)

dbSNPalleles Assignment Chromosome Position Strandrs885550 C/T C/T chr21 9887804 +rs1468022 C/T C/T chr21 9887958 +rs169758 C/T C/T chr21 9928786 +rs150482 A/G A/G chr21 9932218 +rs12627229 C/T C/T chr21 9935312 +rs9982283 C/T C/T chr21 9935844 +

Page 23: Teaching Population Genetics with R

The Ewens- Watterson Test

• Based on Ewens (1977) derivation of the theoretical equilibrium distribution of allele frequencies under the infinite allele model.

• Uses expected homozygosity (Σp2) as test statistic• Compares observed homozygosity in sample to expected

distribution in n random simulations• Observed data are

o N=number of sampleso k= number of alleleso Allele Frequency Distribution

Page 24: Teaching Population Genetics with R

Classic Data (Keith et al., 1985)

• Xdh in D. pseudoobscura, analyzed by sequential electrophoresis

• 89 samples, 15 distinct alleles

Page 25: Teaching Population Genetics with R

Testing the Data

1. Input the Data

Xdh <- c(52,9,8,4,4,2,2,1,1,1,1,1,1,1,1) # vector of allele numberslength(Xdh) # number of alleles = ksum(Xdh) #number of samples = n

2. Calculate Expected Homozygosity

Fx <-fhat(Xdh)

3. Run the Analysis

Ewens(n,k,Fx)

Page 26: Teaching Population Genetics with R

The Result

Page 27: Teaching Population Genetics with R

With Newer (and more complete) Data

Lactase Haplotypes in European and African Populations

1. Download data for Lactase gene from HapMap (CEU, YRI)o 25 SNPSo 48,000 KB

2. Determine numbers of haplotypes and frequencies for each3. Apply Ewens-Waterson test to each.

Page 28: Teaching Population Genetics with R

The Results

par(mfrow=c(2,1))pops <-c("ceu","yri")sapply(pops,hapE)

CEU

YRI

Page 29: Teaching Population Genetics with R

Some Basic Statistics from Sequence Data

library(seqinR)library(pegas)dat <-read.fasta(file="./Data/FGB.fas") #additional code needed to rearrange datasites <-seg.sites(dat.dna)nd <-nuc.div(dat.dna)taj <-tajima.test(dat.dna)length(sites); nd;taj$D

[1] 23[1] 0.007561061[1] -0.7759744

Intron sequences, 433 nucleotides each

from Peters JL, Roberts TE, Winker K, McCracken KG (2012) PLoS ONE 7(2): e31972. doi:10.1371/journal.pone.0031972

Page 30: Teaching Population Genetics with R

Coalescence I – A Bunch of Trees

trees <-read.tree("http://dl.dropbox.com/u/9752688/ZOO%20422P/R/msfiles/tree.1.txt")plot(trees[1:9],layout=9)

Page 31: Teaching Population Genetics with R

Coalescence II - MRCA

msout.1.txt <-system("./ms 10 1000 -t .1 -L", intern=TRUE)ms.1 <- read.ms.output(msout.1.txt)hist(ms.1$times[,1],main="MRCA, Theta=0.1",xlab="4N")

Page 32: Teaching Population Genetics with R

Coalescence III – Summary Statistics

system("./ms 50 1000 -s 10 -L | ./sample_stats >samp.ss") # 1000 simulations of 50 samples, with number of sites set to 10ss.out <-read_ss("samp.ss")head(ss.out)

pi S D thetaH H1. 1.825306 10 -0.521575 2.419592 -0.5942862. 2.746939 10 0.658832 2.518367 0.2285713. 3.837551 10 2.055665 3.631837 0.2057144. 2.985306 10 0.964128 2.280000 0.7053065. 1.577959 10 -0.838371 5.728163 -4.1502046. 2.991020 10 0.971447 3.539592 -0.548571

Page 33: Teaching Population Genetics with R

Coalescence IV – Distribution of Summary Statistics

hist(ss.out$D,main="Distribution of Tajima's D (N=1000)",xlab="D")abline(v=mean(ss.out$D),col="blue")abline(v=quantile(ss.out$D,c(.025,.975)),col="red")

Page 34: Teaching Population Genetics with R

Other Uses

• Data Manipulationo Conversion of HapMap Data for use elsewhere (e. g. Genalex)o Other data sources via API’s (e. g. package rdryad)

• Other Analyseso Hierarchical F statistics (hierfstat)o Haplotype networking (pegas)o Phylogenetics (ape, phyclust, others)o Approximate Bayesian Computation (abc)

• Access for studentso Scripts available via LMSo Course specific functions can be accessed (source("http://db.tt/A6tReYEC")o Notes with embedded code in HTML (Rstudio, knitr)

Page 35: Teaching Population Genetics with R

Sample HTML Rendering

Page 36: Teaching Population Genetics with R

Challenges

• Some coding required• Data Structures are a challenge• Packages are heterogeneous

• Students resist coding

Page 37: Teaching Population Genetics with R

Nevertheless

• Fundamental concepts can be easily visualized graphically• Real data can be incorporated from the outset• It takes students from fundamental concepts to real-world

applications and analyses

For Further information:[email protected]://db.tt/A6tReYEC