Genome-wide association mapping
Introduction to theory and methodology
Aaron Lorenz
Department of Agronomy and Horticulture
GWAS – Genome-wide Association Study• Big subject• Lots of methods and software packages• Lots of considerations for handling data• We have some data to analyze
• 75 minutes
Slide credit: Mike Gore
Goal
Find genes contributing to variation in phenotypes of interest
Approaches to mapping genes
Yu and Buckler, 2006
Germplasm• Any genetically diverse natural or artificial population can
be used– Examples
• 71 elite European maize inbred lines (Andersen et al., 2005)
• Diverse panel of 288 maize lines (Harjes et al., 2008)
• Diverse panel of 191 Arabidopsis lines (Stock center accessions and individuals sampled from the wild; Atwell et al. 2010)
• 915 dogs from 80 domestic breeds, 83 wild canids, 10 outbred African shelter dogs.
Linkage disequilibrium (LD)
AB A BD p p p 2
2
A a B b
Dr
p p p p
Common statistic to quantify LD. Normalized value of D.
• The non-random association of alleles between loci.
• Extent of LD over physical distance determines marker density needed.
LD decay in bi-parental linkage mapping populations
Slide credit: Peter Bradbury
Plots of LD across the Maize d3 Gene (Remington et al., 2001).
Gaut B. S., Long A. D. Plant Cell 2010:15:1502-1506
Copyright © 2003. American Society of Plant Biologists. All rights reserved.
r2 above diagonal, D’ below diagonal
bp
Note that LD drops to nearly 0 within 500 base pairs
Extensive LD in barley of the Upper Midwest
• 500 random individuals from a population phenotyped and genotyped– Genotypes were scored for one marker linked to a
candidate gene– Individuals scored as A1A1 = 0, A1A2 = 1, A2A2 = 2.
Toy example
0 1 2
Phe
no v
alue
0 : 0
: 0A
y bw
H b
H b
R: lm function• Fits a linear model with normal errors and constant
variance; generally this is used for regression analysis using continuous explanatory variables.
• Simple linear regression– lm(y ~ x)
• See riceGwasEmma.r
Population structure• Nearly always present in association mapping panels• Causes spurious associations if not accounted for.
AA
BB
AA
BB
AA
BB
AA
BBA
ABB
AA
BB
aa
bb
aa
bb
aa
bb
aa
bb
aa
bb
aa
bb
aa
bb
Extreme example
Within each of these populations, the Ab or bA gametes never occur, soD = freq(AB) – freq(A)*freq(B) = 0.25.When the subpops are combined into population and LD is calculated, the two loci are in complete LD regardless of their physical linkage.
Model population structure
y vq bw e
Subpop membership and effect
Marker allele dosage and effect
y 1 Qv Wb eMatrix notation
Illustration3 subpopulations, 2 markers, 10 individuals
4.4 1 0.75 0.25 0.00
4.6 1 0.65 0.30 0.05
5.3 1 0.50 0.40 0.10
5.0 1 0.75 0.05 0.20
5.8 1 0.80 0.00 0.20
5.7 1 0.20 0.60 0.20
4.3 1 0.20 0.80 0.00
4.6 1 0.30 0.70 0
4.4 1
4.8 1
1
2
3
41
512
623
7
8
9
10
0 0
0 1
1 1
1 1
0 1
1 0
1 0
.00 0 1
0.10 0.00 0.90 0 0
0.10 0.00 0.90 1 1
e
e
e
ev
ebv
ebv
e
e
e
e
1y Qv Wb e
Population structure and differential relatedness (or family structure)
Yu and Buckler, 2006
Mixed-linear model to account for family structure
y 1 Qv Wb Zu e
2~ (0, )uMVN u K
Polygenic effect(random)
K = kinship matrix. Normally calculated with genome-wide markers
Efficient Mixed-Model Association (EMMA)
• Uses eigenvalue decomposition to more efficiently solve mixed-model equation
• (Taking direct inverse of covariance matrix is computationally intensive. Want to avoid in GWAS.)
Options for modeling structure and kinship [see Price et al. (2010)]Inferring and modeling structure• Use knowledge on subpop membership directly• Subpopulation clustering (explicitly infer ancestry)
– STRUCTURE– ADMIXTURE
• Principal component analysis– Use top PCs as covariates to correct for pop structure– Related approach is multi-dimensional scaling (MDS)
Inferring kinship• Marker similarity matrix• Realized genomic additive relationship matrix• Pedigree additive relationship matrix
Efficient Mixed-Model Association (EMMA)
See riceGwasEmma.r
Manhattan plot
See riceGwasEmma.r
Statistical threshold: Correcting for multiple testing
Here?
Here?
Statistical threshold: Correcting for multiple testing• Bonferroni correction
– alphaC ≈ alphaE / test#
– Assumes independent tests– Too conservative
• Permutation testing– Good for linkage mapping– Generally, not valid for GWAS because family structure not
preserved
• False-discovery rate (Benjamini and Hochberg, 1995)– Calculate expected proportion of declared QTL that are false
positives.
Calculate effective number of tests
Other software packages to implement linear models for GWAS• TASSEL: www.maizegenetics.net• PLINK: http://pngu.mgh.harvard.edu/~purcell/plink/• EIGENSTRAT:
http://www.hsph.harvard.edu/alkes-price/software/• EMMAMAX: http://genetics.cs.ucla.edu/emmax/• GAPIT: http://www.maizegenetics.net/gapit• GenABEL: http://www.genabel.org/packages/GenABEL• GWASTools: http://
www.bioconductor.org/packages/2.11/bioc/html/GWASTools.html
• FaST-LMM: http://research.microsoft.com/en-us/um/redmond/projects/MSCompBio/Fastlmm/
Top Related