Association Mapping versus Genomic Selection Association Mapping To discover genes and genetic...
-
Upload
davin-herbert -
Category
Documents
-
view
213 -
download
0
Transcript of Association Mapping versus Genomic Selection Association Mapping To discover genes and genetic...
1
Association Mappingversus Genomic Selection
Association Mapping• To discover genes and
genetic variants that control a trait
• Knowledge can be applied understand mechanism, genetic architecture, design pathways with diversity, ideas for transgenic improvement
Genomic Selection• To identify germplasm
with the best breeding values and performance
• Can identify complementary varieties that should be crossed for future improvement.
2
Association-based selection methods:Genomic selection
• We have MAS, why do we need something different?
• Historical introduction to genomic selection– The basic idea– Methods– Theory– Selected simulation results– Empirical results– long-term genomic selection– Introgressing diversity using GS
3
MAS problems
• Relevant germplasm• Bias of estimated effects• Effects too small for detection
4
Resolution (bp)
Rese
arch
tim
e (y
ear)
1 1 x 104 1 x 107
1
5
Association mapping
Positional cloning
Recombinant inbred lines
Pedigree
Intermated recombinant inbreds
F2 / BC
Near-isogenic lines
Relevance to breeding
germplasm
Depends
Low
High
Association mapping identifies QTL rapidly while scanning relevant germplasm
5
Bias in Effect Estimation
Locus Effect Estimate
True Effect
Effect Estimate(True + Error)
Significance Threshold
• Keep in all loci => No threshold => Estimated effects are unbiased
Average “Detected”Effect Estimated
Bias
6
In polygenic traits, much is hidden
Lande & Thompson 1990
E.g., h2 = 0.8α = 0.01
1200
7
Genomic selection principles
• Meuwissen et al. 2001 Genetics 157:1819-1829• No distinction between “significant” and “non-
significant”; no arbitrary inclusion / exclusion: all markers contribute to prediction
• More effects must be estimated than there are phenotypic observations
• Estimated effects are unbiased• Capture small effects
8
MakeSelections
Calculate GEBVGenotyping
Breeding Material
Train GS
Model
Genotyping & Phenotyping
Training Population
Genomic selection:Prediction using many markers
Meuwissen et al. 2001 Genetics 157:1819-1829
9
Statistical modeling: The two cultures
Breiman 2001 Stat. Sci. 16:199-231
Observedinputs Nature
ObservedresponsesX Y
Can we understand Y?
RegressionX YIdentify causal inputs
Can we predict Y?
X YRegression
Decision treesWhatever works
?
10
Need to shorten breeding cycle
1 10 100 1000 100000
0.51
1.52
2.53
3.54
Ratio Candidates / Selected
i
1 10 100 10000
0.10.20.30.40.50.60.70.80.9
1
Number of Replications
rA
i cumulates over breeding cycles
11
Release
Select
Cross
Inbreed
Phenotype
F1 × Inducer
Self DH0
2 Seasons1 Rep
N=2270 S=100
5 RepsN=100 S=10
2 Years
1 Season
3 Years
Phenotypic Selection
12
Release
Select
Cross
Inbreed
Phenotype
1 Year!
Genomic Selection
13
Release
Select
Cross
Inbreed
Phenotype
FastGS
1 Season = ⅓ Year!!
14
Selection Intensities
• Phenotypic–N = 2270, S = 10: i = 2.4
• FastGS–N = 370, S = 43: i = 1.7– 9 × i
≅ 15 Inbreeding:
(!!!)
15
Rates of gain per year
16
Impacts• Schaeffer, L.R. 2006. Strategy for applying genome-wide selection in dairy
cattle. J. Anim. Breed. Genet. 123:218-223.
17
Schaeffer 2006Ph
enot
ypic
Gen
omic
$116 M
$4.2 M
Cost per genetic
standard deviation
18
Potential Impact
Test varieties
and release
Make crossesand advance generations
Genotype
New Germplasm
Line Development
Cycle
Genomic Selection
Advance lines with highest
GEBV
Phenotype (lines have
already been genotyped)
Train prediction model
Advance lines informative for
model improvement
Model Training
Cycle
UpdatedModel
Heffner, E.L. et al. 2009. Genomic Selection for Crop Improvement. Crop Science 49:1-12
19
What (I think) is revolutionaryTest
varieties and release
Make crossesand advance generations
Genotype
New Germplasm
Line Development
Cycle
Genomic Selection
Advance lines with highest
GEBV
Phenotype (lines have
already been genotyped)
Train prediction model
Advance lines informative for
model improvement
Model Training
Cycle
UpdatedModel
For a century, breeding has focused on better ways to evaluate lines. Henceforth it will focus on how to improve a model.
Phenotypic Selection
20
A Focus for Information
Select
Cross
Cultivar Release
Population Improvement
Genomic PredictionModel Development
• Current pheno–geno data• Historical pheno–geno data• Linkage and association mapping• Biological knowledge
21
The Alleletarian Revolution
• The breeding line as the focus of evaluation has been dethroned in favor of the allele
• A line is useful to us only with respect to the alleles it carries
• Time-honored practice: replicate (progeny test) lines
• But alleles are replicated regardless of what line carries them
22
Methods
• Linear models:– Effects are random– Methods differ in marker effect priors
• Machine learning methods– Regression trees
23
Linear models: Priors on coefficients
• Ridge regression•
• BayesB (SSVS)•
• BayesCπ•
else
else
24
Den
sity
Var(β)
Ridge regression
BayesB
BayesCπ
25
Machine learning methods
• Random Forests– Forest of regression trees– Each tree on a bootstrapped
sample– Nodes split on randomly
sampled features– Prediction is forest mean
• Can capture interactions
0 M1 1
0 M2 1 0 M2 1
M1
0 1
M20 1 0
1 0 1
26
Additive models and breeding value
• Breeding value = Mean phenotype of progeny– Most important parent selection criterion– Recombination: parents do not always pass
combinations of genes to their progeny– > Sum of individual locus effects
• Linear models capture this; Machine learning methods may not
27
Theory
• How accurate will GS be?• Impact of GS on inbreeding / loss of diversity• Genomic selection captures pedigree
relatedness among candidates
28
Prediction accuracy = Correlation(predicted, true)
• R = irAσA
rA = corr(selection criterion, breeding value)
• On simulated data corr(Â, A) is easy• On real data:
29
Predict prediction accuracy
• Daetwyler, H.D. et al. 2008. Accuracy of Predicting the Genetic Risk of Disease Using a Genome-Wide Approach. PLoS ONE 3:e3395
• Assume all loci affecting the trait areknown and are independent
• Assume marker effects are fixed
30
λ
0.02
0.5
0.1
1
25
1020
Replicating hurts: 2000 with 1 plot is better than 1000 with 2 plots
31
Predict prediction accuracy
• Hayes, B.J. et al. 2009. Increased accuracy of artificial selection by using the realized relationship matrix. Genetics Research 91:47-60.
• Detail on the population genetics that drive nG
• Assume marker effects are random• Still assume all markers independent and
estimated separately
32
Analytical approximationsDaetwyler et al., 2008
NP / NG
Hayes et al., 2009
NP / NG
33
Take Homes
• Even with traits of very low heritability(h2 = 0.01), sufficient nP gives accuracy
• Replication may not be good• The number of loci estimated (nG) is a critical
parameter• If you don’t know where the QTL are, higher
marker coverage requires higher nG
• N.B. All conclusions assuming only 100% LD!
34
Genetic diversity loss / inbreeding
• Daetwyler, H.D. et al. 2007. Inbreeding in genome-wide selection. J. Anim. Breed. Genet. 124:369-376
• Avoid selecting close relatives together• What is the correlation in the estimated
breeding value between full sibs?
Correlationsibling
estimates
35
Genetic diversity loss / inbreeding
Aj = ½AS + ½AD + aj
Mendelian sampling term
Correlation sibling estimatesσ2
B
σ2W > 0
σ2B
σ2W = 0
_BLUP_
__GS__
36
Daetwyler et al. 2007 Take Homes
• Genomic selection captures the Mendelian sampling term.– Correlation between the estimates of sibling
performance are reduced– Co-selection of sibs is reduced– Rate of inbreeding / loss of diversity is reduced
37
A word on pedigree relatedness
• Five individuals, a, b, c, d, and e.– a, b, and c unrelated– d offspring of a and b– e offspring of a and c
a b c d ea 1 0 0 ½ ½b 0 1 0 ½ 0c 0 0 1 0 ½d ½ ½ 0 1 ¼e ½ 0 ½ ¼ 1
A =
38
Ridge Regression
Habier, D. et al. 2007. Genetics 177:2389-2397Hayes, B.J. et al. 2009. Genetics Research 91:47-60.
39
Habier et al. simulation set up
40
Genetic relationship decays fast
Training population here
• Prediction from pedigree relationship loses acccuracy very quickly
• Decay rate is initially more rapid then stabilizes after about 5 generations
• Rapid initial decay reflects that the closest marker may not be in highest LD with the QTL
• RR-BLUP accuracy decays more rapidly than Bayes-B because more markers absorb the effect of a QTL
41
Habier et al. 2007 Take homes
• The ability of genomic selection to capture information on genetic relatedness is valuable
• That information decays rapidly• The amount of that information relates to the
number of markers fitted by a model:– Ridge regression > BayesB
• Bayes-B captured more LD information:– Long-term accuracy: BayesB > Ridge regression
42
Accuracy due to relationships vs. LD
43
Stochastic vs deterministic prediction
NP / NG
Zhong et al.
Habier et al.
44
To replicate or not to replicate504 Lines replicated once 168 Lines replicated three times
Ridge Regression BayesB
45
Genetic diversity loss / inbreeding
Aj = ½AS + ½AD + aj
Mendelian sampling term
Correlation sibling estimates
σ2B
σ2W > 0
σ2B
σ2W = 0
_BLUP_
__GS__
Capturing relationship Information increases
σ2B NOT σ2
W
46
Simulation setting:Meuwissen; Habier; Solberg
• Ne = 100; 1000 generations• Mutation / Drift / Recombination equilibrium• High marker mutation rate (2.5 x 10-3 / loc /
gen); higher “haplotype mutation rate”• Mutation effect distribution Gamma (1.66,
0.4): “effective QTL number” is only about 6 (!)–> Watch out how you simulate!
47
Results
• Prediction accuracy estimated by simulationMHG
HFDRR-BLUP 0.730.64BayesB 0.850.69
• These accuracies are ASTOUNDING• If h2 = 1, r = 0.71
48
Noteworthy discussion
• Markers flanking QTL not always in model– QTL effects captured by multiple markers– No need to “detect” QTL
• Recombination causes accuracy to decay– Faster than if QTL captured by flanking markers– Markers far from QTL contribute to capture its effect
• Ne / 2 markers per Morgan achieves close to maximum accuracy– Dependent on high marker mutation rates (?)
49
Solberg et al. 2008
• Density: Number of markers per Morgan
SSR: ¼ Ne ½ Ne 1 Ne 2 Ne
SNP: 1 Ne 2 Ne 4 Ne 8 Ne
50
Zhong et al. 2009
• Zhong, S. et al. 2009. Genetics 182:355-364.
• 42 diverse 2-row barley• 1040 markers ~ evenly spaced• Mating designs to generate
500 high and low LD training dataset
• 20 or 80 QTL; h2 = 0.4
51
Ridge regression Vs. BayesB
Zhong et al. 2009
20QTL – HiLD 20QTL – LoLD 80QTL – HiLD 80QTL – LoLD
Ridge Regression BayesB
Observed
Unobserved
QTL:
52
Take-home messages
• Ridge regression is not affected by the number of QTL / the QTL effect size
• BayesB performs better with large marker-associated effects
• Co-linearity is more detrimental to BayesB• High marker density and training pop. size?
Yes: BayesB No: RR-BLUP
53
VanRaden et al. 2009
• VanRaden, P.M. et al. 2009. Invited Review: Reliability of genomic predictions for North American Holstein bulls. J. Dairy Sci. 92:16-24.
54
VanRaden et al. 2009
• Some traits have major genes, others do not
55
VanRaden et al. 2009
• The larger the training population, the better. Where diminishing returns will begin is not in sight.
Predictor
56
Take Homes
• Training population requirements very large• BayesB did not help• == no large marker-associated effects ==• Like the “Case of the missing heritability” in
human GWAS studies– Are many quantitative traits driven by very low
frequency variants?– RR would capture this case better than BayesB
57
Empirical data on crops: TP size
58
Empirical data on crops: Marker No.
59
Empirical data on Humans: Marker No.
Yang et al. 2010. Nat. Genet. 10.1038/ng.608
Out of 295K SNP
60
Long-term genomic selection
• Marker data from elite six-row barley program• 880 Markers• 100 hidden as additive-effect QTL• Evaluate 200 progeny, select 20• Phenotypic compared to genomic selection
61
Breeding / model update cycles
Evaluation is possible every other season. Candidates from every other cycle can be evaluated. There is still a lag: Parents of C2 are selected based on evaluation of C0.
Season 1 Season 2 Season 3 Season 4 Season 5 Season 6
Phenotypic Selection
Cross &Inbreed
Evaluate& Select
Cross &Inbreed
Evaluate& Select
Cross &Inbreed
Evaluate& Select
Cross &Inbreed
Evaluate& Select
Cross, Inb.& Select
Cross, Inb.& Select
Cross, Inb.& Select
Cross, Inb.& Select
Evaluate EvaluateGenomic Selection
62
Response in genotypic value
Phenotypic Breeding Cycle
Mea
n G
enot
ypic
Val
ue
Genomic; Small Training PopGenomic; Large Training Pop
Phenotypic Selection
63
Accuracy
Phenotypic Breeding Cycle
Mea
n Re
alize
d Ac
cura
cy
Genomic; Small Training PopGenomic; Large Training Pop
Phenotypic Selection
64
Genetic variance
Phenotypic Breeding Cycle
Mea
n G
enot
ypic
Sta
ndar
d D
evia
tion
Genomic; Small Training PopGenomic; Large Training Pop
Phenotypic Selection
65
Lost favorable alleles
Phenotypic Breeding Cycle
Mea
n N
umbe
r Los
t Fav
orab
le A
lllel
es
Genomic; Small Training PopGenomic; Large Training Pop
Phenotypic Selection
66
Goddard 2008; Hayes et al. 2009
67
Response in genotypic value
Phenotypic Breeding Cycle
Mea
n G
enot
ypic
Val
ue
Genomic; Small Training PopGenomic; Large Training Pop
Phenotypic Selection
Phenotypic Breeding Cycle
Unweighted Weighted
68
Genetic variance
Phenotypic Breeding Cycle
Mea
n G
enot
ypic
Sta
ndar
d D
evia
tion
Genomic; Small Training PopGenomic; Large Training Pop
Phenotypic Selection
Phenotypic Breeding Cycle
Unweighted Weighted
69
Lost favorable alleles
Phenotypic Breeding Cycle
Mea
n N
umbe
r Los
t Fav
orab
le A
llele
s
Genomic; Small Training PopGenomic; Large Training Pop
Phenotypic Selection
Phenotypic Breeding Cycle
Unweighted Weighted
70
Long term genomic selection
• The acceleration of the breeding cycle is key• Some favorable alleles will be lost– Likely those not in LD with any marker
• Managing diversity / favorable alleles appears a good idea
• This can be done using the same data as used for genomic prediction
71
Introgressing diversity
• GS relies on marker–QTL allele association• An “exotic” line comes from a sub-population
divergent from the breeding population• After sub-populations separate– Drift moves allele frequencies independently– Drift & recombination shift associations
independently• Will the GS prediction model identify valuable
segments from the exotic?
72
Three approaches
• Create a bi-parental family with the exotic (Bernardo 2009)– Develop a mini-training population for that family– Improve the family – Bring it into the main breeding population
• Develop a separate training population for the exotic sub-population (Ødegård et al. 2009)
• Develop a single multi-subpopulation (species-wide?) training population (Goddard 2006)
73
Need higher marker density
Ancestral LD
• Tightly–linked: ancestral LD• Loosely–linked: sub-population specific LD
sub-population specific LD
74
0 cM recombination distance 5 cM recombination distance
Genetic Distance
Corr
elati
on o
f rConsistency of association across barley
subpopulations
0.8
0.0
0.2
0.4
0.6
1.0
0.0 0.5
75
Example: Dairy cattle breeds
TP = Hols. TP = Jers. Hols. + Jers.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
VP = HolsteinVP = Jersey
Pred
ictio
n A
ccur
acy
76
G1 G2 G3
N=136 N=149 N=161
Oat sub-populations (UOPN)
77
Combined sub-population TP(β-Glucan)
G1 G2 and G3
0.11
TPVP
G3G1 and G2
0.50
G1 G3G2
0.39
78
Introgressing diversity using GS
• Need higher marker density• Analysis of consistency of r may indicate
whether current density is sufficient– Not sure we have it for barley
• If you have the density, a multi-subpopulation training population seems like a good idea– Focuses the model on tighter ancestral LD rather
than looser sub-population specific LD