Association Mapping versus Genomic Selection Association Mapping To discover genes and genetic...

1

Association Mappingversus Genomic Selection

Association Mapping• To discover genes and

genetic variants that control a trait

• Knowledge can be applied understand mechanism, genetic architecture, design pathways with diversity, ideas for transgenic improvement

Genomic Selection• To identify germplasm

with the best breeding values and performance

• Can identify complementary varieties that should be crossed for future improvement.

2

Association-based selection methods:Genomic selection

• We have MAS, why do we need something different?

• Historical introduction to genomic selection– The basic idea– Methods– Theory– Selected simulation results– Empirical results– long-term genomic selection– Introgressing diversity using GS

3

MAS problems

• Relevant germplasm• Bias of estimated effects• Effects too small for detection

4

Resolution (bp)

Rese

arch

tim

e (y

ear)

1 1 x 104 1 x 107

1

5

Association mapping

Positional cloning

Recombinant inbred lines

Pedigree

Intermated recombinant inbreds

F2 / BC

Near-isogenic lines

Relevance to breeding

germplasm

Depends

Low

High

Association mapping identifies QTL rapidly while scanning relevant germplasm

5

Bias in Effect Estimation

Locus Effect Estimate

True Effect

Effect Estimate(True + Error)

Significance Threshold

• Keep in all loci => No threshold => Estimated effects are unbiased

Average “Detected”Effect Estimated

Bias

6

In polygenic traits, much is hidden

Lande & Thompson 1990

E.g., h2 = 0.8α = 0.01

1200

7

Genomic selection principles

• Meuwissen et al. 2001 Genetics 157:1819-1829• No distinction between “significant” and “non-

significant”; no arbitrary inclusion / exclusion: all markers contribute to prediction

• More effects must be estimated than there are phenotypic observations

• Estimated effects are unbiased• Capture small effects

8

MakeSelections

Calculate GEBVGenotyping

Breeding Material

Train GS

Model

Genotyping & Phenotyping

Training Population

Genomic selection:Prediction using many markers

Meuwissen et al. 2001 Genetics 157:1819-1829

9

Statistical modeling: The two cultures

Breiman 2001 Stat. Sci. 16:199-231

Observedinputs Nature

ObservedresponsesX Y

Can we understand Y?

RegressionX YIdentify causal inputs

Can we predict Y?

X YRegression

Decision treesWhatever works

?

10

Need to shorten breeding cycle

1 10 100 1000 100000

0.51

1.52

2.53

3.54

Ratio Candidates / Selected

i

1 10 100 10000

0.10.20.30.40.50.60.70.80.9

1

Number of Replications

rA

i cumulates over breeding cycles

11

Release

Select

Cross

Inbreed

Phenotype

F1 × Inducer

Self DH0

2 Seasons1 Rep

N=2270 S=100

5 RepsN=100 S=10

2 Years

1 Season

3 Years

Phenotypic Selection

12

Release

Select

Cross

Inbreed

Phenotype

1 Year!

Genomic Selection

13

Release

Select

Cross

Inbreed

Phenotype

FastGS

1 Season = ⅓ Year!!

14

Selection Intensities

• Phenotypic–N = 2270, S = 10: i = 2.4

• FastGS–N = 370, S = 43: i = 1.7– 9 × i

≅ 15 Inbreeding:

(!!!)

15

Rates of gain per year

16

Impacts• Schaeffer, L.R. 2006. Strategy for applying genome-wide selection in dairy

cattle. J. Anim. Breed. Genet. 123:218-223.

17

Schaeffer 2006Ph

enot

ypic

Gen

omic

$116 M

$4.2 M

Cost per genetic

standard deviation

18

Potential Impact

Test varieties

and release

Make crossesand advance generations

Genotype

New Germplasm

Line Development

Cycle

Genomic Selection

Advance lines with highest

GEBV

Phenotype (lines have

already been genotyped)

Train prediction model

Advance lines informative for

model improvement

Model Training

Cycle

UpdatedModel

Heffner, E.L. et al. 2009. Genomic Selection for Crop Improvement. Crop Science 49:1-12

19

What (I think) is revolutionaryTest

varieties and release

Make crossesand advance generations

Genotype

New Germplasm

Line Development

Cycle

Genomic Selection

Advance lines with highest

GEBV

Phenotype (lines have

already been genotyped)

Train prediction model

Advance lines informative for

model improvement

Model Training

Cycle

UpdatedModel

For a century, breeding has focused on better ways to evaluate lines. Henceforth it will focus on how to improve a model.


20

A Focus for Information

Select

Cross

Cultivar Release

Population Improvement

Genomic PredictionModel Development

• Current pheno–geno data• Historical pheno–geno data• Linkage and association mapping• Biological knowledge

21

The Alleletarian Revolution

• The breeding line as the focus of evaluation has been dethroned in favor of the allele

• A line is useful to us only with respect to the alleles it carries

• Time-honored practice: replicate (progeny test) lines

• But alleles are replicated regardless of what line carries them

22

Methods

• Linear models:– Effects are random– Methods differ in marker effect priors

• Machine learning methods– Regression trees

23

Linear models: Priors on coefficients

• Ridge regression•

• BayesB (SSVS)•

• BayesCπ•

else

else

24

Den

sity

Var(β)

Ridge regression

BayesB

BayesCπ

25

Machine learning methods

• Random Forests– Forest of regression trees– Each tree on a bootstrapped

sample– Nodes split on randomly

sampled features– Prediction is forest mean

• Can capture interactions

0 M1 1

0 M2 1 0 M2 1

M1

0 1

M20 1 0

1 0 1

26

Additive models and breeding value

• Breeding value = Mean phenotype of progeny– Most important parent selection criterion– Recombination: parents do not always pass

combinations of genes to their progeny– > Sum of individual locus effects

• Linear models capture this; Machine learning methods may not

27

Theory

• How accurate will GS be?• Impact of GS on inbreeding / loss of diversity• Genomic selection captures pedigree

relatedness among candidates

28

Prediction accuracy = Correlation(predicted, true)

• R = irAσA

rA = corr(selection criterion, breeding value)

• On simulated data corr(Â, A) is easy• On real data:

29

Predict prediction accuracy

• Daetwyler, H.D. et al. 2008. Accuracy of Predicting the Genetic Risk of Disease Using a Genome-Wide Approach. PLoS ONE 3:e3395

• Assume all loci affecting the trait areknown and are independent

• Assume marker effects are fixed

30

λ

0.02

0.5

0.1

1

25

1020

Replicating hurts: 2000 with 1 plot is better than 1000 with 2 plots

31

Predict prediction accuracy

• Hayes, B.J. et al. 2009. Increased accuracy of artificial selection by using the realized relationship matrix. Genetics Research 91:47-60.

• Detail on the population genetics that drive nG

• Assume marker effects are random• Still assume all markers independent and

estimated separately

32

Analytical approximationsDaetwyler et al., 2008

NP / NG

Hayes et al., 2009

NP / NG

33

Take Homes

• Even with traits of very low heritability(h2 = 0.01), sufficient nP gives accuracy

• Replication may not be good• The number of loci estimated (nG) is a critical

parameter• If you don’t know where the QTL are, higher

marker coverage requires higher nG

• N.B. All conclusions assuming only 100% LD!

34

Genetic diversity loss / inbreeding

• Daetwyler, H.D. et al. 2007. Inbreeding in genome-wide selection. J. Anim. Breed. Genet. 124:369-376

• Avoid selecting close relatives together• What is the correlation in the estimated

breeding value between full sibs?

Correlationsibling

estimates

35


Aj = ½AS + ½AD + aj

Mendelian sampling term

Correlation sibling estimatesσ2

B

σ2W > 0

σ2B

σ2W = 0

_BLUP_

__GS__

36

Daetwyler et al. 2007 Take Homes

• Genomic selection captures the Mendelian sampling term.– Correlation between the estimates of sibling

performance are reduced– Co-selection of sibs is reduced– Rate of inbreeding / loss of diversity is reduced

37

A word on pedigree relatedness

• Five individuals, a, b, c, d, and e.– a, b, and c unrelated– d offspring of a and b– e offspring of a and c

a b c d ea 1 0 0 ½ ½b 0 1 0 ½ 0c 0 0 1 0 ½d ½ ½ 0 1 ¼e ½ 0 ½ ¼ 1

A =

38

Ridge Regression

Habier, D. et al. 2007. Genetics 177:2389-2397Hayes, B.J. et al. 2009. Genetics Research 91:47-60.

39

Habier et al. simulation set up

40

Genetic relationship decays fast

Training population here

• Prediction from pedigree relationship loses acccuracy very quickly

• Decay rate is initially more rapid then stabilizes after about 5 generations

• Rapid initial decay reflects that the closest marker may not be in highest LD with the QTL

• RR-BLUP accuracy decays more rapidly than Bayes-B because more markers absorb the effect of a QTL

41

Habier et al. 2007 Take homes

• The ability of genomic selection to capture information on genetic relatedness is valuable

• That information decays rapidly• The amount of that information relates to the

number of markers fitted by a model:– Ridge regression > BayesB

• Bayes-B captured more LD information:– Long-term accuracy: BayesB > Ridge regression

42

Accuracy due to relationships vs. LD

43

Stochastic vs deterministic prediction

NP / NG

Zhong et al.

Habier et al.

44

To replicate or not to replicate504 Lines replicated once 168 Lines replicated three times

Ridge Regression BayesB

45


Aj = ½AS + ½AD + aj

Mendelian sampling term

Correlation sibling estimates

σ2B

σ2W > 0

σ2B

σ2W = 0

_BLUP_

__GS__

Capturing relationship Information increases

σ2B NOT σ2

W

46

Simulation setting:Meuwissen; Habier; Solberg

• Ne = 100; 1000 generations• Mutation / Drift / Recombination equilibrium• High marker mutation rate (2.5 x 10-3 / loc /

gen); higher “haplotype mutation rate”• Mutation effect distribution Gamma (1.66,

0.4): “effective QTL number” is only about 6 (!)–> Watch out how you simulate!

47

Results

• Prediction accuracy estimated by simulationMHG

HFDRR-BLUP 0.730.64BayesB 0.850.69

• These accuracies are ASTOUNDING• If h2 = 1, r = 0.71

48

Noteworthy discussion

• Markers flanking QTL not always in model– QTL effects captured by multiple markers– No need to “detect” QTL

• Recombination causes accuracy to decay– Faster than if QTL captured by flanking markers– Markers far from QTL contribute to capture its effect

• Ne / 2 markers per Morgan achieves close to maximum accuracy– Dependent on high marker mutation rates (?)

49

Solberg et al. 2008

• Density: Number of markers per Morgan

SSR: ¼ Ne ½ Ne 1 Ne 2 Ne

SNP: 1 Ne 2 Ne 4 Ne 8 Ne

50

Zhong et al. 2009

• Zhong, S. et al. 2009. Genetics 182:355-364.

• 42 diverse 2-row barley• 1040 markers ~ evenly spaced• Mating designs to generate

500 high and low LD training dataset

• 20 or 80 QTL; h2 = 0.4

51

Ridge regression Vs. BayesB

Zhong et al. 2009

20QTL – HiLD 20QTL – LoLD 80QTL – HiLD 80QTL – LoLD

Ridge Regression BayesB

Observed

Unobserved

QTL:

52

Take-home messages

• Ridge regression is not affected by the number of QTL / the QTL effect size

• BayesB performs better with large marker-associated effects

• Co-linearity is more detrimental to BayesB• High marker density and training pop. size?

Yes: BayesB No: RR-BLUP

53

VanRaden et al. 2009

• VanRaden, P.M. et al. 2009. Invited Review: Reliability of genomic predictions for North American Holstein bulls. J. Dairy Sci. 92:16-24.

54


• Some traits have major genes, others do not

55


• The larger the training population, the better. Where diminishing returns will begin is not in sight.

Predictor

56

Take Homes

• Training population requirements very large• BayesB did not help• == no large marker-associated effects ==• Like the “Case of the missing heritability” in

human GWAS studies– Are many quantitative traits driven by very low

frequency variants?– RR would capture this case better than BayesB

57

Empirical data on crops: TP size

58

Empirical data on crops: Marker No.

59

Empirical data on Humans: Marker No.

Yang et al. 2010. Nat. Genet. 10.1038/ng.608

Out of 295K SNP

60

Long-term genomic selection

• Marker data from elite six-row barley program• 880 Markers• 100 hidden as additive-effect QTL• Evaluate 200 progeny, select 20• Phenotypic compared to genomic selection

61

Breeding / model update cycles

Evaluation is possible every other season. Candidates from every other cycle can be evaluated. There is still a lag: Parents of C2 are selected based on evaluation of C0.

Season 1 Season 2 Season 3 Season 4 Season 5 Season 6


Cross &Inbreed

Evaluate& Select

Cross &Inbreed

Evaluate& Select

Cross &Inbreed

Evaluate& Select

Cross &Inbreed

Evaluate& Select

Cross, Inb.& Select

Cross, Inb.& Select

Cross, Inb.& Select

Cross, Inb.& Select

Evaluate EvaluateGenomic Selection

62

Response in genotypic value

Phenotypic Breeding Cycle

Mea

n G

enot

ypic

Val

ue

Genomic; Small Training PopGenomic; Large Training Pop


63

Accuracy


Mea

n Re

alize

d Ac

cura

cy



64

Genetic variance


Mea

n G

enot

ypic

Sta

ndar

d D

evia

tion



65

Lost favorable alleles


Mea

n N

umbe

r Los

t Fav

orab

le A

lllel

es



66

Goddard 2008; Hayes et al. 2009

67

Response in genotypic value


Mea

n G

enot

ypic

Val

ue




Unweighted Weighted

68

Genetic variance


Mea

n G

enot

ypic

Sta

ndar

d D

evia

tion




Unweighted Weighted

69

Lost favorable alleles


Mea

n N

umbe

r Los

t Fav

orab

le A

llele

s




Unweighted Weighted

70

Long term genomic selection

• The acceleration of the breeding cycle is key• Some favorable alleles will be lost– Likely those not in LD with any marker

• Managing diversity / favorable alleles appears a good idea

• This can be done using the same data as used for genomic prediction

71

Introgressing diversity

• GS relies on marker–QTL allele association• An “exotic” line comes from a sub-population

divergent from the breeding population• After sub-populations separate– Drift moves allele frequencies independently– Drift & recombination shift associations

independently• Will the GS prediction model identify valuable

segments from the exotic?

72

Three approaches

• Create a bi-parental family with the exotic (Bernardo 2009)– Develop a mini-training population for that family– Improve the family – Bring it into the main breeding population

• Develop a separate training population for the exotic sub-population (Ødegård et al. 2009)

• Develop a single multi-subpopulation (species-wide?) training population (Goddard 2006)

73

Need higher marker density

Ancestral LD

• Tightly–linked: ancestral LD• Loosely–linked: sub-population specific LD

sub-population specific LD

74

0 cM recombination distance 5 cM recombination distance

Genetic Distance

Corr

elati

on o

f rConsistency of association across barley

subpopulations

0.8

0.0

0.2

0.4

0.6

1.0

0.0 0.5

75

Example: Dairy cattle breeds

TP = Hols. TP = Jers. Hols. + Jers.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

VP = HolsteinVP = Jersey

Pred

ictio

n A

ccur

acy

76

G1 G2 G3

N=136 N=149 N=161

Oat sub-populations (UOPN)

77

Combined sub-population TP(β-Glucan)

G1 G2 and G3

0.11

TPVP

G3G1 and G2

0.50

G1 G3G2

0.39

78

Introgressing diversity using GS

• Need higher marker density• Analysis of consistency of r may indicate

whether current density is sufficient– Not sure we have it for barley

• If you have the density, a multi-subpopulation training population seems like a good idea– Focuses the model on tighter ancestral LD rather

than looser sub-population specific LD

Association Mapping versus Genomic Selection Association Mapping To discover genes and genetic...

Documents

Transcript of Association Mapping versus Genomic Selection Association Mapping To discover genes and genetic...