Marker Based Infinitesimal Model for Quantitative Trait ... · Marker Based Infinitesimal Model for...

Marker Based Infinitesimal Model for

Quantitative Trait Analysis

Shizhong Xu

Department of Botany and Plant Sciences

University of California

Riverside, CA 92521

Outline

• Quantitative trait and the infinitesimal model

• Infinitesimal model using marker information

• Adaptive infinitesimal model

• Simulation studies

• Rice and beef cattle data analyses

Outline

Quantitative Trait

Quantitative Genetics Model

Phenotype = Genotype + Environment

Infinitesimal Model

• Infinite number of genes

• Infinitely small effect of each gene

• Effect of an individual gene is not recognizable

• Collective effect of all genes are studied using

pedigree information (genetic relationship)

• Best linear unbiased prediction (BLUP)

Outline

Marker Based Infinitesimal Model

( ) ( )

j jk k jk

Different from Longitudinal Data Analysis

( ) ( )

( ) ( ) ;

y t t t

Numerical Integration

Bin Effect Model

( ) ( )

j jk k jk

j j k k k jk

j jk k jk

Bin Effects

Dense markers

Bin Bin

jk jhk

Z Z hp

Recombination Breakpoint Data

1Marker: ( )

1Breakpoint: ( )

jk jhk

Z Z hp

8 21 0 0.8

10 10jkZ

What Does a Bin Effect Represent?

1( ) ( )

size of bin k

uniform variable

j jk k jk

Assumptions of the Infinitesimal Model

• High linkage disequilibrium within a bin

• Homogeneous genetic effect within a bin

High Linkage Disequilibrium

number of crossovers, inversely

related to linkage disequilibrium

1lim var( ) , high linkage disequilibrium (F )

lim var( ) 0, low linkage disequilibrium

Larger v

ar( ) means higher powerjkZ

Range of Var(Z)

20 0 0

2 1 1 1lim var( ) lim lim

2 1 1lim var( ) lim lim 0

0 var( ) 0.5

choose var( ) as close to 0.5 as possible

but with the number of bins small enough to be

handled by a program for a given sample size

Outline

Adaptive Model Relaxes the Two

Assumptions

• High linkage disequilibrium within a bin

- prevent var(Z) from being zero

• Homogeneous genetic effect within a bin

- make all effects positive

Redefine the Bin Size by the Number

of Markers Within a Bin

number of markers in bin k

j jk k jk

jk jhk

k k kh

Z Z hp

Weighted Average Effect of a Bin

1Unweighted: ( ); ( )

1Weighted: ( ) ( ); ( ) ( )

jk j k k kh hk

jk j kh hk

Z Z h p hp

Z w h Z h w h hp

j jk k jk

Weight System

1 ˆ ˆDefine | | = mean(| |)

ˆwhere is the least squares estimate

of marker within bin

The weight for marker is defined as

ˆ ˆˆ

ˆmean(| |)ˆ| |

k h hh k h p

c b bp

p b bw c b

Weighted Var(Z*) > 0

* * * *

1var( ) var ( ) 2 cov ( ), ( )

1 1 1 2 (1 2 )

1 1 , when no linkage disequilibrium (1 2 )

jk j j jh l hk

h h l hlh l hk

h hlhk

Z Z h Z h Z lp

w w wp

Homogenization of Marker

Effects Within Bin

( ) ˆ( ) | |ˆ

( )where (a constant)

ˆ ˆ| | 0 as long as one 0

k h k k k hh h hh

k h hh

hw h c c p b

Outline

Measurement of Prediction (Cross Validation)

1ˆ( ) , Mean Squared Error

1( ) , Phenotypic Variance

, Squared Correlation

MSE y yn

MSY y yn

MSY MSER

Simulation Experiment

• Genome size = 2,500 cM

• Number of markers = 120,000

• Marker interval = 0.02 cM

• Cross validation (MSE)

• Design I = 20 QTL

• Design II = Clustered polygenic model

• Design III = Polygenic model

• Design IV = Design I with 2,500 x100 cM

True QTL Effect

0 500 1000 1500 2000 2500

true values

Position (cM)

Estimated Bin Effects0 500 1000 1500 2000 2500

(a) Δ = 1cM m = 2400

p = 50

0 500 1000 1500 2000 2500

(b) Δ = 2cM m = 1200

p = 100

0 500 1000 1500 2000 2500

(c) Δ = 5cM m = 480

p = 250

0 500 1000 1500 2000 2500

(d) Δ = 10cM

m = 240 p = 500

0 500 1000 1500 2000 2500

(e) Δ = 20cM

m = 120 p = 1000

0 500 1000 1500 2000 2500

(f) Δ = 40cM m = 60 p = 2000

Position (cM)

0 500 1000 1500 2000 2500

(g) Δ = 100cM m = 24

p = 5000

0 500 1000 1500 2000 2500

(h) Δ = 150cM m = 16

p = 7500

0 500 1000 1500 2000 2500

(i) Δ = 300cM m = 8

p = 15000

0 500 1000 1500 2000 2500

(j) Δ = 600cM

m = 4 p = 30000

0 500 1000 1500 2000 2500

(k) Δ = 1200cM

m = 2 p = 60000

0 500 1000 1500 2000 2500

(l) Δ = 2400cM m = 1 p = 120000

Position (cM)

True and Estimated QTL Effect

0 500 1000 1500 2000 2500

gtrue values

Position (cM)

0 500 1000 1500 2000 2500

Δ = 20cM

m = 120

p = 1000

Position (cM)

(data$cm[idx])

se[idx]

(data$cm[idx])

se[idx]

(data$cm[idx])

se[idx]

(data$cm[idx])

se[idx]

(data$cm[idx])

se[idx]

0 20 40 60 80 100

(a) 2=10, h

2=0.638

(data$cm[idx])

se[idx]

(data$cm[idx])

se[idx]

(data$cm[idx])

se[idx]

(data$cm[idx])

se[idx]

(data$cm[idx])

se[idx]

0 20 40 60 80 100

(b) 2=20, h

2=0.581

Bin size (cM)

(data$cm[idx])

se[idx]

(data$cm[idx])

se[idx]

(data$cm[idx])

se[idx]

(data$cm[idx])

se[idx]

(data$cm[idx])

se[idx]

0 20 40 60 80 100

(c) 2=50, h

2=0.457

(data$cm[idx])

se[idx]

(data$cm[idx])

se[idx]

(data$cm[idx])

se[idx]

(data$cm[idx])

se[idx]

(data$cm[idx])

se[idx]

0 20 40 60 80 100140

220 (d)

2=100, h

2=0.337

Bin size (cM)

n=200 n=300 n=400 n=500 n=1000

Figure 1. Mean squared error expressed as a function of bin size for Design I. The mean squared

errors were obtained from 100 replicated simulations. The overall proportion of the phenotypic

variance contributed by the 20 simulated QTL was calculated using 2 264.41/ (64.41 26.53 )h . Each panel contains the result of five different sample sizes (n).

The phenotypic variance of the simulated trait is indicated by the light horizontal line in each

panel (each panel represents one of the four different scenarios).

4.0 4.5 5.0 5.5

log10cmlist * 5000

Bin size (log10 cM)

Figure 6. Mean squared error for the simulated data under design IV (low linkage disequilibrium)

plotted against the bin size. The sample size of the simulated population was 500n . The

residual error variance was 2 20 , corresponding to 2 0.777h . The filled circles indicate the

MSE under the infinitesimal model while the open circles indicate the MSE under the adaptive

infinitesimal model. The dashed horizontal line represents the phenotypic variance of the

simulated trait (89.71).

Outline

Rice Tiller Number (Yu et al. 2011)

• Number of recombinant inbred lines: 210

• Number of SNP: 270,820

• Number of natural bins: 1619

• Number of artificial bins: vary from small to large

• Method: Empirical Bayes (eBayes)

• Cross validation: MSE and R-square

Yu et al. 2007, PLoS One 6(3) e17595

Figure 5. The MSE (curve in the left panel) and the R-square (curve in the right panel)

of the rice tiller number trait analysis, expressed as a function of bin size (artificial

bins). The black dashed horizontal line in the left panel is the phenotypic variance. The

red dashed horizontal line in the left panel is the MSE of the natural bin (without

breakpoints with bin) analysis. The red dashed horizontal line in the right panel is the

R-square of the natural bin analysis. R-square increased from 0.42 to 0.55.

Beef Cattle Data Analysis

• Trait = carcass weight

• Number of beef = 922

• Number of SNP markers = 40809

• Number of chromosomes = 29

• Methods = unweighted and weighted

5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5

Bin size (log10 bp)

Figure 7. Mean squared error for the carcass trait of beef cattle plotted against the bin size. The

filled circles indicate the MSE under the infinitesimal model while the open circles indicate the

MSE under the adaptive infinitesimal model. The dashed horizontal line represents the

phenotypic variance of the simulated trait (670.36). The blue horizontal line along with the two

dotted lines represents the MSE and the standard deviation of the MSE in the situation where the

bin size was one (one marker per bin). The sample size was 921n and the number of SNP

markers was 40809p . The bin size was defined as log10 bp. For example, the largest bin size

10log bp 8.5 means that the bin size contains 58.5 10 base pairs.