Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang...

Post on 13-Dec-2015

217 views 1 download

Tags:

Transcript of Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang...

Sequential & Multiple Hypothesis Sequential & Multiple Hypothesis Testing Procedures Testing Procedures

for Genome-wide Association Scansfor Genome-wide Association Scans

Qunyuan Zhang Qunyuan Zhang

Division of Statistical GenomicsDivision of Statistical GenomicsWashington University School of MedicineWashington University School of Medicine

22

Multiple ComparisonMultiple Comparison(strategy 1)(strategy 1)Type I error

False Positive

Type II errorFalse negative

Highpower

Lowpower

P value adjustment/correction (Bonferroni, FDR)

Empirical p value (permutation, bootstrap)

33

Type I errorFalse Positive

Type II errorFalse negative

Multiple ComparisonMultiple Comparison(strategy 2)(strategy 2) Larger sample size

Meta analysis

Biological info or evidence

……

More powerful statistical approach

SMDP: Sequential Multiple SMDP: Sequential Multiple Decision ProcedureDecision Procedure

44

What is SMDP?What is SMDP?

A generalized framework for ranking and selection, using optimum sample sizes

A combination of sequential analysis and multiple hypothesis test

55

Feature 1 of SMDPFeature 1 of SMDP

Sequential AnalysisSequential Analysis

nn00Start from a small sample size

Increase sample size, sequential test at each stage

Stop when stopping rule is satisfied

nn00+1+1

nn00+2+2

nn00+i+i

……

66

Feature 2 of SMDP Feature 2 of SMDP

Multiple DecisionMultiple Decision

SNP1SNP1

SNP2SNP2

SNP3SNP3

SNP4SNP4

SNP5SNP5

SNP6SNP6

……

SNPnSNPn

Simultaneous testSimultaneous testMultiple hypothesis testMultiple hypothesis test Independent testIndependent test

Binary hypothesis testBinary hypothesis test test 1

test 2

test 3

test 4

test 5

test 6

test n

SNP1SNP1

SNP2SNP2

SNP3SNP3

SNP4SNP4

SNP5SNP5

SNP6SNP6

……

SNPnSNPn

Signal Signal group group

Noise Noise group group

77

Binary Hypothesis TestBinary Hypothesis Testused by traditional methods used by traditional methods

SNP1SNP1

SNP2SNP2

SNP3SNP3

SNP4SNP4

SNP5SNP5

SNP6SNP6

……

SNPnSNPn

test 1 H0: Eff.(SNP1)=0 vs. H1: Eff.(SNP1)≠0

test 2 H0: Eff.(SNP2)=0 vs. H1: Eff.(SNP2)≠0

test 3 ……

test 4 ……

test 5 ……

test 6 ……

test n H0: Eff.(SNPn)=0 vs. H1: Eff.(SNPn)≠0

test-wise error and genome-wise error

multiple testing issue

88

Multiple Hypothesis TestMultiple Hypothesis Testused by SMDPused by SMDP

SNP1SNP1

SNP2SNP2

SNP3SNP3

SNP4SNP4

SNP5SNP5

SNP6SNP6

……

SNPnSNPn

H1: SNP1,2,3 are truly different from the others

H2: SNP1,2,4 are truly different from the others

H3 ……

H4 ……

H5: SNP4,5,6 are truly different from the others

H6 ……

……

Hu: SNPn,n-1,n-2 are truly different from the others

Goal: search the best one

H: any t SNPs are truly different from the others (n-t)

u= number of all possible combination of t out of n

99

General Rule of SMDP General Rule of SMDP (Bechhofer et al., 1968)(Bechhofer et al., 1968)

Selecting the Selecting the t t best of best of MM K-D populations K-D populations

Sequential Sampling

1 2 … h h+1 …

Pop. 1

Pop. 2

:

Pop. t-1

Pop. t

Pop. k+1

Pop. k+2

:

Pop. M

D

Y1,h

Y2,h

:

:

Yt,h

:

::

:

YM,h

U

j

thj

thU

hU

YD

YDW

1

)exp(

)exp(

)(],[

*

)(],[

*

],[

)!(!

!

tMt

MU

U possible combinations

of t out of M

t

ihi

thu k

YY1

,)(

,

For each combination u

)(],[

)(],[

)(],[

)(],[ ... t

hUt

hUt

hth YYYY 121

*],[ PW hU Stopping rule

Prob. of correct selection (PCS) > P*, whenever D>D*

Sequential statistic at stage h

1010

jijiD

2

1

2

1: ,

Koopman-Darmois(K-D) PopulationsKoopman-Darmois(K-D) Populations (Bechhofer et al., 1968)(Bechhofer et al., 1968)

The freq/density function of a K-D population can be written in the form:

f(x)=exp{P(x)Q(θ)+R(x)+S(θ)}

A. The normal density function with unknown mean and known variance;

B. The normal density function with unknown variance and known mean;

C. The exponential density function with unknown scale parameter and known location parameter;

D. The Poisson distribution with unknown mean;

……

The distance of two K-D populations

)()(, jiji QQ

1111

1212

Combine SMDP With Regression ModelCombine SMDP With Regression Model(M.A. Province, 2000, page 319)(M.A. Province, 2000, page 319)

),(~

)ˆˆ( )()(

2111

111

0

NVrV

XZr

XZ

hhh

hhh

hh

Case B : the normal density function with unknown variance and known mean;

h

jjihi VY

1

2,,

1313

SMDP - Regression SMDP - Regression (M.A. Province, 2000)(M.A. Province, 2000)

Z1 , X1

Z2 , X2

Z3 , X3

: :

Zh , Xh

Zh+1 , Xh+1

: :

ZN , XN

Data pairs for a marker

Sequential sum of squares of regression residualsYi,h denotes Y for marker i at stage h (see slide 7)

1h

1j

2j1h

21h1h1h

h

1j

21hj

h

1j

2)h(j

h

1j

2)h(j

1h

1h)h()h(

1h1h

VY

),0(N~VrV

)XX()XX(h

)XX(h

)Xˆˆ(Zr

XZ

1414

A Real Data Example (A Real Data Example (M.A. Province, 2000, page 308)M.A. Province, 2000, page 308)

1515

Simulation Results Simulation Results M.A. Province, 2000, page 312M.A. Province, 2000, page 312

1616

SMDP: SMDP: Computational ProblemComputational Problem

)t(h],U[

)t(h],1U[

)t(h],2[

)t(h],1[

*U

1j

)t(h],j[

*

)t(h],U[

*

h],U[

YY...YY

P)YDexp(

)YDexp(W

1

2

3

:

h

h+1

:

N

Sequential stage

Y1,h

Y2,h

:

Yk,h

Yk+1,h

Yk+2,h

:

YM,h

U sums of U possible combinations of t out of MEach sum contains t members of Yi,h

)!tM(!t

!MU

Computer time

?

1717

h],U[]1U[

h],U[]2U[

h],U[]2[

h],U[]1[

h],U[

)t(h],U[

)t(h],1U[

)t(h],1S[

)t(h],S[

)t(h],2[

)t(h],1[

*U

Sj

)t(h],j[

*)t(h],S[

*

)t(h],U[

*]SU[

h],U[

WWW...WW

YY...YY...YY

P)YDexp()YDexp()1S(

)YDexp(W

Simplified Stopping RuleSimplified Stopping Rule

U-S+1= Top Combination Number (TCN)

TCN=2 (i.e. S=U-1,U-S=1)=> the simplest stopping rule

}P1

P)1U(ln{

D

1YY

*

*

*h],tM[h],1tM[

When TCN=U (i.e. S=1, U-S=U-1)=> the original stopping rule

How to choose TCN? Balance between computational accuracy and computational timeZhang & Province, 2005

1818

1919

Application to Pharmacal Genetics DataApplication to Pharmacal Genetics Data

Sample Sample sizesize

GenotypeGenotype PhenotypePhenotype

8585

Cell Cell lineslines

5841 SNPs5841 SNPs ViabFu7ViabFu7

P*=0.95P*=0.95D*=10D*=10TCN=10000TCN=10000

72 SNPs72 SNPsP<0.01P<0.01

2020

SMDP for GAWSSMDP for GAWS

Some technical/programming problems

1. Computer time (approximation & parallelization) 2. Missing data3. Stability at early stage4. Rare SNPs

Now SMDP can done for an analysis of GWAS data (500K chip, 1000 subjects) within 10 hours via cluster

2121

Simulation 1Simulation 1

5000 SNPs5000 SNPs1 true signal1 true signal

500 replications500 replications

2222

Simulation 2: Simulation 2: Multiple signalsMultiple signals

Genotype data: GAW16 problem 3, 500K SNP data; Phenotype data: Simulated LDL (measured at the first visit), ~6500 subjects, 200 replicationsAnalyses: For each replication, randomly draw 1000 SNPs without true effects and 10 SNPs with minor poly-gene effects and keep all 6 SNPs with relatively major effects to create a subset of genotypes. Recode the genotypes to 0, 1 and 2 according the copy number of minor alleles; Apply SMDP to the selected data and repeat the analysis over 200 replications.

2323

Modified SMDP(analysis procedure)

(1) Start analysis (or experiment) from a small sample size;

(2) Perform multiple decision analysis to simultaneously test if a group of makers are significant;

(3) Eliminate significant markers from the list (if identified);

(4) Add one or multiple new samples to the data;

(5) Repeat (2),(3),(4) …

(6) Stop the procedure when all samples have been used and no makers are identified any more .

2424

ROC Curves of SMDP and Regular Regression Analyses

Ar, Br : Regular regression using all samples

As, Bs: SMDP analyses

Ars, Brs: Regular regression using SMDP’s average sample sizes (ASN)

Ar, As and Ars: Analysis of SNPs with major effects;

Br, Bs and Brs: Anaysis of SNPs with minor effects.

ASN: the average sample size used in SMDP, presented as proportion of the entire sample size.

2525

Power comparison of SMPD and regular regression(type I error rate = 0.0025)

SNPs with true effects

Simulatedh2

SMDP

Power of regular regression using ASNpower ASN* Validation*

rs7672287 0.003 0.46 4432 0.26 0.40

rs1466535 0.002 0.80 4370 0.50 0.74

rs901824 0.001 0.00 NA NA 0.00

rs10910457 0.005 0.74 4509 0.44 0.73

rs4648068 0.007 0.04 5550 0.00 0.05

rs2294207 0.010 1.00 2077 1.00 0.47

*Proportion of significant tests (P<0.05), based on regression using the rest of samples after SMDP stops.*ASN: Average sample number used in SMDP

Conclusion: given the same sample size, SMDP-regression is more powerful than regular regression.

2626

The NHLBI Family Heart StudyIllumina HuamanMap550 array data983 subjectsCoronary Artery Calcification (CAC)

SMDP identifies 69 SNPs using less than 811 samples

Traditional regression analysis of all 983 samples identifies46122 SNPs (p<0.05)15 SNPs (FDR<0.05) 11 identified by SMDP1 SNPs (p<0.05/500K) also identified by SMDP

Application to Real Data

2727

Efficient use of sample size, extra sample size after stopping can be used for validation

Simultaneously test group of signals, avoid one-by-one test and p-value adjustment

Increase power (or decrease false positives) given the same average sample size

Flexible experimental design. Extra N

Summary of SMDP(advantages)

2828

Compute time (needs approximation & parallelization )

Requirement of Koopman-Darmois distribution family

Summary of SMDP(limitations)

2929

SMDP: SMDP: P*, t, D*P*, t, D*

P* P* arbitrary, 0.95arbitrary, 0.95

t fixed or variedt fixed or varied

D* indifference zone D* indifference zone

Pop. 1

Pop. 2

:

Pop. t-1

Pop. t

Pop. t+1 Pop. t+2

:

:

:

Pop. M

*)exp(

)exp(

)(],[

*

)(],[

*

],[ PYD

YDW

U

j

thj

thU

hU

1

SMDP stopping rule

Prob. of correct selection (PCS) > P*whenever D>D*

Correct selection Populations with Q(θ)> Q(θt)+D* are selected

D*

Q(θt)+D*

Q(θt)

3030

ReferencesReferences

R.E. Bechhofer, J. Kiefer., M. Sobel. 1968. Sequential identification and ranking procedures. The University of Chicago Press, Chicago.

M.A. Province. 2000. A single, sequential, genome-wide test to identify simultaneously all promising areas in a linkage scan. Genetic Epidemiology,19:301-332 .

Q. Zhang, M.A. Province . 2005. Simplified sequential multiple decision procedures for genome scans . 2005 Proceedings of American Statistical Association. Biometrics section:463~468

3131

Application to GWASApplication to GWAS

slide 9

slide 10

3232

h],U[]1U[

h],U[]2U[

h],U[]2[

h],U[]1[

h],U[

)t(h],U[

)t(h],1U[

)t(h],1S[

)t(h],S[

)t(h],2[

)t(h],1[

*U

Sj

)t(h],j[

*)t(h],S[

*

)t(h],U[

*]SU[

h],U[

WWW...WW

YY...YY...YY

P)YDexp()YDexp()1S(

)YDexp(W

Simplified Stopping RuleSimplified Stopping Rule

U-S+1= Top Combination Number (TCN)

TCN=2 (i.e. S=U-1,U-S=1)=> the simplest stopping rule

}P1

P)1U(ln{

D

1YY

*

*

*h],tM[h],1tM[

When TCN=U (i.e. S=1, U-S=U-1)=> the original stopping rule

How to choose TCN? Balance between computational accuracy and computational timeZhang & Province, 2005

3333

Zhang & Province,2005,page 467Zhang & Province,2005,page 467

P*=0.95P*=0.95D*=10D*=10TCN=10000TCN=10000

72 SNPs72 SNPsP<0.01P<0.01

3434

Simplified Stopping Rule Simplified Stopping Rule M.A. Province, 2000 M.A. Province, 2000

page 321-322 page 321-322

3535

A Real Data Example (A Real Data Example (M.A. Province, 2000, page 310)M.A. Province, 2000, page 310)

3636

Simulation Results (2) Simulation Results (2) M.A. Province, 2000, page 313M.A. Province, 2000, page 313

3737

h],U[]1U[

h],U[]2U[

h],U[]2[

h],U[]1[

h],U[

)t(h],U[

)t(h],1U[

)t(h],1S[

)t(h],S[

)t(h],2[

)t(h],1[

*U

Sj

)t(h],j[

*)t(h],S[

*

)t(h],U[

*]SU[

h],U[

WWW...WW

YY...YY...YY

P)YDexp()YDexp()1S(

)YDexp(W

Simplified SMDPSimplified SMDP (Bechhofer et al., 1968)(Bechhofer et al., 1968)

U-S+1= Top Combination Number (TCN)

How to choose TCN?

Balance between computational accuracy and computational time

3838

Relation of Relation of WW and and t t (h=50, D*=10)(h=50, D*=10)

Effective Top Combination Number

ETCN

Zhang & Province,2005,page 465Zhang & Province,2005,page 465

3939

ETCN CurveETCN Curve

Zhang & Province,2005,page 466Zhang & Province,2005,page 466

4040

t t =?=?

Zhang & Province,2005,page 466Zhang & Province,2005,page 466

4141

SMDP SummarySMDP Summary

Advantages:Advantages:

Test, identify all signals simultaneously, no multiple comparisons Test, identify all signals simultaneously, no multiple comparisons

Use “Minimal” N to find significant signals, efficient Use “Minimal” N to find significant signals, efficient

Tight control statistical errors (Type I, II), powerfulTight control statistical errors (Type I, II), powerful

Save rest of N for validation, reliableSave rest of N for validation, reliable

Further studies:Further studies:

Computer time Computer time

Extension to more methods/modelsExtension to more methods/models

Extension to non-K-D distributionsExtension to non-K-D distributions