Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang...

41
Sequential & Multiple Sequential & Multiple Hypothesis Testing Procedures Hypothesis Testing Procedures for Genome-wide Association for Genome-wide Association Scans Scans Qunyuan Zhang Qunyuan Zhang Division of Statistical Genomics Division of Statistical Genomics Washington University School of Medicine Washington University School of Medicine

Transcript of Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang...

Page 1: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

Sequential & Multiple Hypothesis Sequential & Multiple Hypothesis Testing Procedures Testing Procedures

for Genome-wide Association Scansfor Genome-wide Association Scans

Qunyuan Zhang Qunyuan Zhang

Division of Statistical GenomicsDivision of Statistical GenomicsWashington University School of MedicineWashington University School of Medicine

Page 2: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

22

Multiple ComparisonMultiple Comparison(strategy 1)(strategy 1)Type I error

False Positive

Type II errorFalse negative

Highpower

Lowpower

P value adjustment/correction (Bonferroni, FDR)

Empirical p value (permutation, bootstrap)

Page 3: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

33

Type I errorFalse Positive

Type II errorFalse negative

Multiple ComparisonMultiple Comparison(strategy 2)(strategy 2) Larger sample size

Meta analysis

Biological info or evidence

……

More powerful statistical approach

SMDP: Sequential Multiple SMDP: Sequential Multiple Decision ProcedureDecision Procedure

Page 4: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

44

What is SMDP?What is SMDP?

A generalized framework for ranking and selection, using optimum sample sizes

A combination of sequential analysis and multiple hypothesis test

Page 5: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

55

Feature 1 of SMDPFeature 1 of SMDP

Sequential AnalysisSequential Analysis

nn00Start from a small sample size

Increase sample size, sequential test at each stage

Stop when stopping rule is satisfied

nn00+1+1

nn00+2+2

nn00+i+i

……

Page 6: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

66

Feature 2 of SMDP Feature 2 of SMDP

Multiple DecisionMultiple Decision

SNP1SNP1

SNP2SNP2

SNP3SNP3

SNP4SNP4

SNP5SNP5

SNP6SNP6

……

SNPnSNPn

Simultaneous testSimultaneous testMultiple hypothesis testMultiple hypothesis test Independent testIndependent test

Binary hypothesis testBinary hypothesis test test 1

test 2

test 3

test 4

test 5

test 6

test n

SNP1SNP1

SNP2SNP2

SNP3SNP3

SNP4SNP4

SNP5SNP5

SNP6SNP6

……

SNPnSNPn

Signal Signal group group

Noise Noise group group

Page 7: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

77

Binary Hypothesis TestBinary Hypothesis Testused by traditional methods used by traditional methods

SNP1SNP1

SNP2SNP2

SNP3SNP3

SNP4SNP4

SNP5SNP5

SNP6SNP6

……

SNPnSNPn

test 1 H0: Eff.(SNP1)=0 vs. H1: Eff.(SNP1)≠0

test 2 H0: Eff.(SNP2)=0 vs. H1: Eff.(SNP2)≠0

test 3 ……

test 4 ……

test 5 ……

test 6 ……

test n H0: Eff.(SNPn)=0 vs. H1: Eff.(SNPn)≠0

test-wise error and genome-wise error

multiple testing issue

Page 8: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

88

Multiple Hypothesis TestMultiple Hypothesis Testused by SMDPused by SMDP

SNP1SNP1

SNP2SNP2

SNP3SNP3

SNP4SNP4

SNP5SNP5

SNP6SNP6

……

SNPnSNPn

H1: SNP1,2,3 are truly different from the others

H2: SNP1,2,4 are truly different from the others

H3 ……

H4 ……

H5: SNP4,5,6 are truly different from the others

H6 ……

……

Hu: SNPn,n-1,n-2 are truly different from the others

Goal: search the best one

H: any t SNPs are truly different from the others (n-t)

u= number of all possible combination of t out of n

Page 9: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

99

General Rule of SMDP General Rule of SMDP (Bechhofer et al., 1968)(Bechhofer et al., 1968)

Selecting the Selecting the t t best of best of MM K-D populations K-D populations

Sequential Sampling

1 2 … h h+1 …

Pop. 1

Pop. 2

:

Pop. t-1

Pop. t

Pop. k+1

Pop. k+2

:

Pop. M

D

Y1,h

Y2,h

:

:

Yt,h

:

::

:

YM,h

U

j

thj

thU

hU

YD

YDW

1

)exp(

)exp(

)(],[

*

)(],[

*

],[

)!(!

!

tMt

MU

U possible combinations

of t out of M

t

ihi

thu k

YY1

,)(

,

For each combination u

)(],[

)(],[

)(],[

)(],[ ... t

hUt

hUt

hth YYYY 121

*],[ PW hU Stopping rule

Prob. of correct selection (PCS) > P*, whenever D>D*

Sequential statistic at stage h

Page 10: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

1010

jijiD

2

1

2

1: ,

Koopman-Darmois(K-D) PopulationsKoopman-Darmois(K-D) Populations (Bechhofer et al., 1968)(Bechhofer et al., 1968)

The freq/density function of a K-D population can be written in the form:

f(x)=exp{P(x)Q(θ)+R(x)+S(θ)}

A. The normal density function with unknown mean and known variance;

B. The normal density function with unknown variance and known mean;

C. The exponential density function with unknown scale parameter and known location parameter;

D. The Poisson distribution with unknown mean;

……

The distance of two K-D populations

)()(, jiji QQ

Page 11: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

1111

Page 12: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

1212

Combine SMDP With Regression ModelCombine SMDP With Regression Model(M.A. Province, 2000, page 319)(M.A. Province, 2000, page 319)

),(~

)ˆˆ( )()(

2111

111

0

NVrV

XZr

XZ

hhh

hhh

hh

Case B : the normal density function with unknown variance and known mean;

h

jjihi VY

1

2,,

Page 13: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

1313

SMDP - Regression SMDP - Regression (M.A. Province, 2000)(M.A. Province, 2000)

Z1 , X1

Z2 , X2

Z3 , X3

: :

Zh , Xh

Zh+1 , Xh+1

: :

ZN , XN

Data pairs for a marker

Sequential sum of squares of regression residualsYi,h denotes Y for marker i at stage h (see slide 7)

1h

1j

2j1h

21h1h1h

h

1j

21hj

h

1j

2)h(j

h

1j

2)h(j

1h

1h)h()h(

1h1h

VY

),0(N~VrV

)XX()XX(h

)XX(h

)Xˆˆ(Zr

XZ

Page 14: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

1414

A Real Data Example (A Real Data Example (M.A. Province, 2000, page 308)M.A. Province, 2000, page 308)

Page 15: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

1515

Simulation Results Simulation Results M.A. Province, 2000, page 312M.A. Province, 2000, page 312

Page 16: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

1616

SMDP: SMDP: Computational ProblemComputational Problem

)t(h],U[

)t(h],1U[

)t(h],2[

)t(h],1[

*U

1j

)t(h],j[

*

)t(h],U[

*

h],U[

YY...YY

P)YDexp(

)YDexp(W

1

2

3

:

h

h+1

:

N

Sequential stage

Y1,h

Y2,h

:

Yk,h

Yk+1,h

Yk+2,h

:

YM,h

U sums of U possible combinations of t out of MEach sum contains t members of Yi,h

)!tM(!t

!MU

Computer time

?

Page 17: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

1717

h],U[]1U[

h],U[]2U[

h],U[]2[

h],U[]1[

h],U[

)t(h],U[

)t(h],1U[

)t(h],1S[

)t(h],S[

)t(h],2[

)t(h],1[

*U

Sj

)t(h],j[

*)t(h],S[

*

)t(h],U[

*]SU[

h],U[

WWW...WW

YY...YY...YY

P)YDexp()YDexp()1S(

)YDexp(W

Simplified Stopping RuleSimplified Stopping Rule

U-S+1= Top Combination Number (TCN)

TCN=2 (i.e. S=U-1,U-S=1)=> the simplest stopping rule

}P1

P)1U(ln{

D

1YY

*

*

*h],tM[h],1tM[

When TCN=U (i.e. S=1, U-S=U-1)=> the original stopping rule

How to choose TCN? Balance between computational accuracy and computational timeZhang & Province, 2005

Page 18: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

1818

Page 19: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

1919

Application to Pharmacal Genetics DataApplication to Pharmacal Genetics Data

Sample Sample sizesize

GenotypeGenotype PhenotypePhenotype

8585

Cell Cell lineslines

5841 SNPs5841 SNPs ViabFu7ViabFu7

P*=0.95P*=0.95D*=10D*=10TCN=10000TCN=10000

72 SNPs72 SNPsP<0.01P<0.01

Page 20: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

2020

SMDP for GAWSSMDP for GAWS

Some technical/programming problems

1. Computer time (approximation & parallelization) 2. Missing data3. Stability at early stage4. Rare SNPs

Now SMDP can done for an analysis of GWAS data (500K chip, 1000 subjects) within 10 hours via cluster

Page 21: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

2121

Simulation 1Simulation 1

5000 SNPs5000 SNPs1 true signal1 true signal

500 replications500 replications

Page 22: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

2222

Simulation 2: Simulation 2: Multiple signalsMultiple signals

Genotype data: GAW16 problem 3, 500K SNP data; Phenotype data: Simulated LDL (measured at the first visit), ~6500 subjects, 200 replicationsAnalyses: For each replication, randomly draw 1000 SNPs without true effects and 10 SNPs with minor poly-gene effects and keep all 6 SNPs with relatively major effects to create a subset of genotypes. Recode the genotypes to 0, 1 and 2 according the copy number of minor alleles; Apply SMDP to the selected data and repeat the analysis over 200 replications.

Page 23: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

2323

Modified SMDP(analysis procedure)

(1) Start analysis (or experiment) from a small sample size;

(2) Perform multiple decision analysis to simultaneously test if a group of makers are significant;

(3) Eliminate significant markers from the list (if identified);

(4) Add one or multiple new samples to the data;

(5) Repeat (2),(3),(4) …

(6) Stop the procedure when all samples have been used and no makers are identified any more .

Page 24: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

2424

ROC Curves of SMDP and Regular Regression Analyses

Ar, Br : Regular regression using all samples

As, Bs: SMDP analyses

Ars, Brs: Regular regression using SMDP’s average sample sizes (ASN)

Ar, As and Ars: Analysis of SNPs with major effects;

Br, Bs and Brs: Anaysis of SNPs with minor effects.

ASN: the average sample size used in SMDP, presented as proportion of the entire sample size.

Page 25: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

2525

Power comparison of SMPD and regular regression(type I error rate = 0.0025)

SNPs with true effects

Simulatedh2

SMDP

Power of regular regression using ASNpower ASN* Validation*

rs7672287 0.003 0.46 4432 0.26 0.40

rs1466535 0.002 0.80 4370 0.50 0.74

rs901824 0.001 0.00 NA NA 0.00

rs10910457 0.005 0.74 4509 0.44 0.73

rs4648068 0.007 0.04 5550 0.00 0.05

rs2294207 0.010 1.00 2077 1.00 0.47

*Proportion of significant tests (P<0.05), based on regression using the rest of samples after SMDP stops.*ASN: Average sample number used in SMDP

Conclusion: given the same sample size, SMDP-regression is more powerful than regular regression.

Page 26: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

2626

The NHLBI Family Heart StudyIllumina HuamanMap550 array data983 subjectsCoronary Artery Calcification (CAC)

SMDP identifies 69 SNPs using less than 811 samples

Traditional regression analysis of all 983 samples identifies46122 SNPs (p<0.05)15 SNPs (FDR<0.05) 11 identified by SMDP1 SNPs (p<0.05/500K) also identified by SMDP

Application to Real Data

Page 27: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

2727

Efficient use of sample size, extra sample size after stopping can be used for validation

Simultaneously test group of signals, avoid one-by-one test and p-value adjustment

Increase power (or decrease false positives) given the same average sample size

Flexible experimental design. Extra N

Summary of SMDP(advantages)

Page 28: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

2828

Compute time (needs approximation & parallelization )

Requirement of Koopman-Darmois distribution family

Summary of SMDP(limitations)

Page 29: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

2929

SMDP: SMDP: P*, t, D*P*, t, D*

P* P* arbitrary, 0.95arbitrary, 0.95

t fixed or variedt fixed or varied

D* indifference zone D* indifference zone

Pop. 1

Pop. 2

:

Pop. t-1

Pop. t

Pop. t+1 Pop. t+2

:

:

:

Pop. M

*)exp(

)exp(

)(],[

*

)(],[

*

],[ PYD

YDW

U

j

thj

thU

hU

1

SMDP stopping rule

Prob. of correct selection (PCS) > P*whenever D>D*

Correct selection Populations with Q(θ)> Q(θt)+D* are selected

D*

Q(θt)+D*

Q(θt)

Page 30: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

3030

ReferencesReferences

R.E. Bechhofer, J. Kiefer., M. Sobel. 1968. Sequential identification and ranking procedures. The University of Chicago Press, Chicago.

M.A. Province. 2000. A single, sequential, genome-wide test to identify simultaneously all promising areas in a linkage scan. Genetic Epidemiology,19:301-332 .

Q. Zhang, M.A. Province . 2005. Simplified sequential multiple decision procedures for genome scans . 2005 Proceedings of American Statistical Association. Biometrics section:463~468

Page 31: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

3131

Application to GWASApplication to GWAS

slide 9

slide 10

Page 32: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

3232

h],U[]1U[

h],U[]2U[

h],U[]2[

h],U[]1[

h],U[

)t(h],U[

)t(h],1U[

)t(h],1S[

)t(h],S[

)t(h],2[

)t(h],1[

*U

Sj

)t(h],j[

*)t(h],S[

*

)t(h],U[

*]SU[

h],U[

WWW...WW

YY...YY...YY

P)YDexp()YDexp()1S(

)YDexp(W

Simplified Stopping RuleSimplified Stopping Rule

U-S+1= Top Combination Number (TCN)

TCN=2 (i.e. S=U-1,U-S=1)=> the simplest stopping rule

}P1

P)1U(ln{

D

1YY

*

*

*h],tM[h],1tM[

When TCN=U (i.e. S=1, U-S=U-1)=> the original stopping rule

How to choose TCN? Balance between computational accuracy and computational timeZhang & Province, 2005

Page 33: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

3333

Zhang & Province,2005,page 467Zhang & Province,2005,page 467

P*=0.95P*=0.95D*=10D*=10TCN=10000TCN=10000

72 SNPs72 SNPsP<0.01P<0.01

Page 34: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

3434

Simplified Stopping Rule Simplified Stopping Rule M.A. Province, 2000 M.A. Province, 2000

page 321-322 page 321-322

Page 35: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

3535

A Real Data Example (A Real Data Example (M.A. Province, 2000, page 310)M.A. Province, 2000, page 310)

Page 36: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

3636

Simulation Results (2) Simulation Results (2) M.A. Province, 2000, page 313M.A. Province, 2000, page 313

Page 37: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

3737

h],U[]1U[

h],U[]2U[

h],U[]2[

h],U[]1[

h],U[

)t(h],U[

)t(h],1U[

)t(h],1S[

)t(h],S[

)t(h],2[

)t(h],1[

*U

Sj

)t(h],j[

*)t(h],S[

*

)t(h],U[

*]SU[

h],U[

WWW...WW

YY...YY...YY

P)YDexp()YDexp()1S(

)YDexp(W

Simplified SMDPSimplified SMDP (Bechhofer et al., 1968)(Bechhofer et al., 1968)

U-S+1= Top Combination Number (TCN)

How to choose TCN?

Balance between computational accuracy and computational time

Page 38: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

3838

Relation of Relation of WW and and t t (h=50, D*=10)(h=50, D*=10)

Effective Top Combination Number

ETCN

Zhang & Province,2005,page 465Zhang & Province,2005,page 465

Page 39: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

3939

ETCN CurveETCN Curve

Zhang & Province,2005,page 466Zhang & Province,2005,page 466

Page 40: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

4040

t t =?=?

Zhang & Province,2005,page 466Zhang & Province,2005,page 466

Page 41: Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

4141

SMDP SummarySMDP Summary

Advantages:Advantages:

Test, identify all signals simultaneously, no multiple comparisons Test, identify all signals simultaneously, no multiple comparisons

Use “Minimal” N to find significant signals, efficient Use “Minimal” N to find significant signals, efficient

Tight control statistical errors (Type I, II), powerfulTight control statistical errors (Type I, II), powerful

Save rest of N for validation, reliableSave rest of N for validation, reliable

Further studies:Further studies:

Computer time Computer time

Extension to more methods/modelsExtension to more methods/models

Extension to non-K-D distributionsExtension to non-K-D distributions