Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare...

51
Scalable Statistical Inference for Massive Health Science Data Xihong Lin Department of Biostatistics and Department of Statistics Harvard University

Transcript of Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare...

Page 1: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Scalable Statistical Inference for Massive Health Science Data

Xihong Lin

Department of Biostatistics and Department of Statistics

Harvard University

Page 2: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Examples of Genome, Exposome and Phenome

Smartphone DataWhole Genome Sequencing

Electronic Medical Records

Genome

ExposomePhenome

Page 3: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Our Niche in Big Data Era: Scalable Statistical Inference

Data KnowledgeScalable

ActionsStatistical Inference

Broader

Partnership

Goal: To solve big problems

Page 4: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Whole Genome Sequencing Studies (2015-)

CCGATCCAAGTCCATATATACCGATTTAACCGAA

CCGATCCAAGTCCATATATACCAATTTAACCGAA

CCGATCCAAGTCCATACATACCGATTTAACCGAACCGATCCAAGTCCATACATACCGATTTAACCGAA

CCAATCCAAGTTCATATATACCGATTTGACCGAA

CCGATCTAAGTCCATATATACCGATTTAACCGAA

CCGATCCAAGTCCATACATACCGATTTAACCGAA

Page 5: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

WGS Covers 100% of the Genome

Rare Variants

>97%

GWAS Common Variants <3%

Rare variants are more likely to cause diseases and their coded proteins are more likely to be drug targets.

TOPMed Freeze 5 (n=54,000): 430M Variants (97% are rare variants)

Page 6: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

1000 Genomes N=1000

GSP(NHGRI)N=200,000

2008

2015

2016

TOPMed(NHLBI)

N=150,000

Large Scale WGS Timeline2018

Biobanks(N=millions)

Presenter
Presentation Notes
Animate this one using flyin
Page 7: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

First Goal of WGS Analysis:Signal Detection

Scan the genome to identify genomic regions associated with diseases/traits

Page 8: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Challenges in Rare Variant Analysis of WGS Data

• Simple single SNP analysis does not work

• Need to perform SNP-set analysis

• Estimation is very difficult

APOE Promoter

LDL= 𝐆𝐆𝐆𝐆 + 𝐞𝐞

Page 9: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Sequencing Kernel Association Test (SKAT)• Wu, et al, 2011, AJHG.

(Citations=1400)• SMMAT• STAAR

Generalized Higher Criticism (GHC) /Generalized Berk-Jones (GBJ)/ACAT

• Murkerjee, et al, Ann. Stat, 2015

• Barnett, et al 2017 (GHC), JASA

• Sun and Lin (GBJ), 2017

• Liu, et al (ACAT), 2018

Model:

Sparse AlternativeDense Alternative

Test for Dense & Sparse High-Dimensional Alternatives

𝐘𝐘 = 𝐆𝐆𝐆𝐆 + 𝐞𝐞

Page 10: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Model and Hypothesis

Yi is phenotype (outcome) (i = 1, · · · , n)

Xi contains q covariates

Gi contains p SNPs (AA, AB, BB=0,1,2) in a SNV set, e.g.,variants in the promoter region of APOE.

α and β contain regression coefficients.

µi = E (Yi |Gi ,Xi)

Model

h(µi) = XTi α +G

Ti β

Hypothesis of no gene/network effect (p might be large):

H0 : β = 0 and H1 : β 6= 0 (weak).

March 8, 2019 1 / 23

Page 11: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

• 𝑝𝑝 = dim(𝜷𝜷) might not be small

• Full GLMs hard to fit due to rare variants

• Solution:

• Use score statistics 𝑍𝑍𝑗𝑗 = ∑𝑖𝑖=1𝑛𝑛 𝐺𝐺𝑖𝑖𝑗𝑗(𝑌𝑌𝑖𝑖 − �𝜇𝜇𝑖𝑖𝑖)

• Scability: Fit the null same null model 𝑔𝑔 𝜇𝜇𝑖𝑖 = 𝑿𝑿𝒊𝒊′𝜶𝜶 only once

when scanning the genome

Challenges Addressed in Scalable Inference for WGS Data

Page 12: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Dense/Sparse Alternatives

Unknown Truth: k = p1−α of βj ’s 6= 0

Hypothesis

H0 : β = 0H1 : Some βj 6= 0

Dense alternative (α < 1/2):

Ex: p = 100, α = 0.4⇒ k = 16

Sparse alternative (α > 1/2):

Ex: p = 100, α = 0.6⇒ k = 7

March 8, 2019 2 / 23

Page 13: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Difficulties in Testing

No global optimal most powerful test exists.

Test optimality depends on

Genotype matrix(G ) : Sparsity, LD (correlation)

Signals β: Sparsity, strength, and sign

Distribution of Y

March 8, 2019 3 / 23

Page 14: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

• Burden(B) (if all variants are causal with effects (β’s) in the same direction)

𝐵𝐵 = �𝑗𝑗

𝑝𝑝

𝑤𝑤𝑗𝑗𝑍𝑍𝑗𝑗

2

• SKAT (if there are neural variants and/or with effects (β’s) in different directions)

𝑆𝑆 = �𝑗𝑗

𝑝𝑝

𝑤𝑤𝑗𝑗𝑍𝑍𝑗𝑗2

Dense Regime

Page 15: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Sparse Regime: Higher Criticism (HC) (Tukey,1976)

Let

S(t) =

p∑j=1

1{|Zj |≥t}

Assumes Σ = Ip or sparse (G is a low coherence matrix)

The HC test statistic is (Ingster, 1998; Donoho and Jin, 2003;Arias-Castro, et al, 2011)

HC = supt>0

{S(t)− 2pΦ(t)√

2pΦ(t)(1− 2Φ(t))

}

March 8, 2019 5 / 23

Page 16: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

The Higher Criticism

−4 −2 0 2 4

0.00

0.10

0.20

0.30

Histogram of the Zi

t

Den

sity

argmax{HC(t)}

March 8, 2019 6 / 23

Page 17: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Linear Regression: Existing Results on DetectionBoundary

Dense Regime (α ≤ 12) Sparse Regime (α > 1

2)

A �√

pα− 12

n⇒ all tests

powerless.A <

√2t log p

n, t <

ρ∗gaussian(α) ⇒ all tests pow-erless.

A �√

pα− 12

n⇒ SKAT pow-

erfulA >

√2t log p

r, t >

ρ∗gaussian(α) ⇒ HC powerful.

Setting

Low coherence matrix G (sparse correlation Σ)

A=signal strength of β.

Sparsity index: k = p1−α

March 8, 2019 7 / 23

Page 18: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

The results for binary regression are different fromlinear regression (Mukherjee, et al, 2015, Ann Stat)

If design matrices are too sparse, then signal detection isimpossible no matter how strong signals are.

Two point detection boundary: Maximal Sparsity of G andMinimal Signal Strength β.

March 8, 2019 8 / 23

Page 19: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Asymptotic p-values for HC Does Not Work Wellfor Finte p

The supremum of this standardized empirical process follows aGumbel distribution asymptotically.

Jaeschke (1979) shows that this converges in distribution at anabysmal rate of O{(log p)−1/2}

March 8, 2019 9 / 23

Page 20: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Slow Convergence to Asymptotic Distribution of HC

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

x

CD

F(x

)

Theoretical; p=∞ Empirical; p=102 Empirical; p=106

In genetic studies, gene and network sizes

(p=# of SNPs=dozens to thousands)

March 8, 2019 10 / 23

Page 21: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Analytic p-values for HC for Finite p(Barnett and Lin, Biometrika, 2015)

Letting h be the observed HC statistic:

p-value = pr

(supt>0

{S∗(t)− 2pΦ(t)√2pΦ(t)(1− 2Φ(t))

}≥ h

)There exists 0 < t1 < · · · < tp, such that

p-value = 1− pr

(p⋂

k=1

{S∗(tk) ≤ p − k}

)

Then apply the chain rule of conditioning to get a product ofbinomial probabilities.

March 8, 2019 11 / 23

Page 22: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Simulation Study of Type I error rates of HC:Analytic(Exact) vs Asymptotic

p

α 10 50

1·0 9·92× 10−1(7·31× 10−1) 1·01(1·59× 10−1)

1·0× 10−1 1·01× 10−1(6·03× 10−2) 9·75× 10−2(4·90× 10−3)

1·0× 10−2 1·12× 10−2(7·30× 10−3) 9·80× 10−3(4·00× 10−4)

March 8, 2019 12 / 23

Page 23: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Need to account for Correlation among SNPs (LD))

CHRNA3-5 Gene Region

March 8, 2019 13 / 23

Page 24: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Accounting for correlation: Innovated HC (iHC)(Hall and Jin, 2011)

Letting UUT = Cov(Z ) = Σ

Define the transformed (decorrelated) test statistics:

Z∗ = U

−1Z

L−−−→n→∞

MVN(0, Ip)

Set

S∗(t) =

p∑j=1

1{|Z∗j |≥t}

The innovated Higher Criticism test (iHC) statistic is:

iHC = supt>0

{S∗(t)− 2pΦ(t)√2pΦ(t)(1− 2Φ(t))

}

March 8, 2019 14 / 23

Page 25: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Decorrelating using dampens true signals and causesiHC to lose power: CGEM Breast Cancer GWAS:FGFR2 gene

Fre

quen

cy

02

46

8Z

Marginal test statistics

Fre

quen

cy

−4 −2 0 2 4

02

46

8

Z*

March 8, 2019 15 / 23

Page 26: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Generalized Higher Critcism (GHC) (Barnett, et al,2016, JASA)

Recall

S(t) =

p∑j=1

1{|Zj |≥t}

Now we allow Σ to have arbitrary correlation structure.

S(t) is no longer binomial. Instead we approximate withBeta-binomial, matching on first two moments.

The Generalized Higher Criticism (GHC) test statistic is:

GHC = supt>0

S(t)− 2pΦ(t)√Var(S(t))

GHC achieves the same as detection boundary as HC .

March 8, 2019 16 / 23

Page 27: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

The variance estimator Var(S(t))

Let rn = 2p(1−p)

∑1≤k<l≤p(Σkl)

n and let Hi(t) be the Hermite

polynomials: H0(t) = 1, H1(t) = t, H2(t) = t2 − 1 and so on. Then

Cov

(S(tk), S(tj)

)= p[2Φ(max{tj , tk})− 4Φ(tj)Φ(tk)]

+4p(p − 1)φ(tj)φ(tk)∞∑i=1

H2i−1(tj)H2i−1(tk)r 2i

(2i)!

March 8, 2019 17 / 23

Page 28: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Analytic p-values for the GHC

Letting h be the observed GHC statistic:

p-value = pr

supt>0

S(t)− 2pΦ(t)√Var(S(t))

≥ h

There exists 0 < t1 < · · · < tp, such that

p-value = 1− pr

(p⋂

k=1

{S(tk) ≤ p − k}

)

March 8, 2019 18 / 23

Page 29: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Generalized Berk-Jones

Motivation: GHC works well in the very sparse signal case butless well in the moderately sparse signal case in finite samples.

Let s be the realized value of S(t).

Berk-Jones (Sup LR test):

BJ = maxt>0

log

{Pr [S(t) = s|π = s/p]

Pr [S(t) = s|π = π0]

}1

{π0 <

s

p

}Generalized Berk-Jones (Account for correlation):

GBJ = maxt>0

log

{Pr [S(t) = s|π = s/p, cor(Z) = Σ]

Pr [S(t) = s|π = π0, cor(Z) = Σ]

}1

{π0 <

s

p

}

March 8, 2019 20 / 23

Page 30: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Inference using Generalized Higher Criticism andGeneralized Berk-Jones

The distribution of S(t) is over-dispersed binomial and its exactdistribution is hard to calculate.

Approximate the distribution of S(t) using extendedbeta-binomial.

The sups in GHC and GBJ are achieved at the design points andboth GHC/GBJ and their distributions are calculated analyticallyusing approximations.

March 8, 2019 21 / 23

Page 31: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Rejection Boundary Comparisons: GHC vs GBJ

20 SNPs, 100% correlated with ρ=0.3

2 4 6 8 10 12 14 16 18 20

01

23

4

Bo

un

da

ry

Ordered Magnitudes of Test Statistics, |Z|(j)

BJ

GBJ

HC

GHC

Note how we gain ’volume’ in the rejection region near the expectedsignals.

March 8, 2019 22 / 23

Page 32: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Simulation (Main advantage of GBJ: Power gain infinite sample for moderate sparsity)

200 SNPs, ρ1=0.3, ρ2=0, ρ3=0, R2=0.01

2 4 6 8 10 12 14

00.2

0.4

0.6

0.8

1

Pow

er

Number of causal SNPs

GBJ

GHC

MinP

SKAT

OMNI

Extremely sparse regime: 1-3 causal. Moderately sparse regime: 4-13causal. Dense regime: 14+ causal.

March 8, 2019 23 / 23

Page 33: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Key features:

• A general method for combining p-values.• Super fast computation under arbitrary correlation and robust to

correlation. • Powerful when signals are sparse.• Can be used for constructing robust test.

Sparse Regime: ACAT: Aggregated Cauchy Association Test

Yaowu Liu, et al (JASA 2018, AJHG, 2019)

Page 34: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

𝑇𝑇𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = �𝑖𝑖=1

𝑑𝑑

𝑤𝑤𝑖𝑖 tan 0.5 − 𝒑𝒑𝒊𝒊 𝜋𝜋

Transform p-value to Cauchy

Weights

Aggregated Cauchy Association Test (ACAT)

Page 35: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Dense signals Sparse signals

Alternative

Neutral variantCausal variant

SlowFastComputation

SKAT / Burden MinP/GHC/GBJ Tests

No prior knowledge about the sparsity of signals. Need robust test.

Existing SNV-set tests

Page 36: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Assumptions: 𝐼𝐼. 𝑝𝑝𝑖𝑖 |𝑍𝑍𝑖𝑖| (z-score) 𝐼𝐼𝐼𝐼. ∀𝑖𝑖, 𝑗𝑗, 𝑍𝑍𝑖𝑖 ,𝑍𝑍𝑗𝑗 ~𝑁𝑁2 0,Θ𝑖𝑖𝑗𝑗Theorem: For any Σ ≥ 0, we have

lim𝑡𝑡→+∞

𝑃𝑃{𝑇𝑇𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 > 𝑡𝑡}𝑃𝑃{Cauchy 0,1 > 𝑡𝑡}

= 1.

P-value calculation:p − value ≈ 1/2 − {𝑎𝑎𝑎𝑎𝑎𝑎𝑡𝑡𝑎𝑎𝑛𝑛(𝑇𝑇𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴)}/π

Correlation of p-values Not required Super fast

Tail is Cauchy

Theory about ACAT

Page 37: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

IndependentPerfectly dependent

Sample mean ( �𝑋𝑋 = 1

𝑑𝑑∑𝑖𝑖=1𝑑𝑑 𝑋𝑋𝑖𝑖)

𝑋𝑋𝑖𝑖 ~ Cauchy(0,1)

𝑋𝑋𝑖𝑖 ~ Normal(0,1) �𝑋𝑋 ~ N(0,1/d)

�𝑋𝑋 ~ Cauchy(0,1)�𝑋𝑋 ~ Cauchy(0,1)

�𝑋𝑋 ~ N(0,1)

≈Cauchy(0,1)

General Dependency

Heavy tail makes Cauchy distribution insensitive to correlation

Some insights

Page 38: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

P-values

0.35 0.510.25 1.000.15 1.960.05 6.31

0.45 0.16

2e-03 159 5e-03 63.7

Cauchy values

233

ACAT uses a few smallest p-values to represent the significance.

ACAT is powerful against sparse alternatives

Page 39: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

…………MAC<10 MAC>=10

SNVs in a region

Burden 𝑝𝑝0 𝑝𝑝1 𝑝𝑝2 𝑝𝑝3 𝑝𝑝𝑘𝑘……

𝑇𝑇

P-value ≈ 1 − 𝐹𝐹𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑇𝑇) Super fastAccurate

𝑤𝑤𝑖𝑖,𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴−𝑉𝑉 = 𝑤𝑤𝑖𝑖,𝑆𝑆𝑆𝑆𝐴𝐴𝐴𝐴 × )𝑀𝑀𝑀𝑀𝐹𝐹𝑖𝑖(1 −𝑀𝑀𝑀𝑀𝐹𝐹𝑖𝑖

Saddlepoint method (Dey, et al, 2017)

ACAT-V for testing a SNV-set

Page 40: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Key features:

• Boost RV analysis power by optimally combining statistical evidence of MAFs (default in SKAT), functional annotations, and phenotypic information

• Computationally scalable

• Applicable to any given variant-set

STAAR: variant-Set Test for Association using Annotation infoRmation

Xihao Li and Zilin Li

Page 41: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Optimal weighting: True effect sizes (unknown)

Signal Regions (Effect Sizes (𝜷𝜷)) in the Genome

Page 42: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Question: Which functional scores to use boost power of RV association analysis in a variant-Set

Use Functional Annotations to Prioritize Variants in a Variant-Set

Page 43: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Functional Annotation Database

WGSA

Annovar

CADD

ENCODE

EPIGENOME

Individual Scores

Choosing Weights 𝒘𝒘𝒋𝒋 to Empower WGS Association Analysis

>260 annotations

15 Types of Annotations

80% built on hg38

Genome Functional Variant Annotations (GSP+TOPMED) Hufeng Zhou)

Dynamically incorporate multiple annotation weights in RV Tests

Page 44: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Coding Variants Non-coding Variants

Existing Integrative Annotation Scores are Mainly Driven by Protein and Conservation Scores with Little Correlation with Epigenetic Scores

Page 45: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

• APC1: Epigenetics• APC2: Conservation• APC3: Protein Function• APC4: Negative Selection• APC5: Distance to Coding• APC6: Mutation Density• APC7: Transcription Factor• APC8: MapAbility• APC9: Distance to TEE/TSE• APC10: MicroRNA

Correlation Heatmap with Annotation PCs (GSP Freeze 1, hg38)

Page 46: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

STAAR: Incorporate Multiple Functional Scores to Boost Power of RV Association Analysis Using ACAT

Page 47: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Type I Error Rates Using STAAR are Protected: Simulated WGS data Using COSI (n = 10,000)

𝜶𝜶 = 𝟏𝟏𝟎𝟎−𝟔𝟔 Continuous Traits Dichotomous Traits

STAAR-B 1.1 × 10−6 1.0 × 10−6

STAAR-S 9.9 × 10−7 7.8 × 10−7

STAAR-O 9.3 × 10−7 1.0 × 10−6

STAAR-O uses ACAT to combine STAAR-B and STAAR-O

Page 48: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

ARIC WGS data of LPA (AA, n=1800): Significant 4KB Sliding Windows in Chr 6

Page 49: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

LPA (AA): Significant 4KB Sliding Windows in Chr 6

Page 50: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

Area 1 and Area 2: Weights

Page 51: Scalable Statistical Inference for Massive Health Science DataWGS Covers 100% of the Genome Rare Variants >97% GWAS Common Variants

• Scalable statistical inference is a critical niche for analysis of big data.

• It is important to integrate domain science and computational science in scalable statistical inference to accelerate statistical science and scientific discovery.

• “Optimal” statistical inference needs to context-specific, e.g., dense and sparse regimes for high-dimensional hypothesis testing

• Asymptotic and finite sample results are both important.

Final Remarks