Scalable Statistical Inference for Massive Health Science Data
Xihong Lin
Department of Biostatistics and Department of Statistics
Harvard University
Examples of Genome, Exposome and Phenome
Smartphone DataWhole Genome Sequencing
Electronic Medical Records
Genome
ExposomePhenome
Our Niche in Big Data Era: Scalable Statistical Inference
Data KnowledgeScalable
ActionsStatistical Inference
Broader
Partnership
Goal: To solve big problems
Whole Genome Sequencing Studies (2015-)
CCGATCCAAGTCCATATATACCGATTTAACCGAA
CCGATCCAAGTCCATATATACCAATTTAACCGAA
CCGATCCAAGTCCATACATACCGATTTAACCGAACCGATCCAAGTCCATACATACCGATTTAACCGAA
CCAATCCAAGTTCATATATACCGATTTGACCGAA
CCGATCTAAGTCCATATATACCGATTTAACCGAA
CCGATCCAAGTCCATACATACCGATTTAACCGAA
WGS Covers 100% of the Genome
Rare Variants
>97%
GWAS Common Variants <3%
Rare variants are more likely to cause diseases and their coded proteins are more likely to be drug targets.
TOPMed Freeze 5 (n=54,000): 430M Variants (97% are rare variants)
1000 Genomes N=1000
GSP(NHGRI)N=200,000
2008
2015
2016
TOPMed(NHLBI)
N=150,000
Large Scale WGS Timeline2018
Biobanks(N=millions)
First Goal of WGS Analysis:Signal Detection
Scan the genome to identify genomic regions associated with diseases/traits
Challenges in Rare Variant Analysis of WGS Data
β’ Simple single SNP analysis does not work
β’ Need to perform SNP-set analysis
β’ Estimation is very difficult
APOE Promoter
LDL= ππππ + ππ
Sequencing Kernel Association Test (SKAT)β’ Wu, et al, 2011, AJHG.
(Citations=1400)β’ SMMATβ’ STAAR
Generalized Higher Criticism (GHC) /Generalized Berk-Jones (GBJ)/ACAT
β’ Murkerjee, et al, Ann. Stat, 2015
β’ Barnett, et al 2017 (GHC), JASA
β’ Sun and Lin (GBJ), 2017
β’ Liu, et al (ACAT), 2018
Model:
Sparse AlternativeDense Alternative
Test for Dense & Sparse High-Dimensional Alternatives
ππ = ππππ + ππ
Model and Hypothesis
Yi is phenotype (outcome) (i = 1, Β· Β· Β· , n)
Xi contains q covariates
Gi contains p SNPs (AA, AB, BB=0,1,2) in a SNV set, e.g.,variants in the promoter region of APOE.
Ξ± and Ξ² contain regression coefficients.
Β΅i = E (Yi |Gi ,Xi)
Model
h(Β΅i) = XTi Ξ± +G
Ti Ξ²
Hypothesis of no gene/network effect (p might be large):
H0 : Ξ² = 0 and H1 : Ξ² 6= 0 (weak).
March 8, 2019 1 / 23
β’ ππ = dim(π·π·) might not be small
β’ Full GLMs hard to fit due to rare variants
β’ Solution:
β’ Use score statistics ππππ = βππ=1ππ πΊπΊππππ(ππππ β οΏ½πππππ)
β’ Scability: Fit the null same null model ππ ππππ = πΏπΏππβ²πΆπΆ only once
when scanning the genome
Challenges Addressed in Scalable Inference for WGS Data
Dense/Sparse Alternatives
Unknown Truth: k = p1βΞ± of Ξ²j βs 6= 0
Hypothesis
H0 : Ξ² = 0H1 : Some Ξ²j 6= 0
Dense alternative (Ξ± < 1/2):
Ex: p = 100, Ξ± = 0.4β k = 16
Sparse alternative (Ξ± > 1/2):
Ex: p = 100, Ξ± = 0.6β k = 7
March 8, 2019 2 / 23
Difficulties in Testing
No global optimal most powerful test exists.
Test optimality depends on
Genotype matrix(G ) : Sparsity, LD (correlation)
Signals Ξ²: Sparsity, strength, and sign
Distribution of Y
March 8, 2019 3 / 23
β’ Burden(B) (if all variants are causal with effects (Ξ²βs) in the same direction)
π΅π΅ = οΏ½ππ
ππ
π€π€ππππππ
2
β’ SKAT (if there are neural variants and/or with effects (Ξ²βs) in different directions)
ππ = οΏ½ππ
ππ
π€π€ππππππ2
Dense Regime
Sparse Regime: Higher Criticism (HC) (Tukey,1976)
Let
S(t) =
pβj=1
1{|Zj |β₯t}
Assumes Ξ£ = Ip or sparse (G is a low coherence matrix)
The HC test statistic is (Ingster, 1998; Donoho and Jin, 2003;Arias-Castro, et al, 2011)
HC = supt>0
{S(t)β 2pΞ¦(t)β
2pΞ¦(t)(1β 2Ξ¦(t))
}
March 8, 2019 5 / 23
The Higher Criticism
β4 β2 0 2 4
0.00
0.10
0.20
0.30
Histogram of the Zi
t
Den
sity
argmax{HC(t)}
March 8, 2019 6 / 23
Linear Regression: Existing Results on DetectionBoundary
Dense Regime (Ξ± β€ 12) Sparse Regime (Ξ± > 1
2)
A οΏ½β
pΞ±β 12
nβ all tests
powerless.A <
β2t log p
n, t <
Οβgaussian(Ξ±) β all tests pow-erless.
A οΏ½β
pΞ±β 12
nβ SKAT pow-
erfulA >
β2t log p
r, t >
Οβgaussian(Ξ±) β HC powerful.
Setting
Low coherence matrix G (sparse correlation Ξ£)
A=signal strength of Ξ².
Sparsity index: k = p1βΞ±
March 8, 2019 7 / 23
The results for binary regression are different fromlinear regression (Mukherjee, et al, 2015, Ann Stat)
If design matrices are too sparse, then signal detection isimpossible no matter how strong signals are.
Two point detection boundary: Maximal Sparsity of G andMinimal Signal Strength Ξ².
March 8, 2019 8 / 23
Asymptotic p-values for HC Does Not Work Wellfor Finte p
The supremum of this standardized empirical process follows aGumbel distribution asymptotically.
Jaeschke (1979) shows that this converges in distribution at anabysmal rate of O{(log p)β1/2}
March 8, 2019 9 / 23
Slow Convergence to Asymptotic Distribution of HC
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
x
CD
F(x
)
Theoretical; p=β Empirical; p=102 Empirical; p=106
In genetic studies, gene and network sizes
(p=# of SNPs=dozens to thousands)
March 8, 2019 10 / 23
Analytic p-values for HC for Finite p(Barnett and Lin, Biometrika, 2015)
Letting h be the observed HC statistic:
p-value = pr
(supt>0
{Sβ(t)β 2pΞ¦(t)β2pΞ¦(t)(1β 2Ξ¦(t))
}β₯ h
)There exists 0 < t1 < Β· Β· Β· < tp, such that
p-value = 1β pr
(pβ
k=1
{Sβ(tk) β€ p β k}
)
Then apply the chain rule of conditioning to get a product ofbinomial probabilities.
March 8, 2019 11 / 23
Simulation Study of Type I error rates of HC:Analytic(Exact) vs Asymptotic
p
Ξ± 10 50
1Β·0 9Β·92Γ 10β1(7Β·31Γ 10β1) 1Β·01(1Β·59Γ 10β1)
1Β·0Γ 10β1 1Β·01Γ 10β1(6Β·03Γ 10β2) 9Β·75Γ 10β2(4Β·90Γ 10β3)
1Β·0Γ 10β2 1Β·12Γ 10β2(7Β·30Γ 10β3) 9Β·80Γ 10β3(4Β·00Γ 10β4)
March 8, 2019 12 / 23
Need to account for Correlation among SNPs (LD))
CHRNA3-5 Gene Region
March 8, 2019 13 / 23
Accounting for correlation: Innovated HC (iHC)(Hall and Jin, 2011)
Letting UUT = Cov(Z ) = Ξ£
Define the transformed (decorrelated) test statistics:
Zβ = U
β1Z
Lββββnββ
MVN(0, Ip)
Set
Sβ(t) =
pβj=1
1{|Zβj |β₯t}
The innovated Higher Criticism test (iHC) statistic is:
iHC = supt>0
{Sβ(t)β 2pΞ¦(t)β2pΞ¦(t)(1β 2Ξ¦(t))
}
March 8, 2019 14 / 23
Decorrelating using dampens true signals and causesiHC to lose power: CGEM Breast Cancer GWAS:FGFR2 gene
Fre
quen
cy
02
46
8Z
Marginal test statistics
Fre
quen
cy
β4 β2 0 2 4
02
46
8
Z*
March 8, 2019 15 / 23
Generalized Higher Critcism (GHC) (Barnett, et al,2016, JASA)
Recall
S(t) =
pβj=1
1{|Zj |β₯t}
Now we allow Ξ£ to have arbitrary correlation structure.
S(t) is no longer binomial. Instead we approximate withBeta-binomial, matching on first two moments.
The Generalized Higher Criticism (GHC) test statistic is:
GHC = supt>0
S(t)β 2pΞ¦(t)βVar(S(t))
GHC achieves the same as detection boundary as HC .
March 8, 2019 16 / 23
The variance estimator Var(S(t))
Let rn = 2p(1βp)
β1β€k<lβ€p(Ξ£kl)
n and let Hi(t) be the Hermite
polynomials: H0(t) = 1, H1(t) = t, H2(t) = t2 β 1 and so on. Then
Cov
(S(tk), S(tj)
)= p[2Ξ¦(max{tj , tk})β 4Ξ¦(tj)Ξ¦(tk)]
+4p(p β 1)Ο(tj)Ο(tk)ββi=1
H2iβ1(tj)H2iβ1(tk)r 2i
(2i)!
March 8, 2019 17 / 23
Analytic p-values for the GHC
Letting h be the observed GHC statistic:
p-value = pr
supt>0
S(t)β 2pΞ¦(t)βVar(S(t))
β₯ h
There exists 0 < t1 < Β· Β· Β· < tp, such that
p-value = 1β pr
(pβ
k=1
{S(tk) β€ p β k}
)
March 8, 2019 18 / 23
Generalized Berk-Jones
Motivation: GHC works well in the very sparse signal case butless well in the moderately sparse signal case in finite samples.
Let s be the realized value of S(t).
Berk-Jones (Sup LR test):
BJ = maxt>0
log
{Pr [S(t) = s|Ο = s/p]
Pr [S(t) = s|Ο = Ο0]
}1
{Ο0 <
s
p
}Generalized Berk-Jones (Account for correlation):
GBJ = maxt>0
log
{Pr [S(t) = s|Ο = s/p, cor(Z) = Ξ£]
Pr [S(t) = s|Ο = Ο0, cor(Z) = Ξ£]
}1
{Ο0 <
s
p
}
March 8, 2019 20 / 23
Inference using Generalized Higher Criticism andGeneralized Berk-Jones
The distribution of S(t) is over-dispersed binomial and its exactdistribution is hard to calculate.
Approximate the distribution of S(t) using extendedbeta-binomial.
The sups in GHC and GBJ are achieved at the design points andboth GHC/GBJ and their distributions are calculated analyticallyusing approximations.
March 8, 2019 21 / 23
Rejection Boundary Comparisons: GHC vs GBJ
20 SNPs, 100% correlated with Ο=0.3
2 4 6 8 10 12 14 16 18 20
01
23
4
Bo
un
da
ry
Ordered Magnitudes of Test Statistics, |Z|(j)
BJ
GBJ
HC
GHC
Note how we gain βvolumeβ in the rejection region near the expectedsignals.
March 8, 2019 22 / 23
Simulation (Main advantage of GBJ: Power gain infinite sample for moderate sparsity)
200 SNPs, Ο1=0.3, Ο2=0, Ο3=0, R2=0.01
2 4 6 8 10 12 14
00.2
0.4
0.6
0.8
1
Pow
er
Number of causal SNPs
GBJ
GHC
MinP
SKAT
OMNI
Extremely sparse regime: 1-3 causal. Moderately sparse regime: 4-13causal. Dense regime: 14+ causal.
March 8, 2019 23 / 23
Key features:
β’ A general method for combining p-values.β’ Super fast computation under arbitrary correlation and robust to
correlation. β’ Powerful when signals are sparse.β’ Can be used for constructing robust test.
Sparse Regime: ACAT: Aggregated Cauchy Association Test
Yaowu Liu, et al (JASA 2018, AJHG, 2019)
πππ΄π΄π΄π΄π΄π΄π΄π΄ = οΏ½ππ=1
ππ
π€π€ππ tan 0.5 β ππππ ππ
Transform p-value to Cauchy
Weights
Aggregated Cauchy Association Test (ACAT)
Dense signals Sparse signals
Alternative
Neutral variantCausal variant
SlowFastComputation
SKAT / Burden MinP/GHC/GBJ Tests
No prior knowledge about the sparsity of signals. Need robust test.
Existing SNV-set tests
Assumptions: πΌπΌ. ππππ |ππππ| (z-score) πΌπΌπΌπΌ. βππ, ππ, ππππ ,ππππ ~ππ2 0,ΞππππTheorem: For any Ξ£ β₯ 0, we have
limπ‘π‘β+β
ππ{πππ΄π΄π΄π΄π΄π΄π΄π΄ > π‘π‘}ππ{Cauchy 0,1 > π‘π‘}
= 1.
P-value calculation:p β value β 1/2 β {πππππππ‘π‘ππππ(πππ΄π΄π΄π΄π΄π΄π΄π΄)}/Ο
Correlation of p-values Not required Super fast
Tail is Cauchy
Theory about ACAT
IndependentPerfectly dependent
Sample mean ( οΏ½ππ = 1
ππβππ=1ππ ππππ)
ππππ ~ Cauchy(0,1)
ππππ ~ Normal(0,1) οΏ½ππ ~ N(0,1/d)
οΏ½ππ ~ Cauchy(0,1)οΏ½ππ ~ Cauchy(0,1)
οΏ½ππ ~ N(0,1)
βCauchy(0,1)
General Dependency
Heavy tail makes Cauchy distribution insensitive to correlation
Some insights
P-values
0.35 0.510.25 1.000.15 1.960.05 6.31
0.45 0.16
2e-03 159 5e-03 63.7
Cauchy values
233
ACAT uses a few smallest p-values to represent the significance.
ACAT is powerful against sparse alternatives
β¦β¦β¦β¦MAC<10 MAC>=10
SNVs in a region
Burden ππ0 ππ1 ππ2 ππ3 ππππβ¦β¦
ππ
P-value β 1 β πΉπΉπππππππππππ(ππ) Super fastAccurate
π€π€ππ,π΄π΄π΄π΄π΄π΄π΄π΄βππ = π€π€ππ,πππππ΄π΄π΄π΄ Γ )πππππΉπΉππ(1 βπππππΉπΉππ
Saddlepoint method (Dey, et al, 2017)
ACAT-V for testing a SNV-set
Key features:
β’ Boost RV analysis power by optimally combining statistical evidence of MAFs (default in SKAT), functional annotations, and phenotypic information
β’ Computationally scalable
β’ Applicable to any given variant-set
STAAR: variant-Set Test for Association using Annotation infoRmation
Xihao Li and Zilin Li
Optimal weighting: True effect sizes (unknown)
Signal Regions (Effect Sizes (π·π·)) in the Genome
Question: Which functional scores to use boost power of RV association analysis in a variant-Set
Use Functional Annotations to Prioritize Variants in a Variant-Set
Functional Annotation Database
WGSA
Annovar
CADD
ENCODE
EPIGENOME
Individual Scores
Choosing Weights ππππ to Empower WGS Association Analysis
>260 annotations
15 Types of Annotations
80% built on hg38
Genome Functional Variant Annotations (GSP+TOPMED) Hufeng Zhou)
Dynamically incorporate multiple annotation weights in RV Tests
Coding Variants Non-coding Variants
Existing Integrative Annotation Scores are Mainly Driven by Protein and Conservation Scores with Little Correlation with Epigenetic Scores
β’ APC1: Epigeneticsβ’ APC2: Conservationβ’ APC3: Protein Functionβ’ APC4: Negative Selectionβ’ APC5: Distance to Codingβ’ APC6: Mutation Densityβ’ APC7: Transcription Factorβ’ APC8: MapAbilityβ’ APC9: Distance to TEE/TSEβ’ APC10: MicroRNA
Correlation Heatmap with Annotation PCs (GSP Freeze 1, hg38)
STAAR: Incorporate Multiple Functional Scores to Boost Power of RV Association Analysis Using ACAT
Type I Error Rates Using STAAR are Protected: Simulated WGS data Using COSI (n = 10,000)
πΆπΆ = ππππβππ Continuous Traits Dichotomous Traits
STAAR-B 1.1 Γ 10β6 1.0 Γ 10β6
STAAR-S 9.9 Γ 10β7 7.8 Γ 10β7
STAAR-O 9.3 Γ 10β7 1.0 Γ 10β6
STAAR-O uses ACAT to combine STAAR-B and STAAR-O
ARIC WGS data of LPA (AA, n=1800): Significant 4KB Sliding Windows in Chr 6
LPA (AA): Significant 4KB Sliding Windows in Chr 6
Area 1 and Area 2: Weights
β’ Scalable statistical inference is a critical niche for analysis of big data.
β’ It is important to integrate domain science and computational science in scalable statistical inference to accelerate statistical science and scientific discovery.
β’ βOptimalβ statistical inference needs to context-specific, e.g., dense and sparse regimes for high-dimensional hypothesis testing
β’ Asymptotic and finite sample results are both important.
Final Remarks
Top Related