BAYESIAN HIGH-DIMENSIONAL MODELS WITH SCALE-MIXTURE...
Transcript of BAYESIAN HIGH-DIMENSIONAL MODELS WITH SCALE-MIXTURE...
BAYESIAN HIGH-DIMENSIONAL MODELS WITH SCALE-MIXTURE SHRINKAGE PRIORS
By
RAY BAI
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2018
© 2018 Ray Bai
Dedicated to Mom, Dad, and Will
ACKNOWLEDGMENTS
I owe my deepest gratitude to my PhD advisor Dr. Malay Ghosh. It has truly been an
honor to work under his supervision. In addition to being an outstanding scholar and mentor,
he is also an excellent teacher, and I was fortunate to take four courses with him during my
PhD studies. I also thank Dr. Kshitij Khare, Dr. Nikolay Bliznyuk, and Dr. Arunava Banerjee
for serving on my PhD committee and for providing valuable comments on my dissertation.
I have had the pleasure to take courses with all of my committee members, and I learned a
lot of things from them. I am especially indebted to Dr. Bliznyuk for serving as my Master of
Statistics’ advisor and for involving me in several of his research projects early on.
I also owe a great deal of gratitude to the faculty and staff at the Department of Statistics
at the University of Florida. In addition to the members of my PhD committee, I am also
grateful to Dr. Sophia Su, Dr. Andrew Rosalsky, and Dr. Hani Doss for helping me to develop
a firm foundation in advanced linear algebra, mathematical analysis, probability theory, and
mathematical statistics. Thank you to Tina Greenly, Christine Miron, and Bill Campbell for
their help with administrative tasks. Thank you to Dr. Jim Hobert and Maria Ripol for offering
valuable guidance for navigating graduate school and for their advice on teaching.
I am thankful to my cohort and friends in graduate school: Syed Rahman, Peyman Jalali,
Andrey Skripnikov, Hunter Merrill, Isaac Duerr, Mingyuan Gao, Qian Qin, Tamal Ghosh, Tuo
Chen, Minji Lee, Zeren Xing, Ethan Alt, Grant Backlund, Saptarshi Chakraborty, Satyajit
Ghosh, and Xueying Tang. Thank you for all of the fun times and for the valuable discussions
and help when I needed it.
Finally, I want to thank my boyfriend Will Haslam and my parents for always believing
in me. I would not have survived graduate school without their love, encouragement, and
support. I love you guys very much.
4
TABLE OF CONTENTSpage
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
CHAPTER
1 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1 The Sparse Normal Means Problem . . . . . . . . . . . . . . . . . . . . . . . 121.2 Bayesian Methods for Sparse Estimation . . . . . . . . . . . . . . . . . . . . 13
1.2.1 Spike-and-Slab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2.2 Scale-Mixture Shrinkage Priors . . . . . . . . . . . . . . . . . . . . . 14
1.3 Minimax Estimation and Posterior Contraction . . . . . . . . . . . . . . . . . 161.3.1 Sparse Normal Vectors in the Nearly Black Sense . . . . . . . . . . . . 171.3.2 Theoretical Results for Spike-and-Slab Priors . . . . . . . . . . . . . . 181.3.3 Theoretical Results for Scale-Mixture Shrinkage Priors . . . . . . . . . 19
1.4 Signal Detection Through Multiple Hypothesis Testing . . . . . . . . . . . . . 201.4.1 Asymptotic Bayes Optimality Under Sparsity . . . . . . . . . . . . . . 201.4.2 ABOS of Thresholding Rules Based on Scale-Mixture Shrinkage Priors 23
1.5 Sparse Univariate Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 241.5.1 Frequentist Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 251.5.2 Bayesian Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.5.2.1 Spike-and-slab . . . . . . . . . . . . . . . . . . . . . . . . . 251.5.2.2 Continuous Shrinkage Priors . . . . . . . . . . . . . . . . . 26
1.5.3 Posterior Consistency for Univariate Linear Regression . . . . . . . . . 271.6 Sparse Multivariate Linear Regression . . . . . . . . . . . . . . . . . . . . . . 28
1.6.1 Frequentist Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 291.6.2 Bayesian Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 291.6.3 Reduced Rank Regression . . . . . . . . . . . . . . . . . . . . . . . . 30
2 THE INVERSE GAMMA-GAMMA PRIOR FOR SPARSE ESTIMATION . . . . . . 31
2.1 The Inverse Gamma-Gamma (IGG) Prior . . . . . . . . . . . . . . . . . . . . 322.2 Concentration Properties of the IGG Prior . . . . . . . . . . . . . . . . . . . 35
2.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.2.2 Concentration Inequalities for the Shrinkage Factor . . . . . . . . . . . 35
2.3 Posterior Behavior Under the IGG Prior . . . . . . . . . . . . . . . . . . . . . 372.3.1 Minimax Posterior Contraction Under the IGG Prior . . . . . . . . . . 372.3.2 Kullback-Leibler Risk Bounds . . . . . . . . . . . . . . . . . . . . . . 40
2.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5
2.4.1 Computation and Selection of Hyperparameters . . . . . . . . . . . . . 422.4.2 Simulation Study for Sparse Estimation . . . . . . . . . . . . . . . . . 43
2.5 Analysis of a Prostate Cancer Data Set . . . . . . . . . . . . . . . . . . . . . 442.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 MULTIPLE HYPOTHESIS TESTING WITH THE INVERSE GAMMA-GAMMAPRIOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1 Classification Using the Inverse Gamma-Gamma Prior . . . . . . . . . . . . . 483.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.1.2 Thresholding the Posterior Shrinkage Weight . . . . . . . . . . . . . . 48
3.2 Asymptotic Optimality of the IGG Classification Rule . . . . . . . . . . . . . . 493.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.4 Analysis of a Prostate Cancer Data Set . . . . . . . . . . . . . . . . . . . . . 553.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 HIGH-DIMENSIONAL MULTIVARIATE POSTERIOR CONSISTENCY UNDERGLOBAL-LOCAL SHRINKAGE PRIORS . . . . . . . . . . . . . . . . . . . . . . . 58
4.1 Multivariate Bayesian Model with Shrinkage Priors (MBSP) . . . . . . . . . . 594.1.1 Preliminary Notation and Definitions . . . . . . . . . . . . . . . . . . 594.1.2 MBSP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.1.3 Handling Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Posterior Consistency of MBSP . . . . . . . . . . . . . . . . . . . . . . . . . 614.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.2.2 Definition of Posterior Consistency . . . . . . . . . . . . . . . . . . . 624.2.3 Sufficient Conditions for Posterior Consistency . . . . . . . . . . . . . 62
4.2.3.1 Low-Dimensional Case . . . . . . . . . . . . . . . . . . . . 634.2.3.2 Ultrahigh Dimensional Case . . . . . . . . . . . . . . . . . . 64
4.2.4 Sufficient Conditions for Posterior Consistency of MBSP . . . . . . . . 664.3 Implementation of the MBSP Model . . . . . . . . . . . . . . . . . . . . . . 68
4.3.1 TPBN Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.3.2 The MBSP-TPBN Model . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.2.1 Computational Details . . . . . . . . . . . . . . . . . . . . 704.3.2.2 Specification of Hyperparameters τ , d , and k . . . . . . . . 71
4.3.3 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.4 Simulations and Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.1 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.4.2 Yeast cell cycle data analysis . . . . . . . . . . . . . . . . . . . . . . 76
4.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5 SUMMARY AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.1 Extensions of the Inverse Gamma-Gamma Prior . . . . . . . . . . . . . 80
6
5.2.2 Extensions to Bayesian Multivariate Linear Regression with ShrinkagePriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
APPENDIX
A PROOFS FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
A.1 Proofs for Section 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84A.2 Proofs for Section 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87A.3 Proofs for Section 2.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
B PROOFS FOR CHAPTER 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
C PROOFS FOR CHAPTER 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
C.1 Proofs for Section 4.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106C.1.1 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 106C.1.2 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 111
C.2 Proofs for Section 4.2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116C.2.1 Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . 116C.2.2 Proofs for Theorem 4.3 and Theorem 4.4 . . . . . . . . . . . . . . . . 118
D GIBBS SAMPLER FOR THE MBSP-TPBN MODEL . . . . . . . . . . . . . . . . . 123
D.1 Full Conditional Densities for the Gibbs Sampler . . . . . . . . . . . . . . . . 123D.2 Fast Sampling of the Full Conditional Density for B . . . . . . . . . . . . . . 123D.3 Convergence of the Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . 124
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7
LIST OF TABLESTable page
1-1 Polynomial-tailed priors, their respective prior densities for π(ξi) up to normalizingconstant C , and the slowly-varying component L(ξi). . . . . . . . . . . . . . . . . 16
2-1 Comparison of average squared error loss for the posterior median estimate of θacross 100 replications. Results are reported for the IGG1/n, DL (Dirichlet-Laplace),HS (horseshoe), and the HS+ (horseshoe-plus). . . . . . . . . . . . . . . . . . . . 44
2-2 The z-scores and the effect size estimates for the top 10 genes selected by Efron(2010) by the IGG, DL, HS, and HS+ models and the two-groups empirical Bayesmodel by Efron (2010). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3-1 Comparison of false discovery rate (FDR) for different classification methods underdense settings. The IGG1/n has the lowest FDR of all the different methods. . . . 54
4-1 Simulation results for MBSP-TPBN, compared with MBGL-SS, MLASSO, SRRR,and SPLS, averaged across 100 replications. . . . . . . . . . . . . . . . . . . . . . 75
4-2 Results for analysis of the yeast cell cycle data set. The MSPE has been scaled bya factor of 100. In particular, all fives models selected the three TFs, ACE2, SWI5,and SWI6 as significant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8
LIST OF FIGURESFigure page
2-1 Marginal density of the IGG prior in Eq. 2–5 with hyperparameters a = 0.6, b =0.4, in comparison to other shrinkage priors. The DL1/2 prior is the marginal densityfor the Dirichlet-Laplace density with D(1/2, ...., 1/2) specified as a prior in theBayesian hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3-1 Comparison between the posterior inclusion probabilities and the posterior shrinkageweights 1− E(κi |Xi) when p = 0.10. . . . . . . . . . . . . . . . . . . . . . . . . . 51
3-2 Estimated misclassification probabilities. The thresholding rule in Eq. 3–1 based onthe IGG posterior mean is nearly as good as the Bayes Oracle rule in Eq. 1–16. . . . 52
3-3 Posterior Mean E(θ|X ) vs. X plot for p = 0.25. . . . . . . . . . . . . . . . . . . . 55
4-1 Plots of the estimates and 95% credible bands for four of the 10 TFs that weredeemed as significant by the MBSP-TPBN model. The x-axis indicates time (minutes)and the y-axis indicates the estimated coefficients. . . . . . . . . . . . . . . . . . . 78
D-1 History plots of the first 10,000 draws from the Gibbs sampler for the MBSP-TPBNmodel described in Section D.1 for randomly drawn coefficients bij in B0 from experiments5 and 6 in Section 4.4.1. The top two plots are taken from experiment 5 (n =100, p = 500, q = 3), and the bottom two plots are taken from Experiment 6(n = 150, p = 1000, q = 4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
BAYESIAN HIGH-DIMENSIONAL MODELS WITH SCALE-MIXTURE SHRINKAGE PRIORS
By
Ray Bai
August 2018
Chair: Malay GhoshMajor: Statistics
High-dimensional data is ubiquitous in many modern applications as diverse as medicine,
machine learning, electronic health records, engineering, and finance. As technological
advances have produced larger and more complex data sets, scientists have been faced
with greater challenges. In this dissertation, we address three major challenges in modern
high-dimensional statistics: 1) estimation of sparse noisy vectors, 2) signal detection, and 3)
multivariate linear regression where the number of covariates is larger than the sample size.
To tackle these problems, we work within the Bayesian framework, using continuous shrinkage
priors which can be expressed as scale mixtures of normal densities.
We first review the literature on the methodological and theoretical developments for
these three problems. We then introduce a new fully Bayesian scale-mixture shrinkage prior
known as the inverse gamma-gamma (IGG) prior to handle both the tasks of sparse estimation
of noisy vectors and signal detection. We show that the IGG’s posterior distribution contracts
around the true mean vector at (near) minimax rate and that the IGG posterior concentrates
at a faster rate than other popular Bayes estimators in the Kullback-Leibler (K-L) sense. To
detect signals, we also propose a hypothesis test based on thresholding the posterior mean
under the IGG prior. Taking the loss function to be expected number of misclassified tests, our
test procedure is shown to be asymptotically Bayes optimal under sparsity.
Finally, we consider sparse Bayesian estimation in the classical multivariate linear
regression model. We propose a new method to estimate the unknown coefficients matrix
10
known as the Multivariate Bayesian model with Shrinkage Priors (MBSP). We also develop
new theory for posterior consistency under the Bayesian multivariate regression framework,
including the ultrahigh-dimensional setting where the number of covariates grows at nearly
exponential rate with the sample size. We prove that MBSP achieves strong posterior
consistency in both low-dimensional and ultrahigh-dimensional scenarios.
11
CHAPTER 1LITERATURE REVIEW
1.1 The Sparse Normal Means Problem
Suppose we observe a noisy n-component random observation (X1, ...,Xn) ∈ Rn, such
that
Xi = θi + ϵi , i = 1, ..., n, (1–1)
where ϵi ∼ N (0, 1), i = 1, ..., n. In the high-dimensional setting where n is very large, sparsity
is a very common phenomenon. That is, in the unknown mean vector θ = (θ1, ..., θn), only
a few of the θi ’s are nonzero. Under the model in Eq. 1–1, we are primarily interested in
separating the signals (θi = 0) from the noise (θi = 0) and giving robust estimates of the
signals.
This simple model is the basis for a number of high-dimensional problems, such as image
reconstruction, genetics, and wavelet analysis (Johnstone & Silverman (2004)). For example, if
we wish to reconstruct an image from millions of pixels of data, only a few pixels are typically
needed to recover the objects of interest. In genetics, we may have tens of thousands of
gene expression data points, but only a few are significantly associated with the phenotype
of interest. For instance, Wellcome Trust (2007) has confirmed that only seven genes have a
non-negligible association with Type I diabetes. These applications demonstrate that sparsity is
a fairly reasonable assumption for θ in Eq. 1–1.
Existing frequentist methods for obtaining a sparse estimate of θ in Eq. 1–1 include the
popular LASSO (Tibshirani (1996)) and its many variants (see, for example, Zou & Hastie
(2005), Zou (2006), Yuan & Lin (2006), Belloni et al. (2011), and Sun & Zhang (2012)).
All of these methods use either an ℓ1 or a combination of an ℓ1 and ℓ2 penalty function to
shrink many of the θi ’s to zero. These methods are able to produce point estimates with good
theoretical and empirical properties. However, in many cases, it is desirable to obtain not only
a point estimate but a realistic characterization of uncertainty in the parameter estimates.
12
In high-dimensional settings, frequentist approaches to characterizing uncertainty, such as
bootstrapping or constructing confidence regions, can break down (Bhattacharya et al. (2015)).
For example, Camponovo (2015) recently showed that the bootstrap does not provide a valid
approximation of the distribution of the LASSO estimator. Bayesian approaches to estimating
θ, on the other hand, give a natural way to quantify uncertainty through the posterior density.
As we illustrate later, Bayesian point estimates such as the median or mean also have desirable
frequentist properties.
1.2 Bayesian Methods for Sparse Estimation
While frequentist methods for estimating θ in Eq. 1–1 induce sparsity through a penalty
function on the entries θi , i = 1, ..., n, Bayesian approaches obtain sparse estimates by placing
a carefully constructed prior on θ. Spike-and-slab priors and scale-mixture priors are two of the
most commonly used priors that for sparse normal means estimation.
1.2.1 Spike-and-Slab
Spike-and-slab priors are a particularly appealing way to model sparsity. The original
spike-and- slab model, introduced by Mitchell & Beauchamp (1988), was of the form,
π(θi) = (1− p)δ0(θi) + pψ(θi |λ), i = 1, ..., n, (1–2)
where δ0(·) is the “spike” distribution (a point mass at zero), ψ(·|λ) is an absolutely
continuous “slab” distribution, indexed by a hyper-parameter λ, and p is a mixing proportion.
The “spike” forces some coefficients to zero, while the “slab” models the signals. Typically,
λ > 0 is chosen to be large so that the “slab” is very diffuse and signals can be identified with
high probability.
The choice of (p,λ) is crucial for good performance of spike-and-slab models in Eq. 1–2.
Johnstone & Silverman (2004) utilized an empirical Bayes variant of Eq. 1–2. They used a
restricted marginal maximum likelihood estimate of p and a sufficiently heavy-tailed density
for ψ(·|λ) (i.e. tails at least as heavy as the Laplace distribution). They also considered a fully
Bayesian variant where a suitable beta prior was placed on p.
13
Despite their interpretability, these point-mass mixtures face computational difficulties
in high dimensions. Because the point-mass mixture in Eq. 1–2 is discontinuous, this model
requires searching over 2n possible models. To circumvent this problem, fully continuous
variants of spike-and-slab densities have been developed. These continuous spike-and-slab
models are of the form,
π(θi) = (1− p)ψ(θi |λ1) + pψ(θi |λ2), i = 1, ..., n, (1–3)
where ψ(·|λi), i = 1, 2 represents a symmetric unimodal density centered at zero, ψ(θi |λ1)
models the spike, and ψ(θi |λ2) models the slab. Typically, λ1 and λ2 are chosen so that the
their respective densities have small and large variances respectively. George & McCulloch
(1993) proposed the stochastic search variable selection (SSVS) method, which places a
mixture prior of two normal densities with different variances (one small and one large) on each
of the θi ’s. More recently, Ročková & George (2016) introduced the spike-and-slab LASSO
(SSL), which is a mixture of two Laplace densities with different variances (one small and one
large).
1.2.2 Scale-Mixture Shrinkage Priors
Because of the computational difficulties of (point-mass) spike-and-slab priors, a rich
variety of continuous shrinkage priors which can be expressed as scale mixtures of normal
densities has also been developed. These priors behave similarly to spike-and-slab priors but
require significantly less computational effort. They mimic the model in Eq. 1–2 in that
they contain significant probability around zero so that most coefficients are shrunk to zero.
However, they retain heavy enough tails in order to correctly identify and prevent overshrinkage
of the true signals. These priors typically take the form
θi |σ2i ∼ N (0,σ2i ), σ2i ∼ π(σ2i ), i = 1, ..., n, (1–4)
where π : [0,∞) → [0,∞) is a density on the positive reals. π may depend on further
hyperparameters, of which there may or may not be additional priors placed on them. Priors on
14
σ2i in Eq. 1–4 may be either independent for each i = 1, ..., n, in which case the θi coefficients
are independent a posteriori, or they may contain hyperpriors on shared hyperparameters,
in which case, θi ’s are a posteriori dependent. We refer to priors of the form in Eq. 1–4 as
scale-mixture shrinkage priors.
Global-local (GL) shrinkage priors comprise a wide class of scale-mixture shrinkage priors.
GL priors take the form
θi |τ , ξi ∼ N (0, ξiτ), ξi ∼ f , τ ∼ g, (1–5)
where τ is a global shrinkage parameter that shrinks all θi ’s to the origin, while the local
scale parameters ξi ’s control the degree of individual shrinkage. If g puts sufficient mass near
zero and f is an appropriately chosen heavy-tailed density, then GL priors can approximate
the model in Eq. 1–2 through a continuous density concentrated near zero with heavy tails.
Examples of GL shrinkage priors include the horseshoe prior Carvalho et al. (2010) and the
Bayesian lasso Park & Casella (2008). The horseshoe in particular is very popular and has the
hierarchical form
θi |τ , ξi ∼ N (0, ξiτ),√ξi ∼ C+(0, 1),
√τ ∼ C+(0, 1), (1–6)
where C+(0, 1) denotes a half-Cauchy density with scale 1. Global-local shrinkage priors have
also been considered by numerous authors, including Strawderman (1971), Berger (1980),
Armagan et al. (2011), Polson & Scott (2012), Armagan et al. (2013b), Griffin & Brown
(2013), and Bhadra et al. (2017). Armagan et al. (2011) noted that a number of these priors
utilize a beta prime density as the prior for π(ξi) and referred to this general class of shrinkage
priors as the “three parameter beta normal” (TPBN) mixture family. The TPBN family in
particular includes the horseshoe, the Strawderman-Berger (Strawderman (1971) and Berger
(1980)), and the normal-exponential-gamma (NEG) Griffin & Brown (2013) priors. Polson &
Scott (2012) also generalized the beta prime density to the family of hypergeometric inverted
beta (HIB) priors. Finally, Armagan et al. (2013b) introduced another general class of priors
called the generalized double Pareto (GDP) family.
15
Table 1-1. Polynomial-tailed priors, their respective prior densities for π(ξi) up to normalizingconstant C , and the slowly-varying component L(ξi).
Prior π(ξi)/C L(ξi)Student’s t ξ−a−1
i exp(−a/ξi) exp (−a/ξi)Horseshoe ξ
−1/2i (1 + ξi)
−1 ξa+1/2i /(1 + ξi)
Horseshoe+ ξ−1/2i (ξi − 1)−1 log(ξi) ξ
a+1/2i (ξi − 1)−1 log(ξi)
NEG (1 + ξi)−1−a {ξi/(1 + ξi)}a+1
TPBN ξu−1i (1 + ξi)
−a−u {ξi/(1 + ξi)}a+u
GDP∫∞0
λ2
2exp
(−λ2ξi
2
)λ2a−1 exp(−ηλ)dλ
∫∞0
ta exp(−t − η√2t/ξi)dt
HIB ξu−1i (1 + ξi)
−(a+u) exp{− s
1+ξi
}{ξi/(1 + ξi)}a+u
×{ϕ2 + 1−ϕ2
1+ξi
}−1
× exp{− s
1+ξi
}{ϕ2 + 1−ϕ2
1+ξi
}−1
Ghosh et al. (2016) observed that for a large number of GL priors of the form in Eq. 1–5,
the local parameter ξi has a hyperprior distribution π(ξi) that can be written as
π(ξi) = Kξ−a−1i L(ξi), (1–7)
where K > 0 is the constant of proportionality, a is positive real number, and L is a positive
measurable, non-constant, slowly varying function over (0,∞).
Definition 1.1. A positive measurable function L defined over (A,∞), for some A ≥ 0, is said
to be slowly varying (in Karamata’s sense) if for every fixed α > 0, limx→∞
L(αx)
L(x)= 1.
A thorough treatment of functions of this type can be found in the classical text by
Bingham et al. (1987). Table 1-1 provides a list of several well-known global-local shrinkage
priors that fall in the class of priors of the form given in Eq. 1–5, the corresponding density
π(ξi) for ξi , and the slowly-varying component L(ξi) in Eq. 1–7. Following Tang et al. (2017),
we refer to these scale-mixture priors as polynomial-tailed priors.
1.3 Minimax Estimation and Posterior Contraction
In this section, we review the theory for Bayesian estimation of θ in the sparse normal
means model in Eq. 1–1. In particular, we are interested in studying the frequentist properties
of Bayesian estimates of θ.
16
1.3.1 Sparse Normal Vectors in the Nearly Black Sense
Suppose that we observe X = (X1, ...,Xn) ∈ Rn from Eq. 1–1. Let ℓ0[qn] denote the
subset of Rn given by
ℓ0[qn] = {θ ∈ Rn : #(1 ≤ j ≤ n : θj = 0) ≤ qn}. (1–8)
If θ ∈ ℓ0[qn] with qn = o(n) as n → ∞, we say that θ is sparse in the “nearly black sense.”
Let θ0 = (θ01, ..., θ0n) be the true mean vector. In their seminal work, Donoho et al. (1992)
showed that for any estimator of θ, denoted by θ, the corresponding minimax risk with respect
to the l2- norm is given by
infθ
supθ0∈ℓ0[qn]
Eθ0 ||θ − θ0||22 = 2qn log(
nqn
)(1 + o(1)), as n → ∞. (1–9)
In Eq. 1–9, Eθ0 denotes expectation with respect to the Nn(θ0, In) distribution. Equation 1–9
effectively states that in the presence of sparsity, a minimax-optimal estimator only loses a
logarithmic factor (in the ambient dimension) as a penalty for not knowing the true locations
of the zeroes. Moreover, Eq. 1–9 implies that we only need a number of replicates in the order
of the true sparsity level qn to consistently estimate θ0.
In order for the performance of Bayesian estimators to be compared with frequentist
ones, we say that a Bayesian point estimator θB attains the minimax risk (in the order of a
constant) if
supθ0∈ℓ0[qn]
Eθ0 ||θB − θ0||22 ≍ qn log
(n
qn
). (1–10)
Examples of potential choices for θB include the posterior median or the posterior mean (as in
Johnstone & Silverman (2004)), or the posterior mode (as in Ročková (2018)). Equation 1–10
pertains only to a particular point estimate. For a fully Bayesian interpretation, we say that the
posterior distribution contracts around the true θ0 at a rate at least as fast as the minimax l2
risk if
supθ0∈ℓ0[qn]
Eθ0�
(θ : ||θ − θ0||22 > Mnqn log
(n
qn
) ∣∣∣∣X)→ 0, (1–11)
17
for every Mn → ∞ as n → ∞. On the other hand, in another seminal paper, Ghosal et al.
(2000) showed that the posterior distribution cannot contract faster than the minimax rate of
qn log(
nqn
)around the truth. Hence, the optimal rate of contraction of a posterior distribution
around the true θ0 must be the minimax optimal rate in Eq. 1–9, up to some multiplicative
constant. In other words, if we use a fully Bayesian model to estimate a “nearly black”
normal mean vector, the minimax optimal rate should be our benchmark, and the posterior
distribution should capture the true θ0 in a ball of squared radius at most qn log(
nqn
)(up to a
multiplicative constant) with probability one as n → ∞.
1.3.2 Theoretical Results for Spike-and-Slab Priors
There is a large body of theoretical evidence in favor of point-mass mixture priors in Eq.
1–2 (for instance, see George & Foster (2000), Johnstone & Silverman (2004), Johnstone
& Silverman (2005), Abramovich et al. (2007), and Castillo & van der Vaart (2012)). As
remarked by Carvalho et al. (2009), a carefully chosen “two-groups” model can be considered a
“gold standard” for sparse problems. Using the empirical Bayes variant of Eq. 1–2, Johnstone
& Silverman (2004) showed that if the tails of ψ(·|λ) are at least as heavy as Laplace but not
heavier than Cauchy and if we take a restricted marginal maximum likelihood estimator for p,
both the posterior mean and median contract around the true θ0 at minimax rate. They also
showed that with a suitable beta prior on p, the entire posterior distribution contracts at the
minimax rate established in Eq. 1–11.
Recently, minimax-optimality results have also been obtained for continuous spike-and-slab
priors of the form given in Eq. 1–3 by Ročková & George (2016). Specifying a normal density
for ψ(·|λi), i = 1, 2, in Eq. 1–3 does not enable us to obtain minimax-optimality results
because the tails are insufficiently heavy. However, Ročková (2018) showed that minimax
optimality could be achieved for their spike-and-slab LASSO model, where Eq. 1–3 is a mixture
of two Laplace densities with differing variances instead. Specifically, Ročková (2018) showed
that by specifying suitable variances for the Laplace densities and a particular fixed value for
p (all of which depend on sample size n), the posterior mode under the SSL prior attains the
18
minimax risk in Eq. 1–10. Going further, Ročková (2018) also established that by placing an
appropriate beta prior on the mixing proportion p, the entire posterior distribution of the SSL
contracts at (near) minimax rate in Eq. 1–11.
1.3.3 Theoretical Results for Scale-Mixture Shrinkage Priors
In the statistical literature, there are also many minimax optimality results for global-local
shrinkage priors introduced in Section 1.2.2. van der Pas et al. (2014) showed that by either
treating the global parameter τ in Eq. 1–5 as a tuning parameter that decays to zero at an
appropriate rate as n → ∞ (that is, τ ≡ τn → 0 as n → ∞) or by giving an empirical Bayes
estimate τ based on an estimate of the sparsity level, the posterior mean under the horseshoe
prior in Eq. 1–6 attains the minimax risk in Eq. 1–10, possibly up to a multiplicative constant.
van der Pas et al. (2014) showed that for the same choices of τn or τ , the entire posterior
distribution for the horseshoe in Eq. 1–6 keeps pace with the posterior mean and contracts at
the minimax rate. Ghosh & Chakrabarti (2017) extended the work of van der Pas et al. (2014)
by showing that when τ → 0 at an appropriate rate and the true sparsity level is known, the
posterior distribution under a wider class of GL priors (including the student-t prior, the TPBN
family, and the GDP family) contracts at the minimax rate.
All the aforementioned results for global-local shrinkage priors in Eq. 1–5 have required
setting a rate for τ a priori or estimating τ through empirical Bayes in order to achieve the
minimax posterior contraction. Results for fully Bayesian global-local shrinkage priors have
also recently been discovered. Bhattacharya et al. (2015) developed a prior known as the
Dirichlet-Laplace prior, which contains a D(a, ...., a) prior in the scale component and a
gamma prior on the global parameter τ ∼ G(na, 1/2). Bhattacharya et al. (2015) showed
that the Dirichlet-Laplace prior could attain the minimax posterior contraction rate, provided
that an appropriate rate is placed on a and provided that there is a restriction on the signal
size. Specifically, they required that ||θ0||22 ≤ qn log4 n. Recently, van der Pas et al. (2017a)
were also able to attain near-minimax posterior contraction for the horseshoe prior in Eq.1–6 by
placing a prior on τ and restricting the support of τ to be the interval [1/n, 1].
19
Moving beyond the global-local framework, van der Pas et al. (2016) provided conditions
for which the posterior distribution under any scale-mixture shrinkage prior of the form
in Eq. 1–4 achieves the minimax posterior contraction rate, provided that the θi ’s are a
posteriori independent. Their result is quite general and covers a wide variety of priors,
including the inverse Gaussian prior, the normal-gamma prior (Griffin & Brown (2010)), and
the spike-and-slab LASSO (Ročková (2018)).
These results for scale-mixture shrinkage priors demonstrate that although scale-mixture
shrinkage priors do not contain a point mass at zero, they mimic the point mass in the
traditional spike-and-slab model in Eq. 1–2 well enough. Meanwhile, their heavy tails ensure
that large observations are not overshrunk.
1.4 Signal Detection Through Multiple Hypothesis Testing
In addition to robust estimation of θ in Eq. 1–1, we are often interested in detecting the
true signals (or non-zero entries) within θ. Here, we are essentially conducting n simultaneous
hypothesis tests, H0i : θi = 0 vs. H1i : θi = 0, i = 1, ..., n. The problem of signal detection for a
noisy vector therefore can be recast as a multiple hypothesis testing problem.
Using the two-components model in Eq. 1–2 as a benchmark, Bogdan et al. (2011)
studied the risk properties of multiple testing rules within the decision theoretic framework
where each θi is truly generated from a two-groups model. Specifically, Bogdan et al.
(2011) considered a symmetric 0-1 loss function taken to be the expected total number of
misclassified tests. Below we describe this framework and review some of the recent work on
thesholding rules for scale-mixture shrinkage priors within this framework.
1.4.1 Asymptotic Bayes Optimality Under Sparsity
Suppose we observe X = (X1, ...,Xn), such that Xi ∼ N (θi , 1), for i = 1, ..., n. To
identify the true signals in X, we conduct n simultaneous tests: H0i : θi = 0 against H1i : θi =
0, for i = 1, ..., n. For each i , θi is assumed to be generated by a true data-generating model,
θii .i .d .∼ (1− p)δ{0} + pN (0,ψ2), i = 1, ..., n, (1–12)
20
where ψ2 > 0 represents a diffuse “slab” density. This point mass mixture model is often
considered a theoretical ideal for generating a sparse vector θ in the statistical literature.
Indeed, Carvalho et al. (2009) referred to the model in Eq. 1–12 as a “gold standard” for
sparse problems.
The model in Eq. 1–12 is equivalent to assuming that for each i , θi follows a random
variable whose distribution is determined by the latent binary random variable νi , where νi = 0
denotes the event that H0i is true, while νi = 1 corresponds to the event that H0i is false. Here
νi ’s are assumed to be i.i.d. Bernoulli(p) random variables, for some p in (0, 1). Under H0i ,
i.e. θi ∼ δ{0}, the distribution having a mass 1 at 0, while under H1i , θi = 0 and is assumed to
follow an N (0,ψ2) distribution with ψ2 > 0. The marginal distributions of the Xi ’s are then
given by the following two-groups model:
Xii .i .d .∼ (1− p)N (0, 1) + pN (0, 1 + ψ2), i = 1, ..., n. (1–13)
Our testing problem is now equivalent to testing simultaneously
H0i : νi = 0 versus H1i : νi = 1 for i = 1, ..., n. (1–14)
We consider a symmetric 0-1 loss for each individual test and the total loss of a multiple
testing procedure is assumed to be the sum of the individual losses incurred in each test.
Letting t1i and t2i denote the probabilities of type I and type II errors of the ith test
respectively, the Bayes risk of a multiple testing procedure under the two-groups model
(1) is given by
R =
m∑i=1
{(1− p)t1i + pt2i}. (1–15)
21
Bogdan et al. (2011) showed that the rule which minimizes the Bayes risk in Eq. 1–15 is the
test which, for each i = 1, ..., n, rejects H0i if
f (xi |νi = 1)
f (xi |νi = 0)>
1− p
p, i.e. X 2
i > c2, (1–16)
where f (xi |νi = 1) denotes the marginal density of Xi under H1i , while f (xi |νi = 0) denotes
that under H0i and c2 ≡ c2ψ,f =1+ψ2
ψ2 (log(1+ψ2)+2 log(f )), with f = 1−pp
. The above rule is
known as the Bayes Oracle, because it makes use of unknown parameters ψ and p, and hence,
it is not attainable in finite samples. By reparametrizing as u = ψ2 and v = uf 2, the above
threshold becomes
c2 ≡ c2u,v =
(1 +
1
u
)(log v + log
(1 +
1
u
)).
Bogdan et al. (2011) considered the following asymptotic scheme.
Assumption 1
The sequences of vectors (ψn, pn) satisfies the following conditions:
1. pn → 0 as n → ∞.
2. un = ψ2n → ∞ as n → ∞.
3. vn = unf2 = ψ2
n
(1−pnpn
)2→ ∞ as n → ∞.
4. log vnun
→ C ∈ (0,∞) as n → ∞.
Bogdan et al. (2011) provided detailed insight on the threshold C . Summarizing briefly, if
C = 0, then both the Type I and Type II errors are zero, and for C = ∞, the inference is
essentially no better than tossing a coin. Under Assumption 1, Bogdan et al. (2011) showed
that the corresponding asymptotic optimal Bayes risk has a particularly simple form, which is
given by
RBOOpt = n((1− p)tBO1 + ptBO2 ) = np(2�(
√C)− 1)(1 + o(1)), (1–17)
where the o(1) terms tend to zero as n → ∞ and �(·) denotes the standard normal
cumulative distribution function (cdf). A testing procedure with risk R is said to be
22
asymptotically Bayes optimal under sparsity (ABOS) if
R
RBOOpt
→ 1 as n → ∞. (1–18)
1.4.2 ABOS of Thresholding Rules Based on Scale-Mixture Shrinkage Priors
Bogdan et al. (2011) gave conditions under which traditional multiple testing rules,
such as the Benjamini & Hochberg (1995) procedure to control false discovery rate or the
Bonferonni family-wise error adjustment procedure, are ABOS. Thresholding rules based on
scale-mixture shrinkage priors in Eq. 1–4 have also recently been considered.
While scale-mixture shrinkage priors are attractive because of their computational
efficiency, they do not produce exact zeroes as estimates. Therefore, to classify estimates as
signals or noise, one must use some sort of thresholding rule. Thresholding rules based on the
posterior mean for global-local shrinkage priors in Eq. 1–5 have been studied extensively in the
literature.
One easily sees that the conditional mean under GL priors in Eq. 1–5 is given by
E(θi |X1, ...,Xn, ξi , τ) = (1− κi)Xi , (1–19)
where κi = 11+τξi
. Since κi ∈ (0, 1), it is clear from Eq. 1–19 that the amount of shrinkage is
controlled by the shrinkage factor κi , which depends on both ξi and τ . Namely, the posterior
mean E(θi |X1, ...,Xn) ≈ Xi for large signals Xi , while E(θi |X1, ...,Xn) ≈ 0 for small Xi .
Therefore, it seems reasonable to classify the entries in θ as either signal or noise depending
upon this shrinkage factor κi .
For the horseshoe prior in Eq. 1–6, Carvalho et al. (2010) first introduced thresholding
rule,
Reject H0i if E(1− κi |X1...,Xn) >1
2, (1–20)
where κi = 11+ξiτ
,√ξi ∼ C+(0, 1), and √
τ ∼ C+(0, 1). Ghosh et al. (2016) later extended
the classification rule in Eq. 1–20 for a general class of global-local shrinkage priors, which
23
includes the Strawderman-Berger, normal-exponential-gamma, and generalized double Pareto
priors.
The theoretical properties of the classification rule in Eq. 1–20 have been studied
within the ABOS framework described in Section 1.4.1. Assuming that the θi ’s come from
a two-components model and placing an appropriate rate of decay on τ , Datta & Ghosh
(2013) showed that the thresholding rule in Eq. 1–20 for the horseshoe prior in Eq. 1–6 could
asymptotically attain the ABOS risk in Eq. 1–17 up to a multiplicative constant.
Ghosh et al. (2016) generalized Datta & Ghosh (2013)’s result to a general class of
shrinkage priors of the form in Eq. 1–5, which includes the student-t distribution, the TPBN
family, and the GDP family of priors. In particular, Ghosh et al. (2016) considered both the
case where τ is treated as a tuning parameter that depends on sample size and the case
where τ is the empirical Bayes estimator for τ given by van der Pas et al. (2014). Ghosh &
Chakrabarti (2017) later showed that thresholding rule in Eq. 1–20 for this same class of priors
could even asymptotically attain the ABOS risk in Eq. 1–17 exactly. Bhadra et al. (2017)
also extended classification rule in Eq. 1–20 for the horseshoe+ prior. The horseshoe+ prior
adds an extra half-Cauchy hyperprior to the hierarchy of the horseshoe prior in order to induce
ultra-sparse estimates of θ. Bhadra et al. (2017) established that, with an appropriate rate
specified for τ , the horseshoe+ prior asymptotically attains the ABOS risk in Eq. 1–17 up to a
multiplicative constant.
1.5 Sparse Univariate Linear Regression
Before we discuss high-dimensional multivariate linear regression, we first review some
frequentist and Bayesian methods for univariate linear regression model in high-dimensional
settings. The model we consider first is
Y = Xβ + ε, (1–21)
where Y = (y1, ..., yn) is an n×1 vector of observations of some response, X is an n×p design
matrix, and ε ∼ Nn(0,σ2In) is an n-dimensional random noise vector. In high-dimensional
24
settings, p is typically much greater than n, which renders traditional estimation and model
selection techniques such as ordinary least squares or best subsets regression infeasible. In
particular, when p > n, X⊤X is no longer nonsingular, so the usual ordinary least squares
estimator (X⊤X)−1X⊤Y is no longer unique. Further, it becomes computationally infeasible to
search over 2p possible models when p is very large. To mitigate these problems, statisticians
typically impose a sparsity assumption on β in Eq. 1–21 (i.e. most of the entries in β are
assumed to be zero) in order to make β estimable.
1.5.1 Frequentist Approaches
In frequentist approaches to estimating β in Eq. 1–21, the most commonly used method
for inducing sparsity is through imposing regularization penalties on the coefficients of interest.
These frequentist estimators can be obtained by minimizing a penalized least squares objective
function,
minβ
||Y − Xβ||22 + λ
p∑i=1
ρ(βi),
where ρ(·) is an appropriately chosen (usually convex) penalty function, and λ > 0 is a
tuning parameter. Popular choices of penalty functions include the LASSO (Tibshirani (1996))
and its many variants, including the adaptive lasso (Zou (2006)), the group lasso (Yuan &
Lin (2006)), and the elastic net (Zou & Hastie (2005)). These methods use either an ℓ1 or
a combination of an ℓ1 and ℓ2 penalty function to shrink irrelevant predictors or groups of
predictors to exactly zero. These methods are attractive because they induce exact zeros as
estimates for some of the βi ’s, i = 1, ..., p, therefore enabling statisticians to simultaneously
perform estimation and variable selection.
1.5.2 Bayesian Approaches
1.5.2.1 Spike-and-slab
In the Bayesian univariate regression model, spike-and-slab priors in Eq. 1–2 have been a
popular choice for inducing sparsity in the coefficients for regression problems. In the context
of linear regression, these priors are placed on βi , i = 1, ..., p, in the following hierarchical
25
formulation:Y|X,β ∼ Nn(Xβ,σ
2In),
βi ∼ (1− p)δ0(βi) + pψ(βi |λ), i = 1, ..., p,
p ∼ π(p),σ2 ∼ µ(σ2).
(1–22)
where ψ is a diffuse unimodal density symmetric around zero and indexed by scale parameter
λ, π(·) has support on (0, 1), and µ(·) has support on (0,∞). Popular choices for π include
Uniform(0,1) or Beta(a, b), while µ is usually chosen to be an inverse gamma density or
the noninformative Jeffrey’s prior. Under the model in Eq. 1–22, some of the regression
coefficients are forced to zero (the “spike”), while ψ (the “slab”) models the nonzero
coefficients. In order to perform group estimation and group variable selection, Xu & Ghosh
(2015) also introduced the Bayesian group lasso with spike-and-slab priors (BGL-SS), which
is a mixture density with a point mass at a vector 0mg∈ Rmg , where mg denotes the size of
group g and a normal distribution to model the “slab.”
Just as in the normal means model in Eq. 1–1, the point mass mixture can face
computational difficulties when p is very large. Therefore, continuous variants of spike-and-slab
of the form in Eq. 1–3, such as the celebrated SSVS method by George & McCulloch (1993)
or the recent SSL model by Ročková & George (2016), are often used in practice instead.
Recently, Ishwaran & Rao (2005) and Narisetty & He (2014) also used the mixture prior of
normals but used rescaling of the variances (dependent upon the sample size n) in order to
better control the amount of shrinkage for each individual coefficient.
1.5.2.2 Continuous Shrinkage Priors
When p is large, spike-and-slab priors can face computational problems since they
require either searching over 2p possible models or data augmentation via latent variables. To
circumvent these issues, continuous shrinkage priors of the form in Eq. 1–4 are also popular for
Bayesian univariate regression. In the context of univariate linear regression, our hierarchical
model with shrinkage priors is typically of the form,
26
Y|X,β ∼ Nn(Xβ,σ2In),
βiind∼ N (0,σ2ω2
i ), i = 1, ..., p,
ω2i ∼ π(ω2
i ), i = 1, ..., p,
σ2 ∼ µ(σ2),
(1–23)
where π(ω2i ) is a carefully chosen prior on the scale component in the prior for βi . In
particular, we obtain special cases of the model in Eq. 1–23 by placing global-local shrinkage
priors of the form in Eq. 1–5 on the coefficients βi , i = 1, ..., p. Just as in the normal means
model in Eq. 1–1, these shrinkage priors shrink most of the coefficients towards zero, but their
tail robustness prevents overshrinkage of true nonzero coefficients.
1.5.3 Posterior Consistency for Univariate Linear Regression
Suppose that the true model is
Yn = Xnβ0n + εn, (1–24)
where εn ∼ Nn(0,σ2In) and β0n depends on n. For convenience, we denote β0n as β0 going
forward, noting that β0 depends on n.
Let {β0}n≥1 be the sequence of true coefficients, and let P0 denote the distribution of
{Yn}n≥1 under Eq. 1–24. Let {πn(βn)}n≥1 and {πn(βn|Yn)}n≥1 denote the sequences of prior
and posterior densities for β. We say that our model consistently estimates β0 if the posterior
probability that βn lies in a ε-neighborhood of β0 (ε > 0) converges to 1 almost surely with
respect to P0 measure as n → ∞. Formally, we give the following definition (see Armagan
et al. (2013a)):
Definition 1.2. (posterior consistency) Let Bn = {βn : ||βn−β0||2 > ε}, where ε > 0. The
sequence of posterior distributions of βn under prior πn(βn) is said to be strongly consistent
under Eq. 1–24 if, for any ε > 0,
�n(Bn|Yn) = �n(||βn − β0||2 > ε|Yn) → 0 a.s. P0 as n → ∞.
27
Consistency results have been established for both spike-and-slab models in Eq. 1–22
and continuous shrinkage models in Eq. 1–23 in the case where the number of covariates p
grows slower than or at the same rate as the sample size n. Ishwaran & Rao (2011) established
that for a rescaled version of spike-and-slab, the posterior mean βn consistently estimates
β0(≡ β0n) in Eq. 1–24. Letting A denote the subset of true nonzero coefficients in β0,
Ishwaran & Rao (2011) also showed that√n(βA
n − βA0 ) is asymptotically normally distributed
with mean 0 and variance-covariance matrix inverse of the appropriate submatrix of the Fisher
information matrix. This is known as the oracle property (see Fan & Li (2001)). Xu & Ghosh
(2015) also established the oracle property for the posterior median for their spike-and-slab
model with grouped variables under the assumption of orthogonal design of the design matrix
X. These consistency results concerned only a particular point estimate of β0 rather than the
entire posterior density.
For a variety of GL shrinkage priors of the form in Eq. 1–5, Armagan et al. (2013a)
established posterior consistency of the entire posterior distribution, as defined in Definition
4.3, when p grows slower than n. Zhang & Bondell (2017) also established posterior
consistency under the Dirichlet-Laplace prior when p grows slower than n. Moving beyond
the “small p, large n” scenario, Dasgupta (2016) established posterior consistency for the
Bayesian lasso under the assumption of orthogonal design when p grows at the same rate as n.
1.6 Sparse Multivariate Linear Regression
We now consider the classical multivariate normal linear regression model,
Y = XB+ E, (1–25)
where Y = (Y1, ...,Yq) is an n × q response matrix of n samples and q response variables,
X is an n × p matrix of n samples and p covariates, B ∈ Rp×q is the coefficient matrix,
and E = (ε1, ..., εn)⊤ is an n × q noise matrix. Under normality, we assume that εi i.i.d.∼
Nq(0,�), i = 1, ..., n. In other words, each row of E is identically distributed with mean 0 and
covariance �.
28
Our focus is on obtaining a sparse estimate of the p × q coefficients matrix B. In
practical settings, particularly in high-dimensional settings when p > n, it is important not
only to provide robust estimates of B, but to choose a subset of regressor variables from the
p rows of B which are good for prediction on the q responses. Although p may be large, the
number of predictors that are actually associated with the responses is generally quite small. A
parsimonious model also tends to give far better estimation and prediction performance than a
dense model, which further motivates the need for sparse estimates of B.
1.6.1 Frequentist Approaches
The ℓ1 and ℓ2 regularization methods described in Section 1.5.1 have been naturally
extended to the multivariate regression setup in Eq. 1–25 where sparsity in the coefficients
matrix is desired. For example, Rothman et al. (2010) utilized an ℓ1 penalty on each individual
coefficient of B in Eq. 1–25, in addition to an ℓ1 penalty on the off-diagonal entries of the
covariance matrix to perform joint sparse estimation of B and �. Li et al. (2015) proposed the
multivariate sparse group lasso, which utilizes a combination of a group ℓ2 penalty on rows of
B and an ℓ1 penalty on the individual coefficients bij to perform sparse estimation and variable
selection at both the group and within-group levels. Wilms & Croux (2017) also consider
a model which uses an ℓ2 penalty on rows of B to shrink entire rows to zero, while jointly
estimating the covariance matrix �.
1.6.2 Bayesian Approaches
This two-components mixture approach in Section 1.5.2 has been extended to the
multivariate framework of Eq. 1–25 by Brown et al. (1998), Liquet et al. (2016), and Liquet
et al. (2017). Brown et al. (1998) and Liquet et al. (2016) first facilitate variable selection
by associating each of the p rows of B, bi , 1 ≤ i ≤ p, with a p-dimensional binary vector
γ = (γ1, ..., γp), where each entry in γ follows a Bernoulli distribution. The selected bi ’s are
then estimated by placing a multivariate Zellner g-prior (see Zellner (1986)) on the sub-matrix
of the selected covariates.
29
Recently, Liquet et al. (2017) extended Xu & Ghosh (2015)’s work to the multivariate
case with a method called Multivariate Group Selection with Spike and Slab Prior (MBGL-SS).
Under MBGL-SS, rows of B are grouped together and modeled with a prior mixture density
with a point mass at 0 ∈ Rmgq having positive probability (where mg denotes the size of the
gth group and q is the number of responses). Liquet et al. (2017) use the posterior median
B = (bij)p×q as the estimate for B, so that entire rows are estimated to be exactly zero.
1.6.3 Reduced Rank Regression
Finally, both frequentist and Bayesian reduced rank regression (RRR) approaches
have been developed to tackle the problem of sparse estimation of B in Eq. 1–25. RRR
constrains the coefficient matrix B to be rank-deficient. Chen & Huang (2012) proposed a
rank-constrained adaptive group lasso approach to recover a low-rank matrix with some rows of
B estimated to be exactly zero. Bunea et al. (2012) also proposed a joint sparse and low-rank
estimation approach and derived its non-asymptotic oracle bounds. The RRR approach was
recently adapted to the Bayesian framework by Zhu et al. (2014) and Goh et al. (2017). In the
Bayesian framework, rank-reducing priors are used to shrink most of the rows and columns in
B towards 0p ∈ Rp or 0⊤q ∈ Rq.
30
CHAPTER 2THE INVERSE GAMMA-GAMMA PRIOR FOR SPARSE ESTIMATION
In this chapter, we introduce a new fully Bayesian scale-mixture shrinkage prior known
as the inverse gamma-gamma (IGG) prior. Our goal is twofold. Having observed a vector
X = (X1, ...,Xn) with entries from the model in Eq. 1–1,
Xi = θi + ϵi , ϵi ∼ N (0, 1), i = 1, ..., n,
we would like to simultaneously achieve: 1) robust estimation of θ, and 2) a robust testing
rule for identifying true signals. Multiple testing with the IGG prior is deferred to Chapter 3.
In this chapter, we discuss the IGG’s theoretical properties and illustrate how it can be used to
estimate sparse noisy vectors.
The IGG is a special case of the popular three parameter beta normal (TPBN) family, first
introduced by Armagan et al. (2011). The TPBN mixture family generalizes several well-known
scale-mixture shrinkage priors. This family places a beta prime density (also known as the
inverted beta) as a prior on scale parameter, λi , and is of the form,
θi |τ ,λiind∼ N (0, τλi),
π(λi) =�(a + b)
�(a)�(b)λa−1i (1 + λi)
−(a+b), i = 1, ..., n, (2–1)
where a and b are positive constants. Examples of priors that fall under the TPBN family
include the horseshoe prior (a = b = 0.5), the Strawderman-Berger prior (a = 1, b = 0.5), and
the normal-exponential gamma (NEG) prior (a = 1, b > 0).
With the IGG prior, the global parameter τ is fixed at τ = 1. However, we show that we
can achieve (near) minimax posterior contraction by simply specifying sample-size dependent
hyperparameters a and b, rather than by tuning or estimating a shared global parameter τ .
Our prior therefore does not fall under the global-local framework and our theoretical results
differ from many existing results based on global-local priors. We further justify the use of
the IGG by obtaining a sharper upper bound on the rate of posterior concentration in the
31
Kullback-Leibler sense than previous upper bounds derived for the horseshoe and horseshoe+
densities.
The organization of this chapter is as follows. In Section 2.1, we introduce the IGG
prior. We show that it mimics traditional shrinkage priors by placing heavy mass around
zero. We also establish various concentration properties of the IGG prior that characterize its
tail behavior and that are crucial for establishing our theoretical results. In Section 2.3, we
discuss the behavior of the posterior under the IGG prior. We show that for a class of sparse
normal mean vectors, the posterior distribution under the IGG prior contracts around the true
θ at (near) minimax rate under mild conditions. We also show that the upper bound for the
posterior concentration rate in the Kullback-Leibler sense is sharper for the IGG than it is for
other known Bayes estimators. In Section 2.4, we present simulation results which demonstrate
that the IGG prior has excellent performance for estimation in finite samples. Finally, in Section
2.5, we utilize the IGG prior to analyze a prostate cancer data set.
2.1 The Inverse Gamma-Gamma (IGG) Prior
Suppose we have observed X ∼ Nn(θ, In), and our task is to estimate the n-dimensional
vector, θ. Consider putting a scale-mixture prior on each θi , i = 1, ..., n of the form
θi |σ2iind∼ N (0,σ2i ), i = 1, ..., n,
σ2ii .i .d .∼ β′(a, b), i = 1, ..., n,
(2–2)
where β′(a, b) denotes the beta prime density in Eq. 2–1. The scale mixture prior in Eq. 2–2
is a special case of the TPBN family of priors with the global parameter τ fixed at τ = 1. One
easily sees that the posterior mean of θi under Eq. 2–2 is given by
E{E(θi |Xi ,σ2i )} = {E(1− κi)|Xi}Xi , (2–3)
where κi = 11+σ2
i
. Using a simple transformation of variables, we also see that the posterior
density of the shrinkage factor κi is proportional to
32
Figure 2-1. Marginal density of the IGG prior in Eq. 2–5 with hyperparametersa = 0.6, b = 0.4, in comparison to other shrinkage priors. The DL1/2 prior is themarginal density for the Dirichlet-Laplace density with D(1/2, ...., 1/2) specified asa prior in the Bayesian hierarchy.
π(κi |Xi) ∝ exp
(−κiX
2i
2
)κa−1/2i (1− κi)
b−1, κi ∈ (0, 1). (2–4)
From Eq. 2–3, it is clear that the amount of shrinkage is controlled by the shrinkage
factor κi . With appropriately chosen a and b, one can obtain sparse estimates of the θi ’s. For
example, with a = b = 0.5, we obtain the standard half-Cauchy density C+(0, 1).
33
To distinguish our work from previous results, we note that that the beta prime density in
Eq. 2–1 can be rewritten as a product of independent inverse gamma and gamma densities.
We reparametrize Eq. 2–2 as follows:
θi |λi , ξiind∼ N (0,λiξi), i = 1, ..., n,
λii .i .d .∼ IG(a, 1), i = 1, ..., n,
ξii .i .d .∼ G(b, 1), i = 1, ..., n,
(2–5)
where a, b > 0, and refer to this prior as the inverse gamma-gamma (IGG) prior. It should
be noted that the rate parameter 1 in Eq. 2–5 could be replaced by any positive constant.
Eq. 2–5 gives us some important intuition into the behavior of the IGG prior. Namely, for
small values of b, G(b, 1) places more mass around zero. As Proposition 2.1 shows, for any
0 < b ≤ 12, the marginal distribution for a single θ under the IGG prior has a singularity at
zero.
Proposition 2.1. If θ is endowed with the IGG prior in Eq. 2–5, then the marginal distribution
of θ is unbounded with a singularity at zero for any 0 < b ≤ 1/2.
Proof. See Appendix A.1.
Proposition 2.1 gives us some insight into how we should choose the hyperparameters in
Eq. 2–5. Namely, we see that for small values of b, the IGG prior can induce sparse estimates
of the θi ’s by shrinking most observations to zero. As we will illustrate in Section 2.2, the tails
of the IGG prior are still heavy enough to identify signals that are significantly far away from
zero.
Figure 2-1 gives a plot of the marginal density π(θi) for the IGG prior in Eq. 2–5, with
a = 0.6 and b = 0.4. Figure 2-1 shows that with a small value for b, the IGG has a singularity
at zero. The IGG prior also appears to have slightly heavier mass around zero than other
well-known scale-mixture shrinkage priors, but maintaining the same tail robustness. In Section
2.3, we provide a theoretical argument that shows that the shrinkage profile near zero under
the IGG is indeed more aggressive than that of previous known Bayesian estimators.
34
2.2 Concentration Properties of the IGG Prior
2.2.1 Notation
Throughout the rest of the chapter, we use the following notation. Let {an} and {bn} be
two non-negative sequences of real numbers indexed by n, where bn = 0 for sufficiently large n.
We write an ≍ bn to denote 0 < lim infn→∞
an
bn≤ lim sup
n→∞
an
bn< ∞, and an - bn to denote that
there exists a constant C > 0 independent of n such that an ≤ Cbn provided n is sufficiently
large. If limn→∞anbn
= 1, we write it as an ∼ bn. Moreover, if∣∣∣∣ anbn∣∣∣∣ ≤ M for all sufficiently
large n where M > 0 is a positive constant independent of n, then we write an = O(bn). If
limn→∞anbn
= 0, we write an = o(bn). Thus, an = o(1) if limn→∞ an = 0.
Throughout, we also use Z to denote a standard normal N(0, 1) random variable having
cumulative distribution function and probability density function �(·) and ϕ(·), respectively.
2.2.2 Concentration Inequalities for the Shrinkage Factor
Consider the IGG prior given in Eq. 2–5, but now we allow the hyperparameter bn is
allowed to vary with n as n → ∞. Namely, we allow 0 < bn < 1 for all n, but bn → 0 as
n → ∞ so that even more mass is placed around zero as n → ∞. We also fix a to lie in the
interval (12,∞). To emphasize that the hyperparameter bn depends on n, we rewrite the prior
in Eq. 2–5 asθi |λi , ξi
ind∼ N (0,λiξi), i = 1, ..., n,
λii .i .d .∼ IG(a, 1), i = 1, ..., n,
ξii .i .d .∼ G(bn, 1), i = 1, ..., n,
(2–6)
where bn ∈ (0, 1) = o(1) and a ∈ (12,∞). For the rest of this chapter, we label this particular
variant of the IGG prior as the IGGn prior.
As described in Section 2.1, the shrinkage factor κi = 11+λiξi
plays a critical role in the
amount of shrinkage of each observation Xi . In this section, we further characterize the tail
properties of the posterior distribution π(κi |Xi), which demonstrates that the IGGn prior in
Eq. 2–6 shrinks most estimates of θi ’s to zero but still has heavy enough tails to identify true
signals. In the following results, we assume the IGGn prior on θi for Xi ∼ N(θi , 1).
35
Theorem 2.1. For any a, bn ∈ (0,∞),
E(1− κi |Xi) ≤ eX2i/2
(bn
a + bn + 1/2
).
Proof. See Appendix A.1.
Corollary 2.1.1. If a is fixed and bn → 0 as n → ∞, then E(1− κi |Xi) → 0 as n → ∞.
Theorem 2.2. Fix ϵ ∈ (0, 1). For any a ∈ (12,∞), bn ∈ (0, 1),
Pr(κi < ϵ|Xi) ≤ eX2i/2 bnϵ
(a + 1/2) (1− ϵ).
Proof. See Appendix A.1.
Corollary 2.2.1. If a ∈ (12,∞) is fixed and bn → 0 as n → ∞, then by Theorem 2.2,
Pr(κi ≥ ϵ|Xi) → 1 for any fixed ϵ ∈ (0, 1).
Theorem 2.3. Fix η ∈ (0, 1), δ ∈ (0, 1). Then for any a ∈ (12,∞) and bn ∈ (0, 1),
Pr(κi > η|Xi) ≤(a + 1
2
)(1− η)bn
bn(ηδ)a+ 1
2
exp
(−η(1− δ)
2X 2i
).
Proof. See Appendix A.1.
Corollary 2.3.1. For any fixed n where a ∈ (12,∞), bn ∈ (0, 1), and for every fixed η ∈ (0, 1),
Pr(κi ≤ η|Xi) → 1 as Xi → ∞.
Corollary 2.3.2. For any fixed n where a ∈ (12,∞), bn ∈ (0, 1), and for every fixed η ∈ (0, 1),
E(1− κi |Xi) → 1 as Xi → ∞.
Since E(θi |Xi) = {E(1− κi)|Xi}Xi , Corollaries 2.1.1 and 2.2.1 illustrate that all
observations will be shrunk towards the origin under the IGGn prior in Eq. 2–6. However,
Corollaries 2.3.1 and 2.3.2 demonstrate that if Xi is big enough, then the posterior mean
{E(1− κi)|Xi}Xi ≈ Xi . This assures us that the tails of the IGG prior are still sufficiently
heavy to detect true signals.
We will use the concentration properties established in Theorem 2.1 and 2.3 to provide
sufficient conditions for which the posterior mean and posterior distribution under the IGGn
36
prior in Eq. 2–6 contract around the true θ0 at minimax or near-minimax rate in Section 2.3.
These concentration properties will also help us to construct the multiple testing procedure
based on κi in Chapter 3.
2.3 Posterior Behavior Under the IGG Prior
2.3.1 Minimax Posterior Contraction Under the IGG Prior
We first study the mean square error (MSE) and the posterior variance of the IGG prior
and provide an upper bound on both. For all our results, we assume that the true θ0 belongs
to the set of nearly black vectors defined by in Eq. 1–8. With a suitably chosen rate for bn in
Eq. 2–6, these upper bounds are equal, up to a multiplicative constant, to the minimax risk.
Utilizing these bounds, we also show that the posterior distribution under the IGGn prior in Eq.
2–6 is able to contract around θ0 at minimax-optimal rates.
Since the priors in Eq. 2–6 are independently placed on each θi , i = 1, ..., n, we
denote the resulting vector of posterior means (E(θ1|X1), ...,E(θn|Xn)) by T (X) and the
ith individual posterior mean by T (Xi). Therefore, T (X) is the Bayes estimate of θ under
squared error loss. Theorem 2.4 gives an upper bound on the mean squared error for T (X).
Theorem 2.4. Suppose X ∼ Nn(θ0, In), where θ0 ∈ ℓ0[qn]. Let T (X) denote the posterior
mean vector under Eq. 2–6. If a ∈ (12,∞), bn ∈ (0, 1) with bn → 0 as n → ∞, the MSE
satisfies
supθ0∈ℓ0[qn]
Eθ0 ||T (X)− θ0||2 - qn log
(1
bn
)+ (n − qn)bn
√log
(1
bn
),
provided that qn → ∞ and qn = o(n) as n → ∞.
Proof. See Appendix A.2.
By the minimax result in Donoho et al. (1992), we also have the lower bound,
supθ0∈ℓ0[qn]
Eθ0 ||T (X)− θ0||2 ≥ 2qn log
(n
qn
)(1 + o(1)),
37
as n, qn → ∞ and qn = o(n). The choice of bn =(qnn
)α, for α ≥ 1, therefore leads to an
upper bound MSE of order qn log(
nqn
)with a multiplicative constant of at most 2α. Based on
these observations, we immediately have the following corollary.
Corollary 2.4.1. Suppose that qn is known, and that we set bn =(qnn
)α, where α ≥ 1. Then
under the conditions of Theorem 2.4,
supθ0∈ℓ0[qn]
Eθ0 ||T (X)− θ0||2 ≍ qn log
(n
qn
).
Corollary 2.4.1 shows that the posterior mean under the IGG prior performs well as a
point estimator for θ0, as it is able to attain the minimax risk (possibly up to a multiplicative
constant of at most 2 for α = 1). Although the IGG prior does not include a point mass at
zero, Proposition 2.1 and Corollary 2.4.1 together show that the pole at zero for the IGG prior
mimics the point mass well enough, while the heavy tails ensure that large observations are not
over-shrunk.
The next theorem gives an upper bound for the total posterior variance corresponding to
the IGGn prior in Eq. 2–6.
Theorem 2.5. Suppose X ∼ Nn(θ0, In), where θ0 ∈ ℓ0[qn]. Under prior (7) and the
conditions of Theorem 2.4, the total posterior variance satisfies
supθ0∈ℓ0[qn]
Eθ0
n∑i=1
Var(θi |Xi) - qn log
(1
bn
)+ (n − qn)bn
√log
(1
bn
),
provided that qn → ∞ and qn = o(n) as n → ∞.
Proof. See Appendix A.2.
Having proven Theorems 2.4 and 2.5, we are now ready to state our main theorem
concerning optimal posterior contraction. Theorem 2.6 shows that the IGG is competitive with
other popular heavy-tailed priors like the Dirichlet-Laplace prior considered by Bhattacharya
et al. (2015) or the entire class of global-local shrinkage priors considered in Ghosh &
Chakrabarti (2017). As before, we denote the posterior mean vector under Eq. 2–6 as T (X).
38
Theorem 2.6. Suppose X ∼ Nn(θ0, In), where θ0 ∈ ℓ0[qn]. Suppose that the true sparsity
level qn is known, with qn → ∞, and qn = o(n) as n → ∞. Under the prior in Eq. 2–6, with
a ∈ (12,∞) and bn =
(qnn
)α,α ≥ 1,
supθ0∈ℓ0[qn]
Eθ0�
(θ : ||θ − θ0||2 > Mnqn log
(n
qn
) ∣∣∣∣X)→ 0, (2–7)
and
supθ0∈ℓ0[qn]
Eθ0�
(θ : ||θ − T (X)||2 > Mnqn log
(n
qn
) ∣∣∣∣X)→ 0, (2–8)
for every Mn → ∞ as n → ∞.
Proof. A straightforward application of Markov’s inequality combined with the results of
Theorems 2.4 and 2.5 leads to Eq. 2–7, while Eq. 2–8 follows from Markov’s inequality
combined with only the result of Theorem 2.5.
Theorem 2.6 shows that under mild regularity conditions, the posterior distribution
under the IGG prior contracts around both the true mean vector and the corresponding Bayes
estimates at least as fast as the minimax l2 risk in Eq. 1–9. Since the posterior distribution
cannot contract around the truth faster than the rate of qn log(
nqn
)(by Ghosal et al. (2000)),
the posterior distribution for the IGG prior under the conditions of Theorem 2.6 must contract
around the true θ0 at the minimax optimal rate in Eq. 1–9 up to some multiplicative constant.
We remark that the conditions needed to attain the minimax rate of posterior contraction
are quite mild. Namely, we only require that qn = o(n), and we do not need to make any
assumptions on the size of the true signal or the true sparsity level. For comparison, Castillo
& van der Vaart (2012) showed that the spike-and-slab prior with a Gaussian slab contracts
at sub-optimal rate if ||θ0||2 % qn log(
nqn
). Bhattacharya et al. (2015) showed that given
the Dir(a, ..., a) prior in the Dirichlet-Laplace prior, the posterior contracts around θ0 at the
minimax rate, provided that ||θ0||22 ≤ qn log4 n if a = n−(1+β), or provided that qn % log n
if a = 1n. The IGGn prior in Eq. 2–6 removes these restrictions on θ0 and qn. Moreover, our
minimax contraction result does not rely on tuning or estimating a global tuning parameter τ ,
39
as many previous authors have done, but instead, on appropriate selection of hyperparameters
a and b in the Bayesian hierarchy for the product density of an IG(a, 1) and G(b, 1).
In reality, the true sparsity level of qn is rarely known, so the best that we can do is to
obtain the near-minimax contraction rate of qn log n. A suitable modification of Theorem 2.6
leads to the following corollary.
Corollary 2.6.1. Suppose X ∼ Nn(θ0, In), where θ0 ∈ ℓ0[qn]. Suppose that the true sparsity
level qn is unknown, but that qn → ∞, and qn = o(n) as n → ∞. Under the prior in Eq. 2–6,
with a ∈ (12,∞) and bn =
1nα,α ≥ 1, then
supθ0∈ℓ0[qn]
Eθ0�
(θ : ||θ − θ0||2 > Mnqn log n
∣∣∣∣X)→ 0, (2–9)
and
supθ0∈ℓ0[qn]
Eθ0�
(θ : ||θ − T (X)||2 > Mnqn log n
∣∣∣∣X)→ 0, (2–10)
for every Mn → ∞ as n → ∞.
Having shown that the posterior mean under the model in Eq. 2–6 attains the near-minimax
risk up to a multiplicative constant, and that its posterior density captures the true θ0 in a
ball of squared radius at most qn log n up to some multiplicative constant, we now quantify its
shrinkage profile around zero in terms of Kullback-Leibler risk bounds. We show that this risk
bound is in fact sharper than other known shrinkage priors.
2.3.2 Kullback-Leibler Risk Bounds
In Section 2.3.1, we established that the choice of bn = 1n
allows the IGGn posterior
to contract at near minimax rate, provided that a ∈ (12,∞). Figure 2-1 suggests that
the shrinkage around zero is more aggressive for the IGGn prior than it is for other known
shrinkage priors when a and b are both set to small values. In this section, we provide a
theoretical justification for this behavior near zero.
Carvalho et al. (2010) and Bhadra et al. (2017) showed that when the true data
generating model is Nn(0, In), the Bayes estimate for the sampling density of the horseshoe
40
and the horseshoe+ estimators respectively converge to the true model at a super-efficient
rate in terms of the Kullback-Leibler (K-L) distance between the true model and the posterior
density. They argue that as a result, the horseshoe and horseshoe+ estimators squelch noise
better than other shrinkage estimators. However, in this section, we show that the IGGn prior
is able to shrink noise even more aggressively with appropriate chosen bn.
Let θ0 be the true parameter value and f (x |θ) be the sampling model. Further, let
K(q1, q2) = Eq1 log(q1/q2) denote the K-L divergence of the density q2 from q1. The proof
utilizes the following result by Clarke & Barron (1990).
Proposition 2.2. (Clarke and Barron, 1990). Let νn(dθ|x1, ..., xn) be the posterior distribution
corresponding to some prior ν(dθ) after observing data X = (x1, ..., xn) according to the sam-
pling model f (x |θ). Define the posterior predictive density qn(x) =∫f (x |θ)νn(dθ|x1, ..., xn).
Assume further that ν(Aϵ) > 0 for all ϵ > 0. Then the Cesàro-average risk of the Bayes
estimator, define as Rn ≡ n−1∑n
j=1K(qθ0, qj), satisfies
Rn ≤ ϵ− 1
nlog ν(Aϵ),
where ν(Aϵ) denotes the measure of the set {θ : K(qθ0, qθ) ≤ ϵ}.
Using the above proposition, it is shown in Carvalho et al. (2010) and Bhadra et al.
(2017) that when the global parameter τ is fixed at τ = 1 and the true parameter θ0 = 0, the
horseshoe and the horseshoe+ respectively both have Cesàro-average risk which satisfies
Rn = O
(1
nlog
(n
(log n)d
)), (2–11)
where d is a positive constant. This rate is super-efficient, in the sense that the upper bound
on the risk is lower than that of the maximum likelihood estimator (MLE), which has the rate
O(log n/n) when θ0 = 0. The next theorem establishes that the IGG prior can achieve an even
sharper rate of convergence O(n−1) in the K-L sense, with appropriate choices of a and b.
Theorem 2.7. Suppose that the true sampling model pθ0 is xj ∼ N (θ0, 1). Then for qn under
the IGG prior with any a > 0 and bn = 1n, the optimal rate of convergence of Rn when θ0 = 0
41
satisfies the inequality,
Rn ≤1
n
[2 + log(
√π) + (a + 2) log(2) + log
(a +
1
2
)]+
2 log n
n2, (2–12)
Proof. See Appendix A.3.
Since log nn2
= o(n−1), we see from Theorem 2.7 that the IGGn posterior density with
hyperparameters a > 0 and bn = 1n
has an optimal convergence rate of O(n−1). This
convergence rate is sharper than that of the horseshoe or horseshoe+, both of which converge
at the rate of O{n−1(log n− d log log n)} when θ0 = 0. To our knowledge, this is the sharpest
known bound on Cesàro-average risk for any Bayes estimator. Our result provides a rigorous
explanation for the observation that the IGG seems to shrink noise more aggressively than
other scale-mixture shrinkage priors.
Theorem 2.7 not only justifies the use of bn = 1n
as a choice for hyperparameter b in
the IGG prior, but it also provides insight into how we should choose the hyperparameter a.
Equation 2–12 shows that the constant C in Rn ≤ Cn−1 + o(n−1) can be large if a is set to be
large. This theorem thus implies that in order to minimize the K-L distance between Nn(θ0, In)
and the IGG posterior density, we should pick a to be small. Since we require a ∈ (12,∞) in
order to achieve the near-minimax contraction rate, our theoretical results suggest that we
should set a ∈ (12, 12+ δ] for small δ > 0 for optimal posterior concentration.
2.4 Simulation Study
2.4.1 Computation and Selection of Hyperparameters
Letting κi = 11+λiξi
, the full conditional distributions for the model in Eq. 2–5 are
θi∣∣ rest ∼ N
((1− κi)Xi , 1− κi
), i = 1, ..., n,
λi∣∣ rest ∼ IG
(a + 1
2,θ2i
2ξi+ 1), i = 1, ..., n,
ξi∣∣ rest ∼ GIG
(θ2i
λi, 2, b − 1
2
), i = 1, ..., n,
(2–13)
42
where GIG(a, b, p) denotes a generalized inverse Gaussian density with f (x ; a, b, p) ∝
x (p−1)e−(a/x+bx)/2. Therefore, the IGG model in Eq. 2–5 can be implemented straightforwardly
with Gibbs sampling, utilizing the full conditionals in Eq. 2–13.
For all our simulations, we set a = 12+ 1
nand b = 1
n, in light of Theorems 2.6 and 2.7.
These choices of a and b ensure that the IGG posterior will contract around the true θ0 at
least at near-minimax rate, while keeping a ∈ (12,∞) small. We denote our IGG prior with
hyperparameters (a, b) =(12+ 1
n, 1n
)as IGG1/n. For both of the simulation studies described
below, we run 10,000 iterations of a Gibbs sampler, discarding the first 5000 as burn-in.
2.4.2 Simulation Study for Sparse Estimation
To illustrate finite-sample performance of the IGG1/n prior, we use the set-up in Bhadra
et al. (2017) where we specify sparsity levels of q/n = 0.05, 0.10, 0.20, and 0.30, and set the
signals all equal to values of either A = 7 or 8, for a total of eight simulation settings. With
n = 200, we randomly generate n-dimensional vectors under these settings and compute the
average squared error loss corresponding to the posterior median across 100 replicates.
We compare our results for IGG1/n to the average squared error loss of the posterior
median under the Dirichlet-Laplace (DL), the horseshoe (HS), and the horseshoe+ (HS+)
estimators, since these are global-local shrinkage priors in Eq. 1–5 with singularities at zero.
For the HS and HS+ priors, we use a fully Bayesian approach, with τ ∼ C+(0, 1), as in
Ghosh et al. (2016). For the DL prior, we specify a = 1n
in the D(a, ..., a) prior of the scale
component, along with τ ∼ G(na, 1/2), as in Bhattacharya et al. (2015). Our results are
presented in Table 2-1.
Table 2-1 shows that under these various sparsity and signal strength settings, the
IGG1/n’s posterior median has the lowest (estimated) squared error loss in nearly all of the
simulation settings. It performs better than the horseshoe and the horseshoe+ in all settings.
Our empirical results confirm the theoretical properties that were proven in Section 2.3 and
illustrate that for finite samples, the IGG prior often outperforms other popular shrinkage
priors. Our empirical results also lend strong support to the use of the inverted beta prior
43
Table 2-1. Comparison of average squared error loss for the posterior median estimate of θacross 100 replications. Results are reported for the IGG1/n, DL (Dirichlet-Laplace),HS (horseshoe), and the HS+ (horseshoe-plus).
q/n A IGG DL HS HS+0.05 7 13.88 14.30 18.11 14.41
8 13.34 13.27 17.71 13.960.10 7 27.21 29.91 35.91 30.18
8 25.95 27.67 34.77 29.360.20 7 49.78 56.40 71.18 58.25
8 47.24 52.22 69.81 57.110.30 7 74.42 85.72 104.67 86.00
8 70.83 79.03 104.02 84.70
β′(a, b) as the scale density in scale-mixture shrinkage priors in Eq. 1–4. However, our results
suggest that we can obtain better estimation if we allow the a and b to vary with sample size,
rather than keeping them fixed (as the horseshoe priors do, with a = b = 0.5).
2.5 Analysis of a Prostate Cancer Data Set
We demonstrate practical application of the IGG prior using a popular prostate cancer
data set introduced by Singh et al. (2002). In this data set, there are gene expression values
for n = 6033 genes for m = 102 subjects, with m1 = 50 normal control subjects and m2 = 52
prostate cancer patients. We aim to identify genes that are significantly different between
control and cancer patients. This problem can be reformulated as normal means problem
in Eq. 1–1 by first conducting a two-sample t-test for each gene and then transforming the
test statistics (t1, ..., tn) to z-scores using the inverse normal cumulative distribution function
(CDF) transform �−1(Ft100(ti)), where Ft100 denotes the CDF for the Student’s t distribution
with 100 degrees of freedom. With z-scores (z1, ..., zn), our model is now
zi = θi + ϵi , ϵi ∼ N (0, 1), i = 1, ..., n, (2–14)
With this problem recast as a normal means problem, we can now estimate θ = (θ1, ..., θn). As
argued by Efron (2010), |θi | can be interpreted as the effect size of the ith gene for prostate
cancer. Efron (2010) first analyzed the model in Eq. 2–14 for this particular data set by
obtaining empirical Bayes estimates θEfroni , i = 1, ..., n, based on the two-groups model in Eq.
44
Table 2-2. The z-scores and the effect size estimates for the top 10 genes selected by Efron(2010) by the IGG, DL, HS, and HS+ models and the two-groups empirical Bayesmodel by Efron (2010).
Gene z-score θIGGi θDLi θHSi θHS+i θEfroni
610 5.29 4.85 4.52 4.85 4.91 4.111720 4.83 4.33 3.94 4.33 4.35 3.65332 4.47 3.78 3.40 3.78 3.99 3.24364 -4.42 -3.78 -3.10 -3.78 -3.85 -3.57914 4.40 3.71 3.11 3.71 3.86 3.163940 -4.33 -3.70 -3.06 -3.70 -3.80 -3.524546 -4.29 -3.59 -3.09 -3.59 -3.62 -3.471068 4.25 3.49 3.09 3.49 3.46 2.99579 4.19 3.31 2.98 3.31 3.01 2.924331 -4.14 -3.41 -2.87 -3.41 -3.43 -3.30
1–12. In our analysis, we use the posterior means θi , i = 1, ..., n, to estimate the strength of
association.
Table 2-2 shows the top 10 genes selected by Efron (2010) and their estimated effect size
on prostate cancer. We compare Efron (2010)’s empirical Bayes posterior mean estimates with
the posterior mean estimates under the IGG, DL, HS, and HS+ priors. Our results confirm the
tail robustness of the IGG prior. All of the scale-mixture shrinkage priors shrink the estimated
effect size for significant genes less aggressively than Efron’s procedure. Table 2-2 also shows
that for large signals, the IGG posterior has slightly less shrinkage for large signals than the DL
posterior and roughly the same amount as the HS posterior. The HS+ posterior shrinks the
test statistics the least for large signals, but the IGG’s estimates are still quite similar to those
of the HS+.
2.6 Concluding Remarks
In this chapter, we have introduced a new scale-mixture shrinkage prior called the Inverse
Gamma-Gamma prior for estimating sparse normal mean vectors. This prior has been shown
to have a number of good theoretical properties, including heavy probability mass around zero
and heavy tails. This enables the IGG prior to perform selective shrinkage and to attain (near)
minimax contraction around the true θ in Eq. 1–1. The IGG posterior also converges to the
45
true model in the Kullback-Leibler sense at a rate which has a sharper upper bound than the
upper bounds on the rates for the horseshoe and horseshoe+ posterior densities.
The IGG, HS, and HS+ all fall under the class of priors which utilize a beta prime density
as a prior on the scale component for the model in Eq. 1–4. However, our results suggest that
there is added flexibility in allowing the parameters (a, b) in Eq. 2–1 to vary with sample size
rather than keeping them fixed. This added flexibility leads to excellent empirical performance
and obviates the need to estimate a global tuning parameter τ . Despite the absence of a
data-dependent global parameter τ , the IGG model adapts well to sparsity, performing well
under both sparse and dense settings. This seems to be in stark contrast to remarks made by
authors like Carvalho et al. (2010) who have argued that scale-mixture shrinkage priors which
do not contain shared global parameters do not enjoy the benefits of adaptivity.
46
CHAPTER 3MULTIPLE HYPOTHESIS TESTING WITH THE INVERSE GAMMA-GAMMA PRIOR
In Chapter 2, we introduced the inverse gamma-gamma prior in Eq. 2–5 which can be
used for estimation of sparse noisy vectors in the model given by Eq. 1–1,
Xi = θi + ϵi , ϵi ∼ N (0, 1), i = 1, ..., n,
Through its combination of aggressive shrinkage of noise towards zero and its tail robustness,
the IGG prior performs well for estimation. However, it does not produce exact zeros as
estimates. Therefore, in order to classify θi ’s as either signal (θi = 0) or noise (θi = 0), we
need to use a thresholding rule.
In this chapter, we discuss how the IGG prior in Eq. 2–5 may be used for classification,
or equivalently, simultaneously testing H0i : θi = 0 vs. H1i : θi = 0, i = 1, ..., n. Using the
decision theoretic framework of Bogdan et al. (2011) described in Section 1.4.1, we show that
our testing rule for classifying signals asymptotically achieves the Bayes Oracle risk. While
previously, Ghosh & Chakrabarti (2017) demonstrated that testing rules based on global-local
priors of the form in Eq. 1–5 could asymptotically attain the optimal Bayes risk exactly, their
result required tuning or estimating a global parameter τ . The IGG prior avoids this by placing
appropriate values (dependent upon sample size) as its hyperparameters instead.
In Section 3.1, we introduce our testing rule with the IGG prior based on thresholding the
shrinkage factor κi = 11+λiξi
. In Section 3.2, assuming that the true data generating model is
Eq. 1–12, we present upper and lower bounds on the probabilities of Type I and Type II errors
for our thresholding rule. Using these bounds, we establish that in the presence of sparsity,
our rule is asymptotically Bayes optimal under sparsity (ABOS). In Section 3.3, we present
simulations which show that the IGG has excellent performance for multiple testing in both
sparse and dense settings, and moreover, that it has tight control over the false discovery rate
(FDR). Finally, in Section 3.4, we use the IGG prior to analyze the prostate cancer data set
from Section 2.5 within the context of multiple hypothesis testing..
47
3.1 Classification Using the Inverse Gamma-Gamma Prior
3.1.1 Notation
We use the following notation for the rest of this chapter. Let {an} and {bn} be two
non-negative sequences of real numbers indexed by n, where bn = 0 for sufficiently large n. We
write an ≍ bn to denote 0 < lim infn→∞
an
bn≤ lim sup
n→∞
an
bn< ∞, and an - bn to denote that there
exists a constant C > 0 independent of n such that an ≤ Cbn provided n is sufficiently large. If
limn→∞anbn
= 1, we write it as an ∼ bn. Moreover, if∣∣∣∣ anbn∣∣∣∣ ≤ M for all sufficiently large n where
M > 0 is a positive constant independent of n, then we write an = O(bn). If limn→∞anbn
= 0,
we write an = o(bn). Thus, an = o(1) if limn→∞ an = 0.
Throughout the chapter, we also use Z to denote a standard normal N(0, 1) random
variable having cumulative distribution function and probability density function �(·) and ϕ(·),
respectively
3.1.2 Thresholding the Posterior Shrinkage Weight
As we noted in Chapter 2, the posterior mean under the IGG prior in Eq. 2–5, E{E(θi |Xi ,σ2i )} =
{E(1− κi)|Xi}Xi , depends heavily on the shrinkage factor, κi = 11+σ2
i
= 11+λiξi
. Let θi denote
the posterior mean under the IGG prior. In particular, we established in Corollaries 2.1.1
through 2.3.2 that if Xi ≈ 0, then θi ≈ 0; meanwhile, for large Xi ’s, the posterior mean
θi ≈ Xi . Because of the concentration properties of the IGG prior proven in Sections 2.2
and 2.3, a sensible thresholding rule classifies observations as signals or as noise based on the
posterior distribution of this shrinkage factor.
Consider the following testing rule for the ith observation Xi :
Reject H0i if E(1− κi |Xi) >1
2, (3–1)
where κi is the shrinkage factor based on the IGGn prior in Eq. 2–6. We show in the
subsequent sections that the thresholding rule in Eq. 3–1 has both strong theoretical
guarantees and excellent empirical performance.
48
3.2 Asymptotic Optimality of the IGG Classification Rule
Suppose that the true data generating model is Eq. 1–12, i.e.
θii .i .d .∼ (1− p)δ{0} + pN (0,ψ2), i = 1, ..., n,
where ψ2 > 0. Within the context of multiple testing, a good benchmark for our test
procedure in Eq. 3–1 should be whether it is ABOS, i.e. whether its optimal risk is asymptotically
equal to that of the Bayes Oracle risk in Eq. 1–17. Adopting the asymptotic framework of
Bogdan et al. (2011), we let RIGG denote the asymptotic Bayes risk of testing rule in Eq. 3–1,
and we compare it to the Bayes Oracle risk, denoted as RBOOpt .
Before we state our main theorem, we first present four lemmas which give upper
and lower bounds on the Type I and Type II probabilities, t1i and t2i respectively, for the
classification rule in Eq. 3–1. These error probabilities are given respectively by
t1i = Pr
[E(1− κi |Xi) >
12
∣∣∣∣H0i is true],
t2i = Pr
[E(1− κi |Xi) ≤ 1
2
∣∣∣∣H1i is true].
(3–2)
Lemma 3.1. Suppose that X1, ...,Xn are i.i.d. observations having distribution in Eq. 1–13
where the sequence of vectors (ψ2, p) satisfies Assumption 1. Suppose we wish to test Eq.
1–14 using the classification rule in Eq. 3–1. Then for all n, an upper bound for the probability
of a Type I error for the ith test is given by
t1i ≤2bn√
π(a + bn + 1/2)
[log
(a + bn + 1/2
2bn
)]−1/2
.
Proof. See Appendix B.
Lemma 3.2. Suppose that X1, ...,Xn are i.i.d. observations following the distribution from
Eq. 1–13 where the sequence of vectors (ψ2, p) satisfies Assumption 1. Suppose we wish to
test Eq. 1–14 using the classification rule in Eq. 3–1. Suppose further that a ∈ (12,∞) and
bn ∈ (0, 1), with bn → 0 as n → ∞. Then for any η ∈ (0, 12), δ ∈ (0, 1), and sufficiently large
49
n, a lower bound for the probability of a Type I error for the ith test is given by
t1i ≥ 1−�
√√√√ 2
η(1− δ)
[log
((a + 1
2
)(1− η)bn
bn(ηδ)a+ 1
2
)] .
Proof. See Appendix B.
Lemma 3.3. Suppose we have the same set-up as Lemma 3.1. Assume further that bn → 0
in such a way that limn→∞b1/4n
pn∈ (0,∞). Then for any η ∈ (0, 1
2), δ ∈ (0, 1), and sufficiently
large n, an upper bound for the probability of a Type II error for the ith test is given by
t2i ≤
[2�
(√C
2η(1− δ)
)− 1
](1 + o(1)),
where the o(1) terms tend to zero as n → ∞.
Proof. See Appendix B.
Lemma 3.4. Suppose we have the same set-up as Lemma 3.1. Then a lower bound for the
probability of a Type II error for the ith test is given by
t2i ≥[2�(
√C)− 1
](1 + o(1)) as n → ∞,
where the o(1) terms tend to zero as n → ∞.
Having obtained bounds on Type I and Type II errors in Lemmas 3.1, 3.2, 3.3, and 3.4, we
now state our main theorem.
Theorem 3.1. Suppose that X1, ...,Xn are i.i.d. observations drawn from the distribution in
Eq. 1–13 where the sequence of vectors (ψ2, p) satisfies Assumption 1. Suppose we wish to
test Eq. 1–14 using the classification rule in Eq. 3–1. Suppose further that a ∈ (12,∞) and
bn ∈ (0, 1), with bn → 0 as n → ∞ in such a way that limn→∞b1/4n
pn∈ (0,∞). Then
limn→∞
RIGG
RBOOpt
= 1, (3–3)
i.e. the classification rule in Eq. 3–1 based on the IGGn prior in Eq. 2–6 is ABOS.
Proof. See Appendix B.
50
●●● ●●
●●
●
●●●●●●
●
●
●●
●
●●●
●●●●
●
●●
●●
● ●
●●
●
●
●
●
●
●●
●●
●
●●●
●
●●●●
●
●● ●●● ●● ●●●● ●
●
●●
●
●
●●●●●●
●●
●
●
●
●●
●●●
●●●●●●
●
●●●
●
●
●
●●●
● ●● ●●●●
●
●
●●●
●
●
●
●
●
●●
●●●
●●
●●
●●● ●●●●●
●●● ●●
●
●●
●
●●●
●
●
●
●
●
●
● ●
●
● ●● ●●●●
●
●●
●
●● ●●
●
● ●●●
●
●●●
●●● ●●●●● ●●●
●●●
−10 −5 0 5 10
0.0
0.2
0.4
0.6
0.8
1.0
X
Pos
terio
r In
clus
ion
Pro
babi
lity
Figure 3-1. Comparison between the posterior inclusion probabilities and the posteriorshrinkage weights 1− E(κi |Xi) when p = 0.10.
We have shown that our thresholding rule based on the IGGn posterior asymptotically
attains the ABOS risk exactly, provided that bn decays to zero at a certain rate relative to
the sparsity level p. For example, if the prior mixing proportion pn is known, we can set the
hyperparameter bn = p4n. Then the conditions for classification rule in Eq. 3–1 to be ABOS are
satisfied. Theorem 3.1 thus provides theoretical justification for using the IGG prior for signal
detection.
3.3 Simulation Study
For the multiple testing rule in Eq. 3–1, we adopt the simulation framework of Datta &
Ghosh (2013) and Ghosh et al. (2016) and fix sparsity levels at p ∈ {0.01, 0.05, 0.10, 0.15, 0.2,
0.25, 0.3, 0.35, 0.4, 0.45, 0.5} for a total of 11 simulation settings. For sample size
n = 200 and each p, we generate data from the two-groups model in Eq. 1–12, with
ψ =√2 log n = 3.26. We then add Gaussian white noise to our data and apply the
thresholding rule in Eq. 3–1 using IGG1/n to classify the θi ’s in our model as either signals
(θi = 0) or noise (θi = 0). We estimate the average misclassification probability (MP) for the
thresholding rule in Eq. 3–1 from 100 replicates.
51
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5
Sparsity
Mis
clas
sific
atio
n P
roba
bilit
y
MP=pOracleBHIGGDLHSHS+
Figure 3-2. Estimated misclassification probabilities. The thresholding rule in Eq. 3–1 based onthe IGG posterior mean is nearly as good as the Bayes Oracle rule in Eq. 1–16.
Taking p = 0.10, we plot in Figure 3.3 the theoretical posterior inclusion probabilities
ωi(Xi) = Pr(νi = 1|Xi) for the two-groups model in Eq. 1–12 given by
ωi(Xi) = π(νi = 1|Xi) =
{(1− p
p
)√1 + ψ2e
−X2i2
ψ2
1+ψ2 + 1
}−1
,
along with the shrinkage weights 1−E(κi |Xi) corresponding to the IGG1/n prior. The circles in
the figure denote the theoretical posterior inclusion probabilities, while the triangles correspond
to the shrinkage weights 1 − E(κi |Xi). The figure clearly shows that for small values of
the sparsity level p, the shrinkage weights are in close proximity to the posterior inclusion
probabilities. This and the theoretical results established in Section 3.1 justify the use of using
1 − E(κi |Xi) as an approximation to the corresponding posterior inclusion probabilities ωi(Xi)
in sparse situations. Therefore, this motivates the use of the IGG1/n prior in Eq. 2–6 and its
corresponding decision rule in Eq. 3–1 for identifying signals in noisy data.
Figure 3-2 shows the estimated misclassification probabilities (MP) for the decision rule
in Eq. 3–1 for the IGG1/n prior, along with the estimated MP’s for the Bayes Oracle (BO),
the Benjamini-Hochberg procedure (BH), the Dirichlet-Laplace (DL), the horseshoe (HS), and
the horseshoe+ (HS+). The Bayes Oracle rule, defined in Eq. 1–16, is the decision rule that
52
minimizes the expected number of misclassified signals in Eq. 1–15 when (p,ψ) are known.
The Bayes Oracle therefore serves as the lower bound to the MP, whereas the line MP = p
corresponds to the situation where we reject all null hypotheses without looking into the data.
For the Benjamini-Hochberg rule, we use αn = 1/ log n = 0.1887. Bogdan et al. (2011)
theoretically established the ABOS property of the BH procedure for this choice of αn. For the
DL, HS, and HS+ priors, we use the classification rule
Reject H0i if E(1− κi |X1...,Xn) >1
2, (3–4)
where κi = 11+σ2
i
and σ2i is the scale parameter in the scale-mixture shrinkage model in
Eq. 1–4. For the horseshoe and horseshoe+ priors, we specify a half-Cauchy prior on the
global parameter τ ∼ C+(0, 1). Since τ is a shared global parameter, the posterior for
κi depends on all the data. Carvalho et al. (2010) first introduced the thresholding rule
in Eq. 3–4 for the horseshoe. Ghosh et al. (2016) later extended the rule in Eq. 3–4 for
a general class of global-local shrinkage priors, which includes the Strawderman-Berger,
normal-exponential-gamma, and generalized double Pareto priors. Based on Ghosh et al.
(2016)’s simulation results, the horseshoe performs similarly as or better than these other
aforementioned priors, so we do not include these other priors in our comparison study.
Our results provide strong support to our theoretical findings in Section 3.1 and strong
justification for the use of the test procedure in Eq. 3–1 to classify signals. As Figure 3-2
illustrates, the misclassification probability for the IGG prior with (a, b) = (12+ 1
n, 1n) is nearly
equal to that of the Bayes Oracle, which gives the lowest possible MP. The thresholding rule
in Eq. 3–4 based on the horseshoe+ prior and the Dirichlet-Laplace priors also appears to
be quite competitive compared to the Bayes Oracle. Bhadra et al. (2017) proved that the
horseshoe+ prior asymptotically matches the Bayes Oracle risk up to a multiplicative constant
if τ is treated as a tuning parameter, but did not prove this for the case where τ is endowed
with a prior. There also does not appear to be any theoretical justification for the thresholding
rule in Eq. 3–4 under the DL prior in the literature. On the other hand, Theorem 3.1 provides
53
Table 3-1. Comparison of false discovery rate (FDR) for different classification methods underdense settings. The IGG1/n has the lowest FDR of all the different methods.
p BO BH IGG DL HS HS+0.30 0.08 0.13 0.005 0.08 0.14 0.090.35 0.08 0.12 0.004 0.09 0.22 0.100.40 0.09 0.11 0.004 0.10 0.31 0.100.45 0.10 0.10 0.003 0.12 0.39 0.110.50 0.10 0.09 0.003 0.13 0.43 0.11
theoretical support for the use of the thresholding rule in Eq. 3–1 under the IGG prior, which is
confirmed by our empirical study.
Figure 3-2 also shows that the performance for the rule in Eq. 3–4 under the horseshoe
degrades considerably as θ = (θ1, ..., θn) becomes more dense. With sparsity level p = 0.5,
the horseshoe’s misclassification rate is close to 0.4, only marginally better than rejecting
all the null hypotheses without looking at the data. This phenomenon was also observed by
Datta & Ghosh (2013) and Ghosh et al. (2016). This appears to be because in the dense
setting, there are many noisy entries that are “moderately” far from zero, and the horseshoe
prior does not shrink these aggressively enough towards zero in order for testing rule in Eq.
3–4 to classify these as true noise. The horseshoe+ prior seems to alleviate this by adding an
additional half-Cauchy C+(0, 1) prior to the Bayes hierarchy. In Table 3-1, we report the false
discovery rate (FDR) under dense settings for the different methods. We see that the FDR is
quite a bit larger for the horseshoe than for the other methods. Table 3-1 also shows that the
IGG1/n prior has very tight control over the FDR in dense settings. Although the IGG prior is
not constructed to specifically control FDR, we see that in practice, it does provide excellent
control of false positives.
Finally, we demonstrate the shrinkage properties corresponding to the IGG1/n prior
along with the horseshoe, the horseshoe+, and the Dirichlet-Laplace priors. In Figure 3-3, we
plot the posterior expectations E(θi |Xi) for the IGG1/n prior and the posterior expectations
E(θi |X1, ...,Xn) for the HS, HS+, and DL priors and the posterior expectations. The amount
of posterior shrinkage can be observed in terms of distance between the 45◦ line and the
54
−10 −5 0 5 10
−10
−5
05
10
X
Pos
terio
r E
xpec
tatio
n
FlatIGGHSHS+DL
Figure 3-3. Posterior Mean E(θ|X ) vs. X plot for p = 0.25.
posterior expectation. Figure 3-3 shows that near zero, the noisy entries are more aggressively
shrunk towards zero for the IGG1/n prior than for the other priors with poles at zero. This
confirms our findings in Theorem 2.7 which proved that the shrinkage profile near zero is more
aggressive for the IGG1/n prior in the Kullback-Leibler sense than for the HS or HS+ priors.
Meanwhile, Figure 3-3 also shows that the signals are left mostly unshrunk, confirming that the
IGG shares the same tail robustness as the other priors. The more aggressive shrinkage of noise
explains why the IGG performs better estimation, as we demonstrated in Section 2.4.2.
3.4 Analysis of a Prostate Cancer Data Set
We demonstrate multiple testing with the IGG prior using the same prostate cancer (Singh
et al. (2002)) we analyzed in Section 2.5. As described in Section 2.5, our aim is to identify
significant differences between control subjects and prostate cancer patients from n = 6033
genes. After conducting a two-sample t-test for each gene, we transform our test statistics to
z-scores (z1, ..., zn) through an inverse normal CDF transform to obtain the final model:
zi = θi + ϵi , i = 1, ..., n,
55
where ϵi ∼ N (0, 1). This allows us to implement the IGG prior on the z-scores to conduct
simultaneous testing H0i : θi = 0 vs. H1i : θi = 0, i = 1, ..., n, to identify genes that are
significantly associated with prostate cancer.
With our z-scores, we implement the IGG1/n model with (a, b) = (12+ 1
n, 1n) on the
model in Eq. 2–14 and use classification rule in Eq. 3–1 to identify significant genes. For
comparison, we also fit this model for the DL, HS, and HS+ priors, and benchmark it to the
Benjamini-Hochberg (BH) procedure with FDR α set to 0.10. The IGG1/n selects 85 genes
as significant, in comparison to 60 genes under the BH procedure. The HS prior selects 62
genes as significant. The HS+ and DL priors select 41 and 42 genes respectively, indicating
more conservative estimates. All 60 of the genes flagged as significant by the BH procedure
are included in the 85 genes that the IGG prior classifies as significant. On the other hand, the
HS prior’s conclusions diverge from the BH procedure. Seven genes (genes 11, 377, 637, 805,
1588, 3269, and 4040) are deemed significant by the HS, but not by BH.
3.5 Concluding Remarks
In this chapter, we have introduced a thresholding rule in Eq. 3–1 based on the posterior
density of the shrinkage factor κi = 11+λiξi
under the inverse gamma-gamma prior in Eq. 2–5.
We have shown that our thresholding rule asymptotically attains the Bayes Oracle risk in Eq.
1–17 exactly under mild conditions.
Our work ultimately moves the testing problem beyond the global-local framework in Eq.
1–5. Previously, Datta & Ghosh (2013), Ghosh et al. (2016), Ghosh & Chakrabarti (2017),
and Bhadra et al. (2017) have shown that horseshoe or horseshoe-like priors asymptotically
attain the Bayes Oracle risk (possibly up to a multiplicative constant) either by specifying a
rate for the global parameter τ in Eq. 1–5 or by estimating it with an empirical Bayes plug-in
estimator. In the case with the IGG prior, we prove that our thresholding rule based on the
posterior mean is ABOS without utilizing a shared global tuning parameter τ . This appears
to be the first time that the Bayes Oracle property has been established for a scale-mixture
shrinkage prior that falls outside the class of global-local shrinkage priors.
56
Finally, through our simulation studies, we have shown that the classification rule in Eq.
3–1 performs well under both sparse and dense settings. Our classification rule yields roughly
the same misclassification probability (MP) as the Bayes Oracle in all simulation settings.
We have also shown that the IGG prior provides very tight control of false discovery rate.
The aggressive shrinkage profile near zero was theoretically established in Theorem 2.7 and is
illustrated in Figure 3-3. As a result, the IGG squelches noise more effectively than many other
global-local shrinkage priors, and this keeps the number of false positives low.
57
CHAPTER 4HIGH-DIMENSIONAL MULTIVARIATE POSTERIOR CONSISTENCY UNDER
GLOBAL-LOCAL SHRINKAGE PRIORS
In this chapter, we consider the multivariate normal linear regression model in Eq. 1–25,
Y = XB+ E,
where Y = (Y1, ...,Yq) is an n × q response matrix of n samples and q response variables,
X is an n × p matrix of n samples and p covariates, B ∈ Rp×q is the coefficient matrix,
and E = (ε1, ..., εn)⊤ is an n × q noise matrix. Under normality, we assume that εi i.i.d.∼
Nq(0,�), i = 1, ..., n. In other words, each row of E is identically distributed with mean 0 and
covariance �. Throughout this chapter, we also assume that Y and X are centered so there is
no intercept term in B.
In high-dimensional settings, we are often interested in both sparse estimation of B and
variable selection from the p covariates. We adopt a Bayesian approach to this joint problem
by using global-local (GL) shrinkage priors in Eq. 1–5. GL priors were introduced in Section
1.2.2, and this class of priors encompasses a wide variety of scale-mixture shrinkage priors,
including the horseshoe prior in Eq. 1–6 and many others (e.g., see Table 1-1).
We specifically consider polynomial-tailed priors, which are priors that have tails that
are heavier than exponential. A formal mathematical definition was given in Eq. 1–7, and
we reiterate it later in this chapter. Although polynomial-tailed priors have been studied
extensively in univariate regression, their potential utility for multivariate analysis seems to have
been largely overlooked. In this chapter, we introduce a new Bayesian approach for estimating
the unknown p × q coefficient matrix B in Eq. 1–25 using polynomial-tailed priors. We call our
method the Multivariate Bayesian model with Shrinkage Priors (MBSP).
While there have been many methodological developments for Bayesian multivariate
linear regression, theoretical results in this domain have not kept pace with applications.
There appears to be very little theoretical justification for adopting Bayesian methodology in
multivariate regression. In this thesis, we take a step towards resolving this gap by providing
58
sufficient conditions under which Bayesian multivariate linear regression models can obtain
posterior consistency. To our knowledge, Theorem 4.2 is the first result in the literature to give
general conditions for posterior consistency under the model in Eq. 1–25 when p > n and when
p grows at nearly exponential rate with sample size n. We further illustrate that our method
based on polynomial-tailed priors achieves strong posterior consistency in both low-dimensional
and ultrahigh-dimensional settings.
The rest of the chapter is organized as follows. In Section 4.1, we introduce the MBSP
model and provide some insight into how it facilitates sparse estimation and variable selection.
In Section 4.2, we present sufficient conditions for our model to achieve posterior consistency
in both the cases where p grows slower than n and the case when p grows at nearly
exponential rate with n. In Section 4.3, we show how to implement MBSP using the three
parameter beta normal (TPBN) family of priors in Eq. 2–2 as a special case and how to utilize
our method for variable selection. Efficient Gibbs sampling and computational complexity
considerations are also discussed. In Section 4.4, we illustrate our method’s finite sample
performance through simulations and analysis of a real data set.
4.1 Multivariate Bayesian Model with Shrinkage Priors (MBSP)
4.1.1 Preliminary Notation and Definitions
We first introduce the following notation and definitions.
Definition 4.1. A random matrix Y is said to have the matrix-normal density if Y has the
density function (on the space Ra×b):
f (Y) =|U|−b/2|V|−a/2
(2π)ab/2e−
12
tr[U−1(Y−M)V−1(Y−M)⊤], (4–1)
where M ∈ Ra×b, and U and V are positive definite matrices of dimension a × a and b × b
respectively. If Y is distributed as a matrix-normal distribution with pdf given in Eq. 4–1, we
write Y ∼ MN a×b(M, U, V).
Definition 4.2. The matrix O ∈ Ra×b denotes the a × b matrix with all zero entries.
59
4.1.2 MBSP Model
Our multivariate Bayesian model formulation for the model in Eq. 1–25 with shrinkage
priors (henceforth referred to as MBSP) is as follows:
Y|X,B,� ∼ MN n×q(XB, In,�),
B|ξ1, ... , ξp,� ∼ MN p×q(O, τ diag(ξ1, ... , ξp),�),
ξiind∼ π(ξi), i = 1, ... , p,
(4–2)
where π(ξi) is a polynomial-tailed prior density of the form in Eq. 1–7,
π(ξi) = Kξ−a−1i L(ξi),
where K > 0 is the constant of proportionality, a is positive real number, and L is a positive
measurable, non-constant, slowly varying function over (0,∞). The formal definition of slowly
varying functions was given in Definition 1.1, and examples of polynomial-tailed priors are given
in Table 1-1.
4.1.3 Handling Sparsity
In this section, we illustrate how the MBSP model induces sparsity. First note that in Eq.
4–2, an alternative way of writing the density Y|X,B,� is
Y|X,B,� ∝ |�|−nq/2 exp
−1
2
n∑i=1
(yi −
p∑j=1
xijbj
)⊤
�−1
(yi −
p∑j=1
xijbj
) , (4–3)
where bj denotes the jth row of B.
Following from Eq. 4–3, we see that under the model in Eq. 4–2 and known �, the joint
prior density π(B, ξ1, ... , ξp) is
π(B, ξ1, ... , ξp) ∝p∏
j=1
ξ−q/2j e
− 12ξj
||bj (τ�)−1/2||22π(ξj), (4–4)
where || · ||2 denotes the ℓ2 vector norm. Since the p rows of B are independent, we see from
Eq. 4–4 that this choice of prior induces sparsity on the rows of B, while also accounting for
60
the covariance structure of the q responses. This ultimately facilitates sparse estimation of B
as a whole and variable selection from the p regressors.
For example, if π(ξj) ind∼ IG(αj , γj2 ) (where IG denotes the inverse-gamma density), then
the marginal density for B (after integrating out the ξj ’s) is proportional top∏
j=1
(||bj(τ�)−1/2||22 + γj
)−(αj+q2), (4–5)
which corresponds to a multivariate Student’s t density.
On the other hand, if π(ξj) ∝ ξq/2−1j (1 + ξj)
−1, then the joint density in is proportional to
p∏j=1
ξ−1j (1 + ξj)
−1e− 1
2ξj||bj (τ�)−1/2||22, (4–6)
and integrating out the ξj ’s gives a multivariate horseshoe density function.
As the examples in Eq. 4–5 and Eq. 4–6 demonstrate, our model allows us to obtain
sparse estimates of B by inducing row-wise sparsity in B with a matrix-normal scale mixture
using global-local shrinkage priors. This row-wise sparsity also facilitates variable selection from
the p variables.
4.2 Posterior Consistency of MBSP
4.2.1 Notation
We first introduce some notation that will be used throughout this chapter. For any two
sequences of positive real numbers {an} and {bn} with bn = 0, we write an = O(bn) if∣∣∣ anbn ∣∣∣ ≤ M for all n, for some positive real number M independent of n, and an = o(bn) to
denote limn→∞anbn
= 0. Therefore, an = o(1) if limn→∞ an = 0.
For a vector v ∈ Rn, ||v ||2 :=√∑n
i=1 v2i denote the ℓ2 norm. For a matrix A ∈ Ra×b
with entries aij , ||A||F :=√
tr(ATA) =√∑a
i=1
∑b
j=1 a2ij denotes the Frobenius norm of A.
For a symmetric matrix A, we denote its minimum and maximum eigenvalues by λmin(A) and
λmax(A) respectively. Finally, for an arbitrary set A, we denote its cardinality by |A|.
61
4.2.2 Definition of Posterior Consistency
For this section, we denote the number of predictors by pn to emphasize that p depends
on n and is allowed to grow with n. Suppose that the true model is
Yn = XnB0n + En, (4–7)
where Yn := (Yn,1, ...,Yn,q) and En ∼ MN n×q(O, In,�). For convenience, we denote B0n as
B0 going forward, noting B0 depends on pn (and therefore on n).
Let {B0}n≥1 be the sequence of true coefficient matrices, and let P0 denote the
distribution of {Yn}n≥1 under Eq. 4–7. Let {πn(Bn)}n≥1 and {πn(Bn|Yn)}n≥1 denote
the sequences of prior and posterior densities for coefficients matrix Bn. Analogously, let
{�n(Bn)}n≥1 and {�n(Bn|Yn)}n≥1 denote the sequences of prior and posterior distributions.
In order to achieve consistent estimation of B0(≡ B0n), the posterior probability that Bn lies
in a ε-neighborhood of B0 should converge to 1 almost surely with respect to P0 measure as
n → ∞. We therefore define strong posterior consistency as follows:
Definition 4.3. (posterior consistency) Let Bn = {Bn : ||Bn − B0||F > ε}, where
ε > 0. The sequence of posterior distributions of Bn under prior πn(Bn) is said to be strongly
consistent under Eq. 4–7 if, for any ε > 0,
�n(Bn|Yn) = �n(||Bn − B0||F > ε|Yn) → 0 a.s. P0 as n → ∞.
Using Definition 4.3, we now state two general theorems and a corollary that provide
general conditions under which priors on B (not just the MBSP model) may achieve strong
posterior consistency in both low-dimensional and ultrahigh-dimensional settings.
4.2.3 Sufficient Conditions for Posterior Consistency
For our theoretical investigation, we assume � to be fixed and known and dimension of
the response variables q to be fixed. In practice, � is typically unknown, and one can estimate
it from the data. In Section 4.3, we present a fully Bayesian implementation of MBSP by
placing an appropriate inverse-Wishart prior on �.
62
Theorem 4.1 applies to the case where the number of predictors pn diverges to ∞ at a
rate slower than n as n → ∞, while Theorem 4.2 applies to the case where pn grows to ∞
at a faster rate than n as n → ∞. To handle these two cases, we require different sets of
regularity assumptions.
4.2.3.1 Low-Dimensional Case
We first impose the following regularity conditions which are all standard ones used in
the literature and relatively mild (see, for example, Armagan et al. (2013a)). In particular,
Assumption (A2) ensures that X⊤n Xn is positive-definite for all n and that B0 is estimable.
Regularity Conditions
(A1) pn = o(n) and pn ≤ n for all n ≥ 1.
(A2) There exist constants c1, c2 so that
0 < c1 < lim infn→∞
λmin
(X⊤n Xn
n
)≤ lim sup
n→∞λmax
(X⊤n Xn
n
)< c2 <∞.
(A3) There exist constants d1 and d2 so that
0 < d1 < λmin(�) ≤ λmax(�) < d2 <∞.
Using these conditions, we are able to attain a very simple sufficient condition for strong
posterior consistency under Eq. 4–7, as defined in Definition 4.3, which we state in the next
theorem.
Theorem 4.1. Assume that conditions (A1)-(A3) hold. Then the posterior of Bn under any
prior πn(Bn) is strongly consistent under Eq. 4–7, i.e, for Bn = {Bn : ||Bn − B0||F > ε} and
any arbitrary ε > 0,
�n(Bn|Yn) → 0 a.s. P0 as n → ∞
if
�n
(Bn : ||Bn − B0||F <
�
nρ/2
)> exp(−kn) (4–8)
63
for all 0 < � <ε2c1d
1/21
48c1/22 d2
and 0 < k < ε2c132d2
− 3�c1/22
2d1/21
, where ρ > 0.
Proof. See Appendix C.1.
Condition 4–8 in Theorem 4.1 states that as long as the prior distribution for Bn captures
B0 inside a ball of radius �/nρ/2 with sufficiently high probability for large n, the posterior of
Bn will be strongly consistent.
4.2.3.2 Ultrahigh Dimensional Case
To achieve posterior consistency when pn ≫ n and pn ≥ O(n), we require additional
restrictions on the eigenstructure of Xn and an additional assumption on the size of the true
model. Working under the assumption of sparsity, we assume that the true model in Eq. 4–7
contains only a few nonzero predictors. That is, most of the rows of B0 should contain only
zero entries.We denote S∗ ⊂ {1, 2, ..., pn} as the set of indices of the rows of B0 with at
least one nonzero entry and let s∗ = |S∗| be the size of S∗. We need the following regularity
conditions.
Regularity Conditions
(B1) pn > n for all n ≥ 1, and log(pn) = O(nd) for some 0 < d < 1.
(B2) The rank of Xn is n.
(B3) Let J denote a set of indices, where J ⊂ {1, ..., pn} such that |J | ≤ n. Let XJ denote
the submatrix of Xn that contains the columns with indices in J . For any such set J ,
there exists a finite constant c1(> 0) so that
lim infn→∞
λmin
(X⊤
JXJ
n
)≥ c1.
(B4) There is finite constant c2(> 0) so that
lim supn→∞
λmax
(X⊤n Xn
n
)< c2.
64
(B5) There exist constants d1 and d2 so that
0 < d1 < λmin(�) ≤ λmax(�) < d2 <∞.
(B6) S∗ is nonempty for all n ≥ 1, and s∗ = o(n/ log(pn)).
Condition (B1) allows the number of predictors pn to grow at nearly exponential rate.
In particular, pn may grow at a rate of end , where 0 < d < 1. In the high-dimensional
literature, it is a standard assumption that log(pn) = o(n). Condition (B3) assumes that for
any submatrix of Xn that is full rank, its minimum singular value is bounded below by nc1.
This condition is needed to overcome potential identifiability issues, since trivially, the smallest
singular value of Xn is zero. (B4) imposes a supremum on the maximum singular value of
Xn, which poses no issue. Finally, Condition (B6) allows the true model size to grow with n
but at a rate slower than nlog(pn)
. (B6) is a standard condition that has been used to establish
estimation consistency when pn grows at nearly exponential rate with n for frequentist point
estimators, such as the Dantzig estimator (Candes & Tao (2007)), the scaled LASSO (Sun &
Zhang (2012)), and the LASSO (Tibshirani (1996)). In ultrahigh-dimensional problems, it is
generally agreed that s∗ must be small relative to both p and n in order to attain estimation
consistency and minimax convergence rates, and hence, this restriction on the growth rate of
s∗.
Under these regularity conditions, we are able to attain a simple sufficient condition for
posterior consistency under Eq. 4–7 even when pn grows faster than n. Theorem 4.2 gives the
sufficient condition for strong consistency.
Theorem 4.2. Assume that conditions (B1)-(B6) hold. Then the posterior of Bn under any
prior πn(Bn) is strongly consistent under Eq. 4–7, i.e. for Bn = {Bn : ||Bn − B0||F > ε} and
any arbitrary ε > 0,
�n(Bn|Yn) → 0 a.s. P0 as n → ∞
if
65
�n
(Bn : ||Bn − B0||F <
�
nρ/2
)> exp(−kn) (4–9)
for all 0 < � <ε2c1d
1/21
48c1/22 d2
and 0 < k < ε2c132d2
− 3�c1/22
2d1/21
, where ρ > 0.
Proof. See Appendix C.1.
Similar to Eq. 4–8 in Theorem 4.1, Eq. 4–9 in Theorem 4.2 states that as long as the
prior distribution for Bn captures B0 inside a ball of radius �/nρ/2 with sufficiently high
probability for large n, the posterior of Bn will be strongly consistent. To our knowledge,
Theorem 4.2 is the first theorem in the literature to address the issue of ultra high-dimensional
consistency in Bayesian multivariate linear regression. There has been very little theoretical
investigation done in the framework of Bayesian multivariate regression, and the results in this
thesis take a step towards narrowing this theoretical gap.
Now that we have provided simple sufficient conditions for posterior consistency in
Theorems 4.1 and 4.2, we are ready to state our main theorems which demonstrate the power
of the MBSP model in Eq. 4–2 under polynomial-tailed hyperpriors of the form in Eq. 1–7.
4.2.4 Sufficient Conditions for Posterior Consistency of MBSP
We now establish posterior consistency under the MBSP model in Eq. 4–2, assuming that
� is fixed and known, q is fixed, and that τ = τn is a tuning parameter that depends on n.
As in Section 4.2.3, we assume that most of the rows of B0 are zero, i.e. that the true
model S ⊂ {1, ..., pn} is small relative to the total number of predictors. As before, we
consider the cases where pn = o(n) and pn ≥ O(n) separately. We also require the following
regularity assumptions which turn out to be sufficient for both the low-dimensional and ultra
high-dimensional cases. Here, b0jk denotes an entry in B0.
Regularity Conditions
(C1) For the slowly varying function L(t) in the priors for ξi , 1 ≤ i ≤ p, in Eq. 1–7,
limt→∞ L(t) ∈ (0,∞). That is, there exists c0(> 0) such that L(t) ≥ c0 for all t ≥ t0,
for some t0 which depends on both L and c0.
66
(C2) There exists M > 0 so that supj ,k |b0jk | ≤ M < ∞ for all n, i.e. the maximum entry in
B0 is uniformly bounded above in absolute value.
(C3) 0 < τn < 1 for all n, and τn = o(p−1n n−ρ) for ρ > 0.
Remark 1. Condition (C1) is a very mild condition which ensures that L(·) is slow-varying.
Ghosh et al. (2016) established that (C1) holds for L(·) in the TPBN priors (L(ξi) = (1 +
ξi)−(α+β)) and the GDP priors (L(ξi) = 2−
α2−1∫∞0
e−β√
2u/ξi e−u u(α2+1)−1du). The TPBN
family in particular includes many well-known one-group shrinkage priors, such as the horseshoe
prior (α = 0.5, β = 0.5), the Strawderman-Berger prior (α = 1, β = 0.5), and the
normal-exponential-gamma prior (α = 1, β > 0). As remarked by Ghosh & Chakrabarti
(2017), one easily verifies that Assumption (C1) also holds for the inverse-gamma priors
(π(ξi) ∝ ξ−α−1i e−b/ξi ) and the half-t priors (π(ξi) ∝ (1 + ξ/ν)−(ν+1)/2).
Remark 2. Condition (C2) is a mild condition that bounds the entries of B0 in absolute value
for all n, while (C3) specifies an appropriate rate of decay for τn. It is possible that the upper
bound on the rate for τn can be loosened for individual GL priors. However, since we wish to
encompass all possible priors of the form in Eq. 1–7, we provide a general rate that works for
all the polynomial-tailed priors considered in this thesis. We are now ready to state our main
theorem for posterior consistency of the MBSP model.
Theorem 4.3 (low-dimensional case). Suppose that we have the MBSP model in Eq. 4–2
with hyperpriors of the form in Eq. 1–7. Provided that Assumptions (A1)-(A3) and (C1)-(C3)
hold, our model achieves strong posterior consistency. That is, for any ε > 0,
�n(Bn : ||Bn − B0||F > ε|Yn) → 0 a.s. P0 as n → ∞.
Proof. See Appendix C.2.
Theorem 4.3 establishes posterior consistency for the MBSP model only when pn = o(n).
We also note that in the low-dimensional setting where pn = o(n), we place no restrictions on
the growth on the number of nonzero predictors in the true model relative to sample size n.
67
This contrasts with a previous result by Armagan et al. (2013a), who required that the number
of true nonzero covariates grow slower than nlog(n)
.
In the ultra high-dimensional case where pn ≥ O(n), we can still achieve posterior
consistency under the MBSP model, with additional mild restrictions on the design matrix Xn
and on the size of the true model. Theorem 4.4 deals with the ultra high-dimensional scenario.
Theorem 4.4 (ultra high-dimensional case). Suppose that we have the MBSP model
in Eq. 4–2 with hyperpriors of the form in 1–7. Provided that Assumptions (B1)-(B6) and
(C1)-(C3) hold, our model achieves strong posterior consistency. That is, for any ε > 0,
�n(Bn : ||Bn − B0||F > ε|Yn) → 0 a.s. P0 as n → ∞.
Proof. See Appendix C.2.
Interestingly enough, to ensure posterior consistency in the ultrahigh-dimensional setting,
the only thing that needs to be controlled is the tuning parameter τn, provided that our
hyperpriors in Eq. 4–2 have the form in Eq. 1–7. However, in the high-dimensional regime,
pn is allowed to grow at nearly exponential rate, and therefore, the rate of decay for τn from
Condition (C3) necessarily needs to be much faster. Intuitively, this makes sense because we
must sum over pnq terms in order to compute the Frobenius normed difference in Theorem
4.4.
Taken together, Theorems 4.3 and 4.4 both provide theoretical justification for the use of
global-local shrinkage priors for multivariate linear regression. Even when we allow the number
of predictors to grow at nearly exponential rate, the posterior distribution under the MBSP
model is able to consistently estimate B0 in Eq. 4–7. Our result is also very general in that a
wide class of shrinkage priors, as indicated in Table 1-1, can be used for the hyperpriors ξi ’s in
Eq. 4–2.
4.3 Implementation of the MBSP Model
In this section, we demonstrate how to implement the MBSP model using the three
parameter beta normal (TPBN) mixture family (Armagan et al. (2011)). We choose the
68
TPBN family because it is rich enough to generalize several well-known polynomial-tailed
priors. Although we focus on the TPBN family, our model can easily be implemented for other
global-local shrinkage priors (such as the Student’s t prior or the generalized double Pareto
prior) using similar techniques as the ones we describe below.
4.3.1 TPBN Family
A random variable y said to follow the three parameter beta density, denoted as
T PB(u, a, τ), if
π(y) =�(u + a)
�(u)�(a)τ ay a−1(1− y)u−1 {1− (1− τ)y}−(u+a) .
In univariate regression, a global-local shrinkage prior of the form
βi |τ , ξi ∼ N (0, τξi), i = 1, ... , p,
π(ξi) =�(u+a)�(u)�(a)
ξu−1i (1 + ξi)
−(u+a), i = 1, ... , p,(4–10)
may therefore be represented alternatively as
βi |νi ∼ N (0, ν−1i − 1),
νi ∼ T PB(u, a, τ).(4–11)
After integrating out νi in Eq. 4–11, the marginal prior for βi is said to belong to the TPBN
family. Special cases of Eq. 4–11 include the horseshoe prior (u = 0.5, a = 0.5), the
Strawderman-Berger prior (u = 1, a = 0.5), and the normal-exponential-gamma (NEG) prior
(u = 1, a > 0). By Proposition 1 of Armagan et al. (2011), Eq. 4–10 and Eq. 4–11 can also
be written as a hierarchical mixture of two Gamma distributions,
βi |ψi ∼ N (0,ψi), ψi |ζi ∼ G(u, ζi), ζi ∼ G(a, τ), (4–12)
where ψi = ξiτ .
4.3.2 The MBSP-TPBN Model
Taking our MBSP model in Eq. 4–2 with the TPBN family as our chosen prior and
placing an inverse-Wishart conjugate prior on �, we can construct a specific variant of the
69
MBSP model which we term the MBSP-TPBN model. For our theoretical study of MBSP, we
assumed � to be known and the dimension of the responses q to be fixed (and thus, q < n
for large n). However, in order for our model to be implemented in finite samples, q can be of
any size (including q ≫ n), provided that the posterior distribution is proper. The use of an
inverse-Wishart prior ensures posterior propriety.
Reparametrizing the variance terms τξi , 1 ≤ i ≤ p, in terms of the ψi ’s from Eq. 4–12,
the MBSP-TPBN model is as follows:
Y|X,B,� ∼ MN n×q(XB, In,�),
B|ψ1, ...,ψp,� ∼ MN p×q(O, diag(ψ1, ... ,ψp),�),
ψi |ζiind∼ G(u, ζi), i = 1, ... , p,
ζii.i.d.∼ G(a, τ), i = 1, ... , p,
� ∼IW(d , kIq),
(4–13)
where u, a, d , k , and τ are appropriately chosen hyperparameters. The MBSP-TPBN model
can be implemented using the R package MBSP, which is available on the Comprehensive R
Archive Network (CRAN).
4.3.2.1 Computational Details
The full conditional densities under the MBSP-TPBN model in Eq. 4–13 are available in
closed form, and hence, can be implemented straightforwardly using Gibbs sampling. Moreover,
by suitably modifying an algorithm introduced by Bhattacharya et al. (2016) for drawing from
the matrix-normal density in Eq. 4–1, we can significantly reduce the computational complexity
of sampling from the full conditional density for B from O(p3) to O(n2p) when p ≫ n. We
provide technical details for our Gibbs sampling algorithm and our algorithm for sampling
efficiently from the conditional density for B in Appendix D.
In our experience, with good initial estimates for B and �, (B(init),�(init)), the Gibbs
sampler converges quite quickly, usually within 5000 iterations. In Appendix D, we describe
how to initialize (B(init),�(init)). In Appendix D, we also provide history plots of the draws
from the Gibbs sampler for individual coefficients of B from experiment 5 (n = 100, p =
70
500, q = 3) and experiment 6 (n = 150, p = 1000, q = 4) of our simulation studies in Section
4.4.1, which illustrate rapid convergence.
Although our algorithm is efficient, Gibbs sampling can still be prohibitive if p is
extremely large (say, on the order of millions). In this case, we recommend first screening
the p covariates based on the magnitude of their marginal correlations with the responses
(y1, ... , yq) and then implementing the MBSP model on the reduced subset of covariates.
This marginal screening technique for dimension reduction has long been advocated for
ultrahigh-dimensional problems, even for non-Bayesian approaches (e.g., Fan & Lv (2008),
Fan & Song (2010)). Faster alternatives to MCMC to handle extremely large p are also worth
exploring in the future.
4.3.2.2 Specification of Hyperparameters τ , d , and k
Just as in Eq. 4–2, the τ in Eq. 4–13 continues to act as a global shrinkage parameter.
A natural question is how to specify an appropriate value for τ . Armagan et al. (2011)
recommend setting τ to the expected level of sparsity. Given our theoretical results in
Theorems 4.3 and 4.4, we set τ ≡ τn = 1p√n log n
. This choice of τ satisfies the sufficient
conditions for posterior consistency in both the low-dimensional and the high-dimensional
settings when � is fixed and known.
In order to specify the hyperparameters d and k in the IW(d , kIq) prior for �, we appeal
to the arguments made by Brown et al. (1998). As noted by Brown et al. (1998), if we set
d = 3, then � has a finite first moment, with E(�) = kd−2
Iq = kIq. Additionally, as argued
in Bhadra & Mallick (2013) and Brown et al. (1998), k should a priori be comparable in size
with the likely variances of Y given X. Accordingly, we take our initial estimate of B from the
Gibbs sampler, B(init) (specified in Section 4.3.2.1), and take k as the variance of the residuals,
Y − XB(init).
4.3.3 Variable Selection
Although the MBSP model in Eq. 4–2 and the MBSP-TPBN model in Eq. 4–13 produce
robust estimates for B, they do not produce exact zeros. In order to use the MBSP-TPBN
71
model for variable selection, we recommend looking at the 95% credible intervals for each entry
bij in row i and column j . If the credible intervals for every single entry in row i , 1 ≤ i ≤ p,
contain zero, then we classify predictor i as an irrelevant predictor. If at least one credible
interval in row i , 1 ≤ i ≤ p does not contain zero, then we classify i as an active predictor.
The empirical performance of this variable selection method seems to work well, as shown in
Section 4.4.
4.4 Simulations and Data Analysis
4.4.1 Simulation Studies
For our simulation studies, we implement the MBSP-TPBN model in Eq. 4–13 using our
R package MBSP. We specify u = 0.5, a = 0.5 so that the polynomial-tailed prior that we
utilize is the horseshoe prior. The horseshoe is known to perform well in simulations Carvalho
et al. (2010); van der Pas et al. (2014). We set τ = 1p√n log n
, d = 3, and k comparable to the
size of likely variance of Y given X.
In all of our simulations, we generate data from the multivariate linear regression model
in Eq. 1–25 as follows. The rows of the design matrix X are independently generated from
Np(0,�), where � = (�ij)p×p with �ij = 0.5|i−j |. The sparse p × q matrix B is generated by
first randomly selecting an active set of predictors, A ⊂ {1, 2, ..., p}. For rows with indices in
the set A, we independently draw every row element from Unif([−5,−0.5] ∪ [0.5, 5]). All the
other rows in B, i.e. AC , are then set equal to zero. Finally, the rows of the noise matrix E are
independently generated from Nq(0,�), where � = (�ij)q×q with �ij = σ2(0.5)|i−j |,σ2 = 2.
We consider six different simulation settings with varying levels of sparsity.
• Experiment 1 (p < n): n = 60, p = 30, q = 3, 5 active predictors (sparse model).
• Experiment 2 (p < n): n = 80, p = 60, q = 6, 40 active predictors (dense model).
• Experiment 3 (p > n): n = 50, p = 200, q = 5, 20 active predictors (sparse model).
• Experiment 4 (p > n): n = 60, p = 100, q = 6, 40 active predictors (dense model).
• Experiment 5 (p ≫ n): n = 100, p = 500, q = 3, 10 active predictors (ultra-sparsemodel).
72
• Experiment 6 (p ≫ n): n = 150, p = 1000, q = 4, 50 active predictors (sparse model).
The Gibbs sampler described in Section 4.3.2.1 is efficient in handling the two p ≫ n setups
in experiments 5 and 6. Running on an Intel Xeon E5-2698 v3 processor, the Gibbs sampler
runs about 761 iterations per minute for Experiment 5 and about 134 iterations per minute for
Experiment 6. In all our experiments, we run Gibbs sampling for 15,000 iterations, discarding
the first 5000 iterations as burn-in.
As our point estimate for B, we take the posterior median B = (bij)p×q. To perform
variable selection, we inspect the 95% individual credible interval for every entry and classify
predictors as irrelevant if all of the q intervals in that row contain 0, as described in Section
4.3.3. We compute mean squared errors (MSEs) rescaled by a factor of 100, as well as the
false discovery rate (FDR), false negative rate (FNR), and overall misclassification probability
(MP) as follows:MSEest = 100× ||B− B||2F/(pq),
MSEpred = 100× ||XB− XB||2F/(nq),
FDR = FP / (TP + FP),
FNR = FN / (TN + FN),
MP = (FP + FN)/(pq),
where FP, TP, FN, and TN denote the number of false positives, true positives, false negatives,
and true negatives respectively.
We compare the performance of the MBSP-TPBN estimator with that of four other
row-sparse estimators of B. An alternative Bayesian approach based on the spike-and-slab
formulation is studied. Namely, we consider the multivariate Bayesian group lasso posterior
median estimator with a spike-and-slab prior (MBGL-SS), introduced by Liquet et al. (2017),
which applies a spike-and-slab prior with a point mass 0mgq for the gth group of covariates,
which corresponds to mg rows of B. When the grouping structure of the covariates is not
available, we can still utilize the MBGL-SS method by applying the spike-and-slab prior to
each individual row of B. In our study, we consider each predictor as its own “group” (i.e.,
73
mg = 1, g = 1, ... , p) so that individual rows are shrunk to 0⊤q . This method can be
implemented in R using the MBSGS package.
In addition, we compare the performance of MBSP-TPBN to three different frequentist
point estimators obtained through regularization penalties on the rows of B. In the R package
glmnet Friedman et al. (2010), there is an option to fit the following model to multivariate
data, which we call the multivariate lasso (MLASSO) method:
BMLASSO = arg minB∈Rp×q
(||Y − XB||2F + λ
p∑j=1
||bj ||2
).
The MLASSO model applies an ℓ1 penalty to each of the rows of B to shrink entire row
estimates to be 0⊤q . We also compare the MBSP-TPBN estimator to the row-sparse
reduced-rank regression (SRRR) estimator, introduced by Chen & Huang (2012), which
uses an adaptive group lasso penalty on the rows of B, but which further constrains the
solution to be rank-deficient. Finally, we compare our method to the sparse partial least
squares estimator (SPLS), introduced by Chun & Keleş (2010). SPLS combines partial least
squares (PLS) regression with a regularization penalty on the rows of B in order to obtain a
row-sparse PLS estimate of B. The SRRR and SPLS methods are available in the R packages
rrpack and spls.
Table 4-1 shows the results averaged across 100 replications for the MBSP-TPBN
model in Eq. 4–13, compared with MBGL-SS, LSGL, and SRRR. As the results illustrate,
the Bayesian methods tend to outperform the frequentist ones in the low-dimensional case
where p < n. In the two low-dimensional experiments (experiments 1 and 2), the MBGL-SS
estimator performs the best across all of our performance metrics, with the MBSP-TPBN
model following closely behind.
However, in all the high-dimensional (p > n) settings, MBSP-TPBN significantly
outperforms all of its competitors. Table 4-1 shows that the MBSP-TPBN model has a lower
MSEest than the other four methods in experiments 3 through 6. In experiments 5 and 6 (the
74
Table 4-1. Simulation results for MBSP-TPBN, compared with MBGL-SS, MLASSO, SRRR,and SPLS, averaged across 100 replications.
Experiment 1: n = 60, p = 30, q = 3. 5 active predictors (sparse model).Method MSEest MSEpred FDR FNR MPMBSP 1.146 24.842 0.015 0 0.003MBGL-SS 0.718 17.074 0.005 0 0.001MLASSO 2.181 41.424 0.6412 0 0.335SRRR 1.646 29.256 0.3270 0 0.128SPLS 2.428 43.879 0.1093 0.0019 0.028Experiment 2: n = 80, p = 60, q = 6, 40 active predictors (dense model).Method MSEest MSEpred FDR FNR MPMBSP 5.617 104.88 0.0034 0 0.0023MBGL-SS 5.202 101.40 0.0007 0 0.0005MLASSO 10.478 130.90 0.3307 0 0.330SRRR 5.695 104.67 0.0491 0 0.038SPLS 244.136 3633.77 0.2071 0 0.223
Experiment 3: n = 50, p = 200, q = 5, 20 active predictors (sparse model).Method MSEest MSEpred FDR FNR MPMBSP 1.357 117.52 0.0117 0 0.0013MBGL-SS 57.25 694.81 0.858 0.02 0.619MLASSO 8.400 169.026 0.7758 0 0.349SRRR 17.46 161.70 0.698 0 0.307SPLS 48.551 2006.03 0.422 0.033 0.103
Experiment 4: n = 60, p = 100, q = 6, 40 active predictors (dense model).Method MSEest MSEpred FDR FNR MPMBSP 11.030 172.89 0.0266 0 0.0114MBGL-SS 204.33 318.80 0.505 0.1265 0.415LSGL 44.635 188.81 0.544 0 0.479SRRR 242.67 193.64 0.594 0 0.587SPLS 213.19 3909.07 0.135 0.0005 0.005
Experiment 5: n = 100, p = 500, q = 3, 10 active predictors (ultra-sparse model).Method MSEest MSEpred FDR FNR MPMBSP 0.0374 12.888 0.064 0 0.0015MBGL-SS 1.327 155.51 0.483 0.0005 0.092MLASSO 0.2357 75.961 0.837 0 0.115SRRR 0.9841 49.428 0.688 0 0.104SPLS 0.3886 138.62 0.1355 0.0005 0.005
Experiment 6: n = 150, p = 1000, q = 4, 50 active predictors (sparse model).Method MSEest MSEpred FDR FNR MPMBSP 0.0155 8.934 0.0025 0.00003 0.00016MBGL-SS 1.327 155.51 0.483 0.0005 0.092MLASSO 1.982 181.95 0.810 0 0.214SRRR 0.9841 49.428 0.688 0 0.104SPLS 25.560 8631.92 0.420 0.021 0.051
75
p ≫ n scenarios), the MSEest and MSEpred are both much lower for the MBSP-TPBN model
than for the other methods.
Additionally, using the 95% credible interval technique in Section 4.3.3 to perform variable
selection, the FDR and the overall MP are also consistently low for the MBSP-TPBN model.
Even when the true underlying model is not sparse, as in experiments 2 and 4, MBSP performs
very well and correctly identifies most of the signals. In both the ultrahigh-dimensional settings
we considered in experiments 5 and 6, the other four methods all seem to report high FDR,
while the MBSP’s FDR remains very small.
In short, our experimental results show that the MBSP model in Eq. 4–2 has excellent
finite sample performance for both estimation and selection, is robust to non-sparse situations,
and scales very well to large p compared to the other methods. In addition to its strong
empirical performance, the MBSP model (as well as the MBGL-SS model) provides a vehicle
for uncertainty quantification through the posterior credible intervals.
4.4.2 Yeast cell cycle data analysis
We illustrate the MBSP methodology on a yeast cell cycle data set. This data set was
first analyzed by Chun & Keleş (2010) and is available in the spls package in R. Transcription
factors (TFs) are sequence-specific DNA binding proteins which regulate the transcription
of genes from DNA to mRNA by binding specific DNA sequences. In order to understand
their role as a regulatory mechanism, one often wishes to study the relationship between TFs
and their target genes at different time points. In this yeast cell cycle data set, mRNA levels
are measured at 18 time points seven minutes apart (every 7 minutes for a duration of 119
minutes). The 542 × 18 response matrix Y consists of 542 cell-cycle-regulated genes from an
α factor arrested method, with columns corresponding to the mRNA levels at the 18 distinct
time points. The 542 × 106 design matrix X consists of the binding information of a total of
106 TFs.
In practice, many of the TFs are not actually related to the genes, so our aim is to recover
a parsimonious model with only a tiny number of the truly statistically significant TFs. To
76
Table 4-2. Results for analysis of the yeast cell cycle data set. The MSPE has been scaled by afactor of 100. In particular, all fives models selected the three TFs, ACE2, SWI5,and SWI6 as significant.
Method Number of Proteins Selected MSPEMBSP 12 18.673MBGL-SS 7 20.093MLASSO 78 17.912SRRR 44 18.204SPLS 44 18.904
perform variable selection, we fit the MBSP-TPBN model in Eq. 4–13 and then use the 95%
credible interval method described in Section 4.3.3. Beyond identifying significant TFs, we
assess the predictive performance of the MBSP-TPBN model by performing five-fold cross
validation, using 80 percent of the data as our training set to obtain an estimate of B, Btrain.
We take the posterior median as Btrain = (bij)train and use it to compute the mean squared
error of the residuals on the remaining 20 percent of the left-out data. We repeat this five
times, using different training and test sets each time, and take the average MSE as our mean
squared predictor error (MSPE). To make our analysis more clear, we scale the MSPE by a
factor of 100.
Table 4-2 shows our results compared with the MBGL-SS, MLASSO, SRRR, and
SPLS methods. MBSP-TPBN selects 12 of the 106 TFs as significant, so we do recover a
parsimonious model. All five methods selected the TFs, ACE2, SWI5, and SWI6. The two
Bayesian methods seem to recover a much more sparse model than the frequentist methods.
In particular, the MLASSO method has lowest MSPE, but it selects 78 of the 106 TFs as
significant, suggesting that there may be overfitting in spite of the regularization penalty on
the rows of B. Our results suggest that the frequentist methods may have good predictive
performance on this particular data set, but at the expense of parsimony. In practice, sparse
models are preferred for the sake of interpretability, and our numerical results illustrate that the
MBSP model recovers a sparse model with competitive predictive performance.
Finally, Figure 4-1 illustrates the posterior median estimates and the 95% credible bands
for four of the 10 TFs that were selected as significant by the MBSP-TPBN model. These
77
0 20 40 60 80 100 120
−0.
6−
0.4
−0.
20.
00.
20.
40.
6
ACE2
0 20 40 60 80 100 120
−0.
6−
0.4
−0.
20.
00.
20.
40.
6
HIR1
0 20 40 60 80 100 120
−0.
6−
0.4
−0.
20.
00.
20.
40.
6
NDD1
0 20 40 60 80 100 120
−0.
6−
0.4
−0.
20.
00.
20.
40.
6
SWI6
Figure 4-1. Plots of the estimates and 95% credible bands for four of the 10 TFs that weredeemed as significant by the MBSP-TPBN model. The x-axis indicates time(minutes) and the y-axis indicates the estimated coefficients.
plots illustrate that the standard errors under the MBSP-TPBN model are not too large. One
of the potential drawbacks of using credible intervals for selection is that these intervals may
be too conservative, but we see that it is not the case here. This plot, combined with our
earlier simulation results and our data analysis results, provide empirical evidence for using the
MBSP model for estimation and variable selection. However, further theoretical investigation
is warranted in order to justify the use of marginal credible intervals for variable selection. In
particular, van der Pas et al. (2017b) showed that marginal credible intervals may provide
overconfident uncertainty statements for certain large signal values when applied to estimating
normal mean vectors, and the same issue could be present here.
4.5 Concluding Remarks
In this chapter, we have introduced a method for sparse multivariate Bayesian estimation
with shrinkage priors (MBSP). Previously, polynomial-tailed GL shrinkage priors of the form
78
given in Eq. 1–5 have mainly been used in univariate regression or in the estimation of normal
mean vectors. In this thesis, we have extended the use of polynomial-tailed priors to the
multivariate linear regression framework.
We have made several important contributions to both methodology and theory. First, our
model may be used for sparse multivariate estimation for p, n, and q of any size. To motivate
the MBSP model, we have shown that the posterior distribution can consistently estimate B in
Eq. 1–25 in both the low-dimensional and ultrahigh-dimensional settings where p is allowed to
grow nearly exponentially with n (with the response dimension q fixed). Theorem 4.2 appears
to be the first time that general sufficient conditions for strong posterior consistency have been
derived for Bayesian multivariate linear regression models when p > n and log(p) = o(n).
Moreover, our method is general enough to encompass a large family of heavy-tailed priors,
including the Student’s-t prior, the horseshoe prior, the generalized double Pareto prior, and
others.
The MBSP model in Eq. 4–2 can be implemented using straightforward Gibbs sampling.
We implemented a fully Bayesian version of it with an appropriate prior on � and with
polynomial-tailed priors belonging to the TPBN family, using the horseshoe prior as a special
case. By examining the 95% posterior credible intervals for every element in each row of the
posterior conditional distribution of B, we also showed how one could use the MBSP model
for variable selection. Through simulations and data analysis on a yeast cell cycle data set, we
have illustrated that our model has excellent performance in finite samples for both estimation
and variable selection.
79
CHAPTER 5SUMMARY AND FUTURE WORK
5.1 Summary
In recent years, Bayesian scale-mixture shrinkage priors have gained a great amount of
attention because of their computational efficiency and their ability to mimic point-mass
mixtures in obtaining sparse estimates. This thesis contributes to this large body of
methodological and theoretical work.
In Chapter 1, we surveyed the literature on sparse normal means estimation, Bayesian
multiple hypothesis testing, and sparse univariate and multivariate linear regression. In
Chapter 2, we introduced the inverse gamma-gamma (IGG) prior for estimation of sparse
noisy vectors. This prior has a number of attractive theoretical properties, including minimax
posterior contraction and super-efficient convergence in the Kullback-Leibler sense. In Chapter
3, we introduced a thresholding rule for signal detection based on the IGG posterior and
demonstrated that our procedure has the Bayes Oracle property for multiple hypothesis testing.
Finally, in Chapter 4, we introduced the multivariate Bayesian model with shrinkage priors
(MBSP) which uses global-local shrinkage priors for sparse multivariate linear regression. The
MBSP model recovers a row-sparse estimate of the unknown p × q coefficients matrix B
and consistently estimates the true B even when the number of predictors grows at nearly
exponential rate with sample size.
5.2 Future Work
5.2.1 Extensions of the Inverse Gamma-Gamma Prior
There are a number of possible extensions and further investigations for the inverse
gamma-gamma prior. In Chapters 2 and 3, the hyperparameters for the IGG prior were set
as (a, b) = (12+ 1
n, 1n) based on our findings about the theoretical behavior of the posterior.
However, we could investigate if taking empirical Bayes estimates of (a, b) can boost the
IGG’s performance. Recently, van der Pas et al. (2017a) found that the performance of the
horseshoe can be improved if the global parameter τ is taken as the value τ which maximizes
80
the marginal maximum likelihood on the interval [1/n, 1]. It is possible that a similar procedure
for setting (a, b) would lead to even better performance and better adaptivity to the underlying
sparsity for the IGG.
While we have proven that the IGG posterior achieves the (near) minimax rate of
contraction for nearly black vectors and concentrates at a super-efficient rate in the
Kullback-Leibler sense, we have not proven any results about uncertainty quantification
using the IGG. It is unknown whether the credible balls and marginal credible sets of size
1 − α,α ∈ (0, 1), under the IGG posterior have good frequentist coverage and optimal
size. The literature on uncertainty quantification for scale-mixture shrinkage priors seems
underdeveloped as of now. Some work has recently been done on this for the horseshoe prior
by van der Pas et al. (2017b). van der Pas et al. (2017b) give conditions under which the
credible balls and marginal credible intervals constructed by the horseshoe posterior density give
correct frequentist coverage in the normal means problem. It would be very interesting to see if
the IGG can achieve similar optimality for uncertainty quantification under milder conditions.
Recently, the “mild dimension” scenario for the normal means problem where q/n →
c , c > 0, as n → ∞ has also been of great interest, but Bayesian development in this area
has been slow. Given its excellent performance even in dense settings, it may be worthwhile
to conduct theoretical analysis of the IGG prior’s properties and behavior under moderate
dimensions.
Under the IGG prior, the components are a posteriori independent and therefore separable.
Despite the absence of a data-dependent global parameter, the IGG model adapts well to
sparsity, performing well under both sparse and dense settings. However, several authors
such as Carvalho et al. (2010) have argued that Bayesian models adapt to the underlying
sparsity far better when they include global parameters with priors placed on them. In light
of these arguments, we could investigate if theoretical and empirical performance can be
improved further by incorporating a global parameter into the IGG framework and creating a
non-separable variant of the IGG.
81
The IGG prior can also be extended to other statistical problems besides the normal means
problem and multiple testing. For example, we could adapt the IGG prior for sparse covariance
estimation, variable selection with covariates, and multiclass classification. We conjecture that
the IGG would satisfy many optimality properties (e.g. model selection consistency, optimal
posterior contraction, etc.) if it were utilized in these other contexts.
5.2.2 Extensions to Bayesian Multivariate Linear Regression with Shrinkage Priors
In Chapter 4, we demonstrated that the MBSP model could achieve posterior consistency
in both low-dimensional (p = o(n)) and ultrahigh-dimensional (log p = o(n)) settings. The
next step is to quantify the posterior contraction rate. In the multivariate linear regression
framework, we say that the posterior distribution contracts at the rate rn if
�(||Bn − B0||F > Mnrn|Yn) → 0 a.s. P0 as n → ∞,
for every Mn → ∞ as n → ∞. In the context of high-dimensional univariate regression,
several authors (e.g. Castillo et al. (2015) and Ročková & George (2016)) have attained
optimal posterior contraction rates of O(√s log p/n) with respect to the ℓ1 and ℓ2 norms
(where s denotes the number of active predictors). It is worth noting that√s log p/n is the
familiar minimax rate of convergence under squared error loss for a number of frequentist point
estimators, including the LASSO (Tibshirani (1996)), the scaled lasso (Sun & Zhang (2012)),
and the Dantzig selector (Candes & Tao (2007)). We conjecture that under suitable regularity
conditions and compatibility conditions on the design matrix, the MBSP model can attain a
similarly optimal posterior rate of contraction.
Additionally, we could investigate if posterior consistency and optimal posterior contraction
rates can be achieved if we allow the number of response variables q to diverge to infinity
in the MBSP model. From an implementation standpoint, q can be of any size, but for our
theoretical investigation of the MBSP model, we assumed q to be fixed. If q is allowed to grow
as sample size grows, then some sort of sparsity assumption for the response variables may
need to be imposed. We surmise that novel techniques would also be needed to prove posterior
82
consistency in this scenario, since the distributional theory we used to prove our consistency
results may not apply if q is no longer fixed.
Extension of our posterior consistency results to the case where � is unknown and
endowed with a prior also remains an open problem. In this case, we need to integrate out
� in order to work with the marginal density of the prior on B. If we assume the standard
inverse-Wishart prior on �, this gives rise to a matrix-variate t distribution. Handling this
density is very nontrivial and would require significantly different techniques than the ones
we used to establish posterior consistency in Chapter 4. Nevertheless, this warrants future
investigation.
For variable selection with the MBSP model, we relied on the post hoc method of
examining the 95% credible intervals for each entry of the estimated coefficients matrix for B.
Further theoretical justification for this selection method is needed. Other possible thresholding
rules should also be investigated. Because scale-mixture shrinkage priors place zero probability
at exactly zero, we must necessarily use thresholding to perform variable selection. How to
optimally choose this threshold (or thresholds) in high-dimensional settings remains an active
area of research.
Finally, in the wider context of multivariate analysis, we could also investigate the
use of global-local shrinkage priors for reduced rank regression (RRR) or partial least
squares regression (PLS). While there has been a great deal of work on RRR and PLS in
the frequentist framework, Bayesian methodological and theoretical developments in these
areas have been rather sparse.
All the aforementioned are very important open problems in Bayesian multivariate linear
regression, and we hope that the methodology and theory introduced in this thesis can serve as
the foundation for further developments in this area.
83
APPENDIX APROOFS FOR CHAPTER 2
In this Appendix, we provide proofs to all the propositions, lemmas, and theorems in
Chapter 2.
A.1 Proofs for Section 2.1
Proof of Proposition 2.1. The joint distribution of the prior is proportional to
π(θ, ξ,λ) ∝ (λξ)−1/2 exp
(− θ2
2λξ
)λ−a−1 exp
(−1
λ
)ξb−1 exp (−ξ)
∝ ξb−3/2 exp(−ξ)λ−a−3/2 exp
(−(θ2
2ξ+ 1
)1
λ
).
Thus,
π(θ, ξ) ∝ ξb−3/2 exp(−ξ)∫ ∞
λ=0
λ−a−3/2 exp
(−(θ2
2ξ+ 1
)1
λ
)dλ
∝(θ2
2ξ+ 1
)−(a+1/2)
ξb−3/2e−ξ,
and thus, the marginal density of θ is proportional to
π(θ) ∝∫ ∞
0
(θ2
2ξ+ 1
)−(a+1/2)
ξb−3/2e−ξdξ. (A–1)
As |θ| → 0, the expression in Eq. A–1 is bounded below by
C
∫ ∞
0
ξb−3/2e−ξdξ, (A–2)
where C is a constant that depends on a and b. The integral expression in Eq. A–2 clearly
diverges to ∞ for any 0 < b ≤ 1/2. Therefore, Eq. A–1 diverges to infinity as |θ| → 0, by the
monotone convergence theorem.
Proof of Theorem 2.1. From Eq. 2–4, the posterior distribution of κi under IGGn is
proportional to
π(κi |Xi) ∝ exp
(−κiX
2i
2
)κa−1/2i (1− κi)
bn−1, κi ∈ (0, 1). (A–3)
84
Since exp(−κiX
2i
2
)is strictly decreasing in κi on (0, 1), we have
E(1− κi |Xi) =
∫ 1
0
κa−1/2i (1− κi)
bn exp
(−κiX
2i
2
)dκi∫ 1
0
κa−1/2i (1− κi)
bn−1 exp
(−κiX
2i
2
)dκi
≤eX
2i/2
∫ 1
0
κa−1/2i (1− κi)
bndκi∫ 1
0
κa−1/2i (1− κi)
bn−1dκi
= eX2i/2�(a + 1/2)�(bn + 1)
�(a + bn + 3/2)× �(a + bn + 1/2)
�(a + 1/2)�(bn)
= eX2i/2
(bn
a + bn + 1/2
).
Proof of Theorem 2.2. Note that since a ∈ (12,∞), κa−1/2
i is increasing in κi on (0, 1).
Additionally, since bn ∈ (0, 1), (1 − κi)bn−1 is increasing in κi on (0, 1). Using these facts, we
have
Pr(κi < ϵ|Xi) ≤
∫ ϵ
0
exp
(−κiX
2i
2
)κa−1/2i (1− κi)
bn−1dκi∫ 1
ϵ
exp
(−κiX
2i
2
)κa−1/2i (1− κi)
bn−1dκi
≤eX
2i/2
∫ ϵ
0
κa−1/2i (1− κi)
bn−1dκi∫ 1
ϵ
κa−1/2i (1− κi)
bn−1dκi
≤eX
2i/2(1− ϵ)bn−1
∫ ϵ
0
κa−1/2i dκi
ϵa−1/2
∫ 1
ϵ
(1− κi)bn−1dκi
=eX
2i/2(1− ϵ)bn−1
(a + 1
2
)−1ϵa+1/2
b−1n ϵa−1/2(1− ϵ)b
= eX2i/2 bnϵ
(a + 1/2) (1− ϵ).
85
Proof of Theorem 2.3. First, note that since bn ∈ (0, 1), (1 − κi)bn−1 is increasing in κi on
(0, 1). Therefore, letting C denote the normalizing constant that depends on Xi , we have
∫ η
0
π(κi |Xi)dκi = C
∫ η
0
exp
(−κiX
2i
2
)κa−1/2i (1− κi)
bn−1dκi
≥ C
∫ ηδ
0
exp
(−κiX
2i
2
)κa−1/2i (1− κi)
bn−1dκi
≥ C exp
(−ηδ
2X 2i
)∫ ηδ
0
κa−1/2i dκi
= C exp
(−ηδ
2X 2i
)(a +
1
2
)−1
(ηδ)a+12 . (A–4)
Also, since a ∈ (12,∞), κa−1/2
i is increasing in κi on (0, 1).∫ 1
η
π(κi |Xi)dκi = C
∫ 1
η
exp
(−κiX
2i
2
)κa−1/2i (1− κi)
bn−1dκi
≤ C exp
(−ηX
2i
2
)∫ 1
η
κa−1/2i (1− κi)
bn−1dκi
≤ C exp
(−ηX
2i
2
)∫ 1
η
(1− κi)bn−1dκi
= C exp
(−ηX
2i
2
)b−1n (1− η)bn . (A–5)
Combining Eq. A–4 and Eq. A–5, we have
Pr(κi > η|Xi) ≤
∫ 1
η
π(κi |Xi)dκi∫ η
0
π(κi |Xi)dκi
≤(a + 1
2
)(1− η)bn
bn(ηδ)a+ 1
2
exp
(−η(1− δ)
2X 2i
).
86
A.2 Proofs for Section 2.3.1
Before proving Theorems 2.4 and 2.5, we first prove four lemmas. For Lemmas A.1, A.2,
A.3, and A.4, we denote T (x) = {E(1− κ)|x}x as the posterior mean under the IGGn model
in Eq. 2–6 for a single observation x , where κ = 11+λξ
. Our arguments follow closely those of
van der Pas et al. (2014), Datta & Ghosh (2013), and Ghosh & Chakrabarti (2017), except
that their arguments rely on controlling the rate of decay of tuning parameter τ or an empirical
Bayes estimator τ . In our case, since we are dealing with a fully Bayesian model, the degree of
posterior contraction is instead controlled by the positive sequence of hyperparameters bn in
Eq. 2–6.
Lemma A.1. Let T (x) be the posterior mean under the IGGn model in Eq. 2–6 for a single
observation x drawn from N(θ, 1). Suppose we have constants η ∈ (0, 1), δ ∈ (0, 1),
a ∈ (12,∞), and bn ∈ (0, 1), where bn → 0 as n → ∞. Then for any d > 2 and fixed
n, |T (x) − x | can be bounded above by a real-valued function hn(x), depending on d and
satisfying the following:
For any ρ > d , hn(·) satisfies
limn→∞
sup|x |>
√ρ log( 1
bn)
hn(x) = 0. (A–6)
Proof of Lemma A.1. Fix η ∈ (0, 1), δ ∈ (0, 1). First observe that
|T (x)− x | = |xE(κ|x)|
≤ |xE(κ1{κ < η}|+ |xE(κ1{κ > η}|. (A–7)
We consider the two terms in Eq. A–7 separately. From Eq. 2–4 and the fact that (1− κ)bn−1
is increasing in κ ∈ (0, 1) when bn ∈ (0, 1), we have
87
|xE(κ1{κ < η}| =∣∣∣∣x∫ η0κ · κa−1/2(1− κ)bn−1e−κx
2/2dκ∫ 1
0κa−1/2(1− κ)bn−1e−κx
2/2dκ
∣∣∣∣≤∣∣x∣∣∣∣∣∣(1− η)bn−1
∫ η0κa+1/2e−κx
2/2dκ∫ 1
0κa−1/2e−κx
2/2dκ
∣∣∣∣= (1− η)bn−1
∣∣∣∣∫ ηx20
(tx2
)a+1/2e−t/2dt∫ x2
0
(tx2
)a−1/2e−t/2dt
∣∣∣∣∣∣x∣∣= (1− η)bn−1
∣∣∣∣ 1x2∫ ηx20
ta+1/2e−t/2dt∫ x20
ta−1/2e−t/2dt
∣∣∣∣∣∣x∣∣≤ (1− η)bn−1
∣∣∣∣∫∞0
ta+1/2e−t/2dt∫ x20
ta−1/2e−t/2dt
∣∣∣∣∣∣x∣∣−1
= C(n)
[∣∣∣∣ ∫ x2
0
ta−1/2e−t/2dt
∣∣∣∣]−1 ∣∣x∣∣−1
= h1(x) (say), (A–8)
where we use a change of variables t = κx2 in the second equality, and C(n) = (1 −
η)bn−1(�(1)�(a+3/2)�(a+5/2)
)= (1− η)bn−1
(a + 3
2
)−1. Next, we observe that since κ ∈ (0, 1),
|xE(κ1{κ > η}|x | ≤ |xP(κ > η|x)|
≤(a + 1
2
)(1− η)bn
bn(ηδ)a+1/2|x | exp
(−η(1− δ)
2x2)
= h2(x) (say), (A–9)
where we use Theorem 2.3 for the second inequality.
Let hn(x) = h1(x) + h2(x). Combining Eq. A–7 through Eq. A–9, we have that for every
x ∈ R and fixed n,
|T (x)− x)| ≤ hn(x), (A–10)
Observe from Eq. A–8 that for fixed n, h1(x) is strictly decreasing in |x |. Therefore, we have
that for any fixed n and ρ > 0,
88
sup|x |>
√ρ log( 1
bn)
h1(x) ≤ C(n)
[∣∣∣∣√ρ log
(1
bn
)∫ ρ log( 1bn)
0
ta−1/2e−t/2dt
∣∣∣∣]−1
,
and since bn → 0 as n → ∞, this implies that
limn→∞
sup|x |>
√ρ log( 1
bn)
h1(x) = 0. (A–11)
Next, observe that from Eq. A–9 that for fixed n, h2(x) is eventually decreasing in |x | with a
maximum when |x | = 1√η(1−δ)
. Therefore, for sufficiently large n, we have
sup|x |>
√ρ log( 1
bn)
h2(x) ≤ h2
(√ρ log
(1
bn
)).
Letting K ≡ K(a, η, δ) =(a+ 1
2)(ηδ)a+1/2
, we have from Eq. A–9 and the fact that 0 < bn < 1 for all
n that
limn→∞
h2
(√ρ log
(1
bn
))= K lim
n→∞
(1− η)bn
bn
√ρ log
(1
bn
)e−
η(1−δ)2
ρ log( 1bn)
≤ K limn→∞
1
bn
√ρ log
(1
bn
)eη(1−δ)
2log(b
ρn )
= K√ρ limn→∞
(bn)η(1−δ)
2 (ρ− 2η(1−δ))
√log
(1
bn
)
=
0 if ρ > 2η(1−δ) ,
∞ otherwise,
from which it follows that
limn→∞
sup|x |>
√ρ log( 1
bn)
h2(x) =
0 if ρ > 2η(1−δ) ,
∞ otherwise.(A–12)
Combining Eq. A–11 and Eq. A–12, we have for hn(x) = h1(x) + h2(x) that
89
limn→∞
sup|x |>
√ρ log( 1
bn)
hn(x) =
0 if ρ > 2η(1−δ) ,
∞ otherwise.(A–13)
Since η ∈ (0, 1), δ ∈ (0, 1), it is clear that any real number larger than 2 can be expressed in
the form 2η(1−δ) . For example, taking η = 5
6and δ = 1
5, we obtain 2
η(1−δ) = 3. Hence, given
any d > 2, choose 0 < η, δ < 1 such that c = 2η(1−δ) . Clearly, hn(·) depends on d . Following
Eq. A–10 and Eq. A–13, we see that |T (x) − x | is uniformly bounded above by hn(x) for all
n and d > 2 and that condition in Eq. A–6 is also satisfied when d > 2. This completes the
proof.
Remark: Under the conditions of Lemma A.1, we see that for any fixed n,
lim|x |→∞
|T (x)− x | = 0. (A–14)
Equation A–14 shows that for the IGG prior, large observations almost remain unshrunk no
matter what the sample size n is. This is critical to its ability to properly identify signals in our
data.
Lemma A.2. Let T (x) be the posterior mean and let Var(θ|x) be the posterior variance
under the IGGn prior in Eq. 2–6. Then for a single observation x ∼ N(θ, 1), Var(θ|x) can be
represented by the following identities:
Var(θ|x) = T (x)
x− (T (x)− x)2 + x2
∫ 1
0κa+3/2(1− κ)bn−1e−κx
2/2dκ∫ 1
0κa−1/2(1− κ)bn−1e−κx
2/2dκ, (A–15)
and
Var(θ|x) = T (x)
x− T (x)2 + x2
∫ 1
0κa−1/2(1− κ)bn+1e−κx
2/2dκ∫ 1
0κa−1/2(1− κ)bn−1e−κx
2/2dκ, (A–16)
both of which satisfy the bound Var(θ|x) ≤ 1 + x2.
Proof of Lemma A.2. We first prove Eq. A–15. By the law of the iterated variance and the
fact that θ|κ, x ∼ N((1− κ)x , 1− κ), we have
90
Var(θ|x) = E[Var(θ|κ, x)] + Var [E(θ|κ, x)]
= E(1− κ|x) + Var [(1− κ)x |x ]
= E(1− κ|x) + x2Var(κ|x)
= E[(1− κ|x) + x2E(κ2|x)− x2[E(κ|x)]2.
Since x − T (x) = xE(κ|x), we rewrite the above as
Var(θ|x) = T (x)
x− (T (x)− x)2 + x2
∫ 1
0κa+3/2(1− κ)bn−1e−κx
2/2dκ∫ 1
0κa−1/2(1− κ)bn−1e−κx
2/2dκ.
Since κa+ 32 ≤ κa+
12 for all a ∈ R when κ ∈ (0, 1), it follows that the above display can be
bounded from above as Var(θ|x) ≤ 1 + x2.
Next, we show that Eq. A–16 holds. We may alternatively represent Var(θ|x) as
Var(θ|x) = E(1− κ|x) + x2E[(1− κ)2|x ]− x2E2[(1− κ)|x ]
=T (x)
x− T (x)2 + x2
∫ 1
0κa−1/2(1− κ)bn+1e−κx
2/2dκ∫ 1
0κa−1/2(1− κ)bn−1e−κx
2/2dκ.
Since (1 − κ)bn+1 ≤ (1 − κ)bn−1 for all bn ∈ R when κ ∈ (0, 1), it follows that the above
display can also be bounded from above as Var(θ|x) ≤ 1 + x2.
Lemma A.3. Let Var(θ|x) be the posterior variance under Eq. 2–6 for a single observation
x drawn from N(θ, 1). Suppose we have constants η ∈ (0, 1), δ ∈ (0, 1), a ∈ (12,∞), and
bn ∈ (0, 1), where bn → 0 as n → ∞. Then there exists a nonnegative and measurable
real-valued function hn(x) such that Var(θ|x) ≤ hn(x) for all x ∈ R. Moreover, hn(x) → 1 as
x → ∞ for any fixed bn ∈ (0, 1), and hn(x) satisfies the following:
For any d > 1, hn(·) satisfies
limn→∞
sup|x |>
√2ρ log( 1
bn)
hn(x) = 1 for any ρ > d . (A–17)
91
Proof of Lemma A.3. We use the representation of Var(θ|x) given in Eq. A–15. It is clear
that T (x)x
− (T (x)− x)2 can be bounded above by h1(x) = 1 for all x ∈ R. To bound the last
term in Eq. A–15, fix η ∈ (0, 1) and δ ∈ (0, 1), and split this term into the sum,
x2∫ η0κa+3/2(1− κ)bn−1e−κx
2/2dκ∫ 1
0κa−1/2(1− κ)bn−1e−κx
2/2dκ+ x2
∫ 1
ηκa+3/2(1− κ)bn−1e−κx
2/2dκ∫ 1
0κa−1/2(1− κ)bn−1e−κx
2/2dκ. (A–18)
Following the same techniques as in Lemma A.1 to bound each term in the sum above, we can
show that there exists a real-valued function h2(x) for which Eq. A–18 is uniformly bounded
above by h2(x) for all x ∈ R and for which h2(x) → 0 as x → ∞ for any fixed n. Moreover,
by mimicking the proof for Lemma A.1, it can similarly be shown that, for any d > 1, this
function h2(x) satisfies
limn→∞
sup|x |>
√2ρ log( 1
bn)
h2(x) = 0 for any ρ > d .
Therefore, letting hn(x) = h1(x) + h2(x) = 1 + h2(x), the lemma is proven.
Lemma A.4. Define Jn(x) as follows:
Jn(x) = x2∫ 1
0κa−1/2(1− κ)bn+1e−κx
2/2dκ∫ 1
0κa−1/2(1− κ)bn−1e−κx
2/2dκ. (A–19)
Suppose that a ∈ (12,∞) is fixed and bn ∈ (0, 1) for all n. Then we have the following upper
bound for Jn(x):
Jn(x) ≤ bnex2/2
(�(a + bn + 1/2)
�(a + 1/2)�(bn + 1)
). (A–20)
Proof of Lemma A.4. We have
Jn(x) = x2∫ 1
0κa−1/2(1− κ)bn+1e−κx
2/2dκ∫ 1
0κa−1/2(1− κ)bn−1e−κx
2/2dκ
≤ x2ex2/2
∫ 1
0κa−1/2(1− κ)bn+1e−κx
2/2dκ∫ 1
0κa−1/2(1− κ)bn−1dκ
92
= x2ex2/2 �(a + bn + 1/2)
�(a + 1/2)�(bn)
∫ 1
0
κa−1/2(1− κ)bn+1e−κx2/2dκ
= ex2/2 �(a + bn + 1/2)
�(a + 1/2)�(bn)
∫ x2
0
(t
x2
)a−1/2(1− t
x2
)bn+1
e−t/2dt, (A–21)
where we used a transformation of variables t = κx2 in the last equality. For 0 < t < x2, we
have 0 <(1− t
x2
)< 1, and since bn ∈ (0, 1) for all n, we have
(1− t
x2
)bn+1< 1 for all n.
Therefore, from Eq. A–21, we may further bound Jn(x) from above as
Jn(x) ≤ ex2/2(x2)1/2−a
�(a + bn + 1/2)
�(a + 1/2)�(bn)
∫ x2
0
ta−1/2e−t/2dt
≤ ex2/2 �(a + bn + 1/2)
�(a + 1/2)�(bn)
∫ x2
0
e−t/2dt
≤ ex2/2 �(a + bn + 1/2)
�(a + 1/2)�(bn)
= bnex2/2
(�(a + bn + 1/2)
�(a + 1/2)�(bn + 1)
),
where we used the fact that a ∈ (12,∞) and thus, ta−1/2 is increasing in t ∈ (0, x2) for the
second inequality, and the fact that �(bn + 1) = bn�(bn) for the last equality.
Lemmas A.1, A.2, A.3, and A.4 are crucial in proving Theorems 2.4 and 2.5, which provide
asymptotic upper bounds on the mean squared error (MSE) for the posterior mean under
the IGGn prior in Eq. 2–6 and the posterior variance under the IGGn prior. These theorems
will ultimately allow us to provide sufficient conditions under which the posterior mean and
posterior distribution under the IGGn prior contract at minimax rates.
Proof of Theorem 2.4. Define qn = #{i : θ0i = 0}. We split the MSE,
Eθ0 ||T (X)− θ0||2 =n∑
i=1
Eθ0i (T (Xi)− θ0i)2
asn∑
i=1
Eθ0i (T (Xi)− θ0i)2 =
∑i :θ0i =0
Eθ0i (T (Xi)− θ0i)2 +
∑i :θ0i=0
Eθ0i (T (Xi)− θ0i)2. (A–22)
We consider the nonzero means and the zero means separately.
93
Nonzero means: For θ0i = 0, using the Cauchy-Schwartz inequality and the fact that
E0i(Xi − θ0i)2 = 1, we get
Eθ0i (T (Xi)− θi)2 = Eθ0i (T (Xi)− Xi + Xi − θi)
2
= Eθ0i (T (Xi)− Xi)2 + Eθ0i (Xi − θ0i)
2 + 2Eθ0i (T (Xi)− Xi)(Xi − θ0i)
≤ Eθ0i (T (Xi)− Xi)2 + 1 + 2
√Eθ0i (T (Xi)− Xi)2
√Eθ0i (Xi − θ0i)2
=
[√Eθ0i (T (Xi)− Xi)2 + 1
]2. (A–23)
We now define
ζn =
√2 log
(1
bn
). (A–24)
Let us fix any d > 2 and choose any ρ > d . Then, using Lemma A.1, there exists a
non-negative real-valued function hn(·), depending on d such that
|Tn(x)− x | ≤ hn(x) for all x ∈ R, (A–25)
and
limn→∞
sup|x |>ρζn
hn(x) = 0. (A–26)
Using the fact that (T (Xi)− Xi)2 ≤ X 2
i , together with Eq. A–26, we obtain
Eθ0i (T (Xi)− Xi)2 = Eθ0i [(T (Xi)− Xi)
21{|Xi | ≤ ρζn}]
+ Eθ0i [T (Xi − Xi)21{|Xi | > ρζn}]
≤ ρ2ζ2n +
(sup
|x |>ρζnhn(x)
)2
. (A–27)
Using Eq. A–26 and the fact that ζn → ∞ as n → ∞ by Eq. A–24, it follows that(sup
|x |>ρζnhn(x)
)2
= o(ζ2n ) as n → ∞. (A–28)
94
By combining Eq. A–27 and Eq. A–28, we get
Eθ0i (T (Xi)− Xi)2 ≤ ρ2ζ2n (1 + o(1)) as n → ∞. (A–29)
Noting that Eq. A–29 holds uniformly for any i such that θ0i = 0, we combine Eq. A–23, Eq.
A–24, and Eq. A–29 to conclude that
∑i :θ0i =0
Eθ0i (T (Xi)− θ0i)2 - qn log
(1
bn
), as n → ∞, (A–30)
Zero means: For θ0i = 0, the corresponding MSE can be split as follows:
E0T (Xi)2 = E0[T (Xi)
21{|Xi | ≤ ζn}] + E0[T (Xi)21{|Xi | > ζn}], (A–31)
where ζn is as in Eq. A–24. Using Theorem 2.1, we have
E0[T (Xi)21{|Xi | ≤ ζn}] ≤
(bn
a + bn + 1/2
)2 ∫ ζn
−ζnx2ex
2/2dx
≤ b2na2
∫ ζn
−ζnx2ex
2/2dx
=2b2na2
∫ ζn
0
x2ex2/2dx
≤ 2b2na
(ζneζ2n/2)
- bn
√log
(1
bn
), (A–32)
where we use integration by parts for the third inequality.
Now, using the fact that |T (x)| ≤ |x | for all x ∈ R,
E0[T (Xi)21{|Xi | > ζn}] ≤ 2
∫ ∞
ζn
x2ϕ(x)dx
≤ 2ζnϕ(ζn) +2ϕ(ζn)
ζn
=
√2
πζn(e
−ζ2n/2 + o(1))
95
- bn
√log
(1
bn
), (A–33)
where we used the identity x2ϕ(x) = ϕ(x) − ddx[xϕ(x)] for the first equality and Mill’s ratio,
1 − �(x) ≤ ϕ(x)x
for all x > 0, in the second inequality. Combining Eq. A–32 through Eq.
A–33, we have that ∑i :θ0i=0
Eθ0iT (Xi)2 - (n − qn)bn
√log
(1
bn
). (A–34)
From Eq. A–22, Eq. A–30, and Eq. A–34, it immediately follows that
E||T (X)− θ0||2 =n∑
i=1
Eθ0i (T (Xi)− θ0i)2
- qn log
(1
bn
)+ (n − qn)bn
√log
(1
bn
).
The required result now follows by observing that qn ≤ qn and qn = o(n) and then taking the
supremum over all θ0 ∈ ℓ0[qn]. This completes the proof of Theorem 2.4.
Proof of Theorem 2.5. Define qn = #{i : θ0i = 0}. We decompose the total variance as
Eθ0
n∑i=1
Var(θ0i |Xi) =∑i :θ0i =0
Eθ0iVar(θ0i |Xi) +∑i :θ0i=0
Eθ0iVar(θ0i |Xi), (A–35)
and consider the nonzero means and zero means separately.
Nonzero means: Fix d > 1 and choose any ρ > d , and let ζn be defined as in Eq. A–24. For
θ0i = 0, we have from Eq. A–17 in Lemma A.3 that
Eθ0i [Var(θi |Xi)1{|Xi | > ρζn}] - 1. (A–36)
Moreover, Lemma A.2 shows that Var(θ|x) ≤ 1 + x2 for any x ∈ R, and so we must also have
that as n → ∞,
Eθ0i [Var(θi |Xi)1{|Xi | ≤ ρζn}] - ζ2n . (A–37)
96
Combining Eq. A–36 and Eq. A–37, we have that, as n → ∞,
Eθ0iVar(θi |Xi) - 1 + ζ2n ,
and thus, for all i such that θi = 0,
∑i :θ0i =0
Eθ0iVar(θ0i |Xi) - qn(1 + ζ2n )
- qn log
(1
bn
). (A–38)
Zero means: For θ0i = 0, we use same ζn that we used for the nonzero means. We have from
Lemma A.2 that Var(θ|x) ≤ 1+ x2. Using the identity x2ϕ(x) = ϕ(x)− ddx
[xϕ(x)] for x ∈ R,
we obtain that as n → ∞,
E0 [Var(θi |Xi)1{|Xi | > ζn}] ≤ 2
∫ ∞
ζn
(1 + x2)1√2π
e−x2/2dx
- bn
ζn+ ζnbn
- bn
√log
(1
bn
)(A–39)
Next, we consider |Xi | ≤ ζn. We have by Eq. A–16 in Lemma A.2 that Var(θ|x) ≤ T (x)x
+
Jn(x), where Jn(x) is the last term in Eq. A–16. Lemma A.4 gives an upper bound in Eq.
A–20 on Jn(x). Since a ∈ (12,∞) is fixed and bn ∈ (0, 1) with bn → 0 as n → ∞, the term
in parentheses in Eq. A–20 is uniformly bounded above by a constant. Therefore, we have by
Lemma A.4 that Jn(x) - bn. Moreover, T (x)x
= E(1 − κ|x), and it is clear from Theorem 2.1
that E(1− κ|x) - bn. Altogether, we have that as n → ∞,
E0 [Var(θi |Xi)1{|Xi | ≤ ζn}] - ζnbn
- bn
√log
(1
bn
)(A–40)
97
Combining Eq. A–39 and Eq. A–40, it follows that as n → ∞,
E0Var(θi |Xi) - bn
√log
(1
bn
),
and consequently, ∑i :θ0i=0
Eθ0iVar(θ0i |Xi) - (n − qn)bn
√log
(1
bn
). (A–41)
Combining Eq. A–35, Eq. A–38, and Eq. A–41, it follows that as n → ∞,
Eθ0
n∑i=1
Var(θi |Xi) - qn log
(1
bn
)+ (n − qn)bn
√log
(1
bn
)The required result now follows by observing that qn ≤ qn and qn = o(n) and then taking the
supremum over all θ0 ∈ ℓ0[qn]. This completes the proof of Theorem 2.5.
A.3 Proofs for Section 2.3.2
Proof of Theorem 2.7. Using the beta prime representation of the IGG prior, we have
π(θ) =1
(2π)12B(a, b)
∫ ∞
0
exp
(− θ2
2u
)ub−
32 (1 + u)−a−bdu,
where B(a, b) denotes the beta function. Under the transformation of variables, z = θ2
2u, we
have
π(θ) =2a+
12
(2π)12B(a, b)
(θ2)b−12
∫ ∞
0
exp(−z)za−12 (θ2 + 2z)−a−bdz . (A–42)
Now define the set Aϵ = {θ : |θ| ≤ ϵ}. Then from Eq. A–42, and for 0 < ϵ < 1, we have
ν(Aϵ) = P(|θ| ≤ ϵ)
=2a+
12
∫∞0
exp(−z)za− 12
(∫|θ|≤ϵ(θ
2)2b−12 (θ2 + 2z)−bdθ
)dz
(2π)12B(a, b)
≥2a+
12
∫∞0
exp(−z)za− 12 (2z + 1)−a−b
(∫|θ|≤ϵ(θ
2)b−12dθ)dz
(2π)12B(a, b)
98
≥2a+
122−a−b
∫∞0
exp(−z)za− 12 (1 + z)−a−b
(∫ ϵ0(θ2)b−
12dθ)dz
(2π)12B(a, b)
=ϵ2b
2bbB(a, b)π1/2
∫ ∞
0
exp(−z)za−12 (1 + z)−a−bdz . (A–43)
To bound the integral term in Eq. A–43, note that∫ ∞
0
exp(−z)za−12 (1 + z)−a−bdz ≥
∫ 1
0
exp(−z)za−12 (1 + z)−a−bdz
≥ e−12−a−b(a +
1
2
)−1
. (A–44)
Therefore, combining Eq. A–43 and Eq. A–44, we have
ν(Aϵ) ≥ϵ2b
2bbB(a, b)π1/2e−12−a−b
(a +
1
2
)−1
=ϵ2b�(a + b)
2b�(a)�(b + 1)π1/2e−12−a−b
(a +
1
2
)−1
≥ ϵ2b�(a)
�(a)�(2)�(12)e−12−a−2b
(a +
1
2
)−1
≥(ϵ2)b
(π)−1/2e−12−a−2
(a +
1
2
)−1
, (A–45)
where we use the fact that 0 < b < 1 for the last two inequalities.
Following Clarke & Barron (1990), the optimal rate of convergence comes from setting
ϵn = 1/n, which reflects the ideal case of independent samples x1, ..., xn. We therefore apply
Proposition 2.2, substituting in ϵ = 1/n and b = 1/n and invoking the lower bound for ν(Aϵ)
found in Eq. A–45. This ultimately gives us an upper bound on the Cesàro-average risk as
Rn ≤ 1n− 1
nlog[(
1n
) 2n π−1/2e−12−a−2
(a + 1
2
)−1]
= 1n
[2 + log(
√π) + (a + 2) log(2) + log
(a + 1
2
)]+ 2 log n
n2,
when θ0 = 0.
99
APPENDIX BPROOFS FOR CHAPTER 3
In this Appendix, we provide proofs to all the lemmas and theorems in Section 3.2. Our
proof methods follow those of Datta & Ghosh (2013), Ghosh et al. (2016), and Ghosh &
Chakrabarti (2017), except our arguments rely on control of the sequence of hyperparameters
bn, rather than on specifying a rate or an estimate for a global parameter τ , as in the
global-local framework of Eq. 1–5.
Proof of Lemma 3.1. By Theorem 2.1, the event{E(1− κi |Xi) >
12
}implies the event{
eX2i/2
(bn
a + bn + 1/2
)>
1
2
}⇔{X 2i > 2 log
(a + bn + 1/2
2bn
)}.
Therefore, noting that under H0i , Xi ∼ N (0, 1) and using Mill’s ratio, i.e. P(|Z | > x) ≤ 2ϕ(x)x
,
we have
t1i ≤ Pr
(X 2i > 2 log
(a + bn + 1/2
2bn
) ∣∣∣∣H0i is true)
= Pr
(|Z | >
√2 log
(a + bn + 1/2
2bn
))
≤2ϕ
(√2 log
(a+bn+1/2
2bn
))√2 log
(a+bn+1/2
2bn
)=
2bn√π(a + bn + 1/2)
[log
(a + bn + 1/2
2bn
)]−1/2
.
Proof of Lemma 3.2. By definition, the probability of a Type I error for the ith decision is
given by
t1i = Pr
[E(1− κi |Xi) >
1
2
∣∣∣∣H0i is true].
100
We have by Theorem 2.3 that
E(κi |Xi) ≤ η +
(a + 1
2
)(1− η)bn
bn(ηδ)a+ 1
2
exp
(−η(1− δ)
2X 2i
),
and so it follows that{E(1− κi |Xi) >
1
2
}⊇
{(a + 1
2
)(1− η)bn
bn(ηδ)a+ 1
2
exp
(−η(1− δ)
2X 2i
)<
1
2− η
}.
Thus, using the definition of t1i and the above and noting that under H0i , Xi ∼ N (0, 1), as
n → ∞,
t1i ≥ Pr
((a + 1
2
)(1− η)bn
bn(ηδ)a+ 1
2
exp
(−η(1− δ)
2X 2i
)<
1
2− η
∣∣∣∣ H0i is true)
= Pr
(X 2i >
2
η(1− δ)
[log
( (a + 1
2
)(1− η)bn
bn(ηδ)a+ 1
2
(12− η))])
= 2Pr
Z >
√√√√ 2
η(1− δ)
[log
( (a + 1
2
)(1− η)bn
bn(ηδ)a+ 1
2
(12− η))]
= 2
1−�
√√√√ 2
η(1− δ)
[log
( (a + 1
2
)(1− η)bn
bn(ηδ)a+ 1
2
(12− η))]
,
where for the last inequality, we used the fact that bn → 0 as n → ∞, and the fact that
η, ηδ ∈ (0, 12), so that the log(·) term in final equality is greater than zero for sufficiently large
n.
Proof of Lemma 3.3. By definition, the probability of a Type II error is given by
t2i = Pr
(E(1− κi) ≤
1
2
∣∣∣∣H1i is true).
Fix η ∈ (0, 12) and δ ∈ (0, 1). Using the inequality,
κi ≤ 1 {η < κi ≤ 1}+ η,
101
we obtain
E(κi |Xi) ≤ Pr(κi > η|Xi) + η.
Coupled with Theorem 2.3, we obtain that for sufficiently large n,{E(κi |Xi) >
1
2
}⊆
{(a + 1
2
)(1− η)bn
bn(ηδ)a+ 1
2
exp
(−η(1− δ)
2X 2i
)>
1
2− η
}.
Therefore,
t2i = Pr
(E(κi |Xi) >
1
2
∣∣∣∣H1i is true)
≤ Pr
((a + 1
2
)(1− η)bn
b(ηδ)an+12
exp
(−η(1− δ)
2X 2i
)>
1
2− η
∣∣∣∣H1i is true)
= Pr
(X 2i <
2
η(1− δ)
{log
(a + 1
2
bn(ηδ)a
)−
log
((12− η)(ηδ)1/2
(1− η)bn
)}∣∣∣∣H1i is true)
= Pr
(X 2i <
2
η(1− δ)log
(a
bn(ηδ)a
)(1 + o(1))
∣∣∣∣H1i is true), (B–1)
where in the final equality, we used the fact that bn → 0 as n → ∞, so the second log(·) term
in the second to last equality is a bounded quantity.
Note that under H1i ,Xi ∼ N (0, 1 + ψ2). Therefore, by Eq. B–1 and the fact that
limn→∞
ψ2n
1 + ψ2n
= 1 (by the second condition of Assumption 1), we have
t2i ≤ Pr
|Z | <
√2
η(1− δ)
√log(a(ηδ)−ab−1
n )
ψ2(1 + o(1))
as n → ∞. (B–2)
By assumption, limn→∞b1/4n
pn∈ (0,∞). This then implies that limn→∞
b7/8n
p2n= 0. Therefore, by
the fourth condition of Assumption 1 and the fact that ψ2 → ∞ as n → ∞, we have
102
log(a(ηδ)−ab−1n )
ψ2=
log(a(ηδ)−a) + log(b−1n )
ψ2
=
(log(b
−1/8n )
ψ2+
log(b−7/8n )
ψ2
)(1 + o(1))
=log(b
−1/2n )
4ψ2(1 + o(1))
→ C
4as n → ∞. (B–3)
Thus, using Eq. B–2 and Eq. B–3, we have
t2i ≤ Pr
(|Z | <
√C
2η(1− δ)(1 + o(1))
)as n → ∞
= Pr
(|Z | <
√C
2η(1− δ)
)(1 + o(1)) as n → ∞
= 2
[�
(√C
2η(1− δ)
)− 1
](1 + o(1)) as n → ∞.
Proof of Lemma 3.4. By definition, the probability of a Type II error for the ith decision is
given by
t2i = Pr
(E(1− κi) ≤
1
2
∣∣∣∣H1i is true).
For any n, we have by Theorem 2.1 that{eX
2i/2
(bn
a + bn + 1/2
)≤ 1
2
}⊆{E(1− κi |Xi) ≤
1
2
}.
Therefore,
t2i = Pr
(E(1− κi |Xi) ≤
1
2
∣∣∣∣H1i is true)
≥ Pr
(eX
2i/2
(bn
a + bn + 1/2
)≤ 1
2
∣∣∣∣H1i is true)
= Pr
(X 2i ≤ 2 log
(a + bn + 1/2
2bn
) ∣∣∣∣H1i is true). (B–4)
103
Since Xi ∼ N (0, 1 + ψ2) under H1i , we have by the second condition in Assumption 1 that
limn→∞
ψ2n
1 + ψ2n
→ 1. From Eq. B–4 and the facts that a ∈ (12,∞) and bn ∈ (0, 1) for all n (so
b−1n ≥ b
−1/2n for all n), we have for sufficiently large n,
t2i ≥ Pr
|Z | ≤
√√√√2 log(a+bn+1/2
2bn
)ψ2
(1 + o(1))
as n → ∞
≥ Pr
|Z | ≤
√√√√ log(
12bn
)ψ2
(1 + o(1))
as n → ∞
≥ Pr
|Z | ≤
√log(b
−1/2n ) + log(1/2)
ψ2(1 + o(1))
as n → ∞
= Pr(|Z | ≤√C)(1 + o(1)) as n → ∞
= 2[�(√C)− 1](1 + o(1)) as n → ∞,
where in the second to last equality, we used the assumption that limn→∞b1/4n
pn∈ (0,∞) and
the second and fourth conditions from Assumption 1.
Proof of Theorem 3.1. Since the κi ’s, i = 1, ..., n are a posteriori independent, the Type I and
Type II error probabilities t1i and t2i are the same for every test i , i = 1, ..., n. By Lemmas 3.1
and 3.2, for large enough n,
2
1−�
√√√√ 2
η(1− δ)
[log
( (a + 1
2
)(1− η)bn
bn(ηδ)a+ 1
2
(12− η))]
≤ t1i
≤ 2bn√π(a + bn + 1/2)
[log
(a + bn + 1/2
2bn
)]−1/2
.
Taking the limit as n → ∞ of all the terms above and using the sandwich theorem, we have
limn→∞
t1i = 0 (B–5)
for the ith test, under the assumptions on the hyperparameters a and bn.
104
By Lemmas 3.3 and 3.4, for any η ∈ (0, 12) and δ ∈ (0, 1),
[2�(
√C)− 1
](1 + o(1)) ≤ t2i ≤
[2�
(√C
2η(1− δ)
)− 1
](1 + o(1)). (B–6)
Therefore, we have by Eq. B–5 and Eq. B–6 that as n → ∞, the asymptotic risk in Eq. 1–15
of the classification rule in Eq. 3–1, RIGG , can be bounded as follows:
np(2�(√C)− 1)(1 + o(1)) ≤ RIGG ≤ np
(2�
(√C
2η(1− δ)
)− 1
)(1 + o(1). (B–7)
Therefore, from Eq. 1–17 and Eq. B–7, we have that as n → ∞,
1 ≤ lim infn→∞
RIGG
RBOOpt
≤ lim supn→∞
RIGG
RBOOpt
≤2�(√
C2η(1−δ)
)− 1
2�(√C)− 1
. (B–8)
Now, the supremum of η(1 − δ) over the grid (η, δ) ∈ (0, 12) × (0, 1) is clearly 1
2, and so the
infimum of the numerator in the right-most term in Eq. B–8 is therefore 2�(√C)− 1. Thus,
1 ≤ lim infn→∞
RIGG
RBOOpt
≤ lim supn→∞
RIGG
RBOOpt
≤ 1,
so classification rule in Eq. 3–1 is ABOS, i.e. RIGG
RBOOpt
→ 1 as n → ∞.
105
APPENDIX CPROOFS FOR CHAPTER 4
In this Appendix, we provide proofs to all the theorems in Chapter 4.
C.1 Proofs for Section 4.2.3
C.1.1 Proof of Theorem 4.1
The proof of Theorem 4.1 is based on a lemma. This lemma is similar to Lemma 1.1 in
Goh et al. (2017), with suitable modifications so that we utilize Conditions (A1)-(A3) explicitly.
Furthermore, Goh et al. (2017) gave a sufficient condition for posterior consistency in the
Frobenius norm when pn = o(n) in Theorem 1 of their paper. However, we are not clear about
a particular step in their proof. They assert that
{(A,B) : n−1
(||Yn − XC)�−1/2||2F − ||(Yn − XC∗)�−1/2||2F
)< 2ν,
C = AB⊤}⊇{(A,B) : n−1
∣∣∣∣||Yn − XC||2F − ||(Yn − XC∗)||2F∣∣∣∣ < 2τminν,C = AB⊤
},
where τmin is the minimum eigenvalue for �. This does not seem to be true in general, unless
the matrix (Yn − XC)(Yn − XC)⊤ − (Yn − XC∗)(Yn − XC∗)⊤ is positive definite, which
cannot be assumed. Our proof for Theorem 4.1 thus gives a different sufficient condition for
posterior consistency in this low-dimensional setting. Moreover, the proof of Theorem 4.2 in
the ultrahigh-dimensional case requires a suitable modification of Theorem 4.1. Thus, we deem
it beneficial to write out all the details for Lemma C.1 and Theorem 4.1.
Lemma C.1. Define Bε = {Bn : ||Bn − B0||F > ε}, where ε > 0. To test H0 : Bn = B0
vs. H1 : Bn ∈ Bε, define a test function �n = 1(Yn ∈ Cn), where the critical region is
Cn :={Yn : ||Bn − B0||F > ε/2
}and Bn = (X⊤
n Xn)−1X⊤
n Yn. Then, under the model in Eq.
4–7 and assumptions (A1)-(A3), we have that as n → ∞,
1. EB0(�n) ≤ exp(−ε2nc1/16d2),
2. supBn∈Bε
EBn(1−�n) ≤ exp(−ε2nc1/16d2).
106
Proof of Lemma C.1. Since Bn ∼ MN pn×q(B0, (X⊤n Xn)
−1,�) w.r.t. P0-measure,
Zn = (X⊤n Xn)
1/2(Bn − B0)�−1/2 ∼ MN pn×q(O, Ipn , Iq). (C–1)
Using the fact that for square conformal positive definite matrices A,B, λmin(A)tr(B) ≤
tr(AB) ≤ λmax(A)tr(B), we have
EB0(�n) = PB0
(Yn : ||Bn − B0||F > ε/2)
= PB0(||(X⊤
n Xn)−1/2Zn�
1/2||2F > ε2/4) (by (4))
= PB0(tr(�1/2Zn(X
⊤n Xn)
−1Zn�1/2) > ε2/4)
≤ PB0(n−1c−1
1 tr(�1/2Z⊤n Zn�
1/2) > ε2/4)
≤ PB0(n−1c−1
1 d2tr(Z⊤n Zn) > ε2/4)
= PB0
(||Zn||2F >
ε2c1n
4d2
)= Pr
(χ2pnq >
ε2c1n
4d2
), (C–2)
where the two inequalities follow from Assumptions (A2) and (A3) respectively and the last
equality follows from Eq. C–1. By Armagan et al. (2013a), for all m > 0, Pr(χ2m ≥ x) ≤
exp(−x/4) whenever x ≥ 8m. Using Assumption (A1) and noting that q is fixed, we have by
Eq. C–2 that as n → ∞,
EB0(�n) ≤ Pr
(χ2pnq >
ε2c1n
4d2
)≤ exp
(−ε
2c1n
16d2
),
thus establishing the first part of the lemma.
We next show the second part of the lemma. We have
supBn∈Bε
EBn(1−�n) = sup
Bn∈BεPBn
(Yn : ||Bn − B0||F ≤ ε/2)
≤ supBn∈Bε
PBn
(Yn :
∣∣∣∣||Bn − Bn||F − ||Bn − B0||F∣∣∣∣ ≤ ε/2
)≤ sup
Bn∈BεPBn
(Yn : −ε/2 + ||Bn − B0||F ≤ ||Bn − Bn||F
)
107
= PBn(Yn : ||Bn − Bn||F > ε/2)
≤ exp
(−ε
2c1n
16d2
),
The last inequality follows from the fact that Bn ∼ MN pn×q(Bn, (X⊤n Xn)
−1,�). Thus, we
may use the same steps that were used to prove the first part of the lemma. Therefore, we
have also established the second part of the lemma.
Proof of Theorem 4.1. We utilize the proof technique of Theorem 1 in Armagan et al. (2013a)
and modify it suitably for the multivariate case subject to conditions (A1)-(A3). The posterior
probability of Bn is given by
�n(Bn|Yn) =
∫Bn
f (Yn|Bn)
f (Yn|B0)�n(dBn)∫
f (Yn|Bn)
f (Yn|B0)�n(dBn)
≤ �n +(1−�n)JBε
Jn
= I1 +I2
Jn, (C–3)
where JBε =
∫Bε
{f (Yn|Bn)
f (Yn|B0)
}�n(dBn) and Jn =
∫ {f (Yn|Bn)
f (Yn|B0)
}�n(dBn).
Let b = ε2c116d2
. For sufficiently large n, using Markov’s Inequality and the first part of
Lemma C.1, we have
PB0
(I1 ≥ exp
(−bn
2
))≤ exp
(bn
2
)EB0
(I1) ≤ exp
(−bn
2
).
This implies that∑∞
n=1 PB0 (I1 ≥ exp(−bn/2)) < ∞. Thus, by the Borel-Cantelli Lemma,
I1 → 0 a.s. P0 as n → ∞.
We next look at the behavior of I2. We have
EB0I2 = EB0
{(1−�n)JBε} = EB0
{(1−�n)
∫Bε
f (Yn|Bn)
f (Yn|B0)�n(dBn)
}=
∫Bε
∫(1−�n)f (Yn|Bn) dYn �n(dBn)
108
≤ �n(Bε) supBn∈Bε
EBn(1−�n)
≤ supBn∈Bε
EBn(1−�n)
≤ exp(−bn),
where the last inequality follows from the second part of Lemma C.1.
Thus, for sufficiently large n, PB0(I2 ≥ exp(−bn/2)) ≤ exp(−bn/2), which implies that∑∞
n=1 PB0 (I2 ≥ exp(−bn/2)) < ∞. Thus, by the Borel-Cantelli Lemma, I2 → 0 a.s. P0 as
n → ∞.
We have now shown that both I1 and I2 in Eq. C–3 tend towards zero exponentially fast.
We now analyze the behavior of Jn. To complete the proof, we need to show that
exp(bn/2)Jn → ∞ P0 a.s. as n → ∞. (C–4)
Note that
exp(bn/2)Jn = exp(bn/2)
∫exp
{−n1
nlog
f (Yn|B0)
f (Yn|Bn)
}�n(dBn)
≥ exp {(b/2− ν)n}�n(Dn,ν), (C–5)
where Dn,ν ={Bn : n
−1 log(f (Yn|B0)
f (Yn|Bn)
)< ν
}for 0 < ν < b/2. Therefore, we have
Dn,ν =
{Bn : n
−1
(1
2tr[[(Yn − XnBn)
⊤(Yn − XnBn)]�−1]− 1
2tr [[(Yn−
XnB0)⊤(Yn − XnB0)]�
−1])< 2ν
}≡{Bn : n
−1(tr[�−1/2(Yn − XnBn)
⊤(Yn − XnBn)�−1/2
]− tr
[�−1/2
(Yn − XnB0)⊤(Yn − XnB0)�
−1/2])< 2ν
}≡{Bn : n
−1(||(Yn − XnBn)�
−1/2||2F − ||(Yn − XnB0)�−1/2||2F
)< 2ν
}.
109
Noting that
||(Yn − XnBn)�−1/2||2F ≤ ||(Yn − XnB0)�
−1/2||2F + ||Xn(Bn − B0)�−1/2||2F
+ 2||(Yn − XnB0)�−1/2||F ||Xn(Bn − B0)�
−1/2||F ,
we have
�n(Dn,ν) ≥ �{Bn : n
−1(2||Yn − XnB0)�
−1/2||F ||Xn(Bn − B0)�−1/2||F
+||Xn(Bn − B0)�−1/2||2F
)< 2ν
}≥ �
{Bn : n
−1||Xn(Bn − B0)�−1/2||F <
2ν
3κn,
||(Yn − XnB0)�−1/2||F < κn
}, (C–6)
for some positive increasing sequence κn such that κn → ∞ as n → ∞.
Set κn = n(1+ρ)/2 for ρ > 0. Since En = Yn −XnB0, we have Zn = (Yn −XnB0)�−1/2 ∼
MN n×q(O, In, Iq). Therefore, as n → ∞,
PB0(||(Yn − XnB0)�
−1/2||F > κn) = PB0(||Zn||2F > κ2n)
= Pr(χ2nq > n1+ρ
)≤ exp
(−n1+ρ
4
),
where the last inequality follows from the fact that for all m > 0, Pr(χ2m ≥ x) ≤ exp(−x/4)
when x ≥ 8m and the assumptions that q is fixed and ρ > 0. Since∑∞
n=1 PB0(||(Yn −
XnB0)�−1/2||F > κn) ≤
∑∞n=1 exp
(−n1+ρ
4
)<∞, we have by the Borel-Cantelli Lemma that
PB0{||Yn − XnB0||F > κn infinity often } = 0.
110
For sufficiently large n, we have from Eq. C–6 that
�n(Dn,ν) ≥ �n
{Bn : n
−1||Xn(Bn − B0)�−1/2||F <
2ν
3κn
}≥ �n
{n−1n1/2c
1/22 d
−1/21 ||Bn − B0||F <
2ν
3κn
}= �n
{Bn : ||Bn − B0||F <
(2d
1/21 ν
3c1/22
)n−(1+ρ)/2n1/2
}
= �n
{Bn : ||Bn − B0||F <
�
nρ/2
}, (C–7)
where � =2d
1/21 ν
3c1/22
. The second inequality in Eq. C–7 follows from Assumptions (A2) and (A3)
and the fact that
||Xn(Bn − B0)�−1/2||F =
√tr[�−1/2(Bn − B0)⊤XT
n Xn(Bn − B0)�−1/2]
≤√λmax(X⊤
n Xn)λmax(�−1)||Bn − B0)||2F
< n1/2c1/22 d
−1/21 ||Bn − B0||F .
Therefore, from Eq. C–7, if �n
{Bn : ||Bn − B0||F < �
nρ/2
}> exp(−kn) for all 0 < k <
b/2− ν, then Eq. C–5 will hold.
Substitute b = ε2c116d2
, � =2d
1/21 ν
3c1/22
⇒ ν =3�c
1/22
2d1/21
to obtain that 0 < k < ε2c132d2
− 3�c1/22
2d1/21
. To
ensure that k > 0, we must have 0 < � <ε2c1d
1/21
48c1/22 d2
.
Therefore, if the conditions on � and k in Theorem 4.1 are satisfied, then Eq. C–4 holds.
This ensures that the expected value of Eq. C–3 w.r.t. P0 measure approaches 0 as n → ∞,
which ultimately establishes that posterior consistency holds if Eq. 4–8 is satisfied.
C.1.2 Proof of Theorem 4.2
Proof of Theorem 4.2 also requires the creation of an appropriate test function. In this
case, the test must be very carefully constructed since Xn is no longer nonsingular. We first
define some constants and prove a lemma.
For arbitrary ε > 0 and c1 and d2 specified in (B3) and (B5), let
111
c3 =ε2c116d2
, (C–8)
and
mn =
⌊nc3
6 log pn
⌋. (C–9)
Lemma C.2. Define the set Bε = {Bn : ||Bn − B0||F > ε}. Suppose that Conditions (B1)-
(B6) hold under Eq. 4–7. In order to test H0 : Bn = B0 vs. H1 : Bn ∈ Bε, there exists a test
function �n such that as n → ∞,
1. EB0(�n) ≤ exp(−nc3/2),
2. supBn∈Bε
EBn(1− �n) ≤ exp(−nc3),
where c3 is defined in Eq. C–8.
Proof of Lemma C.2. By Condition (B1), we must have that nlog pn
→ ∞. Moreover, by Eq.
C–9, mn = o(n), since log pn → ∞ as n → ∞. Combining this with assumption (B6), we
must have that for sufficiently large n, there exists a positive integer mn, determined by Eq.
C–9, such that 0 < s∗ < mn < n.
For sufficiently large n so that s∗ < mn < n, define the set M as the set of models S
which properly contain the true model S∗ ⊂ {1, ... , pn} so that
M = {S : S ⊃ S∗,S = S∗, |S | ≤ mn}, (C–10)
and define the set T as
T = {S : S ⊂ {1, ... , pn} \M, |S | ≤ n}. (C–11)
Let XS denote the submatrix of X with columns indexed by model S , and let BS0 denote the
submatrix of B0 that contains rows of B0 indexed by S . Define the following sets Cn and En:
Cn =∨S∈M
{||(X⊤
SXS)−1X⊤
SYn − BS0 ||F > ε/2
}, (C–12)
112
En =∧S∈T
{||(X⊤
SXS)−1X⊤
SYn − BS0 ||F ≤ ε/2
}. (C–13)
where∨
indicates the union of all models S contained in M, and∧
indicates the intersection
of all models contained in T , and ε > 0 is arbitrary. Essentially, the set Cn contains the union
of all models S that contain the true model S∗, S = S∗, such that the submatrix XS has at
least s∗ columns and at most mn(< n) columns, and ||(X⊤SXS)
−1X⊤SYS − BS
0 ||F > ε/2. Given
our choice of mn, XTS XS is nonsingular for all models S contained in our sets.
We are now ready to define our test function �n. To test H0 : Bn = B0 vs. H1 : Bn ∈ Bε,
define �n = 1(Yn ∈ Cn), where the critical region is defined as in Eq. C–12. We now show
that Lemma C.2 holds with this choice of �.
Let s be the size of an arbitrary model S . Noting also that there are(pns
)ways to select a
model of size s , we therefore have for sufficiently large n,
EB0(�n) ≤
∑S∈M
PB0
(Yn : ||(X⊤
SXS)−1X⊤
SYS − BS0 ||F > ε/2
)=
mn∑s=s∗+1
(pn
s
)PB0
(Yn : ||(X⊤
SXS)−1X⊤
SYS − BS0 ||F > ε/2
)≤
mn∑s=s∗+1
(pn
s
)P(χ2sq >
ε2c1n
4d2
)
≤mn∑
s=s∗+1
(pn
s
)exp(−nc3)
≤ (mn − s∗)
(pn
mn
)exp(−nc3)
≤ (mn − s∗)
(epn
mn
)mn
exp(−nc3), (C–14)
where we use Part 1 of Lemma 1 for the second inequality, the fact that P(χ2m > x) ≤
exp(−x/4) when x > 8m and mn = o(n) for the third inequality, and the fact that∑m
i=k
(n
i
)≤ (m − k + 1)
(n
m
)for the fourth inequality in Eq. C–14.
113
Since log n = o(n), we must have for sufficiently large n that log n < c3n
6. Then from the
definition of mn, we have
log(mn − s∗) + mn
(1 + log
(pn
mn
))≤ log(mn) + mn (1 + log(pn))
≤ log(n) +c3n
6 log pn+
(c3n
6 log pn
)log(pn)
≤ c3n
6+
c3n
6+
c3n
6=
c3n
2. (C–15)
Therefore, from Eq. C–14 and Eq. C–15, we must have that EB0(�n) ≤ exp(−nc3/2) as
n → ∞. This proves the first part of the lemma.
Next, letting En be the set defined in Eq. C–13, we observe that as n → ∞,
supBn∈Bε
EBn(1− �n) = sup
Bn∈BεPBn
(Yn /∈ Cn)
≤ supBn∈Bε
PBn(Yn ∈ En) (since Ccn ⊆ En)
= supBn∈Bε
PBn
(∩S∈T
{Yn : ||(X⊤
SXS)−1X⊤
SYS − BS0 ||F ≤ ε/2
})
≤ supBn∈Bε
PBn
(Yn : ||(X⊤
SXS)−1X⊤
SYS− BS
0 ||F ≤ ε/2)
for some S ∈ T .
Rewrite BSn = (X⊤
SXS)−1X⊤
SYS
for the single set S ∈ T , and we get
supBn∈Bε
EBn(1− �n) ≤ sup
Bn∈BεPBn
(Yn :
∣∣∣∣||BSn − BS
0 ||F − ||BSn − BS
0 ||F∣∣∣∣ ≤ ε/2
)supBn∈Bε
PBn
(Yn : −ε/2 + ||BS
n − BS0 ||F ≤ ||BS
n − BSn ||F
)≤ P
BSn
(Yn : ||BS
n − BSn ||F > ε/2
)≤ exp (−c3n) , (C–16)
as n → ∞. To arrive at Eq. C–16, we invoked Part 2 of Lemma C.1 for the final inequality.
Proof of Theorem 4.2. In light of Lemma C.2, we suitably modify Theorem 4.1 for the
ultrahigh-dimensional case. Let �n be the test function defined in Lemma C.2 for sufficiently
114
large n. The posterior probability of Bn is given by
�n(Bn|Yn) =
∫Bn
f (Yn|Bn)
f (Yn|B0)�(dBn)∫
f (Yn|Bn)
f (Yn|B0)�(dBn)
≤ �n +(1− �n)JBε
Jn
= I1 +I2
Jn, (C–17)
where JBε =
∫Bε
{f (Yn|Bn)
f (Yn|B0)
}�(dBn) and Jn =
∫ {f (Yn|Bn)
f (Yn|B0)
}�(dBn).
For sufficiently large n, using Markov’s Inequality and the first part of Lemma C.2, and
taking c3 as defined in Eq. C–8, we have
PB0
(I1 ≥ exp
(−nc3
4
))≤ exp
(nc3
4
)EB0
(I1) ≤ exp
(−nc3
4
).
This implies that∑∞
n=1 PB0
(I1 ≥ exp(−nc3/4)
)<∞. Thus, by the Borel-Cantelli Lemma, we
have PB0(I1 ≥ exp(−nc3/4) infinitely often) = 0, i.e. I1 → 0 a.s. P0 as n → ∞.
We next look at the behavior of I2. We have
EB0I2 = EB0
{(1− �n)JBε
}= EB0
{(1− �n)
∫Bε
f (Yn|Bn)
f (Yn|B0)�n(dBn)
}=
∫Bε
∫(1− �n)f (Yn|Bn) dYn �n(dBn)
≤ πn(Bε) supBn∈Bε
EBn(1− �n)
≤ supBn∈Bε
EBn(1− �n)
≤ exp(−nc3),
where the last inequality follows from the second part of Lemma C.2, and c3 is again from Eq.
C–8.
115
Thus, for sufficiently large n, PB0(I2 ≥ exp(−nc3/2)) ≤ exp(−nc3/2), which implies that∑∞
n=1 PB0
(I2 ≥ exp(−nc3/2)
)< ∞. Thus, by the Borel-Cantelli Lemma, I2 → 0 a.s. P0 as
n → ∞.
We have now shown that both I1 and I2 in Eq. C–17 tend towards zero exponentially fast.
We now analyze the behavior of Jn. To complete the proof, we need to show that
exp(nc3/2)Jn → ∞ P0 a.s. as n → ∞. (C–18)
Note that
exp(nc3/2)Jn = exp(nc3/2)
∫exp
{−n1
nlog
f (Yn|B0)
f (Yn|Bn)
}�n(dBn)
≥ exp {(c3/2− ν)n}�n(Dn,ν), (C–19)
where Dn,ν ={Bn : n
−1 log(f (Yn|B0)
f (Yn|Bn)
)< ν
}for 0 < ν < c3/2.
Because of Assumption (B4) which bounds the maximum singular value of Xn from above
by nc2, the rest of the proof is essentially identical to the remainder of the proof from Theorem
4.1, with suitable modifications (i.e. replacing c1 with c1 and c2 with c2 and substituting in the
expression in Eq. C–8 for c3).
Therefore, if the conditions on � and k in Theorem 4.2 are satisfied, then Eq. C–19 is
satisfied, i.e. exp(nc3/2)Jn → ∞ as n → ∞. This ensures that the expected value of Eq.
C–18 w.r.t. P0 measure approaches 0 as n → ∞, which ultimately establishes that posterior
consistency holds if Eq. 4–9 is satisfied.
C.2 Proofs for Section 4.2.4
C.2.1 Preliminary Lemmas
Before proving Theorems 4.3 and 4.4, we first prove two lemmas which characterize the
marginal prior density for the rows of B. Throughout this section, we denote bi , 1 ≤ i ≤ p as
the ith row of B under the model in Eq. 4–2, with polynomial-tailed hyperpriors of the form
116
given in Eq. 1–7. Lemma C.4 in particular plays a central role in proving our theoretical results
in Section 4.2.
Lemma C.3. Under the MBSP model in Eq. 4–2 with polynomial-tailed hyperpriors of the
form in Eq. 1–7, the marginal density π(bi |�) is equal to
π(bi |�) = D
∫ ∞
0
ξ−q/2−a−1i exp
{− 1
2ξiτ||bi�−1/2||22
}L(ξi)dξi ,
where D > 0 is an appropriate constant.
Proof of Lemma C.3. Let D = diag(ξ1, ..., ξp). Using Definition 4.1, the joint prior for the
MBSP model in Eq. 4–2 with polynomial-tailed priors is
π(B, ξ1, ..., ξp|�) ∝ |D|−q/2|�|−p/2 exp{−1
2tr[�−1BT τ−1D−1B]
}×
p∏i=1
π(ξi)
∝
[p∏i=1
(ξi)−q/2
]exp
{− 1
2τ
p∑i=1
||ξ−1/2i bi�
−1/2||22
}×
p∏i=1
π(ξi)
∝p∏i=1
[ξ−q/2i exp
{− 1
2ξiτ||bi�−1/2||22
}π(ξi)
]
∝p∏i=1
[ξ−q/2−a−1i exp
{− 1
2ξiτ||bi�−1/2||22
}Lξi)
]. (C–20)
Since the rows, bi and the ξi ’s, 1 ≤ i ≤ p are independent, we have from Eq. C–20 that
π(bi , ξi |�) ∝ ξ−q/2−a−1i exp
{− 1
2ξiτ||bi�−1/2||22
}.
Integrating out ξi gives the desired marginal prior for π(bi |�).
Though we are not able to obtain a closed form solution for π(bi |�), we are able to
obtain a lower bound on it that can be written in closed form, as we illustrate in the next
lemma.
Lemma C.4. Suppose Condition (A3) on the eigenvalues of � and Condition (C1) on
the slowly varying function L(·) in Eq. 1–7 hold . Under the MBSP model in Eq. 4–2 with
polynomial-tailed hyperpriors of the form in Eq. 1–7 and known �, the marginal density for bi ,
117
the ith row of B, can be bounded below by
C exp
(− ||bi ||222τd1t0
), (C–21)
where C = Dc0t−q/2−a0
(q
2+ a)−1.
Proof of Lemma C.4. Following from Lemma C.3, we have
π(bi) = D
∫ ∞
0
ξ−q/2−a−1i exp
{− 1
2ξiτ||bi�−1/2||22
}L(ξi)dξi
≥ D
∫ ∞
0
ξ−q/2−a−1i exp
{− ||bi ||222ξiτd1
}L(ξi)dξi (C–22)
≥ D
∫ ∞
t0
ξ−q/2−a−1i exp
{− ||bi ||222ξiτd1
}L(ξi)dξi
≥ Dc0
∫ ∞
t0
ξ−q/2−a−1i exp
{− ||bi ||222ξiτd1
}dξi (C–23)
= Dc0
(2τd1||bi ||22
)q/2+a ∫ ||bi ||22/2τd1t0
0
uq/2+a−1e−udu (C–24)
≥ Dc0
(2τd1||bi ||22
)q/2+a
exp
(− ||bi ||222τd1t0
)∫ ||bi ||22/2τd1t0
0
uq/2+a−1du
= Dc0t−q/2−a0
(q2+ a)−1
exp
(− ||bi ||222τd1t0
)= C exp
(− ||bi ||222τd1t0
).
where Eq. C–22 follows from Condition (A3), while Eq. C–23 follows from Condition (C1).
Equation C–24 follows from a change of variables u =||bi ||222ξiτd1
. We have thus established the
lower bound in Eq. C–21 for the marginal density of bi .
C.2.2 Proofs for Theorem 4.3 and Theorem 4.4
Before we prove Theorems 4.3 and 4.4 for the MBSP model in Eq. 4–2 with hyperpriors
of the form in Eq. 1–7, we first introduce some notation. Because we are operating under the
assumption of sparsity, most of the rows of B0 should contain only entries of zero.
Our proofs depend on partitioning B0 into sets of active and inactive predictors. To this
end, let b0j denote the jth row for the true coefficient matrix B0 and bnj denote the jth row of
Bn, where both B0 and Bn depend on n. We also let An :={j : bnj = 0, 1 ≤ j ≤ pn
}denote
118
the set of indices of the nonzero rows of B0. This indicates active predictors. Equivalently, Acn
is the set of indices of the zero rows (or the inactive predictors).
Proof of Theorem 4.3. For the low-dimensional setting, let s = |S | denote the size of the true
model. Since (A1)-(A3) hold, it is enough to show (by Theorem 4.1) that, for sufficiently large
n and any k > 0,
�n
(Bn : ||Bn − B0||F <
�
nρ/2
)> exp(−kn),
where 0 < � <ε2c1d
1/21
48c1/22 d2
. We have
�n
(Bn : ||Bn − B0||F <
�
nρ/2
)= �n
(Bn : ||Bn − B0||2F <
�2
nρ
)
= �n
Bn :∑j∈An
||bnj − b0j ||22 +∑j∈Ac
n
||bnj ||22 <�2
nρ
≥ �n
Bn :∑j∈An
||bnj − b0j ||22 <s�2
pnnρ,∑j∈Ac
n
||bj ||22 <(pn − s)�2
pnnρ
≥
{∏j∈An
�n
(bnj : ||bnj − b0j ||22 <
�2
pnnρ
)}
×
�n
∑j∈Ac
n
||bj ||22 <(pn − s)�2
pnnρ
. (C–25)
Define the density
π(bj) ∝ exp
(− ||bj ||222τnd1t0
), (C–26)
Since (C1) holds for the slowly varying component of Eq. 1–7, we have by the lower bound in
Lemma C.4, Eq. C–25, and Eq. C–26 that it is sufficient to show that{�n
(bnj : ||bnj − b0j ||22 <
�2
pnnρ
)}s
×
�n
∑j∈Ac
n
||bj ||22 <(pn − s)�2
pnnρ
> exp(−kn) (C–27)
for sufficiently large n and any k > 0 in order to obtain posterior consistency (again by
Theorem 4.1). We consider the last two terms in the product on the left-hand side of Eq.
119
C–27 separately. Note that
�n
(bnjk :
q∑k=1
(bnjk − b0jk)2 <
�2
pnnρ
)≥ �n
(bnjk :
q∑k=1
|bnjk − b0jk | <�√pnnρ
)
≥q∏
k=1
{�n
(bnjk : |bnjk − b0jk | <
�
q√pnnρ
)}(C–28)
By Eq. C–26, π(bnjk) = 1√2πd1t0
exp
(−(bnjk)
2
2d1t0
), i.e. bnjk ∼ N(0, τnd1t0). Therefore, we have
�n
(bnjk : |bnjk − b0jk | <
�
q√pnnρ
)=
1√2πτnd1t0
∫ b0jk+ �
q√
pnnρ
b0jk− �
q√
pnnρ
exp
(−(bnjk)2
2τnd1t0
)dbjk
= Pr
(− �
q√pnnρ
≤ X − b0jk ≤ �
q√pnnρ
)= Pr
(|X − b0jk | ≤
�
q√pnnρ
), (C–29)
where X ∼ N(b0jk , τnd1t0). By Assumption (C2), b0jk is finite for every n. Furthermore, for any
random variable X ∼ N(µ,σ2), we have the concentration inequality Pr |X − µ| > t) ≤ 2e−t2
2σ2
for any t ∈ R. Setting X ∼ N(b0jk , τnd1t0) and t = �q√pnnρ
, we have
Pr
(|X − b0jk | ≤
�
q√pnnρ
)= 1− Pr
(|X − b0jk | ≥
�
q√pnnρ
)≥ 1− 2 exp
(− �2
2q2pnnρτnd1t0
). (C–30)
We now consider the second term on the left in Eq. C–27. Since E(b2jk) = τnd1t0, an
application of the Markov inequality gives
�n
bnjk : ∑j∈Ac
n
q∑k=1
(bnjk)2 <
(pn − s)�2
pnnρ
≥ 1−
pnnρE
∑j∈Ac
n
q∑k=1
(bnjk)2
(pn − s)�2
= 1− pnqnρτnd1t0�2
. (C–31)
120
Combining Eq. C–27 through Eq. C–31, we obtain as a lower bound for the left-hand side of
Eq. C–27, {1− 2 exp
(− �2
2q2pnnρτnd1t0
)}qs (1− pnqn
ρτnd1t0�2
). (C–32)
By Assumption (C3), it is clear that Eq. C–32 tends to 1 as n → ∞, so obviously this quantity
is greater than e−kn for any k > 0 for sufficiently large n. Since the lower bound in Eq. C–27
holds for all sufficiently large n, we have under the given conditions that the MBSP model in
Eq. 4–2 achieves posterior consistency in the Frobenius norm.
Proof of Theorem 4.4. For the ultrahigh-dimensional setting, we first let S∗ ⊂ {1, 2, ..., pn}
denote the indices of the nonzero rows, and denote the true size of S∗ as s∗ = |S∗|. Since
(B1)-(B6) hold, it is enough to show by Theorem 4.2 that, for sufficiently large n and any
k > 0,
�n
(Bn : ||Bn − B0||F <
�
nρ/2
)> exp(−kn),
where 0 < � <ε2c1d
1/21
48c1/22 d2
and ρ > 0.
By Assumption (C1) for the slowly varying component in Eq. 1–7, Lemma C.4, and
Theorem 4.2, it is thus sufficient to show that{�n
(bnj : ||bnj − b0j ||22 <
�2
pnnρ
)}s∗
×
�n
∑j∈Ac
n
||bj ||22 <(pn − s∗)�2
pnnρ
> exp(−kn) (C–33)
for sufficiently large n and any k > 0, where the density πn is defined in Eq. C–26. Mimicking
the proof for Theorem 4.3 and given regularity conditions (C1) and (C2), we obtain as a lower
bound for the left-hand side of Eq. C–33,{1− 2 exp
(− �2
2q2pnnρτnd1t0
)}qs∗ (1− pnqn
ρτnd1t0
�2
). (C–34)
121
Under Assumption (C3), Eq. C–34 is clearly greater than e−kn for any k > 0 and sufficiently
large n, since Eq. C–34 tends to 1 as n → ∞. We have thus proven posterior consistency in
the Frobenius norm for the ultrahigh-dimensional case as well.
122
APPENDIX DGIBBS SAMPLER FOR THE MBSP-TPBN MODEL
In this Appendix, we provide the technical details of the Gibbs sampler for the MBSP-TPBN
model in Eq. 4–13 from Section 4.3.2. We also present an efficient method for sampling from
the full conditional of B. These algorithms are implemented in the R package MBSP.
D.1 Full Conditional Densities for the Gibbs Sampler
The full conditional densities are all available in closed form as follows. Letting T =
diag(ψ1, ... ,ψp), bi denote the ith row of B, and GIG(a, b, p) denote a generalized inverse
Gaussian density with f (x ; a, b, p) ∝ x (p−1)e−(a/x+bx)/2, we have
B|rest ∼ MN p×q
((X⊤X+ T−1)−1X⊤Y,
(X⊤X+ T−1
)−1,�),
�|rest ∼ IW(n + p + d , (Y − XB)⊤(Y − XB) + B⊤T−1B+ kIq),
ψi |rest ind∼ GIG(||bi�−1/2||22, 2ζi , u − q
2
), i = 1, ... , p, (D–1)
ζi |rest ind∼ G (a, τ + ψi) , i = 1, ... , p.
Because the full conditional densities are available in closed form, we can implement the
MBSP-TPBN model in Eq. 4–13 straightforwardly using Gibbs sampling.
D.2 Fast Sampling of the Full Conditional Density for B
In Eq. D–1, the most computationally intensive operation is sampling from the density,
π(B|rest). Much of the computational cost comes from computing the inverse (X⊤X +
T−1)−1, which requires O(p3) time complexity if we use Cholesky factorization methods. In
the case where p < n, this is not a problem. However, when p ≫ n, then this operation can be
prohibitively costly.
In this section, we provide an alternative algorithm for sampling from the density
MN p×q
((X⊤X+ T−1)−1X⊤Y,
(X⊤X+ T−1
)−1,�)
in O(n2p) time. Bhattacharya
et al. (2016) originally devised an algorithm to efficiently sample from a class of structured
multivariate Gaussian distributions. Our algorithm below is a matrix-variate extension of the
algorithm given by Bhattacharya et al. (2016).
123
Algorithm 1
Step 1. Sample U ∼ MN p×q(O,T,�) and M ∼ MN n×q(O, In,�).
Step 2. Set V = XU+M.
Step 3. Solve for W in the below system of equations:
(XTX⊤ + In)W = Y − V.
Step 4. Set � = U+ TX⊤W.
With the above algorithm, we have the following proposition.
Proposition D.1. Suppose � is obtained by following Algorithm 1. Then
� ∼ MN p×q
((X⊤X+ T−1)−1X⊤Y,
(X⊤X+ T−1
)−1,�).
Proof. This follows from a trivial modification of Proposition 1 in Bhattacharya et al. (2016).
From Algorithm 1, it is clear that the most computationally intensive step is solving the
system of equations in Step 3. However, since T is a diagonal matrix, it follows from the
arguments in Bhattacharya et al. (2016) that computing the inverse of (XTX⊤ + In) can be
done in O(n2p) time. Once this inverse is obtained, solving the system of equations can be
done in O(n2q) time, and in general, q ≪ p. It is thus clear that Algorithm 1 is O(n2p)
when p > n. Since our algorithm scales linearly with p, it provides a significant reduction in
computing time from typical methods based on Cholesky factorization when p ≫ n.
On the other hand, if p < n, then Algorithm 1 provides no time saving, so we simply
utilize Cholesky factorization methods to sample from the full conditional density, π(B|rest) in
O(p3) time if p < n.
D.3 Convergence of the Gibbs Sampler
In order to ensure quick convergence, we need good initial guesses for B and �, B(init)
and �(init), respectively. We take as our initial guess for B, B(init) = (X⊤X + λIp)−1X⊤Y,
where λ = δ + λmin+(X), λmin+(X) is the smallest positive singular value of X, and δ = 0.01.
124
0 2000 4000 6000 8000 10000
−10
−5
05
10
Iteration
True Nonzero Coefficient
0 2000 4000 6000 8000 10000
−10
−5
05
10
Iteration
True Zero Coefficient
0 2000 4000 6000 8000 10000
−10
−5
05
10
Iteration
True Nonzero Coefficient
0 2000 4000 6000 8000 10000
−10
−5
05
10
Iteration
True Zero Coefficient
Figure D-1. History plots of the first 10,000 draws from the Gibbs sampler for theMBSP-TPBN model described in Section D.1 for randomly drawn coefficients bijin B0 from experiments 5 and 6 in Section 4.4.1. The top two plots are takenfrom experiment 5 (n = 100, p = 500, q = 3), and the bottom two plots are takenfrom Experiment 6 (n = 150, p = 1000, q = 4).
This forces the term X⊤X + λIp to be positive definite. For �, we take as our initial guess
�(init) = 1n
(Y − XB(init))⊤ (Y − XB(init)).
Figure D-1 shows the historical plots of the first 10,000 draws from from the Gibbs
sampler for the MBSP-TPBN model described in Section D.1 for four randomly drawn
coefficients bij in B from experiments 5 and 6 in Section 4.4.1. The top two plots correspond
to a true nonzero coefficient (b0ij = −3.8103) and a true zero coefficient (b0ij = 0) from
experiment 5 in Section 4.4.1 (n = 100, p = 500, q = 3). The bottom two plots correspond to
125
a true nonzero coefficient (b0ij = 3.1436) and a true zero coefficient (b0ij = 0) from experiment
6 in Section 4.4.1 (n = 150, p = 1000, q = 4).
We consider two different Markov chains with different starting values for B(init): 1)
the ridge estimator described above, and 2) the regularized MLASSO estimator described
in Section 4.4.1. We see from the plots in Figure D-1 that although both chains start with
different initial values of b(init)ij , they mix well and seem to rapidly converge to a stationary
distribution that captures the true coefficients b0ij ’s with high probability.
126
REFERENCES
Abramovich, F., Grinshtein, V., & Pensky, M. (2007). On optimality of bayesian testimation inthe normal means problem. Annals of Statistics, 35(5), 2261–2286.
Armagan, A., Clyde, M., & Dunson, D. B. (2011). Generalized beta mixtures of gaussians.In J. Shawe-taylor, R. Zemel, P. Bartlett, F. Pereira, & K. Weinberger (Eds.) Advances inNeural Information Processing Systems 24, (pp. 523–531).
Armagan, A., Dunson, D., Lee, J., Bajwa, W., & Strawn, N. (2013a). Posterior consistency inlinear models under shrinkage priors. Biometrika, 100(4), 1011–1018.
Armagan, A., Dunson, D. B., & Lee, J. (2013b). Generalized double pareto shrinkage.Statistica Sinica, 23 1, 119–143.
Belloni, A., Chernozhukov, V., & Wang, L. (2011). Square-root lasso: pivotal recovery ofsparse signals via conic programming. Biometrika, 98(4), 791–806.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical andpowerful approach to multiple testing. Journal of the Royal Statistical Society: Series B(Statistical Methodology), 57(1), 289–300.
Berger, J. (1980). A robust generalized bayes estimator and confidence region for amultivariate normal mean. Annals of Statistics, 8(4), 716–761.
Bhadra, A., Datta, J., Polson, N. G., & Willard, B. (2017). The horseshoe+ estimator ofultra-sparse signals. Bayesian Analysis, 12(4), 1105–1131.
Bhadra, A., & Mallick, B. K. (2013). Joint high-dimensional bayesian variable and covarianceselection with an application to eqtl analysis. Biometrics, 69(2), 447–457.
Bhattacharya, A., Chakraborty, A., & Mallick, B. K. (2016). Fast sampling with gaussian scalemixture priors in high-dimensional regression. Biometrika, 103(4), 985–991.
Bhattacharya, A., Pati, D., Pillai, N. S., & Dunson, D. B. (2015). Dirichlet–laplace priors foroptimal shrinkage. Journal of the American Statistical Association, 110(512), 1479–1490.
Bingham, N., Goldie, C., & Teugels, J. (1987). Regular variation (Encyclopedia of Mathemat-ics and its Applications). Cambridge University Press.
Bogdan, M., Chakrabarti, A., Frommlet, F., & Ghosh, J. K. (2011). Asymptoticbayes-optimality under sparsity of some multiple testing procedures. Annals of Statis-tics, 39(3), 1551–1579.
Brown, P. J., Vannucci, M., & Fearn, T. (1998). Multivariate bayesian variable selection andprediction. Journal of the Royal Statistical Society: Series B (Statistical Methodology),60(3), 627–641.
127
Bunea, F., She, Y., & Wegkamp, M. H. (2012). Joint variable and rank selection forparsimonious estimation of high-dimensional matrices. Annals of Statistics, 40(5), 2359–2388.
Camponovo, L. (2015). On the validity of the pairs bootstrap for lasso estimators. Biometrika,102(4), 981–987.
Candes, E., & Tao, T. (2007). The dantzig selector: Statistical estimation when p is muchlarger than n. Annals of Statistics, 35(6), 2313–2351.
Carvalho, C. M., Polson, N. G., & Scott, J. G. (2009). Handling sparsity via the horseshoe.In D. van Dyk, & M. Welling (Eds.) Proceedings of the Twelth International Conference onArtificial Intelligence and Statistics, vol. 5 of Proceedings of Machine Learning Research,(pp. 73–80). Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA: PMLR.
Carvalho, C. M., Polson, N. G., & Scott, J. G. (2010). The horseshoe estimator for sparsesignals. Biometrika, 97(2), 465–480.
Castillo, I., Schmidt-Hieber, J., & van der Vaart, A. (2015). Bayesian linear regression withsparse priors. Annals of Statistics, 43(5), 1986–2018.
Castillo, I., & van der Vaart, A. (2012). Needles and straw in a haystack: Posteriorconcentration for possibly sparse sequences. Annals of Statistics, 40(4), 2069–2101.
Chen, L., & Huang, J. Z. (2012). Sparse reduced-rank regression for simultaneous dimensionreduction and variable selection. Journal of the American Statistical Association, 107(500),1533–1545.
Chun, H., & Keleş, S. (2010). Sparse partial least squares regression for simultaneousdimension reduction and variable selection. Journal of the Royal Statistical Society: Series B(Statistical Methodology), 72(1), 3–25.
Clarke, B. S., & Barron, A. R. (1990). Information-theoretic asymptotics of bayes methods.IEEE Transactions on Information Theory, 36, 453–471.
Dasgupta, S. (2016). High-dimensional posterior consistency of the bayesian lasso. Communi-cations in Statistics - Theory and Methods, 45(22), 6700–6708.
Datta, J., & Ghosh, J. K. (2013). Asymptotic properties of bayes risk for the horseshoe prior.Bayesian Analysis, 8(1), 111–132.
Donoho, D. L., Johnstone, I. M., Hoch, J. C., & Stern, A. S. (1992). Maximum entropyand the nearly black object. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 54(1), 41–81.
Efron, B. (2010). The future of indirect evidence. Statistical Science, 25(2), 145–157.
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracleproperties. Journal of the American Statistical Association, 96(456), 1348–1360.
128
Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849–911.
Fan, J., & Song, R. (2010). Sure independence screening in generalized linear models withnp-dimensionality. Annals of Statistics, 38(6), 3567–3604.
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linearmodels via coordinate descent. Journal of Statistical Software, 33(1), 1–22.
George, E. I., & Foster, D. P. (2000). Calibration and empirical bayes variable selection.Biometrika, 87(4), 731–747.
George, E. I., & McCulloch, R. E. (1993). Variable selection via gibbs sampling. Journal of theAmerican Statistical Association, 88(423), 881–889.
Ghosal, S., Ghosh, J. K., & van der Vaart, A. W. (2000). Convergence rates of posteriordistributions. Annals of Statistics, 28(2), 500–531.
Ghosh, P., & Chakrabarti, A. (2017). Asymptotic optimality of one-group shrinkage priors insparse high-dimensional problems. Bayesian Analysis, 12(4), 1133–1161.
Ghosh, P., Tang, X., Ghosh, M., & Chakrabarti, A. (2016). Asymptotic properties of bayes riskof a general class of shrinkage priors in multiple hypothesis testing under sparsity. BayesianAnalysis, 11(3), 753–796.
Goh, G., Dey, D. K., & Chen, K. (2017). Bayesian sparse reduced rank multivariate regression.Journal of Multivariate Analysis, 157 , 14 – 28.
Griffin, J. E., & Brown, P. J. (2010). Inference with normal-gamma prior distributions inregression problems. Bayesian Analysis, 5(1), 171–188.
Griffin, J. E., & Brown, P. J. (2013). Some priors for sparse regression modelling. BayesianAnalysis, 8(3), 691–702.
Ishwaran, H., & Rao, J. S. (2005). Spike and slab variable selection: Frequentist and bayesianstrategies. Annals of Statistics, 33(2), 730–773.
Ishwaran, H., & Rao, J. S. (2011). Consistency of spike and slab regression. Statistics &Probability Letters, 81(12), 1920 – 1928.
Johnstone, I. M., & Silverman, B. W. (2004). Needles and straw in haystacks: Empirical bayesestimates of possibly sparse sequences. Annals of Statistics, 32(4), 1594–1649.
Johnstone, I. M., & Silverman, B. W. (2005). Empirical bayes selection of wavelet thresholds.Annals of Statistics, 33(4), 1700–1752.
Li, Y., Nan, B., & Zhu, J. (2015). Multivariate sparse group lasso for the multivariate multiplelinear regression with an arbitrary group structure. Biometrics, 71(2), 354–363.
129
Liquet, B., Bottolo, L., Campanella, G., Richardson, S., & Chadeau-Hyam, M. (2016).R2GUESS: A graphics processing unit-based R package for bayesian variable selectionregression of multivariate responses. Journal of Statistical Software, 69(1), 1–32.
Liquet, B., Mengersen, K., Pettitt, A. N., & Sutton, M. (2017). Bayesian variable selectionregression of multivariate responses for group data. Bayesian Analysis, 12(4), 1039–1067.
Mitchell, T., & Beauchamp, J. (1988). Bayesian variable selection in linear regression. Journalof the American Statistical Association, 83(404), 1023–1032.
Narisetty, N. N., & He, X. (2014). Bayesian variable selection with shrinking and diffusingpriors. Annals of Statistics, 42(2), 789–817.
Park, T., & Casella, G. (2008). The bayesian lasso. Journal of the American StatisticalAssociation, 103(482), 681–686.
Polson, N. G., & Scott, J. G. (2012). On the half-cauchy prior for a global scale parameter.Bayesian Analysis, 7(4), 887–902.
Ročková, V. (2018). Bayesian estimation of sparse signals with a continuous spike-and-slabprior. Annals of Statistics, 46(1), 401–437.
Ročková, V., & George, E. I. (2016). The spike-and-slab lasso. Journal of the AmericanStatistical Association. To appear.
Rothman, A. J., Levina, E., & Zhu, J. (2010). Sparse multivariate regression with covarianceestimation. Journal of Computational and Graphical Statistics, 19(4), 947–962.
Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw,A. A., D’Amico, A. V., Richie, J. P., Lander, E. S., Loda, M., Kantoff, P. W., Golub, T. R.,& Sellers, W. R. (2002). Gene expression correlates of clinical prostate cancer behavior.Cancer Cell, 1(2), 203 – 209.
Strawderman, W. E. (1971). Proper bayes minimax estimators of the multivariate normalmean. Annals of Mathematical Statistics, 42(1), 385–388.
Sun, T., & Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika, 99(4), 879–898.
Tang, X., Xu, X., Ghosh, M., & Ghosh, P. (2017). Bayesian variable selection andestimation based on global-local shrinkage priors. Sankhya A. Retrieved fromhttps://doi.org/10.1007/s13171-017-0118-2.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the RoyalStatistical Society, Series B, 58, 267–288.
van der Pas, S., Kleijn, B., & van der Vaart, A. (2014). The horseshoe estimator: Posteriorconcentration around nearly black vectors. Electronic Journal of Statistics, 8(2), 2585–2618.
130
van der Pas, S., Salomond, J.-B., & Schmidt-Hieber, J. (2016). Conditions for posteriorcontraction in the sparse normal means problem. Electronic Journal of Statistics, 10(1),976–1000.
van der Pas, S., Szabó, B., & van der Vaart, A. (2017a). Adaptive posterior contraction ratesfor the horseshoe. Electronic Journal of Statistics, 11(2), 3196–3225.
van der Pas, S., Szabó, B., & van der Vaart, A. (2017b). Uncertainty quantification for thehorseshoe (with discussion). Bayesian Analysis, 12(4), 1221–1274.
Wellcome Trust (2007). Genome-wide association study of 14,000 cases of seven commondiseases and 3000 shared controls. Nature, 447 , 661–678.
Wilms, I., & Croux, C. (2017). An algorithm for the multivariate group lasso with covarianceestimation. Journal of Applied Statistics, 0(0), 1–14.
Xu, X., & Ghosh, M. (2015). Bayesian variable selection and estimation for group lasso.Bayesian Analysis, 10(4), 909–936.
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with groupedvariables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1),49–67.
Zellner, A. (1986). On assessing prior distributions and bayesian regression analysis with g priordistributions. Bayesian Inference and Decision Techniques: Essays in Honor of Bruno deFinetti. Studies in Bayesian Econometrics, (pp. 233–243). Eds. P. K. Goel and A. Zellner,Amsterdam: North-Holland/Elsevier.
Zhang, Y., & Bondell, H. D. (2017). Variable selection via penalized credible regions withdirichlet–laplace global-local shrinkage priors. Advance publication. Retrieved fromhttps://projecteuclid.org/euclid.ba/1508551721.
Zhu, H., Khondker, Z., Lu, Z., & Ibrahim, J. G. (2014). Bayesian generalized low rankregression models for neuroimaging phenotypes and genetic markers. Journal of theAmerican Statistical Association, 109(507), 977–990.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the AmericanStatistical Association, 101(476), 1418–1429.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journalof the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.
131
BIOGRAPHICAL SKETCH
Ray Bai graduated from Cornell University in May 2007 with bachelor’s degrees in
government and economics. He then worked as a financial analyst at State Street Bank &
Trust from May 2007 to August 2010. From September 2010 to May 2012, he attended
graduate school at the University of Massachusetts Amherst where he earned his master’s in
applied mathematics. He then worked as a systems engineer at General Dynamics Mission
Systems from June 2012 to July 2014. He joined the Department of Statistics at the University
of Florida in August 2014 where he earned his master’s in statistics in August 2016 and his
doctorate in August 2018.
132