BAYESIAN HIGH-DIMENSIONAL MODELS WITH SCALE-MIXTURE...

BAYESIAN HIGH-DIMENSIONAL MODELS WITH SCALE-MIXTURE SHRINKAGE PRIORS

By

RAY BAI

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2018

Dedicated to Mom, Dad, and Will

ACKNOWLEDGMENTS

I owe my deepest gratitude to my PhD advisor Dr. Malay Ghosh. It has truly been an

honor to work under his supervision. In addition to being an outstanding scholar and mentor,

he is also an excellent teacher, and I was fortunate to take four courses with him during my

PhD studies. I also thank Dr. Kshitij Khare, Dr. Nikolay Bliznyuk, and Dr. Arunava Banerjee

for serving on my PhD committee and for providing valuable comments on my dissertation.

I have had the pleasure to take courses with all of my committee members, and I learned a

lot of things from them. I am especially indebted to Dr. Bliznyuk for serving as my Master of

Statistics’ advisor and for involving me in several of his research projects early on.

I also owe a great deal of gratitude to the faculty and staff at the Department of Statistics

at the University of Florida. In addition to the members of my PhD committee, I am also

grateful to Dr. Sophia Su, Dr. Andrew Rosalsky, and Dr. Hani Doss for helping me to develop

a firm foundation in advanced linear algebra, mathematical analysis, probability theory, and

mathematical statistics. Thank you to Tina Greenly, Christine Miron, and Bill Campbell for

their help with administrative tasks. Thank you to Dr. Jim Hobert and Maria Ripol for offering

valuable guidance for navigating graduate school and for their advice on teaching.

I am thankful to my cohort and friends in graduate school: Syed Rahman, Peyman Jalali,

Andrey Skripnikov, Hunter Merrill, Isaac Duerr, Mingyuan Gao, Qian Qin, Tamal Ghosh, Tuo

Chen, Minji Lee, Zeren Xing, Ethan Alt, Grant Backlund, Saptarshi Chakraborty, Satyajit

Ghosh, and Xueying Tang. Thank you for all of the fun times and for the valuable discussions

and help when I needed it.

Finally, I want to thank my boyfriend Will Haslam and my parents for always believing

in me. I would not have survived graduate school without their love, encouragement, and

support. I love you guys very much.

4

TABLE OF CONTENTSpage

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

CHAPTER

1 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.1 The Sparse Normal Means Problem . . . . . . . . . . . . . . . . . . . . . . . 121.2 Bayesian Methods for Sparse Estimation . . . . . . . . . . . . . . . . . . . . 13

1.2.1 Spike-and-Slab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2.2 Scale-Mixture Shrinkage Priors . . . . . . . . . . . . . . . . . . . . . 14

1.3 Minimax Estimation and Posterior Contraction . . . . . . . . . . . . . . . . . 161.3.1 Sparse Normal Vectors in the Nearly Black Sense . . . . . . . . . . . . 171.3.2 Theoretical Results for Spike-and-Slab Priors . . . . . . . . . . . . . . 181.3.3 Theoretical Results for Scale-Mixture Shrinkage Priors . . . . . . . . . 19

1.4 Signal Detection Through Multiple Hypothesis Testing . . . . . . . . . . . . . 201.4.1 Asymptotic Bayes Optimality Under Sparsity . . . . . . . . . . . . . . 201.4.2 ABOS of Thresholding Rules Based on Scale-Mixture Shrinkage Priors 23

1.5 Sparse Univariate Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 241.5.1 Frequentist Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 251.5.2 Bayesian Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.5.2.1 Spike-and-slab . . . . . . . . . . . . . . . . . . . . . . . . . 251.5.2.2 Continuous Shrinkage Priors . . . . . . . . . . . . . . . . . 26

1.5.3 Posterior Consistency for Univariate Linear Regression . . . . . . . . . 271.6 Sparse Multivariate Linear Regression . . . . . . . . . . . . . . . . . . . . . . 28

1.6.1 Frequentist Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 291.6.2 Bayesian Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 291.6.3 Reduced Rank Regression . . . . . . . . . . . . . . . . . . . . . . . . 30

2 THE INVERSE GAMMA-GAMMA PRIOR FOR SPARSE ESTIMATION . . . . . . 31

2.1 The Inverse Gamma-Gamma (IGG) Prior . . . . . . . . . . . . . . . . . . . . 322.2 Concentration Properties of the IGG Prior . . . . . . . . . . . . . . . . . . . 35

2.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.2.2 Concentration Inequalities for the Shrinkage Factor . . . . . . . . . . . 35

2.3 Posterior Behavior Under the IGG Prior . . . . . . . . . . . . . . . . . . . . . 372.3.1 Minimax Posterior Contraction Under the IGG Prior . . . . . . . . . . 372.3.2 Kullback-Leibler Risk Bounds . . . . . . . . . . . . . . . . . . . . . . 40

2.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5

2.4.1 Computation and Selection of Hyperparameters . . . . . . . . . . . . . 422.4.2 Simulation Study for Sparse Estimation . . . . . . . . . . . . . . . . . 43

2.5 Analysis of a Prostate Cancer Data Set . . . . . . . . . . . . . . . . . . . . . 442.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3 MULTIPLE HYPOTHESIS TESTING WITH THE INVERSE GAMMA-GAMMAPRIOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.1 Classification Using the Inverse Gamma-Gamma Prior . . . . . . . . . . . . . 483.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.1.2 Thresholding the Posterior Shrinkage Weight . . . . . . . . . . . . . . 48

3.2 Asymptotic Optimality of the IGG Classification Rule . . . . . . . . . . . . . . 493.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.4 Analysis of a Prostate Cancer Data Set . . . . . . . . . . . . . . . . . . . . . 553.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 HIGH-DIMENSIONAL MULTIVARIATE POSTERIOR CONSISTENCY UNDERGLOBAL-LOCAL SHRINKAGE PRIORS . . . . . . . . . . . . . . . . . . . . . . . 58

4.1 Multivariate Bayesian Model with Shrinkage Priors (MBSP) . . . . . . . . . . 594.1.1 Preliminary Notation and Definitions . . . . . . . . . . . . . . . . . . 594.1.2 MBSP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.1.3 Handling Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Posterior Consistency of MBSP . . . . . . . . . . . . . . . . . . . . . . . . . 614.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.2.2 Definition of Posterior Consistency . . . . . . . . . . . . . . . . . . . 624.2.3 Sufficient Conditions for Posterior Consistency . . . . . . . . . . . . . 62

4.2.3.1 Low-Dimensional Case . . . . . . . . . . . . . . . . . . . . 634.2.3.2 Ultrahigh Dimensional Case . . . . . . . . . . . . . . . . . . 64

4.2.4 Sufficient Conditions for Posterior Consistency of MBSP . . . . . . . . 664.3 Implementation of the MBSP Model . . . . . . . . . . . . . . . . . . . . . . 68

4.3.1 TPBN Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.3.2 The MBSP-TPBN Model . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3.2.1 Computational Details . . . . . . . . . . . . . . . . . . . . 704.3.2.2 Specification of Hyperparameters τ , d , and k . . . . . . . . 71

4.3.3 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.4 Simulations and Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.4.1 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.4.2 Yeast cell cycle data analysis . . . . . . . . . . . . . . . . . . . . . . 76

4.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5 SUMMARY AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2.1 Extensions of the Inverse Gamma-Gamma Prior . . . . . . . . . . . . . 80

6

5.2.2 Extensions to Bayesian Multivariate Linear Regression with ShrinkagePriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

APPENDIX

A PROOFS FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

A.1 Proofs for Section 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84A.2 Proofs for Section 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87A.3 Proofs for Section 2.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

B PROOFS FOR CHAPTER 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

C PROOFS FOR CHAPTER 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

C.1 Proofs for Section 4.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106C.1.1 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 106C.1.2 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 111

C.2 Proofs for Section 4.2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116C.2.1 Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . 116C.2.2 Proofs for Theorem 4.3 and Theorem 4.4 . . . . . . . . . . . . . . . . 118

D GIBBS SAMPLER FOR THE MBSP-TPBN MODEL . . . . . . . . . . . . . . . . . 123

D.1 Full Conditional Densities for the Gibbs Sampler . . . . . . . . . . . . . . . . 123D.2 Fast Sampling of the Full Conditional Density for B . . . . . . . . . . . . . . 123D.3 Convergence of the Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . 124

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7

LIST OF TABLESTable page

1-1 Polynomial-tailed priors, their respective prior densities for π(ξi) up to normalizingconstant C , and the slowly-varying component L(ξi). . . . . . . . . . . . . . . . . 16

2-1 Comparison of average squared error loss for the posterior median estimate of θacross 100 replications. Results are reported for the IGG1/n, DL (Dirichlet-Laplace),HS (horseshoe), and the HS+ (horseshoe-plus). . . . . . . . . . . . . . . . . . . . 44

2-2 The z-scores and the effect size estimates for the top 10 genes selected by Efron(2010) by the IGG, DL, HS, and HS+ models and the two-groups empirical Bayesmodel by Efron (2010). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3-1 Comparison of false discovery rate (FDR) for different classification methods underdense settings. The IGG1/n has the lowest FDR of all the different methods. . . . 54

4-1 Simulation results for MBSP-TPBN, compared with MBGL-SS, MLASSO, SRRR,and SPLS, averaged across 100 replications. . . . . . . . . . . . . . . . . . . . . . 75

4-2 Results for analysis of the yeast cell cycle data set. The MSPE has been scaled bya factor of 100. In particular, all fives models selected the three TFs, ACE2, SWI5,and SWI6 as significant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

8

LIST OF FIGURESFigure page

2-1 Marginal density of the IGG prior in Eq. 2–5 with hyperparameters a = 0.6, b =0.4, in comparison to other shrinkage priors. The DL1/2 prior is the marginal densityfor the Dirichlet-Laplace density with D(1/2, ...., 1/2) specified as a prior in theBayesian hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3-1 Comparison between the posterior inclusion probabilities and the posterior shrinkageweights 1− E(κi |Xi) when p = 0.10. . . . . . . . . . . . . . . . . . . . . . . . . . 51

3-2 Estimated misclassification probabilities. The thresholding rule in Eq. 3–1 based onthe IGG posterior mean is nearly as good as the Bayes Oracle rule in Eq. 1–16. . . . 52

3-3 Posterior Mean E(θ|X ) vs. X plot for p = 0.25. . . . . . . . . . . . . . . . . . . . 55

4-1 Plots of the estimates and 95% credible bands for four of the 10 TFs that weredeemed as significant by the MBSP-TPBN model. The x-axis indicates time (minutes)and the y-axis indicates the estimated coefficients. . . . . . . . . . . . . . . . . . . 78

D-1 History plots of the first 10,000 draws from the Gibbs sampler for the MBSP-TPBNmodel described in Section D.1 for randomly drawn coefficients bij in B0 from experiments5 and 6 in Section 4.4.1. The top two plots are taken from experiment 5 (n =100, p = 500, q = 3), and the bottom two plots are taken from Experiment 6(n = 150, p = 1000, q = 4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

9

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

BAYESIAN HIGH-DIMENSIONAL MODELS WITH SCALE-MIXTURE SHRINKAGE PRIORS

By

Ray Bai

August 2018

Chair: Malay GhoshMajor: Statistics

High-dimensional data is ubiquitous in many modern applications as diverse as medicine,

machine learning, electronic health records, engineering, and finance. As technological

advances have produced larger and more complex data sets, scientists have been faced

with greater challenges. In this dissertation, we address three major challenges in modern

high-dimensional statistics: 1) estimation of sparse noisy vectors, 2) signal detection, and 3)

multivariate linear regression where the number of covariates is larger than the sample size.

To tackle these problems, we work within the Bayesian framework, using continuous shrinkage

priors which can be expressed as scale mixtures of normal densities.

We first review the literature on the methodological and theoretical developments for

these three problems. We then introduce a new fully Bayesian scale-mixture shrinkage prior

known as the inverse gamma-gamma (IGG) prior to handle both the tasks of sparse estimation

of noisy vectors and signal detection. We show that the IGG’s posterior distribution contracts

around the true mean vector at (near) minimax rate and that the IGG posterior concentrates

at a faster rate than other popular Bayes estimators in the Kullback-Leibler (K-L) sense. To

detect signals, we also propose a hypothesis test based on thresholding the posterior mean

under the IGG prior. Taking the loss function to be expected number of misclassified tests, our

test procedure is shown to be asymptotically Bayes optimal under sparsity.

Finally, we consider sparse Bayesian estimation in the classical multivariate linear

regression model. We propose a new method to estimate the unknown coefficients matrix

10

known as the Multivariate Bayesian model with Shrinkage Priors (MBSP). We also develop

new theory for posterior consistency under the Bayesian multivariate regression framework,

including the ultrahigh-dimensional setting where the number of covariates grows at nearly

exponential rate with the sample size. We prove that MBSP achieves strong posterior

consistency in both low-dimensional and ultrahigh-dimensional scenarios.

11

CHAPTER 1LITERATURE REVIEW

1.1 The Sparse Normal Means Problem

Suppose we observe a noisy n-component random observation (X1, ...,Xn) ∈ Rn, such

that

Xi = θi + ϵi , i = 1, ..., n, (1–1)

where ϵi ∼ N (0, 1), i = 1, ..., n. In the high-dimensional setting where n is very large, sparsity

is a very common phenomenon. That is, in the unknown mean vector θ = (θ1, ..., θn), only

a few of the θi ’s are nonzero. Under the model in Eq. 1–1, we are primarily interested in

separating the signals (θi = 0) from the noise (θi = 0) and giving robust estimates of the

signals.

This simple model is the basis for a number of high-dimensional problems, such as image

reconstruction, genetics, and wavelet analysis (Johnstone & Silverman (2004)). For example, if

we wish to reconstruct an image from millions of pixels of data, only a few pixels are typically

needed to recover the objects of interest. In genetics, we may have tens of thousands of

gene expression data points, but only a few are significantly associated with the phenotype

of interest. For instance, Wellcome Trust (2007) has confirmed that only seven genes have a

non-negligible association with Type I diabetes. These applications demonstrate that sparsity is

a fairly reasonable assumption for θ in Eq. 1–1.

Existing frequentist methods for obtaining a sparse estimate of θ in Eq. 1–1 include the

popular LASSO (Tibshirani (1996)) and its many variants (see, for example, Zou & Hastie

(2005), Zou (2006), Yuan & Lin (2006), Belloni et al. (2011), and Sun & Zhang (2012)).

All of these methods use either an ℓ1 or a combination of an ℓ1 and ℓ2 penalty function to

shrink many of the θi ’s to zero. These methods are able to produce point estimates with good

theoretical and empirical properties. However, in many cases, it is desirable to obtain not only

a point estimate but a realistic characterization of uncertainty in the parameter estimates.

12

In high-dimensional settings, frequentist approaches to characterizing uncertainty, such as

bootstrapping or constructing confidence regions, can break down (Bhattacharya et al. (2015)).

For example, Camponovo (2015) recently showed that the bootstrap does not provide a valid

approximation of the distribution of the LASSO estimator. Bayesian approaches to estimating

θ, on the other hand, give a natural way to quantify uncertainty through the posterior density.

As we illustrate later, Bayesian point estimates such as the median or mean also have desirable

frequentist properties.

1.2 Bayesian Methods for Sparse Estimation

While frequentist methods for estimating θ in Eq. 1–1 induce sparsity through a penalty

function on the entries θi , i = 1, ..., n, Bayesian approaches obtain sparse estimates by placing

a carefully constructed prior on θ. Spike-and-slab priors and scale-mixture priors are two of the

most commonly used priors that for sparse normal means estimation.

1.2.1 Spike-and-Slab

Spike-and-slab priors are a particularly appealing way to model sparsity. The original

spike-and- slab model, introduced by Mitchell & Beauchamp (1988), was of the form,

π(θi) = (1− p)δ0(θi) + pψ(θi |λ), i = 1, ..., n, (1–2)

where δ0(·) is the “spike” distribution (a point mass at zero), ψ(·|λ) is an absolutely

continuous “slab” distribution, indexed by a hyper-parameter λ, and p is a mixing proportion.

The “spike” forces some coefficients to zero, while the “slab” models the signals. Typically,

λ > 0 is chosen to be large so that the “slab” is very diffuse and signals can be identified with

high probability.

The choice of (p,λ) is crucial for good performance of spike-and-slab models in Eq. 1–2.

Johnstone & Silverman (2004) utilized an empirical Bayes variant of Eq. 1–2. They used a

restricted marginal maximum likelihood estimate of p and a sufficiently heavy-tailed density

for ψ(·|λ) (i.e. tails at least as heavy as the Laplace distribution). They also considered a fully

Bayesian variant where a suitable beta prior was placed on p.

13

Despite their interpretability, these point-mass mixtures face computational difficulties

in high dimensions. Because the point-mass mixture in Eq. 1–2 is discontinuous, this model

requires searching over 2n possible models. To circumvent this problem, fully continuous

variants of spike-and-slab densities have been developed. These continuous spike-and-slab

models are of the form,

π(θi) = (1− p)ψ(θi |λ1) + pψ(θi |λ2), i = 1, ..., n, (1–3)

where ψ(·|λi), i = 1, 2 represents a symmetric unimodal density centered at zero, ψ(θi |λ1)

models the spike, and ψ(θi |λ2) models the slab. Typically, λ1 and λ2 are chosen so that the

their respective densities have small and large variances respectively. George & McCulloch

(1993) proposed the stochastic search variable selection (SSVS) method, which places a

mixture prior of two normal densities with different variances (one small and one large) on each

of the θi ’s. More recently, Ročková & George (2016) introduced the spike-and-slab LASSO

(SSL), which is a mixture of two Laplace densities with different variances (one small and one

large).

1.2.2 Scale-Mixture Shrinkage Priors

Because of the computational difficulties of (point-mass) spike-and-slab priors, a rich

variety of continuous shrinkage priors which can be expressed as scale mixtures of normal

densities has also been developed. These priors behave similarly to spike-and-slab priors but

require significantly less computational effort. They mimic the model in Eq. 1–2 in that

they contain significant probability around zero so that most coefficients are shrunk to zero.

However, they retain heavy enough tails in order to correctly identify and prevent overshrinkage

of the true signals. These priors typically take the form

θi |σ2i ∼ N (0,σ2i ), σ2i ∼ π(σ2i ), i = 1, ..., n, (1–4)

where π : [0,∞) → [0,∞) is a density on the positive reals. π may depend on further

hyperparameters, of which there may or may not be additional priors placed on them. Priors on

14

σ2i in Eq. 1–4 may be either independent for each i = 1, ..., n, in which case the θi coefficients

are independent a posteriori, or they may contain hyperpriors on shared hyperparameters,

in which case, θi ’s are a posteriori dependent. We refer to priors of the form in Eq. 1–4 as

scale-mixture shrinkage priors.

Global-local (GL) shrinkage priors comprise a wide class of scale-mixture shrinkage priors.

GL priors take the form

θi |τ , ξi ∼ N (0, ξiτ), ξi ∼ f , τ ∼ g, (1–5)

where τ is a global shrinkage parameter that shrinks all θi ’s to the origin, while the local

scale parameters ξi ’s control the degree of individual shrinkage. If g puts sufficient mass near

zero and f is an appropriately chosen heavy-tailed density, then GL priors can approximate

the model in Eq. 1–2 through a continuous density concentrated near zero with heavy tails.

Examples of GL shrinkage priors include the horseshoe prior Carvalho et al. (2010) and the

Bayesian lasso Park & Casella (2008). The horseshoe in particular is very popular and has the

hierarchical form

θi |τ , ξi ∼ N (0, ξiτ),√ξi ∼ C+(0, 1),

√τ ∼ C+(0, 1), (1–6)

where C+(0, 1) denotes a half-Cauchy density with scale 1. Global-local shrinkage priors have

also been considered by numerous authors, including Strawderman (1971), Berger (1980),

Armagan et al. (2011), Polson & Scott (2012), Armagan et al. (2013b), Griffin & Brown

(2013), and Bhadra et al. (2017). Armagan et al. (2011) noted that a number of these priors

utilize a beta prime density as the prior for π(ξi) and referred to this general class of shrinkage

priors as the “three parameter beta normal” (TPBN) mixture family. The TPBN family in

particular includes the horseshoe, the Strawderman-Berger (Strawderman (1971) and Berger

(1980)), and the normal-exponential-gamma (NEG) Griffin & Brown (2013) priors. Polson &

Scott (2012) also generalized the beta prime density to the family of hypergeometric inverted

beta (HIB) priors. Finally, Armagan et al. (2013b) introduced another general class of priors

called the generalized double Pareto (GDP) family.

15

Table 1-1. Polynomial-tailed priors, their respective prior densities for π(ξi) up to normalizingconstant C , and the slowly-varying component L(ξi).

Prior π(ξi)/C L(ξi)Student’s t ξ−a−1

i exp(−a/ξi) exp (−a/ξi)Horseshoe ξ

−1/2i (1 + ξi)

−1 ξa+1/2i /(1 + ξi)

Horseshoe+ ξ−1/2i (ξi − 1)−1 log(ξi) ξ

a+1/2i (ξi − 1)−1 log(ξi)

NEG (1 + ξi)−1−a {ξi/(1 + ξi)}a+1

TPBN ξu−1i (1 + ξi)

−a−u {ξi/(1 + ξi)}a+u

GDP∫∞0

λ2

2exp

(−λ2ξi

2

)λ2a−1 exp(−ηλ)dλ

∫∞0

ta exp(−t − η√2t/ξi)dt

HIB ξu−1i (1 + ξi)

−(a+u) exp{− s

1+ξi

}{ξi/(1 + ξi)}a+u

×{ϕ2 + 1−ϕ2

1+ξi

}−1

× exp{− s

1+ξi

}{ϕ2 + 1−ϕ2

1+ξi

}−1

Ghosh et al. (2016) observed that for a large number of GL priors of the form in Eq. 1–5,

the local parameter ξi has a hyperprior distribution π(ξi) that can be written as

π(ξi) = Kξ−a−1i L(ξi), (1–7)

where K > 0 is the constant of proportionality, a is positive real number, and L is a positive

measurable, non-constant, slowly varying function over (0,∞).

Definition 1.1. A positive measurable function L defined over (A,∞), for some A ≥ 0, is said

to be slowly varying (in Karamata’s sense) if for every fixed α > 0, limx→∞

L(αx)

L(x)= 1.

A thorough treatment of functions of this type can be found in the classical text by

Bingham et al. (1987). Table 1-1 provides a list of several well-known global-local shrinkage

priors that fall in the class of priors of the form given in Eq. 1–5, the corresponding density

π(ξi) for ξi , and the slowly-varying component L(ξi) in Eq. 1–7. Following Tang et al. (2017),

we refer to these scale-mixture priors as polynomial-tailed priors.

1.3 Minimax Estimation and Posterior Contraction

In this section, we review the theory for Bayesian estimation of θ in the sparse normal

means model in Eq. 1–1. In particular, we are interested in studying the frequentist properties

of Bayesian estimates of θ.

16

1.3.1 Sparse Normal Vectors in the Nearly Black Sense

Suppose that we observe X = (X1, ...,Xn) ∈ Rn from Eq. 1–1. Let ℓ0[qn] denote the

subset of Rn given by

ℓ0[qn] = {θ ∈ Rn : #(1 ≤ j ≤ n : θj = 0) ≤ qn}. (1–8)

If θ ∈ ℓ0[qn] with qn = o(n) as n → ∞, we say that θ is sparse in the “nearly black sense.”

Let θ0 = (θ01, ..., θ0n) be the true mean vector. In their seminal work, Donoho et al. (1992)

showed that for any estimator of θ, denoted by θ, the corresponding minimax risk with respect

to the l2- norm is given by

infθ

supθ0∈ℓ0[qn]

Eθ0 ||θ − θ0||22 = 2qn log(

nqn

)(1 + o(1)), as n → ∞. (1–9)

In Eq. 1–9, Eθ0 denotes expectation with respect to the Nn(θ0, In) distribution. Equation 1–9

effectively states that in the presence of sparsity, a minimax-optimal estimator only loses a

logarithmic factor (in the ambient dimension) as a penalty for not knowing the true locations

of the zeroes. Moreover, Eq. 1–9 implies that we only need a number of replicates in the order

of the true sparsity level qn to consistently estimate θ0.

In order for the performance of Bayesian estimators to be compared with frequentist

ones, we say that a Bayesian point estimator θB attains the minimax risk (in the order of a

constant) if

supθ0∈ℓ0[qn]

Eθ0 ||θB − θ0||22 ≍ qn log

(n

qn

). (1–10)

Examples of potential choices for θB include the posterior median or the posterior mean (as in

Johnstone & Silverman (2004)), or the posterior mode (as in Ročková (2018)). Equation 1–10

pertains only to a particular point estimate. For a fully Bayesian interpretation, we say that the

posterior distribution contracts around the true θ0 at a rate at least as fast as the minimax l2

risk if

supθ0∈ℓ0[qn]

Eθ0�

(θ : ||θ − θ0||22 > Mnqn log

(n

qn

) ∣∣∣∣X)→ 0, (1–11)

17

for every Mn → ∞ as n → ∞. On the other hand, in another seminal paper, Ghosal et al.

(2000) showed that the posterior distribution cannot contract faster than the minimax rate of

qn log(

nqn

)around the truth. Hence, the optimal rate of contraction of a posterior distribution

around the true θ0 must be the minimax optimal rate in Eq. 1–9, up to some multiplicative

constant. In other words, if we use a fully Bayesian model to estimate a “nearly black”

normal mean vector, the minimax optimal rate should be our benchmark, and the posterior

distribution should capture the true θ0 in a ball of squared radius at most qn log(

nqn

)(up to a

multiplicative constant) with probability one as n → ∞.

1.3.2 Theoretical Results for Spike-and-Slab Priors

There is a large body of theoretical evidence in favor of point-mass mixture priors in Eq.

1–2 (for instance, see George & Foster (2000), Johnstone & Silverman (2004), Johnstone

& Silverman (2005), Abramovich et al. (2007), and Castillo & van der Vaart (2012)). As

remarked by Carvalho et al. (2009), a carefully chosen “two-groups” model can be considered a

“gold standard” for sparse problems. Using the empirical Bayes variant of Eq. 1–2, Johnstone

& Silverman (2004) showed that if the tails of ψ(·|λ) are at least as heavy as Laplace but not

heavier than Cauchy and if we take a restricted marginal maximum likelihood estimator for p,

both the posterior mean and median contract around the true θ0 at minimax rate. They also

showed that with a suitable beta prior on p, the entire posterior distribution contracts at the

minimax rate established in Eq. 1–11.

Recently, minimax-optimality results have also been obtained for continuous spike-and-slab

priors of the form given in Eq. 1–3 by Ročková & George (2016). Specifying a normal density

for ψ(·|λi), i = 1, 2, in Eq. 1–3 does not enable us to obtain minimax-optimality results

because the tails are insufficiently heavy. However, Ročková (2018) showed that minimax

optimality could be achieved for their spike-and-slab LASSO model, where Eq. 1–3 is a mixture

of two Laplace densities with differing variances instead. Specifically, Ročková (2018) showed

that by specifying suitable variances for the Laplace densities and a particular fixed value for

p (all of which depend on sample size n), the posterior mode under the SSL prior attains the

18

minimax risk in Eq. 1–10. Going further, Ročková (2018) also established that by placing an

appropriate beta prior on the mixing proportion p, the entire posterior distribution of the SSL

contracts at (near) minimax rate in Eq. 1–11.

1.3.3 Theoretical Results for Scale-Mixture Shrinkage Priors

In the statistical literature, there are also many minimax optimality results for global-local

shrinkage priors introduced in Section 1.2.2. van der Pas et al. (2014) showed that by either

treating the global parameter τ in Eq. 1–5 as a tuning parameter that decays to zero at an

appropriate rate as n → ∞ (that is, τ ≡ τn → 0 as n → ∞) or by giving an empirical Bayes

estimate τ based on an estimate of the sparsity level, the posterior mean under the horseshoe

prior in Eq. 1–6 attains the minimax risk in Eq. 1–10, possibly up to a multiplicative constant.

van der Pas et al. (2014) showed that for the same choices of τn or τ , the entire posterior

distribution for the horseshoe in Eq. 1–6 keeps pace with the posterior mean and contracts at

the minimax rate. Ghosh & Chakrabarti (2017) extended the work of van der Pas et al. (2014)

by showing that when τ → 0 at an appropriate rate and the true sparsity level is known, the

posterior distribution under a wider class of GL priors (including the student-t prior, the TPBN

family, and the GDP family) contracts at the minimax rate.

All the aforementioned results for global-local shrinkage priors in Eq. 1–5 have required

setting a rate for τ a priori or estimating τ through empirical Bayes in order to achieve the

minimax posterior contraction. Results for fully Bayesian global-local shrinkage priors have

also recently been discovered. Bhattacharya et al. (2015) developed a prior known as the

Dirichlet-Laplace prior, which contains a D(a, ...., a) prior in the scale component and a

gamma prior on the global parameter τ ∼ G(na, 1/2). Bhattacharya et al. (2015) showed

that the Dirichlet-Laplace prior could attain the minimax posterior contraction rate, provided

that an appropriate rate is placed on a and provided that there is a restriction on the signal

size. Specifically, they required that ||θ0||22 ≤ qn log4 n. Recently, van der Pas et al. (2017a)

were also able to attain near-minimax posterior contraction for the horseshoe prior in Eq.1–6 by

placing a prior on τ and restricting the support of τ to be the interval [1/n, 1].

19

Moving beyond the global-local framework, van der Pas et al. (2016) provided conditions

for which the posterior distribution under any scale-mixture shrinkage prior of the form

in Eq. 1–4 achieves the minimax posterior contraction rate, provided that the θi ’s are a

posteriori independent. Their result is quite general and covers a wide variety of priors,

including the inverse Gaussian prior, the normal-gamma prior (Griffin & Brown (2010)), and

the spike-and-slab LASSO (Ročková (2018)).

These results for scale-mixture shrinkage priors demonstrate that although scale-mixture

shrinkage priors do not contain a point mass at zero, they mimic the point mass in the

traditional spike-and-slab model in Eq. 1–2 well enough. Meanwhile, their heavy tails ensure

that large observations are not overshrunk.

1.4 Signal Detection Through Multiple Hypothesis Testing

In addition to robust estimation of θ in Eq. 1–1, we are often interested in detecting the

true signals (or non-zero entries) within θ. Here, we are essentially conducting n simultaneous

hypothesis tests, H0i : θi = 0 vs. H1i : θi = 0, i = 1, ..., n. The problem of signal detection for a

noisy vector therefore can be recast as a multiple hypothesis testing problem.

Using the two-components model in Eq. 1–2 as a benchmark, Bogdan et al. (2011)

studied the risk properties of multiple testing rules within the decision theoretic framework

where each θi is truly generated from a two-groups model. Specifically, Bogdan et al.

(2011) considered a symmetric 0-1 loss function taken to be the expected total number of

misclassified tests. Below we describe this framework and review some of the recent work on

thesholding rules for scale-mixture shrinkage priors within this framework.

1.4.1 Asymptotic Bayes Optimality Under Sparsity

Suppose we observe X = (X1, ...,Xn), such that Xi ∼ N (θi , 1), for i = 1, ..., n. To

identify the true signals in X, we conduct n simultaneous tests: H0i : θi = 0 against H1i : θi =

0, for i = 1, ..., n. For each i , θi is assumed to be generated by a true data-generating model,

θii .i .d .∼ (1− p)δ{0} + pN (0,ψ2), i = 1, ..., n, (1–12)

20

where ψ2 > 0 represents a diffuse “slab” density. This point mass mixture model is often

considered a theoretical ideal for generating a sparse vector θ in the statistical literature.

Indeed, Carvalho et al. (2009) referred to the model in Eq. 1–12 as a “gold standard” for

sparse problems.

The model in Eq. 1–12 is equivalent to assuming that for each i , θi follows a random

variable whose distribution is determined by the latent binary random variable νi , where νi = 0

denotes the event that H0i is true, while νi = 1 corresponds to the event that H0i is false. Here

νi ’s are assumed to be i.i.d. Bernoulli(p) random variables, for some p in (0, 1). Under H0i ,

i.e. θi ∼ δ{0}, the distribution having a mass 1 at 0, while under H1i , θi = 0 and is assumed to

follow an N (0,ψ2) distribution with ψ2 > 0. The marginal distributions of the Xi ’s are then

given by the following two-groups model:

Xii .i .d .∼ (1− p)N (0, 1) + pN (0, 1 + ψ2), i = 1, ..., n. (1–13)

Our testing problem is now equivalent to testing simultaneously

H0i : νi = 0 versus H1i : νi = 1 for i = 1, ..., n. (1–14)

We consider a symmetric 0-1 loss for each individual test and the total loss of a multiple

testing procedure is assumed to be the sum of the individual losses incurred in each test.

Letting t1i and t2i denote the probabilities of type I and type II errors of the ith test

respectively, the Bayes risk of a multiple testing procedure under the two-groups model

(1) is given by

R =

m∑i=1

{(1− p)t1i + pt2i}. (1–15)

21

Bogdan et al. (2011) showed that the rule which minimizes the Bayes risk in Eq. 1–15 is the

test which, for each i = 1, ..., n, rejects H0i if

f (xi |νi = 1)

f (xi |νi = 0)>

1− p

p, i.e. X 2

i > c2, (1–16)

where f (xi |νi = 1) denotes the marginal density of Xi under H1i , while f (xi |νi = 0) denotes

that under H0i and c2 ≡ c2ψ,f =1+ψ2

ψ2 (log(1+ψ2)+2 log(f )), with f = 1−pp

. The above rule is

known as the Bayes Oracle, because it makes use of unknown parameters ψ and p, and hence,

it is not attainable in finite samples. By reparametrizing as u = ψ2 and v = uf 2, the above

threshold becomes

c2 ≡ c2u,v =

(1 +

1

u

)(log v + log

(1 +

1

u

)).

Bogdan et al. (2011) considered the following asymptotic scheme.

Assumption 1

The sequences of vectors (ψn, pn) satisfies the following conditions:

1. pn → 0 as n → ∞.

2. un = ψ2n → ∞ as n → ∞.

3. vn = unf2 = ψ2

n

(1−pnpn

)2→ ∞ as n → ∞.

4. log vnun

→ C ∈ (0,∞) as n → ∞.

Bogdan et al. (2011) provided detailed insight on the threshold C . Summarizing briefly, if

C = 0, then both the Type I and Type II errors are zero, and for C = ∞, the inference is

essentially no better than tossing a coin. Under Assumption 1, Bogdan et al. (2011) showed

that the corresponding asymptotic optimal Bayes risk has a particularly simple form, which is

given by

RBOOpt = n((1− p)tBO1 + ptBO2 ) = np(2�(

√C)− 1)(1 + o(1)), (1–17)

where the o(1) terms tend to zero as n → ∞ and �(·) denotes the standard normal

cumulative distribution function (cdf). A testing procedure with risk R is said to be

22

asymptotically Bayes optimal under sparsity (ABOS) if

R

RBOOpt

→ 1 as n → ∞. (1–18)

1.4.2 ABOS of Thresholding Rules Based on Scale-Mixture Shrinkage Priors

Bogdan et al. (2011) gave conditions under which traditional multiple testing rules,

such as the Benjamini & Hochberg (1995) procedure to control false discovery rate or the

Bonferonni family-wise error adjustment procedure, are ABOS. Thresholding rules based on

scale-mixture shrinkage priors in Eq. 1–4 have also recently been considered.

While scale-mixture shrinkage priors are attractive because of their computational

efficiency, they do not produce exact zeroes as estimates. Therefore, to classify estimates as

signals or noise, one must use some sort of thresholding rule. Thresholding rules based on the

posterior mean for global-local shrinkage priors in Eq. 1–5 have been studied extensively in the

literature.

One easily sees that the conditional mean under GL priors in Eq. 1–5 is given by

E(θi |X1, ...,Xn, ξi , τ) = (1− κi)Xi , (1–19)

where κi = 11+τξi

. Since κi ∈ (0, 1), it is clear from Eq. 1–19 that the amount of shrinkage is

controlled by the shrinkage factor κi , which depends on both ξi and τ . Namely, the posterior

mean E(θi |X1, ...,Xn) ≈ Xi for large signals Xi , while E(θi |X1, ...,Xn) ≈ 0 for small Xi .

Therefore, it seems reasonable to classify the entries in θ as either signal or noise depending

upon this shrinkage factor κi .

For the horseshoe prior in Eq. 1–6, Carvalho et al. (2010) first introduced thresholding

rule,

Reject H0i if E(1− κi |X1...,Xn) >1

2, (1–20)

where κi = 11+ξiτ

,√ξi ∼ C+(0, 1), and √

τ ∼ C+(0, 1). Ghosh et al. (2016) later extended

the classification rule in Eq. 1–20 for a general class of global-local shrinkage priors, which

23

includes the Strawderman-Berger, normal-exponential-gamma, and generalized double Pareto

priors.

The theoretical properties of the classification rule in Eq. 1–20 have been studied

within the ABOS framework described in Section 1.4.1. Assuming that the θi ’s come from

a two-components model and placing an appropriate rate of decay on τ , Datta & Ghosh

(2013) showed that the thresholding rule in Eq. 1–20 for the horseshoe prior in Eq. 1–6 could

asymptotically attain the ABOS risk in Eq. 1–17 up to a multiplicative constant.

Ghosh et al. (2016) generalized Datta & Ghosh (2013)’s result to a general class of

shrinkage priors of the form in Eq. 1–5, which includes the student-t distribution, the TPBN

family, and the GDP family of priors. In particular, Ghosh et al. (2016) considered both the

case where τ is treated as a tuning parameter that depends on sample size and the case

where τ is the empirical Bayes estimator for τ given by van der Pas et al. (2014). Ghosh &

Chakrabarti (2017) later showed that thresholding rule in Eq. 1–20 for this same class of priors

could even asymptotically attain the ABOS risk in Eq. 1–17 exactly. Bhadra et al. (2017)

also extended classification rule in Eq. 1–20 for the horseshoe+ prior. The horseshoe+ prior

adds an extra half-Cauchy hyperprior to the hierarchy of the horseshoe prior in order to induce

ultra-sparse estimates of θ. Bhadra et al. (2017) established that, with an appropriate rate

specified for τ , the horseshoe+ prior asymptotically attains the ABOS risk in Eq. 1–17 up to a

multiplicative constant.

1.5 Sparse Univariate Linear Regression

Before we discuss high-dimensional multivariate linear regression, we first review some

frequentist and Bayesian methods for univariate linear regression model in high-dimensional

settings. The model we consider first is

Y = Xβ + ε, (1–21)

where Y = (y1, ..., yn) is an n×1 vector of observations of some response, X is an n×p design

matrix, and ε ∼ Nn(0,σ2In) is an n-dimensional random noise vector. In high-dimensional

24

settings, p is typically much greater than n, which renders traditional estimation and model

selection techniques such as ordinary least squares or best subsets regression infeasible. In

particular, when p > n, X⊤X is no longer nonsingular, so the usual ordinary least squares

estimator (X⊤X)−1X⊤Y is no longer unique. Further, it becomes computationally infeasible to

search over 2p possible models when p is very large. To mitigate these problems, statisticians

typically impose a sparsity assumption on β in Eq. 1–21 (i.e. most of the entries in β are

assumed to be zero) in order to make β estimable.

1.5.1 Frequentist Approaches

In frequentist approaches to estimating β in Eq. 1–21, the most commonly used method

for inducing sparsity is through imposing regularization penalties on the coefficients of interest.

These frequentist estimators can be obtained by minimizing a penalized least squares objective

function,

minβ

||Y − Xβ||22 + λ

p∑i=1

ρ(βi),

where ρ(·) is an appropriately chosen (usually convex) penalty function, and λ > 0 is a

tuning parameter. Popular choices of penalty functions include the LASSO (Tibshirani (1996))

and its many variants, including the adaptive lasso (Zou (2006)), the group lasso (Yuan &

Lin (2006)), and the elastic net (Zou & Hastie (2005)). These methods use either an ℓ1 or

a combination of an ℓ1 and ℓ2 penalty function to shrink irrelevant predictors or groups of

predictors to exactly zero. These methods are attractive because they induce exact zeros as

estimates for some of the βi ’s, i = 1, ..., p, therefore enabling statisticians to simultaneously

perform estimation and variable selection.

1.5.2 Bayesian Approaches

1.5.2.1 Spike-and-slab

In the Bayesian univariate regression model, spike-and-slab priors in Eq. 1–2 have been a

popular choice for inducing sparsity in the coefficients for regression problems. In the context

of linear regression, these priors are placed on βi , i = 1, ..., p, in the following hierarchical

25

formulation:Y|X,β ∼ Nn(Xβ,σ

2In),

βi ∼ (1− p)δ0(βi) + pψ(βi |λ), i = 1, ..., p,

p ∼ π(p),σ2 ∼ µ(σ2).

(1–22)

where ψ is a diffuse unimodal density symmetric around zero and indexed by scale parameter

λ, π(·) has support on (0, 1), and µ(·) has support on (0,∞). Popular choices for π include

Uniform(0,1) or Beta(a, b), while µ is usually chosen to be an inverse gamma density or

the noninformative Jeffrey’s prior. Under the model in Eq. 1–22, some of the regression

coefficients are forced to zero (the “spike”), while ψ (the “slab”) models the nonzero

coefficients. In order to perform group estimation and group variable selection, Xu & Ghosh

(2015) also introduced the Bayesian group lasso with spike-and-slab priors (BGL-SS), which

is a mixture density with a point mass at a vector 0mg∈ Rmg , where mg denotes the size of

group g and a normal distribution to model the “slab.”

Just as in the normal means model in Eq. 1–1, the point mass mixture can face

computational difficulties when p is very large. Therefore, continuous variants of spike-and-slab

of the form in Eq. 1–3, such as the celebrated SSVS method by George & McCulloch (1993)

or the recent SSL model by Ročková & George (2016), are often used in practice instead.

Recently, Ishwaran & Rao (2005) and Narisetty & He (2014) also used the mixture prior of

normals but used rescaling of the variances (dependent upon the sample size n) in order to

better control the amount of shrinkage for each individual coefficient.

1.5.2.2 Continuous Shrinkage Priors

When p is large, spike-and-slab priors can face computational problems since they

require either searching over 2p possible models or data augmentation via latent variables. To

circumvent these issues, continuous shrinkage priors of the form in Eq. 1–4 are also popular for

Bayesian univariate regression. In the context of univariate linear regression, our hierarchical

model with shrinkage priors is typically of the form,

26

Y|X,β ∼ Nn(Xβ,σ2In),

βiind∼ N (0,σ2ω2

i ), i = 1, ..., p,

ω2i ∼ π(ω2

i ), i = 1, ..., p,

σ2 ∼ µ(σ2),

(1–23)

where π(ω2i ) is a carefully chosen prior on the scale component in the prior for βi . In

particular, we obtain special cases of the model in Eq. 1–23 by placing global-local shrinkage

priors of the form in Eq. 1–5 on the coefficients βi , i = 1, ..., p. Just as in the normal means

model in Eq. 1–1, these shrinkage priors shrink most of the coefficients towards zero, but their

tail robustness prevents overshrinkage of true nonzero coefficients.

1.5.3 Posterior Consistency for Univariate Linear Regression

Suppose that the true model is

Yn = Xnβ0n + εn, (1–24)

where εn ∼ Nn(0,σ2In) and β0n depends on n. For convenience, we denote β0n as β0 going

forward, noting that β0 depends on n.

Let {β0}n≥1 be the sequence of true coefficients, and let P0 denote the distribution of

{Yn}n≥1 under Eq. 1–24. Let {πn(βn)}n≥1 and {πn(βn|Yn)}n≥1 denote the sequences of prior

and posterior densities for β. We say that our model consistently estimates β0 if the posterior

probability that βn lies in a ε-neighborhood of β0 (ε > 0) converges to 1 almost surely with

respect to P0 measure as n → ∞. Formally, we give the following definition (see Armagan

et al. (2013a)):

Definition 1.2. (posterior consistency) Let Bn = {βn : ||βn−β0||2 > ε}, where ε > 0. The

sequence of posterior distributions of βn under prior πn(βn) is said to be strongly consistent

under Eq. 1–24 if, for any ε > 0,

�n(Bn|Yn) = �n(||βn − β0||2 > ε|Yn) → 0 a.s. P0 as n → ∞.

27

Consistency results have been established for both spike-and-slab models in Eq. 1–22

and continuous shrinkage models in Eq. 1–23 in the case where the number of covariates p

grows slower than or at the same rate as the sample size n. Ishwaran & Rao (2011) established

that for a rescaled version of spike-and-slab, the posterior mean βn consistently estimates

β0(≡ β0n) in Eq. 1–24. Letting A denote the subset of true nonzero coefficients in β0,

Ishwaran & Rao (2011) also showed that√n(βA

n − βA0 ) is asymptotically normally distributed

with mean 0 and variance-covariance matrix inverse of the appropriate submatrix of the Fisher

information matrix. This is known as the oracle property (see Fan & Li (2001)). Xu & Ghosh

(2015) also established the oracle property for the posterior median for their spike-and-slab

model with grouped variables under the assumption of orthogonal design of the design matrix

X. These consistency results concerned only a particular point estimate of β0 rather than the

entire posterior density.

For a variety of GL shrinkage priors of the form in Eq. 1–5, Armagan et al. (2013a)

established posterior consistency of the entire posterior distribution, as defined in Definition

4.3, when p grows slower than n. Zhang & Bondell (2017) also established posterior

consistency under the Dirichlet-Laplace prior when p grows slower than n. Moving beyond

the “small p, large n” scenario, Dasgupta (2016) established posterior consistency for the

Bayesian lasso under the assumption of orthogonal design when p grows at the same rate as n.

1.6 Sparse Multivariate Linear Regression

We now consider the classical multivariate normal linear regression model,

Y = XB+ E, (1–25)

where Y = (Y1, ...,Yq) is an n × q response matrix of n samples and q response variables,

X is an n × p matrix of n samples and p covariates, B ∈ Rp×q is the coefficient matrix,

and E = (ε1, ..., εn)⊤ is an n × q noise matrix. Under normality, we assume that εi i.i.d.∼

Nq(0,�), i = 1, ..., n. In other words, each row of E is identically distributed with mean 0 and

covariance �.

28

Our focus is on obtaining a sparse estimate of the p × q coefficients matrix B. In

practical settings, particularly in high-dimensional settings when p > n, it is important not

only to provide robust estimates of B, but to choose a subset of regressor variables from the

p rows of B which are good for prediction on the q responses. Although p may be large, the

number of predictors that are actually associated with the responses is generally quite small. A

parsimonious model also tends to give far better estimation and prediction performance than a

dense model, which further motivates the need for sparse estimates of B.

1.6.1 Frequentist Approaches

The ℓ1 and ℓ2 regularization methods described in Section 1.5.1 have been naturally

extended to the multivariate regression setup in Eq. 1–25 where sparsity in the coefficients

matrix is desired. For example, Rothman et al. (2010) utilized an ℓ1 penalty on each individual

coefficient of B in Eq. 1–25, in addition to an ℓ1 penalty on the off-diagonal entries of the

covariance matrix to perform joint sparse estimation of B and �. Li et al. (2015) proposed the

multivariate sparse group lasso, which utilizes a combination of a group ℓ2 penalty on rows of

B and an ℓ1 penalty on the individual coefficients bij to perform sparse estimation and variable

selection at both the group and within-group levels. Wilms & Croux (2017) also consider

a model which uses an ℓ2 penalty on rows of B to shrink entire rows to zero, while jointly

estimating the covariance matrix �.

1.6.2 Bayesian Approaches

This two-components mixture approach in Section 1.5.2 has been extended to the

multivariate framework of Eq. 1–25 by Brown et al. (1998), Liquet et al. (2016), and Liquet

et al. (2017). Brown et al. (1998) and Liquet et al. (2016) first facilitate variable selection

by associating each of the p rows of B, bi , 1 ≤ i ≤ p, with a p-dimensional binary vector

γ = (γ1, ..., γp), where each entry in γ follows a Bernoulli distribution. The selected bi ’s are

then estimated by placing a multivariate Zellner g-prior (see Zellner (1986)) on the sub-matrix

of the selected covariates.

29

Recently, Liquet et al. (2017) extended Xu & Ghosh (2015)’s work to the multivariate

case with a method called Multivariate Group Selection with Spike and Slab Prior (MBGL-SS).

Under MBGL-SS, rows of B are grouped together and modeled with a prior mixture density

with a point mass at 0 ∈ Rmgq having positive probability (where mg denotes the size of the

gth group and q is the number of responses). Liquet et al. (2017) use the posterior median

B = (bij)p×q as the estimate for B, so that entire rows are estimated to be exactly zero.

1.6.3 Reduced Rank Regression

Finally, both frequentist and Bayesian reduced rank regression (RRR) approaches

have been developed to tackle the problem of sparse estimation of B in Eq. 1–25. RRR

constrains the coefficient matrix B to be rank-deficient. Chen & Huang (2012) proposed a

rank-constrained adaptive group lasso approach to recover a low-rank matrix with some rows of

B estimated to be exactly zero. Bunea et al. (2012) also proposed a joint sparse and low-rank

estimation approach and derived its non-asymptotic oracle bounds. The RRR approach was

recently adapted to the Bayesian framework by Zhu et al. (2014) and Goh et al. (2017). In the

Bayesian framework, rank-reducing priors are used to shrink most of the rows and columns in

B towards 0p ∈ Rp or 0⊤q ∈ Rq.

30

CHAPTER 2THE INVERSE GAMMA-GAMMA PRIOR FOR SPARSE ESTIMATION

In this chapter, we introduce a new fully Bayesian scale-mixture shrinkage prior known

as the inverse gamma-gamma (IGG) prior. Our goal is twofold. Having observed a vector

X = (X1, ...,Xn) with entries from the model in Eq. 1–1,

Xi = θi + ϵi , ϵi ∼ N (0, 1), i = 1, ..., n,

we would like to simultaneously achieve: 1) robust estimation of θ, and 2) a robust testing

rule for identifying true signals. Multiple testing with the IGG prior is deferred to Chapter 3.

In this chapter, we discuss the IGG’s theoretical properties and illustrate how it can be used to

estimate sparse noisy vectors.

The IGG is a special case of the popular three parameter beta normal (TPBN) family, first

introduced by Armagan et al. (2011). The TPBN mixture family generalizes several well-known

scale-mixture shrinkage priors. This family places a beta prime density (also known as the

inverted beta) as a prior on scale parameter, λi , and is of the form,

θi |τ ,λiind∼ N (0, τλi),

π(λi) =�(a + b)

�(a)�(b)λa−1i (1 + λi)

−(a+b), i = 1, ..., n, (2–1)

where a and b are positive constants. Examples of priors that fall under the TPBN family

include the horseshoe prior (a = b = 0.5), the Strawderman-Berger prior (a = 1, b = 0.5), and

the normal-exponential gamma (NEG) prior (a = 1, b > 0).

With the IGG prior, the global parameter τ is fixed at τ = 1. However, we show that we

can achieve (near) minimax posterior contraction by simply specifying sample-size dependent

hyperparameters a and b, rather than by tuning or estimating a shared global parameter τ .

Our prior therefore does not fall under the global-local framework and our theoretical results

differ from many existing results based on global-local priors. We further justify the use of

the IGG by obtaining a sharper upper bound on the rate of posterior concentration in the

31

Kullback-Leibler sense than previous upper bounds derived for the horseshoe and horseshoe+

densities.

The organization of this chapter is as follows. In Section 2.1, we introduce the IGG

prior. We show that it mimics traditional shrinkage priors by placing heavy mass around

zero. We also establish various concentration properties of the IGG prior that characterize its

tail behavior and that are crucial for establishing our theoretical results. In Section 2.3, we

discuss the behavior of the posterior under the IGG prior. We show that for a class of sparse

normal mean vectors, the posterior distribution under the IGG prior contracts around the true

θ at (near) minimax rate under mild conditions. We also show that the upper bound for the

posterior concentration rate in the Kullback-Leibler sense is sharper for the IGG than it is for

other known Bayes estimators. In Section 2.4, we present simulation results which demonstrate

that the IGG prior has excellent performance for estimation in finite samples. Finally, in Section

2.5, we utilize the IGG prior to analyze a prostate cancer data set.

2.1 The Inverse Gamma-Gamma (IGG) Prior

Suppose we have observed X ∼ Nn(θ, In), and our task is to estimate the n-dimensional

vector, θ. Consider putting a scale-mixture prior on each θi , i = 1, ..., n of the form

θi |σ2iind∼ N (0,σ2i ), i = 1, ..., n,

σ2ii .i .d .∼ β′(a, b), i = 1, ..., n,

(2–2)

where β′(a, b) denotes the beta prime density in Eq. 2–1. The scale mixture prior in Eq. 2–2

is a special case of the TPBN family of priors with the global parameter τ fixed at τ = 1. One

easily sees that the posterior mean of θi under Eq. 2–2 is given by

E{E(θi |Xi ,σ2i )} = {E(1− κi)|Xi}Xi , (2–3)

where κi = 11+σ2

i

. Using a simple transformation of variables, we also see that the posterior

density of the shrinkage factor κi is proportional to

32

Figure 2-1. Marginal density of the IGG prior in Eq. 2–5 with hyperparametersa = 0.6, b = 0.4, in comparison to other shrinkage priors. The DL1/2 prior is themarginal density for the Dirichlet-Laplace density with D(1/2, ...., 1/2) specified asa prior in the Bayesian hierarchy.

π(κi |Xi) ∝ exp

(−κiX

2i

2

)κa−1/2i (1− κi)

b−1, κi ∈ (0, 1). (2–4)

From Eq. 2–3, it is clear that the amount of shrinkage is controlled by the shrinkage

factor κi . With appropriately chosen a and b, one can obtain sparse estimates of the θi ’s. For

example, with a = b = 0.5, we obtain the standard half-Cauchy density C+(0, 1).

33

To distinguish our work from previous results, we note that that the beta prime density in

Eq. 2–1 can be rewritten as a product of independent inverse gamma and gamma densities.

We reparametrize Eq. 2–2 as follows:

θi |λi , ξiind∼ N (0,λiξi), i = 1, ..., n,

λii .i .d .∼ IG(a, 1), i = 1, ..., n,

ξii .i .d .∼ G(b, 1), i = 1, ..., n,

(2–5)

where a, b > 0, and refer to this prior as the inverse gamma-gamma (IGG) prior. It should

be noted that the rate parameter 1 in Eq. 2–5 could be replaced by any positive constant.

Eq. 2–5 gives us some important intuition into the behavior of the IGG prior. Namely, for

small values of b, G(b, 1) places more mass around zero. As Proposition 2.1 shows, for any

0 < b ≤ 12, the marginal distribution for a single θ under the IGG prior has a singularity at

zero.

Proposition 2.1. If θ is endowed with the IGG prior in Eq. 2–5, then the marginal distribution

of θ is unbounded with a singularity at zero for any 0 < b ≤ 1/2.

Proof. See Appendix A.1.

Proposition 2.1 gives us some insight into how we should choose the hyperparameters in

Eq. 2–5. Namely, we see that for small values of b, the IGG prior can induce sparse estimates

of the θi ’s by shrinking most observations to zero. As we will illustrate in Section 2.2, the tails

of the IGG prior are still heavy enough to identify signals that are significantly far away from

zero.

Figure 2-1 gives a plot of the marginal density π(θi) for the IGG prior in Eq. 2–5, with

a = 0.6 and b = 0.4. Figure 2-1 shows that with a small value for b, the IGG has a singularity

at zero. The IGG prior also appears to have slightly heavier mass around zero than other

well-known scale-mixture shrinkage priors, but maintaining the same tail robustness. In Section

2.3, we provide a theoretical argument that shows that the shrinkage profile near zero under

the IGG is indeed more aggressive than that of previous known Bayesian estimators.

34

2.2 Concentration Properties of the IGG Prior

2.2.1 Notation

Throughout the rest of the chapter, we use the following notation. Let {an} and {bn} be

two non-negative sequences of real numbers indexed by n, where bn = 0 for sufficiently large n.

We write an ≍ bn to denote 0 < lim infn→∞

an

bn≤ lim sup

n→∞

an

bn< ∞, and an - bn to denote that

there exists a constant C > 0 independent of n such that an ≤ Cbn provided n is sufficiently

large. If limn→∞anbn

= 1, we write it as an ∼ bn. Moreover, if∣∣∣∣ anbn∣∣∣∣ ≤ M for all sufficiently

large n where M > 0 is a positive constant independent of n, then we write an = O(bn). If

limn→∞anbn

= 0, we write an = o(bn). Thus, an = o(1) if limn→∞ an = 0.

Throughout, we also use Z to denote a standard normal N(0, 1) random variable having

cumulative distribution function and probability density function �(·) and ϕ(·), respectively.

2.2.2 Concentration Inequalities for the Shrinkage Factor

Consider the IGG prior given in Eq. 2–5, but now we allow the hyperparameter bn is

allowed to vary with n as n → ∞. Namely, we allow 0 < bn < 1 for all n, but bn → 0 as

n → ∞ so that even more mass is placed around zero as n → ∞. We also fix a to lie in the

interval (12,∞). To emphasize that the hyperparameter bn depends on n, we rewrite the prior

in Eq. 2–5 asθi |λi , ξi

ind∼ N (0,λiξi), i = 1, ..., n,

λii .i .d .∼ IG(a, 1), i = 1, ..., n,

ξii .i .d .∼ G(bn, 1), i = 1, ..., n,

(2–6)

where bn ∈ (0, 1) = o(1) and a ∈ (12,∞). For the rest of this chapter, we label this particular

variant of the IGG prior as the IGGn prior.

As described in Section 2.1, the shrinkage factor κi = 11+λiξi

plays a critical role in the

amount of shrinkage of each observation Xi . In this section, we further characterize the tail

properties of the posterior distribution π(κi |Xi), which demonstrates that the IGGn prior in

Eq. 2–6 shrinks most estimates of θi ’s to zero but still has heavy enough tails to identify true

signals. In the following results, we assume the IGGn prior on θi for Xi ∼ N(θi , 1).

35

Theorem 2.1. For any a, bn ∈ (0,∞),

E(1− κi |Xi) ≤ eX2i/2

(bn

a + bn + 1/2

).


Corollary 2.1.1. If a is fixed and bn → 0 as n → ∞, then E(1− κi |Xi) → 0 as n → ∞.

Theorem 2.2. Fix ϵ ∈ (0, 1). For any a ∈ (12,∞), bn ∈ (0, 1),

Pr(κi < ϵ|Xi) ≤ eX2i/2 bnϵ

(a + 1/2) (1− ϵ).


Corollary 2.2.1. If a ∈ (12,∞) is fixed and bn → 0 as n → ∞, then by Theorem 2.2,

Pr(κi ≥ ϵ|Xi) → 1 for any fixed ϵ ∈ (0, 1).

Theorem 2.3. Fix η ∈ (0, 1), δ ∈ (0, 1). Then for any a ∈ (12,∞) and bn ∈ (0, 1),

Pr(κi > η|Xi) ≤(a + 1

2

)(1− η)bn

bn(ηδ)a+ 1

2

exp

(−η(1− δ)

2X 2i

).


Corollary 2.3.1. For any fixed n where a ∈ (12,∞), bn ∈ (0, 1), and for every fixed η ∈ (0, 1),

Pr(κi ≤ η|Xi) → 1 as Xi → ∞.

Corollary 2.3.2. For any fixed n where a ∈ (12,∞), bn ∈ (0, 1), and for every fixed η ∈ (0, 1),

E(1− κi |Xi) → 1 as Xi → ∞.

Since E(θi |Xi) = {E(1− κi)|Xi}Xi , Corollaries 2.1.1 and 2.2.1 illustrate that all

observations will be shrunk towards the origin under the IGGn prior in Eq. 2–6. However,

Corollaries 2.3.1 and 2.3.2 demonstrate that if Xi is big enough, then the posterior mean

{E(1− κi)|Xi}Xi ≈ Xi . This assures us that the tails of the IGG prior are still sufficiently

heavy to detect true signals.

We will use the concentration properties established in Theorem 2.1 and 2.3 to provide

sufficient conditions for which the posterior mean and posterior distribution under the IGGn

36

prior in Eq. 2–6 contract around the true θ0 at minimax or near-minimax rate in Section 2.3.

These concentration properties will also help us to construct the multiple testing procedure

based on κi in Chapter 3.

2.3 Posterior Behavior Under the IGG Prior

2.3.1 Minimax Posterior Contraction Under the IGG Prior

We first study the mean square error (MSE) and the posterior variance of the IGG prior

and provide an upper bound on both. For all our results, we assume that the true θ0 belongs

to the set of nearly black vectors defined by in Eq. 1–8. With a suitably chosen rate for bn in

Eq. 2–6, these upper bounds are equal, up to a multiplicative constant, to the minimax risk.

Utilizing these bounds, we also show that the posterior distribution under the IGGn prior in Eq.

2–6 is able to contract around θ0 at minimax-optimal rates.

Since the priors in Eq. 2–6 are independently placed on each θi , i = 1, ..., n, we

denote the resulting vector of posterior means (E(θ1|X1), ...,E(θn|Xn)) by T (X) and the

ith individual posterior mean by T (Xi). Therefore, T (X) is the Bayes estimate of θ under

squared error loss. Theorem 2.4 gives an upper bound on the mean squared error for T (X).

Theorem 2.4. Suppose X ∼ Nn(θ0, In), where θ0 ∈ ℓ0[qn]. Let T (X) denote the posterior

mean vector under Eq. 2–6. If a ∈ (12,∞), bn ∈ (0, 1) with bn → 0 as n → ∞, the MSE

satisfies

supθ0∈ℓ0[qn]

Eθ0 ||T (X)− θ0||2 - qn log

(1

bn

)+ (n − qn)bn

√log

(1

bn

),

provided that qn → ∞ and qn = o(n) as n → ∞.


By the minimax result in Donoho et al. (1992), we also have the lower bound,

supθ0∈ℓ0[qn]

Eθ0 ||T (X)− θ0||2 ≥ 2qn log

(n

qn

)(1 + o(1)),

37

as n, qn → ∞ and qn = o(n). The choice of bn =(qnn

)α, for α ≥ 1, therefore leads to an

upper bound MSE of order qn log(

nqn

)with a multiplicative constant of at most 2α. Based on

these observations, we immediately have the following corollary.

Corollary 2.4.1. Suppose that qn is known, and that we set bn =(qnn

)α, where α ≥ 1. Then

under the conditions of Theorem 2.4,

supθ0∈ℓ0[qn]

Eθ0 ||T (X)− θ0||2 ≍ qn log

(n

qn

).

Corollary 2.4.1 shows that the posterior mean under the IGG prior performs well as a

point estimator for θ0, as it is able to attain the minimax risk (possibly up to a multiplicative

constant of at most 2 for α = 1). Although the IGG prior does not include a point mass at

zero, Proposition 2.1 and Corollary 2.4.1 together show that the pole at zero for the IGG prior

mimics the point mass well enough, while the heavy tails ensure that large observations are not

over-shrunk.

The next theorem gives an upper bound for the total posterior variance corresponding to

the IGGn prior in Eq. 2–6.

Theorem 2.5. Suppose X ∼ Nn(θ0, In), where θ0 ∈ ℓ0[qn]. Under prior (7) and the

conditions of Theorem 2.4, the total posterior variance satisfies

supθ0∈ℓ0[qn]

Eθ0

n∑i=1

Var(θi |Xi) - qn log

(1

bn

)+ (n − qn)bn

√log

(1

bn

),

provided that qn → ∞ and qn = o(n) as n → ∞.


Having proven Theorems 2.4 and 2.5, we are now ready to state our main theorem

concerning optimal posterior contraction. Theorem 2.6 shows that the IGG is competitive with

other popular heavy-tailed priors like the Dirichlet-Laplace prior considered by Bhattacharya

et al. (2015) or the entire class of global-local shrinkage priors considered in Ghosh &

Chakrabarti (2017). As before, we denote the posterior mean vector under Eq. 2–6 as T (X).

38

Theorem 2.6. Suppose X ∼ Nn(θ0, In), where θ0 ∈ ℓ0[qn]. Suppose that the true sparsity

level qn is known, with qn → ∞, and qn = o(n) as n → ∞. Under the prior in Eq. 2–6, with

a ∈ (12,∞) and bn =

(qnn

)α,α ≥ 1,

supθ0∈ℓ0[qn]

Eθ0�

(θ : ||θ − θ0||2 > Mnqn log

(n

qn

) ∣∣∣∣X)→ 0, (2–7)

and

supθ0∈ℓ0[qn]

Eθ0�

(θ : ||θ − T (X)||2 > Mnqn log

(n

qn

) ∣∣∣∣X)→ 0, (2–8)

for every Mn → ∞ as n → ∞.

Proof. A straightforward application of Markov’s inequality combined with the results of

Theorems 2.4 and 2.5 leads to Eq. 2–7, while Eq. 2–8 follows from Markov’s inequality

combined with only the result of Theorem 2.5.

Theorem 2.6 shows that under mild regularity conditions, the posterior distribution

under the IGG prior contracts around both the true mean vector and the corresponding Bayes

estimates at least as fast as the minimax l2 risk in Eq. 1–9. Since the posterior distribution

cannot contract around the truth faster than the rate of qn log(

nqn

)(by Ghosal et al. (2000)),

the posterior distribution for the IGG prior under the conditions of Theorem 2.6 must contract

around the true θ0 at the minimax optimal rate in Eq. 1–9 up to some multiplicative constant.

We remark that the conditions needed to attain the minimax rate of posterior contraction

are quite mild. Namely, we only require that qn = o(n), and we do not need to make any

assumptions on the size of the true signal or the true sparsity level. For comparison, Castillo

& van der Vaart (2012) showed that the spike-and-slab prior with a Gaussian slab contracts

at sub-optimal rate if ||θ0||2 % qn log(

nqn

). Bhattacharya et al. (2015) showed that given

the Dir(a, ..., a) prior in the Dirichlet-Laplace prior, the posterior contracts around θ0 at the

minimax rate, provided that ||θ0||22 ≤ qn log4 n if a = n−(1+β), or provided that qn % log n

if a = 1n. The IGGn prior in Eq. 2–6 removes these restrictions on θ0 and qn. Moreover, our

minimax contraction result does not rely on tuning or estimating a global tuning parameter τ ,

39

as many previous authors have done, but instead, on appropriate selection of hyperparameters

a and b in the Bayesian hierarchy for the product density of an IG(a, 1) and G(b, 1).

In reality, the true sparsity level of qn is rarely known, so the best that we can do is to

obtain the near-minimax contraction rate of qn log n. A suitable modification of Theorem 2.6

leads to the following corollary.

Corollary 2.6.1. Suppose X ∼ Nn(θ0, In), where θ0 ∈ ℓ0[qn]. Suppose that the true sparsity

level qn is unknown, but that qn → ∞, and qn = o(n) as n → ∞. Under the prior in Eq. 2–6,

with a ∈ (12,∞) and bn =

1nα,α ≥ 1, then

supθ0∈ℓ0[qn]

Eθ0�

(θ : ||θ − θ0||2 > Mnqn log n

∣∣∣∣X)→ 0, (2–9)

and

supθ0∈ℓ0[qn]

Eθ0�

(θ : ||θ − T (X)||2 > Mnqn log n

∣∣∣∣X)→ 0, (2–10)

for every Mn → ∞ as n → ∞.

Having shown that the posterior mean under the model in Eq. 2–6 attains the near-minimax

risk up to a multiplicative constant, and that its posterior density captures the true θ0 in a

ball of squared radius at most qn log n up to some multiplicative constant, we now quantify its

shrinkage profile around zero in terms of Kullback-Leibler risk bounds. We show that this risk

bound is in fact sharper than other known shrinkage priors.

2.3.2 Kullback-Leibler Risk Bounds

In Section 2.3.1, we established that the choice of bn = 1n

allows the IGGn posterior

to contract at near minimax rate, provided that a ∈ (12,∞). Figure 2-1 suggests that

the shrinkage around zero is more aggressive for the IGGn prior than it is for other known

shrinkage priors when a and b are both set to small values. In this section, we provide a

theoretical justification for this behavior near zero.

Carvalho et al. (2010) and Bhadra et al. (2017) showed that when the true data

generating model is Nn(0, In), the Bayes estimate for the sampling density of the horseshoe

40

and the horseshoe+ estimators respectively converge to the true model at a super-efficient

rate in terms of the Kullback-Leibler (K-L) distance between the true model and the posterior

density. They argue that as a result, the horseshoe and horseshoe+ estimators squelch noise

better than other shrinkage estimators. However, in this section, we show that the IGGn prior

is able to shrink noise even more aggressively with appropriate chosen bn.

Let θ0 be the true parameter value and f (x |θ) be the sampling model. Further, let

K(q1, q2) = Eq1 log(q1/q2) denote the K-L divergence of the density q2 from q1. The proof

utilizes the following result by Clarke & Barron (1990).

Proposition 2.2. (Clarke and Barron, 1990). Let νn(dθ|x1, ..., xn) be the posterior distribution

corresponding to some prior ν(dθ) after observing data X = (x1, ..., xn) according to the sam-

pling model f (x |θ). Define the posterior predictive density qn(x) =∫f (x |θ)νn(dθ|x1, ..., xn).

Assume further that ν(Aϵ) > 0 for all ϵ > 0. Then the Cesàro-average risk of the Bayes

estimator, define as Rn ≡ n−1∑n

j=1K(qθ0, qj), satisfies

Rn ≤ ϵ− 1

nlog ν(Aϵ),

where ν(Aϵ) denotes the measure of the set {θ : K(qθ0, qθ) ≤ ϵ}.

Using the above proposition, it is shown in Carvalho et al. (2010) and Bhadra et al.

(2017) that when the global parameter τ is fixed at τ = 1 and the true parameter θ0 = 0, the

horseshoe and the horseshoe+ respectively both have Cesàro-average risk which satisfies

Rn = O

(1

nlog

(n

(log n)d

)), (2–11)

where d is a positive constant. This rate is super-efficient, in the sense that the upper bound

on the risk is lower than that of the maximum likelihood estimator (MLE), which has the rate

O(log n/n) when θ0 = 0. The next theorem establishes that the IGG prior can achieve an even

sharper rate of convergence O(n−1) in the K-L sense, with appropriate choices of a and b.

Theorem 2.7. Suppose that the true sampling model pθ0 is xj ∼ N (θ0, 1). Then for qn under

the IGG prior with any a > 0 and bn = 1n, the optimal rate of convergence of Rn when θ0 = 0

41

satisfies the inequality,

Rn ≤1

n

[2 + log(

√π) + (a + 2) log(2) + log

(a +

1

2

)]+

2 log n

n2, (2–12)


Since log nn2

= o(n−1), we see from Theorem 2.7 that the IGGn posterior density with

hyperparameters a > 0 and bn = 1n

has an optimal convergence rate of O(n−1). This

convergence rate is sharper than that of the horseshoe or horseshoe+, both of which converge

at the rate of O{n−1(log n− d log log n)} when θ0 = 0. To our knowledge, this is the sharpest

known bound on Cesàro-average risk for any Bayes estimator. Our result provides a rigorous

explanation for the observation that the IGG seems to shrink noise more aggressively than

other scale-mixture shrinkage priors.

Theorem 2.7 not only justifies the use of bn = 1n

as a choice for hyperparameter b in

the IGG prior, but it also provides insight into how we should choose the hyperparameter a.

Equation 2–12 shows that the constant C in Rn ≤ Cn−1 + o(n−1) can be large if a is set to be

large. This theorem thus implies that in order to minimize the K-L distance between Nn(θ0, In)

and the IGG posterior density, we should pick a to be small. Since we require a ∈ (12,∞) in

order to achieve the near-minimax contraction rate, our theoretical results suggest that we

should set a ∈ (12, 12+ δ] for small δ > 0 for optimal posterior concentration.

2.4 Simulation Study

2.4.1 Computation and Selection of Hyperparameters

Letting κi = 11+λiξi

, the full conditional distributions for the model in Eq. 2–5 are

θi∣∣ rest ∼ N

((1− κi)Xi , 1− κi

), i = 1, ..., n,

λi∣∣ rest ∼ IG

(a + 1

2,θ2i

2ξi+ 1), i = 1, ..., n,

ξi∣∣ rest ∼ GIG

(θ2i

λi, 2, b − 1

2

), i = 1, ..., n,

(2–13)

42

where GIG(a, b, p) denotes a generalized inverse Gaussian density with f (x ; a, b, p) ∝

x (p−1)e−(a/x+bx)/2. Therefore, the IGG model in Eq. 2–5 can be implemented straightforwardly

with Gibbs sampling, utilizing the full conditionals in Eq. 2–13.

For all our simulations, we set a = 12+ 1

nand b = 1

n, in light of Theorems 2.6 and 2.7.

These choices of a and b ensure that the IGG posterior will contract around the true θ0 at

least at near-minimax rate, while keeping a ∈ (12,∞) small. We denote our IGG prior with

hyperparameters (a, b) =(12+ 1

n, 1n

)as IGG1/n. For both of the simulation studies described

below, we run 10,000 iterations of a Gibbs sampler, discarding the first 5000 as burn-in.

2.4.2 Simulation Study for Sparse Estimation

To illustrate finite-sample performance of the IGG1/n prior, we use the set-up in Bhadra

et al. (2017) where we specify sparsity levels of q/n = 0.05, 0.10, 0.20, and 0.30, and set the

signals all equal to values of either A = 7 or 8, for a total of eight simulation settings. With

n = 200, we randomly generate n-dimensional vectors under these settings and compute the

average squared error loss corresponding to the posterior median across 100 replicates.

We compare our results for IGG1/n to the average squared error loss of the posterior

median under the Dirichlet-Laplace (DL), the horseshoe (HS), and the horseshoe+ (HS+)

estimators, since these are global-local shrinkage priors in Eq. 1–5 with singularities at zero.

For the HS and HS+ priors, we use a fully Bayesian approach, with τ ∼ C+(0, 1), as in

Ghosh et al. (2016). For the DL prior, we specify a = 1n

in the D(a, ..., a) prior of the scale

component, along with τ ∼ G(na, 1/2), as in Bhattacharya et al. (2015). Our results are

presented in Table 2-1.

Table 2-1 shows that under these various sparsity and signal strength settings, the

IGG1/n’s posterior median has the lowest (estimated) squared error loss in nearly all of the

simulation settings. It performs better than the horseshoe and the horseshoe+ in all settings.

Our empirical results confirm the theoretical properties that were proven in Section 2.3 and

illustrate that for finite samples, the IGG prior often outperforms other popular shrinkage

priors. Our empirical results also lend strong support to the use of the inverted beta prior

43

Table 2-1. Comparison of average squared error loss for the posterior median estimate of θacross 100 replications. Results are reported for the IGG1/n, DL (Dirichlet-Laplace),HS (horseshoe), and the HS+ (horseshoe-plus).

q/n A IGG DL HS HS+0.05 7 13.88 14.30 18.11 14.41

8 13.34 13.27 17.71 13.960.10 7 27.21 29.91 35.91 30.18

8 25.95 27.67 34.77 29.360.20 7 49.78 56.40 71.18 58.25

8 47.24 52.22 69.81 57.110.30 7 74.42 85.72 104.67 86.00

8 70.83 79.03 104.02 84.70

β′(a, b) as the scale density in scale-mixture shrinkage priors in Eq. 1–4. However, our results

suggest that we can obtain better estimation if we allow the a and b to vary with sample size,

rather than keeping them fixed (as the horseshoe priors do, with a = b = 0.5).

2.5 Analysis of a Prostate Cancer Data Set

We demonstrate practical application of the IGG prior using a popular prostate cancer

data set introduced by Singh et al. (2002). In this data set, there are gene expression values

for n = 6033 genes for m = 102 subjects, with m1 = 50 normal control subjects and m2 = 52

prostate cancer patients. We aim to identify genes that are significantly different between

control and cancer patients. This problem can be reformulated as normal means problem

in Eq. 1–1 by first conducting a two-sample t-test for each gene and then transforming the

test statistics (t1, ..., tn) to z-scores using the inverse normal cumulative distribution function

(CDF) transform �−1(Ft100(ti)), where Ft100 denotes the CDF for the Student’s t distribution

with 100 degrees of freedom. With z-scores (z1, ..., zn), our model is now

zi = θi + ϵi , ϵi ∼ N (0, 1), i = 1, ..., n, (2–14)

With this problem recast as a normal means problem, we can now estimate θ = (θ1, ..., θn). As

argued by Efron (2010), |θi | can be interpreted as the effect size of the ith gene for prostate

cancer. Efron (2010) first analyzed the model in Eq. 2–14 for this particular data set by

obtaining empirical Bayes estimates θEfroni , i = 1, ..., n, based on the two-groups model in Eq.

44

Table 2-2. The z-scores and the effect size estimates for the top 10 genes selected by Efron(2010) by the IGG, DL, HS, and HS+ models and the two-groups empirical Bayesmodel by Efron (2010).

Gene z-score θIGGi θDLi θHSi θHS+i θEfroni

610 5.29 4.85 4.52 4.85 4.91 4.111720 4.83 4.33 3.94 4.33 4.35 3.65332 4.47 3.78 3.40 3.78 3.99 3.24364 -4.42 -3.78 -3.10 -3.78 -3.85 -3.57914 4.40 3.71 3.11 3.71 3.86 3.163940 -4.33 -3.70 -3.06 -3.70 -3.80 -3.524546 -4.29 -3.59 -3.09 -3.59 -3.62 -3.471068 4.25 3.49 3.09 3.49 3.46 2.99579 4.19 3.31 2.98 3.31 3.01 2.924331 -4.14 -3.41 -2.87 -3.41 -3.43 -3.30

1–12. In our analysis, we use the posterior means θi , i = 1, ..., n, to estimate the strength of

association.

Table 2-2 shows the top 10 genes selected by Efron (2010) and their estimated effect size

on prostate cancer. We compare Efron (2010)’s empirical Bayes posterior mean estimates with

the posterior mean estimates under the IGG, DL, HS, and HS+ priors. Our results confirm the

tail robustness of the IGG prior. All of the scale-mixture shrinkage priors shrink the estimated

effect size for significant genes less aggressively than Efron’s procedure. Table 2-2 also shows

that for large signals, the IGG posterior has slightly less shrinkage for large signals than the DL

posterior and roughly the same amount as the HS posterior. The HS+ posterior shrinks the

test statistics the least for large signals, but the IGG’s estimates are still quite similar to those

of the HS+.

2.6 Concluding Remarks

In this chapter, we have introduced a new scale-mixture shrinkage prior called the Inverse

Gamma-Gamma prior for estimating sparse normal mean vectors. This prior has been shown

to have a number of good theoretical properties, including heavy probability mass around zero

and heavy tails. This enables the IGG prior to perform selective shrinkage and to attain (near)

minimax contraction around the true θ in Eq. 1–1. The IGG posterior also converges to the

45

true model in the Kullback-Leibler sense at a rate which has a sharper upper bound than the

upper bounds on the rates for the horseshoe and horseshoe+ posterior densities.

The IGG, HS, and HS+ all fall under the class of priors which utilize a beta prime density

as a prior on the scale component for the model in Eq. 1–4. However, our results suggest that

there is added flexibility in allowing the parameters (a, b) in Eq. 2–1 to vary with sample size

rather than keeping them fixed. This added flexibility leads to excellent empirical performance

and obviates the need to estimate a global tuning parameter τ . Despite the absence of a

data-dependent global parameter τ , the IGG model adapts well to sparsity, performing well

under both sparse and dense settings. This seems to be in stark contrast to remarks made by

authors like Carvalho et al. (2010) who have argued that scale-mixture shrinkage priors which

do not contain shared global parameters do not enjoy the benefits of adaptivity.

46

CHAPTER 3MULTIPLE HYPOTHESIS TESTING WITH THE INVERSE GAMMA-GAMMA PRIOR

In Chapter 2, we introduced the inverse gamma-gamma prior in Eq. 2–5 which can be

used for estimation of sparse noisy vectors in the model given by Eq. 1–1,

Xi = θi + ϵi , ϵi ∼ N (0, 1), i = 1, ..., n,

Through its combination of aggressive shrinkage of noise towards zero and its tail robustness,

the IGG prior performs well for estimation. However, it does not produce exact zeros as

estimates. Therefore, in order to classify θi ’s as either signal (θi = 0) or noise (θi = 0), we

need to use a thresholding rule.

In this chapter, we discuss how the IGG prior in Eq. 2–5 may be used for classification,

or equivalently, simultaneously testing H0i : θi = 0 vs. H1i : θi = 0, i = 1, ..., n. Using the

decision theoretic framework of Bogdan et al. (2011) described in Section 1.4.1, we show that

our testing rule for classifying signals asymptotically achieves the Bayes Oracle risk. While

previously, Ghosh & Chakrabarti (2017) demonstrated that testing rules based on global-local

priors of the form in Eq. 1–5 could asymptotically attain the optimal Bayes risk exactly, their

result required tuning or estimating a global parameter τ . The IGG prior avoids this by placing

appropriate values (dependent upon sample size) as its hyperparameters instead.

In Section 3.1, we introduce our testing rule with the IGG prior based on thresholding the

shrinkage factor κi = 11+λiξi

. In Section 3.2, assuming that the true data generating model is

Eq. 1–12, we present upper and lower bounds on the probabilities of Type I and Type II errors

for our thresholding rule. Using these bounds, we establish that in the presence of sparsity,

our rule is asymptotically Bayes optimal under sparsity (ABOS). In Section 3.3, we present

simulations which show that the IGG has excellent performance for multiple testing in both

sparse and dense settings, and moreover, that it has tight control over the false discovery rate

(FDR). Finally, in Section 3.4, we use the IGG prior to analyze the prostate cancer data set

from Section 2.5 within the context of multiple hypothesis testing..

47

3.1 Classification Using the Inverse Gamma-Gamma Prior

3.1.1 Notation

We use the following notation for the rest of this chapter. Let {an} and {bn} be two

non-negative sequences of real numbers indexed by n, where bn = 0 for sufficiently large n. We

write an ≍ bn to denote 0 < lim infn→∞

an

bn≤ lim sup

n→∞

an

bn< ∞, and an - bn to denote that there

exists a constant C > 0 independent of n such that an ≤ Cbn provided n is sufficiently large. If

limn→∞anbn

= 1, we write it as an ∼ bn. Moreover, if∣∣∣∣ anbn∣∣∣∣ ≤ M for all sufficiently large n where

M > 0 is a positive constant independent of n, then we write an = O(bn). If limn→∞anbn

= 0,

we write an = o(bn). Thus, an = o(1) if limn→∞ an = 0.

Throughout the chapter, we also use Z to denote a standard normal N(0, 1) random

variable having cumulative distribution function and probability density function �(·) and ϕ(·),

respectively

3.1.2 Thresholding the Posterior Shrinkage Weight

As we noted in Chapter 2, the posterior mean under the IGG prior in Eq. 2–5, E{E(θi |Xi ,σ2i )} =

{E(1− κi)|Xi}Xi , depends heavily on the shrinkage factor, κi = 11+σ2

i

= 11+λiξi

. Let θi denote

the posterior mean under the IGG prior. In particular, we established in Corollaries 2.1.1

through 2.3.2 that if Xi ≈ 0, then θi ≈ 0; meanwhile, for large Xi ’s, the posterior mean

θi ≈ Xi . Because of the concentration properties of the IGG prior proven in Sections 2.2

and 2.3, a sensible thresholding rule classifies observations as signals or as noise based on the

posterior distribution of this shrinkage factor.

Consider the following testing rule for the ith observation Xi :

Reject H0i if E(1− κi |Xi) >1

2, (3–1)

where κi is the shrinkage factor based on the IGGn prior in Eq. 2–6. We show in the

subsequent sections that the thresholding rule in Eq. 3–1 has both strong theoretical

guarantees and excellent empirical performance.

48

3.2 Asymptotic Optimality of the IGG Classification Rule

Suppose that the true data generating model is Eq. 1–12, i.e.

θii .i .d .∼ (1− p)δ{0} + pN (0,ψ2), i = 1, ..., n,

where ψ2 > 0. Within the context of multiple testing, a good benchmark for our test

procedure in Eq. 3–1 should be whether it is ABOS, i.e. whether its optimal risk is asymptotically

equal to that of the Bayes Oracle risk in Eq. 1–17. Adopting the asymptotic framework of

Bogdan et al. (2011), we let RIGG denote the asymptotic Bayes risk of testing rule in Eq. 3–1,

and we compare it to the Bayes Oracle risk, denoted as RBOOpt .

Before we state our main theorem, we first present four lemmas which give upper

and lower bounds on the Type I and Type II probabilities, t1i and t2i respectively, for the

classification rule in Eq. 3–1. These error probabilities are given respectively by

t1i = Pr

[E(1− κi |Xi) >

12

∣∣∣∣H0i is true],

t2i = Pr

[E(1− κi |Xi) ≤ 1

2

∣∣∣∣H1i is true].

(3–2)

Lemma 3.1. Suppose that X1, ...,Xn are i.i.d. observations having distribution in Eq. 1–13

where the sequence of vectors (ψ2, p) satisfies Assumption 1. Suppose we wish to test Eq.

1–14 using the classification rule in Eq. 3–1. Then for all n, an upper bound for the probability

of a Type I error for the ith test is given by

t1i ≤2bn√

π(a + bn + 1/2)

[log

(a + bn + 1/2

2bn

)]−1/2

.

Proof. See Appendix B.

Lemma 3.2. Suppose that X1, ...,Xn are i.i.d. observations following the distribution from

Eq. 1–13 where the sequence of vectors (ψ2, p) satisfies Assumption 1. Suppose we wish to

test Eq. 1–14 using the classification rule in Eq. 3–1. Suppose further that a ∈ (12,∞) and

bn ∈ (0, 1), with bn → 0 as n → ∞. Then for any η ∈ (0, 12), δ ∈ (0, 1), and sufficiently large

49

n, a lower bound for the probability of a Type I error for the ith test is given by

t1i ≥ 1−�

√√√√ 2

η(1− δ)

[log

((a + 1

2

)(1− η)bn

bn(ηδ)a+ 1

2

)] .


Lemma 3.3. Suppose we have the same set-up as Lemma 3.1. Assume further that bn → 0

in such a way that limn→∞b1/4n

pn∈ (0,∞). Then for any η ∈ (0, 1

2), δ ∈ (0, 1), and sufficiently

large n, an upper bound for the probability of a Type II error for the ith test is given by

t2i ≤

[2�

(√C

2η(1− δ)

)− 1

](1 + o(1)),

where the o(1) terms tend to zero as n → ∞.


Lemma 3.4. Suppose we have the same set-up as Lemma 3.1. Then a lower bound for the

probability of a Type II error for the ith test is given by

t2i ≥[2�(

√C)− 1

](1 + o(1)) as n → ∞,

where the o(1) terms tend to zero as n → ∞.

Having obtained bounds on Type I and Type II errors in Lemmas 3.1, 3.2, 3.3, and 3.4, we

now state our main theorem.

Theorem 3.1. Suppose that X1, ...,Xn are i.i.d. observations drawn from the distribution in

Eq. 1–13 where the sequence of vectors (ψ2, p) satisfies Assumption 1. Suppose we wish to

test Eq. 1–14 using the classification rule in Eq. 3–1. Suppose further that a ∈ (12,∞) and

bn ∈ (0, 1), with bn → 0 as n → ∞ in such a way that limn→∞b1/4n

pn∈ (0,∞). Then

limn→∞

RIGG

RBOOpt

= 1, (3–3)

i.e. the classification rule in Eq. 3–1 based on the IGGn prior in Eq. 2–6 is ABOS.


50

●●● ●●

●●

●

●●●●●●

●

●

●●

●

●●●

●●●●

●

●●

●●

● ●

●●

●

●

●

●

●

●●

●●

●

●●●

●

●●●●

●

●● ●●● ●● ●●●● ●

●

●●

●

●

●●●●●●

●●

●

●

●

●●

●●●

●●●●●●

●

●●●

●

●

●

●●●

● ●● ●●●●

●

●

●●●

●

●

●

●

●

●●

●●●

●●

●●

●●● ●●●●●

●●● ●●

●

●●

●

●●●

●

●

●

●

●

●

● ●

●

● ●● ●●●●

●

●●

●

●● ●●

●

● ●●●

●

●●●

●●● ●●●●● ●●●

●●●

−10 −5 0 5 10

0.0

0.2

0.4

0.6

0.8

1.0

X

Pos

terio

r In

clus

ion

Pro

babi

lity

Figure 3-1. Comparison between the posterior inclusion probabilities and the posteriorshrinkage weights 1− E(κi |Xi) when p = 0.10.

We have shown that our thresholding rule based on the IGGn posterior asymptotically

attains the ABOS risk exactly, provided that bn decays to zero at a certain rate relative to

the sparsity level p. For example, if the prior mixing proportion pn is known, we can set the

hyperparameter bn = p4n. Then the conditions for classification rule in Eq. 3–1 to be ABOS are

satisfied. Theorem 3.1 thus provides theoretical justification for using the IGG prior for signal

detection.

3.3 Simulation Study

For the multiple testing rule in Eq. 3–1, we adopt the simulation framework of Datta &

Ghosh (2013) and Ghosh et al. (2016) and fix sparsity levels at p ∈ {0.01, 0.05, 0.10, 0.15, 0.2,

0.25, 0.3, 0.35, 0.4, 0.45, 0.5} for a total of 11 simulation settings. For sample size

n = 200 and each p, we generate data from the two-groups model in Eq. 1–12, with

ψ =√2 log n = 3.26. We then add Gaussian white noise to our data and apply the

thresholding rule in Eq. 3–1 using IGG1/n to classify the θi ’s in our model as either signals

(θi = 0) or noise (θi = 0). We estimate the average misclassification probability (MP) for the

thresholding rule in Eq. 3–1 from 100 replicates.

51

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

Sparsity

Mis

clas

sific

atio

n P

roba

bilit

y

MP=pOracleBHIGGDLHSHS+

Figure 3-2. Estimated misclassification probabilities. The thresholding rule in Eq. 3–1 based onthe IGG posterior mean is nearly as good as the Bayes Oracle rule in Eq. 1–16.

Taking p = 0.10, we plot in Figure 3.3 the theoretical posterior inclusion probabilities

ωi(Xi) = Pr(νi = 1|Xi) for the two-groups model in Eq. 1–12 given by

ωi(Xi) = π(νi = 1|Xi) =

{(1− p

p

)√1 + ψ2e

−X2i2

ψ2

1+ψ2 + 1

}−1

,

along with the shrinkage weights 1−E(κi |Xi) corresponding to the IGG1/n prior. The circles in

the figure denote the theoretical posterior inclusion probabilities, while the triangles correspond

to the shrinkage weights 1 − E(κi |Xi). The figure clearly shows that for small values of

the sparsity level p, the shrinkage weights are in close proximity to the posterior inclusion

probabilities. This and the theoretical results established in Section 3.1 justify the use of using

1 − E(κi |Xi) as an approximation to the corresponding posterior inclusion probabilities ωi(Xi)

in sparse situations. Therefore, this motivates the use of the IGG1/n prior in Eq. 2–6 and its

corresponding decision rule in Eq. 3–1 for identifying signals in noisy data.

Figure 3-2 shows the estimated misclassification probabilities (MP) for the decision rule

in Eq. 3–1 for the IGG1/n prior, along with the estimated MP’s for the Bayes Oracle (BO),

the Benjamini-Hochberg procedure (BH), the Dirichlet-Laplace (DL), the horseshoe (HS), and

the horseshoe+ (HS+). The Bayes Oracle rule, defined in Eq. 1–16, is the decision rule that

52

minimizes the expected number of misclassified signals in Eq. 1–15 when (p,ψ) are known.

The Bayes Oracle therefore serves as the lower bound to the MP, whereas the line MP = p

corresponds to the situation where we reject all null hypotheses without looking into the data.

For the Benjamini-Hochberg rule, we use αn = 1/ log n = 0.1887. Bogdan et al. (2011)

theoretically established the ABOS property of the BH procedure for this choice of αn. For the

DL, HS, and HS+ priors, we use the classification rule

Reject H0i if E(1− κi |X1...,Xn) >1

2, (3–4)

where κi = 11+σ2

i

and σ2i is the scale parameter in the scale-mixture shrinkage model in

Eq. 1–4. For the horseshoe and horseshoe+ priors, we specify a half-Cauchy prior on the

global parameter τ ∼ C+(0, 1). Since τ is a shared global parameter, the posterior for

κi depends on all the data. Carvalho et al. (2010) first introduced the thresholding rule

in Eq. 3–4 for the horseshoe. Ghosh et al. (2016) later extended the rule in Eq. 3–4 for

a general class of global-local shrinkage priors, which includes the Strawderman-Berger,

normal-exponential-gamma, and generalized double Pareto priors. Based on Ghosh et al.

(2016)’s simulation results, the horseshoe performs similarly as or better than these other

aforementioned priors, so we do not include these other priors in our comparison study.

Our results provide strong support to our theoretical findings in Section 3.1 and strong

justification for the use of the test procedure in Eq. 3–1 to classify signals. As Figure 3-2

illustrates, the misclassification probability for the IGG prior with (a, b) = (12+ 1

n, 1n) is nearly

equal to that of the Bayes Oracle, which gives the lowest possible MP. The thresholding rule

in Eq. 3–4 based on the horseshoe+ prior and the Dirichlet-Laplace priors also appears to

be quite competitive compared to the Bayes Oracle. Bhadra et al. (2017) proved that the

horseshoe+ prior asymptotically matches the Bayes Oracle risk up to a multiplicative constant

if τ is treated as a tuning parameter, but did not prove this for the case where τ is endowed

with a prior. There also does not appear to be any theoretical justification for the thresholding

rule in Eq. 3–4 under the DL prior in the literature. On the other hand, Theorem 3.1 provides

53

Table 3-1. Comparison of false discovery rate (FDR) for different classification methods underdense settings. The IGG1/n has the lowest FDR of all the different methods.

p BO BH IGG DL HS HS+0.30 0.08 0.13 0.005 0.08 0.14 0.090.35 0.08 0.12 0.004 0.09 0.22 0.100.40 0.09 0.11 0.004 0.10 0.31 0.100.45 0.10 0.10 0.003 0.12 0.39 0.110.50 0.10 0.09 0.003 0.13 0.43 0.11

theoretical support for the use of the thresholding rule in Eq. 3–1 under the IGG prior, which is

confirmed by our empirical study.

Figure 3-2 also shows that the performance for the rule in Eq. 3–4 under the horseshoe

degrades considerably as θ = (θ1, ..., θn) becomes more dense. With sparsity level p = 0.5,

the horseshoe’s misclassification rate is close to 0.4, only marginally better than rejecting

all the null hypotheses without looking at the data. This phenomenon was also observed by

Datta & Ghosh (2013) and Ghosh et al. (2016). This appears to be because in the dense

setting, there are many noisy entries that are “moderately” far from zero, and the horseshoe

prior does not shrink these aggressively enough towards zero in order for testing rule in Eq.

3–4 to classify these as true noise. The horseshoe+ prior seems to alleviate this by adding an

additional half-Cauchy C+(0, 1) prior to the Bayes hierarchy. In Table 3-1, we report the false

discovery rate (FDR) under dense settings for the different methods. We see that the FDR is

quite a bit larger for the horseshoe than for the other methods. Table 3-1 also shows that the

IGG1/n prior has very tight control over the FDR in dense settings. Although the IGG prior is

not constructed to specifically control FDR, we see that in practice, it does provide excellent

control of false positives.

Finally, we demonstrate the shrinkage properties corresponding to the IGG1/n prior

along with the horseshoe, the horseshoe+, and the Dirichlet-Laplace priors. In Figure 3-3, we

plot the posterior expectations E(θi |Xi) for the IGG1/n prior and the posterior expectations

E(θi |X1, ...,Xn) for the HS, HS+, and DL priors and the posterior expectations. The amount

of posterior shrinkage can be observed in terms of distance between the 45◦ line and the

54

−10 −5 0 5 10

−10

−5

05

10

X

Pos

terio

r E

xpec

tatio

n

FlatIGGHSHS+DL

Figure 3-3. Posterior Mean E(θ|X ) vs. X plot for p = 0.25.

posterior expectation. Figure 3-3 shows that near zero, the noisy entries are more aggressively

shrunk towards zero for the IGG1/n prior than for the other priors with poles at zero. This

confirms our findings in Theorem 2.7 which proved that the shrinkage profile near zero is more

aggressive for the IGG1/n prior in the Kullback-Leibler sense than for the HS or HS+ priors.

Meanwhile, Figure 3-3 also shows that the signals are left mostly unshrunk, confirming that the

IGG shares the same tail robustness as the other priors. The more aggressive shrinkage of noise

explains why the IGG performs better estimation, as we demonstrated in Section 2.4.2.

3.4 Analysis of a Prostate Cancer Data Set

We demonstrate multiple testing with the IGG prior using the same prostate cancer (Singh

et al. (2002)) we analyzed in Section 2.5. As described in Section 2.5, our aim is to identify

significant differences between control subjects and prostate cancer patients from n = 6033

genes. After conducting a two-sample t-test for each gene, we transform our test statistics to

z-scores (z1, ..., zn) through an inverse normal CDF transform to obtain the final model:

zi = θi + ϵi , i = 1, ..., n,

55

where ϵi ∼ N (0, 1). This allows us to implement the IGG prior on the z-scores to conduct

simultaneous testing H0i : θi = 0 vs. H1i : θi = 0, i = 1, ..., n, to identify genes that are

significantly associated with prostate cancer.

With our z-scores, we implement the IGG1/n model with (a, b) = (12+ 1

n, 1n) on the

model in Eq. 2–14 and use classification rule in Eq. 3–1 to identify significant genes. For

comparison, we also fit this model for the DL, HS, and HS+ priors, and benchmark it to the

Benjamini-Hochberg (BH) procedure with FDR α set to 0.10. The IGG1/n selects 85 genes

as significant, in comparison to 60 genes under the BH procedure. The HS prior selects 62

genes as significant. The HS+ and DL priors select 41 and 42 genes respectively, indicating

more conservative estimates. All 60 of the genes flagged as significant by the BH procedure

are included in the 85 genes that the IGG prior classifies as significant. On the other hand, the

HS prior’s conclusions diverge from the BH procedure. Seven genes (genes 11, 377, 637, 805,

1588, 3269, and 4040) are deemed significant by the HS, but not by BH.


In this chapter, we have introduced a thresholding rule in Eq. 3–1 based on the posterior

density of the shrinkage factor κi = 11+λiξi

under the inverse gamma-gamma prior in Eq. 2–5.

We have shown that our thresholding rule asymptotically attains the Bayes Oracle risk in Eq.

1–17 exactly under mild conditions.

Our work ultimately moves the testing problem beyond the global-local framework in Eq.

1–5. Previously, Datta & Ghosh (2013), Ghosh et al. (2016), Ghosh & Chakrabarti (2017),

and Bhadra et al. (2017) have shown that horseshoe or horseshoe-like priors asymptotically

attain the Bayes Oracle risk (possibly up to a multiplicative constant) either by specifying a

rate for the global parameter τ in Eq. 1–5 or by estimating it with an empirical Bayes plug-in

estimator. In the case with the IGG prior, we prove that our thresholding rule based on the

posterior mean is ABOS without utilizing a shared global tuning parameter τ . This appears

to be the first time that the Bayes Oracle property has been established for a scale-mixture

shrinkage prior that falls outside the class of global-local shrinkage priors.

56

Finally, through our simulation studies, we have shown that the classification rule in Eq.

3–1 performs well under both sparse and dense settings. Our classification rule yields roughly

the same misclassification probability (MP) as the Bayes Oracle in all simulation settings.

We have also shown that the IGG prior provides very tight control of false discovery rate.

The aggressive shrinkage profile near zero was theoretically established in Theorem 2.7 and is

illustrated in Figure 3-3. As a result, the IGG squelches noise more effectively than many other

global-local shrinkage priors, and this keeps the number of false positives low.

57

CHAPTER 4HIGH-DIMENSIONAL MULTIVARIATE POSTERIOR CONSISTENCY UNDER

GLOBAL-LOCAL SHRINKAGE PRIORS

In this chapter, we consider the multivariate normal linear regression model in Eq. 1–25,

Y = XB+ E,

where Y = (Y1, ...,Yq) is an n × q response matrix of n samples and q response variables,

X is an n × p matrix of n samples and p covariates, B ∈ Rp×q is the coefficient matrix,

and E = (ε1, ..., εn)⊤ is an n × q noise matrix. Under normality, we assume that εi i.i.d.∼

Nq(0,�), i = 1, ..., n. In other words, each row of E is identically distributed with mean 0 and

covariance �. Throughout this chapter, we also assume that Y and X are centered so there is

no intercept term in B.

In high-dimensional settings, we are often interested in both sparse estimation of B and

variable selection from the p covariates. We adopt a Bayesian approach to this joint problem

by using global-local (GL) shrinkage priors in Eq. 1–5. GL priors were introduced in Section

1.2.2, and this class of priors encompasses a wide variety of scale-mixture shrinkage priors,

including the horseshoe prior in Eq. 1–6 and many others (e.g., see Table 1-1).

We specifically consider polynomial-tailed priors, which are priors that have tails that

are heavier than exponential. A formal mathematical definition was given in Eq. 1–7, and

we reiterate it later in this chapter. Although polynomial-tailed priors have been studied

extensively in univariate regression, their potential utility for multivariate analysis seems to have

been largely overlooked. In this chapter, we introduce a new Bayesian approach for estimating

the unknown p × q coefficient matrix B in Eq. 1–25 using polynomial-tailed priors. We call our

method the Multivariate Bayesian model with Shrinkage Priors (MBSP).

While there have been many methodological developments for Bayesian multivariate

linear regression, theoretical results in this domain have not kept pace with applications.

There appears to be very little theoretical justification for adopting Bayesian methodology in

multivariate regression. In this thesis, we take a step towards resolving this gap by providing

58

sufficient conditions under which Bayesian multivariate linear regression models can obtain

posterior consistency. To our knowledge, Theorem 4.2 is the first result in the literature to give

general conditions for posterior consistency under the model in Eq. 1–25 when p > n and when

p grows at nearly exponential rate with sample size n. We further illustrate that our method

based on polynomial-tailed priors achieves strong posterior consistency in both low-dimensional

and ultrahigh-dimensional settings.

The rest of the chapter is organized as follows. In Section 4.1, we introduce the MBSP

model and provide some insight into how it facilitates sparse estimation and variable selection.

In Section 4.2, we present sufficient conditions for our model to achieve posterior consistency

in both the cases where p grows slower than n and the case when p grows at nearly

exponential rate with n. In Section 4.3, we show how to implement MBSP using the three

parameter beta normal (TPBN) family of priors in Eq. 2–2 as a special case and how to utilize

our method for variable selection. Efficient Gibbs sampling and computational complexity

considerations are also discussed. In Section 4.4, we illustrate our method’s finite sample

performance through simulations and analysis of a real data set.

4.1 Multivariate Bayesian Model with Shrinkage Priors (MBSP)

4.1.1 Preliminary Notation and Definitions

We first introduce the following notation and definitions.

Definition 4.1. A random matrix Y is said to have the matrix-normal density if Y has the

density function (on the space Ra×b):

f (Y) =|U|−b/2|V|−a/2

(2π)ab/2e−

12

tr[U−1(Y−M)V−1(Y−M)⊤], (4–1)

where M ∈ Ra×b, and U and V are positive definite matrices of dimension a × a and b × b

respectively. If Y is distributed as a matrix-normal distribution with pdf given in Eq. 4–1, we

write Y ∼ MN a×b(M, U, V).

Definition 4.2. The matrix O ∈ Ra×b denotes the a × b matrix with all zero entries.

59

4.1.2 MBSP Model

Our multivariate Bayesian model formulation for the model in Eq. 1–25 with shrinkage

priors (henceforth referred to as MBSP) is as follows:

Y|X,B,� ∼ MN n×q(XB, In,�),

B|ξ1, ... , ξp,� ∼ MN p×q(O, τ diag(ξ1, ... , ξp),�),

ξiind∼ π(ξi), i = 1, ... , p,

(4–2)

where π(ξi) is a polynomial-tailed prior density of the form in Eq. 1–7,

π(ξi) = Kξ−a−1i L(ξi),

where K > 0 is the constant of proportionality, a is positive real number, and L is a positive

measurable, non-constant, slowly varying function over (0,∞). The formal definition of slowly

varying functions was given in Definition 1.1, and examples of polynomial-tailed priors are given

in Table 1-1.

4.1.3 Handling Sparsity

In this section, we illustrate how the MBSP model induces sparsity. First note that in Eq.

4–2, an alternative way of writing the density Y|X,B,� is

Y|X,B,� ∝ |�|−nq/2 exp

−1

2

n∑i=1

(yi −

p∑j=1

xijbj

)⊤

�−1

(yi −

p∑j=1

xijbj

) , (4–3)

where bj denotes the jth row of B.

Following from Eq. 4–3, we see that under the model in Eq. 4–2 and known �, the joint

prior density π(B, ξ1, ... , ξp) is

π(B, ξ1, ... , ξp) ∝p∏

j=1

ξ−q/2j e

− 12ξj

||bj (τ�)−1/2||22π(ξj), (4–4)

where || · ||2 denotes the ℓ2 vector norm. Since the p rows of B are independent, we see from

Eq. 4–4 that this choice of prior induces sparsity on the rows of B, while also accounting for

60

the covariance structure of the q responses. This ultimately facilitates sparse estimation of B

as a whole and variable selection from the p regressors.

For example, if π(ξj) ind∼ IG(αj , γj2 ) (where IG denotes the inverse-gamma density), then

the marginal density for B (after integrating out the ξj ’s) is proportional top∏

j=1

(||bj(τ�)−1/2||22 + γj

)−(αj+q2), (4–5)

which corresponds to a multivariate Student’s t density.

On the other hand, if π(ξj) ∝ ξq/2−1j (1 + ξj)

−1, then the joint density in is proportional to

p∏j=1

ξ−1j (1 + ξj)

−1e− 1

2ξj||bj (τ�)−1/2||22, (4–6)

and integrating out the ξj ’s gives a multivariate horseshoe density function.

As the examples in Eq. 4–5 and Eq. 4–6 demonstrate, our model allows us to obtain

sparse estimates of B by inducing row-wise sparsity in B with a matrix-normal scale mixture

using global-local shrinkage priors. This row-wise sparsity also facilitates variable selection from

the p variables.

4.2 Posterior Consistency of MBSP

4.2.1 Notation

We first introduce some notation that will be used throughout this chapter. For any two

sequences of positive real numbers {an} and {bn} with bn = 0, we write an = O(bn) if∣∣∣ anbn ∣∣∣ ≤ M for all n, for some positive real number M independent of n, and an = o(bn) to

denote limn→∞anbn

= 0. Therefore, an = o(1) if limn→∞ an = 0.

For a vector v ∈ Rn, ||v ||2 :=√∑n

i=1 v2i denote the ℓ2 norm. For a matrix A ∈ Ra×b

with entries aij , ||A||F :=√

tr(ATA) =√∑a

i=1

∑b

j=1 a2ij denotes the Frobenius norm of A.

For a symmetric matrix A, we denote its minimum and maximum eigenvalues by λmin(A) and

λmax(A) respectively. Finally, for an arbitrary set A, we denote its cardinality by |A|.

61

4.2.2 Definition of Posterior Consistency

For this section, we denote the number of predictors by pn to emphasize that p depends

on n and is allowed to grow with n. Suppose that the true model is

Yn = XnB0n + En, (4–7)

where Yn := (Yn,1, ...,Yn,q) and En ∼ MN n×q(O, In,�). For convenience, we denote B0n as

B0 going forward, noting B0 depends on pn (and therefore on n).

Let {B0}n≥1 be the sequence of true coefficient matrices, and let P0 denote the

distribution of {Yn}n≥1 under Eq. 4–7. Let {πn(Bn)}n≥1 and {πn(Bn|Yn)}n≥1 denote

the sequences of prior and posterior densities for coefficients matrix Bn. Analogously, let

{�n(Bn)}n≥1 and {�n(Bn|Yn)}n≥1 denote the sequences of prior and posterior distributions.

In order to achieve consistent estimation of B0(≡ B0n), the posterior probability that Bn lies

in a ε-neighborhood of B0 should converge to 1 almost surely with respect to P0 measure as

n → ∞. We therefore define strong posterior consistency as follows:

Definition 4.3. (posterior consistency) Let Bn = {Bn : ||Bn − B0||F > ε}, where

ε > 0. The sequence of posterior distributions of Bn under prior πn(Bn) is said to be strongly

consistent under Eq. 4–7 if, for any ε > 0,

�n(Bn|Yn) = �n(||Bn − B0||F > ε|Yn) → 0 a.s. P0 as n → ∞.

Using Definition 4.3, we now state two general theorems and a corollary that provide

general conditions under which priors on B (not just the MBSP model) may achieve strong

posterior consistency in both low-dimensional and ultrahigh-dimensional settings.

4.2.3 Sufficient Conditions for Posterior Consistency

For our theoretical investigation, we assume � to be fixed and known and dimension of

the response variables q to be fixed. In practice, � is typically unknown, and one can estimate

it from the data. In Section 4.3, we present a fully Bayesian implementation of MBSP by

placing an appropriate inverse-Wishart prior on �.

62

Theorem 4.1 applies to the case where the number of predictors pn diverges to ∞ at a

rate slower than n as n → ∞, while Theorem 4.2 applies to the case where pn grows to ∞

at a faster rate than n as n → ∞. To handle these two cases, we require different sets of

regularity assumptions.

4.2.3.1 Low-Dimensional Case

We first impose the following regularity conditions which are all standard ones used in

the literature and relatively mild (see, for example, Armagan et al. (2013a)). In particular,

Assumption (A2) ensures that X⊤n Xn is positive-definite for all n and that B0 is estimable.

Regularity Conditions

(A1) pn = o(n) and pn ≤ n for all n ≥ 1.

(A2) There exist constants c1, c2 so that

0 < c1 < lim infn→∞

λmin

(X⊤n Xn

n

)≤ lim sup

n→∞λmax

(X⊤n Xn

n

)< c2 <∞.

(A3) There exist constants d1 and d2 so that

0 < d1 < λmin(�) ≤ λmax(�) < d2 <∞.

Using these conditions, we are able to attain a very simple sufficient condition for strong

posterior consistency under Eq. 4–7, as defined in Definition 4.3, which we state in the next

theorem.

Theorem 4.1. Assume that conditions (A1)-(A3) hold. Then the posterior of Bn under any

prior πn(Bn) is strongly consistent under Eq. 4–7, i.e, for Bn = {Bn : ||Bn − B0||F > ε} and

any arbitrary ε > 0,

�n(Bn|Yn) → 0 a.s. P0 as n → ∞

if

�n

(Bn : ||Bn − B0||F <

�

nρ/2

)> exp(−kn) (4–8)

63

for all 0 < � <ε2c1d

1/21

48c1/22 d2

and 0 < k < ε2c132d2

− 3�c1/22

2d1/21

, where ρ > 0.

Proof. See Appendix C.1.

Condition 4–8 in Theorem 4.1 states that as long as the prior distribution for Bn captures

B0 inside a ball of radius �/nρ/2 with sufficiently high probability for large n, the posterior of

Bn will be strongly consistent.

4.2.3.2 Ultrahigh Dimensional Case

To achieve posterior consistency when pn ≫ n and pn ≥ O(n), we require additional

restrictions on the eigenstructure of Xn and an additional assumption on the size of the true

model. Working under the assumption of sparsity, we assume that the true model in Eq. 4–7

contains only a few nonzero predictors. That is, most of the rows of B0 should contain only

zero entries.We denote S∗ ⊂ {1, 2, ..., pn} as the set of indices of the rows of B0 with at

least one nonzero entry and let s∗ = |S∗| be the size of S∗. We need the following regularity

conditions.


(B1) pn > n for all n ≥ 1, and log(pn) = O(nd) for some 0 < d < 1.

(B2) The rank of Xn is n.

(B3) Let J denote a set of indices, where J ⊂ {1, ..., pn} such that |J | ≤ n. Let XJ denote

the submatrix of Xn that contains the columns with indices in J . For any such set J ,

there exists a finite constant c1(> 0) so that

lim infn→∞

λmin

(X⊤

JXJ

n

)≥ c1.

(B4) There is finite constant c2(> 0) so that

lim supn→∞

λmax

(X⊤n Xn

n

)< c2.

64

(B5) There exist constants d1 and d2 so that

0 < d1 < λmin(�) ≤ λmax(�) < d2 <∞.

(B6) S∗ is nonempty for all n ≥ 1, and s∗ = o(n/ log(pn)).

Condition (B1) allows the number of predictors pn to grow at nearly exponential rate.

In particular, pn may grow at a rate of end , where 0 < d < 1. In the high-dimensional

literature, it is a standard assumption that log(pn) = o(n). Condition (B3) assumes that for

any submatrix of Xn that is full rank, its minimum singular value is bounded below by nc1.

This condition is needed to overcome potential identifiability issues, since trivially, the smallest

singular value of Xn is zero. (B4) imposes a supremum on the maximum singular value of

Xn, which poses no issue. Finally, Condition (B6) allows the true model size to grow with n

but at a rate slower than nlog(pn)

. (B6) is a standard condition that has been used to establish

estimation consistency when pn grows at nearly exponential rate with n for frequentist point

estimators, such as the Dantzig estimator (Candes & Tao (2007)), the scaled LASSO (Sun &

Zhang (2012)), and the LASSO (Tibshirani (1996)). In ultrahigh-dimensional problems, it is

generally agreed that s∗ must be small relative to both p and n in order to attain estimation

consistency and minimax convergence rates, and hence, this restriction on the growth rate of

s∗.

Under these regularity conditions, we are able to attain a simple sufficient condition for

posterior consistency under Eq. 4–7 even when pn grows faster than n. Theorem 4.2 gives the

sufficient condition for strong consistency.

Theorem 4.2. Assume that conditions (B1)-(B6) hold. Then the posterior of Bn under any

prior πn(Bn) is strongly consistent under Eq. 4–7, i.e. for Bn = {Bn : ||Bn − B0||F > ε} and

any arbitrary ε > 0,

�n(Bn|Yn) → 0 a.s. P0 as n → ∞

if

65

�n

(Bn : ||Bn − B0||F <

�

nρ/2

)> exp(−kn) (4–9)

for all 0 < � <ε2c1d

1/21

48c1/22 d2

and 0 < k < ε2c132d2

− 3�c1/22

2d1/21

, where ρ > 0.


Similar to Eq. 4–8 in Theorem 4.1, Eq. 4–9 in Theorem 4.2 states that as long as the

prior distribution for Bn captures B0 inside a ball of radius �/nρ/2 with sufficiently high

probability for large n, the posterior of Bn will be strongly consistent. To our knowledge,

Theorem 4.2 is the first theorem in the literature to address the issue of ultra high-dimensional

consistency in Bayesian multivariate linear regression. There has been very little theoretical

investigation done in the framework of Bayesian multivariate regression, and the results in this

thesis take a step towards narrowing this theoretical gap.

Now that we have provided simple sufficient conditions for posterior consistency in

Theorems 4.1 and 4.2, we are ready to state our main theorems which demonstrate the power

of the MBSP model in Eq. 4–2 under polynomial-tailed hyperpriors of the form in Eq. 1–7.

4.2.4 Sufficient Conditions for Posterior Consistency of MBSP

We now establish posterior consistency under the MBSP model in Eq. 4–2, assuming that

� is fixed and known, q is fixed, and that τ = τn is a tuning parameter that depends on n.

As in Section 4.2.3, we assume that most of the rows of B0 are zero, i.e. that the true

model S ⊂ {1, ..., pn} is small relative to the total number of predictors. As before, we

consider the cases where pn = o(n) and pn ≥ O(n) separately. We also require the following

regularity assumptions which turn out to be sufficient for both the low-dimensional and ultra

high-dimensional cases. Here, b0jk denotes an entry in B0.


(C1) For the slowly varying function L(t) in the priors for ξi , 1 ≤ i ≤ p, in Eq. 1–7,

limt→∞ L(t) ∈ (0,∞). That is, there exists c0(> 0) such that L(t) ≥ c0 for all t ≥ t0,

for some t0 which depends on both L and c0.

66

(C2) There exists M > 0 so that supj ,k |b0jk | ≤ M < ∞ for all n, i.e. the maximum entry in

B0 is uniformly bounded above in absolute value.

(C3) 0 < τn < 1 for all n, and τn = o(p−1n n−ρ) for ρ > 0.

Remark 1. Condition (C1) is a very mild condition which ensures that L(·) is slow-varying.

Ghosh et al. (2016) established that (C1) holds for L(·) in the TPBN priors (L(ξi) = (1 +

ξi)−(α+β)) and the GDP priors (L(ξi) = 2−

α2−1∫∞0

e−β√

2u/ξi e−u u(α2+1)−1du). The TPBN

family in particular includes many well-known one-group shrinkage priors, such as the horseshoe

prior (α = 0.5, β = 0.5), the Strawderman-Berger prior (α = 1, β = 0.5), and the

normal-exponential-gamma prior (α = 1, β > 0). As remarked by Ghosh & Chakrabarti

(2017), one easily verifies that Assumption (C1) also holds for the inverse-gamma priors

(π(ξi) ∝ ξ−α−1i e−b/ξi ) and the half-t priors (π(ξi) ∝ (1 + ξ/ν)−(ν+1)/2).

Remark 2. Condition (C2) is a mild condition that bounds the entries of B0 in absolute value

for all n, while (C3) specifies an appropriate rate of decay for τn. It is possible that the upper

bound on the rate for τn can be loosened for individual GL priors. However, since we wish to

encompass all possible priors of the form in Eq. 1–7, we provide a general rate that works for

all the polynomial-tailed priors considered in this thesis. We are now ready to state our main

theorem for posterior consistency of the MBSP model.

Theorem 4.3 (low-dimensional case). Suppose that we have the MBSP model in Eq. 4–2

with hyperpriors of the form in Eq. 1–7. Provided that Assumptions (A1)-(A3) and (C1)-(C3)

hold, our model achieves strong posterior consistency. That is, for any ε > 0,

�n(Bn : ||Bn − B0||F > ε|Yn) → 0 a.s. P0 as n → ∞.


Theorem 4.3 establishes posterior consistency for the MBSP model only when pn = o(n).

We also note that in the low-dimensional setting where pn = o(n), we place no restrictions on

the growth on the number of nonzero predictors in the true model relative to sample size n.

67

This contrasts with a previous result by Armagan et al. (2013a), who required that the number

of true nonzero covariates grow slower than nlog(n)

.

In the ultra high-dimensional case where pn ≥ O(n), we can still achieve posterior

consistency under the MBSP model, with additional mild restrictions on the design matrix Xn

and on the size of the true model. Theorem 4.4 deals with the ultra high-dimensional scenario.

Theorem 4.4 (ultra high-dimensional case). Suppose that we have the MBSP model

in Eq. 4–2 with hyperpriors of the form in 1–7. Provided that Assumptions (B1)-(B6) and

(C1)-(C3) hold, our model achieves strong posterior consistency. That is, for any ε > 0,

�n(Bn : ||Bn − B0||F > ε|Yn) → 0 a.s. P0 as n → ∞.


Interestingly enough, to ensure posterior consistency in the ultrahigh-dimensional setting,

the only thing that needs to be controlled is the tuning parameter τn, provided that our

hyperpriors in Eq. 4–2 have the form in Eq. 1–7. However, in the high-dimensional regime,

pn is allowed to grow at nearly exponential rate, and therefore, the rate of decay for τn from

Condition (C3) necessarily needs to be much faster. Intuitively, this makes sense because we

must sum over pnq terms in order to compute the Frobenius normed difference in Theorem

4.4.

Taken together, Theorems 4.3 and 4.4 both provide theoretical justification for the use of

global-local shrinkage priors for multivariate linear regression. Even when we allow the number

of predictors to grow at nearly exponential rate, the posterior distribution under the MBSP

model is able to consistently estimate B0 in Eq. 4–7. Our result is also very general in that a

wide class of shrinkage priors, as indicated in Table 1-1, can be used for the hyperpriors ξi ’s in

Eq. 4–2.

4.3 Implementation of the MBSP Model

In this section, we demonstrate how to implement the MBSP model using the three

parameter beta normal (TPBN) mixture family (Armagan et al. (2011)). We choose the

68

TPBN family because it is rich enough to generalize several well-known polynomial-tailed

priors. Although we focus on the TPBN family, our model can easily be implemented for other

global-local shrinkage priors (such as the Student’s t prior or the generalized double Pareto

prior) using similar techniques as the ones we describe below.

4.3.1 TPBN Family

A random variable y said to follow the three parameter beta density, denoted as

T PB(u, a, τ), if

π(y) =�(u + a)

�(u)�(a)τ ay a−1(1− y)u−1 {1− (1− τ)y}−(u+a) .

In univariate regression, a global-local shrinkage prior of the form

βi |τ , ξi ∼ N (0, τξi), i = 1, ... , p,

π(ξi) =�(u+a)�(u)�(a)

ξu−1i (1 + ξi)

−(u+a), i = 1, ... , p,(4–10)

may therefore be represented alternatively as

βi |νi ∼ N (0, ν−1i − 1),

νi ∼ T PB(u, a, τ).(4–11)

After integrating out νi in Eq. 4–11, the marginal prior for βi is said to belong to the TPBN

family. Special cases of Eq. 4–11 include the horseshoe prior (u = 0.5, a = 0.5), the

Strawderman-Berger prior (u = 1, a = 0.5), and the normal-exponential-gamma (NEG) prior

(u = 1, a > 0). By Proposition 1 of Armagan et al. (2011), Eq. 4–10 and Eq. 4–11 can also

be written as a hierarchical mixture of two Gamma distributions,

βi |ψi ∼ N (0,ψi), ψi |ζi ∼ G(u, ζi), ζi ∼ G(a, τ), (4–12)

where ψi = ξiτ .

4.3.2 The MBSP-TPBN Model

Taking our MBSP model in Eq. 4–2 with the TPBN family as our chosen prior and

placing an inverse-Wishart conjugate prior on �, we can construct a specific variant of the

69

MBSP model which we term the MBSP-TPBN model. For our theoretical study of MBSP, we

assumed � to be known and the dimension of the responses q to be fixed (and thus, q < n

for large n). However, in order for our model to be implemented in finite samples, q can be of

any size (including q ≫ n), provided that the posterior distribution is proper. The use of an

inverse-Wishart prior ensures posterior propriety.

Reparametrizing the variance terms τξi , 1 ≤ i ≤ p, in terms of the ψi ’s from Eq. 4–12,

the MBSP-TPBN model is as follows:

Y|X,B,� ∼ MN n×q(XB, In,�),

B|ψ1, ...,ψp,� ∼ MN p×q(O, diag(ψ1, ... ,ψp),�),

ψi |ζiind∼ G(u, ζi), i = 1, ... , p,

ζii.i.d.∼ G(a, τ), i = 1, ... , p,

� ∼IW(d , kIq),

(4–13)

where u, a, d , k , and τ are appropriately chosen hyperparameters. The MBSP-TPBN model

can be implemented using the R package MBSP, which is available on the Comprehensive R

Archive Network (CRAN).

4.3.2.1 Computational Details

The full conditional densities under the MBSP-TPBN model in Eq. 4–13 are available in

closed form, and hence, can be implemented straightforwardly using Gibbs sampling. Moreover,

by suitably modifying an algorithm introduced by Bhattacharya et al. (2016) for drawing from

the matrix-normal density in Eq. 4–1, we can significantly reduce the computational complexity

of sampling from the full conditional density for B from O(p3) to O(n2p) when p ≫ n. We

provide technical details for our Gibbs sampling algorithm and our algorithm for sampling

efficiently from the conditional density for B in Appendix D.

In our experience, with good initial estimates for B and �, (B(init),�(init)), the Gibbs

sampler converges quite quickly, usually within 5000 iterations. In Appendix D, we describe

how to initialize (B(init),�(init)). In Appendix D, we also provide history plots of the draws

from the Gibbs sampler for individual coefficients of B from experiment 5 (n = 100, p =

70

500, q = 3) and experiment 6 (n = 150, p = 1000, q = 4) of our simulation studies in Section

4.4.1, which illustrate rapid convergence.

Although our algorithm is efficient, Gibbs sampling can still be prohibitive if p is

extremely large (say, on the order of millions). In this case, we recommend first screening

the p covariates based on the magnitude of their marginal correlations with the responses

(y1, ... , yq) and then implementing the MBSP model on the reduced subset of covariates.

This marginal screening technique for dimension reduction has long been advocated for

ultrahigh-dimensional problems, even for non-Bayesian approaches (e.g., Fan & Lv (2008),

Fan & Song (2010)). Faster alternatives to MCMC to handle extremely large p are also worth

exploring in the future.

4.3.2.2 Specification of Hyperparameters τ , d , and k

Just as in Eq. 4–2, the τ in Eq. 4–13 continues to act as a global shrinkage parameter.

A natural question is how to specify an appropriate value for τ . Armagan et al. (2011)

recommend setting τ to the expected level of sparsity. Given our theoretical results in

Theorems 4.3 and 4.4, we set τ ≡ τn = 1p√n log n

. This choice of τ satisfies the sufficient

conditions for posterior consistency in both the low-dimensional and the high-dimensional

settings when � is fixed and known.

In order to specify the hyperparameters d and k in the IW(d , kIq) prior for �, we appeal

to the arguments made by Brown et al. (1998). As noted by Brown et al. (1998), if we set

d = 3, then � has a finite first moment, with E(�) = kd−2

Iq = kIq. Additionally, as argued

in Bhadra & Mallick (2013) and Brown et al. (1998), k should a priori be comparable in size

with the likely variances of Y given X. Accordingly, we take our initial estimate of B from the

Gibbs sampler, B(init) (specified in Section 4.3.2.1), and take k as the variance of the residuals,

Y − XB(init).

4.3.3 Variable Selection

Although the MBSP model in Eq. 4–2 and the MBSP-TPBN model in Eq. 4–13 produce

robust estimates for B, they do not produce exact zeros. In order to use the MBSP-TPBN

71

model for variable selection, we recommend looking at the 95% credible intervals for each entry

bij in row i and column j . If the credible intervals for every single entry in row i , 1 ≤ i ≤ p,

contain zero, then we classify predictor i as an irrelevant predictor. If at least one credible

interval in row i , 1 ≤ i ≤ p does not contain zero, then we classify i as an active predictor.

The empirical performance of this variable selection method seems to work well, as shown in

Section 4.4.

4.4 Simulations and Data Analysis

4.4.1 Simulation Studies

For our simulation studies, we implement the MBSP-TPBN model in Eq. 4–13 using our

R package MBSP. We specify u = 0.5, a = 0.5 so that the polynomial-tailed prior that we

utilize is the horseshoe prior. The horseshoe is known to perform well in simulations Carvalho

et al. (2010); van der Pas et al. (2014). We set τ = 1p√n log n

, d = 3, and k comparable to the

size of likely variance of Y given X.

In all of our simulations, we generate data from the multivariate linear regression model

in Eq. 1–25 as follows. The rows of the design matrix X are independently generated from

Np(0,�), where � = (�ij)p×p with �ij = 0.5|i−j |. The sparse p × q matrix B is generated by

first randomly selecting an active set of predictors, A ⊂ {1, 2, ..., p}. For rows with indices in

the set A, we independently draw every row element from Unif([−5,−0.5] ∪ [0.5, 5]). All the

other rows in B, i.e. AC , are then set equal to zero. Finally, the rows of the noise matrix E are

independently generated from Nq(0,�), where � = (�ij)q×q with �ij = σ2(0.5)|i−j |,σ2 = 2.

We consider six different simulation settings with varying levels of sparsity.

• Experiment 1 (p < n): n = 60, p = 30, q = 3, 5 active predictors (sparse model).

• Experiment 2 (p < n): n = 80, p = 60, q = 6, 40 active predictors (dense model).

• Experiment 3 (p > n): n = 50, p = 200, q = 5, 20 active predictors (sparse model).

• Experiment 4 (p > n): n = 60, p = 100, q = 6, 40 active predictors (dense model).

• Experiment 5 (p ≫ n): n = 100, p = 500, q = 3, 10 active predictors (ultra-sparsemodel).

72

• Experiment 6 (p ≫ n): n = 150, p = 1000, q = 4, 50 active predictors (sparse model).

The Gibbs sampler described in Section 4.3.2.1 is efficient in handling the two p ≫ n setups

in experiments 5 and 6. Running on an Intel Xeon E5-2698 v3 processor, the Gibbs sampler

runs about 761 iterations per minute for Experiment 5 and about 134 iterations per minute for

Experiment 6. In all our experiments, we run Gibbs sampling for 15,000 iterations, discarding

the first 5000 iterations as burn-in.

As our point estimate for B, we take the posterior median B = (bij)p×q. To perform

variable selection, we inspect the 95% individual credible interval for every entry and classify

predictors as irrelevant if all of the q intervals in that row contain 0, as described in Section

4.3.3. We compute mean squared errors (MSEs) rescaled by a factor of 100, as well as the

false discovery rate (FDR), false negative rate (FNR), and overall misclassification probability

(MP) as follows:MSEest = 100× ||B− B||2F/(pq),

MSEpred = 100× ||XB− XB||2F/(nq),

FDR = FP / (TP + FP),

FNR = FN / (TN + FN),

MP = (FP + FN)/(pq),

where FP, TP, FN, and TN denote the number of false positives, true positives, false negatives,

and true negatives respectively.

We compare the performance of the MBSP-TPBN estimator with that of four other

row-sparse estimators of B. An alternative Bayesian approach based on the spike-and-slab

formulation is studied. Namely, we consider the multivariate Bayesian group lasso posterior

median estimator with a spike-and-slab prior (MBGL-SS), introduced by Liquet et al. (2017),

which applies a spike-and-slab prior with a point mass 0mgq for the gth group of covariates,

which corresponds to mg rows of B. When the grouping structure of the covariates is not

available, we can still utilize the MBGL-SS method by applying the spike-and-slab prior to

each individual row of B. In our study, we consider each predictor as its own “group” (i.e.,

73

mg = 1, g = 1, ... , p) so that individual rows are shrunk to 0⊤q . This method can be

implemented in R using the MBSGS package.

In addition, we compare the performance of MBSP-TPBN to three different frequentist

point estimators obtained through regularization penalties on the rows of B. In the R package

glmnet Friedman et al. (2010), there is an option to fit the following model to multivariate

data, which we call the multivariate lasso (MLASSO) method:

BMLASSO = arg minB∈Rp×q

(||Y − XB||2F + λ

p∑j=1

||bj ||2

).

The MLASSO model applies an ℓ1 penalty to each of the rows of B to shrink entire row

estimates to be 0⊤q . We also compare the MBSP-TPBN estimator to the row-sparse

reduced-rank regression (SRRR) estimator, introduced by Chen & Huang (2012), which

uses an adaptive group lasso penalty on the rows of B, but which further constrains the

solution to be rank-deficient. Finally, we compare our method to the sparse partial least

squares estimator (SPLS), introduced by Chun & Keleş (2010). SPLS combines partial least

squares (PLS) regression with a regularization penalty on the rows of B in order to obtain a

row-sparse PLS estimate of B. The SRRR and SPLS methods are available in the R packages

rrpack and spls.

Table 4-1 shows the results averaged across 100 replications for the MBSP-TPBN

model in Eq. 4–13, compared with MBGL-SS, LSGL, and SRRR. As the results illustrate,

the Bayesian methods tend to outperform the frequentist ones in the low-dimensional case

where p < n. In the two low-dimensional experiments (experiments 1 and 2), the MBGL-SS

estimator performs the best across all of our performance metrics, with the MBSP-TPBN

model following closely behind.

However, in all the high-dimensional (p > n) settings, MBSP-TPBN significantly

outperforms all of its competitors. Table 4-1 shows that the MBSP-TPBN model has a lower

MSEest than the other four methods in experiments 3 through 6. In experiments 5 and 6 (the

74

Table 4-1. Simulation results for MBSP-TPBN, compared with MBGL-SS, MLASSO, SRRR,and SPLS, averaged across 100 replications.

Experiment 1: n = 60, p = 30, q = 3. 5 active predictors (sparse model).Method MSEest MSEpred FDR FNR MPMBSP 1.146 24.842 0.015 0 0.003MBGL-SS 0.718 17.074 0.005 0 0.001MLASSO 2.181 41.424 0.6412 0 0.335SRRR 1.646 29.256 0.3270 0 0.128SPLS 2.428 43.879 0.1093 0.0019 0.028Experiment 2: n = 80, p = 60, q = 6, 40 active predictors (dense model).Method MSEest MSEpred FDR FNR MPMBSP 5.617 104.88 0.0034 0 0.0023MBGL-SS 5.202 101.40 0.0007 0 0.0005MLASSO 10.478 130.90 0.3307 0 0.330SRRR 5.695 104.67 0.0491 0 0.038SPLS 244.136 3633.77 0.2071 0 0.223

Experiment 3: n = 50, p = 200, q = 5, 20 active predictors (sparse model).Method MSEest MSEpred FDR FNR MPMBSP 1.357 117.52 0.0117 0 0.0013MBGL-SS 57.25 694.81 0.858 0.02 0.619MLASSO 8.400 169.026 0.7758 0 0.349SRRR 17.46 161.70 0.698 0 0.307SPLS 48.551 2006.03 0.422 0.033 0.103

Experiment 4: n = 60, p = 100, q = 6, 40 active predictors (dense model).Method MSEest MSEpred FDR FNR MPMBSP 11.030 172.89 0.0266 0 0.0114MBGL-SS 204.33 318.80 0.505 0.1265 0.415LSGL 44.635 188.81 0.544 0 0.479SRRR 242.67 193.64 0.594 0 0.587SPLS 213.19 3909.07 0.135 0.0005 0.005

Experiment 5: n = 100, p = 500, q = 3, 10 active predictors (ultra-sparse model).Method MSEest MSEpred FDR FNR MPMBSP 0.0374 12.888 0.064 0 0.0015MBGL-SS 1.327 155.51 0.483 0.0005 0.092MLASSO 0.2357 75.961 0.837 0 0.115SRRR 0.9841 49.428 0.688 0 0.104SPLS 0.3886 138.62 0.1355 0.0005 0.005

Experiment 6: n = 150, p = 1000, q = 4, 50 active predictors (sparse model).Method MSEest MSEpred FDR FNR MPMBSP 0.0155 8.934 0.0025 0.00003 0.00016MBGL-SS 1.327 155.51 0.483 0.0005 0.092MLASSO 1.982 181.95 0.810 0 0.214SRRR 0.9841 49.428 0.688 0 0.104SPLS 25.560 8631.92 0.420 0.021 0.051

75

p ≫ n scenarios), the MSEest and MSEpred are both much lower for the MBSP-TPBN model

than for the other methods.

Additionally, using the 95% credible interval technique in Section 4.3.3 to perform variable

selection, the FDR and the overall MP are also consistently low for the MBSP-TPBN model.

Even when the true underlying model is not sparse, as in experiments 2 and 4, MBSP performs

very well and correctly identifies most of the signals. In both the ultrahigh-dimensional settings

we considered in experiments 5 and 6, the other four methods all seem to report high FDR,

while the MBSP’s FDR remains very small.

In short, our experimental results show that the MBSP model in Eq. 4–2 has excellent

finite sample performance for both estimation and selection, is robust to non-sparse situations,

and scales very well to large p compared to the other methods. In addition to its strong

empirical performance, the MBSP model (as well as the MBGL-SS model) provides a vehicle

for uncertainty quantification through the posterior credible intervals.

4.4.2 Yeast cell cycle data analysis

We illustrate the MBSP methodology on a yeast cell cycle data set. This data set was

first analyzed by Chun & Keleş (2010) and is available in the spls package in R. Transcription

factors (TFs) are sequence-specific DNA binding proteins which regulate the transcription

of genes from DNA to mRNA by binding specific DNA sequences. In order to understand

their role as a regulatory mechanism, one often wishes to study the relationship between TFs

and their target genes at different time points. In this yeast cell cycle data set, mRNA levels

are measured at 18 time points seven minutes apart (every 7 minutes for a duration of 119

minutes). The 542 × 18 response matrix Y consists of 542 cell-cycle-regulated genes from an

α factor arrested method, with columns corresponding to the mRNA levels at the 18 distinct

time points. The 542 × 106 design matrix X consists of the binding information of a total of

106 TFs.

In practice, many of the TFs are not actually related to the genes, so our aim is to recover

a parsimonious model with only a tiny number of the truly statistically significant TFs. To

76

Table 4-2. Results for analysis of the yeast cell cycle data set. The MSPE has been scaled by afactor of 100. In particular, all fives models selected the three TFs, ACE2, SWI5,and SWI6 as significant.

Method Number of Proteins Selected MSPEMBSP 12 18.673MBGL-SS 7 20.093MLASSO 78 17.912SRRR 44 18.204SPLS 44 18.904

perform variable selection, we fit the MBSP-TPBN model in Eq. 4–13 and then use the 95%

credible interval method described in Section 4.3.3. Beyond identifying significant TFs, we

assess the predictive performance of the MBSP-TPBN model by performing five-fold cross

validation, using 80 percent of the data as our training set to obtain an estimate of B, Btrain.

We take the posterior median as Btrain = (bij)train and use it to compute the mean squared

error of the residuals on the remaining 20 percent of the left-out data. We repeat this five

times, using different training and test sets each time, and take the average MSE as our mean

squared predictor error (MSPE). To make our analysis more clear, we scale the MSPE by a

factor of 100.

Table 4-2 shows our results compared with the MBGL-SS, MLASSO, SRRR, and

SPLS methods. MBSP-TPBN selects 12 of the 106 TFs as significant, so we do recover a

parsimonious model. All five methods selected the TFs, ACE2, SWI5, and SWI6. The two

Bayesian methods seem to recover a much more sparse model than the frequentist methods.

In particular, the MLASSO method has lowest MSPE, but it selects 78 of the 106 TFs as

significant, suggesting that there may be overfitting in spite of the regularization penalty on

the rows of B. Our results suggest that the frequentist methods may have good predictive

performance on this particular data set, but at the expense of parsimony. In practice, sparse

models are preferred for the sake of interpretability, and our numerical results illustrate that the

MBSP model recovers a sparse model with competitive predictive performance.

Finally, Figure 4-1 illustrates the posterior median estimates and the 95% credible bands

for four of the 10 TFs that were selected as significant by the MBSP-TPBN model. These

77

0 20 40 60 80 100 120

−0.

6−

0.4

−0.

20.

00.

20.

40.

6

ACE2

0 20 40 60 80 100 120

−0.

6−

0.4

−0.

20.

00.

20.

40.

6

HIR1

0 20 40 60 80 100 120

−0.

6−

0.4

−0.

20.

00.

20.

40.

6

NDD1

0 20 40 60 80 100 120

−0.

6−

0.4

−0.

20.

00.

20.

40.

6

SWI6

Figure 4-1. Plots of the estimates and 95% credible bands for four of the 10 TFs that weredeemed as significant by the MBSP-TPBN model. The x-axis indicates time(minutes) and the y-axis indicates the estimated coefficients.

plots illustrate that the standard errors under the MBSP-TPBN model are not too large. One

of the potential drawbacks of using credible intervals for selection is that these intervals may

be too conservative, but we see that it is not the case here. This plot, combined with our

earlier simulation results and our data analysis results, provide empirical evidence for using the

MBSP model for estimation and variable selection. However, further theoretical investigation

is warranted in order to justify the use of marginal credible intervals for variable selection. In

particular, van der Pas et al. (2017b) showed that marginal credible intervals may provide

overconfident uncertainty statements for certain large signal values when applied to estimating

normal mean vectors, and the same issue could be present here.


In this chapter, we have introduced a method for sparse multivariate Bayesian estimation

with shrinkage priors (MBSP). Previously, polynomial-tailed GL shrinkage priors of the form

78

given in Eq. 1–5 have mainly been used in univariate regression or in the estimation of normal

mean vectors. In this thesis, we have extended the use of polynomial-tailed priors to the

multivariate linear regression framework.

We have made several important contributions to both methodology and theory. First, our

model may be used for sparse multivariate estimation for p, n, and q of any size. To motivate

the MBSP model, we have shown that the posterior distribution can consistently estimate B in

Eq. 1–25 in both the low-dimensional and ultrahigh-dimensional settings where p is allowed to

grow nearly exponentially with n (with the response dimension q fixed). Theorem 4.2 appears

to be the first time that general sufficient conditions for strong posterior consistency have been

derived for Bayesian multivariate linear regression models when p > n and log(p) = o(n).

Moreover, our method is general enough to encompass a large family of heavy-tailed priors,

including the Student’s-t prior, the horseshoe prior, the generalized double Pareto prior, and

others.

The MBSP model in Eq. 4–2 can be implemented using straightforward Gibbs sampling.

We implemented a fully Bayesian version of it with an appropriate prior on � and with

polynomial-tailed priors belonging to the TPBN family, using the horseshoe prior as a special

case. By examining the 95% posterior credible intervals for every element in each row of the

posterior conditional distribution of B, we also showed how one could use the MBSP model

for variable selection. Through simulations and data analysis on a yeast cell cycle data set, we

have illustrated that our model has excellent performance in finite samples for both estimation

and variable selection.

79

CHAPTER 5SUMMARY AND FUTURE WORK

5.1 Summary

In recent years, Bayesian scale-mixture shrinkage priors have gained a great amount of

attention because of their computational efficiency and their ability to mimic point-mass

mixtures in obtaining sparse estimates. This thesis contributes to this large body of

methodological and theoretical work.

In Chapter 1, we surveyed the literature on sparse normal means estimation, Bayesian

multiple hypothesis testing, and sparse univariate and multivariate linear regression. In

Chapter 2, we introduced the inverse gamma-gamma (IGG) prior for estimation of sparse

noisy vectors. This prior has a number of attractive theoretical properties, including minimax

posterior contraction and super-efficient convergence in the Kullback-Leibler sense. In Chapter

3, we introduced a thresholding rule for signal detection based on the IGG posterior and

demonstrated that our procedure has the Bayes Oracle property for multiple hypothesis testing.

Finally, in Chapter 4, we introduced the multivariate Bayesian model with shrinkage priors

(MBSP) which uses global-local shrinkage priors for sparse multivariate linear regression. The

MBSP model recovers a row-sparse estimate of the unknown p × q coefficients matrix B

and consistently estimates the true B even when the number of predictors grows at nearly

exponential rate with sample size.

5.2 Future Work

5.2.1 Extensions of the Inverse Gamma-Gamma Prior

There are a number of possible extensions and further investigations for the inverse

gamma-gamma prior. In Chapters 2 and 3, the hyperparameters for the IGG prior were set

as (a, b) = (12+ 1

n, 1n) based on our findings about the theoretical behavior of the posterior.

However, we could investigate if taking empirical Bayes estimates of (a, b) can boost the

IGG’s performance. Recently, van der Pas et al. (2017a) found that the performance of the

horseshoe can be improved if the global parameter τ is taken as the value τ which maximizes

80

the marginal maximum likelihood on the interval [1/n, 1]. It is possible that a similar procedure

for setting (a, b) would lead to even better performance and better adaptivity to the underlying

sparsity for the IGG.

While we have proven that the IGG posterior achieves the (near) minimax rate of

contraction for nearly black vectors and concentrates at a super-efficient rate in the

Kullback-Leibler sense, we have not proven any results about uncertainty quantification

using the IGG. It is unknown whether the credible balls and marginal credible sets of size

1 − α,α ∈ (0, 1), under the IGG posterior have good frequentist coverage and optimal

size. The literature on uncertainty quantification for scale-mixture shrinkage priors seems

underdeveloped as of now. Some work has recently been done on this for the horseshoe prior

by van der Pas et al. (2017b). van der Pas et al. (2017b) give conditions under which the

credible balls and marginal credible intervals constructed by the horseshoe posterior density give

correct frequentist coverage in the normal means problem. It would be very interesting to see if

the IGG can achieve similar optimality for uncertainty quantification under milder conditions.

Recently, the “mild dimension” scenario for the normal means problem where q/n →

c , c > 0, as n → ∞ has also been of great interest, but Bayesian development in this area

has been slow. Given its excellent performance even in dense settings, it may be worthwhile

to conduct theoretical analysis of the IGG prior’s properties and behavior under moderate

dimensions.

Under the IGG prior, the components are a posteriori independent and therefore separable.

Despite the absence of a data-dependent global parameter, the IGG model adapts well to

sparsity, performing well under both sparse and dense settings. However, several authors

such as Carvalho et al. (2010) have argued that Bayesian models adapt to the underlying

sparsity far better when they include global parameters with priors placed on them. In light

of these arguments, we could investigate if theoretical and empirical performance can be

improved further by incorporating a global parameter into the IGG framework and creating a

non-separable variant of the IGG.

81

The IGG prior can also be extended to other statistical problems besides the normal means

problem and multiple testing. For example, we could adapt the IGG prior for sparse covariance

estimation, variable selection with covariates, and multiclass classification. We conjecture that

the IGG would satisfy many optimality properties (e.g. model selection consistency, optimal

posterior contraction, etc.) if it were utilized in these other contexts.

5.2.2 Extensions to Bayesian Multivariate Linear Regression with Shrinkage Priors

In Chapter 4, we demonstrated that the MBSP model could achieve posterior consistency

in both low-dimensional (p = o(n)) and ultrahigh-dimensional (log p = o(n)) settings. The

next step is to quantify the posterior contraction rate. In the multivariate linear regression

framework, we say that the posterior distribution contracts at the rate rn if

�(||Bn − B0||F > Mnrn|Yn) → 0 a.s. P0 as n → ∞,

for every Mn → ∞ as n → ∞. In the context of high-dimensional univariate regression,

several authors (e.g. Castillo et al. (2015) and Ročková & George (2016)) have attained

optimal posterior contraction rates of O(√s log p/n) with respect to the ℓ1 and ℓ2 norms

(where s denotes the number of active predictors). It is worth noting that√s log p/n is the

familiar minimax rate of convergence under squared error loss for a number of frequentist point

estimators, including the LASSO (Tibshirani (1996)), the scaled lasso (Sun & Zhang (2012)),

and the Dantzig selector (Candes & Tao (2007)). We conjecture that under suitable regularity

conditions and compatibility conditions on the design matrix, the MBSP model can attain a

similarly optimal posterior rate of contraction.

Additionally, we could investigate if posterior consistency and optimal posterior contraction

rates can be achieved if we allow the number of response variables q to diverge to infinity

in the MBSP model. From an implementation standpoint, q can be of any size, but for our

theoretical investigation of the MBSP model, we assumed q to be fixed. If q is allowed to grow

as sample size grows, then some sort of sparsity assumption for the response variables may

need to be imposed. We surmise that novel techniques would also be needed to prove posterior

82

consistency in this scenario, since the distributional theory we used to prove our consistency

results may not apply if q is no longer fixed.

Extension of our posterior consistency results to the case where � is unknown and

endowed with a prior also remains an open problem. In this case, we need to integrate out

� in order to work with the marginal density of the prior on B. If we assume the standard

inverse-Wishart prior on �, this gives rise to a matrix-variate t distribution. Handling this

density is very nontrivial and would require significantly different techniques than the ones

we used to establish posterior consistency in Chapter 4. Nevertheless, this warrants future

investigation.

For variable selection with the MBSP model, we relied on the post hoc method of

examining the 95% credible intervals for each entry of the estimated coefficients matrix for B.

Further theoretical justification for this selection method is needed. Other possible thresholding

rules should also be investigated. Because scale-mixture shrinkage priors place zero probability

at exactly zero, we must necessarily use thresholding to perform variable selection. How to

optimally choose this threshold (or thresholds) in high-dimensional settings remains an active

area of research.

Finally, in the wider context of multivariate analysis, we could also investigate the

use of global-local shrinkage priors for reduced rank regression (RRR) or partial least

squares regression (PLS). While there has been a great deal of work on RRR and PLS in

the frequentist framework, Bayesian methodological and theoretical developments in these

areas have been rather sparse.

All the aforementioned are very important open problems in Bayesian multivariate linear

regression, and we hope that the methodology and theory introduced in this thesis can serve as

the foundation for further developments in this area.

83

APPENDIX APROOFS FOR CHAPTER 2

In this Appendix, we provide proofs to all the propositions, lemmas, and theorems in

Chapter 2.

A.1 Proofs for Section 2.1

Proof of Proposition 2.1. The joint distribution of the prior is proportional to

π(θ, ξ,λ) ∝ (λξ)−1/2 exp

(− θ2

2λξ

)λ−a−1 exp

(−1

λ

)ξb−1 exp (−ξ)

∝ ξb−3/2 exp(−ξ)λ−a−3/2 exp

(−(θ2

2ξ+ 1

)1

λ

).

Thus,

π(θ, ξ) ∝ ξb−3/2 exp(−ξ)∫ ∞

λ=0

λ−a−3/2 exp

(−(θ2

2ξ+ 1

)1

λ

)dλ

∝(θ2

2ξ+ 1

)−(a+1/2)

ξb−3/2e−ξ,

and thus, the marginal density of θ is proportional to

π(θ) ∝∫ ∞

0

(θ2

2ξ+ 1

)−(a+1/2)

ξb−3/2e−ξdξ. (A–1)

As |θ| → 0, the expression in Eq. A–1 is bounded below by

C

∫ ∞

0

ξb−3/2e−ξdξ, (A–2)

where C is a constant that depends on a and b. The integral expression in Eq. A–2 clearly

diverges to ∞ for any 0 < b ≤ 1/2. Therefore, Eq. A–1 diverges to infinity as |θ| → 0, by the

monotone convergence theorem.

Proof of Theorem 2.1. From Eq. 2–4, the posterior distribution of κi under IGGn is

proportional to

π(κi |Xi) ∝ exp

(−κiX

2i

2

)κa−1/2i (1− κi)

bn−1, κi ∈ (0, 1). (A–3)

84

Since exp(−κiX

2i

2

)is strictly decreasing in κi on (0, 1), we have

E(1− κi |Xi) =

∫ 1

0

κa−1/2i (1− κi)

bn exp

(−κiX

2i

2

)dκi∫ 1

0

κa−1/2i (1− κi)

bn−1 exp

(−κiX

2i

2

)dκi

≤eX

2i/2

∫ 1

0

κa−1/2i (1− κi)

bndκi∫ 1

0

κa−1/2i (1− κi)

bn−1dκi

= eX2i/2�(a + 1/2)�(bn + 1)

�(a + bn + 3/2)× �(a + bn + 1/2)

�(a + 1/2)�(bn)

= eX2i/2

(bn

a + bn + 1/2

).

Proof of Theorem 2.2. Note that since a ∈ (12,∞), κa−1/2

i is increasing in κi on (0, 1).

Additionally, since bn ∈ (0, 1), (1 − κi)bn−1 is increasing in κi on (0, 1). Using these facts, we

have

Pr(κi < ϵ|Xi) ≤

∫ ϵ

0

exp

(−κiX

2i

2

)κa−1/2i (1− κi)

bn−1dκi∫ 1

ϵ

exp

(−κiX

2i

2

)κa−1/2i (1− κi)

bn−1dκi

≤eX

2i/2

∫ ϵ

0

κa−1/2i (1− κi)

bn−1dκi∫ 1

ϵ

κa−1/2i (1− κi)

bn−1dκi

≤eX

2i/2(1− ϵ)bn−1

∫ ϵ

0

κa−1/2i dκi

ϵa−1/2

∫ 1

ϵ

(1− κi)bn−1dκi

=eX

2i/2(1− ϵ)bn−1

(a + 1

2

)−1ϵa+1/2

b−1n ϵa−1/2(1− ϵ)b

= eX2i/2 bnϵ

(a + 1/2) (1− ϵ).

85

Proof of Theorem 2.3. First, note that since bn ∈ (0, 1), (1 − κi)bn−1 is increasing in κi on

(0, 1). Therefore, letting C denote the normalizing constant that depends on Xi , we have

∫ η

0

π(κi |Xi)dκi = C

∫ η

0

exp

(−κiX

2i

2

)κa−1/2i (1− κi)

bn−1dκi

≥ C

∫ ηδ

0

exp

(−κiX

2i

2

)κa−1/2i (1− κi)

bn−1dκi

≥ C exp

(−ηδ

2X 2i

)∫ ηδ

0

κa−1/2i dκi

= C exp

(−ηδ

2X 2i

)(a +

1

2

)−1

(ηδ)a+12 . (A–4)

Also, since a ∈ (12,∞), κa−1/2

i is increasing in κi on (0, 1).∫ 1

η

π(κi |Xi)dκi = C

∫ 1

η

exp

(−κiX

2i

2

)κa−1/2i (1− κi)

bn−1dκi

≤ C exp

(−ηX

2i

2

)∫ 1

η

κa−1/2i (1− κi)

bn−1dκi

≤ C exp

(−ηX

2i

2

)∫ 1

η

(1− κi)bn−1dκi

= C exp

(−ηX

2i

2

)b−1n (1− η)bn . (A–5)

Combining Eq. A–4 and Eq. A–5, we have

Pr(κi > η|Xi) ≤

∫ 1

η

π(κi |Xi)dκi∫ η

0

π(κi |Xi)dκi

≤(a + 1

2

)(1− η)bn

bn(ηδ)a+ 1

2

exp

(−η(1− δ)

2X 2i

).

86

A.2 Proofs for Section 2.3.1

Before proving Theorems 2.4 and 2.5, we first prove four lemmas. For Lemmas A.1, A.2,

A.3, and A.4, we denote T (x) = {E(1− κ)|x}x as the posterior mean under the IGGn model

in Eq. 2–6 for a single observation x , where κ = 11+λξ

. Our arguments follow closely those of

van der Pas et al. (2014), Datta & Ghosh (2013), and Ghosh & Chakrabarti (2017), except

that their arguments rely on controlling the rate of decay of tuning parameter τ or an empirical

Bayes estimator τ . In our case, since we are dealing with a fully Bayesian model, the degree of

posterior contraction is instead controlled by the positive sequence of hyperparameters bn in

Eq. 2–6.

Lemma A.1. Let T (x) be the posterior mean under the IGGn model in Eq. 2–6 for a single

observation x drawn from N(θ, 1). Suppose we have constants η ∈ (0, 1), δ ∈ (0, 1),

a ∈ (12,∞), and bn ∈ (0, 1), where bn → 0 as n → ∞. Then for any d > 2 and fixed

n, |T (x) − x | can be bounded above by a real-valued function hn(x), depending on d and

satisfying the following:

For any ρ > d , hn(·) satisfies

limn→∞

sup|x |>

√ρ log( 1

bn)

hn(x) = 0. (A–6)

Proof of Lemma A.1. Fix η ∈ (0, 1), δ ∈ (0, 1). First observe that

|T (x)− x | = |xE(κ|x)|

≤ |xE(κ1{κ < η}|+ |xE(κ1{κ > η}|. (A–7)

We consider the two terms in Eq. A–7 separately. From Eq. 2–4 and the fact that (1− κ)bn−1

is increasing in κ ∈ (0, 1) when bn ∈ (0, 1), we have

87

|xE(κ1{κ < η}| =∣∣∣∣x∫ η0κ · κa−1/2(1− κ)bn−1e−κx

2/2dκ∫ 1

0κa−1/2(1− κ)bn−1e−κx

2/2dκ

∣∣∣∣≤∣∣x∣∣∣∣∣∣(1− η)bn−1

∫ η0κa+1/2e−κx

2/2dκ∫ 1

0κa−1/2e−κx

2/2dκ

∣∣∣∣= (1− η)bn−1

∣∣∣∣∫ ηx20

(tx2

)a+1/2e−t/2dt∫ x2

0

(tx2

)a−1/2e−t/2dt

∣∣∣∣∣∣x∣∣= (1− η)bn−1

∣∣∣∣ 1x2∫ ηx20

ta+1/2e−t/2dt∫ x20

ta−1/2e−t/2dt

∣∣∣∣∣∣x∣∣≤ (1− η)bn−1

∣∣∣∣∫∞0

ta+1/2e−t/2dt∫ x20

ta−1/2e−t/2dt

∣∣∣∣∣∣x∣∣−1

= C(n)

[∣∣∣∣ ∫ x2

0

ta−1/2e−t/2dt

∣∣∣∣]−1 ∣∣x∣∣−1

= h1(x) (say), (A–8)

where we use a change of variables t = κx2 in the second equality, and C(n) = (1 −

η)bn−1(�(1)�(a+3/2)�(a+5/2)

)= (1− η)bn−1

(a + 3

2

)−1. Next, we observe that since κ ∈ (0, 1),

|xE(κ1{κ > η}|x | ≤ |xP(κ > η|x)|

≤(a + 1

2

)(1− η)bn

bn(ηδ)a+1/2|x | exp

(−η(1− δ)

2x2)

= h2(x) (say), (A–9)

where we use Theorem 2.3 for the second inequality.

Let hn(x) = h1(x) + h2(x). Combining Eq. A–7 through Eq. A–9, we have that for every

x ∈ R and fixed n,

|T (x)− x)| ≤ hn(x), (A–10)

Observe from Eq. A–8 that for fixed n, h1(x) is strictly decreasing in |x |. Therefore, we have

that for any fixed n and ρ > 0,

88

sup|x |>

√ρ log( 1

bn)

h1(x) ≤ C(n)

[∣∣∣∣√ρ log

(1

bn

)∫ ρ log( 1bn)

0

ta−1/2e−t/2dt

∣∣∣∣]−1

,

and since bn → 0 as n → ∞, this implies that

limn→∞

sup|x |>

√ρ log( 1

bn)

h1(x) = 0. (A–11)

Next, observe that from Eq. A–9 that for fixed n, h2(x) is eventually decreasing in |x | with a

maximum when |x | = 1√η(1−δ)

. Therefore, for sufficiently large n, we have

sup|x |>

√ρ log( 1

bn)

h2(x) ≤ h2

(√ρ log

(1

bn

)).

Letting K ≡ K(a, η, δ) =(a+ 1

2)(ηδ)a+1/2

, we have from Eq. A–9 and the fact that 0 < bn < 1 for all

n that

limn→∞

h2

(√ρ log

(1

bn

))= K lim

n→∞

(1− η)bn

bn

√ρ log

(1

bn

)e−

η(1−δ)2

ρ log( 1bn)

≤ K limn→∞

1

bn

√ρ log

(1

bn

)eη(1−δ)

2log(b

ρn )

= K√ρ limn→∞

(bn)η(1−δ)

2 (ρ− 2η(1−δ))

√log

(1

bn

)

=

0 if ρ > 2η(1−δ) ,

∞ otherwise,

from which it follows that

limn→∞

sup|x |>

√ρ log( 1

bn)

h2(x) =

0 if ρ > 2η(1−δ) ,

∞ otherwise.(A–12)

Combining Eq. A–11 and Eq. A–12, we have for hn(x) = h1(x) + h2(x) that

89

limn→∞

sup|x |>

√ρ log( 1

bn)

hn(x) =

0 if ρ > 2η(1−δ) ,

∞ otherwise.(A–13)

Since η ∈ (0, 1), δ ∈ (0, 1), it is clear that any real number larger than 2 can be expressed in

the form 2η(1−δ) . For example, taking η = 5

6and δ = 1

5, we obtain 2

η(1−δ) = 3. Hence, given

any d > 2, choose 0 < η, δ < 1 such that c = 2η(1−δ) . Clearly, hn(·) depends on d . Following

Eq. A–10 and Eq. A–13, we see that |T (x) − x | is uniformly bounded above by hn(x) for all

n and d > 2 and that condition in Eq. A–6 is also satisfied when d > 2. This completes the

proof.

Remark: Under the conditions of Lemma A.1, we see that for any fixed n,

lim|x |→∞

|T (x)− x | = 0. (A–14)

Equation A–14 shows that for the IGG prior, large observations almost remain unshrunk no

matter what the sample size n is. This is critical to its ability to properly identify signals in our

data.

Lemma A.2. Let T (x) be the posterior mean and let Var(θ|x) be the posterior variance

under the IGGn prior in Eq. 2–6. Then for a single observation x ∼ N(θ, 1), Var(θ|x) can be

represented by the following identities:

Var(θ|x) = T (x)

x− (T (x)− x)2 + x2

∫ 1

0κa+3/2(1− κ)bn−1e−κx

2/2dκ∫ 1

0κa−1/2(1− κ)bn−1e−κx

2/2dκ, (A–15)

and

Var(θ|x) = T (x)

x− T (x)2 + x2

∫ 1

0κa−1/2(1− κ)bn+1e−κx

2/2dκ∫ 1

0κa−1/2(1− κ)bn−1e−κx

2/2dκ, (A–16)

both of which satisfy the bound Var(θ|x) ≤ 1 + x2.

Proof of Lemma A.2. We first prove Eq. A–15. By the law of the iterated variance and the

fact that θ|κ, x ∼ N((1− κ)x , 1− κ), we have

90

Var(θ|x) = E[Var(θ|κ, x)] + Var [E(θ|κ, x)]

= E(1− κ|x) + Var [(1− κ)x |x ]

= E(1− κ|x) + x2Var(κ|x)

= E[(1− κ|x) + x2E(κ2|x)− x2[E(κ|x)]2.

Since x − T (x) = xE(κ|x), we rewrite the above as

Var(θ|x) = T (x)

x− (T (x)− x)2 + x2

∫ 1

0κa+3/2(1− κ)bn−1e−κx

2/2dκ∫ 1

0κa−1/2(1− κ)bn−1e−κx

2/2dκ.

Since κa+ 32 ≤ κa+

12 for all a ∈ R when κ ∈ (0, 1), it follows that the above display can be

bounded from above as Var(θ|x) ≤ 1 + x2.

Next, we show that Eq. A–16 holds. We may alternatively represent Var(θ|x) as

Var(θ|x) = E(1− κ|x) + x2E[(1− κ)2|x ]− x2E2[(1− κ)|x ]

=T (x)

x− T (x)2 + x2

∫ 1

0κa−1/2(1− κ)bn+1e−κx

2/2dκ∫ 1

0κa−1/2(1− κ)bn−1e−κx

2/2dκ.

Since (1 − κ)bn+1 ≤ (1 − κ)bn−1 for all bn ∈ R when κ ∈ (0, 1), it follows that the above

display can also be bounded from above as Var(θ|x) ≤ 1 + x2.

Lemma A.3. Let Var(θ|x) be the posterior variance under Eq. 2–6 for a single observation

x drawn from N(θ, 1). Suppose we have constants η ∈ (0, 1), δ ∈ (0, 1), a ∈ (12,∞), and

bn ∈ (0, 1), where bn → 0 as n → ∞. Then there exists a nonnegative and measurable

real-valued function hn(x) such that Var(θ|x) ≤ hn(x) for all x ∈ R. Moreover, hn(x) → 1 as

x → ∞ for any fixed bn ∈ (0, 1), and hn(x) satisfies the following:

For any d > 1, hn(·) satisfies

limn→∞

sup|x |>

√2ρ log( 1

bn)

hn(x) = 1 for any ρ > d . (A–17)

91

Proof of Lemma A.3. We use the representation of Var(θ|x) given in Eq. A–15. It is clear

that T (x)x

− (T (x)− x)2 can be bounded above by h1(x) = 1 for all x ∈ R. To bound the last

term in Eq. A–15, fix η ∈ (0, 1) and δ ∈ (0, 1), and split this term into the sum,

x2∫ η0κa+3/2(1− κ)bn−1e−κx

2/2dκ∫ 1

0κa−1/2(1− κ)bn−1e−κx

2/2dκ+ x2

∫ 1

ηκa+3/2(1− κ)bn−1e−κx

2/2dκ∫ 1

0κa−1/2(1− κ)bn−1e−κx

2/2dκ. (A–18)

Following the same techniques as in Lemma A.1 to bound each term in the sum above, we can

show that there exists a real-valued function h2(x) for which Eq. A–18 is uniformly bounded

above by h2(x) for all x ∈ R and for which h2(x) → 0 as x → ∞ for any fixed n. Moreover,

by mimicking the proof for Lemma A.1, it can similarly be shown that, for any d > 1, this

function h2(x) satisfies

limn→∞

sup|x |>

√2ρ log( 1

bn)

h2(x) = 0 for any ρ > d .

Therefore, letting hn(x) = h1(x) + h2(x) = 1 + h2(x), the lemma is proven.

Lemma A.4. Define Jn(x) as follows:

Jn(x) = x2∫ 1

0κa−1/2(1− κ)bn+1e−κx

2/2dκ∫ 1

0κa−1/2(1− κ)bn−1e−κx

2/2dκ. (A–19)

Suppose that a ∈ (12,∞) is fixed and bn ∈ (0, 1) for all n. Then we have the following upper

bound for Jn(x):

Jn(x) ≤ bnex2/2

(�(a + bn + 1/2)

�(a + 1/2)�(bn + 1)

). (A–20)

Proof of Lemma A.4. We have

Jn(x) = x2∫ 1

0κa−1/2(1− κ)bn+1e−κx

2/2dκ∫ 1

0κa−1/2(1− κ)bn−1e−κx

2/2dκ

≤ x2ex2/2

∫ 1

0κa−1/2(1− κ)bn+1e−κx

2/2dκ∫ 1

0κa−1/2(1− κ)bn−1dκ

92

= x2ex2/2 �(a + bn + 1/2)

�(a + 1/2)�(bn)

∫ 1

0

κa−1/2(1− κ)bn+1e−κx2/2dκ

= ex2/2 �(a + bn + 1/2)

�(a + 1/2)�(bn)

∫ x2

0

(t

x2

)a−1/2(1− t

x2

)bn+1

e−t/2dt, (A–21)

where we used a transformation of variables t = κx2 in the last equality. For 0 < t < x2, we

have 0 <(1− t

x2

)< 1, and since bn ∈ (0, 1) for all n, we have

(1− t

x2

)bn+1< 1 for all n.

Therefore, from Eq. A–21, we may further bound Jn(x) from above as

Jn(x) ≤ ex2/2(x2)1/2−a

�(a + bn + 1/2)

�(a + 1/2)�(bn)

∫ x2

0

ta−1/2e−t/2dt

≤ ex2/2 �(a + bn + 1/2)

�(a + 1/2)�(bn)

∫ x2

0

e−t/2dt

≤ ex2/2 �(a + bn + 1/2)

�(a + 1/2)�(bn)

= bnex2/2

(�(a + bn + 1/2)

�(a + 1/2)�(bn + 1)

),

where we used the fact that a ∈ (12,∞) and thus, ta−1/2 is increasing in t ∈ (0, x2) for the

second inequality, and the fact that �(bn + 1) = bn�(bn) for the last equality.

Lemmas A.1, A.2, A.3, and A.4 are crucial in proving Theorems 2.4 and 2.5, which provide

asymptotic upper bounds on the mean squared error (MSE) for the posterior mean under

the IGGn prior in Eq. 2–6 and the posterior variance under the IGGn prior. These theorems

will ultimately allow us to provide sufficient conditions under which the posterior mean and

posterior distribution under the IGGn prior contract at minimax rates.

Proof of Theorem 2.4. Define qn = #{i : θ0i = 0}. We split the MSE,

Eθ0 ||T (X)− θ0||2 =n∑

i=1

Eθ0i (T (Xi)− θ0i)2

asn∑

i=1

Eθ0i (T (Xi)− θ0i)2 =

∑i :θ0i =0

Eθ0i (T (Xi)− θ0i)2 +

∑i :θ0i=0

Eθ0i (T (Xi)− θ0i)2. (A–22)

We consider the nonzero means and the zero means separately.

93

Nonzero means: For θ0i = 0, using the Cauchy-Schwartz inequality and the fact that

E0i(Xi − θ0i)2 = 1, we get

Eθ0i (T (Xi)− θi)2 = Eθ0i (T (Xi)− Xi + Xi − θi)

2

= Eθ0i (T (Xi)− Xi)2 + Eθ0i (Xi − θ0i)

2 + 2Eθ0i (T (Xi)− Xi)(Xi − θ0i)

≤ Eθ0i (T (Xi)− Xi)2 + 1 + 2

√Eθ0i (T (Xi)− Xi)2

√Eθ0i (Xi − θ0i)2

=

[√Eθ0i (T (Xi)− Xi)2 + 1

]2. (A–23)

We now define

ζn =

√2 log

(1

bn

). (A–24)

Let us fix any d > 2 and choose any ρ > d . Then, using Lemma A.1, there exists a

non-negative real-valued function hn(·), depending on d such that

|Tn(x)− x | ≤ hn(x) for all x ∈ R, (A–25)

and

limn→∞

sup|x |>ρζn

hn(x) = 0. (A–26)

Using the fact that (T (Xi)− Xi)2 ≤ X 2

i , together with Eq. A–26, we obtain

Eθ0i (T (Xi)− Xi)2 = Eθ0i [(T (Xi)− Xi)

21{|Xi | ≤ ρζn}]

+ Eθ0i [T (Xi − Xi)21{|Xi | > ρζn}]

≤ ρ2ζ2n +

(sup

|x |>ρζnhn(x)

)2

. (A–27)

Using Eq. A–26 and the fact that ζn → ∞ as n → ∞ by Eq. A–24, it follows that(sup

|x |>ρζnhn(x)

)2

= o(ζ2n ) as n → ∞. (A–28)

94

By combining Eq. A–27 and Eq. A–28, we get

Eθ0i (T (Xi)− Xi)2 ≤ ρ2ζ2n (1 + o(1)) as n → ∞. (A–29)

Noting that Eq. A–29 holds uniformly for any i such that θ0i = 0, we combine Eq. A–23, Eq.

A–24, and Eq. A–29 to conclude that

∑i :θ0i =0

Eθ0i (T (Xi)− θ0i)2 - qn log

(1

bn

), as n → ∞, (A–30)

Zero means: For θ0i = 0, the corresponding MSE can be split as follows:

E0T (Xi)2 = E0[T (Xi)

21{|Xi | ≤ ζn}] + E0[T (Xi)21{|Xi | > ζn}], (A–31)

where ζn is as in Eq. A–24. Using Theorem 2.1, we have

E0[T (Xi)21{|Xi | ≤ ζn}] ≤

(bn

a + bn + 1/2

)2 ∫ ζn

−ζnx2ex

2/2dx

≤ b2na2

∫ ζn

−ζnx2ex

2/2dx

=2b2na2

∫ ζn

0

x2ex2/2dx

≤ 2b2na

(ζneζ2n/2)

- bn

√log

(1

bn

), (A–32)

where we use integration by parts for the third inequality.

Now, using the fact that |T (x)| ≤ |x | for all x ∈ R,

E0[T (Xi)21{|Xi | > ζn}] ≤ 2

∫ ∞

ζn

x2ϕ(x)dx

≤ 2ζnϕ(ζn) +2ϕ(ζn)

ζn

=

√2

πζn(e

−ζ2n/2 + o(1))

95

- bn

√log

(1

bn

), (A–33)

where we used the identity x2ϕ(x) = ϕ(x) − ddx[xϕ(x)] for the first equality and Mill’s ratio,

1 − �(x) ≤ ϕ(x)x

for all x > 0, in the second inequality. Combining Eq. A–32 through Eq.

A–33, we have that ∑i :θ0i=0

Eθ0iT (Xi)2 - (n − qn)bn

√log

(1

bn

). (A–34)

From Eq. A–22, Eq. A–30, and Eq. A–34, it immediately follows that

E||T (X)− θ0||2 =n∑

i=1

Eθ0i (T (Xi)− θ0i)2

- qn log

(1

bn

)+ (n − qn)bn

√log

(1

bn

).

The required result now follows by observing that qn ≤ qn and qn = o(n) and then taking the

supremum over all θ0 ∈ ℓ0[qn]. This completes the proof of Theorem 2.4.

Proof of Theorem 2.5. Define qn = #{i : θ0i = 0}. We decompose the total variance as

Eθ0

n∑i=1

Var(θ0i |Xi) =∑i :θ0i =0

Eθ0iVar(θ0i |Xi) +∑i :θ0i=0

Eθ0iVar(θ0i |Xi), (A–35)

and consider the nonzero means and zero means separately.

Nonzero means: Fix d > 1 and choose any ρ > d , and let ζn be defined as in Eq. A–24. For

θ0i = 0, we have from Eq. A–17 in Lemma A.3 that

Eθ0i [Var(θi |Xi)1{|Xi | > ρζn}] - 1. (A–36)

Moreover, Lemma A.2 shows that Var(θ|x) ≤ 1 + x2 for any x ∈ R, and so we must also have

that as n → ∞,

Eθ0i [Var(θi |Xi)1{|Xi | ≤ ρζn}] - ζ2n . (A–37)

96

Combining Eq. A–36 and Eq. A–37, we have that, as n → ∞,

Eθ0iVar(θi |Xi) - 1 + ζ2n ,

and thus, for all i such that θi = 0,

∑i :θ0i =0

Eθ0iVar(θ0i |Xi) - qn(1 + ζ2n )

- qn log

(1

bn

). (A–38)

Zero means: For θ0i = 0, we use same ζn that we used for the nonzero means. We have from

Lemma A.2 that Var(θ|x) ≤ 1+ x2. Using the identity x2ϕ(x) = ϕ(x)− ddx

[xϕ(x)] for x ∈ R,

we obtain that as n → ∞,

E0 [Var(θi |Xi)1{|Xi | > ζn}] ≤ 2

∫ ∞

ζn

(1 + x2)1√2π

e−x2/2dx

- bn

ζn+ ζnbn

- bn

√log

(1

bn

)(A–39)

Next, we consider |Xi | ≤ ζn. We have by Eq. A–16 in Lemma A.2 that Var(θ|x) ≤ T (x)x

+

Jn(x), where Jn(x) is the last term in Eq. A–16. Lemma A.4 gives an upper bound in Eq.

A–20 on Jn(x). Since a ∈ (12,∞) is fixed and bn ∈ (0, 1) with bn → 0 as n → ∞, the term

in parentheses in Eq. A–20 is uniformly bounded above by a constant. Therefore, we have by

Lemma A.4 that Jn(x) - bn. Moreover, T (x)x

= E(1 − κ|x), and it is clear from Theorem 2.1

that E(1− κ|x) - bn. Altogether, we have that as n → ∞,

E0 [Var(θi |Xi)1{|Xi | ≤ ζn}] - ζnbn

- bn

√log

(1

bn

)(A–40)

97

Combining Eq. A–39 and Eq. A–40, it follows that as n → ∞,

E0Var(θi |Xi) - bn

√log

(1

bn

),

and consequently, ∑i :θ0i=0

Eθ0iVar(θ0i |Xi) - (n − qn)bn

√log

(1

bn

). (A–41)

Combining Eq. A–35, Eq. A–38, and Eq. A–41, it follows that as n → ∞,

Eθ0

n∑i=1

Var(θi |Xi) - qn log

(1

bn

)+ (n − qn)bn

√log

(1

bn

)The required result now follows by observing that qn ≤ qn and qn = o(n) and then taking the

supremum over all θ0 ∈ ℓ0[qn]. This completes the proof of Theorem 2.5.

A.3 Proofs for Section 2.3.2

Proof of Theorem 2.7. Using the beta prime representation of the IGG prior, we have

π(θ) =1

(2π)12B(a, b)

∫ ∞

0

exp

(− θ2

2u

)ub−

32 (1 + u)−a−bdu,

where B(a, b) denotes the beta function. Under the transformation of variables, z = θ2

2u, we

have

π(θ) =2a+

12

(2π)12B(a, b)

(θ2)b−12

∫ ∞

0

exp(−z)za−12 (θ2 + 2z)−a−bdz . (A–42)

Now define the set Aϵ = {θ : |θ| ≤ ϵ}. Then from Eq. A–42, and for 0 < ϵ < 1, we have

ν(Aϵ) = P(|θ| ≤ ϵ)

=2a+

12

∫∞0

exp(−z)za− 12

(∫|θ|≤ϵ(θ

2)2b−12 (θ2 + 2z)−bdθ

)dz

(2π)12B(a, b)

≥2a+

12

∫∞0

exp(−z)za− 12 (2z + 1)−a−b

(∫|θ|≤ϵ(θ

2)b−12dθ)dz

(2π)12B(a, b)

98

≥2a+

122−a−b

∫∞0

exp(−z)za− 12 (1 + z)−a−b

(∫ ϵ0(θ2)b−

12dθ)dz

(2π)12B(a, b)

=ϵ2b

2bbB(a, b)π1/2

∫ ∞

0

exp(−z)za−12 (1 + z)−a−bdz . (A–43)

To bound the integral term in Eq. A–43, note that∫ ∞

0

exp(−z)za−12 (1 + z)−a−bdz ≥

∫ 1

0

exp(−z)za−12 (1 + z)−a−bdz

≥ e−12−a−b(a +

1

2

)−1

. (A–44)

Therefore, combining Eq. A–43 and Eq. A–44, we have

ν(Aϵ) ≥ϵ2b

2bbB(a, b)π1/2e−12−a−b

(a +

1

2

)−1

=ϵ2b�(a + b)

2b�(a)�(b + 1)π1/2e−12−a−b

(a +

1

2

)−1

≥ ϵ2b�(a)

�(a)�(2)�(12)e−12−a−2b

(a +

1

2

)−1

≥(ϵ2)b

(π)−1/2e−12−a−2

(a +

1

2

)−1

, (A–45)

where we use the fact that 0 < b < 1 for the last two inequalities.

Following Clarke & Barron (1990), the optimal rate of convergence comes from setting

ϵn = 1/n, which reflects the ideal case of independent samples x1, ..., xn. We therefore apply

Proposition 2.2, substituting in ϵ = 1/n and b = 1/n and invoking the lower bound for ν(Aϵ)

found in Eq. A–45. This ultimately gives us an upper bound on the Cesàro-average risk as

Rn ≤ 1n− 1

nlog[(

1n

) 2n π−1/2e−12−a−2

(a + 1

2

)−1]

= 1n

[2 + log(

√π) + (a + 2) log(2) + log

(a + 1

2

)]+ 2 log n

n2,

when θ0 = 0.

99

APPENDIX BPROOFS FOR CHAPTER 3

In this Appendix, we provide proofs to all the lemmas and theorems in Section 3.2. Our

proof methods follow those of Datta & Ghosh (2013), Ghosh et al. (2016), and Ghosh &

Chakrabarti (2017), except our arguments rely on control of the sequence of hyperparameters

bn, rather than on specifying a rate or an estimate for a global parameter τ , as in the

global-local framework of Eq. 1–5.

Proof of Lemma 3.1. By Theorem 2.1, the event{E(1− κi |Xi) >

12

}implies the event{

eX2i/2

(bn

a + bn + 1/2

)>

1

2

}⇔{X 2i > 2 log

(a + bn + 1/2

2bn

)}.

Therefore, noting that under H0i , Xi ∼ N (0, 1) and using Mill’s ratio, i.e. P(|Z | > x) ≤ 2ϕ(x)x

,

we have

t1i ≤ Pr

(X 2i > 2 log

(a + bn + 1/2

2bn

) ∣∣∣∣H0i is true)

= Pr

(|Z | >

√2 log

(a + bn + 1/2

2bn

))

≤2ϕ

(√2 log

(a+bn+1/2

2bn

))√2 log

(a+bn+1/2

2bn

)=

2bn√π(a + bn + 1/2)

[log

(a + bn + 1/2

2bn

)]−1/2

.

Proof of Lemma 3.2. By definition, the probability of a Type I error for the ith decision is

given by

t1i = Pr

[E(1− κi |Xi) >

1

2

∣∣∣∣H0i is true].

100

We have by Theorem 2.3 that

E(κi |Xi) ≤ η +

(a + 1

2

)(1− η)bn

bn(ηδ)a+ 1

2

exp

(−η(1− δ)

2X 2i

),

and so it follows that{E(1− κi |Xi) >

1

2

}⊇

{(a + 1

2

)(1− η)bn

bn(ηδ)a+ 1

2

exp

(−η(1− δ)

2X 2i

)<

1

2− η

}.

Thus, using the definition of t1i and the above and noting that under H0i , Xi ∼ N (0, 1), as

n → ∞,

t1i ≥ Pr

((a + 1

2

)(1− η)bn

bn(ηδ)a+ 1

2

exp

(−η(1− δ)

2X 2i

)<

1

2− η

∣∣∣∣ H0i is true)

= Pr

(X 2i >

2

η(1− δ)

[log

( (a + 1

2

)(1− η)bn

bn(ηδ)a+ 1

2

(12− η))])

= 2Pr

Z >

√√√√ 2

η(1− δ)

[log

( (a + 1

2

)(1− η)bn

bn(ηδ)a+ 1

2

(12− η))]

= 2

1−�

√√√√ 2

η(1− δ)

[log

( (a + 1

2

)(1− η)bn

bn(ηδ)a+ 1

2

(12− η))]

,

where for the last inequality, we used the fact that bn → 0 as n → ∞, and the fact that

η, ηδ ∈ (0, 12), so that the log(·) term in final equality is greater than zero for sufficiently large

n.

Proof of Lemma 3.3. By definition, the probability of a Type II error is given by

t2i = Pr

(E(1− κi) ≤

1

2

∣∣∣∣H1i is true).

Fix η ∈ (0, 12) and δ ∈ (0, 1). Using the inequality,

κi ≤ 1 {η < κi ≤ 1}+ η,

101

we obtain

E(κi |Xi) ≤ Pr(κi > η|Xi) + η.

Coupled with Theorem 2.3, we obtain that for sufficiently large n,{E(κi |Xi) >

1

2

}⊆

{(a + 1

2

)(1− η)bn

bn(ηδ)a+ 1

2

exp

(−η(1− δ)

2X 2i

)>

1

2− η

}.

Therefore,

t2i = Pr

(E(κi |Xi) >

1

2

∣∣∣∣H1i is true)

≤ Pr

((a + 1

2

)(1− η)bn

b(ηδ)an+12

exp

(−η(1− δ)

2X 2i

)>

1

2− η


= Pr

(X 2i <

2

η(1− δ)

{log

(a + 1

2

bn(ηδ)a

)−

log

((12− η)(ηδ)1/2

(1− η)bn

)}∣∣∣∣H1i is true)

= Pr

(X 2i <

2

η(1− δ)log

(a

bn(ηδ)a

)(1 + o(1))

∣∣∣∣H1i is true), (B–1)

where in the final equality, we used the fact that bn → 0 as n → ∞, so the second log(·) term

in the second to last equality is a bounded quantity.

Note that under H1i ,Xi ∼ N (0, 1 + ψ2). Therefore, by Eq. B–1 and the fact that

limn→∞

ψ2n

1 + ψ2n

= 1 (by the second condition of Assumption 1), we have

t2i ≤ Pr

|Z | <

√2

η(1− δ)

√log(a(ηδ)−ab−1

n )

ψ2(1 + o(1))

as n → ∞. (B–2)

By assumption, limn→∞b1/4n

pn∈ (0,∞). This then implies that limn→∞

b7/8n

p2n= 0. Therefore, by

the fourth condition of Assumption 1 and the fact that ψ2 → ∞ as n → ∞, we have

102

log(a(ηδ)−ab−1n )

ψ2=

log(a(ηδ)−a) + log(b−1n )

ψ2

=

(log(b

−1/8n )

ψ2+

log(b−7/8n )

ψ2

)(1 + o(1))

=log(b

−1/2n )

4ψ2(1 + o(1))

→ C

4as n → ∞. (B–3)

Thus, using Eq. B–2 and Eq. B–3, we have

t2i ≤ Pr

(|Z | <

√C

2η(1− δ)(1 + o(1))

)as n → ∞

= Pr

(|Z | <

√C

2η(1− δ)

)(1 + o(1)) as n → ∞

= 2

[�

(√C

2η(1− δ)

)− 1

](1 + o(1)) as n → ∞.

Proof of Lemma 3.4. By definition, the probability of a Type II error for the ith decision is

given by

t2i = Pr

(E(1− κi) ≤

1

2

∣∣∣∣H1i is true).

For any n, we have by Theorem 2.1 that{eX

2i/2

(bn

a + bn + 1/2

)≤ 1

2

}⊆{E(1− κi |Xi) ≤

1

2

}.

Therefore,

t2i = Pr

(E(1− κi |Xi) ≤

1

2


≥ Pr

(eX

2i/2

(bn

a + bn + 1/2

)≤ 1

2


= Pr

(X 2i ≤ 2 log

(a + bn + 1/2

2bn

) ∣∣∣∣H1i is true). (B–4)

103

Since Xi ∼ N (0, 1 + ψ2) under H1i , we have by the second condition in Assumption 1 that

limn→∞

ψ2n

1 + ψ2n

→ 1. From Eq. B–4 and the facts that a ∈ (12,∞) and bn ∈ (0, 1) for all n (so

b−1n ≥ b

−1/2n for all n), we have for sufficiently large n,

t2i ≥ Pr

|Z | ≤

√√√√2 log(a+bn+1/2

2bn

)ψ2

(1 + o(1))

as n → ∞

≥ Pr

|Z | ≤

√√√√ log(

12bn

)ψ2

(1 + o(1))

as n → ∞

≥ Pr

|Z | ≤

√log(b

−1/2n ) + log(1/2)

ψ2(1 + o(1))

as n → ∞

= Pr(|Z | ≤√C)(1 + o(1)) as n → ∞

= 2[�(√C)− 1](1 + o(1)) as n → ∞,

where in the second to last equality, we used the assumption that limn→∞b1/4n

pn∈ (0,∞) and

the second and fourth conditions from Assumption 1.

Proof of Theorem 3.1. Since the κi ’s, i = 1, ..., n are a posteriori independent, the Type I and

Type II error probabilities t1i and t2i are the same for every test i , i = 1, ..., n. By Lemmas 3.1

and 3.2, for large enough n,

2

1−�

√√√√ 2

η(1− δ)

[log

( (a + 1

2

)(1− η)bn

bn(ηδ)a+ 1

2

(12− η))]

≤ t1i

≤ 2bn√π(a + bn + 1/2)

[log

(a + bn + 1/2

2bn

)]−1/2

.

Taking the limit as n → ∞ of all the terms above and using the sandwich theorem, we have

limn→∞

t1i = 0 (B–5)

for the ith test, under the assumptions on the hyperparameters a and bn.

104

By Lemmas 3.3 and 3.4, for any η ∈ (0, 12) and δ ∈ (0, 1),

[2�(

√C)− 1

](1 + o(1)) ≤ t2i ≤

[2�

(√C

2η(1− δ)

)− 1

](1 + o(1)). (B–6)

Therefore, we have by Eq. B–5 and Eq. B–6 that as n → ∞, the asymptotic risk in Eq. 1–15

of the classification rule in Eq. 3–1, RIGG , can be bounded as follows:

np(2�(√C)− 1)(1 + o(1)) ≤ RIGG ≤ np

(2�

(√C

2η(1− δ)

)− 1

)(1 + o(1). (B–7)

Therefore, from Eq. 1–17 and Eq. B–7, we have that as n → ∞,

1 ≤ lim infn→∞

RIGG

RBOOpt

≤ lim supn→∞

RIGG

RBOOpt

≤2�(√

C2η(1−δ)

)− 1

2�(√C)− 1

. (B–8)

Now, the supremum of η(1 − δ) over the grid (η, δ) ∈ (0, 12) × (0, 1) is clearly 1

2, and so the

infimum of the numerator in the right-most term in Eq. B–8 is therefore 2�(√C)− 1. Thus,

1 ≤ lim infn→∞

RIGG

RBOOpt

≤ lim supn→∞

RIGG

RBOOpt

≤ 1,

so classification rule in Eq. 3–1 is ABOS, i.e. RIGG

RBOOpt

→ 1 as n → ∞.

105

APPENDIX CPROOFS FOR CHAPTER 4

In this Appendix, we provide proofs to all the theorems in Chapter 4.

C.1 Proofs for Section 4.2.3

C.1.1 Proof of Theorem 4.1

The proof of Theorem 4.1 is based on a lemma. This lemma is similar to Lemma 1.1 in

Goh et al. (2017), with suitable modifications so that we utilize Conditions (A1)-(A3) explicitly.

Furthermore, Goh et al. (2017) gave a sufficient condition for posterior consistency in the

Frobenius norm when pn = o(n) in Theorem 1 of their paper. However, we are not clear about

a particular step in their proof. They assert that

{(A,B) : n−1

(||Yn − XC)�−1/2||2F − ||(Yn − XC∗)�−1/2||2F

)< 2ν,

C = AB⊤}⊇{(A,B) : n−1

∣∣∣∣||Yn − XC||2F − ||(Yn − XC∗)||2F∣∣∣∣ < 2τminν,C = AB⊤

},

where τmin is the minimum eigenvalue for �. This does not seem to be true in general, unless

the matrix (Yn − XC)(Yn − XC)⊤ − (Yn − XC∗)(Yn − XC∗)⊤ is positive definite, which

cannot be assumed. Our proof for Theorem 4.1 thus gives a different sufficient condition for

posterior consistency in this low-dimensional setting. Moreover, the proof of Theorem 4.2 in

the ultrahigh-dimensional case requires a suitable modification of Theorem 4.1. Thus, we deem

it beneficial to write out all the details for Lemma C.1 and Theorem 4.1.

Lemma C.1. Define Bε = {Bn : ||Bn − B0||F > ε}, where ε > 0. To test H0 : Bn = B0

vs. H1 : Bn ∈ Bε, define a test function �n = 1(Yn ∈ Cn), where the critical region is

Cn :={Yn : ||Bn − B0||F > ε/2

}and Bn = (X⊤

n Xn)−1X⊤

n Yn. Then, under the model in Eq.

4–7 and assumptions (A1)-(A3), we have that as n → ∞,

1. EB0(�n) ≤ exp(−ε2nc1/16d2),

2. supBn∈Bε

EBn(1−�n) ≤ exp(−ε2nc1/16d2).

106

Proof of Lemma C.1. Since Bn ∼ MN pn×q(B0, (X⊤n Xn)

−1,�) w.r.t. P0-measure,

Zn = (X⊤n Xn)

1/2(Bn − B0)�−1/2 ∼ MN pn×q(O, Ipn , Iq). (C–1)

Using the fact that for square conformal positive definite matrices A,B, λmin(A)tr(B) ≤

tr(AB) ≤ λmax(A)tr(B), we have

EB0(�n) = PB0

(Yn : ||Bn − B0||F > ε/2)

= PB0(||(X⊤

n Xn)−1/2Zn�

1/2||2F > ε2/4) (by (4))

= PB0(tr(�1/2Zn(X

⊤n Xn)

−1Zn�1/2) > ε2/4)

≤ PB0(n−1c−1

1 tr(�1/2Z⊤n Zn�

1/2) > ε2/4)

≤ PB0(n−1c−1

1 d2tr(Z⊤n Zn) > ε2/4)

= PB0

(||Zn||2F >

ε2c1n

4d2

)= Pr

(χ2pnq >

ε2c1n

4d2

), (C–2)

where the two inequalities follow from Assumptions (A2) and (A3) respectively and the last

equality follows from Eq. C–1. By Armagan et al. (2013a), for all m > 0, Pr(χ2m ≥ x) ≤

exp(−x/4) whenever x ≥ 8m. Using Assumption (A1) and noting that q is fixed, we have by

Eq. C–2 that as n → ∞,

EB0(�n) ≤ Pr

(χ2pnq >

ε2c1n

4d2

)≤ exp

(−ε

2c1n

16d2

),

thus establishing the first part of the lemma.

We next show the second part of the lemma. We have

supBn∈Bε

EBn(1−�n) = sup

Bn∈BεPBn

(Yn : ||Bn − B0||F ≤ ε/2)

≤ supBn∈Bε

PBn

(Yn :

∣∣∣∣||Bn − Bn||F − ||Bn − B0||F∣∣∣∣ ≤ ε/2

)≤ sup

Bn∈BεPBn

(Yn : −ε/2 + ||Bn − B0||F ≤ ||Bn − Bn||F

)

107

= PBn(Yn : ||Bn − Bn||F > ε/2)

≤ exp

(−ε

2c1n

16d2

),

The last inequality follows from the fact that Bn ∼ MN pn×q(Bn, (X⊤n Xn)

−1,�). Thus, we

may use the same steps that were used to prove the first part of the lemma. Therefore, we

have also established the second part of the lemma.

Proof of Theorem 4.1. We utilize the proof technique of Theorem 1 in Armagan et al. (2013a)

and modify it suitably for the multivariate case subject to conditions (A1)-(A3). The posterior

probability of Bn is given by

�n(Bn|Yn) =

∫Bn

f (Yn|Bn)

f (Yn|B0)�n(dBn)∫

f (Yn|Bn)

f (Yn|B0)�n(dBn)

≤ �n +(1−�n)JBε

Jn

= I1 +I2

Jn, (C–3)

where JBε =

∫Bε

{f (Yn|Bn)

f (Yn|B0)

}�n(dBn) and Jn =

∫ {f (Yn|Bn)

f (Yn|B0)

}�n(dBn).

Let b = ε2c116d2

. For sufficiently large n, using Markov’s Inequality and the first part of

Lemma C.1, we have

PB0

(I1 ≥ exp

(−bn

2

))≤ exp

(bn

2

)EB0

(I1) ≤ exp

(−bn

2

).

This implies that∑∞

n=1 PB0 (I1 ≥ exp(−bn/2)) < ∞. Thus, by the Borel-Cantelli Lemma,

I1 → 0 a.s. P0 as n → ∞.

We next look at the behavior of I2. We have

EB0I2 = EB0

{(1−�n)JBε} = EB0

{(1−�n)

∫Bε

f (Yn|Bn)

f (Yn|B0)�n(dBn)

}=

∫Bε

∫(1−�n)f (Yn|Bn) dYn �n(dBn)

108

≤ �n(Bε) supBn∈Bε

EBn(1−�n)

≤ supBn∈Bε

EBn(1−�n)

≤ exp(−bn),

where the last inequality follows from the second part of Lemma C.1.

Thus, for sufficiently large n, PB0(I2 ≥ exp(−bn/2)) ≤ exp(−bn/2), which implies that∑∞

n=1 PB0 (I2 ≥ exp(−bn/2)) < ∞. Thus, by the Borel-Cantelli Lemma, I2 → 0 a.s. P0 as

n → ∞.

We have now shown that both I1 and I2 in Eq. C–3 tend towards zero exponentially fast.

We now analyze the behavior of Jn. To complete the proof, we need to show that

exp(bn/2)Jn → ∞ P0 a.s. as n → ∞. (C–4)

Note that

exp(bn/2)Jn = exp(bn/2)

∫exp

{−n1

nlog

f (Yn|B0)

f (Yn|Bn)

}�n(dBn)

≥ exp {(b/2− ν)n}�n(Dn,ν), (C–5)

where Dn,ν ={Bn : n

−1 log(f (Yn|B0)

f (Yn|Bn)

)< ν

}for 0 < ν < b/2. Therefore, we have

Dn,ν =

{Bn : n

−1

(1

2tr[[(Yn − XnBn)

⊤(Yn − XnBn)]�−1]− 1

2tr [[(Yn−

XnB0)⊤(Yn − XnB0)]�

−1])< 2ν

}≡{Bn : n

−1(tr[�−1/2(Yn − XnBn)

⊤(Yn − XnBn)�−1/2

]− tr

[�−1/2

(Yn − XnB0)⊤(Yn − XnB0)�

−1/2])< 2ν

}≡{Bn : n

−1(||(Yn − XnBn)�

−1/2||2F − ||(Yn − XnB0)�−1/2||2F

)< 2ν

}.

109

Noting that

||(Yn − XnBn)�−1/2||2F ≤ ||(Yn − XnB0)�

−1/2||2F + ||Xn(Bn − B0)�−1/2||2F

+ 2||(Yn − XnB0)�−1/2||F ||Xn(Bn − B0)�

−1/2||F ,

we have

�n(Dn,ν) ≥ �{Bn : n

−1(2||Yn − XnB0)�

−1/2||F ||Xn(Bn − B0)�−1/2||F

+||Xn(Bn − B0)�−1/2||2F

)< 2ν

}≥ �

{Bn : n

−1||Xn(Bn − B0)�−1/2||F <

2ν

3κn,

||(Yn − XnB0)�−1/2||F < κn

}, (C–6)

for some positive increasing sequence κn such that κn → ∞ as n → ∞.

Set κn = n(1+ρ)/2 for ρ > 0. Since En = Yn −XnB0, we have Zn = (Yn −XnB0)�−1/2 ∼

MN n×q(O, In, Iq). Therefore, as n → ∞,

PB0(||(Yn − XnB0)�

−1/2||F > κn) = PB0(||Zn||2F > κ2n)

= Pr(χ2nq > n1+ρ

)≤ exp

(−n1+ρ

4

),

where the last inequality follows from the fact that for all m > 0, Pr(χ2m ≥ x) ≤ exp(−x/4)

when x ≥ 8m and the assumptions that q is fixed and ρ > 0. Since∑∞

n=1 PB0(||(Yn −

XnB0)�−1/2||F > κn) ≤

∑∞n=1 exp

(−n1+ρ

4

)<∞, we have by the Borel-Cantelli Lemma that

PB0{||Yn − XnB0||F > κn infinity often } = 0.

110

For sufficiently large n, we have from Eq. C–6 that

�n(Dn,ν) ≥ �n

{Bn : n

−1||Xn(Bn − B0)�−1/2||F <

2ν

3κn

}≥ �n

{n−1n1/2c

1/22 d

−1/21 ||Bn − B0||F <

2ν

3κn

}= �n

{Bn : ||Bn − B0||F <

(2d

1/21 ν

3c1/22

)n−(1+ρ)/2n1/2

}

= �n

{Bn : ||Bn − B0||F <

�

nρ/2

}, (C–7)

where � =2d

1/21 ν

3c1/22

. The second inequality in Eq. C–7 follows from Assumptions (A2) and (A3)

and the fact that

||Xn(Bn − B0)�−1/2||F =

√tr[�−1/2(Bn − B0)⊤XT

n Xn(Bn − B0)�−1/2]

≤√λmax(X⊤

n Xn)λmax(�−1)||Bn − B0)||2F

< n1/2c1/22 d

−1/21 ||Bn − B0||F .

Therefore, from Eq. C–7, if �n

{Bn : ||Bn − B0||F < �

nρ/2

}> exp(−kn) for all 0 < k <

b/2− ν, then Eq. C–5 will hold.

Substitute b = ε2c116d2

, � =2d

1/21 ν

3c1/22

⇒ ν =3�c

1/22

2d1/21

to obtain that 0 < k < ε2c132d2

− 3�c1/22

2d1/21

. To

ensure that k > 0, we must have 0 < � <ε2c1d

1/21

48c1/22 d2

.

Therefore, if the conditions on � and k in Theorem 4.1 are satisfied, then Eq. C–4 holds.

This ensures that the expected value of Eq. C–3 w.r.t. P0 measure approaches 0 as n → ∞,

which ultimately establishes that posterior consistency holds if Eq. 4–8 is satisfied.

C.1.2 Proof of Theorem 4.2

Proof of Theorem 4.2 also requires the creation of an appropriate test function. In this

case, the test must be very carefully constructed since Xn is no longer nonsingular. We first

define some constants and prove a lemma.

For arbitrary ε > 0 and c1 and d2 specified in (B3) and (B5), let

111

c3 =ε2c116d2

, (C–8)

and

mn =

⌊nc3

6 log pn

⌋. (C–9)

Lemma C.2. Define the set Bε = {Bn : ||Bn − B0||F > ε}. Suppose that Conditions (B1)-

(B6) hold under Eq. 4–7. In order to test H0 : Bn = B0 vs. H1 : Bn ∈ Bε, there exists a test

function �n such that as n → ∞,

1. EB0(�n) ≤ exp(−nc3/2),

2. supBn∈Bε

EBn(1− �n) ≤ exp(−nc3),

where c3 is defined in Eq. C–8.

Proof of Lemma C.2. By Condition (B1), we must have that nlog pn

→ ∞. Moreover, by Eq.

C–9, mn = o(n), since log pn → ∞ as n → ∞. Combining this with assumption (B6), we

must have that for sufficiently large n, there exists a positive integer mn, determined by Eq.

C–9, such that 0 < s∗ < mn < n.

For sufficiently large n so that s∗ < mn < n, define the set M as the set of models S

which properly contain the true model S∗ ⊂ {1, ... , pn} so that

M = {S : S ⊃ S∗,S = S∗, |S | ≤ mn}, (C–10)

and define the set T as

T = {S : S ⊂ {1, ... , pn} \M, |S | ≤ n}. (C–11)

Let XS denote the submatrix of X with columns indexed by model S , and let BS0 denote the

submatrix of B0 that contains rows of B0 indexed by S . Define the following sets Cn and En:

Cn =∨S∈M

{||(X⊤

SXS)−1X⊤

SYn − BS0 ||F > ε/2

}, (C–12)

112

En =∧S∈T

{||(X⊤

SXS)−1X⊤

SYn − BS0 ||F ≤ ε/2

}. (C–13)

where∨

indicates the union of all models S contained in M, and∧

indicates the intersection

of all models contained in T , and ε > 0 is arbitrary. Essentially, the set Cn contains the union

of all models S that contain the true model S∗, S = S∗, such that the submatrix XS has at

least s∗ columns and at most mn(< n) columns, and ||(X⊤SXS)

−1X⊤SYS − BS

0 ||F > ε/2. Given

our choice of mn, XTS XS is nonsingular for all models S contained in our sets.

We are now ready to define our test function �n. To test H0 : Bn = B0 vs. H1 : Bn ∈ Bε,

define �n = 1(Yn ∈ Cn), where the critical region is defined as in Eq. C–12. We now show

that Lemma C.2 holds with this choice of �.

Let s be the size of an arbitrary model S . Noting also that there are(pns

)ways to select a

model of size s , we therefore have for sufficiently large n,

EB0(�n) ≤

∑S∈M

PB0

(Yn : ||(X⊤

SXS)−1X⊤

SYS − BS0 ||F > ε/2

)=

mn∑s=s∗+1

(pn

s

)PB0

(Yn : ||(X⊤

SXS)−1X⊤

SYS − BS0 ||F > ε/2

)≤

mn∑s=s∗+1

(pn

s

)P(χ2sq >

ε2c1n

4d2

)

≤mn∑

s=s∗+1

(pn

s

)exp(−nc3)

≤ (mn − s∗)

(pn

mn

)exp(−nc3)

≤ (mn − s∗)

(epn

mn

)mn

exp(−nc3), (C–14)

where we use Part 1 of Lemma 1 for the second inequality, the fact that P(χ2m > x) ≤

exp(−x/4) when x > 8m and mn = o(n) for the third inequality, and the fact that∑m

i=k

(n

i

)≤ (m − k + 1)

(n

m

)for the fourth inequality in Eq. C–14.

113

Since log n = o(n), we must have for sufficiently large n that log n < c3n

6. Then from the

definition of mn, we have

log(mn − s∗) + mn

(1 + log

(pn

mn

))≤ log(mn) + mn (1 + log(pn))

≤ log(n) +c3n

6 log pn+

(c3n

6 log pn

)log(pn)

≤ c3n

6+

c3n

6+

c3n

6=

c3n

2. (C–15)

Therefore, from Eq. C–14 and Eq. C–15, we must have that EB0(�n) ≤ exp(−nc3/2) as

n → ∞. This proves the first part of the lemma.

Next, letting En be the set defined in Eq. C–13, we observe that as n → ∞,

supBn∈Bε

EBn(1− �n) = sup

Bn∈BεPBn

(Yn /∈ Cn)

≤ supBn∈Bε

PBn(Yn ∈ En) (since Ccn ⊆ En)

= supBn∈Bε

PBn

(∩S∈T

{Yn : ||(X⊤

SXS)−1X⊤

SYS − BS0 ||F ≤ ε/2

})

≤ supBn∈Bε

PBn

(Yn : ||(X⊤

SXS)−1X⊤

SYS− BS

0 ||F ≤ ε/2)

for some S ∈ T .

Rewrite BSn = (X⊤

SXS)−1X⊤

SYS

for the single set S ∈ T , and we get

supBn∈Bε

EBn(1− �n) ≤ sup

Bn∈BεPBn

(Yn :

∣∣∣∣||BSn − BS

0 ||F − ||BSn − BS

0 ||F∣∣∣∣ ≤ ε/2

)supBn∈Bε

PBn

(Yn : −ε/2 + ||BS

n − BS0 ||F ≤ ||BS

n − BSn ||F

)≤ P

BSn

(Yn : ||BS

n − BSn ||F > ε/2

)≤ exp (−c3n) , (C–16)

as n → ∞. To arrive at Eq. C–16, we invoked Part 2 of Lemma C.1 for the final inequality.

Proof of Theorem 4.2. In light of Lemma C.2, we suitably modify Theorem 4.1 for the

ultrahigh-dimensional case. Let �n be the test function defined in Lemma C.2 for sufficiently

114

large n. The posterior probability of Bn is given by

�n(Bn|Yn) =

∫Bn

f (Yn|Bn)

f (Yn|B0)�(dBn)∫

f (Yn|Bn)

f (Yn|B0)�(dBn)

≤ �n +(1− �n)JBε

Jn

= I1 +I2

Jn, (C–17)

where JBε =

∫Bε

{f (Yn|Bn)

f (Yn|B0)

}�(dBn) and Jn =

∫ {f (Yn|Bn)

f (Yn|B0)

}�(dBn).

For sufficiently large n, using Markov’s Inequality and the first part of Lemma C.2, and

taking c3 as defined in Eq. C–8, we have

PB0

(I1 ≥ exp

(−nc3

4

))≤ exp

(nc3

4

)EB0

(I1) ≤ exp

(−nc3

4

).

This implies that∑∞

n=1 PB0

(I1 ≥ exp(−nc3/4)

)<∞. Thus, by the Borel-Cantelli Lemma, we

have PB0(I1 ≥ exp(−nc3/4) infinitely often) = 0, i.e. I1 → 0 a.s. P0 as n → ∞.

We next look at the behavior of I2. We have

EB0I2 = EB0

{(1− �n)JBε

}= EB0

{(1− �n)

∫Bε

f (Yn|Bn)

f (Yn|B0)�n(dBn)

}=

∫Bε

∫(1− �n)f (Yn|Bn) dYn �n(dBn)

≤ πn(Bε) supBn∈Bε

EBn(1− �n)

≤ supBn∈Bε

EBn(1− �n)

≤ exp(−nc3),

where the last inequality follows from the second part of Lemma C.2, and c3 is again from Eq.

C–8.

115

Thus, for sufficiently large n, PB0(I2 ≥ exp(−nc3/2)) ≤ exp(−nc3/2), which implies that∑∞

n=1 PB0

(I2 ≥ exp(−nc3/2)

)< ∞. Thus, by the Borel-Cantelli Lemma, I2 → 0 a.s. P0 as

n → ∞.

We have now shown that both I1 and I2 in Eq. C–17 tend towards zero exponentially fast.

We now analyze the behavior of Jn. To complete the proof, we need to show that

exp(nc3/2)Jn → ∞ P0 a.s. as n → ∞. (C–18)

Note that

exp(nc3/2)Jn = exp(nc3/2)

∫exp

{−n1

nlog

f (Yn|B0)

f (Yn|Bn)

}�n(dBn)

≥ exp {(c3/2− ν)n}�n(Dn,ν), (C–19)

where Dn,ν ={Bn : n

−1 log(f (Yn|B0)

f (Yn|Bn)

)< ν

}for 0 < ν < c3/2.

Because of Assumption (B4) which bounds the maximum singular value of Xn from above

by nc2, the rest of the proof is essentially identical to the remainder of the proof from Theorem

4.1, with suitable modifications (i.e. replacing c1 with c1 and c2 with c2 and substituting in the

expression in Eq. C–8 for c3).

Therefore, if the conditions on � and k in Theorem 4.2 are satisfied, then Eq. C–19 is

satisfied, i.e. exp(nc3/2)Jn → ∞ as n → ∞. This ensures that the expected value of Eq.

C–18 w.r.t. P0 measure approaches 0 as n → ∞, which ultimately establishes that posterior

consistency holds if Eq. 4–9 is satisfied.

C.2 Proofs for Section 4.2.4

C.2.1 Preliminary Lemmas

Before proving Theorems 4.3 and 4.4, we first prove two lemmas which characterize the

marginal prior density for the rows of B. Throughout this section, we denote bi , 1 ≤ i ≤ p as

the ith row of B under the model in Eq. 4–2, with polynomial-tailed hyperpriors of the form

116

given in Eq. 1–7. Lemma C.4 in particular plays a central role in proving our theoretical results

in Section 4.2.

Lemma C.3. Under the MBSP model in Eq. 4–2 with polynomial-tailed hyperpriors of the

form in Eq. 1–7, the marginal density π(bi |�) is equal to

π(bi |�) = D

∫ ∞

0

ξ−q/2−a−1i exp

{− 1

2ξiτ||bi�−1/2||22

}L(ξi)dξi ,

where D > 0 is an appropriate constant.

Proof of Lemma C.3. Let D = diag(ξ1, ..., ξp). Using Definition 4.1, the joint prior for the

MBSP model in Eq. 4–2 with polynomial-tailed priors is

π(B, ξ1, ..., ξp|�) ∝ |D|−q/2|�|−p/2 exp{−1

2tr[�−1BT τ−1D−1B]

}×

p∏i=1

π(ξi)

∝

[p∏i=1

(ξi)−q/2

]exp

{− 1

2τ

p∑i=1

||ξ−1/2i bi�

−1/2||22

}×

p∏i=1

π(ξi)

∝p∏i=1

[ξ−q/2i exp

{− 1

2ξiτ||bi�−1/2||22

}π(ξi)

]

∝p∏i=1

[ξ−q/2−a−1i exp

{− 1

2ξiτ||bi�−1/2||22

}Lξi)

]. (C–20)

Since the rows, bi and the ξi ’s, 1 ≤ i ≤ p are independent, we have from Eq. C–20 that

π(bi , ξi |�) ∝ ξ−q/2−a−1i exp

{− 1

2ξiτ||bi�−1/2||22

}.

Integrating out ξi gives the desired marginal prior for π(bi |�).

Though we are not able to obtain a closed form solution for π(bi |�), we are able to

obtain a lower bound on it that can be written in closed form, as we illustrate in the next

lemma.

Lemma C.4. Suppose Condition (A3) on the eigenvalues of � and Condition (C1) on

the slowly varying function L(·) in Eq. 1–7 hold . Under the MBSP model in Eq. 4–2 with

polynomial-tailed hyperpriors of the form in Eq. 1–7 and known �, the marginal density for bi ,

117

the ith row of B, can be bounded below by

C exp

(− ||bi ||222τd1t0

), (C–21)

where C = Dc0t−q/2−a0

(q

2+ a)−1.

Proof of Lemma C.4. Following from Lemma C.3, we have

π(bi) = D

∫ ∞

0


{− 1

2ξiτ||bi�−1/2||22

}L(ξi)dξi

≥ D

∫ ∞

0


{− ||bi ||222ξiτd1

}L(ξi)dξi (C–22)

≥ D

∫ ∞

t0


{− ||bi ||222ξiτd1

}L(ξi)dξi

≥ Dc0

∫ ∞

t0


{− ||bi ||222ξiτd1

}dξi (C–23)

= Dc0

(2τd1||bi ||22

)q/2+a ∫ ||bi ||22/2τd1t0

0

uq/2+a−1e−udu (C–24)

≥ Dc0

(2τd1||bi ||22

)q/2+a

exp

(− ||bi ||222τd1t0

)∫ ||bi ||22/2τd1t0

0

uq/2+a−1du

= Dc0t−q/2−a0

(q2+ a)−1

exp

(− ||bi ||222τd1t0

)= C exp

(− ||bi ||222τd1t0

).

where Eq. C–22 follows from Condition (A3), while Eq. C–23 follows from Condition (C1).

Equation C–24 follows from a change of variables u =||bi ||222ξiτd1

. We have thus established the

lower bound in Eq. C–21 for the marginal density of bi .

C.2.2 Proofs for Theorem 4.3 and Theorem 4.4

Before we prove Theorems 4.3 and 4.4 for the MBSP model in Eq. 4–2 with hyperpriors

of the form in Eq. 1–7, we first introduce some notation. Because we are operating under the

assumption of sparsity, most of the rows of B0 should contain only entries of zero.

Our proofs depend on partitioning B0 into sets of active and inactive predictors. To this

end, let b0j denote the jth row for the true coefficient matrix B0 and bnj denote the jth row of

Bn, where both B0 and Bn depend on n. We also let An :={j : bnj = 0, 1 ≤ j ≤ pn

}denote

118

the set of indices of the nonzero rows of B0. This indicates active predictors. Equivalently, Acn

is the set of indices of the zero rows (or the inactive predictors).

Proof of Theorem 4.3. For the low-dimensional setting, let s = |S | denote the size of the true

model. Since (A1)-(A3) hold, it is enough to show (by Theorem 4.1) that, for sufficiently large

n and any k > 0,

�n

(Bn : ||Bn − B0||F <

�

nρ/2

)> exp(−kn),

where 0 < � <ε2c1d

1/21

48c1/22 d2

. We have

�n

(Bn : ||Bn − B0||F <

�

nρ/2

)= �n

(Bn : ||Bn − B0||2F <

�2

nρ

)

= �n

Bn :∑j∈An

||bnj − b0j ||22 +∑j∈Ac

n

||bnj ||22 <�2

nρ

≥ �n

Bn :∑j∈An

||bnj − b0j ||22 <s�2

pnnρ,∑j∈Ac

n

||bj ||22 <(pn − s)�2

pnnρ

≥

{∏j∈An

�n

(bnj : ||bnj − b0j ||22 <

�2

pnnρ

)}

×

�n

∑j∈Ac

n

||bj ||22 <(pn − s)�2

pnnρ

. (C–25)

Define the density

π(bj) ∝ exp

(− ||bj ||222τnd1t0

), (C–26)

Since (C1) holds for the slowly varying component of Eq. 1–7, we have by the lower bound in

Lemma C.4, Eq. C–25, and Eq. C–26 that it is sufficient to show that{�n

(bnj : ||bnj − b0j ||22 <

�2

pnnρ

)}s

×

�n

∑j∈Ac

n

||bj ||22 <(pn − s)�2

pnnρ

> exp(−kn) (C–27)

for sufficiently large n and any k > 0 in order to obtain posterior consistency (again by

Theorem 4.1). We consider the last two terms in the product on the left-hand side of Eq.

119

C–27 separately. Note that

�n

(bnjk :

q∑k=1

(bnjk − b0jk)2 <

�2

pnnρ

)≥ �n

(bnjk :

q∑k=1

|bnjk − b0jk | <�√pnnρ

)

≥q∏

k=1

{�n

(bnjk : |bnjk − b0jk | <

�

q√pnnρ

)}(C–28)

By Eq. C–26, π(bnjk) = 1√2πd1t0

exp

(−(bnjk)

2

2d1t0

), i.e. bnjk ∼ N(0, τnd1t0). Therefore, we have

�n

(bnjk : |bnjk − b0jk | <

�

q√pnnρ

)=

1√2πτnd1t0

∫ b0jk+ �

q√

pnnρ

b0jk− �

q√

pnnρ

exp

(−(bnjk)2

2τnd1t0

)dbjk

= Pr

(− �

q√pnnρ

≤ X − b0jk ≤ �

q√pnnρ

)= Pr

(|X − b0jk | ≤

�

q√pnnρ

), (C–29)

where X ∼ N(b0jk , τnd1t0). By Assumption (C2), b0jk is finite for every n. Furthermore, for any

random variable X ∼ N(µ,σ2), we have the concentration inequality Pr |X − µ| > t) ≤ 2e−t2

2σ2

for any t ∈ R. Setting X ∼ N(b0jk , τnd1t0) and t = �q√pnnρ

, we have

Pr

(|X − b0jk | ≤

�

q√pnnρ

)= 1− Pr

(|X − b0jk | ≥

�

q√pnnρ

)≥ 1− 2 exp

(− �2

2q2pnnρτnd1t0

). (C–30)

We now consider the second term on the left in Eq. C–27. Since E(b2jk) = τnd1t0, an

application of the Markov inequality gives

�n

bnjk : ∑j∈Ac

n

q∑k=1

(bnjk)2 <

(pn − s)�2

pnnρ

≥ 1−

pnnρE

∑j∈Ac

n

q∑k=1

(bnjk)2

(pn − s)�2

= 1− pnqnρτnd1t0�2

. (C–31)

120

Combining Eq. C–27 through Eq. C–31, we obtain as a lower bound for the left-hand side of

Eq. C–27, {1− 2 exp

(− �2

2q2pnnρτnd1t0

)}qs (1− pnqn

ρτnd1t0�2

). (C–32)

By Assumption (C3), it is clear that Eq. C–32 tends to 1 as n → ∞, so obviously this quantity

is greater than e−kn for any k > 0 for sufficiently large n. Since the lower bound in Eq. C–27

holds for all sufficiently large n, we have under the given conditions that the MBSP model in

Eq. 4–2 achieves posterior consistency in the Frobenius norm.

Proof of Theorem 4.4. For the ultrahigh-dimensional setting, we first let S∗ ⊂ {1, 2, ..., pn}

denote the indices of the nonzero rows, and denote the true size of S∗ as s∗ = |S∗|. Since

(B1)-(B6) hold, it is enough to show by Theorem 4.2 that, for sufficiently large n and any

k > 0,

�n

(Bn : ||Bn − B0||F <

�

nρ/2

)> exp(−kn),

where 0 < � <ε2c1d

1/21

48c1/22 d2

and ρ > 0.

By Assumption (C1) for the slowly varying component in Eq. 1–7, Lemma C.4, and

Theorem 4.2, it is thus sufficient to show that{�n

(bnj : ||bnj − b0j ||22 <

�2

pnnρ

)}s∗

×

�n

∑j∈Ac

n

||bj ||22 <(pn − s∗)�2

pnnρ

> exp(−kn) (C–33)

for sufficiently large n and any k > 0, where the density πn is defined in Eq. C–26. Mimicking

the proof for Theorem 4.3 and given regularity conditions (C1) and (C2), we obtain as a lower

bound for the left-hand side of Eq. C–33,{1− 2 exp

(− �2

2q2pnnρτnd1t0

)}qs∗ (1− pnqn

ρτnd1t0

�2

). (C–34)

121

Under Assumption (C3), Eq. C–34 is clearly greater than e−kn for any k > 0 and sufficiently

large n, since Eq. C–34 tends to 1 as n → ∞. We have thus proven posterior consistency in

the Frobenius norm for the ultrahigh-dimensional case as well.

122

APPENDIX DGIBBS SAMPLER FOR THE MBSP-TPBN MODEL

In this Appendix, we provide the technical details of the Gibbs sampler for the MBSP-TPBN

model in Eq. 4–13 from Section 4.3.2. We also present an efficient method for sampling from

the full conditional of B. These algorithms are implemented in the R package MBSP.

D.1 Full Conditional Densities for the Gibbs Sampler

The full conditional densities are all available in closed form as follows. Letting T =

diag(ψ1, ... ,ψp), bi denote the ith row of B, and GIG(a, b, p) denote a generalized inverse

Gaussian density with f (x ; a, b, p) ∝ x (p−1)e−(a/x+bx)/2, we have

B|rest ∼ MN p×q

((X⊤X+ T−1)−1X⊤Y,

(X⊤X+ T−1

)−1,�),

�|rest ∼ IW(n + p + d , (Y − XB)⊤(Y − XB) + B⊤T−1B+ kIq),

ψi |rest ind∼ GIG(||bi�−1/2||22, 2ζi , u − q

2

), i = 1, ... , p, (D–1)

ζi |rest ind∼ G (a, τ + ψi) , i = 1, ... , p.

Because the full conditional densities are available in closed form, we can implement the

MBSP-TPBN model in Eq. 4–13 straightforwardly using Gibbs sampling.

D.2 Fast Sampling of the Full Conditional Density for B

In Eq. D–1, the most computationally intensive operation is sampling from the density,

π(B|rest). Much of the computational cost comes from computing the inverse (X⊤X +

T−1)−1, which requires O(p3) time complexity if we use Cholesky factorization methods. In

the case where p < n, this is not a problem. However, when p ≫ n, then this operation can be

prohibitively costly.

In this section, we provide an alternative algorithm for sampling from the density

MN p×q

((X⊤X+ T−1)−1X⊤Y,

(X⊤X+ T−1

)−1,�)

in O(n2p) time. Bhattacharya

et al. (2016) originally devised an algorithm to efficiently sample from a class of structured

multivariate Gaussian distributions. Our algorithm below is a matrix-variate extension of the

algorithm given by Bhattacharya et al. (2016).

123

Algorithm 1

Step 1. Sample U ∼ MN p×q(O,T,�) and M ∼ MN n×q(O, In,�).

Step 2. Set V = XU+M.

Step 3. Solve for W in the below system of equations:

(XTX⊤ + In)W = Y − V.

Step 4. Set � = U+ TX⊤W.

With the above algorithm, we have the following proposition.

Proposition D.1. Suppose � is obtained by following Algorithm 1. Then

� ∼ MN p×q

((X⊤X+ T−1)−1X⊤Y,

(X⊤X+ T−1

)−1,�).

Proof. This follows from a trivial modification of Proposition 1 in Bhattacharya et al. (2016).

From Algorithm 1, it is clear that the most computationally intensive step is solving the

system of equations in Step 3. However, since T is a diagonal matrix, it follows from the

arguments in Bhattacharya et al. (2016) that computing the inverse of (XTX⊤ + In) can be

done in O(n2p) time. Once this inverse is obtained, solving the system of equations can be

done in O(n2q) time, and in general, q ≪ p. It is thus clear that Algorithm 1 is O(n2p)

when p > n. Since our algorithm scales linearly with p, it provides a significant reduction in

computing time from typical methods based on Cholesky factorization when p ≫ n.

On the other hand, if p < n, then Algorithm 1 provides no time saving, so we simply

utilize Cholesky factorization methods to sample from the full conditional density, π(B|rest) in

O(p3) time if p < n.

D.3 Convergence of the Gibbs Sampler

In order to ensure quick convergence, we need good initial guesses for B and �, B(init)

and �(init), respectively. We take as our initial guess for B, B(init) = (X⊤X + λIp)−1X⊤Y,

where λ = δ + λmin+(X), λmin+(X) is the smallest positive singular value of X, and δ = 0.01.

124

0 2000 4000 6000 8000 10000

−10

−5

05

10

Iteration

True Nonzero Coefficient

0 2000 4000 6000 8000 10000

−10

−5

05

10

Iteration

True Zero Coefficient

0 2000 4000 6000 8000 10000

−10

−5

05

10

Iteration

True Nonzero Coefficient

0 2000 4000 6000 8000 10000

−10

−5

05

10

Iteration

True Zero Coefficient

Figure D-1. History plots of the first 10,000 draws from the Gibbs sampler for theMBSP-TPBN model described in Section D.1 for randomly drawn coefficients bijin B0 from experiments 5 and 6 in Section 4.4.1. The top two plots are takenfrom experiment 5 (n = 100, p = 500, q = 3), and the bottom two plots are takenfrom Experiment 6 (n = 150, p = 1000, q = 4).

This forces the term X⊤X + λIp to be positive definite. For �, we take as our initial guess

�(init) = 1n

(Y − XB(init))⊤ (Y − XB(init)).

Figure D-1 shows the historical plots of the first 10,000 draws from from the Gibbs

sampler for the MBSP-TPBN model described in Section D.1 for four randomly drawn

coefficients bij in B from experiments 5 and 6 in Section 4.4.1. The top two plots correspond

to a true nonzero coefficient (b0ij = −3.8103) and a true zero coefficient (b0ij = 0) from

experiment 5 in Section 4.4.1 (n = 100, p = 500, q = 3). The bottom two plots correspond to

125

a true nonzero coefficient (b0ij = 3.1436) and a true zero coefficient (b0ij = 0) from experiment

6 in Section 4.4.1 (n = 150, p = 1000, q = 4).

We consider two different Markov chains with different starting values for B(init): 1)

the ridge estimator described above, and 2) the regularized MLASSO estimator described

in Section 4.4.1. We see from the plots in Figure D-1 that although both chains start with

different initial values of b(init)ij , they mix well and seem to rapidly converge to a stationary

distribution that captures the true coefficients b0ij ’s with high probability.

126

REFERENCES

Abramovich, F., Grinshtein, V., & Pensky, M. (2007). On optimality of bayesian testimation inthe normal means problem. Annals of Statistics, 35(5), 2261–2286.

Armagan, A., Clyde, M., & Dunson, D. B. (2011). Generalized beta mixtures of gaussians.In J. Shawe-taylor, R. Zemel, P. Bartlett, F. Pereira, & K. Weinberger (Eds.) Advances inNeural Information Processing Systems 24, (pp. 523–531).

Armagan, A., Dunson, D., Lee, J., Bajwa, W., & Strawn, N. (2013a). Posterior consistency inlinear models under shrinkage priors. Biometrika, 100(4), 1011–1018.

Armagan, A., Dunson, D. B., & Lee, J. (2013b). Generalized double pareto shrinkage.Statistica Sinica, 23 1, 119–143.

Belloni, A., Chernozhukov, V., & Wang, L. (2011). Square-root lasso: pivotal recovery ofsparse signals via conic programming. Biometrika, 98(4), 791–806.

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical andpowerful approach to multiple testing. Journal of the Royal Statistical Society: Series B(Statistical Methodology), 57(1), 289–300.

Berger, J. (1980). A robust generalized bayes estimator and confidence region for amultivariate normal mean. Annals of Statistics, 8(4), 716–761.

Bhadra, A., Datta, J., Polson, N. G., & Willard, B. (2017). The horseshoe+ estimator ofultra-sparse signals. Bayesian Analysis, 12(4), 1105–1131.

Bhadra, A., & Mallick, B. K. (2013). Joint high-dimensional bayesian variable and covarianceselection with an application to eqtl analysis. Biometrics, 69(2), 447–457.

Bhattacharya, A., Chakraborty, A., & Mallick, B. K. (2016). Fast sampling with gaussian scalemixture priors in high-dimensional regression. Biometrika, 103(4), 985–991.

Bhattacharya, A., Pati, D., Pillai, N. S., & Dunson, D. B. (2015). Dirichlet–laplace priors foroptimal shrinkage. Journal of the American Statistical Association, 110(512), 1479–1490.

Bingham, N., Goldie, C., & Teugels, J. (1987). Regular variation (Encyclopedia of Mathemat-ics and its Applications). Cambridge University Press.

Bogdan, M., Chakrabarti, A., Frommlet, F., & Ghosh, J. K. (2011). Asymptoticbayes-optimality under sparsity of some multiple testing procedures. Annals of Statis-tics, 39(3), 1551–1579.

Brown, P. J., Vannucci, M., & Fearn, T. (1998). Multivariate bayesian variable selection andprediction. Journal of the Royal Statistical Society: Series B (Statistical Methodology),60(3), 627–641.

127

Bunea, F., She, Y., & Wegkamp, M. H. (2012). Joint variable and rank selection forparsimonious estimation of high-dimensional matrices. Annals of Statistics, 40(5), 2359–2388.

Camponovo, L. (2015). On the validity of the pairs bootstrap for lasso estimators. Biometrika,102(4), 981–987.

Candes, E., & Tao, T. (2007). The dantzig selector: Statistical estimation when p is muchlarger than n. Annals of Statistics, 35(6), 2313–2351.

Carvalho, C. M., Polson, N. G., & Scott, J. G. (2009). Handling sparsity via the horseshoe.In D. van Dyk, & M. Welling (Eds.) Proceedings of the Twelth International Conference onArtificial Intelligence and Statistics, vol. 5 of Proceedings of Machine Learning Research,(pp. 73–80). Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA: PMLR.

Carvalho, C. M., Polson, N. G., & Scott, J. G. (2010). The horseshoe estimator for sparsesignals. Biometrika, 97(2), 465–480.

Castillo, I., Schmidt-Hieber, J., & van der Vaart, A. (2015). Bayesian linear regression withsparse priors. Annals of Statistics, 43(5), 1986–2018.

Castillo, I., & van der Vaart, A. (2012). Needles and straw in a haystack: Posteriorconcentration for possibly sparse sequences. Annals of Statistics, 40(4), 2069–2101.

Chen, L., & Huang, J. Z. (2012). Sparse reduced-rank regression for simultaneous dimensionreduction and variable selection. Journal of the American Statistical Association, 107(500),1533–1545.

Chun, H., & Keleş, S. (2010). Sparse partial least squares regression for simultaneousdimension reduction and variable selection. Journal of the Royal Statistical Society: Series B(Statistical Methodology), 72(1), 3–25.

Clarke, B. S., & Barron, A. R. (1990). Information-theoretic asymptotics of bayes methods.IEEE Transactions on Information Theory, 36, 453–471.

Dasgupta, S. (2016). High-dimensional posterior consistency of the bayesian lasso. Communi-cations in Statistics - Theory and Methods, 45(22), 6700–6708.

Datta, J., & Ghosh, J. K. (2013). Asymptotic properties of bayes risk for the horseshoe prior.Bayesian Analysis, 8(1), 111–132.

Donoho, D. L., Johnstone, I. M., Hoch, J. C., & Stern, A. S. (1992). Maximum entropyand the nearly black object. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 54(1), 41–81.

Efron, B. (2010). The future of indirect evidence. Statistical Science, 25(2), 145–157.

Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracleproperties. Journal of the American Statistical Association, 96(456), 1348–1360.

128

Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849–911.

Fan, J., & Song, R. (2010). Sure independence screening in generalized linear models withnp-dimensionality. Annals of Statistics, 38(6), 3567–3604.

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linearmodels via coordinate descent. Journal of Statistical Software, 33(1), 1–22.

George, E. I., & Foster, D. P. (2000). Calibration and empirical bayes variable selection.Biometrika, 87(4), 731–747.

George, E. I., & McCulloch, R. E. (1993). Variable selection via gibbs sampling. Journal of theAmerican Statistical Association, 88(423), 881–889.

Ghosal, S., Ghosh, J. K., & van der Vaart, A. W. (2000). Convergence rates of posteriordistributions. Annals of Statistics, 28(2), 500–531.

Ghosh, P., & Chakrabarti, A. (2017). Asymptotic optimality of one-group shrinkage priors insparse high-dimensional problems. Bayesian Analysis, 12(4), 1133–1161.

Ghosh, P., Tang, X., Ghosh, M., & Chakrabarti, A. (2016). Asymptotic properties of bayes riskof a general class of shrinkage priors in multiple hypothesis testing under sparsity. BayesianAnalysis, 11(3), 753–796.

Goh, G., Dey, D. K., & Chen, K. (2017). Bayesian sparse reduced rank multivariate regression.Journal of Multivariate Analysis, 157 , 14 – 28.

Griffin, J. E., & Brown, P. J. (2010). Inference with normal-gamma prior distributions inregression problems. Bayesian Analysis, 5(1), 171–188.

Griffin, J. E., & Brown, P. J. (2013). Some priors for sparse regression modelling. BayesianAnalysis, 8(3), 691–702.

Ishwaran, H., & Rao, J. S. (2005). Spike and slab variable selection: Frequentist and bayesianstrategies. Annals of Statistics, 33(2), 730–773.

Ishwaran, H., & Rao, J. S. (2011). Consistency of spike and slab regression. Statistics &Probability Letters, 81(12), 1920 – 1928.

Johnstone, I. M., & Silverman, B. W. (2004). Needles and straw in haystacks: Empirical bayesestimates of possibly sparse sequences. Annals of Statistics, 32(4), 1594–1649.

Johnstone, I. M., & Silverman, B. W. (2005). Empirical bayes selection of wavelet thresholds.Annals of Statistics, 33(4), 1700–1752.

Li, Y., Nan, B., & Zhu, J. (2015). Multivariate sparse group lasso for the multivariate multiplelinear regression with an arbitrary group structure. Biometrics, 71(2), 354–363.

129

Liquet, B., Bottolo, L., Campanella, G., Richardson, S., & Chadeau-Hyam, M. (2016).R2GUESS: A graphics processing unit-based R package for bayesian variable selectionregression of multivariate responses. Journal of Statistical Software, 69(1), 1–32.

Liquet, B., Mengersen, K., Pettitt, A. N., & Sutton, M. (2017). Bayesian variable selectionregression of multivariate responses for group data. Bayesian Analysis, 12(4), 1039–1067.

Mitchell, T., & Beauchamp, J. (1988). Bayesian variable selection in linear regression. Journalof the American Statistical Association, 83(404), 1023–1032.

Narisetty, N. N., & He, X. (2014). Bayesian variable selection with shrinking and diffusingpriors. Annals of Statistics, 42(2), 789–817.

Park, T., & Casella, G. (2008). The bayesian lasso. Journal of the American StatisticalAssociation, 103(482), 681–686.

Polson, N. G., & Scott, J. G. (2012). On the half-cauchy prior for a global scale parameter.Bayesian Analysis, 7(4), 887–902.

Ročková, V. (2018). Bayesian estimation of sparse signals with a continuous spike-and-slabprior. Annals of Statistics, 46(1), 401–437.

Ročková, V., & George, E. I. (2016). The spike-and-slab lasso. Journal of the AmericanStatistical Association. To appear.

Rothman, A. J., Levina, E., & Zhu, J. (2010). Sparse multivariate regression with covarianceestimation. Journal of Computational and Graphical Statistics, 19(4), 947–962.

Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw,A. A., D’Amico, A. V., Richie, J. P., Lander, E. S., Loda, M., Kantoff, P. W., Golub, T. R.,& Sellers, W. R. (2002). Gene expression correlates of clinical prostate cancer behavior.Cancer Cell, 1(2), 203 – 209.

Strawderman, W. E. (1971). Proper bayes minimax estimators of the multivariate normalmean. Annals of Mathematical Statistics, 42(1), 385–388.

Sun, T., & Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika, 99(4), 879–898.

Tang, X., Xu, X., Ghosh, M., & Ghosh, P. (2017). Bayesian variable selection andestimation based on global-local shrinkage priors. Sankhya A. Retrieved fromhttps://doi.org/10.1007/s13171-017-0118-2.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the RoyalStatistical Society, Series B, 58, 267–288.

van der Pas, S., Kleijn, B., & van der Vaart, A. (2014). The horseshoe estimator: Posteriorconcentration around nearly black vectors. Electronic Journal of Statistics, 8(2), 2585–2618.

130

van der Pas, S., Salomond, J.-B., & Schmidt-Hieber, J. (2016). Conditions for posteriorcontraction in the sparse normal means problem. Electronic Journal of Statistics, 10(1),976–1000.

van der Pas, S., Szabó, B., & van der Vaart, A. (2017a). Adaptive posterior contraction ratesfor the horseshoe. Electronic Journal of Statistics, 11(2), 3196–3225.

van der Pas, S., Szabó, B., & van der Vaart, A. (2017b). Uncertainty quantification for thehorseshoe (with discussion). Bayesian Analysis, 12(4), 1221–1274.

Wellcome Trust (2007). Genome-wide association study of 14,000 cases of seven commondiseases and 3000 shared controls. Nature, 447 , 661–678.

Wilms, I., & Croux, C. (2017). An algorithm for the multivariate group lasso with covarianceestimation. Journal of Applied Statistics, 0(0), 1–14.

Xu, X., & Ghosh, M. (2015). Bayesian variable selection and estimation for group lasso.Bayesian Analysis, 10(4), 909–936.

Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with groupedvariables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1),49–67.

Zellner, A. (1986). On assessing prior distributions and bayesian regression analysis with g priordistributions. Bayesian Inference and Decision Techniques: Essays in Honor of Bruno deFinetti. Studies in Bayesian Econometrics, (pp. 233–243). Eds. P. K. Goel and A. Zellner,Amsterdam: North-Holland/Elsevier.

Zhang, Y., & Bondell, H. D. (2017). Variable selection via penalized credible regions withdirichlet–laplace global-local shrinkage priors. Advance publication. Retrieved fromhttps://projecteuclid.org/euclid.ba/1508551721.

Zhu, H., Khondker, Z., Lu, Z., & Ibrahim, J. G. (2014). Bayesian generalized low rankregression models for neuroimaging phenotypes and genetic markers. Journal of theAmerican Statistical Association, 109(507), 977–990.

Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the AmericanStatistical Association, 101(476), 1418–1429.

Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journalof the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.

131

BIOGRAPHICAL SKETCH

Ray Bai graduated from Cornell University in May 2007 with bachelor’s degrees in

government and economics. He then worked as a financial analyst at State Street Bank &

Trust from May 2007 to August 2010. From September 2010 to May 2012, he attended

graduate school at the University of Massachusetts Amherst where he earned his master’s in

applied mathematics. He then worked as a systems engineer at General Dynamics Mission

Systems from June 2012 to July 2014. He joined the Department of Statistics at the University

of Florida in August 2014 where he earned his master’s in statistics in August 2016 and his

doctorate in August 2018.

132

BAYESIAN HIGH-DIMENSIONAL MODELS WITH SCALE-MIXTURE...

Documents

Transcript of BAYESIAN HIGH-DIMENSIONAL MODELS WITH SCALE-MIXTURE...