Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)


  • Submitted to the Annals of Statistics



    School of Mathematics and Statistics, University of Sydney

    We develop methodology and theory for a mean field variationalBayes approximation to a linear model with a spike and slab prioron the regression coefficients. In particular we show how our methodforces a subset of regression coefficients to be numerically indistin-guishable from zero; under mild regularity conditions estimators basedon our method consistently estimates the model parameters with eas-ily obtainable and appropriately sized standard error estimates; and,most strikingly, selects the true model at a exponential rate in the sam-ple size. We also develop a practical method for simultaneously choos-ing reasonable initial parameter values and tuning the main tuningparameter of our algorithms which is both computationally efficientand empirically performs as well or better than some popular vari-able selection approaches.

    1. Introduction. Variable selection is one of the key problems in statistics as evidenced by pa-pers too numerous to mention all but a small subset. Major classes of model selection approachesinclude criteria based procedures (Akaike, 1973; Mallows, 1973; Schwarz, 1978), penalized regres-sion (Tibshirani, 1996; Fan and Li, 2001; Fan and Peng, 2004) and Bayesian modeling approaches(Bottolo and Richardson, 2010; Hans, Dobra and West, 2007; Li and Zhang, 2010; Stingo and Van-nucci, 2011). One of the key forces driving this research is model selection for large scale problemswhere the number of candidate variables is large or where the model is nonstandard in someway. Good overviews of the latest approaches to model selection are Johnstone and Titterington(2009); Fan and Lv (2010); Muller and Welsh (2010); Bulmann and van de Geer (2011); Johnsonand Rossell (2012).

    Bayesian model selection approaches have the advantage of being able to easily incorporate si-multaneously many sources of variation, including prior knowledge. However, except for specialcases such as linear models with very carefully chosen priors (see for example Liang et al., 2008;or Maruyama and George, 2011), Bayesian inference via Markov chain Monte Carlo (MCMC)for moderate to large scale problems is inefficient. For this reason an enormous amount of efforthas been put into developing MCMC and similar stochastic search based methods which can beused to explore the model space efficiently (Nott and Kohn, 2005; Hans, Dobra and West, 2007;OHara and Sillanpaa, 2009; Bottolo and Richardson, 2010; Li and Zhang, 2010; Stingo and Van-nucci, 2011). However, for sufficiently large scale problems even these approaches can be deemedto be too slow to be used in practice. A further drawback to these methods is that there are noavailable diagnostics to determine whether the MCMC chain has either converged or explored asufficient proportion of models in the model space.

    Mean field variational Bayes (VB) is an efficient but approximate alternative to MCMC forBayesian inference (Bishop, 2006; Ormerod and Wand, 2010). In general VB is typically a muchfaster, deterministic alternative to stochastic search algorithms. However, unlike MCMC, methodsbased on VB cannot achieve an arbitrary accuracy. Nevertheless, VB has shown to be an effectiveapproach to several practical problems including document retrieval (Jordan, 2004), functional

    Keywords and phrases: Mean field variational Bayes, spike and slab prior, approximate Bayesian inference

    1imsart-aos ver. 2014/02/20 file: VariableSelection_AnnalsFinal.tex date: June 27, 2014


    magnetic resonance imaging (Flandin and Penny, 2007), and cluster analysis for gene-expressiondata (Teschendorff et al., 2005).

    A criticism often leveled at VB methods is that they often fail to provide reliable estimates ofposterior variances. Such criticism can be made on empirical, e.g. Wand et al. (2011); Carbonettoand Stephens (2011), or theoretical grounds, e.g., Wang and Titterington (2006); Rue, Martino andChopin (2009). However, as previously shown in You, Ormerod and Muller (2014) such criticismdoes not hold for VB methods in general, at least in an asymptotic sense. Furthermore, varia-tional approximation has been shown to be useful in frequentist settings Hall, Ormerod and Wand(2011); Hall et al. (2011).

    In this paper we consider a spike and slab prior on the regression coefficients (see Mitchelland Beauchamp, 1988; George and McCulloch, 1993) in order to encourage sparse estimators.This entails using VB to approximate the posterior distribution of indicator variables to selectwhich variables are to be included in the model. We consider this modification to be amongstthe simplest such modifications to the standard Bayesian linear model (using conjugate priors) toautomatically select variables to be included in the model. Our contributions are:

    (i) We show how our VB method induces sparsity upon the regression coefficients;(ii) We show, under mild assumptions, that our estimators for the model parameters are consis-

    tent with easily obtainable and appropriately sized standard error estimates;(iii) Under these same assumptions our VB method selects the true model at an exponential rate

    in n; and(iv) We develop a practical method for simultaneously choosing reasonable initial parameter

    values and tuning the main tuning parameter of our algorithms.

    Contribution (iii) is a remarkable result as we do not know of another method with such a prov-ably fast rate of convergence to the true model. However, we note an important drawback of ourtheoretical analysis is that we only consider the case where n > p and p is fixed (but still possiblyvery large).

    We are by no means the first to consider model selection via the use of model indicator vari-ables within the context of variational approximation. Earlier papers which use either expectationmaximization (which may be viewed as a special case of VB), or VB include Huang, Morris andFrey (2007), Rattray et al. (2009), Logsdon, Hoffman and Mezey (2010), Carbonetto and Stephens(2011), Wand and Ormerod (2011) and Rockova and George (2014). However, apart from Rockovaand George (2014), these references did not attempt to understand how sparsity was achievedand the later reference did not consider proving the rates of convergence for the estimators theyconsidered. Furthermore, each of these papers considered slightly different models and tuningparameter selection approaches to those here.

    The spike and slab prior is a sparsity inducing prior since it has a discontinuous derivative atthe origin (owing to the spike at zero). We could also employ other such priors on our regressioncoefficients. A deluge of such papers have recently considered such priors including Polson andScott (2010), Carvalho, Polson and Scott (2010), Griffin and Brown (2011), Armagan, Dunson andLee (2013) and Neville, Ormerod and Wand (2014). Such priors entail at least some shrinkage onthe regression coefficients. We believe that some coefficient shrinkage is highly likely to be bene-ficial in practice. However, we have not pursued such priors here but focus on developing theoryfor what we believe is the simplest model deviating from the linear model which incorporates asparsity inducing prior. Adapting the theory we develop for the spike and slab prior here to othersparsity inducing priors is beyond the scope of this paper.

    The paper is organized as follows. Section 2 considers model selection for a linear model using

    imsart-aos ver. 2014/02/20 file: VariableSelection_AnnalsFinal.tex date: June 27, 2014


    a spike and slab prior on the regression coefficients and provides a motivating example from realdata. Section 3 summarizes our main results which are proved in Section 4. Section 5 discussesinitialization and hyperparameter selection. Numerical examples are shown in Section 6 and il-lustrate the good empirical properties of our methods. We discuss our results and conclude inSection 7. Some auxiliary material is provided in the appendix.

    2. Bayesian linear model selection. Suppose that we have observed data (yi,xi), 1 i n,and hypothesize that yi

    ind. N(xTi , 2), 1 i n for some coefficients and noise variance 2where xi is p-vector of predictors. When spike and slab priors and a conjugate prior is employedon and 2, respectively, a Bayesian version of the linear regression model can be written asfollows,

    (1)y|, 2 N(X, 2I), 2 Inverse-Gamma(A,B),

    j |j jN(0, 2) + (1 j)0 and j Bernoulli(), j = 1, . . . , p,

    where X is a n p design matrix whose ith row is xTi (possibly including an intercept), =(1, . . . ,p)

    T is a p-vector of regression coefficients, Inverse-Gamma(A,B) is the inverse Gammadistribution with shape parameter A and scale parameter B, and 0 is the degenerate distributionwith point mass at 0. The parameters 2 , A and B are fixed prior hyperparameters, and (0, 1)is also a hyperparameter which controls sparsity. Contrasting with Rockova and George (2014)we use rather than 2 as a tuning parameter to control sparsity. The selection of (or

    2 for that

    matter) is particularly important and is a point which we will discuss later.Replacing j with jj for 1 j p we can recast the model as

    y|, 2, N(X, 2I), 2 Inverse-Gamma(A,B),

    j N(0, 2) and j Bernoulli(), j = 1, . . . , p,

    where = diag(1, . . . , p) which is easier to deal with using VB. In the signal processing litera-ture this is sometimes called the Bernoulli-Gaussian (Soussen et al., 2011) or binary mask modeland is closely related to `0 regularization (see Murphy, 2012, Section 13.2.2). Wand and Ormerod(2011) also considered what they call