AIC Using Bayesian Marginal Likelihood

CIRJE Discussion Papers can be downloaded without charge from:

http://www.cirje.e.u-tokyo.ac.jp/research/03research02dp.html

Discussion Papers are a series of manuscripts in their draft form. They are not intended for

circulation or distribution except as indicated by the author. For that reason Discussion Papers

may not be reproduced or distributed without the written consent of the author.

CIRJE-F-971

A Variant of AIC Using Bayesian Marginal Likelihood

Yuki Kawakubo Graduate School of Economics, The University of Tokyo

Tatsuya Kubokawa

The University of Tokyo

Muni S. Srivastava University of Toronto

April 2015

A Variant of AIC Using Bayesian Marginal Likelihood

Yuki Kawakubo, Tatsuya Kubokawayand Muni S. Srivastavaz

April 27, 2015

Abstract

We propose an information criterion which measures the prediction risk of the pre-dictive density based on the Bayesian marginal likelihood from a frequentist pointof view. We derive the criteria for selecting variables in linear regression models byputting the prior on the regression coecients, and discuss the relationship betweenthe proposed criteria and other related ones. There are three advantages of our method.Firstly, this is a compromise between the frequentist and Bayesian standpoint becauseit evaluates the frequentist's risk of the Bayesian model. Thus it is less inuenced byprior misspecication. Secondly, non-informative improper prior can be also used forconstructing the criterion. When the uniform prior is assumed on the regression coef-cients, the resulting criterion is identical to the residual information criterion (RIC)of Shi and Tsai (2002). Lastly, the criteria have the consistency property for selectingthe true model.

Key words and phrases: AIC, BIC, consistency, Kullback{Leibler divergence, linearregression model, residual information criterion, variable selection.

1 Introduction

The problem of selecting appropriate models has been extensively studied in the literaturesince Akaike (1973, 1974), who derived so called the Akaike information criterion (AIC). Sincethe AIC and their variants are based on the risk of the predictive densities with respect to theKullback{Leibler (KL) divergence, they can select a good model in the light of prediction. Itis known, however, that the AIC-type criteria do not have the consistency property, namely,the probability that the criteria select the true model does not converges to 1. Anotherapproach to model selection is Bayesian procedures such as Bayes factors and the Bayesian

Graduate School of Economics, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, JAPAN,Research Fellow of Japan Society for the Promotion of Science, E-Mail: [email protected]

yFaculty of Economics, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, JAPAN,E-Mail: [email protected]

zDepartment of Statistics, University of Toronto, 100 St George Street, Toronto, Ontario, CANADA M5S3G3, E-Mail: [email protected]

1

information criterion (BIC) suggested by Schwarz (1978), both of which are constructedbased on the Bayesian marginal likelihood. Bayesian procedures for model selection havethe consistency property in some specic models, while they do not select models in termsof prediction. In addition, it is known that Bayes factors do not work for improper priordistributions and that the BIC does not use any specic prior information. In this paper,we provide a unied framework to derive an information criterion for model selection so thatit can produce various information criteria including AIC, BIC and the residual informationcriterion (RIC) suggested by of Shi and Tsai (2002). Especially, we propose an intermediatecriterion between AIC and BIC using the empirical Bayes method.

To explain the unied framework in the general setup, let y be an n-variate observablerandom vector whose density is f(yj!) for a vector of unknown parameters !. Let f^(ey;y) isa predictive density for f(eyj!), where ey is an independent replication of y. We here evaluatethe predictive performance of f^(ey;y) in terms of the following risk:

R(!; f^) =

Z "Zlog

(f(eyj!)f^(ey;y)

)f(eyj!)dey# f(yj!)dy: (1.1)

Since this is interpreted as a risk with respect to the KL divergence, we call it the KL risk.The spirit of AIC suggests that we can provide an information criterion for model selectionas an (asymptotically) unbiased estimator of the information

I(!; f^) =

ZZ2 logff^(ey;y)gf(eyj!)f(yj!)deydy

=E!

h2 logff^(ey;y)gi ; (1.2)

which is a part of (1.1) (multiplied by 2), where E! denotes the expectation with respectto the distribution of f(ey;yj!) = f(eyj!)f(yj!). Let = I(!; f^) E![2 logff^(y;y)g].Then, the AIC variant based on the predictor f^(ey;y) is dened by

IC(f^) = 2 logff^(y;y)g+ b;where b is an (asymptotically) unbiased estimator of .

It is interesting to point out that IC(f^) produces AIC and BIC for specic predictors.

(AIC) Put f^(ey;y) = f(eyjb!) for the maximum likelihood estimator b! of !. Then,IC(f(eyjb!)) is the exact AIC or the corrected AIC suggested by Sugiura (1978) and Hurvichand Tsai (1989), which is approximated by AIC of Akaike (1973, 1974) as 2 logff(yjb!)g+2dim(!).

(BIC) Put f^(ey;y) = f0(ey) = R f(eyj!)0(!)d! for a proper prior distribution 0(!).Since it can be easily seen that I(!; f0) = E![2 logff0(y)g], we have = 0 in this case,so that IC(f0) = 2 logff0(y)g, which is the Bayesian marginal likelihood. It is noted that2 logff0(y)g is approximated by BIC = 2 logff(yjb!)g+ log(n) dim(!).

2

The criterion IC(f^) can produce not only the conventional information criteria AIC andBIC, but also various criteria between AIC and BIC. For example, it is supposed that !is divided as ! = (t;t)t for a p-dimensional parameter vector of interest and a q-dimensional nuisance parameter vector . We assume that has a prior density (j;)with hyperparameter . The model is described as

yj f(yj;); (j;);

and and are estimated by data. Inference based on such a model is called an empiricalBayes procedure. Put f^(ey;y) = f(eyjb; b) = R f(eyj; b)(jb; b)d for some estimators band b. Then, the information in (1.2) is

I(!; f) =

ZZ2 logff(eyjb; b)gf(eyj;)f(yj;)deydy; (1.3)

and the resulting information criterion is

IC(f) = 2 logff(yjb; b)g+ b; (1.4)where b is an (asymptotically) unbiased estimator of = I(!; f)E![2 logff(yjb; b)g].

There are three motivations to consider the information I(!; f) in (1.3) and the infor-mation criterion IC(f) in (1.4).

Firstly, it is noted that the Bayesian predictor f(eyjb; b) is evaluated by the risk R(!; f)in (1.1), which is based on a frequentist point of view. On the other hand, the Bayesian riskis

r( ; f^) =

ZR(!; f^)(j;)d; (1.5)

which measures the prediction error of f^(ey;y) under the assumption that the prior informa-tion is correct, where = (t;t)t. The resulting Bayesian criteria such as PIC (Kitagawa,1997) and DIC (Spiegelhalter et al., 2002) are sensitive to the prior misspecication, sincethey depend on the prior information. Because R(!; f) can measure the prediction error ofthe Bayesian model from a standpoint of frequentists, however, the resulting criterion IC(f)is less inuenced by the prior misspecication.

Secondly, we can construct the information criterion IC(f) when the prior distributionof is improper, since the information I(!; f) in (1.3) can be dened formally for thecorresponding improper marginal likelihood. Because the Bayesian risk r( ; f) does notexist for the improper prior, however, we cannot obtain the corresponding Bayesian criteriaand cannot use the Bayesian risk. Objective Bayesians want to avoid informative priorand many non-informative priors are improper. The suggested criterion IC(f) can respondto such a request. For example, objective Bayesians assume the uniform improper prior onregression coecients in linear regression models. It is interesting to note that the resultingvariable selection criterion (1.4) is identical to the residual information criterion (RIC) of Shiand Tsai (2002), which is shown in the next section.

3

Lastly, this criterion has the consistency property. We derive the criterion for the variableselection problem in general linear regression model and prove that the criterion selects thetrue model with probability tending to one. The BIC or marginal likelihood are known tohave the consistency (Nishii, 1984), while most AIC-type criteria are not consistent. ButAIC-type criteria have the property to choose a good model in the sense of minimizing theprediction error (Shibata, 1981; Shao, 1997). Our proposed criterion is consistent for selectionof the parameters of interest and selects a good model in the light of prediction based onthe empirical Bayes model.

The rest of the paper is organized as follows. In Section 2, we obtain the informationcriterion (1.4) in linear regression model with general covariance structure and compare itwith other related criteria. In Section 3, we prove the consistency of the criteria. In Section4, we investigate the performance of the criteria through simulations. Section 5 concludesthe paper with some discussions.

2 Proposed Criteria

2.1 Variable selection criteria for linear regression model

Consider the linear regression model

y =X + "; (2.1)

where y is an n 1 observation vector of the response variables, X is an n p matrix of theexplanatory variables, is a p1 vector of the regression coecients, and " is an n1 vectorof the random errors. Throughout the paper, we assume thatX has full column rank p. Here,the random error " has the distribution Nn(0; 2V ), where 2 is an unknown scalar and Vis a known positive denite matrix. We consider the problem of selecting the explanatoryvariables and assume that the true model is included in the family of the candidate models,that is the common assumption to obtain the criterion.

We shall construct the variable selection criteria for the regression model (2.1) which isof the form (1.4). We consider the following two situations.

[i] Normal prior for . We rst assume the prior distribution of ,

(j2) N (0; 2W );where W is a p p matrix suitably chosen with full rank. Examples of W are W =(X tX)1 for > 0 when V is identity matrix, which is introduced by Zellner (1986), ormore simplyW = 1Ip. Because the likelihood is f(yj; 2) N (X; 2V ), the marginallikelihood function is

f(yj2) =Z

f(yj; 2)(j2)d=(22)n=2 jV j1=2 jW j1=2 jX tV 1X +W1j1=2 expytAy=(22) ;

4

where A = V 1 V 1X(X tV 1X +W1)1X tV 1. Note that A = (V + B)1 forB =XWX t, namely f(yj2) N (0; 2(V +B)). Then we take the predictive density asf^(ey;y) = f(eyj^2) and the information (1.3) can be written as

I;1(!) = E!n log(2^2) + log jV j+ log jWX tV 1X + Ipj+ eytAey=^2 ; (2.2)

where ^2 = ytPy=n and E! denotes the expectation with respect to the distribution off(ey;yj; 2) = f(eyj; 2)f(yj; 2) for ! = (t; 2)t. Note that is the parameter ofinterest and 2 is the nuisance parameter, which corresponds to in the previous section.Then we propose the information criterion.

Proposition 2.1 The information I;1(!) in (2:2) is unbiasedly estimated by the informa-tion criterion

IC;1 = 2 logff(yj^2)g+ 2nn p 2 ; (2.3)

where

2 logff(yj^2)g = n log ^2 + log jV j+ log jWX tV 1X + Ipj+ ytAy=^2 + (const);

namely, E!(IC;1) = I;1(!).

If n1W 1=2X tV 1XW 1=2 converges to a p p positive denite matrix as n ! 1,log jWX tV 1X + Ipj can be approximated to p log n. In that case, IC;1 is approximatelyexpressed as

IC;1 = n log ^2 + log jV j+ p log n+ 2 + ytAy=^2;

when n is large.

Alternatively, the KL risk r( ; f^) in (1.5) can be also used for evaluating the risk of thepredictive density f(eyj^2), since the prior distribution is proper. The resulting criterion is

IC;2 = n log ^2 + log jV j+ p log n+ p; (2.4)

which is an asymptotically unbiased estimator of I;2(2) = E[I;1(!)] up to constant

where E denotes the expectation with respect to the prior distribution (j2), namelyEE!(IC;2) ! I;2(2) + (const) as n ! 1. It is interesting to point out that IC;2 isanalogous to the criterion proposed by Bozdogan (1987) known as the consistent AIC, whosuggested to replace the penalty term 2p in the AIC with p+ p log n.

[ii] Uniform prior for . We next assume the uniform prior for , namely uniform(Rp).Though this is improper prior distribution, we can obtain the marginal likelihood functionformally:

fr(yj2) =Z

f(yj; 2)d=(22)(np)=2 jV j1=2 jX tV 1Xj1=2 expytPy=(22) ;

5

which is known as the residual likelihood (Patterson and Thompson, 1971), where P = V 1V 1X(X tV 1X)1X tV 1. Then we take the predictive density as f^(ey;y) = fr(eyj^2) andthe information (1.3) can be written as

Ir(!) = E!(n p) log(2~2) + log jV j+ log jX tV 1Xj+ eytP ey=~2 ; (2.5)

where ~2 = ytPy=(n p), which is the residual maximum likelihood (REML) estimator of2 based on the residual likelihood fr(yj2). Then we propose the information criterion.

Proposition 2.2 The information Ir(!) in (2:5) is unbiasedly estimated by the infomationcriterion

ICr = 2 logffr(yj~2)g+ 2(n p)n p 2 ; (2.6)

where

2 logffr(yj~2)g = (n p) log ~2 + log jV j+ log jX tV 1Xj+ ytPy=~2 + (const);

namely, E!(ICr) = Ir(!).

Note that ytPy=~2 = np. If n1X tV 1X converges to pp positive denite matrix asn!1, log jX tV 1Xj can be approximated to p log n. In that case, we can approximatelywrite

ICr = (n p) log ~2 + log jV j+ p log n+(n p)2n p 2 ; (2.7)

when n is large. It is important to note that ICr is identical to the RIC proposed by Shiand Tsai (2002) up to constant. Since (n p)2=(n p 2) = (n+ 2) + f4=(n p 2) pgand n + 2 is irrelevant to the model, we can subtract n + 2 from ICr in (2.7), which resultsin the RIC exactly. Note the criterion based on fr(yj2) and r( ; fr) cannot be constructedbecause the KL risk of it diverges to innity.

2.2 Extension to the case of unknown covariance

In the derivation of the criteria, we have assumed that the scaled covariance matrix V ofthe error terms vector are known. However, it is often the case that V is unknown and issome function of the unknown parameter , namely V = V (). In that case, V in each

criterion is replaced with its plug-in estimator V (b), where b is some consistent estimatorof . This strategy is also used in many other studies, for example in Shi and Tsai (2002),who proposed the RIC. We suggest that the is estimated based on the full model. Themethod to estimate the nuisance parameters by the full model is similar to the Cp criterionby Mallows (1973). The scaled covariance matrix W of the prior distribution of is alsoassumed to be known. In practice, its structure should be specied and we have to estimatethe parameters involved in W from the data. In the same manner as V , W in eachcriterion is replaced with W (b). We propose that is estimated based on each candidatemodel under consideration because the structure of W depends on the model.

6

We here give three examples for the regression model (2.1), a regression model withconstant variance, a variance components model, and a regression model with ARMA errors,where the second and the third ones include the unknown parameter in the covariance matrix.

[1] regression model with constant variance. In the case where V = In, (2.1) representsa multiple regression model with constant variance. In this model, the scaled covariancematrix V does not contain any unknown parameters.

[2] variance components model. Consider a variance components model (Henderson,1950) described by

y =X +Z2v2 + +Zrvr + ; (2.8)where Zi is an nmi matrix with V i = ZiZti, vi is an mi1 random vector having the dis-tribution Nmi(0; iImi) for i 2, is an n1 random vector with Nn(0;V 0+1V 1) forknown nn matrices V 0 and V 1, and ; v2; : : : ;vr are mutually independently distributed.The nested error regression model (NERM) is a special case of variance components modelgiven by

yij = xtij + vi + ij; (i = 1; : : : ;m; j = 1; : : : ; ni); (2.9)

where vi's and ij's are mutually independently distributed as vi N (0; 2) and ij N (0; 2) and n =Pmi=1 ni. Note that the NERM in (2.9) is given by 1 = 2; 2 = 2; V 1 =In and Z2 = diag (jn1 ; : : : ; jnm), where jk is the k-dimensional vector of ones, for variancecomponents model (2.8). This model is often used for the clustered data and vi can beseen as the random eect of the cluster (Battese et al., 1988). For such a model, when oneis interested in the specic cluster or predicting the random eects, the conditional AICproposed by Vaida and Blanchard (2005), which is based on the conditional likelihood giventhe random eects, is appropriate. However, when the aim of the analysis is focused on thepopulation, the NERM can be seen as linear regression model and the random eects areinvolved in the error term, namely we can treat " = Z2v2 + , V = V () = V 2 + In for(2.1), where = 2=2 and V 2 = Z2Z

t2 = diag (Jn1 ; : : : ;Jnm) for Jk = jkj

tk. In that case,

our proposed variable selection procedure is valid.

[3] regression model with autoregressive moving average errors. Consider the re-gression model (2.1), assuming the random errors are generated by an ARMA(q; r) processdened by

"i 1"i1 q"iq = ui '1ui1 'ruir;where fuig is a sequence of independent normal random variables having mean 0 and variance 2. A special case of this model is the regression model with AR(1) errors satisfying "1 N (0; 2=(1 2)), "i = "i1 + ui, ui N (0; 2) for i = 2; 3; : : : ; n. When we dene2 = 2=(1 2), (i; j)-element of the scaled covariance matrix V in (2.1) is jijj.

3 Consistency of the Criteria

In this section, we prove that the proposed criteria have the consistency property. Ourasymptotic framework is that n goes to innity and the true dimension of the regression

7

coecients p is xed. Following Shi and Tsai (2002), we rst show the criteria are consistentfor the regression model with constant variance and the prespecied W , and then extendthe result to the regression model with general covariance matrix and the case where W isestimated.

To discuss the consistency, we dene the class of the candidate models and the true modelmore formally. Let n p matrix X consist of all the explanatory variables and assume thatrank (X) = p. To dene the candidate model by the index j, suppose that j denotes a subsetof ! = f1; : : : ; pg containing pj elements, namely pj = #(j), and Xj consists of pj columnsof X indexed by the elements of j. Note that X! =X and p! = p. We dene the index setby J = P(!), namely the power set of !. Then the model j is

y =Xjj + "j; "j Nn(0; 2jV );where Xj is n pj and j is pj 1. The prior distribution of j is j Npj(0; 2W j).The true model is dened by j0 and Xj0j0 is abbreviated to X00, which is the true meanof y. J is the collection of all the candidate models and divide J into two subsets J+ andJ, where J+ = fj 2 J : j0 jg and J = J n J+. Note that the true model j0 is thesmallest model in J+. Let |^ denote the model selected by some criterion. Following Shi andTsai (2002), we make the assumptions.

(A1) E("4s1 )

4 Simulations

In this section, we compare the numerical performance of the proposed criteria with someother conventional ones, which are AIC, BIC, the corrected AIC (AICC) by Sugiura (1978)and Hurvich and Tsai (1989). We shall consider the three regression models|regressionmodel with constant variance, NERM, and regression model with AR(1) errors|which aretaken as examples of (2.1) in Section 2.2. For the NERM, we consider the balanced samplecase, namely n1 = = nm(= n0). In each simulation, 1000 realizations are generatedfrom (2.1) with = (1; 1; 1; 1; 0; 0; 0)t, namely the full model is seven-dimensional and thetrue model is four-dimensional. All explanatory variables are randomly generated from thestandard normal distribution. The signal-to-noise ratio (SNR = fvar(xti)=var("i)g1=2) iscontrolled at 1, 3, and 5. In the NERM, three cases of variance ratio = 2=2 are consideredwith = 0:5; 1 and 2. In the regression model with AR(1) errors, three correlation structuresare considered with AR parameter = 0:1; 0:5 and 0:8.

When deriving the criteria IC;1, IC;1 and IC;2, we set the prior distribution of as

Np(0; 21Ip), namely W = 1Ip. The hyperparameter is estimated by maximizingthe marginal likelihood f(yj^2), where the estimate ^2 = ytPy=n of 2 is plugged in. Theunknown parameter involved in V is estimated by some consistent estimator based on thefull model. In the NERM, = 2=2 is estimated by ^ 2PR=^2PR, where ^ 2PR and ^2PR areunbiased estimators proposed by Prasad and Rao (1990). Let S0 = y

tfInX(X tX)1X tgyand S1 = y

tfEEX(X tEX)1X tEgy where E = diag (E1; : : : ;Em), Ei = In0 n10 Jn0for i = 1; : : : ;m. Then, the Prasad{Rao estimators of 2 and 2 are

^2PR = S1=(nm p); ^ 2PR =S0 (n p)^2PR

=n

where n = n tr [Zt2X(X tX)1X tZ2]. In the regression model with AR(1) errors, theAR parameter is estimated by the maximum likelihood estimator based. Note that isestimated based on the full model and that 2 and is estimated based on each candidatemodel using the plug-in version of V (^).

The candidate models include all the subsets of the full model and select the model bythe criteria. The performance of the criteria is measured by the number of selecting thetrue model and the prediction error of the selected model based on quadratic loss, namelykX |^b|^ X00k2=n.

Tables 1{3 give the number of selecting the true model by the criteria and the averageprediction error of the selected model by each criterion is shown in Tables 4{6 for each ofthe regression models. From these tables, we can see the following three facts. Firstly, thenumber of selecting the true model approaches 1000 for all the proposed criteria, that is thenumerical evidence of the consistency of the criteria. Though the BIC is also consistent, thesmall sample performance is not as good as our criteria. Secondly, the proposed criteria arenot only consistent but also have smaller prediction error even when the sample size is small.Especially, IC;1 is the best for the most of the experiments except when both the samplesize and SNR are small. AIC and AICC have good performance in that situation in terms of

9

prediction error. Thirdly, IC;1 and ICr have better performance than their approximationIC;1 and IC

r, respectively, but the dierence gets smaller as n becomes larger.

Table 1: The number of selecting the true model by the criteria in 1000 realizations of theregression model with constant variance.

SNR = 1 SNR = 3 SNR = 5

n = 20 AIC 130 428 428BIC 118 587 588AICC 89 749 755IC;1 115 843 905IC;1 73 732 738IC;2 73 731 737ICr 143 797 882ICr 147 828 898



10

Table 2: The number of selecting the true model by the criteria in 1000 realizations of thenested error regression model.

= 0:5 = 1 = 2SNR 1 3 5 1 3 5 1 3 5

n0 = 5 AIC 75 382 385 104 387 393 137 393 403m = 4 BIC 60 546 564 90 539 574 134 556 586

AICC 57 694 736 86 693 742 140 691 748IC;1 71 821 902 119 829 915 181 835 941IC;1 35 646 702 66 656 715 115 662 721IC;2 34 642 701 70 653 715 116 657 720ICr 244 789 888 310 818 900 432 838 924ICr 78 723 912 106 723 922 149 698 941

n0 = 5 AIC 220 458 458 235 465 465 259 473 473m = 8 BIC 169 731 731 208 739 741 240 746 750


n0 = 5 AIC 418 522 522 417 528 528 418 545 545m = 16 BIC 394 853 853 407 859 859 416 866 866


11

Table 3: The number of selecting the true model by the criteria in 1000 realizations of theregression model with AR(1) errors.

= 0:1 = 0:5 = 0:8SNR 1 3 5 1 3 5 1 3 5

n = 20 AIC 110 347 346 125 373 372 193 377 414BIC 91 482 482 118 523 542 209 502 587AICC 84 642 646 117 652 688 228 584 715IC;1 101 741 834 138 774 862 274 742 901IC;1 64 620 625 83 625 667 194 554 688IC;2 63 618 624 88 621 667 197 553 685ICr 123 698 801 224 769 846 562 808 901ICr 122 702 790 144 715 826 241 646 825



12

Table 4: The prediction error of the best model selected by the criteria for the regressionmodel with constant variance.

SNR 1 3 5

n = 20 AIC 1.59 0.141 0.0504BIC 1.74 0.131 0.0468AICC 1.77 0.120 0.0421IC;1 1.70 0.111 0.0372IC;1 1.92 0.122 0.0429IC;2 1.92 0.122 0.0430ICr 1.56 0.116 0.0383ICr 1.57 0.114 0.0374



13

Table 5: The prediction error of the best model selected by the criteria for the nested errorregression model.

= 0:5 = 1 = 2SNR 1 3 5 1 3 5 1 3 5

n0 = 5 AIC 1.80 0.150 0.0524 1.61 0.145 0.0494 1.44 0.140 0.0463m = 4 BIC 1.96 0.159 0.0496 1.76 0.159 0.0473 1.50 0.158 0.0447

AICC 2.00 0.172 0.0458 1.78 0.174 0.0443 1.53 0.179 0.0428IC;1 1.93 0.151 0.0417 1.72 0.156 0.0410 1.47 0.159 0.0400IC;1 2.14 0.190 0.0470 1.88 0.188 0.0452 1.60 0.186 0.0434IC;2 2.14 0.192 0.0470 1.87 0.190 0.0452 1.59 0.186 0.0434ICr 1.48 0.135 0.0422 1.37 0.137 0.0413 1.21 0.139 0.0404ICr 1.91 0.251 0.0413 1.72 0.271 0.0434 1.52 0.316 0.0471

n0 = 5 AIC 0.983 0.0696 0.0251 0.907 0.0659 0.0237 0.824 0.0619 0.0223m = 8 BIC 1.25 0.0622 0.0224 1.11 0.0615 0.0216 1.01 0.0613 0.0209


n0 = 5 AIC 0.451 0.0341 0.0123 0.440 0.0325 0.0117 0.434 0.0308 0.0111m = 16 BIC 0.740 0.0294 0.0106 0.711 0.0289 0.0104 0.684 0.0284 0.0102


14

Table 6: The prediction error of the best model selected by the criteria for the regressionmodel with AR(1) errors.

= 0:1 = 0:5 = 0:8SNR 1 3 5 1 3 5 1 3 5

n = 20 AIC 1.85 0.175 0.0628 1.71 0.173 0.0613 1.75 0.256 0.0759BIC 2.01 0.170 0.0595 1.82 0.192 0.0583 1.73 0.305 0.0802AICC 2.04 0.163 0.0549 1.85 0.203 0.0577 1.69 0.343 0.0909IC;1 1.96 0.153 0.0492 1.78 0.182 0.0538 1.66 0.312 0.0831IC;1 2.17 0.164 0.0559 1.95 0.211 0.0566 1.71 0.355 0.0962IC;2 2.17 0.164 0.0559 1.95 0.216 0.0566 1.72 0.354 0.0972ICr 1.79 0.154 0.0505 1.59 0.157 0.0516 1.71 0.224 0.0720ICr 1.84 0.159 0.0507 1.72 0.212 0.0557 1.77 0.361 0.114



15

5 Discussion

We have derived the variable selection criteria for linear regression model relative to thefrequentist KL risk of the predictive density based on the Bayesian marginal likelihood. Wehave proved the consistency of the criteria and have showed that they perform well also inthe sense of the prediction through simulations.

We gave some advantages of the approach based on frequentist's risk R(!; f^) in (1.1). Wehere explain them more clearly through comparison of the related Bayesian criteria. Whenthe prior distribution (j;) is proper, we can treat the Bayesian prediction risk

r( ; f^) =

ZR(!; f^)(j;)d

in (1.5). When and are known, the predictive density f^(ey;y) which minimizes r( ; f^)is the Bayesian predictive density (posterior predictive density) f^(eyjy;;) given byZ

f(eyj;)(jy;;)d = R f(eyj;)f(yj;)(j;)dRf(yj;)(j;)d :

When and are unknown, we can consider the Bayesian risk of the plug-in predictivedensity f^(eyjy; b; b). Then the resulting criterion is known as the predictive likelihood(Akaike, 1980a) or the PIC (Kitagawa, 1997). The deviance information criterion (DIC) ofSpiegelhalter et al. (2002) and the Bayesian predictive information criterion (BPIC) of Ando(2007) are related criteria based on the Bayesian prediction risk r( ; f^).

The Akaike's Bayesian information criterion (ABIC) (Akaike, 1980b) is another informa-tion criterion based on the Bayesian marginal likelihood, given by

ABIC = 2 logff(yjb)g+ 2dim();where the nuisance parameter is not considered. The ABIC measures the following KLrisk: Z "Z

log

(f(eyj)f(eyjb)

)f(eyj)dey# f(yj)dy;

which is not the same as either R(!; f^) or r( ; f^). The ABIC is the criterion for choosing thehyperparameter in the same sense as the AIC. However, it is noted that the ABIC worksas a model selection criterion for because it is based on the Bayesian marginal likelihood.

A drawback of such Bayesian criteria is that we cannot construct them for improper priordistributions (j;), since the corresponding Bayesian prediction risks do not exist. Onthe other hand, we can construct the corresponding criteria based on R(!; f^), because theapproach suggested in this paper measures the prediction risk in the framework of frequen-tists. In fact, putting the uniform improper prior on regression coecients in the linearregression model, we get the RIC of Shi and Tsai (2002). Note that the criteria based on

16

improper marginal likelihood works as variable selection only when the marginal likelihooditself does. For the case where the improper priors cannot be used for model selection, intrin-sic prior was proposed in the literature (Berger and Pericchi, 1996; Casella and Moreno, 2006,and others), which is an objective and automatic procedure. As future work, it is worthwhileto consider such an automatic procedure in the framework of our proposed criteria.

Acknowledgments.The rst author was supported in part by Grant-in-Aid for Scientic Research (26-10395)

from Japan Society for the Promotion of Science (JSPS). The second author was supportedin part by Grant-in-Aid for Scientic Research (23243039 and 26330036) from JSPS. Thethird author was supported in part by NSERC of Canada.

A Derivations of the Criteria

In this section, we show the derivations of the criteria. To this end, we obtain the followinglemma, which was shown in Section A.2 of Srivastava and Kubokawa (2010).

Lemma A.1 Assume that C is an n n symmetric matrix, M is an idempotent matrix ofrank p and that u N (0; In). Then,

E

utCu

ut(In M)u=

tr (C)

n p 2 2tr [C(In M )](n p)(n p 2) :

A.1 Derivation of IC;1 in (2.3)

It is sucient to show that the bias correction ;1 = I;1(!) E![2 logff(yj^2)g] is2n=(n p 2), where I;1(!) is given by (2.2). It follows that

;1 =E!(eytAey=^2) E!(ytAy=^2)=E!(eytAey) E!(1=^2) E!(ytAy=^2):

Firstly,

E!(eytAey) =E![(ey X +X)tA(ey X +X)]=2tr (AV ) + tX tAX: (A.1)

Secondly, noting that n^2 = ytPy = 2ut(In M )u foru =V 1=2(y X)=;M =In V 1=2X(X tV 1X)1X tV 1=2;

(A.2)

and that PX = 0, we can obtain

E!(1=^2) =nE!

1

ytPy

= nE!

1

2ut(In M )u

=n

2(n p 2) : (A.3)

17

Finally,

E!(ytAy=^2) =nE!

ytAy

ytPy

= nE!

"2utV 1=2AV 1=2u+ tX tAX

2ut(In M )u

#

=n

tr (AV )

n p 2 2tr (AV PV )

(n p)(n p 2) +tX tAX

2(n p 2): (A.4)

The last equation in the above can be derived by Lemma A.1. Combining (A.1), (A.3) and(A.4), we get

;1 =2n tr (AV PV )(n p)(n p 2) :

We can see that

tr (AV PV ) =tr f(V +B)1(V +B B)PV g=tr (PV ) tr f(V +B)1BPV g=tr (In M ) = n p; (A.5)

since BP =XWX tP = 0, then we obtain ;1 = 2n=(n p 2).

A.2 Derivation of IC;2 in (2.4)

From the fact that E!(IC;1) = I;1(!) and that EE!(IC;1) = E[I;1(!)] = I;2(2), it

suces to show that EE!(IC;1) is approximated to

EE!(IC;1) EE![n log ^2 + log jV j+ p log n+ 2 + EE!(ytAy=^2)]EE![n log ^2 + log jV j+ p log n+ p] + (n+ 2) = EE!(IC;2) + (n+ 2);

when n is large. Note that n+ 2 is irrelevant to the model. It follows that

E!

ytAy

^2

=n E!

ytfV 1 V 1X(X tV 1X +W1)1X tV 1gy

ytfV 1 V 1X(X tV 1X)1X tV 1gy

=n+ n E!ytV 1X(X tV 1X +W1)1W1(X tV 1X)1X tV 1y

ytfV 1 V 1X(X tV 1X)1X tV 1gy

=n+n

2(n p 2) E!ytV 1X(X tV 1X +W1)1W1(X tV 1X)1X tV 1y

=n+

n

2(n p 2) h2 tr f(X tV 1X +W1)1W1g

+ tX tV 1X(X tV 1X +W1)1W1i;

and that

E[tX tV 1X(X tV 1X +W1)1W1] = 2 tr [X tV 1X(X tV 1X +W1)1]:

18

If n1X tV 1X converges to p p positive denite matrix as n ! 1, tr [(X tV 1X +W1)1W1] ! 0 and tr [X tV 1X(X tV 1X + W1)1] ! p. Then we can obtainEE!(y

tAy=^2 n)! p, which we want to show.

A.3 Derivation of ICr in (2.6)

We shall show that the bias correction r = Ir(!) E![2 logffr(yj~2)g] is 2(n p)=(np 2), where Ir(!) is given by (2.5). Then,

r =E!(eytP ey=~2) E!(ytPy=~2)=E!(eytP ey) E!(1=~2) (n p):

Since E!(eytP ey) = (n p)2 and E!(1=~2) = (n p)=f2(n p 2)g, we get r =2(n p)=(n p 2).

B Proof of Theorem 1

We only prove the consistency of IC;1. The proof of the consistency of the other criteria canbe done in the same manner. Because we see that

P (|^ = j) PfIC;1(j) < IC;1(j0)g

for any j 2 J n fj0g, it suces to show that PfIC;1 < IC;1(j0)g ! 0, or equivalentlyPfIC;1(j) IC;1(j0) > 0g ! 1 as n!1. When V = In, we obtain

IC;1(j) IC;1(j0) = I1 + I2 + I3;

where

I1 =n log(^2j=^

20) + y

tAjy=^2j ytA0y=^20;

I2 = log jX tjXj +W1j j log jX t0X0 +W10 j;I3 = logfjW jj=jW 0jg+ 2n

n pj 2 2n

n p0 2 ;

for ^2j = yt(In Hj)y=n, ^20 = ^2j0 , Aj = In Xj(X tjXj +W1j )1X tj and H0 = Hj0 .

We evaluate asymptotic behaviors of I1, I2 and I3 for j 2 J and j 2 J+ n fj0g, separately.[Case of j 2 J]. Firstly, we evaluate I1. We decompose I1 = I11 + I12, where I11 =

n log(^2j=^20) and I12 = y

tAjy=^2j ytA0y=^20. It follows that

^2j ^20 =(X00 + ")t(In Hj)(X00 + ")=n "t(In H0)"=n=kX00 HjX00k2=n+ op(1);

19

Then we can see that

n1I11 = log1 +

^2j ^20^20

= log

1 +

kX00 HjX00k2n2

+ op(1); (B.1)

and it follows from the assumption (A3) that

lim infn!1

log

1 +

kX00 HjX00k2n2

> 0: (B.2)

Because ytAjy=(n^2j ) = 1 + op(1) and y

tA0y=(n^20) = 1 + op(1), we obtain

n1I12 = op(1): (B.3)

Secondly, we evaluate I2. It follows that

log jX tjXj +W1j j = pj log n+ log jX tjXj=n+W1j =nj = pj log n+O(1):

It can be also seen that log jX t0X0 +W10 j = p0 log n+O(1). Then,n1I2 = (pj p0)n1 log n+ o(1) = o(1): (B.4)

Lastly, it is easy to see thatn1I3 = o(1): (B.5)

From (B.1){(B.5), it follows that

PfIC;1(j) IC;1(j0) > 0g ! 1; (B.6)for all j 2 J.

[Case of j 2 J+ n fj0g]. Firstly, we evaluate I1. From the fact that^20 ^2j = "t(Hj H0)"=n = Op(n1); (B.7)

it follows that

(log n)1I11 =(log n)1 n log^20 (^20 ^2j )

^20

=(log n)1 n logf1 +Op(n1)g = op(1): (B.8)

As for I12, from (B.7) and ytAjy ytA0y = Op(1), we can obtain

I12 =ytAjy=^

2j ytA0y=^20

=(ytAjy ytA0y)=^20 +Op(1) = Op(1):Then,

(log n)1I12 = op(1): (B.9)

20

Secondly, we evaluate I2. Since pj > p0 for all j 2 J+ n fj0g,lim infn!1

(log n)1I2 = pj p0 > 0: (B.10)

Finally, it is easy to see that(log n)1I3 = o(1): (B.11)

From (B.8){(B.11), it follows that

PfIC;1(j) IC;2(j0) > 0g ! 1; (B.12)for all j 2 J+ n fj0g.

Combining (B.6) and (B.12), we obtain

PfIC;1(j) IC;1(j0) > 0g ! 1;for all j 2 J n fj0g, which shows that IC;1 is consistent.

References

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle.In 2nd International Symposium on Information Theory, (B.N. Petrov and Csaki, F, eds.),267{281, Akademia Kiado, Budapest.

Akaike, H. (1974). A new look at the statistical model identication. System identicationand time-series analysis. IEEE Trans. Autom. Contr., AC-19, 716{723.

Akaike, H. (1980a). On the use of predictive likelihood of a Gaussian model. Ann. Inst.Statist. Math., 32, 311{324.

Akaike, H. (1980b). Likelihood and the Bayes procedure. In Bayesian Statistics, (N.J.Bernard, M.H. Degroot, D.V. Lindaley and A.F.M. Simith, eds.), Valencia, Spain, Univer-sity Press, 141{166.

Ando, T. (2007). Bayesian predictive information criterion for the evaluation of hierarchicalBayesian and empirical Bayes models. Biometrika, 94, 443{458.

Battese, G.E., Harter, R.M. and Fuller, W.A. (1988). An error-components model for pre-diction of county crop areas using survey and satellite data. J. Amer. Statist. Assoc., 83,28{36.

Berger, J.O. and Pericchi, L.R. (1996). The intrinsic Bayes factor for model selection andprediction. J. Amer. Statist. Soc., 91, 109{122.

Bozdogan, H. (1987). Model selection and Akaike's information criterion (AIC): The generaltheory and its analytical extensions. Psychometrika, 52, 345{370.

21

Casella, G. and Moreno, E. (2006). Objective Bayesian variable selection. J. Amer. Statist.Assoc., 101, 157{167.

Henderson, C.R. (1950). Estimation of genetic parameters. Ann. Math. Statist., 21, 309{310.

Hurvich, C.M. and Tsai, C.-L. (1989). Regression and time series model selection in smallsamples. Biometrika, 76, 297{307.

Kitagawa, G. (1997). Information criteria for the predictive evaluation of Bayesian models.Commun. Statist. Theory Meth., 26, 2223{2246.

Mallows, C.L. (1973). Some comments on Cp. Technometrics, 15, 661{675.

Nishii, R. (1984). Asymptotic properties of criteria for selection of variables in multipleregression. Ann. Statist., 12, 758{765.

Patterson, H.D. and Thompson, R. (1971). Recovery of inter-block information when blocksizes are unequal. Biometrika, 58, 545{554.

Prasad, N.G.N. and Rao, J.N.K. (1990). The estimation of the mean squared error of small-area estimators. J. Amer. Statist. Assoc., 85, 163{171.

Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist., 6, 461{464.

Shao, J. (1997). An asymptotic theory for linear model selection. Statist. Sin., 7, 221{264.

Shi, P. and Tsai, C.-L. (2002). Regression model selection|a residual likelihood approach.J. Royal Statist. Soc. B, 64, 237{252.

Shibata, R. (1981). An optimal selection of regression variables. Biometrika, 68, 45{54.

Spiegelhalter, D.J., Best, N.G., Carlin, B.P. and van der Linde, A. (2002). Bayesian measuresof model complexity and t. J. Royal Statist. Soc. B, 64, 583{639.

Srivastava, M.S. and Kubokawa, T. (2010). Conditional information criteria for selectingvariables in linear mixed models. J. Multivariate Anal., 101, 1970{1980.

Sugiura, N. (1978). Further analysis of the data by Akaike's information criterion and thenite corrections. Commun. Statist. Theory Meth., 7, 13{26.

Vaida, F. and Blanchard, S. (2005). Conditional Akaike information for mixed-eects models.Biometrika 92, 351{370.

Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis withg-prior distributions. In Bayesian Inference and Decision Techniques: Essays in Honorof Bruno de Finetti, (P.K. Goel and A. Zellner, eds.), pp. 233{243, Amsterdam: North-Holland/Elsevier.

22

AIC Using Bayesian Marginal Likelihood

Documents

Transcript of AIC Using Bayesian Marginal Likelihood