Aspects Of Composite Likelihood Estimation And …...Aspects Of Composite Likelihood Estimation And...
Transcript of Aspects Of Composite Likelihood Estimation And …...Aspects Of Composite Likelihood Estimation And...
Aspects Of Composite Likelihood Estimation And Prediction
by
Ximing Xu
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
Graduate Department of StatisticsUniversity of Toronto
c© Copyright by Ximing Xu 2012
Aspects Of Composite Likelihood Estimation And Prediction
Ximing Xu
Submitted for the Degree of Doctor of PhilosophyDepartment of Statistics, University of Toronto
June 2012
Abstract
A composite likelihood is usually constructed by multiplying a collection of lower di-mensional marginal or conditional densities. In recent years, composite likelihood methodshave received increasing interest for modeling complex data arising from various applica-tion areas, where the full likelihood function is analytically unknown or computationallyprohibitive due to the structure of dependence, the dimension of data or the presence ofnuisance parameters.
In this thesis we investigate some theoretical properties of the maximum composite like-lihood estimator (MCLE). In particular, we obtain the limit of the MCLE in a generalsetting, and set out a framework for understanding the notion of robustness in the contextof composite likelihood inference. We also study the improvement of the efficiency of acomposite likelihood by incorporating additional component likelihoods, or by using com-ponent likelihoods with higher dimension. We show through some illustrative examplesthat such strategies do not always work and may impair the efficiency. We also show thatthe MCLE of the parameter of interest can be less efficient when the nuisance parametersare known than when they are unknown.
In addition to the theoretical study on composite likelihood estimation, we also explorethe possibility of using composite likelihood to make predictive inference in computer ex-periments. The Gaussian process model is widely used to build statistical emulators forcomputer experiments. However, when the number of trials is large, both estimation andprediction based on a Gaussian process can be computationally intractable due to the di-mension of the covariance matrix. To address this problem, we propose prediction methods
ii
based on different composite likelihood functions, which do not require the evaluation ofthe large covariance matrix and hence alleviate the computational burden. Simulation stud-ies show that the blockwise composite likelihood-based predictors perform well and arecompetitive with the optimal predictor based on the full likelihood.
iii
Acknowledgements
First and foremost, I would like to express my deepest gratitude to my supervisor, ProfessorNancy Reid, for her inspiration, guidance, encouragement and patience throughout mydoctoral program.
I thank my thesis committee members, Professor Keith Knight and Professor Muni Sri-vastava, and the external examiner, Professor Naisyin Wang of University of Michigan fortheir constructive feedback and suggestions on my work. I would also like to thank Pro-fessor Derek Bingham at Simon Fraser University for his help on my work of computerexperiments. Many thanks to all the faculty, staff and my fellow graduate students formaking my study at Toronto such an enjoyable experience.
Finally, I dedicate this thesis to my parents, for being always understanding, supportingand loving me.
iv
Contents
1 Introduction 1
1.1 Definition and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Asymptotics of maximum composite likelihood estimators . . . . . . . . . 6
1.3 Main results and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 On the robustness of maximum composite likelihood estimators 10
2.1 The limit of the MCLE in a general setting . . . . . . . . . . . . . . . . . . 12
2.1.1 Introduction and assumptions . . . . . . . . . . . . . . . . . . . . 12
2.1.2 The main theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Aspects of robustness for the MCLE . . . . . . . . . . . . . . . . . . . . . 18
2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Estimation of association parameters . . . . . . . . . . . . . . . . . 19
2.3.2 Estimation of the correlation . . . . . . . . . . . . . . . . . . . . . 20
2.3.3 No compatible joint density exists . . . . . . . . . . . . . . . . . . 21
2.3.4 A class of distributions with normal margins . . . . . . . . . . . . 23
v
2.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 On the efficiency of maximum composite likelihood estimators 29
3.1 Efficiency of composite likelihood with more component likelihoods . . . . 30
3.1.1 General setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 Product of information-unbiased composite likelihoods . . . . . . . 33
3.1.3 Product of uncorrelated composite likelihoods . . . . . . . . . . . . 35
3.1.4 Pairwise likelihood and independence likelihood . . . . . . . . . . 38
3.2 Efficiency of composite likelihood with known nuisance parameters . . . . 43
3.2.1 Equicorrelated multivariate normal model . . . . . . . . . . . . . . 44
3.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Theoretical results on multiparameter case . . . . . . . . . . . . . . . . . . 49
3.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Prediction in computer experiments with composite likelihood 53
4.1 Computer experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1.1 Gaussian random function model . . . . . . . . . . . . . . . . . . 54
4.1.2 Estimation for GRF model . . . . . . . . . . . . . . . . . . . . . . 56
4.1.3 Prediction for GRF model . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Estimation using composite likelihood . . . . . . . . . . . . . . . . . . . . 61
4.3 Prediction using composite likelihood . . . . . . . . . . . . . . . . . . . . 63
4.3.1 Maximum pairwise likelihood predictors . . . . . . . . . . . . . . 64
vi
4.3.2 Maximum triplet-wise likelihood predictors . . . . . . . . . . . . . 65
4.3.3 Maximum blockwise likelihood predictors . . . . . . . . . . . . . 66
4.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.1 Prediction for GRF model with 1-dimensional input . . . . . . . . 69
4.4.2 Prediction for GRF model with 2-dimensional input . . . . . . . . 72
4.5 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 74
Bibliography 83
vii
List of Figures
2.1 The ratio of the simulated variances, Sρ/Sρpl , as a function of ρ. Thelines shown are for p = 3, 6, 8 (descending) . . . . . . . . . . . . . . . . . 26
3.1 The asymptotic variances (multiplied by n) of the maximum compositelikelihood estimators for the full likelihood (solid line), the independencelikelihood (dashed line) and the pairwise likelihood (dotted line). . . . . . . 42
3.2 The plot of the ratio r(ρ)= avar(ρpl)/avar(ρpl) at p = 3. The vertical andhorizontal dashed line denotes ρ = 0 and r(ρ) = 1 respectively. . . . . . . . 46
3.3 The plot of the ratio r(ρ) = avar(ρcl)/avar(ρcl) at p = 3. The horizontaldashed line denotes r(ρ) = 1. . . . . . . . . . . . . . . . . . . . . . . . . . 48
viii
List of Tables
2.1 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Performances of ρ, ρpl and ρ when n = 100, M = 10000, p = 3 and t = 1. . 26
4.1 EMSPE of the six predictors for different γ when α = 1.99 . . . . . . . . . 71
4.2 EMSPE of the six predictors for different α when γ = 100 . . . . . . . . . 71
4.3 EMSPE of the six predictors at different sample size n . . . . . . . . . . . 72
4.4 EMSPE of the six predictors with 2-dimensional input . . . . . . . . . . . 73
ix
Chapter 1
Introduction
The likelihood function plays a critical role in statistical inference in both frequentist and
Bayesian frameworks. However, with the current explosion in the size of data sets and the
increase in complexity of the dependencies among variables in many realistic models, the
full likelihood function is often impractical to construct or too cumbersome to evaluate. In
these situations, composite likelihood, which is usually constructed by compounding some
lower dimensional likelihood objects, appears to be an attractive alternative.
The idea of composite likelihood dates back at least to the pseudolikelihood, a product
of conditional densities, suggested by Besag (1974, 1975) for modelling spatial data. Since
then, various types of composite likelihood methods have been proposed in a range of com-
plex applications. Examples include variants of Besag’s pseudolikelihood using blocks of
1
1 Introduction 2
observations in both conditional and conditioned events (Vecchia, 1988; Stein et al., 2004),
the pairwise likelihood for longitudinal data (Renard et al., 2004) and time series (Varin
and Vidoni, 2006; Davis and Yau, 2011), and composite likelihood methods in Markov
chain models (Hjort and Varin, 2007). A recent comprehensive review on the applications
of composite likelihood methods is given in Varin et al. (2011).
1.1 Definition and Notation
Consider a p-dimensional random vector Y with probability density function f(y; θ) for
some q-dimensional parameter vector θ ∈ Θ. Denote by A1, . . . , AK a set of marginal
or conditional events, for example determined by the joint or conditional distributions of
sub-vectors of Y , with associated likelihood functions Lk(θ; y), k = 1, . . . , K. Following
Lindsay (1988), the composite likelihood function is defined as
CL(θ; y) =K∏k=1
Lk(θ; y)wk , (1.1)
where wk is a set of nonnegative weights. Note that Lk(θ; y) might depend only on a sub-
vector of θ. The choice of the component likelihood functions Lk(θ; y) and the weights
wk may be critical to improve the accuracy and efficiency of the resulting statistical
inference (Lindsay, 1988; Joe and Lee, 2009; Varin et al., 2011). From the above definition
1 Introduction 3
it is easy to see that the full likelihood function is a special case of a composite likelihood
function; however composite likelihood functions will not usually be a genuine likelihood
function, that is, will not be proportional to the density function of any random vector.
The composite likelihood functions are usually distinguished in conditional and marginal
versions. Two of the most commonly used composite conditional likelihood functions are
the pairwise conditional likelihood function,
CLPC(θ; y) =
p∏r=1
∏s 6=r
f(yr|ys; θ )wrs , (1.2)
and the full conditional likelihood function,
CLFC(θ; y) =
p∏r=1
f(yr|y(−r); θ )wr , (1.3)
where y(−r) denotes the vector y omitting the rth component yr.
Two particularly useful composite marginal likelihood functions are the independence
likelihood function,
CLind(θ; y) =
p∏r=1
f(yr; θ )wr , (1.4)
1 Introduction 4
and the pairwise likelihood function,
CLpair(θ; y) =
p−1∏r=1
p∏s=r+1
f(yr, ys; θ )wrs . (1.5)
With a sample of independent observations y = (y(1), ..., y(n)), where each y(i) is a p-
dimensional vector, the composite log-likelihood function is
c`(θ; y) =n∑i=1
c`(θ; y(i)) =n∑i=1
log CL(θ; y(i)), (1.6)
and the composite score function uc(θ; y) is defined as
uc(θ; y) = ∇θc`(θ; y), (1.7)
where ∇θ denotes the operation of differentiation with respect to the vector parameter θ.
The notation u(θ; y) is reserved for the score function of the full likelihood.
Because each of the component likelihood function Lk(θ; y) is based on a marginal or
conditional density, it is easy to show that Euc(θ; y) = 0 under mild regularity conditions
on Lk(θ; y)’s, such as the differentiability of logLk(θ; y) under the integral sign and that
the support does not depend on θ (Godambe, 1960). The Kullback-Leibler inequality also
applies to the composite likelihood function (1.1):
1 Introduction 5
Eθ0
log
CL(θ0; y)
CL(θ; y)
=
K∑k=1
Eθ0
log
Lk(θ0; y)
Lk(θ; y)
≥ 0, (1.8)
where θ0 denotes the true value of θ, and the expectation Eθ0(·) is determined by the true
density functions with θ = θ0. The validity of composite likelihood inference is usu-
ally justified either based on the theory of unbiased estimating equations or based on the
Kullback-Leibler criterion (Lindsay, 1988; Varin, 2008; Molenberghs and Verbeke, 2005).
Composite likelihood versions of Wald, score, and likelihood ratio test statistics can be
easily constructed, and have been proposed in a range of applications (Geys et al., 1999;
Chandler and Bate, 2007; Varin, 2008; Pace et al., 2011; Molenberghs and Verbeke, 2005,
Chapter 9). Some discussion of higher order asymptotic theory for the composite likelihood
ratio statistic can be found in Jin (2009, Section 2.4). Model selection criteria have also
been derived under the framework of composite likelihood; examples include the composite
AIC criterion in Varin and Vidoni (2005) and the composite BIC criterion in Gao and Song
(2010). In this thesis we focus on estimation and prediction using composite likelihood
functions. Some theoretical results on the maximum composite likelihood estimator are
summarized in the next section.
1 Introduction 6
1.2 Asymptotics of maximum composite likelihood esti-
mators
Given a sample y = (y(1), ..., y(n)) as in Section 1.1, the maximum composite likelihood
estimator (MCLE) of θ in the family of models, f(y; θ), is defined as the maximizer of the
composite log-likelihood function (1.6), i.e.
θCL = arg maxθ c`(θ; y). (1.9)
In many applications, θCL may be found by solving the composite score equation uc(θ; y) =
∇θc`(θ; y) = 0. Under some regularity conditions, as n → ∞, θCL is consistent and
asymptotically normally distributed (Lindsay, 1988; Varin, 2008; Xu and Reid, 2011):
√n(θCL − θ)
d→ Nq0, G−1(θ),
where Nqµ,Σ denotes the q-dimensional normal distribution with mean and variance as
indicated, and G(θ) is the Godambe information matrix (Godambe, 1960):
G(θ) = H(θ)J−1(θ)H(θ), (1.10)
1 Introduction 7
where the sensitivity matrix H(θ) = E−∇θuc(θ; y), and the variability matrix J(θ) =
varθuc(θ; y). Throughout this thesis both H(θ) and J(θ) are assumed to be positive
definite matrices. Note that H(θ) and J(θ) are by definition nonnegative definite matrices.
Because the component score functions uk(θ; y) = ∇θlogLk(θ; y) can be correlated,
the second Bartlett identity does not hold for composite likelihood functions in general, i.e.
H(θ) 6= J(θ). After Lindsay (1982), we call a composite likelihood CL(θ; y) information-
unbiased if H(θ) = J(θ) for all θ ∈ Θ, and information-biased, otherwise. For the full
likelihood function, we have H(θ) = J(θ) = I(θ), the expected Fisher information.
By the differentiation of the composite score equation, Euc(θ; y) = 0 with respect to
θ, we have
E∇θuc(θ; y)+ Eu(θ; y)uT
c (θ; y) = 0, (1.11)
where u(θ; y) is the score function of the full likelihood. So,H(θ) = covu(θ; y)uTc (θ; y),
and the Godambe information G(θ) can also be written as
G(θ) = covu(θ; y)uT
c (θ; y)varθuc(θ; y)−1covuc(θ; y)uT(θ; y). (1.12)
The multivariate version of the Cauchy-Schwarz inequality implies that
I(θ) = varθu(θ; y) ≥ G(θ),
1 Introduction 8
i.e. the full likelihood function is more efficient than any other composite likelihood func-
tion (Lindsay, 1988, Lemma 4A).
1.3 Main results and Outline
In modelling only lower dimensional marginal or conditional densities, composite likeli-
hood inference is widely viewed as robust, in the sense that the inference is valid for a
range of statistical models consistent with the component densities. Chapter 2 presents a
rigorous proof on the limit of the maximum composite likelihood estimator taking model
misspecification into account. The notion of robustness in the context of composite like-
lihood inference is studied based on the consistency of the MCLE, and clarified through
some illustrative examples. We also carry out a simulation study of the performance of the
MCLE in a constructed model suggested by Arnold (2010) that is not multivariate normal,
but has multivariate normal marginal distributions.
Intuitively, a more efficient composite likelihood function can be obtained by pooling
more component likelihood functions, or using higher dimensional component likelihood
function. However, such strategies may cause loss of efficiency even when the component
likelihood functions are independent, which will be illustrated through some simple exam-
ples in Chapter 3. Sufficient conditions to guarantee the improvement of the efficiency are
1 Introduction 9
presented in different scenarios. Chapter 3 also investigates the efficiency of the maximum
composite likelihood estimation in the presence of nuisance parameters. In the equicorre-
lated multivariate normal model, we show that the maximum composite likelihood estima-
tor of the correlation coefficient ρ can be less efficient when the common variance σ2 is
known than when σ2 is unknown. Necessary and sufficient conditions for this paradox to
occur are discussed.
Composite likelihood methods have been proposed to estimate the parameters in a va-
riety of complex models (Varin et al., 2011). However, little work has been done to use
composite likelihood for predictive inference (Grunenfelder, 2010). Chapter 4 aims to use
composite likelihood to make predictions in computer experiments, where Gaussian pro-
cesses are usually used to build the statistical emulators. When the number of trials n is
large, both estimation and prediction based on a Gaussian process can be computationally
intractable due to the dimension of the covariance matrix. To address this problem, we pro-
pose prediction methods by maximizing different composite likelihood functions, which
reduce the computational complexity from O(n3) to O(n2). Simulation studies show that
the proposed predictors outperform the pairwise likelihood-based predictor suggested by
Grunenfelder (2010), and are competitive with the optimal predictor based on the full like-
lihood function.
Chapter 2
On the robustness of maximum
composite likelihood estimators
This chapter is a complement to the discussion on the robustness of composite likelihood in-
ference in Varin (2008) and Varin et al. (2011). To formulate ideas about robustness we dis-
tinguish between the true data-generating model, and the working model used for inference,
following Kent (1982). We assume the random vector Y has distribution function G(y),
with associated density function g(y); the marginal distribution function of the subvector
Ysk ⊂ Y is Gk(ysk) and the corresponding density function is gk(ysk), k = 1, · · · , K, with
respect to some dominating measure µ. Now consider the family of modelled distributions
for Ysk, with common support and density functions fk(ysk; θ); θ ∈ Ω with respect to the
10
2 On the robustness of maximum composite likelihood estimators 11
same dominating measure µ. We restrict attention to the unweighted composite marginal
likelihood function:
CL(θ; y) =K∏k=1
fk(ysk; θ). (2.1)
The family of densities f(y; θ); θ ∈ Ω is correctly specified if there exists θ0 ∈ Ω
such that f(y; θ0) = g(y); if no such θ0 exists, the model is misspecified. The composite
marginal likelihood function (2.1) is correctly specified if fk(ysk; θ); θ ∈ Ω are correctly
specified for all k ∈ 1, · · · , K.
If the full model is misspecified, then as in Kent (1982) and White (1982), we define θ∗ML
as the parameter which minimizes the Kullback-Leibler divergence between the specified
full model and the true model g(·). Similarly, for a misspecified composite likelihood
function CL(θ; y), θ∗ is defined as the parameter point which minimizes the composite
Kullback-Leibler divergence (Varin and Vidoni, 2005):
θ∗ = arg minθEg
log
∏Kk=1 gk(ysk)
CL(θ; y)
= arg minθ
K∑k=1
Eg
log
gk(ysk)
fk(ysk; θ)
. (2.2)
Consistency of the maximum composite likelihood estimator is claimed in several pa-
pers, although without detailed proof; see for example Lindsay (1988), Molenberghs and
Verbeke (2005, Chapter 9) and Jin (2009). Asymptotic results on misspecified full likeli-
hood functions, as in White (1982), cannot be applied to the case of composite likelihood
2 On the robustness of maximum composite likelihood estimators 12
directly, since the composite likelihood function will not usually be a genuine likelihood
function, as mentioned in Section 1.1. In Section 2.1 we adapt Wald’s classical approach
(Wald, 1949) to establish the result that the MCLE converges almost surely to θ∗ defined in
(2.2), taking model misspecification into account. The regularity conditions are analogous
to those given in Wald’s proof, but applied to each component density function fk(ysk; θ)
without explicit assumptions on the full model.
2.1 The limit of the MCLE in a general setting
For analytical simplicity we only treat the composite marginal likelihood functions (2.1)
with equal weights; however the results obtained here should be easily generalized to more
general situations.
2.1.1 Introduction and assumptions
Following Wald (1949) we introduce some notation for the needed assumptions. For any
θ and for ρ, r > 0, let f(y; θ, ρ) = supf(y; θ′) : ‖θ′ − θ‖ ≤ ρ, where ‖ · ‖ means
Euclidean norm; ϕ(y, r) = supf(y; θ) : ‖θ‖ > r; f ∗(y; θ, ρ) = maxf(y; θ, ρ), 1;
ϕ∗(y, r) = maxϕ(y, r), 1.
For each k ∈ (1, 2, ..., K), we make the following assumptions, analogous to Assump-
2 On the robustness of maximum composite likelihood estimators 13
tions 1 to 8, in Wald (1949):
(A0): The parameter space Ω is a closed subset of q-dimensional Cartesian space.
(A1): fk(ysk; θ, ρ) is a measurable function of ysk for any θ and ρ.
(A2): The density function fk(ysk; θ) is distinct for different values of θ, i.e. if θ1 6= θ2,
then µ[ysk : fk(ysk; θ1) 6= fk(ysk; θ2)] > 0
(A3): For sufficiently small ρ and sufficiently large r, the expected values∫log f ∗k (ysk; θ, ρ)gk(ysk)dµ(ysk) and
∫logϕ∗k(ysk, r)gk(ysk)dµ(ysk) are finite.
(A4): For any θ ∈ Ω, there exist a set Bkθ , such that
∫Bkθgk(ysk)dµ(ysk) = 0 and
fk(ysk; θ′)→ fk(ysk; θ) as θ′ → θ for ysk ∈ Bk
θ (the complement set of Bkθ ).
(A5): The expectation of log gk(ysk) exists.
(A6): There exists a set Ak, such that∫Akgk(ysk)dµ(ysk) = 0 and lim‖θ‖→∞ fk(ysk; θ) = 0
for ysk ∈ Ak.
(A7): There exists a unique point θ∗ ∈ Ω which minimizes the composite Kullback-Leibler
divergence defined in (2.2).
2 On the robustness of maximum composite likelihood estimators 14
2.1.2 The main theorem
Theorem 1. Assume that y(1), . . . , y(n) are independently and identically distributed with
distribution function G(y). Under the regularity conditions (A0)-(A7), the maximum com-
posite likelihood estimator θCL converges almost surely to θ∗ defined in (2.2).
Before we prove Theorem 1, we state the following lemmas. By the expected value
Eg(·), we shall mean the expected value determined under the true distribution G(y).
Lemma 1. For any θ 6= θ∗, we have
EgK∑k=1
log fk(ysk; θ) < EgK∑k=1
log fk(ysk; θ∗) ≤ Eg
K∑k=1
log gk(ysk)
Lemma 2.
limρ→0
EgK∑k=1
log fk(ysk; θ, ρ) = EgK∑k=1
log fk(ysk; θ)
Lemma 3.
limr→∞
EgK∑k=1
log ϕk(ysk, r) = −∞
The three Lemmas follow immediately from Assumption (A7) and Lemmas 1, 2, 3 in
2 On the robustness of maximum composite likelihood estimators 15
Wald (1949).
Proof: First we shall prove that
Pr
limn→∞
supθ∈ω∏n
i=1
∏Kk=1 fk(y
(i)sk ; θ)∏n
i=1
∏Kk=1 fk(y
(i)sk ; θ∗)
= 0
= 1, (2.3)
for any closed subset ω which belongs to Ω and does not contain θ∗ defined in (A7).
From Lemma 3, for each i, we can choose r0 > 0 such that
EgK∑k=1
log ϕk(y(i)sk , r0) < Eg
K∑k=1
log fk(y(i)sk ; θ∗). (2.4)
Let ω0=θ : θ ∈ ω and ‖θ‖ ≤ r0⊆ ω. From Lemma 1 and 2, for each θ ∈ ω0, we can find
a ρθ such that
EgK∑k=1
log fk(y(i)sk ; θ, ρθ) < Eg
K∑k=1
log fk(y(i)sk ; θ∗). (2.5)
Since ω0 is compact, by the finite-covering theorem there exists a finite number of points
θ1, ..., θh in ω0 such that S(θ1, ρθ1) ∪ ... ∪ S(θh, ρθh) ⊇ ω0, where S(θ, ρ) denotes the
sphere with center θ and radius ρ. Clearly, we have
supθ∈ω
n∏i=1
K∏k=1
fk(y(i)sk ; θ) ≤
h∑l=1
n∏i=1
K∏k=1
fk(y(i)sk ; θl, ρθl)
+
n∏i=1
K∏k=1
ϕk(y(i)sk , r0).
2 On the robustness of maximum composite likelihood estimators 16
Hence (2.3) is proved if we can show that
Pr
limn→∞
∏ni=1
∏Kk=1 fk(y
(i)sk ; θl, ρθl)∏n
i=1
∏Kk=1 fk(y
(j)sk ; θ∗)
= 0
= 1, (l = 1, ..., h),
and
Pr
limn→∞
∏ni=1
∏Kk=1 ϕk(y
(i)sk , r0)∏n
i=1
∏Kk=1 fk(y
(i)sk ; θ∗)
= 0
= 1.
Proving the above two equations is equivalent to showing that for l = 1, ..., h
Pr
[limn→∞
n∑i=1
log
K∏k=1
fk(y(i)sk ; θl, ρθl)− log
K∏k=1
fk(y(i)sk ; θ∗)
= −∞
]= 1,
and
Pr
[limn→∞
n∑i=1
log
K∏k=1
ϕk(y(i)sk , r0)− log
K∏k=1
fk(y(i)sk ; θ∗)
= −∞
]= 1.
These equations follow immediately from (2.4), (2.5) and the strong law of large numbers.
Let θn(y(1), · · · , y(n)) be any function of the observations y(1), · · · , y(n) such that
∏ni=1
∏Kk=1 fk(y
(i)sk ; θn)∏n
i=1
∏Kk=1 fk(y
(i)sk ; θ∗)
≥ c > 0 for all n and all y(1), ..., y(n). (2.6)
If we can show that
Pr
limn→∞
θn = θ∗
= 1, (2.7)
the proof of Theorem 1 is completed since the maximum composite estimator θCL satisfies
2 On the robustness of maximum composite likelihood estimators 17
(2.6). To prove (2.7) it is sufficient to show that for any ε > 0 the probability is one that all
limit points θ of the sequence θn satisfy that ‖θ − θ∗‖ ≤ ε.
If there exists a limit point θ0 such that ‖θ0 − θ∗‖ > ε, we have
sup‖θ−θ∗‖≥ε
n∏i=1
K∏k=1
fk(y(i)sk ; θ) ≥
n∏i=1
K∏k=1
fk(y(i)sk ; θn) for infinitely many n.
Then
sup‖θ−θ∗‖≥ε∏n
i=1
∏Kk=1 fk(y
(i)sk ; θ)∏n
i=1
∏Kk=1 fk(y
(i)sk ; θ∗)
≥ c > 0 for infinitely many n.
According our previous result (2.3) this is an event with probability zero. Thus we have
shown that the probability is one that all limit points θ of the sequence θn satisfy ‖θ −
θ∗‖ ≤ ε, and the equation (2.7) is obtained.
Since the ordinary likelihood function is a special case of composite likelihood, the consis-
tency of maximum likelihood estimator under a misspecified model (White, 1982, Theorem
2.2) follows immediately from Theorem 1.
Corollary 1. If the composite likelihood function (2.1) is correctly specified, under the
assumptions (A0)-(A6), the maximum composite likelihood estimator θCL converges to the
true parameter point θ0 almost surely.
2 On the robustness of maximum composite likelihood estimators 18
2.2 Aspects of robustness for the MCLE
Model specifications under different mechanisms and their impact on the convergence of
the resulting maximum likelihood estimators are illustrated schematically in Table 2.1. The
first row illustrates the result that has been most studied: when the model and sub-models
are correctly specified, the resulting MCLE and MLE are both consistent for the true pa-
rameter value, under some regularity conditions, and the MCLE will be less efficient than
the MLE, although a number of examples indicate that the loss of efficiency can be quite
small.
Table 2.1: Model SpecificationModel Full Likelihood Composite Likelihood
Correctly specified f(y; θ0) = g(y) fk(ysk; θ0) = gk(ysk) for all kθML → θ0 θCL → θ0
Misspecified f(y; θ) 6= g(y), fk(ysk; θ) 6= gk(ysk) for some kθML → θ∗ML θCL → θ∗
The interesting case for robustness of consistency is when the components of composite
likelihood, such as lower dimensional marginal densities, are correctly specified, but the
full likelihood is misspecified. In this case the MLE will not usually be consistent for the
true parameter value. On the other hand, the MCLE, which is calculated from the com-
posite likelihood function making use of the correctly specified lower dimensional margins
only, still converges to the true parameter value without depending on the full model. How-
ever the asymptotic variance of the MCLE may vary dramatically, depending on the true
2 On the robustness of maximum composite likelihood estimators 19
underlying full model.
Finally, if both the composite and the full likelihood functions are not correctly specified,
the MCLE or MLE will converge not to the true parameter, but to θ∗ or to θ∗ML.
2.3 Examples
In this section, we illustrate some of the points above with some simple examples con-
structed to highlight aspects of robustness.
2.3.1 Estimation of association parameters
This example is due to Andrei and Kendziorski (2009). Suppose Y1 ∼ N(µ1, σ21), Y2 ∼
N(µ2, σ22) and ε ∼ N(0, 1) are independent random variables. Let Y3 = Y1+Y2+bY1Y2+ε,
b 6= 0. We can show that all full conditional distributions i.e. f(y1|y2, y3), f(y2|y1, y3) and
f(y3|y1, y2) are normal, but the joint distribution is not multivariate normal due to the non-
zero interaction term bY1Y2.
Given a random sample (y(i)1 , y
(i)2 , y
(i)2 ), i = 1, · · · , n, if we misspecify the joint model as
multivariate normal, b will be forced to be 0 directly; if we use the full conditional distribu-
tion f(y3|y1, y2), the MCLE of b is bCL =∑n
i=1 y(i)1 y
(i)2 (y
(i)3 − y
(i)1 − y
(i)2 )/
∑ni=1(y
(i)1 y
(i)2 )2,
2 On the robustness of maximum composite likelihood estimators 20
which is consistent for b. We can also use f(y1|y2, y3) or f(y2|y1, y3), but the resulting
MCLE can not be expressed in a closed form and some numerical methods are needed.
2.3.2 Estimation of the correlation
The random vector (Y1, Y2, Y3, Y4)T follows a multivariate normal distribution with mean
vector (0, 0, 0, 0)T and covariance matrix
Σ =
1 ρ0 2ρ0 2ρ0
ρ0 1 2ρ0 2ρ0
2ρ0 2ρ0 1 ρ0
2ρ0 2ρ0 ρ0 1
.
If we model the joint distribution of (Y1, Y2, Y3, Y4)T as multivariate normal with zero mean
vector and all correlations equal, the covariance matrix is then misspecified and the result-
ing MLE will not be consistent for ρ0. On the other hand, if we only use the correct
information about the pairs (Y1, Y2) and (Y3, Y4), and construct the pairwise likelihood
CL(ρ; y) = f12(y1, y2; ρ)f34(y3, y4; ρ), (2.8)
2 On the robustness of maximum composite likelihood estimators 21
where both f12 and f34 are the density functions for a bivariate normal with mean vector
(0, 0)T and covariance matrix
Σ =
1 ρ
ρ 1
,
then by Corollary 1, the resulting MCLE is consistent for ρ0.
It is of interest to note that the parameter constraint needed to ensure that the covariance
matrix is non-negative definite in the correct full likelihood is −1/5 ≤ ρ ≤ 1/3, whereas
in the composite likelihood (2.8) the parameter constraint is −1 ≤ ρ ≤ 1. The composite
likelihood (2.8) can also be thought as the likelihood function for a multivariate normal
distribution with a block diagonal covariance matrix, which is obviously different from the
true full model.
From this example we can see that even if different parameter constraints are imposed
or the composite likelihood is compatible with a different full model, the MCLE will be
consistent as long as all of the component likelihoods are correctly specified.
2.3.3 No compatible joint density exists
Suppose the true model for the random vector (Y1, Y2, Y3) is multivariate normal with mean
vector µ0(1, 1, 1)T, µ0 > 0, and covariance matrix equal to the identity matrix. Now con-
2 On the robustness of maximum composite likelihood estimators 22
sider the following pairwise likelihood function
CL(µ; y) = f12(y1, y2;µ)f13(y1, y3;µ)f23(y2, y3;µ), (2.9)
where both f12 and f23 are the density functions for a bivariate normal distribution with
unknown mean vector µ(1, 1)T and covariance matrix equal to the 2 × 2 identity matrix.
However, f13(y1, y3;µ) is misspecified as
f13(y1, y3;µ) =1
µexp(−y1
µ)
1√2π
exp(−(y3 − µ)2
2).
It is easy to see that no compatible joint density exists for the composite likelihood function
(2.9) since from f12 and f13 we will get different marginal densities of Y1.
Given a random sample (y(i)1 , y
(i)2 , y
(i)3 ), i = 1, · · · , n, the MCLE of µ based on the
composite likelihood function (2.9), µCL can be obtained by solving the score equation
5nµ3 − Snµ2 + nµ− S1n = 0, (2.10)
where Sn =∑n
i=1(y(i)1 + 2y
(i)2 + 2y
(i)3 ) and S1n =
∑ni=1 y
(i)1 .
As n → ∞, a direct argument using the consistency of sample means for the population
mean shows that the unique real root of (2.10) converges to µ0. The asymptotic variance of
µCL can be calculated using the Godambe information function G(θ), and the ratio of the
2 On the robustness of maximum composite likelihood estimators 23
asymptotic variance of the MLE of µ based on the true multivariate normal model, µML
to that of µCL is r = 5 + (1/µ2)2/3[8 + 1 + (1/µ2)2]. It is easy to check r ≤ 1 and
equality holds only for µ = 1.
From this artificial example, we can see that although no compatible joint density exists,
the limit of the MCLE may still be meaningful, even consistent for the true parameter
value. In general the MCLE converges to θ∗ which minimizes the composite Kullback-
Leibler divergence, whether the specified sub-models are compatible or not. If the specified
sub-models are very close to the corresponding true sub-models, we can imagine that θ∗
should be a good estimate of the true parameter value even if those specified sub-models
are incompatible.
2.3.4 A class of distributions with normal margins
This example is suggested by Arnold (2010). Suppose the density function of the random
vector Y = (Y1, Y2, . . . , Yp) is
f(y) = φ(p)(y;µ,Σ) + g(µ,Σ)(
p∏i=1
yi)IA(y), (2.11)
where φ(p)(y;µ,Σ) is the density function of a p-dimensional multivariate normal with
mean vector µ and covariance matrix Σ, g(·) is a function of parameters chosen to guarantee
2 On the robustness of maximum composite likelihood estimators 24
that f(y) ≥ 0, A = y : −t ≤ yi ≤ t, i = 1, 2, ..., p, t is a threshold parameter, and
IA(y) = 1 if y ∈ A and 0 otherwise. It is easy to show that all k(< p) dimensional sub-
vectors of y follow k-dimensional multivariate normal distributions with corresponding
mean vectors and covariance matrices. When t = 0, f(y) becomes φ(p)(y;µ,Σ). This
example also provides a general approach to constructing a model with the same margins
as a pre-specified density. In model (2.11), depending on the complexity of the function
g(·), the calculation of the MLE may be very difficult. In the simulation study, we let
t = 1, µ = 0 and Σ = (1 − ρ)Ip + ρJp, where Ip is identity matrix, Jp is a p × p matrix
with all entries equal to 1, and ρ is the common correlation coefficient for p ≥ 3. Since
A ⊆ y : yTy ≤ p, we can choose the function g as
g(µ,Σ) = infyTy≤p
φ(p)(y;µ,Σ) ≤ infy∈A
φ(p)(y;µ,Σ)
To calculate g(µ,Σ), we use the fact that
supyTy≤p
yTΣ−1y = pλp,
where λp is the largest eigenvalue of Σ−1. For Σ = (1 − ρ)Ip + ρJp, we can show that
λp = 1/(1− ρ) if 0 ≤ ρ < 1, and λp = 1/1 + (p− 1)ρ if 1/(1− p) < ρ ≤ 0.
We begin with p = 3 and consider three different estimators of ρ: the MLE ρ; the MCLE,
2 On the robustness of maximum composite likelihood estimators 25
ρpl obtained by maximizing the pairwise likelihood function (1.5) with equal weights, and
the unbiased estimator based on the method of moments,
ρ =2S2
np(p− 1),where S2 =
n∑i=1
p∑s>r
yr(i)ys
(i).
The last two estimators are free of the function g(·) and are more computationally conve-
nient than the MLE.
The rejection sampling method is used to generate n sample points from the joint distri-
bution (2.11), using the fact that
f(y) ≤ φ(p)(y;µ,Σ)1 + IA(y)
p∏i=1
yi ≤ 2φ(p)(y;µ,Σ).
We used numerical methods to calculate ρ and ρpl by solving the relevant score equa-
tions, and calculated simulation means and variances of the three estimators, ρ, ρpl and ρ.
In Table 2.2 Sρ, Sρpl and Sρ denote the simulation variances of the three estimators, re-
spectively. The ratios Sρpl/Sρ and Sρ/Sρpl are used to compare the efficiencies of the three
estimators.
The results for sample size n = 100, simulation size M = 10000, threshold t = 1 and
dimension p = 3 are presented in Table 2.2. All three methods produce accurate estimates.
With the exception of ρ = −0.49, var(ρpl) is very close to var(ρ), and var(ρpl) seems
2 On the robustness of maximum composite likelihood estimators 26
smaller than var(ρ) for any value of ρ except ρ = 0. We also performed the simulation for
t = 2, 4, 8, and observed the same phenomenon. Figure 2.1 illustrates the relative efficiency
of ρpl compared with the MLE ρ, with increasing p.
Table 2.2: Performances of ρ, ρpl and ρ when n = 100, M = 10000, p = 3 and t = 1.true value of ρ -0.49 -0.25 0 0.25 0.5 0.75 0.99sim.mean of ρpl -0.4924 -0.2512 -0.0012 0.2515 0.4986 0.7487 0.9899sim.mean of ρ -0.4900 -0.2481 0.0019 0.2489 0.4983 0.7502 0.9900sim.mean of ρ -0.4908 -0.2479 -0.0015 0.2521 0.4998 0.7511 0.9874sim.variance of ρpl 0.0008 0.0013 0.0036 0.0037 0.0024 0.0006 10−6
sim.variance of ρ 10−6 0.0012 0.0036 0.0036 0.0023 0.0059 10−6
sim.variance of ρ 0.0025 0.0023 0.0035 0.0057 0.0092 0.0155 0.0215Sρ/Sρpl 0.0025 0.9231 1.0000 0.9730 0.9583 0.9833 1.0000Sρpl/Sρ 0.3334 0.5614 1.0252 0.6521 0.2599 0.0402 10−5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.85
0.9
0.95
1
the true value of ρ
the
rela
tive
effic
ienc
y
Figure 2.1: The ratio of the simulated variances, Sρ/Sρpl , as a function of ρ. The linesshown are for p = 3, 6, 8 (descending)
2 On the robustness of maximum composite likelihood estimators 27
2.4 Concluding Remarks
This chapter sets out some issues in the study of robustness of composite likelihood infer-
ence; specifically emphasizing robustness of consistency. Robustness in inference usually
means obtaining the same inferential result under a range of models. In point estimation the
range of models is often considered to be small-probability perturbations of the assumed
model, to reflect the sampling notion of occasional outliers.
In the framework of composite likelihood inference, the range of models is, loosely
speaking, all models consistent with the specified set of sub-models fk(ysk; θ). For exam-
ple, if pairwise likelihood is used, the range of models is those consistent with the assumed
bivariate distributions. In many, or even most, applications of composite likelihood, it is
not immediately clear what that range of models looks like, and indeed whether there is
even a single model compatible with the assumed sub-models.
The Wald assumptions set out in Section 2.1.1 are sufficient to ensure consistency of
the MCLE, although they may be stronger than necessary. The most restrictive of these
assumptions is (A7): that there exists a unique point θ∗ ∈ Ω that minimizes the Kullback-
Leibler divergence (2.2). For each component likelihood the assumption that there is a
unique θ∗k ∈ Ωk would be more closely analogous to the usual Wald assumption for the
MLE.
2 On the robustness of maximum composite likelihood estimators 28
However, even in cases where both the MLE and the MCLE are not consistent, the
MCLE might still be more reliable than the MLE, since mis-specifying a high dimensional
complex joint density may be much more likely than mis-specifying some simpler lower
dimensional densities.
The MCLE also has a type of robustness of efficiency: in computing the asymptotic
variance the composite likelihood is always treated as a “misspecified” model even if all
component likelihoods are correctly specified. On the other hand, the inverse of the Fisher
information matrix I(θ), which is used as the asymptotic variance of the MLE, is sensitive
to model misspecification.
Composite likelihood also has a type of computational robustness, discussed in Varin
et al. (2011); there is some evidence from applied work that the composite likelihood sur-
face is smoother, and hence easier to maximize, than the full likelihood surface.
There is also some evidence that composite likelihood inference is robust to missing
data, although there is still much work to be done in this area. Recent papers discussing
this include Yi et al. (2011), Molenberghs et al. (2011), and He and Yi (2011).
Chapter 3
On the efficiency of maximum composite
likelihood estimators
Given two composite likelihood functions CL1(θ; y) and CL2(θ; y), CL2(θ; y) is said to
be more efficient than CL1(θ; y) if the Godambe information in CL2(θ; y) is greater than
the Godambe information in CL1(θ; y) in the sense of matrix inequality. It is known that
the full likelihood function is more efficient than any other composite likelihood function
(Godambe, 1960; Lindsay, 1988).
In practice, we usually want to select a composite likelihood which should achieve a
good balance between computational savings and statistical efficiency. We might expect
that either increasing the number of component likelihoods, or increasing the dimension
29
3 On the efficiency of maximum composite likelihood estimators 30
of the component likelihood, would improve the efficiency, although extra computing time
may be needed. When θ is a scalar, basic results on efficiency of the product of two or more
composite likelihood functions are stated in Section 3.1; some examples are also presented
to show such strategies may impair the efficiency, even when the component composite
likelihoods are independent. In Section 3.2, the equicorrelated multivariate normal model
is used to illustrate that in the presence of nuisance parameters, the maximum composite
likelihood estimator of the parameter of interest can be less efficient when the nuisance
parameters are known. Some theoretical results on the multiparameter case are presented
in Section 3.3.
3.1 Efficiency of composite likelihood with more compo-
nent likelihoods
3.1.1 General setting
Consider a composite likelihood function of the p-dimensional vector Y , CLc(θ; y), which
can be expressed as a product of two “smaller” composite likelihood functions CL1(θ; y)
and CL2(θ; y), i.e. CLc(θ; y) = CL1(θ; y) × CL2(θ; y). If both CL1(θ; y) and CL2(θ; y)
are true likelihood functions, with uncorrelated score functions, then the information in
3 On the efficiency of maximum composite likelihood estimators 31
CLc(θ; y) is equal to the sum of the information in CL1(θ; y) and in CL2(θ; y). In fact,
this information additivity also holds for the product of information-unbiased composite
likelihood functions with uncorrelated composite scores.
Denote the Godambe information in CL1(θ; y) and CL2(θ; y) by G1(θ) = H21 (θ)J−11 (θ)
andG2(θ) = H22 (θ)J−12 (θ), respectively. A lower bound on the efficiency of the compound
composite likelihood function CLc(θ; y) is given by the following theorem.
Theorem 2. If G1(θ) ≤ G2(θ), then CLc(θ; y) is at least as efficient as CL1(θ; y).
Proof: The Godambe information in CLc(θ; y) can be calculated directly as
Gc(θ) =(1 + γ)2
1 + λ+ 2ρ√λG1(θ), (3.1)
where γ = H2(θ)/H1(θ), λ = J2(θ)/J1(θ) and ρ = corru1(θ; y), u2(θ; y). When γ ≥√λ, i.e. G2(θ) ≥ G1(θ), it is easy to check that (1 + γ)2/(1 + λ + 2ρ
√λ) ≥ 1 for any
ρ ∈ [−1, 1], and equality holds if and only if γ =√λ and ρ = 1.
Corollary 2. If G1(θ) ≤ G2(θ), the weighted compound composite likelihood function
CLwc(θ; y) = CLw11 (θ; y)×CLw2
2 (θ; y) is at least as efficient as CL1(θ; y) for any w1 > 0
and w2 > 0.
Corollary 3. If the composite likelihood functions CL1(θ), CL2(θ), . . . , CLm(θ) have the
same Godambe information G(θ), the information in the compound composite likelihood
function, CLc(θ) =∏m
i=1CLi(θ), is greater than or equal to G(θ).
3 On the efficiency of maximum composite likelihood estimators 32
Corollary 2 follows from Theorem 2 immediately by noting thatCLw11 (θ; y) has the same
information as CL1(θ; y), and CLw22 (θ; y) has the same information as CL2(θ; y) for any
positive w1 and w2. For information-unbiased composite likelihood functions, a version of
Corollary 3 is given in Lindsay (1988, Lemma 4C).
In equation (3.1), we also note that (1 + γ)2/(1 + λ + 2ρ√λ) can be smaller than 1
when γ <√λ, i.e. the Godambe information decreases as more component likelihoods are
added, which will be illustrated through the following example.
Example 3.1.1 (Correlated regression). The model for the random vector y = (y1, y2) is
y(i)1 = αti + εi, εi ∼ N(0, σ2
1) (3.2)
y(i)2 = αti + ε′i, ε′i ∼ N(0, σ2
2) (3.3)
with cov(εi, ε′j) = ρσ1σ2 when i = j, and cov(εi, ε
′j) = 0 when i 6= j (i, j = 1, ..., n).
Assume σ1 and σ2 are known, and α is the unknown parameter of interest.
Suppose we ignore the bivariate distribution of (y1, y2), and use either model (3.2) or
model (3.3) separately to estimate α, without knowing the value of ρ. The maximum
likelihood estimators based on (3.2) and (3.3) are given by α1 =∑n
i=1 tiy(i)1 /∑n
i=1 t2i and
α2 =∑n
i=1 tiy(i)2 /∑n
i=1 t2i , respectively. We also consider the weighted independence
3 On the efficiency of maximum composite likelihood estimators 33
likelihood ignoring the correlation between y1 and y2:
CLc(α; y1, y2) =n∏i=1
f(y(i)1 ;α)w1f(y
(i)2 ;α)w2 ,
where the weights w1 and w2 are some positive constants. The corresponding maximum
composite likelihood estimator is αCL = w∗1α1 + w∗2α2 with w∗1 = w1σ22/(w1σ
22 + w2σ
21)
and w∗2 = w2σ21/(w1σ
22 + w2σ
21). The variance of αCL is
var(αCL) =1
(1 + γ)2σ21
S+
γ2
(1 + γ)2σ22
S+
2γ
(1 + γ)2ρσ1σ2S
,
where γ = w2σ21/(w1σ
22) and S =
∑ni=1 t
2i .
Whether var(αCL) ≤ var(α1) = σ21/S depends on the values of ρ, σ2
2 , σ21 and the ratio
w2/w1. When σ22 ≤ σ2
1 , we have var(αCL) ≤var(α1) for all w1, w2 and ρ ∈ [−1, 1], which
agrees with Corollary 2. On the other hand, when σ22/σ
21 = 4 and w1 = w2, var(αCL) <
var(α1) for ρ < 5/16, and var(αCL) >var(α1) for ρ > 5/16.
3.1.2 Product of information-unbiased composite likelihoods
Under the usual regularity conditions, the likelihood function is information-unbiased.
More generally, any composite likelihood function constructed by multiplying likelihood
functions with mutually uncorrelated scores is also information-unbiased. An example is
3 On the efficiency of maximum composite likelihood estimators 34
the partial likelihood of Cox (1975). Given a p-dimensional vector Y = (Y1, . . . , Yp)T with
density function f(y1, . . . , yp; θ), and defining f(y1 | y0; θ) = f(y1; θ), it is easy to show
that the covariance between the score function of f(yi | y1, . . . , yi−1; θ) and the score func-
tion of f(yj | y1, . . . , yj−1; θ) is zero for any i 6= j ∈ 1, . . . , p. Without loss of generality,
we assume j > i, from the equation (1.11), the covariance between the two score func-
tions is cov∇θ log f(yi | y1, . . . , yi−1; θ),∇θ log f(y1, . . . , yj−1, yj; θ) − cov∇θ log f(yi |
y1, . . . , yi−1; θ),∇θ log f(y1, . . . , yj−1; θ) = Hi(θ) − Hi(θ) = 0, where Hi(θ) is the sen-
sitivity matrix of f(yi | y1, . . . , yi−1; θ). Thus, any composite likelihood function of type
CL(y; θ) =∏
i∈A f(yi | y1, . . . , yi−1; θ), where A ⊆ 1, . . . , p, is information-unbiased.
From Section 3.1.1, we have that information additivity holds for uncorrelated
information-unbiased composite likelihood functions. On the other hand, the product of
correlated information-unbiased composite likelihood functions with the same information,
is at least as efficient as any of its components (Lindsay, 1988, Lemma 4C). In the latter
case, a stronger improvement of efficiency can be achieved with an additional assumption
on the covariance between any two composite likelihood functions.
Lemma 4. Assume CL1(θ), CL2(θ), . . . , CLm(θ) are information-unbiased, each with the
same Godambe information G(θ). If∑m
i 6=j=1 cov∇θ log CLi(θ),∇θ log CLj(θ) is equal
to a constant, R(θ), for any i ∈ 1, . . . ,m, then the compound composite likelihood
function CLmc (θ) =∏m
i=1CLi(θ) is at least as efficient as CLm−1c (θ) =∏m−1
i=1 CLi(θ) for
any m ≥ 2.
3 On the efficiency of maximum composite likelihood estimators 35
The proof of Lemma 4 is obtained by showing that the Godambe information forCLmc (θ)
ism2G(θ)mG(θ)+mR(θ)−1G(θ) = mG(θ)G(θ)+R(θ)−1G(θ), which is an increas-
ing function of m. This proof also holds for the multiparameter case. The assumptions in
Lemma 4 apply for a variety of applications of composite likelihood inference, including
pairwise and conditional pairwise likelihood in the equicorrelated multivariate distribution
(Cox and Reid, 2004; Mardia et al., 2009), pairwise likelihood in a Mantel-Haenszel pro-
cedure (Lindsay, 1988), and a version of composite conditional likelihood in Markov chain
models (Hjort and Varin, 2007) and in lattice data ignoring boundary effects (Besag, 1975).
3.1.3 Product of uncorrelated composite likelihoods
Since uncorrelated information-unbiased composite likelihood functions satisfy informa-
tion additivity, in this subsection we focus only on the product of uncorrelated information-
biased composite likelihood functions. Denote by Ys1 and Ys2 two sub-vectors of the
p-dimensional random vector Y . We consider a composite likelihood function for Ys1,
CL(ys1; θ) with Godambe information G1(θ) = H21 (θ)J−11 (θ), multiplied by the likeli-
hood function of Ys2, f(ys2; θ), with Fisher information I2(θ).
Lemma 5. If the score functions for CL(ys1; θ) and f(ys2; θ) are uncorrelated, the com-
pound composite likelihood functionCLc(θ) = CL(ys1; θ)×f(ys2; θ) is more efficient than
CL(ys1; θ) if and only if I2(θ) > G1(θ)− 2H1(θ).
3 On the efficiency of maximum composite likelihood estimators 36
Proof: Denote by Gc(θ) the Godambe information of the compound composite likelihood
function CLc(θ). By definition,
Gc(θ)−G1(θ) =I2(θ) +H1(θ)2
I2(θ) + J1(θ)− H2
1 (θ)
J1(θ)
=I2(θ) +H1(θ)2 J1(θ)−H2
1 (θ) I2(θ) + J1(θ)J1(θ) + I2(θ) J1(θ)
=I2(θ) I2(θ)J1(θ) + 2H1(θ)J1(θ)−H2
1 (θ)J1(θ) + I2(θ) J1(θ)
Since I2(θ) are J1(θ) are positive, Gc(θ) > G1(θ) if and only if I2(θ)J1(θ) > H21 (θ) −
2H1(θ)J1(θ). Noting that H21 (θ)/J1(θ) = G1(θ), Lemma 5 is therefore proved.
By Lemma 5, in order to improve the efficiency by incorporating an independent like-
lihood function, G1(θ) − 2H1(θ) should not be too large. When G1(θ) = H1(θ), i.e.
CL(ys1; θ) is information-unbiased, we always have I2(θ) > G1(θ)− 2H1(θ). As an illus-
tration of Lemma 5, we consider the following example.
Example 3.1.2 (Product of independent normal models). The random vector (Y1, Y2, Y3)
follows a normal distribution with mean vector µ× (1, 1, 1)T and covariance matrix
Σ =
1 ρ 0
ρ 1 0
0 0 σ2
.
3 On the efficiency of maximum composite likelihood estimators 37
Assume σ2 is known, µ and ρ are unknown, and µ is the only parameter of interest.
Suppose we do not know the bivariate distribution of Y1 and Y2, and use the independence
likelihood function CL12(µ) = f(y1;µ)×f(y2;µ), which is free of the nuisance parameter
ρ, to estimate µ. To incorporate the information contained in the independent variable Y3,
we also consider the composite likelihood function CL123(µ) = CL12(µ)× f(y3;µ).
Given a random sample of size n from the model, the Fisher information in f(y3;µ)
is n/σ2, and the Godambe information in CL12(µ) is G12(µ) = H212(µ)/J12(µ) with
H12(µ) = 2n and J12(µ) = 2n(1 + ρ). By Lemma 5, CL123(µ) is more efficient than
CL12(µ) if and only if n/σ2 > 2n/(1 + ρ) − 4n. When σ2 = 2, this inequality becomes
ρ > −5/9. So if ρ ∈ [−1,−5/9], CL123(µ) is less efficient than CL12(µ).
We can also compare the variances of the two maximum composite likelihood estimators
directly. The maximum composite likelihood estimator is µ12 = (y1 + y2)/2 for CL12(µ),
and µ123 = σ2(y1 + y2)/(1+2σ2)+ y3/(1+2σ2) for CL123(µ). When σ2 = 2, the variance
of µ12 is (1 + ρ)/(2n) and the variance of µ123 is (10 + 8ρ)/(25n). It is easy to show that
(10 + 8ρ)/(25n) is smaller than (1 + ρ)/(2n) if and only if ρ > −5/9.
Note that if ρ = −1 this result is expected as (Y1, Y2) determines µ exactly with µ ≡
(Y1 + Y2)/2; but the dependence on σ2 of the range of ρ over which Y3 degrades the
inference is surprising; as σ2 increases this range approaches [−1,−1/2).
3 On the efficiency of maximum composite likelihood estimators 38
3.1.4 Pairwise likelihood and independence likelihood
Intuitively, a composite likelihood with higher dimensional component likelihoods should
achieve a higher efficiency, although it usually demands more computational cost. In
this subsection we focus on comparing the independence likelihood CLind(θ; y) =∏pr=1 f(yr; θ ) and the pairwise likelihood CLpair(θ; y) =
∏p−1r=1
∏ps=r+1 f(yr, ys; θ ). Un-
der independence, CLind(θ; y) is identical to the full likelihood, and CLpair(θ; y) =
CLind(θ; y)p−1, which is also fully efficient. For a multivariate normal model with con-
tinuous responses, Zhao and Joe (2005) proved that the maximum pairwise likelihood es-
timator of the regression parameter has a smaller asymptotic variance than the maximum
independence likelihood estimator. On the other hand, pairwise likelihood can be expressed
as a product of the independence likelihood and the pairwise conditional likelihood, i.e.
CLpair(θ; y)2 = CLind(θ; y)p−1p∏r=1
∏s 6=r
f(yr | ys; θ ).
If we can show that the pairwise conditional likelihood dominates the independence likeli-
hood in terms of efficiency, then CLpair(θ; y) is more efficient than CLind(θ; y) by Corol-
lary 2. Arnold and Strauss (1991) showed that this may not be true in a bivariate binary
model. However, in their example, the pairwise likelihood is identical to the full likeli-
hood and hence still more efficient than the independence likelihood. In this subsection
we generalize Arnold and Strauss’s example to a four-dimensional binary model, which al-
3 On the efficiency of maximum composite likelihood estimators 39
lows us to compare the (asymptotic) variances of different composite likelihood estimators
analytically, and observe the reverse relationship.
Example 3.1.3 (A partial multivariate binary model). Suppose (Y1, Y2, Y3, Y4) follows a
Multinomial(1; θ, θ, θ/k, 1−2θ−θ/k), where k is a positive constant and 0 ≤ θ ≤ k/(2k+
1). The parameter θ controls both the mean and covariance structures, and we can change
the value of k to adjust the strength of dependence. Given a random sample of size n
from this model, we estimate θ based only on the partial observations (y(i)1 , y
(i)2 , y
(i)3 ), i =
1, . . . , n.
The full likelihood for the model of (Y1, Y2, Y3) is
L(θ) =n∏i=1
θy(i)1 +y
(i)2 (
θ
k)y
(i)3 (1− 2θ − θ
k)1−y
(i)1 −y
(i)2 −y
(i)3 . (3.4)
Solving the score equation we get the maximum likelihood estimator of θ, θ = (y1 + y2 +
y3)/(2 + 1/k). The exact variance of θ is
var(y1 + y2 + y3
2 + 1/k) =
1
n(
θ
2 + 1/k− θ2).
3 On the efficiency of maximum composite likelihood estimators 40
The independence likelihood function for the model of (Y1, Y2, Y3) is
CLind(θ) =n∏i=1
f(y(i)1 ; θ)f(y
(i)2 ; θ)f(y
(i)3 ; θ)
=n∏i=1
θy(i)1 (1− θ)1−y
(i)1 θy
(i)2 (1− θ)1−y
(i)2 (
θ
k)y
(i)3 (1− θ
k)1−y
(i)3 (3.5)
and we can calculate its sensitivity matrix and variability matrix as
Hind(θ) =2
θ+
2
1− θ+
1
kθ+
1
k(k − θ),
Jind(θ) =2
θ(1− θ)+
1
θ(k − θ)− 2
(1− θ)2− 4
(1− θ)(k − θ).
The pairwise likelihood function is
CLpair(θ) =n∏i=1
f(y(i)1 , y
(i)2 ; θ)f(y
(i)1 , y
(i)3 ; θ)f(y
(i)2 , y
(i)3 ; θ)
=n∏i=1
θ2y(i)1 +2y
(i)2 (1− 2θ)1−y
(i)1 −y
(i)2 (
θ
k)2y
(i)3 (1− θ − θ
k)2−y
(i)1 −y
(i)2 −2y
(i)3 , (3.6)
and we can calculate its sensitivity matrix and variability matrix as
Hpair(θ) =4
θ+
2
kθ+
4
1− 2θ+
2(1 + 1/k)2
1− (1 + 1/k)θ,
Jpair(θ) = 2A2θ(1− θ) +B2(θ
k)(1− θ
k)− 2A2θ2 − 4AB
θ2
k,
3 On the efficiency of maximum composite likelihood estimators 41
where A = 2/θ+ 2/(1− 2θ) + (1 + 1/k)/(1− θ− θ/k) and B = 2/θ+ 2(1 + 1/k)/(1−
θ − θ/k).
For k = 5, the asymptotic variances of the maximum composite likelihood estimators
for (3.4), (3.5) and (3.6) multiplied by n are plotted as a function of θ in Figure 3.1. We
can see that when θ < 0·3, the three estimators perform almost equally well; when θ >
0·3, the full likelihood becomes more efficient than the independence likelihood, and the
independence likelihood estimator is more efficient than the pairwise likelihood estimator.
We also carried out the comparisons for different values of k and found that: at k = 1, both
the independence likelihood and the pairwise likelihood are fully efficient; when k > 1, the
independence likelihood is more efficient than the pairwise likelihood and the difference
goes to zero when k → ∞; when k < 1, the pairwise likelihood is more efficient than the
independence likelihood and the difference goes to zero when k → 0.
This example suggests that in practical applications of composite likelihood inference,
where the models will usually have more complex dependence structure and incomplete
data (Yi et al., 2011), some care is required for the use of higher dimensional composite
likelihood to obtain more efficient estimators.
From the discussion above we know that a composite likelihood function with more com-
ponent likelihoods, or with higher dimensional component likelihoods, usually requires
more computing time but does not guarantee a more efficient estimator. The most direct
3 On the efficiency of maximum composite likelihood estimators 42
0.0 0.1 0.2 0.3 0.4
0.0
00
.01
0.0
20
.03
0.0
40
.05
The value of θ
Asym
pto
tic v
ari
an
ce
Figure 3.1: The asymptotic variances (multiplied by n) of the maximum composite like-lihood estimators for the full likelihood (solid line), the independence likelihood (dashedline) and the pairwise likelihood (dotted line).
way to avoid the decrease of efficiency is to adopt some weighting scheme, e.g. a care-
ful choice of w1 and w2 will make CLwc(θ; y) more efficient than both CL1(θ; y) and
CL2(θ; y), and a lot of research has been done along this direction within the framework
of unbiased estimating equations (Lindsay, 1988; Zhao and Joe, 2005; Joe and Lee, 2009).
We can also consider multiplying component likelihoods of different dimensions, for ex-
ample the Hoeffding scores as suggested in Lindsay et al. (2011) and the second-order
log-likelihoods in Cox and Reid (2004); or using some hybrid composite likelihood meth-
ods which employ two or more different composite likelihood functions simultaneously
to make inference, for example the hybrid pairwise likelihood method proposed in Kuk
(2007).
3 On the efficiency of maximum composite likelihood estimators 43
3.2 Efficiency of composite likelihood with known nui-
sance parameters
In the presence of nuisance parameters, it is well known that the asymptotic variance of
the maximum likelihood estimator will become smaller when the nuisance parameters are
replaced by their true values. Meanwhile, the reverse relationship has been noted for semi-
parametric inference using estimating functions (Henmi and Eguchi, 2004). We are inter-
ested to know whether such a phenomenon could occur for the estimators based on a com-
posite score function, which is a generalized score function as well as a special unbiased
estimating function. It is easy to check that this paradox will not occur for information-
unbiased composite likelihood functions. Suppose the q-dimensional parameter vector θ is
partitioned as θ = (ψT, λT), where ψ is a q1-dimensional parameter vector of interest and
λ is a q2-dimensional nuisance parameter vector, q = q1 + q2. The Godambe information
matrix of a information-unbiased composite likelihood is G(θ) = H(θ) = J(θ), and
G(θ) =
Gψψ Gψλ
Gλψ Gλλ
,
where Gψψ is the q1× q1 submatrix of G(θ) pertaining to ψ, and Gλλ the q2× q2 submatrix
of G(θ) pertaining to λ. When λ is unknown, the asymptotic variance of the MCLE of
ψ is given by (Gψψ − GψλG−1λλGλψ)−1; when λ is known, the asymptotic variance of the
3 On the efficiency of maximum composite likelihood estimators 44
MCLE of ψ can be shown to be G−1ψψ. Since GψλG−1λλGλψ is a nonnegative matrix, we
have (Gψψ − GψλG−1λλGλψ)−1 ≥ G−1ψψ. In this section we focus on the inference based on
information-biased composite likelihood functions.
The example given by Henmi and Eguchi (2004) can be thought as a hybrid composite
likelihood approach to a missing data problem, where the nuisance parameters are esti-
mated based on the marginal distribution of the indicator of missing status, and the pa-
rameters of interest are obtained by maximizing the weighted likelihood function of the
response variable with weights depending on the nuisance parameters. In this section
we will investigate the situation where only one composite likelihood CL(θ; y) is used
and the estimators of the parameters are obtained by solving its composite score equation
uc(θ; y) = ∇θc`(θ; y) = 0.
3.2.1 Equicorrelated multivariate normal model
The equicorrelated multivariate normal model has been well studied to compare the effi-
ciency of pairwise likelihood and full likelihood in different settings (Arnold and Strauss,
1991; Cox and Reid, 2004; Mardia et al., 2009). As shown in Pace et al. (2011) the sensi-
tivity matrixH(θ) of pairwise likelihood is not identical to its variability matrix J(θ) in this
model. In this section we compare the pairwise likelihood with itself when the nuisance
parameters are unknown and known.
3 On the efficiency of maximum composite likelihood estimators 45
Suppose y(1), . . . , y(n) are n independent observations from the p-dimensional multivari-
ate normal distribution with zero mean and covariance matrix Σ = σ2(1 − ρ)Ip + ρJp,
where Ip is identity matrix and Jp is a p×p matrix with all entries equal to 1. The common
correlation coefficient ρ is the parameter of interest. When σ2 is unknown, the maximum
pairwise likelihood estimator of ρ, denoted as ρpl, is as same as the maximum likelihood
estimator of ρ; when σ2 is known, the maximum pairwise likelihood estimator, denoted
as ρpl, is less efficient than the maximum likelihood estimator of ρ (Cox and Reid, 2004;
Mardia et al., 2009).
The asymptotic variance of ρpl is (Cox and Reid, 2004)
avar(ρpl) =2(1− ρ)2
np(p− 1)
c(p, ρ)
(1 + ρ2)2, (3.7)
where c(p, ρ) = (1− ρ)2(3ρ2 + p2ρ2 + 1) + pρ(−3ρ3 + 8ρ2 − 3ρ+ 2).
The asymptotic variance of ρpl can be shown to be
avar(ρpl) =2(1− ρ)2
np(p− 1)1 + (p− 1)ρ2. (3.8)
Comparing the equations (3.7) and (3.8), we find that as ρ approaches its lower bound
−1/(p − 1), avar(ρpl) decreases to zero while avar(ρpl) does not. The ratio of the asymp-
totic variances, avar(ρpl)/avar(ρpl), as a function of ρ is plotted in Figure 3.2 for p = 3.
3 On the efficiency of maximum composite likelihood estimators 46
−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0
0.5
1.0
1.5
2.0
The value of ρ
The
ratio
of a
sym
ptot
ic v
aria
nce
Figure 3.2: The plot of the ratio r(ρ)= avar(ρpl)/avar(ρpl) at p = 3. The vertical andhorizontal dashed line denotes ρ = 0 and r(ρ) = 1 respectively.
We can see that when ρ is positive, ρpl is more efficient than ρpl; when ρ < 0, the op-
posite phenomenon is observed, and when ρ approaches the lower bound −0·5, this ratio
diverges to infinity. We performed the comparisons for different p and observed the same
phenomenon.
3.2.2 Discussion
To see that information-biasedness is not a sufficient condition for the paradox to occur,
we consider another information-biased composite likelihood function, the full conditional
3 On the efficiency of maximum composite likelihood estimators 47
likelihood for the same model:
CLFC(θ; y) =
p∏r=1
f(yr | y(−r); θ ),
where y(−r) denotes the random vector excluding yr. When σ2 is unknown, the maximum
full conditional likelihood estimator of ρ, ρcl is identical to ρpl and fully efficient (Mardia
et al., 2009); when σ2 is known, the maximum full conditional likelihood estimator, ρcl
is less efficient than the maximum likelihood estimator for p ≥ 3. Using the formula in
Mardia et al. (2007), the ratio of the asymptotic variances, avar(ρcl)/avar(ρcl), as a function
of ρ is plotted in Figure 3.3 for p = 3. We can see that the ratio does not exceed 1 for all
ρ ∈ [−1/(p− 1), 1].
Denote by σ2pl the maximum pairwise likelihood estimator of σ2. As suggested in Henmi
and Eguchi (2004, Proposition 1), a sufficient condition to observe the paradox in this
example is that ρpl and σ2pl are asymptotically independent, while ρpl and σ2
pl are not. It
can be shown that the asymptotic covariance between ρpl and σ2pl is 2ρ(1 − ρ)1 + (p −
1)ρσ2/(np) which goes to 0 as ρ approaches−1/(p−1); ρpl and σ2pl are not asymptotically
independent when ρ = −1/(p − 1). This may explain why the paradox occurs when ρ is
close to its lower bound −1/(p− 1).
One way to avoid the paradoxical phenomenon is to convert the composite score func-
tion uc(θ; y) to an unbiased estimating function by projecting (Henmi and Eguchi, 2004;
3 On the efficiency of maximum composite likelihood estimators 48
−0.5 0.0 0.5 1.0
0.4
0.5
0.6
0.7
0.8
0.9
1.0
The value of ρ
The
ratio
of a
sym
ptot
ic v
aria
nce
Figure 3.3: The plot of the ratio r(ρ) = avar(ρcl)/avar(ρcl) at p = 3. The horizontal dashedline denotes r(ρ) = 1.
Lindsay et al., 2011):
u∗c(θ; y) = H(θ)J−1(θ)uc(θ; y) = arg minν=Auc(θ;y)
E‖u(θ; y)− ν(θ; y)‖2
, (3.9)
where u(θ; y) is the score function of full likelihood, A ranges over all q×q matrices, H(θ)
and J(θ) are the sensitivity matrix and variability matrix. It is easy to check that u∗c(θ; y)
is information-unbiased. Since H(θ) and J(θ) are constant matrices, this projection does
not change the point estimator of θ, and u∗c(θ; y) has the same Godambe information as
uc(θ; y). In the equicorrelated multivariate normal model, θ = (ρ, σ2), Kenne Pagui (2009)
showed that the score funtion of the pairwise likelihood is upl(θ; y) = J(θ)H−1(θ)u(θ, y).
3 On the efficiency of maximum composite likelihood estimators 49
From (3.9), the projected estimating funtion of upl(θ; y) is equal to the score function of
full likelihood, u(θ; y).
In complex models, the required computation for the projected estimating function
u∗c(θ; y) can be intractable; it may be a better idea to design a nuisance-parameter-free
composite likelihood carefully for practical use. As an example, a pairwise difference
likelihood that eliminates nuisance parameters in a Neyman–Scott problem is described in
Hjort and Varin (2007).
3.3 Theoretical results on multiparameter case
In Section 3.1 we focus on the composite likelihood inference for a scalar parameter. In
this section we consider the multiparameter version of Theorem 2, and the corresponding
Corollary 2 and Corollary 3 follow immediately.
We start by assuming H1(θ) = H2(θ). To show that the compound composite likelihood
CLc(θ; y) is more efficient than CL1(θ; y), in the sense thatGc(θ)−G1(θ) is a nonnegative
definite matrix, it is equivalent to show that G−11 (θ) ≥ G−1c (θ), i.e.
H−11 (θ)J1(θ)H−11 (θ) ≥ H1(θ) +H2(θ)−1varuc1(θ; y) + uc2(θ; y)H1(θ) +H2(θ)−1
(3.10)
3 On the efficiency of maximum composite likelihood estimators 50
where uci(θ; y) is the composite score function for CLi(θ; y), i = 1, 2.
Since H1(θ) +H2(θ) is positive definite, to show (3.10) it is equivalent to show that
varuc1(θ; y) + uc2(θ; y) ≤ H1(θ) +H2(θ)H−11 (θ)J1(θ)H−11 (θ)H1(θ) +H2(θ).
(3.11)
Define Cov12(θ; y) = covuc1(θ; y), uc2(θ; y), andB12(θ) = J1(θ)H−11 (θ)H2(θ). The left
hand side of (3.11) is equal to J1(θ)+J2(θ)+Cov12(θ; y)+CovT
12(θ; y), and the right hand
side is J1(θ) +B12(θ) +BT12(θ) +H2(θ)H
−11 (θ)J1(θ)H
−11 (θ)H2(θ).
Because G−12 (θ) = H−12 (θ)J2(θ)H−12 (θ) ≤ H−11 (θ)J1(θ)H
−11 (θ) = G−11 (θ), we have
J2(θ) ≤ H2(θ)H−11 (θ)J1(θ)H
−11 (θ)H2(θ).
Hence, to show (3.11) we only need to show
Cov12(θ; y) + CovT
12(θ; y) ≤ B12(θ) +BT
12(θ). (3.12)
The assumption that H1(θ) = H2(θ) implies that B12(θ) + BT12(θ) = J1(θ) + J1(θ) ≥
J1(θ) + J2(θ) ≥ Cov12(θ; y) + CovT
12(θ; y).
When H1(θ) 6= H2(θ), additional assumptions may be needed for Gc(θ) − G1(θ)
to be a nonnegative definite matrix. To see this, we consider a simple case where
3 On the efficiency of maximum composite likelihood estimators 51
H2(θ)J−12 (θ)H2(θ) = H1(θ)J
−11 (θ)H1(θ) and covuc1(θ; y), uc2(θ; y) is a zero matrix.
The inequality (3.11) is then simplified to be
J1(θ) + J2(θ) ≤ J1(θ) + J2(θ) +B12(θ) +BT
12(θ).
So, B12(θ) +BT12(θ) is required to be a nonnegative definite matrix.
In general, we can define u∗c2(θ; y) = H1(θ)H−12 (θ)uc2(θ; y), which has the sensitivity
matrix H1(θ) and the Godambe information matrix G2(θ) = H2(θ)J−12 (θ)H2(θ). Since
varu∗c2(θ; y) = H1(θ)H−12 (θ)J2(θ)H
−12 (θ)H1(θ) ≤ J1(θ), we can show that
covuc1(θ; y), u∗c2(θ; y)+ covu∗c2(θ; y), uc1(θ; y) ≤ varuc1(θ; y)+ varu∗c2(θ; y)
≤ J1(θ) + J1(θ),
Define J∗1 (θ) = J1(θ) − covu∗c2(θ; y), uc1(θ; y). From the inequality (3.12), a suffi-
cient condition for Gc(θ) ≥ G1(θ) is that J∗T1 (θ)H−11 (θ)H2(θ) + H2(θ)H−11 (θ)J∗1 (θ) is
a nonnegative definite matrix, which is true if H1(θ) = H2(θ), or the composite likeli-
hood functions CL1(θ; y) and CL2(θ; y) are information-unbiased, i.e. H1(θ) = J1(θ) and
H2(θ) = J2(θ).
3 On the efficiency of maximum composite likelihood estimators 52
3.4 Concluding Remarks
An information-unbiased composite likelihood function can be thought as a true likelihood
function based on partial information of the full model, and its Godambe information plays
the similar role as the Fisher information in the full likelihood inference. However, it may
be very inefficient. On the other hand, an information-biased composite likelihood not only
uses partial correct information but also introduces extra incorrect information implicitly,
leading to the occurrence of some undesirable paradoxes as discussed in Section 3.1 and
3.2.
In many applications (Varin et al., 2011), an optimally weighted estimating equation
is constructed based on the composite score function to achieve higher efficiency, within
the framework of unbiased estimating equations. However, such an optimal estimating
equation is usually very expensive to compute, and unlikely to be a score function of any
composite likelihood function. One direction for future research is to develop some spe-
cific theories on the construction of an optimally weighted composite likelihood function
which achieves higher efficiency while retaining the features as a likelihood-type objective
function, such as the Kullback-Leibler inequality (1.8).
Chapter 4
Prediction in computer experiments
with composite likelihood
4.1 Computer experiments
Computer experiments have been successfully used for predicting weather and climate,
modeling wildfire evolution, assessing the performance of integrated circuits and many
other scientific and technological fields where physical experiments are impossible or too
expensive and time-consuming to conduct (Santner et al., 2003; Fang et al., 2006). In a
computer experiment, we usually run a computer code to solve a mathematical system
which is used to approximate some real physical process. We can vary the inputs to the
53
4 Prediction in computer experiments with composite likelihood 54
code and observe how the output is affected. Due to the complexity of the mathematical
system and the high dimensionality of the inputs, it may take a long time to obtain even
a single output. To address this problem, statistical models have been used as surrogates
for computer simulators. Different from physical experiments, computer experiments are
typically deterministic, i.e. running the code twice with identical input will produce the
same output. Hence the principles of randomization, blocking and replication do not work
for computer experiments, and our statistical modeling scheme for a computer experiment
should be able to capture this deterministic feature. It is also desirable for the models
to allow for some smoothness assumption about the response surface. In addition, the
predictions are expected to have zero uncertainty at the observed inputs, small uncertainty
close to the observed inputs and larger uncertainty further away.
4.1.1 Gaussian random function model
Modeling the computer outputs as a sample path of a Gaussian process is a popular sta-
tistical approach due to its flexibility for fitting a large class of response surfaces and its
convenience for analytic work (Sacks et al., 1989a,b; Welch et al., 1992). The uncertainty
about the output at an untried input setting comes from fact that there can be more than one
random path passing through all of the observed points. Now consider the d-dimensional
input vector x = (x1, x2, . . . , xd) and the scalar output Y . In a Gaussian random function
4 Prediction in computer experiments with composite likelihood 55
(GRF) model, the relationship between Y and x is modelled as
Y (x) =k∑j=1
βjφj(x) + Z(x) = φ(x)Tβ + Z(x), (4.1)
where φ(x) = (φ1(x), . . . , φk(x))T is a k × 1 vector of basis functions, β = (β1, . . . , βk)T
is a vector of coefficients, and Z(·) is a mean zero Gaussian process for x ∈ X ∈ Rd. The
covariance function of Z(·) at two inputs x and x∗ is
cov(Z(x), Z(x∗)) = σ2R(x, x∗; θ), (4.2)
where σ2 is the marginal variance of Z(·), R(·, ·) is the correlation function and θ is the
parameter vector governing the correlation structure.
Following convention, we require that Z(·) is stationary, which implies that R(x, x∗; θ)
depends only on the difference, x − x∗. In spatial statistics Z(·) is often assumed to be
isotropic, i.e. the correlation function depends only on ||x − x∗||, the Euclidean distance
between x and x∗. However, in the context of computer experiments, anisotropic corre-
lation functions are commonly used because the input variables are usually measured on
different scales and impact the output in very different ways.
The correlation function R(x, x∗; θ) is usually modelled as a product of the correlations
4 Prediction in computer experiments with composite likelihood 56
at each dimension of the input vector x, i.e.
R(x, x∗; θ) =d∏i=1
Ri(|xi − x∗i |; θi) (4.3)
Two widely-used families of correlation functions are the power exponential correlation
function and the Matern correlation function (Matern, 1986). In this thesis we only consider
the power exponential correlation function
Ri(|xi − x∗i |; θi) = exp−γi|xi − x∗i |αi, αi ∈ (0, 2], γi > 0 (4.4)
where θi = (γi, αi). The exponent αi can be interpreted as a smoothness parameter, which
determines the differentiability of the sample paths. The sample paths at the ith dimension
are infinitely differentiable when αi = 2, and nondifferentiable when αi < 2 (Santner
et al., 2003, Chapter 2). γi can be interpreted as a dependence parameter. As γi increases,
the range of dependence decreases. The Gaussian correlation function ( αi = 2 ) and the
Ornstein-Uhlenbeck correlation function ( αi = 1 ) are two special cases in this family.
4.1.2 Estimation for GRF model
Now we assume a collection of outputs y(1), . . . , y(n) are observed at the inputs x(1), . . . , x(n)
respectively, where each input x(·) is a d-dimensional vector. Under the GRF model (4.1),
4 Prediction in computer experiments with composite likelihood 57
the log-likelihood function for y = (y(1), . . . , y(n))T = (y(x(1)), . . . , y(x(n)))T up to an
additive constant is
− 1
2n log σ2 + log |R(θ)|+ 1
σ2(y − Φβ)TR(θ)−1(y − Φβ), (4.5)
where Φ is the n× k matrix of basis functions and R(θ) is the n× n matrix of correlations
with the (l,m)th entry R(θ)lm = R(x(l), x(m); θ), l,m = 1, . . . , n.
In the analysis of computer experiments, the most popular way to estimate the parame-
ters of the Gaussian process model is a likelihood-based method such as MLE or REML
(Santner et al., 2003), rather than the variogram method which is usually used in geostatis-
tics to estimate the correlation parameters (e.g. Cressie, 1993). The difference is partly
due to the incommensurability and higher dimensionality of the input vector in computer
experiments. As n gets larger, maximum likelihood estimation becomes computationally
infeasible because we need to evaluate the log-likelihood function (4.5) many times and at
each time we need to calculate |R(θ)| and R−1(θ), both of which require O(n3) operations.
From a fully Bayesian point of view, the Gaussian random function (4.1) is a prior on
the response function space, and the parameters (β, θ, σ2) are treated as hyperparameters.
Point estimation of the parameters is then not required, but we still need to evaluate the
n-dimensional normal likelihood function to obtain the predictive distribution.
4 Prediction in computer experiments with composite likelihood 58
To address this “big n” problem, two classes of approaches have been proposed:
1. Simplifying the correlation function. R(θ) is modelled with much more easily ma-
nipulated structures such as a sparse matrix. This class includes low rank matrix
(Stein, 2008; Cressie and Johannesson, 2008), covariance tapering techniques (Kauf-
man et al., 2008; Furrer et al., 2006; Sang and Huang, 2011) and compactly supported
correlation matrices (Kaufman et al., 2011).
2. Approximating the likelihood function. This class includes Besag’s conditional com-
posite likelihood and its variants (Besag, 1975; Vecchia, 1988; Stein et al., 2004),
pairwise difference likelihood (Curriero and Lele, 1999), pairwise likelihood for im-
age models (Nott and Ryden, 1999) and binary spatial data (Heagerty and Lele,
1998), blockwise composite likelihood (Caragea and Smith, 2007) and the ensem-
ble method, such as the Bayesian committee machine developed within the Bayesian
framework (Tresp, 2000).
4.1.3 Prediction for GRF model
We would like a fast statistical surrogate for the computer code to predict the outputs at
untried input settings, with associated measures of uncertainties. For a new input x(0), we
want to predict its output y(0) = y(x(0)), given the observations y = (y(1), . . . , y(n)) =
4 Prediction in computer experiments with composite likelihood 59
(y(x(1)), . . . , y(x(n))) . The most popular method for prediction is the best linear unbiased
predictor ( BLUP ) which minimizes the mean squared prediction error, E(CT0 y − y(0))2
among all the unbiased linear predictors y(0) = CT0 y, whereC0 is a n×1 vector of constants
satisfyingE(y(0)) = E(y0). This approach is also known as “Kriging” in the field of spatial
statistics (Matheron, 1963; Stein, 1999). For the Gaussian random function model (4.1),
the BLUP of y(0) is given by the conditional mean
E(y(0) | y) = φT
0β + rT0R(θ)−1(y − Φβ), (4.6)
where φ0 = φ(x(0)), and r0 = (R(x(0), x(1)), . . . , R(x(0), x(n)))T is the vector of correla-
tions between y(0) and y. r0 reflects the information about y(x(0)) contained in each y(x(l))
( l = 1, . . . , n ), and R(θ)−1 is able to account for the the clustering effects among the
observations. To see this, assume φT0β = 0 and n = 3, and y(0) = y(x(0)) is predicted
using the three observed points y(x(1)), y(x(2)) and y(x(3)). Suppose ||x(1)−x(2)|| ≈ 0 and
y(x(1)) ≈ y(x(2)), the BLUP of y(x(0)) will be
E(y0 | y(x(1)), y(x(2)), y(x(3))) ≈ 1
1− r213
(r01 − r13r03)y(x(1)) + (r03 − r13r01)y(x(3))
= E(y0 | y(x(1)), y(x(3))) ≈ E(y0 | y(x(2)), y(x(3))),
where rij = R(x(i), x(j)), i, j ∈ 0, 1, 2, 3.
4 Prediction in computer experiments with composite likelihood 60
When the parameters (β, θ, σ2) are all known, the unconditional prediction variance of
the BLUP (4.6) is
vary(0) − E(y(0) | y) = var[Ey(0) − E(y(0) | y) | y] + E[vary0 − E(y(0) | y) | y]
= 0 + E[vary(0) − E(y(0) | y) | y]
= Evar(y(0) | y).
Since var(y(0) | y) = σ2(1− rT0R(θ)−1r0) does not depend on y, we have
vary(0) − E(y(0) | y) = var(y(0) | y) = σ2(1− rT0R(θ)−1r0). (4.7)
It is easy to show that at any observed input, x(0) = x(l) with l ∈ 1, . . . , n, the BLUP for
y(x(0)) is E(y(x(0)) | y) = y(x(l)), which is a desirable feature for a determinstic computer
code. It is also worthwhile to note that the variance (4.7) are equal to zero when x(0) = x(l)
for l ∈ (1, . . . , n), due to the fact that rTl R(θ)−1 = eTl , where el is the lth unit vector
(0, . . . , 0, 1, 0, . . .)T (Santner et al., 2003, p.90-93). To calculate the BLUP (4.6) and its
variance (4.7), we still need to compute the inverse of the n× n covariance matrix.
In this thesis we do not consider the uncertainties on the estimated parameters, which
might be done in a fully Bayesian framework mentioned in Section 4.1.2 (Santner et al.,
2003). In particular, when σ2, θ are known, and we put a non-informative prior on β,
4 Prediction in computer experiments with composite likelihood 61
[β] ∝ 1, the predictive distribution of y(x(0)) can be shown to be normal with mean
φT
0 β + rT0R(θ)−1(y − Φβ) (4.8)
and variance
σ2
1− (φT
0 , rT
0 )
0 ΦT
Φ R(θ)
−1 φ0
r0
, (4.9)
where β = (ΦTR(θ)−1Φ)−1ΦTR(θ)−1y, is equal to the MLE of β.
In the next section we consider the pairwise likelihood function (1.5) for estimating the
unknown parameters. In Section 4.3, we propose prediction methods based on different
composite likelihood functions, to approximate the BLUP (4.6). Composite likelihood-
based estimation and prediction do not involve the large n × n matrix, and reduce the
computational complexity from O(n3) to O(n2).
4.2 Estimation using composite likelihood
Pairwise likelihood is one of the most widely used versions of composite likelihood, and
has been used to fit Gaussian process models with isotropic or geometrical isotropic corre-
lation functions in spatial data analysis (Heagerty and Lele, 1998; Nott and Ryden, 1999).
4 Prediction in computer experiments with composite likelihood 62
The pairwise log-likelihood function for the full model (4.5) up to an additive constant is
− 1
2
n−1∑l=1
n∑m=l+1
log σ4 + log |Rl,m|+
1
σ2(yl,m − Φl,mβ)TR−1l,m(yl,m − Φl,mβ)
, (4.10)
where yl,m is the bivariate vector (y(l), y(m))T, Φl,m is the 2 × k matrix of basis functions
for yl,m and Rl,m is the 2× 2 correlation matrix of yl,m. The maximum pairwise likelihood
estimator ( MPLE ) of β is
βpl = (n−1∑l=1
n∑m=l+1
ΦT
l,mR−1l,myl,m)−1(
n−1∑l=1
n∑m=l+1
ΦT
l,mR−1l,myl,m) (4.11)
and the MPLE of σ2 is
σ2pl =
1
2(n2
) n−1∑l=1
n∑m=l+1
(yl,m − Φl,mβpl)TR−1l,m(yl,m − Φl,mβpl) (4.12)
Substituting βpl and σ2pl into the pairwise log-likelihood function (4.10), we can get an
objective function which depends only on θ and can be maximized to obtain the MPLE
of θ. The computaional complexity to find the maximum pairwise likelihood estimators is
O(n2). The computational burden can be further reduced if we exclude the pairs formed by
observations far apart, which may also improve the efficiency (Davis and Yau, 2011; Varin
et al., 2005).
Consistency and asymptotic normality of the maximum composite likelihood estimators
4 Prediction in computer experiments with composite likelihood 63
can be obtained under some regularity conditions within the framework of “increasing do-
main” asymptotics (Nott and Ryden, 1999; Caragea and Smith, 2007; Bevilacqua et al.,
2012).
4.3 Prediction using composite likelihood
In this section we assume the parameters are known and develop composite likelihood-
based techniques to approximate the BLUP (4.6). In practice we can replace the parameters
by their maximum composite likelihood estimators obtained in Section 4.2.
Under the model (4.1), the joint distribution of y(0) and y = (y(1), . . . , y(n))T is
y
y(0)
∼MVNn+1
Φ
φT0
β, σ2
R(θ) r0
rT0 1
. (4.13)
Treating y(0) as an unknown parameter, Jones et al. (1998, Appendix 1) showed that the
MLE of y(0) is
y(0)mle = φT
0β + rT0R(θ)−1(y − Φβ), (4.14)
which is identical to the BLUP (4.6). Following this line, we consider maximizing a com-
posite likelihood function, instead of the full likelihood function of (4.13), to get the maxi-
mum composite estimator of y(0), as an approximation to y(0)mle.
4 Prediction in computer experiments with composite likelihood 64
4.3.1 Maximum pairwise likelihood predictors
The pairwise likelihood function of (4.13) is
CLpair(y, y(0)) =
n−1∏l=1
n∏m=l+1
f(y(l), y(m); θ, β, σ2 )
n∏l=1
f(y(l), y(0); θ, β, σ2), (4.15)
where each f(y(l), y(m); θ, β, σ2 ) is a bivariate normal density funtion. The maximum pair-
wise likelihood predictor of y(0) is obtained by maximizing (4.3.3) with respect to y(0):
y(0)pl = φT
0β +n∑l=1
rl,0(y(l) − φlβ)
1− r2l,0/
n∑l=1
1
1− r2l,0(4.16)
where φl is the 1×k basis matrix functions for y(l), and rl,0 denotes the correlation between
y(l) and y(0). The number of operations to calculate y(0)pl for each prediction is O(n). This
maximum pairwise likelihood predictor was also suggested by Grunenfelder (2010), and
applied to the max-stable process for modeling spatial extremes, rather than the Gaussian
process. The performances there and in our simulation study, shown in Section 4.4, were
unsatisfactory, and we consider a better approach in the next section.
4 Prediction in computer experiments with composite likelihood 65
4.3.2 Maximum triplet-wise likelihood predictors
The maximum pairwise likelihood predictor (4.16) does not account for the clustering ef-
fect between any two observations, as illustrated in Section 4.1.3, so we consider the max-
imum triplet-wise likelihood predictor of y(0) to achieve an improvement. Similarly to the
pairwise likelihood function (4.3.3), the triplet-wise likelihood function of the (n + 1)-
dimensional normal model (4.13) is constructed as a product of all the possible trivariate
normal densities, and the maximum triplet-wise likelihood predictor is given by
y(0)tr = φT
0β +n−1∑l=1
n∑m=l+1
rTlm0R−1l,m(yl,m − Φl,mβ)
1− rTlm0R−1l,mrlm0
/n−1∑l=1
n∑m=l+1
1
1− rTlm0R−1l,mrlm0
, (4.17)
where yl,m is the paired vector (y(l), y(m))T, Φl,m is the 2 × k matrix of basis functions for
yl,m, Rl,m is the 2 × 2 correlation matrix of yl,m, and rlm0 is the 2 × 1 vector of correla-
tions between y(0) and yl,m. The number of operations to calculate y(0)tr for each prediction
is O(n2), and hence this approach may become infeasible when the numer of requested
predictions is very large. Moreover, the maximum triplet-wise likelihood predictor can
not account for higher order clustering effects, such as three or more observations close to
each other. To address these problems, we generalize the definition of maximum pairwise
likelihood predictors in a different direction in the next section.
4 Prediction in computer experiments with composite likelihood 66
4.3.3 Maximum blockwise likelihood predictors
We split the observed data (y(1), . . . , y(n)) into B blocks, D1, . . . , DB, according to their
input settings. The sizes of the blocks are assumed to be n1, . . . , nB respectively. In spatial
statistics, each block may be chosen as a geographic neighborhood. However, the selection
of the blocks in the context of computer experiments seems much more difficult due to the
dimensionality of the input and the anisotropic correlation structures. Ideally, the inputs
within a block should be more homogenous than those in different blocks, and the corre-
lation between any two blocks should be weak. The blockwise likelihood function of the
joint model (4.13) is
CLblock(y, y(0)) =
B−1∏b=1
B∏c=b+1
f(Db, Dc | β, θ, σ2)
B∏b=1
f(Db, y0 | β, θ, σ2)
, (4.18)
which can be seen as a pairwise likelihood function treating each block as one single obser-
vation. When the size of each block is equal to 1, the blockwise likelihood function reduces
to the pairwise likelihood function ; when the number of blocks B = 1, the blockwise like-
lihood function becomes the full likelihood function.
Since∏B−1
b=1
∏Bc=b+1 f(Db, Dc | β, θ, σ2) is not a function of y(0), maximizing the block-
wise likelihood function (4.18) is equivalent to maximizing∏B
b=1 f(Db, y(0) | β, θ, σ2).
4 Prediction in computer experiments with composite likelihood 67
The maximum blockwise likelihood predictor is given by
y(0)bl =
1∑Bb=1 var−1(y(0) | Db)
B∑b=1
var−1(y(0) | Db)E(y(0) | Db)
= φT
0β +B∑b=1
rTDb0R−1Db
(Db − ΦDbβ)
1− rTDb0R−1DbrDb0
/
B∑b=1
1
1− rTDb0R−1DbrDb0
, (4.19)
where ΦDb is the nb × k matrix of basis functions for Db, RDb is the nb × nb correlation
matrix for Db, and rDb0 is the nb × 1 vector of correlations between y(0) and Db.
We also consider the weighted blockwise likelihood function
CLwblock(y, y(0)) =
B∏b=1
f(Db, y(0) | β, θ, σ2)wb , (4.20)
where the weight wb = var−1(y(0) | Db). Maximizing the weighted blockwise likelihood
function (4.20) we get the maximum weighted blockwise likelihood predictor
y(0)wbl =
1∑Bb=1 var−2(y(0) | Db)
B∑b=1
var−2(y(0) | Db)E(y(0) | Db)
= φT
0β +B∑b=1
rTDb0R−1Db
(Db − ΦDbβ)
(1− rTDb0R−1DbrDb0)
2/
B∑b=1
1
(1− rTDb0R−1DbrDb0)
2, (4.21)
which places more weights on the block with smaller var(y(0) | Db), compared with the
unweighted blockwise likelihood predictor (4.19).
In the joint multivariate model (4.13), if we maximize the full conditional likelihood
4 Prediction in computer experiments with composite likelihood 68
function f(y(0) | y(1), . . . , y(n); β, θ, σ2) with respect y(0), the resulting predictor of y(0)
is also identical to the BLUP (4.6). Following this line, we may approximate the full
conditional likelihood function using a composite likelihood, which is then maximized to
obtain a predictor of y(0).
Denote by Di the collection of blocks D1, . . . , Di, i = 1, . . . , B. If we approxi-
mate the conditional likelihood function f(Di | Di, y(0); β, θ, σ2) by f(Di | y(0); β, θ, σ2),
it can be shown by Bayes rule that the full conditional likelihood function f(y(0) |
y(1), . . . , y(n); β, θ, σ2) is proportional to
∏Bb=1 f(y(0) | Db; β, θ, σ
2)
f(y(0); β, θ, σ2)B−1. (4.22)
This approximation is good if the observations in different blocks are nearly independent.
Maximizing the composite conditional likelihood function (4.22), we obtain an alternative
to the blockwise likelihood-based predictor y(0)bl :
y(0)abl =
1∑Bb=1 var−1(y(0)|Db)− (B − 1)/σ2
B∑b=1
var−1(y(0)|Db)E(y(0)|Db)−B − 1
σ2E(y(0))
= φT
0β +B∑b=1
rTDb0R−1Db
(Db − ΦDbβ)
1− rTDb0R−1DbrDb0
/B∑b=1
1
1− rTDb0R−1DbrDb0
− (B − 1), (4.23)
which is identical to the predictor given by the Bayesian Committee Machine (Tresp, 2000)
when the number of query points is equal to 1, and all the (hyper)parameters are assumed
4 Prediction in computer experiments with composite likelihood 69
known. In practice, the parameters are usually unknown, and the Bayesian Committee
Machine will put prior distributions on the parameters, while our approach will replace the
parameters by their maximum composite likelihood estimators.
Now assume all of the blocks have the same size nB, the number of operations to cal-
culate the blockwise likelihood-based predictors defined above for each prediction will be
O(n2B × n). Because the expectation E(Db − ΦDbβ) = 0, it is easy to see that the predic-
tors y(0)bl , y(0)wbl and y(0)abl are all unbiased, i.e. their expectations under the joint distribution of
(y(1), . . . , y(n)) are equal to E(y(0)) = φT0β. Moreover, if the block boundaries are fixed, the
blockwise composite likelihood-based predictors will converge to the BLUP (4.6) when the
density of the observations increases to infinity (Eidsvik et al., 2011). In the next section,
we compare the performances of the proposed composite likelihood-based predictors in a
simple model setting.
4.4 Simulation study
4.4.1 Prediction for GRF model with 1-dimensional input
With the univariate input x, the Gaussian random function is modelled with zero mean and
Gaussian correlation function, i.e. φ(x)Tβ ≡ 0 and R(|x − x∗|; θ) = exp−γ|x − x∗|α.
For α = 1.99, we uniformly generate n+ 1 = 101 input values from the unit interval [0, 1];
4 Prediction in computer experiments with composite likelihood 70
a sample path is then drawn from the Gaussian process to give n outputs at the selected
input values. We repeat the simulation M = 1000 times, and at each time we randomly
pick one input value, denoted as x(0), associated with the output y(x(0)), which is to be
predicted by the rest of the observations y(x(1)), . . . , y(x(n)). The composite likelihood-
based predictors developed in previous section, as well as the BLUP based on the full
likelihood, are compared in terms of the empirical mean square prediction error:
EMSPE =1
M
M∑m=1
ym − y(x(0))2, (4.24)
where ym denotes the predicted values of y(x(0)). To compute the blockwise likelihood-
based predictors, we sort y(x(1)), . . . , y(x(n)) according to their input values, and then split
the sorted data into B blocks in order, each with size nB = 5. The results of the compar-
isons at different values of γ are presented in Table 4.1. As mentioned in Section 4.1.1,
when γ increases, the strength of dependence decreases. From Table 4.1, the BLUP y(0)mle
has the smallest EMSPE in all cases as expected. The maximum triplet-wise likelihood
predictor performs better than the maximum pairwise likelihood predictor, especially when
γ is not large. The blockwise likelihood-based predictors outperform the pairwise and
triplet-wise likelihood-based predictors. When γ = 1000, y(0)bl and y(0)abl become inaccurate,
while the maximum weighted blockwise likelihood predictor, y(0)wbl is still competitive with
the BLUP, y(0)mle. When γ increases to ∞, the Gaussian process will behave like a white
4 Prediction in computer experiments with composite likelihood 71
noise, and the EMSPE’s will converge to the individual variance σ2, which is equal to 1 in
this simulation study.
Table 4.1: EMSPE of the six predictors for different γ when α = 1.99
y(0)mle y
(0)pl y
(0)tr y
(0)bl y
(0)wbl y
(0)abl
γ = 1 5.15× 10−6 0.0025 8.96× 10−5 9.79× 10−6 6.18× 10−6 9.92× 10−6
γ = 10 4.74× 10−5 0.0218 0.0022 9.79× 10−5 5.98× 10−5 8.37× 10−5
γ = 100 0.0012 0.2415 0.1079 0.0034 0.0012 0.0020γ = 1000 0.0441 0.8788 0.7932 0.3964 0.0664 0.2826γ = 10000 0.6294 0.8400 0.8285 0.8240 0.7914 0.8096
We also performed the comparisons at different values of α, with γ fixed to be 100. The
results are shown in Table 4.2. Recall that, α controls the smoothness of the sample path.
When α decreases, the sample path becomes less smooth, and the accuracy of prediction
also decreases. The maximum weighted blockwise likelihood predictor outperforms y(0)bl
and y(0)abl , especially when α = 1.8.
Table 4.2: EMSPE of the six predictors for different α when γ = 100
y(0)mle y
(0)pl y
(0)tr y
(0)bl y
(0)wbl y
(0)abl
α = 1.99 0.0012 0.2415 0.1079 0.0034 0.0012 0.0020α = 1.9 0.0123 0.3080 0.1807 0.0806 0.0124 0.0456α = 1.8 0.0233 0.5217 0.3445 0.2627 0.0252 0.1668α = 1.5 0.1381 0.9876 0.8556 0.8892 0.4447 0.7677α = 1 0.6126 0.8285 0.8208 0.8214 0.8106 0.8118
To investigate the influence of the sample size and the density of the observations on the
predictive accuracy of the proposed prediction methods, the EMSPE’s are compared for
4 Prediction in computer experiments with composite likelihood 72
different values of n, when α = 1.8 and γ = 100. The block size is still set equal to 5. The
results for M = 1000 simulations are presented in Table 4.3. As n increases, the density of
the observations increases, and all of the predictors become more accurate. The maximum
weighted blockwise likelihood predictor has the smallest EMSPE among all the composite
likelihood-based predictors, and is comparable with the EMSPE of the BLUP y(0)mle.
Table 4.3: EMSPE of the six predictors at different sample size n
y(0)mle y
(0)pl y
(0)tr y
(0)bl y
(0)wbl y
(0)abl
n = 100 0.0233 0.5217 0.3445 0.2627 0.0252 0.1668n = 200 0.0064 0.4742 0.2691 0.1879 0.0069 0.1059n = 400 0.0025 0.2028 0.0946 0.0650 0.0026 0.0315n = 1000 2.93× 10−4 0.0706 0.0230 0.0142 3.12× 10−4 0.0056
4.4.2 Prediction for GRF model with 2-dimensional input
In the univariate input case, it is easy to identify the blocks formed by the adjacent obser-
vations, according to their input values. In this section we consider a Gaussian random
function with 2-dimensional input vector x = (x1, x2). The mean function is still set equal
to be zero, and the correlation funtion is modelled as
R(||x− x∗||; θ) =2∏i=1
exp(−γi|xi − x∗i |α).
The parameter of smoothness, α is equal at both dimensions, while the parameters of de-
pendence, γ1 and γ2 can be different. At α = 1.99, we uniformly generate n + 1 = 101
4 Prediction in computer experiments with composite likelihood 73
input values from the unit square [0, 1] × [0, 1]. γ1 is fixed at 100, while the value of γ2 is
allowed to vary. The blocks, each with size 5, are identified in the same way as the univari-
ate input case, according to the value of x1 only, the first dimension of the input vector x.
At each setting of parameters, we repeat the simulation M = 1000 times. The composite
likelihood-based predictors and the BLUP are compared in terms of their empirical mean
square prediction errors (4.24).
The results of simulations are summarized in Table 4.4. As γ2 decreases, the strength
of dependence increases, and all of the predictors become more accurate. The blockwise
likelihood-based predictors still outperform the pairwise and triplet-wise likelihood-based
predictors. The maximum weighted blockwise likelihood predictor y(0)wbl performs better
than all the other composite likelihood-based predictors, although its relative accuracy com-
pared with the BLUP is not as good as the univariate case.
Table 4.4: EMSPE of the six predictors with 2-dimensional input
y(0)mle y
(0)pl y
(0)tr y
(0)bl y
(0)wbl y
(0)abl
γ2 = 0 0.0012 0.3614 0.1594 0.0052 0.0012 0.0028γ2 = 5 0.0758 0.8688 0.7569 0.5839 0.2688 0.4742γ2 = 10 0.1401 0.9173 0.8413 0.7459 0.4117 0.6514γ2 = 50 0.3969 0.9537 0.9151 0.8857 0.7488 0.8348
4 Prediction in computer experiments with composite likelihood 74
4.5 Discussion and Future Work
The blockwise composite likelihood-based predictiors developed in this chapter may be
thought as a weighted average of the best linear predictors obtained from each block. The
weights are automatically determined by maximizing (weighted) blockwise composite like-
lihood functions. Some preliminary simulation studies in Section 4.4 show that the block-
wise composite likelihood-based predictiors outperform the maximum pairwise and maxi-
mum triplet-wise likelihood predictors, while the maximum weighted blockwise composite
likelihood predictior y(0)wbl seems better than y(0)bl and y(0)abl , especially when the strength of
dependence is not too weak. The idea of maximizing a composite likelihood function to
develop prediction methods is not restricted to the multivariate distribuion: it may also
be applied to other statistical models, such as the max-stable process for modeling spatial
extrems (Grunenfelder, 2010).
In the simulation studies we assume the parameters in the Gaussian random functions
are known. In practice, the parameters are usually unknown, and can be estimated using
the maximum pairwise likelihood estimator mentioned in Section 4.2. The influence of the
uncertainty in parameter estimation on the prediction will be investigated in future work.
The proposed blockwise likelihood-based predictors will be applied to some real problems
with high-dimensional input vector, for example, the data set from a computer experiment
on photometric redshift (Kaufman et al., 2011), where the full likelihood approach does
4 Prediction in computer experiments with composite likelihood 75
not work due to the size of the data. Moreover, the variances, as well as the prediction
intervals, for the blockwise likelihood-based predictors will also be derived in future work.
Bibliography
A. Andrei and C. Kendziorski. An efficient method for identifying statistical interactors in
gene association networks. Biostatistics, 10:706–718, 2009.
B. C. Arnold. Example of a non-normal distribution with normal marginals. Personal
communication, 2010.
B. C. Arnold and D. Strauss. Pseudolikelihood estimation: some examples. Sankhya Ser.
B, 53:233–243, 1991.
J. E. Besag. Spatial interaction and the statistical analysis of lattice systems. J. Roy. Statist.
Soc. Ser. B, 36:192–236, 1974.
J. E. Besag. Statistical analysis of non-lattice data. The Statistician, 24:179–195, 1975.
M. Bevilacqua, C. Gaetan, J. Mateu, and E. Porcu. Estimating space and space-time covari-
ance functions for large data sets: a weighted composite likelihood approach. J. Amer.
Statist. Assoc., To appear, 2012.
76
4 BIBLIOGRAPHY 77
P. Caragea and R. L. Smith. Asymptotic properties of computationally efficient alternative
estimators for a class of multivariate normal models. J. Multivariate Anal., 98:1417–
1440, 2007.
R. E. Chandler and S. Bate. Inference for clustered data using the independence loglikeli-
hood. Biometrika, 94:167–183, 2007.
D. R. Cox. Partial likelihood. Biometrika, 62:268–276, 1975.
D. R. Cox and N. Reid. A note on pseudolikelihood constructed from marginal densities.
Biometrika, 91:729–737, 2004.
N. Cressie and G. Johannesson. Fixed rank kriging for very large spatial data sets. J. Roy.
Statist. Soc. Ser. B, 70:209–226, 2008.
F. Curriero and S. Lele. A composite likelihood approach to semivariogram estimation. J.
Agric. Biol. Environ. Stat., 4:9–28, 1999.
R. A. Davis and C. Y. Yau. Comments on pairwise likelihood in time series models. Statist.
Sinica, 21:255–277, 2011.
J. Eidsvik, B.A. Shaby, B.J. Reich, M. Wheeler, and J Niemi. Estimation and prediction in
spatial models with block composite likelihoods using parallel computing. unpublished
manuscript, 2011.
4 BIBLIOGRAPHY 78
K. Fang, R. Li, and A. Sudjianto. Design and Modeling for Computer Experiments. Chap-
man&Hall/CRC, 2006.
R. Furrer, M. G. Genton, and D. Nychka. Covariance tapering for interpolation of large
spatial datasets. J. Comput. Graph. Statist., 15:502–523, 2006.
X. Gao and P. X.-K. Song. Composite likelihood bayesian information criteria for model
selection in high dimensional data. J. Amer. Statist. Assoc., 105:1531–1540, 2010.
H. Geys, G. Molenberghs, and L. Ryan. Pseudolikelihood modeling of multivariate out-
comes in developmental toxicology. J. Amer. Statist. Assoc., 94:734–745, 1999.
V. P. Godambe. An optimum property of regular maximum likelihood estimation. Ann.
Statist., 31:1208–1212, 1960.
C. Grunenfelder. Aspects of composite likelihood inference. Master’s thesis, Imperial
College London, 2010.
W. He and G. Y. Yi. A pairwise likelihood method for correlated binary data with/without
missing observations under generalized partially linear single-index models. Statist.
Sinica, 21:207–229, 2011.
P. Heagerty and S. Lele. A composite likelihood approach to binary spatial data. J. Amer.
Statist. Assoc., 93:1099–1111, 1998.
4 BIBLIOGRAPHY 79
M. Henmi and S. Eguchi. A paradox concerning nuisance parameters and projected esti-
mating functions. Biometrika, 91:929–943, 2004.
NL. Hjort and C. Varin. Ml, pl, ql in markov chain models. Scand. J. Statist., 35:64–82,
2007.
Z.. Jin. Aspects of composite likelihood inference. PhD thesis, University of Toronto, 2009.
H. Joe and Y. Lee. On weighting of bivariate margins in pairwise likelihood. J. Mult. Anal.,
100:670–685, 2009.
D.R. Jones, M. Schonlau, and Welch W.J. Efcient global optimization of expensive black-
box functions. J. Global Optim, 13:455–492, 1998.
C. G. Kaufman, D. Bingham, S. Habib, K. Heitmann, and J. A. Frieman. Efficient emula-
tors of computer experiments using compactly supported correlation functions, with an
application to cosmology. Ann. Appl. Stat., 5:2470–2492, 2011.
C.G. Kaufman, Schervish, and D.W. M.J., Nychka. Covariance tapering for likelihood-
based estimation in large spatial data sets. J. Amer. Statist. Assoc., 103:1545–1555,
2008.
E. C. Kenne Pagui. Pairwise likelihood in multivariate normal models. Master’s thesis,
University of Padova, 2009.
J. T. Kent. Robust properties of likelihood ratio tests. Biometrika, 69:19–27, 1982.
4 BIBLIOGRAPHY 80
A. Y. C. Kuk. A hybrid pairwise likelihood method. Biometrika, 94:939–952, 2007.
B. G. Lindsay. Conditional score functions: some optimality results. Biometrika, 69:503–
512, 1982.
B. G. Lindsay. Composite likelihood methods. Contemp. Math., 80:221–239, 1988.
B. G. Lindsay, G. Y. Yi, and J. Sun. Issues and strategies in the selection of composite
likelihoods. Statist. Sinica, 21:71–105, 2011.
K. V. Mardia, G. Hughes, and C. C. Taylor. Efficiency of the pseudolikelihood for multi-
variate normal and von mises distributions. Technical report, Department of Statistics,
University of Leeds, 2007.
K. V. Mardia, J. T. Kent, G. Hughes, and C. C. Taylor. Maximum likelihood estimation
using composite likelihoods for closed exponential families. Biometrika, 96:975–982,
2009.
B. Matern. Spatial Variation. New York: Springer, 1986.
G. Matheron. Principles of geostatistics. Economic Geology, 58:1246–1266, 1963.
G. Molenberghs and G. Verbeke. Models for Discrete Longitudinal Data. New York:
Springer, 2005.
G. Molenberghs, M. Kenward, G. Verbeke, and T. Berhanu. Pseudo-likelihood estimation
for incomplete data. Statist. Sinica, 21:187–206, 2011.
4 BIBLIOGRAPHY 81
D. Nott and T. Ryden. Pairwise likelihood methods for inference in image models.
Biometrika, 86:661–676, 1999.
L. Pace, A. Salvan, and N. Sartori. Adjusting composite likelihood ratio statistics. Statist.
Sinica, 21:129–148, 2011.
D. Renard, G. Molenberghs, and H. Geys. A pairwise likelihood approach to estimation in
multilevel probit models. Comput. Statist. Data Anal., 44:649–667, 2004.
J. Sacks, S. B. Schiller, and W. J. Welch. Designs for computer experiments. Technometrics,
31:41–47, 1989a.
J. Sacks, W. J. Welch, T. J. Mitchell, and H. P. Wynn. Design and analysis of computer
experiments (with discussion). Statist. Sinica, 4:409–435, 1989b.
H. Sang and J. Z. Huang. A full scale approximation of covariance functions for large
spatial data sets. J. Roy. Statist. Soc. Ser. B, 74:111–132, 2011.
T. J. Santner, B. J. Williams, and W. I. Notz. The Design and Analysis of Computer Exper-
iments. New York: Springer, 2003.
M. Stein. Statistical Interpolation of Spatial Data: Some Theory for Kriging. New York:
Springer, 1999.
M. Stein. A modeling approach for large spatial datasets. J. Korean Statist. Soc., 37:3–10,
2008.
4 BIBLIOGRAPHY 82
M. Stein, Z. Chi, and L. Welty. Approximating likelihoods for large spatial data sets. J.
Roy. Statist. Soc. Ser. B, 66:275–296, 2004.
V. Tresp. A bayesian committee machine. Neural Computation, 12:2719–2741, 2000.
C. Varin. On composite marginal likelihoods. Adv. Statist. Anal., 92:1–28, 2008.
C. Varin and P. Vidoni. A note on composite likelihood inference and model selection.
Biometrika, 92:519–528, 2005.
C. Varin and P. Vidoni. Pairwise likelihood inference for ordinal categorical time series.
Comput. Statist. Data Anal., 51:2365–2373, 2006.
C. Varin, G. Hø st, and Ø Skare. Pairwise likelihood inference in spatial generalized linear
mixed models. Comput. Statist. Data Anal., 49:1173–1191, 2005.
C. Varin, N. Reid, and D. Firth. An overview of composite likelihood methods. Statist.
Sinica, 21:5–42, 2011.
A. V.. Vecchia. Estimation and model identification for continuous spatial processes. J.
Roy. Statist. Soc. Ser. B, 50:297–312, 1988.
A. Wald. Note on the consistency of the maximum likelihood estimate. Ann. Math. Statist,
20:595–601, 1949.
W. J. Welch, R. J. Buck, J. Sacks, H. P. Wynn, T. J. Mitchell, and M. D. Morris. Screening,
predicting, and computer experiments. Technometrics, 34:15–22, 1992.
Bibliography 83
H. White. Maximum likeihood estimation of misspecified models. Econometrica, 50:1–25,
1982.
X. Xu and N. Reid. On the robustness of maximum composite likelihood estimate. J.
Statist. Plan. Infer., 141:3047–3054, 2011.
G. Y. Yi, L. L. Zeng, and J. R. Cook. A robust pairwise likelihood method for incomplete
longitudinal binary data arising in clusters. Can. J. Statist., 39:34–51, 2011.
Y. Zhao and H. Joe. Composite likelihood estimation in multivariate data analysis. Can. J.
Statist., 33:335–356, 2005.