a b s t r a c t - Information Technologykxk4695/JMVA-KRIS-2012.pdf · MLEs enable us to develop...

12
(This is a sample cover image for this issue. The actual cover is not yet available at this time.) This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright

Transcript of a b s t r a c t - Information Technologykxk4695/JMVA-KRIS-2012.pdf · MLEs enable us to develop...

Page 1: a b s t r a c t - Information Technologykxk4695/JMVA-KRIS-2012.pdf · MLEs enable us to develop finite sample inferential procedures for the mean and the covariance matrix of a multivariate

(This is a sample cover image for this issue. The actual cover is not yet available at this time.)

This article appeared in a journal published by Elsevier. The attachedcopy is furnished to the author for internal non-commercial researchand education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling orlicensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of thearticle (e.g. in Word or Tex form) to their personal website orinstitutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies areencouraged to visit:

http://www.elsevier.com/copyright

Page 2: a b s t r a c t - Information Technologykxk4695/JMVA-KRIS-2012.pdf · MLEs enable us to develop finite sample inferential procedures for the mean and the covariance matrix of a multivariate

Author's personal copy

Journal of Multivariate Analysis 114 (2013) 378–388

Contents lists available at SciVerse ScienceDirect

Journal of Multivariate Analysis

journal homepage: www.elsevier.com/locate/jmva

Comparison of confidence intervals for correlation coefficients based onincomplete monotone samples and those based on listwise deletionK. KrishnamoorthyDepartment of Mathematics, University of Louisiana at Lafayette, Lafayette, LA 70504, USA

a r t i c l e i n f o

Article history:Received 9 February 2012Available online 13 August 2012

AMS subject classifications:62H1262H15

Keywords:Coverage probabilityLower triangular invariantMissing at randomMonotone samples

a b s t r a c t

Inferential procedures for estimating and comparing normal correlation coefficients basedon incomplete samples with a monotone missing pattern are considered. The proceduresare based on the generalized variable (GV) approach. It is shown that the GVmethods basedon complete or incomplete samples are exact for estimating or testing a simple correlationcoefficient. Procedures based on incomplete samples for comparing two overlappingdependent correlation coefficients are also proposed. For both problems, Monte Carlosimulation studies indicate that the inference based on incomplete samples and thosebased on samples after listwise or pairwise deletion are similar, and the loss of efficiency byignoring additional data is not appreciable. The proposed GV approach is simple, and it canbe readily extended to other problems such as the one of estimating two non-overlappingdependent correlations. The results are illustrated using two examples.

© 2012 Elsevier Inc. All rights reserved.

1. Introduction

The Pearson product–moment correlation is the most popular measure of association between two continuous randomvariables. Assuming normality, several authors have addressed the problem of estimating or testing correlation coefficientsin various contexts, and provided solutions based on large sample theory. If the underlying distribution is bivariate normal,then an exact t procedure is available to test if the population correlation coefficient ρ is significantly different from zero.To test a non-zero value of ρ, the test based on [5]’s z transformation of the sample correlation coefficient is commonlyused. Fisher’s approach is reasonably accurate for moderate samples, and standard software packages use this approach tofind confidence limits (CLs) for ρ. There is an exact method, which produces uniformly most accurate confidence intervals(CIs), available in the literature (see [2, Section 4.2]). However, this exact method is not popular because of computationalcomplexity.

Fisher’s z transformation for the one-sample case can be readily extended to the problem of testing two independentcorrelation coefficients, but the test cannot be transformed into a procedure for setting CLs for the difference betweencorrelation coefficients. Olkin and Finn [16,17]) proposed a normal based asymptotic method that can be used for testing aswell as for obtaining CIs. In general, the procedures given in the literature are based on asymptotic theory, and simulationstudies by Krishnamoorthy and Xia [11] indicated that such asymptotic procedures are, in some cases, not satisfactory evenfor large samples.

In this article, we consider inferential procedures for correlation coefficients based on missing data. Missing data arises,for example, during data gathering and recording, when the experiment involves a group of individuals over a period oftime like in clinical trials or in a planned experiment where the variables that are expensive to measure are collected onlyfrom a subset of a sample. To ignore the missingness mechanism, we assume that the data are missing at random (MAR).Lua and Copas [13] noted that inference from the likelihood method is valid if and only if the missing data mechanism is

E-mail address: [email protected].

0047-259X/$ – see front matter© 2012 Elsevier Inc. All rights reserved.doi:10.1016/j.jmva.2012.08.003

Page 3: a b s t r a c t - Information Technologykxk4695/JMVA-KRIS-2012.pdf · MLEs enable us to develop finite sample inferential procedures for the mean and the covariance matrix of a multivariate

Author's personal copy

K. Krishnamoorthy / Journal of Multivariate Analysis 114 (2013) 378–388 379

MAR. For formal definition and exposition of MAR or missing completely at random (MCAR), we refer to [12, Section 1.3],and [8]. There are a few missing patterns considered in the literature, but the incomplete data with monotone pattern iscommon, and it is convenient for making inference. For the multivariate normal case, Anderson [1] gives a simple approachto derive the maximum likelihood estimates (MLEs) and present them for a special case. Some invariance properties of theMLEs enable us to develop finite sample inferential procedures for the mean and the covariance matrix of a multivariatenormal distribution. See the articles by Krishnamoorthy and Pannala [9,10], Hao and Krishnamoorthy [7], the recent articlesby Chang and Richards [3,4], and the references therein.

Although several papers address the problems ofmaking inference on amultivariate normalmean vector and covariancematrix, the problem of estimating or testing a correlation coefficient withmissing data is seldom addressed in the literature.Our online review indicates that commonly used software packages use the standard approach after deleting the records forsubjects with missing observations. This standard practice is simple but does not utilize the information of the additionaldata. So it is of interest to assess the loss of efficiency of the standard approach by comparing the results of themethods thatutilize the additional data.

In this article, we provide a generalized variable (GV)method formaking inference on a simple correlation coefficient andfor comparing two dependent correlation coefficients based on incomplete samples with amonotone pattern. The proposedapproach is similar to the one for the complete sample case given in [11], but here we show that the GV solutions tothe one-sample problems are exact for complete or incomplete samples. Furthermore, the GV approach for incompletesamples can be readily extended to test or interval estimating the difference between two independent correlations,the difference between two overlapping dependent correlations [15] and the difference between two non-overlappingdependent correlation coefficients [16,14].

The rest of the article is organized as follows. In Section 2, we describe the MLE for the normal covariance matrix Σ,and develop generalized pivotal quantities (GPQs) for the elements of Σ. In Section 3, we develop a GPQ for the simplecorrelation coefficient ρ as a function of the GPQs of the elements of Σ, and outline inferential procedures based on theGPQ. In Section 4, we extend the results of Section 3 to compare two overlapping dependent correlation coefficients. InSection 5, we compare the expected widths of CIs based on the complete pairs and of those based on incomplete samples toassess the gain in precision by using the additional data. An illustrative example and an example based on simulated dataare provided in Section 6. Some concluding remarks and applications of the GV approach to other correlation problems aregiven in Section 7.

2. Generalized pivotal quantity for a normal correlation matrix

Let X be p-variate normal random vector with mean vector µ and covariance matrix Σ. Let the correlation matrix basedon Σ be

ρ =

1 ρ12 · · · ρ1p

ρ21 1 · · · ρ2p...

......

...ρp1 ρp2 · · · 1

.

Consider a monotone sample of n1 subjects with the following pattern:

X11 . . . X1np . . . X1n2 . . . X1n1X21 . . . X2np . . . X2n2...

...Xp1 . . . Xpnp .

(1)

Note that there are ni observations available on the ith component, i = 1, . . . , p and we assume that n1 ≥ n2 ≥ · · · ≥ np.That is, measurements on the first component are available for all n1 subjects, measurements on the first two componentsare available only for n2 subjects, and so on. Let Σ be the MLE of Σ based on sample (1). WriteΣ = WW′, (2)

whereW be the Cholesky factor Σ with positive diagonal elements. Let θ = (θij) be the Cholesky factor of Σ. Since the MLEΣ is invariant under a lower triangular transformation as well as under location transformation, the distribution of θ−1Wdoes not depend on any unknown parameters.

To find a GPQ for θ = (θij), letw be an observed value ofW defined in (2). Then

w(θ−1W)−1= wV−1

= A = (aij) is a GPQ for θ = (θij), i ≥ j. (3)

Notice that A is a lower triangular matrix with aij = 0 for i < j. Furthermore, for a given w, the distribution of A does notdepend on any unknown parameters. The element aij is a GPQ for θij for i ≥ j. Also, if h(θ) is a real valued function of θ,then h(A) is a GPQ for h(θ). For example, the percentiles of h(A) can be used to construct CIs for h(θ). For more details andnumerous applications of the GV approach, see the book byWeerahandi [19], and for details in the present context see [11].

Page 4: a b s t r a c t - Information Technologykxk4695/JMVA-KRIS-2012.pdf · MLEs enable us to develop finite sample inferential procedures for the mean and the covariance matrix of a multivariate

Author's personal copy

380 K. Krishnamoorthy / Journal of Multivariate Analysis 114 (2013) 378–388

A GPQ for ρij can be expressed in terms of GPQs for θij. Toward this, we note that ρij =j

k=1 θikθjk

ik=1 θ2

ikjk=1 θ2

jk

, and so an expression for the GPQ of ρij is given by

Gρij =

jk=1

aikajki

k=1a2ik

j

k=1a2jk

for i ≥ j, (4)

where aij is the (i, j)th element of A in (3).

3. Inference on simple correlation coefficient

To write the MLE Σ explicitly, let us write the monotone sample in (1) as

X11 X12 . . . X1n Y1 . . . YmX21 X22 . . . X2n.

(5)

That is, we have a sample of n observations available on both components, and additional m observations are available onthe first component. Define

S =

S11 S12S12 S22

=

ni=1

(Xi − X)(Xi − X)′ and V =

mi=1

(Yi − Y )2 (6)

so that S ∼ W2(n − 1, Σ) independently of V ∼ W1(m − 1, σ11), where Wp(m, Ω) denotes the p dimensional Wishartdistribution with the degrees of freedom m and the positive definite parameter matrix Ω . The MLEs of the elements σij ofΣ are (see [1]) given by

Σ =

σ11 σ12σ12 σ22

=

S11 + V +

1m +

1n

−1(X1 − Y )2

m + nσ11S12S11σ11S12

S11

S2.1n

+S21σ11S12

S211

, (7)

where S2.1 = S22 − S21S−111 S12. Letting

Q = V +

1m

+1n

−1

(X1 − Y )2, (8)

the MLE of Σ can be written as

Σ = T

1 + Q/S11m + n

0

01n

T′= WW′, where W = T

1 + Q/S11m + n

0

0

1n

, (9)

and T = (Tij) is the lower triangular matrix with positive diagonal elements so that TT′= S (see [18]). As noted earlier, the

distribution of θ−1W does not depend on any parameters, and in fact

V = θ−1W is distributed as T∗

1 + Q ∗/S∗

11

m + n0

0

1n

, (10)

where T∗= (T ∗

ij ) is the Cholesky factor of S∗= (S∗

ij ) which follows W2(n − 1, I2) independently of Q ∗∼ χ2

m. Furthermore,it is known that T ∗

ij are independent with T ∗2ii ∼ χ2

n−i, i = 1, 2, and T ∗

ij ∼ N(0, 1), i > j.The GPQ for the simple correlation coefficient ρ can be obtained as a special case of (4), and is given by

Gρ =a21

a221 + a222. (11)

Page 5: a b s t r a c t - Information Technologykxk4695/JMVA-KRIS-2012.pdf · MLEs enable us to develop finite sample inferential procedures for the mean and the covariance matrix of a multivariate

Author's personal copy

K. Krishnamoorthy / Journal of Multivariate Analysis 114 (2013) 378–388 381

An explicit expression for A can be obtained as follows. Let s = (sij) be an observed value of S, let t = (tij) be the Choleskyfactor of s, and let q be an observed value of Q in (8). It follows from (9) that

w = t

1 + q/s11m + n

0

0

1n

, (12)

and so

A = wV−1

= w

T∗

1 + Q ∗/S∗

11

m + n0

0

1n

−1

=

s11 + qS∗

11 + Q ∗

12

0

t−111 t21

s11 + qS∗

11 + Q ∗

12

−T ∗

21t22T ∗

11T∗

22

t22T ∗

22

=

s11 + qχ2n−1 + χ2

m

12

0

s−111 s21

s11 + q

χ2n−1 + χ2

m

12

− Z

s2.1

χ2n−1χ

2n−2

s2.1χ2n−2

, (13)

where Z is the standard normal random variable, and Z and χ2 random variables are mutually independent.

3.1. Confidence interval for a simple correlation coefficient

A CI for ρ can be obtained from the one for

η =ρ

1 − ρ2=

θ21

θ22(14)

in a relatively easier manner. Specifically, if (L,U) is a CI for η, thenL/

√1 + L2,U/

√1 + U2

is a CI for ρ. Let r = s21/

√s11s22 be the sample correlation coefficient based on the n complete cases, and let r∗

= s21/√s11s2.1 = r/

√1 − r2. Using

(3) and (13), the GPQ for η can be expressed as

Gη =Gθ21

Gθ22

=a21a22

=s21

√s11s2.1

1 +

qs11

χ2n−2

χ2n−1 + χ2

m

12

−Zχ2n−1

= r∗

χ2n−2

χ2n−1

12

1 +qs11

1 +

χ2m

χ2n−1

−12

−Zχ2n−1

. (15)

TheGPQ forη basedononlyn complete pairs, denotedbyGcη , is obtainedbydropping the terms

√1 + q/s11

1 + χ2

m/χ2n−1

− 12

from the GPQ in (15). That is,

Gcη = r∗

χ2n−2

χ2n−1

12

−Zχ2n−1

. (16)

The following algorithm describes computational details to find a generalized CI for ρ.

Page 6: a b s t r a c t - Information Technologykxk4695/JMVA-KRIS-2012.pdf · MLEs enable us to develop finite sample inferential procedures for the mean and the covariance matrix of a multivariate

Author's personal copy

382 K. Krishnamoorthy / Journal of Multivariate Analysis 114 (2013) 378–388

Algorithm 1.1. For a given incomplete sample of size (n,m), compute s11, s12, s22, and r based on n complete pairs of observations, and

q =m

i=1(yi − y)2 + 1m +

1n

−1(x1 − y)2; set r∗

= r/√1 − r2.

2. Generate a standard normal variate Z , and chi-square variates χ2m, χ2

n−1 and χ2n−2.

3. Set Gη = r∗

1 +

qs11

χ2n−2

χ2n−1+χ2

m

12

−Z

χ2n−1

.

4. Repeat steps 2 and 3 for a large number of times, say, 10,000.

The 100α percentile and the 100(1 − α) percentile of Gη ’s generated above form a 1 − 2α generalized CI for η. The lowerpercentile is a 1 − α one-sided lower CL for η, and the upper one is a 1 − α one-sided upper confidence limit for η. If (L,U)

is a 1 − 2α CI for η, thenL/

√1 + L2,U/

√1 + U2

is a 1 − 2α CI for ρ.

3.2. A test for simple correlation coefficient ρ

As η and ρ have one-to-one relation, it is enough to develop a test for η. A generalized test variable for η is given byGt

η = Gη − η. For testing H0 : η ≤ η0 vs. Ha : η > η0, the generalized p-value is given by

supH0

P(Gtη ≤ 0) = sup

H0

P(Gη ≤ η)

= P(Gη ≤ η0)

= P

η0 +Zχ2n−1

χ2n−2

χ2n−1 + χ2

m

−12

≥ r∗

1 +

qs11

, (17)

and the generalized test rejects H0 whenever the above generalized p-value is less than the nominal level α. It is easy tocheck that the generalized p-value based only on n complete pairs is given by (17) with χ2

m and√1 + q/s11 removed. It has

been shown in the Appendix that the generalized p-value has a uniform(0, 1) distribution, and so the generalized test isexact. As a result, the generalized CI for η (equivalently, the generalized CI for ρ) described in the preceding section is alsoexact.

4. Comparison between two overlapping dependent correlations

Let us now describe a generalized variable approach for comparing two overlapping dependent correlation coefficientsρ21 and ρ31 based on the monotone sample in (1) with p = 3. To express the MLE of Σ, we partition the data matrix (1) asin [20]:

X1 =X11, . . . , X1n3 , . . . , X1n2 , . . . , X1n1

1×n1

, X2 =

X11, . . . , X1n2X21, . . . , X2n2

2×n2

X3 =

X11, . . . , X1n3X21, . . . , X2n3X21, . . . , X2n3

3×n3

. (18)

That is, Xl is the submatrix of (1) formed by the first nl columns and the first l rows, l = 1, 2, 3.Let Xl and Sl denote respectively the sample mean vector and the sums of squares and products matrix based on

Xl, l = 1, 2, 3. We partition these means and matrices accordingly as follows:

X1 = X (1)1 , X2 =

X (1)2

X (2)2

, X3 =

X (1)3

X (2)3x(3)3

,

S1 = S(1,1)1 , S2 =

S(1,1)2 S(1,2)

2S(2,1)2 S(2,2)

2

and S3 =

S(1,1)3 S(1,2)

3 S(1,3)3

S(2,1)3 S(2,2)

3 S(2,3)3

S(3,1)3 S(3,2)

3 S(3,3)3

. (19)

Let

b21 = S(2,1)2 (S(1,1)

2 )−1, (b31, b32) = (S(3,1)3 , S(3,2)

3 )

S(1,1)3 , S(1,2)

3S(2,1)3 , S(2,2)

3

−1

,

σ2.1 =1n2

[S(2,2)2 − S(2,1)

2 (S(1,1)2 )−1S(1,2)

2 ],

Page 7: a b s t r a c t - Information Technologykxk4695/JMVA-KRIS-2012.pdf · MLEs enable us to develop finite sample inferential procedures for the mean and the covariance matrix of a multivariate

Author's personal copy

K. Krishnamoorthy / Journal of Multivariate Analysis 114 (2013) 378–388 383

and

σ3.21 =1n3

S(3,3)3 − (S(3,1)

3 , S(3,2)3 )

S(1,1)3 S(1,2)

3S(2,1)3 S(2,2)

3

−1 S(1,3)3S(2,3)3

.

The MLEs are given byµ1 = X1,µ2 = X (2)2 − b21(X

(1)2 −µ1),µ3 = X (3)

3 − b31(X(1)3 −µ1) − b32(X

(2)3 −µ2),

Σ =

σ11 − −σ21 σ22 −σ31 σ32 σ33

=

1n1

S(1,1)1 − −

b21σ11 σ2.1 + b21σ12 −

b31σ11 + b32σ21 b31σ12 + b32σ22 σ3.21 + b31σ13 + b32σ23

. (20)

LetW be the Cholesky factor ofΣ, and letw be an observed value ofW. LetΣ∗ be theMLE based on a sample withmonotonepattern (1) (with p = 3) from a N3(0, I) distribution, and let T∗ be the Cholesky factor Σ∗. Then, a GPQ for θ, the Choleskyfactor of Σ, is given by

A = wV−1, where V is distributed as T∗. (21)

Note that, for a given w, the distribution of A does not depend on any unknown parameters, and so its distribution can beevaluated empirically.

A GPQ for ρ21 − ρ31 can be obtained from the general expression (4), and is given by

Gρ21−ρ31 = Gρ21 − Gρ31 =a21

a221 + a222−

a31a231 + a232 + a233

. (22)

The lower 100α percentile and the upper 100α percentile of Gρ21 − Gρ31 form a 1 − 2α confidence interval for ρ21 − ρ31.These percentiles can be estimated by Monte Carlo simulation as outlined in Algorithm 2.

Algorithm 2.

1. For a given monotone sample with sizes n1, n2 and n3, compute the MLE Σ, and its Cholesky factorw.2. Generate a monotone sample in (1) from a N3(0, I).3. Compute the MLE Σ∗ based on the sample in step 2, and find its Cholesky factor V so that VV′

= Σ∗.4. Compute A = wV−1.5. Set Gρ21−ρ31 =

a21√a221+a222

−a31

a231+a232+a233, where aij is the (i, j)th element of A.

6. Repeat steps 2–5 for a large number of times, say, 10,000.

The 100α and 100(1− α) percentiles of the 10,000 GPQs form a 1− 2α confidence interval for ρ21 − ρ31. The proportion ofGPQs that are greater than zero is an estimate of the generalized p-value for testing the hypotheses H0 : ρ21 ≥ ρ31 vs. Ha :

ρ21 < ρ31.

5. Comparison of CIs based on incomplete pairs with those based on samples after subjectwise deletion

To judge the gain in efficiency of the estimates based on n complete pairs and additional m observations on the firstcomponent, we estimate the bias and the mean squared error ofρ = σ12/

√σ11σ22, where σij’s are the MLEs in (7). Theestimated bias and the MSE are given in Table 1 for n = 10 and 30, and some values of m including zero. Note that m = 0corresponds to the point estimate based on the sample with no additional data on the first component. We observe inTable 1 that the use of additional data in fact produces inefficient estimates when ρ = 0. However, when ρ > 0, the biasis decreasing with increasingm. The MSE is increasing with increasingm for ρ ≤ 0.5, and decreasing with increasingm forρ ≥ .6. In general, we see that, for values of ρ moderate to large, both bias and MSE decrease as m increases, and so thereseems to be some benefits of using additional data.

To assess the loss of precision of CIs based only on complete pairs, we estimate their expected widths and compare withthose of CIs based on incomplete samples, that is, with m additional observations on the first component. The expectedwidths are estimated as follows. We first generated 2500 statistics s ∼ W2(n − 1, Σ) and q ∼ σ11χ

2m, assuming that

σ11 = σ22 = 1 and σ12 = ρ. As the estimation procedure is scale invariant, without loss of generality, we can assume ρ ≥ 0for evaluating expected widths. For each set s, q generated, we used 5000 simulation runs to find the 95% generalizedCI for ρ. The average width of these 2500 CIs is a Monte Carlo estimate of expected width at the assumed values of ρ andsample size. The estimated expected widths of 95% CIs are given in Table 2 for values of ρ ranging from 0 to 0.95. For thesame sample size and parameter configurations, we also estimated expectations of 95% upper CLs, and they are presentedin Table 3.

We observe from the estimated values in Tables 2 and 3 that, for fixed n, the expected width or the expectation of upperCL remains the same with increasing m. In some cases (e.g., n = 40 and ρ ≥ 0.7 in Table 2), we see some improvements.

Page 8: a b s t r a c t - Information Technologykxk4695/JMVA-KRIS-2012.pdf · MLEs enable us to develop finite sample inferential procedures for the mean and the covariance matrix of a multivariate

Author's personal copy

384 K. Krishnamoorthy / Journal of Multivariate Analysis 114 (2013) 378–388

Table 1Bias and MSE (in parenthesis) ofρ = σ12/

√σ11σ22 based on n complete pairs andm additional observations on the first component.

ρ n = 10 n = 30m = 0 5 20 0 10 30

0 −0.00034(0.11196) −0.00032(0.12043) −0.00197(0.12866) 0.00003(0.03457) 0.00004(0.03540) 0.00062(0.03601)0.10 −0.00723(0.10956) −0.00606(0.11774) −0.00468(0.12492) −0.00161(0.03367) −0.00111(0.03439) 0.00038(0.03552)0.20 −0.01779(0.10470) −0.01387(0.11219) −0.00788(0.11925) −0.00390(0.03223) −0.00247(0.03272) 0.00007(0.03295)0.30 −0.03190(0.09652) −0.02334(0.10284) −0.01477(0.10869) −0.00925(0.02897) −0.00631(0.02926) −0.00198(0.02938)0.40 −0.05027(0.08581) −0.03545(0.09030) −0.02241(0.09431) −0.01461(0.02504) −0.00955(0.02497) −0.00355(0.02520)0.50 −0.07142(0.07247) −0.04869(0.07476) −0.02799(0.07593) −0.02133(0.02032) −0.01356(0.01996) −0.00496(0.01965)0.60 −0.09455(0.05645) −0.06207(0.05642) −0.02959(0.05528) −0.02824(0.01504) −0.01746(0.01441) −0.00518(0.01394)0.70 −0.11587(0.03881) −0.07199(0.03687) −0.02907(0.03530) −0.03478(0.00995) −0.02079(0.00928) −0.00491(0.00865)0.80 −0.13424(0.02181) −0.07861(0.01888) −0.02459(0.01647) −0.03979(0.00507) −0.02275(0.00453) −0.00392(0.00405)0.90 −0.14548(0.00704) −0.08088(0.00515) −0.01871(0.00378) −0.04301(0.00148) −0.02390(0.00126) −0.00299(0.00106)0.95 −0.15181(0.00206) −0.08151(0.00128) −0.01453(0.00084) −0.04472(0.00039) −0.02442(0.00032) −0.00227(0.00027)

Table 2Expected widths of CIs for ρ based on n complete pairs and of those based on n complete pairs andm additional observations on the first component.

ρ n = 10 n = 30 n = 40 n = 50m = 0 5 10 20 0 10 15 20 0 10 20 30 0 10 20 30

0 1.12 1.12 1.12 1.12 0.69 0.69 0.69 0.69 0.60 0.60 0.60 0.60 0.54 0.54 0.54 0.540.10 1.12 1.11 1.11 1.11 0.68 0.68 0.68 0.68 0.60 0.60 0.60 0.60 0.54 0.54 0.54 0.540.20 1.10 1.09 1.10 1.10 0.67 0.67 0.66 0.67 0.58 0.58 0.58 0.58 0.52 0.52 0.52 0.520.30 1.07 1.07 1.07 1.07 0.64 0.64 0.64 0.64 0.56 0.55 0.55 0.55 0.50 0.50 0.50 0.490.40 1.02 1.02 1.03 1.02 0.60 0.59 0.59 0.59 0.52 0.52 0.52 0.52 0.46 0.46 0.46 0.460.50 0.97 0.96 0.96 0.95 0.54 0.54 0.54 0.54 0.47 0.46 0.46 0.46 0.42 0.42 0.42 0.410.60 0.87 0.87 0.87 0.87 0.48 0.47 0.47 0.47 0.41 0.40 0.40 0.40 0.37 0.36 0.36 0.360.70 0.76 0.76 0.76 0.76 0.39 0.38 0.38 0.38 0.34 0.32 0.29 0.29 0.30 0.29 0.29 0.290.80 0.60 0.60 0.59 0.59 0.28 0.27 0.27 0.27 0.24 0.23 0.20 0.20 0.21 0.21 0.20 0.200.90 0.38 0.36 0.35 0.34 0.16 0.15 0.15 0.15 0.13 0.12 0.11 0.11 0.12 0.11 0.11 0.110.95 0.22 0.20 0.19 0.19 0.08 0.08 0.08 0.08 0.07 0.06 0.06 0.06 0.06 0.06 0.06 0.05

Table 3Expectations of 95% upper CLs based on n complete pairs and of those based on n complete pairs andm additional observations on the first component.

ρ n = 10 n = 30 n = 40 n = 50m = 0 5 10 20 0 10 15 20 0 10 20 30 0 10 20 30

0 0.49 0.49 0.47 0.48 0.29 0.29 0.29 0.29 0.25 0.25 0.26 0.25 0.23 0.23 0.23 0.230.10 0.55 0.55 0.55 0.54 0.38 0.38 0.38 0.37 0.34 0.34 0.34 0.34 0.31 0.31 0.32 0.320.20 0.61 0.60 0.60 0.60 0.46 0.46 0.45 0.45 0.43 0.43 0.43 0.42 0.41 0.41 0.40 0.400.30 0.67 0.66 0.66 0.67 0.54 0.53 0.53 0.53 0.51 0.51 0.51 0.51 0.49 0.49 0.49 0.490.40 0.73 0.72 0.72 0.71 0.61 0.61 0.61 0.61 0.59 0.59 0.58 0.58 0.57 0.57 0.57 0.570.50 0.77 0.77 0.76 0.75 0.68 0.68 0.68 0.68 0.66 0.66 0.66 0.66 0.65 0.65 0.64 0.650.60 0.83 0.82 0.81 0.81 0.75 0.75 0.75 0.75 0.74 0.73 0.73 0.73 0.72 0.72 0.72 0.720.70 0.87 0.86 0.86 0.86 0.82 0.81 0.81 0.81 0.81 0.80 0.80 0.80 0.80 0.79 0.79 0.790.80 0.92 0.91 0.91 0.90 0.88 0.88 0.88 0.88 0.87 0.87 0.87 0.87 0.87 0.86 0.86 0.860.90 0.96 0.96 0.95 0.95 0.94 0.94 0.94 0.94 0.94 0.94 0.93 0.93 0.93 0.93 0.93 0.930.95 0.98 0.98 0.98 0.98 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97

However, on an overall basis, we see that additional data on one of the components does not offer noticeable improvementin estimating correlation coefficient. Note that comparison of expected widths indicates that the power properties of thetest based only on complete pairs, and those of the test based on incomplete samples should be similar.

To judge the accuracy of the GV procedure for comparing two dependent overlapping correlation coefficients, weestimate the coverage probabilities of the generalized CIs by Monte Carlo simulation. The estimated coverage probabilitiesof 95% CIs for ρ21 − ρ31 based on a monotone sample of the form

X11 . . . X1n3 . . . X1n2 . . . X1n1X21 . . . X2n3 . . . X2n2X31 . . . X3n3

(23)

are given in Table 4 for some values of (ρ21, ρ31, ρ32) and sample sizes range from small to large. The estimated coverageprobabilities in Table 4 are close to the nominal level 0.95, except for a few cases where they are between 0.93 and 0.94.Overall, the GV approach for constructing CIs for the difference between two dependent correlation coefficients seems tobe very satisfactory.

We also evaluated expected widths of CIs for ρ21 − ρ31 based on (a) incomplete samples (i.e., all sample observationsin (23)), (b) pairwise deletion (i.e., monotone pattern (23) with x1n2+1, . . . , x1n1 removed), and (c) subjectwise deletion

Page 9: a b s t r a c t - Information Technologykxk4695/JMVA-KRIS-2012.pdf · MLEs enable us to develop finite sample inferential procedures for the mean and the covariance matrix of a multivariate

Author's personal copy

K. Krishnamoorthy / Journal of Multivariate Analysis 114 (2013) 378–388 385

Table 4Coverage probabilities of 95% generalized CIs for ρ21 − ρ31 .

(ρ21, ρ31, ρ32) (n1, n2, n3)

(15, 12, 9) (25, 20, 15) (30, 20, 10) (40, 30, 25) (50, 40, 35)

(−0.1, 0.4, 0.4) 0.947 0.946 0.948 0.950 0.949(0.2, 0.5, 0.5) 0.941 0.951 0.947 0.951 0.949(0.1, 0.5, 0.5) 0.946 0.939 0.950 0.948 0.950(0.5, 0.5, 0.5) 0.934 0.940 0.942 0.944 0.945(0.3, 0.3, 0.4) 0.935 0.938 0.939 0.940 0.947(0.4, 0.6, 0.6) 0.948 0.942 0.951 0.953 0.948(0.5, 0.8, 0.8) 0.950 0.948 0.950 0.954 0.950

Table 5Expected widths of 95% CIs for ρ21 − ρ31 based on (a) incomplete samples, (b) pairwise deletion and (c) subjectwise deletion.

(ρ21, ρ31, ρ32) (n1, n2, n3)

(15, 12, 9) (25, 20, 15) (30, 20, 10) (50, 40, 35)(a) (b) (c) (a) (b) (c) (a) (b) (c) (a) (b) (c)

(−.1, 0.4, 0.4) 1.34 1.42 1.36 1.04 1.04 1.04 1.28 1.30 1.27 0.66 0.66 0.67(0.2, 0.5, 0.5) 1.30 1.34 1.31 0.96 0.96 0.97 1.22 1.23 1.23 0.61 0.61 0.62(0.1, 0.5, 0.5) 1.31 1.31 1.30 0.95 0.95 0.95 1.21 1.21 1.20 0.61 0.60 0.61(0.5, 0.5, 0.5) 1.31 1.32 1.31 0.94 0.93 0.96 1.22 1.20 1.22 0.57 0.57 0.59(0.3, 0.3, 0.3) 1.62 1.61 1.62 1.17 1.17 1.16 1.44 1.46 1.43 0.74 0.74 0.75(0.4, 0.6, 0.6) 1.19 1.19 1.19 0.84 0.84 0.87 1.08 1.07 1.11 0.52 0.52 0.53(0.5, 0.8, 0.8) 0.80 0.81 0.83 0.57 0.58 0.60 0.68 0.68 0.69 0.36 0.36 0.37

Table 6ATP levels in youngest and oldest sons.

Family Youngest (x) Oldest (y)

1 4.18 4.812 5.16 4.983 4.85 4.484 3.43 4.195 4.53 4.276 5.13 4.877 4.10 4.748 4.77 4.539 4.12 3.72

10 4.65 4.6211 6.03 5.8312 5.94 4.40∗

13 5.99 4.87∗

14 5.43 5.44∗

15 5.00 4.70∗

16 4.82 4.14∗

17 5.25 5.30∗

(i.e., based only on data matrix 3 × n3). These estimated expected widths are given in Table 5. Comparison of expectedwidths in Table 5 indicate that use of additional data does not offer appreciable improvement in precision. For instance,in the case of (n1, n2, n3) = (30, 20, 10) in Table 5, we see that the expected width based on 10 complete cases is 0.69while the one based on 20 additional observations on the first component and 10 additional observations on the secondcomponent is 0.68; the reduction in expected width due to 30 additional observations is not appreciable.

6. Examples

Example 1. To illustrate the methods of estimating correlation coefficient, we shall use the data given in [6, Example 9.3].The data represent erythrocyte adenosine triphosphate (ATP) levels in youngest and oldest sons in 17 families. For easyreference, the data are given in Table 6. The ATP level is important because it determines the ability of the blood to carryenergy to the cells of the body.

For this example, the correlation coefficient based on all 17 cases is r = 0.597. Using the GV approach, Krishnamoorthyand Xia [11] have computed the 95% CI for the population correlation coefficient ρ as (0.156, 0.827). For the purposeof illustration, let us assume the observations marked by ∗ in Table 6 are missing so that n = 11 and m = 6. Therequired statistics are computed as follows. The sample correlation coefficient r based on the first 11 complete cases is0.768, s11 = 4.753 and q = 3.480; the r∗

= r/√1 − r2 = 1.199. The 95% CI based on the first 11 complete pairs is

Page 10: a b s t r a c t - Information Technologykxk4695/JMVA-KRIS-2012.pdf · MLEs enable us to develop finite sample inferential procedures for the mean and the covariance matrix of a multivariate

Author's personal copy

386 K. Krishnamoorthy / Journal of Multivariate Analysis 114 (2013) 378–388

Table 7Simulated data from a trivariate normal distribution with Σ = (ρij); ρ21 = 0.1, ρ32 = 0.4, ρ31 = 0.7.

No. x1 x2 x3 No. x1 x2 x3 No. x1 x2 x3

1 0.07 −0.27 0.89 11 −0.22 −1.60 −0.52* 21 −0.42 0.12*−0.76*

2 −0.66 0.53 −0.77 12 0.67 −0.65 0.91* 22 −0.60 0.37*−0.73*

3 −0.48 0.12 −1.28 13 0.89 −0.43 0.96* 23 −0.47 −0.79*−0.47*

4 −0.06 −0.09 0.11 14 −0.35 0.22 0.09* 24 0.55 −0.07* 0.44*

5 1.94 1.57 2.15 15 −0.70 0.09 0.13* 25 0.80 −1.91*−0.53*

6 −0.33 0.70 −1.00 16 0.29 1.45 0.34* 26 −1.32 0.29*−0.87*

7 0.06 −0.02 0.30 17 −0.78 1.49 0.03* 27 1.86 1.20* 2.14*

8 1.66 −0.25 1.26 18 0.38 −0.56 −0.67* 28 −0.63 −0.84*−1.56*

9 −0.41 −2.52 −0.76 19 −0.04 −0.99 −0.06* 29 −0.44 −0.47*−0.10*

10 0.28 −0.42 −0.24 20 −0.76 0.14 −0.52* 30 0.08 1.62* 0.34*

* The entries are assumed missing.

(0.293, 0.927). The 95% CI for ρ based on 11 complete pairs, and additional 6 observations on the x component (that is,cases 12–17) is (0.307, 0.920). Note that the latter CI is slightly shorter than the one based only on complete pairs, and theyare in agreement with our earlier conclusion that they are expected to have approximately the same width. All CIs werecomputed using simulation with 100,000 runs.

Example 2. To illustrate the GV approach for constructing a CI for the difference between two dependent correlationcoefficients, we simulated a sample of 30 observations from a trivariate normal distribution with the correlation matrix asgiven in Table 7. Amonotone sample is created by assuming that the datamarked by ∗ aremissing. Note that 20 observationson X3 and 10 observations on X2 aremissing.We shall find 95% CI for the difference ρ31−ρ21 based on (i) all 30 observations,(ii) monotone sample with n1 = 30, n2 = 20 and n3 = 10, and (iii) based on the sample after subjectwise deletion, that is,based only on the first 10 subjects for whom no observation is missing. The statistics based on all 30 complete observationsare

S =

18.345 2.569 16.7082.569 28.155 9.48316.708 9.483 23.538

,

and the sample correlation coefficients r21 = 0.1131 and r31 = 0.8040. Using the approach of [11], we computed 95%generalized CI for ρ31 − ρ21 as (0.360, 1.03), which indicates that ρ31 is significantly larger than ρ21.

The necessary statistics to compute MLE of Σ (see Section 5) based on incomplete samples with n1 = 30, n2 = 20 andn3 = 10 are

S1 = 18.345, S2 =

10.766 1.3741.374 18.670

and S3 =

7.124 3.028 7.8973.028 9.879 3.6027.897 3.602 10.927

, (24)

b21 = 0.1276, b31 = 1.0963, b32 = 0.0286,σ2.1 = 0.9247 andσ3.21 = 0.2167. The MLE

Σ =

0.611 0.078 0.6730.078 0.935 0.1120.673 0.112 0.957

.

The 95% CI for the difference ρ31 − ρ21 is (0.222, 1.36).To compute the 95% CI for ρ21 − ρ31 based on the sample after subjectwise deletion, that is, based only on the first 10

subjects, we evaluated that the correlation coefficients based on s3 are r21 = 0.3609, r31 = 0.8951 and r32 = 0.3467. Usingthese sample correlation coefficients and the GV procedure in [11], we found the 95% generalized CI for ρ31 − ρ21 CI as(0.010, 1.19).

We observe from the above results that the CI based on all 30 observations (0.360, 1.03) is narrower than the CI(0.222, 1.36) based on incomplete monotone samples, and the latter one is barely shorter than the CI (0.010, 1.19) basedon the sample after subjectwise deletion. Nevertheless, all CIs lead to the same conclusion that ρ31 is significantly largerthan ρ21.

7. Discussion

In this article, we have developed exact methods for testing hypotheses and computing CIs for a simple correlationcoefficient based on incomplete samples. Computationally the methods based on complete data or incomplete samplesare similar, but use of additional data does not offer noticeable improvement in estimating or testing a simple correlationcoefficient. Thus, even though additional data on one of the components are useful to find efficient estimate or test for amean vector or for a variance–covariance matrix, they are not much useful in simple correlation analysis. Our Monte Carlosimulation studies also indicate similar results for comparing two dependent overlapping correlations.

Page 11: a b s t r a c t - Information Technologykxk4695/JMVA-KRIS-2012.pdf · MLEs enable us to develop finite sample inferential procedures for the mean and the covariance matrix of a multivariate

Author's personal copy

K. Krishnamoorthy / Journal of Multivariate Analysis 114 (2013) 378–388 387

Theproposedmethods can be readily extended to compare two independent correlation coefficients based on incompletesamples. Specifically, the GPQ for ρ1 − ρ2, where ρ1 and ρ2 are correlation coefficients associated with two independentbivariate normal distributions, is given by Gρ1−ρ2 = Gρ1 −Gρ2 . Expressions for Gρ1 and Gρ2 can be obtained from (4). The CIsfor the difference ρ1 − ρ2 based on the GPQ are not exact, but very satisfactory in terms of coverage probabilities. However,simulation studies (not reported here) indicated that the GV procedures based on incomplete samples and those based oncomplete cases are very similar, like those for the one-sample problem in Section 3.

The GV approach proposed in this article can be applied to other related problems of comparing two dependent non-overlapping correlation coefficients ρ12 and ρ34. This problem has been addressed by [14,15,11] for the case of completesamples. A GPQ for any ρij based on incomplete monotone samples can be obtained using the expression for the MLE of Σ

described in [20], and following the lines of [11, Section 2]. However, obtaining generalized inference based on a monotonesample is quite involved because it is difficult to express the GPQs for ρij’s explicitly in terms of observed statistics and otherrandom variables whose distributions are free of parameters. The percentiles of the GPQ for ρ12 − ρ34 can be obtained bysimulating monotone samples (with the missing pattern of the observed samples) from a multivariate normal distributionwith µ = 0 and Σ = I. However, based on the simulation results of this article, we expect that the improvement due to theadditional data could be negligible.

Appendix

Note that the generalized p-value for testing H0 : η ≤ η0 vs. Ha : η > η0, where η0 is a specified value ofη = θ21/θ22 = ρ/

1 − ρ2, is given by

PGη ≤ η0

= P

s21√s11s2.1

1 +

qs11

χ2n−2

χ2n−1 + χ2

m

12

−Zχ2n−1

≤ η0

= P

s21√s11s2.1

1 +

qs11

η0 +Zχ2n−1

χ2n−2

χ2n−1 + χ2

m

−12 .

Recall that s is an observed value of S and q is an observed value of Q , and so if we show that

S21√S11S2.1

1 +

QS11

distributed as

η0 +Zχ2n−1

χ2n−2

χ2n−1 + χ2

m

−12

(A.1)

then the probability integral transform implies that the generalized p-value follows a uniform (0, 1) distribution, and so thegeneralized test is exact.

To prove (A.1), let T be the Cholesky factor of S and θ be the Cholesky factor of Σ. Furthermore, let R be the Choleskyfactor of aW2(n − 1, I) matrix so that Rij’s are independent with

R2ii ∼ χ2

n−i, i = 1, 2, and R21 ∼ N(0, 1). (A.2)

Since T is distributed as θR, we have Tii ∼ θiiRii, i = 1, 2, and T21 ∼ θ22(R21 + ηR11). Writing Sij in terms of Tij, and using thedistributional results of Tij, we see that

S21√S11S2.1

1 +

QS11

=T21T22

1 +

χ2m

χ2n−1

∼(R21 + R11η)

R22

1 +

χ2m

χ2n−1

. (A.3)

Using the distributional results for Rij’s in (A.2) and rearranging the terms, we prove (A.1).To show that the generalized test based on a complete sample is exact, we simply ignore the additional m observations,

and need to show thatS21

√S11S2.1

=T21T22

∼(R21 + R11η)

R22.

The above distributional result follows from (A.3).

References

[1] T.W. Anderson, Maximum likelihood estimates for a multivariate normal distribution when some observations are missing, Journal of the AmericanStatistical Association 52 (1957) 200–203.

[2] T.W. Anderson, An Introduction to Multivariate Statistical Analysis, Wiley, Hoboken, NJ, 1984.

Page 12: a b s t r a c t - Information Technologykxk4695/JMVA-KRIS-2012.pdf · MLEs enable us to develop finite sample inferential procedures for the mean and the covariance matrix of a multivariate

Author's personal copy

388 K. Krishnamoorthy / Journal of Multivariate Analysis 114 (2013) 378–388

[3] W.-Y. Chang, D.St.P. Richards, Finite-sample inference with monotone incomplete multivariate normal data, I, Journal of Multivariate Analysis 100(2009) 1883–1899.

[4] W.-Y. Chang, D.St.P. Richards, Finite-sample inference with monotone incomplete multivariate normal data, II, Journal of Multivariate Analysis 101(2010) 603–620.

[5] R.A. Fisher, On the probable error of a coefficient of correlation deduced from a small sample, Metron 1 (1921) 3–32.[6] L.D. Fisher, G. Van Belle, Biostatistics: A Methodology for the Health Sciences, Wiley, Hoboken, NJ, 1993.[7] J. Hao, K. Krishnamoorthy, Inferences on normal covariance matrix and generalized variance with incomplete data, Journal of Multivariate Analysis

78 (2001) 62–82.[8] D.F. Heitjan, S. Basu, Distinguishing missing at random and missing completely at random, The American Statistician 50 (1996) 207–213.[9] K. Krishnamoorthy, M. Pannala, Some simple test procedures for normal mean vector with incomplete data, Annals of the Institute of Statistical

Mathematics 50 (1998) 531–542.[10] K. Krishnamoorthy, M. Pannala, Confidence estimation of normal mean vector with incomplete data, Canadian Journal Statistics 27 (1999) 395–407.[11] K. Krishnamoorthy, Y. Xia, Inferences on correlation coefficients: one-sample, independent and correlated cases, Journal of Statistical Planning and

Inferences 137 (2007) 2362–2379.[12] R.J.A. Little, D.B. Rubin, Statistical Analysis with Missing Data, Wiley, Hoboken, NJ, 2002.[13] G.B. Lu, J.B. Copas, Missing at random, likelihood ignorability and model completeness, Annals of Statistics 32 (2004) 754–765.[14] X.L. Meng, R. Rosenthal, D.B. Rubin, Comparing correlated correlation coefficients, Psychological Bulletin 111 (1992) 172–175.[15] J.J. Neill, O.J. Dunn, Equality of dependent correlation coefficients, Biometrics 31 (1975) 531–543.[16] I. Olkin, J.D. Finn, Testing correlated correlations, Psychological Bulletin 108 (1990) 330–333.[17] I. Olkin, J.D. Finn, Correlations redux, Psychological Bulletin 118 (1995) 155–164.[18] D. Sharma, K. Krishnamoorthy, Improved minimax estimators of normal covariance and precision matrices from incomplete samples, Calcutta

Statistical Association Bulletin 34 (1985) 23–42.[19] S. Weerahandi, Exact Statistical Methods for Data Analysis, Springer-Verlag, New York, 1995.[20] J. Yu, K. Krishnamoorthy, M.K. Pannala, Two-sample inference for normal mean vectors based on monotone missing data, Journal of Multivariate

Analysis 97 (2006) 2162–2176.