Sibling - PNAS · 2005. 4. 22. · Sib-Sib Correlation Estimation (Classical and Related Methods)....

5
Proc. Natl. Acad. Sci. USA Vol. 78, No. 5, pp. 2664-2668, May 1981 Statistics Sibling and parent-offspring correlation estimation with variable family size (familial correlations/human genetics/demographics) SAMUEL KARLIN, EDWARD C. CAMERON, AND PAUL T. WILLIAMS Department of Mathematics, Stanford University, Stanford, California 94305 Contributed by Samuel Karlin, January 21, 1981 ABSTRACT A method for estimating intrafamilial correla- tions under variable family sizes involving weightings of paired data points is introduced. Classical methods of intrafamilial cor- relation estimation and those in current use are outlined and crit- ically analyzed. Extensions of the proposed estimation method to more general data structures -are delineated. A problem of interest with regard to nuclear-familial data con- cerns the assessment of similarity or degree of.resemblance among family members. A commonly used approach to evalu- ating similarity is via correlation analysis. There is considerable recent literature pertaining to both the theoretical and practical problem of estimating correlations among relatives meaning- fully and efficiently, particularly under the complicationofvary- ing family size and structure (1-5). Two main categories of es- timators are usually employed in the correlation analysis of nuclear family data, those. for between-sibling correlations and those for parent-offspring correlations. One widely used statistic for the assessment of sib-sib simX ilarity is the intraclass correlation coefficient dating from Fisher (6) and extended to handle unequal family sizes by way of equat- ing certain. variance components to their corresponding expec- tations (7). This statistic, henceforth. called the ANOVA esti- mate, can produce negative values inconsistent with the modeling assumption that it is determined as a ratio of variance estimates (see Section 2 below). Other approaches involve max- imum likelihood estimators (MLEs), usually in the framework of multivariate normal models, correlation calculations based, on randomly selected sibpairs, and Bayesian methods. Because the maximum likelihood estimate of sibling correlation subject to variable family size does not yield an explicit.formula, its im- plementation and application are quite formidable; Donner and Koval (2) discuss some algorithms for approximation of the MLE. Another. procedure in current use (and quite popular in corresponding econometrics contexts) is to calculate sibling cor- relation stratified on family size and to combine the estimates with weights proportional to the inverse of the variance esti- mates for these classes. This is partly by analogy with testing independence in contingency tables. It is difficult to assess its relevance and reliability. Finally, with respect to parent-off- spring similarity, three interclass correlations known as the pairwise, sib-mean, and random sib methods are reviewed by Rosner et al. (1) along with a proposed "ensemble" statistic. Advantages and limitations of these methods are discussed in Section 2. Approaching the problem from a perspective that is funda- mentally different from those summarized above, we present here an internally consistent and. flexible. approach to the gen- eral problem of correlation estimation that is based on minimal modeling assumptions, lends intuitive comprehensibility to the meaning of the proposed estimates, is simple to implement, especially for variable family sizes, and agrees with the maxi- mum likelihood result for a multivariate normal model under constant family size. We promulgate this approach in a general theoretical framework of correlation estimation between sets delineated in Section 5. The technique is applied in Section 3 to the problem of correlation estimation in the context of the inherent structure of familial data. In the case of sibling cor- relations, we provide a series of statistics that separately em- phasize contributions of the family versus those of the individual and compared to contributions accenting sib-pair units. Exten- sions of our approach to take account of demographic conditions, environmental factors, and natural stratifications of the under- lying population base are also briefly considered. 2. Correlation estimation on.nuclear family data We first stipulate the basic assumptions that underlie all of the correlational methods to be discussed. Trait values are assumed to be independently distributed among families. Moreover, all offspring within a family are taken as equivalent to each other with respect to offspring-offspring resemblance. When re- quired it is assumed that the data have been first adjusted for the effects of age, sex, birth order, and other concomitant variables. A standard description of the variables and parameters of the population model can be set forth in terms of first and second moments as follows. We treat the case of one parent per family for ease of exposition. Let the trait values of family i of size ki be (Yi,Xil,Xi2 ..Xi,k,) with parental value yi and offspring values XilXi2,. . ,Xisk, We assume the families share mean and variance population values such that E[yi] = p,, E[xiJ = As, Var(yi) = o2, Var(xi) = oI, [1] Corr(yi,xi,,,) = pps (independent of i and P = 1,...,ki), Corr(xi,,,xi,=) pis (independent of i and v # /i = 1,2,...,ki). Thus, all parents have equal means and variances- as do off- spring, but those of the latter typically differ from those of the former, while all parent-offspring correlations are equal and all sibling correlations are equal as well but differ in general from those between parents and offspring. This model,. with all the moment assumptions, is very general and more restrictive dis- tributional or underlying structural assumptions are often embedded within this context to obtain desired correlation estimates. Sib-Sib Correlation Estimation (Classical and Related Methods). To provide perspective on our approach, we review, some of the classical methods and include some brief remarks on their advantages and disadvantages. (i) The ANOVA estimate. This is the sibling correlation es- Abbreviations: ANOVA, analysis of variance; MLE, maximum likeli- hood estimator; W. S.V., within-sibship variance statistic; B. S.V., be- tween-sibship variance statistic; T.V.*, total variance. The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertise- ment" in accordance with 18 U. S. C. §1734 solely to indicate this fact. 2664 Downloaded by guest on April 18, 2021

Transcript of Sibling - PNAS · 2005. 4. 22. · Sib-Sib Correlation Estimation (Classical and Related Methods)....

Page 1: Sibling - PNAS · 2005. 4. 22. · Sib-Sib Correlation Estimation (Classical and Related Methods). Toprovideperspective onourapproach, wereview, someofthe classical methodsandinclude

Proc. Natl. Acad. Sci. USAVol. 78, No. 5, pp. 2664-2668, May 1981Statistics

Sibling and parent-offspring correlation estimation with variablefamily size

(familial correlations/human genetics/demographics)

SAMUEL KARLIN, EDWARD C. CAMERON, AND PAUL T. WILLIAMSDepartment of Mathematics, Stanford University, Stanford, California 94305

Contributed by Samuel Karlin, January 21, 1981

ABSTRACT A method for estimating intrafamilial correla-tions under variable family sizes involving weightings of paireddata points is introduced. Classical methods of intrafamilial cor-relation estimation and those in current use are outlined and crit-ically analyzed. Extensions of the proposed estimation method tomore general data structures -are delineated.

A problem of interest with regard to nuclear-familial data con-cerns the assessment of similarity or degree of.resemblanceamong family members. A commonly used approach to evalu-ating similarity is via correlation analysis. There is considerablerecent literature pertaining to both the theoretical and practicalproblem of estimating correlations among relatives meaning-fully and efficiently, particularly under the complicationofvary-ing family size and structure (1-5). Two main categories of es-timators are usually employed in the correlation analysis ofnuclear family data, those. for between-sibling correlations andthose for parent-offspring correlations.One widely used statistic for the assessment of sib-sib simX

ilarity is the intraclass correlation coefficient dating from Fisher(6) and extended to handle unequal family sizes by way ofequat-ing certain.variance components to their corresponding expec-tations (7). This statistic, henceforth. called the ANOVA esti-mate, can produce negative values inconsistent with themodeling assumption that it is determined as a ratio ofvarianceestimates (see Section 2 below). Other approaches involve max-imum likelihood estimators (MLEs), usually in the frameworkof multivariate normal models, correlation calculations based,on randomly selected sibpairs, and Bayesian methods. Becausethe maximum likelihood estimate of sibling correlation subjectto variable family size does not yield an explicit.formula, its im-plementation and application are quite formidable; Donner andKoval (2) discuss some algorithms for approximation of theMLE. Another. procedure in current use (and quite popular incorresponding econometrics contexts) is to calculate sibling cor-relation stratified on family size and to combine the estimateswith weights proportional to the inverse of the variance esti-mates for these classes. This is partly by analogy with testingindependence in contingency tables. It is difficult to assess itsrelevance and reliability. Finally, with respect to parent-off-spring similarity, three interclass correlations known as thepairwise, sib-mean, and random sib methods are reviewed byRosner et al. (1) along with a proposed "ensemble" statistic.Advantages and limitations of these methods are discussed inSection 2.

Approaching the problem from a perspective that is funda-mentally different from those summarized above, we presenthere an internally consistent and. flexible. approach to the gen-eral problem of correlation estimation that is based on minimalmodeling assumptions, lends intuitive comprehensibility to the

meaning of the proposed estimates, is simple to implement,especially for variable family sizes, and agrees with the maxi-mum likelihood result for a multivariate normal model underconstant family size. We promulgate this approach in a generaltheoretical framework of correlation estimation between setsdelineated in Section 5. The technique is applied in Section 3to the problem of correlation estimation in the context of theinherent structure of familial data. In the case of sibling cor-relations, we provide a series of statistics that separately em-phasize contributions ofthe family versus those ofthe individualand compared to contributions accenting sib-pair units. Exten-sions ofour approach to take account ofdemographic conditions,environmental factors, and natural stratifications of the under-lying population base are also briefly considered.

2. Correlation estimation on.nuclear family dataWe first stipulate the basic assumptions that underlie all of thecorrelational methods to be discussed. Trait values are assumedto be independently distributed among families. Moreover, alloffspring within a family are taken as equivalent to each otherwith respect to offspring-offspring resemblance. When re-quired it is assumed that the data have been first adjusted forthe effects of age, sex, birth order, and other concomitantvariables.A standard description of the variables and parameters ofthe

population model can be set forth in terms of first and secondmoments as follows. We treat the case of one parent per familyfor ease of exposition. Let the trait values of family i of size kibe (Yi,Xil,Xi2..Xi,k,) with parental value yi and offspring valuesXilXi2,..,Xisk, We assume the families share mean and variancepopulation values such that

E[yi] = p,, E[xiJ = As, Var(yi) = o2, Var(xi) = oI, [1]

Corr(yi,xi,,,) = pps (independent of i and P = 1,...,ki),Corr(xi,,,xi,=) pis (independent of i and v # /i = 1,2,...,ki).

Thus, all parents have equal means and variances- as do off-spring, but those of the latter typically differ from those of theformer, while all parent-offspring correlations are equal and allsibling correlations are equal as well but differ in general fromthose between parents and offspring. This model,. with all themoment assumptions, is very general and more restrictive dis-tributional or underlying structural assumptions are oftenembedded within this context to obtain desired correlationestimates.

Sib-Sib Correlation Estimation (Classical and RelatedMethods). To provide perspective on our approach, we review,some of the classical methods and include some brief remarkson their advantages and disadvantages.

(i) The ANOVA estimate. This is the sibling correlation es-

Abbreviations: ANOVA, analysis of variance; MLE, maximum likeli-hood estimator; W. S.V., within-sibship variance statistic; B. S.V., be-tween-sibship variance statistic; T.V.*, total variance.

The publication costs ofthis article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertise-ment" in accordance with 18 U. S. C. §1734 solely to indicate this fact.

2664

Dow

nloa

ded

by g

uest

on

Apr

il 18

, 202

1

Page 2: Sibling - PNAS · 2005. 4. 22. · Sib-Sib Correlation Estimation (Classical and Related Methods). Toprovideperspective onourapproach, wereview, someofthe classical methodsandinclude

Proc. Natl. Acad. Sci. USA 78 (1981) 2665

timate in most common use in medical genetic epidemiologyliterature (e.g., refs. 8-12). This method is strongly tied to nor-mal theory and unbiased estimation in the following way: Forx#, i = 1,. .. ,n (n is the number ofsampled families);j = 1.ki,ki 2 2 (k, is the number of sibs in the ith family) from a one-wayANOVA model of type I, i.e., Xf = /u + Ai + EY for a., Ai, i= 1,...,n fixed constants and e.i independently distributedN(0,u!), we haveWithin-sibship variance statistic

1 ,nk=W.S.V. =n E Xi-~ i)2

> (ki -1) i=l J=I

Between-sibship variance statisticn

= B.S.V. =k-1k - I)2

ki n ki

where xi ix, x= n E>ij k, E ykij=l ki's j=l

[2a]

Third, it is possible to obtain W. S.V. > B. S.V. that entails&2A < and thus Ass < 0, contrary to the ANOVA model as-sumption of oA2 .0, i.e., 6A is inadmissible. In practice Ass isset equal to zero when .ss < 0 occurs, but this inconsistency inthe model raises questions.

Finally, this Ass does not reduce to the maximum likelihoodestimate for the multivariate normal case when ki is constant.

(ii) Another moment method. We can attempt to obtain a var-iance decomposition ofTotal variance = T.V. *

n ki n ki

= n z E(xZ - )23where = 1 E EXi

Ek =1==11k '=1 j=1i=l i=l

When ki = k (and only when the ki are unvarying), there[2b] results the orthogonal decomposition

[2c]

which are the residual variance (normalized by degrees of free-dom) of the best least-squares fit for the model xy = A + Ai andthe (normalized) difference between this residual variance andthe residual variance for the model xi = A, respectively.

Under this model we know on the basis ofnormal linear the-ory (13) that (B.S.V./W.S.V.) = F is distributed according toan F-distribution with parameters n - 1 and =I (ki - 1) underthe hypothesis Ai = 0, i = 1,.. .,n.To obtain a sibling correlation estimator the {xt} are consid-

ered in terms ofan ANOVA model oftype II (with variable fam-ily effects)-i.e., x,. = pu + Ai + sit where I.. is a constant butAi and eij are independent random variables for all i andj dis-tributed in the manner Ai - N(O,oA2), eij - N(O,o2), and then

PSS CVar(x,) ) +6OAThe estimates of DA2 and 6, are obtained by Henderson's

method (13) by equating B.S.V. and W.S.V. to their expecta-tions under the ANOVA model II and solving the resulting twolinear equations in two unknowns. This yieldsE[W.S.V.] = od and E[B.S.V.] = koorA + orI, where

ko = ki - k2 ki>'_ 2 for ki. 2 [4]Substituting into Eq. 3 the appropriate unbiased linear com-binations for &M and &: conforming with Eq. 4, we obtain

MA B.S.V. - W.S.V.

MA + - B.-S.V. + (ko- 1)W.S. [5

There are several caveats in this approach to correlation es-timation: First, it is not correct that B.S.V./W. S.V. is distrib-uted F in the case of the ANOVA model II except when ouA= 0, i.e., pss = 0 (13). Thus, although very directly tied to as-sumptions of normality, this method does not yield suitablehypothesis-testing procedures when p # 0 (the normality as-sumption in the model II formulation can be lifted but the es-timates of B. S.V. and W.S.V. of Eqs. 2, linear combinations ofwhich are used in a method ofmoments, still are motivated fromnormal theory).

Second, the estimate is strongly tied to the structure of theunderlying ANOVA model ofadditive effects, which is a specialassumption for which there is no guarantee of applicability to,or appropriateness for, real data.

T.V.* = W.S.V.* + B.S.V.*with

n k 1 n

W.S.V. * = k (xii-x2, B.S.V.* = - i)2i=1J=1 n i=1

[6a]

[6b]

We can consider in lieu ofW. S.V. and B. S.V. of Eqs. 2a and2b, respectively, the quantities

n I 1 kiW.S.V.* = > (x=- )2, Xi = x

n i=1 i,=, ki j=1and B.S.V.* the same as in Eq. 6b. When k, is variable the for-mula 6a is not correct. Nevertheless, the following expectationsensue:

E[W.S.V.*]I= - E [:11) (1 - 2

/ (1 [ J) + [J

E[B.S.V.*] = (1-n) I - E [I]\( \ 2

in which E[1/k] is the expectation ofthe reciprocal offamily sizeregarded as a random variable and o-2 = Var(x ).

If we form F* = B. S.V. */W. S.V.*, then sofving from Eqs.7 suggests the estimate

[8]

This differs from the ANOVA estimator and -1 ' Ass ' 1.For k, k, [(k - 1)n/(n - 1)]F* is distributed F with param-

eters n - 1 and n(k - 1) in the normal case for population cor-relation pss = 0, providing a test statistic in this case, if desired.Some insight into the distributional properties ofthe F* statisticcan be developed by simulation procedures. We emphasizeagain this is not the same as the ANOVA method and we do notgenerally advocate its use.

(iii) C. A. B. Smith sibling correlation proposal. C. A. B.Smith (3) reviews several problems and interpretations in es-timating genetic correlations. He suggests a method for esti-mating pss that is strongly tied to modeling assumptions but thatis of a somewhat different kind from the canonical intraclasscorrelation coefficient. Smith assumes that each set of sibshiptraits x.,, j = 1, ki is a set of independent random variablesfrom a distribution with mean A.i and variance vi in which theAi and vi are themselves random variables such that the ,i aredrawn from a distribution with variance B and the vi are drawnfrom a distribution with mean A. Thus, siblings in family i areindependent given their (pi, vi), but have dependency via the

Statistics: Karlin et al.

F* I-EI

-EI

II

k k n

AS :--

I IF* + I I- E -n k

Dow

nloa

ded

by g

uest

on

Apr

il 18

, 202

1

Page 3: Sibling - PNAS · 2005. 4. 22. · Sib-Sib Correlation Estimation (Classical and Related Methods). Toprovideperspective onourapproach, wereview, someofthe classical methodsandinclude

Proc. Natl. Acad. SciW USA 78 (1981)

larger framework ofvarying (kui, vi) between families. From thishe obtains p,, = B/(A+B) and specifying s = B/(A+B) de-termines an estimate for A of a form similar to that for theANOVA estimate. He chooses his estimate for A to be A =E(v = W. S.V. as defined for the ANOVA model, while his Bis determined in a more complicated way involving an iterativeprocedure.Many of the problems inherent to the ANOVA approach ob-

tain for this approach as well. Specifically, the method dependsstrongly on the specific assumptions of the model and there isno reduction to the maximum likelihood estimate in the normalcase when ki = k for all i. Finally, the estimates for A and B,while based on unbiasedness and consistency considerations,take their form from normal theory and in that sense are highlydistributionally dependent.

(iv) Maximum likelihood estimation of p, The maximum li-kelihood approach depends on imposing distributional assump-tions on the set offamily random variables; specifically, the stan-dard model takes the observed variables (xi,1, , xik) to bemultivariate normal with mean and covariance parameters asspecified in Eq. 1. An explicit MLE is inaccessible for ki varying,and to maximize the likelihood function numerically is usuallyquite formidable. In the special case of ki k, it is straightfor-ward to show that

AML

n k k

, EV (Xi,,-Ax#M- A)i=l v=l IL=l

VAAn k

(k - 1) > > (xi - IL)2i=l j=l

where ft = lI/nk X Ik x.(v) Single sib-pair and "ensemble" methods. Another form of

the sib-sib correlation estimator is the single sib-pair estimate:

Pas =

n

E (Xi - ,*)(X** - A*)i=l

(4n n ( 1/2 '

(Xi tL*)2E (Xi** 4u**)2i=l i=l

where p,* = 1/n Xt=1 xi, px** = 1/n Xt=1 x** and in whichxi, x** are selected either at random or depending on birthorder (e.g., youngest versus oldest) or by some other criterionfrom xij, j = 1, ..., ki, given the sample realizations. This es-timator has the problem ofdiscarding information in a way thatseems clearly not to maintain sufficiency.

It seems intuitively transparent that a superior estimate tothis is the "ensemble" version [the ensemble estimate for theparent-offspring correlation was introduced by Rosner et al.(1)], which is defined to be

Ea, (4 - g*)(XP* - t**))EE= n\E ,(x*' - *2

in which expectations are computed given the sample realiza-tions. We determine this to be

the advantage of not being derived via an underlying modelingassumption. This estimate is apparently not in current use.

3. An alternative approach to estimating sibling correlationsOur proposal for sib-sib correlation estimation is to consider theset of all sib-pairs

(x,,,x Q,vP IL = 1,2,...,ki; i = 1,2,...,n, [10]consisting of '1.~ Iki(ki - 1) points in the Euclidean plane andprescribe a discrete probability distribution 6 on this set, as-signing weights w. to (xi, xj, (in which often wi,, is specifiedindependent of v and Au and varies only with the family index,ki) and then to compute the correlation of the first and secondcomponents under this probability distribution as a sibling cor-relation estimator.

Three natural choices ofthe distribution 26 with weights wi,,,,,are explicitly as follows:

Sib-pair density: wi,,, = Xn k,(k, - 1)

Individual density: wi.,. = (k, - 1) :, k

Family density: wi,,, = n -1)

[Ila]

[11b]

[lic]

The sib-pair density Ha emphasizes contributions from largefamilies the most, whereas the family density lic emphasizescontributions from large families the least, in fact, treating allfamilies equivalently. The individual density hib lies betweenthe other two in its emphasis on large families.

Another appealing choice is to superimpose on the densitiesof Eqs. 11 factors that reflect demographic influences or otherstratification conditions. In this vein, let Pr be the populationfrequency (available from census data) of families composed ofr siblings. Suppose in the actual sample the observed frequen-cies are Pr' For a given {wj} we modify wi to w0 by the definition

W* = PkilpkiN [lid)in which N is a normalizing constant ensuring Xi ki(ki - 1)w*= 1. The modified density {w1} adjusts the weights so as tobetter reflect the family size distribution ofthe total population.

This method can generalize to determinations of the distri-bution a based on sex and age of the family members, culturalcriteria, etc.We now proceed to derive the sibling correlations for the

distributions of Eqs. 11.(i) Sib-pair method. (For the density of Eq. Ila.) We count

all possible sib pairs over all families such that each sib paircarries equal weight. The component mean and variance are

1 n ki

AS= n (ki - I)xu andki(ki - 1) '=1 J=1

2=1

I n ki

O'2S = 1n > (ki-1) A(X,,- fS)2,ki(ki - 1) i= 1 J=1

i= 1

respectively. The correlation coefficient is

PE =

1~ n 1 I ki k I n n 1 1k kj

1--ni~ bek/k,- I)~1all m fl.2 2kk4 E ,1 xilxjmI E~~1# i~i --m_I__2iljI(1E k E Xit n E E k

IE E xilxjmlom

Two caveats of this approach are the lack of reduction to themaximum likelihood estimate for ki = k and the fact that theestimate fE can be less than -1 (although always PE ' 1). It has

1 n ki ki

As= n >>E (Xiv,_ AS)(Xi,M S).&2 E ki(ki - 1) i=' ` U

i=l

[12a]

This procedure is equivalent to averaging all sib-pairs withina family before averaging between families.

(ii) Individual method. (For the density of Eq. lib.) This isequivalent to averaging over individuals before averaging be-tween individuals. Thus, each individual of {xi} contributes

,O,,,rv,66 Statistics: Karlin et al.

Dow

nloa

ded

by g

uest

on

Apr

il 18

, 202

1

Page 4: Sibling - PNAS · 2005. 4. 22. · Sib-Sib Correlation Estimation (Classical and Related Methods). Toprovideperspective onourapproach, wereview, someofthe classical methodsandinclude

Proc. Nati. Acad. Sci. USA 78 (1981) 2667

effectively once in the sample under the density lib. The meanand variance values are, respectively,

I n ki n ki

I n I 2 xy, and = n EAOk '=1 J=1 zk i=1 J=1i~~~~l~~~i=1

The correlation coefficient isn 1 ki kA

__ >~~~~~k-. (xi, AI)(X iA AI)

=n ki . [12b]E E (XY - I)2i=l J=1

(iii) Family method. The density lic weights each familyequally, independent of the size of the sibship. Here

I n I k, 1 n I ki

OFF = > y aF=F= i=1 'j=11

in which we average within families before averaging betweenfamilies in computing the sib-sib correlation statistic. The cor-responding correlation estimate becomes

n l ki ki

I -E E (Xiv,- A'F)(Xiu F)i=l kikii V=1 ,u=l

n kiI

E(x, - LF)2i=l ij=l

[12c]

(iv) The general method. For a general probability measurewi assigned to (x{,, xj, i = 1,2,. . .,n, 1 v ., ' k, symmet-rically defined over the sib pairs ofeach family, but which couldvary between family units, we have the correlation

andn

the individual density iri = 1 E ki.i=1

For a general density wi we obtain the estimator

Aps =

n ki

E, ir( i Ash)2 (Xy A-2)i=l j=l

a n

i 7riki(yi - Al)2 Avi 2(Xi, - 2i=l i=l j=l

[15b]

n n kiwith A = > rikiyi, A2 = i E x

i=1 i=1 j=1

(i) The family method. The density is that of Eq. 15a. Thisaverages within families leading to

n I ki

(Yi - A1F) E (xi - A2F)EF / i21 ki j=1 [16a]

where AliF = >: yiandj2F = .j2 xe,

nnka=1 1

(ii) Individual method. The density is that of Eq. 15b. Be-cause there is a one-to-one correspondence between individualsiblings and possible parent-offspring pairs, this is equivalentto a parent-offspring pair estimator. The calculation gives

n / n

A1I = >: kjyj E ki,2=l i=l

n k / n

A21 =1EE E ii,i=l j=l i=l

Pss

n ki ki

E wi 2 E (xi" - )(Xi", - A)i=l v= ,L=1

n kf

E wi(ki - 1) > (Xi - A)2i=l j=l

[13]

in which A = On wi(k, - 1) X.i , and the {wi} are con-strained to satisfy lI' ki(ki - I)wi = 1.

For the case of constant family size ka k; all estimates re-duce identically to the statistic of Eq. 9, which is the MLE forthe multivariate normal case.4. Parent-offspring correlationsRosner et al. (1) reviewed the status ofvarious parent-offspringcorrelation estimates currently in use, including the pairwiseestimator, sib-mean estimator, random-sib estimator, and an"ensemble" estimator (see ref. 1 for definitions and detaileddiscussion). The sib-mean and random-sib estimators stand outas ones that lose sufficiency in contracting the data unduly. Theensemble estimator has the difficulty of not being a true cor-relation, whereas the sib-pair estimator will appear as a specialcase ofthe class ofestimators we propose. As in the sib-sib case,the MLE exists in explicit form only when ki k.

Following mutatis mutandis our methodology in the calcu-lation ofsibling correlation estimates, we consider the completeset of parent-offspring values

(yi i = 1,2.n, v = 1.ki, [14]

induced by the family sets (yi,xil, ..,xi k,) and assign to them anappropriate discrete distribution 9P [with weight tri assigned to(yi,xj, normalized byX'=1 irik, = 1] from whichwe can computea parent-offspring correlation. Two obvious canonical choicesfor QP are

the family density ir1 = lI/nki [15a]

and =

A,

In \ ~'n k,ki(yi -ilI )( E (Xij /A)

i=l i=l J=lThis statistic is precisely the pairwise parent-offspring estimatormost commonly used (e.g., see refs. 9-12).The family method emphasizes contributions from large fam-

ilies less than the individual method.For ki a k, both the family and individual estimates reduce

to the MLE for the multivariate normal case:

Apo(ML) -

n k

(Yi-AlML)E(X>: k

A A2 2ML./

i=l i=l

k(,i AI-ML)2 E E (Xj -ML)2)n

where A1ML = 2:>:n=l

n k

A2ML = 2 Xy.i=l j=l5. A general method of correlation estimation between setsIt is useful to extend the concept of degree of resemblance bydefining a correlation between sets of variables for which de-pendencies between sets may exist. Formally, we address thequestion, given a set S on which are defined ordered pairs ofsubsets (Sli,S2d), i = 1,. . .,n, what is the degree of resemblancebetween the sets Si,, i = 1. . .,n, and S2i, i = 1, . .,n. (Presum-ably, in actual cases of applications, the form of the subsets S.-will be determined by some natural structure in the problemat hand-e.g., Sli = 2= 1. ,n, with U~Sjj = S and Sn Slk = 4, j # k as in sib-sib correlations or Sli n2, = 4, Si,n Sik = 4, j # k with U jiu-7 1 Sij = S as in parent-offspringcorrelations.)

Statistics: Karlin et al.

Dow

nloa

ded

by g

uest

on

Apr

il 18

, 202

1

Page 5: Sibling - PNAS · 2005. 4. 22. · Sib-Sib Correlation Estimation (Classical and Related Methods). Toprovideperspective onourapproach, wereview, someofthe classical methodsandinclude

Proc. Natl. Acad. Sci. USA 78 (1981)

Our approach is to redefine a new sample space consistingof all possible pairs of the appropriate variables chosen fromeach subset pair (excluding pairs consisting of identical points),assign some relative weight or measure to those pairs, and thencompute the correlation estimate in the same form as the Pear-son statistic (or with some trimming or rank-ordering modifi-cation when appropriate for increasing robustness) with thesenew variable pairs weighted appropriately.

For the sake of brevity ofexposition we will assume that vari-able pairs taken from the same subset pairs will receive equalweight. Formally, we let ai and bi represent the number ofpoints in Sli and S2i, respectively, and ci represent the numberofpoints in common between S1i and S2i. Then, letting di repre-sent the number of possible choices of (distinct) pairs from(Sli,S2V), we find di = aibi - ci.

Finally, we let wi be the weight of pairs chosen from (Sli, S2i),and to assure wi is a density on the set of all sib pairs (this istechnically unnecessary because the statistic itself is scale in-dependent), constrain wi such that 2,i= 1 diwi = 1. Then if Sji haselements fij, j = 1,. . .,a1 and S2i has elements 'i,, j = 1. biin whichcii= = 1,.. .,c1, we let

ai

A

E wi(b>E fiii=l j=1

cf n bi ci

- AP2 = >, Wi(aiE - yi;)j=1 i=l j=l j=l

n / as ci

(=iy -p22)2i=1 j=l j=l

n bi Ci

&2= w(aE (#i - P2)2 (i=l j=l j=l

A2))

n ai bi ci

(f - Al)(Yil- A2)- A (ij- iA,)(Oif -.~l =1 1=1 j=1

/2))

6. Summary and discussionIn this paper we advocate an approach to the estimation offamil-ial correlations that differs from the approaches commonly inuse. In the case of sibling correlations we introduce a naturaldiscrete probability distribution defined for the set of all sib-pairs and determine the correlation based on this distribution.The approach in the parent-offspring case is similar. In partic-ular, we have advocated three specific estimators and naturalextensions thereof: The family statistic AF (see Eq. 12c), in whicheach family contributes equally independent of family size; thesib-pair estimator Ps (see Eq. 12a), in which each pair receivesequal weight; and the individual estimator A (see Eq. 12b), inwhich each individual is counted equally with each other. Eachofthe three methods produces a different estimator, all ofwhichreduce to the same statistic for constant family size as the normalMLE. This approach to the estimation of correlations carriesthe following advantages:

(i) All estimates for the sibling correlation pss and the parent-offspring correlation pp. are computed directly in the form ofPearson's estimate (not indirectly, e.g., through moments)guaranteeing IjI ' 1 and thus all possible values of pj are con-

sistent with our assumptions, including negative estimates.(ii) We make a minimum of modeling assumptions-i.e., no

specific model is assumed beyond that specified in the existenceoffirst and second moments for the underlying distribution cor-

responding to Eq. 1.(iii) Our estimate reduces to the maximum likelihood esti-

mate, PL' in the case k, k for the multivariate normaldistribution.We believe this basic perspective and approach to the prob-

lem of assessing intrafamilial similarity lend intuitive compre-hensibility to the meaning of the various estimates. Thus, our

approach lends overall coherency to the estimation of intrafam-

ily correlations that may also be applied to new problems (e.g.,family data involving extended family sets) or in general to anydata for which correlation estimates within or between sets, orboth, is required (see Section 5) or possibly to higher-order in-teractions between sets of variables.

Finally, our method readily yields a way of fine tuning theanalysis of familial correlation by adjusting for such factors asdemographic variables, age and sex distributions within fami-lies, and cultural criteria. Because family size is important inour method, we may account for other factors that influence therelative importance of families of differing sizes by reweightingby Vi = Qiwi, in which wi is the weight as determined previouslyand Qi reflects other factors while preserving the normalizationf

i ki(ki - 1)Qiwi = 1. For example, data might be collectedfrom a population with a known family size distribution and wemay wish to correct for possible sampling error relating to familysize. To this end, we can use Vi = wipkIPkQN, in which N is anormalization constant, pk, is the frequency of family size ki inthe population, and Pk, is its frequency within the sample.We also may wish to weight by family in a way that reflects

factors other than size (e.g., age or sex distribution). Weightingcould even be done on a finer level involving the characteristicsof specific offspring, parents, sib pairs, etc.On theoretical grounds, it is worthwhile to point out the fact

that there are no universal orderings between the sib-pair, in-dividual, and family correlation estimates of Section 3.

Our recommendation on using the various correlation esti-mators is to compute them all (e.g., A,, A , P5E, PANOVA in thesibling correlation case) and to incorporate into their interpre-tation the inherent differences. When the results are about thesame we may conclude that family size exerts no special influ-ences. If there are manifest differences in these estimators [andwe may expect this is some cases-e.g., Sing et al. (14) havesuggested that family size may influence cholesterol levels],then further study on the effects of family size, demographicfactors, family sex ratio, etc., are warranted. In dealing with andcomparing data sets of diverse populations the different esti-mators may be very revealing.

This work was supported in part by National Institutes of HealthGrant 5R01 GM 10452-17, National Science Foundation Grant MCS79-24310, and National Heart, Lung, and Blood Institute Contract N01-HV-2161.

1. Rosner, B., Donner, A. & Hennekens, C. H. (1977) Appl. Stat.26, 179-187.

2. Donner, A. & Koval, J. J. (1980) Biometrics 36, 19-25.3. Smith, C. A. B. (1980) Ann. Hum. Genet. 43, 265-284.4. Rosner, B., Donner, A. & Hennekens, C. H. (1979) Biometrics

35, 461-471.5. Henneckens, C. H., Jesse, M. S., Klein, B. E., Gourley, J. E.

& Blumenthal, S. (1976) Am. J. Epidemiol. 103, 457463.6. Fisher, R. A. (1938) Statistical Methods for Research Workers

(Oliver and Boyd, London), pp. 213-249.7. Snedecor, G. & Cochran, W. G. (1967) Statistical Methods (Iowa

State Univ. Press, Ames, IA), pp. 294-296.8. Karlin, S., Williams, P. T. & Carmelli, D. (1981) Am. J. Hum.

Genet. 33, 262-282.9. Hayes, C. G., Tyroler, H. A. & Cassel, T. C. (1971) Arch. Intern.

Med. 128, 965-981.10. Schrott, H. G., Bucher, K. A., Clarke, W. R. & Lauer, R. M.

(1979) in Genetic Analysis of Common Diseases: Applications toPredictive Factors in Coronary Disease, eds. Sing, C. & Skol-nick, M. (Liss, New York), pp. 619-646.

11. Tager, I. B., Rosner, B., Tishler, P. V., Speizer, F. E. & Kass,E. H. (1976) Am. Rev. Resp. Dis. 114, 485-492.

12. Hewitt, D., Jones, G. J. L. & Goden, G. W. (1979) Atheroscle-rosis 32, 381-396.

13. Searle, S. R. (1971) Linear Models (Wiley, New York), pp. 376-472.

14. Sing, C. F., Orr, J. P. & Moll, P. P. (1980) in Childhood Preven-tion ofAtherosclerosis and Hypertension (Raven, New York), pp.87-97.

2668 Statistics: Karlin et al.

Dow

nloa

ded

by g

uest

on

Apr

il 18

, 202

1