Influential Observations in the Analysis of Additive ... · allocation in a house budget survey in...

Influential Observations

in

the Analysis of Additive Ipsative Data

WONG Wai-Wan

A Thesis Submitted in Partial Fulfillment

of the Requirements for the Degree of

Master of Philosophy

in

Statistics

⑥ T H E CHINESE UNIVERSITY OF H O N G K O N G

JUNE 2003

The Chinese University of Hong Kong holds the copyright of this thesis. Any

person (s) intending to use a part or whole of the materials in the thesis in a

proposed publication must seek copyright release from the Dean of the Graduate

School.

糊、 A (L/ \4 R 2 9 E m )_

UNIVERSITY'". \0i 4IBRARY SYSTEMyV/

Abstract of thesis entitled:

Influential Observations in the Analysis of Additive Ipsative Data

Submitted by Wong Wai-Wan

for the degree of Master of Philosophy in Statistics

at The Chinese University of Hong Kong in June 2003.

Abstract The objective of this thesis is to use the local influence approach to develop

a procedure for identifying influential observations in the analysis of additive ip-

sative data. Ipsative data are the data collected, or modified in such a way that

each observation is subject to a constant-sum constraint. Usually, ipsative data

are grouped into two categories: the additive ipsative data (AID) and the mul-

tiplicative ipsative data (MID). In this thesis, the AID is considered. Let x be

the observed AID vector and assume that the distribution of x is multivariate

normal with mean /i and variance E, then the distribution of x is degenerated

(singular) because the elements of E are subject to a constant-sum constraint.

As a result, the maximum likelihood estimates of the model parameters /i and

S cannot be obtained directly. A common practice to overcome this problem is

to introduce the transformation x* = Dx such that the transformed x* becomes

non-ipsative and have a noiisingular density with mean fi* and variance E*. The

ML estimates of ii* and E* can then be obtained in a straightforward way. For

non-ipsative data, various kinds of diagnostic measures have been developed to

identify influential observations. For example, three diagnostic measures Bj, Bj

and B'- are constructed by Poon and Poon (2002). These diagnostic measures

are used respectively to identify those observations that enforce an irregular in-

fluences on the estimates of the mean vector together with the covariance matrix,

the estimate of the mean vector, and the estimate of the covariance matrix in

a multivariate normal model. We will make use of these diagnostic measures to

develop procedures to identify influential observations that exert an undue influ-

ences on the estimation of /i and E. Examples based on real data sets are used

to illustrate the practicability of the proposed procedure.

i

摘要

本論文的目的是以局部影響方法爲基礎，去建立一個能夠在加法式自比

性數據的分析裏，確認具有影響力的觀測値的工序。自比性數據是一些

收集所得的數據，或是一些修改而得的數據，而這些數據都受一個加法

常數所限制。自比性數據通常分爲兩個類別：第一類是加法式自比性數

據(AID)，第二類是乘法式自比性數據(MID)�在本論文裏，我們會討論

加法式自比性數據(AID)�設;C爲被觀察AID的向量，再假設X是以多數

量的正態形式分佈，而其平均値爲厂，變差量爲2：，那麼X的分佈是退化

的(奇異矩陣的），因爲S的元素受一個加法常數所限制。因此，模式參數

//和Z的極大槪似性下的參數估値不能被直接獲得。普遍的解決方法是引

入變換公式，以致變換了的/成爲非自比性和非奇異矩陣的密度

函數，其平均値爲/ ’變差量爲。因此//•和的極大槪似性下的參

數估値可以直接取得。在處理非自比性數據方面，已有多種診斷性的測：：

量用以確認具有影響力的觀測値。例如，Poon和Poon (2002)構造的三個

診斷性的測量A、和力丨。這些診斷性的測量分別在多數量的正態分

佈的模式中用於確認對於平均値向量和協方差矩陣的估値有異常影響的

觀測値、對於平均値向量的估値有異常影響的觀測値、對協方差矩陣的

估値有異常影響的觀測値。我們會運用這些診斷性的測量來建立工序，

用以確認將不恰當的影響加於//和2的估値的觀測値。本文運用了以真實

數據集爲基礎的例子，以闡明所設立的工序的實用性。

ii

Acknowledgment

I would like to express my sincere gratitude to my supervisor, Prof. Pooii

Wai-Yin, for her invaluable encouragement and advice during the period of this

research program. Further, I would also like to give thanks to my fellow class-

mates for their kindly assistance. The work described in this thesis was partially

supported by a grant from the Research Grants Council of the Hong Kong Special

Administrative Region, China (RGC Ref. No. CUHK 4347/OlH).

iii

Contents

Abstract i

Acknowledgement iii

1 Introduction 1

1.1 Ipsative Data 1

1.2 Transformation 2

1.3 Influence Analysis 3

2 Ipsative Data 5 �

2.1 Additive Ipsative Data (AID) and Multiplicative Ipsative Data

(MID) 6

2.1.1 Additive Ipsative Data (AID) 6

2.1.2 Multiplicative Ipsative Data (MID) 7

2.2 Partially Additive Ipsative Data (PAID) 7

2.2.1 Vector of PAID 8

2.2.2 Special Cases of PAID 9

3 Transformation 10

3.1 Distribution of AID 10

3.2 Transformation 11

3.3 Relationship between Parameters of AID and the Transformed Vector 13

4 Influence Analysis of Ipsative Data 14

4.1 The Postulated Model 15

iv

4.2 Perturbation 16

4.3 Likelihood Displacement 17

4.4 Normal Curvature 18

4.5 Computation of the Normal Curvature 20

4.6 Diagnostic Measures 21

4.6.1 Observations influencing the estimates of ii and E 22

4.6.2 Observations influencing the estimate of yti or E 22

4.7 Examples 24

4.7.1 Example 1: Foraminiferal Compositions Data Set (AID) . 24

4.7.2 Example 2: Compositions of Sediments Data Set (PAID) . 25

5 Discussion 31

A Proof of Propositions 34

A.l Proof of Proposition 3 34

A.2 Proof of Proposition 4 35

B Analytical Expressions of {Lo*)'^, A* and Lc 37

B.l Analytical Expression of (L^.)"^ 37

B.2 Analytical Expression of A* 38

B.3 Analytical Expression of Lc 38

C Calculation of 39

D Matlab Commands of Example 1 42

Bibliography 44

V

Chapter 1

Introduction

1.1 Ipsative Data

Data in ipsative form are frequently used in various disciplines, such as the com-

positions of different rock samples in geology, the proportions of total expenditure

allocation in a house budget survey in sociology, and the ranked variables of the

personal attitudes in a recruitment test in psychology. Ipsativity is a mathemati-

cal term that refers to an aspect of a data matrix, such as a set of scores. A data

matrix is said to be ipsative when the sum of the scores over each respondent is

a constant. Measures with this constant-sum property are called ipsative data

(Cattell, 1944; Hicks, 1970). In mathematical notation, let a: be a p x 1 column

vector such that

I'^x = c, (1.1)

where i is a p x 1 unit vector and c is a constant scalar. Therefore, x is an ipsative

data vector.

Ipsative data are commonly grouped into two categories (Chan & Bentler,

1993). They are additive ipsative data (AID) and multiplicative ipsative data

(MID). AID can be obtained by subtracting individual's unweighted average from

the raw scores. The transformation of AID from the raw scores in mathematical

1

representation is given by

x = [ I - 1 ( 1 ^ 1 ) - ' ( 1 . 2 )

where x is the p x 1 vector of AID, I is a p x p identity matrix, i is a p x 1 unit

vector, and y is the underlying p x 1 nonipsative vector. We say x is ipsative

because

l'^x = 0. (1.3)

In Chapter 2, we will give more details on the definition of ipsative data.

1.2 Transformation

The multivariate normal distribution is one of the most popular distributions for

analyzing multivariate data. The distribution is specified by a mean vector /z to-

gether with a symmetric and positive definite covariance matrix E. The estimates

of these parameters are usually produced by the method of maximum likelihood.

Our development of the influence analysis will also rely on the multivariate nor-

mal distribution. Let x he & vector of AID distributed as multivariate normal

with mean ft and variance E, then

广E = OT (1.4)

where S is a p x p matrix, i is a p x 1 unit vector and 0 is a p x 1 vector with

all entries equal to zero. Due to the zero-sum constraint as in (1.3), neither

the population covariance matrix E nor the sample covariance matrix S is of full

rank. So, ordinary MLE cannot be applied to ipsative data directly. The common

approach to solve this problem is to introduce a transformation x* = Dx such that

the resulted x* becomes non-ipsative. Then, the ML estimates of /i* and E* can

be obtained in a straightforward way. More discussions about the transformation

D will be presented in Chapter 3.

2

1.3 Influence Analysis

There are two major paradigms in influential analysis literature. They are the

deletion approach and the local influential approach. The deletion approach as-

sesses the effect of dropping a case on a chosen quantity. A typical diagnostic

measure is the Cook's distance (Cook, 1977). The main weakness of this approach

is that masking effects arise in the presence of multiple unusual observations. Two

kinds of masking effects would arise in the presence of several outliers, they are

the joint influence and the conditional influence (Lawrance, 1995; Poon and Poon,

2001).

The local influence approach develops diagnostic measures by examining the

consequence of an infinitesimal perturbation on relevant quantity. This approach

is well developed for detecting joint influence. A general method for assessing

the influence of local perturbation was first proposed by Cook (1986) and then

modified by Poon and Poon (1999). Poon and Poon (2002) constructed three

diagnostic measures Bj, Bg and BJ for multivariate data to identify influential

observations for non-ipsative 'data. These diagnostic measures share the nice

property of diagnostic measures based on the local influence approach in its abil-

ity to address joint influence.

In the influence analysis of the multivariate data, different statistical proce-

dures are developed by putting their emphasis on different sample quantities.

Some procedures put emphasis on both the estimates of the mean vector and the

covariance matrix, while others emphasize either one. In view of this, Poon and

Poon constructed the diagnostic measures Bj, Bj and BJ to serve the purpose

of identifying influential observations with different natures. In particular, the

diagnostic measures Bj, Bf and BJ are developed to identify observations that

exercise unusual influences on the estimates of the mean vector together with

the covariance matrix, the estimate of the mean vector, and the estimate of the

3

covariance matrix, respectively.

In this thesis, the diagnostic measures developed for the influence analysis of

ipsative data basically relies on the above diagnostic measures Bj, Bg and BJ.

The procedure for establishing diagnostic measures for ipsative data, together

with real data illustration about the practicability of the proposed procedure,

will be represented in Chapter 4.

r

4

Chapter 2

Ipsative Data

Data are said to be ipsative if, for each observation, they are subject to a constant-

sum constraint. Recall (1.1),

广:r = c, (2.1)

where i is a p x 1 unit vector and c is a constant scalar. Therefore, x is a p-

dimension data vector with ipsative property.

Ipsative data is useful. For example, if a person rank-ordered his or her fa-

vorite subjects and someone else rank-ordered their favorite subjects, though we

could not compare the intensity of preference for any particular subject due to

ipsativity; the ranking could be compared. Therefore, the use of ipsative items

in questionnaires can lower the scope for lying. Another example is about ques-

tionnaires in the occupational field, employers may be particularly interested in

possible 'negative' personality traits. They would like to know whether applicants

are likely to be, such as lazy, dishonest, or bad-tempered. Applicants are liable

to feel more free to report their personality with the use of ipsative items.

One property of ipsative data is that although scores are independent across

the respondents; within a respondent, score on each variable is dependent on

other variables.

5

2.1 Additive Ipsative Data (AID) and Multi-

plicative Ipsative Data (MID)

2.1.1 Additive Ipsative Data (AID)

Sometimes, ipsative data are the consequence of transformation from their origi-

nal rionipsative data. For example, a drug abuse survey is conducted in order to

know subjects' monthly consumption of four different drugs. Since this topic is

too sensitive, in order to maintain the assessment of integrity, relative consump-

tion is collected. Suppose the monthly consumption of four different drugs of

subject A is (32，24, 38’ 22) with average 29 and subject B is (20，12, 26’ 10)

with average 17. By subtracting the average for the raw data, both of them have

the same relative consumption (3, -5, 9，-7). i

In above example, the vector of deviation scores is obtained by subtracting

individual's unweighted average from the raw scores. This kind of transformation

is called the additive ipsative transformation (Chan & Bentler, 1993). Let y be

a p X 1 column vector of raw scores, called preipsative data, and x achieves its

ipsative property by the equation

x ^ y - l y (2.2)

where y is the average of y and i is a p x 1 unit vector, y is called the underlying

vector of the AID vector x.

AID is subject to a constant-sum constraint as

l ^ x = — l y ) = iTy — l^^ly = 0. (2.3)

6

2.1.2 Multiplicative Ipsative Data (MID)

As mentioned in Chapter 1, another type of ipsative transformation is called the

multiplicative ipsative transformation. The multiplicative ipsative data (MID)

vector X is obtained by taking the relative proportion of each 队〉0 to its whole,

Y. Ui- Let y = (2/1，y2，…，be the preipsative vector of the MID vector x.

Then,

= (ml Ylyi,y2l Y / y i ,…， y p j Y j h Y = ( 广 " r V (2.4)

Under the definition of (2.4), x is also called compositional data. For example,

the ingredients of foods are always transformed to percentages when they are

appearing in food packages. These percentages are examples of MID.

MID is also subject to a constant-sum constraint as

l^x = � = ( 广 ? / ” ( , ? / ) = 1. (2.5)

By (2.3) and (2.4), we can see that the sum of scores for each respondent is

zero and one respectively. The averages of MID are always equal to one divided

by the number of measures. So, MID can be easily transformed to AID by sub-

tracting each score by the average of MID.

2.2 Partially Additive Ipsative Data (PAID)

Realistic data may contain ipsative and nonipsative data measures at the same

time, and this is known as partially ipsative data (Hicks, 1970). For example, in

tests of different rocks, scientists would like to examine not only the compositions

of the rocks, but also the hardness and the color of the rocks. Then, the mea-

sures for the compositions of the rocks are ipsative while others are nonipsative.

Similarly, there would also be some data containing additive ipsative data (AID)

and nonipsative data at the same time. Chan and Bentler (1996) defined this

7

kind of data as partially additive ipsative data (PAID). PAID is a collection of

AIDs, weighted AIDs, and nonipsative data. In the following, we will give more

details on Chan and Bentler's definition of PAID, which is a generalized version

of the definition of AID in (1.2).

2.2.1 Vector of PAID

Similar to AID, an underlying nonipsative vector y is assumed for PAID. Let

y 二 (yf , yj,. • • ’ Vg)^ be the p x 1 underlying random vector, where yk is a pk x 1

subvector of y, k = 1,...

'丄'hen，oc —(工i，工2，•，•’ ^G"^ ^ is a p X 1 vector of PAID (with respect to y) if

Xk = [Ik - U,{VlUk)-'Vl]yk = AkVk. / L - ^ l , . . . ，。， ( 2 . 6 )

where Ik is a x identity matrix, Uk and Vk are pk x q known matrices

with rank(t/jt) = rank(V^) = q .

Also, we have G G

[ P f c = p and = g > 0.

Let

A be a p X p block-diagonal matrix with diagonals A i , A2, . . .，A^, and

V be a p X block-diagonal matrix with diagonals V i , V2, • • •, Vg-

Then, for /c = 1,...，G, (2.6) becomes

a： = Ay, (2.7)

and

V ^ x = (V『工 2)了， . . .， (Vlxc f f = 0. (2.8)

8

From (2.8), we can notice that each subvector Xk is subject to qk linear con-

straints as V^Xk = 0. Therefore, all together, x is subject to q constraints. Here,

q is regarded as a measure of degree of ipsativity {DI) of PAID. In general, when

q {DI) increases, the amount of information remained in PAID decreases.

2.2.2 Special Cases of PAID

Let Ik he a Pk X 1 unit vector and Ck be any non-zero pk x 1 known vector.

Therefore, we have the following special cases of PAID:

1. If Uk = Vk = Ik, for all k, then x will contain G different sets of AID.

When G = l, PAID equals AID.

2. If Uk = Ik and V^ = Ck, then Xk will be the deviation scores obtained by

subtracting individual's weighted average from the raw data z/k- Then, Xk

is regarded as weighted AID with weight Ck.

3. If qk — 0，then Uk and Vk do not exist. Therefore, Ak = Ik and Xk = yu-

In this case, x is allowed to contain a nonipsative component. When this

happen, V is needed to be modified by removing the corresponding columns

associated with V^.

Remark:

In the thesis, the procedure for identifying influential observations is first

developed for analyzing AID. Then, this procedure is also applied to PAID. It is

found that the developed procedure is also applicable to PAID.

9

Chapter 3

Transformation

3.1 Distribution of AID

Let 2/ be a p x l underlying random vector in B? and assume that it is distributed as

Nplpy, Ey] where Ey is a symmetric and positive definite matrix. Correspondingly,

let a; be a p X 1 ipsative data with mean fi and variance E. Then, we have

•‘ E{x) = 11 = Afiy (3.1)

and

cov(x) = E = AYiyA^.

where A is a known block-diagonal matrix in which the elements are determined

according to the pattern of ipsativity. For AID,

rank(A) =p-l, (3.2)

and for PAID, the rank of A is given by

G

rank(A) = ^ rank(Ait) (3.3) k=l

G

=JliPk Qk) k=\

= p — q < p.

10

From now on, we denote the rank of A as r with r < p.

Proposition 1

If Ey is positive definite, then rank(ASj^AT) = rank(A).

Proof:

See Chan and Rentier (1996). •

Therefore, rank(E) = rank(AEyA^) = rank(A) = r < p. The observed x has

a degenerate (singular) normal distribution with mean fi and covariance matrix

S, where E is a symmetric and positive definite matrix.

3.2 Transformation ；

As the density of x does not exist, the ML estimates of /i and E cannot be esti-

mated from X directly. To overcome this problem, Chan and Beritler (1996) pro-

posed to use a transformation D: Rp — R " such that the transformed x* = Dx

will have a nonsingular density. They have also established the following to find

the transformation.

Let D be a r X p matrix such that R(D'^) = R{A). Thus, D^ is with full

column rank 7、and

PR(dt) = PR(A) = (3.4)

where Pr{-) is an orthogonal project matrix onto R{-). Since x G R(A), there

exist a D , such that

= Pr{A)X (3.5)

11

where

rr* = Dx (3.6)

will provide the required transformation of x. A simple way to get a matrix D

that satisfies (3.6) is to remove the redundant rows in A.

According to Chan and Rentier (1996), the transformation (3.6) is reversible,

because x can be perfectly reconstructed from x*. As x* preserves the informa-

tion about re, finding the ML estimates of fi and E from x* is reasonable.

Proposition 2

The covariance matrix of a:*, DALyA^D^, is positive definite.

Proof:

'• See Chan and Rentier (1996). •

From Proposition 2, it can be seen that x* follows a nonsingular Nr[fi*, S*

with / 广 = D i i = DA fly and E* = D E D ^ = DALyA^D^, where E* is a sym- ‘

metric and positive definite matrix.

Remark :

AID is a special case of PAID with Uk = Vk = h, for all k, and G = 1.

Combining (2.2) and (2.6), we have

x = y- ly = [ I - 1 = Ay. (3.7)

with rank(A)= p — 1. Hence, the transformation D maps BP to and D^

has full column rank p — 1.

12

3.3 Relationship between Parameters of AID and

the Transformed Vector

By assuming that x follows a 7Vp[/i, E], it can see that the distribution of x* is

equal to where / / = Dfi and E* = D ^ D ^ . From (3.5) and (3.6), the

relationship between the mean and the covariance matrix of x and x* is given by

= D{j, and fi = 了广 V* (3.8)

E* = D S D ^ and S = D ^ ( D D ^ y ^ E * { D D ^ ) - ^ D . (3.9)

Let

be a p* X 1 vector storing the elements in /i and the lower triangular elements of S with p* = p + + i ) /2 , and �

• ；

be a r* X 1 vector storing the elements in ii* and the lower triangular elements of

S* with r* = r + r(r + l ) /2 .

Since the elements of matrix D are known constants, it follows from (3.8) and

(3.9) that there exists r* x p* constant matrix R and p* x r* constant matrix Q

such that

r = R9 (3.10)

and

e = Qe\ (3.11)

13

Chapter 4

Influence Analysis of Ipsative

Data

111 pervious chapters, the definitions of ipsative data and transformation are in-

troduced in details. In this chapter, a procedure based on the local influence

approach is developed for identifying influential observations in the analysis of

ipsative data. The procedure is developed based on the work of Cook (1986),

Poon and Poon (1999) and Poon and Poon (2002).

Steps for the local influence approach to develop diagnostic measures for ip-

sative data are as follows:

1. Define the postulated model.

2. Choose a perturbation on the postulated model.

3. Define the induced likelihood displacement function /(•).

4. Use differential geometry techniques to assess the behavior of the influence

graph g of /(•) with a view to develop diagnostic measures for identifying

influential observations.

14

4.1 The Postulated Model

In this section, we define the postulated model for ipsative data. Let x be a

p X 1 random vector of AID distributed as H] where E = {ctq^j} and let

{xi, i = 1，...，rz} be a random sample, from which we estimate the mean and

the covariance matrix. Then, x* is a r x 1 transformed vector, obtained via the

transformation (3.6), distributed as E*] where E* = {a*^} is a symmetric

and positive definite matrix and = 1 , . . . , n } is the transformed sample.

The maximum likelihood (ML) estimate jl* of and E* of E* are obtained by

maximizing the following log-likelihood

= ( - r l o g 问 - l o g - (x* - - . (4.1)

If we express L(9*) in (4.1) in terms of 0 = (iF, (J^Y, it becomes

m = jl\i-T log(27r) - l o g — (x* - D^f 一 D^i)) • (4.2) 2=1乙

It can be shown that the likelihood function L{0) is maximized when

•n * Dii = 乙 ⑶ i 二 X* (4.3)

n

and

i^ED- = S U O ^ T - 珊 - 无 T = ！ ( 4 . 4 ) n n

where S* is the unbiased sample covariance matrix of x*. Then, the ML estimate

9 = (JlT, a^Y of 9 is given by

p, = (4.5)

and

E = D ^ ( D D ^ ) - ^ ( ( D D ^ ) - ' D . (4.6) \ n

15

It is noted that there may exist more than one full row rank matrix D that

satisfies (3.6). However, the functions L{9) = Lo(0) defined in (4.2) with respect

to different D differ only by a constant.

Proposition 3

The function Ld^ [6) and Lo-ii^) differ by a constant.

Proof:

See Appendix A. 1. •

4.2 Perturbation

In order to assess the influence of individual observation to the estimate of 6, the

case-weights perturbation (Cook, 1986) is introduced to the log-likelihood. The

resulted perturbed likelihood is given by I

L(e I Lj)

= E y ( - r l o g ( 2 兀 ) - l o g — (x： - D^fiDED^r^x：一 D^)) (4.7)

= E y ( - n o g ( 2 兀) - log|E*|- « - �-1(0：,* - / / ) ) i=i ^

= L ( 0 * I cj) (4.8)

where cj , i = 1，…，n are perturbation parameters, and u = ( c j i , . . . , c o l -

lecting these parameters is a n x 1 vector in a relevant perturbation space Q. of

Rn. It is assumed that there exists an wo such that L{9 | cjq) = L{9) for all 9 and

L ( r I 0；0) = L(e*) for all

16

4.3 Likelihood Displacement

Let 9 and 谷⑴ be the estimates of 9 that maximize the likelihood L{9) in (4.2) and

L{9 I a;) in (4.7) respectively. Let 9* and be the estimates of 9* that maximize

the likelihood L(d*) in (4.1) and L{6* \ UJ) in (4.8) respectively. Therefore, from

the relationship given in (3.10) and (3.11), we have

r = R e and = m ^ (4.9)

e ^ q r and e^ = qei (4.10)

Following Cook (1986), the discrepancy of 9 and can be measured by the

likelihood displacement function

/(a;) = L ( � a ; o ) - Z ^ ( 6 j a ; o ) . (4.11)

This function has its minimum value at a; = ujq. We have L(9 | (Jq) = L[6)

and L[9* | cjq) = L[9*) if ujq = 1 where i is an n x 1 vector with 1 at every

slots. When the perturbation specified by the perturbation parameter UJ causes a

considerable deviation of from 9, considerable deviation of f{uj) from /(wq) is

enforced. Hence, examining the change of F{UJ) as a function of UJ enables us to

assess the influential perturbations so as to identify observations that influence

the estimate of 9. It is noted that /(cj) does not depend on the choice of D.

Proposition 4

The displacement function f{uj) does not depend on the choice of D.

Proof:

See Appendix A.2. •

17

4.4 Normal Curvature

In this section, we assess the behavior of the influence graph g of the likelihood dis-

placement function f{uj) by applying the differential geometry techniques. Cook

(1986) proposed the normal curvature Ce to quantify the changes of f(uj) from

/(Wo), where i defines a direction for a straight line in Q passing through COQ.

Large values of Ce indicates that the perturbation along the corresponding di-

rection i induces a substantial changes in the likelihood displacement. Poon and

Poon (1999) has shown that the normal curvature Ce of the influence graph g of

/ (c j ) along a direction I at the optimal point LJQ is given by

— 仰 ( 4 12)

where • / = ( 兹 , . . . , 兹 ) �i s the gradient vector of / , and 11/ is a n x n matrix

given by d^f

丑 , = ( 4 . 1 3 )

As / (c j ) in (4.11) attains its minimum when uj = luq and if i is chosen such ‘

that ll ll = 1, then CV in (4.12) reduces to

Q = (4.14)

Let LQ and A be p* x p* and p* x n matrices respectively defined as

V d^m . . d'L(9 10；) , … Lo = 二 . and A = — ^ - . ， 4.15

dQdO {0=9 oBouj \o=9,u)=uo

Cook (1986) has shown that Ce can in general be written as a function of LQ and

A . On the other hand, let

Le^ = ~， (4.16)

then LQ* is a r* X r* matrix that can be obtained using the covariance matrix of

[i* and (J* (Poon & Poon, 2002, equation (9)，see Appendix B.l).

18

The relationship between LQ and LQ* is given by:

•• 一 d^LjO) 0 二 "MM'

= j ^ T d 背、 R — do*de* = R ^ L o * R . (4.17)

Q0*

The calculation of is in Appendix C.

Moreover, if A � ^ ^ - ，（4.18)

06*010 \e*=e%u}=u}o

then A* is a r* x n matrix that consists of elements with their analytical expres-

sions available in Poon and Poon (2002, equations (15) and (17), see Appendix

B.2). The relationship between A and A* is given by:

二 d'Lje I c.) — deouj

de* d^Lje* I cj) = W de*duj = i ^了A * • (4.19)

The normal curvature Ce and the conformal normal curvature B^ in general

(Poon & Poon, 1999) can be computed based on the matrix LQ, evaluated at 9.

It is worth of note that the conformal normal curvature B^ which transforms the

normal curvature Ci onto the unit interval, has been demonstrated by Poon and

Poon (1999) as an effective influence measure. However, the matrix LQ is singular

for ipsative data. Thus, it is difficult to apply the methods proposed by Cook

(1986) or Poon and Poon (2002) to develop diagnostic measures. Therefore, we

consider the following method for computing the normal curvature.

19

4.5 Computation of the Normal Curvature

Following Cook (1986, equation (12)), we have

Hf = - j T ' L e J (4.20)

where J is a p* x n matrix defined as

J 二 " t . (4.21)

To evaluate J, we use the fact that

^ 二 0 (4.22)

洲j \o=L

for j = 1, 2，…，p* and all LO in Differentiating (4.22) with respect to � “ for

i = 1,2,…，n and evaluating at 0 and CJQ, we obtain

,ft (dL{0\u) \ . �

fcM 丨 “ ) � = 0 . (4.23)

By (4.15) and (4.21), (4.23) reduces to Lo J + A = 0, (4.24)

where all matrices are evaluated at 0 = 0 and U = UJQ. The matrix LQ is singu-

lar because its covariance part is singular. We use the crude covariance matrix

COVC{G) in Aitchison (1986, p.52) to approximate the negative of the covariance

part of the inverse of LQ. The calculation of the crude covariance matrix coVc{d )

is based on ipsative data but treated as if common non-ipsative data. Let Lc be

the approximated matrix of the inverse of LQ with the negative of the covariance

part equal to the crude covariance matrix. Then, Lc is a diagonal block matrix

given by

.. ^ —covijji) 0 \ Lc = � ) • (4.25)

乂 0 -coVc{a)

20

where cov{fi) and covc(^) can be computed using equations (10) and (11) in Poon

and Poon (2002, see Appendix B.3). As a result (see also 4.19),

J = - L c A = -LcRÂ*. (4.26)

Furthermore, from (4.20), (4.26) and (4.17), we have

Hf = -jT'LeJ

=-{-LRÂ'^Y Lg (-ZeKÂ*)

= - [ r ^ L O ^ R ) [-LCRÂ*)

= — 丑 i ci 了 A*. (4.27)

Hence, by (4.14) and (4.27)，

Q = - f (A* '^RLcR^Le*RLcRÂ*) i . (4.28)

It is noted that the most influential observations can be identified by revealing

the direction ^ nax along which the greatest change of the likelihood displacement

is observed, ijnax is the direction that gives Cmax = max£ CV Moreover, Cmax and

亿max are the largest eigenvalue and the associated eigenvector of the symmetric

matrix Hf.

4.6 Diagnostic Measures

As the normal curvature Ce is defined on an unbounded interval, it may be

difficult to judge its magnitude. So, the normal curvature is transformed one-one

onto the unit interval and we called the transformed one as the conformal normal

curvature B^. At the critical point UQ along the direction £，the conformal normal

curvature is given by

. . (4.29)

V f \9=e,u)=uJo

21

4.6.1 Observations influencing the estimates of (i and H

Let 五j, j = 1，...，n be vectors of the n-dimensional standard basis. It is demon-

strated by Poon and Poon (1999) that B e �= B j , j = 1 , . . . , n are effective mea-

sures to identify the influential perturbation parameters when Cmax is sufficiently

large. It is noted that Bj is the j-th diagonal element of the matrix

- r a • (4.30) V f \e=e,uj=ujo

Up to this point, it is possible to compute the matrix HJ by (4.27) and hence

the diagnostic measures for revealing observations that influence the estimates of

both fi and E. Furthermore, observations that exert an unduly influence to the

estimates of both /z and E can be located by examining Bj, j = 1 , . . . , n. Thus,

the elements with large Bj values are those with influential observations.

4.6.2 Observations influencing the estimate of /x or E

There may exist some observations that influential to the estimate of but not to

the estimate of E or vice versa. Hence, we follow the method of Poon and Poon

(2002) to construct another two diagnostic measures Bj and B J = 1,…，n,

for ipsative data to identify observations that exercise an undue influence to the

estimate of /.i and the estimate of S respectively.

The diagnostic measures Bj, j = 1,…，n are developed based on the influence

graph g of the likelihood displacement function /(a;) given in (4.11). Therefore,

estimates of all parameters in 9 are affected by the perturbation. When the ef-

fects on only a subset of the parameters is interested, it is demonstrated by Cook

(1986) that the effects can be studied by examining the normal curvature of the

influence graph of an objective function worked out from (4.11).

22

Let arid Lji{a) he p x p and {p* — p) x (p* - p) matrices respectively,

such that

.. f LROI) 0 ) L e = � … . . (4.31)

V 0 Ln{a) j

where in = is obtained via (4.17).

Then, the influences of the perturbation on the estimate of /z can be studied

by examining the normal curvature

Cf = = - f H 化 (4.32)

where A* is a r* x n matrix same as in (4.18), L^ is a p* x p* matrix given by

.. (—Covin) 0 \ 仏 = , (4.33)

V 0 o) and i'A is a p* X p* matrix equal to

( Zfi(A) 0 ) . L^r= • (4.34)

V 0 0；

Observations that exercise an undue influences to the estimate of fi can be traced

by examining the diagonal elements B j , j = 1,…，n of the matrix

. • (4.35)

V ^ ^ f ) \e=o,Lo=ujo

Similarly, the influences of the perturbation on the estimate of E is reflected

by the normal curvature

CI = = -f 叩 , (4.36)

where LJ is a p* x p* matrix given by

. . ( O 0 \ K = , (4.37)

\ 0 -CoVc{a) j

23

and L^ & p* x p* matrix equal to

. . ( O 0 \ L'R 二 .. . (4.38)

V 0 Ln(a) Y

Observations that exercise an disproportionate influences to the estimate of E

can be located by examining the diagonal elements BJ, j = 1,…，n of the matrix

H�

, ^ . (4.39) JiTrmy . V \ ”丨0 = 0’ U=UQ

At present, we have constructed three diagnostic measures Bj, Bj and BJ to

identify those observations that enforce an irregular influences on the estimates

of /i together with E, the estimate of and the estimate of E based on the local

influence approach. In next section, we go to demonstrate by examples on how

to apply the developed procedure to analyze ipsative data.

4.7 Examples

4.7.1 Example 1: Foraminiferal Compositions Data Set

(AID)

As a demonstration, we first consider the foraminiferal compositions data set,

which is available in Aitchison (1986, p.399). The data set consists of 30 speci-

mens. Each composition consists of the percentages by weight of four composi-

tions: Neogloboquadrina atlantica, Neogloboquadrina pachyderma, Globorotalia

obesa and Globigerinoides triloba, which we conveniently abbreviate to Prop.l,

Porp.2, Prop.3 and Prop.4. Pairwise scatter plots among the 4 variables are pro-

vided in Figure 4.1. The data is transformed from MID to AID by subtracting

each slot by 1 and analyzed using the proposed procedure, which is programmed

24

by Matlab (see Appendix D). The results are plotted in Figure 4.2a, 4.2b and

4.2c, where the values of Bj, Bj and BJ,j = 1 , . . . , n are computed by Hj, Hj

and HJ respectively with a D satisfied (3.6). The selected D for this example is

, 3 / 4 - 1 / 4 - 1 / 4 - 1 / 4 �

D = - 1 / 4 3/4 - 1 / 4 一 1/4 • (4.40)

、 - 1 / 4 - 1 / 4 - 1 / 4 3/4 乂

The results show that cases 24, 13, 25 and 26, with the most influential comes

first, are most influential to the estimates of both ji and E. The findings are

sensible because all of these observations are usually located at the boundaries of

the data point clouds formed by pairs of variables. Therefore, they are influential

to both the estimates of /i and E. Among these observations, the most extreme

case, namely, case 24 is usually located far from the majority of the data; while it

is most influential to both estimates of /z and E, the effect on S is more significant

than that on /LI. On the other hand, we found from Figure 4.2b and 4.2c that

B � o is relatively large while is not, so we conclude that the influence of case

30 on the estimate of /i is larger than that on the estimate of E. We can notice

that the effect of cases 30 is quite similar to that of cases 13, 25 and 26，except

its effect on the estimate of E is a little bit lower. It may be because it lies more

often in the corners of the scatter plots than the others.

4.7.2 Example 2: Compositions of Sediments Data Set

(PAID)

The second data set considered is the compositions of sediments data set (PAID),

which is available in Aitchison (1986, p.359). Specimens of sediments are tradi-

tionally separated into three mutually exclusive and exhaustive components, sand,

silt and clay, and the proportions of these constituents by weight are quoted as

compositions. The data set records compositions of 39 sediment samples at differ-

25

eiit water depths in an Arctic lake. Thus, there are 4 variables sand, silt, clay and

water depth. Pairwise scatter plots among the 4 variables are provided in Figure

4.3. The values of the first three variables sand, silt and clay were subtracted

by 1 such that the data set becomes PAID. We analyzed the data set using the

proposed procedure and plotted in Figure 4.4a, 4.4b and 4.4c, where the values of

Bj, BJ and BJ= 1,…，N are computed by H/, H^ and HJ respectively with

a D satisfied (3.6). The selected D for this example is

丨 2/3 - 1 / 3 - 1 / 3 0、

- 1 / 3 —1/3 2/3 0 . (4.41)

乂 0 0 0 1 少

Cases 7 and 14 are classified as influential to the estimates of both fi and E. From

Figure 4.3, we found that these two observations are usually located outside the

ellipse formed by the majority of the data. As a result, they are most influential

to both estimates of /i and S, but the effect on E are more substantial than that

on II. On the other hand, from Figure 4.4b and 4.4c, cases 1 and 18 which have

considerable effect on the estimate of ji do not have similar prominent effect on

the estimate of S. We can see from Figure 4.3 that these two observations usually

at the corners of the Pairwise scatter plots of variables sand, silt and clay. Due

to the constant-sum property of these three variables, extra large value of one

variable would result in small values of the others. Therefore, cases 1 and 18

affect the estimate of the mean vector substantially.

26

0.9h

• • 0.8- •• 0.8- • • • • • • • •

^ 0-7- • ••； • 二 0.7— • • • • •

c? 0.6- 2緊 • •• ••: •• •• I 0.6- •:• •• •• • 2^25 •

。•& 13 • • 。-h • • 13

0.4H . 广 — 0.4-j 宅4 30 ‘ 0.1 0.2 0.3 0.4 0.5 、 ^ M 0 2 03

Prop.2 Prop.3

O.&H 0.5-1 ； -1 • • 30

0 8 - ‘ » • •• 知.«

- 0 . 7 - • . CM I

i 0.^ • f � . h . 4 : •••• 13: 1 3 • . •• •

0•一 y H 0.1-1 •• • • • • 2诏 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3

Prop.4 Prop.3

30 • 13

、0.知••• • • c • CO 0.2-

i 13 . - • y i 30-D- • •••• lL •

• 0 1 - • 0.2- . I U.l :}•

0.1-约!J . 0.0- ••:: • •• 妥4 I o 1 1 1 ‘―1 1 1 1

0.0 0.1 h 0.2 0.3 0.0 0.1 0.2 0.3 Prop.4 Prop.4

Figure 4.1: Scatter plots for the foraminiferal compositions data set

27

0.8 0.7-0 . 6 -

0.5 -. 一 I 0 rp u.-r

0.3-

CI： 2 缺 30 O.CH *

0 10 20 30 index (a)

0.5-1 24

0.4-

i 0 .3 - •

：‘0.2— 13 2526 糾 ’ 0 . 1 - •

• • • • • • • • • • • • • - • 0 . 0 - • 禱》會 > •義 •

1 I 1 I ~

0 10 20 30 index (b)

O-S-l 0.7-0 . 6 -

？ 0.5-

：^ 0.4-

0.3-0 .2 - 13

• 2 5 2 6 0 1- • - ° U. 30 0 . 0 - � h — 1 1

0 10 20 30 index (C)

Figure 4.2: Index plots of the influential measures for the foraminiferal composi-

tions data set

28

50-J ^ ^ rv i • •• • • ••

40- • • • . 60-> , 3 0 - • • • • • 、 • 名 50_ • • • -g • S 40- ^

20- ；4 • - 30-

10- • • 20- ••• •• 1 • • 10- ^

0 - ^ n 一，，，，吻 0 ^ 100 0 10 20 30 40 50 water depth clay

80~| Tj 7 0 - r Y' ‘ ： 70- •• ‘ 14 6 0 - ‘ 6 0 - z

” 50- . • 50- • • � J l 运 40— { • = . . 的 3 0 - • • 4 0 - • • • •

20一 • • M 、 . ’ 30- •• • -10- • • „ •• • i 0 _ • •• • • 20- J . n 1 1— n 1 1 1 1 1 0 50 100 0 10 20 30 40 50

water depth clay

7o_| — 5QJ ^ ^ r r � 6 0 - 7 40- • ' • • .

• • • • _ • • • • • \ • - 5 0 — ：18.- ••• •••• ••• • • 玄 30- ••• •

饥 4 0 _ • ^ 20- 14 •

3 0 - ； •• 1 0 - • •

20-1 3 I 0-1 1-二7. 0 50 100 0 50 100

water depth water depth

Figure 4.3: Scatter plots for the compositions of sediments data set

29

^ ：

0.5-

0.4-

。 •、 7

0 . 2 -

0 . 1 - 1 18 0.0- • ^ • n 1 1 1 r 0 10 20 30 40

Index (a)

0-2^ 14

S 1 18 V：旦 0.1- ^ •

’ — I ‘ J Q • • • • •

• • • • • • • • • • • • • •

n n _ • • • • • • • • • • • 1 I I I r 0 10 20 30 40

index (b)

�-®1 0.5-

爸 0.3" 7

二 0.2-

0 . 1 -2 18

0 . 0 - * 1 I I 1 r 0 10 20 30 40

index (c)

Figure 4.4: Index plots of the influential measures for the compositions of sedi-

ments data set

30

Chapter 5

Discussion

Ipsative data is said to be ipsative when it is with a constant-sum constraint. This

constraint automatically induces singular structures for the population covariance

matrix and the sample covariance matrix. All major calculation problems for es-

tablishing the procedure for the analysis of ipsative data are nearly resulted from

the existence of this singular property. To establish a procedure for influence

analysis, we need first to define the likelihood function and calculate the ML es-

timates. Since ML approach cannot be applied to data with singular covariance

matrix directly, a transformation in (3.6) is introduced as the common practice.

After that, we follow the work Poon and Poon (1999), which is a modification

of the work of Cook (1986), to assess the influence of local perturbation by the

use of the normal curvature CV and the conformal normal curvature Then,

following the method of Poon and Poon (2002), we established three diagnostic

measures Bj, Bj and BJ. These diagnostic measures are found to be applica-

ble to all kinds of ipsative data defined in this thesis, that is, AID, MID and PAID.

In the process of calculating the normal curvature, a crude covariance matrix

(Aitchison, 1986) is used to approximate the negative of the covariance part of

the inverse of the matrix LQ. The singular structure of the matrix LQ is also

a consequence of the singularity of the covariance matrix. It is found that the

diagnostic measures Bj and BJ change with respect to D, this change may due

31

to the use of the crude covariance matrix. However, as f{uj) is invariant with

respect to D , we can solve this problem by compute

S g (5.1)

directly. As this is very tedious, further study is needed.

When we consider the purpose of identifying influential observations is for ex-

ploratory more than confirmatory, and any identified observations will be followed

by a careful analysis on the underlying facts of the observations, strict obedience

to a critical value for identification seems not necessary. Thus, in determining

the magnitude of a measure for observations worthy of further notice, we make

use of the natural gap approach and detect large values with the help of the use

of an index plot. In most cases, this simple method can effectively disclose obser-

vations that need attentions. However, if objectivity is desired, one may employ

the reference constant proposed by Poon and Poori (1999). The constant utilized

the geometric concept of mean curvature to establish benchmark for judging the

largeness of a measure.

The development of our procedure for the analysis of ipsative data is based

on a simple case, where the covariance matrix E has no structure and the data

is assumed to be distributed as normal. Thus, we can generalize the procedure

in two directions. First, the developed procedure can be generalized to the case

where the covariance matrix S is structured. Second, it can be generalized to

other multivariate distributions. The normal distribution is chosen in this study

not only because of its popularity, but also because many multivariate techniques

make use of the ML estimates of the normal model parameters, that is its sample

mean and/or its sample covariance matrix. Therefore, whenever an attempt is

made to use sample mean given in (4.5) and sample covariance matrix given in

(4.6) to estimate the location or dispersion of an ipsative data set, the proposed

diagnostic measures are applicable to identify those observations that exercise

32

:.、::.:乂。广表 ...， 'II---- T - ‘ •\.、V"..:. •。，• ——•、.. ‘ ‘

- .:’•.. • - .：..。,：:?- •、’ H •• V � ‘ ：•'•... Sa iTi:" •• .. • _ . : • _ •:;、:-、，-undue influential effects on the estimates.

I A w r • . 騰 “ ：

i v ‘ . ^

, : " ’ .•. P f t y o t .of' F T o p o m f m i w

：办：‘眷：,n.释,，•:‘. ”

； . . . . . ：； . 、 ”

ifv；?^亡如;..奶:§‘ .wii泌 t c i i ' ^鄉 .輔 i ifti A 汉�-‘• 1 hirx.

！靠麗謹書 … , : . ： ‘

Appendix A

Proof of Propositions

A. l Proof of Proposition 3

Proposition 3

The function LDI {0) and differ by a constant.

Proof

Let Di and D2 be two different matrices satisfy R(DJ) = R(A), j = 1, 2.

Since rank(Di) = rank(£)2) = r, there exists a nonsirigular r x r matrix K such

that Di = K D 2 . Let x*j = DjXi and x* = i = 1,... ,n and j = 1, 2,

then = D,Xi = KD2Xi = Kx]^ and = x]Jn = f^ K^^h/n = Kx；. (A.l)

2=1 2 = 1

We write the second and third terms of (4.2) as follows:

- — - A " ) )

= l o g l i ^ i E ^ n - ^-tr 77

- 对 — - D,f i ) (A.2)

where W i = 1( *1 — — xj)^. (See Anderson, 1974, p.45 equation (2)

k P.46 equation (9)). By (A.l), Wi = KW2K^ when W2 is similarly defined

34

as W i . As a result, (A.2) becomes

l o g — i t r

-'^{Kxl - KD2iif[KD2T.DlK'^)-\Kxl - KD2H)

= l o g — ^tr

—^(x； - D 2 M f ( D 2 E D ^ r ' ( x； — D2M) + c (A.3)

where c is a constant. This shows that LDI {0) and LIY^I^) differ by a constant.

•

A.2 Proof of Proposition 4

Proposition 4 ‘

The displacement function /(a;) does not depend on the choice of D.

Proof

Let 9 = (FIT, and = {JI^, ^ZY• Let DI and D2 be two different

matrices satisfy R(DJ) = R(A), j = 1, 2. Since LD^(6) and 没)differs by

a constant, 9 are the same for different D. Similarly, since LD^ (9 | U) and

LO^IP I w) differs by a c o n s t a n t , � � are the same for different D. Also, since

rarik(Di) = rank(£>2) = R, there exists a nonsingular r x R matrix K such that

Di = KDQ. Let x*j = DjXi and x* = Er=i 工几，« = l,...，n and j = 1,2,

similar to the proof of Proposition 3, we have

/N A

foi H = LD, (0 ICJO) - LD, I C o)

一 — — D说

+ + � t r ((D.t^Dfr'W,)

35

+ 登 log\KD2t^DlK' '\ + i t r { { K D 2 t ^ D l K ^ ) - \ K W 2 K ^ ) )

= — 昏 log\D2tDl\ - { [ D ^ t D l r ' W ^ )

- — D2fif[D2tDl)-\xl — D^jl)

+ 昏 l o g + Ut [ { D 2 t ^ D l ) - ' W 2 )

= I d ^ H

where W j = E ^ i C ^ —巧•丄无;)了，i = 1,2 and W i = K W 2 K ^ . Therefore,

f{uj) does not depend on the choice of D. •

36

Appendix B

• • -

Analytical Expressions of (L没*)一丄，

A * and Lc

B.l Analytical Expression of ^

Following Poon and Poon (2002, equation (9)), 一LO* is the observed information

for the postulated model and the ML estimates /i* of /i* and a* of a* are sta- : f

tistically independent. Then, is a r* x r* diagonal block matrix given

by

(•专�—1 { 0 ) ( {{Lo^yj} 0 ) (Lo*) 二 = . . 1

V 0 -cov{a*) ； V 0 ；

where cov{fL*) is a r x r matrix storing the covariance matrix of jl* and cov(a*) is

a (r* — r) X (r* — r) matrix storing the covariance matrix of a*. Following Poon

and Poon (2002，equations (10) and (11)), the elements of aw (ft*) and cov(a*)

are respectively given by

.. 1 ( V ) a 6 = -COVIFLLFL；)=--心

and

.. 1 ( 丄版） = = - - +

I V

37

B.2 Analytical Expression of A*

Let

Ca be a r X 1 vector with 1 at its a-th slot and zeros elsewhere,

z* = X* — jj,* be a r X 1 vector with zj at its 6-th slot, and

心 1 be the (a, 6)-th element of S * - �

Then, following equations (15) and (17) in Poon and Poon (2002), A* is a

r* X n matrix with the i-th column equal to

A* - 淨 L _ *r”*-i 一 • 广

and

where they correspotid to ju* in and a*^ in S* respectively.

B.3 Analytical Expression of Lc

The matrix Lc is a p* x p* diagonal matrix given by

. . ( - c o v [ f i ) 0 ) ( {(L^),,} 0 ) Lc = = ..

V 0 -COVc{d) J \ 0 {(^0)fa/?)(7p)}

where cov{ji) is a p x p matrix storing the covariance matrix of p, and covc(a) is

a {p* - p) X {p* — p) matrix storing the covariance matrix of a. By Poon and

Poon (2002, equations (10) and (11)), the elements of cov{ji) and coVc{a) are

respectively given by .. 1

{Lo)ab = -COv{jla,M = —— ^ab, TX

and

. . c 1 州7P) = = - - [cJcy^app + GapCyM)-

f I

38

Appendix C

Calculation of 儉

Let

( \ I \ I , , \ fM (Ju . . • CTlp "11 . . . "Ip ; ， s = ; ••• ; a n d D 二； ••• ： • (C.l) ；

�P p 乂乂 Cpl • • • CFpp y 乂 dri • • • drp�

Then,

‘E?=i dufM�

= Dfi = •： (C.2)

^ YA=I dpifJ-i ^

and

E f = i diidij(7ij . •. E L i E j = i diidrjaij

E* = D U D ^ = '： ••• : . (C.3)

�SU dridijCJij • . . ELI Tjj=l dridrjCTij 乂

39

Let

Oy. be a (p* — p) X 1 vector storing the lower triangular elements of E where

p* = p - l -p (p+ l ) /2 , and

吟 be a (r* — r) X 1 vector storing the lower triangular elements of E* where

r* = r + r(r + l ) /2 .

i.e.

卜0

022 ( E L I dudijaij

ELi (kidijCTij

- - ELi d2id2j(Jij

(7r1 = and e^ = . (C.4)

• • • •

^rr

• • • • • •

\ E L i Ej=i dridrjaij y

Opl

\ ^pp /

Differentiating /x* respect to for r < p :

( ^ ( E L i dulM) •. • (ELI dril^i)

•： . . . : - D ^ (C.5)

�^ ( E L I dii^i) . . . ^ ( E L i drilM) J

40

Differentiating respect to 9j： for r < p :

de*^ _ 取 = ⑷

F

dkidij for i = j and k > l\ — (C.6)

dkidij + dkjdii for 2 > j and A; > I,

where i, j = 1,…，p ; = 1,... ,r t = ^^ + j and o= ^^^ +/.

Let be a (p* - p) x (r* - r) matrix such that

S =瑪 . （C.7)

‘ i ：

So,

； de* (D^ 0 \ T = „ = R] (C-8)

洲 [ o r , ： )

where R is the r* x p* constant matrix given in (3.10) or (4.9).

41

Appendix D

Matlab Commands of Example 1

% RAW DATA

X = [ 0.74 0.19 0.03 0.04 % X : AID with row sum c (Dim: p,n) 0.74 0.19 0.03 0.04 0.58 0.29 0.01 0.12 0.58 0.19 0.22 0.01 0.61 0.28 0.08 0.03 0.82 0.13 0.02 0.03 0.48 0.38 0.01 0.13 0.59 0.38 0 0.03 0.76 0.12 0.09 0.03 0 . 8 1 0 . 1 2 0 . 0 4 0 . 0 3 ；

0.68 0.23 0.05 0.04 0 . 7 2 0 . 2 0 . 0 4 0 . 0 4

0.62 0.27 0.09 0.02 0.45 0.25 0.29 0.01 0.66 0.25 0.06 0.03 0.85 0.13 0.01 0.01 ‘ 0.75 0.09 0.15 0.01 0.69 0.25 0 0.06 0.76 0.1 0.11 0.03 0.66 0.29 0.01 0.04 0.66 0.24 0.06 0.04 0 . 5 0 . 4 6 0 0 . 0 4

0.65 0.25 0.05 0.05 0.60 0.35 0.02 0.03 0.4 0.27 0.01 0.32 0.6 0.1 0.3 0 0.6 0.1 0.29 0.01 0.59 0.39 0.01 0.01 0.58 0.39 0.01 0.02 0.61 0.34 0.02 0.03 0.39 0.49 0.12 0 ]';

% DIMENSION

n = 30; % n : number of observations p = 4; % p : number of variables r = p-1; % r : number of variables of x* : [z] q = p+(p*(p+l ) /2 ) ; % q : dimension of theta s = r+(r*(r+l) /2) ; % s : dimension of theta*

42

% TRANSFORM X TO BE WITH A ZERO CONSTRAINT

for i=l :n c(i) = sum(X(:,i))； % c : the sum of each obsevation

end X = X - repmat(c/p,p,l); % x : ipsative data with a constant sum ZERO (Dim: p,n) mx = (mean(x'))'； % mx : mean(x) (Dim: p ’ l ) ex = cov(x'); % ex : covariance matrix of x (Dim: p,p)

% THE INFORMATION MATRIX OF x (Dim: q,q)

irifmx = -ex/n; % infmx : mean part of inf. matrix of x (Dim: p,p) for i = l:p % infex : covariance part of inf. matirx of x (Dim: q-p,q-p)

for j 二 l:i % (i=alpha’ j二beta, k=gamma，l=rho) for k = l:p

for 1 = l:k t = ((i-l)*i)/2 + j; o = ((k-l)*k)/2 + 1; infex(t,o) = -((ex(i,k)*ex(j,l))+(ex(i,l)*ex(j,k))) /n;

end end

end end infx = [infmx,zeros(p,q-p)；zeros(q-p,p),infex]； % inf. matrix for Bj infxm = [infmx,zeros(p,q-p)；zeros(q-p,p),zeros(q-p,q-p)]； % inf. matrix for Bj (mu) infxe = [zeros(p,p),zeros(p,q-p)；zeros(q-p,p),infex]； % inf. matrix for Bj (sigma)

% TRANSFORMATION d (Dim: r’p)

d3 = [3/4 -1 /4 -1/4 -1/4 ； -1/4 3/4 -1/4 -1/4 ； -1/4 -1/4 -1/4 3/4]; % Delete 3th row i d = d3;

% TRANSFORM DATA TO NONIPSATIVE

z = d*x; % z : non-ipsative data set x* (Dim: r’n) rnz 二 (meari(z'))'; % mz : mean vector of z (Dim: r,l) ez = cov(z’)； % ez : covariance matrix of z (Dim: r,r) inez = inv(ez); % inez : inverse of cov(z) (Dim: r,r) y = z-repmat(mz,l,n); % y : z - mean(z) (Dim: r’n)

% INFORMATION MATRIX OF z (Dim: s’s)

infrnz = -ez/n; % infmz : mean part of inf. matrix of z (Dim: p,p) for i = l:r % infez : covariance part of inf.* matirx of z (Dim: s-r,s-r)

for j = l:i % (i二alpha, j=beta, k=gamma, l=rho) for k = l:r

for 1 = l:k t = ((i-l)*i)/2 + j; o = ((k-l)*k)/2 + 1; infez(t,o) = -((ez(i,k)*ez(j,l))+(ez(i,l)*ez(j,k)))/n;

end end

end end infz = [infmz,zeros(r,s-r)；zeros(s-r,r),infez]； % inf. matrix* for Bj infzm = [infmz,zeros(r,s-r);zeros(s-r,r),zeros(s-r,s-r)]; % inf. matrix* for Bj (mu) infze = [zeros(r,r),zeros(r,s-r);zeros(s-r,r),infez]; % inf. matrix* for Bj (sigma)

43

% THE TRIANGULAR MATRIX OF Z ： [triz] (Dim: s’n)

一 % trimz : mean part of triangular* matrix (Dim: r,n) tri7nz(!,i) = inez*y(:，i); % triez : covariance part of triangular* matrix (Dim: s-r,n)

end for i 二 l:n

for k 二 l:r for t = l:k

end end

end triz 二 [trimz ;triez];

% CALCULATE THE DIFF. OF DEFF. THETA* RESPECT TO THETA : [dtt] (Dim: q,s)

� =d，. % dmm : difF. mu* respect to mu {Dim: p,r) fo" • =—i:p’ ％ dee : difF. vec(simga*) respect to vec(simga) (Dim: q-p,s-r)

for j 二 l:i for k 二- l:r

for 1 二 l:k

0 二（(k-l)*k)/2 + 1; • J

1 i(S�t，o) = d(k,i)*d(l,j)； GISG

dee(t,o) = (d(k,i)*d(l,j)) + (d(k,j)*d(l,i)); end

end end

end end dtt = [dmm,zeros (p,s-r);zeros(q-p,r),dee];

% OVERALL B : [B]

Ldd = dtt*inv(infz)*dtt'; H = -triz'*dtt'*infx*Ldd*infx*dtt*triz; B = diag(h/(trace(h*h))^(l/2))

% B RESPECT TO MEAN : [Bm]

Lddmtemp = Ldd(l:p,l:p)；

Lddm = [Lddmtemp,zeros(p,q-p);zeros(q-p,p),zeros(q-p,q-p)J; Hm 二 -triz，*dtt，*in5cm*Lddm*infxm*dtt;*t;riz; Bm = diag(Hm /(trace(Hm*Hm))^(l/2))

% B RESPECT TO VAR : [Be]

Lddetemp = Ldd((p+l):q’(p+l):ci); Ldde = [zeros(p,p) ,zeros(p,q-p) ；zeros(q-p,p),Lddetemp]； He = -triz'*dtt'*infxe*Ldde*infxe*dtt*triz; Be = diag(He /(trace(He*He))^(l/2))

44

Bibliography

Aitchison, J. (1986). The statistical analysis of compositional data. London:

Chapman and Hall.

Anderson, T.W. (1974). An introduction to multivariate statistical analysis (2nd

ed.). New York: Wiley.

Cattell, R.B. (1944). Psychological measurement: Ipsative, normative and in-

teractive. Psychological Review, 51, 292-303.

Chan, W. and Rentier, P.M. (1993). The covariance structure analysis of ipsative

data. Sociological Methods & Research, 22, 214-247.

Chan, W. and Bentler, P.M. (1996). Covariance structure analysis of partially

additive ipsative data using restricted maximum likelihood estimation. Mul-

tivariate Behavioral Research, 31(3), 289-312.

Cook, R.D. (1977). Detection of influential observations in linear regression.

Technometrics, 19, 15-18.

Cook, R.D. (1986). Assessment of local influence (with discussion). Journal of

the Royal Statistical Society, B, 48^ 133-169.

Hicks, L.E. (1970). Some properties of ipsative, normative and forced-choice

normative measures. Psychological Bulletin, 74(3), 167-184.

Lawrance, A.J. (1995). Deletion influence and masking in regression. Journal

of Royal Statistical Society, B, 57, 181-189.

45

Poon, W.Y. and Poon, Y.S. (1999). Conformal normal curvature and assessment

of local influence. Journal of the Royal Statistical Society, B, 61, 51-61.

Poon, W.Y. and Poon, Y.S. (2001). Conditional local influence in case-weights

linear regression. British Journal of Mathematical and Statistical Psychol-

ogy, 54, 177-191.

Poon, W.Y. and Poon, Y.S. (2002). Influential observations in the estimation of

mean vector and covariance matrix. British Journal of Mathematical and

Statistical Psychology, 55, 177-192.

46

TbTiiDhOD

saLJBjqtn >IH门3

Influential Observations in the Analysis of Additive ... · allocation in a house budget survey in...

Documents

Transcript of Influential Observations in the Analysis of Additive ... · allocation in a house budget survey in...