Influential Observations in the Analysis of Additive ... · allocation in a house budget survey in...
Transcript of Influential Observations in the Analysis of Additive ... · allocation in a house budget survey in...
Influential Observations
in
the Analysis of Additive Ipsative Data
WONG Wai-Wan
A Thesis Submitted in Partial Fulfillment
of the Requirements for the Degree of
Master of Philosophy
in
Statistics
⑥ T H E CHINESE UNIVERSITY OF H O N G K O N G
JUNE 2003
The Chinese University of Hong Kong holds the copyright of this thesis. Any
person (s) intending to use a part or whole of the materials in the thesis in a
proposed publication must seek copyright release from the Dean of the Graduate
School.
糊 、 A (L/ \4 R 2 9 E m )_
UNIVERSITY'". \0i 4IBRARY SYSTEMyV/
Abstract of thesis entitled:
Influential Observations in the Analysis of Additive Ipsative Data
Submitted by Wong Wai-Wan
for the degree of Master of Philosophy in Statistics
at The Chinese University of Hong Kong in June 2003.
Abstract The objective of this thesis is to use the local influence approach to develop
a procedure for identifying influential observations in the analysis of additive ip-
sative data. Ipsative data are the data collected, or modified in such a way that
each observation is subject to a constant-sum constraint. Usually, ipsative data
are grouped into two categories: the additive ipsative data (AID) and the mul-
tiplicative ipsative data (MID). In this thesis, the AID is considered. Let x be
the observed AID vector and assume that the distribution of x is multivariate
normal with mean /i and variance E, then the distribution of x is degenerated
(singular) because the elements of E are subject to a constant-sum constraint.
As a result, the maximum likelihood estimates of the model parameters /i and
S cannot be obtained directly. A common practice to overcome this problem is
to introduce the transformation x* = Dx such that the transformed x* becomes
non-ipsative and have a noiisingular density with mean fi* and variance E*. The
ML estimates of ii* and E* can then be obtained in a straightforward way. For
non-ipsative data, various kinds of diagnostic measures have been developed to
identify influential observations. For example, three diagnostic measures Bj, Bj
and B'- are constructed by Poon and Poon (2002). These diagnostic measures
are used respectively to identify those observations that enforce an irregular in-
fluences on the estimates of the mean vector together with the covariance matrix,
the estimate of the mean vector, and the estimate of the covariance matrix in
a multivariate normal model. We will make use of these diagnostic measures to
develop procedures to identify influential observations that exert an undue influ-
ences on the estimation of /i and E. Examples based on real data sets are used
to illustrate the practicability of the proposed procedure.
i
摘要
本論文的目的是以局部影響方法爲基礎,去建立一個能夠在加法式自比
性數據的分析裏,確認具有影響力的觀測値的工序。自比性數據是一些
收集所得的數據,或是一些修改而得的數據,而這些數據都受一個加法
常數所限制。自比性數據通常分爲兩個類別:第一類是加法式自比性數
據(AID),第二類是乘法式自比性數據(MID)�在本論文裏,我們會討論
加法式自比性數據(AID)�設;C爲被觀察AID的向量,再假設X是以多數
量的正態形式分佈,而其平均値爲厂,變差量爲2:,那麼X的分佈是退化
的(奇異矩陣的),因爲S的元素受一個加法常數所限制。因此,模式參數
//和Z的極大槪似性下的參數估値不能被直接獲得。普遍的解決方法是引
入變換公式,以致變換了的/成爲非自比性和非奇異矩陣的密度
函數,其平均値爲/ ’變差量爲。因此//•和的極大槪似性下的參
數估値可以直接取得。在處理非自比性數據方面,已有多種診斷性的測 ::
量用以確認具有影響力的觀測値。例如,Poon和Poon (2002)構造的三個
診斷性的測量A、和力丨。這些診斷性的測量分別在多數量的正態分
佈的模式中用於確認對於平均値向量和協方差矩陣的估値有異常影響的
觀測値、對於平均値向量的估値有異常影響的觀測値、對協方差矩陣的
估値有異常影響的觀測値。我們會運用這些診斷性的測量來建立工序,
用以確認將不恰當的影響加於//和2的估値的觀測値。本文運用了以真實
數據集爲基礎的例子,以闡明所設立的工序的實用性。
ii
Acknowledgment
I would like to express my sincere gratitude to my supervisor, Prof. Pooii
Wai-Yin, for her invaluable encouragement and advice during the period of this
research program. Further, I would also like to give thanks to my fellow class-
mates for their kindly assistance. The work described in this thesis was partially
supported by a grant from the Research Grants Council of the Hong Kong Special
Administrative Region, China (RGC Ref. No. CUHK 4347/OlH).
iii
Contents
Abstract i
Acknowledgement iii
1 Introduction 1
1.1 Ipsative Data 1
1.2 Transformation 2
1.3 Influence Analysis 3
2 Ipsative Data 5 �
2.1 Additive Ipsative Data (AID) and Multiplicative Ipsative Data
(MID) 6
2.1.1 Additive Ipsative Data (AID) 6
2.1.2 Multiplicative Ipsative Data (MID) 7
2.2 Partially Additive Ipsative Data (PAID) 7
2.2.1 Vector of PAID 8
2.2.2 Special Cases of PAID 9
3 Transformation 10
3.1 Distribution of AID 10
3.2 Transformation 11
3.3 Relationship between Parameters of AID and the Transformed Vector 13
4 Influence Analysis of Ipsative Data 14
4.1 The Postulated Model 15
iv
4.2 Perturbation 16
4.3 Likelihood Displacement 17
4.4 Normal Curvature 18
4.5 Computation of the Normal Curvature 20
4.6 Diagnostic Measures 21
4.6.1 Observations influencing the estimates of ii and E 22
4.6.2 Observations influencing the estimate of yti or E 22
4.7 Examples 24
4.7.1 Example 1: Foraminiferal Compositions Data Set (AID) . 24
4.7.2 Example 2: Compositions of Sediments Data Set (PAID) . 25
5 Discussion 31
A Proof of Propositions 34
A.l Proof of Proposition 3 34
A.2 Proof of Proposition 4 35
B Analytical Expressions of {Lo*)'^, A* and Lc 37
B.l Analytical Expression of (L^.)"^ 37
B.2 Analytical Expression of A* 38
B.3 Analytical Expression of Lc 38
C Calculation of 39
D Matlab Commands of Example 1 42
Bibliography 44
V
Chapter 1
Introduction
1.1 Ipsative Data
Data in ipsative form are frequently used in various disciplines, such as the com-
positions of different rock samples in geology, the proportions of total expenditure
allocation in a house budget survey in sociology, and the ranked variables of the
personal attitudes in a recruitment test in psychology. Ipsativity is a mathemati-
cal term that refers to an aspect of a data matrix, such as a set of scores. A data
matrix is said to be ipsative when the sum of the scores over each respondent is
a constant. Measures with this constant-sum property are called ipsative data
(Cattell, 1944; Hicks, 1970). In mathematical notation, let a: be a p x 1 column
vector such that
I'^x = c, (1.1)
where i is a p x 1 unit vector and c is a constant scalar. Therefore, x is an ipsative
data vector.
Ipsative data are commonly grouped into two categories (Chan & Bentler,
1993). They are additive ipsative data (AID) and multiplicative ipsative data
(MID). AID can be obtained by subtracting individual's unweighted average from
the raw scores. The transformation of AID from the raw scores in mathematical
1
representation is given by
x = [ I - 1 ( 1 ^ 1 ) - ' ( 1 . 2 )
where x is the p x 1 vector of AID, I is a p x p identity matrix, i is a p x 1 unit
vector, and y is the underlying p x 1 nonipsative vector. We say x is ipsative
because
l'^x = 0. (1.3)
In Chapter 2, we will give more details on the definition of ipsative data.
1.2 Transformation
The multivariate normal distribution is one of the most popular distributions for
analyzing multivariate data. The distribution is specified by a mean vector /z to-
gether with a symmetric and positive definite covariance matrix E. The estimates
of these parameters are usually produced by the method of maximum likelihood.
Our development of the influence analysis will also rely on the multivariate nor-
mal distribution. Let x he & vector of AID distributed as multivariate normal
with mean ft and variance E, then
广E = OT (1.4)
where S is a p x p matrix, i is a p x 1 unit vector and 0 is a p x 1 vector with
all entries equal to zero. Due to the zero-sum constraint as in (1.3), neither
the population covariance matrix E nor the sample covariance matrix S is of full
rank. So, ordinary MLE cannot be applied to ipsative data directly. The common
approach to solve this problem is to introduce a transformation x* = Dx such that
the resulted x* becomes non-ipsative. Then, the ML estimates of /i* and E* can
be obtained in a straightforward way. More discussions about the transformation
D will be presented in Chapter 3.
2
1.3 Influence Analysis
There are two major paradigms in influential analysis literature. They are the
deletion approach and the local influential approach. The deletion approach as-
sesses the effect of dropping a case on a chosen quantity. A typical diagnostic
measure is the Cook's distance (Cook, 1977). The main weakness of this approach
is that masking effects arise in the presence of multiple unusual observations. Two
kinds of masking effects would arise in the presence of several outliers, they are
the joint influence and the conditional influence (Lawrance, 1995; Poon and Poon,
2001).
The local influence approach develops diagnostic measures by examining the
consequence of an infinitesimal perturbation on relevant quantity. This approach
is well developed for detecting joint influence. A general method for assessing
the influence of local perturbation was first proposed by Cook (1986) and then
modified by Poon and Poon (1999). Poon and Poon (2002) constructed three
diagnostic measures Bj, Bg and BJ for multivariate data to identify influential
observations for non-ipsative 'data. These diagnostic measures share the nice
property of diagnostic measures based on the local influence approach in its abil-
ity to address joint influence.
In the influence analysis of the multivariate data, different statistical proce-
dures are developed by putting their emphasis on different sample quantities.
Some procedures put emphasis on both the estimates of the mean vector and the
covariance matrix, while others emphasize either one. In view of this, Poon and
Poon constructed the diagnostic measures Bj, Bj and BJ to serve the purpose
of identifying influential observations with different natures. In particular, the
diagnostic measures Bj, Bf and BJ are developed to identify observations that
exercise unusual influences on the estimates of the mean vector together with
the covariance matrix, the estimate of the mean vector, and the estimate of the
3
covariance matrix, respectively.
In this thesis, the diagnostic measures developed for the influence analysis of
ipsative data basically relies on the above diagnostic measures Bj, Bg and BJ.
The procedure for establishing diagnostic measures for ipsative data, together
with real data illustration about the practicability of the proposed procedure,
will be represented in Chapter 4.
r
4
Chapter 2
Ipsative Data
Data are said to be ipsative if, for each observation, they are subject to a constant-
sum constraint. Recall (1.1),
广:r = c, (2.1)
where i is a p x 1 unit vector and c is a constant scalar. Therefore, x is a p-
dimension data vector with ipsative property.
Ipsative data is useful. For example, if a person rank-ordered his or her fa-
vorite subjects and someone else rank-ordered their favorite subjects, though we
could not compare the intensity of preference for any particular subject due to
ipsativity; the ranking could be compared. Therefore, the use of ipsative items
in questionnaires can lower the scope for lying. Another example is about ques-
tionnaires in the occupational field, employers may be particularly interested in
possible 'negative' personality traits. They would like to know whether applicants
are likely to be, such as lazy, dishonest, or bad-tempered. Applicants are liable
to feel more free to report their personality with the use of ipsative items.
One property of ipsative data is that although scores are independent across
the respondents; within a respondent, score on each variable is dependent on
other variables.
5
2.1 Additive Ipsative Data (AID) and Multi-
plicative Ipsative Data (MID)
2.1.1 Additive Ipsative Data (AID)
Sometimes, ipsative data are the consequence of transformation from their origi-
nal rionipsative data. For example, a drug abuse survey is conducted in order to
know subjects' monthly consumption of four different drugs. Since this topic is
too sensitive, in order to maintain the assessment of integrity, relative consump-
tion is collected. Suppose the monthly consumption of four different drugs of
subject A is (32,24, 38’ 22) with average 29 and subject B is (20,12, 26’ 10)
with average 17. By subtracting the average for the raw data, both of them have
the same relative consumption (3, -5, 9,-7). i
In above example, the vector of deviation scores is obtained by subtracting
individual's unweighted average from the raw scores. This kind of transformation
is called the additive ipsative transformation (Chan & Bentler, 1993). Let y be
a p X 1 column vector of raw scores, called preipsative data, and x achieves its
ipsative property by the equation
x ^ y - l y (2.2)
where y is the average of y and i is a p x 1 unit vector, y is called the underlying
vector of the AID vector x.
AID is subject to a constant-sum constraint as
l ^ x = — l y ) = iTy — l^^ly = 0. (2.3)
6
2.1.2 Multiplicative Ipsative Data (MID)
As mentioned in Chapter 1, another type of ipsative transformation is called the
multiplicative ipsative transformation. The multiplicative ipsative data (MID)
vector X is obtained by taking the relative proportion of each 队〉0 to its whole,
Y. Ui- Let y = (2/1,y2,…,be the preipsative vector of the MID vector x.
Then,
= (ml Ylyi,y2l Y / y i ,…, y p j Y j h Y = ( 广 " r V (2.4)
Under the definition of (2.4), x is also called compositional data. For example,
the ingredients of foods are always transformed to percentages when they are
appearing in food packages. These percentages are examples of MID.
MID is also subject to a constant-sum constraint as
l^x = � = ( 广 ? / ” ( , ? / ) = 1. (2.5)
By (2.3) and (2.4), we can see that the sum of scores for each respondent is
zero and one respectively. The averages of MID are always equal to one divided
by the number of measures. So, MID can be easily transformed to AID by sub-
tracting each score by the average of MID.
2.2 Partially Additive Ipsative Data (PAID)
Realistic data may contain ipsative and nonipsative data measures at the same
time, and this is known as partially ipsative data (Hicks, 1970). For example, in
tests of different rocks, scientists would like to examine not only the compositions
of the rocks, but also the hardness and the color of the rocks. Then, the mea-
sures for the compositions of the rocks are ipsative while others are nonipsative.
Similarly, there would also be some data containing additive ipsative data (AID)
and nonipsative data at the same time. Chan and Bentler (1996) defined this
7
kind of data as partially additive ipsative data (PAID). PAID is a collection of
AIDs, weighted AIDs, and nonipsative data. In the following, we will give more
details on Chan and Bentler's definition of PAID, which is a generalized version
of the definition of AID in (1.2).
2.2.1 Vector of PAID
Similar to AID, an underlying nonipsative vector y is assumed for PAID. Let
y 二 (yf , yj,. • • ’ Vg)^ be the p x 1 underlying random vector, where yk is a pk x 1
subvector of y, k = 1,...
'丄'hen,oc —(工i,工2,•,•’ ^G"^ ^ is a p X 1 vector of PAID (with respect to y) if
Xk = [Ik - U,{VlUk)-'Vl]yk = AkVk. / L - ^ l , . . . , 。 , ( 2 . 6 )
where Ik is a x identity matrix, Uk and Vk are pk x q known matrices
with rank(t/jt) = rank(V^) = q .
Also, we have G G
[ P f c = p and = g > 0.
Let
A be a p X p block-diagonal matrix with diagonals A i , A2, . . .,A^, and
V be a p X block-diagonal matrix with diagonals V i , V2, • • •, Vg-
Then, for /c = 1,...,G, (2.6) becomes
a: = Ay, (2.7)
and
V ^ x = (V『工 2)了, . . ., (Vlxc f f = 0. (2.8)
8
From (2.8), we can notice that each subvector Xk is subject to qk linear con-
straints as V^Xk = 0. Therefore, all together, x is subject to q constraints. Here,
q is regarded as a measure of degree of ipsativity {DI) of PAID. In general, when
q {DI) increases, the amount of information remained in PAID decreases.
2.2.2 Special Cases of PAID
Let Ik he a Pk X 1 unit vector and Ck be any non-zero pk x 1 known vector.
Therefore, we have the following special cases of PAID:
1. If Uk = Vk = Ik, for all k, then x will contain G different sets of AID.
When G = l, PAID equals AID.
2. If Uk = Ik and V^ = Ck, then Xk will be the deviation scores obtained by
subtracting individual's weighted average from the raw data z/k- Then, Xk
is regarded as weighted AID with weight Ck.
3. If qk — 0,then Uk and Vk do not exist. Therefore, Ak = Ik and Xk = yu-
In this case, x is allowed to contain a nonipsative component. When this
happen, V is needed to be modified by removing the corresponding columns
associated with V^.
Remark:
In the thesis, the procedure for identifying influential observations is first
developed for analyzing AID. Then, this procedure is also applied to PAID. It is
found that the developed procedure is also applicable to PAID.
9
Chapter 3
Transformation
3.1 Distribution of AID
Let 2/ be a p x l underlying random vector in B? and assume that it is distributed as
Nplpy, Ey] where Ey is a symmetric and positive definite matrix. Correspondingly,
let a; be a p X 1 ipsative data with mean fi and variance E. Then, we have
•‘ E{x) = 11 = Afiy (3.1)
and
cov(x) = E = AYiyA^.
where A is a known block-diagonal matrix in which the elements are determined
according to the pattern of ipsativity. For AID,
rank(A) =p-l, (3.2)
and for PAID, the rank of A is given by
G
rank(A) = ^ rank(Ait) (3.3) k=l
G
=JliPk Qk) k=\
= p — q < p.
10
From now on, we denote the rank of A as r with r < p.
Proposition 1
If Ey is positive definite, then rank(ASj^AT) = rank(A).
Proof:
See Chan and Rentier (1996). •
Therefore, rank(E) = rank(AEyA^) = rank(A) = r < p. The observed x has
a degenerate (singular) normal distribution with mean fi and covariance matrix
S, where E is a symmetric and positive definite matrix.
3.2 Transformation ;
As the density of x does not exist, the ML estimates of /i and E cannot be esti-
mated from X directly. To overcome this problem, Chan and Beritler (1996) pro-
posed to use a transformation D: Rp — R " such that the transformed x* = Dx
will have a nonsingular density. They have also established the following to find
the transformation.
Let D be a r X p matrix such that R(D'^) = R{A). Thus, D^ is with full
column rank 7、and
PR(dt) = PR(A) = (3.4)
where Pr{-) is an orthogonal project matrix onto R{-). Since x G R(A), there
exist a D , such that
= Pr{A)X (3.5)
11
where
rr* = Dx (3.6)
will provide the required transformation of x. A simple way to get a matrix D
that satisfies (3.6) is to remove the redundant rows in A.
According to Chan and Rentier (1996), the transformation (3.6) is reversible,
because x can be perfectly reconstructed from x*. As x* preserves the informa-
tion about re, finding the ML estimates of fi and E from x* is reasonable.
Proposition 2
The covariance matrix of a:*, DALyA^D^, is positive definite.
Proof:
'• See Chan and Rentier (1996). •
From Proposition 2, it can be seen that x* follows a nonsingular Nr[fi*, S*
with / 广 = D i i = DA fly and E* = D E D ^ = DALyA^D^, where E* is a sym- ‘
metric and positive definite matrix.
Remark :
AID is a special case of PAID with Uk = Vk = h, for all k, and G = 1.
Combining (2.2) and (2.6), we have
x = y- ly = [ I - 1 = Ay. (3.7)
with rank(A)= p — 1. Hence, the transformation D maps BP to and D^
has full column rank p — 1.
12
3.3 Relationship between Parameters of AID and
the Transformed Vector
By assuming that x follows a 7Vp[/i, E], it can see that the distribution of x* is
equal to where / / = Dfi and E* = D ^ D ^ . From (3.5) and (3.6), the
relationship between the mean and the covariance matrix of x and x* is given by
= D{j, and fi = 了广 V* (3.8)
E* = D S D ^ and S = D ^ ( D D ^ y ^ E * { D D ^ ) - ^ D . (3.9)
Let
be a p* X 1 vector storing the elements in /i and the lower triangular elements of S with p* = p + + i ) /2 , and �
• ;
be a r* X 1 vector storing the elements in ii* and the lower triangular elements of
S* with r* = r + r(r + l ) /2 .
Since the elements of matrix D are known constants, it follows from (3.8) and
(3.9) that there exists r* x p* constant matrix R and p* x r* constant matrix Q
such that
r = R9 (3.10)
and
e = Qe\ (3.11)
13
Chapter 4
Influence Analysis of Ipsative
Data
111 pervious chapters, the definitions of ipsative data and transformation are in-
troduced in details. In this chapter, a procedure based on the local influence
approach is developed for identifying influential observations in the analysis of
ipsative data. The procedure is developed based on the work of Cook (1986),
Poon and Poon (1999) and Poon and Poon (2002).
Steps for the local influence approach to develop diagnostic measures for ip-
sative data are as follows:
1. Define the postulated model.
2. Choose a perturbation on the postulated model.
3. Define the induced likelihood displacement function /(•).
4. Use differential geometry techniques to assess the behavior of the influence
graph g of /(•) with a view to develop diagnostic measures for identifying
influential observations.
14
4.1 The Postulated Model
In this section, we define the postulated model for ipsative data. Let x be a
p X 1 random vector of AID distributed as H] where E = {ctq^j} and let
{xi, i = 1,...,rz} be a random sample, from which we estimate the mean and
the covariance matrix. Then, x* is a r x 1 transformed vector, obtained via the
transformation (3.6), distributed as E*] where E* = {a*^} is a symmetric
and positive definite matrix and = 1 , . . . , n } is the transformed sample.
The maximum likelihood (ML) estimate jl* of and E* of E* are obtained by
maximizing the following log-likelihood
= ( - r l o g 问 - l o g - (x* - - . (4.1)
If we express L(9*) in (4.1) in terms of 0 = (iF, (J^Y, it becomes
m = jl\i-T log(27r) - l o g — (x* - D^f 一 D^i)) • (4.2) 2=1乙
It can be shown that the likelihood function L{0) is maximized when
•n * Dii = 乙 ⑶ i 二 X* (4.3)
n
and
i^ED- = S U O ^ T - 珊 - 无 T = ! ( 4 . 4 ) n n
where S* is the unbiased sample covariance matrix of x*. Then, the ML estimate
9 = (JlT, a^Y of 9 is given by
p, = (4.5)
and
E = D ^ ( D D ^ ) - ^ ( ( D D ^ ) - ' D . (4.6) \ n
15
It is noted that there may exist more than one full row rank matrix D that
satisfies (3.6). However, the functions L{9) = Lo(0) defined in (4.2) with respect
to different D differ only by a constant.
Proposition 3
The function Ld^ [6) and Lo-ii^) differ by a constant.
Proof:
See Appendix A. 1. •
4.2 Perturbation
In order to assess the influence of individual observation to the estimate of 6, the
case-weights perturbation (Cook, 1986) is introduced to the log-likelihood. The
resulted perturbed likelihood is given by I
L(e I Lj)
= E y ( - r l o g ( 2 兀 ) - l o g — (x: - D^fiDED^r^x: 一 D^)) (4.7)
= E y ( - n o g ( 2 兀) - log|E*|- « - �-1(0:,* - / / ) ) i=i ^
= L ( 0 * I cj) (4.8)
where cj , i = 1,…,n are perturbation parameters, and u = ( c j i , . . . , c o l -
lecting these parameters is a n x 1 vector in a relevant perturbation space Q. of
Rn. It is assumed that there exists an wo such that L{9 | cjq) = L{9) for all 9 and
L ( r I 0;0) = L(e*) for all
16
4.3 Likelihood Displacement
Let 9 and 谷⑴ be the estimates of 9 that maximize the likelihood L{9) in (4.2) and
L{9 I a;) in (4.7) respectively. Let 9* and be the estimates of 9* that maximize
the likelihood L(d*) in (4.1) and L{6* \ UJ) in (4.8) respectively. Therefore, from
the relationship given in (3.10) and (3.11), we have
r = R e and = m ^ (4.9)
e ^ q r and e^ = qei (4.10)
Following Cook (1986), the discrepancy of 9 and can be measured by the
likelihood displacement function
/(a;) = L ( � a ; o ) - Z ^ ( 6 j a ; o ) . (4.11)
This function has its minimum value at a; = ujq. We have L(9 | (Jq) = L[6)
and L[9* | cjq) = L[9*) if ujq = 1 where i is an n x 1 vector with 1 at every
slots. When the perturbation specified by the perturbation parameter UJ causes a
considerable deviation of from 9, considerable deviation of f{uj) from /(wq) is
enforced. Hence, examining the change of F{UJ) as a function of UJ enables us to
assess the influential perturbations so as to identify observations that influence
the estimate of 9. It is noted that /(cj) does not depend on the choice of D.
Proposition 4
The displacement function f{uj) does not depend on the choice of D.
Proof:
See Appendix A.2. •
17
4.4 Normal Curvature
In this section, we assess the behavior of the influence graph g of the likelihood dis-
placement function f{uj) by applying the differential geometry techniques. Cook
(1986) proposed the normal curvature Ce to quantify the changes of f(uj) from
/(Wo), where i defines a direction for a straight line in Q passing through COQ.
Large values of Ce indicates that the perturbation along the corresponding di-
rection i induces a substantial changes in the likelihood displacement. Poon and
Poon (1999) has shown that the normal curvature Ce of the influence graph g of
/ (c j ) along a direction I at the optimal point LJQ is given by
— 仰 ( 4 12)
where • / = ( 兹 , . . . , 兹 ) �i s the gradient vector of / , and 11/ is a n x n matrix
given by d^f
丑 , = ( 4 . 1 3 )
As / (c j ) in (4.11) attains its minimum when uj = luq and if i is chosen such ‘
that ll ll = 1, then CV in (4.12) reduces to
Q = (4.14)
Let LQ and A be p* x p* and p* x n matrices respectively defined as
V d^m . . d'L(9 10;) , … Lo = 二 . and A = — ^ - . , 4.15
dQdO {0=9 oBouj \o=9,u)=uo
Cook (1986) has shown that Ce can in general be written as a function of LQ and
A . On the other hand, let
Le^ = ~, (4.16)
then LQ* is a r* X r* matrix that can be obtained using the covariance matrix of
[i* and (J* (Poon & Poon, 2002, equation (9),see Appendix B.l).
18
The relationship between LQ and LQ* is given by:
•• 一 d^LjO) 0 二 "MM'
= j ^ T d 背 、 R — do*de* = R ^ L o * R . (4.17)
Q0*
The calculation of is in Appendix C.
Moreover, if A � ^ ^ - , (4.18)
06*010 \e*=e%u}=u}o
then A* is a r* x n matrix that consists of elements with their analytical expres-
sions available in Poon and Poon (2002, equations (15) and (17), see Appendix
B.2). The relationship between A and A* is given by:
二 d'Lje I c.) — deouj
de* d^Lje* I cj) = W de*duj = i ^了A * • (4.19)
The normal curvature Ce and the conformal normal curvature B^ in general
(Poon & Poon, 1999) can be computed based on the matrix LQ, evaluated at 9.
It is worth of note that the conformal normal curvature B^ which transforms the
normal curvature Ci onto the unit interval, has been demonstrated by Poon and
Poon (1999) as an effective influence measure. However, the matrix LQ is singular
for ipsative data. Thus, it is difficult to apply the methods proposed by Cook
(1986) or Poon and Poon (2002) to develop diagnostic measures. Therefore, we
consider the following method for computing the normal curvature.
19
4.5 Computation of the Normal Curvature
Following Cook (1986, equation (12)), we have
Hf = - j T ' L e J (4.20)
where J is a p* x n matrix defined as
J 二 " t . (4.21)
To evaluate J, we use the fact that
^ 二 0 (4.22)
洲j \o=L
for j = 1, 2,…,p* and all LO in Differentiating (4.22) with respect to � “ for
i = 1,2,…,n and evaluating at 0 and CJQ, we obtain
,ft (dL{0\u) \ . �
fcM 丨 “ ) � = 0 . (4.23)
By (4.15) and (4.21), (4.23) reduces to Lo J + A = 0, (4.24)
where all matrices are evaluated at 0 = 0 and U = UJQ. The matrix LQ is singu-
lar because its covariance part is singular. We use the crude covariance matrix
COVC{G) in Aitchison (1986, p.52) to approximate the negative of the covariance
part of the inverse of LQ. The calculation of the crude covariance matrix coVc{d )
is based on ipsative data but treated as if common non-ipsative data. Let Lc be
the approximated matrix of the inverse of LQ with the negative of the covariance
part equal to the crude covariance matrix. Then, Lc is a diagonal block matrix
given by
.. ^ —covijji) 0 \ Lc = � ) • (4.25)
乂 0 -coVc{a)
20
where cov{fi) and covc(^) can be computed using equations (10) and (11) in Poon
and Poon (2002, see Appendix B.3). As a result (see also 4.19),
J = - L c A = -LcR^A*. (4.26)
Furthermore, from (4.20), (4.26) and (4.17), we have
Hf = -jT'LeJ
=-{-LR^A'^Y Lg (-ZeK^A*)
= - [ r ^ L O ^ R ) [-LCR^A*)
= — 丑 i ci 了 A*. (4.27)
Hence, by (4.14) and (4.27),
Q = - f (A* '^RLcR^Le*RLcR^A*) i . (4.28)
It is noted that the most influential observations can be identified by revealing
the direction ^ nax along which the greatest change of the likelihood displacement
is observed, ijnax is the direction that gives Cmax = max£ CV Moreover, Cmax and
亿max are the largest eigenvalue and the associated eigenvector of the symmetric
matrix Hf.
4.6 Diagnostic Measures
As the normal curvature Ce is defined on an unbounded interval, it may be
difficult to judge its magnitude. So, the normal curvature is transformed one-one
onto the unit interval and we called the transformed one as the conformal normal
curvature B^. At the critical point UQ along the direction £,the conformal normal
curvature is given by
. . (4.29)
V f \9=e,u)=uJo
21
4.6.1 Observations influencing the estimates of (i and H
Let 五j, j = 1,...,n be vectors of the n-dimensional standard basis. It is demon-
strated by Poon and Poon (1999) that B e �= B j , j = 1 , . . . , n are effective mea-
sures to identify the influential perturbation parameters when Cmax is sufficiently
large. It is noted that Bj is the j-th diagonal element of the matrix
- r a • (4.30) V f \e=e,uj=ujo
Up to this point, it is possible to compute the matrix HJ by (4.27) and hence
the diagnostic measures for revealing observations that influence the estimates of
both fi and E. Furthermore, observations that exert an unduly influence to the
estimates of both /z and E can be located by examining Bj, j = 1 , . . . , n. Thus,
the elements with large Bj values are those with influential observations.
4.6.2 Observations influencing the estimate of /x or E
There may exist some observations that influential to the estimate of but not to
the estimate of E or vice versa. Hence, we follow the method of Poon and Poon
(2002) to construct another two diagnostic measures Bj and B J = 1,…,n,
for ipsative data to identify observations that exercise an undue influence to the
estimate of /.i and the estimate of S respectively.
The diagnostic measures Bj, j = 1,…,n are developed based on the influence
graph g of the likelihood displacement function /(a;) given in (4.11). Therefore,
estimates of all parameters in 9 are affected by the perturbation. When the ef-
fects on only a subset of the parameters is interested, it is demonstrated by Cook
(1986) that the effects can be studied by examining the normal curvature of the
influence graph of an objective function worked out from (4.11).
22
Let arid Lji{a) he p x p and {p* — p) x (p* - p) matrices respectively,
such that
.. f LROI) 0 ) L e = � … . . (4.31)
V 0 Ln{a) j
where in = is obtained via (4.17).
Then, the influences of the perturbation on the estimate of /z can be studied
by examining the normal curvature
Cf = = - f H 化 (4.32)
where A* is a r* x n matrix same as in (4.18), L^ is a p* x p* matrix given by
.. (—Covin) 0 \ 仏 = , (4.33)
V 0 o) and i'A is a p* X p* matrix equal to
( Zfi(A) 0 ) . L^r= • (4.34)
V 0 0;
Observations that exercise an undue influences to the estimate of fi can be traced
by examining the diagonal elements B j , j = 1,…,n of the matrix
. • (4.35)
V ^ ^ f ) \e=o,Lo=ujo
Similarly, the influences of the perturbation on the estimate of E is reflected
by the normal curvature
CI = = -f 叩 , (4.36)
where LJ is a p* x p* matrix given by
. . ( O 0 \ K = , (4.37)
\ 0 -CoVc{a) j
23
and L^ & p* x p* matrix equal to
. . ( O 0 \ L'R 二 .. . (4.38)
V 0 Ln(a) Y
Observations that exercise an disproportionate influences to the estimate of E
can be located by examining the diagonal elements BJ, j = 1,…,n of the matrix
H�
, ^ . (4.39) JiTrmy . V \ ”丨0 = 0’ U=UQ
At present, we have constructed three diagnostic measures Bj, Bj and BJ to
identify those observations that enforce an irregular influences on the estimates
of /i together with E, the estimate of and the estimate of E based on the local
influence approach. In next section, we go to demonstrate by examples on how
to apply the developed procedure to analyze ipsative data.
4.7 Examples
4.7.1 Example 1: Foraminiferal Compositions Data Set
(AID)
As a demonstration, we first consider the foraminiferal compositions data set,
which is available in Aitchison (1986, p.399). The data set consists of 30 speci-
mens. Each composition consists of the percentages by weight of four composi-
tions: Neogloboquadrina atlantica, Neogloboquadrina pachyderma, Globorotalia
obesa and Globigerinoides triloba, which we conveniently abbreviate to Prop.l,
Porp.2, Prop.3 and Prop.4. Pairwise scatter plots among the 4 variables are pro-
vided in Figure 4.1. The data is transformed from MID to AID by subtracting
each slot by 1 and analyzed using the proposed procedure, which is programmed
24
by Matlab (see Appendix D). The results are plotted in Figure 4.2a, 4.2b and
4.2c, where the values of Bj, Bj and BJ,j = 1 , . . . , n are computed by Hj, Hj
and HJ respectively with a D satisfied (3.6). The selected D for this example is
, 3 / 4 - 1 / 4 - 1 / 4 - 1 / 4 �
D = - 1 / 4 3/4 - 1 / 4 一 1/4 • (4.40)
、 - 1 / 4 - 1 / 4 - 1 / 4 3/4 乂
The results show that cases 24, 13, 25 and 26, with the most influential comes
first, are most influential to the estimates of both ji and E. The findings are
sensible because all of these observations are usually located at the boundaries of
the data point clouds formed by pairs of variables. Therefore, they are influential
to both the estimates of /i and E. Among these observations, the most extreme
case, namely, case 24 is usually located far from the majority of the data; while it
is most influential to both estimates of /z and E, the effect on S is more significant
than that on /LI. On the other hand, we found from Figure 4.2b and 4.2c that
B � o is relatively large while is not, so we conclude that the influence of case
30 on the estimate of /i is larger than that on the estimate of E. We can notice
that the effect of cases 30 is quite similar to that of cases 13, 25 and 26,except
its effect on the estimate of E is a little bit lower. It may be because it lies more
often in the corners of the scatter plots than the others.
4.7.2 Example 2: Compositions of Sediments Data Set
(PAID)
The second data set considered is the compositions of sediments data set (PAID),
which is available in Aitchison (1986, p.359). Specimens of sediments are tradi-
tionally separated into three mutually exclusive and exhaustive components, sand,
silt and clay, and the proportions of these constituents by weight are quoted as
compositions. The data set records compositions of 39 sediment samples at differ-
25
eiit water depths in an Arctic lake. Thus, there are 4 variables sand, silt, clay and
water depth. Pairwise scatter plots among the 4 variables are provided in Figure
4.3. The values of the first three variables sand, silt and clay were subtracted
by 1 such that the data set becomes PAID. We analyzed the data set using the
proposed procedure and plotted in Figure 4.4a, 4.4b and 4.4c, where the values of
Bj, BJ and BJ= 1,…,N are computed by H/, H^ and HJ respectively with
a D satisfied (3.6). The selected D for this example is
丨 2/3 - 1 / 3 - 1 / 3 0、
- 1 / 3 —1/3 2/3 0 . (4.41)
乂 0 0 0 1 少
Cases 7 and 14 are classified as influential to the estimates of both fi and E. From
Figure 4.3, we found that these two observations are usually located outside the
ellipse formed by the majority of the data. As a result, they are most influential
to both estimates of /i and S, but the effect on E are more substantial than that
on II. On the other hand, from Figure 4.4b and 4.4c, cases 1 and 18 which have
considerable effect on the estimate of ji do not have similar prominent effect on
the estimate of S. We can see from Figure 4.3 that these two observations usually
at the corners of the Pairwise scatter plots of variables sand, silt and clay. Due
to the constant-sum property of these three variables, extra large value of one
variable would result in small values of the others. Therefore, cases 1 and 18
affect the estimate of the mean vector substantially.
26
0.9h
• • 0.8- •• 0.8- • • • • • • • •
^ 0-7- • ••; • 二 0.7— • • • • •
c? 0.6- 2緊 • •• ••: •• •• I 0.6- •:• •• •• • 2^25 •
。•& 13 • • 。-h • • 13
0.4H . 广 — 0.4-j 宅4 30 ‘ 0.1 0.2 0.3 0.4 0.5 、 ^ M 0 2 03
Prop.2 Prop.3
O.&H 0.5-1 ; -1 • • 30
0 8 - ‘ » • •• 知.«
- 0 . 7 - • . CM I
i 0.^ • f � . h . 4 : •••• 13: 1 3 • . •• •
0•一 y H 0.1-1 •• • • • • 2诏 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3
Prop.4 Prop.3
30 • 13
、0.知••• • • c • CO 0.2-
i 13 . - • y i 30-D- • •••• lL •
• 0 1 - • 0.2- . I U.l :}•
0.1-约!J . 0.0- ••:: • •• 妥4 I o 1 1 1 ‘―1 1 1 1
0.0 0.1 h 0.2 0.3 0.0 0.1 0.2 0.3 Prop.4 Prop.4
Figure 4.1: Scatter plots for the foraminiferal compositions data set
27
0.8 0.7-0 . 6 -
0.5 -. 一 I 0 rp u.-r
0.3-
CI: 2 缺 30 O.CH *
0 10 20 30 index (a)
0.5-1 24
0.4-
i 0 .3 - •
:‘0.2— 13 2526 糾 ’ 0 . 1 - •
• • • • • • • • • • • • • - • 0 . 0 - • 禱 》 會 > •義 •
1 I 1 I ~
0 10 20 30 index (b)
O-S-l 0.7-0 . 6 -
? 0.5-
:^ 0.4-
0.3-0 .2 - 13
• 2 5 2 6 0 1- • - ° U. 30 0 . 0 - � h — 1 1
0 10 20 30 index (C)
Figure 4.2: Index plots of the influential measures for the foraminiferal composi-
tions data set
28
50-J ^ ^ rv i • •• • • ••
40- • • • . 60-> , 3 0 - • • • • • 、 • 名 50_ • • • -g • S 40- ^
20- ;4 • - 30-
10- • • 20- ••• •• 1 • • 10- ^
0 - ^ n 一 , , , , 吻 0 ^ 100 0 10 20 30 40 50 water depth clay
80~| Tj 7 0 - r Y' ‘ : 70- •• ‘ 14 6 0 - ‘ 6 0 - z
” 50- . • 50- • • � J l 运 40— { • = . . 的 3 0 - • • 4 0 - • • • •
20一 • • M 、 . ’ 30- •• • -10- • • „ •• • i 0 _ • •• • • 20- J . n 1 1— n 1 1 1 1 1 0 50 100 0 10 20 30 40 50
water depth clay
7o_| — 5QJ ^ ^ r r � 6 0 - 7 40- • ' • • .
• • • • _ • • • • • \ • - 5 0 — :18.- ••• •••• ••• • • 玄 30- ••• •
饥 4 0 _ • ^ 20- 14 •
3 0 - ; •• 1 0 - • •
20-1 3 I 0-1 1-二7. 0 50 100 0 50 100
water depth water depth
Figure 4.3: Scatter plots for the compositions of sediments data set
29
^ :
0.5-
0.4-
。 •、 7
0 . 2 -
0 . 1 - 1 18 0.0- • ^ • n 1 1 1 r 0 10 20 30 40
Index (a)
0-2^ 14
S 1 18 V: 旦 0.1- ^ •
’ — I ‘ J Q • • • • •
• • • • • • • • • • • • • •
n n _ • • • • • • • • • • • 1 I I I r 0 10 20 30 40
index (b)
�-®1 0.5-
爸 0.3" 7
二 0.2-
0 . 1 -2 18
0 . 0 - * 1 I I 1 r 0 10 20 30 40
index (c)
Figure 4.4: Index plots of the influential measures for the compositions of sedi-
ments data set
30
Chapter 5
Discussion
Ipsative data is said to be ipsative when it is with a constant-sum constraint. This
constraint automatically induces singular structures for the population covariance
matrix and the sample covariance matrix. All major calculation problems for es-
tablishing the procedure for the analysis of ipsative data are nearly resulted from
the existence of this singular property. To establish a procedure for influence
analysis, we need first to define the likelihood function and calculate the ML es-
timates. Since ML approach cannot be applied to data with singular covariance
matrix directly, a transformation in (3.6) is introduced as the common practice.
After that, we follow the work Poon and Poon (1999), which is a modification
of the work of Cook (1986), to assess the influence of local perturbation by the
use of the normal curvature CV and the conformal normal curvature Then,
following the method of Poon and Poon (2002), we established three diagnostic
measures Bj, Bj and BJ. These diagnostic measures are found to be applica-
ble to all kinds of ipsative data defined in this thesis, that is, AID, MID and PAID.
In the process of calculating the normal curvature, a crude covariance matrix
(Aitchison, 1986) is used to approximate the negative of the covariance part of
the inverse of the matrix LQ. The singular structure of the matrix LQ is also
a consequence of the singularity of the covariance matrix. It is found that the
diagnostic measures Bj and BJ change with respect to D, this change may due
31
to the use of the crude covariance matrix. However, as f{uj) is invariant with
respect to D , we can solve this problem by compute
S g (5.1)
directly. As this is very tedious, further study is needed.
When we consider the purpose of identifying influential observations is for ex-
ploratory more than confirmatory, and any identified observations will be followed
by a careful analysis on the underlying facts of the observations, strict obedience
to a critical value for identification seems not necessary. Thus, in determining
the magnitude of a measure for observations worthy of further notice, we make
use of the natural gap approach and detect large values with the help of the use
of an index plot. In most cases, this simple method can effectively disclose obser-
vations that need attentions. However, if objectivity is desired, one may employ
the reference constant proposed by Poon and Poori (1999). The constant utilized
the geometric concept of mean curvature to establish benchmark for judging the
largeness of a measure.
The development of our procedure for the analysis of ipsative data is based
on a simple case, where the covariance matrix E has no structure and the data
is assumed to be distributed as normal. Thus, we can generalize the procedure
in two directions. First, the developed procedure can be generalized to the case
where the covariance matrix S is structured. Second, it can be generalized to
other multivariate distributions. The normal distribution is chosen in this study
not only because of its popularity, but also because many multivariate techniques
make use of the ML estimates of the normal model parameters, that is its sample
mean and/or its sample covariance matrix. Therefore, whenever an attempt is
made to use sample mean given in (4.5) and sample covariance matrix given in
(4.6) to estimate the location or dispersion of an ipsative data set, the proposed
diagnostic measures are applicable to identify those observations that exercise
32
:.、::.:乂。广表 ..., 'II---- T - ‘ •\.、V"..:. •。,• ——•、.. ‘ ‘
- .:’•.. • - .:..。,::?- •、’ H •• V � ‘ :•'•... Sa iTi:" •• .. • _ . : • _ •:;、:-、,-undue influential effects on the estimates.
I A w r • . 騰 “ :
i v ‘ . ^
, : " ’ .•. P f t y o t .of' F T o p o m f m i w
:办:‘眷:,n.释,,•:‘. ”
; . . . . . : ; . 、 ”
ifv;?^亡如;..奶:§‘ .wii泌 t c i i ' ^鄉 .輔 i ifti A 汉�-‘• 1 hirx.
! 靠 麗 謹 書 … , : . : ‘
Appendix A
Proof of Propositions
A. l Proof of Proposition 3
Proposition 3
The function LDI {0) and differ by a constant.
Proof
Let Di and D2 be two different matrices satisfy R(DJ) = R(A), j = 1, 2.
Since rank(Di) = rank(£)2) = r, there exists a nonsirigular r x r matrix K such
that Di = K D 2 . Let x*j = DjXi and x* = i = 1,... ,n and j = 1, 2,
then = D,Xi = KD2Xi = Kx]^ and = x]Jn = f^ K^^h/n = Kx;. (A.l)
2=1 2 = 1
We write the second and third terms of (4.2) as follows:
- — - A " ) )
= l o g l i ^ i E ^ n - ^-tr 77
- 对 — - D,f i ) (A.2)
where W i = 1( *1 — — xj)^. (See Anderson, 1974, p.45 equation (2)
k P.46 equation (9)). By (A.l), Wi = KW2K^ when W2 is similarly defined
34
as W i . As a result, (A.2) becomes
l o g — i t r
-'^{Kxl - KD2iif[KD2T.DlK'^)-\Kxl - KD2H)
= l o g — ^tr
—^(x; - D 2 M f ( D 2 E D ^ r ' ( x; — D2M) + c (A.3)
where c is a constant. This shows that LDI {0) and LIY^I^) differ by a constant.
•
A.2 Proof of Proposition 4
Proposition 4 ‘
The displacement function /(a;) does not depend on the choice of D.
Proof
Let 9 = (FIT, and = {JI^, ^ZY• Let DI and D2 be two different
matrices satisfy R(DJ) = R(A), j = 1, 2. Since LD^(6) and 没)differs by
a constant, 9 are the same for different D. Similarly, since LD^ (9 | U) and
LO^IP I w) differs by a c o n s t a n t , � � are the same for different D. Also, since
rarik(Di) = rank(£>2) = R, there exists a nonsingular r x R matrix K such that
Di = KDQ. Let x*j = DjXi and x* = Er=i 工 几,« = l,...,n and j = 1,2,
similar to the proof of Proposition 3, we have
/N A
foi H = LD, (0 ICJO) - LD, I C o)
一 — — D说
+ + � t r ((D.t^Dfr'W,)
35
+ 登 log\KD2t^DlK' '\ + i t r { { K D 2 t ^ D l K ^ ) - \ K W 2 K ^ ) )
= — 昏 log\D2tDl\ - { [ D ^ t D l r ' W ^ )
- — D2fif[D2tDl)-\xl — D^jl)
+ 昏 l o g + Ut [ { D 2 t ^ D l ) - ' W 2 )
= I d ^ H
where W j = E ^ i C ^ —巧•丄无;)了,i = 1,2 and W i = K W 2 K ^ . Therefore,
f{uj) does not depend on the choice of D. •
36
Appendix B
• • -
Analytical Expressions of (L没*)一丄,
A * and Lc
B.l Analytical Expression of ^
Following Poon and Poon (2002, equation (9)), 一LO* is the observed information
for the postulated model and the ML estimates /i* of /i* and a* of a* are sta- : f
tistically independent. Then, is a r* x r* diagonal block matrix given
by
(•专�—1 { 0 ) ( {{Lo^yj} 0 ) (Lo*) 二 = . . 1
V 0 -cov{a*) ; V 0 ;
where cov{fL*) is a r x r matrix storing the covariance matrix of jl* and cov(a*) is
a (r* — r) X (r* — r) matrix storing the covariance matrix of a*. Following Poon
and Poon (2002,equations (10) and (11)), the elements of aw (ft*) and cov(a*)
are respectively given by
.. 1 ( V ) a 6 = -COVIFLLFL;)=--心
and
.. 1 ( 丄 版 ) = = - - +
I V
37
B.2 Analytical Expression of A*
Let
Ca be a r X 1 vector with 1 at its a-th slot and zeros elsewhere,
z* = X* — jj,* be a r X 1 vector with zj at its 6-th slot, and
心 1 be the (a, 6)-th element of S * - �
Then, following equations (15) and (17) in Poon and Poon (2002), A* is a
r* X n matrix with the i-th column equal to
A* - 淨 L _ *r”*-i 一 • 广
and
where they correspotid to ju* in and a*^ in S* respectively.
B.3 Analytical Expression of Lc
The matrix Lc is a p* x p* diagonal matrix given by
. . ( - c o v [ f i ) 0 ) ( {(L^),,} 0 ) Lc = = ..
V 0 -COVc{d) J \ 0 {(^0)fa/?)(7p)}
where cov{ji) is a p x p matrix storing the covariance matrix of p, and covc(a) is
a {p* - p) X {p* — p) matrix storing the covariance matrix of a. By Poon and
Poon (2002, equations (10) and (11)), the elements of cov{ji) and coVc{a) are
respectively given by .. 1
{Lo)ab = -COv{jla,M = —— ^ab, TX
and
. . c 1 州7P) = = - - [cJcy^app + GapCyM)-
f I
38
Appendix C
Calculation of 儉
Let
( \ I \ I , , \ fM (Ju . . • CTlp "11 . . . "Ip ; , s = ; ••• ; a n d D 二 ; ••• : • (C.l) ;
�P p 乂 乂 Cpl • • • CFpp y 乂 dri • • • drp�
Then,
‘E?=i dufM�
= Dfi = •: (C.2)
^ YA=I dpifJ-i ^
and
E f = i diidij(7ij . •. E L i E j = i diidrjaij
E* = D U D ^ = ': ••• : . (C.3)
�SU dridijCJij • . . ELI Tjj=l dridrjCTij 乂
39
Let
Oy. be a (p* — p) X 1 vector storing the lower triangular elements of E where
p* = p - l -p (p+ l ) /2 , and
吟 be a (r* — r) X 1 vector storing the lower triangular elements of E* where
r* = r + r(r + l ) /2 .
i.e.
卜0
022 ( E L I dudijaij
ELi (kidijCTij
- - ELi d2id2j(Jij
(7r1 = and e^ = . (C.4)
• • • •
^rr
• • • • • •
\ E L i Ej=i dridrjaij y
Opl
\ ^pp /
Differentiating /x* respect to for r < p :
( ^ ( E L i dulM) •. • (ELI dril^i)
•: . . . : - D ^ (C.5)
�^ ( E L I dii^i) . . . ^ ( E L i drilM) J
40
Differentiating respect to 9j: for r < p :
de*^ _ 取 = ⑷
F
dkidij for i = j and k > l\ — (C.6)
dkidij + dkjdii for 2 > j and A; > I,
where i, j = 1,…,p ; = 1,... ,r t = ^^ + j and o= ^^^ +/.
Let be a (p* - p) x (r* - r) matrix such that
S =瑪 . (C.7)
‘ i :
So,
; de* (D^ 0 \ T = „ = R] (C-8)
洲 [ o r , : )
where R is the r* x p* constant matrix given in (3.10) or (4.9).
41
Appendix D
Matlab Commands of Example 1
% RAW DATA
X = [ 0.74 0.19 0.03 0.04 % X : AID with row sum c (Dim: p,n) 0.74 0.19 0.03 0.04 0.58 0.29 0.01 0.12 0.58 0.19 0.22 0.01 0.61 0.28 0.08 0.03 0.82 0.13 0.02 0.03 0.48 0.38 0.01 0.13 0.59 0.38 0 0.03 0.76 0.12 0.09 0.03 0 . 8 1 0 . 1 2 0 . 0 4 0 . 0 3 ;
0.68 0.23 0.05 0.04 0 . 7 2 0 . 2 0 . 0 4 0 . 0 4
0.62 0.27 0.09 0.02 0.45 0.25 0.29 0.01 0.66 0.25 0.06 0.03 0.85 0.13 0.01 0.01 ‘ 0.75 0.09 0.15 0.01 0.69 0.25 0 0.06 0.76 0.1 0.11 0.03 0.66 0.29 0.01 0.04 0.66 0.24 0.06 0.04 0 . 5 0 . 4 6 0 0 . 0 4
0.65 0.25 0.05 0.05 0.60 0.35 0.02 0.03 0.4 0.27 0.01 0.32 0.6 0.1 0.3 0 0.6 0.1 0.29 0.01 0.59 0.39 0.01 0.01 0.58 0.39 0.01 0.02 0.61 0.34 0.02 0.03 0.39 0.49 0.12 0 ]';
% DIMENSION
n = 30; % n : number of observations p = 4; % p : number of variables r = p-1; % r : number of variables of x* : [z] q = p+(p*(p+l ) /2 ) ; % q : dimension of theta s = r+(r*(r+l) /2) ; % s : dimension of theta*
42
% TRANSFORM X TO BE WITH A ZERO CONSTRAINT
for i=l :n c(i) = sum(X(:,i)); % c : the sum of each obsevation
end X = X - repmat(c/p,p,l); % x : ipsative data with a constant sum ZERO (Dim: p,n) mx = (mean(x'))'; % mx : mean(x) (Dim: p ’ l ) ex = cov(x'); % ex : covariance matrix of x (Dim: p,p)
% THE INFORMATION MATRIX OF x (Dim: q,q)
irifmx = -ex/n; % infmx : mean part of inf. matrix of x (Dim: p,p) for i = l:p % infex : covariance part of inf. matirx of x (Dim: q-p,q-p)
for j 二 l:i % (i=alpha’ j二beta, k=gamma,l=rho) for k = l:p
for 1 = l:k t = ((i-l)*i)/2 + j; o = ((k-l)*k)/2 + 1; infex(t,o) = -((ex(i,k)*ex(j,l))+(ex(i,l)*ex(j,k))) /n;
end end
end end infx = [infmx,zeros(p,q-p);zeros(q-p,p),infex]; % inf. matrix for Bj infxm = [infmx,zeros(p,q-p);zeros(q-p,p),zeros(q-p,q-p)]; % inf. matrix for Bj (mu) infxe = [zeros(p,p),zeros(p,q-p);zeros(q-p,p),infex]; % inf. matrix for Bj (sigma)
% TRANSFORMATION d (Dim: r’p)
d3 = [3/4 -1 /4 -1/4 -1/4 ; -1/4 3/4 -1/4 -1/4 ; -1/4 -1/4 -1/4 3/4]; % Delete 3th row i d = d3;
% TRANSFORM DATA TO NONIPSATIVE
z = d*x; % z : non-ipsative data set x* (Dim: r’n) rnz 二 (meari(z'))'; % mz : mean vector of z (Dim: r,l) ez = cov(z’); % ez : covariance matrix of z (Dim: r,r) inez = inv(ez); % inez : inverse of cov(z) (Dim: r,r) y = z-repmat(mz,l,n); % y : z - mean(z) (Dim: r’n)
% INFORMATION MATRIX OF z (Dim: s’s)
infrnz = -ez/n; % infmz : mean part of inf. matrix of z (Dim: p,p) for i = l:r % infez : covariance part of inf.* matirx of z (Dim: s-r,s-r)
for j = l:i % (i二alpha, j=beta, k=gamma, l=rho) for k = l:r
for 1 = l:k t = ((i-l)*i)/2 + j; o = ((k-l)*k)/2 + 1; infez(t,o) = -((ez(i,k)*ez(j,l))+(ez(i,l)*ez(j,k)))/n;
end end
end end infz = [infmz,zeros(r,s-r);zeros(s-r,r),infez]; % inf. matrix* for Bj infzm = [infmz,zeros(r,s-r);zeros(s-r,r),zeros(s-r,s-r)]; % inf. matrix* for Bj (mu) infze = [zeros(r,r),zeros(r,s-r);zeros(s-r,r),infez]; % inf. matrix* for Bj (sigma)
43
% THE TRIANGULAR MATRIX OF Z : [triz] (Dim: s’n)
一 % trimz : mean part of triangular* matrix (Dim: r,n) tri7nz(!,i) = inez*y(:,i); % triez : covariance part of triangular* matrix (Dim: s-r,n)
end for i 二 l:n
for k 二 l:r for t = l:k
end end
end triz 二 [trimz ;triez];
% CALCULATE THE DIFF. OF DEFF. THETA* RESPECT TO THETA : [dtt] (Dim: q,s)
� =d,. % dmm : difF. mu* respect to mu {Dim: p,r) fo" • =—i:p’ % dee : difF. vec(simga*) respect to vec(simga) (Dim: q-p,s-r)
for j 二 l:i for k 二- l:r
for 1 二 l:k
0 二((k-l)*k)/2 + 1; • J
1 i(S�t,o) = d(k,i)*d(l,j); GISG
dee(t,o) = (d(k,i)*d(l,j)) + (d(k,j)*d(l,i)); end
end end
end end dtt = [dmm,zeros (p,s-r);zeros(q-p,r),dee];
% OVERALL B : [B]
Ldd = dtt*inv(infz)*dtt'; H = -triz'*dtt'*infx*Ldd*infx*dtt*triz; B = diag(h/(trace(h*h))^(l/2))
% B RESPECT TO MEAN : [Bm]
Lddmtemp = Ldd(l:p,l:p);
Lddm = [Lddmtemp,zeros(p,q-p);zeros(q-p,p),zeros(q-p,q-p)J; Hm 二 -triz,*dtt,*in5cm*Lddm*infxm*dtt;*t;riz; Bm = diag(Hm /(trace(Hm*Hm))^(l/2))
% B RESPECT TO VAR : [Be]
Lddetemp = Ldd((p+l):q’(p+l):ci); Ldde = [zeros(p,p) ,zeros(p,q-p) ;zeros(q-p,p),Lddetemp]; He = -triz'*dtt'*infxe*Ldde*infxe*dtt*triz; Be = diag(He /(trace(He*He))^(l/2))
44
Bibliography
Aitchison, J. (1986). The statistical analysis of compositional data. London:
Chapman and Hall.
Anderson, T.W. (1974). An introduction to multivariate statistical analysis (2nd
ed.). New York: Wiley.
Cattell, R.B. (1944). Psychological measurement: Ipsative, normative and in-
teractive. Psychological Review, 51, 292-303.
Chan, W. and Rentier, P.M. (1993). The covariance structure analysis of ipsative
data. Sociological Methods & Research, 22, 214-247.
Chan, W. and Bentler, P.M. (1996). Covariance structure analysis of partially
additive ipsative data using restricted maximum likelihood estimation. Mul-
tivariate Behavioral Research, 31(3), 289-312.
Cook, R.D. (1977). Detection of influential observations in linear regression.
Technometrics, 19, 15-18.
Cook, R.D. (1986). Assessment of local influence (with discussion). Journal of
the Royal Statistical Society, B, 48^ 133-169.
Hicks, L.E. (1970). Some properties of ipsative, normative and forced-choice
normative measures. Psychological Bulletin, 74(3), 167-184.
Lawrance, A.J. (1995). Deletion influence and masking in regression. Journal
of Royal Statistical Society, B, 57, 181-189.
45
Poon, W.Y. and Poon, Y.S. (1999). Conformal normal curvature and assessment
of local influence. Journal of the Royal Statistical Society, B, 61, 51-61.
Poon, W.Y. and Poon, Y.S. (2001). Conditional local influence in case-weights
linear regression. British Journal of Mathematical and Statistical Psychol-
ogy, 54, 177-191.
Poon, W.Y. and Poon, Y.S. (2002). Influential observations in the estimation of
mean vector and covariance matrix. British Journal of Mathematical and
Statistical Psychology, 55, 177-192.
46
0
TbTiiDhOD
saLJBjqtn >IH门3