Chain graph models of multivariate regression type for ...e-mail: [email protected]...

arX

iv:0

906.

2098

v1 [

stat

.ME

] 1

1 Ju

n 20

09

Chain graph models of multivariate

regression type for categorical data

Giovanni M. Marchetti

Dipartimento di Statistica “G. Parenti”, University of Florence

Viale Morgagni 59, 50134 Firenze, Italy

e-mail: [email protected]

Monia Lupparelli

Dipartimento di Scienze Statistiche “P. Fortunati”, University of Bologna

Via Belle Arti 41, 40126 Bologna, Italy

e-mail: [email protected]

Abstract: We discuss a class of graphical models for discrete data defined

by what we call a multivariate regression chain graph Markov property. We

propose a parameterization based on a sequence of generalized linear mod-

els with a multivariate logistic link function. We show the relationship with

a chain graph model recently defined in the literature, and we prove that

the proposed parametrization for the joint distribution defines a complete

and hierarchical marginal log-linear model.

Keywords and phrases:Graphical Markovmodels, block-recursiveMarkov

property, discrete chain graph models of type IV, multivariate logistic re-

gression models, marginal log-linear models.

1. Introduction

Discrete graphical Markov models are models for discrete distributions repre-

sentable by graphs, associating nodes with the variables and using rules that

translate properties of the graph into conditional independence statements be-

tween variables. There are several classes of graphical models; see [21] for a re-

view. In this paper we focus on the class of multivariate regression chain graphs

and we discuss their definition and parameterization for discrete variables.

The simplest case of multivariate regression chain graph is called covariance

graph and it is an undirected graph representing marginal independencies: the

basic requirement is that if two sets of nodes are disconnected, then the asso-

ciated variables are jointly independent. On the other hand, the edges describe

the marginal associations between variables, all treated on equal standing and

drawn as dashed lines, [6], or as bi-directed arrows, [18]. The general definition

1

http://arxiv.org/abs/0906.2098v1

mailto:[email protected]

mailto:[email protected]

G. M. Marchetti and M. Lupparelli/Chain graph models of multivariate regression type 2

of covariance graph models for discrete variables is discussed in [9] and [15].

Multivariate regression chain graphs are a generalization of covariance graphs

in which the variables can be arranged in a sequence of groups, called chain

components, ordered on the basis of subject-matter considerations, and all the

variables within a group are considered to be on an equal standing as responses

potentially depending on all the variables in all previous groups. They were

proposed first in [5], with several examples discussed in [6, Chapter 5].

In this chain graph the edges are undirected within the chain components,

and directed between components, all pointing in the same direction, i.e., with

no chance of semi-directed cycles. The interpretation of the undirected graphs

within a component is that of a covariance graph, but conditional on all variables

in preceding components. For example, the missing edge (1, 3) in the graph of

Figure 1(b) is interpreted as the independence statement X1 ⊥⊥ X3|X4, X5, com-

pactly written in terms of sets as 1⊥⊥ 3|4,5. On the other hand, the interpretation

of the directed edges is that of multivariate regression models, with a missing

edge denoting a conditional independence of the response on an explanatory

variable given all the remaining potential predictors. Thus, in the chain graph

of Figure 1(a) the missing arrow (1, 4) indicates the independence statement

1⊥⊥ 4|3.

Other types of chain graph models can be defined depending on the definition

of the conditioning sets involved, as was pointed out in [5]. Recently, [7] distin-

guished four fundamental types comprising the classical (Lauritzen, Wermuth,

Frydenberg, [14], [11]) and the alternative (Andersson, Madigan, Perlman, [1])

chain graph models, called of type I and II, respectively.

In this paper we give a formal definition of multivariate regression chain graph

models and we prove that they are equivalent to the chain graph models of type

IV, in Drton’s classification [7]. Moreover, we provide a parameterization based

on recursive multivariate logistic regression models. These models, introduced

in [17, Section 6.5.4] and [12], have parameters with a statistical interpretation

and can be used to define all the independence constraints. The models can be

defined by an intuitive rule, see Theorem 2, based on the structure of the chain

graph, that can be translated into a sequence of explicit regression models.

Finally, we extend the general rule connecting logit to log-linear models (see

[10, Chapter 6]), by proving that the multivariate logistic regression parameters

are identical to log-linear parameters defined in selected marginal distributions.

This is used to show in Proposition 5 that the full set of regression coefficients

defines a well-behaved marginal log-linear parameterization, called hierarchical


and complete; see [3].

The given results show, either using properties of multivariate logistic models

or of marginal log-linear models that any discrete multivariate regression chain

graph model is a curved exponential family.

All the proofs are deferred to Appendix A, for the equivalence of the two

chain graph Markov properties, and to Appendix B, for the relation between

multivariate logistic and marginal log-linear parameters.

2. Multivariate regression chain graphs

2.1. Chain graphs

The basic definitions and notation used in this paper closely follow [7], and

they are briefly recalled below. Formally, a chain graph G = (V,E) is a graph

with finite node set V = {1, . . . , d} and an edge set E that may contain either

directed edges or undirected edges and with no semi-directed cycle, i.e. no path

from a node to itself with at least one directed edge such that all directed edges

have the same orientation. The node set V of a chain graph can be partitioned

(a) (b) (c)

Fig 1. Three chain graphs with chain components (a): T = {{1, 2}, {3, 4}}; (b): T ={{1, 2, 3}, {4, 5}}; (c): T = {{1, 2, 3, 4}, {5, 6}, {7}}. Dashed lines could only occur withinchain components.

into disjoint connected subsets T called chain components, such that all edges

in each subgraph GT are undirected and the edges between different subsets

T1 6= T2 are directed, pointing in the same direction. The collection of chain

components will be denoted by T , with V =⋃

T∈T T .

The subgraphs GT within each chain component are by definition undirected

and for chain graphs with the multivariate regression interpretation, their edges

are drawn as dashed or bi-directed ←→. The former convention is adopted

in the three chain graphs of Figure 1 having two and three chain components.

In the undirected subgraphs GT , a nonempty subset A ⊆ T of nodes is said

to be disconnected if it contains at least two nodes that are not linked by any

path in GT . In this case, A can be partitioned uniquely into a set of r > 1


(a) (b)

Fig 2. A chain graph (a) and one possible compatible order (b) of the four chain components:{1, 2} ≺ {3, 4} ≺ {5, 6} ≺ {7, 8}. In (b) the set of predecessors of T = {1, 2} is pre(T ) ={3, 4, 5, 6, 7, 8}, while the set of parent components of T is pa

D(T ) = {3, 4, 5, 6}.

connected components A1, . . . , Ar. For example, in chain graph (c) of Figure 1,

the set {1, 2, 4} is disconnected because the induced subgraph, 1 2 4, has

two connected components A1 = {1, 2} and A2 = {4}.

Any chain graph yields a directed acyclic graph D of its chain components

having T as node set and an edge T1 ≻T2 whenever there exists in the chain

graph G at least one edge v ≻w connecting a node v in T1 with node w in T2.

In this directed graph, we may define for each T the set paD(T ) as the union of

all the chain components that are parents of T in the directed graph D. This

concept is distinct from the usual notion of the parents paG(A) of a set of nodes

A in the chain graph, that is the set of all the nodes w outside A such that

w ≻v with v ∈ A. For instance, in the graph of Figure 2(a), for T = {1, 2},

the set of parent components is paD(T ) = {3, 4, 5, 6}, whereas the set of parents

of T is paG(T ) = {3, 6}.

In this paper we start the analysis from a given chain graph G = (V,E) with

an associated collection T of chain components. However in applied work, where

variables are linked to nodes by the correspondence Xv for v ∈ V , usually a set

of chain components is assumed known from previous studies of substantive the-

ories or from the temporal ordering of the variables. For variables within such

chain components no direction of influence is specified and they are considered

as joint responses, i.e. to be on equal standing. The relations between variables

in different chain components are directional and are typically based on a pre-

liminary distinction of responses, intermediate responses and purely explanatory

factors. Often, in this approach, a full ordering of the assumed components is

defined based on time order or on a subject-matter working hypothesis; see [6].

A different possibility would be that of stacking chain components that are not


implied to be ordered by the graph.

In the rest of the paper we shall follow the first approach. Given a chain graph

G with chain components (T |T ∈ T ), we can always define a strict total order ≺

of the chain components that is compatible with the partial order induced by the

chain graph. For instance, in the chain graph of Figure 2(a) there are four chain

components ordered in graph (b) as {1, 2} ≺ {3, 4} ≺ {5, 6} ≺ {7, 8}. Note that

the chosen total order of the chain components is in general not unique and that

another compatible ordering could be {1, 2} ≺ {5, 6} ≺ {3, 4} ≺ {7, 8}.

Then, for each T , the set af all components preceding T is known and we

may define the cumulative set pre(T ) =⋃

T ′≺T T ′ of nodes contained in the

precedecessors of component T . The set pre(T ) captures the notion of all the

potential explanatory variables of the response variables within T . By definition,

as the full ordering of the components is compatible with G, the set of predeces-

sors pre(T ) of each chain component T always includes the parent components

paD(T ).

2.2. Multivariate regression Markov property

Given a chain graph G = (V,E) with a compatible total order ≺ of its chain

components, and given a random vector X, its joint probability distribution P

is said to obey the multivariate regression Markov property for G if it satisfies

the following conditions.

a) For all chain component T and for all connected subsets A of T the condi-

tional distribution of variables in A given the variables in the predecessors

of T depends only on the parents of A:

pA|pre(T ) = pA|paG(A). (1a)

b) For all chain component T and for all disconnected subsets A of T , the

variables in the connected components A1, . . . , Ar, of A are jointly inde-

pendent given the variables in the predecessors of T :

pA|pre(T ) = pA1|pre(T ) × · · · × pAr| pre(T ). (1b)

The stated multivariate Markov property defines a chain graph model in the

variables X, formally defined as follows by translating the previous conditions

a) and b) into independence statements.

Definition 1. Let G be a chain graph with chain components (T |T ∈ T ) and

let pre(T ) define a compatible full ordering of the chain components. A joint


distribution P of the random vector X obeys the multivariate regression Markov

property for G if it satisfies the following independencies. For all T ∈ T and

for all A ⊆ T :

(mr1) if A is connected: A⊥⊥ [pre(T ) \ paG(A)] | paG(A),

(mr2) if A is disconnected with connected components A1, . . . , Ar :

A1 ⊥⊥ · · · ⊥⊥ Ar | pre(T ).

Condition (mr2) implies a stronger requirement than a set of pairwise in-

dependencies associated to each missing undirected edge. For example, in this

model, two pairwise independencies 1⊥⊥ 2 and 1⊥⊥ 3 can occur only in combina-

tion with the joint independence 1⊥⊥ 2,3.

Note that this definition depends on the chosen full ordering of the chain com-

ponents. However, the multivariate regression Markov property is nonetheless

equivalent to the block-recursive Markov property of type IV ([7]; see Defini-

tion 2 in the Appendix A) that, in turn depends only on the underlying chain

graph G. We have, in fact, the following theorem proved in Appendix A.

Theorem 1. Given a chain graph G, the multivariate regression Markov prop-

erty is equivalent to the block-recursive Markov property of type IV (see Defini-

tion 2 in Appendix A.)

One consequence of this equivalence is that, when the joint probability density

p(x) is strictly positive, then it factorizes according to the directed acyclic graph

of the chain components:

p(x) =∏

T∈T

p(xT |xpaD(T )). (2)

This standard property is shared by all chain graph Markov properties; see [21]

and [7].

Example 1. The independencies implied by the multivariate regression chain

graph Markov property are illustrated below for each of the graphs of Figure 1.

Graph (a) represents the independencies of the seemingly unrelated regression

model; see [5], [8]. For T = {1, 2} and pre(T ) = {3, 4} we have the independen-

cies 1⊥⊥ 4|3 and 2⊥⊥ 3|4. Note that for the connected set A = {1, 2} the condition

(mr1) implies the trivial statement A⊥⊥ ∅| pre(T ). The remaining component is

complete with no predecessors, and thus implies no independencies.

In graph (b) let T = {1, 2, 3} and pre(T ) = {4, 5}. Thus, for each connected

subset A ⊆ T , by (mr1), we have the non trivial statements

1⊥⊥ 5 | 4; 2⊥⊥ 4,5; 3⊥⊥ 4 | 5; 1,2⊥⊥ 5 | 4; 2,3⊥⊥ 4 | 5.


Then, for the remaining disconnected set A = {1, 3} we obtain by (mr2) the

independence 1⊥⊥ 3 | 4,5.

In graph (c), considering the conditional distribution pT |pre(T ) for T = {1, 2, 3, 4}

and pre(T ) = {5, 6, 7} we can define independencies for each of the eight con-

nected subsets of T . For instance we have

1⊥⊥ 5,6,7; 1,2⊥⊥ 6,7 | 5; 1,2,3,4⊥⊥ 7 | 5,6.

Notice that the last independence is equivalent to the factorization

p = p1234|56 · p56|7 · p7

of the joint probability distribution according to the directed acyclic graph of

the chain components. The remaining five disconnected subsets of T imply the

conditional independencies 1,2⊥⊥ 4|5,6,7 and 1⊥⊥ 3,4 | 5,6,7.

Remark 1. When each component T induces a complete subgraph GT and,

for all subset A in T , the set of parents of A, paG(A), coincides with the set of

the parent components of T , paD(T ), then the only conditional independence

implied by the multivariate regression Markov property is

A⊥⊥ [pre(T ) \ paD(T )]| paD(T ), for all A ⊆ T, T ∈ T .

This condition is in turn equivalent just to the factorization (2) of the joint

probability distribution.

Remark 2. In Definition 1, (mr2) is equivalent to imposing that for all T the

conditional distribution pT |pre(T ), satisfies the independencies of a covariance

graph model with respect to the subgraph GT .

In [15, Proposition 3], it is shown that a covariance graph model is defined by

constraining to zero, in the multivariate logistic parameterization, the param-

eters corresponding to all disconnected subsets of the graph. In the following

subsection we extend this approach to the multivariate regression chain graph

models.

3. Recursive multivariate logistic regression models

3.1. Notation

Let X = (Xv|v ∈ V ) be a discrete random vector, where each variable Xv has a

finite number rv of levels. Thus X takes values in the set I =∏

v∈V {1, . . . , rv}

whose elements are the cells of the joint contingency table, denoted by i =


(i1, . . . , id). The first level of each variable is considered a reference level and we

consider also the set I⋆ =∏

v∈V {2, . . . , rv} of cells having all indices different

from the first. The elements of I⋆ are denoted by i⋆.

The joint probability distribution of X is defined by the mass function

p(i) = P (Xv = iv, v = 1, . . . , d) for all i ∈ I,

or equivalently by the probability vector p = (p(i), i ∈ I). With three variables

we shall use often pijk instead of p(i1, i2, i3).

Given two disjoint subsets A and B of V , we define the marginal probability

distribution of XB by

p(iB) =∑

jB=iB

p(j)

where iB is a subvector of i belonging to the marginal contingency table IB =∏

v∈B{1, . . . , rv}. In the same way we define the set I⋆B =∏

v∈B{2, . . . , rv}.

The conditional probability distributions are defined as usual and denoted by

p(iA|iB), for iA ∈ IA and iB ∈ IB or, compactly, by pA|B.

A discrete multivariate regression chain graph model PMR(G) associated with

the chain graph G = (V,E) is the set of strictly positive joint probability

distributions p(i), for i ∈ I that obeys the multivariate regression Markov

property. By Theorem 1 this class coincides with the set PIV(G) of discrete

chain graph models of type IV.

In the next subsection we define an appropriate parameterization for each

component of the standard factorization

p(i) =∏

T∈T

p(iT | ipre(T )) (3)

of the joint probability distribution. Actually we define a saturated linear model

for a suitable transformation of the parameters of each conditional probability

distribution p(iT | ipre(T )).

3.2. Multivariate logistic contrasts

The suggested link function is the multivariate logistic transformation; see [17,

p. 219] and [12]. This link transforms the joint probability vector of the responses

into a vector of logistic contrasts defined for all the marginal distributions.

The contrasts of interest are all sets of univariate, bivariate, and higher-order

contrasts. In general, a multivariate logistic contrast for a marginal table pA is

defined by the function

η(A)(i⋆A) =∑

s⊆A

(−1)|A\s| log p(i⋆s, 1A\s), for i⋆ ∈ I⋆A, (4)


where the notation |A \ s| denotes the cardinality of set A \ s. The contrasts

for a margin A are denoted by η(A) and the full vector of the contrasts for all

nonempty margins A ⊆ V are denoted by η. The following example illustrates

the transformation for two responses.

Example 2. Let pij, for i = 1, 2, j = 1, 2, 3 be a joint bivariate distribution

for two discrete variables X1 and X2. Then, the multivariate logistic transform

changes the vector p of probabilities, belonging to the 5-dimensional simplex,

into the 5× 1 vector

η =

η(1)

η(2)

η(12)

where η(1) = logp2+

p1+, η(2) =

logp+2

p+1

logp+3

p+1

, η(12) =

logp11p22

p21p12

logp11p23

p21p13

,

where the + suffix indicates summing over the corresponding index. Thus, the

parameters η(1) and η(2) are marginal baseline logits for the variables X1 and

X2, while η(12) is a vector of log odds-ratios. The definition used in this paper

uses baseline coding, i.e., the contrasts are defined with respect to a reference

level, by convention the first. Therefore the dimension of the vectors η(1), η(2)

and η(12) are the dimensions of the sets I⋆1 , I⋆2 and I⋆12. Other coding schemes

can be adopted, as discussed for instance in [20] and [2].

Remark 3. This transformation for multivariate binary variables is discussed

in [12], that show that the function from p to η is a smooth (C∞) one-to-one

function having a smooth inverse, i.e. it is a diffeomorphism; see also [3]. For

general discrete variables see [15].

Remark 4. The parameters are not variation-independent, i.e. they do not

belong to a hyper-rectangle. A specified value for η will be called strongly com-

patible if it corresponds to a valid distribution p. In general, in fact, for some

values η there is no corresponding probability vector p.

However, the parameters satisfy the upward compatibility property, i.e., they

have the same meaning across different marginal distributions; see [12] and [15,

Proposition 4].

The calculation of multivariate logistic contrasts can be extended to condi-

tional distributions as follows. For fixed ipre(T ) ∈ Ipre(T ), let pT |pre(T ) be the

vector with strictly positive components p(iT |ipre(T )) > 0, iT ∈ IT . Then the

corresponding multivariate logistic parameters are denoted by a vector ηT |pre(T )

composed of all the (conditional) multivariate logistic contrasts η(A)(ipre(T )) for

all nonempty subsets A of T . For the upper compatibility property, ηA|pre(T ) is a


subvector of ηT | pre(T ) and from Remark 3 the transformation to the conditional

probability vector pA|pre(T ), is a diffeomorphism.

3.3. Saturated model

To specify the dependence of the responses in a chain component T on the

variables in pre(T ) we define a saturated multivariate logistic model for the

conditional probability distribution of XT given Xpre(T ). This is obtained by

specifying for each subvector of contrasts η(A) of ηT |pre(T ), for all nonempty

A ⊆ T , the linear predictor

η(A) = Z(A)β(A), (5)

where Z(A) is a full-rank design matrix and β(A) is parameter vector. Equiva-

lently, we express the dependence of the multivariate logistic contrasts on the

variables in the preceding components by a complete factorial model

η(A)(ipre(T )) =∑

b⊆pre(T )

β(A)b (ib), (6)

where the vectors β(A)b (ib) have dimensions of the sets I⋆A and, according to the

baseline coding, vanish when at least one component of ib takes on the first

level. Again, here different codings may be used if appropriate.

The link function of the model for each pT |pre(T ), can be written in matrix

form as

ηT |pre(T ) = ZTβT , (7)

where ZT = diag(Z(A)) is a block-diagonal matrix and βT is obtained by stack-

ing all the vectors β(A), for all the nonempty subsets A of T . The full model for

the joint probability then follows from the factorization of equation (3).

The form of matrix ZT depends on the used coding, but the relation between

βT and the vector of parameters β∗T , say, obtained from a different coding can

be derived in closed form.

Example 2. (continued.) Suppose that the responses X1 and X2 depend on

two binary explanatory variables X3 and X4, with levels indexed by k and ℓ,

respectively. Then, the saturated model is

η(A)(k, ℓ) = β(A)∅ + β

(A)3 (k) + β

(A)4 (ℓ) + β

(A)34 (k, ℓ), k, ℓ = 1, 2,

for A = {1}, {2}, {12}. The explicit form of the matrix Z(A) in equation (5) is,

using the Kronecker product ⊗ operator,

Z(A) = I ⊗

1 0

1 1

⊗

1 0

1 1

,


i.e., a matrix of a complete factorial design matrix, where I is an indentity

matrix of order equal to the common dimension of each η(A)(k, ℓ).

Remark 5. Following [17, p. 222] we shall denote the model, for the sake of

brevity, by a multivariate model formula

X1 : X3 ∗ X4; X2 : X3 ∗ X4; X12 : X3 ∗ X4

where X3 ∗ X4 = X3 + X4 + X3.X4 is the factorial expansion inWilkinson and Rogers’

symbolic notation; [22]. Notice that, as the linear map (5) is non singular, a zero

constraint of type η(A) = 0 is equivalent to a zero constraint on all the corre-

sponding regression coefficients, i.e., to β(A) = 0.

4. Discrete multivariate regression chain graph models

4.1. Linear constraints

A multivariate regression chain graph model can be specified by imposing zero

constraints on the parameters βT of the saturated model (7). Models of this

form belong to the class of multivariate logistic models [12]. We give first an

example and then we state the general result.

Example 2. (continued.) Assume that the chain graph G is defined as in Fig-

ure 1(a) and that X1 and X2 have two and three levels respectively as in Ex-

ample 2. Thus, we require, as shown in the graph, that X1 depends only on X3

and X2 depends only on X4. Therefore, we specify the model

η(1)(k, ℓ) = β(1)∅ + β

(1)3 (k),

η(2)(k, ℓ) = β(2)∅ + β

(2)4 (ℓ),

η(12)(k, ℓ) = β(12)∅ + β

(12)3 (k) + β

(12)4 (ℓ) + β

(12)34 (k, ℓ).

with a corresponding multivariate model formula

X1 : X3; X2 : X4; X12 : X3 ∗ X4.

The reduced model satisfies the two independencies 1⊥⊥ 4|3 and 2⊥⊥ 3|4 because

the first two equations are equivalent to p1|34 = p1|3 and p2|34 = p1|4, respectively.

The log odds-ratio between X1 and X2, on the other hand, depends in general

on all the combinations (k, ℓ) of levels of the two explanatory variables.

The following Theorem states a general rule to parametrize any discrete chain

graph model of multivariate regression type.


Theorem 2. Let G be a chain graph with chain components (T |T ∈ T ) and let

pre(T ) define a compatible full ordering of the chain components. A joint dis-

tribution of the discrete random vector X, satisfies the condition (mr1) of the

multivariate regression Markov property if and only if, for each chain component

T and for all connected subsets A ⊆ T ,

η(A)(ipre(T )) =∑

b⊆paG(A)

β(A)b (ib), (8)

by imposing in the multivariate logistic model (6) the constraints β(A)b (ib) = 0

for all b ⊆ pre(T ) such that b * paG(A), and ib ∈ Ib.

Moreover, the joint distribution satisfies the condition (mr2) if and only if,

for each chain component T and for any disconnected subset A of T ,

η(A)(ipre(T )) = 0, (9)

by imposing in model (6) the constraints β(A)b (ib) = 0 for b ⊆ pre(T ), and

ib ∈ Ib.

Example 3. The chain graph in Figure 1(b) corresponds to the multivariate

logistic model formula

X1 : X4; X2 : 1; X3 : X5; X12 : X4; X13 : 0; X23 : X5; X123 : X4 ∗ X5.

Notice that as there is no arrow pointing to node 2, the marginal logit of X2 is

assumed constant, and this is indicated by X2 : 1. This in turn specifies the joint

independence 2⊥⊥ 4,5. The missing edge (1, 3) with associated independence

1⊥⊥ 3|4,5, is obtained by imposing that the bivariate logit between X1 and X3

is zero, denoted by the model formula X13 : 0.

Remark 6. The discrete chain graph model defined by the constraints of The-

orem 2 implies, by Theorem 2, the factorization of the joint density according

to the chain components, i.e., pT |pre(T ) = pT |paD(T ), for all T . Moreover, defining

D = pre(T ) \ paD(T ), the factorization is equivalent to the statements that

η(A)(ipre(T )) = η(A)(ipaD(T ), iD) do not depend on iD, for all A ⊆ T . (10)

Therefore, the model defined in Theorem (2) implies that the equations (8) and

(9) hold, with η(A)(ipre(T )) substituted by η(A)(ipaD(T )), i.e., by the multivariate

logistic contrasts in the probability distribution pT |paD(T ) conditional on the

variables in the parent components instead that on all preceding components.


Remark 7. The class of models defined in Remark 1, corresponding exactly to

the factorization (2), all the independencies are obtained by setting paG(A) =

paD(T ) for all A ⊆ T in equation (8).

Remark 6 suggests an alternative definition of the multivariate logistic re-

gression models representing the chain graph.

Corollary 3. The joint probability distribution of the random vector X satisfies

the multivariate regression Markov property, if and only if

(i) the joint density factorizes according to the chain components

p(i) =∏

T∈T

p(iT |ipaD(T )); (11)

(ii) for each conditional distribution pT | paD(T ), for T ∈ T a multivariate logis-

tic model is specified, for all A ⊆ T , by

η(A)(ipaD(T )) =

∑

b⊆paG(A)

β(A)b (ib), if A is connected,

0, if A is disconnected.

(12)

The Corollary follows from Remark 6. Thus, by exploiting the factorization,

the multivariate logistic models can be defined in the lower-dimensional condi-

tional distributions pT | paD(T ), whereas, in the previous approach, the indepen-

dencies required by the factorization were incorporated in the models for the

dependence of XT on all the previous variables.

4.2. Marginal log-linear parameterization

If the discrete multivariate regression chain graph model is defined by recur-

sive multivariate logistic models for the conditional distribution pT |pre(T ), as

explained in Theorem 2, then the complete set of parameters βT for T ∈ T ,

actually defines a marginal log-linear model; see [3]. We give the details below.

A parameterization of the discrete joint probability distribution p(i) is said

to be a marginal log-linear parameterization if it is obtained by combining the

log-linear parameters of several marginal distributions pM1, . . . , pMs

of interest.

For each j = 1, . . . , s the log-linear parameters defined within the marginal

distribution pMjare called marginal log-linear ‘interactions’ and are denoted by

λMj

L for L ∈ Lj, a collection of subsets of Mj .

The marginal log-linear parameterization is said to be hierarchical and com-

plete if it is generated by a list of pairs (Mj ,Lj), for j = 1, . . . , s, such that:


(i) the sets Lj , for j = 1, . . . , m form a partition of the set of all possible

nonempty subsets of V ;

(ii) each Lj forms an ascending class of interactions defined within the margin

Mj .

Property (ii) requires that if L ∈ Lj then also any superset L′ such that

L ⊆ L′ ⊆ Mj belongs to Lj . In [3] it is proved that any complete and hi-

erarchical marginal log-linear parameterization is smooth. In fact, it defines a

diffeomorphism of the interior of the simplex of all strictly positive discrete

probability distributions.

Then, the following Lemma, states that the parameters of the saturated mul-

tivariate logistic model (5) are identical to the log-linear parameters of selected

marginal distributions.

Lemma 4. For any fixed subset A in T , the parameter β(A) in the linear pre-

dictor η(A) = Z(A)β(A), in the saturated model (5) coincides with the vector

(λMA,T

L ), obtained by stacking the marginal log-linear parameters λMA,T

L defined

in the margins MA,T = A ∪ pre(T ) for the interactions L = A ∪ b for all

b ⊆ pre(T ).

Moreover, the collection of all the multivariate logistic regression parameters

in fact defines a hierarchical and complete marginal log-linear parameterization

for the joint distribution.

Proposition 5. The marginal log-linear parameterization (λMA,T

L ) generated by

MA,T = A ∪ pre(T ) and LA,T = {A ∪ b : b ⊆ pre(T )},

for all nonempty A ⊆ T and for all T ∈ T , is hierarchical and complete and its

parameters coincide with the set of parameters β(A) of the saturated multivariate

logistic model (5).

We shall name this parameterization the multivariate regression log-linear

parameterization of the class of discrete chain graph models with chain compo-

nents T . It was used before in [16] with an illustration for a chain graph like

that in Figure 1(b).

By combining this result with Theorem 2 is straightforward to verify that

any discrete chain graph model PMR(G) is defined by a set of zero constraints

on these marginal log-linear parameters.

Proposition 6. The joint distribution of the discrete random vector X, obeys

the multivariate regression Markov property with respect to G if and only if


the parameters of the multivariate regression log-linear parameterization satisfy

λMA,T

L = 0 for all L = A ∪ b with

b ⊆ pre(T ) and b * paG(A) if A is connected, and

b ⊆ pre(T ) if A is disconnected.

The discrete chain graph model PMR(G) is thus defined by a set of linear

restrictions on complete hierarchical marginal log-linear parameters. Then [3,

Theorem 5] proves that the model is a curved exponential family. The same

result has been obtained for discrete chain graph modelsPIV(G), with a different

proof, by [7].

Acknoledgement

We thank Nanny Wermuth and D. R. Cox for helpful comments. The work was

supported by MIUR, Rome, under the project PRIN 2007, XECZ7L001 and

XECZ7L002.

References

[1] Andersson, S., Madigan, D. and Perlman, M. (2001). Alternative

Markov properties for chain graphs. Scand. J. Statist., 28 33–85.

[2] Bartolucci, F., Colombi, R. and Forcina, A. (2007). An extended

class of marginal link functions for modelling contingency tables by equality

and inequality constraints. Statist. Sinica, 17 691–711.

[3] Bergsma, W. P. and Rudas, T. (2002). Marginal log-linear models for

categorical data. Ann. Statist., 30 140–159.

[4] Cowell, R. G.,Dawid, A. P., Lauritzen, S. L. and Spiegelhalter,

D. J. (1999). Probabilistic networks and expert systems. Springer Series,

New York.

[5] Cox, D. R. and Wermuth, N. (1993). Linear dependencies represented

by chain graphs (with discussion). Statist. Sci., 8 204–218, 247–277.

[6] Cox, D. R. and Wermuth, N. (1996). Multivariate dependencies. Chap-

man and Hall, London.

[7] Drton, M. (2009). Discrete chain graph models. Bernoulli (Forthcoming).

[8] Drton, M. and Richardson, T. S. (2004). Multimodality of the likeli-

hood in the bivariate seemingly unrelated regression models. Biometrika,

91 383–392.


[9] Drton, M. and Richardson, T. S. (2008). Binary models for marginal

independence. J. R. Stat. Soc. Ser. B Stat. Methodol., 70 287–309.

[10] Fienberg, S. E. (2007). The analysis o cross-classified categorical data.

Springer Verlag, New York.

[11] Frydenberg, M. (1990). The chain graph Markov property. Scand. J.

Statist., 17 333–353.

[12] Glonek, G. J. N. and McCullagh, P. (1995). Multivariate logistic

models. J. R. Stat. Soc. Ser. B Stat. Methodol., 57 533–546.

[13] Lauritzen, S. L. (1996). Graphical Models. Oxford University Press,

Oxford.

[14] Lauritzen, S. L. and Wermuth, N. (1989). Graphical models for asso-

ciations between variables, some of which are qualitative and some quan-

titative. Ann. Statist., 17 31–57.

[15] Lupparelli, M., Marchetti, G. M. and Bergsma, W. P. (2009).

Parameterizations and fitting of bi-directed graph models to categorical

data. Scand. J. Statist. Forthcoming. Preprint available at arXiv:0801.1440.

[16] Marchetti, G. M. and Lupparelli, M. (2008). Parameterization and

fitting of a class of discrete graphical models. In Compstat 2008 - Proceed-

ings in Computational Statistics: 18th Symposium (P. Brito, ed.). Physica-

Verlag, Heidelberg, 117–128.

[17] McCullagh, P. and Nelder, J. A. (1989). Generalized linear models.

Chapman and Hall, London.

[18] Richardson, T. S. (2003). Markov property for acyclic directed mixed

graphs. Scand. J. Statist., 30 145–157.

[19] Studeny, M. (2005). Probabilistic conditional independence structures.

Springer Series in Information Science and Statistics, London.

[20] Wermuth, N. and Cox, D. R. (1992). On the relation between inter-

actions obtained with alternative codings of discrete variables. Methodika,

VI 76–85.

[21] Wermuth, N. and Cox, D. R. (2004). Joint response graphs and separa-

tion induced by triangular systems. J. R. Stat. Soc. Ser. B Stat. Methodol.,

66 687–717.

[22] Wilkinson, G. N. and Rogers, C. E. (1973). Symbolic description of

factorial models for analysis of variance. J. Roy. Statist. Soc. Ser. C, 22

392–399.


Appendix A: Equivalence of the multivariate regression and

block-recursive Markov property of type IV

The following properties of conditional independence will be used in the subsequent

proofs. Let X be a random vector and let A, B, C, and D be a collection of pairwise

disjoint subsets of V . Then, the basic property of conditional independence is that

A⊥⊥ BD | C is equivalent to [A⊥⊥ B | CD and A⊥⊥ D | C], (13)

see for instance [19, p. 13]. In this statement, and henceforth, we use the notation BD

as a shortcut for B ∪ D. From this property several related results can be obtained

as for instance

A⊥⊥ BD | C and B ⊥⊥ D | C imply AB ⊥⊥ D | C (14)

A⊥⊥ BD | C and B ⊥⊥ D | AC imply B ⊥⊥ D | C. (15)

To state the block-recursive Markov property for chain graphs of type IV we need

two further concepts from graph theory. Given a chain graph G, the set NbG(A) is

the union of A itself and the set of nodes w that are neighbours of A, i.e. coupled by

an undirected edge to some node node v in A. Moreover, the set of non-descendants

ndD(T ) of a chain component T , is the union of all components T ′ such that there is

no directed path from T to T ′ in the directed graph of chain components D.

The following is the definition of the block-recursive Markov property of type IV.

Definition 2. [7]. Let G be a chain graph with chain components (T |T ∈ T ) and di-

rected acyclic graph D of components. The joint probability distribution of X obeys the

block-recursive Markov property of type IV if it satisfies the following independencies:

(iv0) A⊥⊥ [ndD(T ) \ paD(T )] | paD(T ) for all T ∈ T

(iv1) A⊥⊥ [paD(T ) \ paG(A)] | paG(A) for all T ∈ T , for all A ⊆ T

(iv2) A⊥⊥ [T \NbG(A)] | paD(T ) for all T ∈ T for all connected subsets A ⊆ T.

As [7] remarks, when the joint probability distribution is strictly positive, condition

(iv0) is equivalent to the factorization (2).

In the following proofs T ∈ T is an arbitrary chain component of the graph G,

that by assumption is connected. If A is a nonempty subset of T , then the parents

of A, paG(A) are always included in the set paD(T ). Also, assuming any strict total

ordering of the chain components, compatible with the partial order implied by the

chain graph, we have

paG(A) ⊆ paD(T ) ⊆ pre(T ). (16)

Then, we may define the pairwise disjoint sets

B = paG(A), C = paD(T ) \ paG(A), D = pre(T ) \ paD(T )


with

paD(T ) = BC, and pre(T ) = BCD, pre(T ) \ paG(A) = CD.

denoting unions of the relevant disjoint sets.

The first result shows that we can replace the original Definition 2 of block-recursive

Markov property by the following alternative equivalent form.

Lemma 7. The block-recursive Markov property of type IV for a chain graph G states

equivalently that:

(iv0′) T ⊥⊥ [pre(T ) \ paD(T )]|paD(T ) for all T ∈ T

(iv1′) A⊥⊥ [paD(T ) \ paG(A)] | paG(A) for all T ∈ T , for all connected A ⊆ T

(iv2′) A1 ⊥⊥ · · · ⊥⊥ Ar | paD(T ) for all T ∈ T for all disconnected A ⊆ T,

where A1, . . . , Ar are the connected components of A if disconnected.

Proof. Equation (iv0) states that the joint probability distribution obeys the local

directed Markov property relative to the directed graph D of the chain components.

Then, from [4, Theorem 5.12], this is equivalent to (iv0′) for any well-chosen total

order of the components, calling this statement the ordered directed Markov property.

Condition (iv2) is equivalent to the mutual independence of the connected com-

ponents of the disconnected set A, in the conditional distribution pA|paD(T ), as stated

in (iv2′); [9].

Considering finally statement (iv1), this is extended to all subsets A, but in fact

it is not needed for disconnected sets, because this is implied by statement (iv2)

for disconnected sets and statement (iv1) for connected sets. Assume that A is dis-

connected with two connected components A1 and A2 and let Bj = paG(Aj), for

j = 1, 2 and Bj = B \ Bj . Thus B = paG(A) = B1B1 = B2B2. Thus, (iv1) ap-

plied to connected sets A1 and A2 gives A1 ⊥⊥ B1C|B1 and A2 ⊥⊥ B2C|B2. Thus, by

property (13) Aj ⊥⊥ C|B is implied, j = 1, 2. By the same property, we may com-

bine this statements with statement A1 ⊥⊥ A2|BC implied by (iv2), obtaining the

independence A1 ⊥⊥ A2C|B. As A2 ⊥⊥ C|B, by property (14) we get A1A2 ⊥⊥ C|B, i.e.

A⊥⊥ paD(T ) \ paG(A)|paG(A) as required. The result can be generalized to discon-

nected sets with more than two connected components. Therefore in Definition 2 we

may restrict statement (iv1) to connected sets.

Then, we are ready to prove that the multivariate regression Markov property is

equivalent to the above block-recursive Markov property.

Proof of Theorem 1. We establish the equivalence of Definition 1 with the state-

ments (iv0′), (iv1′) and (iv3′) of Proposition 7.

1) [(mr1) implies (iv0′) and (iv1′).] Given a chain component T and a connected

nonempty subset A ⊆ T , statement (mr1) is A⊥⊥ CD | B, and, by property (13),


this is equivalent to A⊥⊥ C|B and A⊥⊥ D | BC. The first statement is A⊥⊥ [paD(T ) \

paG(A)]|paG(A) i.e. (iv1′). The second statement, for A = T , gives T ⊥⊥ [pre(T ) \

paD(T )]|paD(T ), i.e., (iv0′).

2) [(iv0′) and (iv1′) imply (mr1).] Statement (iv0′) is T ⊥⊥ D|BC and thus, for

any A ⊆ T , property (13) implies A⊥⊥ D|BC. Moreover, statement (iv1′) is A⊥⊥ C|B.

Thus the last two statements together imply, by property (13), A⊥⊥ CD|B that trans-

lates into A⊥⊥ [pre(T ) \ paG(A)]|paG(A), i.e., (mr1).

3) [mr1 and (mr2) imply (iv2′).] Assume that A is disconnected with two con-

nected components A1 and A2 and let Bj = paG(Aj), for j = 1, 2 and Bj = B \ Bj.

Thus B = paG(A) = B1B1 = B2B2. Then statement (mr2) is A1 ⊥⊥ A2|BCD. Then

we use again the argument of the proof of Proposition 7, to show that even when

A is disconnected, the multivariate regression Markov property implies A⊥⊥ [pre(T ) \

paG(A)]|paG(A). In fact, we can apply statement (mr1) to the connected A1 we ob-

tain A1 ⊥⊥ B1CD|B1 and, by property (13), A1 ⊥⊥ CD|B. Then, we combine this state-

ment with A1 ⊥⊥ A2|BCD obtaining A1 ⊥⊥ A2CD|B. As A2 ⊥⊥ CD|B, by property (14)

we conclude that A⊥⊥ CD|B. Now, using property (15), A1 ⊥⊥ A2|BCD and A⊥⊥ CD|B

imply A1 ⊥⊥ A2|BC, that translates back into A1 ⊥⊥ A2|paD(T ), i.e. (iv2′). The general

case of any finite number r of connected components Aj is treated similarly.

4) [(iv0′)) and (iv2′) imply (mr2).] Again we give the details for a disconnected

set A with two connected components A1 and A2. Statement (iv2′) is A1 ⊥⊥ A2|BC.

By (iv0′) we also know that T ⊥⊥ D|BC, and thus for the subset A = A1A2 ⊆ T we

have A1A2 ⊥⊥ D|BC. Therefore, by property (14) we conclude that A1 ⊥⊥ A2|BCD, or

A1 ⊥⊥ A2|pre(T ), i.e. (mr2).

Appendix B: Relation between multivariate logistic and log-linear

parameters

Given a subvector XM of the given random vector X, the log-linear expansion of its

marginal probability distribution pM is

log pM (iM ) =∑

s⊆M

λMs (is), (17)

where λMs (is) defines the ‘interaction’ parameters of order |s|, in the baseline param-

eterization, i.e. with the implicit constraint that the function returns zero whenever

at least one of the indices in is takes the first level.

Then we can relate the multivariate logistic contrasts η(A)(i⋆A|ipre(T )) to selected

log-linear parameters as follows.

Lemma 8. If η(A)(i⋆A|ipre(T )) is the multivariate logistic contrast of the conditional

probability distribution pA|pre(T ) for A subset of T , then, with B = pre(T ),

η(A)(i⋆A|iB) =∑

b⊆B

λABAb (i

⋆A, ib) (18)


where λABAb (i

⋆A, ib) are the log-linear interaction parameters of order A ∪ b in the

marginal probability distribution pAB.

Proof. First note that the multivariate logistic contrasts η(A|B)(i⋆A|iB) can be written

η(A|B)(i⋆A|iB) =∑

s⊆A

(−1)|A\s| log pAB(i⋆s, iB ,1A\s). (19)

Then, we express the logarithm of the joint probabilities pAB as sum of log-linear

interactions using (17)

log pAB(i⋆s, iB ,1A\s) =

∑

a⊆A

∑

b⊆B

λABab (i⋆a∩s,1a\s, ib)

=∑

a⊆s

∑

b⊆B

λABab (i⋆a, ib).

Therefore, by substitution into equation (19) we get

ηA|B(i⋆A|iB) =∑

s⊆A

(−1)|A\s|∑

a⊆s

∑

b⊆B

λABab (i⋆a, ib)

=∑

b⊆B

∑

s⊆A

(−1)|A\s|∑

a⊆s

λABab (i⋆a, ib)

=∑

b⊆B

λABAb (i

⋆A, ib),

where the last equality is obtained using Mobius inversion, see [13, p. 239, Lemma

A.2]).

Lemma 8 is used in the proof of Theorem 2 given below.

Proof of Theorem 2. If (8) holds for any chain component T , then for any con-

nected set A ⊆ T , η(A)(ipre(T )) is a function of ipaG(T ) only. Therefore, using the

diffeomorphism and the property of upward compatibility discussed in Remark 4, the

conditional distribution pA|pre(T ) coincide with pA|paG(A) and condition (mr1) holds.

Conversely, if condition (mr1) holds and pA|pre(T ) = pA|paG(A), for all connected

subsets A of T , then the components of η(A)(ipre(T )) are

η(A)(i⋆A|ipre(T )) =∑

s⊆A

(−1)|A\s| log p(i⋆s,1A\s | ipaG(T ))

=∑

b⊆B

λABAb (i

⋆A, ib), with B = paG(T )

by Lemma 8, and thus (8) holds with β(A)b (ib) = λAB

Ab (ib) where λABAb (ib) denotes the

vector of log-linear parameters λABAb (i

⋆A, ib) for all i

⋆A ∈ I

⋆A.

Condition (mr2) of Definition 1 is equivalent to impose that, for any chain com-

ponent T , the conditional distribution pT |pre(T ) satisfies the independence model of

a covariance sub-graph GT . In [15] is proved that, given a joint distribution pT , a


covariance graph model is satisfied if and only if in the multivariate logistic param-

eterization ηT , η(A) = 0 for all disconnected sets A ⊆ T . Therefore, extending this

result to the conditional distribution pT |pre(T ) and considering the diffeomorphism

(6), condition (mr2) holds if and only if η(A)(iB) = 0 for every disconnected set

A ⊆ T . Following the factorial model (7), β(A)b (ib) = 0 with b ⊆ pre(T ), for each

disconnected subset A of T . Notice that, by Lemma 8, β(A)b (ib) = λAB

Ab (ib) = 0, with

b ⊆ pre(T ).

Proof of Proposition 5. The sets LA,T are pairwise disjoint and their union is

KT =⋃

∅6=A⊆T

LA,T = P(A ∪ pre(T )) \ P(pre(T ))

where P(·) denotes the power set. Moreover, for T ∈ T the sets KT are pairwise

disjoint and their union⋃

T∈T

KT = P(V ) \ {∅}

coincides with the set of all nonempty subsets of V , thus implying that condition (ii) is

satisfied. Condition (i) is also satisfied by the very definition of the sets LA,T . Finally,

the equality with the parameters of saturated model (5) follows from Propositions (4).

Chain graph models of multivariate regression type for ...e-mail: [email protected]...

Documents

Transcript of Chain graph models of multivariate regression type for ...e-mail: [email protected]...