Parameter identifiability of discrete DAG models with...

Parameter identifiability of discrete DAG models

with latent variables

John A. Rhodes

Algebraic Statistics 2014

IIT, May 19 – 22

Note: AK ⊂ US, P(x ∈ AK|x ∈ US) ≈ .16, P(p ∈ AK|p ∈ US) ≈ .0023, P(c ∈ AK|c ∈ US) ≈ 1

Thanks to those who made AS2014 possible:

Local Organizers: Sonja Petrovic, Despina Stasi

Program Committee: Stephen Fienberg, Sonja Petrovic, Seth

Sullivant, Henry Wynn, Ruriko Yoshida

IIT grad students: Weronika J. Swiechowicz, Carlo Pierandozzi,

Martin Dillon, Kawkab Alhejoj, Dane Wilburne, Junyu He

IIT undergrads: Xintong Li, Meng (Mamie) Wang

Discrete DAG identifiability 2/29

Parameter identifiability of discrete DAG models

with latent variables

Collaborators:

Elizabeth Allman, Mathematics, UAF

Elena Stanghellini, Statistics, Perugia

Marco Valtorta, Computer Science, South Carolina


Example:

0

2

1

3

4

56

7

Variables Xi have finite state spaces, of size ni .

Q: Are model parameters identifiable?


Example:

0

2

1

3

4

56

7

Parameters: Conditional probabilities P(Xi | pa(Xi))

Joint Distribution:∑

X0

∏

i

P(Xi | pa(Xi))

Identifiability: The joint distribution of observable variables

determines the parameters (up to...)


Identifiability:

1) Parameterization is polynomial, so focus on generic behavior.

(generic complex? real? stochastic?)

2) Latent variables ⇒ “label-swapping”

⇒ n0!-to-1 parametrization, at best

Q’: Is the parameterization generically k-to-1 for some finite k?

If so, characterize the fibers of the parameterization.


Common practical approach:

With J = Jacobian of parameterization, N=dim(parameter space)

compute rank(J) at many random points

• If rank (J) < N everywhere, parameters not identifiable (∞-to-1).

• If rank (J) = N, then parameters locally identifiable

• Since local identifiability 6⇒ global identifiability, assume/hope

label swapping is only issue, k = n0!.


For any specific DAG and finite state spaces, one can (try to)

answer this question with computational algebra, but....

Q”: What graphical criteria addresses identifiability?

Cf: ”do”-calculus for identifiability of causal effects – determines

exactly what is identifiable, gives rational formulas.

— for nonparametric latent variables —


Simple DAGs:

0

1 2

P(X1,X2) = MT1 DM2

D = diag(P(X0)), M1 = P(X1 | X0), M2 = P(X2 | X0)

Non-uniqueness of matrix factorization

⇒ ∞-to-1 parameterization


Simple DAGs:

Star model — tensor decomposition

0

1 2 3

Kruskal’s Theorem: Decomposition of a generic 3-tensors is unique

if n1, n2, n3 sufficiently large relative to n0,

n1 + n2 + n3 ≥ 2n0 + 2

⇒ Parameters are identifiable, up to label swapping.


Example (Kuroki and Pearl 2014)

0

1 2 3 4

By do-calculus, P(X3 | do(X2)) is not identifiable,

But...

If X0 has finite state space, X1,X4 have larger state spaces,

P(X3 | do(X2)) is identifiable.


In fact, all parameters are identifiable, up to label swapping.

0

1 2 3 4

• reverse 1 → 2, Markov equivalent model

• condition on X2 generic Kruskal model with“same”

parameters

• identify P(X4 | X0), up to label swap

• Solve P(X1,X2,X3;X4) = P(X1,X2,X3;X0)P(X4 | X0)

for P(X1,X2,X3;X0) to “uncover” latent

• From P(X0,X1,X2,X3) find remaining parameters.


More generally...

To gain insight, consider all DAG models with:

• 1 latent, parent of at most 4 observables

• binary variables

Goals:

• Develop algebraic arguments not tied to binary case

• Reduce more complex DAGs to these... (more later)

Is conditioning/marginalizing/Kruskal enough to successfully

analyze these?


Almost ....

Model Graph dim(Θ) 2A − 1 k

2-B, B ≥ 0 ≥ 5 3 ∞

3-0

0

1 2 3 7 7 23-Bx , B ≥ 1 ≥ 9 7 ∞

4-0

0

1 2 3 4 9 15 2

4-1

0

1 2 3 4 11 15 2

4-2a

0

1 2 3 4 13 15 ∞

4-2b,c

0

1 2 3 4 ,

0

2 1 3 4 13 15 2

4-2d

0

1 3 2 4 15 15 2

4-3a,b (A)

0

1 2 3 4 ,

0

2 1 3 4 15 15 2

4-3c,d

0

1 3 2 4 ,

0

1 2 4 3 17 15 ∞

4-3e,f (B)

0

2 1 3 4 ,

0

1 2 3 4 15 15 4

4-3g

0

1 2 3 4 17 15 ∞

4-3h

0

1 2 4 3 25 15 ∞

4-3i

0

1 2 3 4 25 15 ∞4-Bx , B ≥ 4 ≥ 19 15 ∞

2 interesting cases...


Model A (binary)

With binary variables, the parameterization for

0

2 1 3 4

is generically 2-to-1 on stochastic parameter space.

This model is not reducible to Kruskal.


With binary variables, the parameterization for

0

2 1 3 4

is generically 2-to-1 on stochastic parameter space.

Sketch:

• Condition on X1,X3 (4 ways), to give 4 matrices

• Construct expressions in these matrices whose eigenvectors

identify parameters.

• Need generic condition: distinct eigenvalues. Equivalently:

There is a 3-way interaction between X0,X1,X3


Why is the 3-way interaction needed?

0

2 1 3 4

has ∞-to-1 parametrization. So

0

2 1 3 4

5

does as well.


But conditioning

0

2 1 3 4

5

on X5 yields

0

2 1 3 4

still with an ∞-to-1 parameterization.


Contradiction, since

0

2 1 3 4

has a 2-to-1 parameterization (Model A).

FLAW: Conditioning gave a non-generic instance – no 3-way

interaction between 0,1,3 – there is no contradiction


Moral 1: Conditioning must be done carefully, to give a generic

model.

Moral 2: Frameworks such as summary graphs and maximal

ancestral graphs which graphically depict some consequences of

conditioning are not helpful here – don’t get generic instances.


Model A (general)

The model

0

2 1 3 4

is generically identifiable, up to label swapping, provided

n2, n4 ≥ n0


Model B (binary)With binary variables, the parameterization for

0

2 1 3 4

is generically 4-to-1 on stochastic parameter space,

– not just label swapping –

Sketch:

• Conditioning a generic model on X1 yields 2 generic

0

1 2 3

models

• These each have 2-to-1 parameterizations.

• Any of the 4 choices of parameters for them can be “combined”

to give parameters for the original model.


Model B (general)

If n2, n3, n4 sufficiently large relative to n0, then

0

2 1 3 4

has a (n0!)n1-to-1 parameterization.

Moreover, a full fiber can be obtained from any single element by

rational formulas.


Large DAG models

If a DAG model has a k-to-1 parameterization, then

k is unchanged if:

• remove observable sinks with all parents observable

• pass to Markov equivalent graphs

k may change if:

• marginalize/condition on observed variables

Cautions:

• Marginalize only over sinks, but risk losing identifiability.

• Condition carefully, to get generic model.


A general result

Building on Model B0

2 1 3 4 ,

Theorem: Suppose a DAG has one latent node 0 with no parents,

and three observable sinks 1, 2, 3 that are children of 0.

Let

C = Anc (Chd(0) ∩ Anc(1) ∩ Anc(2) ∩ Anc(3)) r {0},

and

u = |C ∩ Pa(1) ∩ Pa(2) ∩ Pa(3)| .

Then for binary variables, the parametrization is generically k-to-1

with k = 22u

the potential fiber can be described, and thus k can be determined

exactly.

(non-binary version also)


Example: Model B

0

2 1 3 4

Sinks 2,3,4, all children of 0,

C = Anc (Chd(0) ∩ Anc(2) ∩ Anc(3) ∩ Anc(4)) r {0}

= {1}

u = |C ∩ Pa(2) ∩ Pa(3) ∩ Pa(4)|

= 1

so 221= 4-to-1 parameterization


Example: from beginning of talk

0

2

1

3

4

56

7

Remove 7 (observable child with observable parents):

0

2

1

3

4

56


0

2

1

3

4

56

Sinks 4,5,6 all children of 0,

C = Anc (Chd(0) ∩ Anc(4) ∩ Anc(5) ∩ Anc(6)) r {0}

= ∅

u = |C ∩ Pa(4) ∩ Pa(5) ∩ Pa(6)|

= 0

so 220= 2-to-1 parameterization


Final comments:

• A 2-sink theorem is “under development,” building on0

2 1 3 4

• Multiple latent variables with no/limited common children may

be handlable.

• Main impediment to non-binary variables is awkwardness of

statements.


Parameter identifiability of discrete DAG models with...

Documents

Transcript of Parameter identifiability of discrete DAG models with...