Causality Basics - GitHub Pages

36
Causality Basics (Based on Peters et al., 2017, โ€œElements of Causal Inference: Foundations and Learning Algorithmsโ€) Chang Liu 2020.12.16

Transcript of Causality Basics - GitHub Pages

Page 1: Causality Basics - GitHub Pages

Causality Basics(Based on Peters et al., 2017,

โ€œElements of Causal Inference: Foundations and Learning Algorithmsโ€)Chang Liu

2020.12.16

Page 2: Causality Basics - GitHub Pages

Contentโ€ข Causality and Structural Causal Models.

โ€ข Causal Inference.

โ€ข Causal Discovery.

Page 3: Causality Basics - GitHub Pages

Causalityโ€ข Formal definition of causality:

โ€œtwo variables have a causal relation, if intervening the cause may change the effect, but not vice versaโ€ [Pearl, 2009; Peters et al., 2017].

โ€ข Intervention: change the value of a variable by leveraging mechanisms and changing variables out of the considered system.

โ€ข Example: for the ๐ดltitude and average ๐‘‡emperature of a city, ๐ด โ†’ ๐‘‡.โ€ข Running a huge heater (intv. ๐‘‡) does not lower ๐ด.

โ€ข Raising the city by a huge elevator (intv. ๐ด) lowers ๐‘‡.

โ€ข Causality contains more information than observation (static data).โ€ข E.g., both ๐‘ ๐ด ๐‘ ๐‘‡ ๐ด (๐ด โ†’ ๐‘‡) and ๐‘ ๐‘‡ ๐‘ ๐ด ๐‘‡ (๐‘‡ โ†’ ๐ด) can describe the observed relation ๐‘ ๐ด, ๐‘‡ .

Page 4: Causality Basics - GitHub Pages

CausalityThe Independent Mechanism Principle.โ€ข Principle 2.1 (Independent mechanisms)

The causal generative process of a systemโ€™s variables is composed of autonomous modules that do not inform or influence each other.In the probabilistic case, this means that the conditional distribution of each variable given its causes (i.e., its mechanism) does not inform or influence the other conditional distributions.

โ€ข It is possible to conduct a localized intervention.

โ€ข We can change one mechanism without affecting the others.

Page 5: Causality Basics - GitHub Pages

Structural Causal ModelsFor describing causal relations.โ€ข Definition 6.2 (Structural causal models) An SCM โ„ญ โ‰” ๐’, ๐‘ƒ๐ consists of a collection ๐’ of ๐‘‘

(structural) assignments, ๐‘‹๐‘— โ‰” ๐‘“๐‘— pa๐‘— , ๐‘๐‘— , ๐‘— = 1,โ€ฆ , ๐‘‘ where pa๐‘— are called parents of ๐‘‹๐‘—, and a joint distr. ๐‘ƒ๐ = ๐‘ƒ๐‘1,โ€ฆ,๐‘๐‘‘

over the noise variables, which we require to be jointly independent.

โ€ข The joint indep. requirement on ๐ comes from the Independent Mechanism Principle (the right block).

โ€ข Causal graph ๐’ข: the Directed Acyclic Graph constructed based on the assignments.

โ€ข Proposition 6.3 (Entailed distributions) An SCM โ„ญ defines a unique distribution over the variables ๐— = ๐‘‹1, โ€ฆ , ๐‘‹๐‘‘ such that ๐‘‹๐‘— = ๐‘“๐‘— pa๐‘— , ๐‘๐‘— , ๐‘— = 1,โ€ฆ , ๐‘‘, in distribution, called the entailed distribution ๐‘ƒ๐—

โ„ญ and sometimes write ๐‘ƒ๐—.

โ€ข Implications of an SCM:

Page 6: Causality Basics - GitHub Pages

Structural Causal ModelsDescribing interventions using SCMs.โ€ข โ€œWhat if ๐‘‹ was set to ๐‘ฅ?โ€โ€What if patients took the treatment?โ€

โ€ข Definition 6.8 (Intervention distribution) Consider an SCM โ„ญ โ‰” ๐’, ๐‘ƒ๐ and its entailed distribution ๐‘ƒ๐—

โ„ญ. We replace one (or several) of the structural assignments to obtain a new SCM เทฉโ„ญ. Assume that we replace the assignment for ๐‘‹๐‘˜ by: ๐‘‹๐‘˜ โ‰” แˆš๐‘“ เทฆpa๐‘˜ , เทฉ๐‘๐‘˜ , where all the new and old noises are required to be jointly independent.โ€ข Atomic/ideal, structural [Eberhardt and Scheines, 2007]/surgical [Pearl, 2009]/independent

/deterministic [Korb et al., 2004] intervention:

when แˆš๐‘“ เทฆpa๐‘˜, เทฉ๐‘๐‘˜ puts a point mass on a value ๐‘Ž. Simply write ๐‘ƒ๐—โ„ญ;๐‘‘๐‘œ ๐‘‹๐‘˜โ‰”๐‘Ž

.

โ€ข Soft intervention: intervene with a distribution.

๐‘‹

๐‘‡ ๐‘Œ

๐‘‹

๐‘ก ๐‘Œ

Page 7: Causality Basics - GitHub Pages

Structural Causal ModelsDescribing interventions using SCMs.

โ€ข Definition 6.12 (Total causal effect) Given an SCM โ„ญ, there is a total causal effect from ๐‘‹to ๐‘Œ, if ๐‘‹ and ๐‘Œ are dependent in ๐‘ƒ๐—

โ„ญ;๐‘‘๐‘œ ๐‘‹โ‰”เทฉ๐‘๐‘‹ for some r.v. เทฉ๐‘๐‘‹.

โ€ข Proposition 6.13 (Total causal effects) Equivalent statements given an SCM โ„ญ:1. There is a total causal effect from ๐‘‹ to ๐‘Œ.

2. There are ๐‘ฅ 1 and ๐‘ฅ 2 such that ๐‘ƒ๐‘Œโ„ญ;๐‘‘๐‘œ ๐‘‹โ‰”๐‘ฅ 1

โ‰  ๐‘ƒ๐‘Œโ„ญ;๐‘‘๐‘œ ๐‘‹โ‰”๐‘ฅ 2

.

3. There is ๐‘ฅ 1 such that ๐‘ƒ๐‘Œโ„ญ;๐‘‘๐‘œ ๐‘‹โ‰”๐‘ฅ 1

โ‰  ๐‘ƒ๐‘Œโ„ญ.

4. ๐‘‹ and ๐‘Œ are dependent in ๐‘ƒ๐‘‹,๐‘Œโ„ญ;๐‘‘๐‘œ ๐‘‹โ‰”เทฉ๐‘๐‘‹ for any เทฉ๐‘๐‘‹ with full support.

โ€ข Proposition 6.14 (Graphical criteria for total causal effects) Consider SCM โ„ญ with corresponding graph ๐’ข.โ€ข If there is no directed path from ๐‘‹ to ๐‘Œ, then there is no total causal effect.

โ€ข Sometimes there is a directed path but no total causal effect.

Page 8: Causality Basics - GitHub Pages

Structural Causal ModelsDescribing counterfactuals using SCMs.โ€ข โ€œWhat if I took the treatment?โ€

โ€œWhat would the outcome be had I acted differently at that moment/situation?โ€

โ€ข Intervention {in a given situation/context} / {for a given individual/unit}.Intervention with a conditional.

โ€ข Definition 6.17 (Counterfactuals) Consider SCM โ„ญ โ‰” ๐’, ๐‘ƒ๐ . Given some observations ๐ฑ, define a counterfactual SCM by replacing the distribution of noise variables:

โ„ญ๐—=๐ฑ โ‰” ๐’, ๐‘ƒ๐โ„ญ|๐—=๐ฑ

, where ๐‘ƒ๐โ„ญ|๐—=๐ฑ

โ‰” ๐‘ƒ๐|๐—=๐ฑ (need not be jointly independent anymore).

โ€ข โ€œThe probability that ๐‘ would have been 7 (say)had ๐‘Œ been set to 2 for observation ๐ฑโ€:

Page 9: Causality Basics - GitHub Pages

Structural Causal ModelsObservational implications (SCM as a BayesNet); causality and correlation.โ€ข Definition 6.1 (d-separation). A path ๐‘ƒ is d-separated by a set of nodes ๐’, iff. ๐‘ƒ either has an

emitter (โ†’ ๐‘‹ โ†’ or โ† ๐‘‹ โ†’) in ๐’, or a collider (โ†’ ๐‘‹ โ†) that is not in ๐’ nor are its descendants.

โ€ข โ€ข Theorem 6.22 (Equivalence of Markov properties). When ๐‘ƒ๐—has a density.

โ€ข (iii) is what we mean by using a DAG/BeyesNet to define a joint distribution (i.e., the entailed distribution by the SCM). Its equivalence to (i) means that we can read off conditional independence assertions of this joint distribution from the graph.

Page 10: Causality Basics - GitHub Pages

Structural Causal ModelsObservational implications (SCM as a BayesNet); causality and correlation.

โ€ข Confounding bias.โ€ข ๐‘‡ and ๐‘Œ are correlated if ๐‘‹ is not given.

โ€ข E.g., ๐‘‡= chocolate consumption, ๐‘Œ= #Nobel prices, ๐‘‹= development status.

โ€ข Simpsonโ€™s paradox.โ€ข ๐‘‡: treatment. ๐‘Œ: recovery. ๐‘‹: socio-economic status.

โ€ข Is the treatment effective for recovery?๐‘ ๐‘Œ = 1 ๐‘‡ = 1 > ๐‘ ๐‘Œ = 1 ๐‘‡ = 0 , but๐‘๐‘‘๐‘œ ๐‘‡=1 ๐‘Œ = 1 < ๐‘๐‘‘๐‘œ ๐‘‡=0 ๐‘Œ = 1 .

โ€ข ๐‘ ๐‘Œ ๐‘ก = ฯƒ๐‘ฅ ๐‘ ๐‘Œ ๐‘ฅ, ๐‘ก ๐‘ ๐‘ฅ ๐‘ก :conditioning on ๐‘‡ = 1 infers a good ๐‘‹ thus a higher probability of recovery.

๐‘‹

๐‘‡ ๐‘Œ

๐‘‹

๐‘‡ ๐‘Œ

๐‘‹

๐‘ก ๐‘Œ

Page 11: Causality Basics - GitHub Pages

Structural Causal ModelsObservational implications (SCM as a BayesNet); causality and correlation.

โ€ข Selection bias.โ€ข ๐ป and ๐น are correlated if secretly/unconsciously conditioned on ๐‘….

โ€ข Berksonโ€™s paradox.โ€ข โ€œWhy are handsome men such jerks?โ€ [Ellenberg, 2014]

โ€ข ๐ป= handsome, ๐น= friendly, ๐‘…= in a relation.

โ€ข Single women often date men with ๐‘… = 0. Such men tend to be either not ๐ป or not ๐‘….

โ€ข (spurious correlations).

๐น

๐‘…

๐ป

Page 12: Causality Basics - GitHub Pages

Structural Causal ModelsLevels of model according to the prediction ability:

๐‘ƒ๐‘‹๐ฉ๐ซ๐จ๐›๐š๐›๐ข๐ฅ๐ข๐ฌ๐ญ๐ข๐œ/๐จ๐›๐ฌ๐ž๐ซ๐ฏ๐š๐ญ๐ข๐จ๐ง๐š๐ฅ ๐ฆ๐จ๐๐ž๐ฅ

(machine learning models)

+ intv. distrs.

๐ข๐ง๐ญ๐ž๐ซ๐ฏ๐ž๐ง๐ญ๐ข๐จ๐ง๐š๐ฅ ๐ฆ๐จ๐๐ž๐ฅ (causal graphical models)

+ counterf. statements

๐œ๐จ๐ฎ๐ง๐ญ๐ž๐ซ๐Ÿ๐š๐œ๐ญ๐ฎ๐š๐ฅ ๐ฆ๐จ๐๐ž๐ฅ (SCMs)

โ€ข Interventional model has the utility of a probabilistic/observational model ๐‘ ๐‘‹, ๐‘Œ :

but cannot determine the interventional distribution ๐‘๐‘‘๐‘œ ๐‘‹=๐‘ฅ ๐‘Œ :

โ€ข Example 6.19: Two SCMs that induce the same graph, observational distributions, and intervention distributions, could entail different counterfactual statements (โ€œprobabilistically and interventionally equivalentโ€ but not โ€œcounterfactually equivalentโ€).

๐‘‹ ๐‘Œ ๐‘‹ ๐‘Œ

๐‘

๐‘‹ ๐‘Œ

๐‘

๐‘‹ ๐‘Œ

๐‘ ๐‘‹ ๐‘ ๐‘Œ ๐‘‹ ๐‘ ๐‘Œ ๐‘ ๐‘‹ ๐‘Œ ๐‘ ๐‘ง ๐‘ ๐‘‹ ๐‘ง ๐‘ ๐‘Œ ๐‘ง d๐‘ง ๐‘ ๐‘ง ๐‘ ๐‘‹ ๐‘ง ๐‘ ๐‘Œ ๐‘‹, ๐‘ง d๐‘ง

๐‘ ๐‘Œ ๐‘ฅ ๐‘ ๐‘Œ ๐‘ ๐‘ง ๐‘ ๐‘Œ ๐‘ง d๐‘ง = ๐‘ ๐‘Œ ๐‘ ๐‘ง ๐‘ ๐‘Œ ๐‘ฅ, ๐‘ง d๐‘ง

๐‘ ๐‘‹, ๐‘Œ =

๐‘๐‘‘๐‘œ ๐‘‹=๐‘ฅ ๐‘Œ =

Page 13: Causality Basics - GitHub Pages

Structural Causal ModelsLevels of model according to the prediction ability:

Page 14: Causality Basics - GitHub Pages
Page 15: Causality Basics - GitHub Pages

Causal Inferenceโ€ข Given an SCM, infer the distribution/outcome of a variable, under an intervention

or unit-level intervention (counterfactual).โ€ข Estimate causal effect:

โ€ข Average treatment effect: ๐”ผ๐‘‘๐‘œ ๐‘‡=1 ๐‘Œ โˆ’ ๐”ผ๐‘‘๐‘œ ๐‘‡=0 ๐‘Œ .

โ€ข Individual treatment effect: ๐”ผ๐‘‘๐‘œ ๐‘‡=1 ๐‘Œ|๐‘ฅ โˆ’ ๐”ผ๐‘‘๐‘œ ๐‘‡=0 ๐‘Œ|๐‘ฅ .

โ€ข Estimate the interventional distribution is the key task (counterfactual is the conditional in an intervened SCM).

Page 16: Causality Basics - GitHub Pages

Causal Inferenceโ€ข Truncated factorization [Pearl, 1993] / G-computation formula [Robins, 1986] /

manipulation theorem [Spirtes et al., 2000]:

๐‘โ„ญ;๐‘‘๐‘œ ๐‘‹๐‘˜โ‰”เทฉ๐‘๐‘˜ ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘‘ = ๐‘ ๐‘ฅ๐‘˜ ฯ‚๐‘—โ‰ ๐‘˜ ๐‘โ„ญ(๐‘ฅ๐‘—|๐‘ฅpa๐‘—), where ๐‘ is the density of เทฉ๐‘๐‘˜.

โ€ข Almost evident from the interventional SCM.

โ€ข Relies on the autonomy/modularity of causality (principle of independence of mechanisms):one causal mechanism does not influence or inform other causal mechanisms.

Page 17: Causality Basics - GitHub Pages

Causal Inferenceโ€ข Definition 6.38 (Valid adjustment set, v.a.s) Consider an SCM โ„ญ over nodes ๐•. Let ๐‘‹, ๐‘Œ โˆˆ ๐• with ๐‘Œ โˆ‰ pa๐‘‹. We call a set ๐‘ โŠ† ๐•\ ๐‘‹, ๐‘Œ a v.a.s for ๐‘‹, ๐‘Œ if ๐‘โ„ญ;๐‘‘๐‘œ ๐‘‹โ‰”๐‘ฅ ๐‘ฆ = ฯƒ๐ณ๐‘

โ„ญ ๐‘ฆ ๐‘ฅ, ๐ณ ๐‘โ„ญ ๐ณ .โ€ข Importance: we can then use the observations of variables in a v.a.s to estimate an

intervention distribution.

โ€ข By the previous result, pa๐‘‹ is a v.a.s.

โ€ข Are there any other v.a.sโ€™s?

๐‘

๐‘‹ ๐‘Œ

๐‘

๐‘ฅ ๐‘Œ

Page 18: Causality Basics - GitHub Pages

Causal Inference

โ€ข Proposition 6.41 (Valid adjustment sets) The following ๐™'s are v.a.s's for ๐‘‹, ๐‘Œ where ๐‘‹, ๐‘Œ โˆˆ ๐— and ๐‘Œ โˆ‰ pa๐‘‹:1. โ€œParent adjustmentโ€: ๐™ โ‰” pa๐‘‹;2. โ€œBackdoor criterionโ€ (Pearl's admissible/sufficient set): Any ๐™ โŠ† ๐—\ ๐‘‹, ๐‘Œ with:โ€ข ๐™ contains no descendant of ๐‘‹, andโ€ข ๐™ blocks all paths from ๐‘‹ to ๐‘Œ entering ๐‘‹ through the backdoor (๐‘‹ โ† โ‹ฏ๐‘Œ);3. โ€œToward necessityโ€ (characterize all v.a.s) [Shpitser et al., 2010; Perkovic et al., 2015]: Any ๐™ โŠ† ๐—\ ๐‘‹, ๐‘Œ with:

โ€ข ๐™ contains no nodes (๐‘‹ excluded) on a directed path from ๐‘‹ to ๐‘Œ nor their descendants, andโ€ข ๐™ blocks all non-directed paths from ๐‘‹ to ๐‘Œ.

Page 19: Causality Basics - GitHub Pages

1. โ€œParent adjustmentโ€: ๐™ โ‰” pa๐‘‹;2. โ€œBackdoor criterionโ€ (Pearl's admissible/sufficient set): Any ๐™ โŠ† ๐—\ ๐‘‹, ๐‘Œ with:โ€ข ๐™ contains no descendant of ๐‘‹, andโ€ข ๐™ blocks all paths from ๐‘‹ to ๐‘Œ entering ๐‘‹ through the backdoor (๐‘‹ โ† โ‹ฏ๐‘Œ);3. โ€œToward necessityโ€ (characterize all v.a.s): Any ๐™ โŠ† ๐—\ ๐‘‹, ๐‘Œ with:โ€ข ๐™ contains no nodes (๐‘‹ excluded) on a directed path from ๐‘‹ to ๐‘Œ nor their descendants, andโ€ข ๐™ blocks all non-directed paths from ๐‘‹ to ๐‘Œ.

Causal Inference

Page 20: Causality Basics - GitHub Pages

Causal InferenceDo-calculus: alternative way to estimate an interventional distribution from observations:

โ€ข The three rules of do-calculus: consider disjoint subsets ๐—, ๐˜, ๐™,๐– from DAG ๐’ข.1. โ€œInsertion/deletion of observationsโ€: ๐‘โ„ญ;๐‘‘๐‘œ ๐—โ‰”๐ฑ ๐ฒ|๐ณ,๐ฐ = ๐‘โ„ญ;๐‘‘๐‘œ ๐—โ‰”๐ฑ ๐ฒ|๐ฐ ,

if ๐˜ and ๐™ are d-separated by ๐—,๐– in a graph where incoming edges in ๐— have been removed.2. โ€œAction-observation exchangeโ€: ๐‘โ„ญ;๐‘‘๐‘œ ๐—โ‰”๐ฑ,๐™โ‰”๐ณ ๐ฒ|๐ฐ = ๐‘โ„ญ;๐‘‘๐‘œ ๐—โ‰”๐ฑ ๐ฒ|๐ณ,๐ฐ ,

if ๐˜ and ๐™ are d-separated by ๐—,๐– in a graph where incoming edges in ๐— and outgoing edges from ๐™ have been removed.

3. โ€œInsertion/deletion of actionsโ€: ๐‘โ„ญ;๐‘‘๐‘œ ๐—โ‰”๐ฑ,๐™โ‰”๐ณ ๐ฒ|๐ฐ = ๐‘โ„ญ;๐‘‘๐‘œ ๐—โ‰”๐ฑ ๐ฒ|๐ฐ ,if ๐˜ and ๐™ are d-separated by ๐—,๐– in a graph where incoming edges in ๐— and ๐™(๐–) have been removed. Here, ๐™(๐–) is the subset of nodes in ๐™ that are not ancestors of any node in ๐– in a graph that is obtained from ๐’ข after removing all edges into ๐—.

Page 21: Causality Basics - GitHub Pages

Causal InferenceDo-calculus: alternative way to estimate an interventional distribution from observations:

โ€ข Theorem 6.45 (Do-calculus).1. The three rules are complete: all identifiable intv. distrs. can be computed by an iterative

application of these three rules [Huang and Valtorta, 2006, Shpitser and Pearl, 2006] (more general than v.a.s).

2. There is an algorithm [Tian, 2002] that is guaranteed [Huang and Valtorta, 2006, Shpitser and Pearl, 2006] to find all identifiable intv. distrs.

3. There is a nec. & suff. graphical criterion for identifiability of intv. distrs. [Shpitser and Pearl, 2006, Corollary 3], based on so-called hedges [see also Huang and Valtorta, 2006].

Page 22: Causality Basics - GitHub Pages

Causal InferenceDo-calculus: alternative way to estimate an interventional distribution from observations:

โ€ข Example 6.46 (Front-door adjustment).

โ€ข There is no v.a.s if we do not observe ๐‘ˆ.โ€ข But ๐‘โ„ญ;๐‘‘๐‘œ ๐‘‹โ‰”๐‘ฅ ๐‘ฆ is identifiable using the do-calculus if ๐‘โ„ญ ๐‘ง ๐‘ฅ > 0:๐‘โ„ญ;๐‘‘๐‘œ ๐‘‹โ‰”๐‘ฅ ๐‘ฆ = ฯƒ๐‘ง ๐‘

โ„ญ ๐‘ง ๐‘ฅ ฯƒ๐‘ฅโ€ฒ ๐‘โ„ญ ๐‘ฆ|๐‘ฅโ€ฒ, ๐‘ง ๐‘โ„ญ ๐‘ฅโ€ฒ .

โ€ข Observing ๐‘ in addition to ๐‘‹ and ๐‘Œ reveals causal information nicely.โ€ข Causal relations can be explored by observing the โ€œchannelโ€ ๐‘ that carries the โ€œsignalโ€ from ๐‘‹ to ๐‘Œ.

Page 23: Causality Basics - GitHub Pages

Causal DiscoveryLearn a causal model (causal graph of an SCM) from data.

โ€ข In most cases, only observational data is available: Observational causal discovery.

Two main-stream methods:

โ€ข Conditional Independence (CI) / constraint based methods (PC algorithm).

โ€ข Directional methods (additive noise models).

Page 24: Causality Basics - GitHub Pages

Causal DiscoveryCI-based methods.

โ€ข Assumption:โ€ข ๐‘ƒ๐— is Markovian (makes CI assertions as the graph does, in one of the three equiv. forms) and

faithful (makes no more CI assertions than the graph does).

โ€ข The Markov equivalence class is then identifiable (Lemma 7.2; Spirtes et al., 2000).โ€ข Definition 6.24 (Markov equivalence of graphs): Two DAGs are said to be Markov. equiv., if they have

the same set of distributions Markovian w.r.t them. Equivalently, they have the same set of d-separation.

โ€ข Lemma 6.25 (Graphical criteria for Markov equivalence; Verma and Pearl [1991]; Frydenberg [1990]) Two DAGs are Markov equiv., iff. they have the same skeleton and the same immoralities (v-structures).

Page 25: Causality Basics - GitHub Pages

Causal DiscoveryCI-based methods.

โ€ข Instances:โ€ข Inductive causation (IC) algorithm [Pearl, 2009].

โ€ข SGS algorithm (Spirtes, Glymour, Scheines) [Spirtes et al., 2000].

โ€ข PC algorithm (Peter Spirtes & Clark Glymour) [Spirtes et al., 2000].

Page 26: Causality Basics - GitHub Pages

Causal DiscoveryCI-based methods.

โ€ข Algorithm:โ€ข Estimation of Skeleton:โ€ข Lemma 7.8 [Verma and Pearl, 1991, Lemma 1] (i) Two nodes ๐‘‹, ๐‘Œ in a DAG are adjacent, iff. they cannot

be d-separated by any subset ๐‘† โŠ† ๐—\ ๐‘‹, ๐‘Œ . (ii) If two nodes ๐‘‹, ๐‘Œ in a DAG are not adjacent, then they are d-separated by either pa๐‘‹ or pa๐‘Œ.

โ€ข IC/SGS: For each pair of nodes ๐‘‹, ๐‘Œ , search through all such ๐‘†'s. ๐‘‹, ๐‘Œ are adjacent iff. they are cond. dep. given any one of such ๐‘†'s (i).

โ€ข PC: efficient search for ๐‘†. Start with a fully connected undirected graph, and traverse over such ๐‘†'s in an order of increasing size. For size ๐‘˜, one only has to go through ๐‘†'s that are subsets either of the neighbors of ๐‘‹ or of the neighbors of ๐‘Œ (inverse negative of (ii)).

Page 27: Causality Basics - GitHub Pages

Causal DiscoveryCI-based methods.

โ€ข Algorithm:โ€ข Orientation of Edges:โ€ข For a structure ๐‘‹ โˆ’ ๐‘ โˆ’ ๐‘Œ with no edge betw. ๐‘‹, ๐‘Œ in the skeleton, let ๐ด be a set that d-separates ๐‘‹ and ๐‘Œ. Then the structure can be oriented as ๐‘‹ โ†’ ๐‘ โ† ๐‘Œ, iff. ๐‘ โˆ‰ ๐ด.

โ€ข More edges can be oriented in order to, e.g., avoid cycles.

โ€ข Meekโ€™s orientation rules [Meek, 1995]: a complete set of orientation rules.

Page 28: Causality Basics - GitHub Pages

Causal DiscoveryAdditive Noise Models (ANM).

โ€ข Philosophy:Define a family of โ€œparticularly naturalโ€ conditionals, such that if a conditional can be described using this family, then the conditional in the opposite direction cannot be covered by this family.

โ€ข Determines a causal direction even for two variables with observational data.

โ€ข Definition 7.3 (ANMs) We call an SCM โ„ญ an ANM if the structural assignments are of the form ๐‘‹๐‘— โ‰” ๐‘“๐‘— pa๐‘— +๐‘๐‘— , ๐‘— = 1,โ€ฆ , ๐‘‘. Further assume that ๐‘“๐‘— 's are differentiable and ๐‘๐‘— 's have strictly positive densities.

โ€ข Recall that in defining SCM, the ๐‘๐‘—โ€™s are required to be jointly independent, in respect to the independent mechanism principle.

Page 29: Causality Basics - GitHub Pages

Causal DiscoveryAdditive Noise Models (ANM).

โ€ข Philosophy:

Page 30: Causality Basics - GitHub Pages

Causal DiscoveryAdditive Noise Models (ANM).

โ€ข All the non-identifiable cases of ANMs [Zhang & Hyvarinen, 2009, Thm. 8; Peters et al., 2014, Prop. 23]:Consider the bivariate ANM ๐‘‹ โ†’ ๐‘Œ, ๐‘Œ โ‰” ๐‘“ ๐‘‹ + ๐‘, where ๐‘ โŠฅ ๐‘‹ and ๐‘“ is three times differentiable. Assume that ๐‘“โ€ฒ ๐‘ฅ log ๐‘๐‘

โ€ฒโ€ฒ ๐‘ฆ = 0 only at finitely many points ๐‘ฅ, ๐‘ฆ . If there is a backward ANM ๐‘Œ โ†’ ๐‘‹, then one of the followings must hold:

๐‘‹ ๐‘ ๐‘“

(i) Gaussian Gaussian linear

(ii) log-mix-lin-exp log-mix-lin-exp linear

(iii), (iv) log-mix-lin-exp one-sided asymptotically exponential, or generalized mixture of two exponentials.

strictly monotonic with lim

๐‘ฅโ†’โˆžor โˆ’โˆž๐‘“โ€ฒ ๐‘ฅ = 0.

(v) generalized mixture of two exponentials

two-sided asymptotically exponential strictly monotonic with lim

๐‘ฅโ†’โˆžor โˆ’โˆž๐‘“โ€ฒ ๐‘ฅ = 0.

Page 31: Causality Basics - GitHub Pages

Causal DiscoveryCommon choices of ANMs:

โ€ข In multivariate case, even linear Gaussian with equal noise variances is identifiable [Peters and Buhlmann, 2014].

โ€ข Linear Non-Gaussian Acyclic Models (LiNGAMs) [Thm. 7.6; Shimizu et al., 2006].

โ€ข Nonlinear Gaussian Additive Noise Models [Thm. 7.7; Peters et al., 2014, Cor. 31].

Page 32: Causality Basics - GitHub Pages

Causal DiscoveryLearning causal graph with ANMs.

โ€ข Score-based methods.โ€ข Maximize likelihood using a greedy search technique (for optimization over the graph space).

โ€ข For nonlinear Gaussian ANMs [Buhlmann et al., 2014]:โ€ข Given a current graph ๐’ข, regress each var. on its parents and obtain the score log ๐‘ ๐’Ÿ ๐’ขโ‰” log ๐‘ ๐’Ÿ ๐’ข, แˆ˜๐œƒ = โˆ’ฯƒ๐‘—=1

๐‘‘ เทžvar ๐‘…๐‘— , where ๐‘…๐‘— โ‰” ๐‘‹๐‘— โˆ’ แˆ˜๐‘“๐‘— pa๐‘— is the residual.

โ€ข When computing the score of a neighboring graph, utilize the decomposition and update only the altered summand.

โ€ข It can be shown to be consistent [Buhlmann et al., 2014].

โ€ข If the noise cannot be assumed to be Gaussian, one can estimate the noise distr. [Nowzohour and Buhlmann, 2016] and obtain an entropy-like score.

Page 33: Causality Basics - GitHub Pages

Causal DiscoveryLearning causal graph with ANMs.

โ€ข RESIT [Mooij et al., 2009; Peters et al., 2014]:

Phase 1: Based on the fact that for each node ๐‘‹๐‘– the corresp. ๐‘๐‘– is indep. of all non-desc. of ๐‘‹๐‘–. Particularly, for each sink node ๐‘‹๐‘–, ๐‘๐‘– โŠฅ ๐—\ ๐‘‹๐‘– .

Phase 2: visit every node and eliminate incoming edges until the residuals are not indep. anymore.

Page 34: Causality Basics - GitHub Pages

Causal DiscoveryLearning causal graph with ANMs.

โ€ข Independence-Based Score [Peters et al., 2014]:โ€ข RESIT is asymmetric, thus mistakes will propagate and accumulate over the iterations.

โ€ข Intuition: (i) under ๐’ข0, all the residuals are indep. of each other; (ii) causal minimality helps identifying a unique DAG:

โ€ข res๐‘–๐’ข,RM

are the residuals of node ๐‘‹๐‘– when it is regressed on its parents specified in ๐’ข using regression method RM.

โ€ข DM denotes a dependence measure. The second term encourages causal minimality.

โ€ข ICA-based methods for LiNGAMs [Shimizu et al., 2006; 2011].

Page 35: Causality Basics - GitHub Pages

Causal Discoveryโ€ข CI-based methods.โ€ข More faithful to data.

โ€ข Identif. only up to a Markov equiv. class.

โ€ข CI test methods may be tricky.

โ€ข ANMs.โ€ข A relative strong assumption/belief

(a family is particularly natural for modeling conditionals).

โ€ข Identification within a Markov equiv. class.

โ€ข Leveraging intervened datasets.โ€ข Based on the invariance principle of causality.

โ€ข Identification within a Markov equiv. class.

โ€ข Stronger requirement on data.

โ€ข [Peters et al., 2016; Romeijn & Williamson, 2018;Bengio et al., 2019].

Page 36: Causality Basics - GitHub Pages

Thanks!