Download - Telling Cause from Effect using MDL-based Local and Global ...people.mpi-inf.mpg.de/~amarx/presentations/slope-icdm-collapsed.pdf · where -(3)denotes the Kolmogorov complexityof

AlexanderMarxJillesVreeken

TellingCausefromEffectusingMDL-basedLocalandGlobalRegression

November20,2017

QuestionoftheDay

2

How can we infer the causal directionbetween two univariate numeric

random variables 𝑿and 𝒀and give a reliable confidence measure?

Howthisiscurrentlydone

3

0 1

0

1

X

Y

Additive Noise Models (ANMs)

If 𝑋 → 𝑌, there exists a function 𝑓s.t.

𝑌 = 𝑓 𝑋 + 𝑁, with 𝑁∥𝑋,

but not for the inverse direction

Problemsg need for powerful independence testg often independence holds

(to some extend) in both directionsg 𝑝-values are sensitive to sample size

Howwedoit

We build on the Algorithmic Markov Condition

Formally, if 𝑋 → 𝑌, then

𝐾 𝑃(𝑋) + 𝐾 𝑃(𝑌|𝑋) ≤ 𝐾 𝑃 𝑌 + 𝐾 𝑃(𝑋 𝑌))

where 𝐾(𝑥) denotes the Kolmogorov complexity of 𝑥, which is the length of the shortest program that outputs 𝑥 and halts.

(Janzing and Schölkopf 2010)4

“Simpler to describe cause, and then effect given cause,than if we do so vice versa.”

Informally:

TheMinimumDescriptionLengthPrinciple

Approximate 𝑲with 𝑳, using two-part MDL

Given model class ℳ, the best model 𝑀 ∈ ℳ for data 𝐷, is that 𝑀 that minimizes the total encoded cost.

5

Costs to encode 𝐷and 𝑀

Complexity of 𝑀

Complexity of 𝐷given 𝑀+=

𝐿 𝐷 = min>∈ℳ

𝐿 𝑀 + 𝐿 𝐷 𝑀)

MDL avoids overfitting by considering both the fit of the model, as well as the model complexity

MDLasaPracticalInstantiation

If 𝑋 → 𝑌, we have that

𝐿 𝑋 + 𝐿 𝑌 𝑋)𝐿 𝑋 + 𝐿(𝑌)

<𝐿 𝑌 + 𝐿 𝑋 𝑌)𝐿 𝑋 + 𝐿(𝑌)

Encode using two-part MDLg restrict model class to regression functionsg that is, we need to find that function that minimizes

the complexity of model and the data given the model

6

Howtoencodeafunction

Encode 𝐿 𝑌 𝑋) = 𝐿 𝑓 + 𝐿 𝑌 𝑓, 𝑋)g 𝑌 = 𝑓 𝑋 + 𝑁g 𝑓is a regression functiong 𝑓minimizes two-part cost

Data given model 𝑳 𝒀 𝒇, 𝑿) = 𝑳 𝑵g encode noise assuming a normal distribution

Model 𝑳(𝒇)g encode type (linear, quadratic, exponential, …)g encode parameters 𝛼, 𝛽, 𝛾, …

7

0 1

0

1

X

Y

Non-determinism

8

0 1

0

1

X

Y�1 1

0

0.7

x̃

yi

asso

ciat

edto

x

Non-determinism

Setupg global trend functiong local functions restricted to the same type (we

assume noise follows same distribution)

Computationg for 𝑥 ∈ 𝑋with 𝑐𝑜𝑢𝑛𝑡 𝑥 = 𝑐

g create 𝑥L = 𝑠𝑒𝑞(− QR, … , Q

R)of length 𝑐

g 𝐹𝑖𝑡(𝑠𝑜𝑟𝑡 𝑦W𝑚𝑎𝑝𝑝𝑒𝑑𝑡𝑜𝑥 , 𝑥L)

Encodingg encode type and parameters of the functionsg encode mapping of local functions to 𝑥 ∈ 𝑋

9

0 1

0

1

X

Y�1 1

0

0.7

x̃

yi

asso

ciat

edto

x

Slope– compute𝐿 𝑌 𝑋)

10

1 F = ;;2 fg fit global function and add fg to F ;3 for each function type t do4 Ft F ;5 for x 2 X, count(x) > � do6 fl fit local function on x̃ of x;7 if adding fl to Ft reduces overall costs then8 Ft = Ft [ fl;9 end

10 end11 F min(F, Ft);12 end13 return costs of Y given F and X;

ConfidenceandSignificance

How certain are we?

But, is a given inference significant?g we can use the no-hypercompression inequality to test significance

P 𝐿\ 𝑋 − 𝐿 𝑋 ≥ 𝑘 ≤ 2`a

g our null hypothesis 𝐿\ is that 𝑿 and 𝒀 are only correlated

(for details on no-hypercompression inequality see Grünwald, 2007)11

ℂ =𝐿 𝑋 + 𝐿 𝑌 𝑋)𝐿 𝑋 + 𝐿 𝑌

−𝐿 𝑌 + 𝐿 𝑋 𝑌)𝐿 𝑋 + 𝐿 𝑌

g the higher the more certaing robust w.r.t. sample size

𝐿(𝑋 → 𝑌) 𝐿(𝑌 → 𝑋)

ConfidenceRobustness

100

250

500

1000

0

200

400

600

800

# data points

confi

denc

e

12

100

250

500

1000

2

3

4

# data points

confi

denc

e

100

250

500

1000

0.1

0.2

0.3

0.4

0.5

# data points

confi

denc

e

RESIT(HSIC idep.)

IGCI(Entropy)

SLOPE(Compression)

SyntheticdatageneratedwithANMs

uu

ug

un

gu

gg

gn

bu

bg

bn pu

pg

pn

0

20

40

60

80

100

generating model

accura

cy

in[%

]

RESIT

IGCI

SLOPE

13

Uniform

𝑌 = 𝑓 𝑋 + 𝑁

Gaussian Binomial Poisson

PerformanceonBenchmarkData(Tübingen 97 univariate numeric cause-effect pairs)

14

Inferences of state of the art algorithms ordered by confidence values.

SLOPE is 85% accurate with 𝛼 = 0.001

Conclusions

We considered causal inference on univariate numeric data

g we propose an MDL score for local and global regression

g reliable confidence score, indep. of sample size

g significance test based on no-hypercompression

g very good performance on synthetic data,outclassing the state of the art on benchmark data

Future: Consider multivariate, mixed-type and causal graphs

15

Thank you!

16

We considered causal inference on univariate numeric data

g we propose an MDL score for local and global regression

g reliable confidence score, indep. of sample size

g significance test based on no-hypercompression

g very good performance on synthetic data,outclassing the state of the art on benchmark data

Future: Consider multivariate, mixed-type and causal graphs

Runtimes

IGC

IS

LO

PE

RE

SIT

AN

M

CU

RE

101102103104105106

tim

e(s

)

17

SlopeonlyDeterministic

18

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

decision rate

%of

corr

ect

decis

ions

SLOPE SLOPED

CURE RESIT

IGCI ANM

FormulasEncoding a function

𝐿 𝑓 = f 𝐿ℕ 𝑠 + 𝐿ℕ(𝜙 ⋅ 10j)�

l∈m

Encoding the model

𝐿 𝐹 = 𝐿ℕ 𝐹 + log𝑋 − 1𝐹q − 1

+ 2 log(ℱ)

+𝐿 𝑓s + f 𝐿(𝑓q)�

tu∈vu

Encoding the data given the model

𝐿 𝑌 𝐹, 𝑋) = f𝑛t2

1ln 2

+ log 2𝜋𝜎R − 𝑛t log 𝜏�

t∈v

19