De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal...

45
De nihilo nihil Statistical Modelling

Transcript of De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal...

Page 1: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

De nihilo nihil

Statistical Modelling

Page 2: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Causal Relationships

responseblood pressure

disturbing factorbody weight

disturbing factorcigarette smoke

explanatorycaffeine intake

causation association

Page 3: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Statistical Modelling

response or dependent variables and

explanatory or independent variables,

including adjustment for

uncontrollable disturbing factors.

... entails the analysis of the functional relationship between

Page 4: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Experimental Modelling experimental evaluation of the effects of given explanatory variables upon a response variable, involving either randomisation or matching (or 'control') for known disturbing factors (e.g. temperature and humidity as

determinants of the adhesion of dental prostheses)

Observational Modellingobservation-based analysis of the relationship between a response variable and several explanatory variables(e.g. birth weight and gestational age)

Statistical Modelling Basic Approaches

Page 5: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Y: response variableX1,...,Xk: explanatory variables

Ε : random error

Ε+++++= kk2211 xb...xbxbaY

Linear Models

Ε is generally assumed to be N(0,σ2) with unknown σ

Use of Multiple Linear (and other) Models allows regression coefficients bi to be estimated while taking the influence of

disturbing factors into account ('adjustment').

Page 6: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Linear Models

0 E(Y)

ypred=a+b1x1+...+bkxk

Ε Y

Page 7: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

body height (inches)

62 64 66 68 70 72

body

weig

ht

(pounds)

90

100

110

120

130

140

150

y: body weight (pounds), x1: body height (inches)

ypred=-111.29+3.44⋅x1

Miss America Body Features 1984 - 2002

Page 8: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

1. Data Exploration: isolated assessment of the possible relevance of each explanatory variable

2. Model Formulation: mathematical modelling of the multifaceted relationship between explanatory and response variables, invoking scientific plausibility

3. Model Selection: parameter estimation ('regression'), hypotheses testing (e.g. likelihood ratio, p value, coefficient of determination)

4. Model Checking: comparison between model predictions and observations ('residual diagnostic')

Statistical Modelling Procedure

Page 9: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Prediction of Body Fat Percentage

Body fat percentage can be determined by dual energy X-ray absorptiometry (DXA), a fairly accurate but time-

consuming and expensive technique. On the other hand, measurement of triceps skin fold thickness, thigh and

mid arm circumference may not be as accurate as DXA, but are quicker and cheaper to perform.

from: J. Neter, W. Wasserman, M.H.Kutner (1997) Applied Linear Statistical Models

Page 10: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Prediction of Body Fat Percentage

explanatoryskin fold

responsebody fat

explanatorythigh

explanatory mid arm

from: J. Neter, W. Wasserman, M.H.Kutner (1997) Applied Linear Statistical Models

Page 11: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

body fat (%)Y

skin fold (mm)X1

thigh (cm)X2

mid arm (cm)X3

11.9 19.5 43.1 29.122.8 24.7 49.8 28.2

18.7 30.7 51.9 37.0

20.1 29.8 54.3 31.1

12.9 19.1 42.2 30.9

21.7 25.6 53.9 23.7

27.1 31.4 58.5 27.6

Variables Y, X1,...,X3 were measured simultaneously in 20 individuals.

Prediction of Body Fat Percentage

from: J. Neter, W. Wasserman, M.H.Kutner (1997) Applied Linear Statistical Models

...

Page 12: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Multiple Linear Regression

pair-wise Pearson correlation coefficients r (upper right half) and two-sided p values for r=0 (lower left half)

Data Exploration

Y X1 X2 X3

Y

X1

X2

X3

0.843 0.878 0.142

<0.001 0.924 0.458

<0.001 <0.001 0.085

0.549 0.042 0.723

^

Page 13: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

skin fold thickness (mm)

10 15 20 25 30 35

body

fat (

%)

10

15

20

25

30

y: body fat (%)x1: skin fold thickness (mm)

ypred=-1.496+0.857⋅x1

R2=0.711

Multiple Linear RegressionData Exploration

Page 14: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

thigh circumference (cm)

40 45 50 55 60

body

fat (

%)

10

15

20

25

30

Multiple Linear RegressionData Exploration

y: body fat (%)x2: thigh circumference (cm)

ypred=-23.634+0.857⋅x2

R2=0.771

Page 15: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

mid arm circumference (cm)

20 25 30 35 40

body

fat (

%)

10

15

20

25

30

Multiple Linear RegressionData Exploration

y: body fat (%)x3: mid arm circumference (cm)

ypred=14.687+0.199⋅x3

R2=0.020

Page 16: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Ε++++= 332211 xbxbxbaY

linear model with normal error Ε

Multiple Linear RegressionModel Formulation

Page 17: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Backward Selection: stepwise reduction of the number of explanatory variables, starting with the "full" model

Model Selection

Forward Selection: stepwise inclusion of explanatory variables, starting with the best variable (e.g. that with the smallest p value)

Page 18: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Parameter estimation from model equations using maximum likelihood or least square methods

2020,3320,2220,1120

22,332,222,112

11,331,221,111

xbxbxbay

xbxbxbay

xbxbxbay

ε++++=

ε++++=

ε++++=

M

Multiple Linear RegressionModel (Backward) Selection

Page 19: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

a (intercept) 117.085 99.782

b1 (skin fold) 4.334 3.016

b2 (thigh) -2.857 2.582

b3 (mid arm) -2.186 1.595

term estimate s.e.

ypred=117.085+4.334⋅x1-2.857⋅x2 -2.186⋅x3

R2= 0.895

Multiple Linear RegressionFull Model

Page 20: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

For each regression coefficient bi, test the null hypothesis Hi,0: bi=0 against the alternative Hi,A: bi≠0 using, for example, a Wald test.

)b̂.(e.s

b̂W

i

ii =

Since Wi∼N(0,1) under Hi,0, reject Hi,0 if |Wi |> z1-α/2.

Multiple Linear RegressionModel (Backward) Selection

Page 21: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

a (intercept) 1.173 0.258

b1 (skin fold) 1.437 0.170

b2 (thigh) -1.106 0.285

b3 (mid arm) -1.370 0.190

term W p

Multiple Linear RegressionModel (Backward) Selection

Page 22: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

a (intercept) 6.792 4.488

b1 (skin fold) 1.001 0.128

b3 (mid arm) -0.431 0.177

term estimate s.e.

ypred=6.792+1.001⋅x1 -0.431⋅x3

R2= 0.887

Multiple Linear RegressionFinal Model

Page 23: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

a (intercept) 1.513 0.149

b1 (skin fold) 7.803 <0.001

b3 (mid arm) -2.442 0.026

term W p

Multiple Linear RegressionFinal Model

Page 24: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

body fat (%)

10 15 20 25 30

stan

dard

ized

res

idua

l

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

predyy

ipred,ii

s

yy

−=ε

verification whether (random) error Ε is N(0,σ2)

'standardized residuals'

Multiple Linear RegressionModel Checking

Page 25: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

response variable

resi

dual

0

response variable

resi

dual

0

resi

dual

0

response variable

response variable

resi

dual

0

(a)

(b)

(c)

(d)

Multiple Linear RegressionModel Checking

Page 26: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Analysis of Variance (ANOVA) explanatory variables are either qualitative or quantitative, but discrete

Analysis of Covariance (ANCOVA)some explanatory variables are continuous, some are discrete (multiple regression)

Other (Normal) Linear Models

Page 27: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Y: response variableX1,...,Xk: explanatory variablesΕ: N(0,σ2) with unknown σ

Ε+++++= kk2211 xb...xbxbaY

Linear Models

kk2211 xb...xbxba)YE( ++++=

)(Exb...xbxbaE(Y) kk2211 Ε+++++=

Page 28: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Y: response variableX1,...,Xk: explanatory variablesG: link function

kk2211 xb...xbxba(Y)]EG[ ++++=

Generalised Linear Models

for a dichotomous response variable Y:E(Y) = 0⋅P(Y=0)+1⋅P(Y=1) = P(Y=1) =π

Page 29: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

x

0.0 0.2 0.4 0.6 0.8 1.0

logit(x

)

-6

-4

-2

0

2

4

6

kk2211 xb...xbxba)logit( ++++=π

Generalised Linear Model with the 'logit' as the link function

Logistic Regression

)x1

xln(logit(x)

−=

Page 30: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Logistic Regression

Let X1 be a dichotomous explanatory variable (e.g. 1:"exposed", 0:"not exposed")

)bexp(OR 1=

kk221e xb...xb1ba)logit( +++⋅+=π

kk221n xb...xb0ba)logit( +++⋅+=π

)1

ln()1

ln()logit(-)(itlogbn

n

e

ene1 π−

π−π−

π=ππ=

)ORln(1

/1

lnn

n

e

e =

π−π

π−π=

Adjusted Odds Ratio

Page 31: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

The Evans County Heart Study

In 1960, the entire population of Evans County, Georgia, aged 40 and over were given a complete cardiovascular

examination. Some 609 white males were followed for 9 years to determine their coronary heart disease (CHD) status.

Hames C (1971) Arch Intern Med 128: 883-886.

Page 32: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

The Evans County Heart Study

Y: CHD status (dichotomous)0:"no", 1:"yes"

x1: catecholamine level (CAT; dichotomous) 0:"low", 1:"high"

x2: age (years) x3: cholesterol (CHL; mg/dL) x4: smoking status (dichotomous)

0:"never smoker", 1:"ever smoker"x5: hypertension (dichotomous)

0:"no", 1:"yes"x6: ECG abnormalities (dichotomous)

0:"no", 1:"yes"

from: Kleinbaum DG (1994) Logistic Regression - A Self-Learning Text. Springer, New York

Page 33: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

CAT (%) 95 (18%) 27 (38%) <0.001

age 53 ± 9 57 ± 10 0.002

CHL 210 ± 39 222 ±39 0.021

smoking (%) 333 (62%) 54 (76%) 0.025

hypertension (%) 212 (39%) 43 (60%) <0.001

ECG (%) 137 (26%) 29 (41%) 0.010

explanatory variable no (n=538) yes (n=71) p

CHD

Data Exploration

Logistic Regression

number and percentage, or mean±s.e., with p values from χ2-test or t-test, respectively

Page 34: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

The Evans County Heart Study

Unadjusted Odds Ratios

44low 443

27high 95

CHD ∅ CHD

17no 205

54yes 333

CHD ∅ CHD

28no 326

43yes 212

CHD ∅ CHD

42no 401

29yes 137

CHD ∅ CHD

CAT

ORCAT=27⋅443/95⋅44=2.86 ORsmoke=54⋅205/333⋅17=1.96

ORhyp=43⋅326/212⋅28=2.36 ORECG=29⋅401/137⋅42=2.02

Smoking

Hypertension ECG abnormality

Page 35: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Model Formulation

Logistic Regression

662211 xb...xbxba)logit( ++++=π

logistic model with π=E(Y) equal to the 9-years incidence proportion (or "risk") of CHD

Page 36: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Logistic RegressionThe Full Model

a (intercept)

b1 (CAT)

b2 (age)

b3 (CHL)

b4 (smoking)

b5 (hypertension)

b6 (ECG)

term estimate s.e.

-6.772

0.598

0.032

0.009

0.834

0.439

0.369

1.140

0.352

0.015

0.003

0.305

0.291

0.294

Page 37: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

The Evans County Heart Study

Adjusted versus Unadjusted Odds Ratios

b1 (CAT)

b4 (smoking)

b5 (hypertension)

b6 (ECG)

term estimate

0.598

0.834

0.439

0.369

odds ratio

adjusted unadjusted

1.82

2.30

1.55

1.49

2.86

1.96

2.36

2.02

Page 38: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Logistic RegressionModel (Backward) Selection

a (intercept)

b1 (CAT)

b2 (age)

b3 (CHL)

b4 (smoking)

b5 (hypertension)

b6 (ECG)

term W p

<0.001

0.089

0.034

0.007

0.006

0.131

0.208

-5.940

1.698

2.123

2.680

2.734

1.509

1.258

Page 39: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Logistic RegressionThe Final Model

a (intercept)

b2 (age)

b3 (CHL)

b4 (smoking)

term estimate s.e.

-7.027

0.051

0.007

0.851

1.107

0.014

0.003

0.301

logit(π) = -7.027+0.051⋅x2+0.007⋅x3+0.851⋅x4

ORsmoke unadjusted: 1.96, adjusted: 2.34

Page 40: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

x

-10 -5 0 5 10

logit

-1(x

)

0.0

0.2

0.4

0.6

0.8

1.0

Logistic Function (logit-1)

)xexp(1

1(x)logit 1-

−+=

)xb...xbxbexp(-a1

1

kk2211 −−−−+=π

Page 41: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

What is the 9-years CHD risk of a 45 year old ever-smoker with a cholesterol level of 260 mg/dL?

The Evans County Heart Study

x2=45, x3=260, x4=1

)1851.0260007.045051.0027.7exp(1

1

⋅−⋅−⋅−+=π

113.0)061.2exp(1

1 =+

=

Page 42: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Logistic RegressionScreening Test

The comparison of the individual risk, π, with a given threshold, ρ, provides a "screening"

test for the disease.

π

>ρ ≤ρ

test positive

test negative

Page 43: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Logistic RegressionScreening Test (ROC Curve)

1-sensitivity

0 1

1

specificity

0.32

ρ: 0.11

sensitivity: 0.68specificity: 0.61Youden's index: 0.29

baseline risk: 71/(71+538)=0.12PPV: 0.19NPV: 0.93

AUC: 0.68

0.61

Page 44: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

The triple test is done between the 16th and 18th weeks of pregnancy. The test measures three substances, or markers, that are passed from the fetus and the placenta into the mother's bloodstream - AFP, human chorionic gonadotropin and unconjugated estriol. [...] A method was found

to combine results of the three tests with a mother's age to identify women at increased risk for having a baby with Down's syndrome. Since that time, a number of studies have shown that the triple test can detect 60 to 70 percent of Down's syndrome cases. Because it is a screening test, the triple test identifies pregnancies that are at increased risk, or "screen-positive" for Down's syndrome. A positive result does not necessarily mean the baby is affected, but is only a signal for further testing.

"Triple Test" for Down Syndrome

American Society of Clinical Pathology (www.ascp.org)

Page 45: De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal Relationships response blood pressure disturbing factor body weight disturbing factor

Summary

- Statistical modelling entails the analysis of the functional relationship between response and explanatory variables.

- Experimental modelling is based upon prospective trials, addressing controlled explanatory variables. Observational modelling makes use of uncontrolled, observational data.

- Statistical modelling proceeds in multiple steps, including data exploration, followed by model formulation, selection and checking.

- The most commonly used class of statistical models are generalised linear models, encompassing (multiple) linear regression, analysis of variance and logistic regression.

- Multiple models 'adjust' the effect of explanatory variables for any bias introduced by disturbing factors.