Introduction to Biostatistics (ZJU 2008)

Introduction to Biostatistics Introduction to Biostatistics (ZJU 2008)(ZJU 2008)

Wenjiang Fu, Ph.DWenjiang Fu, Ph.DAssociate ProfessorAssociate Professor

Division of Biostatistics, Department of Division of Biostatistics, Department of Epidemiology Epidemiology

Michigan State UniversityMichigan State UniversityEast Lansing, Michigan 48824, USAEast Lansing, Michigan 48824, USA

Email: Email: [email protected]@msu.eduwww: www: http://www.msu.edu/~fuwhttp://www.msu.edu/~fuw

mailto:[email protected]

http://www.msu.edu/~fuw

Logistic regression modelLogistic regression model

Why use logistic regression?Why use logistic regression? Estimation by maximum likelihoodEstimation by maximum likelihood Coefficients InterpretationCoefficients Interpretation Hypothesis testingHypothesis testing Evaluating the performance of the Evaluating the performance of the

model model

Why logistic regression?Why logistic regression? Many important research topics in which the Many important research topics in which the

dependent variable is Binary. dependent variable is Binary. eg. disease vs no disease, eg. disease vs no disease,

damage vs no damage, death vs live, etc.damage vs no damage, death vs live, etc.

Binary logistic regression is a type of Binary logistic regression is a type of regression analysis where the dependent regression analysis where the dependent variable is a dummy variable: coded 0 variable is a dummy variable: coded 0 (absence of disease) or 1 (presence of a (absence of disease) or 1 (presence of a disease), etc.disease), etc.

To explain the variability of the binary variable by To explain the variability of the binary variable by other variables, either continuous or categorical, such other variables, either continuous or categorical, such as age, sex, BMI, marriage status, socio-economic as age, sex, BMI, marriage status, socio-economic status, etc. use a statistical model to relate the status, etc. use a statistical model to relate the probability of the response event to the explanatory probability of the response event to the explanatory variables.variables.

Logistic Regression ModelLogistic Regression Model Event (Y = 1), no event (Y = 0) want to model the mean: Event (Y = 1), no event (Y = 0) want to model the mean:

E(Y) = P(Y=1) * 1 + P(Y=0) *0 = P(Y=1)E(Y) = P(Y=1) * 1 + P(Y=0) *0 = P(Y=1)but but ππ = P(Y=1) is between 0 and 1 and is bounded = P(Y=1) is between 0 and 1 and is bounded

the linear predictor the linear predictor ββ0 0 ++ xx11ββ11 + + xx22ββ22+ + xx33ββ3 3 +… is a +… is a linear combination and may take any value.linear combination and may take any value.

The "logit" model solves this problem. The "logit" model solves this problem. Single independent variable:Single independent variable: ln [p/(1-p)] = ln [p/(1-p)] = + + XX

where the probability p = p(Y=1 | X)where the probability p = p(Y=1 | X) p/(1-p) is the "odds" - ratio of the probabilities an p/(1-p) is the "odds" - ratio of the probabilities an

event to occur versus not to occur under condition event to occur versus not to occur under condition A.A.

ln[p/(1-p)] is the “log odds” or "logit probability"ln[p/(1-p)] is the “log odds” or "logit probability"

More:More: The logistic distribution constrains the The logistic distribution constrains the

estimated probabilities to lie between 0 estimated probabilities to lie between 0 and 1. and 1.

The estimated probability is:The estimated probability is:

p=1/[1+exp(-p=1/[1+exp(- - - X)] X)]

= exp(= exp( + + X) / [1+exp(X) / [1+exp( + + X)] X)]

if you let if you let + + X =0, then p = .50 X =0, then p = .50 as as + + X gets really big, p approaches 1 X gets really big, p approaches 1 as as + + X gets really small, p approaches X gets really small, p approaches

00

Comparing LinReg and Logit Comparing LinReg and Logit ModelsModels

Y=0

Y=1

LinReg Model

Logit Model

Maximum Likelihood Estimation Maximum Likelihood Estimation (MLE)(MLE)

MLE is a statistical method for estimating MLE is a statistical method for estimating the coefficients of a model.the coefficients of a model.

The likelihood function (L) measures the The likelihood function (L) measures the probability of observing the particular set probability of observing the particular set of dependent variable values (Yof dependent variable values (Y11=y=y11, … , , … , YYnn=y=ynn) )

L = Prob (yL = Prob (y11,y,y22,…,y,…,ynn))

The higher the L, the higher the probability The higher the L, the higher the probability of observing the sample data at hand. of observing the sample data at hand.

Maximum Likelihood EstimatorMaximum Likelihood Estimator MLE involves finding the coefficients (MLE involves finding the coefficients (, ,

) that makes the log of the likelihood ) that makes the log of the likelihood function (logLik < 0) as large as possible function (logLik < 0) as large as possible

The maximum likelihood estimator The maximum likelihood estimator maximizes following likelihood function maximizes following likelihood function

Where Where pp = exp( = exp( + + XX) / [ 1+ exp() / [ 1+ exp( + + XX) ]) ]

i i1(1 )y yiLik p p

Maximum Likelihood EstimatorMaximum Likelihood Estimator Or equivalently, the MLE maximizes the Or equivalently, the MLE maximizes the

log-likelihood function log-likelihood function

Where Where pp = exp( = exp( + + XX) / [ 1+ exp() / [ 1+ exp( + + XX) ]) ]

MLE is biased estimator, but consistent MLE is biased estimator, but consistent (by large sample theory, the estimator (by large sample theory, the estimator converges to the true model parameters converges to the true model parameters fast enough as sample size increases).fast enough as sample size increases).

ilog log +n log(1- p) 1i

pl lik y

p

Link Function

Interpretation of Interpretation of CoefficientsCoefficients

Since ln [ p/(1-p) ] = Since ln [ p/(1-p) ] = + + XX

The slope The slope is interpreted as the rate is interpreted as the rate of change in the "log odds" as X of change in the "log odds" as X changes changes … not very useful. … not very useful.

More useful: p = Pr (Y=1|X). More useful: p = Pr (Y=1|X).

p / (1-p) is the odds with condition X. p / (1-p) is the odds with condition X.

Example. X: smoker; Y=1 Lung Ca. Example. X: smoker; Y=1 Lung Ca.

pp/(1-/(1-pp) is odds of Lung Ca w.r.t ) is odds of Lung Ca w.r.t smoke.smoke.


If X is continuous, exp(If X is continuous, exp() measures the ) measures the change of odds with one unit increase change of odds with one unit increase of X. of X.

ln (OR) = ln [odds(X+1)] - ln [odds(X)]ln (OR) = ln [odds(X+1)] - ln [odds(X)]

= ln [ p(Y=1|X+1 ) / (1-p(Y=1|X+1)) ]= ln [ p(Y=1|X+1 ) / (1-p(Y=1|X+1)) ]

– – ln [ p(Y=1|X) / (1-p(Y=1|X)) ] ln [ p(Y=1|X) / (1-p(Y=1|X)) ]

= = + + ( (X+1) – (X+1) – ( + + X) = X) =

ln(OR)= ln(OR)= OR=exp( OR=exp())

If X is binary, exp(If X is binary, exp() measures the ) measures the change of odds for one group change of odds for one group compared to the secondcompared to the second

ln [p(Y=1|X=1 ) /(1-p(Y=1|X=1)) ] ln [p(Y=1|X=1 ) /(1-p(Y=1|X=1)) ] – – ln [p(Y=1|X=0) / (1-p(Y=1|X=0)) ] ln [p(Y=1|X=0) / (1-p(Y=1|X=0)) ]

= = + + .1.1 –( –( + + ..0) = 0) =

ln(OR) = ln(OR) = OR = exp( OR = exp())


Interpretation of parameter Interpretation of parameter ββ Model : logit (Model : logit (pp) = ) = ββ0 0 ++ xx11ββ11

0 1

0 1Pr(1| 1)

1

ex

e

Y=1Y=1 Y=0Y=0

X=1X=1

X=0X=0

0 1

11 Pr(1| 1)

1x

e

0

11 Pr(1| 0)

1x

e

0

0Pr(1| 0)

1

ex

e

1(1| 1)*[1 (1| 0)]

[1 (1| 1)]* (1| 0)

P x P xOR e

P x P x

Model AssessmentModel Assessment

There are several statistics which There are several statistics which can be used for comparing can be used for comparing alternative models or evaluating alternative models or evaluating the performance of a single model: the performance of a single model: LRT, Wald testsLRT, Wald tests Percent Correct PredictionsPercent Correct Predictions

Model Chi-Squares StatisticModel Chi-Squares Statistic The model likelihood ratio test (LRT) statistic isThe model likelihood ratio test (LRT) statistic is

LR = [-2 lik (Reduced model)] - [-2 lik (Full LR = [-2 lik (Reduced model)] - [-2 lik (Full

model)]model)]

Example: test of Example: test of , , LR = -2 [lik ( LR = -2 [lik () -lik () -lik (, , ) ] ) ]

liklik (() is likelihood of model with only the intercept) is likelihood of model with only the interceptlik (lik (, , ) is a model with the intercept and X) is a model with the intercept and X

The LR statistic follows a chi-squares The LR statistic follows a chi-squares distribution with r degrees of freedom, where distribution with r degrees of freedom, where r=difference in numbers of parameters between r=difference in numbers of parameters between the two modelsthe two models

Use the LRT statistic to determine if the overall Use the LRT statistic to determine if the overall model is statistically significant. model is statistically significant.

Percent Correct PredictionsPercent Correct Predictions Predicted outcome (majority vote Predicted outcome (majority vote

method) method)

if predicted prob Pr(x) ≥ 0.5, assign y if predicted prob Pr(x) ≥ 0.5, assign y = 1= 1

otherwise assign y = 0otherwise assign y = 0

Compare the predicted outcome y and Compare the predicted outcome y and the actual outcome y and compute the actual outcome y and compute the percentage of correct outcomes.the percentage of correct outcomes.

^

^

^

An Example:An Example:

Observed % Correct0 1

0 328 24 93.18%1 139 44 24.04%

Overall 69.53%

Predicted

Testing significance of Testing significance of variablesvariables

Omitted variable(s) can result in bias in the Omitted variable(s) can result in bias in the coefficient estimates. To test for omitted coefficient estimates. To test for omitted variables you can conduct a likelihood ratio variables you can conduct a likelihood ratio test:test:

LR[q] = {[-2 lik (constrained model, i=k-q)] LR[q] = {[-2 lik (constrained model, i=k-q)]

- [-2 lik (unconstrained model, i=k)]} - [-2 lik (unconstrained model, i=k)]}

where LR follows chi-squares distribution where LR follows chi-squares distribution with q degrees of freedom, with q = 1 or with q degrees of freedom, with q = 1 or more omitted variables more omitted variables

An Example:An Example:Variable B Wald Sig

PETS -0.699 10.968 0.001MOBLHOME 1.570 29.412 0.000TENURE -0.020 5.993 0.014EDUC 0.049 1.079 0.299CHILD 0.009 0.011 0.917WHITE 0.186 0.422 0.516FEMALE 0.018 0.008 0.928Constant -1.049 2.073 0.150

Beginning -2 LL 687.36Ending -2 LL 641.41

B/se(B)

Constructing the LR TestConstructing the LR Test

“Since the chi-squared value is less than the critical value the set of coefficients is not statistically significant. The full model is not an improvement over the partial model.”

Ending -2 loglik Partial Model 641.84Ending -2 loglik Full Model 641.41Block Chi-Square 0.43DF 3Critical Value 11.345

Multiple Logistic regressionMultiple Logistic regression Prob of event labeled as binary outcomeProb of event labeled as binary outcome

Logistic regression model: logit function.Logistic regression model: logit function.

log [log [ππ / (1- / (1- ππ)] = )] = ββ0 0 ++ xx11ββ11 +… + +… + xxppββp p = = ηη

equivalent toequivalent to p = exp( p = exp(ηη) / [ 1+ exp() / [ 1+ exp(ηη) ] ) ]

Why need multiple logistic regression?Why need multiple logistic regression?Simple RxC table cannot solve all problems, and Simple RxC table cannot solve all problems, and

can be misleading.can be misleading.

Multiple Logistic Regression-Multiple Logistic Regression-FormulationFormulation

The relationship between The relationship between ππ and x is S-shaped and x is S-shaped

The The logitlogit ( (log-oddslog-odds) transformation (link function)) transformation (link function)

0 1

( )ln

1 ( ) p p

xx X

x

0 1 1

0 1 1( | ) ( 1| ) ( )

1

p p

p p

X X

X X

eE Y x P Y x x

e

Individually HIndividually Hoo: : ββk k = 0 = 0

Globally Globally HHoo: : ββm m =… =… ββm+tm+t= 0 = 0

while controlling for confounders and other while controlling for confounders and other important determinants of the eventimportant determinants of the event

Multiple Logistic RegressionMultiple Logistic Regression

AssessAssess risk factorsrisk factors

Interpretation of the parametersInterpretation of the parameters

If If ππ is the probability of an event and is the probability of an event and OO is the odds is the odds for that event thenfor that event then

The link function in logistic regression gives the The link function in logistic regression gives the log-log-oddsodds

( )

1 ( )

x probability of eventOdds

x probability of no event

0 1

( )( ) ln

1 ( ) p p

xg x x X

x

Example: Snoring & Heart DiseaseExample: Snoring & Heart Disease

An epidemiologic study surveyed 2484 An epidemiologic study surveyed 2484 subjects to examine whether snoring was a subjects to examine whether snoring was a possible risk factor for heart disease. possible risk factor for heart disease.

SnoringSnoring

NearlyNearly EveryEvery

Heart DiseaseHeart Disease NeverNever OccasionalOccasional Every nightEvery night NightNight

YesYes 2424 3535 2121 3030

NoNo 13551355 603603 192192 224224

Prop(yes)Prop(yes) .017.017 .055.055 .099.099 .118.118

Constructing Indicator variables Constructing Indicator variables

Let ZLet Z11=1 if occasional, 0 otherwise=1 if occasional, 0 otherwise

Let ZLet Z22=1 if nearly every night, 0 =1 if nearly every night, 0

otherwiseotherwise

Let ZLet Z33=1 if every night, 0 otherwise=1 if every night, 0 otherwise

SAS CodesSAS Codesdatadata hd; hd;input hd $ snoring $ input hd $ snoring $

count;count;Z1=(snoring="occa");Z1=(snoring="occa");Z2=(snoring="nearly");Z2=(snoring="nearly");Z3=(snoring="every");Z3=(snoring="every");

cards;cards;yes never 24yes never 24yes occa 35yes occa 35yes nearly 21yes nearly 21yes every 30 yes every 30

no never 1355no never 1355

no occa 603no occa 603

no nearly 192no nearly 192

no every 224 no every 224

;;

runrun;;

procproc logistic data=hd logistic data=hd descending;descending;

model hd (event=‘yes’) model hd (event=‘yes’) =Z1 Z2 Z3;=Z1 Z2 Z3;

freq count;freq count;runrun;;

SAS OUTPUTSAS OUTPUTOrdered Total

Value hd Frequency

1 yes 110

2 no 2374

Probability modeled is hd='yes'.

Model Convergence Status

Convergence criterion (GCONV=1E-8) satisfied.

Model Fit Statistics

Intercept

Intercept and

Criterion Only Covariates

AIC 902.827 842.923

SC 908.645 866.193

-2 Log L 900.827 834.923

Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr

LikRatio 65.9045 3 <.0001

Score 72.7821 3 <.0001

Wald 58.9513 3 <.0001

The LOGISTIC Procedure

Analysis of Maximum Likelihood Estimates Standard Wald Param DF Estimate Error Chi-Square P

Inter 1 -4.0335 0.2059 383.6641 <.0001 Z1 1 1.1869 0.2695 19.3959 <.0001 Z2 1 1.8205 0.3086 34.8027 <.0001 Z3 1 2.0231 0.2832 51.0313 <.0001

Odds Ratio Estimates

Point 95% Wald Effect Estimate Confidence Limits Z1 3.277 1.932 5.558 Z2 6.175 3.373 11.306 Z3 7.561 4.341 13.172

Calculating ProbabilitiesCalculating Probabilities

The fitted logistic regression function isThe fitted logistic regression function is

Logit(Logit(ππ)= -4.0335 + 1.1869 Z)= -4.0335 + 1.1869 Z11+ 1.8205 Z+ 1.8205 Z22+ 2.0231 Z+ 2.0231 Z33

So, the probability of heart disease if never snore So, the probability of heart disease if never snore is exp(-4.0335) / (1+exp(-4.0335))=.0174is exp(-4.0335) / (1+exp(-4.0335))=.0174

If snore occasionally, If snore occasionally, exp(-4.0335+1.1869) / (1+exp(-4.0335 +1.1869))exp(-4.0335+1.1869) / (1+exp(-4.0335 +1.1869)) =.0549=.0549

Calculating Odds RatiosCalculating Odds Ratios If ZIf Z11=Z=Z22=Z=Z33=0, then odds are exp(-4.0335)=0, then odds are exp(-4.0335)

If ZIf Z22=Z=Z33=0, but Z=0, but Z11=1, then odds are exp(-4.0335+1.1869)=1, then odds are exp(-4.0335+1.1869)

The ratio of odds is then exp(1.1869) = 3.2769The ratio of odds is then exp(1.1869) = 3.2769 InterpretationInterpretation: Compared with people who never snore, : Compared with people who never snore,

people who snore occasionally are 3.28 times as likely to people who snore occasionally are 3.28 times as likely to develop heart disease.develop heart disease.

What is the odds ratio for comparing those who snore What is the odds ratio for comparing those who snore nearly every night with occasional snorers?nearly every night with occasional snorers?

What is the odds ratio for comparing those who snore What is the odds ratio for comparing those who snore every night with those who snore nearly every night?every night with those who snore nearly every night?

Example – Genetic Association studyExample – Genetic Association study Idiopathic Pulmonary Fibrosis (IPF) is known to be Idiopathic Pulmonary Fibrosis (IPF) is known to be

associated with age and gender (older and male are associated with age and gender (older and male are more likely)more likely)

One study had 174 cases and 225 controls found One study had 174 cases and 225 controls found association of IPF with one gene genotype COX2.8473 association of IPF with one gene genotype COX2.8473 (C (C T). T).

P-value by Pearson Chi-squares test: p = 0.0241.P-value by Pearson Chi-squares test: p = 0.0241. Q: Is this association true?Q: Is this association true?

Genotype CC CT TT total

Case 88 72 14 174

Control 84 113 28 225

Total 172 185 42 399

Example on genetic studyExample on genetic study Logistic regression model Logistic regression model

logit [Pr(IPF)] = intercept + snp + sex + agelogit [Pr(IPF)] = intercept + snp + sex + age

ResultsResults: : WaldWald

EffectEffect DFDF Chi-squareChi-square P-valueP-value

SNPSNP 22 2.78112.7811 0.24890.2489

sexsex 11 9.11729.1172 0.00250.0025

ageage 1 100.4541 100.454 <.0001<.0001

Example on genetic studyExample on genetic study Investigate why it happens by age and sexInvestigate why it happens by age and sex

Disease (N-normal; D-disease) by ageDisease (N-normal; D-disease) by age

0-290-29 30-4930-49 50-6450-64 65-7465-74 75+75+

NN 104 104 77 77 35 35 7 7 2 2

DD 0 0 10 10 42 42 68 68 54 54

TT 104 104 87 87 77 77 75 75 56 56


Disease (N-normal; D-disease) by sexDisease (N-normal; D-disease) by sex

malemale femalefemale

NN 7272 153 153

DD 108108 66 66

TT 180180 219 219


SNP genotype by sexSNP genotype by sex

malemale femalefemale

CCCC 7979 93 93

CTCT 75 75 110 110

TTTT 26 26 16 16

TotalTotal 180180 219 219

Pearson Chi-squares test

X2 = 6.3911, p = 0.0409

Example on genetic studyExample on genetic study Investigate why it happens by age and sexInvestigate why it happens by age and sex Age class by genotypeAge class by genotype

CCCC CTCT TTTT

2929 3636 5858 1010

30-4930-49 34 34 4242 1111

50-6450-64 35 35 3232 1010

65-7465-74 37 37 3030 8 8

75+75+ 3030 2323 3 3

TT 172172 185 185 4242Pearson Chi-squares test

X2 = 10.01, p = 0.2643

Loglinear regression modelLoglinear regression model Why use loglinear regression?Why use loglinear regression?

Often, observations are counts (>0, and Often, observations are counts (>0, and may be = 0), with potential number of may be = 0), with potential number of people exposed to risk (total population people exposed to risk (total population at risk).at risk).

Estimation by maximum likelihood Estimation by maximum likelihood estimatorestimator

or loglik functionor loglik function

i

!

y

ii

Lik ey

log [ log log( !)]

log .

i i

i

l Lik y y

y n Const

Link function

R program for Logistic and R program for Logistic and Loglinear modelsLoglinear models

Both logistic and loglinear models are special Both logistic and loglinear models are special generalized linear models (GLM)generalized linear models (GLM)

R uses a function ‘glm’ for fitting GLM models with R uses a function ‘glm’ for fitting GLM models with different link function:different link function:

Logistic regression: binomial familyLogistic regression: binomial family Loglinear regression: Poisson familyLoglinear regression: Poisson family

Logistic:Logistic: fit <- glm(as.factor(y)~x, family = binomial (link = fit <- glm(as.factor(y)~x, family = binomial (link = logit))logit))

Loglinear: Loglinear: fit <- glm(y~x, family = poisson (link = log))fit <- glm(y~x, family = poisson (link = log))

R output for Logistic and R output for Logistic and Loglinear modelsLoglinear models

> fit > fit Call: glm(formula = y ~ x, family = binomial(link = logit)) Call: glm(formula = y ~ x, family = binomial(link = logit))

Coefficients:Coefficients: (Intercept) x1 (Intercept) x1 x2 x3 x2 x3 -0.725709 0.014827 0.015248 0.007396 -0.725709 0.014827 0.015248 0.007396 x4x4 0.041097 0.041097

Degrees of Freedom: 19 Total (i.e. Null); 15 ResidualDegrees of Freedom: 19 Total (i.e. Null); 15 Residual Null Deviance: 24.43 Null Deviance: 24.43 Residual Deviance: 21.22 AIC: 31.22 Residual Deviance: 21.22 AIC: 31.22

R output for LogisticR output for Logistic > summary(fit)> summary(fit)

Call:Call: glm(formula = y ~ x, family = binomial(link = logit))glm(formula = y ~ x, family = binomial(link = logit))

Deviance Residuals: Deviance Residuals: Min 1Q Median 3Q Max Min 1Q Median 3Q Max -1.3708 -0.9643 -0.3896 1.2518 1.6197 -1.3708 -0.9643 -0.3896 1.2518 1.6197

Coefficients:Coefficients: Estimate Std. Error z value Pr(>|z|)Estimate Std. Error z value Pr(>|z|) (Intercept) -0.725709 1.988678 -0.365 0.715(Intercept) -0.725709 1.988678 -0.365 0.715 x1 0.014827 0.021880 0.678 0.498x1 0.014827 0.021880 0.678 0.498 x2 0.015248 0.022781 0.669 0.503x2 0.015248 0.022781 0.669 0.503 x3 0.007396 0.018840 0.393 0.695x3 0.007396 0.018840 0.393 0.695 x4 -0.041097 0.030172 -1.362 0.173x4 -0.041097 0.030172 -1.362 0.173

(Dispersion parameter for binomial family taken to be 1)(Dispersion parameter for binomial family taken to be 1)

Null deviance: 24.435 on 19 degrees of freedomNull deviance: 24.435 on 19 degrees of freedom Residual deviance: 21.223 on 15 degrees of freedomResidual deviance: 21.223 on 15 degrees of freedom

Mortality from Cervical Cancer in Ontario 1960-94 Rate (per 105 person-year) and Frequency

Age Year 60-64 65-69 70-74 75-79 80-84 85-89 90-94

20-24 0.15 2

0.11 2

0.15 3

0.14 3

0.14 3

0.20 4

0.13 1

25-29 1.22 14

0.52 8

1.24 23

0.80 16

0.88 20

0.47 11

0.93 8

30-34 3.15 35

2.94 37

2.01 32

1.45 27

1.79 38

1.31 32

1.08 11

35-39 5.38 62

4.47 52

3.59 46

3.86 61

3.12 60

2.47 55

2.16 21

40-44 9.80 116

7.15 84

4.32 51

5.12 66

3.71 60

2.47 63

2.16 33

45-49 15.66 160

10.97 130

7.75 91

4.69 55

5.17 67

5.02 83

3.41 27

50-54 17.01 151

13.32 138

8.19 97

6.82 80

6.12 72

4.65 61

5.79 35

55-59 18.56 141

15.23 133

11.53 118

9.12 107

5.94 70

5.81 69

5.77 29

60-64 22.44 144

16.08 121

13.66 117

10.71 108

7.93 92

7.35 86

4.02 19

65-69 23.53 128

18.87 119

15.31 112

13.79 115

10.36 102

7.60 86

6.83 31

70-74 25.89 116

19.36 97

15.36 89

15.18 103

13.95 108

10.42 96

10.44 44

75-79 29.12 94

20.08 75

23.84 102

16.29 82

14.90 88

11.50 78

12.73 38

80-84 31.76 62

24.72 59

21..51 60

23.82 79

12.69 50

17.40 81

12.77 27

85 + 33.16 42

28.95 50

22.90 50

24.94 68

15.23 51

13.88 56

10.42 19

Disease Rate (Per 10Disease Rate (Per 1055) and Frequency) and Frequency

60 - 60 - 6464

65 - 65 - 6969

70 - 70 - 7474

75 - 75 - 7979

20 - 20 - 2424

0.150.15

22

0.110.11

22

0.150.15

33

0.140.14

33

25 - 25 - 2929

1.221.22

1414

0.520.52

88

1.241.24

2323

0.800.80

1616

30 - 30 - 3434

3.153.15

4545

2.942.94

3737

2.012.01

3232

1.451.45

2727

Age

Year

Birth Cohort

Statistical ModelsStatistical Models LLoglinear modeloglinear model

log(log(ERERijij) = log () = log (nnijij) + ) + + + i i + + jj + + ijij

intercept ; intercept ; ERERij ij expected freq. in cell (expected freq. in cell (ii, , jj), ), nnijij population-yrs of expos – offset term. population-yrs of expos – offset term.

11, …, , …, a a row (age) effects, row (age) effects, aai=1i=1 i i = 0= 0

11, …, , …, pp column (period) effects, column (period) effects,

ppj=1j=1 jj = 0= 0

ij ij interaction effects, interaction effects, ijij = 0= 0

Statistical ModelsStatistical Models

APC lAPC loglinear modeloglinear model

log(log(ERERijij) = log () = log (nnijij) + ) + + + i i + + jj + + kk

intercept ; intercept ; ER ERij ij expected freq. in cell (expected freq. in cell (ii, , jj), ),

nnijij popul-yrs of expos – offset term. popul-yrs of expos – offset term.

11, …, , …, a a row (age) effects, row (age) effects, i=1i=1aa i i = 0= 0

11, …, , …, pp column (period) effects, column (period) effects, j=1j=1pp jj = 0= 0

11, …, , …, a+p-a+p-1 1 diagonal (cohort) effects, diagonal (cohort) effects,

k=1k=1a+p-1a+p-1 kk = 0 = 0

ProblemProblem

Linear dependency among covariates:Linear dependency among covariates: Period – Age = Cohort Period – Age = Cohort

Rows, columns and diagonals on a Lexis Rows, columns and diagonals on a Lexis diagramdiagram

Multiple estimators !Multiple estimators !

Disease Rate (Per 10Disease Rate (Per 1055) and Frequency) and Frequency

60 - 60 - 6464

65 - 65 - 6969

70 - 70 - 7474

75 - 75 - 7979

20 - 20 - 2424

0.150.15

22

0.110.11

22

0.150.15

33

0.140.14

33

25 - 25 - 2929

1.221.22

1414

0.520.52

88

1.241.24

2323

0.800.80

1616

30 - 30 - 3434

3.153.15

4545

2.942.94

3737

2.012.01

3232

1.451.45

2727

Age

Year

Matrix formMatrix form

Models in matrix formModels in matrix form loglog((ERER) =) = log log ((nn) + ) + X bX b

b = b = ((, , 11,…,…aa-1-1, , 11,…, ,…, pp-1-1, , 11, …,, …,a+pa+p-2-2))TT

X X singular design matrix : 1–less than full ranksingular design matrix : 1–less than full rank

Matrix formMatrix form

Regression design Matrix (X) for APC model with 3x3 table Regression design Matrix (X) for APC model with 3x3 table

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8][,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 1 0 1 0 0 0 1 0[1,] 1 0 1 0 0 0 1 0 [2,] 1 0 0 1 0 0 0 1[2,] 1 0 0 1 0 0 0 1 [3,] 1 0 -1 -1 -1 -1 -1 -1[3,] 1 0 -1 -1 -1 -1 -1 -1 [4,] 0 1 1 0 0 1 0 0[4,] 0 1 1 0 0 1 0 0 [5,] 0 1 0 1 0 0 1 0[5,] 0 1 0 1 0 0 1 0 [6,] 0 1 -1 -1 0 0 0 1[6,] 0 1 -1 -1 0 0 0 1 [7,] -1 -1 1 0 1 0 0 0[7,] -1 -1 1 0 1 0 0 0 [8,] -1 -1 0 1 0 1 0 0[8,] -1 -1 0 1 0 1 0 0 [9,] -1 -1 -1 -1 0 0 1 0[9,] -1 -1 -1 -1 0 0 1 0

Eigen values of (XEigen values of (XTTX): X): 12.16 9.23 4.98 3.35 3.00 2.15 1.12 0.0012.16 9.23 4.98 3.35 3.00 2.15 1.12 0.00

R-program to generate APC MatrixR-program to generate APC Matrix

apcmat1 <- function (a = 3, p = 3) {apcmat1 <- function (a = 3, p = 3) {

## construct APC matrix for APC analysis ## construct APC matrix for APC analysis x <- matrix(0, a*p, 2*(a+p-2))x <- matrix(0, a*p, 2*(a+p-2))

gammac <- rbind( diag(1, a+p-2), -1)gammac <- rbind( diag(1, a+p-2), -1)

for (i in 1:(a-1)) {for (i in 1:(a-1)) {x[(i-1)*p+(1:p), i] <- 1x[(i-1)*p+(1:p), i] <- 1 # for alpha# for alpha

x[(i-1)*p+(1:(p-1)), a:(a+p-2)] <- diag(1,p-1)x[(i-1)*p+(1:(p-1)), a:(a+p-2)] <- diag(1,p-1)# for beta# for beta

x[i*p, a:(a+p-2)] <- -1x[i*p, a:(a+p-2)] <- -1

for (j in 1:p)for (j in 1:p) x[(i-1)*p+j, -1:-(a+p-2) ] <- gammac [a-i+j, ]x[(i-1)*p+j, -1:-(a+p-2) ] <- gammac [a-i+j, ]

}}

for (j in 1:p)for (j in 1:p)x[(a-1)*p+j, -1:-(a+p-2)] <- gammac[j, ]x[(a-1)*p+j, -1:-(a+p-2)] <- gammac[j, ]

# last block (row in apc table) for gamma# last block (row in apc table) for gamma

gammac <- NULLgammac <- NULL

x[(a-1)*p+(1:p), 1:(a-1)] <- -1x[(a-1)*p+(1:p), 1:(a-1)] <- -1# last block (row in apc table) for alpha# last block (row in apc table) for alpha

x[(a-1)*p+(1:(p-1)), a:(a+p-2)] <- diag(1, p-1)x[(a-1)*p+(1:(p-1)), a:(a+p-2)] <- diag(1, p-1)# last block (row in apc table) for beta# last block (row in apc table) for betax[a*p, a:(a+p-2)] <- -1x[a*p, a:(a+p-2)] <- -1

xx}}

11

1

2

2

2

a

p

a+p-1

Challenge - IdentificationChallenge - Identification

Constraint on parameters (Kupper et al, Constraint on parameters (Kupper et al, 1985)1985)

One more constraint One more constraint trend determination trend determination

But diff constraint But diff constraint diff trend diff trend

Challenge - IdentificationChallenge - Identification

Conclusion: all estimators are biasedConclusion: all estimators are biased Except one Except one

With constraint satisfied by true With constraint satisfied by true parameters of modelparameters of model

Not verifiable !!Not verifiable !! Identifiability problem !Identifiability problem ! Mystery?Mystery?

Previous MethodsPrevious Methods

Estimable functions (Estimable functions (Fienberg + Mason 1979,Fienberg + Mason 1979,

Clayton + Schifflers 1987, Holford 1985, 1991Clayton + Schifflers 1987, Holford 1985, 1991)) Indep. of selection of constraintIndep. of selection of constraint Invariant characteristics of trendsInvariant characteristics of trends Nonlinear components estimable – curvatureNonlinear components estimable – curvature Linear component (slope) not estimable ! Linear component (slope) not estimable !

Previous ConclusionPrevious Conclusion

Identifiability problem - difficult ! Identifiability problem - difficult ! Linear trends are not estimable (Numerically) !Linear trends are not estimable (Numerically) ! Kupper et al (1985), Holfold (1985, 1991), Kupper et al (1985), Holfold (1985, 1991),

Clayton and Schifflers (1987), Clayton and Schifflers (1987),

Fienberg and Mason (1979, 1985).Fienberg and Mason (1979, 1985).

Kupper et al. (1985) provided a condition for Kupper et al. (1985) provided a condition for estimable function.estimable function.

MysteryMystery ??

Approach ?Approach ?

Multiple estimators !Multiple estimators !

How to pick up a “correct one” ?How to pick up a “correct one” ?

Hint: think about math, not statistics !Hint: think about math, not statistics !

Structure of EstimatorsStructure of Estimators

Each Estimator Each Estimator

bb = = B + tBB + tB00

BB0 0 eigen-vector of eigen-vector of XXTTX X :: eigeneigen--valuevalue 00. .

||||BB00|| = || = 1, indep. of disease rate1, indep. of disease rate

t t arbitrary real number arbitrary real number

B B orthogonal to orthogonal to BB0 0 , , uniquely determined uniquely determined

by disease /event rateby disease /event rate

^

BB00 Independent of Rate Independent of Rate

Kupper et al. (1985):Kupper et al. (1985):BB** = (0 = (0 A P C A P C))

AA = [ 1 = [ 1 - - ((aa+1)/2, … , (+1)/2, … , (a-a-1) – (1) – (aa+1)/2 ]+1)/2 ] PP = [ = [--1+(1+(pp+1)/2, …, +1)/2, …, --((p-p-1) + (1) + (pp+1)/2 ] +1)/2 ] CC = [ 1 = [ 1- - ((aa++p)p)/2, …, (/2, …, (a+p -a+p -1) – (1) – (a+p)a+p)/2] /2]

BB00 = = BB* / ||* / ||BB*|| *||

Estimable FunctionEstimable Function

TheoremTheorem 1 1E(E(BB) is estimable, determines both linear and ) is estimable, determines both linear and

nonlinear components of trend;nonlinear components of trend;

BB = ( = (I – BI – B00BB0 0 TT) ) bb

E(E(BB) is the only estimable function that ) is the only estimable function that determines both linear and nonlinear determines both linear and nonlinear components;components;

L b = LL b = L((B+tBB+tB00) = ) = LB + tLBLB + tLB0 0 = LB= LB

^

^

Geometry for Estimable Geometry for Estimable EE((BB))

O

B

B0t1B0t2B0

^b2

^b1

ConstraintConstraint

QuantitativeQuantitative constraints constraints Specify relationship between parametersSpecify relationship between parameters Require Require a prioria priori knowledge of event /diseaseknowledge of event /disease

QualitativeQualitative constraints constraints Require Require no no a prioria priori knowledge knowledge

Properties of Properties of BB

Intrinsic to dis./event: arbitrary Intrinsic to dis./event: arbitrary tBtB00

removed removed Only estimable function Only estimable function

linear+nonlinearlinear+nonlinear Robust estimation by sensitivity analysisRobust estimation by sensitivity analysis Consistent estimation for intercept and Consistent estimation for intercept and

age effects age effects , , 11, …, , …, aa-1-1 as as pp ? ?

New Method - New Method - Intrinsic EstimatorIntrinsic Estimator

Structure of estimators : Structure of estimators : B + tBB + tB00

Intrinsic estimatorIntrinsic estimator : : BB Determined by removing Determined by removing tBtB00 – arbitrary term – arbitrary term

Effective trendEffective trend : trend determined by : trend determined by BB

Cervical Cancer MortalityCervical Cancer Mortality

age

age

effe

ct

20 30 40 50 60 70 80

-3-2

-10

1

Age trend, 95% CI

period

perio

d ef

fect

1960 1965 1970 1975 1980 1985 1990

-3-2

-10

1

Period trend, 95% CI

cohort

coho

rt ef

fect

1880 1900 1920 1940 1960

-3-2

-10

1

Cohort trend, 95% CI

Homicide Arrest Rate (per 105) (R. O'Brien, 2000)

1960 1965 1970 1975 1980 1985 1990 1995

15 8.89 9.07 17.22 17.54 18.02 16.32 36.52 35.24

20 14.00 15.18 23.76 25.62 23.95 21.11 29.10 32.34

25 13.45 14.69 20.09 21.05 18.91 16.79 17.99 16.75

30 10.73 11.70 16.00 15.81 15.22 12.59 12.44 10.05

35 9.37 9.76 13.13 12.83 12.31 9.60 9.38 7.27

40 6.48 7.41 10.10 10.52 8.79 7.50 6.81 5.48

45 5.71 5.56 7.51 7.32 6.76 5.31 5.17 3.67

age

Calendar year

Homicide Arrest RateHomicide Arrest Rate

age

ag

e e

ffect

15 20 25 30 35 40 45

-0.5

0.0

0.5

1.0

Age trend

period

pe

rio

d e

ffect

1960 1970 1980 1990

-0.5

0.0

0.5

1.0

Period trend

cohort

coh

ort

effe

ct

1920 1930 1940 1950 1960 1970 1980

-0.5

0.0

0.5

1.0

Cohort trend

On-line Software APCsoftOn-line Software APCsoft

Based on IE method:Based on IE method: www.apcsoft.epi.msu.eduwww.apcsoft.epi.msu.edu Web-based softwareWeb-based software Need to upload excel files.Need to upload excel files. Output analysis results and dynamic Output analysis results and dynamic

graphics.graphics.

Dynamic graph.Dynamic graph.

Introduction to Biostatistics (ZJU 2008)

Documents

Transcript of Introduction to Biostatistics (ZJU 2008)