Introduction to Biostatistics (ZJU 2008)
description
Transcript of Introduction to Biostatistics (ZJU 2008)
Introduction to Biostatistics Introduction to Biostatistics (ZJU 2008)(ZJU 2008)
Wenjiang Fu, Ph.DWenjiang Fu, Ph.DAssociate ProfessorAssociate Professor
Division of Biostatistics, Department of Division of Biostatistics, Department of Epidemiology Epidemiology
Michigan State UniversityMichigan State UniversityEast Lansing, Michigan 48824, USAEast Lansing, Michigan 48824, USA
Email: Email: [email protected]@msu.eduwww: www: http://www.msu.edu/~fuwhttp://www.msu.edu/~fuw
Logistic regression modelLogistic regression model
Why use logistic regression?Why use logistic regression? Estimation by maximum likelihoodEstimation by maximum likelihood Coefficients InterpretationCoefficients Interpretation Hypothesis testingHypothesis testing Evaluating the performance of the Evaluating the performance of the
model model
Why logistic regression?Why logistic regression? Many important research topics in which the Many important research topics in which the
dependent variable is Binary. dependent variable is Binary. eg. disease vs no disease, eg. disease vs no disease,
damage vs no damage, death vs live, etc.damage vs no damage, death vs live, etc.
Binary logistic regression is a type of Binary logistic regression is a type of regression analysis where the dependent regression analysis where the dependent variable is a dummy variable: coded 0 variable is a dummy variable: coded 0 (absence of disease) or 1 (presence of a (absence of disease) or 1 (presence of a disease), etc.disease), etc.
To explain the variability of the binary variable by To explain the variability of the binary variable by other variables, either continuous or categorical, such other variables, either continuous or categorical, such as age, sex, BMI, marriage status, socio-economic as age, sex, BMI, marriage status, socio-economic status, etc. use a statistical model to relate the status, etc. use a statistical model to relate the probability of the response event to the explanatory probability of the response event to the explanatory variables.variables.
Logistic Regression ModelLogistic Regression Model Event (Y = 1), no event (Y = 0) want to model the mean: Event (Y = 1), no event (Y = 0) want to model the mean:
E(Y) = P(Y=1) * 1 + P(Y=0) *0 = P(Y=1)E(Y) = P(Y=1) * 1 + P(Y=0) *0 = P(Y=1)but but ππ = P(Y=1) is between 0 and 1 and is bounded = P(Y=1) is between 0 and 1 and is bounded
the linear predictor the linear predictor ββ0 0 ++ xx11ββ11 + + xx22ββ22+ + xx33ββ3 3 +… is a +… is a linear combination and may take any value.linear combination and may take any value.
The "logit" model solves this problem. The "logit" model solves this problem. Single independent variable:Single independent variable: ln [p/(1-p)] = ln [p/(1-p)] = + + XX
where the probability p = p(Y=1 | X)where the probability p = p(Y=1 | X) p/(1-p) is the "odds" - ratio of the probabilities an p/(1-p) is the "odds" - ratio of the probabilities an
event to occur versus not to occur under condition event to occur versus not to occur under condition A.A.
ln[p/(1-p)] is the “log odds” or "logit probability"ln[p/(1-p)] is the “log odds” or "logit probability"
More:More: The logistic distribution constrains the The logistic distribution constrains the
estimated probabilities to lie between 0 estimated probabilities to lie between 0 and 1. and 1.
The estimated probability is:The estimated probability is:
p=1/[1+exp(-p=1/[1+exp(- - - X)] X)]
= exp(= exp( + + X) / [1+exp(X) / [1+exp( + + X)] X)]
if you let if you let + + X =0, then p = .50 X =0, then p = .50 as as + + X gets really big, p approaches 1 X gets really big, p approaches 1 as as + + X gets really small, p approaches X gets really small, p approaches
00
Comparing LinReg and Logit Comparing LinReg and Logit ModelsModels
Y=0
Y=1
LinReg Model
Logit Model
Maximum Likelihood Estimation Maximum Likelihood Estimation (MLE)(MLE)
MLE is a statistical method for estimating MLE is a statistical method for estimating the coefficients of a model.the coefficients of a model.
The likelihood function (L) measures the The likelihood function (L) measures the probability of observing the particular set probability of observing the particular set of dependent variable values (Yof dependent variable values (Y11=y=y11, … , , … , YYnn=y=ynn) )
L = Prob (yL = Prob (y11,y,y22,…,y,…,ynn))
The higher the L, the higher the probability The higher the L, the higher the probability of observing the sample data at hand. of observing the sample data at hand.
Maximum Likelihood EstimatorMaximum Likelihood Estimator MLE involves finding the coefficients (MLE involves finding the coefficients (, ,
) that makes the log of the likelihood ) that makes the log of the likelihood function (logLik < 0) as large as possible function (logLik < 0) as large as possible
The maximum likelihood estimator The maximum likelihood estimator maximizes following likelihood function maximizes following likelihood function
Where Where pp = exp( = exp( + + XX) / [ 1+ exp() / [ 1+ exp( + + XX) ]) ]
i i1(1 )y yiLik p p
Maximum Likelihood EstimatorMaximum Likelihood Estimator Or equivalently, the MLE maximizes the Or equivalently, the MLE maximizes the
log-likelihood function log-likelihood function
Where Where pp = exp( = exp( + + XX) / [ 1+ exp() / [ 1+ exp( + + XX) ]) ]
MLE is biased estimator, but consistent MLE is biased estimator, but consistent (by large sample theory, the estimator (by large sample theory, the estimator converges to the true model parameters converges to the true model parameters fast enough as sample size increases).fast enough as sample size increases).
ilog log +n log(1- p) 1i
pl lik y
p
Link Function
Interpretation of Interpretation of CoefficientsCoefficients
Since ln [ p/(1-p) ] = Since ln [ p/(1-p) ] = + + XX
The slope The slope is interpreted as the rate is interpreted as the rate of change in the "log odds" as X of change in the "log odds" as X changes changes … not very useful. … not very useful.
More useful: p = Pr (Y=1|X). More useful: p = Pr (Y=1|X).
p / (1-p) is the odds with condition X. p / (1-p) is the odds with condition X.
Example. X: smoker; Y=1 Lung Ca. Example. X: smoker; Y=1 Lung Ca.
pp/(1-/(1-pp) is odds of Lung Ca w.r.t ) is odds of Lung Ca w.r.t smoke.smoke.
Interpretation of Interpretation of CoefficientsCoefficients
If X is continuous, exp(If X is continuous, exp() measures the ) measures the change of odds with one unit increase change of odds with one unit increase of X. of X.
ln (OR) = ln [odds(X+1)] - ln [odds(X)]ln (OR) = ln [odds(X+1)] - ln [odds(X)]
= ln [ p(Y=1|X+1 ) / (1-p(Y=1|X+1)) ]= ln [ p(Y=1|X+1 ) / (1-p(Y=1|X+1)) ]
– – ln [ p(Y=1|X) / (1-p(Y=1|X)) ] ln [ p(Y=1|X) / (1-p(Y=1|X)) ]
= = + + ( (X+1) – (X+1) – ( + + X) = X) =
ln(OR)= ln(OR)= OR=exp( OR=exp())
If X is binary, exp(If X is binary, exp() measures the ) measures the change of odds for one group change of odds for one group compared to the secondcompared to the second
ln [p(Y=1|X=1 ) /(1-p(Y=1|X=1)) ] ln [p(Y=1|X=1 ) /(1-p(Y=1|X=1)) ] – – ln [p(Y=1|X=0) / (1-p(Y=1|X=0)) ] ln [p(Y=1|X=0) / (1-p(Y=1|X=0)) ]
= = + + .1.1 –( –( + + ..0) = 0) =
ln(OR) = ln(OR) = OR = exp( OR = exp())
Interpretation of Interpretation of CoefficientsCoefficients
Interpretation of parameter Interpretation of parameter ββ Model : logit (Model : logit (pp) = ) = ββ0 0 ++ xx11ββ11
0 1
0 1Pr(1| 1)
1
ex
e
Y=1Y=1 Y=0Y=0
X=1X=1
X=0X=0
0 1
11 Pr(1| 1)
1x
e
0
11 Pr(1| 0)
1x
e
0
0Pr(1| 0)
1
ex
e
1(1| 1)*[1 (1| 0)]
[1 (1| 1)]* (1| 0)
P x P xOR e
P x P x
Model AssessmentModel Assessment
There are several statistics which There are several statistics which can be used for comparing can be used for comparing alternative models or evaluating alternative models or evaluating the performance of a single model: the performance of a single model: LRT, Wald testsLRT, Wald tests Percent Correct PredictionsPercent Correct Predictions
Model Chi-Squares StatisticModel Chi-Squares Statistic The model likelihood ratio test (LRT) statistic isThe model likelihood ratio test (LRT) statistic is
LR = [-2 lik (Reduced model)] - [-2 lik (Full LR = [-2 lik (Reduced model)] - [-2 lik (Full
model)]model)]
Example: test of Example: test of , , LR = -2 [lik ( LR = -2 [lik () -lik () -lik (, , ) ] ) ]
liklik (() is likelihood of model with only the intercept) is likelihood of model with only the interceptlik (lik (, , ) is a model with the intercept and X) is a model with the intercept and X
The LR statistic follows a chi-squares The LR statistic follows a chi-squares distribution with r degrees of freedom, where distribution with r degrees of freedom, where r=difference in numbers of parameters between r=difference in numbers of parameters between the two modelsthe two models
Use the LRT statistic to determine if the overall Use the LRT statistic to determine if the overall model is statistically significant. model is statistically significant.
Percent Correct PredictionsPercent Correct Predictions Predicted outcome (majority vote Predicted outcome (majority vote
method) method)
if predicted prob Pr(x) ≥ 0.5, assign y if predicted prob Pr(x) ≥ 0.5, assign y = 1= 1
otherwise assign y = 0otherwise assign y = 0
Compare the predicted outcome y and Compare the predicted outcome y and the actual outcome y and compute the actual outcome y and compute the percentage of correct outcomes.the percentage of correct outcomes.
^
^
^
An Example:An Example:
Observed % Correct0 1
0 328 24 93.18%1 139 44 24.04%
Overall 69.53%
Predicted
Testing significance of Testing significance of variablesvariables
Omitted variable(s) can result in bias in the Omitted variable(s) can result in bias in the coefficient estimates. To test for omitted coefficient estimates. To test for omitted variables you can conduct a likelihood ratio variables you can conduct a likelihood ratio test:test:
LR[q] = {[-2 lik (constrained model, i=k-q)] LR[q] = {[-2 lik (constrained model, i=k-q)]
- [-2 lik (unconstrained model, i=k)]} - [-2 lik (unconstrained model, i=k)]}
where LR follows chi-squares distribution where LR follows chi-squares distribution with q degrees of freedom, with q = 1 or with q degrees of freedom, with q = 1 or more omitted variables more omitted variables
An Example:An Example:Variable B Wald Sig
PETS -0.699 10.968 0.001MOBLHOME 1.570 29.412 0.000TENURE -0.020 5.993 0.014EDUC 0.049 1.079 0.299CHILD 0.009 0.011 0.917WHITE 0.186 0.422 0.516FEMALE 0.018 0.008 0.928Constant -1.049 2.073 0.150
Beginning -2 LL 687.36Ending -2 LL 641.41
B/se(B)
Constructing the LR TestConstructing the LR Test
“Since the chi-squared value is less than the critical value the set of coefficients is not statistically significant. The full model is not an improvement over the partial model.”
Ending -2 loglik Partial Model 641.84Ending -2 loglik Full Model 641.41Block Chi-Square 0.43DF 3Critical Value 11.345
Multiple Logistic regressionMultiple Logistic regression Prob of event labeled as binary outcomeProb of event labeled as binary outcome
Logistic regression model: logit function.Logistic regression model: logit function.
log [log [ππ / (1- / (1- ππ)] = )] = ββ0 0 ++ xx11ββ11 +… + +… + xxppββp p = = ηη
equivalent toequivalent to p = exp( p = exp(ηη) / [ 1+ exp() / [ 1+ exp(ηη) ] ) ]
Why need multiple logistic regression?Why need multiple logistic regression?Simple RxC table cannot solve all problems, and Simple RxC table cannot solve all problems, and
can be misleading.can be misleading.
Multiple Logistic Regression-Multiple Logistic Regression-FormulationFormulation
The relationship between The relationship between ππ and x is S-shaped and x is S-shaped
The The logitlogit ( (log-oddslog-odds) transformation (link function)) transformation (link function)
0 1
( )ln
1 ( ) p p
xx X
x
0 1 1
0 1 1( | ) ( 1| ) ( )
1
p p
p p
X X
X X
eE Y x P Y x x
e
Individually HIndividually Hoo: : ββk k = 0 = 0
Globally Globally HHoo: : ββm m =… =… ββm+tm+t= 0 = 0
while controlling for confounders and other while controlling for confounders and other important determinants of the eventimportant determinants of the event
Multiple Logistic RegressionMultiple Logistic Regression
AssessAssess risk factorsrisk factors
Interpretation of the parametersInterpretation of the parameters
If If ππ is the probability of an event and is the probability of an event and OO is the odds is the odds for that event thenfor that event then
The link function in logistic regression gives the The link function in logistic regression gives the log-log-oddsodds
( )
1 ( )
x probability of eventOdds
x probability of no event
0 1
( )( ) ln
1 ( ) p p
xg x x X
x
Example: Snoring & Heart DiseaseExample: Snoring & Heart Disease
An epidemiologic study surveyed 2484 An epidemiologic study surveyed 2484 subjects to examine whether snoring was a subjects to examine whether snoring was a possible risk factor for heart disease. possible risk factor for heart disease.
SnoringSnoring
NearlyNearly EveryEvery
Heart DiseaseHeart Disease NeverNever OccasionalOccasional Every nightEvery night NightNight
YesYes 2424 3535 2121 3030
NoNo 13551355 603603 192192 224224
Prop(yes)Prop(yes) .017.017 .055.055 .099.099 .118.118
Constructing Indicator variables Constructing Indicator variables
Let ZLet Z11=1 if occasional, 0 otherwise=1 if occasional, 0 otherwise
Let ZLet Z22=1 if nearly every night, 0 =1 if nearly every night, 0
otherwiseotherwise
Let ZLet Z33=1 if every night, 0 otherwise=1 if every night, 0 otherwise
SAS CodesSAS Codesdatadata hd; hd;input hd $ snoring $ input hd $ snoring $
count;count;Z1=(snoring="occa");Z1=(snoring="occa");Z2=(snoring="nearly");Z2=(snoring="nearly");Z3=(snoring="every");Z3=(snoring="every");
cards;cards;yes never 24yes never 24yes occa 35yes occa 35yes nearly 21yes nearly 21yes every 30 yes every 30
no never 1355no never 1355
no occa 603no occa 603
no nearly 192no nearly 192
no every 224 no every 224
;;
runrun;;
procproc logistic data=hd logistic data=hd descending;descending;
model hd (event=‘yes’) model hd (event=‘yes’) =Z1 Z2 Z3;=Z1 Z2 Z3;
freq count;freq count;runrun;;
SAS OUTPUTSAS OUTPUTOrdered Total
Value hd Frequency
1 yes 110
2 no 2374
Probability modeled is hd='yes'.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
AIC 902.827 842.923
SC 908.645 866.193
-2 Log L 900.827 834.923
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr
LikRatio 65.9045 3 <.0001
Score 72.7821 3 <.0001
Wald 58.9513 3 <.0001
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates Standard Wald Param DF Estimate Error Chi-Square P
Inter 1 -4.0335 0.2059 383.6641 <.0001 Z1 1 1.1869 0.2695 19.3959 <.0001 Z2 1 1.8205 0.3086 34.8027 <.0001 Z3 1 2.0231 0.2832 51.0313 <.0001
Odds Ratio Estimates
Point 95% Wald Effect Estimate Confidence Limits Z1 3.277 1.932 5.558 Z2 6.175 3.373 11.306 Z3 7.561 4.341 13.172
Calculating ProbabilitiesCalculating Probabilities
The fitted logistic regression function isThe fitted logistic regression function is
Logit(Logit(ππ)= -4.0335 + 1.1869 Z)= -4.0335 + 1.1869 Z11+ 1.8205 Z+ 1.8205 Z22+ 2.0231 Z+ 2.0231 Z33
So, the probability of heart disease if never snore So, the probability of heart disease if never snore is exp(-4.0335) / (1+exp(-4.0335))=.0174is exp(-4.0335) / (1+exp(-4.0335))=.0174
If snore occasionally, If snore occasionally, exp(-4.0335+1.1869) / (1+exp(-4.0335 +1.1869))exp(-4.0335+1.1869) / (1+exp(-4.0335 +1.1869)) =.0549=.0549
Calculating Odds RatiosCalculating Odds Ratios If ZIf Z11=Z=Z22=Z=Z33=0, then odds are exp(-4.0335)=0, then odds are exp(-4.0335)
If ZIf Z22=Z=Z33=0, but Z=0, but Z11=1, then odds are exp(-4.0335+1.1869)=1, then odds are exp(-4.0335+1.1869)
The ratio of odds is then exp(1.1869) = 3.2769The ratio of odds is then exp(1.1869) = 3.2769 InterpretationInterpretation: Compared with people who never snore, : Compared with people who never snore,
people who snore occasionally are 3.28 times as likely to people who snore occasionally are 3.28 times as likely to develop heart disease.develop heart disease.
What is the odds ratio for comparing those who snore What is the odds ratio for comparing those who snore nearly every night with occasional snorers?nearly every night with occasional snorers?
What is the odds ratio for comparing those who snore What is the odds ratio for comparing those who snore every night with those who snore nearly every night?every night with those who snore nearly every night?
Example – Genetic Association studyExample – Genetic Association study Idiopathic Pulmonary Fibrosis (IPF) is known to be Idiopathic Pulmonary Fibrosis (IPF) is known to be
associated with age and gender (older and male are associated with age and gender (older and male are more likely)more likely)
One study had 174 cases and 225 controls found One study had 174 cases and 225 controls found association of IPF with one gene genotype COX2.8473 association of IPF with one gene genotype COX2.8473 (C (C T). T).
P-value by Pearson Chi-squares test: p = 0.0241.P-value by Pearson Chi-squares test: p = 0.0241. Q: Is this association true?Q: Is this association true?
Genotype CC CT TT total
Case 88 72 14 174
Control 84 113 28 225
Total 172 185 42 399
Example on genetic studyExample on genetic study Logistic regression model Logistic regression model
logit [Pr(IPF)] = intercept + snp + sex + agelogit [Pr(IPF)] = intercept + snp + sex + age
ResultsResults: : WaldWald
EffectEffect DFDF Chi-squareChi-square P-valueP-value
SNPSNP 22 2.78112.7811 0.24890.2489
sexsex 11 9.11729.1172 0.00250.0025
ageage 1 100.4541 100.454 <.0001<.0001
Example on genetic studyExample on genetic study Investigate why it happens by age and sexInvestigate why it happens by age and sex
Disease (N-normal; D-disease) by ageDisease (N-normal; D-disease) by age
0-290-29 30-4930-49 50-6450-64 65-7465-74 75+75+
NN 104 104 77 77 35 35 7 7 2 2
DD 0 0 10 10 42 42 68 68 54 54
TT 104 104 87 87 77 77 75 75 56 56
Example on genetic studyExample on genetic study Investigate why it happens by age and sexInvestigate why it happens by age and sex
Disease (N-normal; D-disease) by sexDisease (N-normal; D-disease) by sex
malemale femalefemale
NN 7272 153 153
DD 108108 66 66
TT 180180 219 219
Example on genetic studyExample on genetic study Investigate why it happens by age and sexInvestigate why it happens by age and sex
SNP genotype by sexSNP genotype by sex
malemale femalefemale
CCCC 7979 93 93
CTCT 75 75 110 110
TTTT 26 26 16 16
TotalTotal 180180 219 219
Pearson Chi-squares test
X2 = 6.3911, p = 0.0409
Example on genetic studyExample on genetic study Investigate why it happens by age and sexInvestigate why it happens by age and sex Age class by genotypeAge class by genotype
CCCC CTCT TTTT
2929 3636 5858 1010
30-4930-49 34 34 4242 1111
50-6450-64 35 35 3232 1010
65-7465-74 37 37 3030 8 8
75+75+ 3030 2323 3 3
TT 172172 185 185 4242Pearson Chi-squares test
X2 = 10.01, p = 0.2643
Loglinear regression modelLoglinear regression model Why use loglinear regression?Why use loglinear regression?
Often, observations are counts (>0, and Often, observations are counts (>0, and may be = 0), with potential number of may be = 0), with potential number of people exposed to risk (total population people exposed to risk (total population at risk).at risk).
Estimation by maximum likelihood Estimation by maximum likelihood estimatorestimator
or loglik functionor loglik function
i
!
y
ii
Lik ey
log [ log log( !)]
log .
i i
i
l Lik y y
y n Const
Link function
R program for Logistic and R program for Logistic and Loglinear modelsLoglinear models
Both logistic and loglinear models are special Both logistic and loglinear models are special generalized linear models (GLM)generalized linear models (GLM)
R uses a function ‘glm’ for fitting GLM models with R uses a function ‘glm’ for fitting GLM models with different link function:different link function:
Logistic regression: binomial familyLogistic regression: binomial family Loglinear regression: Poisson familyLoglinear regression: Poisson family
Logistic:Logistic: fit <- glm(as.factor(y)~x, family = binomial (link = fit <- glm(as.factor(y)~x, family = binomial (link = logit))logit))
Loglinear: Loglinear: fit <- glm(y~x, family = poisson (link = log))fit <- glm(y~x, family = poisson (link = log))
R output for Logistic and R output for Logistic and Loglinear modelsLoglinear models
> fit > fit Call: glm(formula = y ~ x, family = binomial(link = logit)) Call: glm(formula = y ~ x, family = binomial(link = logit))
Coefficients:Coefficients: (Intercept) x1 (Intercept) x1 x2 x3 x2 x3 -0.725709 0.014827 0.015248 0.007396 -0.725709 0.014827 0.015248 0.007396 x4x4 0.041097 0.041097
Degrees of Freedom: 19 Total (i.e. Null); 15 ResidualDegrees of Freedom: 19 Total (i.e. Null); 15 Residual Null Deviance: 24.43 Null Deviance: 24.43 Residual Deviance: 21.22 AIC: 31.22 Residual Deviance: 21.22 AIC: 31.22
R output for LogisticR output for Logistic > summary(fit)> summary(fit)
Call:Call: glm(formula = y ~ x, family = binomial(link = logit))glm(formula = y ~ x, family = binomial(link = logit))
Deviance Residuals: Deviance Residuals: Min 1Q Median 3Q Max Min 1Q Median 3Q Max -1.3708 -0.9643 -0.3896 1.2518 1.6197 -1.3708 -0.9643 -0.3896 1.2518 1.6197
Coefficients:Coefficients: Estimate Std. Error z value Pr(>|z|)Estimate Std. Error z value Pr(>|z|) (Intercept) -0.725709 1.988678 -0.365 0.715(Intercept) -0.725709 1.988678 -0.365 0.715 x1 0.014827 0.021880 0.678 0.498x1 0.014827 0.021880 0.678 0.498 x2 0.015248 0.022781 0.669 0.503x2 0.015248 0.022781 0.669 0.503 x3 0.007396 0.018840 0.393 0.695x3 0.007396 0.018840 0.393 0.695 x4 -0.041097 0.030172 -1.362 0.173x4 -0.041097 0.030172 -1.362 0.173
(Dispersion parameter for binomial family taken to be 1)(Dispersion parameter for binomial family taken to be 1)
Null deviance: 24.435 on 19 degrees of freedomNull deviance: 24.435 on 19 degrees of freedom Residual deviance: 21.223 on 15 degrees of freedomResidual deviance: 21.223 on 15 degrees of freedom
Mortality from Cervical Cancer in Ontario 1960-94 Rate (per 105 person-year) and Frequency
Age Year 60-64 65-69 70-74 75-79 80-84 85-89 90-94
20-24 0.15 2
0.11 2
0.15 3
0.14 3
0.14 3
0.20 4
0.13 1
25-29 1.22 14
0.52 8
1.24 23
0.80 16
0.88 20
0.47 11
0.93 8
30-34 3.15 35
2.94 37
2.01 32
1.45 27
1.79 38
1.31 32
1.08 11
35-39 5.38 62
4.47 52
3.59 46
3.86 61
3.12 60
2.47 55
2.16 21
40-44 9.80 116
7.15 84
4.32 51
5.12 66
3.71 60
2.47 63
2.16 33
45-49 15.66 160
10.97 130
7.75 91
4.69 55
5.17 67
5.02 83
3.41 27
50-54 17.01 151
13.32 138
8.19 97
6.82 80
6.12 72
4.65 61
5.79 35
55-59 18.56 141
15.23 133
11.53 118
9.12 107
5.94 70
5.81 69
5.77 29
60-64 22.44 144
16.08 121
13.66 117
10.71 108
7.93 92
7.35 86
4.02 19
65-69 23.53 128
18.87 119
15.31 112
13.79 115
10.36 102
7.60 86
6.83 31
70-74 25.89 116
19.36 97
15.36 89
15.18 103
13.95 108
10.42 96
10.44 44
75-79 29.12 94
20.08 75
23.84 102
16.29 82
14.90 88
11.50 78
12.73 38
80-84 31.76 62
24.72 59
21..51 60
23.82 79
12.69 50
17.40 81
12.77 27
85 + 33.16 42
28.95 50
22.90 50
24.94 68
15.23 51
13.88 56
10.42 19
Disease Rate (Per 10Disease Rate (Per 1055) and Frequency) and Frequency
60 - 60 - 6464
65 - 65 - 6969
70 - 70 - 7474
75 - 75 - 7979
20 - 20 - 2424
0.150.15
22
0.110.11
22
0.150.15
33
0.140.14
33
25 - 25 - 2929
1.221.22
1414
0.520.52
88
1.241.24
2323
0.800.80
1616
30 - 30 - 3434
3.153.15
4545
2.942.94
3737
2.012.01
3232
1.451.45
2727
Age
Year
Birth Cohort
Statistical ModelsStatistical Models LLoglinear modeloglinear model
log(log(ERERijij) = log () = log (nnijij) + ) + + + i i + + jj + + ijij
intercept ; intercept ; ERERij ij expected freq. in cell (expected freq. in cell (ii, , jj), ), nnijij population-yrs of expos – offset term. population-yrs of expos – offset term.
11, …, , …, a a row (age) effects, row (age) effects, aai=1i=1 i i = 0= 0
11, …, , …, pp column (period) effects, column (period) effects,
ppj=1j=1 jj = 0= 0
ij ij interaction effects, interaction effects, ijij = 0= 0
Statistical ModelsStatistical Models
APC lAPC loglinear modeloglinear model
log(log(ERERijij) = log () = log (nnijij) + ) + + + i i + + jj + + kk
intercept ; intercept ; ER ERij ij expected freq. in cell (expected freq. in cell (ii, , jj), ),
nnijij popul-yrs of expos – offset term. popul-yrs of expos – offset term.
11, …, , …, a a row (age) effects, row (age) effects, i=1i=1aa i i = 0= 0
11, …, , …, pp column (period) effects, column (period) effects, j=1j=1pp jj = 0= 0
11, …, , …, a+p-a+p-1 1 diagonal (cohort) effects, diagonal (cohort) effects,
k=1k=1a+p-1a+p-1 kk = 0 = 0
ProblemProblem
Linear dependency among covariates:Linear dependency among covariates: Period – Age = Cohort Period – Age = Cohort
Rows, columns and diagonals on a Lexis Rows, columns and diagonals on a Lexis diagramdiagram
Multiple estimators !Multiple estimators !
Disease Rate (Per 10Disease Rate (Per 1055) and Frequency) and Frequency
60 - 60 - 6464
65 - 65 - 6969
70 - 70 - 7474
75 - 75 - 7979
20 - 20 - 2424
0.150.15
22
0.110.11
22
0.150.15
33
0.140.14
33
25 - 25 - 2929
1.221.22
1414
0.520.52
88
1.241.24
2323
0.800.80
1616
30 - 30 - 3434
3.153.15
4545
2.942.94
3737
2.012.01
3232
1.451.45
2727
Age
Year
Matrix formMatrix form
Models in matrix formModels in matrix form loglog((ERER) =) = log log ((nn) + ) + X bX b
b = b = ((, , 11,…,…aa-1-1, , 11,…, ,…, pp-1-1, , 11, …,, …,a+pa+p-2-2))TT
X X singular design matrix : 1–less than full ranksingular design matrix : 1–less than full rank
Matrix formMatrix form
Regression design Matrix (X) for APC model with 3x3 table Regression design Matrix (X) for APC model with 3x3 table
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8][,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 1 0 1 0 0 0 1 0[1,] 1 0 1 0 0 0 1 0 [2,] 1 0 0 1 0 0 0 1[2,] 1 0 0 1 0 0 0 1 [3,] 1 0 -1 -1 -1 -1 -1 -1[3,] 1 0 -1 -1 -1 -1 -1 -1 [4,] 0 1 1 0 0 1 0 0[4,] 0 1 1 0 0 1 0 0 [5,] 0 1 0 1 0 0 1 0[5,] 0 1 0 1 0 0 1 0 [6,] 0 1 -1 -1 0 0 0 1[6,] 0 1 -1 -1 0 0 0 1 [7,] -1 -1 1 0 1 0 0 0[7,] -1 -1 1 0 1 0 0 0 [8,] -1 -1 0 1 0 1 0 0[8,] -1 -1 0 1 0 1 0 0 [9,] -1 -1 -1 -1 0 0 1 0[9,] -1 -1 -1 -1 0 0 1 0
Eigen values of (XEigen values of (XTTX): X): 12.16 9.23 4.98 3.35 3.00 2.15 1.12 0.0012.16 9.23 4.98 3.35 3.00 2.15 1.12 0.00
R-program to generate APC MatrixR-program to generate APC Matrix
apcmat1 <- function (a = 3, p = 3) {apcmat1 <- function (a = 3, p = 3) {
## construct APC matrix for APC analysis ## construct APC matrix for APC analysis x <- matrix(0, a*p, 2*(a+p-2))x <- matrix(0, a*p, 2*(a+p-2))
gammac <- rbind( diag(1, a+p-2), -1)gammac <- rbind( diag(1, a+p-2), -1)
for (i in 1:(a-1)) {for (i in 1:(a-1)) {x[(i-1)*p+(1:p), i] <- 1x[(i-1)*p+(1:p), i] <- 1 # for alpha# for alpha
x[(i-1)*p+(1:(p-1)), a:(a+p-2)] <- diag(1,p-1)x[(i-1)*p+(1:(p-1)), a:(a+p-2)] <- diag(1,p-1)# for beta# for beta
x[i*p, a:(a+p-2)] <- -1x[i*p, a:(a+p-2)] <- -1
for (j in 1:p)for (j in 1:p) x[(i-1)*p+j, -1:-(a+p-2) ] <- gammac [a-i+j, ]x[(i-1)*p+j, -1:-(a+p-2) ] <- gammac [a-i+j, ]
}}
for (j in 1:p)for (j in 1:p)x[(a-1)*p+j, -1:-(a+p-2)] <- gammac[j, ]x[(a-1)*p+j, -1:-(a+p-2)] <- gammac[j, ]
# last block (row in apc table) for gamma# last block (row in apc table) for gamma
gammac <- NULLgammac <- NULL
x[(a-1)*p+(1:p), 1:(a-1)] <- -1x[(a-1)*p+(1:p), 1:(a-1)] <- -1# last block (row in apc table) for alpha# last block (row in apc table) for alpha
x[(a-1)*p+(1:(p-1)), a:(a+p-2)] <- diag(1, p-1)x[(a-1)*p+(1:(p-1)), a:(a+p-2)] <- diag(1, p-1)# last block (row in apc table) for beta# last block (row in apc table) for betax[a*p, a:(a+p-2)] <- -1x[a*p, a:(a+p-2)] <- -1
xx}}
11
1
2
2
2
a
p
a+p-1
Challenge - IdentificationChallenge - Identification
Constraint on parameters (Kupper et al, Constraint on parameters (Kupper et al, 1985)1985)
One more constraint One more constraint trend determination trend determination
But diff constraint But diff constraint diff trend diff trend
Challenge - IdentificationChallenge - Identification
Conclusion: all estimators are biasedConclusion: all estimators are biased Except one Except one
With constraint satisfied by true With constraint satisfied by true parameters of modelparameters of model
Not verifiable !!Not verifiable !! Identifiability problem !Identifiability problem ! Mystery?Mystery?
Previous MethodsPrevious Methods
Estimable functions (Estimable functions (Fienberg + Mason 1979,Fienberg + Mason 1979,
Clayton + Schifflers 1987, Holford 1985, 1991Clayton + Schifflers 1987, Holford 1985, 1991)) Indep. of selection of constraintIndep. of selection of constraint Invariant characteristics of trendsInvariant characteristics of trends Nonlinear components estimable – curvatureNonlinear components estimable – curvature Linear component (slope) not estimable ! Linear component (slope) not estimable !
Previous ConclusionPrevious Conclusion
Identifiability problem - difficult ! Identifiability problem - difficult ! Linear trends are not estimable (Numerically) !Linear trends are not estimable (Numerically) ! Kupper et al (1985), Holfold (1985, 1991), Kupper et al (1985), Holfold (1985, 1991),
Clayton and Schifflers (1987), Clayton and Schifflers (1987),
Fienberg and Mason (1979, 1985).Fienberg and Mason (1979, 1985).
Kupper et al. (1985) provided a condition for Kupper et al. (1985) provided a condition for estimable function.estimable function.
MysteryMystery ??
Approach ?Approach ?
Multiple estimators !Multiple estimators !
How to pick up a “correct one” ?How to pick up a “correct one” ?
Hint: think about math, not statistics !Hint: think about math, not statistics !
Structure of EstimatorsStructure of Estimators
Each Estimator Each Estimator
bb = = B + tBB + tB00
BB0 0 eigen-vector of eigen-vector of XXTTX X :: eigeneigen--valuevalue 00. .
||||BB00|| = || = 1, indep. of disease rate1, indep. of disease rate
t t arbitrary real number arbitrary real number
B B orthogonal to orthogonal to BB0 0 , , uniquely determined uniquely determined
by disease /event rateby disease /event rate
^
BB00 Independent of Rate Independent of Rate
Kupper et al. (1985):Kupper et al. (1985):BB** = (0 = (0 A P C A P C))
AA = [ 1 = [ 1 - - ((aa+1)/2, … , (+1)/2, … , (a-a-1) – (1) – (aa+1)/2 ]+1)/2 ] PP = [ = [--1+(1+(pp+1)/2, …, +1)/2, …, --((p-p-1) + (1) + (pp+1)/2 ] +1)/2 ] CC = [ 1 = [ 1- - ((aa++p)p)/2, …, (/2, …, (a+p -a+p -1) – (1) – (a+p)a+p)/2] /2]
BB00 = = BB* / ||* / ||BB*|| *||
Estimable FunctionEstimable Function
TheoremTheorem 1 1E(E(BB) is estimable, determines both linear and ) is estimable, determines both linear and
nonlinear components of trend;nonlinear components of trend;
BB = ( = (I – BI – B00BB0 0 TT) ) bb
E(E(BB) is the only estimable function that ) is the only estimable function that determines both linear and nonlinear determines both linear and nonlinear components;components;
L b = LL b = L((B+tBB+tB00) = ) = LB + tLBLB + tLB0 0 = LB= LB
^
^
Geometry for Estimable Geometry for Estimable EE((BB))
O
B
B0t1B0t2B0
^b2
^b1
ConstraintConstraint
QuantitativeQuantitative constraints constraints Specify relationship between parametersSpecify relationship between parameters Require Require a prioria priori knowledge of event /diseaseknowledge of event /disease
QualitativeQualitative constraints constraints Require Require no no a prioria priori knowledge knowledge
Properties of Properties of BB
Intrinsic to dis./event: arbitrary Intrinsic to dis./event: arbitrary tBtB00
removed removed Only estimable function Only estimable function
linear+nonlinearlinear+nonlinear Robust estimation by sensitivity analysisRobust estimation by sensitivity analysis Consistent estimation for intercept and Consistent estimation for intercept and
age effects age effects , , 11, …, , …, aa-1-1 as as pp ? ?
New Method - New Method - Intrinsic EstimatorIntrinsic Estimator
Structure of estimators : Structure of estimators : B + tBB + tB00
Intrinsic estimatorIntrinsic estimator : : BB Determined by removing Determined by removing tBtB00 – arbitrary term – arbitrary term
Effective trendEffective trend : trend determined by : trend determined by BB
Cervical Cancer MortalityCervical Cancer Mortality
age
age
effe
ct
20 30 40 50 60 70 80
-3-2
-10
1
Age trend, 95% CI
period
perio
d ef
fect
1960 1965 1970 1975 1980 1985 1990
-3-2
-10
1
Period trend, 95% CI
cohort
coho
rt ef
fect
1880 1900 1920 1940 1960
-3-2
-10
1
Cohort trend, 95% CI
Homicide Arrest Rate (per 105) (R. O'Brien, 2000)
1960 1965 1970 1975 1980 1985 1990 1995
15 8.89 9.07 17.22 17.54 18.02 16.32 36.52 35.24
20 14.00 15.18 23.76 25.62 23.95 21.11 29.10 32.34
25 13.45 14.69 20.09 21.05 18.91 16.79 17.99 16.75
30 10.73 11.70 16.00 15.81 15.22 12.59 12.44 10.05
35 9.37 9.76 13.13 12.83 12.31 9.60 9.38 7.27
40 6.48 7.41 10.10 10.52 8.79 7.50 6.81 5.48
45 5.71 5.56 7.51 7.32 6.76 5.31 5.17 3.67
age
Calendar year
Homicide Arrest RateHomicide Arrest Rate
age
ag
e e
ffect
15 20 25 30 35 40 45
-0.5
0.0
0.5
1.0
Age trend
period
pe
rio
d e
ffect
1960 1970 1980 1990
-0.5
0.0
0.5
1.0
Period trend
cohort
coh
ort
effe
ct
1920 1930 1940 1950 1960 1970 1980
-0.5
0.0
0.5
1.0
Cohort trend
On-line Software APCsoftOn-line Software APCsoft
Based on IE method:Based on IE method: www.apcsoft.epi.msu.eduwww.apcsoft.epi.msu.edu Web-based softwareWeb-based software Need to upload excel files.Need to upload excel files. Output analysis results and dynamic Output analysis results and dynamic
graphics.graphics.
Dynamic graph.Dynamic graph.