Homework #2€¦ · Web viewBiost 518: Applied Biostatistics II. Biost 515: Biostatistics II....

Biost 518 / 515, Winter 2014 Homework #5 February 3, 2014, of 12

Biost 518: Applied Biostatistics IIBiost 515: Biostatistics II

Emerson, Winter 2014

Homework #5February 3, 2014

Written problems: To be submitted as a MS-Word compatible file to the class Catalyst dropbox by 9:30 am on Monday, February 10, 2014. See the instructions for peer grading of the homework that are posted on the web pages.

On this (as all homeworks) Stata / R code and unedited Stata / R output is TOTALLY unacceptable. Instead, prepare a table of statistics gleaned from the Stata output. The table should be appropriate for inclusion in a scientific report, with all statistics rounded to a reasonable number of significant digits. (I am interested in how statistics are used to answer the scientific question.)

Unless explicitly told otherwise in the statement of the problem, in all problems requesting “statistical analyses” (either descriptive or inferential), you should present both

Methods: A brief sentence or paragraph describing the statistical methods you used. This should be using wording suitable for a scientific journal, though it might be a little more detailed. A reader should be able to reproduce your analysis. DO NOT PROVIDE Stata OR R CODE.

Inference : A paragraph providing full statistical inference in answer to the question. Please see the supplementary document relating to “Reporting Associations” for details.

Problems 2 and 3 of the homework build on the analyses performed in homeworks #1 through #4. As such, all questions relate to associations among death from any cause, serum low density lipoprotein (LDL) levels, age, and sex in a population of generally healthy elderly subjects in four U.S. communities. This homework uses the subset of information that was collected to examine MRI changes in the brain. The data can be found on the class web page (follow the link to Datasets) in the file labeled mri.txt. Documentation is in the file mri.pdf. See homework #1 for additional information. Problem 1 of this homework uses the same dataset to explore associations between prevalence of diabetes and race in the population from which that sample was drawn.

1. Perform a statistical regression analysis evaluating an association between prevalence of diabetes and race by comparing the odds of a diabetes diagnosis across.

a. Fit a logistic regression model that uses whites as a reference group. Is this a saturated model? Provide a formal report (methods and inference) about the scientific question regarding an association between diabetes and race.

Method: Robust logistic regression model using whites as reference group and generating new indicator variables for black, Asian and other.

Log(odds)=a+b1*BLACK+b2*ASIAN+b3*OTHER

(Black: black=1, everything else=0; ASIAN: Asian=1, everything else=0; OTHER: other=1, everything else=0;)

Results:

Log(odds)=-2.2208+0.6568*BLACK-0.4648*ASIAN+0.6113*OTHER

(Black: black=1, everything else=0; ASIAN: Asian=1, everything else=0; OTHER: other=1, everything else=0;)

3864, 02/14/14,

Overall score: 15/65

3864, 02/14/14,

Overall: 2/10 Points breakdown: (3 pts for addressing saturation question; 3 pts for stating method; 4 pts for reporting inferences). Breakdown of scoring: Saturation – omitted from answer – 0/3 Method – cursory, insufficient for a report. Not even one full sentence. See the key for how methods and inferences are reported. Missing the discussion of Huber-white sandwich estimator, 2-sided p, Wald statistic, multiple comparisons, missing subjects. 1/3 Inferences – No formal inferences made in (a) – just put down a formula. It looks like you put your inferences in (b) rather than (a). But even referring to (b), all the figures reported in (b) are off – none of the numbers match what is in the key. 1/4


b. Using the regression model fit in part (a), provide an interpretation for each of the regression parameters (including the intercept).

Ans: For those 572 white subjects, their estimated mean odds of diabetes is 10.85% (e^(-2.2208)), significantly difference from zero (p<0.001). A 95% CI suggests that it is not unusual if the observed odds of diabetes in whites is anywhere between 10.85%to 14.30%.

For those 104 black subjects, their estimated odds of diabetes is relative 92.86% higher than white people. The difference is statistically significant (p=0.026). A 95% CI suggests that the observation is not unusual if the odds of diabetes in black people is anywhere from 8.15% higher to 243.91% higher than in white people.

For those 47 Asian subjects, their estimated mean odds of diabetes is relative 37.18% lower than white people, although he difference is not statistically significant (p=0.449). A 95% CI suggests that the observation is not unusual if the odds of diabetes in asian people is anywhere from 81.12% lower to 109.08% higher than in white people.

For those 12 subjects of other race, their estimated mean odds of diabetes is 84.29% higher than white people, although he difference is not statistically significant (p=0.438). A 95% CI suggests that the observation is not unusual if the odds of diabetes in subjects of other race is anywhere from 60.65% lower to 763.13% higher than in white people.

c. If we were to ignore issue related to multiple comparisons, what conclusions would you reach based on the p values reported in the regression output from part (a) using a 0.05 level of significance.

Method: Wald test is used to simultaneously test coefficients b1, b2 and b3 are simultaneously zero.

Results: p=0.0956, we could not reject the hypothesis that b1=b2=b3= 0 at a 0.05 level of significance. If we ignored issue related to multiple comparison we may have removed parameters, Asian and other race, that are not statistically significant before proceeding to other tests.

d. Now fit a logistic regression model that uses blacks as a reference group. How would your report of formal inference differ from that that you provided in part (a)? How does this regression model relate to that in part (a)?

Methods: Similar in (a) but use blacks as a reference group.

Results:

Results:

Log(odds)=- 1.5639 -0.6568*WHITE -1.1216 *ASIAN -0.0455 *OTHER

(White: white=1, everything else=0; ASIAN: Asian=1, everything else=0; OTHER: other=1, everything else=0;)

This is a reparameterized model of the model in a.

e. Using the regression model fit in part (d), provide an interpretation for each of the regression parameters (including the intercept.)

3864, 02/14/14,

1/3 Missing any statement about statistical inferences. I see you again responded to the wrong question and put inferences in (e). Unfortunately, again all the numbers are incorrect. There is no statistically significant finding. 1 pt for noting this is a reparameterization.

3864, 02/14/14,

0/3 This is incorrect. You cannot remove the groups that are not statistically significant. See lecture notes on the error in doing this. The correct answer is you would have mistakenly concluded there was a statistically significant difference between blacks and whites.

3864, 02/14/14,

0/3 The answer is not responsive to the question. The question asks for an interpretation of the regression parameters – slopes, intercept.


Ans: For those 104 black subjects, their estimated mean odds of diabetes is 20.93%, significantly difference from zero (p<0.001). A 95% CI suggests that it is not unusual if the observed odds of diabetes in whites is anywhere between 12.59%to 34.80%.

For those 572 white subjects, their estimated odds of diabetes is relative 48.15% lower than white people. The difference is statistically significant (p=0.026). A 95% CI suggests that the observation is not unusual if the odds of diabetes in white people is anywhere from 7.54% lower to 70.92% lower than in blacks.

For those 47 Asian subjects, their estimated mean odds of diabetes is relative 67.42% lower than blacks, although he difference is not statistically significant (p=0.085). A 95% CI suggests that the observation is not unusual if the odds of diabetes in black people is anywhere from 90.91% lower to 16.70% higher than in blacks.

For those 12 subjects of other race, their estimated mean odds of diabetes is relative 4.43% lower than blacks, although he difference is not statistically significant (p=0.956). A 95% CI suggests that the observation is not unusual if the odds of diabetes in subjects of other race is anywhere from relative re70.75% lower to 374.24% higher than in blacks.

f. If we were to ignore issue related to multiple comparisons, what conclusions would you reach based on the p values reported in the regression output from part (d) using a 0.05 level of significance.

Method: Wald test is used to simultaneously test coefficients b1, b2 and b3 are simultaneously zero.

Results: p=0.0956, we could not reject the hypothesis that b1=b2=b3= 0 at a 0.05 level of significance. If we ignored issue related to multiple comparison we may have removed parameters, Asian and other race, that are not statistically significant before proceeding to other tests.

g. What do your results from parts (c) and (f) say about the dangers of using the p values for individual regression parameters from a dummy variable regression to decide whether to include or exclude those variables in a regression model (i.e., in a “stepwise model building” procedure)?

Ans: If we ignored issue related to multiple comparison we may have removed parameters, Asian and other race, that are not statistically significant before proceeding to other tests.

2. Perform a statistical regression analysis evaluating an association between all-cause mortality and serum by comparing the instantaneous risk (hazard) of death over the entire period of observation across groups defined by serum LDL when fit as dummy variables using the categories suggested by the Mayo Clinic as reported on Homework #1. The Stata egen command can be used to categorize the LDL levels

egen ldlCTG = cut(ldl), at(0 70 100 130 160 190 250)

a. Include full description of your methods, appropriate descriptive statistics, and full report of your inferential statistics.

3864, 02/14/14,

3/5 Missing any discussion of the dangers of doing this.

3864, 02/14/14,

0/3 This is incorrect. You cannot remove the groups that are not statistically significant. See lecture notes on the error in doing this. The correct answer is you would have mistakenly concluded there was a statistically significant difference between blacks and whites.

3864, 02/14/14,

0/3 The answer is not responsive to the question. The question asks for an interpretation of the regression parameters – slopes, intercept.


Methods for descriptive statistics: Descriptive statistics for the censoring distribution included the minimum and maximum observed censoring times and the Kaplan-Meier estimates of the10th, 50th (median), and 90th percentiles.

　 Serum LDL categories (ldlCTG) All subjects (with LDL available)　 0-69 70-99 100-

129130-159

160-189

190-250

N subjects 22 143 228 225 83 24 725N deaths 10 28 44 34 11 4 1312 year survival probalility (%) 100 95.8 93.9 95.6 98.8 95.8 96,75 year survival probalility (%) 59.1 83.2 81.1 87.1 88 83.3 8610th Pctile of survival 3.46 3.8 3.41 4.3 4.53 4.13 3.6620th Pctile of survival 3.55 5.44 5.36 NA NA NA 5.545.75 Year restriced mean of survival

4.91 5.24 5.23 5.35 5.35 5.32 5.29

Methods for inferential statistics: Distributions of time to death from any cause was compared across groups defined by serum LDL when fit as dummy variables using the categories suggested by the Mayo Clinic using proportional hazards regression. The group with serum LDL from 100-129mg/dL is used as the regference group.

3864, 02/14/14,

Overall: 3/10 Points breakdown: (3 pts for methods; 3 pts for descriptive statistics; 4 pts for reporting inferences). Method – omitted from answer – 0/3 Descriptive Statistics –3/3 Inferences – The point estimates and inferential statistics are all incorrect – none of the figures match those in the key. 0/4


Quantification of association between all cause mortality was summarized by the hazards ratio computed from the regression model, with confidence intervals and two-sided p values computed using Wald statistics based on the Huber-White sandwich estimator. Subjects missing data for serum LDL at the time of study accrual were omitted from the analysis.

Inferential results: From a proportional hazards regression analysis, we estimate that the instantaneous risk of death in the group with LDL from 0 to 69mg/dl is relative 154.72% higher than the instantaneous risk of death in the group with LDL from 100 to 129mg/dl group. A 95% confidence interval suggests that it would not be judged unusual if the true instantaneous risk of death in the group with LDL from 0 to 69mg/dl is anywhere from relative 34.38% higher to 382.82% higher than the instantaneous risk of death in the group with LDL from 100 to 129mg/dl group. A two-sided p value of 0.004 suggests that we can with high confidence reject the null hypothesis that the instantaneous risk of death in the group with LDL from 0 to 69mg/dl is the same with that in the group with LDL from 100 to 129mg/dl group.

We estimate that the instantaneous risk of death in the group with LDL from 70 to 99mg/dl is relative 1.38% higher than the instantaneous risk of death in the group with LDL from 100 to 129mg/dl group. A 95% confidence interval suggests that it would not be judged unusual if the true instantaneous risk of death in the group with LDL from 70 to 99mg/dl is anywhere from relative 36.93% lower to 62.97% higher than the instantaneous risk of death in the group with LDL from 100 to 129mg/dl group. A two-sided p value of 0.995 suggests that we can not reject the null hypothesis that the instantaneous risk of death in the group with LDL from 70 to 99mg/dl is the same with that in the group with LDL from 100 to 129mg/dl group.

We estimate that the instantaneous risk of death in the group with LDL from 130 to 159mg/dl is relative 25.14% lower than the instantaneous risk of death in the group with LDL from 100 to 129mg/dl group. A 95% confidence interval suggests that it would not be judged unusual if the true instantaneous risk of death in the group with LDL from 130 to 159mg/dl is anywhere from relative 42.19% lower to 17.24% higher than the instantaneous risk of death in the group with LDL from 100 to 129mg/dl group. A two-sided p value of 0.206 suggests that we can not reject the null hypothesis that the instantaneous risk of death in the group with LDL from 130 to 159mg/dl is the same with that in the group with LDL from 100 to 129mg/dl group.



We estimate that the instantaneous risk of death in the group with LDL from 190 to 250mg/dl is relative 19.13% lower than the instantaneous risk of death in the group with LDL from 100 to 129mg/dl group. A 95% confidence interval suggests that it would not be judged unusual if the true instantaneous risk of death in the group with LDL from 190 to 250mg/dl is anywhere from relative 71.23% lower to 126.20% higher than the instantaneous risk of death in the group with LDL from 100 to 129mg/dl group. A two-sided p value of 0.205 suggests that we can not reject the null hypothesis that the instantaneous risk of death in the group with LDL from 190 to 250mg /dl is the same with that in the group with LDL from 100 to 129mg/dl group.

b. Provide an interpretation for each parameter in your regression model, including the intercept.

Ans:

The intercept is the baseline hazard of death, which is the hazard of death in the group with LDL from 100 to 129mg/ld.

From a proportional hazards regression analysis, we estimate that the instantaneous risk of death in the group with LDL from 0 to 69mg/dl is relative 154.72% higher than the instantaneous risk of death in the group with LDL from 100 to 129mg/dl group. A 95% confidence interval suggests that it would not be judged unusual if the true instantaneous risk of death in the group with LDL from 0 to 69mg/dl is anywhere from relative 34.38% higher to 382.82% higher than the instantaneous risk of death in the group with LDL from 100 to 129mg/dl group. A two-sided p value of 0.004 suggests that we can with high confidence reject the null hypothesis that the instantaneous risk of death in the group with LDL from 0 to 69mg/dl is the same with that in the group with LDL from 100 to 129mg/dl group.

We estimate that the instantaneous risk of death in the group with LDL from 70 to 99mg/dl is relative 1.38% higher than the instantaneous risk of death in the group with LDL from 100 to 129mg/dl group. A 95% confidence interval suggests that it would not be judged unusual if the true instantaneous risk of death in the group with LDL from 70 to 99mg/dl is anywhere from relative 36.93% lower to 62.97% higher than the instantaneous risk of death in the group with LDL from 100 to 129mg/dl group. A two-sided p value of 0.995 suggests that we can not reject the null hypothesis that the instantaneous risk of death in the group with LDL from 70 to 99mg/dl is the same with that in the group with LDL from 100 to 129mg/dl group.


3864, 02/14/14,

Again this answer is not responsive to the question – it is asking for the interpretation of the parameters not the inferences.

3864, 02/14/14,

Incorrect – it is the hazard function in the reference group, which is the group having serum LDL less than 70 mg/dL

3864, 02/14/14,

0/5



We estimate that the instantaneous risk of death in the group with LDL from 190 to 250mg/dl is relative 19.13% lower than the instantaneous risk of death in the group with LDL from 100 to 129mg/dl group. A 95% confidence interval suggests that it would not be judged unusual if the true instantaneous risk of death in the group with LDL from 190 to 250mg/dl is anywhere from relative 71.23% lower to 126.20% higher than the instantaneous risk of death in the group with LDL from 100 to 129mg/dl group. A two-sided p value of 0.205 suggests that we can not reject the null hypothesis that the instantaneous risk of death in the group with LDL from 190 to 250mg /dl is the same with that in the group with LDL from 100 to 129mg/dl group.

c. What analysis would you perform to assess whether the regression model used in this problem provides a “better fit” than does a model that uses only a continuous linear term for LDL? What is the result of such an analysis?

Method: compare the Wald chi square value and p value of these two models. Compare the trend of relative hazard.

Result: The Wald chi square value in the regression model used in this problem is 15.42, bigger than the Wald chi square value 6.76 in a model that uses only a continuous linear term for LDL. The p value in the regression model used in this problem is 0.0087, smaller than the p value 0.0093 in a model that uses only a continuous linear term for LDL.

According to the Kaplan-Meier survival curve, the hazard is the group with LDL from 160 to 189mg/dl is lower than other groups, but in a cox regression model that uses only a continuous linear term for LDL hazard is the group with LDL from 160 to 189mg/dl is higher than the group with LDL from 190 to 1250mg/dl. The regression model used in this problem provides a better fit than does a model that uses only a continuous linear term for LDL.

3864, 02/14/14,

0/5 The method, p-value and conclusions are incorrect. See key: I included a linear continuous term along with the dummy variables, and then tested that the regression coefficients for the dummy variables were 0 in a multiple partial Wald test having a chi square distribution with 5 degrees of freedom. The P value of 0.399 suggests that we do not have sufficient evidence of a trend in the log hazard that is nonlinear, when we use the dummy variables to model departures from nonlinearity.


d. For each population defined by serum LDL value, compute the hazard ratio relative to a group having serum LDL of 160 mg/dL. (This will be used in problem 4). This can be effected by generating fitted hazard ratio estimates for each individual in the sample, and then dividing that fitted value by the fitted value for a subject having a LDL of 160 mg/dL.

Method: reparameterized proportional hazards regression model of the model in a

Results: Predicted fitted hazard ration relative to a group having serum LDL of 160 mg/dL is shown in the following figure:


3. Perform a statistical regression analysis evaluating an association between all-cause mortality and serum by comparing the instantaneous risk (hazard) of death over the entire period of observation across groups defined by serum LDL when fit as linear splines using the categories suggested by the Mayo Clinic as reported on Homework #1. The Stata mkspline command can be used to create the predictors that can be used in a regressionmkspline ldl0 70 ldl70 100 ldl100 130 ldl130 160 ldl160 190 ldl190 = ldl

a. Include full description of your methods, appropriate descriptive statistics, and full report of your inferential statistics.

Method：Distributions of time to death from any cause was compared across groups defined by serum LDL when fit splines using the categories suggested by the Mayo Clinic using proportional hazards regression.

Result：From a proportional hazards regression analysis, we estimate that in the group with LDL below 70, the instantaneous risk of death is relative 2.19% lower for each 1mg/dl higher serum LDL level at baseline.

We estimate that in the group with LDL from 70 to 99mg/dl, the instantaneous risk of death is relative 2.03% lower for each 1mg/dl higher serum LDL level at baseline.


We estimate that in the group with LDL from 130 to 159mg/dl, the instantaneous risk of death is relative 0.36% higher for each 1mg/dl higher serum LDL level at baseline.

3864, 02/14/14,

Overall: 2/10 Points breakdown: (5 pts for stating method; 5 pts for reporting inferences). Method –See the key for how methods and inferences are reported. Missing the discussion of Huber-white sandwich estimator, 2-sided p, Wald statistic, multiple comparisons, missing subjects, how the linearity of association tested. 3/5 Inferences – 5/5




b. Provide an interpretation for each parameter in your regression model, including the intercept.

From a proportional hazards regression analysis, we estimate that in the group with LDL below 70, the instantaneous risk of death is relative 2.19% lower for each 1mg/dl higher serum LDL level at baseline.






The intercept is the baseline hazard of death.

3864, 02/14/14,

1/5 The “intercept” in a proportional hazards model is the hazard function in the “reference group”. In this model, the baseline hazard function corresponds to the group having serum LDL equal to 0 mg/dL. That is outside our data, and not really possible. So the baseline hazard group is not really of interest. Missing any discussion of the slopes.


c. What analysis would you perform to assess whether the regression model used in this problem provides a “better fit” than does a model that uses only a continuous linear term for LDL? What is the result of such an analysis?

I think compared with a proportional hazard model that uses only a continuous linear term for LDL, regression model used in this problem provides a “better fit”

d. For each population defined by serum LDL value, compute the hazard ratio relative to a group having serum LDL of 160 mg/dL. (This will be used in problem 4). This can be effected by generating fitted hazard ratio estimates for each individual in the sample, and then dividing that fitted value by the fitted value for a subject having a LDL of 160 mg/dL.

Ans: The same with 2 d

4. By answering the following questions, compare the relative advantages and disadvantages of the various statistical analysis strategies we have considered in Homeworks 1-4 and problems 2 and 3 in this homework.

a. What advantages do the regression strategies used in Homeworks 4 and 5 provide over the approaches used in Homeworks 1-3?

Homework 1-3 are mainly descriptive statistics and dichotomize serum LDL. Home4 treats LDL level as continuous variable and predict a trend that is predominantly downward with higher LDL. Homework 5 uses flexible methods rather than just use a continuous linear term for LDL.

3864, 02/14/14,

1/3 This answer is not responsive to the question. The question asked what advantages do the regression strategies used in homeworks 4 and 5 provide over the approaches used in homework 1-3 not just a description of what strategies were used. No discussion of advantages.

3864, 02/14/14,

0/5 This answer is not responsive to the question. The question asked what analysis you would perform to assess which model is a better fit.


b. Comment on any similarities or differences of the fitted values from the three models fit in Homework 4 and the two models fit in problems 2 and 3 of this homework.

Homework 4 treat ldl uses only a continuous linear term for LDL. Homework 5 uses flexible methods. The models in homework 4 predict a trend that is predominantly downward with higher LDL, but the prediction in the two models fit in problem 2 and 3 is slightly different from that in homework 4.

c. A priori, of all the analyses we have considered for exploring an (unadjusted) association between all cause mortality and serum LDL in an elderly population, which one would you prefer and why?

I prefer the method in problem 3 of this work. It is more precise.

Discussion Sections: February 3 - 7, 2014

We continue to discuss the dataset regarding FEV and smoking in children. Come do discussion section prepared to describe descriptive statistics, especially as they relate to confounding, precision, effect modification, and the impact of heteroscedasticity.

3864, 02/14/14,

1/5 From key: When making inference about an association, my top choice a priori would have been the PH regression with the logarithmically transformed LDL. On biological principles, I expected a multiplicative effect across LDL levels such that each doubling of the LDL might be associated with a more constant HR. (I was not so strong in that belief that I would have rejected a linear continuous fit instead, just for the more straightforward interpretation.) But when LDL is my POI, I would have shied away from the more flexible models. When testing for nonlinearity, I would have used the quadratic as my first choice, but a linear spline model with fewer knots (to avoid unnecessary loss of precision) would also have been acceptable. When adjusting for LDL as a confounder, I might consider the log fit as likely to be adequate on scientific grounds, but if there was any doubt in my mind I would use a quadratic or possibly a linear spline model with only 3 or 4 knots. Key among my recommendations is the avoidance of dummy variables, especially when fit to quartiles or quintiles . I have not yet seen a situation where that is superior to one of these other approaches.

3864, 02/14/14,

1/3 Missing chart with the fitted HR from all the relevant models and analysis of how the models differ. (There is not much difference. The straight line model is much closer to the three than the dummy variable model. The dummy variable model seemed a poor choice. The flexible model seemed to agree with the log fit.)

Homework #2€¦ · Web viewBiost 518: Applied Biostatistics II. Biost 515: Biostatistics II....

Documents

Transcript of Homework #2€¦ · Web viewBiost 518: Applied Biostatistics II. Biost 515: Biostatistics II....