Demonstrating data-driven multivariate regression models ...
Lecture5-Regression - Pennsylvania State University · • Linear regression is the first...
Transcript of Lecture5-Regression - Pennsylvania State University · • Linear regression is the first...
8/1/16
1
Lecture 5: Linear Regression Modeling Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox
Review of Homework
2
What sample size is required to achieve 80% power (alpha level 0.05) for a randomized trial that expects a mortality rate of 5% in the treatment group and 10% in the control group?
STATA: sampsi 0.10 0.05, p(0.8) Estimated sample size for two-sample comparison of proportions Test Ho: p1 = p2, where p1 is the proportion in population 1 and p2 is the proportion in population 2 Assumptions: alpha = 0.0500 (two-sided) power = 0.8000 p1 = 0.1000 p2 = 0.0500 n2/n1 = 1.00 Estimated required sample sizes: n1 = 474 n2 = 474
Review of Homework
3
What sample size is required to achieve 80% power (alpha level 0.05) for a randomized trial that expects a mortality rate of 5% in the treatment group and 10% in the control group?
R: pwr.2p.test(h=ES.h(0.10,0.05), power=0.80, sig.level=0.05) Difference of proportion power calculation for binomial distribution (arcsine transformation) h = 0.1924743 n = 423.7319 sig.level = 0.05 power = 0.8 alternative = two.sided NOTE: same sample sizes
8/1/16
2
Review of Homework
4
What sample size is required to achieve 80% power (alpha level 0.05) for a randomized trial that expects a systolic blood pressure of 120 in the treatment group and 130 in the control group, if the standard deviation is 10?
STATA: sampsi 120 130, sd(10) p(.8) Estimated sample size for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha = 0.0500 (two-sided) power = 0.8000 m1 = 120 m2 = 130 sd1 = 10 sd2 = 10 n2/n1 = 1.00 Estimated required sample sizes: n1 = 16 n2 = 16
Review of Homework
5
What sample size is required to achieve 80% power (alpha level 0.05) for a randomized trial that expects a systolic blood pressure of 120 in the treatment group and 130 in the control group, if the standard deviation is 10?
R: pwr.t.test(d=(120-130)/10, power=0.80, sig.level=0.05, type="two.sample") Two-sample t test power calculation n = 16.71473 d = 1 sig.level = 0.05 power = 0.8 alternative = two.sided NOTE: n is number in *each* group
Review of Homework
6
How much power does a randomized controlled trial have that yielded a mortality rate of 5% among 250 treated patients and 10% among 275 control patients?
STATA: sampsi .05 .1, n1(250) n2(275) Estimated power for two-sample comparison of proportions Test Ho: p1 = p2, where p1 is the proportion in population 1 and p2 is the proportion in population 2 Assumptions: alpha = 0.0500 (two-sided) p1 = 0.0500 p2 = 0.1000 sample size n1 = 250 n2 = 275 n2/n1 = 1.10 Estimated power: power = 0.5130
8/1/16
3
Review of Homework
7
How much power does a randomized controlled trial have that yielded a mortality rate of 5% among 250 treated patients and 10% among 275 control patients?
R: pwr.2p2n.test(h=ES.h(0.10,0.05), n1=275, n2=250, sig.level=0.05) Difference of proportion power calculation for binomial distribution (arcsine transformation) h = 0.1924743 n1 = 275 n2 = 250 sig.level = 0.05 power = 0.5958599 alternative = two.sided NOTE: different sample sizes
Review of Homework
8
How much power was achieved in a randomized controlled trial that achieved a LOS of 9 days (SD=3.2) among 130 treated patients, and a LOS of 12 days (SD=7) among 140 control patients?
STATA: sampsi 9 12, sd1(3.2) sd2(7) n1(130) n2(140) Estimated power for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha = 0.0500 (two-sided) m1 = 9 m2 = 12 sd1 = 3.2 sd2 = 7 sample size n1 = 130 n2 = 140 n2/n1 = 1.08 Estimated power: power = 0.9956
Review of Homework
9
How much power was achieved in a randomized controlled trial that achieved a LOS of 9 days (SD=3.2) among 130 treated patients, and a LOS of 12 days (SD=7) among 140 control patients?
R: psd <- (129*3.2 + 139*7)/(129+139) pwr.t2n.test(d=(12-9)/psd, n1=130, n2=140, sig.level=0.05) t test power calculation n1 = 130 n2 = 140 d = 0.5801703 sig.level = 0.05 power = 0.9973337 alternative = two.sided
8/1/16
4
Overview
• Linear regression model conceptually • Linear regression model graphically • Performing linear regression in Stata • Performing linear regression in R • A regression model for LOS in liver transplantation
10
Linear Regression Model
• The linear regression model is the most important statistical model you will learn
• It may not fit every situation, but it will be the starting point
• Recall we use linear regression when we have a continuous dependent variable
11
Linear Regression Model
• Linear regression is the first multivariate model we will study – Multivariate means there may be more than one
independent variable
• Independent variables could be continuous, binary, or categorical
8/1/16
5
When to Use Linear Regression
• Think of cause-and-effect • Examples: – What is the effect of age on total costs? – What is the effect of a surgical site infection on total
length of stay?
13
Linear Model
• The linear regression model proposes a linear relationship between two variables – Then uses the data to find the best straight line model
• Think back to high school algebra – Linear function was
• Linear function has a slope and an intercept
14
y = mx+ b
15
-2 -1 0 1 2 3 4
-4-2
02
4
Linear Function
x
f(x) =
-2 +
1.5
x
b = -2
m = 1.5
Intercept
Slope
f(x) = �2 + 1.5x
8/1/16
6
Intercept and Slope
• The intercept tells the value of the function when x is zero – Here, when x is zero, the function equals -2
• The slope tells how much y changes when x changes one unit – Here, for each one unit increase in x, y goes up by 1.5 – If slope is negative, then y goes down when x goes up
• If we know the slope and intercept terms, then we can predict y
16
Linear Model
• In statistics we will introduce – More than one x – A random error term
• i is an index for observations (usually patients) • εi is a random error term • We assume εi is normally distributed
17
yi = �0 + �ixi1 + · · ·+ �kxki + ✏i
Linear Regression
• The linear regression model estimates the straight line that best fits the data
• Best fit is the line that minimizes the “error” or the difference between the predicted line and the actual values
• Since errors are positive and negative, each error is squared, and then summed
• This gives an estimate for each beta coefficient
18
8/1/16
7
19
0 1 2 3 4 5
02
46
810
Linear Regression
x
f(x)
Error (+)Error (-)
Predicted Values
Actual Values
Regression Results
• Linear regression results include: – An estimate of the intercept – Estimates of beta coefficients for each variable – 95% confidence interval for each coefficient – Test statistic for the hypothesis that each coefficient
equals zero (i.e. the null hypothesis) – A p-value for the null hypothesis – A measure of goodness-of-fit
20
How To Interpret Results
• Intercept term is expected value of the dependent variable assuming all independent variables equal zero
• For continuous independent variables the coefficient is the incremental effect on the dependent variable of a one-unit increase
• For binary independent variables the coefficient is the incremental effect of the presence of the variable
• For categorical independent variables the coefficient is the incremental effect of the variable relative to the reference category
21
8/1/16
8
Example
22
Source | SS df MS Number of obs = 777 -------------+------------------------------ F( 5, 771) = 16.69 Model | 65650.0665 5 13130.0133 Prob > F = 0.0000 Residual | 606554.886 771 786.711914 R-squared = 0.0977 -------------+------------------------------ Adj R-squared = 0.0918 Total | 672204.952 776 866.243495 Root MSE = 28.048 ------------------------------------------------------------------------------ los | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | -.3017448 .0592052 -5.10 0.000 -.4179674 -.1855222 female | 2.829169 2.037956 1.39 0.165 -1.171431 6.829769 black | 7.158582 5.015663 1.43 0.154 -2.687394 17.00456 abmm | 3.968621 1.118231 3.55 0.000 1.773483 6.163759 ssi | 11.87794 2.090173 5.68 0.000 7.774839 15.98105 _cons | 22.54716 4.911483 4.59 0.000 12.9057 32.18863 ------------------------------------------------------------------------------
RegressLOSoncovariatesinthelivertransplantdata
Liver Transplant Paper
• Interpret Table V
utable to surgical site infections to $131,276 (P =.0001). Surgical site infections had the largestimpact on resource utilization of any variablestudied.
DISCUSSIONPreoperative infections are significant sources of
morbidity, mortality, and resource utilizationamong orthotopic liver transplant recipients.5,6,29-31
In this study of 777 first, single organ transplantrecipients from the NIDDK Liver TransplantationDatabase, we found that transplant recipients whodevelop surgical site infections had a significantlyhigher rate of graft loss, accumulated more than 24additional hospital days, and incurred approxi-mately $131,000 in excess charges. As noted previ-ously, these excess charges represent 1995 dollars.The medical care component of the ConsumerPrice Index rose approximately 18.3% between1995 and 2000. Inflating excess charges by thisamount suggests that the excess charges attribut-able to surgical site infections may be as high as$155,269 today.
Our findings, with respect to risk factors forinfection, are similar to previous reports in the lit-erature. Pretransplant ascites and low serum albu-min levels have been previously suggested as riskfactors for infection.5,12 Others have also found fac-tors relating to surgical technique, including leaksin the biliary anastomosis and duration of surgicalprocedure, and use of OKT3 to be significant pre-dictors of infection.7,13,14,16,19 Our finding thatpatients with surgical site infections do not experi-ence significantly higher rates of mortality but dohave higher rates of graft loss was also suggested byWhiting et al.13
Multivariate analysis of charges suggested thateach HLA-A, and HLA-B mismatch was associatedwith $13,504 in excess charges. Although rejectionepisodes may explain this phenomenon, it is
Surgery Hollenbeak, Alfrey, and Souba 393Volume 130, Number 2
ly to develop a surgical site infection (P = .037), andlow levels of serum albumin in grams per deciliterswere associated with a 41% greater risk of infection(P = .009). Patients who received OKT3 in the firstpostoperative week were approximately 50% morelikely to develop a surgical site infection (P = .039),which suggests the important relationship betweenacute rejection and infection. Other factors stud-ied, including pediatric transplant recipients,female sex, ethnicity, volume of packed red bloodcells, and cold ischemia time did not significantlyaffect the likelihood of surgical site infection.
Patient and graft survival. One-year patient andgraft survival were estimated with the method ofKaplan and Meier.28 Death with a functioning graftwas considered a graft loss. Although the 1-year sur-vival rate for patients with infections was slightlylower than that for patients without infections(88.9% vs 91.8%), the difference was not statistical-ly significant (P = .22; Fig 1). The graft survival rate,however, was significantly lower for patients withsurgical site infections (80% vs 87%, P = .018).
Resource utilization. As noted previously, 292patients developed surgical site infections. However,only 159 patients developed surgical site infectionduring the transplant admission. These 159 patientshad significantly higher resource utilization require-ments than patients who did not develop surgicalsite infections during the transplant hospitalization.Patients who developed surgical site infectionsincurred on average, unadjusted for other factors,$159,975 in additional charges ($337,409 vs$177,433; P = .0001) and 24 additional hospital days(47 vs 23; P = .0001) compared with uninfectedtransplant recipients. Note that infections did notconcentrate resource use in any single departmentbut significantly increased costs throughout all costcenters (P < .01 for all cost centers; Fig 2).
Results from a multivariate analysis of costs arepresented in Table V and show that although sur-gical site infections were not the only factor thatcontributed to increased resource utilization,they were the most costly. Severity, determined bythe Karnofsky scale, increased charges by nearly$11,000 per index point (P = .0002). Packedwhole red blood cells increased charges by $9727(P < .0001) per 1000 mL unit. Each additionalhour of cold ischemia time resulted in an addi-tional $1656 (P = .021) in charges, and each HLA-A or -B mismatch cost $13,504 (P = .037). Finally,patients with edema incurred, on average,$24,763 in additional charges (P = .041). Raceand sex were not significantly associated withincreased charges. Controlling for these otherfactors reduced the estimate of excess costs attrib-
Table V. Multivariate analysis of factors affectingthe cost of liver transplantation
Variable Estimated cost P value
Surgical site infection $131,276 .0001Karnofsky scale $11,009 .0002Packed red blood cells $9727 .0001Cold ischemia time $1656 .0211HLA-A and -B mismatches $13,504 .0372Edema $24,763 .0411Female sex $17,683 .1308Black ethnicity $27,665 .333
R2 = 0.21
Goodness of Fit
• How well the model fits the data is measured by the coefficient of determination: R2
– R2 is the percent of all the variation in the data explained by the model
– Ranges from 0 to 1
• Most cross-sectional analyses care lucky to get 35%
• Be suspicious of R2 values over 80%!
24
8/1/16
9
Assumptions of Linear Regression
1. There is a linear relationship between the dependent and independent variable (linearity, violation is nonlinearity)
2. The errors are normally distributed (normality, violation is non-normality)
3. The variance of the error term is constant (homoskedasticity, violation is heteroskedasticity)
4. Covariates are not correlated with each other (violation is called colinearity)
25
26
0 1 2 3 4 5
01
23
45
6
Violation of Linearity
x
f(x)
Violation of Normality
LOS
Density
0 100 200 300
0100
200
300
400
0 1 2 3 4 5
05
1015
Violation of Constant Variance
x
f(x)
Colinearity
• The other problems are not too serious, usually • Colinearity can be a huge problem • Symptoms: – Model fit will be excellent (high R2), but – Few coefficients are statistically significant
• Perform correlations on covariates and look for highly correlated pairs
• Remove one of the offending variables in a pair
27
8/1/16
10
Regression in Stata
• Command to run a regression in Stata is – regress depvar indvar1 indvar2 … indvark
• Example: – regress los age female black abmm ssi
28
Stata Results
29
Source | SS df MS Number of obs = 777 -------------+------------------------------ F( 5, 771) = 16.69 Model | 65650.0665 5 13130.0133 Prob > F = 0.0000 Residual | 606554.886 771 786.711914 R-squared = 0.0977 -------------+------------------------------ Adj R-squared = 0.0918 Total | 672204.952 776 866.243495 Root MSE = 28.048 ------------------------------------------------------------------------------ los | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | -.3017448 .0592052 -5.10 0.000 -.4179674 -.1855222 female | 2.829169 2.037956 1.39 0.165 -1.171431 6.829769 black | 7.158582 5.015663 1.43 0.154 -2.687394 17.00456 abmm | 3.968621 1.118231 3.55 0.000 1.773483 6.163759 ssi | 11.87794 2.090173 5.68 0.000 7.774839 15.98105 _cons | 22.54716 4.911483 4.59 0.000 12.9057 32.18863 ------------------------------------------------------------------------------
Regression in Stata
• When including binary and categorical variables, you must leave out one category – For example, I can only include male OR female, not
both
• The excluded variable becomes the REFERENCE category
• If you try to include all, Stata will drop one • Stata drops all observations if any variable is
missing
30
8/1/16
11
Regression in R
• The regression function is lm() – lm stands for “linear model”
• Create a linear model object, then summarize – reg1 <- lm(depvar ~ indvar1 + invar2 + … + indvark)
– summary(reg1)
• Example: – reg1 <- lm(data1$los ~ data1$age + data1$female + data1$black + data1$abmm + data1$ssi)
– summary(reg1)
R Results Call: lm(formula = data1$los ~ data1$age + data1$female + data1$black + data1$abmm + data1$ssi) Residuals: Min 1Q Median 3Q Max -39.11 -13.73 -6.09 3.57 338.09 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 22.54716 4.91148 4.591 5.16e-06 *** data1$age -0.30174 0.05921 -5.097 4.35e-07 *** data1$female 2.82917 2.03796 1.388 0.16547 data1$black 7.15858 5.01566 1.427 0.15391 data1$abmm 3.96862 1.11823 3.549 0.00041 *** data1$ssi 11.87794 2.09017 5.683 1.88e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 28.05 on 771 degrees of freedom Multiple R-squared: 0.09766, Adjusted R-squared: 0.09181 F-statistic: 16.69 on 5 and 771 DF, p-value: 1.127e-15
R and Confidence Intervals
• Notice that R does not estimate confidence intervals for you
• Two choices: – Compute them by hand in Excel from the coefficient and
the standard error • Lower: • Upper:
– Use confint() function in R • confint(reg1)
� � 1.96 ⇤ SE� + 1.96 ⇤ SE
8/1/16
12
Reporting Regression Results
• Include coefficient, 95% confidence interval, and p-value
• For p-values less than 0.0001, indicate with “< 0.0001” rather than the actual p-value
• Indicate reference group • Indicate units for continuous variables • Include R2 in the table caption
34
Categories or Continuous?
• We have two versions of the age variable – Continuous – Categories (0-39, 40-49, 50-59, 60+)
• Which should we use in regression? • Both are correct, but make different assumptions
about the relationship between age and LOS – Continuous assumes linear relationship – Categories allows for nonlinear relationship
Categories or Continuous?
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 14.747 4.388 3.361 0.000816 *** data1$age4049 -10.050 2.745 -3.661 0.000269 *** data1$age5059 -9.675 2.742 -3.528 0.000443 *** data1$age60 -3.849 3.170 -1.214 0.225053 data1$female 2.579 2.057 1.254 0.210340 data1$black 9.544 5.039 1.894 0.058584 . data1$abmm 4.055 1.125 3.603 0.000335 *** data1$ssi 12.210 2.102 5.809 9.21e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Is the relationship linear?
8/1/16
13
37
38
Write-up
• Here is how to describe regression results in your paper:
“The reference patient, which is a white male infant with no surgical site infection and zero HLA A or B mismatches, stayed on average 22.5 days. After controlling for other factors, patients who developed a surgical site infection stayed on average nearly 12 days longer than patients without infections (p<0.0001).”
39
8/1/16
14
Homework
• Read the Hollenbeak paper that is posted on the web site
• Replicate Table V in the paper, but do not include edema or the Karnofsky score – Your coefficients will be different! – Format the table according to the template on Slides 37
and 38
40