Lecture5-Regression - Pennsylvania State University · • Linear regression is the first...

14
8/1/16 1 Lecture 5: Linear Regression Modeling Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox Review of Homework 2 What sample size is required to achieve 80% power (alpha level 0.05) for a randomized trial that expects a mortality rate of 5% in the treatment group and 10% in the control group? STATA: sampsi 0.10 0.05, p(0.8) Estimated sample size for two-sample comparison of proportions Test Ho: p1 = p2, where p1 is the proportion in population 1 and p2 is the proportion in population 2 Assumptions: alpha = 0.0500 (two-sided) power = 0.8000 p1 = 0.1000 p2 = 0.0500 n2/n1 = 1.00 Estimated required sample sizes: n1 = 474 n2 = 474 Review of Homework 3 What sample size is required to achieve 80% power (alpha level 0.05) for a randomized trial that expects a mortality rate of 5% in the treatment group and 10% in the control group? R: pwr.2p.test(h=ES.h(0.10,0.05), power=0.80, sig.level=0.05) Difference of proportion power calculation for binomial distribution (arcsine transformation) h = 0.1924743 n = 423.7319 sig.level = 0.05 power = 0.8 alternative = two.sided NOTE: same sample sizes

Transcript of Lecture5-Regression - Pennsylvania State University · • Linear regression is the first...

Page 1: Lecture5-Regression - Pennsylvania State University · • Linear regression is the first multivariate model we will study – Multivariate means there may be more than one independent

8/1/16

1

Lecture 5: Linear Regression Modeling Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox

Review of Homework

2

What sample size is required to achieve 80% power (alpha level 0.05) for a randomized trial that expects a mortality rate of 5% in the treatment group and 10% in the control group?

STATA: sampsi 0.10 0.05, p(0.8) Estimated sample size for two-sample comparison of proportions Test Ho: p1 = p2, where p1 is the proportion in population 1 and p2 is the proportion in population 2 Assumptions: alpha = 0.0500 (two-sided) power = 0.8000 p1 = 0.1000 p2 = 0.0500 n2/n1 = 1.00 Estimated required sample sizes: n1 = 474 n2 = 474

Review of Homework

3

What sample size is required to achieve 80% power (alpha level 0.05) for a randomized trial that expects a mortality rate of 5% in the treatment group and 10% in the control group?

R: pwr.2p.test(h=ES.h(0.10,0.05), power=0.80, sig.level=0.05) Difference of proportion power calculation for binomial distribution (arcsine transformation) h = 0.1924743 n = 423.7319 sig.level = 0.05 power = 0.8 alternative = two.sided NOTE: same sample sizes

Page 2: Lecture5-Regression - Pennsylvania State University · • Linear regression is the first multivariate model we will study – Multivariate means there may be more than one independent

8/1/16

2

Review of Homework

4

What sample size is required to achieve 80% power (alpha level 0.05) for a randomized trial that expects a systolic blood pressure of 120 in the treatment group and 130 in the control group, if the standard deviation is 10?

STATA: sampsi 120 130, sd(10) p(.8) Estimated sample size for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha = 0.0500 (two-sided) power = 0.8000 m1 = 120 m2 = 130 sd1 = 10 sd2 = 10 n2/n1 = 1.00 Estimated required sample sizes: n1 = 16 n2 = 16

Review of Homework

5

What sample size is required to achieve 80% power (alpha level 0.05) for a randomized trial that expects a systolic blood pressure of 120 in the treatment group and 130 in the control group, if the standard deviation is 10?

R: pwr.t.test(d=(120-130)/10, power=0.80, sig.level=0.05, type="two.sample") Two-sample t test power calculation n = 16.71473 d = 1 sig.level = 0.05 power = 0.8 alternative = two.sided NOTE: n is number in *each* group

Review of Homework

6

How much power does a randomized controlled trial have that yielded a mortality rate of 5% among 250 treated patients and 10% among 275 control patients?

STATA: sampsi .05 .1, n1(250) n2(275) Estimated power for two-sample comparison of proportions Test Ho: p1 = p2, where p1 is the proportion in population 1 and p2 is the proportion in population 2 Assumptions: alpha = 0.0500 (two-sided) p1 = 0.0500 p2 = 0.1000 sample size n1 = 250 n2 = 275 n2/n1 = 1.10 Estimated power: power = 0.5130

Page 3: Lecture5-Regression - Pennsylvania State University · • Linear regression is the first multivariate model we will study – Multivariate means there may be more than one independent

8/1/16

3

Review of Homework

7

How much power does a randomized controlled trial have that yielded a mortality rate of 5% among 250 treated patients and 10% among 275 control patients?

R: pwr.2p2n.test(h=ES.h(0.10,0.05), n1=275, n2=250, sig.level=0.05) Difference of proportion power calculation for binomial distribution (arcsine transformation) h = 0.1924743 n1 = 275 n2 = 250 sig.level = 0.05 power = 0.5958599 alternative = two.sided NOTE: different sample sizes

Review of Homework

8

How much power was achieved in a randomized controlled trial that achieved a LOS of 9 days (SD=3.2) among 130 treated patients, and a LOS of 12 days (SD=7) among 140 control patients?

STATA: sampsi 9 12, sd1(3.2) sd2(7) n1(130) n2(140) Estimated power for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha = 0.0500 (two-sided) m1 = 9 m2 = 12 sd1 = 3.2 sd2 = 7 sample size n1 = 130 n2 = 140 n2/n1 = 1.08 Estimated power: power = 0.9956

Review of Homework

9

How much power was achieved in a randomized controlled trial that achieved a LOS of 9 days (SD=3.2) among 130 treated patients, and a LOS of 12 days (SD=7) among 140 control patients?

R: psd <- (129*3.2 + 139*7)/(129+139) pwr.t2n.test(d=(12-9)/psd, n1=130, n2=140, sig.level=0.05) t test power calculation n1 = 130 n2 = 140 d = 0.5801703 sig.level = 0.05 power = 0.9973337 alternative = two.sided

Page 4: Lecture5-Regression - Pennsylvania State University · • Linear regression is the first multivariate model we will study – Multivariate means there may be more than one independent

8/1/16

4

Overview

•  Linear regression model conceptually •  Linear regression model graphically •  Performing linear regression in Stata •  Performing linear regression in R •  A regression model for LOS in liver transplantation

10

Linear Regression Model

•  The linear regression model is the most important statistical model you will learn

•  It may not fit every situation, but it will be the starting point

•  Recall we use linear regression when we have a continuous dependent variable

11

Linear Regression Model

•  Linear regression is the first multivariate model we will study –  Multivariate means there may be more than one

independent variable

•  Independent variables could be continuous, binary, or categorical

Page 5: Lecture5-Regression - Pennsylvania State University · • Linear regression is the first multivariate model we will study – Multivariate means there may be more than one independent

8/1/16

5

When to Use Linear Regression

•  Think of cause-and-effect •  Examples: –  What is the effect of age on total costs? –  What is the effect of a surgical site infection on total

length of stay?

13

Linear Model

•  The linear regression model proposes a linear relationship between two variables –  Then uses the data to find the best straight line model

•  Think back to high school algebra –  Linear function was

•  Linear function has a slope and an intercept

14

y = mx+ b

15

-2 -1 0 1 2 3 4

-4-2

02

4

Linear Function

x

f(x) =

-2 +

1.5

x

b = -2

m = 1.5

Intercept

Slope

f(x) = �2 + 1.5x

Page 6: Lecture5-Regression - Pennsylvania State University · • Linear regression is the first multivariate model we will study – Multivariate means there may be more than one independent

8/1/16

6

Intercept and Slope

•  The intercept tells the value of the function when x is zero –  Here, when x is zero, the function equals -2

•  The slope tells how much y changes when x changes one unit –  Here, for each one unit increase in x, y goes up by 1.5 –  If slope is negative, then y goes down when x goes up

•  If we know the slope and intercept terms, then we can predict y

16

Linear Model

•  In statistics we will introduce –  More than one x –  A random error term

•  i is an index for observations (usually patients) •  εi is a random error term •  We assume εi is normally distributed

17

yi = �0 + �ixi1 + · · ·+ �kxki + ✏i

Linear Regression

•  The linear regression model estimates the straight line that best fits the data

•  Best fit is the line that minimizes the “error” or the difference between the predicted line and the actual values

•  Since errors are positive and negative, each error is squared, and then summed

•  This gives an estimate for each beta coefficient

18

Page 7: Lecture5-Regression - Pennsylvania State University · • Linear regression is the first multivariate model we will study – Multivariate means there may be more than one independent

8/1/16

7

19

0 1 2 3 4 5

02

46

810

Linear Regression

x

f(x)

Error (+)Error (-)

Predicted Values

Actual Values

Regression Results

•  Linear regression results include: –  An estimate of the intercept –  Estimates of beta coefficients for each variable –  95% confidence interval for each coefficient –  Test statistic for the hypothesis that each coefficient

equals zero (i.e. the null hypothesis) –  A p-value for the null hypothesis –  A measure of goodness-of-fit

20

How To Interpret Results

•  Intercept term is expected value of the dependent variable assuming all independent variables equal zero

•  For continuous independent variables the coefficient is the incremental effect on the dependent variable of a one-unit increase

•  For binary independent variables the coefficient is the incremental effect of the presence of the variable

•  For categorical independent variables the coefficient is the incremental effect of the variable relative to the reference category

21

Page 8: Lecture5-Regression - Pennsylvania State University · • Linear regression is the first multivariate model we will study – Multivariate means there may be more than one independent

8/1/16

8

Example

22

Source | SS df MS Number of obs = 777 -------------+------------------------------ F( 5, 771) = 16.69 Model | 65650.0665 5 13130.0133 Prob > F = 0.0000 Residual | 606554.886 771 786.711914 R-squared = 0.0977 -------------+------------------------------ Adj R-squared = 0.0918 Total | 672204.952 776 866.243495 Root MSE = 28.048 ------------------------------------------------------------------------------ los | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | -.3017448 .0592052 -5.10 0.000 -.4179674 -.1855222 female | 2.829169 2.037956 1.39 0.165 -1.171431 6.829769 black | 7.158582 5.015663 1.43 0.154 -2.687394 17.00456 abmm | 3.968621 1.118231 3.55 0.000 1.773483 6.163759 ssi | 11.87794 2.090173 5.68 0.000 7.774839 15.98105 _cons | 22.54716 4.911483 4.59 0.000 12.9057 32.18863 ------------------------------------------------------------------------------

RegressLOSoncovariatesinthelivertransplantdata

Liver Transplant Paper

•  Interpret Table V

utable to surgical site infections to $131,276 (P =.0001). Surgical site infections had the largestimpact on resource utilization of any variablestudied.

DISCUSSIONPreoperative infections are significant sources of

morbidity, mortality, and resource utilizationamong orthotopic liver transplant recipients.5,6,29-31

In this study of 777 first, single organ transplantrecipients from the NIDDK Liver TransplantationDatabase, we found that transplant recipients whodevelop surgical site infections had a significantlyhigher rate of graft loss, accumulated more than 24additional hospital days, and incurred approxi-mately $131,000 in excess charges. As noted previ-ously, these excess charges represent 1995 dollars.The medical care component of the ConsumerPrice Index rose approximately 18.3% between1995 and 2000. Inflating excess charges by thisamount suggests that the excess charges attribut-able to surgical site infections may be as high as$155,269 today.

Our findings, with respect to risk factors forinfection, are similar to previous reports in the lit-erature. Pretransplant ascites and low serum albu-min levels have been previously suggested as riskfactors for infection.5,12 Others have also found fac-tors relating to surgical technique, including leaksin the biliary anastomosis and duration of surgicalprocedure, and use of OKT3 to be significant pre-dictors of infection.7,13,14,16,19 Our finding thatpatients with surgical site infections do not experi-ence significantly higher rates of mortality but dohave higher rates of graft loss was also suggested byWhiting et al.13

Multivariate analysis of charges suggested thateach HLA-A, and HLA-B mismatch was associatedwith $13,504 in excess charges. Although rejectionepisodes may explain this phenomenon, it is

Surgery Hollenbeak, Alfrey, and Souba 393Volume 130, Number 2

ly to develop a surgical site infection (P = .037), andlow levels of serum albumin in grams per deciliterswere associated with a 41% greater risk of infection(P = .009). Patients who received OKT3 in the firstpostoperative week were approximately 50% morelikely to develop a surgical site infection (P = .039),which suggests the important relationship betweenacute rejection and infection. Other factors stud-ied, including pediatric transplant recipients,female sex, ethnicity, volume of packed red bloodcells, and cold ischemia time did not significantlyaffect the likelihood of surgical site infection.

Patient and graft survival. One-year patient andgraft survival were estimated with the method ofKaplan and Meier.28 Death with a functioning graftwas considered a graft loss. Although the 1-year sur-vival rate for patients with infections was slightlylower than that for patients without infections(88.9% vs 91.8%), the difference was not statistical-ly significant (P = .22; Fig 1). The graft survival rate,however, was significantly lower for patients withsurgical site infections (80% vs 87%, P = .018).

Resource utilization. As noted previously, 292patients developed surgical site infections. However,only 159 patients developed surgical site infectionduring the transplant admission. These 159 patientshad significantly higher resource utilization require-ments than patients who did not develop surgicalsite infections during the transplant hospitalization.Patients who developed surgical site infectionsincurred on average, unadjusted for other factors,$159,975 in additional charges ($337,409 vs$177,433; P = .0001) and 24 additional hospital days(47 vs 23; P = .0001) compared with uninfectedtransplant recipients. Note that infections did notconcentrate resource use in any single departmentbut significantly increased costs throughout all costcenters (P < .01 for all cost centers; Fig 2).

Results from a multivariate analysis of costs arepresented in Table V and show that although sur-gical site infections were not the only factor thatcontributed to increased resource utilization,they were the most costly. Severity, determined bythe Karnofsky scale, increased charges by nearly$11,000 per index point (P = .0002). Packedwhole red blood cells increased charges by $9727(P < .0001) per 1000 mL unit. Each additionalhour of cold ischemia time resulted in an addi-tional $1656 (P = .021) in charges, and each HLA-A or -B mismatch cost $13,504 (P = .037). Finally,patients with edema incurred, on average,$24,763 in additional charges (P = .041). Raceand sex were not significantly associated withincreased charges. Controlling for these otherfactors reduced the estimate of excess costs attrib-

Table V. Multivariate analysis of factors affectingthe cost of liver transplantation

Variable Estimated cost P value

Surgical site infection $131,276 .0001Karnofsky scale $11,009 .0002Packed red blood cells $9727 .0001Cold ischemia time $1656 .0211HLA-A and -B mismatches $13,504 .0372Edema $24,763 .0411Female sex $17,683 .1308Black ethnicity $27,665 .333

R2 = 0.21

Goodness of Fit

•  How well the model fits the data is measured by the coefficient of determination: R2

–  R2 is the percent of all the variation in the data explained by the model

–  Ranges from 0 to 1

•  Most cross-sectional analyses care lucky to get 35%

•  Be suspicious of R2 values over 80%!

24

Page 9: Lecture5-Regression - Pennsylvania State University · • Linear regression is the first multivariate model we will study – Multivariate means there may be more than one independent

8/1/16

9

Assumptions of Linear Regression

1.  There is a linear relationship between the dependent and independent variable (linearity, violation is nonlinearity)

2.  The errors are normally distributed (normality, violation is non-normality)

3.  The variance of the error term is constant (homoskedasticity, violation is heteroskedasticity)

4.  Covariates are not correlated with each other (violation is called colinearity)

25

26

0 1 2 3 4 5

01

23

45

6

Violation of Linearity

x

f(x)

Violation of Normality

LOS

Density

0 100 200 300

0100

200

300

400

0 1 2 3 4 5

05

1015

Violation of Constant Variance

x

f(x)

Colinearity

•  The other problems are not too serious, usually •  Colinearity can be a huge problem •  Symptoms: –  Model fit will be excellent (high R2), but –  Few coefficients are statistically significant

•  Perform correlations on covariates and look for highly correlated pairs

•  Remove one of the offending variables in a pair

27

Page 10: Lecture5-Regression - Pennsylvania State University · • Linear regression is the first multivariate model we will study – Multivariate means there may be more than one independent

8/1/16

10

Regression in Stata

•  Command to run a regression in Stata is –  regress depvar indvar1 indvar2 … indvark

•  Example: –  regress los age female black abmm ssi

28

Stata Results

29

Source | SS df MS Number of obs = 777 -------------+------------------------------ F( 5, 771) = 16.69 Model | 65650.0665 5 13130.0133 Prob > F = 0.0000 Residual | 606554.886 771 786.711914 R-squared = 0.0977 -------------+------------------------------ Adj R-squared = 0.0918 Total | 672204.952 776 866.243495 Root MSE = 28.048 ------------------------------------------------------------------------------ los | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | -.3017448 .0592052 -5.10 0.000 -.4179674 -.1855222 female | 2.829169 2.037956 1.39 0.165 -1.171431 6.829769 black | 7.158582 5.015663 1.43 0.154 -2.687394 17.00456 abmm | 3.968621 1.118231 3.55 0.000 1.773483 6.163759 ssi | 11.87794 2.090173 5.68 0.000 7.774839 15.98105 _cons | 22.54716 4.911483 4.59 0.000 12.9057 32.18863 ------------------------------------------------------------------------------

Regression in Stata

•  When including binary and categorical variables, you must leave out one category –  For example, I can only include male OR female, not

both

•  The excluded variable becomes the REFERENCE category

•  If you try to include all, Stata will drop one •  Stata drops all observations if any variable is

missing

30

Page 11: Lecture5-Regression - Pennsylvania State University · • Linear regression is the first multivariate model we will study – Multivariate means there may be more than one independent

8/1/16

11

Regression in R

•  The regression function is lm() –  lm stands for “linear model”

•  Create a linear model object, then summarize –  reg1 <- lm(depvar ~ indvar1 + invar2 + … + indvark)

–  summary(reg1)

•  Example: –  reg1 <- lm(data1$los ~ data1$age + data1$female + data1$black + data1$abmm + data1$ssi)

–  summary(reg1)

R Results Call: lm(formula = data1$los ~ data1$age + data1$female + data1$black + data1$abmm + data1$ssi) Residuals: Min 1Q Median 3Q Max -39.11 -13.73 -6.09 3.57 338.09 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 22.54716 4.91148 4.591 5.16e-06 *** data1$age -0.30174 0.05921 -5.097 4.35e-07 *** data1$female 2.82917 2.03796 1.388 0.16547 data1$black 7.15858 5.01566 1.427 0.15391 data1$abmm 3.96862 1.11823 3.549 0.00041 *** data1$ssi 11.87794 2.09017 5.683 1.88e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 28.05 on 771 degrees of freedom Multiple R-squared: 0.09766, Adjusted R-squared: 0.09181 F-statistic: 16.69 on 5 and 771 DF, p-value: 1.127e-15

R and Confidence Intervals

•  Notice that R does not estimate confidence intervals for you

•  Two choices: –  Compute them by hand in Excel from the coefficient and

the standard error •  Lower: •  Upper:

–  Use confint() function in R •  confint(reg1)

� � 1.96 ⇤ SE� + 1.96 ⇤ SE

Page 12: Lecture5-Regression - Pennsylvania State University · • Linear regression is the first multivariate model we will study – Multivariate means there may be more than one independent

8/1/16

12

Reporting Regression Results

•  Include coefficient, 95% confidence interval, and p-value

•  For p-values less than 0.0001, indicate with “< 0.0001” rather than the actual p-value

•  Indicate reference group •  Indicate units for continuous variables •  Include R2 in the table caption

34

Categories or Continuous?

•  We have two versions of the age variable –  Continuous –  Categories (0-39, 40-49, 50-59, 60+)

•  Which should we use in regression? •  Both are correct, but make different assumptions

about the relationship between age and LOS –  Continuous assumes linear relationship –  Categories allows for nonlinear relationship

Categories or Continuous?

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 14.747 4.388 3.361 0.000816 *** data1$age4049 -10.050 2.745 -3.661 0.000269 *** data1$age5059 -9.675 2.742 -3.528 0.000443 *** data1$age60 -3.849 3.170 -1.214 0.225053 data1$female 2.579 2.057 1.254 0.210340 data1$black 9.544 5.039 1.894 0.058584 . data1$abmm 4.055 1.125 3.603 0.000335 *** data1$ssi 12.210 2.102 5.809 9.21e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Is the relationship linear?

Page 13: Lecture5-Regression - Pennsylvania State University · • Linear regression is the first multivariate model we will study – Multivariate means there may be more than one independent

8/1/16

13

37

38

Write-up

•  Here is how to describe regression results in your paper:

“The reference patient, which is a white male infant with no surgical site infection and zero HLA A or B mismatches, stayed on average 22.5 days. After controlling for other factors, patients who developed a surgical site infection stayed on average nearly 12 days longer than patients without infections (p<0.0001).”

39

Page 14: Lecture5-Regression - Pennsylvania State University · • Linear regression is the first multivariate model we will study – Multivariate means there may be more than one independent

8/1/16

14

Homework

•  Read the Hollenbeak paper that is posted on the web site

•  Replicate Table V in the paper, but do not include edema or the Karnofsky score –  Your coefficients will be different! –  Format the table according to the template on Slides 37

and 38

40