Correlation and Linear Regression

10
2/28/2015 Correlation and Linear Regression http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html 1/10 Correlation and Linear Regression Introduction In this section we discuss correlation analysis which is a technique used to quantify the associations between two continuous variables. For example, we might want to quantify the association between body mass index and systolic blood pressure, or between hours of exercise per week and percent body fat. Regression analysis is a related technique to assess the relationship between an outcome variable and one or more risk factors or confounding variables (confounding is discussed later). The outcome variable is also called the response or dependent variable, and the risk factors and confounders are called the predictors, or explanatory or independent variables. In regression analysis, the dependent variable is denoted "Y" and the independent variables are denoted by "X". [ NOTE: The term "predictor" can be misleading if it is interpreted as the ability to predict even beyond the limits of the data. Also, the term "explanatory variable" might give an impression of a causal effect in a situation in which inferences should be limited to identifying associations. The terms "independent" and "dependent" variable are less subject to these interpretations as they do not strongly imply cause and effect. Learning Objectives After completing this module, the student will be able to: 1. Define and provide examples of dependent and independent variables in a study of a public health problem 2. Compute and interpret a correlation coefficient 3. Compute and interpret coefficients in a linear regression analysis Correlation Analysis In correlation analysis, we estimate a sample correlation coefficient, more specifically the Pearson Product Moment correlation coefficient. The sample correlation coefficient, denoted r, ranges between 1 and +1 and quantifies the direction and strength of the linear association between the two variables. The correlation between two variables can be positive (i.e., higher levels of one variable are associated with higher levels of the other) or negative (i.e., higher levels of one variable are associated with lower levels of the other). The sign of the correlation coefficient indicates the direction of the association. The magnitude of the correlation coefficient indicates the strength of the association. For example, a correlation of r = 0.9 suggests a strong, positive association between two variables, whereas a correlation of r = 0.2 suggest a weak, negative association. A correlation close to zero suggests no linear association between two continuous variables. It is important to note that there may be a nonlinear association between two continuous variables, but computation of a correlation coefficient does not detect this. Therefore, it is always important to evaluate the data carefully before computing a correlation coefficient. Graphical displays are particularly useful to explore associations between variables. The figure below shows four hypothetical scenarios in which one continuous variable is plotted along the Xaxis and the other along the Yaxis.

description

stats

Transcript of Correlation and Linear Regression

Page 1: Correlation and Linear Regression

2/28/2015 Correlation and Linear Regression

http://sphweb.bumc.bu.edu/otlt/MPH­Modules/BS/BS704_Correlation­Regression/BS704_Correlation­Regression_print.html 1/10

Correlation and Linear Regression

IntroductionIn this section we discuss correlation analysis which is a technique used to quantify theassociations between two continuous variables. For example, we might want to quantifythe association between body mass index and systolic blood pressure, or betweenhours of exercise per week and percent body fat. Regression analysis is a relatedtechnique to assess the relationship between an outcome variable and one or more riskfactors or confounding variables (confounding is discussed later). The outcome variableis also called the response or dependent variable, and the risk factors andconfounders are called the predictors, or explanatory or independent variables. Inregression analysis, the dependent variable is denoted "Y" and the independentvariables are denoted by "X".

[ NOTE: The term "predictor" can be misleading if it is interpreted as the ability topredict even beyond the limits of the data. Also, the term "explanatory variable" mightgive an impression of a causal effect in a situation in which inferences should be limitedto identifying associations. The terms "independent" and "dependent" variable are lesssubject to these interpretations as they do not strongly imply cause and effect.

Learning ObjectivesAfter completing this module, the student will be able to:

1. Define and provide examples of dependent and independent variables in a study of a public health problem2. Compute and interpret a correlation coefficient3. Compute and interpret coefficients in a linear regression analysis

Correlation AnalysisIn correlation analysis, we estimate a sample correlation coefficient, more specifically the Pearson Product Moment correlationcoefficient. The sample correlation coefficient, denoted r,

ranges between ­1 and +1 and quantifies the direction and strength of the linear association between the two variables. The correlationbetween two variables can be positive (i.e., higher levels of one variable are associated with higher levels of the other) or negative (i.e., higherlevels of one variable are associated with lower levels of the other).

The sign of the correlation coefficient indicates the direction of the association. The magnitude of the correlation coefficient indicates thestrength of the association.

For example, a correlation of r = 0.9 suggests a strong, positive association between two variables, whereas a correlation of r = ­0.2 suggest aweak, negative association. A correlation close to zero suggests no linear association between two continuous variables.

It is important to note that there may be a non­linear association between two continuous variables, but computation of a correlation coefficientdoes not detect this. Therefore, it is always important to evaluate the data carefully before computing a correlation coefficient. Graphicaldisplays are particularly useful to explore associations between variables.

The figure below shows four hypothetical scenarios in which one continuous variable is plotted along the X­axis and the other along the Y­axis.

Page 2: Correlation and Linear Regression

2/28/2015 Correlation and Linear Regression

http://sphweb.bumc.bu.edu/otlt/MPH­Modules/BS/BS704_Correlation­Regression/BS704_Correlation­Regression_print.html 2/10

Scenario 1 depicts a strong positive association (r=0.9), similar to what we might see for the correlation between infant birth weight andbirth length.Scenario 2 depicts a weaker association (r=0,2) that we might expect to see between age and body mass index (which tends to increasewith age).Scenario 3 might depict the lack of association (r approximately 0) between the extent of media exposure in adolescence and age atwhich adolescents initiate sexual activity.Scenario 4 might depict the strong negative association (r= ­0.9) generally observed between the number of hours of aerobic exerciseper week and percent body fat.

Example ­ Correlation of Gestational Age and Birth WeightA small study is conducted involving 17 infants to investigate the association between gestational age at birth, measured in weeks, and birthweight, measured in grams.

Page 3: Correlation and Linear Regression

2/28/2015 Correlation and Linear Regression

http://sphweb.bumc.bu.edu/otlt/MPH­Modules/BS/BS704_Correlation­Regression/BS704_Correlation­Regression_print.html 3/10

We wish to estimate the association between gestational age and infant birth weight. In this example, birth weight is the dependent variable andgestational age is the independent variable. Thus y=birth weight and x=gestational age. The data are displayed in a scatter diagram in thefigure below.

Each point represents an (x,y) pair (in this case the gestational age, measured in weeks, and the birth weight, measured in grams). Note thatthe independent variable is on the horizontal axis (or X­axis), and the dependent variable is on the vertical axis (or Y­axis). The scatter plotshows a positive or direct association between gestational age and birth weight. Infants with shorter gestational ages are more likely to be bornwith lower weights and infants with longer gestational ages are more likely to be born with higher weights.

The formula for the sample correlation coefficient is

where Cov(x,y) is the covariance of x and y defined as

sx2 and sy2 are the sample variances of x and y, defined as

The variances of x and y measure the variability of the x scores and y scores around their respective sample means ( , considered

Page 4: Correlation and Linear Regression

2/28/2015 Correlation and Linear Regression

http://sphweb.bumc.bu.edu/otlt/MPH­Modules/BS/BS704_Correlation­Regression/BS704_Correlation­Regression_print.html 4/10

separately). The covariance measures the variability of the (x,y) pairs around the mean of x and mean of y, considered simultaneously.

To compute the sample correlation coefficient, we need to compute the variance of gestational age, the variance of birth weight and also thecovariance of gestational age and birth weight.

We first summarize the gestational age data. The mean gestational age is:

To compute the variance of gestational age, we need to sum the squared deviations (or differences) between each observed gestational ageand the mean gestational age. The computations are summarized below.

The variance of gestational age is:

Next, we summarize the birth weight data. The mean birth weight is:

The variance of birth weight is computed just as we did for gestational age as shown in the table below.

Page 5: Correlation and Linear Regression

2/28/2015 Correlation and Linear Regression

http://sphweb.bumc.bu.edu/otlt/MPH­Modules/BS/BS704_Correlation­Regression/BS704_Correlation­Regression_print.html 5/10

The variance of birth weight is:

Next we compute the covariance,

To compute the covariance of gestational age and birth weight, we need to multiply the deviation from the mean gestational age by thedeviation from the mean birth weight for each participant (i.e.,

The computations are summarized below. Notice that we simply copy the deviations from the mean gestational age and birth weight from thetwo tables above into the table below and multiply.

The covariance of gestational age and birth weight is:

We now compute the sample correlation coefficient:

Not surprisingly, the sample correlation coefficient indicates a strong positive correlation.

Page 6: Correlation and Linear Regression

2/28/2015 Correlation and Linear Regression

http://sphweb.bumc.bu.edu/otlt/MPH­Modules/BS/BS704_Correlation­Regression/BS704_Correlation­Regression_print.html 6/10

As we noted, sample correlation coefficients range from ­1 to +1. In practice, meaningful correlations (i.e., correlations that are clinically orpractically important) can be as small as 0.4 (or ­0.4) for positive (or negative) associations. There are also statistical tests to determine whetheran observed correlation is statistically significant or not (i.e., statistically significantly different from zero). Procedures to test whether anobserved sample correlation is suggestive of a statistically significant correlation are described in detail in Kleinbaum, Kupper and Muller.1

Regression Analysis Regression analysis is a widely used technique which is useful for many applications. We introduce the technique here and expand on its usesin subsequent modules.

Simple Linear Regression Simple linear regression is a technique that is appropriate to understand the association between one independent (or predictor) variable andone continuous dependent (or outcome) variable. For example, suppose we want to assess the association between total cholesterol (inmilligrams per deciliter, mg/dL) and body mass index (BMI, measured as the ratio of weight in kilograms to height in meters2) where totalcholesterol is the dependent variable, and BMI is the independent variable. In regression analysis, the dependent variable is denoted Y and theindependent variable is denoted X. So, in this case, Y=total cholesterol and X=BMI.

When there is a single continuous dependent variable and a single independent variable, the analysis is called a simple linear regressionanalysis . This analysis assumes that there is a linear association between the two variables. (If a different relationship is hypothesized, such asa curvilinear or exponential relationship, alternative regression analyses are performed.)

The figure below is a scatter diagram illustrating the relationship between BMI and total cholesterol. Each point represents the observed (x, y)pair, in this case, BMI and the corresponding total cholesterol measured in each participant. Note that the independent variable is on thehorizontal axis and the dependent variable on the vertical axis.

BMI and Total Cholesterol

The graph shows that there is a positive or direct association between BMI and total cholesterol; participants with lower BMI are more likely tohave lower total cholesterol levels and participants with higher BMI are more likely to have higher total cholesterol levels. In contrast, supposewe examine the association between BMI and HDL cholesterol.

In contrast, the graph below depicts the relationship between BMI and HDL HDL cholesterol in the same sample of n=20 participants.

BMI and HDL Cholesterol

This graph shows a negative or inverse association between BMI and HDL cholesterol, i.e., those with lower BMI are more likely to have higherHDL cholesterol levels and those with higher BMI are more likely to have lower HDL cholesterol levels.

For either of these relationships we could use simple linear regression analysis to estimate the equation of the line that best describes theassociation between the independent variable and the dependent variable. The simple linear regression equation is as follows:

Page 7: Correlation and Linear Regression

2/28/2015 Correlation and Linear Regression

http://sphweb.bumc.bu.edu/otlt/MPH­Modules/BS/BS704_Correlation­Regression/BS704_Correlation­Regression_print.html 7/10

, where

is the predicted or expected value of the outcome, X is the predictor, b0 is the estimated Y­intercept, and b1 is the estimated slope. The Y­intercept and slope are estimated from the sample data so as to minimize the sum of the squared differences between the observed and thepredicted values of the outcome, i.e., the estimates minimize:

These differences between observed and predicted values of the outcome are called residuals. The estimates of the Y­intercept and slopeminimize the sum of the squared residuals, and are called the least squares estimates.1

ResidualsConceptually, if the values of X provided a perfect prediction of Y then the sum of the squareddifferences between observed and predicted values of Y would be 0. That would mean thatvariability in Y could be completely explained by differences in X. However, if the differencesbetween observed and predicted values are not 0, then we are unable to entirely account fordifferences in Y based on X, then there are residual errors in the prediction. The residual errorcould result from inaccurate measurements of X or Y, or there could be other variables besides Xthat affect the value of Y.

Based on the observed data, the best estimate of a linear relationship will be obtained from an equation for the line that minimizes thedifferences between observed and predicted values of the outcome. The Y­intercept of this line is the value of the dependent variable (Y)when the independent variable (X) is zero. The slope of the line is the change in the dependent variable (Y) relative to a one unit change in theindependent variable (X). The least squares estimates of the y­intercept and slope are computed as follows:

where

r is the sample correlation coefficient,the sample means are and and Sx and Sy are the standard deviations of the independent variable x and the dependent variable y, respectively.

BMI and Total Cholesterol

The least squares estimates of the regression coefficients, b 0 and b1, describing the relationship between BMI and total cholesterol are b0 =28.07 and b1=6.49. These are computed as follows:

The estimate of the Y­intercept (b0 = 28.07) represents the estimated total cholesterol level when BMI is zero. Because a BMI of zero ismeaningless, the Y­intercept is not informative. The estimate of the slope (b1 = 6.49) represents the change in total cholesterol relative to a oneunit change in BMI. For example, if we compare two participants whose BMIs differ by 1 unit, we would expect their total cholesterols to differ byapproximately 6.49 units (with the person with the higher BMI having the higher total cholesterol).

The equation of the regression line is as follows:

The graph below shows the estimated regression line superimposed on the scatter diagram.

The regression equation can be used to estimate a participant's total cholesterol as a function of his/her BMI. For example, suppose a

Page 8: Correlation and Linear Regression

2/28/2015 Correlation and Linear Regression

http://sphweb.bumc.bu.edu/otlt/MPH­Modules/BS/BS704_Correlation­Regression/BS704_Correlation­Regression_print.html 8/10

participant has a BMI of 25. We would estimate their total cholesterol to be 28.07 + 6.49(25) = 190.32. The equation can also be used toestimate total cholesterol for other values of BMI. However, the equation should only be used to estimate cholesterol levels for persons whoseBMIs are in the range of the data used to generate the regression equation. In our sample, BMI ranges from 20 to 32, thus the equation shouldonly be used to generate estimates of total cholesterol for persons with BMI in that range.

There are statistical tests that can be performed to assess whether the estimated regression coefficients (b0 and b1) are statistically significantlydifferent from zero. The test of most interest is usually H0: b1=0 versus H1: b1≠0, where b1 is the population slope. If the population slope issignificantly different from zero, we conclude that there is a statistically significant association between the independent and dependentvariables.

BMI and HDL Cholesterol

The least squares estimates of the regression coefficients, b0 and b1, describing the relationship between BMI and HDL cholesterol are asfollows: b0 = 111.77 and b1 = ­2.35. These are computed as follows:

Again, the Y­intercept in uninformative because a BMI of zero is meaningless. The estimate of the slope (b1 = ­2.35) represents the change inHDL cholesterol relative to a one unit change in BMI. If we compare two participants whose BMIs differ by 1 unit, we would expect their HDLcholesterols to differ by approximately 2.35 units (with the person with the higher BMI having the lower HDL cholesterol. The figure below showsthe regression line superimposed on the scatter diagram for BMI and HDL cholesterol.

Linear regression analysis rests on the assumption that the dependent variable is continuous and that the distribution of the dependent variable(Y) at each value of the independent variable (X) is approximately normally distributed. Note, however, that the independent variable can becontinuous (e.g., BMI) or can be dichotomous (see below).

Comparing Mean HDL Levels With Regression Analysis Consider a clinical trial to evaluate the efficacy of a new drug to increase HDL cholesterol. We could compare the mean HDL levels betweentreatment groups statistically using a two independent samples t test. Here we consider an alternate approach. Summary data for the trial areshown below:

Sample Size Mean HDL Standard Deviation of HDLNew Drug 50 40.16 4.46

Placebo 50 39.21 3.91

HDL cholesterol is the continuous dependent variable and treatment assignment (new drug versus placebo) is the independent variable.Suppose the data on n=100 participants are entered into a statistical computing package. The outcome (Y) is HDL cholesterol in mg/dL and theindependent variable (X) is treatment assignment. For this analysis, X is coded as 1 for participants who received the new drug and as 0 forparticipants who received the placebo. A simple linear regression equation is estimated as follows:

= 39.21 + 0.95 X,

where is the estimated HDL level and X is a dichotomous variable (also called an indicator variable, in this case indicating whether theparticipant was assigned to the new drug or to placebo). The estimate of the Y­intercept is b0=39.21. The Y­intercept is the value of Y (HDLcholesterol) when X is zero. In this example, X=0 indicates assignment to the placebo group. Thus, the Y­intercept is exactly equal to the meanHDL level in the placebo group. The slope is estimated as b1=0.95. The slope represents the estimated change in Y (HDL cholesterol) relativeto a one unit change in X. A one unit change in X represents a difference in treatment assignment (placebo versus new drug). The sloperepresents the difference in mean HDL levels between the treatment groups. Thus, the mean HDL for participants receiving the new drug is:

= 39.21 + 0.95 (1) = 40.16

Page 9: Correlation and Linear Regression

2/28/2015 Correlation and Linear Regression

http://sphweb.bumc.bu.edu/otlt/MPH­Modules/BS/BS704_Correlation­Regression/BS704_Correlation­Regression_print.html 9/10

A study was conducted to assess the association between a person's intelligence and thesize of their brain. Participants completed a standardized IQ test and researchers usedMagnetic Resonance Imaging (MRI) to determine brain size. Demographic information,including the patient's gender, was also recorded.

The Controversy Over Environmental Tobacco Smoke ExposureThere is convincing evidence that active smoking is a cause of lung cancer and heart disease. Many studies done in a wide variety ofcircumstances have consistently demonstrated a strong association and also indicate that the risk of lung cancer and cardiovascular disease(i.e.., heart attacks) increases in a dose­related way. These studies have led to the conclusion that active smoking is causally related to lungcancer and cardiovascular disease. Studies in active smokers have had the advantage that the lifetime exposure to tobacco smoke can bequantified with reasonable accuracy, since the unit dose is consistent (one cigarette) and the habitual nature of tobacco smoking makes itpossible for most smokers to provide a reasonable estimate of their total lifetime exposure quantified in terms of cigarettes per day or packs perday. Frequently, average daily exposure (cigarettes or packs) is combined with duration of use in years in order to quantify exposure as "pack­years".

It has been much more difficult to establish whether environmental tobacco smoke (ETS) exposure is causally related to chronic diseases likeheart disease and lung cancer, because the total lifetime exposure dosage is lower, and it is much more difficult to accurately estimate totallifetime exposure. In addition, quantifying these risks is also complicated because of confounding factors. For example, ETS exposure is usuallyclassified based on parental or spousal smoking, but these studies are unable to quantify other environmental exposures to tobacco smoke,and inability to quantify and adjust for other environmental exposures such as air pollution makes it difficult to demonstrate an association evenif one existed. As a result, there continues to be controversy over the risk imposed by environmental tobacco smoke (ETS). Some have goneso far as to claim that even very brief exposure to ETS can cause a myocardial infarction (heart attack), but a very large prospective cohortstudy by Enstrom and Kabat was unable to demonstrate significant associations between exposure to spousal ETS and coronary heartdisease, chronic obstructive pulmonary disease, or lung cancer. (It should be noted, however, that the report by Enstrom and Kabat has beenwidely criticized for methodological problems, and these authors also had financial ties to the tobacco industry.)

Correlation analysis provides a useful tool for thinking about this controversy. Consider data from the British Doctors Cohort. They reported theannual mortality for a variety of disease at four levels of cigarette smoking per day: Never smoked, 1­14/day, 15­24/day, and 25+/day. In orderto perform a correlation analysis, I rounded the exposure levels to 0, 10, 20, and 30 respectively.

Cigarettes Smoked Per Day CVD Mortality/100,000 men/yr. Lung Cancer Mortality/100,000 men/yr.0 572 14

10 (actually 1­14) 802 105

20 (actually 15­24) 892 208

30 (actually >24) 1025 355

The figures below show the two estimated regression lines superimposed on the scatter diagram. The correlation with amount of smoking wasstrong for both CVD mortality (r= 0.98) and for lung cancer (r = 0.99). Note also that the Y­intercept is a meaningful number here; it representsthe predicted annual death rate from these disease in individuals who never smoked. The Y­intercept for prediction of CVD is slightly higherthan the observed rate in never smokers, while the Y­intercept for lung cancer is lower than the observed rate in never smokers.

The linearity of these relationships suggests that there is an incremental risk with each additional cigarette smoked per day, and the additionalrisk is estimated by the slopes. This perhaps helps us think about the consequences of ETS exposure. For example, the risk of lung cancer innever smokers is quite low, but there is a finite risk; various reports suggest a risk of 10­15 lung cancers/100,000 per year. If an individual whonever smoked actively was exposed to the equivalent of one cigarette's smoke in the form of ETS, then the regression suggests that their riskwould increase by 11.26 lung cancer deaths per 100,000 per year. However, the risk is clearly dose­related. Therefore, if a non­smoker wasemployed by a tavern with heavy levels of ETS, the risk might be substantially greater.

Page 10: Correlation and Linear Regression

2/28/2015 Correlation and Linear Regression

http://sphweb.bumc.bu.edu/otlt/MPH­Modules/BS/BS704_Correlation­Regression/BS704_Correlation­Regression_print.html 10/10

Finally, it should be noted that some findings suggest that the association between smoking and heart disease is non­linear at the very lowestexposure levels, meaning that non­smokers have a disproportionate increase in risk when exposed to ETS due to an increase in plateletaggregation.

SummaryCorrelation and linear regression analysis are statistical techniques to quantify associations between an independent, sometimes called apredictor, variable (X) and a continuous dependent outcome variable (Y). For correlation analysis, the independent variable (X) can becontinuous (e.g., gestational age) or ordinal (e.g., increasing categories of cigarettes per day). Regression analysis can also accommodatedichotomous independent variables.

The procedures described here assume that the association between the independent and dependent variables is linear. With someadjustments, regression analysis can also be used to estimate associations that follow another functional form (e.g., curvilinear, quadratic).Here we consider associations between one independent variable and one continuous dependent variable. The regression analysis is calledsimple linear regression ­ simple in this case refers to the fact that there is a single independent variable. In the next module, we considerregression analysis with several independent variables, or predictors, considered simultaneously.