Multiple Regression Concepts

Centre for Multilevel Modelling, 2008

0

1. Please ignore this bit & leave it alone 2. - its to get automated Figure & Table numbering

right, and 3. will not appear in the pdfs It works in this way. We wish automatically made figure & table numbering that has module numbers in it so we get captions like Figure 2.1, or Figure 3.2, etc. MS Word works out what numbering to use by looking at the heading styles used in the document. Every time the style Heading 1 is used in the document (as above on Please ignore), MS Word counts this as a new chapter, and increments the chapter number part of the automatic numbering. So to get a caption like Figure 2.x - you need to have preceding it two headings formatted with the style Heading 1. This template has Heading 1 used only once, so figures and tables in this document will be numbered as Figure 1.x or Table 1.x These figure and table numbers will be automatically updated each time another Heading 1 is inserted.

Centre for Multilevel Modelling, 2008 1

Module 3: Multiple Regression Concepts

Fiona Steele1

Centre for Multilevel Modelling

Contents Introduction .........................................................................................................3

What is Multiple Regression? ...................................................................................3 Motivation.........................................................................................................3 Conditioning ......................................................................................................3 Data for multiple regression analysis .........................................................................4

Introduction to Dataset ...........................................................................................4 C3.1 Regression with a Single Continuous Explanatory Variable......................................6

C3.1.1 Examining data graphically .......................................................................6 C3.1.2 The linear regression model ......................................................................8 C3.1.3 The fitted regression line ....................................................................... 10 C3.1.4 Explained and unexplained variance and R-squared ........................................ 13 C3.1.5 Hypothesis testing ................................................................................ 14 C3.1.6 Model checking.................................................................................... 15

C3.2 Comparing Groups: Regression with a Single Categorical Explanatory Variable .......... 19 C3.2.1 Comparing two groups ........................................................................... 19 C3.2.2 Comparing more than two groups.............................................................. 21 C3.2.3 Comparing a large number of groups .......................................................... 25

C3.3 Regression with More than One Explanatory Variable (Multiple Regression) .............. 26 C3.3.1 Statistical control................................................................................. 26 C3.3.2 The multiple regression model ................................................................. 27 C3.3.3 Using multiple regression to model a non-linear relationship ............................. 31 C3.3.4 Adding further predictors ....................................................................... 33

C3.4 Interaction Effects ..................................................................................... 36 C3.4.1 Model with fixed slopes across groups......................................................... 36 C3.4.2 Fitting separate models for each group....................................................... 37 C3.4.3 Allowing for varying slopes in a pooled analysis: interaction effects .................... 39 C3.4.4 Testing for interaction effects.................................................................. 41 C3.4.5 Another example: allowing age effects to be different in different countries ......... 41

C3.5 Checking Model Assumptions in Multiple Regression............................................ 44 C3.5.1 Checking the normality assumption ........................................................... 44 C3.5.2 Checking the homoskedasticity assumption .................................................. 45 C3.5.3 Outliers............................................................................................. 47

1 With additional material from Kelvyn Jones. Comments from Sacha Brostoff, Jon Rasbash and Rebecca Pillinger on an earlier draft are gratefully acknowledged.

Centre for Multilevel Modelling, 2008

2

All of the sections within this module have online quizzes for you to test your understanding. To find the quizzes:

EXAMPLE From within the LEMMA learning environment

Go down to the section for Module 3: Multilevel Modelling Click "3.1 Regression with a Single Continuous Explanatory Variable"

to open Lesson 3.1 Click to open the first question

All of the sections within this module have practicals so you can learn how to perform this kind of analysis in MLwiN or other software packages. To find the practicals: EXAMPLE From within the LEMMA learning environment

Go down to the section for Module 3: Multiple Regression, then Either

Click "3.1 Regression with a Single Continuous Explanatory Variable" to open Lesson 3.1

Click Or

Click Print all Module 3 MLwiN Practicals

Pre-requisites Understanding of types of variables (continuous vs. categorical variables,

dependent and explanatory); covered in Module 1. Correlation between variables Confidence intervals around estimates Hypothesis testing, p-values Independent samples t-test for comparing the means of two groups Online resources: http://www.sportsci.org/resource/stats/ http://www.socialresearchmethods.net/ http://www.animatedsoftware.com/statglos/statglos.htm http://davidmlane.com/hyperstat/index.html

Q 1

Module 3 (Concepts): Multiple Regression

Introduction


Introduction What is Multiple Regression? Multiple regression is a technique used to study the relationship between an outcome variable and a set of explanatory or predictor variables. Motivation To illustrate the ideas of multiple regression, we will consider a research problem of assessing the evidence for gender discrimination in legal firms. Statistical modelling can provide the following: A quantitative assessment of the size of the effect; e.g. the difference in salary

between women and men is 5000 per annum; A quantitative assessment after taking account of other variables; e.g. a female

worker earns 6500 less after taking account of years of experience. This conditioning on other variables distinguishes multiple regression modelling from simple testing for differences analyses.

A measure of uncertainty for the size of the effect; e.g. we can be 95% confident that the female-male difference in salary in the population from which our sample was drawn is likely to lie between 4500 and 5500.

We can use regression modelling in different modes: 1) as description (what is the average salary for men and women?), 2) as part of causal inference (does being female result in a lower salary?), and 3) for prediction (what happens if questions). Conditioning The key feature that distinguishes multiple regression from simple regression is that more than one predictor variable is involved. Even if we are interested in the effect of just one variable (gender) on another (salary) we need to take account of other variables as they may compromise the results. We can recognise three distinct cases where it is important to control or adjust for the effects of other variables: i) Inflation of a relationship when not taking into account extraneous

variables. For example, a substantial gender effect could be reduced after taking account of type of employment. This is because jobs that are characterized by poor pay (e.g. in the service sector) have a predominantly female labour force.

ii) Suppression of a relationship. An apparent small gender gap could increase

when account is taken of years of employment; women having longer service and poorer pay.


Introduction


iii) No confounding. The original relationship remains substantially unaltered when account is taken of other variables. Note, however, that there may be unmeasured confounders.

Data for multiple regression analysis Statistical analysis requires a quantifiable outcome measure (dependent variable) to assess the effects of discrimination. Possibilities include the following, differentiated by the nature of the measurement: a continuous measure of salary, a binary indicator of whether an employee was promoted or not, a three-category indicator of promotion (promoted, not promoted, not even considered), a count of the number of times rejected for promotion, the length of time that it has taken to gain promotion. All of these outcomes can be analysed using regression analysis, but different techniques are required for different scales of measurement. The term multiple regression is usually applied when the dependent variable is measured on a continuous scale. A dichotomous dependent variable can be analysed using logistic regression and multinomial logistic and ordinal regression can be applied to nominal and ordinal dependent variables respectively. There are also methods for handling counts (Poisson regression) and time-to-event data (event history analysis or survival analysis). These techniques will be described in later Modules. The explanatory variables may also have different scales of measurement. For example, gender is a binary categorical variable; ethnicity is categorical with more than two categories; education might be measured on an ordinal scale (e.g. 16 years of education); years of employment could be measured on a continuous scale. Multiple regression can handle all of these types of explanatory variable, and we will consider examples of both continuous and categorical variables in this Module.

Introduction to Dataset The ideas of multiple regression will be introduced using data from the 2002 European Social Surveys (ESS). Measures of ten human values have been constructed for 20 countries in the European Union. According to value theory, values are defined as desirable, trans-situational goals that serve as guiding principles in peoples lives. Further details on value theory and how it is operationalised in the ESS can be found on the ESS education net (http://essedunet.nsd.uib.no/cms/topics/1/). We will study one of the ten values, hedonism, defined as the pleasure and sensuous gratification for oneself. The measure we use is based on responses to the question How much like you is this person?:


Introduction


He (sic) seeks every chance he can to have fun. It is important to him to do things that give him pleasure.

Having a good time is important to him. He likes to spoil himself. A respondents own values are inferred from their self-reported similarity to a person with the above descriptions. Each of the two items is rated on a 6-point scale (from very much like me to not like me at all). The mean of these ratings is calculated for each individual. The mean of the two hedonism items is then adjusted for individual differences in scale use2 by subtracting the mean of all value items (a total of 21 are used to measure the 10 values). These centred scores recognise that the 10 values function as a system rather than independently. The centred hedonism score is interpreted as a measure of the relative importance of hedonism to an individual in their whole value system. The scores on the hedonism variable range from -3.76 to 2.90, where higher scores indicate more hedonistic beliefs. We consider three countries France, Germany and the UK with a total sample size of 5845. That is, we use a subsample of the original data. Hedonism is taken as the outcome variable in our analysis. We consider three explanatory variables: Age in years Gender (coded 0 for male and 1 for female) Country (coded 1 for the UK, 2 for Germany and 3 for France) Years of education. An extract of the data is given below. Respondent Hedonism Age Gender Country Education 1 1.55 25 0 2 10 2 0.76 30 0 2 11 3 -0.26 59 0 2 9 4 -1.00 47 1 3 10 . . . . . . . . . . . . 5845 0.74 65 0 1 9

2 Some individuals will tend to select responses from one side of the scale (very much like me) for any item, while others will select from the other side (not like me at all). If we ignore these differences in response tendency we might incorrectly infer that the first type of individual believes that all values are important, while the second believes that all values are unimportant.



C3.1 Regression with a Single Continuous Explanatory Variable

We will begin with a description of simple linear regression for studying the relationship between a pair of continuous variables, which we denote by Y and X. Simple regression is also commonly known as bivariate regression because only two variables are involved. Y is the outcome variable (also called a response or dependent variable) X is the explanatory variable (also called a predictor or independent variable). C3.1.1 Examining data graphically Before carrying out a regression analysis, it is important to look at your data first. There are various assumptions made when we fit a regression model, which we will consider later, but there are two checks that should always be carried out before fitting any models: i) examine the distribution of the variables and check that the values are all valid, and ii) look at the nature of the relationship between X and Y. Distribution of Y We can examine the distribution of a continuous variable using a histogram. At this stage, we are checking that the values appear reasonable. Are there any outliers, i.e. observations outside the general pattern? Are there any values of -99 in the data that should be declared as missing values? We also look at the shape of the distribution: is it a symmetrical bell-shaped distribution (normal), or is it skewed? Although it is the residuals3 that are assumed to be normally distributed in a multiple regression model, rather than the dependent variable, a skewed Y will often produce skewed residuals. If the residuals turn out to be non-normal, it may be possible to transform Y to obtain a normally distributed variable. For example, a positively skewed distribution (with a long tail to the right) will often look more symmetrical after taking logarithms. Figure 3.1 shows the distribution of the hedonism scores. It appears approximately normal with no obvious outliers. The mean of the hedonism score is -0.15 and the standard deviation is 0.97. Distribution of X For a regression analysis the distribution of the explanatory variable is unimportant, but it is sensible to look at descriptive statistics for any variables that we analyse to check for unusual values. 3 The residual for each observation is the difference between the observed value of Y and the value of Y predicted by the model. See C3.1.2 for further details.


C3.1.1 Examining data graphically


We will first consider age as an explanatory variable for hedonism. The age range in our sample is 14 to 98 years with a mean of 46.7 and standard deviation of 18.1. Relationship between X and Y In its simplest form, a regression analysis assumes that the relationship between X and Y is linear, i.e. that it can be reasonably approximated by a straight line. If the relationship is nonlinear, it may be possible to transform one of the variables to make the relationship linear or the regression model can be modified (see C3.3.3). The relationship between two variables can be viewed in a scatterplot. A scatterplot can also reveal outliers.

Figure 3.1. Histogram of hedonism

Figure 3.2 shows a scatterplot of hedonism versus age, where the size of the plotting symbol is proportional to the number of respondents represented by a particular data point. Also shown is what is commonly called the line of best fit, which we will come back to in a moment. The scatterplot shows a negative relationship: as age increases then hedonism decreases. The Pearson correlation coefficient for the linear relationship is -0.34.


C3.1.1 Examining data graphically


Figure 3.2. Plot of hedonism by age

C3.1.2 The linear regression model In a linear regression analysis, we fit a straight line to the scatterplot of Y against X. The equation of a straight line is traditionally written as

cmxy += (3.1) where m is the gradient or slope of the line, and c is the intercept or the point at which the line cuts the Y-axis (i.e. the value of y when x=0). The gradient is interpreted as the change in y expected for a 1-unit change in x. In statistics, we often refer to m and c as coefficients. A coefficient of a variable is a quantity that multiplies it. The slope m is the coefficient of the predictor x, and the intercept c is the coefficient of a variable which equals 1 for each observation (usually referred to as the constant). Because we will soon be adding more explanatory variables (Xs), it is convenient to use a more general notation with coefficients represented by Greek betas ( ). Thus (3.1) becomes

xy 10 += (3.2)


C3.1.2 The linear regression model


so that the intercept is now denoted by 0 and the slope by 1 . The subscripts on the s indicate the variable to which each coefficient is attached. We could have written (3.2) as 1100 xxy += where 0x =1 for every observation and xx =1 . Later we will be adding further explanatory variables ( 2x , 3x etc.) with coefficients 2 , 3 , etc. For a given individual i (i=1, 2, 3, .., n), we denote their value on Y by iy and their value on X by ix . (Note that when we consider more than one explanatory variable, we will introduce a second subscript to index the variable. For example,

ix2 will denote the value on variable 2x for individual i.) For individual i, the linear relationship between Y and X may be expressed as:

iii exy ++= 10 (3.3)

ie is called the residual and is the difference between the ith individuals actual y-value and that predicted by their x-value. We know that we cannot perfectly predict an individuals value on Y from their value on X; the points in a scatterplot of x and y will never lie perfectly on a straight line (see Figure 3.2, for example). The residuals represent the (vertical) scatter of points about the regression line. The equation (3.3) is called the linear regression model. 0 and 1 are the intercept and slope of the regression line in the population from which our sample was drawn, and ie is the difference between an individuals y-value and the value of y predicted by the population regression line. We estimate these quantities using the sample data. Quantities such as 0 and 1 that relate to the population are called parameters. Parameters are very often represented by Greek letters in statistics4. We make the following assumptions about the residuals ie : i) The residuals are normally distributed with zero mean and variance 2

(spoken as sigma-squared). This assumption is often written in shorthand as ).,0(~ 2Nei

ii) The variance of the residuals is constant, whatever the value of x. This

means that if we take a slice through the scatterplot of y versus x at any particular value of x, the y values have approximately the same variation as at any other value of x. If the variance is constant, we say the residuals are homoskedastic. Otherwise they are said to be heteroskedastic.

4 Other examples of parameters are the population mean (mu) and the population variance 2 (sigma-squared).


C3.1.2 The linear regression model


iii) The residuals are not correlated with one another, i.e. they are

independent. Correlations might arise if some individuals contribute more than one observation (e.g. repeated measures) or if individuals are clustered in some way (e.g. in schools). If it is suspected that residuals are correlated, the regression model needs to be modified, e.g. to a multilevel model (see Module 5).

If these assumptions are not met the estimate of 0 , and more importantly 1 , may be biased and imprecise. C3.1.3 The fitted regression line In linear regression analysis, 0 and 1 are estimated from the data using a method called least squares in which the sum of the squared residuals is minimized5. (Responses with other scales of measurement require other techniques, but all of them are based on the same underlying principle of minimizing the poorness of fit between the actual data points and the fitted model.) By applying the method of least squares to our sample data, we obtain an estimate of the underlying population value of the intercept and of the slope. These estimates are denoted by 0 and 1 (spoken as beta-0-hat and beta-1-hat). The predicted value of y for individual i is denoted by iy and is calculated as:

ii xy 10 += (3.4) The equation (3.4) is the equation of the estimated or fitted regression line. The predicted value iy is the point on the fitted line corresponding to ix . If we regress hedonism on age we obtain 0 =0.712 and 1 =-0.018, and the fitted regression line is written (substituting HED for y and AGE for x) as:

ii AGEDEH 01807120 .. = . The slope estimate tells us that for every extra year of age, hedonism is predicted to decrease by 0.018. Importantly, the decrease in hedonism expected for an increase from 14 to 15 years old is the same as for an increase for 54 to 55 years old. This is a direct consequence of assuming that the underlying functional form of the model is linear and fitting a linear equation.

5 A description of least squares can be found at http://mathforum.org/dynamic/java_gsp/squares.html


C3.1.3 The fitted regression line


We can use the fitted line to predict an individuals hedonism based on their age. So, for example, for an individual of age 25 we would predict a hedonism score of

.262.0)25018.0(712.0 = In contrast, we would predict a score of -0.188 for someone of age 50. The regression line is the line of best fit shown in Figure 3.2. Most statistical packages will report the results of a regression analysis in tabular form, e.g. as in Table 3.1.

Table 3.1. Results from a simple regression of hedonism on age

Coefficient Constant 0.712 Age -0.018

Interpretation of the intercept and slope estimates

0 =0.712 is the predicted value of Y when X=0. So we would expect someone of age zero to have a hedonism score of 0.712. Because the minimum age in the sample is 14, this is not very informative.

1 =-0.018 is the predicted change in Y for a 1 unit change in X. So we expect a decrease of 0.018 in the hedonism score for each 1 year increase in age. Centring Continuous variables are often centred about the mean so that the intercept has a more meaningful interpretation. For example, we would centre the variable AGE by subtracting the sample mean of 46 years from each of its values. If we then repeat the regression analysis replacing AGE by AGE-46, the intercept becomes the predicted value of Y when AGE-46=0, i.e. when AGE=46. Rather than a prediction for a baby of 0 years, which is well outside the age range in the sample, the intercept now gives a prediction for a 46 year old adult The intercept in the analysis based on centred AGE is estimated as -0.139, which is the predicted hedonism score for a 46 year old. Centring does not affect the estimate of the slope because only the origin of X has been shifted; its scale (standard deviation) has not changed. Standardisation and standardised coefficients Sometimes X is standardised, which involves subtracting the sample mean and then dividing the result by the standard deviation:


C3.1.3 The fitted regression line


XXX

ofSDofmean

.

Standardising a variable forces it to have a mean of zero and a standard deviation of 1, while centring shifts only the origin and leaves the scale unaltered. After standardisation a unit corresponds to one standard deviation, so if X is standardised its slope is interpreted as the change in Y expected for a one standard deviation change in X. Sometimes standardised coefficients are calculated. In simple regression the standardised coefficient of X is the slope that would be obtained if X and Y had both been standardised, which is equivalent to the Pearson correlation coefficient. The standardised coefficient of X is interpreted as the number of standard deviation units change in Y that we would expect for each standard deviation change in X. While standardised coefficients put each variable on the same scale, and may therefore be useful for comparing the effect of X on Y in different subpopulations, the natural meaning of the X and Y variables is lost. The use and interpretation of standardised coefficients in multiple regression is discussed in C3.3.2. When age is standardised the estimated intercept and slope of the regression line are 0 =-0.151 and 1 =-0.335. If we also standardise hedonism, we obtain 1 =-0.343 (now a standardised coefficient) which is equal to the Pearson correlation coefficient given earlier in C3.1.1. Important note: We cannot claim that there is a causal relationship between X and Y from such a simple model or, indeed, from any regression model applied to observational data. So when interpreting the slope it is better to avoid statements like a change in X leads to or causes an increase in Y. Taking account of other factors would provide stronger evidence of a causal relationship if the original relationship did not change as additional predictors are included in the model.


C3.1.4 Explained and unexplained variance and R-squared


C3.1.4 Explained and unexplained variance and R-squared All statistical models have a common form:

Response = Systematic part + Random part where for a simple regression (3.3) the systematic part is ix10 + and the random part is the residual ie . The systematic part gives the average relationship between the response and the predictor(s), while the random part is what is left over (the unexplained part) after taking account of the included predictor(s). Figure 3.2 displays the values on X and Y for individuals in the sample, and a straight line that we have threaded through the (X, Y) data points to represent the systematic relation between hedonism and age. The line represents the fitted values, e.g. if you are 20 years old you are predicted to have a hedonism score of about 0.3. The term random means allowed to vary and, in relation to Figure 3.2, the random part is the portion of hedonism that is not accounted for by the underlying average relationship with age. Some people are more and some less hedonistic given their age. The residual is the difference between the actual and predicted hedonism. In some cases there will be a close fit between the actual and fitted values, e.g. if differences in age explain most of the variability in hedonism. In other cases there may be a lot of noise, e.g. if, for any given age, there is a wide range of hedonism scores. It is helpful to characterise this residual variability. To do so requires us to make some assumptions about the residuals (normality and homoskedasticity - see C3.1.2). Under these assumptions we can summarise the variability in a single statistic, the variance of the residuals 2 . We can think of the residual variance as the part of the variance in Y that is unexplained by X. The part of the variance in Y that is explained by X (the systematic part of the model) is called the explained variance in Y. For Figure 3.2 the residual or unexplained variance is 0.84. The total variance in hedonism scores (which is the sum of the explained and unexplained variances) is 0.95, so by subtraction the explained variance is 0.11. Another key summary statistic is the R-squared (R2) value which gives the correspondence between the actual and fitted values, on a scale between zero (no correspondence) and 1 (complete correspondence). R-squared can also be interpreted as the proportion of the total variance in Y that can be explained by variability in X. For the hedonism data, R-squared = 0.11/0.95 = 0.12 so 12% of the variance in hedonism scores can be explained by age. In the case of simple regression, R-squared is the square of the Pearson correlation coefficient.


C3.1.5 Hypothesis testing


C3.1.5 Hypothesis testing We must bear in mind that the estimates of the intercept and slope are subject to sampling variability, as is any statistic calculated from a sample. While we have established that there is a negative relationship between hedonism and age in our sample, we are really interested in their relationship in the population from which our sample was drawn (the combined populations of France, Germany and the UK). In other words, is the relationship statistically significant, or could we have got such a result by chance? The null hypothesis (H0) for our test is that there is no relationship between hedonism and age in the population, in which case 1 =0. The alternative hypothesis (HA) is that there is a relationship, i.e. 1 0. The test of a relationship between hedonism and age is based on the estimate of the slope of the relationship and a measure of the precision of this estimate. The standard error is a measure of imprecision, where large values indicate greater uncertainty about the true (population) value. The standard error is inversely related to sample size, so that the precision of the estimate of 1 increases as the sample size increases. The standard error also depends on the amount of variability in X and the amount of variance in Y that is unexplained by X (the residual variance): the standard error decreases as the variance in X increases, and the standard error increases as the residual variance increases. In our example, the standard error of 1 is 0.001 and a 95% confidence interval for 1 is therefore

).,.()..(.)(. 0160020000109610180961 11 == SE 6 Zero (the value of 1 under H0) is well outside the 95% confidence interval, so we reject the null hypothesis and conclude that the relationship is statistically significant at the 5% level. We can also calculate a confidence interval for the population intercept 0 , but the slope is of principal interest since it measures the relationship between X and Y. Alternatively, but equivalently, we can calculate the test statistic (often called the Z or t-ratio)

6 -1.96 and +1.96 are the 2.5% and 97.5% points of a standard normal distribution (one with a mean of zero and a standard deviation of one). The middle 95% of the distribution lies between these points.


C3.1.5 Hypothesis testing


92700100180

1

1 ...

)(

==

SE

which is compared to a normal distribution (or a t distribution if the sample size is small). In this case the p-value is tiny, less than 0.001. If there was no relationship between hedonism and age in the population (i.e. the true slope is zero), we would expect less than 0.1% of samples from that population to produce a slope estimate of magnitude greater than 0.018. In the practice sections, we will generally use the Z-ratio to test significance rather than calculating confidence intervals. C3.1.6 Model checking A number of assumptions lie behind a regression model. These were given in C3.1.2 but, briefly, we assume: i) The residuals ie are normally distributed. ii) The variance of the residuals is constant, whatever the value of x, i.e. the

residuals are homoskedastic. iii) The residuals are not correlated with one another, i.e. they are

independent. We can check the validity of assumptions i) and ii) by examining plots of the estimated residuals. If it is suspected that residuals might be correlated because the data are clustered in some way, we can test assumption iii) by comparing a multilevel model, which accounts for clustering, with a multiple regression model which ignores clustering (see Module 5). To check assumptions about ie , we use the estimated residuals which are the differences between the observed and predicted values of y:

iii yye = We usually work with the standardized residuals ir which we obtain by dividing ie by their standard deviation.


C3.1.6 Model checking


Checking the normality assumption We can check whether residuals are normally distributed by looking at a histogram or a normal probability plot of the standardized residuals. If the normality assumption holds, the points in a normal plot should lie on a straight line. Figure 3.3 and Figure 3.4 show a histogram and normal probability plot of residuals from a simple regression model with age. Both plots suggest that the normal distribution assumption is reasonable here. Checking the homoskedasticity assumption To check that the variance of the residuals is fairly constant across the range of X, we can examine a plot of the standardized residuals against X and check that the vertical scatter of the residuals is roughly the same for different values of X with no funnelling.

2.50.0-2.5

Standardized residual

500

400

300

200

100

0

F

r

e

q

u

e

n

c

y

Figure 3.3. Histogram of ir

Figure 3.5 shows a plot of ir versus ix . The vertical spread of the points appears fairly equal across different values of X, so we conclude that the assumption of homoskedasticity is reasonable.




1.00.80.60.40.20.0

Observed cumulative probability

1.0

0.8

0.6

0.4

0.2

0.0

E

x

p

e

c

t

e

d

c

u

m

u

l

a

t

i

v

e

p

r

o

b

a

b

i

l

i

t

y

Figure 3.4. Normal probability plot of ir

100806040200

Age, in number of years, in 2002

2

0

-2

S

t

a

n

d

a

r

d

i

z

e

d

r

e

s

i

d

u

a

l

Figure 3.5. Plot of ri versus xi




Outliers We can also check for outliers using any of the above residual plots. An outlier is a point with a particularly large residual. We would expect approximately 95% of the residuals to lie between 2 and +2. Of major interest, however, is whether an outlier has undue influence on our results. For example, in simple regression, an outlier with very large values on X and Y could push up a positive slope. A straightforward way to judge the influence of an outlier is to refit the regression line after excluding it. If the results are very similar to those based on all observations, we would conclude that the outlier does not have undue influence. An observations influence can also be measured by a statistic called Cooks D (see C3.5.3). Dont forget to do the practical for this section! (see page 2 for details of how to find the practical) Please read P3.1, which is available in online form or as part of a pdf file. Dont forget to take the online quiz for this section! (see page 2 for details of how to find the quiz questions)



C3.2 Comparing Groups: Regression with a Single Categorical Explanatory Variable

When X is continuous, we are fitting a straight line relationship. Regression can also be applied when X is categorical, in which case we are allowing the mean of Y to be potentially different for the different categories of X. C3.2.1 Comparing two groups Suppose that a categorical explanatory variable X has only two categories. We wish to compare the mean of our response variable Y for the two groups defined by these categories. We will examine whether there are gender differences in hedonism. In the human values dataset there is a variable called SEX which is coded 1 for female, and 0 for male. Variables that have codes of 0 and 1 are often called dummy variables. If we simply calculate the mean of our response variable HED for men and women, we obtain the results given in Table 3.2.

Table 3.2. Descriptive statistics for hedonism by sex

Sample size Mean hedonism score Women 2747 -0.225 Men 3098 -0.069

So the (female-male) difference in means is -0.225-(-0.069) = -0.156. Normal (t) test for comparing two independent samples We can use a normal test (or t-test if the sample is small) to test for a difference between women and men in the population. The null hypothesis for the test is that the gender difference between the mean of hedonism in the population is zero. The test statistic is -6.12 and the p-value is less than 0.0001. A 95% confidence interval for the difference between the female and male population means is (-0.206, -0.106), which does not contain the null value of zero. We therefore conclude that the difference between women and mens hedonism scores is statistically significant (at the 0.01% level). Comparing two groups using regression We can also compare groups using a regression model. The advantage of using regression, rather than a normal (or t) test, is that in a regression model we can allow for the effects of other variables as well as gender. To start with, however, we will consider gender as the only explanatory variable and demonstrate how men and womens hedonism scores can be compared using regression.


C3.2.1 Comparing two groups


Suppose we fit the simple regression model

iii exy ++= 10 where iy is the hedonism score of individual i, and ix =1 if the individual is a woman, and 0 if the respondent is male7.

Table 3.3. Regression of hedonism on sex

Coefficient Standard Error Constant -0.069 0.019 Sex -0.156 0.025

The regression output is given in Table 3.3, from which we obtain the fitted regression equation:

ii SEXDEH 15600690 .. = We can use this equation to predict HED for men and women:

For men (SEX=0), 0690015600690 .).(. ==DEH For women (SEX=1), 2250115600690 .).(. ==DEH Notice that these predicted values are just the mean hedonism scores for men and women, and that the coefficient of SEX is the difference between these means (womens mean mens mean, since SEX is coded 1 for women here). The null hypothesis that there is no difference between the mean score for men and women in the population can be expressed as H0: 01 = . The standard error of 1 is 0.025 and the Z-ratio is therefore 12.6025.0156.0 = . The 95% confidence interval for 1 is (-0.206, -0.106). Note that these results are exactly the same as those for the independent samples comparison of means test given earlier. So if SEX is the only explanatory variable, a regression analysis gives exactly the same results as a t-test. But only in a regression analysis can we include other explanatory variables.

7 Note that it would not be sensible to centre or standardize a binary variable.


C3.2.2 Comparing more than two groups


C3.2.2 Comparing more than two groups Suppose now that a categorical explanatory variable X has three categories. We wish to compare the mean of our outcome variable Y for the three groups defined by these categories. The respondents in the hedonism example come from three countries. The mean of HED for each country is given in Table 3.4.

Table 3.4. Descriptive statistics for hedonism by country

Sample size Mean hedonism score UK 1748 -0.384 Germany 2785 -0.128 France 1312 0.108

Analysis of Variance (ANOVA) The standard way to compare more than two groups is to use analysis of variance (ANOVA)8. The null hypothesis is that there is no difference between groups (i.e. that the group means are all equal). Table 3.5 shows the results from an ANOVA for a comparison of hedonism for the three countries. When there is just one categorical variable, this type of analysis is usually called a one-way ANOVA.

Table 3.5. Analysis of variance of country differences in hedonism Sum of

squares d.f. Mean

square F statistic p-value

Between countries 184.5 2 92.3 100.4




Comparing groups using multiple regression The statistical model behind ANOVA is in fact a multiple regression model. But rather than including country as an explanatory variable10, we create dummy variables for two of the three countries and include these. Suppose we create three variables which indicate whether a respondent is from a particular country, i.e. UK =1 if respondent is from the UK, =0 if from Germany or France GERM =1 if respondent is from Germany, =0 if from UK or France FRANCE =1 if respondent is from France, =0 if from UK or Germany These variables are called dummy variables11. In fact we do not need all three of these variables because if we know a respondents value on two of them, we can infer their value on the third. E.g. if we know that UK=0 and GERM=1, then we know that FRANCE=0. (A respondent can only be living in one country at the time of survey, so only one of UK, GERM and FRANCE can equal 1 for any given individual.) By the same argument, when we have a categorical variable with only two categories (e.g. our SEX variable in C3.2.1) we do not need to create any additional variables. SEX is already a dummy variable, and can therefore be included directly in the model as an explanatory variable. To allow for differences between the UK, Germany and France, we choose (arbitrarily) two of the country dummy variables and include those as explanatory variables. Suppose we choose GERM and FRANCE, then the multiple regression model is:

iiii eFRANCEGERMHED +++= 210 Table 3.6 shows the results from fitting this model.

10 It would not make sense to fit a model of the form iii eCOUNTRYHED ++= 10 because the coding of COUNTRY is arbitrary (i.e. COUNTRY is a nominal variable). In such a model, b 1 would be interpreted as the effect on HED of a 1 unit change in COUNTRY, but a 1 unit change in COUNTRY has no meaning! 11 This is the most common way of coding dummy variables for a categorical variable and is often called simple coding, but other types of coding are possible depending on which comparisons are of interest. A comprehensive discussion of alternative coding systems can be found at http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter5/statareg5.htm




Table 3.6. Regression analysis of hedonism on country, UK taken as reference Coefficient Standard

error Z-ratio p-value

Constant -0.384 0.023 - Country Germany 0.256 0.029 8.765




The remaining contrast is between Germany and France, which is estimated as .. 236021 =

The null hypothesis for testing whether there is a difference between the mean hedonism scores in the populations of Germany and the UK can be expressed as H0:

1 =0. Similarly the null for testing whether there is a difference between France and the UK is H0: 2 =0. The simplest way to compare Germany and France would be to refit the model, making one of these countries the reference category. Table 3.7 shows the results when Germany is taken as the reference, i.e. when the UK and FRANCE dummies are included in the model. The difference between the means for France and Germany is now obtained directly (from the coefficient of the FRANCE dummy) as 0.236. All coefficients in Table 3.6 and Table 3.7 are significantly different from zero (all p-values are



C3.3 Regression with More than One Explanatory Variable (Multiple Regression)

C3.3.1 Statistical control So far we have used simple regression to assess the linear relationship between two variables. In reality there will be a number of factors that are potential predictors of the outcome variable. The advantage of using a regression framework is that we can straightforwardly account for the effects of multiple variables simultaneously. Examples i) Suppose we compare two secondary schools on their age 16 exam performance, e.g. we might compare the percentage of students who achieve a pass in five or more subjects. Suppose we find that school 1 has a higher percentage with 5+ passes than school 2. Would we conclude that school 1s performance was better than school 2? What other factors would we like to take into account? An obvious candidate would be a measure of students achievement when they entered secondary school so that school effects are value-added. ii) Comparisons of men and womens salaries often reveal that women earn less. Explanations that are commonly put forward for this discrepancy are that women tend to work in jobs that have been traditionally lower paid, or that women have taken time out of paid employment to raise children. To determine whether there are salary differences between men and women who have been working in the same job for the same amount of time, we would wish to account for occupation and number of years of full-time employment as well as other factors such as education level. Using multiple regression we can test whether these other factors explain gender differences in salary, i.e. does any gender difference disappear when we adjust for the effects of these other variables? We can use multiple regression to take into account or adjust for other factors that might predict the response variable. Sometimes the effects of these other factors are of interest in themselves, e.g. predictors of age 16 attainment other than the school attended. Other times the effects of other factors are not of major interest, but it is important to adjust for their effects to obtain more meaningful estimates of effects that we are interested in. Such factors are often called controls.


C3.3.2 The multiple regression model


C3.3.2 The multiple regression model In simple regression we have a single predictor or explanatory variable (X), and the linear regression model is

iii exy ++= 10 . In multiple regression, we have more than one predictor. Suppose that we have two predictors, denoted by X1 and X2, which may be continuous or categorical. We have in fact already used a multiple regression model to analyse country differences in hedonism (in C3.2.2). Although there was just one predictor, country, it was represented by two dummy variables. More generally, we can include several predictors and any of these may be represented by a set of dummy variables. The multiple (linear) regression model for two continuous (or dichotomous) explanatory variables is written

iiii exxy +++= 22110 where 0 is the value of y that would be expected when 1x =0 and 2x =0. The coefficients 1 and 2 are interpreted as follows: 1 is the coefficient of 1x , which is interpreted as the change in y for a 1-unit

change in 1x controlling or adjusting for the effect of 2x . In other words, 1 is the effect of 1x for individuals with the same value of 2x (or holding 2x constant).

Similarly, 2 is the coefficient of 2x , which is interpreted as the change in y

for a 1-unit change in 2x controlling for the effect of 1x . Because each multiple regression coefficient represents the relationship between an explanatory variable and the dependent variable, conditioning on the effect of all other explanatory variables in the model, they are sometimes called partial regression coefficients. We can test for a linear relationship between the response variable Y and a predictor variable Xk by testing the null hypothesis that the coefficient of Xk is zero (H0: k =0) versus the alternative hypothesis that the coefficient is non-zero (HA:

k 0). As in simple regression, we can test for significance by examining confidence intervals for each parameter or, equivalently, by comparing Z-ratios to the normal distribution and calculating a p-value.




As in the simple regression model, ie is a residual. The residuals now represent factors other than X1 and X2 that predict Y, but we use ie in a general way to represent residuals in any model. If X1 and X2 are continuous we can examine their relationship using a scatterplot. Note that when we have two predictors we would need a three-dimensional scatterplot to represent the relationship between Y and X1 and X2 graphically13. As it can be difficult to interpret three-dimensional plots, we can explore the data by looking at plots of Y versus X1, Y versus X2, and X1 versus X2. The third plot is important to check whether X1 versus X2 are highly correlated. Example We will begin with the case where both X1 and X2 are continuous. Lets consider the effects of age (X1) and education (X2) on hedonism. We will ignore gender and country differences for now. We have already examined the bivariate relationship between hedonism and age and found that older respondents tend to be less hedonistic (in C3.1). This relationship may change when we account for education if education is related to both hedonism and age. For example, we would expect older respondents to have fewer years of education and a higher level of education might be associated with less hedonistic beliefs if the more career-minded choose study over having a good time! Figure 3.6 shows the relationship between hedonism and education (see Figure 3.2 for a plot of hedonism versus age). The relationship between the two explanatory variables, age and education, is shown in Figure 3.7. The correlation between hedonism and education is very weak; the Pearson coefficient is only 0.024. As expected, there is a negative correlation between age and education (r=-0.242). Because of the weak correlation between hedonism and education, however, we would not expect the addition of education in a multiple regression to have much impact on the coefficient of age. In C3.1.3 the fitted equation from a simple regression of hedonism on age was found to be:

ii AGEDEH 01807120 .. = . If we add education to the model, we obtain the following fitted multiple regression equation:

iii EDUCAGEDEH 017001909710 ... = .

13 For those of you who are interested, 22110 xxy ++= is the equation of a plane in 3-dimensional space.




Notice that, as expected, there is little change in the coefficient of age when education is added, but the relationship between hedonism and education is now negative after accounting for age. Both relationships are significantly different from zero at the 0.1% level. The relationship between hedonism and education should be interpreted with some caution, however. We should hesitate to conclude that education affects or causes hedonism. It is likely that hedonism and education are both influenced by variables that we have not accounted for in this model.

Figure 3.6. Plot of hedonism by education




Figure 3.7. Plot of age by education

Standardised coefficients Standardisation and standardised coefficients were introduced in C3.1.3. To recap, the standardised coefficient for a predictor X is the estimate of the slope that would be obtained if X and Y were both standardised before the regression analysis. In simple regression, the standardised coefficient of X is equal to the Pearson correlation coefficient. In multiple regression, with two predictors X1 and X2, the standardised coefficient of X1 is interpreted as the change in standardised Y for a 1-unit change in standardised X1, holding X2 constant. (Recall that 1 unit of a standardised variable corresponds to 1 standard deviation.) For example, in a multiple regression model of hedonism on age and education, the standardised coefficient for AGE is -0.358. Thus we can say that a 1 standard deviation change in age predicts a 0.358 standard deviation decrease in hedonism. Note that if all variables (Y and the Xs) had been standardised prior to the analysis, then the unstandardised and standardised coefficients would be equal. Standardised coefficients are produced by many statistical software packages and reported in much published quantitative research, but they should be interpreted with caution. It is often claimed that standardised coefficients can be compared across the predictors to determine which has the strongest influence on Y. However, predictors are usually correlated with one another and it is rarely possible to change the value of one without changing the value of another.




Further, some predictors may be easier to manipulate than others, which is particularly important if regression results are used to inform public policy14. When standardised coefficients are reported, they should be accompanied by the corresponding unstandardised coefficients which represent effects in terms of the original units of measurement for X and Y. This is particularly important for categorical X. C3.3.3 Using multiple regression to model a non-linear relationship Suppose a scatterplot of Y versus X resembles Figure 3.8. The relationship is non-linear, so it would not be appropriate to fit the straight line relationship implied by a linear regression model. We should fit a curve through the points rather than a line. The simplest curve is a quadratic function (or a second order polynomial):

iiii exxy +++= 2210 Note that the above is an example of a multiple regression model with xx =1 and

22 xx = . Also shown in Figure 3.8 is the fitted quadratic curve, which turns out to

have equation .... 2470021001 iii xxy +=

X

2.01.51.0.50.0-.5-1.0-1.5-2.0

Y

2

1

0

-1

-2

-3

Figure 3.8. Example of a non-linear relationship between Y and X

14 See http://www.tufts.edu/~gdallal/importnt.htm for further discussion of the use and interpretation of standardised coefficients.


C3.3.3 Using multiple regression to model a non-linear relationship


The results from fitting a quadratic curve to the relationship between hedonism and age are given in Table 3.8. Note that this analysis is based on standardised age and its square. This is because, for older respondents (remember the oldest is 98), age2 takes very large values; this may cause computational difficulties and the coefficient of age2 would be very small.

Table 3.8. Regression with quadratic effects for age Coeff. S.E. Z-ratio p-value Constant -0.222 0.017 - Standardised age -0.348 0.012 -28.669


C3.3.4 Adding further predictors


Principles of model selection In most quantitative research there is a large set of potential explanatory variables. There are many procedures that have been proposed to automatically select the best model from a set of variables (e.g. backward elimination, forward selection, stepwise selection), and many of these have been implemented in mainstream statistical software. These procedures are sometimes useful in that they provide a systematic means of model selection, but they should be used with caution or you may be accused of data dredging. In practice your research design and analysis will be guided by theory, which will come from previous research in the same or related areas as well as your own ideas. Often you will have several rival theories that you wish to compare and assess which have the stronger empirical support. These theories and your particular research question will guide the order in which you enter explanatory variables into the model. For example, suppose you are interested in examining gender differences in salary levels. The first model you fit might include only a gender effect. Suppose you find that there is a significant difference between men and women. You might then add in other explanatory variables to see which ones, if any, help to explain the gender difference. A further step in the analysis would be to test whether the gender difference is the same for all men and women, e.g. gender differences may be larger in some occupation categories than in others (an example of an interaction effect - see C3.4). In other situations, there will be variables that you want to include for interpretation purposes. For example, in educational research, you might be interested in looking at predictors of academic progress rather than academic attainment at one point in time. One way to do that is to include prior attainment as an explanatory variable in the model. Effect sizes The size of the coefficient for predictor variable Xk will depend on the scales of Xk and the response variable. For example, suppose we multiply each value of AGE by 12 to give age in months rather than years and refit the multiple regression model with age and education effects. We obtain the results shown in Table 3.9.

Table 3.9. Regression of hedonism on age and education for different age scales. Age in years Age in months Coeff. Z-ratio Coeff. Z-ratio Constant 0.971 - 0.971 - Age -0.019 -28.206 -0.002 -28.206 Education -0.017 -4.915 -0.017 -4.915


C3.3.4 Adding further predictors


The coefficient of age in months (AGE*12) is -0.002, which is the coefficient of age in years (AGE) divided by 12. This is because 1 unit on the scale of AGE*12 is equal to 1/12 of a unit on the scale of AGE. Notice that the intercept does not change because AGE=0 means the same whether the measurement is in months or years. The coefficient of education is unaffected by transformations in age. Because regression coefficients depend on scale, standardised coefficients are sometimes quoted too. When researchers talk about effect sizes, they are often referring to standardised coefficients. Dont forget to do the practical for this section! (see page 2 for details of how to find the practical) Please read P3.3, which is available in online form or as part of a pdf file. Dont forget to take the online quiz for this section! (see page 2 for details of how to find the quiz questions)



C3.4 Interaction Effects In C3.2 we saw how to compare groups using dummy variables in a regression model. For example, we compared the mean hedonism score for men and women, and for different countries. So far, however, we have assumed that the effects of other predictor variables, e.g. age, are the same for each group. This is equivalent to assuming that group differences in hedonism are the same for all values of the other predictors. This assumption may be unrealistic. Perhaps age differences in hedonism are more pronounced among men, which would imply that the age effect differs for men and women. Two predictors are said to have an interaction effect on Y if the effect of one of the predictors on Y depends on the value of the other predictor. C3.4.1 Model with fixed slopes across groups Suppose we fit a multiple regression model with age and gender effects:

iiii eSEXAGEHED +++= 210 (3.7) We obtain the following fitted regression equation:

iii SEXAGEDEH 152001807910 ... = For SEX=0 (men), the relationship between HED and AGE is represented by the line:

ii AGEDEH 01807910 .. = and for SEX=1 (women), the fitted line is:

ii AGEDEH 01806390 .. = So the lines for men and women have different intercepts, but the same slope, i.e. the regression lines are parallel (see Figure 3.10). There are two equivalent ways of interpreting Figure 3.10. We can say that the effect of age on hedonism is the same for men and women. Alternatively we can say that the gender difference in hedonism is the same at all ages.


C3.4.1 Model with fixed slopes across groups


Figure 3.10. Regression lines for men and women, fixed slopes

Note: The age range in the sample is 14 to 98 years. The software used to draw the plot has extrapolated beyond the observed range regression lines which is not generally recommended. C3.4.2 Fitting separate models for each group Is it reasonable to assume that the gender difference in hedonism is the same for all ages? One way of allowing men and women to have different slopes for the relationship between hedonism and age is to fit a separate regression line for each sex. We do this by splitting the sample by gender16, and fitting a simple regression of HED on AGE for each sex. If we do this, we obtain the results shown in Table 3.10.

16 This is often done using a select if command or menu option, or by requesting an analysis that is stratified by gender.


C3.4.2 Fitting separate models for each group


Table 3.10. Regression of hedonism on age with separate models fitted for men and women

Coeff. S.E. Z-ratio Men Constant 0.839 0.047 - Age (years) -0.019 0.001 -20.854

Women Constant 0.597 0.047 - Age (years) -0.018 0.001 -18.910

For men the slope of age is -0.019, compared to -0.018 for women. So the slope is slightly steeper for men. Because women have a lower intercept than men, a steeper slope for men implies that the gender difference is greater among younger respondents (see Figure 3.11 later). While splitting the sample into groups is a simple way of allowing for different slopes for each group, there are several problems with this approach: i) The sample size for some groups may be small. ii) There may be more than one categorical predictor, and therefore more than

one way of grouping the data. The effects of the other predictors may vary across each grouping, e.g. hedonism may vary by sex and by country. Splitting the data into groups defined by sex and country will lead to a large number of groups; in this dataset, the sample sizes in each group remain large, but this will often not be the case.

iii) In general there will be several predictors in the model, but it is unlikely

that the effects of all predictors will vary across groups. In that case, fitting a separate regression for each group is inefficient. Where the coefficient of a predictor does not vary across groups, it would be better to estimate it using information from the whole sample; the estimate of the coefficient would then be based on a larger sample size and would therefore have a smaller standard error than if it were estimated separately for each group.

iv) When separate analyses are carried out for each group, it is not possible to

carry out hypothesis tests to compare coefficients across groups. For example, if we fit separate regressions of hedonism on age for men and women we cannot test whether there is a gender difference in the relationship between hedonism and age in the population.


C3.4.3 Allowing for varying slopes in a pooled analysis: interaction effects



Rather than fitting a separate model for each sex, we will fit a single model to the whole pooled sample. We create a new variable which is the product of AGE and SEX: AGE_SEX=AGESEX The new variable AGE_SEX is added as another predictor variable to model (3.7) to give:

iiiii eSEXAGESEXAGEHED ++++= _3210 (3.8) Table 3.11 gives an extract of the analysis data file to which (3.8) could be fitted.

Table 3.11. Example of hedonism dataset with age by sex interaction variable Respondent Hedonism AGE SEX AGE_SEX 1 1.55 25 0 0 2 0.76 30 0 0 3 -0.26 59 0 0 4 -1.00 47 1 47 . . . . . . . . . . 5845 0.74 65 0 0

The inclusion of AGE_SEX, called the interaction between AGE and SEX, allows the effect of AGE on HED to differ for men and women (or, equivalently, the effect of sex on HED to depend on AGE). If the effect of age differs by sex, we say that there is an interaction effect. To see how an interaction effect works, we will look at the regression model for each value of SEX. For SEX=0 (men), AGE_SEX=0 so the regression model (3.8) becomes:

iii eAGEHED ++= 10 (3.9) For SEX=1 (women), AGE_SEX=AGE and the regression model (3.8) becomes:

ii

iiii

eAGEeAGEAGEHED

++++=++++=

)()( 31203210

(3.10)




In equation (3.9) the intercept is 0 , and in (3.10) it is 20 + . So 2 is the difference between intercepts for men and women. In equation (3.9) the slope of AGE is 1 , and in (3.10) it is 31 + . So 3 is the difference between slopes for men and women. Table 3.12 shows the results from fitting model (3.8) to the hedonism data.

Table 3.12. Regression of hedonism on age and sex, pooled analysis with interaction

Coeff. S.E. Z-ratio p-value Constant 0.839 0.048 - - Age (years) -0.019 0.001 -20.075


C3.4.5 Another example: allowing age effects to be different in different countries


each of the three countries, which is equivalent to testing whether differences between countries depend on age. In C3.2.2, we allowed for country effects by including dummy variables for Germany and France, i.e. we included the variables GERM and FRANCE as predictors in the regression model. To allow the effect of age on hedonism to vary across countries, we need to create two interaction variables which we will call AGE_GERM and AGE_FRANCE. These are defined as follows: AGE_GERM=AGEGERM AGE_FRANCE=AGEFRANCE The interaction model has the form:

iiiii eFRANCEAGEGERMAGEFRANCEGERMAGEHED ++++++= __ 543210 The results from fitting this model are given in Table 3.13.

Table 3.13. Regression with age by country interaction effect Coeff. S.E. Z-ratio p-value Constant 0.604 0.061 - - Age (years) -0.021 0.001 -17.210



C3.5 Checking Model Assumptions in Multiple Regression

The assumptions of a multiple regression model are the same as those for a simple regression model (see C3.1.6), i.e. i) the residuals ie are normally distributed, ii) the variance of the residuals is the same for each value of X (or combination of values for different X variables), and iii) the residuals are independent. We can check assumptions i) and ii) by looking at various plots of the standardised residuals. The same plots can be used to check for outliers and their influence on the regression results can be assessed by looking at the distribution of the Cooks D Statistic. C3.5.1 Checking the normality assumption In C3.1.6, we checked the normality assumption of simple regression using two plots of the standardized residuals: a histogram and a normal probability plot. The same plots are used in multiple regression. Figure 3.12 and Figure 3.13 show the histogram and normal probability plot of residuals from a multiple regression model of hedonism that includes age, education, gender and country effects. The histogram shows a symmetric bell-shaped distribution and the normal plot shows a straight line, suggesting that the normal distribution assumption is reasonable.

Figure 3.12. Histogram of ir


C3.5.1 Checking the normality assumption


Figure 3.13. Normal probability plot of ir C3.5.2 Checking the homoskedasticity assumption For simple regression, we check that the variance of the residuals is fairly constant across the range of X in a plot of the standardised residuals against the explanatory variable X. In multiple regression, it is useful to start with a plot of ir against iy because, for any individual, the predicted value of y is a linear function of their values on all X variables in the model. This should be followed by an examination of pairwise plots of the standardized residuals against each explanatory variable X in turn. For each plot we are looking for indications of funnelling where the vertical scatter of the residuals is different for different values of ix or iy , in which case the assumption of homoskedasticity is not met. A common reason for funnelling (or heteroskedasticity) is the existence of groups in the data among which the relationship between Y and one or more X differs, i.e. unmodelled interaction effects. To illustrate the idea of funnelling, suppose that the relationship between Y and a continuous variable X1 is different for two subgroups defined by a binary variable X2: the relationship between Y and X1 is positive for both groups, but stronger for X2=0 than for X2=1. The predicted regression lines from a multiple regression of Y on X1, X2 and their interaction X1*X2 are shown in Figure 3.14.


C3.5.2 Checking the homoskedasticity assumption


5.02.50.0-2.5-5.0

x1

2

0

-2

-4

S

t

a

n

d

a

r

d

i

z

e

d

p

r

e

d

i

c

t

e

d

v

a

l

u

e

x2=0

x2=1

Figure 3.14. Prediction lines from a multiple regression with an interaction effect

Now suppose we mistakenly fit a simple regression of Y on X1, so we ignore the fact that there are two groups with different relationships between Y and X1. Figure 3.15 shows the residual plot for this misspecified model17. (The data points for the groups defined by X2 are distinguished, but remember that X2 is not included in the model.) The plot shows evidence of heteroskedasticity because the vertical spread of the residuals gets smaller as X1 increases this is an example of what we mean by funnelling. Why has this happened? Instead of fitting two regression lines with different intercepts and slopes for each group, we have fitted a single average line which would lie somewhere in between the lines in Figure 3.1418. At small values of X1, where we have the largest difference in the predicted value of Y for the two groups, the residuals about this line are large and positive for X2=1 and large and negative for X2=0. The difference between groups becomes smaller

17 The residuals are plotted against ix1 , but the plot of ir against the predicted response, iy , would look exactly the same because in simple regression iy is just a linear function of ix1 . 18 We would expect this average line to lie closer to the line for the largest group.


C3.5.2 Checking the homoskedasticity assumption


as X1 increases, so the average line will lie close to the individual group lines and the residuals are smaller.

5.02.50.0-2.5-5.0

x1

3

2

1

0

-1

-2

-3

S

t

a

n

d

a

r

d

i

z

e

d

R

e

s

i

d

u

a

l

10

x2

Figure 3.15. Plot of ir versus X1 from fitting a misspecified regression without X2 or its

interaction with X1 Returning to the hedonism data, Figure 3.16 shows a plot of ir versus standardised

iy from the model with age, education, gender and country included as explanatory variables. The vertical spread of the points appears fairly equal across different values of standardised iy , so we conclude that the assumption of homoskedasticity is reasonable. C3.5.3 Outliers We can also check for outliers using any of the residual plots. An outlier is a point with a particularly large residual. We would expect approximately 95% of the residuals to lie between 2 and +2. Of major interest, however, is whether an outlier has undue influence on our results. An influence statistic called Cooks D (where D is for distance) measures


C3.5.3 Outliers


how different our estimated regression coefficients would have been if a sample observation were omitted. Cooks D is calculated for every observation. The higher the value of D, the more likely it is that an observation exerts influence on the estimates of the coefficients. However, D does not have a fixed range and so we focus on those values of D which are considerably greater than, say, the 90th percentile.

Figure 3.16. Plot of ir versus standardised iy from a multiple regression of hedonism scores

For a regression of hedonism on age, education, sex and country, we find that the 90th percentile of the distribution of Cooks D is 0.000046. A boxplot of Cooks D is given in Figure 3.17. Two observations have relatively large values of D: case numbers 3225 and 2948. However, removing these observations from the analysis has negligible impact on our results (see Table 3.14).


C3.5.3 Outliers


Figure 3.17. Boxplot of Cooks D

Table 3.14. Impact of omitting outliers on estimated coefficients and Z-ratios

Full sample Omitting observations 3225

and 2948 Coeff. Z-ratio Constant 0.790 - 0.789 - Age (years) -0.019 -27.611 -0.019 -27.537 Education (years) -0.015 -4.281 -0.015 -4.350 Female -0.160 -6.752 -0.160 -6.774 Country Germany 0.222 8.068 0.222 8.090 France 0.436 13.145 0.441 13.322

Dont forget to do the practical for this section! (see page 2 for details of how to find the practical) Please read P3.5, which is available in online form or as part of a pdf file. Dont forget to take the online quizzes for this module if you havent already done so! (see page 2 for details of how to find the quizzes)

Multiple Regression Concepts

Documents

Transcript of Multiple Regression Concepts