1 Applying Regression. 2 The Course 14 (or so) lessons –Some flexibility Depends how we feel What...

1 Applying Regression Slide 2 2 The Course 14 (or so) lessons Some flexibility Depends how we feel What we get through Slide 3 3 Part I: Theory of Regression 1.Models in statistics 2.Models with more than one parameter: regression 3.Samples to populations 4.Introducing multiple regression 5.More on multiple regression Slide 4 4 Part 2: Application of regression 6.Categorical predictor variables 7.Assumptions in regression analysis 8.Issues in regression analysis 9.Non-linear regression 10.Categorical and count variables 11.Moderators (interactions) in regression 12.Mediation and path analysis Part 3:Taking Regression Further (Kind of brief) 13.Introducing longitudinal multilevel models Slide 5 Bonuses Bonus lesson1: Why is it called regression? Bonus lesson 2: Other types of regression. 5 Slide 6 6 House Rules Jeremy must remember Not to talk too fast If you dont understand Ask Any time If you think Im wrong Ask. (Im not always right) Slide 7 The Assistants Carla Xena - [email protected] Eugenia Suarez Moran [email protected] Arian Daneshmand- [email protected] Slide 8 8 Learning New Techniques Best kind of data to learn a new technique Data that you know well, and understand Your own data In computer labs (esp later on) Use your own data if you like My data Ill provide you with Simple examples, small sample sizes Conceptually simple (even silly) Slide 9 9 Computer Programs Stata Mostly Ill explain SPSS options Youll like Stata more Excel For calculations Semi-optional GPower Slide 10 10 Lesson 1: Models in statistics Models, parsimony, error, mean, OLS estimators Slide 11 11 What is a Model? Slide 12 12 What is a model? Representation Of reality Not reality Model aeroplane represents a real aeroplane If model aeroplane = real aeroplane, it isnt a model Slide 13 13 Statistics is about modelling Representing and simplifying Sifting What is important from what is not important Parsimony In statistical models we seek parsimony Parsimony simplicity Slide 14 14 Parsimony in Science A model should be: 1: able to explain a lot 2: use as few concepts as possible More it explains The more you get Fewer concepts The lower the price Is it worth paying a higher price for a better model? Slide 15 15 The Mean as a Model Slide 16 16 The (Arithmetic) Mean We all know the mean The average Learned about it at school Forget (didnt know) about how clever the mean is The mean is: An Ordinary Least Squares (OLS) estimator Best Linear Unbiased Estimator (BLUE) Slide 17 17 Mean as OLS Estimator Going back a step or two MODEL was a representation of DATA We said we want a model that explains a lot How much does a model explain? DATA = MODEL + ERROR ERROR = DATA - MODEL We want a model with as little ERROR as possible Slide 18 18 What is error? Error (e)Model (b 0 ) mean Data (Y) 0.031.63 0.021.62 0.201.80 -0.051.55 -0.20 1.60 1.40 Slide 19 19 How can we calculate the amount of error? Sum of errors? Sum of absolute errors? Slide 20 20 Are small and large errors equivalent? One error of 4 Four errors of 1 The same? What happens with different data? Y = (2, 2, 5) b 0 = 2 Not very representative Y = (2, 2, 4, 4) b 0 = any value from 2 - 4 Indeterminate There are an infinite number of solutions which would satisfy our criteria for minimum error Slide 21 21 Sum of squared errors (SSE) Slide 22 22 Determinate Always gives one answer If we minimise SSE Get the mean Shown in graph SSE plotted against b 0 Min value of SSE occurs when b 0 = mean Slide 23 23 Slide 24 24 The Mean as an OLS Estimate Slide 25 25 Mean as OLS Estimate The mean is an Ordinary Least Squares (OLS) estimate As are lots of other things This is exciting because OLS estimators are BLUE Best Linear Unbiased Estimators Proven with Gauss-Markov Theorem Which we wont worry about Slide 26 26 BLUE Estimators Best Minimum variance (of all possible unbiased estimators) Narrower distribution than other estimators e.g. median, mode Slide 27 27 SSE and the Standard Deviation Tying up a loose end Slide 28 28 SSE closely related to SD Sample standard deviation s Biased estimator of population SD Population standard deviation - Need to know the mean to calculate SD Reduces N by 1 Hence divide by N -1, not N Like losing one df Slide 29 29 Proof That the mean minimises SSE Not that difficult As statistical proofs go Available in Maxwell and Delaney Designing experiments and analysing data Judd and McClelland Data Analysis: a model comparison approach (out of print?) Slide 30 30 Whats a df? The number of parameters free to vary When one is fixed Term comes from engineering Movement available to structures Slide 31 31 Back to the Data Mean has 5 ( N ) df 1 st moment has N 1 df Mean has been fixed 2 nd moment Can think of it as amount of cases vary away from the mean Slide 32 32 While we are at it Skewness has N 2 df 3 rd moment Kurtosis has N 3 df 4 rd moment Amount cases vary from Slide 33 33 Parsimony and df Number of df remaining Measure of parsimony Model which contained all the data Has 0 df Not a parsimonious model Normal distribution Can be described in terms of mean and 2 parameters ( z with 0 parameters) Slide 34 34 Summary of Lesson 1 Statistics is about modelling DATA Models have parameters Fewer parameters, more parsimony, better Models need to minimise ERROR Best model, least ERROR Depends on how we define ERROR If we define error as sum of squared deviations from predicted value Mean is best MODEL Slide 35 Lesson 1a A really brief introduction to Stata 35 Slide 36 Commands 36 Command review Variable list Commands Output Slide 37 Stata Commands Can use menus But commands are easy All have similar format: command variables, options Stata is case sensitive BEDS, beds, Beds Stata lets you shorten summarize sqft su sq 37 Slide 38 More Stata Commands Open exercise 1.4.dta Run summarize sqm table beds mean price histogram price Or su be tab be mean pr hist pr 38 Slide 39 39 Lesson 2: Models with one more parameter - regression Slide 40 40 In Lesson 1 we said Use a model to predict and describe data Mean is a simple, one parameter model Slide 41 41 More Models Slopes and Intercepts Slide 42 42 More Models The mean is OK As far as it goes It just doesnt go very far Very simple prediction, uses very little information We often have more information than that We want to use more information than that Slide 43 43 House Prices Look at house prices in one area of Los Angeles Predictors of house prices Using: Sale price, size, number of bedrooms, size of lot, year built Slide 44 Slide 45 Slide 46 46 House Prices address listpricebedsbathssqft 3628 OLYMPIAD Dr649500432575 3673 OLYMPIAD Dr450000231910 3838 CHANSON Dr489900322856 3838 West 58TH Pl330000421651 3919 West 58TH Pl349000321466 3954 FAIRWAY Blvd51490032.252018 4044 OLYMPIAD Dr64900042.53019 4336 DON LUIS Dr47400022.52188 4421 West 59TH St460000321519 4518 WHELAN Pl38800021.51403 4670 West 63RD St259500321491 5000 ANGELES VISTA Blvd 678800543808 Slide 47 47 One Parameter Model The mean How much is that house worth? $415,689 Use 1 df to say that Slide 48 48 Adding More Parameters We have more information than this We might as well use it Add a linear function of number of size (square feet) ( x 1 ) Slide 49 49 Alternative Expression Estimate of Y (expected value of Y ) Value of Y Slide 50 50 Estimating the Model We can estimate this model in four different, equivalent ways Provides more than one way of thinking about it 1. Estimating the slope which minimises SSE 2. Examining the proportional reduction in SSE 3. Calculating the covariance 4. Looking at the efficiency of the predictions Slide 51 51 Estimate the Slope to Minimise SSE Slide 52 52 Estimate the Slope Stage 1 Draw a scatterplot x -axis at mean Not at zero Mark errors on it Called residuals Sum and square these to find SSE Slide 53 53 Slide 54 Slide 55 55 Add another slope to the chart Redraw residuals Recalculate SSE Move the line around to find slope which minimises SSE Find the slope Slide 56 56 First attempt: Slide 57 57 Any straight line can be defined with two parameters The location (height) of the slope b 0 Sometimes called The gradient of the slope b 1 Slide 58 58 Gradient 1 unit b 1 units Slide 59 59 Height b 0 units Slide 60 60 Height If we fix slope to zero Height becomes mean Hence mean is b 0 Height is defined as the point that the slope hits the y -axis The constant The y -intercept Slide 61 61 Why the constant? b 0 x 0 Where x 0 is 1.00 for every case i.e. x 0 is constant Implicit in Stata (And SPSS, SAS, R) Some packages force you to make it explicit (Later on well need to make it explicit) Slide 62 62 Why the intercept? Where the regression line intercepts the y - axis Sometimes called y -intercept Slide 63 63 Finding the Slope How do we find the values of b 0 and b 1 ? Start with we jiggle the values, to find the best estimates which minimise SSE Iterative approach Computer intensive used to matter, doesnt really any more (With fast computers and sensible search algorithms more on that later) Slide 64 64 Start with b 0 =416 (mean) b 1 =0.5 (nice round number) SSE = 365,774 b 0 =300, b 1 =0.5, SSE=341,683 b 0 =300, b 1 =0.6, SSE=310,240 b 0 =300, b 1 =0.8, SSE=264,573 b 0 =300, b 1 =1, SSE=301, 797 b 0 =250, b 1 =1, SSE=255,366 .. Slide 65 65 Quite a long time later b 0 = 216.357 b 1 = 1.084 SSE = 145,636.78 Gives the position of the Regression line (or) Line of best fit Better than guessing Not necessarily the only method But it is OLS, so it is the best (it is BLUE) Slide 66 66 Slide 67 67 We now know A zero square metre house is worth $216,000 Adding a square meter adds $1,080 Told us two things Dont extrapolate to meaningless values of x -axis Constant is not necessarily useful It is necessary to estimate the equation Slide 68 Exercise 2a, 2b 68 Slide 69 69 Standardised Regression Line One big but: Scale dependent Values change to, inflation Scales change , 000, 00? Need to deal with this Slide 70 70 Dont express in raw units Express in SD units x 1 =183.82 y =114.637 b 1 = 1.103 We increase x 1 by 1, and increases by 1.084 So we increase x 1 by 1 and increases by 0.0094 SDs Slide 71 71 Similarly, 1 unit of x 1 = 1/69.017 SDs Increase x 1 by 1 SD increases by 1.103 (69.017/1) = 76.126 Put them both together Slide 72 72 The standardised regression line Change (in SDs) in associated with a change of 1 SD in x 1 A different route to the same answer Standardise both variables (divide by SD) Find line of best fit Slide 73 73 The standardised regression line has a special name The Correlation Coefficient ( r ) (r stands for regression, but more on that later) Correlation coefficient is a standardised regression slope Relative change, in terms of SDs Slide 74 Exercise 2c 74 Slide 75 75 Proportional Reduction in Error Slide 76 76 Proportional Reduction in Error We might be interested in the level of improvement of the model How much less error (as proportion) do we have Proportional Reduction in Error (PRE) Mean only Error(model 0) = 341,683 Mean + slope Error(model 1) = 196,046 Slide 77 77 Slide 78 78 But we squared all the errors in the first place So we could take the square root This is the correlation coefficient Correlation coefficient is the square root of the proportion of variance explained Slide 79 79 Standardised Covariance Slide 80 80 Standardised Covariance We are still iterating Need a closed-form equation Equation to solve to get the parameter estimates Answer is a standardised covariance A variable has variance Amount of differentness We have used SSE so far Slide 81 81 SSE varies with N Higher N, higher SSE Divide by N Gives SSE per person (or house) (Actually N 1, we have lost a df to the mean) Gives us the variance Same as SD 2 We thought of SSE as a scattergram Y plotted against X (repeated image follows) Slide 82 82 Slide 83 83 Or we could plot Y against Y Axes meet at the mean (415) Draw a square for each point Calculate an area for each square Sum the areas Sum of areas SSE Sum of areas divided by N Variance Slide 84 84 Plot of Y against Y 0 20 40 60 80 100 120 140 160 180 020406080100120140160180 Slide 85 85 0 20 40 60 80 100 120 140 160 180 020406080100120140160180 Draw Squares 35 88.9 = -53.9 35 88.9 = -53.9 138 88.9 = 40.1 138 88.9 = 40.1 Area = -53.9 x -53.9 = 2905.21 Area = 40.1 x 40.1 = 1608.1 Slide 86 86 What if we do the same procedure Instead of Y against Y Y against X Draw rectangles (not squares) Sum the area Divide by N - 1 This gives us the variance of x with y The Covariance Shortened to Cov( x, y ) Slide 87 87 Slide 88 88 55 88.9 = -33.9 1 - 3 = -2 Area = (-33.9) x (-2) = 67.8 4 - 3 = 1 138-88.9 = 49.1 Area = 49.1 x 1 = 49.1 Slide 89 89 More formally (and easily) We can state what we are doing as an equation Where Cov( x, y ) is the covariance Cov( x, y )=5165 What do points in different sectors do to the covariance? Slide 90 90 Problem with the covariance Tells us about two things The variance of X and Y The covariance Need to standardise it Like the slope Two ways to standardise the covariance Standardise the variables first Subtract from mean and divide by SD Standardise the covariance afterwards Slide 91 91 First approach Much more computationally expensive Too much like hard work to do by hand Need to standardise every value Second approach Much easier Standardise the final value only Need the combined variance Multiply two variances Find square root (were multiplied in first place) Slide 92 92 Standardised covariance Slide 93 93 The correlation coefficient A standardised covariance is a correlation coefficient Slide 94 94 Expanded Slide 95 95 This means We now have a closed form equation to calculate the correlation Which is the standardised slope Which we can use to calculate the unstandardised slope Slide 96 96 We know that: Slide 97 97 So value of b 1 is the same as the iterative approach Slide 98 98 The intercept Just while we are at it The variables are centred at zero We subtracted the mean from both variables Intercept is zero, because the axes cross at the mean Slide 99 99 Add mean of y to the constant Adjusts for centring y Subtract mean of x But not the whole mean of x Need to correct it for the slope Slide 100 100 Accuracy of Prediction Slide 101 101 One More (Last One) We have one more way to calculate the correlation Looking at the accuracy of the prediction Use the parameters b 0 and b 1 To calculate a predicted value for each case Slide 102 102 Plot actual price against predicted price From the model Slide 103 103 Slide 104 104 r = 0.653 The correlation between actual and predicted value Seems a futile thing to do And at this stage, it is But later on, we will see why Slide 105 105 Some More Formulae For hand calculation Point biserial Slide 106 106 Phi ( ) Used for 2 dichotomous variables Vote PVote Q HomeownerA: 19B: 54 Not homeownerC: 60D:53 Slide 107 107 Problem with the phi correlation Unless P x = P y (or P x = 1 P y ) Maximum (absolute) value is < 1.00 Tetrachoric correlation can be used to correct this Rank (Spearman) correlation Used where data are ranked Slide 108 108 Summary Mean is an OLS estimate OLS estimates are BLUE Regression line Best prediction of outcome from predictor OLS estimate (like mean) Standardised regression line A correlation Slide 109 109 Four ways to think about a correlation 1. Standardised regression line 2. Proportional Reduction in Error (PRE) 3. Standardised covariance 4. Accuracy of prediction Slide 110 Regression and Correlation in Stata Correlation: correlate x y correlate x y, cov regress y x Or regress price sqm 110 Slide 111 Post-Estimation Stata commands leave behind something You can run post-estimation commands They mean from the last regression Get predicted values: predict my_preds Get residuals: predict my_res, residuals 111 This comes after the comma, so its an option Slide 112 Graphs Scatterplot scatter price beds Regression line lfit price beds Both graphs twoway (scatter price beds) (lfit price beds) 112 Slide 113 What happens if you run reg without a predictor? regress price 113 Slide 114 Exercises 114 Slide 115 115 Lesson 3: Samples to Populations Standard Errors and Statistical Significance Slide 116 116 The Problem In Social Sciences We investigate samples Theoretically Randomly taken from a specified population Every member has an equal chance of being sampled Sampling one member does not alter the chances of sampling another Not the case in (say) physics, biology, etc. Slide 117 117 Population But its the population that we are interested in Not the sample Population statistic represented with Greek letter Hat means estimate Slide 118 118 Sample statistics (e.g. mean) estimate population parameters Want to know Likely size of the parameter If it is > 0 Slide 119 119 Sampling Distribution We need to know the sampling distribution of a parameter estimate How much does it vary from sample to sample If we make some assumptions We can know the sampling distribution of many statistics Start with the mean Slide 120 120 Sampling Distribution of the Mean Given Normal distribution Random sample Continuous data Mean has a known sampling distribution Repeatedly sampling will give a known distribution of means Centred around the true (population) mean ( ) Slide 121 121 Analysis Example: Memory Difference in memory for different words 10 participants given a list of 30 words to learn, and then tested Two types of word Abstract: e.g. love, justice Concrete: e.g. carrot, table Slide 122 122 Slide 123 123 Confidence Intervals This means If we know the mean in our sample We can estimate where the mean in the population ( ) is likely to be Using The standard error (se) of the mean Represents the standard deviation of the sampling distribution of the mean Slide 124 124 Almost 2 SDs contain 95% 1 SD contains 68% Slide 125 125 We know the sampling distribution of the mean t distributed if N < 30 Normal with large N (>30) Asymptotically normal Know the range within means from other samples will fall Therefore the likely range of Slide 126 126 Two implications of equation Increasing N decreases SE But only a bit (SE halfs if N is 400 times bigger) Decreasing SD decreases SE Calculate Confidence Intervals From standard errors 95% is a standard level of CI 95% of samples the true mean will lie within the 95% CIs In large samples: 95% CI = 1.96 SE In smaller samples: depends on t distribution ( df =N-1=9) Slide 127 127 Slide 128 128 Slide 129 129 What is a CI? (For 95% CI): 95% chance that the true (population) value lies within the confidence interval? No; 95% of samples, true mean will land within the confidence interval? Slide 130 130 Significance Test Probability that is a certain value Almost always 0 Doesnt have to be though We want to test the hypothesis that the difference is equal to 0 i.e. find the probability of this difference occurring in our sample IF =0 (Not the same as the probability that =0) Slide 131 131 Calculate SE, and then t t has a known sampling distribution Can test probability that a certain value is included Slide 132 132 Other Parameter Estimates Same approach Prediction, slope, intercept, predicted values At this point, prediction and slope are the same Wont be later on One predictor only More complicated with > 1 Slide 133 133 Testing the Degree of Prediction Prediction is correlation of Y with The correlation when we have one IV Use F, rather than t Started with SSE for the mean only This is SS total Divide this into SS residual SS regression SS tot = SS reg + SS res Slide 134 134 Slide 135 135 Back to the house prices Original SSE (SS total ) = 341683 SS residual = 196046 What is left after our model SS regression = 341683 196046= 145636 What our model explains Slide 136 136 Slide 137 137 F = 18.6, df = 1, 25, p = 0.0002 Can reject H 0 H 0 : Prediction is not better than chance A significant effect Slide 138 138 Statistical Significance: What does a p-value (really) mean? Slide 139 139 A Quiz Six questions, each true or false Write down your answers (if you like) An experiment has been done. Carried out perfectly. All assumptions perfectly satisfied. Absolutely no problems. P = 0.01 Which of the following can we say? Slide 140 140 1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means). Slide 141 141 2. You have found the probability of the null hypothesis being true. Slide 142 142 3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means). Slide 143 143 4. You can deduce the probability of the experimental hypothesis being true. Slide 144 144 5. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision. Slide 145 145 6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions. Slide 146 146 OK, What is a p-value Cohen (1994) [a p-value] does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe it does (p 997). Slide 147 147 OK, What is a p-value Sorry, didnt answer the question Its The probability of obtaining a result as or more extreme than the result we have in the study, given that the null hypothesis is true Not probability the null hypothesis is true Slide 148 148 A Bit of Notation Not because we like notation But we have to say a lot less Probability P Null hypothesis is true H Result (data) D Given - | Slide 149 149 Whats a P Value P(D|H) Probability of the data occurring if the null hypothesis is true Not P(H|D) (what we want to know) Probability that the null hypothesis is true, given that we have the data = p(H) P(H|D) P(D|H) Slide 150 150 What is probability you are prime minister Given that you are British P(M|B) Very low What is probability you are British Given you are prime minister P(B|M) Very high P(M|B) P(B|M) Slide 151 151 Theres been a murder Someone murdered an instructor (perhaps they talked too much) The police have DNA The police have your DNA They match(!) DNA matches 1 in 1,000,000 people Whats the probability you didnt do the murder, given the DNA match (H|D) Slide 152 152 Police say: P(D|H) = 1/1,000,000 Luckily, you have Jeremy on your defence team We say: P(D|H) P(H|D) Probability that someone matches the DNA, who didnt do the murder Incredibly high Slide 153 153 Back to the Questions Haller and Kraus (2002) Asked those questions of groups in Germany Psychology Students Psychology lecturers and professors (who didnt teach stats) Psychology lecturers and professors (who did teach stats) Slide 154 154 1.You have absolutely disproved the null hypothesis (that is, there is no difference between the population means). True 34% of students 15% of professors/lecturers, 10% of professors/lecturers teaching statistics False We have found evidence against the null hypothesis Slide 155 155 2.You have found the probability of the null hypothesis being true. 32% of students 26% of professors/lecturers 17% of professors/lecturers teaching statistics False We dont know Slide 156 156 3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means). 20% of students 13% of professors/lecturers 10% of professors/lecturers teaching statistics False Slide 157 157 4.You can deduce the probability of the experimental hypothesis being true. 59% of students 33% of professors/lecturers 33% of professors/lecturers teaching statistics False Slide 158 158 5.You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision. 68% of students 67% of professors/lecturers 73% of professors professors/lecturers teaching statistics False Can be worked out P(replication) Slide 159 159 6.You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions. 41% of students 49% of professors/lecturers 37% of professors professors/lecturers teaching statistics False Another tricky one It can be worked out Slide 160 160 One Last Quiz I carry out a study All assumptions perfectly satisfied Random sample from population I find p = 0.05 You replicate the study exactly What is probability you find p < 0.05? Slide 161 161 I carry out a study All assumptions perfectly satisfied Random sample from population I find p = 0.01 You replicate the study exactly What is probability you find p < 0.05? Slide 162 162 Significance testing creates boundaries and gaps where none exist. Significance testing means that we find it hard to build upon knowledge we dont get an accumulation of knowledge Slide 163 163 Yates (1951) "the emphasis given to formal tests of significance... has resulted in... an undue concentration of effort by mathematical statisticians on investigations of tests of significance applicable to problems which are of little or no practical importance... and... it has caused scientific research workers to pay undue attention to the results of the tests of significance... and too little to the estimates of the magnitude of the effects they are investigating Slide 164 164 Testing the Slope Same idea as with the mean Estimate 95% CI of slope Estimate significance of difference from a value (usually 0) Need to know the SD of the slope Similar to SD of the mean Slide 165 165 Slide 166 166 Similar to equation for SD of mean Then we need standard error -Similar (ish) When we have standard error Can go on to 95% CI Significance of difference Slide 167 167 Slide 168 168 Confidence Limits 95% CI t dist with N - k - 1 df is 2.31 CI = 5.24 2.31 = 12.06 95% confidence limits Slide 169 169 Significance of difference from zero i.e. probability of getting result if =0 Not probability that = 0 This probability is (of course) the same as the value for the prediction Slide 170 170 Testing the Standardised Slope (Correlation) Correlation is bounded between 1 and +1 Does not have symmetrical distribution, except around 0 Need to transform it Fisher z transformation approximately normal Slide 171 171 95% CIs 0.879 1.96 * 0.38 = 0.13 0.879 + 1.96 * 0.38 = 1.62 Slide 172 172 Transform back to correlation 95% CIs = 0.13 to 0.92 Very wide Because of small sample size Maybe thats why CIs are not reported? Slide 173 173 Using Excel Functions in excel Fisher() to carry out Fisher transformation Fisherinv() to transform back to correlation Slide 174 174 The Others Same ideas for calculation of CIs and SEs for Predicted score Gives expected range of values given X Same for intercept But we have probably had enough Slide 175 One more tricky thing (Dont worry if you dont understand) For means, regression estimates, etc Estimate 1.0000 95% confidence intervals 0.0000, 2.0000 P = 0.05000 They match 175 Slide 176 For correlations, odds ratios, etc No longer match 95% CIs 0.0000, 0.50000 P-value 0.052000 Because of the sampling distribution of the mean Does not depend on the value The sampling distribution of a proportion Does depend on the value More certainty around 0.9 than around 0.00. 176 Slide 177 177 Lesson 4: Introducing Multiple Regression Slide 178 178 Residuals We said Y = b 0 + b 1 x 1 We could have said Y i = b 0 + b 1 x i1 + e i We ignored the i on the Y And we ignored the e i Its called error, after all But it isnt just error Trying to tell us something Slide 179 179 What Error Tells Us Error tells us that a case has a different score for Y than we predict There is something about that case Called the residual What is left over, after the model Contains information Something is making the residual 0 But what? Slide 180 Slide 181 181 Slide 182 182 The residual (+ the mean) is the expected value of Y If all cases were equal on X It is the value of Y, controlling for X Other words: Holding constant Partialling Residualising (residualised scores) Conditioned on Slide 183 183 Sometimes adjustment is enough on its own Measure performance against criteria Teenage pregnancy rate Measure pregnancy and abortion rate in areas Control for socio-economic deprivation, religion, rural/urban and anything else important See which areas have lower teenage pregnancy and abortion rate, given same level of deprivation Value added education tables Measure school performance Control for initial intake Slide 184 Sqm PricePredictedResidual Adj Value (mean + resid) 239.2605.0 475.77129.23 544.8 177.4400.0 408.78-8.78 406.8 265.3529.5 504.0825.42 441.0 153.4315.0 382.69-67.69 347.9 136.2341.0 364.05-23.05 392.6 187.5525.0 419.66105.34 520.9 280.5585.0 520.5164.49 480.1 203.3430.0 436.79-6.79 408.8 141.1436.0 369.3966.61 482.2 130.3390.0 357.7032.30 447.9 184 Slide 185 185 Control? In experimental research Use experimental control e.g. same conditions, materials, time of day, accurate measures, random assignment to conditions In non-experimental research Cant use experimental control Use statistical control instead Slide 186 186 Analysis of Residuals What predicts differences in crime rate After controlling for socio-economic deprivation Number of police? Crime prevention schemes? Rural/Urban proportions? Something else This is (mostly) what multiple regression is about Slide 187 187 Exam performance Consider number of books a student read (books) Number of lectures (max 20) a student attended (attend) Books and attend as IV, grade as outcome Slide 188 188 First 10 cases Slide 189 189 Use books as IV R =0.492, F =12.1, df =1, 28, p =0.001 b 0 =52.1, b 1 =5.7 (Intercept makes sense) Use attend as IV R =0.482, F =11.5, df =1, 38, p =0.002 b 0 =37.0, b 1 =1.9 (Intercept makes less sense) Slide 190 190 Books 5 43210 Grade (100) 100 90 80 70 60 50 40 30 Slide 191 191 Slide 192 192 Problem Use R 2 to give proportion of shared variance Books = 24% Attend = 23% So we have explained 24% + 23% = 47% of the variance NO!!!!! Slide 193 193 Correlation of books and attend is (unsurprisingly) not zero Some of the variance that books shares with grade, is also shared by attend Look at the correlation matrix BOOKS ATTEND GRADE BOOKSATTENDGRADE 0.44 0.49 0.48 1 1 1 Slide 194 194 I have access to 2 cars My wife has access to 2 cars We have access to four cars? No. We need to know how many of my 2 cars are shared Similarly with regression But we can do this with the residuals Residuals are what is left after (say) books See if residual variance is explained by attend Can use this new residual variance to calculate SS res, SS total and SS reg Slide 195 195 Well. Almost. This would give us correct values for SS Would not be correct for slopes, etc Because assumes that the variables have a causal priority Why should attend have to take what is left from books? Why should books have to take what is left by attend? Use OLS again; take variance they share Slide 196 196 Simultaneously estimate 2 parameters b 1 and b 2 Y = b 0 + b 1 x 1 + b 2 x 2 x 1 and x 2 are IVs Shared variance Not trying to fit a line any more Trying to fit a plane Can solve iteratively Closed form equations better But they are unwieldy Slide 197 197 x1x1 x2x2 y 3D scatterplot (2points only) Slide 198 198 x1x1 x2x2 y b0b0 b1b1 b2b2 Slide 199 199 Slide 200 Increasing Power What if the predictors dont correlate? Regression is still good It increases the power to detect effects (More on power later) Less variance left over When do we know the two predictors dont correlate? 200 Slide 201 201 (Really) Ridiculous Equations Slide 202 202 The good news There is an easier way The bad news It involves matrix algebra The good news We dont really need to know how to do it Slide 203 Were not programming computers So we usually dont care Very, very occasionally it helps to know what the computer is doing 203 Slide 204 204 Back to the Good News We can calculate the standardised parameters as B=R xx -1 x R xy Where B is the vector of regression weights R xx -1 is the inverse of the correlation matrix of the independent (x) variables R xy is the vector of correlations of the correlations of the x and y variables Slide 205 Exercise 4.2 205 Slide 206 Exercises Exercise 4.1 Grades data in Excel Exercise 4.2 Repeat in Stata Exercise 4.3 Zero correlation Exercise 4.4 Repeat therapy data Exercise 4.5 PTSD in families. 206 Slide 207 207 Lesson 5: More on Multiple Regression Slide 208 Contents More on parameter estimates Standard errors of coefficients R, R2, adjusted R2 Extra bits Suppressors Decisions about control variables Standardized estimates > 1 Variable entry techniques 208 Slide 209 More on Parameter Estimates 209 Slide 210 210 Parameter Estimates Parameter estimates ( b 1, b 2 b k ) were standardised Because we analysed a correlation matrix Represent the correlation of each IV with the outcome When all other IVs are held constant Slide 211 211 Can also be unstandardised Unstandardised represent the unit (rather than SDs) change in the outcome associated with a 1 unit change in the IV When all the other variables are held constant Parameters have standard errors associated with them As with one IV Hence t-test, and associated probability can be calculated Trickier than with one IV Slide 212 212 Standard Error of Regression Coefficient Standardised is easier R 2 i is the value of R 2 when all other predictors are used as predictors of that variable Note that if R 2 i = 0, the equation is the same as for previous Slide 213 Multiple R 213 Slide 214 214 Multiple R The degree of prediction R (or Multiple R ) No longer equal to b R 2 Might be equal to the sum of squares of B Only if all x s are uncorrelated Slide 215 215 In Terms of Variance Can also think of R2 in terms of variance explained. Each IV explains some variance in the outcome The IVs share some of their variance Cant share the same variance twice Slide 216 216 The total variance of Y = 1 Variance in Y accounted for by x 1 r x 1 y 2 = 0.36 Variance in Y accounted for by x 2 r x2y 2 = 0.36 Slide 217 217 In this model R 2 = r yx 1 2 + r yx 2 2 R 2 = 0.36 + 0.36 = 0.72 R = 0.72 = 0.85 But If x 1 and x 2 are correlated No longer the case Slide 218 218 The total variance of Y = 1 Variance in Y accounted for by x 2 r x2y 2 = 0.36 Variance in Y accounted for by x 1 r x1y 2 = 0.36 Variance shared between x 1 and x 2 (not equal to r x1x2 ) Slide 219 219 So We can no longer sum the r 2 Need to sum them, and subtract the shared variance i.e. the correlation But Its not the correlation between them Its the correlation between them as a proportion of the variance of Y Two different ways Slide 220 220 If r x1x2 = 0 r xy = b x 1 Equivalent to r yx1 2 + r yx 2 2 Based on estimates Slide 221 221 r x1x2 = 0 Equivalent to r yx1 2 + r yx2 2 Based on correlations Slide 222 222 Can also be calculated using methods we have seen Based on PRE (predicted value) Based on correlation with prediction Same procedure with >2 IVs Slide 223 223 Adjusted R 2 R 2 is on average an overestimate of population value of R 2 Any x will not correlate 0 with Y Any variation away from 0 increases R Variation from 0 more pronounced with lower N Need to correct R 2 Adjusted R 2 Slide 224 224 1 R 2 Proportion of unexplained variance We multiple this by an adjustment More variables greater adjustment More people less adjustment Calculation of Adj. R 2 Slide 225 225 Slide 226 226 Extra Bits Some stranger things that can happen Counter-intuitive Slide 227 227 Can be hard to understand Very counter-intuitive Definition A predictor which increases the size of the parameters associated with other predictors above the size of their correlations Suppressor variables Slide 228 228 An example (based on Horst, 1941) Success of trainee pilots Mechanical ability ( x 1 ), verbal ability ( x 2 ), success ( y ) Correlation matrix Slide 229 229 Mechanical ability correlates 0.3 with success Verbal ability correlates 0.0 with success What will the parameter estimates be? (Dont look ahead until you have had a guess) Slide 230 230 Mechanical ability b = 0.4 Larger than r ! Verbal ability b = -0.2 Smaller than r !! So what is happening? You need verbal ability to do the mechanical ability test Not actually related to mechanical ability Measure of mechanical ability is contaminated by verbal ability Slide 231 231 High mech, low verbal High mech This is positive (.4) Low verbal Negative, because we are talking about standardised scores (-(-.2) (.2) Your mech is really high you did well on the mechanical test, without being good at the words High mech, high verbal Well, you had a head start on mech, because of verbal, and need to be brought down a bit Slide 232 232 Another suppressor? b1 = b2 = b1 = b2 = Slide 233 233 Another suppressor? b 1 =0.26 b 2 = -0.06 Slide 234 234 And another? b 1 = b 2 = Slide 235 235 And another? b 1 = 0.53 b 2 = -0.47 Slide 236 236 One more? b 1 = b 2 = Slide 237 237 One more? b 1 = 0.53 b 2 = 0.47 Slide 238 238 Suppression happens when two opposing forces are happening together And have opposite effects Dont throw away your IVs, Just because they are uncorrelated with the outcome Be careful in interpretation of regression estimates Really need the correlations too, to interpret what is going on Cannot compare between studies with different predictors Think about what you want to know Before throwing variables into the analysis Slide 239 What to Control For? What is the added value of a better college In terms of salary More academic people go to better colleges Control for: Ability? Social class? Mothers education? Parents income? Course? Ethnic group? 239 Slide 240 Decisions about control variables Guided from theory Effect of gender Controlling for hair length and skirt wearing? 240 Slide 241 241 Slide 242 Do dogs make kids healthier? What to control for? Parents weight? Yes: Obese parents are more likely to have obese kids, kids who are thinner, relative to the parents are thinner. No: Dog might make parent thinner. By controlling for parental weight, youre controlling for the effect of dog 242 Slide 243 Slide 244 Dog Kids health Good control vars Bad control vars Slide 245 Dog Kids health Income Parent Weight Child Asthma Rural/Urban? House/apartment? Slide 246 246 Standardised Estimates > 1 Correlations are bounded -1.00 r +1.00 We think of standardised regression estimates as being similarly bounded But they are not Can go >1.00, Hierarchical Regression (Cont) Example (using cars) Parameters from final model: hireg price () (extro) car | Coef. Std. Err. t P>|t| [95% Conf. Interval] extro |.463.1296 3.57 0.001.2004.72626 R2 change statistics R2 change F(df) change p 0.128 12.773(1,36) 0.001 (What is relationship between t and F?) We know the p-value of the R 2 change When there is one predictor in the block What about when theres more than one? 267 Slide 268 Hierarchical Regression (Cont) test isnt exactly what we want But it is the same as what we want Advantage of test You can always use it (I can always remember how it works) 268 Slide 269 (For SPSS) SPSS calls them blocks Enter some variables, click next block Enter more variables Click on Statistics Click on R-squared change 269 Slide 270 Stepwise Regression Add stepwise: prefix With Pr() probability value to be removed from equation Pe() probability value to be entered into equation stepwise, pe(0.05) pr(0.2): reg price sqm lotsize originallis 270 Slide 271 271 A quick note on R 2 R 2 is sometimes regarded as the fit of a regression model Bad idea If good fit is required maximise R 2 Leads to entering variables which do not make theoretical sense Slide 272 Propensity Scores Another method of controlling for variables Ensure that predictors are uncorrelated with one predictor Dont need to control for them 272 Slide 273 x s Uncorrelated? Two cases when x s are uncorrelated Experimental design Predictors are uncorrelated We randomly assigned people to conditions to ensure that was the case Sample weights We can deliberately sample Ensure that they are uncorrelated 273 Slide 274 20 women with college degree 20 women without college degree 20 men with college degree 20 men without college degree Or use post hoc sample weights Propensity weighting Weight to ensure that variables are uncorrelated Usually done to avoid having to control E.g. ethnic differences in PTSD symptoms Can incorporate many more control variables 100+ 274 Slide 275 Propensity Scores Race profiling of police stops Same time, place, area, etc www.youtube.com/watch?v=Oot0BOaQTZI 275 Slide 276 276 Critique of Multiple Regression Goertzel (2002) Myths of murder and multiple regression Skeptical Inquirer (Paper B1) Econometrics and regression are junk science Multiple regression models (in US) Used to guide social policy Slide 277 277 More Guns, Less Crime (controlling for other factors) Lott and Mustard: A 1% increase in gun ownership 3.3% decrease in murder rates But: More guns in rural Southern US More crime in urban North (crack cocaine epidemic at time of data) Slide 278 278 Executions Cut Crime No difference between crimes in states in US with or without death penalty Ehrlich (1975) controlled all variables that affect crime rates Death penalty had effect in reducing crime rate No statistical way to decide whos right Slide 279 279 Legalised Abortion Donohue and Levitt (1999) Legalised abortion in 1970s cut crime in 1990s Lott and Whitley (2001) Legalising abortion decreased murder rates by 0.5 to 7 per cent. Its impossible to model these data Controlling for other historical events Crack cocaine (again) Slide 280 Crime is still dropping in the US Despite the recession Levitt says its mysterious, because the abortion effect should be over Some suggest Xboxes, Playstations, etc Netflix, DVRs (Violent movies reduce crime). 280 Slide 281 281 Another Critique Berk (2003) Regression analysis: a constructive critique (Sage) Three cheers for regression As a descriptive technique Two cheers for regression As an inferential technique One cheer for regression As a causal analysis Slide 282 282 Is Regression Useless? Do regression carefully Dont go beyond data which you have a strong theoretical understanding of Validate models Where possible, validate predictive power of models in other areas, times, groups Particularly important with stepwise Slide 283 283 Lesson 6: Categorical Predictors Slide 284 284 Introduction Slide 285 285 Introduction So far, just looked at continuous predictors Also possible to use categorical (nominal, qualitative) predictors e.g. Sex; Job; Religion; Region; Type (of anything) Usually analysed with t-test/ANOVA Slide 286 286 Historical Note But these (t-test/ANOVA) are special cases of regression analysis Aspects of General Linear Models (GLMs) So why treat them differently? Fishers fault Computers fault Regression, as we have seen, is computationally difficult Matrix inversion and multiplication Cant do it, without a computer Slide 287 287 In the special cases where: You have one categorical predictor Your IVs are uncorrelated It is much easier to do it by partitioning of sums of squares These cases Very rare in applied research Very common in experimental research Fisher worked at Rothamsted agricultural research station Never have problems manipulating wheat, pigs, cabbages, etc Slide 288 288 In psychology Led to a split between experimental psychologists and correlational psychologists Experimental psychologists (until recently) would not think in terms of continuous variables Still (too) common to dichotomise a variable Too difficult to analyse it properly Equivalent to discarding 1/3 of your data Slide 289 289 The Approach Slide 290 290 The Approach Recode the nominal variable Into one, or more, variables to represent that variable Names are slightly confusing Some texts talk of dummy coding to refer to all of these techniques Some (most) refer to dummy coding to refer to one of them Most have more than one name Slide 291 291 If a variable has g possible categories it is represented by g -1 variables Simplest case: Smokes: Yes or No Variable 1 represents Yes Variable 2 is redundant If it isnt yes, its no Slide 292 292 The Techniques Slide 293 293 We will examine two coding schemes Dummy coding For two groups For >2 groups Effect coding For >2 groups Look at analysis of change Equivalent to ANCOVA Pretest-posttest designs Slide 294 294 Dummy Coding 2 Groups Sometimes called simple coding A categorical variable with two groups One group chosen as a reference group The other group is represented in a variable e.g. 2 groups: Experimental (Group 1) and Control (Group 0) Control is the reference group Dummy variable represents experimental group Call this variable group1 Slide 295 295 For variable group1 1 = Yes, 0=No Slide 296 296 Some data Group is x, score is y Slide 297 297 Control Group = 0 Intercept = Score on Y when x = 0 Intercept = mean of control group Experimental Group = 1 b = change in Y when x increases 1 unit b = difference between experimental group and control group Slide 298 298 Gradient of slope represents difference between means Slide 299 299 Dummy Coding 3+ Groups With three groups the approach is the similar g = 3, therefore g -1 = 2 variables needed 3 Groups Control Experimental Group 1 Experimental Group 2 Slide 300 300 Recoded into two variables Note do not need a 3 rd variable If we are not in group 1 or group 2 MUST be in control group 3 rd variable would add no information (What would happen to determinant?) Slide 301 301 F and associated p Tests H 0 that b 1 and b 2 and associated p-values Test difference between each experimental group and the reference group To test difference between experimental groups Need to rerun analysis (or just do ANOVA with post-hoc tests) Slide 302 302 One more complication Have now run multiple comparisons Increases i.e. probability of type I error Need to correct for this Bonferroni correction Multiply given p -values by two/three (depending how many comparisons were made) Slide 303 303 Effect Coding Usually used for 3+ groups Compares each group (except the reference group) to the mean of all groups Dummy coding compares each group to the reference group. Example with 5 groups 1 group selected as reference group Group 5 Slide 304 304 Each group (except reference) has a variable 1 if the individual is in that group 0 if not -1 if in reference group Slide 305 305 Examples Dummy coding and Effect Coding Group 1 chosen as reference group each time Data Slide 306 306 Dummy Groupdummy2dummy3 100 210 301 GroupEffect2effect3 1 210 301 Effect Slide 307 307 Dummy R =0.543, F =5.7, df=2, 27, p =0.009 b 0 = 52.4, b 1 = 3.9, p =0.100 b 2 = 7.7, p =0.002 Effect R =0.543, F =5.7, df =2, 27, p =0.009 b 0 = 56.27, b 1 = 0.03, p=0.980 b 2 = 3.8, p=0.007 Slide 308 308 In Stata Use xi: prefix for dummy coding Use xi3: module for more codings But I dont like it, I do it by hand I dont understand what its doing It makes very long variables And then I cant use test BUT: If doing stepwise, you need to keep the variables together Example: xi: reg outcome contpred i.catpred Put i. in front of categorical predictors This has changed in Stata 11. xi: no longer needed Slide 309 xi: reg salary i.job_description ------------------------------------------------------ salary | Coef. Std. Err. t P>|t| -------------+---------------------------------------- _Ijob_desc~2 | 3100.34 2023.76 1.53 0.126 _Ijob_desc~3 | 36139.2 1228.352 29.42 0.000 _cons | 27838.5 532.4865 52.28 0.000 ------------------------------------------------------ 309 Slide 310 Exercise 6.1 5 golf balls Which is best? 310 Slide 311 311 In SPSS SPSS provides two equivalent procedures for regression Regression GLM GLM will: Automatically code categorical variables Automatically calculate interaction terms Allow you to not understand GLM wont: Give standardised effects Give hierarchical R 2 p-values Slide 312 312 ANCOVA and Regression Slide 313 313 Test (Which is a trick; but its designed to make you think about it) Use bank data (Ex 5.3) Compare the pay rise (difference between salbegin and salary) For ethnic minority and non-minority staff What do you find? Slide 314 314 ANCOVA and Regression Dummy coding approach has one special use In ANCOVA, for the analysis of change Pre-test post-test experimental design Control group and (one or more) experimental groups Tempting to use difference score + t-test / mixed design ANOVA Inappropriate Slide 315 315 Salivary cortisol levels Used as a measure of stress Not absolute level, but change in level over day may be interesting Test at: 9.00am, 9.00pm Two groups High stress group (cancer biopsy) Group 1 Low stress group (no biopsy) Group 0 Slide 316 316 Correlation of AM and PM = 0.493 ( p =0.008) Has there been a significant difference in the rate of change of salivary cortisol? 3 different approaches Slide 317 317 Approach 1 find the differences, do a t-test t = 1.31, df =26, p =0.203 Approach 2 mixed ANOVA, look for interaction effect F = 1.71, df = 1, 26, p = 0.203 F = t 2 Approach 3 regression (ANCOVA) based approach Slide 318 318 IVs: AM and group outcome: PM b 1 (group) = 3.59, standardised b 1 =0.432, p = 0.01 Why is the regression approach better? The other two approaches took the difference Assumes that r = 1.00 Any difference from r = 1.00 and you add error variance Subtracting error is the same as adding error Slide 319 319 Using regression Ensures that all the variance that is subtracted is true Reduces the error variance Two effects Adjusts the means Compensates for differences between groups Removes error variance Data is am-pm cortisol Slide 320 320 More on Change If difference score is correlated with either pre-test or post-test Subtraction fails to remove the difference between the scores If two scores are uncorrelated Difference will be correlated with both Failure to control Equal SDs, r = 0 Correlation of change and pre-score =0.707 Slide 321 321 Even More on Change A topic of surprising complexity What I said about difference scores isnt always true Lords paradox it depends on the precise question you want to answer Collins and Horn (1993). Best methods for the analysis of change Collins and Sayer (2001). New methods for the analysis of change More later Slide 322 322 Lesson 7: Assumptions in Regression Analysis Slide 323 323 The Assumptions 1.The distribution of residuals is normal (at each value of the outcome). 2.The variance of the residuals for every set of values for the predictor is equal. violation is called heteroscedasticity. 3.The error term is additive no interactions. 4.At every value of the outcome the expected (mean) value of the residuals is zero No non-linear relationships Slide 324 324 5.The expected correlation between residuals, for any two cases, is 0. The independence assumption (lack of autocorrelation) 6.All predictors are uncorrelated with the error term. 7.No predictors are a perfect linear function of other predictors (no perfect multicollinearity) 8.The mean of the error term is zero. Slide 325 325 What are we going to do Deal with some of these assumptions in some detail Deal with others in passing only look at them again later on Slide 326 326 Assumption 1: The Distribution of Residuals is Normal at Every Value of the outcome Slide 327 327 Look at Normal Distributions A normal distribution symmetrical, bell-shaped (so they say) Slide 328 328 What can go wrong? Skew non-symmetricality one tail longer than the other Kurtosis too flat or too peaked kurtosed Outliers Individual cases which are far from the distribution Slide 329 329 Effects on the Mean Skew biases the mean, in direction of skew Kurtosis mean not biased standard deviation is and hence standard errors, and significance tests Slide 330 330 Examining Univariate Distributions Graphs Histograms Boxplots P-P plots Calculation based methods Slide 331 331 Histograms A and B Slide 332 332 C and D Slide 333 333 E & F Slide 334 334 Histograms can be tricky . Slide 335 335 Boxplots Slide 336 336 P-P Plots A & B Slide 337 337 C & D Slide 338 338 E & F Slide 339 339 Skew and Kurtosis statistics Outlier detection statistics Calculation Based Slide 340 340 Skew and Kurtosis Statistics Normal distribution skew = 0 kurtosis = 0 Two methods for calculation Fishers and Pearsons Very similar answers Associated standard error can be used for significance (t-test) of departure from normality not actually very useful Never normal above N = 400 Slide 341 341 Slide 342 342 Outlier Detection Calculate distance from mean z-score (number of standard deviations) deleted z-score that case biased the mean, so remove it Look up expected distance from mean 1% 3+ SDs Slide 343 343 Non-Normality in Regression Slide 344 344 Effects on OLS Estimates The mean is an OLS estimate The regression line is an OLS estimate Lack of normality biases the position of the regression slope makes the standard errors wrong probability values attached to statistical significance wrong Slide 345 345 Checks on Normality Check residuals are normally distributed Draw histogram residuals Use regression diagnostics Lots of them Most arent very interesting Slide 346 346 Regression Diagnostics Residuals Standardised, studentised-deleted look for cases > |3| (?) Influence statistics Look for the effect a case has If we remove that case, do we get a different answer? DFBeta, Standardised DFBeta changes in b Slide 347 347 DfFit, Standardised DfFit change in predicted value Distances measures of distance from the centroid some include IV, some dont Slide 348 348 More on Residuals Residuals are trickier than you might have imagined Raw residuals OK Standardised residuals Residuals divided by SD Slide 349 349 Standardised / Studentised Now we can calculate the standardised residuals SPSS calls them studentised residuals Also called internally studentised residuals Slide 350 350 Deleted Studentised Residuals Studentised residuals do not have a known distribution Cannot use them for inference Deleted studentised residuals Externally studentised residuals Studentized (jackknifed) residuals Distributed as t With df = N k 1 Slide 351 351 Testing Significance We can calculate the probability of a residual Is it sampled from the same population BUT Massive type I error rate Bonferroni correct it Multiply p value by N Slide 352 352 Bivariate Normality We didnt just say residuals normally distributed We said at every value of the outcomes Two variables can be normally distributed univariate, but not bivariate Slide 353 353 Couples IQs male and female Seem reasonably normal Slide 354 354 But wait!! Slide 355 355 When we look at bivariate normality not normal there is an outlier So plot X against Y OK for bivariate but may be a multivariate outlier Need to draw graph in 3+ dimensions cant draw a graph in 3 dimensions But we can look at the residuals instead Slide 356 356 IQ histogram of residuals Slide 357 357 Multivariate Outliers Will be explored later in the exercises So we move on Slide 358 358 What to do about Non- Normality Skew and Kurtosis Skew much easier to deal with Kurtosis less serious anyway Transform data removes skew positive skew log transform negative skew - square Slide 359 359 Transformation May need to transform IV and/or outcome More often outcome time, income, symptoms (e.g. depression) all positively skewed can cause non-linear effects (more later) if only one is transformed alters interpretation of unstandardised parameter May alter meaning of variable Some people say that this is such a big problem Never transform May add / remove non-linear and moderator effects Slide 360 360 Change measures increase sensitivity at ranges avoiding floor and ceiling effects Outliers Can be tricky Why did the outlier occur? Error? Delete them. Weird person? Probably delete them Normal person? Tricky. Slide 361 361 You are trying to model a process is the data point outside the process e.g. lottery winners, when looking at salary yawn, when looking at reaction time Which is better? A good model, which explains 99% of your data? (because we threw outliers out) A poor model, which explains all of it (because we keep outliers in) I prefer a good model Slide 362 More on House Prices Zillow.com tracks and predicts house prices In the USA Sometimes detects outliers We dont trust this selling price We havent used it 362 Slide 363 Example in Stata reg salary educ predict res, res hist res gen logsalary= log(salary) reg logsalary educ predict logres, res hist logres 363 Slide 364 Slide 365 Slide 366 But Parameter estimates change Interpretation of parameter estimate is different Exercise 7.0, 7.1 366 Slide 367 Bootstrapping Bootstrapping is very, very cool And very, very clever But very, very simple 367 Slide 368 Bootstrapping When we estimate a test statistic (F or r or t or 2 ) We rely on knowing the sampling distribution Which we know If the distributional assumptions are satisfied 368 Slide 369 Estimate the Distribution Bootstrapping lets you: Skip the bit about distribution Estimate the sampling distribution from the data This shouldnt be allowed Hence bootstrapping But it is 369 Slide 370 How to Bootstrap We resample, with replacement Take our sample Sample 1 individual Put that individual back, so that they can be sampled again Sample another individual Keep going until weve sampled as many people as were in the sample Analyze the data Repeat the process B times Where B is a big number 370 Slide 371 Example 371 Original 1 2 3 4 5 6 7 8 9 10 B1 1 1 3 3 3 3 7 7 9 9 B2 1 2 3 4 4 4 8 8 9 10 B3 2 2 3 2 4 4 6 7 9 9 Slide 372 Analyze each dataset Sampling distribution of statistic Gives sampling distribution 2 approaches to CI or P Semi-parametric Calculate standard error of statistic Call that the standard deviation Does not make assumption about distribution of data Makes assumption about sampling distribution 372 Slide 373 Non-parametric Stata calls this percentile Count. If you have 1000 samples 25 th is lower CI 975 th is upper CI P-value is proportion that cross zero Non-parametric needs more samples 373 Slide 374 Bootstrapping in Stata Very easy: Use bootstrap: (or bs: or bstrap: ) prefix or (Better) use vce(bootstrap) option By default does 50 samples Not enough Use reps() At least 1000 374 Slide 375 Example reg salary salbegin educ, vce(bootstrap, reps(50)) | Observed Bootstrap | Coef. Std. Err. z -----------+--------------------------------- salbegin | 1.672631.0863302 19.37 Again salbegin | 1.672631.0737315 22.69 375 Slide 376 More Reps 1,000 reps Z = 17.31 Again Z = 17.59 10,000 reps 17.23 17.02 376 Slide 377 377 Exercise 7.2, 7.3 Slide 378 378 Assumption 2: The variance of the residuals for every set of values for the predictor is equal. Slide 379 379 Heteroscedasticity This assumption is a about heteroscedasticity of the residuals Hetero=different Scedastic = scattered We dont want heteroscedasticity we want our data to be homoscedastic Draw a scatterplot to investigate Slide 380 380 Slide 381 381 Only works with one IV need every combination of IVs Easy to get use predicted values use residuals there Plot predicted values against residuals A bit like turning the scatterplot to make the line of best fit flat Slide 382 382 Good no heteroscedasticity Slide 383 383 Bad heteroscedasticity Slide 384 384 Testing Heteroscedasticity Whites test 1.Do regression, save residuals. 2.Square residuals 3.Square IVs 4.Calculate interactions of IVs e.g. x 1x 2, x 1x 3, x 2 x 3 Slide 385 385 5.Run regression using squared residuals as outcome IVs, squared IVs, and interactions as IVs 6.Test statistic = N x R 2 Distributed as 2 Df = k (for second regression) Use education and salbegin to predict salary (employee data.sav) R 2 = 0.113, N=474, 2 = 53.5, df=5, p < 0.0001 Automatic in Stata estat imtest, white Slide 386 386 Plot of Predicted and Residual Slide 387 Whites Test as Test of Interest Possible to have a theory that predicts heteroscedasticity Lupien, et al, 2006 Heteroscedasticity in relationship of hippocampal volume and age 387 Slide 388 388 Magnitude of Heteroscedasticity Chop data into 5 slices Calculate variance of each slice Check ratio of smallest to largest Less than 5 OK Slide 389 gen slice = 1 replace slice = 2 if pred > 30000 replace slice = 3 if pred > 60000 replace slice = 4 if pred > 90000 replace slice = 5 if pred > 120000 bysort slice: su pred 1: 3954 5: 17116 (Doesnt look too bad, thanks to skew in predictors) 389 Slide 390 390 Dealing with Heteroscedasticity Use Huber-White (robust) estimates Also called sandwich estimates Also called empirical estimates Use survey techniques Relatively straightforward in SAS and Stata, fiddly in SPSS Google: SPSS Huber-White Slide 391 SE can be calculated with: Sandwich estimator: Whys it a Sandwich? 391 Slide 392 Example reg salary educ Standard errors: 204, 2821 reg salary educ, robust Standard errors: 267 3347 SEs usually go up, can go down 392 Slide 393 393 Heteroscedasticity Implications and Meanings Implications What happens as a result of heteroscedasticity? Parameter estimates are correct not biased Standard errors (hence p-values) are incorrect Slide 394 394 However If there is no skew in predicted scores P-values a tiny bit wrong If skewed, P-values can be very wrong Exercise 7.4 Slide 395 Robust SE Haiku T-stat looks too good. Use robust standard errors significance gone 395 Slide 396 396 Meaning What is heteroscedasticity trying to tell us? Our model is wrong it is misspecified Something important is happening that we have not accounted for e.g. amount of money given to charity ( given ) depends on: earnings degree of importance person assigns to the charity ( import ) Slide 397 397 Do the regression analysis R 2 = 0.60,, p < 0.001 seems quite good b 0 = 0.24, p =0.97 b 1 = 0.71, p < 0.001 b 2 = 0.23, p = 0.031 Whites test 2 = 18.6, df =5, p =0.002 The plot of predicted values against residuals Slide 398 398 Plot shows heteroscedastic relationship Slide 399 399 Which means the effects of the variables are not additive If you think that what a charity does is important you might give more money how much more depends on how much money you have Slide 400 400 Slide 401 401 One more thing about heteroscedasticity it is the equivalent of homogeneity of variance in ANOVA/t-tests Slide 402 Exercise 7.4, 7.5, 7.6 402 Slide 403 403 Assumption 3: The Error Term is Additive Slide 404 404 Additivity What heteroscedasticity shows you effects of variables need to be additive (assume no interaction between the variables) Heteroscedasticity doesnt always show it to you can test for it, but hard work (same as homogeneity of covariance assumption in ANCOVA) Have to know it from your theory A specification error Slide 405 405 Additivity and Theory Two IVs Alcohol has sedative effect A bit makes you a bit tired A lot makes you very tired Some painkillers have sedative effect A bit makes you a bit tired A lot makes you very tired A bit of alcohol and a bit of painkiller doesnt make you very tired Effects multiply together, dont add together Slide 406 406 If you dont test for it Its very hard to know that it will happen So many possible non-additive effects Cannot test for all of them Can test for obvious In medicine Choose to test for salient non-additive effects e.g. sex, race More on this, when we look at moderators Slide 407 Exercise 7.6 Exercise 7.7 407 Slide 408 408 Assumption 4: At every value of the outcome the expected (mean) value of the residuals is zero Slide 409 409 Linearity Relationships between variables should be linear best represented by a straight line Not a very common problem in social sciences measures are not sufficiently accurate (much measurement error) to make a difference R 2 too low unlike, say, physics Slide 410 410 Relationship between speed of travel and fuel used Slide 411 411 R 2 = 0.938 looks pretty good know speed, make a good prediction of fuel BUT look at the chart if we know speed we can make a perfect prediction of fuel used R 2 should be 1.00 Slide 412 412 Detecting Non-Linearity Residual plot just like heteroscedasticity Using this example very, very obvious usually pretty obvious Slide 413 413 Residual plot Slide 414 414 Linearity: A Case of Additivity Linearity = additivity along the range of the IV Jeremy rides his bicycle harder Increase in speed depends on current speed Not additive, multiplicative MacCallum and Mar (1995). Distinguishing between moderator and quadratic effects in multiple regression. Psychological Bulletin. Slide 415 415 Assumption 5: The expected correlation between residuals, for any two cases, is 0. The independence assumption (lack of autocorrelation) Slide 416 416 Independence Assumption Also: lack of autocorrelation Tricky one often ignored exists for almost all tests All cases should be independent of one another knowing the value of one case should not tell you anything about the value of other cases Slide 417 417 How is it Detected? Can be difficult need some clever statistics (multilevel models) Better off avoiding situations where it arises Or handling it when it does arise Residual Plots Slide 418 418 Residual Plots Were data collected in time order? If so plot ID number against the residuals Look for any pattern Test for linear relationship Non-linear relationship Heteroscedasticity Slide 419 419 Slide 420 420 How does it arise? Two main ways time-series analyses When cases are time periods weather on Tuesday and weather on Wednesday correlated inflation 1972, inflation 1973 are correlated clusters of cases patients treated by three doctors children from different classes people assessed in groups Slide 421 421 Why does it matter? Standard errors can be wrong therefore significance tests can be wrong Parameter estimates can be wrong really, really wrong from positive to negative An example students do an exam (on statistics) choose one of three questions IV: time outcome: grade Slide 422 422 Result, with line of best fit Slide 423 423 Result shows that people who spent longer in the exam, achieve better grades BUT we havent considered which question people answered we might have violated the independence assumption outcome will be autocorrelated Look again with questions marked Slide 424 424 Now somewhat different Slide 425 425 Now, people that spent longer got lower grades questions differed in difficulty do a hard one, get better grade if you can do it, you can do it quickly Slide 426 Dealing with Non- Independence For time series data Time series analysis (another course) Multilevel models (hard, some another course) For clustered data Robust standard errors Generalized estimating equations Multilevel models 426 Slide 427 Cluster Robust Standard Errors Predictor: School size Outcome: Grades Sample: 20 schools 20 children per school What is the N? 427 Slide 428 Robust Standard Errors Sample is: 400 children is it 400? Not really Each child adds information First child in a school adds lots of information about that school 100 th child in a school adds less information How much less depends on how similar the children in the school are 20 schools Its more than 20 428 Slide 429 Robust SE in Stata Very easy reg predictor outcome, robust cluster(clusterid) BUT Only to be used where clustering is a nuisance only Only adjusts standard errors, not parameter estimates Only to be used where parameter estimates shouldnt be affected by clustering 429 Slide 430 Example of Robust SE Effects of incentives for attendance at adult literacy class Some students rewarded for attendance Others not rewarded 152 classes randomly assigned to each condition Scores measured at mid term and final 430 Slide 431 Example of Robust SE Nave reg postscore tx midscore Est: -.6798066 SE:.7218797 Clustered reg postscore tx midscore, robust cluster(classid) Est: -.6798066 SE.9329929 431 Slide 432 Problem with Robust Estimates Only corrects standard error Does not correct estimate Other predictors must be uncorrelated with predictors of group membership Or estimates wrong Two alternatives: Generalized estimating equations (gee) Multilevel models 432 Slide 433 Independence + Heteroscedasticity Assumption is that residuals are: Independently and identically distributed i.i.d. Same procedure used for both problems Really, same problem 433 Slide 434 Exercise 7.9, exercise 7.10 434 Slide 435 435 Assumption 6: All predictor variables are uncorrelated with the error term. Slide 436 436 Uncorrelated with the Error Term A curious assumption by definition, the residuals are uncorrelated with the predictors (try it and see, if you like) There are no other predictors that are important That correlate with the error i.e. Have an effect Slide 437 437 Problem in economics Demand increases supply Supply increases wages Higher wages increase demand OLS estimates will be (badly) biased in this case need a different estimation procedure two-stage least squares simultaneous equation modelling Instrumental variables Slide 438 Another Haiku Supply and demand: without a good instrument, not identified. 438 Slide 439 439 Assumption 7: No predictors are a perfect linear function of other predictors no perfect multicollinearity Slide 440 440 No Perfect Multicollinearity IVs must not be linear functions of one another matrix of correlations of IVs is not positive definite cannot be inverted analysis cannot proceed Have seen this with age, age start, time working (cant have all three in the model) also occurs with subscale and total in model at the same time Slide 441 441 Large amounts of collinearity a problem (as we shall see) sometimes not an assumption Exercise 7.11 Slide 442 442 Assumption 8: The mean of the error term is zero. You will like this one. Slide 443 443 Mean of the Error Term = 0 Mean of the residuals = 0 That is what the constant is for if the mean of the error term deviates from zero, the constant soaks it up - note, Greek letters because we are talking about population values Slide 444 444 Can do regression without the constant Usually a bad idea E.g R 2 = 0.995, p < 0.001 Looks good Slide 445 445 Slide 446 446 Lesson 8: Issues in Regression Analysis Things that alter the interpretation of the regression equation Slide 447 447 The Four Issues Causality Sample sizes Collinearity Measurement error Slide 448 448 Causality Slide 449 449 What is a Cause? Debate about definition of cause some statistics (and philosophy) books try to avoid it completely We are not going into depth just going to show why it is hard Two dimensions of cause Ultimate versus proximal cause Determinate versus probabilistic Slide 450 450 Proximal versus Ultimate Why am I here? I walked here because This is the location of the class because Eric Tanenbaum asked me because (I dont know) because I was in my office when he rang because I was a lecturer at Derby University because I saw an advert in the paper because Slide 451 451 I exist because My parents met because My father had a job Proximal cause the direct and immediate cause of something Ultimate cause the thing that started the process off I fell off my bicycle because of the bump I fell off because I was going too fast Slide 452 452 Determinate versus Probabilistic Cause Why did I fall off my bicycle? I was going too fast But every time I ride too fast, I dont fall off Probabilistic cause Why did my tyre go flat? A nail was stuck in my tyre Every time a nail sticks in my tyre, the tyre goes flat Deterministic cause Slide 453 453 Can get into trouble by mixing them together Eating deep fried Mars Bars and doing no exercise are causes of heart disease My Grandad ate three deep fried Mars Bars every day, and the most exercise he ever got was when he walked to the shop next door to buy one (Deliberately?) confusing deterministic and probabilistic causes Slide 454 454 Criteria for Causation Association (correlation) Direction of Influence (a b) Isolation (not c a and c b) Slide 455 455 Association Correlation does not mean causation we all know But Causation does mean correlation Need to show that two things are related may be correlation may be regression when controlling for third (or more) factor Slide 456 456 Relationship between price and sales suppliers may be cunning when people want it more stick the price up So no relationship between price and sales Slide 457 457 Until (or course) we control for demand b 1 (Price) = -0.56 b 2 (Demand) = 0.94 But which variables do we enter? Slide 458 458 Direction of Influence Relationship between A and B three possible processes ABABAB C A causes B B causes A C causes A & B Slide 459 459 How do we establish the direction of influence? Longitudinally? Storm Barometer Drops Now if we could just get that barometer needle to stay where it is Where the role of theory comes in (more on this later) Slide 460 460 Isolation Isolate the outcome from all other influences as experimenters try to do Cannot do this can statistically isolate the effect using multiple regression Slide 461 461 Role of Theory Strong theory is crucial to making causal statements Fisher said: to make causal statements make your theories elaborate. dont rely purely on statistical analysis Need strong theory to guide analyses what critics of non-experimental research dont understand Slide 462 462 S.J. Gould a critic says correlate price of petrol and his age, for the last 10 years find a correlation Ha! (He says) that doesnt mean there is a causal link Of course not! (We say). No social scientist would do that analysis without first thinking (very hard) about the possible causal relations between the variables of interest Would control for time, prices, etc Slide 463 463 Atkinson, et al. (1996) relationship between college grades and number of hours worked negative correlation Need to control for other variables ability, intelligence Gould says Most correlations are non- causal (1982, p243) Of course!!!! Slide 464 464 I drink a lot of beer 120 non-causal correlations 16 causal relations karaoke jokes (about statistics) children wake early bathroom headache sleeping equations (beermat) laugh thirsty fried breakfast no beer curry chips falling over lose keys curtains closed Slide 465 465 Abelson (1995) elaborates on this method of signatures A collection of correlations relating to the process the signature of the process e.g. tobacco smoking and lung cancer can we account for all of these findings with any other theory? Slide 466 466 1.The longer a person has smoked cigarettes, the greater the risk of cancer. 2.The more cigarettes a person smokes over a given time period, the greater the risk of cancer. 3.People who stop smoking have lower cancer rates than do those who keep smoking. 4.Smokers cancers tend to occur in the lungs, and be of a particular type. 5.Smokers have elevated rates of other diseases. 6.People who smoke cigars or pipes, and do not usually inhale, have abnormally high rates of lip cancer. 7.Smokers of filter-tipped cigarettes have lower cancer rates than other cigarette smokers. 8.Non-smokers who live with smokers have elevated cancer rates. (Abelson, 1995: 183-184) Slide 467 467 In addition, should be no anomalous correlations If smokers had more fallen arches than non- smokers, not consistent with theory Failure to use theory to select appropriate variables specification error e.g. in previous example Predict wealth from price and sales increase price, price increases Increase sales, price increases Slide 468 468 Sometimes these are indicators of the process, not the process itself e.g. barometer stopping the needle wont help e.g. inflation? Indicator or cause of economic health? Slide 469 469 No Causation without Experimentation Blatantly untrue I dont doubt that the sun shining makes us warm Why the aversion? Pearl (2000) says problem is that there is no mathematical operator (e.g. =) No one realised that you needed one Until you build a robot Slide 470 470 AI and Causality A robot needs to make judgements about causality Needs to have a mathematical representation of causality Suddenly, a problem! Doesnt exist Most operators are non-directional Causality is directional Slide 471 471 Sample Sizes How many subjects does it take to run a regression analysis? Slide 472 472 Introduction Social scientists dont worry enough about the sample size required Why didnt you get a significant result? I didnt have a large enough sample Not a common answer, but very common reason More recently awareness of sample size is increasing use too few no point doing the research use too many waste their time Slide 473 473 Research funding bodies Ethical review panels both become more interested in sample size calculations We will look at two approaches Rules of thumb (quite quickly) Power Analysis (more slowly) Slide 474 474 Rules of Thumb Lots of simple rules of thumb exist 10 cases per IV and at least 100 cases Green (1991) more sophisticated To test significance of R 2 N = 50 + 8 k To test significance of slopes, N = 104 + k Rules of thumb dont take into account all the information that we have Power analysis does Slide 475 475 Power Analysis Introducing Power Analysis Hypothesis test tells us the probability of a result of that magnitude occurring, if the null hypothesis is correct (i.e. there is no effect in the population) Doesnt tell us the probability of that result, if the null hypothesis is false (i.e., there actually is an effect in the population) Slide 476 476 According to Cohen (1982) all null hypotheses are false everything that might have an effect, does have an effect it is just that the effect is often very tiny Slide 477 477 Type I Errors Type I error is false rejection of H 0 Probability of making a type I error the significance value cut-off usually 0.05 (by convention) Always this value Not affected by sample size type of test Slide 478 478 Type II errors Type II error is false acceptance of the null hypothesis Much, much trickier We think we have some idea we almost certainly dont Example I do an experiment (random sampling, all assumptions perfectly satisfied) I find p = 0.05 Slide 479 479 You repeat the experiment exactly different random sample from same population What is probability you will find p < 0.05? Answer: 0.5 Another experiment, I find p = 0.01 Probability you find p < 0.05? Answer: 0.79 Very hard to work out not intuitive need to understand non-central sampling distributions (more in a minute) Slide 480 480 Probability of type II error = beta ( ) same as population regression parameter (to be confusing) Power = 1 Beta Probability of getting a significant result (given that there is a significant result to be found) Slide 481 481 Type I error p = Type II error p = power = 1 - H 0 true (we find no effect p > 0.05) H 0 false (we find an effect p < 0.05) Research Findings H 0 false (effect to be found) H 0 True (no effect to be found) State of the World Slide 482 482 Four parameters in power analysis prob. of Type I error prob. of Type II error (power = 1 ) Effect size size of effect in population N Know any three, can calculate the fourth Look at them one at a time Slide 483 483 Probability of Type I error Usually set to 0.05 Somewhat arbitrary sometimes adjusted because of circumstances rarely because of power analysis May want to adjust it, based on power analysis Slide 484 484 Probability of type II error Power (probability of finding a result) = 1 Standard is 80% Some argue for 90% Implication that Type I error is 4 times more serious than type II error adjust ratio with compromise power analysis Slide 485 485 Effect size in the population Most problematic to determine Three ways 1.What effect size would be useful to find? R 2 = 0.01 - no use (probably) 2.Base it on previous research what have other people found? 3.Use Cohens conventions small R 2 = 0.02 medium R 2 = 0.13 large R 2 = 0.26 Slide 486 486 Effect size usually measured as f 2 For R 2 Slide 487 487 For (standardised) slopes Where sr 2 is the contribution to the variance accounted for by the variable of interest i.e. sr 2 = R 2 (with variable) R 2 (without) change in R 2 in hierarchical regression Slide 488 488 N the sample size usually use other three parameters to determine this sometimes adjust other parameters ( ) based on this e.g. You can have 50 participants. No more. Slide 489 489 Doing power analysis With power analysis program SamplePower, Gpower (free), Nquery With Stata command sampsi Which I find very confusing But well use it anyway Slide 490 sampsi Limited in usefulness A categorical, two group predictor sampsi 0 0.5, pre(1) r01(0.5) n1(50) sd(1) Find power for detecting an effect of 0.5 When theres one other variable at baseline Which correlates 0.5 50 people in each group When sd is 1.0 490 Slide 491 sampsi Method: ANCOVA relative efficiency = 1.143 adjustment to sd = 0.935 adjusted sd1 = 0.935 Estimated power: power = 0.762 491 Slide 492 GPower Better for regression designs 492 Slide 493 Slide 494 Slide 495 495 Underpowered Studies Research in the social sciences is often underpowered Why? See Paper B11 the persistence of underpowered studies Slide 496 496 Extra Reading Power traditionally focuses on p values What about CIs? Paper B8 Obtaining regression coefficients that are accurate, not simply significant Slide 497 Exercise 8.1 497 Slide 498 498 Collinearity Slide 499 499 Collinearity as Issue and Assumption Collinearity (multicollinearity) the extent to which the predictors are (multiply) correlated If R 2 for any IV, using other IVs = 1.00 perfect collinearity variable is linear sum of other variables regression will not proceed (SPSS will arbitrarily throw out a variable) Slide 500 500 R 2 < 1.00, but high other problems may arise Four things to look at in collinearity meaning implications detection actions Slide 501 501 Meaning of Collinearity Literally co-linearity lying along the same line Perfect collinearity when some IVs predict another Total = S1 + S2 + S3 + S4 S1 = Total (S2 + S3 + S4) rare Slide 502 502 Less than perfect when some IVs are close to predicting other IVs correlations between IVs are high (usually, but not always) high multiple correlations Slide 503 503 Implications Effects the stability of the parameter estimates and so the standard errors of the parameter estimates and so the significance and CIs Because shared variance, which the regression procedure doesnt know where to put Slide 504 504 Sex differences due to genetics? due to upbringing? (almost) perfect collinearity statistically impossible to tell Slide 505 505 When collinearity is less than perfect increases variability of estimates between samples estimates are unstable reflected in the variances, and hence standard errors Slide 506 506 Detecting Collinearity Look at the parameter estimates large standardised parameter estimates (>0.3?), which are not significant be suspicious Run a series of regressions each IV as outcome all other IVs as IVs for each IV Slide 507 507 Sounds like hard work? SPSS does it for us! Ask for collinearity diagnostics Tolerance calculated for every IV Variance Inflation Factor sq. root of amount s.e. has been increased Slide 508 508 Actions What you can do about collinearity no quick fix (Fox, 1991) 1.Get new data avoids the problem address the question in a different way e.g. find people who have been raised as the wrong gender exist, but rare Not a very useful suggestion Slide 509 509 2.Collect more data not different data, more data collinearity increases standard error (se) se decreases as N increases get a bigger N 3.Remove / Combine variables If an IV correlates highly with other IVs Not telling us much new If you have two (or more) IVs which are very similar e.g. 2 measures of depression, socio- economic status, achievement, etc Slide 510 510 sum them, average them, remove one Many measures use principal components analysis to reduce them 3.Use stepwise regression (or some flavour of) See previous comments Can be useful in theoretical vacuum 4.Ridge regression not very useful behaves weirdly Slide 511 Exercise 8.2, 8.3, 8.4 511 Slide 512 512 Measurement Error Slide 513 513 What is Measurement Error In social science, it is unlikely that we measure any variable perfectly measurement error represents this imperfection We assume that we have a true score T A measure of that score x Slide 514 514 just like a regression equation standardise the parameters T is the reliability the amount of variance in x which comes from T but, like a regression equation assume that e is random and has mean of zero more on that later Slide 515 515 Simple Effects of Measurement Error Lowers the measured correlation between two variables Real correlation true scores ( x * and y *) Measured correlation measured scores ( x and y ) Slide 516 516 x*x* y*y* yx e Reliability of y r yy Reliability of x r xx True correlation of x and y r x*y* Measured correlation of x and y r xy e Slide 517 517 Attenuation of correlation Attenuation corrected correlation Slide 518 518 Example Slide 519 519 Complex Effects of Measurement Error Really horribly complex Measurement error reduces correlations reduces estimate of reducing one estimate increases others because of effects of control combined with effects of suppressor variables exercise to examine this Slide 520 520 Dealing with Measurement Error Attenuation correction very dangerous not recommended Avoid in the first place use reliable measures dont discard information dont categorise Age: 10-20, 21-30, 31-40 Slide 521 521 Complications Assume measurement error is additive linear Additive e.g. weight people may under-report / over- report at the extremes Linear particularly the case when using proxy variables Slide 522 522 e.g. proxy measures Want to know effort on childcare, count number of children 1 st child is more effort than 19 th child Want to know financial status, count income 1 st 1 much greater effect on financial status than the 1,000,000 th. Slide 523 Exercise 8.5 523 Slide 524 524 Lesson 9: Non-Linear Analysis in Regression Slide 525 525 Introduction Non-linear effect occurs when the effect of one predictor is not consistent across the range of the IV Assumption is violated expected value of residuals = 0 no longer the case Slide 526 526 Some Examples Slide 527 527 Experience Skill A Learning Curve Slide 528 528 Arousal Performance Yerkes-Dodson Law of Arousal Slide 529 529 Time Enthusiastic Enthusiasm Levels over a Lesson on Regression 03.5 Suicidal Slide 530 530 Learning line changed direction once Yerkes-Dodson line changed direction once Enthusiasm line changed direction twice Slide 531 531 Everything is Non-Linear Every relationship we look at is non- linear, for two reasons Exam results cannot keep increasing with reading more books Linear in the range we examine For small departures from linearity Cannot detect the difference Non-parsimonious solution Slide 532 532 Non-Linear Transformations Slide 533 533 Bending the Line Non-linear regression is hard We cheat, and linearise the data Do linear regression Transformations We need to transform the data rather than estimating a curved line which would be very difficult may not work with OLS we can take a straight line, and bend it or take a curved line, and straighten it back to linear (OLS) regression Slide 534 534 We still do linear regression Linear in the parameters Y = b 1 x + b 2 x 2 + Can do non-linear regression Non-linear in the parameters Y = b 1 x + b 2 x2 + Much trickier Statistical theory either breaks down OR becomes harder Slide 535 535 Linear transformations multiply by a constant add a constant change the slope and the intercept Slide 536 536 x y y=x y=2x y=x + 3 Slide 537 537 Linear transformations are no use alter the slope and intercept dont alter the standardised parameter estimate Non-linear transformation will bend the slope quadratic transformation y = x 2 one change of direction Slide 538 538 Cubic transformation y = x 2 + x 3 two changes of direction Slide 539 539 To estimate a non-linear regression we dont actually estimate anything non- linear we transform the x -variable to a non-linear version can estimate that straight line represents the curve we dont bend the line, we stretch the space around the line, and make it flat Slide 540 540 Detecting Non-linearity Slide 541 541 Draw a Scatterplot Draw a scatterplot of y plotted against x see if it looks a bit non-linear e.g. Education and beginning salary from bank data with line of best fit Slide 542 542 A Real Example Starting salary and years of education From employee data.sav Slide 543 543 Expected value of error (residual) is > 0 Expected value of error (residual) is < 0 Slide 544 544 Use Residual Plot Scatterplot is only good for one variable use the residual plot (that we used for heteroscedasticity) Good for many variables Slide 545 545 We want points to lie in a nice straight sausage Slide 546 546 We dont want a nasty bent sausage Slide 547 547 Educational level and starting salary Slide 548 548 Carrying Out Non-Linear Regression Slide 549 549 Linear Transformation Linear transformation doesnt change interpretation of slope standardised slope se, t, or p of slope R 2 Can change ef

1 Applying Regression. 2 The Course 14 (or so) lessons –Some flexibility Depends how we feel What...

Documents

Transcript of 1 Applying Regression. 2 The Course 14 (or so) lessons –Some flexibility Depends how we feel What...