Clase Regresión Lineal

download Clase Regresión Lineal

of 40

description

Biostatistics Course

Transcript of Clase Regresión Lineal

Slide 1

Gernimo Maldonado-Martnez, rpt; MPH, PhD(c)Diana m. fernndez-santos, ms; eddmultiple LINEAR REGRESSION

Just a little pokeSir Francis GaltonWidely promotedregression techniques.Cousin of Charles Darwin.

Making Sense of RegressionMy emphasis here is onUnderstanding the key elements of regressionRequirementsApplicationLimitationsRegression Is A Powerful Analytical TechniqueEnables researchers to do two things:1. Determine the strength of the relationshipThe r-squared value

Regression Is A Powerful Analytical Technique2. Determine the impact of the independent variable(s) on the dependent variableThe regression coefficient is the predicted change in the dependent variable for every one unit of change in the independent variableCollectively, the regression coefficients enable the researchers to make estimates of how the dependent variable will change using different scenarios for the independent variables

AssumptionsVariables are normally distributed.Continuous nature Assumption of a linear relationship between the independent and dependent variables. Assumption of homoscedasticity. Multiple Regression EquationY = a + bX+bX2+bXk + eWhere:Y = predicted value of the dependent variable a = the constant or Y intercept (where the imaginary line crosses the Y access)b = the regression coefficientX = the independent variablee = error Theoretical Linear Model

8Linear RegressionModel Type X=size of house, Y=cost of houseDeterministic Model: an equation or set of equations that allow us to fully determine the value of the dependent variable from the values of the independent variables.

Probabilistic Model: a method used to capture the randomness that is part of a real-life process.R-square And Its Companionsr = correlation coefficient (overall fit or measure of association, which is also called r, Pearsons r, Pearson Product Moment Correlation coefficient, or zero-order coefficient). r-square = proportion of the explained variance of the dependent variable (also called the coefficient of determination)1 minus r-square = proportion of unexplained variance in the dependent variable

Dirty InterpretationExample: Researchers look at GRE scores and academic performance in graduate school as measured by grade point averageThe hypothesis is that people who have high GRE scores will also have high GPAsFrom an admissions committee perspective: the belief that GRE scores are a good predictor of future academic success and are, therefore, a good criteria for admission decisionsThe researchers report an r-squared of .2GREs explain 20 percent of the change in GPAsThis means that 80 percent of the changes in GPA are explained by other factors.

A quick example

X Axis: Age of Planes5 years..........20 yearsDont get lost$1,000$500........10 years.....Y Axis: Plane Maintenance CostsPredicted values if perfect relationship...........5 years20 years10 years0How It Is AppliedAnalysts collect data over the past two years and crunch it. The computer gives these results:Y = 100 and .020XThe constant is 100: If they do not fly at all, the computer estimates there is still a cost of $100The .020 is the regression coefficient: This gets interpreted as: for every mile flown, there is $.02 change in maintenance costs. Y = 100 and .020XInterpreting the regression coefficient:For every mile flown, the maintenance costs goes up by 2 cents.For every 100 miles flown, costs are $2For every 1,000 miles, the costs are $20For every 100,000 miles, the costs are $20,000How It Is AppliedMaking Maintenance Cost EstimatesThey can then solve the equation:Assuming 100,000 miles will be flown, how much will they need to budget for maintenance? 100,000 multiplied by .020 = $20,000Y= 100 + $20,000 + errorThe estimate maintenance will cost: $20,100 + error

Practicality

Simple Regression: Another ExampleHypothesis: If schools have a higher percentage of poor children, then they will have lower test scores.A regression analysis shows:A regression coefficient of -.04 An r-squared value of .25Even MoreInterpretation?Regression coefficient: For every increase in the percent of children in poverty within a school, the average test score goes down by .04R-squared: 25% of the test scores are explained by the percent of children in poverty in the schoolResearchers will ask: what other factors might explain differences in test scores in the schools?Multiple Regression EquationY = a + bX1 + bX2 + bX3 + bX4 + e.Y = dependent variableX1 = independent variable 1, controlling for X2, X3, X4X2 = independent variable 2controlling for X1, X3, X4X3 = independent variable 3controlling for X1, X2, X4X4= independent variable 4controlling for X1, X2, X3Multiple Regression EquationIt has the same basic structure of simple regressionY is still the dependent variableThere is still a constant (a) and some amount of error (e) that the computer calculatesBut there are more Xs to represent the multiple independent variablesMultiple Regression: An ExampleHypothesis: Income is a function of education and seniority?We suggest that income (the dependent variable) will increase as both education and seniority increases (two independent variables)Y (Income) = a + education + seniority + error

Multiple Regression: InterpretationResults: Y= 6000 + 400X1 (education) + 200X2 (seniority)R square = .67First look at the R-Square: This shows a strong relationshipso analysis can continuePartial regression coefficients:For every year of education, holding seniority constant, income increases by $400.For every year of seniority, holding education constant, income increases by $200.Multiple Regression: ApplicationEstimate the income of someone who has 10 years of education and 5 years of seniority We solve the regression equation:Multiply the 10 years of education by the regression coefficient of 400: equals 4,000Multiply 5 years of senior by the regression coefficient of 200: equals 1,000Put it together with the constant and you haveY=6000 + 400(10) + 200(5) + errorY= $ 11,000 + error

DataStatisticsxy16213945517612

DataStatisticsInformationDataStatisticsDemystifying the monsterMultivariate regression pitfallsMulti-collinearityResidual confoundingOverfittingMulticollinearity Multicollinearity arises when two variables that measure the same thing or similar things (e.g., weight and BMI) are both included in a multiple regression model; they will, in effect, cancel each other out and generally destroy your model. VIF>1 is badTolerance: low is bad

Residuals DiagnosticsYou cannot completely wipe out confounding simply by adjusting for variables in multiple regression unless variables are measured with zero error (which is usually impossible).Example: meat eating and mortality

A clean example in PRISM

A clean example in PRISM

A clean example in PRISM

A clean example in PRISM

A clean example in PRISM

A clean example in PRISMA real linear regression output

A real linear regression output

A real linear regression output

A real linear regression output

ReferencesKleinbaum, D. G. Applied Regression Analysis and Multivariable Methods. Third Edition (2011)