Simple Linear Regression-Part 1

download Simple Linear Regression-Part 1

of 26

Transcript of Simple Linear Regression-Part 1

  • 8/11/2019 Simple Linear Regression-Part 1

    1/26

    10/8/2014 Slide 1

    Linear regression provides additional statistical informationabout the relationship between two quantitative variables.

    The coefficient of determination, R, which indicates thepercentage of variance in the dependent variable that isaccounted by variability in the independent variable

    The regression equation is the formula for the trend or fit

    line which enables us to predict the dependent variablefor any given value of the independent variable

    The regression equation has two parts the intercept andthe slope

    The intercept is the point on the vertical axis where theregression line crosses. It generally does not provideuseful information.

  • 8/11/2019 Simple Linear Regression-Part 1

    2/26

    10/8/2014 Slide 2

    The slope is the change in the dependent variable for aone unit change in the independent variable. The slopetells us the direction and magnitude of change.

    The regression line represents the predicted value of thedependent variable for each value of the independentvariable.

    The difference between the predicted values and the actualvalues of the dependent variable are called residuals.Residuals are the errors that we cannot predict.

    Residuals provide us with an important diagnostic tool fordetermining that linear regression is an appropriate statisticaltechnique for analyzing the relationship between twoquantitative variables.

  • 8/11/2019 Simple Linear Regression-Part 1

    3/26

    10/8/2014 Slide 3

    Linear regression requires us to satisfy three assumptionsabout the distributions of the two quantitative variables:

    No outliers A linear relationship between the variables Equal variance of the residuals across predicted values

    The evaluation of the conformity of the analysis to theseassumptions is generally based upon visual analysis of thescatterplot of the dependent variable by the independentvariable and the residual plot a scatterplot of the residualson the vertical axis by the predicted values on the horizontalaxis.

    Numeric results are also available to evaluate each of theseassumptions.

  • 8/11/2019 Simple Linear Regression-Part 1

    4/26

    10/8/2014 Slide 4

    If we do not satisfy the assumptions, we can: Report the results, noting the limitations produced by

    violation of the assumptions Report the results, ignoring the violations of assumptions,

    using the argument of robustness to violations Re-express one or both variables Omit the outliers Dichotomize the independent variable, splitting the values

    at the mean, median, or some other logical value

    Simple linear regression refers to analysis with oneindependent variable. Multiple regression refers to analysis with more than one

    independent variables

  • 8/11/2019 Simple Linear Regression-Part 1

    5/26

    Visualizing Linear Regression

  • 8/11/2019 Simple Linear Regression-Part 1

    6/26

    SW388R6

    Data Analysis and Computers I

    Visualizing Regression Analysis - 1

    While we will base our problem solving on numeric statisticalresults computed by SPSS, we can use a scatterplot todemonstrate regression graphically.

    We will use the variable "highest year of school completed"[educ] as the independent variable and "occupational prestigescore" [prestg80] as the dependent variable from theGSS2000R data set to demonstrate the relationship graphically.

  • 8/11/2019 Simple Linear Regression-Part 1

    7/26

    SW388R6

    Data Analysis and Computers I

    Visualizing Regression Analysis - 2

    The dots in the body of the chartrepresented the cases in thedistribution.

    The independent variable isplotted on the x-axis, or thehorizontal axis.

    The dependent variableis plotted on the y-axis,or the vertical axis.

    A scatterplot of prestg80 by educproduced by SPSS.

  • 8/11/2019 Simple Linear Regression-Part 1

    8/26

    SW388R6

    Data Analysis and Computers I

    Visualizing Regression Analysis - 3

    I have drawn a greenhorizontal line through themean of prestg80 (44.17).

    NOTE: the plots were created in SPSS

    by adding features to the defaultplot.

    The differences between the mean line andthe dots (shown as pink lines), are thedeviations.

    The sum of the squared deviations is themeasure of total error when the mean is usedas the estimated score for each case.

  • 8/11/2019 Simple Linear Regression-Part 1

    9/26

    SW388R6

    Data Analysis and Computers I

    Visualizing Regression Analysis - 4

    A regression line and theregression equation are

    added in red to thescatterplot.

    The pink deviations from themean have been replaced withthe orange deviations from theregression line. Deviationsbetween cases and the regressionline are called residuals .

  • 8/11/2019 Simple Linear Regression-Part 1

    10/26

    SW388R6

    Data Analysis and Computers I

    Visualizing Regression Analysis - 5

    The existence of a relationship between thevariables is supported when the sum of thesquared orange residuals is significantly less thanthe sum of the squared pink deviations

    Recall that both deviations and residuals canbe referred to as errors. If there is arelationship, we can characterize it as areduction in error .

  • 8/11/2019 Simple Linear Regression-Part 1

    11/26

    SW388R6

    Data Analysis and Computers I

    Visualizing Regression Analysis 6

    While it is difficult for us to square and sumdeviations and residuals, SPSS regressionoutput provides us with the answer.

    The squared sum of the pinkdeviations from the mean is

    the Total Sum of Squares inthe ANOVA table (49104.91).

    The squared sum of the orangeresiduals from the regressionline is the Residual Sum ofSquares in the ANOVA table(37086.80).

  • 8/11/2019 Simple Linear Regression-Part 1

    12/26

    SW388R6

    Data Analysis and Computers I

    Visualizing Regression Analysis 7

    The difference between the Total Sum of Squares and theResidual Sum of Squares is the Regression Sum of Squares.

    The Regression Sum of Squares is the amount of error that canbe eliminated by using the regression equation to estimatevalues of prestg80 instead of the mean of prestg80.

    The Regression Sum of Squaresin the ANOVA table is12018.11.

  • 8/11/2019 Simple Linear Regression-Part 1

    13/26

    SW388R6

    Data Analysis and Computers I

    Visualizing Regression Analysis 8

    We can compute the proportion or error that was

    reduced by the regression by dividing the RegressionSum of Squares by the Total Sum of Squares:

    12018.11 49104.91 = 0.245

  • 8/11/2019 Simple Linear Regression-Part 1

    14/26

    SW388R6

    Data Analysis and Computers I

    Visualizing Regression Analysis 9

    The reduction in error that we computed (0.245)is equal to the R Square that SPSS provides inthe Model Summary table.

    R is the coefficient of determination which isusually characterized as:

    the proportion of variance in the dependentvariable explained by the independentvariable, or

    the reduction in error (or increase inaccuracy).

    In multiple regression, the symbol forcoefficient of determination is R. In simplelinear regression, the symbol is r.

  • 8/11/2019 Simple Linear Regression-Part 1

    15/26

    SW388R6

    Data Analysis and Computers I

    Visualizing Regression Analysis 10

    The correlation coefficient, Multiple R, is defined asthe positive square root of R Square. SPSS uses thesame terminology for Multiple Regression and SimpleLinear Regression.

    This can be misleading in Simple Linear Regressionwhen the correlation for the relationship between thetwo variables, r, can have a negative sign for an inverserelationship. While Multiple R will have the same thenumeric value as r in Simple Linear Regression, weshould look at beta in the table of coefficients to makecertain that we are interpreting the direction of therelationship correctly.

  • 8/11/2019 Simple Linear Regression-Part 1

    16/26

    SW388R6

    Data Analysis and Computers I

    Visualizing Regression Analysis 11

    The regression equation is based on the UnstandardizedCoefficients (B) in the table of Coefficients .

    The B coefficient labeled (Constant) is the intercept. The Bcoefficient for the variable educ is the slope of the regressionline.

    The regression equation for the relationship betweenprestg80 and educ is:

    prestg80 = 12.928 + 2.359 x educ

  • 8/11/2019 Simple Linear Regression-Part 1

    17/26

    SW388R6

    Data Analysis and Computers I

    Visualizing Regression Analysis 12

    The Standardized Coefficients (Beta) in the table of Coefficients are theregression coefficients for the relationship between the standardizeddependent variable (z-scores) and the standardized independent variable (z-scores).

    Since standardizing variables removes the unit of measurement from thecoefficients, we can compare the Beta coefficients to interpret the relativeimportance of each independent variable in Multiple Regression.

    In Simple Linear Regression, Beta will be equal to r, the correlationcoefficient. Multiple R, r, and Beta all have the same numeric value, thoughMultiple R will be positive even when r and Beta are negative.

  • 8/11/2019 Simple Linear Regression-Part 1

    18/26

    SW388R6

    Data Analysis and Computers I

    Visualizing Regression Analysis 13

    The sign of the Beta coefficient, as well as the sign of the Bcoefficient, tells us the direction of the relationship.

    If the coefficients are positive, the relationship ischaracterized as direct or positive, meaning that higher

    values of the dependent variable are associated withhigher values of the independent variables.

    If the coefficients are negative, the relationship ischaracterized as inverse or negative, meaning that lowervalues of the dependent variable are associated withhigher values of the independent variables.

  • 8/11/2019 Simple Linear Regression-Part 1

    19/26

    SW388R6

    Data Analysis and Computers I

    Visualizing Regression Analysis - 14

    The regression line represents the estimated value ofprestg80 for every value of educ.

    To obtain the estimate, we draw a line perpendicular to thevalue on the x-axis to the point where it intersects theregression line. We then draw a line from the intersectionpoint to the y-axis. The intersection point on the y-axis is the

    estimated value for the dependent variable.

  • 8/11/2019 Simple Linear Regression-Part 1

    20/26

    SW388R6

    Data Analysis and Computers I

    Visualizing Regression Analysis - 15

    If we draw a vertical line from the educ value of 5 to theregression line and then to the horizontal axis, we see thatthe estimated value for prestg80 is about 25.

    We can compute the exact value by substituting in the

    regression equation:

    Prestg80 = 12.93 + 2.36 x 5 = 24.73

  • 8/11/2019 Simple Linear Regression-Part 1

    21/26

  • 8/11/2019 Simple Linear Regression-Part 1

    22/26

    Examples of Residual Plots

  • 8/11/2019 Simple Linear Regression-Part 1

    23/26

    Slide 23

    Example of null plot conforms to assumptions

  • 8/11/2019 Simple Linear Regression-Part 1

    24/26

    Slide 24

    Example of violation of linearity assumption

  • 8/11/2019 Simple Linear Regression-Part 1

    25/26

    Slide 25

    Example of violation of equal variance

  • 8/11/2019 Simple Linear Regression-Part 1

    26/26

    Slide 26

    Example of presence of outliers