Introduction to regression analysisfaculty.cse.tamu.edu/slupoli/notes/Statistics/Session 11... ·...
Transcript of Introduction to regression analysisfaculty.cse.tamu.edu/slupoli/notes/Statistics/Session 11... ·...
Introduction to regression analysis
Correlations: Measuring and Describing Relationships The direction of the relationship is measured by the sign of the correlation (+
or -). o A positive correlation means that the two variables tend to change in the
same direction; as one increases, the other also tends to increase. positive slope
o A negative correlation means that the two variables tend to change in opposite directions; as one increases, the other tends to decrease
negative slope The most common form of relationship is a straight line or linear relationship
which is measured by the Pearson correlation (or the relationship between 2 variables).
The strength or consistency of the relationship is measured by the numerical value of the correlation. A value of 1.00 indicates a perfect relationship and a value of zero indicates no relationship.
Correlation Example
What are the two variables?What does the patterns suggest?How would you describe this relationship to someone?
The Pearson Correlation (most common) attempts to predict the relationship by comparing the amount of covariability
(variation from the relationship between X and Y) to the amount X and Y vary separately o notice only 2 variableso cannot be used with multivariate regressions equations
The magnitude of the Pearson correlation ranges from:o 0.00 (indicating no linear relationship between X and Y) to o 1.00 (indicating a perfect straight-line relationship between X and Y).
The correlation can be either positive or negative depending on the direction of the relationship
major weakness of Pearson Correlationo but using this alone is not enough in complex equations since it take
multiple variables to predict and there are other influences on those other variables
example multivariate regressions equations
*** should be +/- between each predictor (IV) variables
Since many variables, Pearson may not work with all equations
Introduction to Linear Equations & Simple Regression The Pearson correlation measures the degree to which a set of data points
form a straight line relationship. Regression in general is a statistical procedure that determines the equation
for the straight line that best fits a specific set of datao this procedure will reflect variables in the multivariate equationo allows the prediction of DV with various that are categorical and/or
continuous variables The resulting straight line is called the regression line there are two types of Regressions
o simple – 2 variables (covered here)o multivariate – many variables (covered later)
Linear Equations & Simple Regression
Any straight line can be represented by an equation of the form Y = a + bX , where a and b are constants.
The value of b is called the slope constant and determines the direction and degree to which the line is tilted.
The value of a is called the Y-intercept and determines the point where the line crosses the Y-axis.
Example Linear Equation
Understanding the linear regression equation How well a set of data points fits a straight line can be measured by
calculating the distance between the data points and the line. o Using the formula Y = bX + a, it is possible to find the predicted value of
Ŷ (“Y hat”) for any X. o The error (distance) between the predicted value and the actual value can
be found by: Ŷ – Y o Because some distances may be negative, each distance should be
squared so that all measures are positive.
A closer examination of the best fit line The total error between the data points and the line is obtained by squaring
each distance and then summing the squared values. o Total squared error = Ʃ(Y – Ŷ)2
The regression equation is designed to produce the minimum sum of squared errors.
Best fitting line has the smallest total squared error: the line is called the least-squared-error solution and is expressed as: Ŷ = bX + a
Equation for the regression line
Analysis of a Simple Regression Finally, the overall significance of the regression equation can be evaluated by
computing an F-ratio. A significant F-ratio indicates that the equation predicts a significant portion
of the variability in the Y scores (more than would be expected by chance alone).o using a combination the best combination of systematic (treatment) and
unsystematic (error) variables (pic 11/5 6:08) To compute the F-ratio, you first calculate a variance or MS for the predicted
variability and for the unpredicted variability overall we want to be able to say that there is less chance of error and our
treatment was the predictor and had an effect p = chance of random luck, rest is due to our treatment
Intro. to Multiple Regression w/ 2 Predictor Variables pg 157 (somewhere/ some book) In the same way that linear regression produces an equation that uses values
of X to predict values of Y, multiple regression produces an equation that uses two different variables (X1 and X2) to predict values of Y.
The equation is determined by a least squared error solution that minimizes the squared distances between the actual Y values and the predicted Y values.
(pic here from notes and 6:17 on 11/5) ??? note sure this matches. Look back at his ppt
Review - Two classes of variables in the model (p. 158 , need to put in Variables notes)
Dependent Variables (DV)o variable being predictedo cannot be nominal
since categorical results, not graphableo called (many options, all the same thing)
criterion variable outcome variable dependent variable
o important that is in every record/survey/data collection, since we cannot predict without it
must use listwise Independent Variables (IV)
o variable being used as predictorso called (many options, all the same thing)
predictor variable independent variable factors
o if missing data use pairwise data cleaning
o a literature review should uncover possible IVs to make a better prediction!
take what has been done, add something to the equation to make it a better predictor
or look for a gap in the literature for research justification
Multiple regression basics We always want to examine a prediction
o We want to use multiple variables (multivariate) to predict an outcome variable more efficiently
Statistical goalo The statistical goal of multiple regression is to produce a model in the
form of a linear equation that identifies the best weighted linear combination of independent variables in the study. to optimally predict the criterion variable (p. 160).
Regression weights for the predictorso They will change based on the variables in the model. o The variables are relative to the overall model.o We will look at the model as whole to predict the total amount of
variance explained (THIS IS HUGE). Fully specified regression model
o It depends of the variables that you selectedo You can’t choose all the variables (parsimony matters)
Multiple Regression (raw score model) For two predictor variables, the general form of the multiple regression equation is:
Ŷ= a + b1X1 + b2X2 + eThe ability of the multiple regression equation to accurately predict the Y values is measured by first computing the proportion of the Y-score variability that is predicted by the regression equation and the proportion that is not predicted.
For a regression with two predictor variables:
As with linear regression, the unpredicted variability (SS and df) can be used to compute a standard error of estimate that measures the standard distance between the actual Y values and the predicted valuesIn addition, the overall significance of the multiple regression equation can be evaluated with an F-ratio
Standardized model (using z-scores)This is easier to use and to discuss the results of the regression model because everything is based on a mean of 0 and s.d. of 1 units. 𝑌𝑧𝑝𝑟𝑒𝑑= 𝛽1Xz1 + 𝛽2Xz2 + 𝛽n Xzn 𝑌𝑧𝑝𝑟𝑒𝑑=(0.31)(gpaz) + (0.62)(GREz ) You need to build, develop, and create a set of variables known as the variate (p. 164). This is the composite of all the independent variable to the right of the equation. Another way of thinking about this side of the equation is also known as the latent variable (an abstract notion- like academic aptitude). Understanding the regression models