Introduction to regression analysisfaculty.cse.tamu.edu/slupoli/notes/Statistics/Session 11... ·...

15
Introduction to regression analysis Correlations: Measuring and Describing Relationships The direction of the relationship is measured by the sign of the correlation (+ or -). o A positive correlation means that the two variables tend to change in the same direction; as one increases, the other also tends to increase. positive slope o A negative correlation means that the two variables tend to change in opposite directions; as one increases, the other tends to decrease negative slope The most common form of relationship is a straight line or linear relationship which is measured by the Pearson correlation (or the relationship between 2 variables). The strength or consistency of the relationship is measured by the numerical value of the correlation. A value of 1.00 indicates a perfect relationship and a value of zero indicates no relationship. Correlation Example

Transcript of Introduction to regression analysisfaculty.cse.tamu.edu/slupoli/notes/Statistics/Session 11... ·...

Page 1: Introduction to regression analysisfaculty.cse.tamu.edu/slupoli/notes/Statistics/Session 11... · Web viewIntroduction to regression analysis Correlations: Measuring and Describing

Introduction to regression analysis

Correlations: Measuring and Describing Relationships The direction of the relationship is measured by the sign of the correlation (+

or -). o A positive correlation means that the two variables tend to change in the

same direction; as one increases, the other also tends to increase. positive slope

o A negative correlation means that the two variables tend to change in opposite directions; as one increases, the other tends to decrease

negative slope The most common form of relationship is a straight line or linear relationship

which is measured by the Pearson correlation (or the relationship between 2 variables).

The strength or consistency of the relationship is measured by the numerical value of the correlation. A value of 1.00 indicates a perfect relationship and a value of zero indicates no relationship.

Correlation Example

Page 2: Introduction to regression analysisfaculty.cse.tamu.edu/slupoli/notes/Statistics/Session 11... · Web viewIntroduction to regression analysis Correlations: Measuring and Describing

What are the two variables?What does the patterns suggest?How would you describe this relationship to someone?

Page 3: Introduction to regression analysisfaculty.cse.tamu.edu/slupoli/notes/Statistics/Session 11... · Web viewIntroduction to regression analysis Correlations: Measuring and Describing

The Pearson Correlation (most common) attempts to predict the relationship by comparing the amount of covariability

(variation from the relationship between X and Y) to the amount X and Y vary separately o notice only 2 variableso cannot be used with multivariate regressions equations

The magnitude of the Pearson correlation ranges from:o 0.00 (indicating no linear relationship between X and Y) to o 1.00 (indicating a perfect straight-line relationship between X and Y).

The correlation can be either positive or negative depending on the direction of the relationship

major weakness of Pearson Correlationo but using this alone is not enough in complex equations since it take

multiple variables to predict and there are other influences on those other variables

example multivariate regressions equations

*** should be +/- between each predictor (IV) variables

Since many variables, Pearson may not work with all equations

Page 4: Introduction to regression analysisfaculty.cse.tamu.edu/slupoli/notes/Statistics/Session 11... · Web viewIntroduction to regression analysis Correlations: Measuring and Describing

Introduction to Linear Equations & Simple Regression The Pearson correlation measures the degree to which a set of data points

form a straight line relationship. Regression in general is a statistical procedure that determines the equation

for the straight line that best fits a specific set of datao this procedure will reflect variables in the multivariate equationo allows the prediction of DV with various that are categorical and/or

continuous variables The resulting straight line is called the regression line there are two types of Regressions

o simple – 2 variables (covered here)o multivariate – many variables (covered later)

Linear Equations & Simple Regression

Any straight line can be represented by an equation of the form Y = a + bX , where a and b are constants.

The value of b is called the slope constant and determines the direction and degree to which the line is tilted.

The value of a is called the Y-intercept and determines the point where the line crosses the Y-axis.

Example Linear Equation

Page 5: Introduction to regression analysisfaculty.cse.tamu.edu/slupoli/notes/Statistics/Session 11... · Web viewIntroduction to regression analysis Correlations: Measuring and Describing

Understanding the linear regression equation How well a set of data points fits a straight line can be measured by

calculating the distance between the data points and the line. o Using the formula Y = bX + a, it is possible to find the predicted value of

Ŷ (“Y hat”) for any X. o The error (distance) between the predicted value and the actual value can

be found by: Ŷ – Y o Because some distances may be negative, each distance should be

squared so that all measures are positive.

Page 6: Introduction to regression analysisfaculty.cse.tamu.edu/slupoli/notes/Statistics/Session 11... · Web viewIntroduction to regression analysis Correlations: Measuring and Describing

A closer examination of the best fit line The total error between the data points and the line is obtained by squaring

each distance and then summing the squared values. o Total squared error = Ʃ(Y – Ŷ)2

The regression equation is designed to produce the minimum sum of squared errors.

Best fitting line has the smallest total squared error: the line is called the least-squared-error solution and is expressed as: Ŷ = bX + a

Equation for the regression line

Page 7: Introduction to regression analysisfaculty.cse.tamu.edu/slupoli/notes/Statistics/Session 11... · Web viewIntroduction to regression analysis Correlations: Measuring and Describing

Analysis of a Simple Regression Finally, the overall significance of the regression equation can be evaluated by

computing an F-ratio. A significant F-ratio indicates that the equation predicts a significant portion

of the variability in the Y scores (more than would be expected by chance alone).o using a combination the best combination of systematic (treatment) and

unsystematic (error) variables (pic 11/5 6:08) To compute the F-ratio, you first calculate a variance or MS for the predicted

variability and for the unpredicted variability overall we want to be able to say that there is less chance of error and our

treatment was the predictor and had an effect p = chance of random luck, rest is due to our treatment

Page 8: Introduction to regression analysisfaculty.cse.tamu.edu/slupoli/notes/Statistics/Session 11... · Web viewIntroduction to regression analysis Correlations: Measuring and Describing

Intro. to Multiple Regression w/ 2 Predictor Variables pg 157 (somewhere/ some book) In the same way that linear regression produces an equation that uses values

of X to predict values of Y, multiple regression produces an equation that uses two different variables (X1 and X2) to predict values of Y.

The equation is determined by a least squared error solution that minimizes the squared distances between the actual Y values and the predicted Y values.

(pic here from notes and 6:17 on 11/5) ??? note sure this matches. Look back at his ppt

Page 9: Introduction to regression analysisfaculty.cse.tamu.edu/slupoli/notes/Statistics/Session 11... · Web viewIntroduction to regression analysis Correlations: Measuring and Describing

Review - Two classes of variables in the model (p. 158 , need to put in Variables notes)

Dependent Variables (DV)o variable being predictedo cannot be nominal

since categorical results, not graphableo called (many options, all the same thing)

criterion variable outcome variable dependent variable

o important that is in every record/survey/data collection, since we cannot predict without it

must use listwise Independent Variables (IV)

o variable being used as predictorso called (many options, all the same thing)

predictor variable independent variable factors

o if missing data use pairwise data cleaning

o a literature review should uncover possible IVs to make a better prediction!

take what has been done, add something to the equation to make it a better predictor

or look for a gap in the literature for research justification

Page 10: Introduction to regression analysisfaculty.cse.tamu.edu/slupoli/notes/Statistics/Session 11... · Web viewIntroduction to regression analysis Correlations: Measuring and Describing

Multiple regression basics We always want to examine a prediction

o We want to use multiple variables (multivariate) to predict an outcome variable more efficiently

Statistical goalo The statistical goal of multiple regression is to produce a model in the

form of a linear equation that identifies the best weighted linear combination of independent variables in the study. to optimally predict the criterion variable (p. 160).

Regression weights for the predictorso They will change based on the variables in the model. o The variables are relative to the overall model.o We will look at the model as whole to predict the total amount of

variance explained (THIS IS HUGE). Fully specified regression model

o It depends of the variables that you selectedo You can’t choose all the variables (parsimony matters)

Multiple Regression (raw score model) For two predictor variables, the general form of the multiple regression equation is:

Ŷ= a + b1X1 + b2X2 + eThe ability of the multiple regression equation to accurately predict the Y values is measured by first computing the proportion of the Y-score variability that is predicted by the regression equation and the proportion that is not predicted.

Page 11: Introduction to regression analysisfaculty.cse.tamu.edu/slupoli/notes/Statistics/Session 11... · Web viewIntroduction to regression analysis Correlations: Measuring and Describing

For a regression with two predictor variables:

As with linear regression, the unpredicted variability (SS and df) can be used to compute a standard error of estimate that measures the standard distance between the actual Y values and the predicted valuesIn addition, the overall significance of the multiple regression equation can be evaluated with an F-ratio

Standardized model (using z-scores)This is easier to use and to discuss the results of the regression model because everything is based on a mean of 0 and s.d. of 1 units. 𝑌𝑧𝑝𝑟𝑒𝑑= 𝛽1Xz1 + 𝛽2Xz2 + 𝛽n Xzn 𝑌𝑧𝑝𝑟𝑒𝑑=(0.31)(gpaz) + (0.62)(GREz ) You need to build, develop, and create a set of variables known as the variate (p. 164). This is the composite of all the independent variable to the right of the equation. Another way of thinking about this side of the equation is also known as the latent variable (an abstract notion- like academic aptitude). Understanding the regression models