correlation and regression

16
INTRODUCTION TO STATISTICAL THEORY FOR SCIENTIST CORRELATION AND REGRESSION

description

 

Transcript of correlation and regression

Page 1: correlation and regression

INTRODUCTION TO STATISTICAL THEORY FOR SCIENTISTCORRELATION AND REGRESSION

Page 2: correlation and regression

• If we have question like “are two or more variables linearly related? If so, what is the strength of the relationship?”

• Numerical measure used to determine whether two or more variables are linearly related and to determine the strength of the relationship. This measure called CORRELATION COEFFICIENT

• There are two types of relationship; SIMPLE RELATIONSHIP AND MULTIPLE RELATIONSHIP.

Page 3: correlation and regression

CORRELATION• Statistical method used to determine

whether a linear relationship between variables exist

REGRESSION• Used to describe the nature of relationship

between variables; positive/negative or linear/nonlinear

SIMPLE REGRESSION

• Have two variables; an independent variable (explanatory) and a dependent variable (response)

MULTIPLE REGRESSION

• Two or more independent variables where used to predict one dependent variable

POSITIVE RELATIONSHIP

• Both variables increase or decrease at the same time

NEGATIVE RELATIONSHIP

• As one variable increase, the other variable decrease and vice versa.

Page 4: correlation and regression

Scatter plots and Correlation• In order to find relationship between two different variables, data

need to be collected. Example: relationship between number of hours study and grades for exam

• Independent variable is variable that can be controlled or manipulated while dependent variable cannot

• Dependent and independent variable can be plotted in graph named scatter plot

• Independent variable x plotted on the horizontal axis while dependent y on vertical axis

• Scatter plot is visual way to show the relationship between two variable

Page 5: correlation and regression

CompanyCars (in ten thousand)

Revenue (in billion)

A 63 7

B 29 3.9

C 20.8 2.1

D 19.1 2.8

E 13.4 1.4

F 8.5 1.5

SCATTER PLOT is a graph of the ordered pairs (x,y) of number consisting of the independent variable x and

dependent variable y

Page 6: correlation and regression

Correlation

• Correlation explained here is from Pearson Product Moment Correlation Coefficient (PPMC) by Karl Pearson

• Value range for correlation is from -1 to +1.• Correlation value which is close to +1 shows that there were a

strong positive correlation while when the value is close to -1, it shows that there were a strong negative correlation

• Value of r close to zero means that no linear relationship between the variable or only a weak relationship between both variables.

Correlation coefficient computed from the sample data measures the strength and direction of a linear relationship between two

quantitative variables. The symbol for the sample correlation is r while ρ (rho) for population correlation

Page 7: correlation and regression

FORMULA FOR CORRELATION COEFFICIENT r

Where n is the number of data pairs

• Rounding rule for correlation: round the value of r to three decimal places

Page 8: correlation and regression

SIGNIFICANCE OF THE CORRELATION COEFFICIENT

• State the hypothesesH ₒ : ρ = 0H ₁ : ρ ≠ 0

• Find the critical values- use t distribution with α and degrees of freedom = n – 2• Compute test value• Make decision and summarize the results

- when H ₒ rejected, there is a significant correlation between the variables

Page 9: correlation and regression

Regression

• We previously test the significance of the correlation coefficient. If the correlation is significant, the next step is to determine the equation of regression line

• LINE OF BEST FIT: best fit means that the sum of squares of the vertical distance from each point to the line is at minimum

• Reason best fit needed is that the value of y will be predicted from the values of x; hence the closer the points to the lines, the better prediction will be

Page 10: correlation and regression

• Equation of the regression line is written as →

• Formula for finding a and b:

where is the intercept and is the slope of the line

• The sign of the correlation coefficient and the sign of the slope of the regression line will always be the same

Page 11: correlation and regression

• MARGINAL CHANGE: the magnitude of the change in one variable when the other variable changes exactly 1 unit.

• See example 10-9; the slope of the regression line is 0.106 which means for each increase of 10,000 cars, the value of y changes 0.106 unit ($ 106 million) on average.

• EXTRAPOLATION: making prediction beyond the bounds of the data.

• When prediction are made, they are based on present condition or on the premise that present trends will continue.

• OUTLIER: point that seems out of place when compared with the other points

• Some of this points can affect the equation of the regression line where the points are called influential points or influential observation

Page 12: correlation and regression

Coefficient of determination

x 1 2 3 4 5

y 10 8 12 16 20

• Equation of the regression line is with r = 0.919

• Now for each x, we have as predicted value and y as observed value

• Closer observed value to predicted value the better fit it is

• Different types of variation associated with the regression model need to be identified

Page 13: correlation and regression

• Total Variation which is : is the sum of squares of the vertical distance each points from the mean

• Total variation can be divided into two parts: explained variation and unexplained variation

• Explained variation is variation that obtained from the relationship which is:

• The closer value r is to -1 or +1, the better the points fit the lines and the closer to

• Unexplained variation is variation that occur due to chance found by:

• When the unexplained variation is small, the value of r is closer to -1 or +1

• If all points fall on the regression line, the value for unexplained variation will be 0

Page 14: correlation and regression

• TOTAL VARIATION:

= +

• Noted that the value are called residuals

• RESIDUALS is the difference between the actual value of and predicted value for a given value

• Mean for residual will always be zero

• Residual value can be plotted in graph called residual plot

• Used to determine whether the regression line can be used to make prediction

Page 15: correlation and regression

• For (a) diagram, relationship between and is linear regression line and can be used to make prediction

• For (b) through (f), the residual plot shows that those regression line not suitable for prediction

• Like from (b), the diagram shows that variance of the residuals increases as the values of increase

Page 16: correlation and regression

Coefficient of determination• Coefficient of determination is the ratio of the explained

variation to the total variation and denoted by

• The term is usually expressed as a percentage

• If you have =78.4/92.8 = 0.845 it means that 84.5% of the total variation is explained by the regression line using independent variable

• Another way to compute by squaring

• The rest of the variation (1 – 0.845 = 0.155) is called as coefficient of nondetermination