correlation and regression

Post on 15-Dec-2014

529 views 1 download

Tags:

description

 

Transcript of correlation and regression

INTRODUCTION TO STATISTICAL THEORY FOR SCIENTISTCORRELATION AND REGRESSION

• If we have question like “are two or more variables linearly related? If so, what is the strength of the relationship?”

• Numerical measure used to determine whether two or more variables are linearly related and to determine the strength of the relationship. This measure called CORRELATION COEFFICIENT

• There are two types of relationship; SIMPLE RELATIONSHIP AND MULTIPLE RELATIONSHIP.

CORRELATION• Statistical method used to determine

whether a linear relationship between variables exist

REGRESSION• Used to describe the nature of relationship

between variables; positive/negative or linear/nonlinear

SIMPLE REGRESSION

• Have two variables; an independent variable (explanatory) and a dependent variable (response)

MULTIPLE REGRESSION

• Two or more independent variables where used to predict one dependent variable

POSITIVE RELATIONSHIP

• Both variables increase or decrease at the same time

NEGATIVE RELATIONSHIP

• As one variable increase, the other variable decrease and vice versa.

Scatter plots and Correlation• In order to find relationship between two different variables, data

need to be collected. Example: relationship between number of hours study and grades for exam

• Independent variable is variable that can be controlled or manipulated while dependent variable cannot

• Dependent and independent variable can be plotted in graph named scatter plot

• Independent variable x plotted on the horizontal axis while dependent y on vertical axis

• Scatter plot is visual way to show the relationship between two variable

CompanyCars (in ten thousand)

Revenue (in billion)

A 63 7

B 29 3.9

C 20.8 2.1

D 19.1 2.8

E 13.4 1.4

F 8.5 1.5

SCATTER PLOT is a graph of the ordered pairs (x,y) of number consisting of the independent variable x and

dependent variable y

Correlation

• Correlation explained here is from Pearson Product Moment Correlation Coefficient (PPMC) by Karl Pearson

• Value range for correlation is from -1 to +1.• Correlation value which is close to +1 shows that there were a

strong positive correlation while when the value is close to -1, it shows that there were a strong negative correlation

• Value of r close to zero means that no linear relationship between the variable or only a weak relationship between both variables.

Correlation coefficient computed from the sample data measures the strength and direction of a linear relationship between two

quantitative variables. The symbol for the sample correlation is r while ρ (rho) for population correlation

FORMULA FOR CORRELATION COEFFICIENT r

Where n is the number of data pairs

• Rounding rule for correlation: round the value of r to three decimal places

SIGNIFICANCE OF THE CORRELATION COEFFICIENT

• State the hypothesesH ₒ : ρ = 0H ₁ : ρ ≠ 0

• Find the critical values- use t distribution with α and degrees of freedom = n – 2• Compute test value• Make decision and summarize the results

- when H ₒ rejected, there is a significant correlation between the variables

Regression

• We previously test the significance of the correlation coefficient. If the correlation is significant, the next step is to determine the equation of regression line

• LINE OF BEST FIT: best fit means that the sum of squares of the vertical distance from each point to the line is at minimum

• Reason best fit needed is that the value of y will be predicted from the values of x; hence the closer the points to the lines, the better prediction will be

• Equation of the regression line is written as →

• Formula for finding a and b:

where is the intercept and is the slope of the line

• The sign of the correlation coefficient and the sign of the slope of the regression line will always be the same

• MARGINAL CHANGE: the magnitude of the change in one variable when the other variable changes exactly 1 unit.

• See example 10-9; the slope of the regression line is 0.106 which means for each increase of 10,000 cars, the value of y changes 0.106 unit ($ 106 million) on average.

• EXTRAPOLATION: making prediction beyond the bounds of the data.

• When prediction are made, they are based on present condition or on the premise that present trends will continue.

• OUTLIER: point that seems out of place when compared with the other points

• Some of this points can affect the equation of the regression line where the points are called influential points or influential observation

Coefficient of determination

x 1 2 3 4 5

y 10 8 12 16 20

• Equation of the regression line is with r = 0.919

• Now for each x, we have as predicted value and y as observed value

• Closer observed value to predicted value the better fit it is

• Different types of variation associated with the regression model need to be identified

• Total Variation which is : is the sum of squares of the vertical distance each points from the mean

• Total variation can be divided into two parts: explained variation and unexplained variation

• Explained variation is variation that obtained from the relationship which is:

• The closer value r is to -1 or +1, the better the points fit the lines and the closer to

• Unexplained variation is variation that occur due to chance found by:

• When the unexplained variation is small, the value of r is closer to -1 or +1

• If all points fall on the regression line, the value for unexplained variation will be 0

• TOTAL VARIATION:

= +

• Noted that the value are called residuals

• RESIDUALS is the difference between the actual value of and predicted value for a given value

• Mean for residual will always be zero

• Residual value can be plotted in graph called residual plot

• Used to determine whether the regression line can be used to make prediction

• For (a) diagram, relationship between and is linear regression line and can be used to make prediction

• For (b) through (f), the residual plot shows that those regression line not suitable for prediction

• Like from (b), the diagram shows that variance of the residuals increases as the values of increase

Coefficient of determination• Coefficient of determination is the ratio of the explained

variation to the total variation and denoted by

• The term is usually expressed as a percentage

• If you have =78.4/92.8 = 0.845 it means that 84.5% of the total variation is explained by the regression line using independent variable

• Another way to compute by squaring

• The rest of the variation (1 – 0.845 = 0.155) is called as coefficient of nondetermination