Correlation - Pennsylvania State University

Post on 05-May-2022

1 views 0 download

Transcript of Correlation - Pennsylvania State University

Correlation

Correlation

A statistics method to measure the

relationship between two variables

Three characteristics

Direction of the relationship

Form of the relationship

Strength/Consistency

Direction of the Relationship

Positive correlation

Variables moving in the same direction

Negative correlation

Variables moving in opposite directions

Form of the Relationship

Linear or non-linear

Predicting data

Strength/Consistency

How well do data fit the specific form?

Measured by the distance between actual data

and the predicted data

The absolute value of a correlation

Measuring the fitness

1: Perfect fit

0: Not fit at all

Correlation Measures The Pearson correlation

Linear relationship

The sign of the correlation: direction

The numerical value: the degree of the relationship

The Spearman correlation For ordinal scale of measurement

Both the X values and the Y values are ranks. Measuring consistency for data relationship

Not necessarily to be linear

The point-biserial correlation Used to measure the correlation between a regular

variable and a dichotomous variable

The Pearson Correlation

YX SSSS

SP

y separatelY and X ofy variabilit

Y and X ofity covariabil

atelyvary separ Y and X whichto degree

togethervary Y and X whichto degreer

2)( MXSS

n

YXXYPS

))((

))(( YX MYMXPS

n

XXSS

2

2)(

Check the Result

Using the scatterplot of data

Drawing the envelope around all data points

Checking the direction and shape of the

envelope

0

1

2

3

4

5

0 2 4 6 8 10 12

X Y

0 1

10 3

4 1

8 2

8 3

Interpreting Correlations

Predication

Correlation is just about relationship between two variables. Not necessarily causation!!

Interpreting Correlations

Predication

Correlation is just about relationship between two variables. Not necessarily causation!!

The value could be affected greatly by the data range.

Data Range and Correlation

Interpreting Correlations

Predication

Correlation is just about relationship between two variables. Not necessarily causation!!

The value could be affected greatly by the data range.

Outliers can dramatically affect the value.

Outlier and Correlation

The Strength of Relationship

The Strength of Relationship

The coefficient of determination

Squaring the value of correlation

How much of the variance in dependent variable

is accounted for by independent variable.

Similar to the power used in z- and t-tests

Hypothesis Tests with the

Pearson Correlation

Pearson correlation is usually computed for

sample data, but used to test hypotheses

about the relationship in the population

Population correlation shown by Greek letter

rho (ρ)

Non-directional: H0: ρ = 0 and H1: ρ ≠ 0

Directional: H0: ρ ≤ 0 and H1: ρ > 0 or

Directional: H0: ρ ≥ 0 and H1: ρ < 0

Population vs. Sample

Correlation Hypothesis Test

Sample correlation r used to test population ρ

Hypothesis test can be computed using

either t or F

Use t table to find critical value

rs

rt

df

rsr

21

About df

What should the df be?

Suppose the sample size is n

2

)1( 2

n

r

rt

Example

α = .05

n = 30

r = 0.35

Two-tailed test: critical value ±2.048

Fail to reject the null hypothesis

One-tailed test: reject: 1.701

Reject

97.1

28

)35.01(

35.0

2

)1( 22

n

r

rt

Using r Directly …

Report Correlations

A correlation for the data revealed a

significant relationship between amount of

education and annual income, r (28)= 0.65,

p < .01, two-tailed.

Usually, Multiple Variables

Involved in Correlation Tests

Partial Correlation

Involvement of

other factors in

correlation?

Partial Correlation

Partial Correlation

A partial correlation measures the

relationship between two variables while

mathematically controlling the influence of a

third variable by holding it constant

)1)(1(

)(

22

yzxz

yzxzxy

zxy

rr

rrrr

ExampleNumber of Churches

(X) Number of Crimes

(Y) Population

(Z)1 4 12 3 13 1 14 2 15 5 17 8 28 11 29 9 2

10 7 211 10 213 15 314 14 315 16 316 17 317 13 3

0zxyr

What if the relationship looks

like this?

The Spearman Correlation

To measure the degree of consistency of direction Not necessarily linear.

One extra step before calculating the Pearson correlation Ranking the X and Y values

Analyze the correlation of ranking values.

X Y (values) X Y (Ranks)

1 3 2 2

6 4 4 3

2 5 3 4

0 2 1 1

Ranking Tied Scores

Using the same rank for same scores

Ranking all scores

Computing the mean for ranked position of same scores

X Y (values) X Y (Ranks)

1 3 2 2 (2.5)

6 3 4 3 (2.5)

2 5 3 4

0 2 1 1

Special Formula for

Spearman Correlation

12

)1( 2

nnSS

)1(

61

2

2

nn

Drs

The Point-Biserial Correlation

Just like the Pearson correlation

One variable has only two values

Gender, success/failure, college education or not, …

The value of correlation has nothing to do with the

values you used in study (1/0, 1/-1, etc.)

Point-Biserial Correlation vs. t

Test

t test

t = 4

p <.001

df = 18

Point-Biserail

r = 0.686

If we know two variables are

linearly related, how can we

describe such a relationship?

Using a linear equation

y = bx + a

Regression

Goal of Regression

Determining two constants for a linear

equation: y=bx+a

b: slope

a: intercept

Methods

The least-squares solution

Distance = Y - Y^

Minimizing S(Y-Y)2^

Formula

XY

X

bMMa

SS

SPb

Regression in Excel

Draw a scatterplot

Show the trendline

Linear Equations and

Regression

The Pearson correlation

measures a linear relationship

between two variables

This figure

Makes the relationship easier to

see

Shows the central tendency of

the relationship

Can be used for prediction

Linear Equations

General equation for a line

Equation: Y = bX + a

X and Y are variables

a and b are fixed constant

Regression

Regression is a method of finding an

equation describing the best-fitting line for a

set of data

Least square

Minimizing errors of known data

Or the error of prediction

Error of Prediction

With a linear function from regression, we

can calculate the predicted value based on a

given X

Ŷ

Error of prediction: Y- Ŷ

Often squared

Standard Error of Estimate

Regression equation makes a prediction

Precision of the estimate is measured by the

standard error of estimate (SEoE)

SEoE =2

)ˆ( 2

n

YY

df

SSresidual

2

)ˆ(SS

2

residual

n

YY

Relationship Between

Correlation and Standard Error

of Estimate

SSregression = r2 SSY

SSresidual = (1 - r2) SSY

2

)1( 2

n

SSr

df

SS Yresidual

Testing Regression

Significance

Analysis of Regression

Similar to Analysis of Variance

Uses an F-ratio of two Mean Square values

Each MS is an SS divided by its df

H0: the slope of the regression line (b or beta)

is zero

no regression

Mean Squares and F-ratio

residual

residualresidual

df

SSMS

regression

regression

regressiondf

SSMS

residual

regression

MS

MSF

SS and df in

Regression Analysis

SPSS Output Example

In Excel

X Y5 101 44 57 116 154 63 52 0

ANOVA and Regression

Basically the same method, but different

perspectives to look at the results

Main effect in ANOVA == a variable in

regression

Interaction between two factors ==

multiplication of two variables in regression

Regression not only tells difference, but also

predicts by how much.

Multivariate regression

Linear or Non-Linear

Regression?

Linear models are usually good enough to

most research in IST.

If non-linear models are involved, how to tell

the linear model you have is not appropriate?

Look at residual distribution

In Summary

Correlation: the relationship between two

variables

Direction, form, degree

Three methods

For different purposes

Regression

Determining the linear equation that data best fit

Slope and intercept

Homework

Three problems to solve.