Correlation and Covariance

39
Correlation and Covariance

description

Correlation and Covariance. Overview. Outcome, Dependent Variable (Y-Axis). Height. Continuous. Histogram. Predictor Variable (X-Axis). Scatter. Continuous. Boxplot. Categorical. Variables. Dependent Variables. Y. Y. Height. X4. X3. X1. X2. Independent Variables. X’s. - PowerPoint PPT Presentation

Transcript of Correlation and Covariance

Page 1: Correlation and Covariance

Correlation andCovariance

Page 2: Correlation and Covariance

Overview

Continuous

Continuous

Categorical

Histogram

Scatter

Boxplot

Predictor Variable(X-Axis)

Height

Outcome, Dependent Variable (Y-Axis)

Page 3: Correlation and Covariance

Variables

Y

X’s

Height

Independent Variables

DependentVariables

Y

X4X3X2X1

Page 4: Correlation and Covariance

Correlation Matrix for Continuous Variables

chart.Correlation(num2)PerformanceAnalytics package

Page 5: Correlation and Covariance

Slide 5

• A deviation is the difference between the mean and an actual data point.

• Deviations can be calculated by taking each score and subtracting the mean from it:

deviation ix x

Calculating ‘Error’

Page 6: Correlation and Covariance

Calculating ‘Error’

Page 7: Correlation and Covariance

Slide 7

• Take the error between the mean and the data and add them????

0)( XX

Score Mean Deviation

1 2.6 -1.6

2 2.6 -0.6

3 2.6 0.4

3 2.6 0.4

4 2.6 1.4

Total = 0

Use the Total Error?Deviation

Page 8: Correlation and Covariance

Slide 8

• We could add the deviations to find out the total error.

• Deviations cancel out because some are positive and others negative.

• Therefore, we square each deviation.

• If we add these squared deviations we get the sum of squared errors (SS).

Sum of Squared ErrorsDeviation

Page 9: Correlation and Covariance

Slide 9

2SS ( ) 5.20X X

Score Mean Deviation Squared Deviation

1 2.6 -1.6 2.56

2 2.6 -0.6 0.36

3 2.6 0.4 0.16

3 2.6 0.4 0.16

4 2.6 1.4 1.96

Total 5.20

Sum of Squared Errors

Page 10: Correlation and Covariance

Slide 10

• The variance is measured in units squared.

• This isn’t a very meaningful metric so we take the square root value.

• This is the standard deviation (s).

2

1 5.205 1.02

niix x

ns

Standard Deviation

Page 11: Correlation and Covariance

Slide 11

• The sum of squares is a good measure of overall variability, but is dependent on the number of scores.

• We calculate the average variability by dividing by the number of scores (n).

• This value is called the variance (s2).

Variance

Page 12: Correlation and Covariance

Slide 12

Same Mean, Different Standard Deviation

Page 13: Correlation and Covariance

Temperature Variation Across Cities

Austin

Las Vegas

San Diego

San Francisco

Tampa Bay

Count of Hours

Page 14: Correlation and Covariance

1cov( , ) i ix x y y

Nx y

Covariance

Y

X

Persons 2,3, and 5 look to have similar magnitudes from their means

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

-4-3-2-1012345

Page 15: Correlation and Covariance

254417

441021418221

4)4)(62()2)(60()1)(41()2)(41()3)(40(

1))((

)cov(

.

.....

.....N

yyxxy,x ii

Covariance

• Calculate the error [deviation] between the mean and each subject’s score for the first variable (x).

• Calculate the error [deviation] between the mean and their score for the second variable (y).

• Multiply these error values.• Add these values and you get the cross product deviations.• The covariance is the average cross-product deviations:

Page 16: Correlation and Covariance

1cov( , ) i ix x y y

Nx y

Covariance

Age Income Education7 4 34 1 86 3 58 6 18 5 77 2 95 3 39 5 87 4 58 2 29 5 28 4 29 2 38 4 73 1 43 1 38 2 61 2 53 1 76 3 3

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

Age vs. Income

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

-6.0-5.0-4.0-3.0-2.0-1.00.01.02.03.04.0

Delta A Delta I

Do they VARY the same way relative to their own means?

2.47

Page 17: Correlation and Covariance

• It depends upon the units of measurement.• E.g. the covariance of two variables measured in miles

might be 4.25, but if the same scores are converted to kilometres, the covariance is 11.

• One solution: standardize it! normalize the data• Divide by the standard deviations of both variables.

• The standardized version of covariance is known as the correlation coefficient.• It is relatively unaffected by units of measurement.

Limitations of Covariance

Page 18: Correlation and Covariance

cov

1

xy

x y

i i

x y

s s

x x y y

N s s

r

The Correlation Coefficient

cov

4.25

1.67 2.92.87

xy

x ys sr

Page 19: Correlation and Covariance

• It varies between -1 and +1• 0 = no relationship

• It is an effect size• ±.1 = small effect• ±.3 = medium effect• ±.5 = large effect

• Coefficient of determination, r2

• By squaring the value of r you get the proportion of variance in one variable shared by the other.

Things to Know about the Correlation

Page 20: Correlation and Covariance

Correlation

Covariance is High: r ~1

Covariance is Low: r ~0

Page 21: Correlation and Covariance

Correlation

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

f(x) = 0.432980193459235 x + 0.250575771533855R² = 0.442392806360523

Page 22: Correlation and Covariance

Correlation

Need inter-item/variable correlations > .30

Page 23: Correlation and Covariance

Character Vector: b <- c("one","two","three")

numeric vector

character vector

Numeric Vector: a <- c(1,2,5.3,6,-2,4)

Matrix: y<-matrix(1:20, nrow=5,ncol=4)

Dataframe:d <- c(1,2,3,4)e <- c("red", "white", "red", NA)f <- c(TRUE,TRUE,TRUE,FALSE)mydata <- data.frame(d,e,f)names(mydata) <- c("ID","Color","Passed")

List:w <- list(name="Fred", age=5.3)

Data Structures

Framework Source: Hadley Wickham

Page 24: Correlation and Covariance

Correlation Matrix

Page 25: Correlation and Covariance

Correlation and Covariance

1cov( , ) i ix x y y

Nx y

Page 26: Correlation and Covariance

Revisiting the Height Dataset

Page 27: Correlation and Covariance

Galton: Height Dataset

cor(heights)Error in cor(heights) : 'x' must be numeric

Initial workaround: Create data.frame without the Factors

h2 <- data.frame(h$father,h$mother,h$avgp,h$childNum,h$kids)

cor() function does not handle Factors

Later we will RECODE the variable into a 0, 1

Excel correl() does not either

Page 28: Correlation and Covariance

Histogram of Correlation Coefficients

-1 +1

Page 29: Correlation and Covariance

Correlations Matrix: Both Types

library(car)scatterplotMatrix(heights)

Zoom in on Gender

Page 30: Correlation and Covariance

Correlation Matrix for Continuous Variables

chart.Correlation(num2)PerformanceAnalytics package

Page 31: Correlation and Covariance

Categorical: Revisit Box Plot

Factors/Categorical work with Boxplots; however some functions are not set up to handle Factors

Note there is an equation here:Y = mx b

Correlation will depend on spread of distributions

Page 32: Correlation and Covariance

Manual Calculation: Note Stdev is Lower

Note that with 0 and 1 the Delta from Mean are low; and Standard Deviation is Lower. Whereas the

Continuous Variable has a lot of variation, spread.

Page 33: Correlation and Covariance

Categorical: Recode!Gender recoded as

a 0= Female1 = Male

@correl does not work with Factor

Variables

Formula now works!

Page 34: Correlation and Covariance

Correlation: Continuous & Discrete

More examples of cor.test()

Page 35: Correlation and Covariance

• Too many variables is difficult to handle

• Computing power to handle all that data.

• Principal components analysis seeks to identify and quantify those components by analyzing the original, observable variables

• In many cases, we can wind up working with just a few—on the order of, say, three to ten—principal components or factors instead of tens or hundreds of conventionally measured variables.

Overview

Page 36: Correlation and Covariance

Z1X1

X2

X3

Z2

observable variables

Which component explains the most variance?

Z3

vectors

Principal Components Analysis

Page 37: Correlation and Covariance

Principal Components Analysis

Page 38: Correlation and Covariance

Principal Components

Page 39: Correlation and Covariance

Correlation Regression