Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109,...

40
Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated

Transcript of Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109,...

Page 1: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Covariance and CorrelationChris Piech

CS109, Stanford University

Your random variables are correlated

Page 2: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Four Prototypical Trajectories

Review

Page 3: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Expectation and Variance

The two most important descriptors of a distribution, a random variable or a dataset

Page 4: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Peer Grades in Coursera HCI

-80 -60 -40 -20 0 20 40 60 8060 80 10040200

True grade = 58E[X] = 57.5

Let X be a random variable that represents a peer grade

If we base a final score off 4 peer-grades, 19% of students would get a score that was off by more than 10 percentage points

E(X) = weighted average

Page 5: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Peer Grades in Coursera HCI

-80 -60 -40 -20 0 20 40 60 8060 80 10040200

True grade = 58E[X] = 57.5

Let X be a random variable that represents a peer gradeVar(X) = E[(X – µ)2]

Page 6: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Peer Grades in Coursera HCI

-80 -60 -40 -20 0 20 40 60 8060 80 10040200

True grade = 58E[X] = 57.5

Let X be a random variable that represents a peer gradeVar(X) = E[(X – µ)2]

X (X – µ)225 points 1056 points2

Page 7: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Peer Grades in Coursera HCI

-80 -60 -40 -20 0 20 40 60 8060 80 10040200

True grade = 58E[X] = 57.5

Let X be a random variable that represents a peer gradeVar(X) = E[(X – µ)2]

X (X – µ)225 points 1056 points2

80 points 506 points2

Page 8: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Peer Grades in Coursera HCI

-80 -60 -40 -20 0 20 40 60 8060 80 10040200

True grade = 58E[X] = 57.5

Let X be a random variable that represents a peer gradeVar(X) = E[(X – µ)2]

X (X – µ)225 points 1056 points2

80 points 506 points2

50 points 56 points2

Page 9: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Peer Grades in Coursera HCI

-80 -60 -40 -20 0 20 40 60 8060 80 10040200

True grade = 58E[X] = 57.5

Let X be a random variable that represents a peer gradeVar(X) = E[(X – µ)2]

X (X – µ)225 points 1056 points2

80 points 506 points2

50 points 56 points2

E [(X – µ)2] = 52 points2

Page 10: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Peer Grades in Coursera HCI

-80 -60 -40 -20 0 20 40 60 8060 80 10040200

True grade = 58E[X] = 57.5

Let X be a random variable that represents a peer gradeVar(X) = E[(X – µ)2]

X (X – µ)225 points 1056 points2

80 points 506 points2

50 points 56 points2

E [(X – µ)2] = 52 points2

SD(X) = 7.2 points

Page 11: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

E[X + Y] = E[X] + E[Y]

Generalized:

Holds regardless of dependency between Xi’s

∑∑==

=⎥⎦

⎤⎢⎣

⎡ n

ii

n

ii XEXE

11][

Expected Values of Sums

Page 12: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

E[g(X,Y )] =X

x,y

g(x, y)p(x, y)

E[g(X)] =X

x

g(x)p(x)

Expected Values of Functions

For example:

Let g(X,Y ) = X · Y

X, Y are independent random variables:

E[X · Y ] = E[g(X,Y )]

=X

x,y

g(x, y) · p(x, y)

=X

x,y

xy · p(x)p(y)

=X

x

xp(x) ·X

y

yp(y) = E[X] · E[Y ]

Page 13: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Four Prototypical Trajectories

Dance of Covariance

Page 14: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Recall our Ebola Bats

Page 15: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Gene1 Gene2 Gene3 Gene4 Gene5 TraitTRUE FALSE TRUE TRUE FALSE FALSEFALSE FALSE TRUE TRUE TRUE TRUETRUE FALSE TRUE FALSE FALSE FALSETRUE FALSE TRUE TRUE TRUE FALSEFALSE TRUE TRUE TRUE TRUE TRUEFALSE FALSE FALSE TRUE FALSE FALSETRUE FALSE FALSE TRUE FALSE FALSETRUE FALSE FALSE TRUE FALSE FALSETRUE FALSE TRUE FALSE FALSE FALSEFALSE TRUE FALSE TRUE FALSE FALSETRUE TRUE FALSE TRUE FALSE FALSETRUE FALSE FALSE TRUE FALSE FALSETRUE FALSE TRUE TRUE TRUE FALSEFALSE FALSE TRUE TRUE FALSE FALSETRUE FALSE FALSE TRUE FALSE FALSETRUE FALSE FALSE TRUE FALSE FALSE

…TRUE FALSE FALSE TRUE FALSE FALSE

Bat Data

Page 16: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Gene1 Gene2 Gene3 Gene4 Gene5 Trait0.71 0.29 0.89 0.82 0.76 0.830.17 0.02 0.89 0.02 0.94 0.850.01 0.63 0.76 0.38 0.82 0.030.19 0.95 0.63 0.89 0.94 0.320.46 0.96 0.36 0.12 0.50 0.100.48 0.51 0.45 0.16 0.40 0.530.20 0.77 0.27 0.23 0.90 0.670.49 0.24 0.77 0.37 0.29 0.710.59 0.95 0.38 0.42 0.72 0.250.43 0.66 0.57 0.03 0.15 0.240.32 0.42 0.25 0.12 0.79 0.980.77 0.31 0.66 0.78 0.68 0.770.46 0.59 0.38 0.99 0.71 0.370.97 0.66 0.05 0.99 0.36 0.180.50 0.66 0.35 0.41 0.62 0.080.70 0.85 0.98 0.29 0.59 0.38

...0.78 0.09 0.69 0.41 0.82 0.76

Expression Amount

Page 17: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

-4

0

4

-4 0 4

-4

0

4

-4 0 4

Spot The Difference

Page 18: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

-2

0

2

4

6

0 5

-2

0

2

4

6

0 5

Spot The Difference

Page 19: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Understanding Covariance

Page 20: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

-2

0

2

4

6

0 5

x� E[x] = 3

y � E[y] = 2.6

(x� E[x])(y � E[y]) = 7.8

Vary Together

Page 21: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

-2

0

2

4

6

0 5

Vary Together

(x� E[x])(y � E[y]) = 0

x� E[x] ⇡ 0

y � E[y] ⇡ 0

Page 22: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

-2

0

2

4

6

0 5

Vary Together

y � E[y] = �2.8

x� E[x] = �1.1

(x� E[x])(y � E[y]) ⇡ 3.1

Page 23: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

• Say X and Y are arbitrary random variables• Covariance of X and Y:

])][])([[(),(Cov YEYXEXEYX −−=

The Dance of the Covariance

x y (x – E[X])(y – E[Y])p(x,y)

Above mean

Above mean

Positive

Bellow mean

Bellow mean

Positive

Bellow mean

Above mean

Negative

Above mean

Bellow mean

Negative

Page 24: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

• Say X and Y are arbitrary random variables• Covariance of X and Y:

• Equivalently:

§ X and Y independent, E[XY] = E[X]E[Y] à Cov(X,Y) = 0

§ But Cov(X,Y) = 0 does not imply X and Y independent!

])][])([[(),(Cov YEYXEXEYX −−=

]][][][][[),(Cov XEYEYXEYXEXYEYX +−−=

][][][][][][][ YEXEYEXEYEXEXYE +−−=

][][][ YEXEXYE −=

The Dance of the Covariance

Page 25: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

• Consider the following data:Weight Height Weight * Height

64 57 364871 59 418953 49 259767 62 415455 51 280558 50 290077 55 423557 48 273656 42 235251 42 214276 61 463668 57 3876

E[W] = 62.75

E[H] = 52.75

E[W*H] = 3355.83

3035404550556065

40 45 50 55 60 65 70 75 80

Height

Weight

Cov(W, H) = E[W*H] – E[W]E[H]= 3355.83 – (62.75)(52.75)= 45.77

Covariance and Data

Page 26: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

CovarianceSocrative: (a) positive, (b) negative, (c) zero

Page 27: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

CovarianceSocrative: (a) positive, (b) negative, (c) zero

Positive

Page 28: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

• X and Y are random variables with PMF:

§ E[X] = -1(1/3) + 0(1/3) + 1(1/3) = 0§ E[Y] = 0(2/3) + 1(1/3) = 1/3§ Since XY = 0, E[XY] = 0§ Cov(X, Y) = E[XY] – E[X]E[Y] = 0 – 0 = 0

• But, X and Y are clearly dependent!

XY -1 0 1 pY(y)

0 1/3 0 1/3 2/3

1 0 1/3 0 1/3

pX(x) 1/3 1/3 1/3 1

⎩⎨⎧ ≠

= otherwise 1

0 if 0 XY

Independence and Covariance

Page 29: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

• Consider rolling a 6-sided die§ Let indicator variable X = 1 if roll is 1, 2, 3, or 4

§ Let indicator variable Y = 1 if roll is 3, 4, 5, or 6

• What is Cov(X, Y)?§ E[X] = 2/3 and E[Y] = 2/3

§ E[XY] =

= (0 * 0) + (0 * 1/3) + (0 * 1/3) + (1 * 1/3) = 1/3

§ Cov(X, Y) = E[XY] – E[X]E[Y] = 1/3 – 4/9 = -1/9§ Consider: P(X = 1) = 2/3 and P(X = 1 | Y = 1) = 1/2

o Observing Y = 1 makes X = 1 less likely

∑∑x y

yxpxy ),(

Example of Covariance

Page 30: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

• Say X and Y are arbitrary random variables§

§

§

• Covariance of sums of random variables§ X1, X2, …, Xn and Y1, Y2, …, Ym are random variables

§

),(Cov),(Cov XYYX =

)(Var][][][),(Cov 2 XXEXEXEXX =−=

),(Cov),(Cov YXaYbaX =+

∑∑∑∑= ===

=⎟⎟⎠

⎞⎜⎜⎝

⎛ n

i

m

jji

m

jj

n

ii YXY,X

1 111),(Cov Cov

Properties of Covariance

Page 31: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

• Let IA and IB be indicators for events A and B

§ E[IA] = P(A), E[IB] = P(B), E[IAIB] = P(AB)§ Cov(IA, IB) = E[IAIB] – E[IA] E[IB]

= P(AB) – P(A)P(B)= P(A | B)P(B) – P(A)P(B)= P(B)[P(A | B) – P(A)]

§ Cov(IA, IB) determined by P(A | B) – P(A)§ P(A | B) > P(A) ⇒ ρ(IA, IB) > 0§ P(A | B) = P(A) ⇒ ρ(IA, IB) = 0 (and Cov(IA, IB) = 0)§ P(A | B) < P(A) ⇒ ρ(IA, IB) < 0

⎩⎨⎧

=otherwise 0

occurs if 1 AIA

⎩⎨⎧

=otherwise 0

occurs if 1 BIB

Do Indicators Covary?

Page 32: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Four Prototypical Trajectories

Correlation

Page 33: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

• Consider the following data:Weight Height Weight * Height

64 57 364871 59 418953 49 259767 62 415455 51 280558 50 290077 55 423557 48 273656 42 235251 42 214276 61 463668 57 3876

E[W] = 62.75

E[H] = 52.75

E[W*H] = 3355.83

3035404550556065

40 45 50 55 60 65 70 75 80

Height

Weight

Cov(W, H) = E[W*H] – E[W]E[H]= 3355.83 – (62.75)(52.75)= 45.77

What is Wrong With This?

Page 34: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

�Std(X)Std(Y ) Cov(X,Y ) Std(X)Std(Y )

Page 35: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

• Say X and Y are arbitrary random variables§ Correlation of X and Y, denoted ρ(X, Y):

§ Note: -1 ≤ ρ(X, Y) ≤ 1 § Correlation measures linearity between X and Y§ ρ(X, Y) = 1 ⇒ Y = aX + b where a = σy/σx

§ ρ(X, Y) = -1 ⇒ Y = aX + b where a = -σy/σx

§ ρ(X, Y) = 0 ⇒ absence of linear relationshipo But, X and Y can still be related in some other way!

§ If ρ(X, Y) = 0, we say X and Y are “uncorrelated”o Note: Independence implies uncorrelated, but not vice versa!

Y)Var(X)Var(),(Cov),( YXYX =ρ

Viva La Correlatión

Page 36: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

• Say X and Y are arbitrary random variables§ Correlation of X and Y, denoted ρ(X, Y):

• Say Y = cX. Correlation should be 1.

Y)Var(X)Var(),(Cov),( YXYX =ρ

Viva La Correlatión

Page 37: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Four Prototypical Trajectories

If we have time

Page 38: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

• Computing Cov(Xi, Xj)§ Indicator Ii(k) = 1 if trial k has outcome i, 0 otherwise

§

§ When a ≠ b, trial a and b independent:

§ When a = b:§ Since trial a cannot have outcome i and j:

⇒ Xi and Xj negatively correlated

∑=

=n

kii kIX

1

)( ∑=

=n

kjj kIX

1

)(

∑∑= =

=n

a

n

bjiji aIbIXX

1 1

))(),((Cov),(Cov

0))(),((Cov =aIbI ji

)]([)]([)]()([))(),((Cov aIEaIEaIaIEaIbI jijiji −=

ii pkIE =)]([

0)]()([ =aIaIE ji

∑∑===

−==n

aji

n

bajiji aIEaIEaIbIXX

11

)])([)]([())(),((Cov),(Cov

ji

n

aji pnppp −=−=∑

=1

)(

Covariance and the Multinomial

Page 39: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

• Multinomial distributions:§ Count of strings hashed into buckets in hash table§ Number of server requests across machines in cluster§ Distribution of words/tokens in an email§ Etc.

• When m (# outcomes) is large, pi is small§ For equally likely outcomes: pi = 1/m

§ Large m ⇒ Xi and Xj very mildly negatively correlated§ Poisson paradigm applicable

2),(Cov mn

jiji pnpXX −=−=

Multinomials All Around

Page 40: Covariance and Correlation - Stanford University · Covariance and Correlation Chris Piech CS109, Stanford University Your random variables are correlated. Four Prototypical Trajectories

Four Prototypical Trajectories

Que te vayas bien