Correlation
-
Upload
reed-zamora -
Category
Documents
-
view
14 -
download
0
description
Transcript of Correlation
Correlation
(Lírios-Vicent Van Gogh,1889)
setosa versicolor virginica
Iris data
• Fisher’s iris data
S.Length S.Width P.Length P.Width Species
1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa ………………. 49 5.3 3.7 1.5 0.2 setosa 50 5.0 3.3 1.4 0.2 setosa 51 7.0 3.2 4.7 1.4
versicolor 52 6.4 3.2 4.5 1.5
versicolor …………………. 99 6.2 2.9 4.3 1.3
versicolor 100 5.7 2.8 4.1 1.3
versicolor101 6.3 3.3 6.0 2.5
virginica …………………150 5.9 3.0 5.1 1.8
virginica
Scatter-plot matrix
Sepal.Length
2.0 3.0 4.0 0.5 1.5 2.5
4.55.5
6.57.5
2.0
3.0
4.0
Sepal.Width
Petal.Length
1234567
0.5
1.5
2.5
Petal.Width
4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 1.0 2.0 3.0
1.0
2.0
3.0
Species
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
2.0
2.5
3.0
3.5
4.0
Sepal.Length
Sepal.Width
setosa
versicolorvirginica
Scatter plot (by group) and Trendlines
4.5 5.0 5.5
2.5
3.0
3.5
4.0
Sepal.Length
Sepal.Width
Scatterplot for setosa of iris data
30 50 70 90
30
50
70
90
30 50 70 90
30
50
70
90
30 50 70 90
30
50
70
90
negative relationship positive relationship
no apparent relationship
How to quantify the relationship ?
30 50 70 90
30
50
70
90
30 50 70 90
30
50
70
90
30 50 70 90
30
50
70
90
count pairs
20 30 50 70 80
30 10 40 70 50
-30 -20 0 20 30
-10 -30 0 30 10
count positive pairs
X
Y
XX YY
20 30 50 70 80
50 70 40 10 30
-30 -20 0 20 30
10 30 0 -30 -10
count negative pairs
X
Y
XX YY
30 50 70 90
30
50
70
90
300 500 700 9003
05
07
09
0
Need to consider scale matters
-10, -2, 3, 5, 7, 9
5, -7, 10, -3, 8, 5
Maximize the sum of products of each pair.
-10, -2, 3, 5, 7, 9
-7, -3, 5, 5, 8, 10
positively matched, negatively matched
-10, -2, 3, 5, 7, 9
10, 8, 5, 5, -3, -7
20 30 50 70 80
30 10 40 70 50
-30 -20 0 20 30
-10 -30 0 30 10
X
Y
XX YY
n
iii yyxx
nYXCov
1
* ))((1
1),(
30 50 70 90
30
50
70
90
-
30 50 70 90
30
50
70
90
0
30 50 70 90
30
50
70
90
+
-300 -200 0 200 300
-20 -60 0 60 20
-30 -20 0 20 30
-10 -30 0 30 10
Need to consider scale
XX YY
XX YY
n
iii yyxx
nYXCov
1
* ))((1
1),(
-300 -200 0 200 300
-20 -60 0 60 20XX YY
n
iii yyxx
nYXCov
1
* ))((1
1),(
n
iii xxxx
nXVar
1
* ))((1
1)(
)()(
),(),(
**
**
YVarXVar
YXCovYXCorr
-300 -200 0 200 300
-20 -60 0 60 20
22 XX
YY
),(10),10(),( 1*
1*
2* YXCovYXCovYXCov
)(10)10()( 12
1*
2* XVarXVarXVar
),(),( 1*
2* YXCorrYXCorr
-30 -20 0 20 30
-20 -60 0 60 2011 XX
YY
30 50 70 90
30
50
70
90 -1
30 50 70 903
05
07
09
0 +1
Cauchy-Schwartz inequality
1),(*1 YXCorr
(very strong) negative linear relationship(very strong) positive linear relationship
20 30 50 70 80
50 70 40 10 30
Exercise
X
Y
?),(* YXCov
YXYX XYEYXEYXCov )())((),(
Sample version
n
iii yyxx
nYXCov
1
* ))((1
1),(
Population version
)(,)( YEXE YX
2 YX
Population
covariance
1 2 3 Total
1 0 0.25 0 0.25
2 0.25 0 0.25 0.5
3 0 0.25 0 0.25
Total 0.25 0.5 0.25 1
XY X Y XY f
1 2 2 0.25
2 1 2 0.25
2 3 6 0.25
3 2 6 0.25
3 3 9 0
… … … …
0224)())((),( YXYX XYEYXEYXCov
425.0625.02)( XYE
Population
covariance
1 2 3 Total
1 1/3 0 0 1/3
2 0 1/3 0 1/3
3 0 0 1/3 1/3
Total 1/3 1/3 1/3 1
XY
?)())((),( YXYX XYEYXEYXCov
Exercise
1 2 3 Total
1 0 0 1/3 1/3
2 0 1/3 0 1/3
3 1/3 0 0 1/3
Total 1/3 1/3 1/3 1
XY
))((),( YX YXEYXCov
dydxyxfYX YX ),())((
i j
jiYjXi yxfyx ),())((
iiiYiXi yxfyx ),())((
22YX
XYXY
ss
sr
n N
n
iiiXY yyxx
ns
1
))((1
1),())((
1ii
N
iYiXiXY yxfyx
22YX
XYXY
Covariance is a measure of linear association
between two variables. Covariance is not a measure
curved association.
20 40 60 80 100
-100
-50
050
100
150
x
y
Covariance is affected by scales of variables,
but correlation is not, except of sign of scale.
Covariance may be any real value,
but correlation is a value only in [-1,1].
cXZ /)(
cYXCovYZCov /),(),(
0,),(
0),,(),(
cYXCorr
cYXCorrYZCorr
YX
XY
YXYX
YXYX
YXCovYXCovZZCov
),(),(
),(
XXX XZ /)(
Correlation is the covariance for standardized variables
YYY YZ /)(
),( YXCorr
30 50 70 90
30
50
70
90
300 500 700 900
30
50
70
90
Covariance = 189
Correlation = 0.78
Covariance = ?
Correlation = ?
30 50 70 90
30
50
70
90
-90 -80 -70
60
80
10
01
20
Covariance = 189
Correlation = 0.78
Covariance = ?
Correlation = ?
-60
Grouped by Zip Code
Gathering groups with + corr's does not give + corr.
High correlation does not mean that one variable
is the cause of the other.
Correlation is a measure of linear association,
but not a causation.
Correlation and causality
The more STBK stores, the higher will APT price increase ?
The more Starbucks, the higher APT price !
APT prices in Seoul
STBK
APT price
강남구 45 1030
강동구 2 530
중구 24 520
중랑구 0 330
STBK: number of Starbucks stores
APT price: Average APT price by a 1 m2
The more Starbucks, the deeper financial crisis are !
Thank you !!